# Basic ML example using team names and indicator label (1=Red won)

Load the training data into a pandas data frame.  

In [1]:
import pandas as pd

train = pd.read_csv('../train.txt', sep='\t', header=None, names=['Label','RedAlliance','BlueAlliance'])

Preview the first 10 rows

In [2]:
train[0:10]

Unnamed: 0,Label,RedAlliance,BlueAlliance
0,1,frc2910 frc2046 frc2907,frc2930 frc4488 frc5468
1,1,frc2910 frc2046 frc2907,frc2930 frc4488 frc5468
2,1,frc2910 frc2046 frc2907,frc1983 frc1318 frc2928
3,1,frc2910 frc2046 frc2907,frc1983 frc1318 frc2928
4,1,frc2471 frc2898 frc1425,frc3663 frc2147 frc4513
5,0,frc2471 frc2898 frc1425,frc3663 frc2147 frc4513
6,0,frc2471 frc2898 frc1425,frc3663 frc2147 frc4513
7,1,frc2990 frc4911 frc948,frc2976 frc4469 frc2412
8,1,frc2990 frc4911 frc948,frc2976 frc4469 frc2412
9,1,frc2930 frc4488 frc5468,frc1540 frc3674 frc6443


We can leverage some concepts from https://stackabuse.com/text-classification-with-python-and-scikit-learn/ to build our model.  The basic idea is to use the team names as features.  Suppose frc492 is a really strong team- when it appears in the RedAlliance column it will add some weight to the probability that Red wins, and vice-versa if it appear in the BlueAlliance column.  So we want to build a predictor that figures out how much it matters when frc492 appears in a column (and the same for any other team).

In [3]:
from sklearn.feature_extraction.text import CountVectorizer  
from sklearn.pipeline import FeatureUnion
from sklearn.compose import ColumnTransformer

# two count vectorizers. This transforms the alliance lists into vector encodings
redVectorizer = CountVectorizer(max_features=1500, min_df=1, max_df=1.0, stop_words=None)  
blueVectorizer = CountVectorizer(max_features=1500, min_df=1, max_df=1.0, stop_words=None)  

ct = ColumnTransformer([('RedFeatures',redVectorizer,'RedAlliance'), ('BlueFeatures',blueVectorizer,'BlueAlliance')])

# shuffle the data first
train = train.sample(frac=1.0)

# produce the training features and labels.
X = ct.fit_transform(train)
y = train.Label



We have the data in a state where we can start to build models. First we'll try a basic random forest with 100 trees.

In [4]:
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators=100, random_state=0, min_samples_split=3)  
classifier

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=3,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
            oob_score=False, random_state=0, verbose=0, warm_start=False)

In [5]:
# Run four-fold cross-validation
from sklearn.model_selection import cross_val_score
scores = cross_val_score(classifier, X, y, cv=4)
scores

array([0.6625    , 0.7625    , 0.63291139, 0.67948718])

In [6]:
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Accuracy: 0.68 (+/- 0.10)


## Let's also try logistic regression.

In [7]:
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state=0, solver='lbfgs', multi_class='multinomial')
scores = cross_val_score(classifier, X, y, cv=4)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Accuracy: 0.65 (+/- 0.08)
