# Machine Learning

### Binary Classificaition
- Predict whether someone will beat their previous season's free throw score

#### Input 
- Previous season scores? 
- Player name? 

#### Features
- team - players can change right but just assume that it's constant per season
- num_scoring_games
- mean score per game
- (consecutive) years they've exceeded previous record?
- max score per game
- actual scores per game per season ~ 100 => too high dim
- number of years player has exceeded their previous scores 
- whether they beat last year
- whether their team made it to playoffs last season - will be implied by game count

#### Limitations
- getting players per team is a bit time consuming
- check which team score increased after any free throw is scored

#### Models
- Log Reg
- Dec Trees
- MLP

#### Accuracy Metrics
- Precision
- Recall

In [1]:
%load_ext autoreload
%autoreload 2

from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

import pandas as pd
import numpy as np
import sys
import os

sys.path.insert(1, os.path.join(sys.path[0], '..'))

from src.feature_generation import get_season_features, get_previous_score_info, add_target_info, encode_categorical_field
from src.utilities import add_info_a, add_info_b
from src.mlp import train_mlp

Using TensorFlow backend.


# Data

In [2]:
df = pd.read_csv('../../data/formatted_free_throws.csv')

### Generate Features

### Get basic features from season data

In [3]:
features = get_season_features(df)

### Add Sequential data about previous year

In [4]:
features_df = get_previous_score_info(features)

### Add Target data from next year

In [5]:
full_df = add_target_info(features_df)

#### Encode team field

In [6]:
encoded_features_df, encoded_teams = encode_categorical_field(full_df, 'team')

### Get features

In [7]:
encoded_features_df = encoded_features_df.fillna(0)

In [8]:
FEATURES = [
    'total_score',
    'mean_score',
    'max_score',
    'num_games',
    'previous_delta',
    'beat_previous_score'
]
FEATURES.extend(encoded_teams)

In [9]:
print("Total features: {}".format(len(FEATURES)))

Total features: 39



## Annotated Data

### Separate data points with missing targets i.e. target == -1

In [10]:
labelled_df = encoded_features_df.loc[encoded_features_df.target != -1]

print("Total annotations: {}".format(len(labelled_df)))

Total annotations: 3428


#### Output

In [11]:
labelled_df.to_csv('../../data/feature_data.csv', index=False)

### Check balanced

In [12]:
pos = len(labelled_df.loc[labelled_df.target == 1])
neg = len(labelled_df.loc[labelled_df.target == 0])
total = len(labelled_df)

print("Total +ve samples: {} [{:.2f}%]".format(pos, (pos/total * 100.0)))
print("Total -ve samples: {} [{:.2f}%]".format(neg, (neg/total * 100.0)))

Total +ve samples: 1506 [43.93%]
Total -ve samples: 1922 [56.07%]


## Train/Test Split

In [15]:
labelled_df = labelled_df.sample(frac=1).reset_index(drop=True)

target = np.ravel(labelled_df[['target']])

full_train_x, full_test_x, train_y, test_y = train_test_split(labelled_df, target, test_size=0.2)


train_x = full_train_x[FEATURES]
test_x = full_test_x[FEATURES]

print("Train X Shape: {}".format(train_x.shape))
print("Train Y Shape: {}".format(train_y.shape))
print("Test X Shape: {}".format(test_x.shape))
print("Test Y Shape: {}".format(test_y.shape))

Train X Shape: (2742, 39)
Train Y Shape: (2742,)
Test X Shape: (686, 39)
Test Y Shape: (686,)


# Models

## Train Logistic Regression Model

In [25]:
model = LogisticRegression(class_weight='balanced')
model.fit(train_x, train_y)

print('Model Coeffs Shape: {}'.format(model.coef_.shape))

Model Coeffs Shape: (1, 39)




In [26]:
y_true = np.array(test_y)
y_pred = np.array(model.predict(test_x))

print("Precision %s" % metrics.precision_score(y_true, y_pred))
print("Recall %s" % metrics.recall_score(y_true, y_pred))
print("F Score %s" % metrics.f1_score(y_true, y_pred))
print(classification_report(y_true, y_pred))

Precision 0.557632398753894
Recall 0.5966666666666667
F Score 0.5764895330112721
              precision    recall  f1-score   support

           0       0.67      0.63      0.65       386
           1       0.56      0.60      0.58       300

   micro avg       0.62      0.62      0.62       686
   macro avg       0.61      0.61      0.61       686
weighted avg       0.62      0.62      0.62       686



## Train Decision Trees Model

In [23]:
model = DecisionTreeClassifier(class_weight='balanced')
model.fit(train_x, train_y)

DecisionTreeClassifier(class_weight='balanced', criterion='gini',
            max_depth=None, max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [24]:
y_true = np.array(test_y)
y_pred = np.array(model.predict(test_x))

print("Precision %s" % metrics.precision_score(y_true, y_pred))
print("Recall %s" % metrics.recall_score(y_true, y_pred))
print("F Score %s" % metrics.f1_score(y_true, y_pred))
print(classification_report(y_true, y_pred))

Precision 0.4738675958188153
Recall 0.4533333333333333
F Score 0.46337308347529815
              precision    recall  f1-score   support

           0       0.59      0.61      0.60       386
           1       0.47      0.45      0.46       300

   micro avg       0.54      0.54      0.54       686
   macro avg       0.53      0.53      0.53       686
weighted avg       0.54      0.54      0.54       686



## Train MLP Model

- very basic POC implementation to validate whether approach is worthwhile
- used keras for speed
- **no** model optimisation/regularisation

In [43]:
mlp = train_mlp(train_x, train_y)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_37 (Dense)             (None, 256)               10240     
_________________________________________________________________
dense_38 (Dense)             (None, 256)               65792     
_________________________________________________________________
dense_39 (Dense)             (None, 128)               32896     
_________________________________________________________________
dense_40 (Dense)             (None, 1)                 129       
Total params: 109,057
Trainable params: 109,057
Non-trainable params: 0
_________________________________________________________________
Train on 2467 samples, validate on 275 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Ep

In [44]:
y_true = np.array(test_y)
y_pred = np.array([int(round(pred)) for pred in mlp.predict(test_x).flatten()])

print("Precision %s" % metrics.precision_score(y_true, y_pred))
print("Recall %s" % metrics.recall_score(y_true, y_pred))
print("F Score %s" % metrics.f1_score(y_true, y_pred))
print(classification_report(y_true, y_pred))

Precision 0.7142857142857143
Recall 0.15
F Score 0.24793388429752064
              precision    recall  f1-score   support

           0       0.59      0.95      0.73       386
           1       0.71      0.15      0.25       300

   micro avg       0.60      0.60      0.60       686
   macro avg       0.65      0.55      0.49       686
weighted avg       0.64      0.60      0.52       686



# Conclusion

- Majority of time spent on data and feature generation
- 3 models test very quickly to see if solutions are valid or whether more feature engineering required
- While no model is highly performant
    - the log reg model scores the highest f score (~0.56) 
    - the neurel network based approach scores both the highest precision (~0.71) and the lowest recall (~0.15)
    
- From a product perspective, predicting whether a player will exceed their previous year's total scores seems like a precision focused tasked, therefore I would prioritise another iteration on both data features and model advancements, in particular experimenting with the MLP approach as it shows the most potential; as the first model arch demonstarated the significantly higher precision than the other ml models it seems plausible tuning and better/more features will increase performance
- Next steps:
    - visualise false positives/negatives to understand why model is going wrong
    - increase/update features based on above findings
    - implement mlp in tensorflow for more custom and less 'out of box' model, add reg and tune etc
