# Predicting Georgia's offensive play-calls under Jim Chaney using random forest classifier. 
This is directly based on the blog by Bill from https://collegefootballdata.com/

In [None]:
import numpy as np
import pandas as pd
import requests

In [None]:
response = requests.get("https://api.collegefootballdata.com/teams/fbs")
teams = pd.read_json(response.text)

teams.head()

Using CFBD API's /plays endpoint, loop through each of Jim Chaneys's years at Georgia (as OC/QB coach), starting in 2016, making sure to filter plays where Georgia is the offense

In [None]:
data = pd.DataFrame()

for year in range(2016,2018):
    response = requests.get("https://api.collegefootballdata.com/plays?seasonType=both&year={0}&offense=georgia".format(year))
    df = pd.io.json.json_normalize(response.json())
    data = pd.concat([data, df])
    
data.head()

Data cleanup. We're only selecting variables that we think are relevant to play-calls and dropping the remaining fields

In [None]:
data = data[['home', 'away', 'offense_score', 'defense_score', 'period', 'clock.minutes', 
             'clock.seconds', 'yardstogoal', 'down', 'distance', 'play_type']]
data.head()

Create new variable for home/away. We don't necessarily care about which team was home and which was away, but we do care whether the team calling the plays is at home

In [None]:
data['is_home'] = np.where(data['home'] == 'Georgia', 1, 0)
data.head()

clock.minutes and clock.seconds fields are not really valuable independent of one another, so we convert them into a single field which is the raw seconds remaining.

In [None]:
data['seconds_remaining'] = (data['clock.minutes'] * 60) + data['clock.seconds']
data.head()

In [None]:
pass_types = ['Pass Reception', 'Pass Interception Return', 'Pass Incompletion', 
              'Sack', 'Passing Touchdown', 'Interception Return Touchdown']
rush_types = ['Rush', 'Rushing Touchdown']
punt_types = ['Punt', 'Punt Return Touchdown', 'Blocked Punt', 'Blocked Punt Touchdown']
fg_types = ['Field Goal Good', 'Field Goal Missed', 'Blocked Field Goal']

def getPlayCall(x):
    if x in pass_types:
            return 'pass'
    elif x in rush_types:
        return 'rush'
    elif x in punt_types:
        return 'punt'
    elif x in fg_types:
        return 'fg'
    else:
        return None
        
data['play_call'] = data['play_type'].apply(getPlayCall)
data.head()

Some play types don't fit into either of the four play call classifications (field goal, pass, punt, rush) and are set to `None`. We'll use teh convenient Pandas function to drop rows that have missing values, specifying which column or columns we want to be considered when looking for which rows to drop.

In [None]:
pd.isna(data['play_call']).sum()

In [None]:
data.dropna(subset=['play_call'], inplace=True)
print(pd.isna(data['play_call']).sum())
data.head()

In [None]:
plays = data[['offense_score', 'defense_score', 'period', 'yardstogoal', 
              'down', 'distance', 'is_home', 'seconds_remaining', 'play_call']]
plays.head()

# Building a random forest prediction model
We want our model to be able to take in a set of inputs regarding the game situation and use those inputs to predict play calls. Our dependent variable is the set of play calls, so everything else belongs in our feature set. 
1. Separate our set of features (our independent variables) from what we want our model to output (our dependent variable). 
2. Split our data into training and validation sets. Using the convenient `train_test_split` module we imported above, we are going to pull out 20% of the data to use as a validation set.
    - We split out a validation set so that we can test out our model and ensure it is accurately predicting for the problem we are trying to solve, i.e. overfitting occurs when your model learns from the training set a little too good such that it's predictions are only good on the set on which it was trained.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

In [None]:
# split the data set between our independent variables (i.e. features) and our dependent variable or output
play_calls = plays['play_call']
plays = plays.drop(['play_call'], axis=1)

# split the data into training and validation sets
plays_train, plays_validation, calls_train, calls_validation = train_test_split(plays, play_calls, train_size=0.8, test_size=0.2, random_state=0)
plays_train.head()

Convert our categorical data (dependent set of play calls) into numbers using pandas `factorize` method. This will return two sets:
1. the data as a set of numbers ranging from 0 to 3
2. the set containing the key mappings telling us which number mapped to which label.

In [None]:
y, y_keys = pd.factorize(calls_train)
print(y[0:15,])
print(y_keys)

### Build and train a random forest classifier model.

In [None]:
# build the classifier
classifier = RandomForestClassifier(random_state=0, n_estimators=100)

# train the classifier with our test set
classifier.fit(plays_train, y)

Pass in our validation set of features to the `predict` method and see what the classifier outputs

In [None]:
classifier.predict(plays_validation)

Unlike the `predict` method, which just outputs a single predicted value for each set of inputs, the `predict_proba` shown below provides a greater level of detail by outputting the probabilities for each set of inputs. Notice that we have four probabilities for each set of inputs which correspond to our four different output labels (pass/rush/fg/punt).

In [None]:
classifier.predict_proba(plays_validation)[0:10]

Convert the raw outputs into labels using the y_keys mapping object we created earlier

In [None]:
predicted_calls = y_keys[classifier.predict(plays_validation)]
predicted_calls

Compare the predicted outputs with the actual outputs from our validation set. We can use the `crontab` functionality in pandas. Each row represents the actual classification of play calls in our validation set. The columns represent what our classifier predicted the play calls to be.

In [None]:
pd.crosstab(calls_validation, predicted_calls, rownames=['Actual Calls'], colnames=['Predicted Calls'])

### Improving our model

Evaluate our model predictions using the builtin `feature_importances_` property so we can see how it is weighting the importance of each feature in making its predictions.

In [None]:
list(zip(plays_train, classifier.feature_importances_))

Drop the `is_home` flag, as it's not helping the model and probably adding noise. 

In [None]:
# drop is_home olumn
plays = plays.drop(columns=['is_home'])

Same with the period field, our model isn't utilizing that flag. However, let's wrap that into seconds remaining instead of dropping it all together, similar to what we did with minutes earlier

In [None]:
# incorporate period into seconds_remaining
plays['seconds_remaining'] = ((4 - plays['period']) * 15 * 60 ) + plays['seconds_remaining']

# drop period column
plays = plays.drop(columns=['period'])

Let's re-run everything to see if our model improved

In [None]:
plays_train, plays_validation, calls_train, calls_validation = train_test_split(plays, play_calls, train_size=0.8, test_size=0.2, random_state=0)
y, y_keys = pd.factorize(calls_train)

classifier = RandomForestClassifier(n_estimators=100, random_state=0)
classifier.fit(plays_train, y)

predicted_calls = y_keys[classifier.predict(plays_validation)]

pd.crosstab(calls_validation, predicted_calls, rownames=['Actual Calls'], colnames=['Predicted Calls'])

***NOT IMPROVED***... let's look at the feature importance now

In [None]:
list(zip(plays_train, classifier.feature_importances_))

Offense and defense score flags are not really helping. Play calling would be more a function of how much a team is behind or ahead rather than the raw scores. Let's convert these two features into a single field, *score margin*.

In [None]:
# calculate new scoring margin field and drop the individual score columns
plays['margin'] = plays['offense_score'] - plays['defense_score']
plays = plays.drop(columns=['offense_score', 'defense_score'])

plays_train, plays_validation, calls_train, calls_validation = train_test_split(plays, play_calls, train_size=0.8, test_size=0.2, random_state=0)
y, y_keys = pd.factorize(calls_train)

classifier = RandomForestClassifier(n_estimators=100, random_state=0)
classifier.fit(plays_train, y)

predicted_calls = y_keys[classifier.predict(plays_validation)]

pd.crosstab(calls_validation, predicted_calls, rownames=['Actual Calls'], colnames=['Predicted Calls'])

***IMPROVED***... but not great

## Evaluate real-time data

In [None]:
def predict_call(yards, down, distance, seconds, margin):
    test_plays = pd.DataFrame({'yardstogoal': [yards], 'down': [down], 'distance': [distance], 'seconds_remaining': [seconds], 'margin': [margin]})
    return y_keys[classifier.predict(test_plays)][0]

Let's say the ball is at the 50 yard line. It's 4th and 1 with 3 minutes left and Georgia is down by 3 points. What does Georgia do?

In [None]:
call = predict_call(50, 4, 1, 180, -4)
call

What if Georgia is up by 10 points?

In [None]:
call = predict_call(50, 4, 1, 180, 10)
call

Same prediction! What about being up by 50 points?

In [None]:
call = predict_call(50, 4, 1, 180, 50)
call