In [None]:
%%html
<style type="text/css">

div.h2 {

    background-color: #159957;
    background-image: linear-gradient(120deg, #155799, #159957);
    text-align: left;
    color: white;              
    padding:9px;
    padding-right: 100px; 
    font-size: 20px; 
    max-width: 1500px; 
    margin: auto; 
    margin-top: 40px; 

}

                                                                         
body {

  font-size: 12px;

}    
                                     

div.h3 {

    color: #159957; 
    font-size: 18px; 
    margin-top: 20px; 
    margin-bottom:4px;

}
                                      

div.h4 {

    color: #159957;
    font-size: 15px; 
    margin-top: 20px; 
    margin-bottom: 8px;

}

   
span.note {

    font-size: 5; 
    color: gray; 
    font-style: italic;

}

  
hr {

    display: block; 
    color: gray
    height: 1px; 
    border: 0; 
    border-top: 1px solid;

}
                                 

hr.light {

    display: block;
    color: lightgray
    height: 1px; 
    border: 0; 
    border-top: 1px solid;

}   

   
                                      
                                      
                        
                                      
                                      
                                      
                                      
                                                          

table.dataframe th 

{

    border: 1px darkgray solid;
    color: black;
    align: left;
    background-color: white;

}

    

                                      

table.dataframe td 
                                      
{

    border: 1px darkgray solid;
    color: black;
    background-color: white;
    font-size: 12px;
    text-align: center;

} 
                                   

table.rules th 

{

    border: 1px darkgray solid;
    color: black;
    background-color: white;
    font-size: 11px;
    align: left;

}
                                   

table.rules td 

{

    border: 1px darkgray solid;
    color: black;
    background-color: white;
    font-size: 13px;
    text-align: center;

} 

   
table.rules tr.best

{

    color: green;

}    

    
.output { 

    align-items: left; 

}

        
.output_png {

    display: table-cell;

    text-align: left;

    margin:auto;

}                                          

                                
</style>

### Disclaimer

***All seasoned data analysts should ignore this notebook since they will neither learn nor miss anything relevant by further reading. However if you are a football fan or only a few weeks in to data science just like me, this notebook can serve as a sort of peer publication.***

*I am a published author of suspense and young adult novels as well as political scientist, not a coder or data analyst. I recently joined Kaggle after I noticed a competition by NFL. As an avid follower of the league, it was interesting to analyze possible reasons for player lower leg injuries.*

*This practice notebook is based on dataset included in the NFL Big Data Bowl 2020, in which a model was created to predict ball carrier runs. I have studied data analysis for roughly a month and have therefore no skills for making a high-end custom ML model. This notebook includes no model-building and is itself part of my personal learning process more than anything else.* 

*Finally, since I am only a true data science rookie, this notebook is prone to mistakes, and its methods and results should not be applied as such in other contexts. It is also a fact that this is the second Jupyter Notebook I've done in my life.*
<hr>

<div class="h2"><i>Running Back vs. Linebacker: A Practice Notebook</i></div>

**TABLE OF CONTENTS**
1. Introduction <br>
2. Data Preprocessing <br>
2.1 *Euclidean Distance* <br>
2.2 *Acceleration*<br>
2.3 *Speed* <br>
3. Analysis <br>
4. (How To Become) ML Model<br>
5. Conclusion


### 1. Introduction 

Football is a situational game divided in individual plays. An average game in 2019 NFL season consisted of 124 plays, each one with their specific play call i.e. plan to execute that particular play.

Running plays are the backbone of football. They are used to get the 'dirty yards' needed for first down i.e. four new attempts to gain ten yards on the field. In **NFL Big Data Bowl 2020** a challenge was set to create a model for predicting how many yards the ball carrier - most often running back - will gain in a play.

This notebook will take a different approcach - I have no skills for building a custom learning model. Using dataset included in the original contest, it was possible to decipher that the average number of yards per run play was 4.18 yards. This was the basis for this notebook and its research question.

Football is also a game of matchups. The offensive and defensive lines face each other dozens of times in a game, and the same goes for running backs and defensive backs. On defense linerbackers form the secondary line behind the big guys, and it is the job of agile linebackers to try preventing running back from gaining yards.

In this notebook, the matchups between running backs and linebackers are the focal point. In the end I decided to ask the following research question:

<div class="h4"><i>Are there factors linked either to running backs or linebackers that contribute to longer run plays than the average 4.18 yards?</i></div>
<br>
<hr>

### 2. Data Preprocessing

The dataset included in the **NFL Data Bowl 2020** includes lot of information irrelevant for the research question. Therefore a part of the dataset can be dropped right from the start.

In [None]:
# original css stylesheet: Kaggle, member: TexasTom

#import modules
import pandas as pd
import numpy as np

#load dataset
df = pd.read_csv("../input/nfl-big-data-bowl-2020/train.csv", low_memory = False)

# set maximum number of columns in diplay
pd.set_option('display.max_columns', 36)

#drop possible NaN values
df.dropna(inplace = True)

# switch Position value HB (half back i.e. running back) to RB
# group different linebacker positions (ILB, MLB, OLB, LB) under same label LB 
df["Position"]= df["Position"].replace("HB", "RB")
df["Position"]= df["Position"].replace("ILB", "LB")
df["Position"]= df["Position"].replace("MLB", "LB")
df["Position"]= df["Position"].replace("OLB", "LB")

# select and drop original columns relevant to task at hand
cols = ['Orientation', 'Dir', 'Dis', 'DisplayName', 'JerseyNumber', 'Season', 'Team', 'PossessionTeam', 'FieldPosition', 'HomeScoreBeforePlay', 'VisitorScoreBeforePlay', 
       'PlayDirection', 'OffenseFormation', 'PlayerBirthDate', 'PlayerCollegeName', 'TimeHandoff', 'HomeTeamAbbr', 'VisitorTeamAbbr', 'Week', 'Stadium', 'Location', 'StadiumType', 'Turf',
       'GameWeather', 'Temperature', 'Humidity', 'WindSpeed', 'WindDirection']
df = df.drop(cols, axis=1)

# select only rows with RB or LB as Position value
df = df[df['Position'].isin(['RB', 'LB']) ]

# create a new column storing as string whether the player is RB or LB
df.loc[:,'RbLb'] = df['Position']

# arrange dataframe by column value, in this case PlayId 
# method courtesy of StackOverFlow, member: yogitha jaya reddy gari
df = df.sort_values(['PlayId'],ascending=False).groupby('PlayId',as_index = False).apply(lambda x: x.reset_index(drop = True))
df.reset_index().drop(['level_0','level_1'],axis = 1)

df.head(10)

Glancing at the data, it seems that not all linebackers are always labeled as such in specific plays. One reason for this is that there are an increasing number of 'flex players' in the NFL, meaning they are able to line up in several positions. Some of them even take the field both in offensive and defensive plays. This factor may be considered as providing some 'data noise' in further steps of this analysis.

Next some new columns will be created for new values based on the dataset.

In [None]:
# get numeric values for Position column
df = pd.get_dummies(df, columns=['Position'])

# create new columns based on existing data
df['Is3Wr'] = df['OffensePersonnel'].str.contains('3 WR')
df['Is3Wr'] = df['Is3Wr'].map({True: 1, False: 0})

df['Is3Lb'] = df['DefensePersonnel'].str.contains('3 LB')
df['Is3Lb'] = df['Is3Lb'].map({True: 1, False: 0})

df['Is4Lb'] = df['DefensePersonnel'].str.contains('4 LB')
df['Is4Lb'] = df['Is4Lb'].map({True: 1, False: 0})

df['Is4Lb'] = df['DefensePersonnel'].str.contains('4 LB')
df['Is4Lb'] = df['Is4Lb'].map({True: 1, False: 0})

df.loc[:,'X_lb'] = df['X']
df.loc[:,'Y_lb'] = df['Y']
df.loc[:,'X_rb'] = df['X']
df.loc[:,'Y_rb'] = df['Y']

# create separate x/y coordinate values for running backs and linebackers
# these values are taken from original dataset X and Y columns
df['X_lb'] = df['Position_LB'].apply(lambda x: None if x==1 else 0)
df['X_lb'] = df['X_lb'].fillna(df['X'])

df['Y_lb'] = df['Position_LB'].apply(lambda x: None if x==1 else 0)
df['Y_lb'] = df['Y_lb'].fillna(df['Y'])

df['X_rb'] = df['Position_LB'].apply(lambda x: None if x==0 else 0)
df['X_rb'] = df['X_rb'].fillna(df['X'])

df['Y_rb'] = df['Position_LB'].apply(lambda x: None if x==0 else 0)
df['Y_rb'] = df['Y_rb'].fillna(df['Y'])

# replace 0 values with NaN
df.X_lb = df.X_lb.replace(0, np.nan)
df.Y_lb = df.Y_lb.replace(0, np.nan)
df.X_rb = df.X_rb.replace(0, np.nan)
df.Y_rb = df.Y_rb.replace(0, np.nan)

# sort dataframe index and create multi index consisting of Play and Players
df.sort_index(inplace = True) 
df.index.names = ['Play','Players']

# df.head()

The multi index column Play describes the number of individual plays in the dataset. The Players column refers to players on field in each individual play. For example, the first two lines have to players from the same Play etc. All in all the count of personnel in Players column ranges from 2 to more. 

If one takes a look at the first play in the dataset, the DefensePersonnel column indicates that there were three linebackers included in the play. However only one linebacker is listed under that specific PlayId. This tells us that **not all linebackers on the field in individual plays have been included in the dataset, which as such is a major deficiency.**
<hr>
Continuing this notebook, next the NaN values in the coordinate columns will be replaced by the average values of player position on field. As such this would make no sense, but the average values are calculated by play, as described in Play index column. As all players by rule are in fixed positions when the ball is snapped, the average value is based on player positions in that particular play. 

The purpose of all this is to create rows where for example a linebacker has his own position coordinates, and the running back, on the field in that same play, his own position coordinates. In running back rows the average position of linebackers in that play is applied, if there are more than one linebacker in the dataset in that particular play. 

To accomplish this, some temporary columns are first created below.

In [None]:
# the average positional X coordinates by Play
a1 = df.groupby('Play')['X_lb'].mean()
b1 = df.groupby('Play')['X_rb'].mean()

# subttract RB average X coordinate values from LB average X values
c1 = (b1 - a1)

# make sure these values are absolute i.e. positive
c1 = np.absolute(c1)
# calculate average
c1 = c1.mean()

# create new temporary column xtr1
# this value is the existing running back X position minus the calculated average 
df['xtr1'] = df['X_rb'] - c1

# on rows where there are no running back X coordinate values, use values stored in the new column
df['X_lb'] = df['X_lb'].fillna(df.xtr1)

## df.head(20)

In [None]:
# repeat the process above on Y coordinates for running backs
a2 = df.groupby('Play')['Y_lb'].mean()
b2 = df.groupby('Play')['Y_rb'].mean()

c2 = (b2 - a2)
c2 = np.absolute(c2)
c2 = c2.mean()

df['xtr2'] = df['Y_rb'] - c2
df['Y_lb'] = df['Y_lb'].fillna(df.xtr2)

# df.head(20)

Next the remaining NaN values are filled by using one line of code based on Play index. This should in theory fill all remaining NaN cells in coordinates columns. This however does not happen, meaning the process above must be repeated on the two remaining coordinates columns.

In [None]:
# fill NaN values based on specific multi index. Original code: StackOverFlow, user: piRSquared
df = df.groupby(level='Play').bfill()

# repeat filling missing values with average values
a3 = df.groupby('Play')['X_rb'].mean()
b3 = df.groupby('Play')['X_lb'].mean()

c3 = (b3 - a3)
c3 = np.absolute(c3)
c3 = c3.mean()

df['xtr3'] = df['X_lb'] + c3
df['X_rb'] = df['X_rb'].fillna(df.xtr3)


# repeat filling missing values with average values
a4 = df.groupby('Play')['Y_rb'].mean()
b4 = df.groupby('Play')['Y_lb'].mean()

c4 = (b4 - a4)
c4 = np.absolute(c4)
c4 = c4.mean()

df['xtr4'] = df['Y_lb'] + c4
df['Y_rb'] = df['Y_rb'].fillna(df.xtr4)

# drop unnecessary coordinate columns
xtr_cols = ['X', 'Y', 'xtr1', 'xtr2', 'xtr3', 'xtr4']
df = df.drop(xtr_cols, axis=1)

# df.head(20)

<hr>

#### 2.1 Euclidean Distance

Next the positional coordinates are used to calculate Euclidean distance between running back and linebacker per each row in the dataset. As noted before, in some cases this value is based on the positional average of several linebackers in particular play. As the defensive players are however lined up by rule due time of snap, the variance of these average values and actual positions is not a major issue compared to the fact that many linebackers' positional data is missing from the original dataset.

In [None]:
# import module
import math 

# x and y coordinates to lists
a = df['X_lb'].values.tolist()
b = df['Y_lb'].values.tolist()
c = df['X_rb'].values.tolist()
d = df['Y_rb'].values.tolist()      

# empty list for Euclidean distance
MyList = []

# function to calculate Euclidean distance for LB and RB x,y  values in lists
def distance(x1, y1, x2, y2): 
                    result = [math.sqrt(math.pow(x2 - x1, 2) + math.pow(y2 - y1, 2) * 1.0) for (x1, y1, x2, y2) in zip(a,b,c,d)] 
                    MyList.append(result)
            
# execute function on list values            
distance (a,b,c,d)

# flatten results list so that it fits the dataframe
MyList = np.array(MyList).flatten()

# round MyList to two digits to fit the dataframe format
MyList = np.round(MyList, 2)

#create new column 'euc' for Euclidean distance
df['euc'] = np.array(MyList)

# df.head(25)

One thing instantly clear is the similarity of distance values. This is caused by the rules of football, according to which both offense and defense must be lined up in a certain way when ball is snapped. This means that the same positional setup in the line of scrimmage is more or less repeated as such dozens of times in a game.
<hr>

#### 2.2 Acceleration

Next separate acceleration columns based on player position are created.

In [None]:
# create a new column acc_lb for acceleration by LB position
df.loc[:,'acc_lb'] = df['A']

# insert the acceleration value from column A, otherwise 0
df['acc_lb'] = df['Position_LB'].apply(lambda x: None if x==1 else 0)
df['acc_lb'] = df['acc_lb'].fillna(df['A'])

# replace 0 values in column with NaN
df.acc_lb = df.acc_lb.replace(0, np.nan)

# fill NaN values based on multi index Play. Original code: StackOverFlow, user: piRSquared
df = df.groupby(level='Play').bfill()


# there are still NaN values left in acc_lb column
# next the NaN values are replaced with the average LB acceleration


# create average acceleration for LB
acc_1 = df['acc_lb'].mean()

# round acc_1 to two digits
acc_1 = np.round(acc_1, 2)

# replace acc_lb NaN values with average LB acceleration (acc_1) 
df['acc_lb'] = df['acc_lb'].fillna(acc_1)


# next the process above is repeated on RB position


# create a new column acc_rb for acceleration by RB position
df.loc[:,'acc_rb'] = df['A']

# insert the acceleration value from column A, otherwise 0
df['acc_rb'] = df['Position_LB'].apply(lambda x: None if x==0 else 0)
df['acc_rb'] = df['acc_rb'].fillna(df['A'])

# replace 0 values in column with NaN
df.acc_rb = df.acc_rb.replace(0, np.nan)

# fill NaN values based on multi index Play
df = df.groupby(level='Play').bfill()

# for remaining NaN values, create average acceleration for RB
acc_2 = df['acc_rb'].mean()

# round acc_2 to two digits
acc_2 = np.round(acc_2, 2)

# replace acc_rb NaN values with average RB acceleration (acc_2) 
df['acc_rb'] = df['acc_rb'].fillna(acc_2)


# drop original acceleration column A
acc_col = ['A']
df = df.drop(acc_col, axis=1)


# df.head(20)

A key aspect in running back vs. linebackers matchup is their acceleration in a particular play. The faster the running back accelerates, the more yards he can be expected to gain before being caught by defensive secondary line i.e. linebackers. Conversely, linebackers are hypothetically more likely to catch the ball carrier if their acceleration is high.

Next a value is created comparing the running back and linebackers acceleration in particular plays as described in Play column. The value is calculated simply dividing the running back acceleration value by linebacker acceleration.

For example, if running back acceleration is 2.53 compared to linebacker acceleration 1.26 in particular play, the result is 2.53 / 1.26 = 2.01. Thus the new value is higher if the running back acceleration has advantage acceleration-wise in a particular play. 

In [None]:
# RB and LB acceleration values to two lists
ac1 = df['acc_lb'].values.tolist()
ac2 = df['acc_rb'].values.tolist()

# empty list for relative acceleration
RelAcc = []

# function to calculate relative acceleration using two lists of values
def relative_acc(x1, x2): 
                    result =  [(x2 / x1) for (x1, x2) in zip(ac1,ac2)] 
                    RelAcc.append(result)   
        
# execute function on list values            
relative_acc (ac1,ac2)

# flatten results list so that it fits the dataframe
RelAcc = np.array(RelAcc).flatten()

# round RelAcc to two digits to fit the dataframe format
RelAcc = np.round(RelAcc, 2)

#create new column 'RelAcc' for relative acceleration value
df['RelAcc'] = np.array(RelAcc)
        
# df.head(20)        

<hr>

#### 2.3 Speed

Finally, the speed value (S) is treated in a similar manner. First, two separate columns for running back and linebacker speed are created. After those values are in place, a new value RelSpd is created, dividing the running back speed by linebacker speed.

In [None]:
# create a new column spd_lb for speed by LB position
df.loc[:,'spd_lb'] = df['S']

# insert the speed value from column S, otherwise 0
df['spd_lb'] = df['Position_LB'].apply(lambda x: None if x==1 else 0)
df['spd_lb'] = df['spd_lb'].fillna(df['S'])

# replace 0 values in column with NaN
df.spd_lb = df.spd_lb.replace(0, np.nan)

# fill NaN values based on multi index Play. Original code: StackOverFlow, user: piRSquared
df = df.groupby(level='Play').bfill()

# create average speed for LB
spd_1 = df['spd_lb'].mean()

# round spd_1 to two digits
spd_1 = np.round(spd_1, 2)

# replace spd_lb NaN values with average LB speed (spd_1) 
df['spd_lb'] = df['spd_lb'].fillna(spd_1)


# the process above is repeated on RB position


# create a new column spd_rb for speed by RB position
df.loc[:,'spd_rb'] = df['S']

# insert the speed value from column S, otherwise 0
df['spd_rb'] = df['Position_LB'].apply(lambda x: None if x==0 else 0)
df['spd_rb'] = df['spd_rb'].fillna(df['S'])

# replace 0 values in column with NaN
df.spd_rb = df.spd_rb.replace(0, np.nan)

# fill NaN values based on multi index Play
df = df.groupby(level='Play').bfill()

# for remaining NaN values, create average speed for RB
spd_2 = df['spd_rb'].mean()

# round spd_2 to two digits
spd_2 = np.round(spd_2, 2)

# replace spd_rb NaN values with average RB speed (spd_2) 
df['spd_rb'] = df['spd_rb'].fillna(spd_2)



# RB and LB speed values to two lists
sp1 = df['spd_lb'].values.tolist()
sp2 = df['spd_rb'].values.tolist()

# empty list for relative speed
RelSpd = []

# function to calculate relative speed using two lists of values
def relative_spd (x1, x2): 
                    result =  [(x2 / x1) for (x1, x2) in zip(sp1,sp2)] 
                    RelSpd.append(result)   
        
# execute function on list values            
relative_spd (sp1,sp2)

# flatten results list so that it fits the dataframe
RelSpd = np.array(RelSpd).flatten()

# round RelSpd to two digits to fit the dataframe format
RelSpd = np.round(RelSpd, 2)

#create new column RelSpd for relative speed value
df['RelSpd'] = np.array(RelSpd)

# create a new column PlayYards with Yards column values
# this is not necessary but it easily relocates the column
df.loc[:,'PlayYards'] = df['Yards']


# drop original speed column S and Yards 
drp_cols = ['S', 'Yards']
df = df.drop(drp_cols, axis=1)

# df.head(20)

<hr>

### 3. Analysis

The starting point of this analysis was the very observation that the average number of yards gained in a play by ball carrier (i.e. running back) in the 2020 Data Bowl dataset was 4.18 yards. Based on this, I formulated the following research question: is it possible to predict which run plays go longer than the average value? Or, rather, what features in the dataset are emphasized in plays reaching more than 4.18 yards?

Below a new column Yds4_18 is created. The value of the column is 0 if the run is equal or less than 4.18 yards, and 1 of the run play reaches over the average length.

In [None]:
# the average yards gained in a play in the dataset is 4.18
# create a new column Yds4_18
df.loc[:,'Yds4_18'] = df['PlayYards']

# set new column value 0 if PlayYards are equal or less than 4.18, 1 if more
f = lambda x: 0 if x <= 4.18 else 1
df['Yds4_18'] = df['Yds4_18'].map(f)

#df.head(20)

The overall plot below shows that there are - as could be assumed - over two times more data on linebackers compared to running backs. After all, most of the time there is only one running back in a play. However, the dataset would even more imbalanced if all linebacker data were included in the 2020 Data Bowl dataset, as noted before. 

Not having that data available necessarily affects also the outcome of this analysis. Of course this only a practice notebook, so this issue will be sidelined from now on.

In [None]:
# import modules
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# set plot size and font
sns.set(rc={'figure.figsize':(9.7,8.27)})
sns.set(font='sans-serif', palette='colorblind')

# set plot parameters
plot = sns.countplot(x = 'RbLb',
              data = df,
              order = df['RbLb'].value_counts().index)

# set plot title etc.
plot.axes.set_title('Total count of dataset rows divided by position',fontsize=24)
plot.set_xlabel("Position",fontsize=18)
plot.set_ylabel("Total count of rows",fontsize=18)
plot.tick_params(labelsize=14)

# show plot
plt.show()

Sending three wide receivers in a play increases the relative chance for a 4.18+ yard run. The reason for this is ovbious: the defense is expecting - and thus preparing for - a pass play instead of a run.

In [None]:
# set plot size and font
sns.set(rc={'figure.figsize':(9.7,8.27)})
sns.set(font='sans-serif', palette='colorblind')

# set plot parameters
plot = sns.countplot(x = 'Yds4_18',
              data = df,
              hue = 'Is3Wr',
              order = df['Yds4_18'].value_counts().index)

# set plot title etc.
plot.axes.set_title('4.18 yards threshold divided by 3WR on field',fontsize=24)
plot.set_xlabel("0 = 4.18 yards or less, 1 = more",fontsize=18)
plot.set_ylabel("Total count of rows",fontsize=18)
plot.tick_params(labelsize=14)
plot.legend (loc=1, fontsize = 16, fancybox=True, framealpha=1, shadow=True, borderpad=1, title = '0 = not 3WR, 1 = 3WR')

# show plot
plt.show()

The lineplot below takes the running back acceleration as x value and compares it to yards gained by the ball carried in a play. The hue factor is 4.18+ yard plays, and when outliers are excluded, it looks like the increasing acceleration does not really have clear significance on yards gained as an individual feature.

The acc_rb.describe function (not in the code) reveals that the running back acceleration average (yards/second) is 2.59 with standard deviation of 0.79. Although there is a slight increase in 4.18+ yard rushes in range of 2.5 - 4.0 yds/second acceleration, it does not really stand out from overall data.


In [None]:
# set plot parameters
sns.set(font='sans-serif', palette='colorblind', font_scale=1.5) 
sns.lineplot(y='PlayYards', x='acc_rb', data=df, hue='Yds4_18', legend = 'full')

The linebackers acceleration average in the dataset is 1.75 yds/second with standard deviation of 0.93. In pass rush linebacker acceleration is a key factor, because the players must by the rules stay in the lineup before the snap: a flying start results as offside and and a five-yard penalty.

Again, as an individual feature linebacker acceleration does not seem to affect neither the number of yards gained nor the likelihood for 4.18+ yards run play. The high values between 4 and 6 yards/second seem like outliers, when we think about the mean value of linebacker acceleration (1.75 yds/s).


In [None]:
# set plot parameters
sns.set(font='sans-serif', palette='colorblind', font_scale=1.5) 
sns.lineplot(y='PlayYards', x='acc_lb', data=df, hue='Yds4_18', legend = 'full')

The RelAcc column value was created by dividing running back acceleration with linebacker acceleration in each play (and applying the average linebacker acceleration when necessary).

The RelAcc.describe function reveals that the RelAcc column average is 2.32 with large-ish standard deviation of 4.44. The third percentile (75 percent of all datapoints) is 2.38.

The RelAcc column includes very high values as outliers, with the maximum value being 268.5. This is presented as scatterplot below:


In [None]:
# define plot
sns.scatterplot(x = "RelAcc", y = "PlayYards", data = df, color = 'lime')

# set plot title etc.
plt.xlabel('Relative acceleration')
plt.ylabel('PlayYards')
plt.title('Relative acceleration and yards gained in play')

# show plot
plt.show()

When the 2.32 average of values is added to standard deviation 4.44, we get 6.76. In the following plot only RelAcc values below 6.77 are included to give a closer view on the case.

In [None]:
# create a new column RelAcc6_177
df.loc[:,'RelAcc6_77'] = df['RelAcc']

# set new column value 0 if RelAcc is equal or less than 6.77, 1 if more
f = lambda x: 0 if x <= 6.77 else 1
df['RelAcc6_77'] = df['RelAcc6_77'].map(f)

The count below shows that of 66954 total datapoints, only 2659 (3.9 percent) have a value above 6.77. 

In [None]:
# create variable acc_count with count of different values in column RelAll6_77
acc_count = df['RelAcc6_77'].value_counts()

# print variable
print (acc_count)

The same dichotomy is be shown as a plot below. The blue color represents 96.1 percent of all datapoints (with value below 6.77).

In [None]:
# set plot parameters
sns.set(font='sans-serif', palette='colorblind', font_scale=1.5) 
sns.lineplot(y='PlayYards', x='RelAcc', data=df, hue='RelAcc6_77', legend = 'full')

Based on information above, next the highest 3.9 percent of RelAcc values are dropped from the dataset, and a new column RelAcc_2 is created for the remaining values. In the new column, the dropped values (those above 6.77) are replaced with the average value of RelAcc column (2.32).

In [None]:
# store relacc_mean
relacc_mean = df['RelAcc'].mean()

# round relacc_mean to two digits
relacc_mean = np.round(relacc_mean, 2)

# create new column RelAcc_2 where value is 0 if RelAcc is greater or equal than 6.77
df['RelAcc_2'] = df['RelAcc'].apply(lambda x: None if x <= 6.77 else 0)

# get other column values from RelAcc
df['RelAcc_2'] = df['RelAcc_2'].fillna(df['RelAcc']) 

# replace 0 values with Nan
df.RelAcc_2 = df.RelAcc_2.replace(0, np.nan)

# replace Nan values with average RelAcc value stored in relacc_mean
df['RelAcc_2'] = df['RelAcc_2'].fillna(relacc_mean)

# drop unnecessary columns for relative acceleration
relacc_cols = ['RelAcc', 'RelAcc6_77']
df = df.drop(relacc_cols, axis=1)

#df.head(20)

In the histogram below, the now-overrepresented average relative acceleration values are well shown in an otherwise relatively even divide. In percentages, we are however talking about 5.5 percent share what would otherwise be about 2 percent. Compared to the earlier outliers in the same data, this sounds acceptable. 

In [None]:
# import module
import plotly.express as px

# set plot parameters
fig = px.histogram(df, x="RelAcc_2", nbins = 100, histnorm = 'percent')
fig.data[0].marker.color = "orange"
fig.data[0].marker.line.width = 2
fig.data[0].marker.line.color = "black"

# set plot title
fig.update_layout(
    title={
        'text': "Relative acceleration datapoints divided by percentage",
        'y':0.95,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'})

# show plot
fig.show()

At first glance, the increased relative acceleration seems to slightly correlate with 4.18+ yard run plays. However - just like the position group acceleration values - it is more of a complementary factor in larger setting than a decisive factor by itself.

In [None]:
# set plot parameters
sns.set(font='sans-serif', palette='colorblind', font_scale=1.5) 
sns.lineplot(y='PlayYards', x='RelAcc_2', data=df, hue='Yds4_18')

Below is a plot showing the number of 4.18+ yard rush plays compared to relative acceleration of 2.32 or more.

As one can see, in both categories 4.18+ yard run plays form roughly a quarter of all datapoints included. This means that increasing the value of relative acceleration does not significantly increase the likelihood of a 4.18+ yard run by ball carrier.

Thus a random guess before a run play in a football game (let's assume we know it will be a run play) is right three out of four times, if the guess predicts the run to be less than 4.18 yards. Conversely, a guess predicting a longer run would in average be correct one in four cases.

If we could create machine learning model able to predict 4.18+ run plays with better accuracy than 25 percent as well as shorter runs with a 75+ percent accuracy, it could in this context be considered as a relative success compared to a mere coin toss.


In [None]:
# create a new column RelAcc2_32
df.loc[:,'RelAcc2_32'] = df['RelAcc_2']

# set new column value 0 if RelAcc_2 is equal or less than 2.32, 1 if more
f = lambda x: 0 if x <= 2.32 else 1
df['RelAcc2_32'] = df['RelAcc2_32'].map(f)

# define plot size, color etc.
sns.set(rc={'figure.figsize':(9.7,7.27)})
sns.set(font='sans-serif', palette='colorblind')

# set plot parameters
plot = sns.countplot(x = 'Yds4_18',
              data = df,
              hue = 'RelAcc2_32',
              order = df['Yds4_18'].value_counts().index)

# set plot title etc.
plot.axes.set_title('4.18 yards threshold divided by relative acceleration (2.32)',fontsize=18)
plot.set_xlabel("0 = 4.18 yards run or less, 1 = more",fontsize=18)
plot.set_ylabel("Total count of rows",fontsize=18)
plot.tick_params(labelsize=14)
plot.legend (loc=1, fontsize = 16, fancybox=True, framealpha=1, shadow=True, borderpad=1, title = '0 = 2.32 or less, 1 = more')

# show plot
plt.show()

The value first created for further analysis was euc, which describes the Euclidean distance between running back and linebacker (or linebackers). As noted, at the beginning of the play both teams are by rules lined up in the line of scrimmage.

As the histogram below shows, the two peaks - running back and the average position of a linebacker in line of scrimmage - together form some two thirds of all datapoints. It can be safely assumed that the shorter distance peak consists of running back vs. middle linebacker distances, and the other one describes the distance between running back and outer linebackers on both sides of the defensive line closer to sidelines.

We can also make the assumption that the shorter of the two peaks - with wider range of values - on the bottom describes the linebacker positions, since in the dataset there are 1-4 linebackers on the field depending on the play. The running back is without exception lined next to quarterback for ball handoff, whereas linebackers are spread out in wider formation.


In [None]:
# import module
import plotly.express as px

# set plot parameters
fig = px.histogram(df, x="euc", nbins = 100, histnorm = 'percent')
fig.data[0].marker.color = "orange"
fig.data[0].marker.line.width = 2
fig.data[0].marker.line.color = "black"

# set plot title
fig.update_layout(
    title={
        'text': "Euclidean distance datapoints divided by percentage",
        'y':0.95,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'})

# show plot
fig.show()

As the plot shows, there may be some outliers included in the Euclidean distance values. According to euc.describe function, of 69613 total datapoint the mean value is 6.46 with standard deviation of 1.77. The third percentile value is 6.88, and the maximum value is 48.6.

However, the histogram above shows no clusters larger than 0.2 percent after the euc value reaches 12.

Based on this, a new column euc_12 is created splitting the euc values in two groups, one consisting of values 12.0 and smaller, leaving the other one for values larger than 12.0.


In [None]:
# create a new column euc_12
df.loc[:,'euc_12'] = df['euc']

# set new column value 0 if euc is equal or less than 12.0, 1 if more
f = lambda x: 0 if x <= 12.0 else 1
df['euc_12'] = df['euc_12'].map(f)

# print out the number of 0 and 1 values in the new column
euc_count = df['euc_12'].value_counts()
print (euc_count)

<hr>
The printout shows that only about 0.01 of all euc datapoints in the dataset have a value of 12 or larger.

Based on this, next euc values larger than 12 will be removed, and they will be replaced with the average euc value. This will be done by creating a new column euc_2.


In [None]:
# store euc_mean
euc_mean = df['euc'].mean()

# round euc_mean to two digits
euc_mean = np.round(euc_mean, 2)

# create new column euc_2 where value is 0 if euc is greater or equal than 12.0
df['euc_2'] = df['euc'].apply(lambda x: None if x <= 12.0 else 0)

# get other column values from euc
df['euc_2'] = df['euc_2'].fillna(df['euc']) 

# replace 0 values with Nan
df.euc_2 = df.euc_2.replace(0, np.nan)

# replace Nan values with average euc value stored in euc_mean
df['euc_2'] = df['euc_2'].fillna(euc_mean)

# drop previous columns for Euclidean distance as well as RelAcc2_32
euc_cols = ['euc', 'euc_12', 'RelAcc2_32']
df = df.drop(euc_cols, axis=1)

# df.head(10)

Now the two clusters in the distance values show even more clearly:

In [None]:
# set histogram parameters
fig = px.histogram(df, x="euc_2", nbins = 100, histnorm = 'percent')
fig.data[0].marker.color = "orange"
fig.data[0].marker.line.width = 2
fig.data[0].marker.line.color = "black"

# set plot title
fig.update_layout(
    title={
        'text': "euc_2 column datapoints divided by percentage",
        'y':0.95,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'})

# show histogram
fig.show()

As a rule of thumb, the farther a linebacker is from ball carrier (running back), the less he has in average a chance of reaching the running back before 4.18 yard threshold.

It is good to remember though that - depending on play call - this is not always even the task. ***Playcalling is an essential part of football, and excluding it from dataset inevitably limits the scope and relevance of any play analysis.***

One of the most talked-about plays in 2020 Superbowl was the '3rd & 15 play' in the fourth quarter. The Kansas City Chiefs quarterback Patrick Mahomes faced third down with 15 yards to gain, and they were losing the game. The Chiefs offensive formation was shotgun, which usually leads to a quick pass to wide receiver along with a wish that the receiver can somehow gain the necessary yards after the catch. Shotgun formation is also risky, because it leaves the quarterback without protection if something goes wrong.

Instead of regular shotgun, Mahomes had called 'wasp play', which in Chiefs playbook meant one of the receivers going deep for a long pass. Mahomes backed up eleven steps (normally 5-7 is the maximum) and threw the ball for 55 yards. In the end Chiefs scored a touchdown and turned the game around for a Superbowl win.

Also, a mere decription of personnel formation does not tell everything about the play. For example a running back cannot - or shouldn't - run against defensive blitz: 3-5 defensive players trying to get to the quarterback. Although biltz most often happen with less defenders in the box, this is not always the case. The important point is however that a blitz play may occur with the same defensive personnel as some other playcall - it cannot be deducted from personnel list.
<hr>
The plot below shows that likelihood for 4.18+ yard run stays relatively same regardless of Euclidean distance larger than average. Using the earlier coin toss example, there is a slightly larger than 50-50 chance for a 4.18+ yard run when the distance reaches the average value 6.46. However the same goes also to runs below 4.18 yards, so the Euclidean distance itself does not explain anything.

More likely, the classic run play rule is still valid also in the age of analytics: the run play should always be aimed at the linebacker considered weakest - either speed, strengh, agility or all of them comive lineup on field in a play. More likely, the classic run play rule is still valid also in the age of analytics: the run play should always be aimed at the linebacker considered weakest - either by speed, strength, acceleration, agility, experience or all of them combined.

In [None]:
# create a new column euc6_46
df.loc[:,'euc6_46'] = df['euc_2']

# set new column value 0 if euc is equal or less than 6.46, 1 if more
f = lambda x: 0 if x <= 6.46 else 1
df['euc6_46'] = df['euc6_46'].map(f)

# set plot size, color etc.
sns.set(rc={'figure.figsize':(9.7,7.27)})
sns.set(font='sans-serif', palette='colorblind')

# set plot parameters
plot = sns.countplot(x = 'Yds4_18',
              data = df,
              hue = 'euc6_46',
              order = df['Yds4_18'].value_counts().index)

# set plot title etc.
plot.axes.set_title('4.18 yards threshold divided by Euclidean distance (6.46)',fontsize=18)
plot.set_xlabel("0 = 4.18 yards run or less, 1 = more",fontsize=18)
plot.set_ylabel("Total count of rows",fontsize=18)
plot.tick_params(labelsize=14)
plot.legend (loc=1, fontsize = 16, fancybox=True, framealpha=1, shadow=True, borderpad=1, title = '0 = 6.46 or less, 1 = more')

# show plot
plt.show()

The linebacker speed has an average values of 2.59 (yds/second) with standard deviation of 1.21. The third percentile is 3.31 and the maximum 7.77, meaning the linebacker speed dataset does not include any significant outliers.

As for running backs, the average value is 4.17 with standard deviation of 0.89. The third percentile is 4.48, and the maximum is 8.5. This means both speed data columns are useful concerning the integrity of dataset.

The RelSpd value was created dividing running back speed with linerbacker speed (or average linebacker speed in a play). The aim for this was to see if higher relative running back speed affects the probability for a 4.18+ yard run.

The average RelSpd value is 2.32 with standard deviation of 4.64. The third percentile value is 2.35, and the maximum is 417, meaning there are significant outliers.

The next plot inspect those outliers more closely.


In [None]:
# set histogram paramters
fig = px.histogram(df, x="RelSpd", nbins = 200, histnorm = 'percent')
fig.data[0].marker.color = "orange"
fig.data[0].marker.line.width = 2
fig.data[0].marker.line.color = "black"

# set plot title
fig.update_layout(
    title={
        'text': "Relative speed datapoints divided by percentage",
        'y':0.95,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'})

# show histogram
fig.show()

As one can see, the 12.0 markdown has a very small percentage of all datapoints, so we exclude all values above 12 using the method familiar from acceleration and distance.

In [None]:
# create a new column spd_12
df.loc[:,'RelSpd_12'] = df['RelSpd']

# set new column value 0 if RelSpd is equal or less than 12.0, 1 if more
f = lambda x: 0 if x <= 12.0 else 1
df['RelSpd_12'] = df['RelSpd_12'].map(f)

# print out the number of 0 and 1 values in the new column
RelSpd_count = df['RelSpd_12'].value_counts()
print (RelSpd_count)

<hr>
Of almost 94000 values, the exclusion will remove only 990.

In [None]:
# store RelSpd_mean
RelSpd_mean = df['RelSpd'].mean()

# round euc_mean to two digits
RelSpd_mean = np.round(RelSpd_mean, 2)

# create new column RelSpd_2 where value is 0 if RelSpd is greater or equal than 12
df['RelSpd_2'] = df['RelSpd'].apply(lambda x: None if x <= 12 else 0)

# get other column values from RelSpd
df['RelSpd_2'] = df['RelSpd_2'].fillna(df['RelSpd']) 

# replace 0 values with Nan
df.RelSpd_2 = df.RelSpd_2.replace(0, np.nan)

# replace Nan values with average RelSpd value stored in relacc_mean
df['RelSpd_2'] = df['RelSpd_2'].fillna(RelSpd_mean)

# drop previous columns for RelSpd
relspd_cols = ['RelSpd', 'RelSpd_12']
df = df.drop(relspd_cols, axis=1)

# df.head(20)

In [None]:
df.RelSpd_2.describe()

<hr>
Next a new column RelSpd2_06 is created, with values less than relative speed average from above printout (2.06) marked as 0 and higher values as 1.

In [None]:
# create a new column RelSpd2_06
df.loc[:,'RelSpd2_06'] = df['RelSpd_2']

# set new column value 0 if RelSpd is equal or less than RelSpd_mean_2 (2.06), 1 if more
f = lambda x: 0 if x <= 2.06 else 1
df['RelSpd2_06'] = df['RelSpd2_06'].map(f)

# df.head(20)

As the plot below shows, when the relative speed of running back vs. linebacker or linebackers increase, there is a better chance for the ball carrier to reach 4.18+ yards. However, on the level of probability, we are still talking about a coin toss since any reasonable margin of error would cover both outcomes.

In [None]:
# set plot size, color etc.
sns.set(rc={'figure.figsize':(9.7,7.27)})
sns.set(font='sans-serif', palette='colorblind')

# set plot parameters
plot = sns.countplot(x = 'Yds4_18',
                data = df,
                hue = 'RelSpd2_06',
                order = df['Yds4_18'].value_counts().index)

# set plot title etc.
plot.axes.set_title('4.18 yards threshold divided by relative speed (2.06)',fontsize=18)
plot.set_xlabel("0 = 4.18 yards run or less, 1 = more",fontsize=18)
plot.set_ylabel("Total count of rows",fontsize=18)
plot.tick_params(labelsize=14)
plot.legend (loc=1, fontsize = 16, fancybox=True, framealpha=1, shadow=True, borderpad=1, title = '0 = 2.06 or less, 1 = more')

# show plot
plt.show()

As argued earlier, football is a situational game especially when it comes to running plays. For example, if the ball carrier is near his own goal line, there is a risk for him fumbling the ball, which would probably result as a defensive touchdown. Conversely, in the same setting linebackers have the majority of the field behind them, meaning there is a risk of them being too aggressive and leaving the field open for pass or an easy long run.

Below is a histogram of the X position of running backs in the dataset. There are two visible clusters in the histogram, but in fact they both describe the same thing. This is explained in the plot after the histogram.


In [None]:
# set histogram parameters
fig = px.histogram(df, x="X_rb", nbins = 100, histnorm = 'percent')
fig.data[0].marker.color = "orange"
fig.data[0].marker.line.width = 2
fig.data[0].marker.line.color = "black"

# set plot title
fig.update_layout(
    title={
        'text': "Running back X position datapoints divided by percentage",
        'y':0.95,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'})

# show histogram
fig.show()

The YardLine column in the dataset describes where on the field the play started from. One must keep in mind here that football field consists of two halves each having the same yardlines from 0 to 49, 0 being the endzone goal line. The emphasis on 25-yard line is explained by the rules of football: that is where the plays start from after a touchdown or touchback. This is also why the 25-yard line showed up twice in earlier histogram.

In [None]:
# set histogram parameters
fig = px.histogram(df, x="YardLine", nbins = 100, histnorm = 'percent')
fig.data[0].marker.color = "orange"
fig.data[0].marker.line.width = 2
fig.data[0].marker.line.color = "black"

# set plot title
fig.update_layout(
    title={
        'text': "YardLine column datapoints divided by percentage",
        'y':0.95,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'})

# show histogram
fig.show()

It is customary for teams to use run plays on first and second down. This is because gaining a first down requires advancing the ball for 10 yards, and getting halfway by running the ball makes the following pass plays much more likely to succeed.


In [None]:
# set histogram parameters
fig = px.histogram(df, x="Down", nbins = 8, histnorm = 'percent')
fig.data[0].marker.color = "orange"
fig.data[0].marker.line.width = 4
fig.data[0].marker.line.color = "black"

# set plot title
fig.update_layout(
    title={
        'text': "Down column datapoints divided by percentage",
        'y':0.95,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'})

# set x ticks
fig.update_xaxes(nticks = 4)

# show histogram
fig.show()

The likelihood for successful 4.18+ yard run does not correlate with down, as the plot below shows.

In [None]:
# set plot size, color etc.
sns.set(rc={'figure.figsize':(9.7,7.27)})
sns.set(font='sans-serif', palette='colorblind')

# set plot parameters
plot = sns.countplot(x = 'Yds4_18',
                data = df,
                hue = 'Down',
                order = df['Yds4_18'].value_counts().index)

# set plot title etc.
plot.axes.set_title('4.18 yards threshold divided by down (1-4)',fontsize=24)
plot.set_xlabel("0 = 4.18 yards run or less, 1 = more",fontsize=18)
plot.set_ylabel("Total count of rows",fontsize=18)
plot.tick_params(labelsize=14)
plot.legend (loc=1, fontsize = 16, fancybox=True, framealpha=1, shadow=True, borderpad=1, title = 'Down (1-4)')

# show plot
plt.show()

<div class="h3"><i>Concluding the job, no significant features correlating with 4.18+ yard runs were found in the analysis.
<br>
<br>
It is now time to introduce a chimpanzee with dart.
</i></div>
<hr>

### 4. (How To Become) ML Model

***Referring to the disclaimer in the beginning, I am not a coder or data scientist. Therefore I have no skills for building high-end machine learning custom models.***

If one takes a look at the NFL Big Data Bowl 2020 results, the winning models were excellent in predicting how many yards the ball carrier will reach in a play - I could never come up with anything similar. All contecnt in this notebook is practice only, and reading through is useful only if you are an absolute beginner just like me.

This is where the chimpanzee with dart stems from. A couple of years ago in Russia, Lusha the chimpanzee took on some of the top bankers in the country and beat 94 percent of them in making lucrative investment decisions. Some serious questions were raised afterwards about the wunderkinds of Russian financial sector with their ridiculous bonuses. Lusha made the high-paid professionals look like clowns, which was an accomplishment in itself considering that it was Lusha who actually worked in circus for a living.

Keeping this in mind, it is not useful for me trying to copy-paste existing ML model gurus in this notebook. Rather, I will start from the other end.

**Next I will imagine myself as an ML model, which in the beginning knows nothing about what it should actually do.**

I start by using 4.18+ yard runs as the deciding factor in the dataset:

In [None]:
# set plot size, color etc.
sns.set(rc={'figure.figsize':(9.7,7.27)})
sns.set(font='sans-serif', palette='colorblind')

# set plot parameters
plot = sns.countplot(x = 'Yds4_18',
              data = df,
              order = df['Yds4_18'].value_counts().index)

# set plot title and x/y labels
plot.axes.set_title('Total count of dataset rows divided by 4.18+ run plays',fontsize=18)
plot.set_xlabel("0 = 4.18 yards or less, 1 = more",fontsize=18)
plot.set_ylabel("Total count of rows",fontsize=18)
plot.tick_params(labelsize=14)

#show the plot
plt.show()

The same as printout:

In [None]:
run4_18_count = df['Yds4_18'].value_counts()
print (run4_18_count)

<hr>
67 percent of all datapoints in the dataset are from runs that did not reach the 4.18 yard benchmark, which is the average value of PlayYards column. Thus, for a model trying to predict longer runs, only one third of the dataset is from that perspective relevant.

As an aspiring machine learning model, I would still start from the larger portion of dataset. Next I will treat runs with less than 0 yards as an anomaly, and anything longer than four yards as temporarily irrelevant. After all, in a bitcoin toss I would already have a 66-percent chance of predicting a run to result in less than four yards.

Below is the PlayYards column plotted by yards:


In [None]:
# set histogram parameters
fig = px.histogram(df, x="PlayYards", nbins = 100, histnorm = 'percent')
fig.data[0].marker.color = "orange"
fig.data[0].marker.line.width = 2
fig.data[0].marker.line.color = "black"

# set plot title
fig.update_layout(
    title={
        'text': "PlayYards column datapoints divided by percentage",
        'y':0.95,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'})

# show histogram
fig.show()

If the short run plays are inspected more closely, we can see that the number of datapoints starts to come down already after two yards:

In [None]:
# create new dataframe including only PlayYard values between 1-4
df_2 = df[(df['PlayYards']>= 0 ) & (df['PlayYards']<= 4)]

# plot historgram with yards between 1-4
fig = px.histogram(df_2, x="PlayYards", nbins = 16, histnorm = 'percent')
fig.data[0].marker.color = "orange"
fig.data[0].marker.line.width = 4
fig.data[0].marker.line.color = "black"

# set plot title
fig.update_layout(
    title={
        'text': "PlayYards column datapoints divided by yards 0-4",
        'y':0.95,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'})

# set x ticks
fig.update_xaxes(nticks=5)

# show ploe
fig.show()

Altogether, there's a data-journalistic minor scoop above.

**Of 0-4 yard runs, about 60 percent of plays in football will not go longer than two yards. This means that the running back never confronts linebackers in those plays.**

If I were an aspiring machine learning model, this would be more good news. First I could get two thirds of predictions right only by predicting runs to result in less than four yards, and now selecting two yards or less will give me 6 cases out of 10 correct in subcategory 0-4 yards. Also, two yards would be the best individual guess with 23 percent of all datapoint in 0-4 yards selection.

Now let's do the same with yards between 5-10:

In [None]:
# create new dataframe including only PlayYard values between 5-10
df_3 = df[(df['PlayYards']>= 5 ) & (df['PlayYards']<= 10)]

# plot histogram with yards between 5-10
fig = px.histogram(df_3, x="PlayYards", nbins = 16, histnorm = 'percent')
fig.data[0].marker.color = "orange"
fig.data[0].marker.line.width = 4
fig.data[0].marker.line.color = "black"

# set plot title
fig.update_layout(
    title={
        'text': "PlayYards column datapoints divided by yards 5-10",
        'y':0.95,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'})

# show plot
fig.show()

<div class="h4"><i>
In longer run plays, 5-7 yard runs form two thirds of all datapoints just like 0-2 yard runs in the other subcategory. My strategy as an aspiring ML model would be to go for 5-7 if I think the run will be more than average, and 0-2 if I think the run will not reach the 4.18 yard average threshold.
<br>    
<br>
Divided between positive yardage 0-99, that's only six datapoints out of 100, meaning hitting those six datapoints correctly would give me a decent chance for a baseline case.
 <hr>
</i></div>

However, the original Big Data Bowl contest was not about long runs: it was predicting the overall yardage in any run play. Everything so far has indicated that predicting 4.18+ yard runs with a ML model and dataset in use is - frankly - not possible.

In fact, an orangutang with dart would probably fare better.

Thus the expectations are low, and this is not a problem. I don't know much of anything about machine learning, and I am only doing this for the practice. Were I hired to do this analysis with some relevant results expected, at this point I would run for the hills for sure.

Earlier I contended that if we could create "machine learning model able to predict 4.18+ yard run plays with better accuracy than 25 percent as well as shorter runs with a 75+ percent accuracy, it could in this context be considered as a relative success."

I now get back to being an aspiring ML model myself and start inspecting which values in the dataset would be most useful in beating Lusha the chimpanzee in prediction game.


In [None]:
# import modules 
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# show dataframe column names
df.columns

<hr>
We already know that there is no 'golden column' in dataset able to provide us the road to 4.18+ yard runs. Therefore I will start by taking a wide selection of different values and see what the algorithm thinks about them. If you just opened this notebook by chance, there are more about these values in the Analysis section.

In [None]:
# select the desired features as X
X = df[['Is3Wr', 'Is3Lb', 'Is4Lb', 'acc_lb', 'acc_rb', 'spd_lb', 
               'spd_rb','RelAcc_2', 'euc_2','RelSpd_2']]

# select the labels as y (in this case 4.18+ yard runs)
y = df['Yds4_18']

# perform train-test split, train data 80%, no random state
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state = 0)

# print train-test dataset sizes
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

<hr>
The printout shows that I - an aspiring ML model - now have ten columns and 75826 datapoints to train myself with, and 18957 datapoints are left to evaluate my performance compared to the aforementioned chimpanzee with dart.

In [None]:
# import module
from sklearn.preprocessing import StandardScaler

# scale the feature data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# create and fit the Logistic Regression model
model = LogisticRegression(solver='lbfgs', class_weight='balanced')
model.fit(X_train, y_train)

# print the scores
print (model.score(X_train, y_train))
print (model.score(X_test, y_test))

<hr>
So far I - an aspiring ML model - know what I already knew after analysing the data.

I am practically no better than a coin toss since I haven't really learned anything new in my modeling career yet.

In [None]:
# print the coefficients
print(model.coef_)

<hr>
Looking at the coefficients (how much a specific feature affects the labels, in thir case 4.18+ runs), this is no wonder. 

Our columns simply are not very helpful in predicting 4.18+ yard runs. Of the ten feature columns used, number five in the list (running back acceleration) looks like the most significant factor if the Logistic Regression algorithm is asked.

In [None]:
# print each feature with its respective coefficient value
print(list(zip(['Is3Wr', 'Is3Lb', 'Is4Lb', 'acc_lb', 'acc_rb', 'spd_lb', 
               'spd_rb','RelAcc_2', 'euc_2','RelSpd_2'],model.coef_[0])))

<hr>
Double-checking the coefficients by feature (i.e. column) name confirms this, and so does the confusion matrix below:

In [None]:
# import module
from sklearn.metrics import confusion_matrix

# set model prediction
y_pred = model.predict(X_test)

# print prediction
confusion_matrix = confusion_matrix(y_test, y_pred)
print(confusion_matrix)

 <hr>
 As the printout shows, there are 7410 correct and 5296 incorrect predictions for short runs and 3565 correct predictions compared to 2686 incorrect ones for longer 4.18+ yard runs.
 
 The classification report below returns similar figures, with a f1-score 0.65 (1 is the best score, 0 is the worst) for short runs but only 0.47 for longer ones:

In [None]:
# import module
from sklearn.metrics import classification_report

# prin report
print(classification_report(y_test, y_pred))

<hr>
As shown below, building a random forest classifier out of same dataset does not improve anything - rather vice versa.

To put it bluntly, we simply have nothing relevant in our dataset to predict 4.18+ yard runs.

In [None]:
# import module
from sklearn.ensemble import RandomForestClassifier

# select the desired features as X
X = df[['Is3Wr', 'Is3Lb', 'Is4Lb', 'acc_lb', 'acc_rb', 'spd_lb', 
               'spd_rb','RelAcc_2', 'euc_2','RelSpd_2']]

# select the labels as y (in this case 4.18+ yard runs)
y = df['Yds4_18']

# scale the feature data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# define and fit Random Forest Classifier model
classifier = RandomForestClassifier(random_state=0, max_depth = 8, n_estimators = 100)
classifier.fit(X_train, y_train)

# set model prediction
y_pred = classifier.predict(X_test)

# print report
print(classification_report(y_test, y_pred))

<hr>
One other way of evaluating the dataset within a model is a ROC curve, ROC being an abbreviation of Receiver Operating Characteristic. 

If the model works well, the True Positive Rate of ROC rises steeply upwards and eventually turns toward False Positive rate so that the area under the curve covers as much area as possible.

By just throwing our dataset in, one can see below that the ROC curve (yellow line) covers only sligthly more area than a 50-50 coin toss (the straight line in the middle) or the now-infamous Lusha.

This basically means there is no predictive model available using out dataset.


In [None]:
# import module
from sklearn.metrics import roc_curve

# define function for ROC surve
def plot_roc_curve(fper, tper):  
    plt.plot(fper, tper, color='orange', label='ROC')
    plt.plot([0, 1], [0, 1], color='darkblue', linestyle='--')
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver Operating Characteristic (ROC) Curve')
    plt.legend()
    plt.show()

# set predictions        
probs = classifier.predict_proba(X_test)  
probs = probs[:, 1]  
fper, tper, thresholds = roc_curve(y_test, probs) 

# plot ROC curve
plot_roc_curve(fper, tper)


Even as a mere aspiring ML model, I start to get the hang of what's going on: not much.

The Gradient Booster algorithm can do a lot, but even it cannot make an irrelevant dataset relevant. And we do have next to nothing to boost.

Just for fun, let's first run the learning rate code below to see what would be the best learning rate value for Gradient Booster.

Learning rates less than 1.0 make less corrections on the process: it is common to have small values in the range of 0.1 to 0.3, as well as values less than 0.1.

In [None]:
# import modules
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

# select the desired features as X
X = df[['Is3Wr', 'Is3Lb', 'Is4Lb', 'acc_lb', 'acc_rb', 'spd_lb', 
               'spd_rb','RelAcc_2', 'euc_2','RelSpd_2']]

# select the labels as y (in this case 4.18+ yard runs)
y = df['Yds4_18']

# scale the feature data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# perform train-test split, train data 80%, no random state
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state = 0)

# define list of learning rates to test
lr_list = [0.05, 0.075, 0.1, 0.25, 0.5, 1, 2]

# check which learning rate is the best and print the result
for learning_rate in lr_list:
    gb_clf = GradientBoostingClassifier(n_estimators= 32, learning_rate=learning_rate, max_features=2, max_depth=2, random_state=0)
    gb_clf.fit(X_train, y_train)   
    print("Learning rate: ", learning_rate)
    print("Accuracy score (training): {0:.3f}".format(gb_clf.score(X_train, y_train)))
    print("Accuracy score (test): {0:.3f}".format(gb_clf.score(X_test, y_test)))

<hr>
Learning rate 0.25 looks like the top result for test dataset (this may vary depending on running the code).

I know that the score is really low, but as as an aspiring ML model I will go for that:

In [None]:
# import modules
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

# select the desired features as X
X = df[['Is3Wr', 'Is3Lb', 'Is4Lb', 'acc_lb', 'acc_rb', 'spd_lb', 
               'spd_rb','RelAcc_2', 'euc_2','RelSpd_2']]

# select the labels as y (in this case 4.18+ yard runs)
y = df['Yds4_18']

# scale the feature data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# perform train-test split, train data 80%, no random state
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state = 0)

# fit the Gradient Booster model and set predictions
gb_clf2 = GradientBoostingClassifier(n_estimators=32, learning_rate=0.25, max_features=2, max_depth=3, random_state=0)
gb_clf2.fit(X_train, y_train)
predictions = gb_clf2.predict(X_test)

#print results
print("Confusion Matrix:")
print(confusion_matrix(y_test, predictions))

print("Classification Report")
print(classification_report(y_test, predictions))

<hr>
As the confusion matrix shows, the Gradient Booster classifier got right 12592 short runs compared to 114 wrong predictions. This is a success rate of 99.1 percent, which would be excellent without the long, 4.18+ yard runs. In that category, 114 predictions were correct compared to 6137 incorrect ones, meaning in this category Lusha's dart predictions would reign victorious forever.

Then again, this practice notebook is all about trial and error, not winning a ML modeling contest. Let's just go on and run the code below:


In [None]:
# import modules
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from math import sqrt

# select the desired features as X
X = df[['Is3Wr', 'Is3Lb', 'Is4Lb', 'acc_lb', 'acc_rb', 'spd_lb', 
               'spd_rb','RelAcc_2', 'euc_2','RelSpd_2']]

# select the labels as y (in this case 4.18+ yard runs)
y = df['Yds4_18']

# scale the feature data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# define and fit the K nearest neighbors model
model = KNeighborsRegressor(n_neighbors=9)
model.fit(X_train, y_train)

# calculate the errors for our training data
mse = mean_squared_error(y_train, model.predict(X_train))
mae = mean_absolute_error(y_train, model.predict(X_train))

# print results
print("mean squared error = ",mse," & mean absolute error = ",mae," & root mean squared error = ", sqrt(mse))

<hr>
The most clear error metric above is the mean absolute error. In our dataset, the average prediction was 0.18 away from real datapoint.

Again, 0 is the best value compared to 1, so at least in theory I - an aspiring ML model - am not doing that bad were there not those unpredictable longer run plays.

We can calculate the same on the test data:


In [None]:
# calculate the errors for our training data
test_mse = mean_squared_error(y_test, model.predict(X_test))
test_mae = mean_absolute_error(y_test, model.predict(X_test))

# print results
print("mean squared error = ",test_mse," & mean absolute error = ",test_mae," & root mean squared error = ", sqrt(test_mse))

<hr>


The fact that those error rates are larger in test set indicates that there is overfitting going on. In this case this is not an issue since I am an aspiring ML model, not a ML catwalk professional.

The reason for overfitting is most likely in choices made before. As our features (i.e. columns) used in evaluation don't correlate with our labels (i.e. whether the run will reach 4.18 yards), the model will therefore go for a more likely event (short run play) in its predictions.

Knowing what we know, it is unlikely also that the K Neighbors classifier alogrithm would be some kind of game saver here. After all, it is based on surrounding values and their effect on predictability, and so far nothing we have tried has not predicted 4.18+ yard runs.

In [None]:
# import module
from sklearn.neighbors import KNeighborsClassifier

# select the desired features as X
X = df[['Is3Wr', 'Is3Lb', 'Is4Lb', 'acc_lb', 'acc_rb', 'spd_lb', 
               'spd_rb','RelAcc_2', 'euc_2','RelSpd_2']]

# select the labels as y (in this case 4.18+ yard runs)
y = df['Yds4_18']

# scale the feature data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# define and fit K nearest neighbor classifier model
classifier = KNeighborsClassifier(n_neighbors=7)
classifier.fit(X_train, y_train)

# set predictions
y_pred = classifier.predict(X_test)
confusion_matrix = confusion_matrix(y_test, y_pred)

# print results
print(confusion_matrix)
print(classification_report(y_test, y_pred))

<hr>
The K Neighbors classifier got correct 83 percent of predictions on short runs, whereas the success rate on longer runs was considerably lower (less than 23 percent).

The key value in K Neighbors algorithm is K, which describes the number of neighboring datapoints used in classification. Just think about drawing a cloed area around a datapoint an expanding it gradually to fit more datapoints i.e. nearest neighbors.

K=1 would mean that only the nearest neighbor is taken into account. With this approach, someone living in a white house next to 1600 Pennsylvania Avenue could also be classified as the President of the United States based on his neighbor's house color if the model was asked. 

Of course this would be incorrect or 'overfitting', as they say. Then again, too large a K value results in underfitting. Just imagine drawing a circle around half a million people celebrating Superbowl win in downtown Kansas and trying to predict if those people have consumed whiskey. Undoubtedly we would get a positive response, but the response would be the same also on vodka, beer and horse tranquilizers. The latter doesn't make any claims about Kansas City Chiefs fans - there were actual horses in the victory parade.

A well-accepted rule of thumb is to use K values 5 and 7 as base case, but it is also possible to get K value by calculating it:


In [None]:
# create an empty list for error values
error = []

# function to calculate K value
for i in range(1, 40):
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train, y_train)
    pred_i = knn.predict(X_test)
    error.append(np.mean(pred_i != y_test))

# create plot for visualizing the results    
plt.figure(figsize=(12, 6))
plt.plot(range(1, 40), error, color='red', linestyle='dashed', marker='o',
         markerfacecolor='blue', markersize=10)

# set plot title etc.
plt.title('Error Rate K Value')
plt.xlabel('K Value')
plt.ylabel('Mean Error')

# show plot
plt.show()

What we usually get is a line coming down to near-zero when the K value is somewhere between 5-10. However this is not the case in our dataset.

Then again, the plot above describes our problem perfectly. Because predicting longer runs is not possible within the dataset, it raises the error rate so that there are no meaningful K values available. 

For example, the error rate 0.38 for K=5 is ridiculously high. As mentioned, the error rate should be close to zero at that point. Then again, K values near zero in the graph include 30 nearest neighbors, meaning the model would underfit in a serious manner.

The code below shows the same problem from another perspective.


In [None]:
# import modules
import eli5
from pdpbox import pdp, get_dataset, info_plots
from eli5.sklearn import PermutationImportance
import joblib


# select the desired features as X
X = df[['Is3Wr', 'Is3Lb', 'Is4Lb', 'acc_lb', 'acc_rb', 'spd_lb', 
               'spd_rb','RelAcc_2', 'euc_2','RelSpd_2']]

# select the labels as y (in this case 4.18+ yard runs)
y = df['Yds4_18']

# scale the feature data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# perform train-test split, train data 80%, no random state
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state = 0)

# define Random Forest Classifier model
model = RandomForestClassifier(n_estimators=32, random_state=0).fit(X_train, y_train)

# fit permutation importance and show the results 
perm = PermutationImportance(model, random_state=1).fit(X_test, y_test)
eli5.show_weights(perm, feature_names = X_test.columns.tolist())

The only dataset columns (i.e. features) with significance (i.e. weight) in 4.18+ yard runs are running back acceleration and three wide receivers on the field, and even those two aren't that significant. This does not mean that the listed features have no meaning at all. What they tell us is that removing any of them does not affect the overall outcome of our model in any significant way, because their relative importance is so similar.

**Thus, as an aspiring ML model, I don't learn anything about 4.18+ yard runs because the dataset cannot teach me anything.**

The graph below picks the most significant factor (running back acceleration) from the list above. As one can see, values from 2.5 yards/second up to 4 yds/s make longer runs more likely.

*Then again, what kind of a football coach didn't already know that?*

In [None]:
# define column used in graph
feature_name = 'acc_rb'

# Create the data that we will plot
my_pdp = pdp.pdp_isolate(model=model, dataset=X_train, model_features=X_train.columns, feature=feature_name)

# set the plot
pdp.pdp_plot(my_pdp, feature_name)

# show plot
plt.show()

<hr>

### 5. Conclusion

Concluding this notebook, sometimes the data at hand cannot offer answers to every single question. In this case it was clear from the beginning that long, 4.18+ yard runs most likely cannot be predicted with the dataset used. However in the beginning this was a mere coffee table hypothesis - now we have proved it.

Another thing to remember are the fundamental choices defined in the beginning. These results show no correlation between running back and linebackers, when 4.18+ runs are concerned. This does not mean that other factors in dataset could not provide us more information. In fact, we actually have been just given a new research question:

<div class="h4"><i>
If 4.18+ yard runs in football are not about the matchup between running backs and linebackers, then what are they all about?
</i></div>
<br>
Earlier analysis showed that many run plays are so short the ball carrier never confronts linebackers in the first place. As a hypothesis, it could be that *the fate of longer runs lies on the matchup between offensive and defensive lines.* The ball carrier can break free only if his lead blocker - an offensive lineman - paves the road for him. Conversely, if a defensive lineman wins his own matchup against the opposing offensive lineman, the linebacker is then free to reach for the running back: at least this would be my starting point for further analysis.

As an aspiring ML model, I did not learn how to predict long runs in football. However, as someone only starting to learn the ropes of data analysis, I did discover new things that may prove to be useful in the future.

***I am looking forward to facing Lusha's dart also in future data analysis matchups.***
<hr>