# Analysis of Football Defense
### Collaborators: 
#### Gabriel Berlin; kaggle.com/gberlin email is gabeberlin@gmail.com
#### Danila Rozhevskii; kaggle.com/danilarozhevskii email is jorryvtanke@yandex.ru

## Introduction and Methodology
The main idea of this competetion is to measure defensive performance on the plays in the given data.

First, we did basic analysis of the data. We looked for correlations between the predictor variables and the response variables

Based on the information of plays' outcomes, our goal was to identify predictor and response variables and build a regression decision tree in order to find the ones that impacts offense play result the most. 

The data is both quantative and categorical. That's why we need to convert all catergorical columns to numpy array with dummy variables and then combine them back with quantative categories.

Then we fit the data in and build the decision tree. We then map feature importances to columns and create a dictionary with all the predictors and associated with them importances.

In [None]:
#load libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt   #for plotting
from sklearn import tree          #for decision trees
from sklearn import preprocessing

In [None]:
#read data
games = pd.read_csv('../input/nfl-big-data-bowl-2021/games.csv')
players = pd.read_csv('../input/nfl-big-data-bowl-2021/players.csv')
plays = pd.read_csv('../input/nfl-big-data-bowl-2021/plays.csv')

# Exploratory Data Analysis


### Games table

In [None]:
games.head(10)

In [None]:
games.info()

### Players table

In [None]:
players.head(10)

In [None]:
players.info()

In [None]:
#clean height data
#convert string into normal format and replace values
replacement_values = {'5-6':'66','5-7':'67','5-8':'68','5-10':'70','5-11':'71','5-9':'69',
                      '6-0':'72','6-1':'73','6-2':'74',
                      '6-3':'75','6-4':'76','6-5':'77','6-6':'78','6-7':'79'}

players['height'].replace(to_replace=replacement_values,inplace=True)
players['height'] = players['height'].astype('int32',copy=False)

In [None]:
#position value counts and number of positions
print("Player position value counts: ",players['position'].value_counts(),sep='\n')
print('\n',"Number of Positions: ",players['position'].unique().size,sep='')

In [None]:
players['height'].hist()

The players's height is fairly normally distributed.

In [None]:
#note: order by number of players descending
players['height'].hist(by=players['position'],figsize=(10,10))

In [None]:
players['weight'].hist()

The players's weight is normally distributed with a heavy left-tail, meaning a lot of very heavy players.

In [None]:
#note: order by number of players descending
players['weight'].hist(by=players['position'],figsize=(10,10))

Focus on defensive players. Split players into offensive and defensive tables.

In [None]:
#drop collegeName and displayName - not needed in analysis
drops = ['collegeName','displayName']
players = players.drop(columns=drops)

#split table into offensive and defensive players
offensive_positions = ['WR','QB','TE','RB','FB','HB','P','LS','K']
defensive_positions = ['CB','OLB','FS','SS','MLB','DE','LB','ILB','DB','S','NT','DT']
offensive_players = players[players['position'].isin(offensive_positions)] #subset of players
defensive_players = players[players['position'].isin(defensive_positions)] #subset of players

In [None]:
#number of defensive players
defensive_players.shape

In [None]:
defensive_players['position'].value_counts()

In [None]:
defensive_players['height'].hist(figsize=(10,10))

In [None]:
defensive_players['height'].hist(by=defensive_players['position'],figsize=(10,10))

In [None]:
#now can use a histogram
defensive_players['weight'].hist(figsize=(10,10))

In [None]:
defensive_players['weight'].hist(by=defensive_players['position'],figsize=(10,10))

# Plays Data Table
This is the table that we will use in most of our analysis.

In [None]:
plays.head()

In [None]:
#average number of plays per game
ave_num_plays_per_game = len(plays['playId'])/len(plays['gameId'].unique())
print(ave_num_plays_per_game)

In [None]:
plays.info()

There are 27 columns. Many of them are not as useful in this analysis.

In [None]:
less_important_columns = ['gameId', 'playId', 'playDescription', 'quarter', 'down', 'yardsToGo','possessionTeam',
                          'yardlineSide', 'yardlineNumber','preSnapVisitorScore', 'preSnapHomeScore', 'gameClock',
                          'absoluteYardlineNumber', 'penaltyCodes', 'penaltyJerseyNumbers','isDefensivePI']
plays.drop(columns=less_important_columns,inplace=True)

In [None]:
plays.info()

In [None]:
#number of missing values in each column
plays.isna().sum()

In [None]:
plays['playType'].value_counts()

In [None]:
plays['offenseFormation'].value_counts()

In [None]:
plays['personnelO'].value_counts()

In [None]:
plays['personnelD'].value_counts()

In [None]:
plays['typeDropback'].value_counts()

In [None]:
plays['passResult'].value_counts()

In [None]:
plays['defendersInTheBox'].value_counts()

In [None]:
plays['numberOfPassRushers'].value_counts()

## Correlations between variables and results
We want to see if there are any correlations between the variables and play/pass result

In [None]:
#yards gained
plays['offensePlayResult'].hist()

In [None]:
print(plays['offensePlayResult'].mean())
print(plays['playResult'].mean())

Yards gained without penalties is slightly higher with penalties. Average of 0.2 more yards gained when there is a penalty. That means, on average, penalties slightly help the offense.

In [None]:
print(plays['defendersInTheBox'].groupby(plays['passResult']).mean())
print(plays['defendersInTheBox'].corr(plays['offensePlayResult']))

In [None]:
#type of dropback
print(plays['offensePlayResult'].groupby(plays['typeDropback']).mean())

In [None]:
#offense formation and play result
plays['offensePlayResult'].groupby(plays['offenseFormation']).mean()

In [None]:
plays['offensePlayResult'].groupby(plays['defendersInTheBox']).mean()

# Decision Tree Regression

We use decision tree to predict offense play result (yards gained/lost). We will predict offense play result rather than play result, becuase the former does not include penalties. This way we will understand the importance of different factors independent of penalties.

These variables are all categorical and have to be changed into a dummy table: offenseFormation, personnelO, personnelD, typeDropback, (passResult)

Quantitative: defendersInTheBox, numberOfPassRushers, (offensePlayResult)

passResult and playResult are response variables

The data is not totally clean. The personnel columns have many unneeded values. There are also missing values in this dataset. We will train our models on a large subset of the data that doesn't contain missing values and contains only common offensive and defensive personnel.

In [None]:
#drop rows with missing values
cleaned_plays = plays.dropna()

In [None]:
print(plays['personnelO'].unique().size,plays['personnelD'].unique().size,sep='\n')

There are a lot of values for the offensive and defensive personnel columns, but only a few are common. We will only use the most relevant values.

In [None]:
cleaned_plays['personnelO'].value_counts().loc[lambda x: x > 500]

In [None]:
offense_counts = cleaned_plays['personnelO'].value_counts()
defense_counts = cleaned_plays['personnelD'].value_counts()

cleaned_plays = cleaned_plays[cleaned_plays['personnelO'].isin(offense_counts[offense_counts > 500].index)]
cleaned_plays = cleaned_plays[cleaned_plays['personnelD'].isin(defense_counts[defense_counts > 500].index)]

In [None]:
cleaned_plays['personnelD'].value_counts()

In [None]:
cleaned_plays.info()

In [None]:
#multiply yards gained by -1 to measure defensive sucess rather than offensive success
y = cleaned_plays['playResult']*-1

In [None]:
DummyDF = pd.get_dummies(cleaned_plays[['offenseFormation','personnelO','personnelD','typeDropback']])
cleaned_numeric = cleaned_plays[['defendersInTheBox','numberOfPassRushers']].to_numpy(copy=True)
X = np.concatenate((DummyDF.to_numpy(copy=True),cleaned_numeric),axis=1)

In [None]:
#average over 10 random states
decision_trees = []

for i in range(10):
    clf = tree.DecisionTreeRegressor(random_state=i+1)
    clf.fit(X,y)
    decision_trees.append(clf)

In [None]:
s = np.zeros(26)

for t in decision_trees:
    s += t.feature_importances_
    
average_importances = s/10

In [None]:
average_importances

In [None]:
#map feature importance to columns
#feature importance array -> dictionary or dataframe with column:importance
column_importances = {}#column_name:column_importance

for i in range(len(DummyDF.columns)):
    column_importances[DummyDF.columns[i]] = average_importances[i]
    
column_importances['defendersInTheBox'] = average_importances[24]
column_importances['numberOfPassRushers'] = average_importances[25]
column_importances

In [None]:
#add all the values for each column
offenseForm = column_importances['offenseFormation_EMPTY'] + column_importances['offenseFormation_I_FORM'] + column_importances['offenseFormation_PISTOL'] + column_importances['offenseFormation_SHOTGUN'] + column_importances['offenseFormation_SINGLEBACK'] + column_importances['offenseFormation_WILDCAT']
personnelO = column_importances['personnelO_1 RB, 1 TE, 3 WR']+column_importances['personnelO_1 RB, 2 TE, 2 WR']+column_importances['personnelO_2 RB, 1 TE, 2 WR']
personnelD = column_importances['personnelD_2 DL, 3 LB, 6 DB']+column_importances['personnelD_2 DL, 4 LB, 5 DB']+column_importances['personnelD_3 DL, 2 LB, 6 DB']
personnelD = column_importances['personnelD_3 DL, 3 LB, 5 DB']+column_importances['personnelD_3 DL, 4 LB, 4 DB']+column_importances['personnelD_4 DL, 1 LB, 6 DB']
personnelD = column_importances['personnelD_4 DL, 2 LB, 5 DB']+column_importances['personnelD_4 DL, 3 LB, 4 DB']
typeDropback = column_importances['typeDropback_DESIGNED_ROLLOUT_LEFT']+column_importances['typeDropback_DESIGNED_ROLLOUT_RIGHT']
typeDropback = column_importances['typeDropback_SCRAMBLE']+column_importances['typeDropback_SCRAMBLE_ROLLOUT_LEFT']
typeDropback = column_importances['typeDropback_SCRAMBLE_ROLLOUT_RIGHT']+column_importances['typeDropback_TRADITIONAL']+column_importances['typeDropback_UNKNOWN']


# print column importances
print("Offense Formation Importance:",offenseForm)
print("Offensive Personnel Importance:",personnelO)
print("Defensive Personnel Importance:",personnelD)
print("Type of Dropback Importance:",typeDropback)
print("Number of Defenders in the Box Importance:",column_importances['defendersInTheBox'])
print("Number of Pass Rushers Importance:",column_importances['numberOfPassRushers'])

# Conclusions
According to the decision tree, the two most important variables are defenders in the box and number of pass rushers. Coaches should focus on these two variables when making decisions about defense.