# NFL Data Bowl Analysis

This notebook will go through the methodology of data collection, analysis, and results of a narrowed-down analysis to only passing plays in the 2018 NFL season. The data is obtained from the NFL's 2021 Big Data Bowl: https://www.kaggle.com/c/nfl-big-data-bowl-2021/

*There is potential to combine the 2020 data bowl data, which contains similar info to 2021 data bowl data except about rushing plays 2017-2019. Combining these sources to produce a similar notebook to the original tendency analysis. Less data, but more information in our columns.*

The main focus of this analysis is to see how offensive / defensive personnel matchups, distance from closest defender to targeted receiver, among others later detailed in the notebook. This is a unique opportunity to utilize tracking / location data of players as well.

Other data bowls for reference:
- https://www.kaggle.com/c/nfl-big-data-bowl-2020: Forecast yardage gained on the run plays
- https://www.kaggle.com/c/nfl-big-data-bowl-2022: Analyze special teams data
- https://github.com/nfl-football-ops/Big-Data-Bowl: Inaugural data bowl from 2019, useful R code on animation of tracking

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
pd.set_option('display.max_columns', None)
pd.options.mode.chained_assignment = None  # default='warn'

In [2]:
bdb_pass = pd.read_csv('nfl-big-data-bowl/pass-2018.csv')
bdb_rush = pd.read_csv('nfl-big-data-bowl/rush.csv')

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


# Useful Features Present in Both Datasets

Before cleaning the data, from a cursory examination of our data, the main features I think will be useful in this analysis are the following. We can only use pre-snap features in this analysis, or else it defeats the purpose.

- Type of play (run/pass): TARGET (need to create)
- Drive number (need to create / merge with pbp)
- Offensive / defensive team (`HomeTeamAbbr`, `VisitorTeamAbbr`, `PossessionTeam`, need to create `DefTeam`)
- Quarter of the game (`Quarter`)
- Down number (`Down`)
- Time left in a quarter (`GameClock`, need to format this)
- Yards to gain for a first down (`Distance`)
- Yards to gain for a touchdown (100 minus current yardline position, `YardLine`)
- Current score in the game (model as difference, `HomeScoreBeforePlay - VisitorScoreBeforePlay`)
- Offensive formation (`OffenseFormation`)
- Offensive personnel (`OffensePersonnel`)
- Defenders in the box (`DefendersInTheBox`)
- Defensive personnel (`DefensePersonnel`)
- Week of season (`Week`)

# Useful Features Present Only in Rushing Data

Some features are only included in the rushing play data. However, for 2018 alone, we could match up the games and include further information in the analysis for pass plays as well.

- Stadium type (`StadiumType`)
- Turf or grass (`Turf`)
- Weather in game (`GameWeather`)
- Temperature on game day (`Temperature`)
- Humidity on game day (`Humidity`)
- Wind speed on game day (`WindSpeed`)
- Wind direction (`WindDirection`)

## Next Steps / Preparing Data

- Is it more useful to do an analysis on an imbalanced dataset with rushing 2017-2019 and passing 2018, with only the first subset of features?
- Or would I rather narrow down the rushing plays to only 2018, and use all features (can match up `GameID` and thus find the weather/stadium data by game for the passing plays as well)
- Aside from that, still need to complete the following data cleaning steps regardless of which direction I choose:
    - Merge both datasets into same format
    - Ignore post-snap outcomes (i.e. yardage gained, direction of the run, play results, type of pass dropback)
    - Ignore tracking data (for now) since all of it is post-snap movements
        - Look into if I can see whether a play is in "motion" i.e. the WR shifting from the X spot to the slot
        - I have the "time of snap" data, so there may be some potential?
    - Change the offensive formation to one-hot encoded columns
    - Change the personnel for offense and defense to one-hot encoded columns
        - Or find a unique way to deal with this since there are many combinations, some with few observations
    - Narrow down each play to a single row
    - More TBD..

# Merging 2018 run/pass

First I will try the simpler case, where we have more features but must narrow our scope to only 2018 plays. We can match these games on `GameID` which will remove run plays in 2017 and 2019. The key issue is that the run data contains many rows for the same play, since the tracking data is baked in. The solution is to remove the tracking data for the rush dataset, effectively leaving one row per play, then merging the two datasets on their `GameID`. I will do this below by grouping on `PlayID` and only retaining information on the play as a whole. Maybe I'll go back to individual players and see if there's an avenue to work with tracking data / pre-snap motion later and add it in as my own feature...we'll see.

In [3]:
print('Unique pass game data provided: {}'.format(bdb_pass['gameId'].nunique()))
print('Unique rush game data provided: {}   (includes 2017, 2019)'.format(bdb_rush['GameId'].nunique()))
print('Unique pass play data provided: {}'.format(bdb_pass['gameId'].count()))
print('Unique rush play data provided: {} (includes 2017, 2019)'.format(bdb_rush['PlayId'].nunique()))

Unique pass game data provided: 253
Unique rush game data provided: 688   (includes 2017, 2019)
Unique pass play data provided: 19239
Unique rush play data provided: 31007 (includes 2017, 2019)


In [4]:
## Group by play ID, take first since the data we want is all the same
## Don't care for any tracking / individual player data (yet)
bdb_rush_play = bdb_rush.groupby('PlayId').first()

In [5]:
## List of things to drop
to_drop = ['X','Y','S','A','Dis','Dir','Orientation','NflId','DisplayName',
           'JerseyNumber','Season','NflIdRusher','PlayerHeight','PlayerWeight',
           'PlayerBirthDate','PlayerCollegeName','Position', 'PlayDirection',
           'TimeHandoff', 'TimeSnap', 'Yards', 'Team', 'Location']
bdb_rush = bdb_rush_play.drop(to_drop, axis=1)
## only 2018 rush plays
bdb_rush_2018 = bdb_rush[bdb_rush['GameId'].astype('str').str[:4] == '2018']

In [6]:
print('Unique pass game data provided: {}'.format(bdb_pass['gameId'].nunique()))
print('Unique rush game data provided: {}     (2018)'.format(bdb_rush_2018['GameId'].nunique()))
print('Unique pass play data provided: {}'.format(bdb_pass['gameId'].count()))
print('Unique rush play data provided: {}   (2018)'.format(bdb_rush_2018['GameId'].count()))

Unique pass game data provided: 253
Unique rush game data provided: 256     (2018)
Unique pass play data provided: 19239
Unique rush play data provided: 11271   (2018)


In [7]:
bdb_rush_2018 = bdb_rush_2018.reset_index(drop=True)

In [8]:
# Show columns where we have missing data
bdb_rush_2018.isnull().sum()[bdb_rush_2018.isnull().sum() > 0]

FieldPosition         142
OffenseFormation        2
DefendersInTheBox       1
StadiumType           719
GameWeather           767
Temperature          1002
Humidity              280
WindSpeed            1394
WindDirection        1654
dtype: int64

In [19]:
## FieldPosition
# Upon examination, FieldPosition is None when at 50 yardline, replace None with 'MID'
bdb_rush_2018.loc[bdb_rush_2018['FieldPosition'].isnull(), 'FieldPosition'] = 'MID'

## Don't need to fix OffenseFormation or DefendersInTheBox...it's 3 plays...we can forgo that much

# The three missing stadiums were MetLife, Stubhub, TIAA Bank Field...all Outdoor (from commented out command below)
# print(bdb_rush_2018.loc[bdb_rush_2018['StadiumType'].isnull(), 'Stadium'].unique())
bdb_rush_2018.loc[bdb_rush_2018['StadiumType'].isnull(), 'StadiumType'] = 'Outdoor'

## When StadiumType is indoor, we dont have values for GameWeather, Temperature, Humidity, Windspeed, or Direction
# First clean up StadiumType to be either indoor or outdoor
bdb_rush_2018['StadiumType'] = ['Indoor' if st=='Retr. Roof - Closed' or st=='Indoors' or st=='Dome' or st=='Domed, closed'
                                         or st=='Retr. Roof-Closed' or st=='Domed' else
                                'Outdoor' if st=='Outdoors' or st=='Open' or st=='Retr. Roof - Open' or st=='Domed, Open'
                                          or st=='Domed, open' or st=='Outside' or st=='Cloudy' or st=='Bowl'
                                          or st=='Retractable Roof'
                                else st for st in bdb_rush_2018['StadiumType']]


## For the rest, no way to easily tell what the true values are for this data...
## First, clean up the categorical values
## GameWeather
bdb_rush_2018['GameWeather'].unique()

## Temperature
#bdb_rush_2018['Temperature'].unique()

## Humidity
#bdb_rush_2018['Humidity'].unique()

## WindSpeed
#bdb_rush_2018['WindSpeed'].unique()

## WindDirection
#bdb_rush_2018['WindDirection'].unique()

array(['Cloudy', 'Rain', 'Scattered Showers', 'N/A (Indoors)',
       'Cloudy and Cool', 'Partly Cloudy', 'Sunny', 'N/A Indoor', 'Clear',
       'Controlled Climate', None, 'Fair', 'Rain Chance 40%',
       'Light Rain', 'Clear and sunny', 'Mostly sunny', 'Sunny and warm',
       'Mostly Cloudy', 'Mostly Sunny', 'Partly Sunny', 'Partly clear',
       'cloudy', 'Cloudy, 50% change of rain', 'Clear and Sunny',
       'Partly cloudy', 'Sunny, Windy', 'Clear and Cool', 'Clear skies',
       'Sunny and clear', 'Hazy', 'Indoors', 'Mostly Sunny Skies',
       'Partly Clouidy', 'Mostly cloudy', 'Clear Skies', 'Snow',
       'Sunny Skies', 'Overcast', 'T: 51; H: 55; W: NW 10 mph',
       'Cloudy, Rain', 'Rain shower', 'Clear and cold', 'Rainy',
       'Sunny and cold'], dtype=object)

In [11]:
# Show columns where we have missing data after cleaning
bdb_rush_2018.isnull().sum()[bdb_rush_2018.isnull().sum() > 0]

OffenseFormation        2
DefendersInTheBox       1
GameWeather           767
Temperature          1002
Humidity              280
WindSpeed            1394
WindDirection        1654
dtype: int64

In [12]:
## Align bdb_rush_2018 and bdb_pass_2018
## Once these are aligned, we can create general features for the whole dataset at once
to_drop = ['playId', 'playDescription', 'typeDropback', 'penaltyCodes', 'penaltyJerseyNumbers',
           'epa', 'isDefensivePI', 'passResult','playType','numberOfPassRushers',
           'offensePlayResult', 'playResult']
bdb_pass_2018 = bdb_pass.drop(to_drop, axis=1)
bdb_pass_2018['Type'] = 'pass'

In [13]:
## Match columns between rush and pass
print(bdb_rush_2018.columns)
print(bdb_pass_2018.columns)

Index(['GameId', 'YardLine', 'Quarter', 'GameClock', 'PossessionTeam', 'Down',
       'Distance', 'FieldPosition', 'HomeScoreBeforePlay',
       'VisitorScoreBeforePlay', 'OffenseFormation', 'OffensePersonnel',
       'DefendersInTheBox', 'DefensePersonnel', 'HomeTeamAbbr',
       'VisitorTeamAbbr', 'Week', 'Stadium', 'StadiumType', 'Turf',
       'GameWeather', 'Temperature', 'Humidity', 'WindSpeed', 'WindDirection'],
      dtype='object')
Index(['gameId', 'quarter', 'down', 'yardsToGo', 'possessionTeam',
       'yardlineSide', 'yardlineNumber', 'offenseFormation', 'personnelO',
       'defendersInTheBox', 'personnelD', 'preSnapVisitorScore',
       'preSnapHomeScore', 'gameClock', 'absoluteYardlineNumber', 'Type'],
      dtype='object')


In [14]:
pass_cols = ['GameId', 'Quarter', 'Down', 'Distance', 'PossessionTeam',
             'FieldPosition', 'YardLine', 'OffenseFormation', 'OffensePersonnel',
             'DefendersInTheBox', 'DefensePersonnel', 'VisitorScoreBeforePlay',
             'HomeScoreBeforePlay', 'GameClock', 'Distance for TD', 'Type']
# assign column names
bdb_pass_2018.columns = pass_cols

## need to merge:
# HomeTeamAbbr, VisitorTeamAbbr, Week, Stadium, StadiumType, Turf,
# GameWeather, Temperature, Humidity, WindSpeed, WindDirection

## need to make this myself, after merging
# PosTeamScore, DefTeamScore, ScoreDifference, Redzone, Under2Min

In [15]:
## Merge both dataframes, then determine best way to fill in missing data
bdb_2018 = pd.concat([bdb_rush_2018, bdb_pass_2018]).reset_index(drop=True)

In [16]:
# find NaNs from passing data, there will be exactly 19239
print('Num of passing rows:  **{}**'.format(len(bdb_pass_2018)))
# these columns didn't exist in my initial data
print(bdb_2018.isnull().sum()[bdb_2018.isnull().sum() == 19239])

Num of passing rows:  **19239**
HomeTeamAbbr       19239
VisitorTeamAbbr    19239
Week               19239
Stadium            19239
StadiumType        19239
Turf               19239
dtype: int64


In [17]:
## Fill in the columns that need merging and have exactly 19239
# HomeTeamAbbr, VisitorTeamAbbr, Week, Stadium, Location, Turf
bdb_2018['HomeTeamAbbr'] = bdb_2018.groupby('GameId')['HomeTeamAbbr'].transform('first')
bdb_2018['VisitorTeamAbbr'] = bdb_2018.groupby('GameId')['VisitorTeamAbbr'].transform('first')
bdb_2018['Week'] = bdb_2018.groupby('GameId')['Week'].transform('first')
bdb_2018['Stadium'] = bdb_2018.groupby('GameId')['Stadium'].transform('first')
bdb_2018['Turf'] = bdb_2018.groupby('GameId')['Turf'].transform('first')

In [18]:
## Features listed here can all be created when sufficient data is fulfilled
print(bdb_2018.isnull().sum()[bdb_2018.isnull().sum() == 19239])

StadiumType    19239
dtype: int64


In [19]:
## Create features for complete data
# Create DefensiveTeam variable
bdb_2018['DefTeam'] = np.where(bdb_2018['PossessionTeam'] == bdb_2018['HomeTeamAbbr'], bdb_2018['VisitorTeamAbbr'], bdb_2018['HomeTeamAbbr'])

# Change GameClock to integer of minutes
min_left_qtr = pd.to_datetime(bdb_2018['GameClock'].str[:5])
bdb_2018['Time Left in Quarter'] = min_left_qtr.dt.hour + (min_left_qtr.dt.minute/60)

# Yards to go for first = Distance
# Yards to go for TD = 100-YardLine
bdb_2018['Distance for TD'] = 100 - bdb_2018['YardLine']

# Current score in the game
# Difference between PossessionTeam and DefTeam
# First create "PosTeamScore" and "DefTeamScore", then take the difference
bdb_2018['PosTeamScore'] = np.where(bdb_2018['PossessionTeam'] == bdb_2018['HomeTeamAbbr'], bdb_2018['HomeScoreBeforePlay'], bdb_2018['VisitorScoreBeforePlay'])
bdb_2018['DefTeamScore'] = np.where(bdb_2018['DefTeam'] == bdb_2018['HomeTeamAbbr'], bdb_2018['HomeScoreBeforePlay'], bdb_2018['VisitorScoreBeforePlay'])
bdb_2018['ScoreDifference'] = bdb_2018['PosTeamScore'] - bdb_2018['DefTeamScore']

# create indicator if team is in Redzone (<= 20 yds to go from goal)
bdb_2018['Redzone'] = np.where(bdb_2018['Distance for TD'] <= 20, 1, 0)

# create indicator if time left in HALF is <= 2min
bdb_2018['Under2Min'] = np.where(((bdb_2018['Time Left in Quarter']<=2.0) & (bdb_2018['Quarter']==2)|(bdb_2018['Quarter']==4)), 1, 0)

# need to include drive #, timeout data