# NFL Data Bowl Analysis

This notebook will go through the methodology of data collection, analysis, and results of a narrowed-down analysis to only passing plays in the 2018 NFL season. The data is obtained from the NFL's 2021 Big Data Bowl: https://www.kaggle.com/c/nfl-big-data-bowl-2021/

*There is potential to combine the 2020 data bowl data, which contains similar info to 2021 data bowl data except about rushing plays 2017-2019. Combining these sources to produce a similar notebook to the original tendency analysis. Less data, but more information in our columns.*

The main focus of this analysis is to see how offensive / defensive personnel matchups, distance from closest defender to targeted receiver, among others later detailed in the notebook. This is a unique opportunity to utilize tracking / location data of players as well.

Other data bowls for reference:
- https://www.kaggle.com/c/nfl-big-data-bowl-2020: Forecast yardage gained on the run plays
- https://www.kaggle.com/c/nfl-big-data-bowl-2022: Analyze special teams data
- https://github.com/nfl-football-ops/Big-Data-Bowl: Inaugural data bowl from 2019, useful R code on animation of tracking

In [54]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
pd.set_option('display.max_columns', None)
pd.options.mode.chained_assignment = None  # default='warn'

In [55]:
bdb_pass = pd.read_csv('nfl-big-data-bowl/pass-2018.csv')
bdb_rush = pd.read_csv('nfl-big-data-bowl/rush.csv')

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


# Useful Features Present in Both Datasets

Before cleaning the data, from a cursory examination of our data, the main features I think will be useful in this analysis are the following. We can only use pre-snap features in this analysis, or else it defeats the purpose.

- Type of play (run/pass): TARGET (need to create)
- Drive number (need to create / merge with pbp)
- Offensive / defensive team (`HomeTeamAbbr`, `VisitorTeamAbbr`, `PossessionTeam`, need to create `DefTeam`)
- Quarter of the game (`Quarter`)
- Down number (`Down`)
- Time left in a quarter (`GameClock`, need to format this)
- Yards to gain for a first down (`Distance`)
- Yards to gain for a touchdown (100 minus current yardline position, `YardLine`)
- Current score in the game (model as difference, `HomeScoreBeforePlay - VisitorScoreBeforePlay`)
- Offensive formation (`OffenseFormation`)
- Offensive personnel (`OffensePersonnel`)
- Defenders in the box (`DefendersInTheBox`)
- Defensive personnel (`DefensePersonnel`)
- Week of season (`Week`)

# Useful Features Present Only in Rushing Data

Some features are only included in the rushing play data. However, for 2018 alone, we could match up the games and include further information in the analysis for pass plays as well.

- Stadium type (`StadiumType`)
- Turf or grass (`Turf`)
- Weather in game (`GameWeather`)
- Temperature on game day (`Temperature`)
- Humidity on game day (`Humidity`)
- Wind speed on game day (`WindSpeed`)
- Wind direction (`WindDirection`)

## Next Steps / Preparing Data

- Is it more useful to do an analysis on an imbalanced dataset with rushing 2017-2019 and passing 2018, with only the first subset of features?
- Or would I rather narrow down the rushing plays to only 2018, and use all features (can match up `GameID` and thus find the weather/stadium data by game for the passing plays as well)
- Aside from that, still need to complete the following data cleaning steps regardless of which direction I choose:
    - Merge both datasets into same format
    - Ignore post-snap outcomes (i.e. yardage gained, direction of the run, play results, type of pass dropback)
    - Ignore tracking data (for now) since all of it is post-snap movements
        - Look into if I can see whether a play is in "motion" i.e. the WR shifting from the X spot to the slot
        - I have the "time of snap" data, so there may be some potential?
    - Change the offensive formation to one-hot encoded columns
    - Change the personnel for offense and defense to one-hot encoded columns
        - Or find a unique way to deal with this since there are many combinations, some with few observations
    - Narrow down each play to a single row
    - More TBD..

# Merging 2018 run/pass

First I will try the simpler case, where we have more features but must narrow our scope to only 2018 plays. We can match these games on `GameID` which will remove run plays in 2017 and 2019. The key issue is that the run data contains many rows for the same play, since the tracking data is baked in. The solution is to remove the tracking data for the rush dataset, effectively leaving one row per play, then merging the two datasets on their `GameID`. I will do this below by grouping on `PlayID` and only retaining information on the play as a whole. Maybe I'll go back to individual players and see if there's an avenue to work with tracking data / pre-snap motion later and add it in as my own feature...we'll see.

In [56]:
print('Unique pass game data provided: {}'.format(bdb_pass['gameId'].nunique()))
print('Unique rush game data provided: {}   (includes 2017, 2019)'.format(bdb_rush['GameId'].nunique()))
print('Unique pass play data provided: {}'.format(bdb_pass['gameId'].count()))
print('Unique rush play data provided: {} (includes 2017, 2019)'.format(bdb_rush['PlayId'].nunique()))

Unique pass game data provided: 253
Unique rush game data provided: 688   (includes 2017, 2019)
Unique pass play data provided: 19239
Unique rush play data provided: 31007 (includes 2017, 2019)


In [57]:
## Group by play ID, take first since the data we want is all the same
## Don't care for any tracking / individual player data (yet)
bdb_rush_play = bdb_rush.groupby('PlayId').first()

In [58]:
## List of things to drop
to_drop = ['X','Y','S','A','Dis','Dir','Orientation','NflId','DisplayName',
           'JerseyNumber','Season','NflIdRusher','PlayerHeight','PlayerWeight',
           'PlayerBirthDate','PlayerCollegeName','Position', 'PlayDirection',
           'TimeHandoff', 'TimeSnap', 'Yards', 'Team']
bdb_rush = bdb_rush_play.drop(to_drop, axis=1)
## only 2018 rush plays
bdb_rush_2018 = bdb_rush[bdb_rush['GameId'].astype('str').str[:4] == '2018']

In [59]:
print('Unique pass game data provided: {}'.format(bdb_pass['gameId'].nunique()))
print('Unique rush game data provided: {}     (2018)'.format(bdb_rush_2018['GameId'].nunique()))
print('Unique pass play data provided: {}'.format(bdb_pass['gameId'].count()))
print('Unique rush play data provided: {}   (2018)'.format(bdb_rush_2018['GameId'].count()))

Unique pass game data provided: 253
Unique rush game data provided: 256     (2018)
Unique pass play data provided: 19239
Unique rush play data provided: 11271   (2018)


In [60]:
bdb_rush_2018 = bdb_rush_2018.reset_index(drop=True)

In [67]:
## Align bdb_rush_2018 and bdb_pass_2018
## Once these are aligned, we can create general features for the whole dataset at once
to_drop = ['playId', 'playDescription', 'typeDropback', 'penaltyCodes', 'penaltyJerseyNumbers',
           'epa', 'isDefensivePI', 'passResult','playType','numberOfPassRushers',
           'offensePlayResult', 'playResult']
bdb_pass_2018 = bdb_pass.drop(to_drop, axis=1)

In [68]:
## Create features for rush data
# Create DefensiveTeam variable
bdb_rush_2018['DefTeam'] = np.where(bdb_rush_2018['PossessionTeam'] == bdb_rush_2018['HomeTeamAbbr'], bdb_rush_2018['VisitorTeamAbbr'], bdb_rush_2018['HomeTeamAbbr'])

# Change GameClock to integer of minutes
min_left_qtr = pd.to_datetime(bdb_rush_2018['GameClock'].str[:5])
bdb_rush_2018['Time Left in Quarter'] = min_left_qtr.dt.hour + (min_left_qtr.dt.minute/60)

# Yards to go for first = Distance
# Yards to go for TD = 100-YardLine
bdb_rush_2018['Distance for TD'] = 100 - bdb_rush_2018['YardLine']

# Current score in the game
# Difference between PossessionTeam and DefTeam
# First create "PosTeamScore" and "DefTeamScore", then take the difference
bdb_rush_2018['PosTeamScore'] = np.where(bdb_rush_2018['PossessionTeam'] == bdb_rush_2018['HomeTeamAbbr'], bdb_rush_2018['HomeScoreBeforePlay'], bdb_rush_2018['VisitorScoreBeforePlay'])
bdb_rush_2018['DefTeamScore'] = np.where(bdb_rush_2018['DefTeam'] == bdb_rush_2018['HomeTeamAbbr'], bdb_rush_2018['HomeScoreBeforePlay'], bdb_rush_2018['VisitorScoreBeforePlay'])
bdb_rush_2018['ScoreDifference'] = bdb_rush_2018['PosTeamScore'] - bdb_rush_2018['DefTeamScore']

# create indicator if team is in Redzone (<= 20 yds to go from goal)
bdb_rush_2018['Redzone'] = np.where(bdb_rush_2018['Distance for TD'] <= 20, 1, 0)

# create indicator if time left in HALF is <= 2min
bdb_rush_2018['Under2Min'] = np.where(((bdb_rush_2018['Time Left in Quarter']<=2.0) & (bdb_rush_2018['Quarter']==2)|(bdb_rush_2018['Quarter']==4)), 1, 0)
# need to include timeout data, match on old_game_id...do this after matching up both datasets

In [69]:
## Match columns between rush and pass
print(bdb_rush_2018.columns)
print(bdb_pass_2018.columns)
bdb_pass_2018.head(1)

Index(['GameId', 'YardLine', 'Quarter', 'GameClock', 'PossessionTeam', 'Down',
       'Distance', 'FieldPosition', 'HomeScoreBeforePlay',
       'VisitorScoreBeforePlay', 'OffenseFormation', 'OffensePersonnel',
       'DefendersInTheBox', 'DefensePersonnel', 'HomeTeamAbbr',
       'VisitorTeamAbbr', 'Week', 'Stadium', 'Location', 'StadiumType', 'Turf',
       'GameWeather', 'Temperature', 'Humidity', 'WindSpeed', 'WindDirection',
       'DefTeam', 'Time Left in Quarter', 'Distance for TD', 'PosTeamScore',
       'DefTeamScore', 'ScoreDifference', 'Redzone', 'Under2Min'],
      dtype='object')
Index(['gameId', 'quarter', 'down', 'yardsToGo', 'possessionTeam',
       'yardlineSide', 'yardlineNumber', 'offenseFormation', 'personnelO',
       'defendersInTheBox', 'personnelD', 'preSnapVisitorScore',
       'preSnapHomeScore', 'gameClock', 'absoluteYardlineNumber'],
      dtype='object')


Unnamed: 0,gameId,quarter,down,yardsToGo,possessionTeam,yardlineSide,yardlineNumber,offenseFormation,personnelO,defendersInTheBox,personnelD,preSnapVisitorScore,preSnapHomeScore,gameClock,absoluteYardlineNumber
0,2018090600,1,1,15,ATL,ATL,20,I_FORM,"2 RB, 1 TE, 2 WR",7.0,"4 DL, 2 LB, 5 DB",0.0,0.0,15:00:00,90.0


In [70]:
bdb_rush_2018.head(1)

Unnamed: 0,GameId,YardLine,Quarter,GameClock,PossessionTeam,Down,Distance,FieldPosition,HomeScoreBeforePlay,VisitorScoreBeforePlay,OffenseFormation,OffensePersonnel,DefendersInTheBox,DefensePersonnel,HomeTeamAbbr,VisitorTeamAbbr,Week,Stadium,Location,StadiumType,Turf,GameWeather,Temperature,Humidity,WindSpeed,WindDirection,DefTeam,Time Left in Quarter,Distance for TD,PosTeamScore,DefTeamScore,ScoreDifference,Redzone,Under2Min
0,2018090600,30,1,14:22:00,ATL,2,5,ATL,0,0,I_FORM,"2 RB, 1 TE, 2 WR",7.0,"4 DL, 2 LB, 5 DB",PHI,ATL,1,Lincoln Financial Field,"Philadelphia, Pa.",Outdoor,Grass,Cloudy,81.0,71.0,8.0,NNW,PHI,14.366667,70,0,0,0,0,0


In [71]:
pass_cols = ['GameId', 'Quarter', 'Down', 'Distance', 'PossessionTeam',
             'FieldPosition', 'YardLine', 'OffenseFormation', 'OffensePersonnel',
             'DefendersInTheBox', 'DefensePersonnel', 'VisitorScoreBeforePlay',
             'HomeScoreBeforePlay', 'GameClock', 'Distance for TD']

## need to merge:
## HomeTeamAbbr, VisitorTeamAbbr, Week, Stadium, Location, StadiumType, Turf,
## GameWeather, Temperature, Humidity, WindSpeed, WindDirection

## need to make this myself:
## PosTeamScore, DefTeamScore, ScoreDifference, Redzone, Under2Min

# assign column names
bdb_pass_2018.columns = pass_cols

In [73]:
# Next...do I merge first, or do I create empty columns and then do it?
# Decisions, decisions. I will sleep on it.

Unnamed: 0,GameId,YardLine,Quarter,GameClock,PossessionTeam,Down,Distance,FieldPosition,HomeScoreBeforePlay,VisitorScoreBeforePlay,OffenseFormation,OffensePersonnel,DefendersInTheBox,DefensePersonnel,HomeTeamAbbr,VisitorTeamAbbr,Week,Stadium,Location,StadiumType,Turf,GameWeather,Temperature,Humidity,WindSpeed,WindDirection,DefTeam,Time Left in Quarter,Distance for TD,PosTeamScore,DefTeamScore,ScoreDifference,Redzone,Under2Min
0,2018090600,30,1,14:22:00,ATL,2,5,ATL,0,0,I_FORM,"2 RB, 1 TE, 2 WR",7.0,"4 DL, 2 LB, 5 DB",PHI,ATL,1,Lincoln Financial Field,"Philadelphia, Pa.",Outdoor,Grass,Cloudy,81.0,71.0,8.0,NNW,PHI,14.366667,70,0,0,0,0,0
1,2018090600,41,1,13:46:00,ATL,1,10,ATL,0,0,SINGLEBACK,"1 RB, 1 TE, 3 WR",7.0,"4 DL, 2 LB, 5 DB",PHI,ATL,1,Lincoln Financial Field,"Philadelphia, Pa.",Outdoor,Grass,Cloudy,81.0,71.0,8.0,NNW,PHI,13.766667,59,0,0,0,0,0
2,2018090600,6,1,12:15:00,ATL,1,6,PHI,0,0,SINGLEBACK,"1 RB, 1 TE, 3 WR",7.0,"4 DL, 2 LB, 5 DB",PHI,ATL,1,Lincoln Financial Field,"Philadelphia, Pa.",Outdoor,Grass,Cloudy,81.0,71.0,8.0,NNW,PHI,12.250000,94,0,0,0,0,0
3,2018090600,1,1,11:41:00,ATL,2,1,PHI,0,0,JUMBO,"2 RB, 3 TE, 0 WR",10.0,"6 DL, 3 LB, 2 DB",PHI,ATL,1,Lincoln Financial Field,"Philadelphia, Pa.",Outdoor,Grass,Cloudy,81.0,71.0,8.0,NNW,PHI,11.683333,99,0,0,0,0,0
4,2018090600,1,1,10:55:00,ATL,4,1,PHI,0,0,JUMBO,"2 RB, 3 TE, 0 WR",11.0,"6 DL, 3 LB, 2 DB",PHI,ATL,1,Lincoln Financial Field,"Philadelphia, Pa.",Outdoor,Grass,Cloudy,81.0,71.0,8.0,NNW,PHI,10.916667,99,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11266,2018123015,35,4,03:03:00,ARZ,1,10,SEA,24,21,I_FORM,"2 RB, 1 TE, 2 WR",8.0,"4 DL, 3 LB, 4 DB",SEA,ARI,17,CenturyLink Field,"Seattle, WA",Outdoor,FieldTurf,Cloudy,45.0,76.0,5,SE,SEA,3.050000,65,21,24,-3,0,1
11267,2018123015,25,4,01:49:00,SEA,1,10,SEA,24,24,SHOTGUN,"1 RB, 1 TE, 3 WR",6.0,"4 DL, 2 LB, 5 DB",SEA,ARI,17,CenturyLink Field,"Seattle, WA",Outdoor,FieldTurf,Cloudy,45.0,76.0,5,SE,ARI,1.816667,75,24,24,0,0,1
11268,2018123015,34,4,01:24:00,SEA,3,1,SEA,24,24,SHOTGUN,"1 RB, 1 TE, 3 WR",7.0,"4 DL, 2 LB, 5 DB",SEA,ARI,17,CenturyLink Field,"Seattle, WA",Outdoor,FieldTurf,Cloudy,45.0,76.0,5,SE,ARI,1.400000,66,24,24,0,0,1
11269,2018123015,25,4,00:56:00,SEA,1,10,ARZ,24,24,SHOTGUN,"1 RB, 1 TE, 3 WR",7.0,"4 DL, 2 LB, 5 DB",SEA,ARI,17,CenturyLink Field,"Seattle, WA",Outdoor,FieldTurf,Cloudy,45.0,76.0,5,SE,ARI,0.933333,75,24,24,0,0,1


# Models

# Preliminary Results