# NFL Data Bowl Analysis

This notebook will go through the methodology of data collection, analysis, and results of a narrowed-down analysis to only passing plays in the 2018 NFL season. The data is obtained from the NFL's 2021 Big Data Bowl: https://www.kaggle.com/c/nfl-big-data-bowl-2021/

*There is potential to combine the 2020 data bowl data, which contains similar info to 2021 data bowl data except about rushing plays 2017-2019. Combining these sources to produce a similar notebook to the original tendency analysis. Less data, but more information in our columns.*

The main focus of this analysis is to see how offensive / defensive personnel matchups, distance from closest defender to targeted receiver, among others later detailed in the notebook. This is a unique opportunity to utilize tracking / location data of players as well.

Other data bowls for reference:
- https://www.kaggle.com/c/nfl-big-data-bowl-2020: Forecast yardage gained on the run plays
- https://www.kaggle.com/c/nfl-big-data-bowl-2022: Analyze special teams data
- https://github.com/nfl-football-ops/Big-Data-Bowl: Inaugural data bowl from 2019, useful R code on animation of tracking

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
pd.set_option('display.max_columns', None)

In [4]:
bdb_pass = pd.read_csv('nfl-big-data-bowl/pass-2018.csv')
bdb_pass

Unnamed: 0,gameId,playId,playDescription,quarter,down,yardsToGo,possessionTeam,playType,yardlineSide,yardlineNumber,offenseFormation,personnelO,defendersInTheBox,numberOfPassRushers,personnelD,typeDropback,preSnapVisitorScore,preSnapHomeScore,gameClock,absoluteYardlineNumber,penaltyCodes,penaltyJerseyNumbers,passResult,offensePlayResult,playResult,epa,isDefensivePI
0,2018090600,75,(15:00) M.Ryan pass short right to J.Jones pus...,1,1,15,ATL,play_type_pass,ATL,20,I_FORM,"2 RB, 1 TE, 2 WR",7.0,4.0,"4 DL, 2 LB, 5 DB",TRADITIONAL,0.0,0.0,15:00:00,90.0,,,C,10,10,0.261827,False
1,2018090600,146,(13:10) M.Ryan pass incomplete short right to ...,1,1,10,ATL,play_type_pass,PHI,39,SINGLEBACK,"1 RB, 1 TE, 3 WR",7.0,4.0,"4 DL, 2 LB, 5 DB",TRADITIONAL,0.0,0.0,13:10:00,49.0,,,I,0,0,-0.372360,False
2,2018090600,168,(13:05) (Shotgun) M.Ryan pass incomplete short...,1,2,10,ATL,play_type_pass,PHI,39,SHOTGUN,"2 RB, 1 TE, 2 WR",6.0,4.0,"4 DL, 2 LB, 5 DB",TRADITIONAL,0.0,0.0,13:05:00,49.0,,,I,0,0,-0.702779,False
3,2018090600,190,(13:01) (Shotgun) M.Ryan pass deep left to J.J...,1,3,10,ATL,play_type_pass,PHI,39,SHOTGUN,"1 RB, 1 TE, 3 WR",6.0,5.0,"4 DL, 1 LB, 6 DB",SCRAMBLE_ROLLOUT_LEFT,0.0,0.0,13:01:00,49.0,,,C,33,33,3.047530,False
4,2018090600,256,(10:59) (Shotgun) M.Ryan pass incomplete short...,1,3,1,ATL,play_type_pass,PHI,1,SHOTGUN,"2 RB, 3 TE, 0 WR",8.0,6.0,"6 DL, 3 LB, 2 DB",TRADITIONAL,0.0,0.0,10:59:00,11.0,,,I,0,0,-0.842272,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19234,2018122200,2300,(7:53) J.Johnson pass incomplete short left [K...,3,2,5,WAS,play_type_unknown,WAS,31,SINGLEBACK,"1 RB, 1 TE, 3 WR",7.0,,"2 DL, 4 LB, 5 DB",,,,,,DH,TEN 24,I,0,5,0.848703,False
19235,2018122200,3177,(6:53) (Shotgun) B.Gabbert pass incomplete sho...,4,3,7,TEN,play_type_unknown,WAS,37,SHOTGUN,"1 RB, 1 TE, 3 WR",6.0,,"2 DL, 4 LB, 5 DB",,,,,,DH,WAS 31,I,0,5,1.227695,False
19236,2018122201,566,(5:32) (Shotgun) P.Rivers pass deep right to K...,1,3,4,LAC,play_type_unknown,LAC,49,SHOTGUN,"1 RB, 1 TE, 3 WR",6.0,,"1 DL, 5 LB, 5 DB",,,,,,OPI,LAC 13,C,27,-10,-0.720168,False
19237,2018122201,1719,(1:08) P.Rivers pass incomplete deep middle to...,2,3,1,LAC,play_type_unknown,LAC,48,SINGLEBACK,"1 RB, 1 TE, 3 WR",7.0,,"2 DL, 3 LB, 6 DB",,,,,,DPI,BAL 23,I,0,15,1.336078,True


In [5]:
bdb_rush = pd.read_csv('nfl-big-data-bowl/rush.csv')
bdb_rush

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Unnamed: 0,GameId,PlayId,Team,X,Y,S,A,Dis,Orientation,Dir,NflId,DisplayName,JerseyNumber,Season,YardLine,Quarter,GameClock,PossessionTeam,Down,Distance,FieldPosition,HomeScoreBeforePlay,VisitorScoreBeforePlay,NflIdRusher,OffenseFormation,OffensePersonnel,DefendersInTheBox,DefensePersonnel,PlayDirection,TimeHandoff,TimeSnap,Yards,PlayerHeight,PlayerWeight,PlayerBirthDate,PlayerCollegeName,Position,HomeTeamAbbr,VisitorTeamAbbr,Week,Stadium,Location,StadiumType,Turf,GameWeather,Temperature,Humidity,WindSpeed,WindDirection
0,2017090700,20170907000118,away,73.91,34.84,1.69,1.13,0.40,81.99,177.18,496723,Eric Berry,29,2017,35,1,14:14:00,NE,3,2,NE,0,0,2543773,SHOTGUN,"1 RB, 1 TE, 3 WR",6.0,"2 DL, 3 LB, 6 DB",left,2017-09-08T00:44:06.000Z,2017-09-08T00:44:05.000Z,8,6-0,212,12/29/1988,Tennessee,SS,NE,KC,1,Gillette Stadium,"Foxborough, MA",Outdoor,Field Turf,Clear and warm,63.0,77.0,8.0,SW
1,2017090700,20170907000118,away,74.67,32.64,0.42,1.35,0.01,27.61,198.70,2495116,Allen Bailey,97,2017,35,1,14:14:00,NE,3,2,NE,0,0,2543773,SHOTGUN,"1 RB, 1 TE, 3 WR",6.0,"2 DL, 3 LB, 6 DB",left,2017-09-08T00:44:06.000Z,2017-09-08T00:44:05.000Z,8,6-3,288,03/25/1989,Miami,DE,NE,KC,1,Gillette Stadium,"Foxborough, MA",Outdoor,Field Turf,Clear and warm,63.0,77.0,8.0,SW
2,2017090700,20170907000118,away,74.00,33.20,1.22,0.59,0.31,3.01,202.73,2495493,Justin Houston,50,2017,35,1,14:14:00,NE,3,2,NE,0,0,2543773,SHOTGUN,"1 RB, 1 TE, 3 WR",6.0,"2 DL, 3 LB, 6 DB",left,2017-09-08T00:44:06.000Z,2017-09-08T00:44:05.000Z,8,6-3,270,01/21/1989,Georgia,DE,NE,KC,1,Gillette Stadium,"Foxborough, MA",Outdoor,Field Turf,Clear and warm,63.0,77.0,8.0,SW
3,2017090700,20170907000118,away,71.46,27.70,0.42,0.54,0.02,359.77,105.64,2506353,Derrick Johnson,56,2017,35,1,14:14:00,NE,3,2,NE,0,0,2543773,SHOTGUN,"1 RB, 1 TE, 3 WR",6.0,"2 DL, 3 LB, 6 DB",left,2017-09-08T00:44:06.000Z,2017-09-08T00:44:05.000Z,8,6-3,245,11/22/1982,Texas,ILB,NE,KC,1,Gillette Stadium,"Foxborough, MA",Outdoor,Field Turf,Clear and warm,63.0,77.0,8.0,SW
4,2017090700,20170907000118,away,69.32,35.42,1.82,2.43,0.16,12.63,164.31,2530794,Ron Parker,38,2017,35,1,14:14:00,NE,3,2,NE,0,0,2543773,SHOTGUN,"1 RB, 1 TE, 3 WR",6.0,"2 DL, 3 LB, 6 DB",left,2017-09-08T00:44:06.000Z,2017-09-08T00:44:05.000Z,8,6-0,206,08/17/1987,Newberry,FS,NE,KC,1,Gillette Stadium,"Foxborough, MA",Outdoor,Field Turf,Clear and warm,63.0,77.0,8.0,SW
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
682149,2019112500,20191125003789,home,69.58,31.33,4.19,1.33,0.43,92.32,20.94,2556593,Cory Littleton,58,2019,38,4,04:11:00,BLT,2,9,BLT,6,45,2562407,PISTOL,"1 RB, 1 TE, 3 WR",6.0,"3 DL, 3 LB, 5 DB",left,2019-11-26T03:59:06.000Z,2019-11-26T03:59:05.000Z,4,6-3,228,11/18/1993,Washington,ILB,LA,BAL,12,Los Angeles Memorial Coliseum,"Los Angeles, CA",Outdoor,Grass,Clear,62.0,64.0,5,WSW
682150,2019112500,20191125003789,home,72.16,25.09,1.01,1.40,0.08,65.83,76.48,2558199,Tanzel Smart,92,2019,38,4,04:11:00,BLT,2,9,BLT,6,45,2562407,PISTOL,"1 RB, 1 TE, 3 WR",6.0,"3 DL, 3 LB, 5 DB",left,2019-11-26T03:59:06.000Z,2019-11-26T03:59:05.000Z,4,6-1,295,11/06/1994,Tulane,DT,LA,BAL,12,Los Angeles Memorial Coliseum,"Los Angeles, CA",Outdoor,Grass,Clear,62.0,64.0,5,WSW
682151,2019112500,20191125003789,home,72.23,31.95,1.75,0.68,0.17,93.99,90.26,2558830,Samson Ebukam,50,2019,38,4,04:11:00,BLT,2,9,BLT,6,45,2562407,PISTOL,"1 RB, 1 TE, 3 WR",6.0,"3 DL, 3 LB, 5 DB",left,2019-11-26T03:59:06.000Z,2019-11-26T03:59:05.000Z,4,6-3,245,05/09/1995,Eastern Washington,OLB,LA,BAL,12,Los Angeles Memorial Coliseum,"Los Angeles, CA",Outdoor,Grass,Clear,62.0,64.0,5,WSW
682152,2019112500,20191125003789,home,72.08,26.74,0.70,1.59,0.07,61.47,48.34,2562391,Greg Gaines,91,2019,38,4,04:11:00,BLT,2,9,BLT,6,45,2562407,PISTOL,"1 RB, 1 TE, 3 WR",6.0,"3 DL, 3 LB, 5 DB",left,2019-11-26T03:59:06.000Z,2019-11-26T03:59:05.000Z,4,6-1,312,05/06/1996,Washington,DT,LA,BAL,12,Los Angeles Memorial Coliseum,"Los Angeles, CA",Outdoor,Grass,Clear,62.0,64.0,5,WSW


# Useful Features Present in Both Datasets

Before cleaning the data, from a cursory examination of our data, the main features I think will be useful in this analysis are the following. We can only use pre-snap features in this analysis, or else it defeats the purpose.

- Type of play (run/pass): TARGET (need to create)
- Drive number (need to create / merge with pbp)
- Offensive / defensive team (`HomeTeamAbbr`, `VisitorTeamAbbr`, `PossessionTeam`, need to create `DefTeam`)
- Quarter of the game (`Quarter`)
- Down number (`Down`)
- Time left in a quarter (`GameClock`, need to format this)
- Yards to gain for a first down (`Distance`)
- Yards to gain for a touchdown (100 minus current yardline position, `YardLine`)
- Current score in the game (model as difference, `HomeScoreBeforePlay - VisitorScoreBeforePlay`)
- Offensive formation (`OffenseFormation`)
- Offensive personnel (`OffensePersonnel`)
- Defenders in the box (`DefendersInTheBox`)
- Defensive personnel (`DefensePersonnel`)
- Week of season (`Week`)

# Useful Features Present Only in Rushing Data

Some features are only included in the rushing play data. However, for 2018 alone, we could match up the games and include further information in the analysis for pass plays as well.

- Stadium type (`StadiumType`)
- Turf or grass (`Turf`)
- Weather in game (`GameWeather`)
- Temperature on game day (`Temperature`)
- Humidity on game day (`Humidity`)
- Wind speed on game day (`WindSpeed`)
- Wind direction (`WindDirection`)

## Next Steps / Cleaning Data

- Is it more useful to do an analysis on an imbalanced dataset with rushing 2017-2019 and passing 2018, with only the first subset of features?
- Or would I rather narrow down the rushing plays to only 2018, and use all features (can match up `GameID` and thus find the weather/stadium data by game for the passing plays as well)
- Aside from that, still need to complete the following data cleaning steps regardless of which direction I choose:
    - Merge both datasets into same format
    - Ignore post-snap outcomes (i.e. yardage gained, direction of the run, play results, type of pass dropback)
    - Ignore tracking data (for now) since all of it is post-snap movements
        - Look into if I can see whether a play is in "motion" i.e. the WR shifting from the X spot to the slot
        - I have the "time of snap" data, so there may be some potential?
    - Change the offensive formation to one-hot encoded columns
    - Change the personnel for offense and defense to one-hot encoded columns
        - Or find a unique way to deal with this since there are many combinations, some with few observations
    - Narrow down each play to a single row
    - More TBD..

In [6]:
bdb_rush['OffenseFormation'].unique()

array(['SHOTGUN', 'SINGLEBACK', 'JUMBO', 'PISTOL', 'I_FORM', 'ACE',
       'WILDCAT', nan, 'EMPTY'], dtype=object)

In [7]:
bdb_rush['OffensePersonnel'].unique()

array(['1 RB, 1 TE, 3 WR', '6 OL, 2 RB, 2 TE, 0 WR', '1 RB, 3 TE, 1 WR',
       '1 RB, 2 TE, 2 WR', '6 OL, 1 RB, 2 TE, 1 WR', '2 RB, 1 TE, 2 WR',
       '2 RB, 2 TE, 1 WR', '0 RB, 3 TE, 2 WR', '0 RB, 1 TE, 4 WR',
       '6 OL, 1 RB, 0 TE, 3 WR', '6 OL, 1 RB, 1 TE, 2 WR',
       '1 RB, 2 TE, 1 WR,1 DL', '1 RB, 3 TE, 0 WR,1 DL',
       '1 RB, 0 TE, 4 WR', '1 RB, 1 TE, 2 WR,1 DL',
       '6 OL, 2 RB, 0 TE, 2 WR', '2 RB, 0 TE, 3 WR',
       '6 OL, 2 RB, 1 TE, 1 WR', '7 OL, 1 RB, 0 TE, 2 WR',
       '7 OL, 2 RB, 0 TE, 1 WR', '7 OL, 1 RB, 2 TE, 0 WR',
       '2 RB, 3 TE, 0 WR', '3 RB, 1 TE, 1 WR', '6 OL, 1 RB, 3 TE, 0 WR',
       '6 OL, 1 RB, 2 TE, 0 WR,1 DL', '2 RB, 3 TE, 1 WR',
       '6 OL, 1 RB, 1 TE, 1 WR,1 DL', '1 RB, 4 TE, 0 WR',
       '1 RB, 2 TE, 1 WR,1 LB', '1 RB, 3 TE, 0 WR,1 LB',
       '7 OL, 2 RB, 1 TE, 0 WR', '0 RB, 2 TE, 3 WR',
       '1 RB, 0 TE, 3 WR,1 DB', '6 OL, 1 RB, 2 TE, 0 WR,1 LB',
       '1 RB, 1 TE, 2 WR,1 DB', '0 RB, 0 TE, 5 WR', '1 RB, 2 TE, 3 WR',
       '1 RB, 

In [8]:
bdb_rush['DefensePersonnel'].unique()

array(['2 DL, 3 LB, 6 DB', '4 DL, 4 LB, 3 DB', '3 DL, 2 LB, 6 DB',
       '3 DL, 4 LB, 4 DB', '3 DL, 3 LB, 5 DB', '4 DL, 3 LB, 4 DB',
       '4 DL, 1 LB, 6 DB', '4 DL, 2 LB, 5 DB', '5 DL, 2 LB, 4 DB',
       '2 DL, 4 LB, 5 DB', '2 DL, 5 LB, 4 DB', '5 DL, 4 LB, 2 DB',
       '1 DL, 5 LB, 5 DB', '5 DL, 3 LB, 3 DB', '6 DL, 2 LB, 3 DB',
       '3 DL, 5 LB, 3 DB', '6 DL, 3 LB, 2 DB', '1 DL, 3 LB, 7 DB',
       '2 DL, 2 LB, 7 DB', '4 DL, 5 LB, 2 DB', '1 DL, 4 LB, 6 DB',
       '4 DL, 5 LB, 1 DB, 1 OL', '6 DL, 1 LB, 4 DB',
       '2 DL, 4 LB, 4 DB, 1 OL', '6 DL, 4 LB, 1 DB', '5 DL, 1 LB, 5 DB',
       '4 DL, 6 LB, 1 DB', '0 DL, 5 LB, 6 DB', '5 DL, 4 LB, 1 DB, 1 OL',
       '3 DL, 1 LB, 7 DB', '4 DL, 0 LB, 7 DB', '3 DL, 4 LB, 3 DB, 1 OL',
       '5 DL, 5 LB, 1 DB', '5 DL, 3 LB, 2 DB, 1 OL', '0 DL, 6 LB, 5 DB',
       '1 DL, 2 LB, 8 DB', '0 DL, 4 LB, 7 DB', '7 DL, 2 LB, 2 DB',
       '0 DL, 4 LB, 6 DB, 1 RB', '1 DL, 4 LB, 5 DB, 1 RB',
       '2 DL, 4 LB, 4 DB, 1 RB', '2 DL, 3 LB, 5 DB, 1 RB',
 