# Overall Result and Questions

One of the data checks I wanted to do was to compare whether the change in the x coordinate of the ball from the first frame to the last frame was the same as the playResult variable. 

As you can see below this analysis shows a significant variance between the reported play result and the actual movement of the ball along the x coordinate from the first frame to the last frame. 

So the questions I have are:
1. Is my assumption correct that the play result should equal the movement of the ball along the x coordinate from the first frame to the last frame?
2. If so, have I done this analysis correctly to find that answer?
3. If so, why would this situation of having variances this large occur? 

# The Analysis

In [1]:
import pandas as pd
import numpy as np
pd.options.display.max_columns = None

To do this comparison I'll need the playResult from the plays dataset. The best way I can see to track the ball across frames is to track where the ball carrier is at on the x coordinate in each frame, and I'll need the plays dataset for the ballCarrierId information as well.

To keep this simple, I'll just do this comparison for the first week. I'm assuming whatever I find will be reflected in the other 8 weeks. 

In [2]:
plays = pd.read_csv('./data/plays.csv')
week1 = pd.read_csv('./data/tracking_week_1.csv')

In [3]:
plays

Unnamed: 0,gameId,playId,ballCarrierId,ballCarrierDisplayName,playDescription,quarter,down,yardsToGo,possessionTeam,defensiveTeam,yardlineSide,yardlineNumber,gameClock,preSnapHomeScore,preSnapVisitorScore,passResult,passLength,penaltyYards,prePenaltyPlayResult,playResult,playNullifiedByPenalty,absoluteYardlineNumber,offenseFormation,defendersInTheBox,passProbability,preSnapHomeTeamWinProbability,preSnapVisitorTeamWinProbability,homeTeamWinProbabilityAdded,visitorTeamWinProbilityAdded,expectedPoints,expectedPointsAdded,foulName1,foulName2,foulNFLId1,foulNFLId2
0,2022100908,3537,48723,Parker Hesse,(7:52) (Shotgun) M.Mariota pass short middle t...,4,1,10,ATL,TB,ATL,41,7:52,21,7,C,6.0,,9,9,N,69,SHOTGUN,7.0,0.747284,0.976785,0.023215,-0.006110,0.006110,2.360609,0.981955,,,,
1,2022091103,3126,52457,Chase Claypool,(7:38) (Shotgun) C.Claypool right end to PIT 3...,4,1,10,PIT,CIN,PIT,34,7:38,14,20,,,,3,3,N,76,SHOTGUN,7.0,0.416454,0.160485,0.839515,-0.010865,0.010865,1.733344,-0.263424,,,,
2,2022091111,1148,42547,Darren Waller,(8:57) D.Carr pass short middle to D.Waller to...,2,2,5,LV,LAC,LV,30,8:57,10,3,C,11.0,,15,15,N,40,I_FORM,6.0,0.267933,0.756661,0.243339,-0.037409,0.037409,1.312855,1.133666,,,,
3,2022100212,2007,46461,Mike Boone,(13:12) M.Boone left tackle to DEN 44 for 7 ya...,3,2,10,DEN,LV,DEN,37,13:12,19,16,,,,7,7,N,47,SINGLEBACK,6.0,0.592704,0.620552,0.379448,-0.002451,0.002451,1.641006,-0.043580,,,,
4,2022091900,1372,47857,Devin Singletary,(8:33) D.Singletary right guard to TEN 32 for ...,2,1,10,BUF,TEN,TEN,35,8:33,7,7,,,,3,3,N,75,I_FORM,7.0,0.470508,0.836290,0.163710,0.001053,-0.001053,3.686428,-0.167903,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12481,2022100204,123,43293,Ezekiel Elliott,(13:31) E.Elliott right tackle to WAS 38 for 1...,1,1,10,DAL,WAS,WAS,39,13:31,0,0,,,,1,1,N,49,SINGLEBACK,6.0,0.577800,0.638600,0.361400,-0.011542,0.011542,3.642571,-0.504018,,,,
12482,2022091200,3467,46189,Will Dissly,(6:08) G.Smith pass short right to W.Dissly to...,4,1,10,SEA,DEN,SEA,30,6:08,17,16,C,0.0,,2,2,N,40,SINGLEBACK,7.0,0.298983,0.615241,0.384759,-0.025458,0.025458,1.434580,-0.444642,,,,
12483,2022101605,3371,44860,Joe Mixon,(9:35) (Shotgun) J.Mixon left end to CIN 47 fo...,4,1,10,CIN,NO,CIN,41,9:35,26,21,,,,6,6,N,69,SHOTGUN,6.0,0.639439,0.667054,0.332946,-0.005164,0.005164,2.115356,0.203819,,,,
12484,2022100207,2777,52449,Jonathan Taylor,(2:02) (Shotgun) J.Taylor up the middle to TEN...,3,1,10,IND,TEN,TEN,34,2:02,17,24,,,,-2,-2,N,44,SHOTGUN,6.0,0.518695,0.410611,0.589389,-0.046648,0.046648,3.946232,-0.976039,,,,


Since playId is "...not unique across games" I'll combine that column and gameId to create a new variable called gamePlayId to have a unique identifier for each play. 

In [4]:
plays['gamePlayId'] = plays[['gameId', 'playId']].astype(str).apply(lambda x: ''.join(x), axis=1)
plays['gamePlayId'] = plays['gamePlayId'].astype('int64')

I'll do the same thing on the week1 dataset. 

In [5]:
week1['gamePlayId'] = week1[['gameId', 'playId']].astype(str).apply(lambda x: ''.join(x), axis=1)
week1['gamePlayId'] = week1['gamePlayId'].astype('int64')

To only include the columns I need from the plays dataset to merge into the week1 dataset. 

In [6]:
plays_merge = plays[['gamePlayId', 'ballCarrierId', 'playResult']]

In [7]:
plays_merge

Unnamed: 0,gamePlayId,ballCarrierId,playResult
0,20221009083537,48723,9
1,20220911033126,52457,3
2,20220911111148,42547,15
3,20221002122007,46461,7
4,20220919001372,47857,3
...,...,...,...
12481,2022100204123,43293,1
12482,20220912003467,46189,2
12483,20221016053371,44860,6
12484,20221002072777,52449,-2


In [8]:
week1 = pd.merge(week1, plays_merge, on=['gamePlayId'], how='left')

In [9]:
week1

Unnamed: 0,gameId,playId,nflId,displayName,frameId,time,jerseyNumber,club,playDirection,x,y,s,a,dis,o,dir,event,gamePlayId,ballCarrierId,playResult
0,2022090800,56,35472.0,Rodger Saffold,1,2022-09-08 20:24:05.200000,76.0,BUF,left,88.370000,27.27,1.62,1.15,0.16,231.74,147.90,,202209080056,42489,6
1,2022090800,56,35472.0,Rodger Saffold,2,2022-09-08 20:24:05.299999,76.0,BUF,left,88.470000,27.13,1.67,0.61,0.17,230.98,148.53,pass_arrived,202209080056,42489,6
2,2022090800,56,35472.0,Rodger Saffold,3,2022-09-08 20:24:05.400000,76.0,BUF,left,88.560000,27.01,1.57,0.49,0.15,230.98,147.05,,202209080056,42489,6
3,2022090800,56,35472.0,Rodger Saffold,4,2022-09-08 20:24:05.500000,76.0,BUF,left,88.640000,26.90,1.44,0.89,0.14,232.38,145.42,,202209080056,42489,6
4,2022090800,56,35472.0,Rodger Saffold,5,2022-09-08 20:24:05.599999,76.0,BUF,left,88.720000,26.80,1.29,1.24,0.13,233.36,141.95,,202209080056,42489,6
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1407434,2022091200,3826,,football,49,2022-09-12 23:05:57.799999,,football,left,56.220001,9.89,2.56,1.25,0.25,,,tackle,20220912003826,53464,9
1407435,2022091200,3826,,football,50,2022-09-12 23:05:57.900000,,football,left,56.060001,10.08,2.50,1.14,0.24,,,,20220912003826,53464,9
1407436,2022091200,3826,,football,51,2022-09-12 23:05:58.000000,,football,left,55.889999,10.27,2.38,1.70,0.25,,,,20220912003826,53464,9
1407437,2022091200,3826,,football,52,2022-09-12 23:05:58.099999,,football,left,55.730000,10.44,2.07,2.83,0.24,,,,20220912003826,53464,9


In order to track where the ball is on any frame, I'll want to find each frame where the nflId is equal to the ballCarrierId. However, there are 61,193 rows where the nflId is 'NaN'. These correspond to where the displayName is 'football'. So for this exercise, I'll just fill the "NaN' nflId's with zeros. 

In [10]:
week1['nflId'].isna().sum()

61193

In [11]:
week1['displayName'].value_counts()

football             61193
Josh Allen            3280
David Long            3261
Jalen Pitre           2854
Jonathan Owens        2854
                     ...  
Keisean Nixon           20
P.J. Locke              16
Eric Johnson            15
Jonathan Williams       14
J.C. Hassenauer         13
Name: displayName, Length: 1166, dtype: int64

In [12]:
week1['nflId'] = week1['nflId'].fillna(0)
week1['nflId'] = week1['nflId'].astype(int)

The new variable xBallLocation will be the x coordinate of that player where that player is the ball carrier. That should be the yard line the ball is on for each frame of a particular play. 

In [13]:
week1['xBallLocation'] = np.where(week1['ballCarrierId'] == week1['nflId'], week1['x'], 0)

Next, I'll identify the first and last frame of each play. 

In [14]:
week1['firstFrame'] = week1.groupby('gamePlayId')['frameId'].transform('min')
week1['lastFrame'] = week1.groupby('gamePlayId')['frameId'].transform('max')

Next, I'll create the xFirstFrame and xLastFrame variables to identify where the ball location is on those frames. Then spread those values to all the plays on that frame, not just the ones that include the ball carrier. 

In [15]:
week1['xFirstFrame'] = np.where(week1['frameId'] == week1['firstFrame'], week1['xBallLocation'], 0)
week1['xLastFrame'] = np.where(week1['frameId'] == week1['lastFrame'], week1['xBallLocation'], 0)

In [16]:
week1['xFirstFrame'] = week1.groupby('gamePlayId')['xFirstFrame'].transform(max)
week1['xLastFrame'] = week1.groupby('gamePlayId')['xLastFrame'].transform(max)

In [17]:
week1

Unnamed: 0,gameId,playId,nflId,displayName,frameId,time,jerseyNumber,club,playDirection,x,y,s,a,dis,o,dir,event,gamePlayId,ballCarrierId,playResult,xBallLocation,firstFrame,lastFrame,xFirstFrame,xLastFrame
0,2022090800,56,35472,Rodger Saffold,1,2022-09-08 20:24:05.200000,76.0,BUF,left,88.370000,27.27,1.62,1.15,0.16,231.74,147.90,,202209080056,42489,6,0.0,1,22,80.60,79.51
1,2022090800,56,35472,Rodger Saffold,2,2022-09-08 20:24:05.299999,76.0,BUF,left,88.470000,27.13,1.67,0.61,0.17,230.98,148.53,pass_arrived,202209080056,42489,6,0.0,1,22,80.60,79.51
2,2022090800,56,35472,Rodger Saffold,3,2022-09-08 20:24:05.400000,76.0,BUF,left,88.560000,27.01,1.57,0.49,0.15,230.98,147.05,,202209080056,42489,6,0.0,1,22,80.60,79.51
3,2022090800,56,35472,Rodger Saffold,4,2022-09-08 20:24:05.500000,76.0,BUF,left,88.640000,26.90,1.44,0.89,0.14,232.38,145.42,,202209080056,42489,6,0.0,1,22,80.60,79.51
4,2022090800,56,35472,Rodger Saffold,5,2022-09-08 20:24:05.599999,76.0,BUF,left,88.720000,26.80,1.29,1.24,0.13,233.36,141.95,,202209080056,42489,6,0.0,1,22,80.60,79.51
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1407434,2022091200,3826,0,football,49,2022-09-12 23:05:57.799999,,football,left,56.220001,9.89,2.56,1.25,0.25,,,tackle,20220912003826,53464,9,0.0,1,53,70.71,55.72
1407435,2022091200,3826,0,football,50,2022-09-12 23:05:57.900000,,football,left,56.060001,10.08,2.50,1.14,0.24,,,,20220912003826,53464,9,0.0,1,53,70.71,55.72
1407436,2022091200,3826,0,football,51,2022-09-12 23:05:58.000000,,football,left,55.889999,10.27,2.38,1.70,0.25,,,,20220912003826,53464,9,0.0,1,53,70.71,55.72
1407437,2022091200,3826,0,football,52,2022-09-12 23:05:58.099999,,football,left,55.730000,10.44,2.07,2.83,0.24,,,,20220912003826,53464,9,0.0,1,53,70.71,55.72


Finally, I'll create the xFrameChange variable to measure the change in the ball location from the first frame to the last frame. I'm using absolute values since many of the changes in the x coordinate might be negative numbers as the offensive team moves onto their opponent's side of the field (for example a 10 yard gain from their opponent's 35 yard line to their opponent's 25 yard line). I know this will create positive values for negative yard plays, but for this overall analysis, that shouldn't be a material problem. 

Additionally, I'll create a framePlayVariance variable to track the difference between the frame change and the play result. 

In [18]:
week1['xFrameChange'] = (week1['xLastFrame'] - week1['xFirstFrame']).abs()
week1['framePlayVariance'] = week1['playResult'] - week1['xFrameChange']

In [19]:
week1

Unnamed: 0,gameId,playId,nflId,displayName,frameId,time,jerseyNumber,club,playDirection,x,y,s,a,dis,o,dir,event,gamePlayId,ballCarrierId,playResult,xBallLocation,firstFrame,lastFrame,xFirstFrame,xLastFrame,xFrameChange,framePlayVariance
0,2022090800,56,35472,Rodger Saffold,1,2022-09-08 20:24:05.200000,76.0,BUF,left,88.370000,27.27,1.62,1.15,0.16,231.74,147.90,,202209080056,42489,6,0.0,1,22,80.60,79.51,1.09,4.91
1,2022090800,56,35472,Rodger Saffold,2,2022-09-08 20:24:05.299999,76.0,BUF,left,88.470000,27.13,1.67,0.61,0.17,230.98,148.53,pass_arrived,202209080056,42489,6,0.0,1,22,80.60,79.51,1.09,4.91
2,2022090800,56,35472,Rodger Saffold,3,2022-09-08 20:24:05.400000,76.0,BUF,left,88.560000,27.01,1.57,0.49,0.15,230.98,147.05,,202209080056,42489,6,0.0,1,22,80.60,79.51,1.09,4.91
3,2022090800,56,35472,Rodger Saffold,4,2022-09-08 20:24:05.500000,76.0,BUF,left,88.640000,26.90,1.44,0.89,0.14,232.38,145.42,,202209080056,42489,6,0.0,1,22,80.60,79.51,1.09,4.91
4,2022090800,56,35472,Rodger Saffold,5,2022-09-08 20:24:05.599999,76.0,BUF,left,88.720000,26.80,1.29,1.24,0.13,233.36,141.95,,202209080056,42489,6,0.0,1,22,80.60,79.51,1.09,4.91
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1407434,2022091200,3826,0,football,49,2022-09-12 23:05:57.799999,,football,left,56.220001,9.89,2.56,1.25,0.25,,,tackle,20220912003826,53464,9,0.0,1,53,70.71,55.72,14.99,-5.99
1407435,2022091200,3826,0,football,50,2022-09-12 23:05:57.900000,,football,left,56.060001,10.08,2.50,1.14,0.24,,,,20220912003826,53464,9,0.0,1,53,70.71,55.72,14.99,-5.99
1407436,2022091200,3826,0,football,51,2022-09-12 23:05:58.000000,,football,left,55.889999,10.27,2.38,1.70,0.25,,,,20220912003826,53464,9,0.0,1,53,70.71,55.72,14.99,-5.99
1407437,2022091200,3826,0,football,52,2022-09-12 23:05:58.099999,,football,left,55.730000,10.44,2.07,2.83,0.24,,,,20220912003826,53464,9,0.0,1,53,70.71,55.72,14.99,-5.99


# The Results

The first 15 plays show a significant variance between these two values. 

In [21]:
play_yards = week1.groupby('gamePlayId').agg(
    PlayResult = ('playResult', max),
    FrameChange = ('xFrameChange', max),
    Variance = ('framePlayVariance', max)
    
    
).reset_index()

play_yards_sorted = play_yards.sort_values(by='gamePlayId')
play_yards_sorted.head(15)

Unnamed: 0,gamePlayId,PlayResult,FrameChange,Variance
0,202209080056,6,1.09,4.91
1,202209080080,7,11.47,-4.47
2,202209110057,11,18.24,-7.24
3,202209110078,6,13.94,-7.94
4,202209110185,-5,3.08,-8.08
5,202209110286,3,7.91,-4.91
6,202209110358,3,4.94,-1.94
7,202209110382,5,1.01,3.99
8,202209110458,-1,6.08,-7.08
9,202209110486,50,58.8,-8.8


The average variances for all 16 games of the first week are also significant. 

In [22]:
play_yards = week1.groupby('gameId').agg(
    PlayResult = ('playResult', 'mean'),
    FrameChange = ('xFrameChange', 'mean'),
    Variance = ('framePlayVariance', 'mean'),
    
).reset_index()

play_yards_sorted = play_yards.sort_values(by='gameId')
play_yards_sorted

Unnamed: 0,gameId,PlayResult,FrameChange,Variance
0,2022090800,6.730256,8.833769,-2.103513
1,2022091100,7.906611,12.33494,-4.428329
2,2022091101,8.269632,12.620509,-4.350877
3,2022091102,6.157533,10.326426,-4.168893
4,2022091103,6.678283,9.788816,-3.110533
5,2022091104,9.133441,13.444933,-4.311492
6,2022091105,6.612102,10.075265,-3.463163
7,2022091106,6.928593,10.088701,-3.160107
8,2022091107,5.617452,9.494224,-3.876773
9,2022091108,8.784129,14.063491,-5.279362
