Just some preliminary EDA -- Go RAVENS! :)

We'll spend some time with a high-level overview of the plays of a SINGLE game and then just quickly look at the player and week data -- this peek will allow us to glean insights needed before we start to merge the data in different ways in order to begin gleaning insights at the level of teams and players as time progresses. As a football fan, this should be a fun way to pick up some empirical insights to back up some of the trash talking amongst friends who root for other teams :)

So we have csv files for the following:<br>
players.csv: information on each player <br> 
plays.csv: information on each passing play for a game <br>
weekx.csv --> players tracking data from all the pass plays during x week

In [None]:
import numpy as np 
import pandas as pd
import os
from pathlib import Path
import collections
import matplotlib.pyplot as plt

In [None]:
input_path = Path('../input/nfl-big-data-bowl-2021')

In [None]:
plays_df = pd.read_csv(input_path/'plays.csv')
players_df = pd.read_csv(input_path/'players.csv')
games_df = pd.read_csv(input_path/'games.csv')
#just pull 1 week for now
week1_df = pd.read_csv(input_path/'week1.csv')
plays_df.shape, players_df.shape, games_df.shape, week1_df.shape

In [None]:
plays_df.head()

Each play is identified with a gameId (unique) and a playId (not unique). Each game has a unique ID but there are varying numbers plays each game and they do not have unique values.

Let's just look at one game:

In [None]:
#89 pass plays in this game!
one_game = plays_df[plays_df['gameId'] == 2018090600]
one_game.shape

In [None]:
one_game.iloc[0]

In [None]:
#the falcons had 7 more pass plays than the eagles this game
one_game['possessionTeam'].value_counts()

In [None]:
atl_plays = one_game[one_game['possessionTeam'] == 'ATL']
phi_plays = one_game[one_game['possessionTeam'] == 'PHI']

In [None]:
atl_plays.offenseFormation.value_counts(), phi_plays.offenseFormation.value_counts()

Nick Foles operated out of the shotgun nearly the entire game, why was this? Without knowing how many total plays there were, it's a bit more difficult to know how pass heavy either team was. But the Eagles (Phi) did not have as many unique formations for their pass plays.

In [None]:
plt.hist([atl_plays.yardsToGo, phi_plays.yardsToGo],
        label=['Atl', 'Phi'])
plt.title('How many yards needed for a first on pass attempts')
plt.legend(loc='upper right');

You normally need 10 yards to get a new set of downs - you can see that the Falcons (Atl) threw the ball more often if the distance for a first down was either very small (less than 5 yards) or if it was at least 10 yards.

In [None]:
plt.hist([atl_plays.defendersInTheBox, phi_plays.defendersInTheBox],
        label=['Atl', 'Phi'])
plt.title('How many defenders in the box is the passing team facing')
plt.legend(loc='upper right');

Looks like the Falcons (Atl) consistently packed the box. While on a bunch of plays the Eagles (Phi) had very few people in the box. As shown above, Foles was in Shotgun in most of his pass plays. So it's intereting to see so many guys packed in the box when they seem to be dictating that they'll pass if in Shotgun (unless they're in shotgun on all their run plays as well)

In [None]:
plt.hist([atl_plays.numberOfPassRushers, phi_plays.numberOfPassRushers],
        label=['Atl', 'Phi'])
plt.title('How many pass rushers is the passing team facing')
plt.legend(loc='upper right');

Both teams look like they send 3-4 rushers normally. But the Eagles seem to be more willing to throw more pass rushers -- a 7 man rush means there are very few defenders in the secondary to mark any pass catchers. It would be interesting to analyze the outcome in regards to number of pass rushers at the play level and the player level

In [None]:
#on one play the eagles only had two defensive backs
phi_plays.personnelD.value_counts(), atl_plays.personnelD.value_counts()

On a pass play, normally you have to consider the types of players you have on the field. Defensive Backs (DB) are normally utilized to cover wide receivers. They are smaller and faster than Line Backers (LB) and even moreso compared to Defensive Linemen(DL). If you had 9 guys on the field that are big and strong -- you might be able to power your way to the quarterback easily, but you won't have enough speed to catch up if he can get the ball away before you get to him

In [None]:
#neither Foles or Ryan are mobile enough to be scrambling around often :)
atl_plays.typeDropback.value_counts(), phi_plays.typeDropback.value_counts()

In [None]:
plt.hist([atl_plays.absoluteYardlineNumber, phi_plays.absoluteYardlineNumber],
        label=['Atl', 'Phi'])
plt.title('How far down the field is each team on their pass plays')
plt.legend(loc='upper right');

Atlanta threw the ball alot in the Redzone while it looks like the Eagles were throwing it at least twice when they started from terrible field position -- being 100 yards from the endzone means that you are most likely right at your own goal line and running the ball might be risky, if you lose yardage you could give up 2 points and possession of the ball (a safety)

This information could be analyzed alongside things like passResult, playResult and epa to determine if there is some correlation with position on the field of the play and the ultimate outcome -- aka are the Falcons deadly in the Redzone (inside opponents 20 yardline)?

In [None]:
plt.hist([atl_plays.playResult, phi_plays.playResult],
        label=['Atl', 'Phi'])
plt.title('Net yards gained per pass play')
plt.legend(loc='upper right');

Most plays end up going nowhere -- but you can see that the Falcons had a few bigger pass plays this game. Once again, without knowing the rushing information, you can't conclude the Falcons were beating them just because of this :)

Could have also looked at the offensePlayResult, but that does not include any penalties and we want to see how each pass play is progressing the position of the ball on the field (or regressing for those negative valued plays)

In [None]:
plt.hist([atl_plays.epa, phi_plays.epa],
        label=['Atl', 'Phi'])
plt.title('Expected points added on the play -- avg of every next soring outcome')
plt.legend(loc='upper right');

While the Falcons seem to have had a few more plays that resulted in better expected results, they also seem to have a higher number in the negative range. As expected, passing the ball is a less safe strategy than running it all game and the gamble does not always work out well!

Now let's examine some player level information!

In [None]:
players_df.head()

In [None]:
players_df['position'].value_counts()

In [None]:
#The big name college programs have the most players, as expected
players_df.collegeName.value_counts()

In [None]:
plt.hist(players_df.weight)
plt.title('weight of players involved in pass plays');

Mostly players you would expect to see throwing/catching the ball on the offense and all the defensive positions that are trying to stop the passing play (registering when they make a tackle).<br>
You can also see both punters (P) and placekickers (K) in this -- punters are involved in 13 pass plays in 2018 vs kickers at 5. This makes sense, on a field goal there is a holder of the ball -- it does not end up in the kickers hands. While on the Punt, the punter has their hands on the ball and can throw it. The kickers might be catching the ball.

In [None]:
#now lets take a look at the game data
games_df.head()

We want to merge the home team and visitor information with the plays df because this will allow us to figure out the score of the game at the time of each pass play<br>

It will also be useful to take into account the date of the game to maybe analyze how teams passing tendencies shift as the season progresses and the impact on the outcomes of the games from this.

In [None]:
#We can use the gameId and playID information to map this data onto our plays dataframe
week1_df.head()

In [None]:
sorted(week1_df.s, reverse=True)[0:5], sorted(week1_df.a, reverse=True)[0:5]

In [None]:
#lol the fastest speeds and accelerations recorderd are of the ball itself??
week1_df[week1_df.a >35].iloc[0:2]

In [None]:
week1_df_players = week1_df[week1_df.displayName != 'Football']
sorted(week1_df_players.s, reverse=True)[0:5], sorted(week1_df_players.a, reverse=True)[0:5]

Ok now we should just only have measurements of the players, let's see if we can see who the fastest players in 2018 were!

In [None]:
week1_df_players[week1_df_players.s > 11]

In [None]:
week1_df_players[week1_df_players.a > 16]

The fastest clocked speeds and accelerations are for are mainly from outside linebackers and strong safetys, both on the defensive side of the ball. Had presumed it would've been wide receivers and cornerbacks as the fastest. 

While safetys are defensive backs like cornerbacks -- they are normally bigger in stature due to their heavy involvement with stopping the run.

Also note that Kirksey and Kindred clocked amongst the highest in both speed and acceleration

In [None]:
atl_plays.iloc[0].gameId, atl_plays.iloc[0].playId

In [None]:
wk1_atl_phi = week1_df[week1_df['gameId'] == 2018090600]
wk1_atl_phi

Whoah that's kind of alot of player information for just passing in a game! haha

In [None]:
one_play = wk1_atl_phi[wk1_atl_phi['playId'] == 75]
one_play

Ok so for a SINGLE play we have 826 points of data tracking players positon, movement etc

In [None]:
one_play['position'].value_counts()

In [None]:
#lets see the personnel on this same play
atl_plays.iloc[0].personnelO, atl_plays.illoc[0].personnelD

It looks like we have two Wide Receivers on this play, two Corner Backs, two Free Safties and a single Strong Safety

CB, FS, SS are all Defensive Backs (DB)

There is not player tracking data for the defensive linemen - as explained in the data description of the competition