# NFL EDA FTW

### Kaggle Description

American football is a complex sport. From the 22 players on the field to specific characteristics that ebb and flow throughout the game, it can be challenging to quantify the value of specific plays and actions within a play. Fundamentally, the goal of football is for the offense to run (rush) or throw (pass) the ball to gain yards, moving towards, then across, the opposing team’s side of the field in order to score. And the goal of the defense is to prevent the offensive team from scoring.

In the National Football League (NFL), roughly a third of teams’ offensive yardage comes from run plays. Ball carriers are generally assigned the most credit for these plays, but their teammates (by way of blocking), coach (by way of play call), and the opposing defense also play a critical role. Traditional metrics such as ‘yards per carry’ or ‘total rushing yards’ can be flawed; in this competition, the NFL aims to provide better context into what contributes to a successful run play.

As an “armchair quarterback” watching the game, you may think you can predict the result of a play when a ball carrier takes the handoff - but what does the data say? In this competition, you will develop a model to predict how many yards a team will gain on given rushing plays as they happen. You'll be provided game, play, and player-level data, including the position and speed of players as provided in the NFL’s Next Gen Stats data. And the best part - you can see how your model performs from your living room, as the leaderboard will be updated week after week on the current season’s game data as it plays out.

Deeper insight into rushing plays will help teams, media, and fans better understand the skill of players and the strategies of coaches. It will also assist the NFL and its teams evaluate the ball carrier, his teammates, his coach, and the opposing defense, in order to make adjustments as necessary.

Additionally, the winning model will be provided to the NFL’s Next Gen Stats group to potentially share with teams. You could help the NFL Network generate models to use during games, or for pre-game/post-game breakdowns.

### About This Competition

This dataset contains Next Gen Stats tracking data for running plays. You must use features known at the time when the ball is handed off (**TimeHandoff**) to forecast the yardage gained on that play (**PlayId**).

Because this is a time-series code competition that will be evaluated on future data, you will receive data and make predictions with a time-series API. This API provides plays in the time order in which they occurred in a game. Refer to the starter notebook here for an example of how to complete a submission.

### Evaluation

Submissions will be evaluated on the Continuous Ranked Probability Score (CRPS). For each PlayId, you must predict a cumulative probability distribution for the yardage gained or lost. In other words, each column you predict indicates the probability that the team gains <= that many yards on the play. 

The CRPS is computed as follows:

$$ C = \frac{1}{199N} \sum_{m=1}^{N} \sum_{n=-99}^{99} (P(y \le n) -H(n - Y_m))^2 $$

where P is the predicted distribution, N is the number of plays in the test set, Y is the actual yardage and H(x) is the Heaviside step function (H(x)=1 for x≥0 and zero otherwise).

The submission will not score if any of the predicted values has

$$ P(y \le k) > P(y \le k+1) $$

for any k (i.e. the CDF must be non-decreasing).

In [None]:
# Competition Specific
from kaggle.competitions import nflrush

# Data Management
import numpy as np 
import pandas as pd 
pd.set_option('max_columns', 100)

# Visualization Libraries
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('ggplot')

# Managing Warnings 
import warnings
warnings.filterwarnings('ignore')

# Plot Figures Inline
%matplotlib inline

# Extras
import math, string, os

# View Available Files
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
train_df = pd.read_csv('/kaggle/input/nfl-big-data-bowl-2020/train.csv', low_memory=False)
print(train_df.shape)
train_df.head()

In [None]:
train_df.columns

### Columns

Each row in the file corresponds to a single player's involvement in a single play. The dataset was intentionally joined (i.e. denormalized) to make the API simple. All the columns are contained in one large dataframe which is grouped and provided by PlayId.

+ **GameId** - a unique game identifier
+ **PlayId** - a unique play identifier
+ **Team** - home or away
+ **X** - player position along the long axis of the field. See figure below.
+ **Y** - player position along the short axis of the field. See figure below.
+ **S** - speed in yards/second
+ **A** - acceleration in yards/second^2
+ **Dis** - distance traveled from prior time point, in yards
+ **Orientation** - orientation of player (deg)
+ **Dir** - angle of player motion (deg)
+ **NflId** - a unique identifier of the player
+ **DisplayName** - player's name
+ **JerseyNumber** - jersey number
+ **Season** - year of the season
+ **YardLine** - the yard line of the line of scrimmage
+ **Quarter** - game quarter (1-4)
+ **GameClock** - time on the game clock
+ **PossessionTeam** - team with possession
+ **Down** - the down (1-4)
+ **Distance** - yards needed for a first down
+ **FieldPosition** - which side of the field the play is happening
+ **HomeScoreBeforePlay** - home team score before play started
+ **VisitorScoreBeforePlay** - visitor team score before play started
+ **NflIdRusher** - the NflId of the rushing player
+ **OffenseFormation** - offense formation
+ **OffensePersonnel** - composition of offense
+ **DefendersInTheBox** - number of defenders in the box
+ **DefensePersonnel** - composition of defense
+ **PlayDirection** - direction the play is headed
+ **TimeHandoff** - UTC time of the handoff
+ **TimeSnap** - UTC time of the snap
+ **Yards** - the yardage gained on the play (you are predicting this)
+ **PlayerHeight** - player height (ft-in)
+ **PlayerWeight** - player weight (lbs)
+ **PlayerBirthDate** - birth date (mm/dd/yyyy)
+ **PlayerCollegeName** - where the player attended college
+ **HomeTeamAbbr** - home team abbreviation
+ **VisitorTeamAbbr** - visitor team abbreviation
+ **Week** - week into the season
+ **Stadium** - stadium where the game is being played
+ **Location** - city where the game is being player
+ **StadiumType** - description of the stadium environment
+ **Turf** - description of the turf
+ **GameWeather** - description of the game weather
+ **Temperature** - temperature (deg F)
+ **Humidity** - humidity
+ **WindSpeed** - wind speed in miles/hour
+ **WindDirection** - wind direction

## Summary Statistics

In [None]:
train_df.info()

In [None]:
train_df.isnull().sum()

### Observations
+ A few columns with missing values
+ WindSpeed, WindDirection, Temperature, GameWeather, Humidity, StadiumType, and FieldPosition are the columns missing large quantities.
+ A few columns that server as identifiers and can potentially be ignored.
+ Player height should be converted to a numerical value.

In [None]:
def height_to_numerical(height):
    """
    Convert string representing height into total inches
    
    Ex. '5-11' --> 71
    Ex. '6-3'  --> 75
    """  
    feet   = height.split('-')[0]
    inches = height.split('-')[1]
    return int(feet)*12 + int(inches)

In [None]:
train_df['PlayerHeight'] = train_df['PlayerHeight'].apply(height_to_numerical)

In [None]:
train_df.drop(['GameId', 'PlayId', 'NflId', 'JerseyNumber', 'NflIdRusher'], axis=1).describe(include=['O']).T

In [None]:
train_df.drop(['GameId', 'PlayId', 'NflId', 'JerseyNumber', 'NflIdRusher'], axis=1).describe().T

### Observations
+ Wow! Some of the outliers here are interesting. 
+ There's a 153 lb player in the NFL? I have to know who that is.
+ Also, the max number of quarters if 5? That clearly doesn't make sense and might be a mistake (unless they count overtime).
+ Speed and acceleration of 0 seem interesting, so we'll examine that in more depth later on.
+ We're only looking at 2017-2018 data.

In [None]:
train_df[ train_df['PlayerWeight'] == 153.00 ]

## Data Cleaning and Handling Missing Values 

We'll go through the columns with missing values in pieces, starting with stadium type.

#### Stadium Type

In [None]:
train_df['StadiumType'].value_counts()

### Observations
+ Quite a few typos.
+ Many of the same stadium types are represented differently.
+ Is Cloudy a stadium type?
+ Heinz Field is home to the Pittsburgh Steelers and is an Outdoor stadium.

Let's group them as best we can. All retractable roof stadiums will considered to be indoor stadiums (unless indicated otherwise), and if the roof isn't specified we'll consider them to be closed roof.

In [None]:
def group_stadium_types(stadium):
    outdoor       = [
        'Outdoor', 'Outdoors', 'Cloudy', 'Heinz Field', 
        'Outdor', 'Ourdoor', 'Outside', 'Outddors', 
        'Outdoor Retr Roof-Open', 'Oudoor', 'Bowl'
    ]
    indoor_closed = [
        'Indoors', 'Indoor', 'Indoor, Roof Closed', 'Indoor, Roof Closed', 
        'Retractable Roof', 'Retr. Roof-Closed', 'Retr. Roof - Closed', 'Retr. Roof Closed',
    ]
    indoor_open   = ['Indoor, Open Roof', 'Open', 'Retr. Roof-Open', 'Retr. Roof - Open']
    dome_closed   = ['Dome', 'Domed, closed', 'Closed Dome', 'Domed', 'Dome, closed']
    dome_open     = ['Domed, Open', 'Domed, open']
    
    if stadium in outdoor:
        return 'outdoor'
    elif stadium in indoor_closed:
        return 'indoor closed'
    elif stadium in indoor_open:
        return 'indoor open'
    elif stadium in dome_closed:
        return 'dome closed'
    elif stadium in dome_open:
        return 'dome open'
    else:
        return 'unknown'

In [None]:
train_df['StadiumType'] = train_df['StadiumType'].apply(group_stadium_types)

#### Next up will be GameWeather.



In [None]:
weather = pd.DataFrame(train_df['GameWeather'].value_counts())
pd.options.display.max_rows=100
weather

In [None]:
def group_game_weather(weather):
    rain = [
        'Rainy', 'Rain Chance 40%', 'Showers',
        'Cloudy with periods of rain, thunder possible. Winds shifting to WNW, 10-20 mph.',
        'Scattered Showers', 'Cloudy, Rain', 'Rain shower', 'Light Rain', 'Rain'
    ]
    overcast = [
        'Cloudy, light snow accumulating 1-3"', 'Party Cloudy', 'Cloudy, chance of rain',
        'Coudy', 'Cloudy, 50% change of rain', 'Rain likely, temps in low 40s.',
        'Cloudy and cold', 'Cloudy, fog started developing in 2nd quarter',
        'Partly Clouidy', '30% Chance of Rain', 'Mostly Coudy', 'Cloudy and Cool',
        'cloudy', 'Partly cloudy', 'Overcast', 'Hazy', 'Mostly cloudy', 'Mostly Cloudy',
        'Partly Cloudy', 'Cloudy'
    ]
    clear = [
        'Partly clear', 'Sunny and clear', 'Sun & clouds', 'Clear and Sunny',
        'Sunny and cold', 'Sunny Skies', 'Clear and Cool', 'Clear and sunny',
        'Sunny, highs to upper 80s', 'Mostly Sunny Skies', 'Cold',
        'Clear and warm', 'Sunny and warm', 'Clear and cold', 'Mostly sunny',
        'T: 51; H: 55; W: NW 10 mph', 'Clear Skies', 'Clear skies', 'Partly sunny',
        'Fair', 'Partly Sunny', 'Mostly Sunny', 'Clear', 'Sunny'
    ]
    snow  = ['Heavy lake effect snow', 'Snow']
    none  = ['N/A Indoor', 'Indoors', 'Indoor', 'N/A (Indoors)', 'Controlled Climate']
    
    if weather in rain:
        return 'rain'
    elif weather in overcast:
        return 'overcast'
    elif weather in clear:
        return 'clear'
    elif weather in snow:
        return 'snow'
    elif weather in none:
        return 'none'
    
    return 'none'

In [None]:
train_df['GameWeather'] = train_df['GameWeather'].apply(group_game_weather)

#### WindSpeed and WindDirection

In [None]:
train_df['WindSpeed'].value_counts()

### Observations
+ We have some weird values here
+ We'll take the lower end for any range of values and set any non numerical values to 0

In [None]:
def clean_wind_speed(windspeed):
    """
    This is not a very robust function, 
    but it should do the job for this dataset.
    """
    ws = str(windspeed)
    # if it's already a number just return an int value
    if ws.isdigit():
        return int(ws)
    # if it's a range, just take the first value
    if '-' in ws:
        return int(ws.split('-')[0])
    # if there's a space between the number and mph
    if ws.split(' ')[0].isdigit():
        return int(ws.split(' ')[0])
    # if it looks like '10MPH' or '12mph' just take the first part
    if 'mph' in ws.lower():
        return int(ws.lower().split('mph')[0])
    else:
        return 0

In [None]:
train_df['WindSpeed'] = train_df['WindSpeed'].apply(clean_wind_speed)

In [None]:
train_df['WindDirection'].value_counts()

In [None]:
# This function has been updated to reflect what Subin An (https://www.kaggle.com/subinium) mentioned in comments below.
# WindDirection is indicated by the direction that wind is flowing FROM - https://en.wikipedia.org/wiki/Wind_direction

def clean_wind_direction(wind_direction):
    wd = str(wind_direction).upper()
    if wd == 'N' or 'FROM N' in wd:
        return 'north'
    if wd == 'S' or 'FROM S' in wd:
        return 'south'
    if wd == 'W' or 'FROM W' in wd:
        return 'west'
    if wd == 'E' or 'FROM E' in wd:
        return 'east'
    
    if 'FROM SW' in wd or 'FROM SSW' in wd or 'FROM WSW' in wd:
        return 'south west'
    if 'FROM SE' in wd or 'FROM SSE' in wd or 'FROM ESE' in wd:
        return 'south east'
    if 'FROM NW' in wd or 'FROM NNW' in wd or 'FROM WNW' in wd:
        return 'north west'
    if 'FROM NE' in wd or 'FROM NNE' in wd or 'FROM ENE' in wd:
        return 'north east'
    
    if 'NW' in wd or 'NORTHWEST' in wd:
        return 'north west'
    if 'NE' in wd or 'NORTH EAST' in wd:
        return 'north east'
    if 'SW' in wd or 'SOUTHWEST' in wd:
        return 'south west'
    if 'SE' in wd or 'SOUTHEAST' in wd:
        return 'south east'

    return 'none'

In [None]:
train_df['WindDirection'] = train_df['WindDirection'].apply(clean_wind_direction)

In [None]:
train_df['WindDirection'].value_counts()

#### Temperature and Humidity

For temperature and humidity, we'll use the mean.

In [None]:
train_df['Humidity'].fillna(train_df['Humidity'].mean(), inplace=True)
train_df['Temperature'].fillna(train_df['Temperature'].mean(), inplace=True)

In [None]:
train_df['FieldPosition'].value_counts()

#### FieldPosition

For FieldPosition, my guess is that all the null values represent the fact that the ball is on the 50 yard line.

In [None]:
train_df['FieldPosition'].isnull().sum()

In [None]:
train_df[ train_df['YardLine'] == 50 ].shape[0]

Since that seems to be true, we'll just use whatever team had possession to fill in that value.

In [None]:
train_df['FieldPosition'] = np.where(train_df['YardLine'] == 50, train_df['PossessionTeam'], train_df['FieldPosition'])

#### Orientation, Dir, DefendersInTheBox, OffenseFormation

For the last null values, we'll handle them in a few ways.
+ For Orientation, Dir, and DefendersInTheBox, we'll use the mean value.
+ For OffenseFormation, we'll use 'UNKNOWN'

In [None]:
na_map = {
    'Orientation': train_df['Orientation'].mean(),
    'Dir': train_df['Dir'].mean(),
    'DefendersInTheBox': math.ceil(train_df['DefendersInTheBox'].mean()),
    'OffenseFormation': 'UNKNOWN'
}

train_df.fillna(na_map, inplace=True)

In [None]:
train_df['DefendersInTheBox'].value_counts()

In [None]:
train_df.isnull().sum()

## Univariate Analysis

In [None]:
columns_to_plot = [
     'X', 'Y', 'S', 'A',
    'Dis', 
    'Orientation', 
    'Dir',
    'YardLine', 
    'HomeScoreBeforePlay',
    'VisitorScoreBeforePlay',
    'OffenseFormation',
    'DefendersInTheBox',
    'Yards',
    'PlayerHeight',
    'PlayerWeight',
    'PlayerBirthDate',     
    'PlayerCollegeName',
    'Position',
    'Week',
    'Stadium',
    'Location',
    'StadiumType',
    'Turf',
    'GameWeather',
    'Temperature',
    'Humidity',
    'WindSpeed',
    'WindDirection',
]

# Plot the distribution of each feature
def plot_distribution(dataset, cols=5, width=20, height=25, hspace=0.4, wspace=0.5):
    """
    Plot distributions for each column in a dataset.
    Seaborn countplots are used for categorical data and distplots for numerical data

    args:
    ----
    dataset {dataframe} - the data that will be plotted
    cols {int} - how many distributions to plot for each row
    width {int} - how wide each plot should be
    height {int} - how tall each plot should be
    hspace {float} - horizontal space between plots
    wspace {float} - vertical space between plots 
    """
    # plot styling
    plt.style.use('fivethirtyeight')
    fig = plt.figure(figsize=(width, height))
    fig.subplots_adjust(left=None, bottom=None, right=None, top=None, wspace=wspace, hspace=hspace)
    # calculate rows needed
    rows = math.ceil(float(dataset.shape[1]) / cols)
    # create a countplot for top 20 categorical values
    # and a distplot for all numerical values
    for i, column in enumerate(dataset.columns):
        ax = fig.add_subplot(rows, cols, i + 1)
        ax.set_title(column)
        if dataset.dtypes[column] == np.object:
            # grab the top 10 for each countplot
            g = sns.countplot(y=column, 
                              data=dataset,
                              order=dataset[column].value_counts().index[:10])
            # make labels only 20 characters long and rotate x labels for nicer displays
            substrings = [s.get_text()[:20] for s in g.get_yticklabels()]
            g.set(yticklabels=substrings)
            plt.xticks(rotation=25)
        else:
            g = sns.distplot(dataset[column])
            plt.xticks(rotation=25)
    
plot_distribution(train_df[columns_to_plot], cols=3, width=30, height=50, hspace=0.45, wspace=0.5)

### Observations
+ Most plays start at the 25 yard line (where kickoffs place the ball)
+ Singleback and Shotgun formations dominate plays
+ Pretty much either 6, 7, or 8 players in the box every play.
+ Nothing super interesting about X, Y, S, A, Dis, Orientation, Dir, or Yards
+ PlayerWeight has an interesting dip between the 270-290 range. We see very few players with this weight potentially because they're too small to be lineman, and too massive to play linebacker/defensive end. The skills players seem to have a bimodal distribution - potentially representing skill players who are big bruisers, and skill players who are finesse and speed oriented.
+ The top birth dates are between 1988-1993, or players who are between 25-31.
+ Players most often come from an SEC school or big names like Ohio State or Note Dame.
+ Games are often played outdoor, on grass, with overcast or clear weather.
+ The temperature is most often 60 degrees, and the humidity has a spike at zero, and then a larger area between 40 and 80.

## Multivariate Analysis

We'll start by looking at comparisons between Yards gained (the target variable) and other features.

First, let's look at a subset of data where the player is the one actually running the football.

In [None]:
# reference: https://www.kaggle.com/sudalairajkumar/simple-exploration-notebook-nfl

rushing_df = train_df[ train_df['NflId'] == train_df['NflIdRusher']]
print(rushing_df.shape)
rushing_df.head()

In [None]:
rushing_df['Position'].value_counts()

### Observations
+ This is interesting. DE is a defensive end, DT is a defensive tackle, and CB is a corner back. Those are all defensive positions. For them to rush the ball would mean they either are playing both offense and defense (and defense is their primary listed position), or they had an interception and this dataset is counting those plays as rushing plays.
+ If anyone has insight here please comment below!

In [None]:
rushing_df[rushing_df['Position'] == 'DE']

From this single sample, it looks like the Los Angeles Chargers let Melvin Ingram - a primarily defensive player - rush the football from the 1 yard line to try and punch in a touchdown. This doesn't happen very often, so there seems to be only one example of it during the season. That makes sense and is an interesting observation.

In [None]:
rushing_df[rushing_df['Position'] == 'DT']

The same seems to be true here. At the 1 yard line and trying to punch in a touchdown with a big man.

In [None]:
rushing_df[rushing_df['Position'] == 'CB']

For corner backs, it seems to be a different scenario. These 3 players can be used on both offense and defense, but are primarily defensive skill players. We see quite a larger proportion of big plays here, potentially because these are trick plays or interesting formations that are not seen very often by the defense.

In [None]:
plt.figure(figsize=(12, 6))
sns.scatterplot(x='S', y='Yards', data=rushing_df, color='b')
plt.xlabel('Speed of Rusher')
plt.ylabel('Yards Gained')
plt.title('Running Speed vs Yards Gained', fontsize=24)
plt.show()

In [None]:
plt.figure(figsize=(12, 6))
sns.scatterplot(x='A', y='Yards', data=rushing_df, color='r')
plt.xlabel('Acceleration of Rusher')
plt.ylabel('Yards Gained')
plt.title('Rusher Acceleration vs Yards Gained', fontsize=24)
plt.show()

In [None]:
plt.figure(figsize=(12, 6))
sns.scatterplot(x='Dis', y='Yards', data=rushing_df, color='g')
plt.xlabel('Distance Traveled')
plt.ylabel('Yards Gained')
plt.title('Distance Traveled vs Yards Gained', fontsize=24)
plt.show()

In [None]:
plt.figure(figsize=(20, 6))
sns.boxplot(x='Distance', y='Yards', data=rushing_df, color='dodgerblue')
plt.xlabel('Yards Needed For First Down')
plt.ylabel('Yards Gained')
plt.title('Yards Needed for a First Down vs Yards Gained', fontsize=24)
plt.show()

In [None]:
plt.figure(figsize=(20, 10))
sns.boxplot(x='DefendersInTheBox', y='Yards', data=rushing_df[rushing_df['DefendersInTheBox'] > 3], color='dodgerblue')
plt.xlabel('Defenders in the Box')
plt.ylabel('Yards Gained')
plt.title('Defenders in the Box vs Yards Gained', fontsize=24)
plt.show()

### Observations
+ Rushing speed and acceleration increase see a slight increase in total yards gained, but not much.
+ As rusher travels a larger distance, they tend to gain more yards - up to a point - after which the total yards gained declines. This may be due to sweep plays where the rusher is traveling sideways as fast as possible to get around the defense before heading upfield.
+ We see a larger number of yards gained when 10 yards are needed for a first down. That may be the result of a few different scenarios. Whenever a team gets a first down, they need to travel 10 yards from their current spot to gain another first down by default.
   + Perhaps on first down, when a team's offense is rolling, they hand the ball off and gain a significant chunk of yards multiple times. 
   + Or perhaps when teams realize that the passing game isn't working, they hand the ball off on 3rd and ten and break a big run
   + Or since more plays happen when 10 yards are needed for a first down, we're just seeing a much larger sample size of rushes that covers a wider span of outcomes.
+ Few big rushing plays are seen when more than 20 yards are needed. This is probably due to the defense playing a soft zone and ensuring that a first down is not achieved, even if the rusher picks up between 10-15 yards on the play. As long as they don't get a first down, the defense is satisfied.


In [None]:
plt.style.use('ggplot')

kws = dict(linewidth=.9)

g = sns.FacetGrid(train_df, col='OffenseFormation', col_wrap=3, size=8, aspect=.7, sharex=False)
g = (g.map(sns.boxplot, 'DefendersInTheBox', 'Yards', **kws)
     .set_titles("{col_name}")
     .fig.subplots_adjust(wspace=.1, hspace=.2))
# for ax in g.axes.flat:
#   ax.set_title(ax.get_title().split(' = ')[1])
#   for label in ax.get_xticklabels():
#     label.set_rotation(90)

## To Be Continued...

## Modeling

...more to come here