# NFL EDA 2021 Edition - Lean, Clean, and Astroturf Green

An analysis of NFL statistics for the NFL Big Data Bowl 2021

## Description

When a quarterback takes a snap and drops back to pass, what happens next may seem like chaos. As offensive players move in various patterns, the defense works together to prevent successful pass completions and then to quickly tackle receivers that do catch the ball. In this year’s Kaggle competition, your goal is to use data science to better understand the schemes and players that make for a successful defense against passing plays.

In American football, there are a plethora of defensive strategies and outcomes. The National Football League (NFL) has used previous Kaggle competitions to focus on offensive plays, but as the old proverb goes, “defense wins championships.” Though metrics for analyzing quarterbacks, running backs, and wide receivers are consistently a part of public discourse, techniques for analyzing the defensive part of the game trail and lag behind. Identifying player, team, or strategic advantages on the defensive side of the ball would be a significant breakthrough for the game.

This competition uses NFL’s Next Gen Stats data, which includes the position and speed of every player on the field during each play. You’ll employ player tracking data for all drop-back pass plays from the 2018 regular season. The goal of submissions is to identify unique and impactful approaches to measure defensive performance on these plays. There are several different directions for participants to ‘tackle’ (ha)—which may require levels of football savvy, data aptitude, and creativity. As examples:

+ What are coverage schemes (man, zone, etc) that the defense employs? What coverage options tend to be better performing?
+ Which players are the best at closely tracking receivers as they try to get open?
+ Which players are the best at closing on receivers when the ball is in the air?
+ Which players are the best at defending pass plays when the ball arrives?
+ Is there any way to use player tracking data to predict whether or not certain penalties – for example, defensive pass interference – will be called?
+ Who are the NFL’s best players against the pass?
+ How does a defense react to certain types of offensive plays?
+ Is there anything about a player – for example, their height, weight, experience, speed, or position – that can be used to predict their performance on defense?
![](http://)+ What does data tell us about defending the pass play? You are about to find out.

Note: Are you a university participant? Students have the option to participate in a college-only Competition, where you’ll work on the identical themes above. Students can opt-in for either the Open or College Competitions, but not both.

## Goals

+ Analyze player data
+ Analyze game data
+ Analyze aggregate team data

## Imports

In [None]:
from typing import Any, List, Callable, Union

# Data Management
import numpy as np 
import pandas as pd 
import scipy

pd.set_option('max_columns', 100)
pd.set_option('max_rows', 50)

# Visualization Libraries
import matplotlib.pyplot as plt
import seaborn as sns
import plotly
from plotly.offline import init_notebook_mode, iplot, plot
init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.figure_factory as ff
import plotly.express as px
from IPython.display import HTML, Image


# Managing Warnings 
import warnings
warnings.filterwarnings('ignore')

# Plot Figures Inline
%matplotlib inline

# Extras
import math, string, os, datetime, dateutil

for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

## Players Analysis

### Columns Overview

+ nflId: Player identification number, unique across players (numeric)
+ height: Player height (text)
+ weight: Player weight (numeric)
+ birthDate: Date of birth (YYYY-MM-DD)
+ collegeName: Player college (text)
+ position: Player position (text)
+ displayName: Player name (text)

### Initial Questions:
+ What are the distributions for player height, weight, age, and position?
+ What colleges has the most active NFL players?
+ What are the average heights, weights, and ages by position?

In [None]:
df_players = pd.read_csv('/kaggle/input/nfl-big-data-bowl-2021/players.csv')
print(df_players.shape)
df_players.tail()

### Clean Players Data

+ Clean player height
+ Create age column

In [None]:
def height_to_numerical(height):
    """
    Convert string representing height into total inches
    
    Ex. '5-11' --> 71
    Ex. '6-3'  --> 75
    """  
    feet   = height.split('-')[0]
    inches = height.split('-')[1]
    return int(feet)*12 + int(inches)


def clean_height(val):
    try:
        # height is already in inches
        height = int(val)
    except:
        # convert it from string
        height = height_to_numerical(val)

    return height

In [None]:
def calculate_age(birthDate):
    today = datetime.date.today()
    age = dateutil.relativedelta.relativedelta(today, birthDate)
    return age.years + (age.months / 12)

In [None]:
def clean_players_data(df):
    df_players["height"] = df_players["height"].apply(clean_height)
    df_players["birthDate"] = pd.to_datetime(df_players["birthDate"])
    df_players["age"] = df_players["birthDate"].apply(calculate_age)
    return df_players.set_index("nflId")
    
df_players_cleaned = clean_players_data(df_players)

In [None]:
df_players_cleaned.info()

In [None]:
df_players_cleaned.describe(include=["O"])

In [None]:
df_players_cleaned.describe()

### Observations

+ More players went to Alabama than anywhere else
+ Suprisingly, there is a still a player at <160 lbs.
+ Average players is 6'1.
+ Oldest player is ~43 and youngest is ~22. Most are between ~26 and ~30.

In [None]:
def get_bar_trace(
    *,
    df: pd.DataFrame, 
    column: str, 
    num_entries: int = 20,
    colorscale: str = "Portland",
    orientation: str = "h",
) -> go.Bar:
    data = df[column].value_counts()[:num_entries][::-1]    
    x = data.values if orientation == "h" else data.index
    y = data.index if orientation == "h" else data.values

    return go.Bar(
        x=x,
        y=y,
        name=column,
        marker=dict(
            color=data.values,
            line=dict(color="black", width=1.5),
            colorscale=px.colors.diverging.Portland,      
        ),
        text=data.values,
        textposition="auto",
        orientation="h",
        showlegend=False,
    )


def get_hist_trace(
    *, 
    df: pd.DataFrame, 
    column: str,
    color: str = "dodgerblue",
) -> go.Histogram:
    return go.Histogram(
        x=df[column],
        opacity=0.75,
        name=column,
        marker=dict(
            color=color,
            line=dict(color="black", width=1.5),   
        ),
        text=df[column].values, 
#         histnorm="probability"
    )


def get_scatter_trace(
    *, 
    df: pd.DataFrame, 
    column: str,
    color: str = "dodgerblue",
) -> go.Scatter:
    data = df[column]
    kde = scipy.stats.kde.gaussian_kde(data.values)    
    x = np.linspace(min(data.values), max(data.values), len(data.values))
    y = [val * len(data.values) for val in kde(x)]  # denormalize
    return go.Scatter(
        x=x, 
        y=y,
        marker=dict(
            size=6,
            color=color,
        ),
        showlegend=False
    )


In [None]:
def plotly_distributions(
    df: pd.DataFrame, 
    height: int = 1000, 
    width: int = 1500,    
    cols: int = 3,
    horizontal_spacing: float = 0.2,
    vertical_spacing: float = 0.3,
    colorscale: List[str] = px.colors.diverging.Portland,
) -> None: 
    rows = math.ceil(float(df.shape[1]) / cols)
    fig = plotly.subplots.make_subplots(
        rows=rows, 
        cols=cols,
        horizontal_spacing=horizontal_spacing,
        vertical_spacing=vertical_spacing,
        subplot_titles=df.columns,
    )
    
    for i, column in enumerate(df.columns):
        row = math.ceil((i + 1) / cols)
        col = (i % cols) + 1
        if df.dtypes[column] == np.object:
            fig.add_trace(
                get_bar_trace(
                    df=df, 
                    column=column, 
                    colorscale=colorscale
                ), 
                row=row, 
                col=col
            )
            fig.update_xaxes(title_text="Count", row=row, col=col)
        else:
#             distplfig = ff.create_distplot(
#                 [df[column]], 
#                 group_labels=[column], 
#                 colors=colorscale,
#                 bin_size=.2, 
#                 show_rug=False,
#             )

#             for k in range(len(distplfig.data)):
#                 fig.add_trace(
#                     distplfig.data[k],
#                     row=row, 
#                     col=col,
#                 )            
            fig.add_trace(
                get_hist_trace(
                    df=df, 
                    column=column, 
                    color=colorscale[i % len(colorscale)]
                ), 
                row=row, 
                col=col
            )
            fig.add_trace(
                get_scatter_trace(
                    df=df, 
                    column=column, 
                    color=colorscale[(i + 1) % len(colorscale)]                    
                ), 
                row=row, 
                col=col,
            )
            fig.update_xaxes(title_text=column, row=row, col=col)
            fig.update_yaxes(title_text="Count", row=row, col=col)
            
    fig.update_layout(
        height=height, 
        width=width
    )

    iplot(fig)

In [None]:
columns_to_plot = [col for col in df_players.columns if col not in ["nflId", "birthDate", "displayName"]]

plotly_distributions(df_players[columns_to_plot], horizontal_spacing=0.10, vertical_spacing=0.15)

### By Position Analysis

We'll start with offense and then move to defense.

#### Wide Receivers

In [None]:
df_wr = df_players[df_players["position"] == "WR"]
columns_to_plot = [col for col in df_players.columns if col not in ["nflId", "birthDate", "displayName", "position"]]
plotly_distributions(df_wr[columns_to_plot], cols=2, horizontal_spacing=0.10, vertical_spacing=0.15)

### Running Backs and Fullbacks

In [None]:
# df_wr = df_players[(df_players["position"] == "RB") | (df_players["position"] == "FB")]
# columns_to_plot = [col for col in df_players.columns if col not in ["nflId", "birthDate", "displayName", "position"]]
# plotly_distributions(df_wr[columns_to_plot], cols=2, horizontal_spacing=0.10, vertical_spacing=0.15)

### Quarterbacks

In [None]:
# df_wr = df_players[df_players["position"] == "QB"]
# columns_to_plot = [col for col in df_players.columns if col not in ["nflId", "birthDate", "displayName", "position"]]
# plotly_distributions(df_wr[columns_to_plot], cols=2, horizontal_spacing=0.10, vertical_spacing=0.15)

### Tight Ends

In [None]:
# df_wr = df_players[df_players["position"] == "TE"]
# columns_to_plot = [col for col in df_players.columns if col not in ["nflId", "birthDate", "displayName", "position"]]
# plotly_distributions(df_wr[columns_to_plot], cols=2, horizontal_spacing=0.10, vertical_spacing=0.15)

### Defensive Lineman

In [None]:
# df_wr = df_players[df_players["position"].isin(["DT", "DE", "NT"])]
# columns_to_plot = [col for col in df_players.columns if col not in ["nflId", "birthDate", "displayName", "position"]]
# plotly_distributions(df_wr[columns_to_plot], cols=2, horizontal_spacing=0.10, vertical_spacing=0.15)

### Linebackers

In [None]:
# df_wr = df_players[df_players["position"].isin(["LB", "ILB", "OLB", "MLB"])]
# columns_to_plot = [col for col in df_players.columns if col not in ["nflId", "birthDate", "displayName", "position"]]
# plotly_distributions(df_wr[columns_to_plot], cols=2, horizontal_spacing=0.10, vertical_spacing=0.15)

### Defensive Backs

In [None]:
# df_wr = df_players[df_players["position"].isin(["SS", "FS", "CB", "DB"])]
# columns_to_plot = [col for col in df_players.columns if col not in ["nflId", "birthDate", "displayName", "position"]]
# plotly_distributions(df_wr[columns_to_plot], cols=2, horizontal_spacing=0.10, vertical_spacing=0.15)

## Games Analysis

### Columns Overview

+ gameId: Game identifier, unique (numeric)
+ gameDate: Game Date (time, mm/dd/yyyy)
+ gameTimeEastern: Start time of game (time, HH:MM:SS, EST)
+ homeTeamAbbr: Home team three-letter code (text)
+ visitorTeamAbbr: Visiting team three-letter code (text)
+ week: Week of game (numeric)

### Initial Questions:
+ What are the distributions for bye weeks in the NFL?
+ What time do most games start?

In [None]:
df_games = pd.read_csv('/kaggle/input/nfl-big-data-bowl-2021/games.csv')
print(df_games.shape)
df_games.head()

### Clean Games

In [None]:
def clean_games_df(df):
    df["gameDate"] = pd.to_datetime(df["gameDate"])
    return df.set_index("gameId")


In [None]:
df_games_cleaned = clean_games_df(df_games)

In [None]:
df_games_cleaned.info()

In [None]:
def plot_bar_chart(
    *,
    x: List[Any],
    y: List[Any],
    name: str,
    title: str,
    xaxis_title: str,
    yaxis_title: str,    
    colorscale: List[str] = px.colors.diverging.Portland,
) -> None:
    trace = go.Bar(
        x=x,
        y=y,
        name=name,
        marker=dict(
            color=y,
            line=dict(color="black", width=1.5),
            colorscale=px.colors.diverging.Portland,
        ),
        text=y,
        textposition="auto",
        orientation="v",
    )
    layout = go.Layout(
        title=title, 
        xaxis=dict(title=xaxis_title), 
        yaxis=dict(title=yaxis_title)
    )
    fig = go.Figure(data=[trace], layout=layout)
    fig.update_xaxes(type='category')    
    iplot(fig)    

In [None]:
df_grouped_by_week = df_games_cleaned.groupby(by=["week"]).count()
data = df_grouped_by_week["gameDate"]

plot_bar_chart(
    x=data.index, 
    y=data.values, 
    name="Weekly Game Count", 
    title="Weekly Game Count", 
    xaxis_title="Week",
    yaxis_title="Number of Games",
)

In [None]:
data = df_games_cleaned["gameTimeEastern"].value_counts()

plot_bar_chart(
    x=data.index, 
    y=data.values, 
    name="Number of Games Per Start Time", 
    title="Number of Games Per Start Time",
    xaxis_title="Game Time",
    yaxis_title="Number of Games",
)

In [None]:
# teamAbbrevs = list(set(df_games["homeTeamAbbr"].tolist() + df_games["visitorTeamAbbr"].values.tolist()))

# df_grouped_by_week = df_games.groupby(by=["week"])[["homeTeamAbbr", "visitorTeamAbbr"]].agg({
#     "homeTeamAbbr": list,
#     "visitorTeamAbbr": list,
# })

## Plays Analysis

### Columns Overview

+ **gameId**: Game identifier, unique (numeric)
+ **playId**: Play identifier, not unique across games (numeric)
+ **playDescription**: Description of play (text)
+ **quarter**: Game quarter (numeric)
+ **down**: Down (numeric)
+ **yardsToGo**: Distance needed for a first down (numeric)
+ **possessionTeam**: Team on offense (text)
+ **playType**: Outcome of dropback: sack or pass (text)
+ **yardlineSide**: 3-letter team code corresponding to line-of-scrimmage (text)
+ **yardlineNumber**: Yard line at line-of-scrimmage (numeric)
+ **offenseFormation**: Formation used by possession team (text)
+ **personnelO**: Personnel used by offensive team (text)
+ **defendersInTheBox**: Number of defenders in close proximity to line-of-scrimmage (numeric)
+ **numberOfPassRushers**: Number of pass rushers (numeric)
+ **personnelD**: Personnel used by defensive team (text)
+ **typeDropback**: Dropback categorization of quarterback (text)
+ **preSnapHomeScore**: Home score prior to the play (numeric)
+ **preSnapVisitorScore**: Visiting team score prior to the play (numeric)
+ **gameClock**: Time on clock of play (MM:SS)
+ **absoluteYardlineNumber**: Distance from end zone for possession team (numeric)
+ **penaltyCodes**: NFL categorization of the penalties that ocurred on the play. For purposes of this contest, the most important penalties are Defensive Pass Interference (DPI), Offensive Pass Interference (OPI), Illegal Contact (ICT), and Defensive Holding (DH). Multiple penalties on a play are separated by a ; (text)
+ **penaltyJerseyNumber**: Jersey number and team code of the player commiting each penalty. Multiple penalties on a play are separated by a ; (text)
+ **passResult**: Outcome of the passing play (C: Complete pass, I: Incomplete pass, S: Quarterback sack, IN: Intercepted pass, text)
+ **offensePlayResult**: Yards gained by the offense, excluding penalty yardage (numeric)
+ **playResult**: Net yards gained by the offense, including penalty yardage (numeric)
+ **epa**: Expected points added on the play, relative to the offensive team. Expected points is a metric that estimates the average of every next scoring outcome given the play's down, distance, yardline, and time remaining (numeric)
+ **isDefensivePI**: An indicator variable for whether or not a DPI penalty ocurred on a given play (TRUE/FALSE)

### Initial Questions:

General:

+ Which team ran the most plays?
+ How many games went into overtime?
+ How many times did teams go for it on 4th down?
+ What formations worked best on 4th down?
+ What was the pass result frequency over the course of a season?

For formations:

+ What formations produces the most yards on average?
+ What relationship holds between yards to go and offensive/defensive formation? What about time left in the game?
+ What formations produce the most penalties? Defensive pass inference? Sacks? Interceptions?
+ What's the correlation between formation and play type?
+ What teams favor what formations? 
+ Are there formations that work best for certain teams?
+ What relationship holds between formations, defenders in the box, number of pass rushers, and play result?
+ Do formation frequencies change over the course of a season?

Potential for textual analysis on `playDescription` column.

In [None]:
df_plays = pd.read_csv('/kaggle/input/nfl-big-data-bowl-2021/plays.csv')
print(df_plays.shape)
df_plays.head()

In [None]:
df_plays.info()

In [None]:
def convert_game_clock_to_seconds(gameClock: str) -> int:
    # handle NaN values
    try:
        [minutes, seconds, ms] = str(gameClock).split(':')
        total_seconds = int(minutes) * 60 + int(seconds)
        return total_seconds
    except:
        return np.nan

In [None]:
def clean_and_merge_plays_df(df):
    df["gameClock"] = df["gameClock"].apply(convert_game_clock_to_seconds)
    merged_with_games = df.merge(df_games_cleaned, left_on="gameId", right_on=df_games_cleaned.index)
    return merged_with_games

In [None]:
df_merged = clean_and_merge_plays_df(df_plays)
df_merged.head(10)

In [None]:
data = df_merged["down"].value_counts()

trace = go.Bar(
    x=data.index,
    y=data.values,
    marker=dict(
        color=data.values,
        line=dict(color="black", width=1.5),
        colorscale=px.colors.diverging.Portland,
    ),
    text=data.values,
    textposition="auto",
    orientation="v",
)
layout = go.Layout(
    title="Number Of Plays Run By Down",
    yaxis=dict(title="Number of Plays"),
    xaxis=dict(title="Down")
)
fig = go.Figure(data=[trace], layout=layout)
fig.update_xaxes(type='category')
iplot(fig) 

In [None]:

def grouped_bar_chart(
    *, 
    title: str,
    x_values: List[Any], 
    y_values: List[Any],
    labels: List[str],
    xaxis_title: str = "",
    yaxis_title: str = "",
) -> None:
    traces = []
    
    for i, label in enumerate(labels):
        x = x_values
        y = y_values[i]
        
        trace = go.Bar(
            x=x,
            y=y,
            name=label,
            marker=dict(
                color=y,
                line=dict(color="black", width=1.5),
                colorscale=px.colors.diverging.Portland,
            ),
            text=label,
            textposition="auto",
            orientation="v",
            offsetgroup=i,
        )
        traces.append(trace)        

    layout = go.Layout(
        title=title, 
        xaxis=dict(title=xaxis_title), 
        yaxis=dict(title=yaxis_title)
    )
    fig = go.Figure(data=traces, layout=layout)
    fig.update_xaxes(type='category')
    iplot(fig)
    

In [None]:
weekly_down_df = pd.DataFrame(df_merged.groupby(by=["week"])["down"].value_counts()).unstack()
grouped_bar_chart(
    labels=["1st Down", "2nd Down", "3rd Down", "4th Down"],
    x_values=weekly_down_df.index.tolist(), 
    y_values=weekly_down_df.values.transpose().tolist(), 
    title="Plays By Down Week over Week",
    xaxis_title="Week",
    yaxis_title="Number of Plays Run",
    
)

In [None]:
def convert_to_percent(df):
    ret = df.copy()
    for col in ret.columns:
        total = ret[col].sum()
        ret[col] = ret[col] / total
    return ret

weekly_down_percentage = convert_to_percent(
    pd.DataFrame(
        df_merged.groupby(by=["week"])["down"].value_counts()
    ).unstack(level=0)
)

In [None]:
grouped_bar_chart(
    labels=["1st Down", "2nd Down", "3rd Down", "4th Down"],
    x_values=[idx[1] for idx in weekly_down_percentage.transpose().index], 
    y_values=weekly_down_percentage.values.tolist(), 
    title="Percentage of Plays Run By Down Week over Week",
    xaxis_title="Week",
    yaxis_title="Percentage of Total Plays Run",
    
)

In [None]:
def plot_weekly_categorical_values(
    *,
    df: pd.DataFrame,
    column: str,
    title: str = "Formation Count By Week",
    horizontal_spacing=0.10,
    vertical_spacing=0.10,   
) -> None:
    traces = []
    fig = plotly.subplots.make_subplots(
        rows=2, 
        cols=1,
        horizontal_spacing=horizontal_spacing,
        vertical_spacing=vertical_spacing,
        shared_xaxes=True,
        row_heights=[0.4, 0.6],        
    )
    
    uniqs = df[~(pd.isna(df[column]))][column].unique().tolist()
    
    for i, category in enumerate(uniqs):
        df_categorical = df[df[column] == category]
        weekly_counts = df_categorical.groupby(by=["week"]).count()
        x = weekly_counts[column].index
        y = weekly_counts[column].values
        
        fig.add_trace(
            go.Scatter(
                x=x,
                y=y,
                name=category,
            ),
            row=1,
            col=1,
        )        
        
        fig.add_trace(
            go.Scatter(
                x=x,
                y=y,
                mode="none",
                fill="tozeroy" if i == 1 else "tonexty",
                name=category,
                stackgroup="one",
            ),
            row=2,
            col=1,
        )

    fig.update_layout(title=title, xaxis=dict(title="Week"), height=800)
    fig.update_xaxes(type='category')
    fig.show()    
    

In [None]:
plot_weekly_categorical_values(df=df_merged, column="offenseFormation")

In [None]:
plot_weekly_categorical_values(df=df_merged, column="passResult", title="Pass Result By Week")

In [None]:
def plot_weekly_numerical_values(
    *,
    df: pd.DataFrame,
#     columns: List[str],
    title: str,
    horizontal_spacing=0.10,
    vertical_spacing=0.10, 
    aggregator: Union[Callable, str] = np.sum,
    agg_mapping: dict = {},
) -> None:
    traces = []
    fig = plotly.subplots.make_subplots(
        rows=2, 
        cols=1,
        horizontal_spacing=horizontal_spacing,
        vertical_spacing=vertical_spacing,
        shared_xaxes=True,
        row_heights=[0.4, 0.6],        
    )
    
    weekly_data = df.groupby(by=["week"])[list(agg_mapping.keys())].agg(agg_mapping)
    
    for i, col in enumerate(weekly_data.columns):
        data = weekly_data[col]
        x = data.index
        y = data.values        
#         data = df[col]
#         weekly_data = df.groupby(by=["week"])[col].agg({ col: aggregator })
#         x = data.index
#         y = data.values
    
#     for i, category in enumerate(uniqs):
#         df_categorical = df[df[column] == category]
#         weekly_counts = df_categorical.groupby(by=["week"]).agg({ column: aggregator })
#         x = weekly_counts[column].index
#         y = weekly_counts[column].values
        
        fig.add_trace(
            go.Scatter(
                x=x,
                y=y,
                name=col,
            ),
            row=1,
            col=1,
        )        
        
        fig.add_trace(
            go.Scatter(
                x=x,
                y=y,
                mode="none",
#                 fill="tozeroy" if i == 1 else "tonexty",
                fill="tonexty",
                name=col,
                stackgroup='one'
            ),
            row=2,
            col=1,
        )

    fig.update_layout(title=title, xaxis=dict(title="Week"), height=800)
    fig.update_xaxes(type='category')
    fig.show()  

In [None]:
plot_weekly_numerical_values(
    df=df_merged,
    agg_mapping={
#         "offensePlayResult": np.sum,
#         "epa": np.sum,
        "preSnapHomeScore": np.sum,
        "preSnapVisitorScore": np.sum,
    },
    title="Results"
)

In [None]:

# def plot_scatter_matrix(df):
#     data = df.loc[:, ["offensePlayResult", "preSnapHomeScore", "preSnapVisitorScore", "epa"]]
#     data.index = np.arange(1, len(data)+1)

#     fig = ff.create_scatterplotmatrix(
#         data,
#         diag='box', 
#         colormap='Portland',
#         colormap_type='cat',
#         height=700, 
#         width=700,
#     )

#     iplot(fig)
    
# plot_scatter_matrix(df_merged)

In [None]:
data = df_plays.groupby(by=["offenseFormation"])["playId"].count()

trace = go.Bar(
    x=data.index,
    y=data.values,
    marker=dict(
        color=data.values,
        line=dict(color="black", width=1.5),
        colorscale=px.colors.diverging.Portland,      
    ),
    text=data.values,
    textposition="auto",
    orientation="v",        
)
layout = go.Layout(title="Offensive Formation Count")
fig = go.Figure(data=[trace], layout=layout)
iplot(fig)

In [None]:
# df_plays.groupby(by=["personnelO"])["playId"].count().sort_values()[::-1]

In [None]:
# df_plays.groupby(by=["personnelD"])["playId"].count().sort_values()[::-1]

## Weekly Tracking Analysis

Each of the 17 week[week].csv files contain player tracking data from all passing plays during Week [week] of the 2018 regular season. Nearly all plays from each [gameId] are included; certain plays or games with insufficient data are dropped. Each team and player plays no more than 1 game in a given week.

+ **time**: Time stamp of play (time, yyyy-mm-dd, hh:mm:ss)
+ **x**: Player position along the long axis of the field, 0 - 120 yards. See Figure 1 below. (numeric)
+ **y**: Player position along the short axis of the field, 0 - 53.3 yards. See Figure 1 below. (numeric)
+ **s**: Speed in yards/second (numeric)
+ **a**: Acceleration in yards/second^2 (numeric)
+ **dis**: Distance traveled from prior time point, in yards (numeric)
+ **o**: Player orientation (deg), 0 - 360 degrees (numeric)
+ **dir**: Angle of player motion (deg), 0 - 360 degrees (numeric)
+ **event**: Tagged play details, including moment of ball snap, pass release, pass catch, tackle, etc (text)
+ **nflId**: Player identification number, unique across players (numeric)
+ **displayName**: Player name (text)
+ **jerseyNumber**: Jersey number of player (numeric)
+ **position**: Player position group (text)
+ **team**: Team (away or home) of corresponding player (text)
+ **frameId**: Frame identifier for each play, starting at 1 (numeric)
+ **gameId**: Game identifier, unique (numeric)
+ **playId**: Play identifier, not unique across games (numeric)
+ **playDirection**: Direction that the offense is moving (text, left or right)
+ **route**: Route ran by offensive player (text)

In [None]:
df_week_1 = pd.read_csv("/kaggle/input/nfl-big-data-bowl-2021/week1.csv")
print(df_week_1.shape)
df_week_1.head()

In [None]:
df_week_1.info()

In [None]:
def clean_weekly_df(df):
    df["time"] = pd.to_datetime(df["time"])
    
    return df.merge(df_merged)

In [None]:
weekly_data_filenames = [
    '/kaggle/input/nfl-big-data-bowl-2021/week1.csv',
    '/kaggle/input/nfl-big-data-bowl-2021/week2.csv',
    '/kaggle/input/nfl-big-data-bowl-2021/week3.csv',
    '/kaggle/input/nfl-big-data-bowl-2021/week4.csv',
    '/kaggle/input/nfl-big-data-bowl-2021/week5.csv',
    '/kaggle/input/nfl-big-data-bowl-2021/week6.csv',
    '/kaggle/input/nfl-big-data-bowl-2021/week7.csv',
    '/kaggle/input/nfl-big-data-bowl-2021/week8.csv',
    '/kaggle/input/nfl-big-data-bowl-2021/week9.csv',
    '/kaggle/input/nfl-big-data-bowl-2021/week10.csv',
    '/kaggle/input/nfl-big-data-bowl-2021/week11.csv',
    '/kaggle/input/nfl-big-data-bowl-2021/week12.csv',
    '/kaggle/input/nfl-big-data-bowl-2021/week13.csv',
    '/kaggle/input/nfl-big-data-bowl-2021/week14.csv',
    '/kaggle/input/nfl-big-data-bowl-2021/week15.csv',
    '/kaggle/input/nfl-big-data-bowl-2021/week16.csv',
    '/kaggle/input/nfl-big-data-bowl-2021/week17.csv'
]

In [None]:
weekly_data = [pd.read_csv(filename) for filename in weekly_data_filenames]

weekly_qb_data = clean_weekly_df(
    pd.concat(
        [df[df["position"] == "QB"] for df in weekly_data],
        axis="index",
    )
)
print(weekly_qb_data.shape)

In [None]:
weekly_qb_data.head()

In [None]:
weekly_qb_data["epa"].describe()

In [None]:
def plot_categorical_bar_chart(
    *,
    x_values: List[Any],
    y_values: List[Any],
    name: str,
    title: str,
    xaxis_title: str,
    yaxis_title: str,    
    colorscale: List[str] = px.colors.diverging.Portland,
) -> None:
    layout = go.Layout(
        title=title, 
        xaxis=dict(title=xaxis_title), 
        yaxis=dict(title=yaxis_title)
    )        
    fig = go.Figure(
        data=[
            go.Bar(
                x=x_values,
                y=y_values,
                text=y_values,
                textposition="auto",
                orientation="v",  
                marker=dict(
                    color=y_values,
                    line=dict(color="black", width=1.5),
                    colorscale=px.colors.diverging.Portland,
                )                
            )
        ], 
        layout=layout,
    )    

    fig.update_xaxes(type='category')    
    iplot(fig)

In [None]:
touchdown_data = weekly_qb_data[(weekly_qb_data["event"] == "touchdown") | (weekly_qb_data["event"] == "pass_outcome_touchdown")]

plot_categorical_bar_chart(
    x_values=touchdown_data.groupby(by=["offenseFormation"])["event"].count().index, 
    y_values=touchdown_data.groupby(by=["offenseFormation"])["event"].count().values, 
    name="Number of Touchdowns Per Formation", 
    title="Number of Touchdowns Per Formation", 
    xaxis_title="Formation",
    yaxis_title="Touchdowns",
)

In [None]:
sack_data = weekly_qb_data[(weekly_qb_data["event"] == "qb_strip_sack") | (weekly_qb_data["event"] == "qb_sack")]

plot_categorical_bar_chart(
    x_values=sack_data.groupby(by=["offenseFormation"])["event"].count().index, 
    y_values=[
        round(val, 3) for val in 
        (
            sack_data.groupby(by=["offenseFormation"])["event"].count().values /
            weekly_qb_data.groupby(by=["offenseFormation"])["event"].count().values * 100
        )
    ],
    name="Percentage of Play Resulting in a Sack By Formation", 
    title="Percentage of Play Resulting in a Sack By Formation", 
    xaxis_title="Formation",
    yaxis_title="Percentage of Plays Resulting in a Sack",

)

In [None]:
interception_data = weekly_qb_data[(weekly_qb_data["passResult"] == "IN")]

plot_bar_chart(
    x=interception_data.groupby(by=["offenseFormation"])["passResult"].count().index, 
    y=[
        round(val, 3) for val in 
        (
            interception_data.groupby(by=["offenseFormation"])["passResult"].count().values /
            weekly_qb_data.groupby(by=["offenseFormation"])["passResult"].count().values * 100
        )
    ],
    name="Percentage of Plays Resulting in an Interception by Formation",
    title="Percentage of Plays Resulting in an Interception by Formation",
    xaxis_title="Formation",
    yaxis_title="Percentage of Plays Resulting in an Interception",
)

In [None]:
def plot_scatterplot(
    *,
    df: pd.DataFrame, 
    x_column: str, 
    y_column: str,
    title: str,
) -> None:
    trace = go.Scattergl(
        x=df[x_column],
        y=df[y_column],
        mode="markers",
    )
    
    layout = go.Layout(
        title=title,
        xaxis=dict(title=x_column),
        yaxis=dict(title=y_column),
    )
    fig = go.Figure(data=[trace], layout=layout)
    iplot(fig)
        

In [None]:
def plot_boxplot(
    *,
    df: pd.DataFrame, 
    x_column: str, 
    y_column: str,
    title: str,
) -> None:    
    data = []
    categorical_labels = df[x_column].unique()
    
    # sort x values
    try:
        are_integers = all([float(l) for l in categorical_labels])
        if are_integers:
            sorted_labels = sorted(categorical_labels, key=lambda x: float(x))
        else:
            sorted_labels = sorted(categorical_labels, key=lambda x: str(x))
    except:
        sorted_labels = sorted(categorical_labels, key=lambda x: str(x))
    
    for i in range(len(categorical_labels)):
        label = sorted_labels[i]
        data.append(
            dict(
                x=label,
                y=df[df[x_column] == label][y_column],
            )
        )
        
    fig = go.Figure()
    for item in data:
        fig.add_trace(
            go.Box(
                y=item["y"],
                name=item["x"],
                line_width=1,
                whiskerwidth=0.2,
            )
        )
        
    fig.update_layout(
        title=title,
        xaxis=dict(title=x_column),
        yaxis=dict(
            title=y_column,
            autorange=True,
            showgrid=True,
            zeroline=True,
            dtick=5,
            gridcolor='rgb(255, 255, 255)',
            gridwidth=1,
            zerolinecolor='rgb(255, 255, 255)',
            zerolinewidth=2,            
        ),
        showlegend=False,
        margin=dict(
            l=40,
            r=30,
            b=80,
            t=100,
        ),        
    )
    fig.update_xaxes(type='category')    
    iplot(fig)
        

In [None]:
plot_boxplot(
    df=weekly_qb_data,
    x_column="numberOfPassRushers",
    y_column="offensePlayResult",
    title="QB Play Results Based on Number of Pass Rushers",
)

In [None]:
plot_boxplot(
    df=weekly_qb_data,
    x_column="offenseFormation",
    y_column="offensePlayResult",
    title="QB Play Results Based on Offensive Formation",
)

In [None]:
plot_boxplot(
    df=weekly_qb_data,
    x_column="defendersInTheBox",
    y_column="offensePlayResult",
    title="QB Play Results Based on Defenders in the Box",
)

In [None]:
plot_boxplot(
    df=weekly_qb_data,
    x_column="offenseFormation",
    y_column="epa",
    title="EPA Based on Offensive Formation",
)

## To Be Continued...