# Introduction

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

The 2022 Big Data Bowl data contains Next Gen Stats player tracking, play, game, player, and PFF scouting data for all 2018-2020 Special Teams plays. Here, you'll find a summary of each data set in the 2022 Data Bowl, a list of key variables to join on, and a description of each variable.

This notebook analyzes data on punt plays to show various strategies The code utilizes the provided game data, play data, and scouting data.

This notebook was created by data science students as a project to gain experience in real-world data science applications. Please feel free to provide suggestions and comments to help improve our knowledge.

In [None]:
#Import some useful, common python libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from datetime import datetime

import warnings
warnings.filterwarnings('ignore')

In [None]:
#Load the datasets from the kaggle page

# Date/Time/Etc of games and who played
games = pd.read_csv('../input/nfl-big-data-bowl-2022/games.csv')

# Info about a specific play
plays = pd.read_csv('../input/nfl-big-data-bowl-2022/plays.csv')

# Play-level scouting info for each game
scouting = pd.read_csv('../input/nfl-big-data-bowl-2022/PFFScoutingData.csv')

# Cleaning the Data

The "scouting" and "plays" datasets are similar in content and size, so they were merged. This gives us a greater amount of information in a more concise manner.

In [None]:
print("Scouting data shape:",scouting.shape)
print("Play data shape:", plays.shape)
merged_df = pd.merge(scouting, plays, on=['gameId', 'playId'])

In [None]:
merged_df = merged_df.loc[merged_df["specialTeamsPlayType"] == "Punt"]
merged_df.reset_index(inplace=True,drop=True)
merged_df.info()

Now we check for the number of unique values for each column in the dataframe. Because the goal is to determine how these values affects the punt result, if there is only 1 unique value within the column then that feature is not helpful and therefore dropped. For example, the "down" column only contains 1 value, meaning that punts will only occur during 1 specific down (in this instance, punts only occur in the 4th down). This will not help in determining a strategy, so it will be dropped.

If a column contains no unique values, it indicates that the feature does not apply to the play. For example, Kickoff Formations do not exist for punt plays, so it has a value of 0 and will be dropped.

In [None]:
merged_df.nunique()

Once the columns are dropped, we can see the remaining columns in our dataframe.

In [None]:
columnsToDrop = ['kickoffReturnFormation','down','specialTeamsPlayType','kickerId','kickBlockerId',
                 'penaltyJerseyNumbers','playDescription']

strategy_df = merged_df.drop(columns=columnsToDrop)
strategy_df.info()

Next, several of the existing columns are edited so the data is in a format that will be easier to use in our model later.

A new column called "tacklers" is created to represent the total number of tacklers involved with the attempted tackle on the play. This is done by summing the counts of players from the "missedTackler", "assistTackler", and "tackler" columns, and will replace those features in the dataframe. This was done to determine if the number of players involved would affect the play result.

In [None]:
# Replaces "missedTackler", "assistTackler", and "tackler" columns with one "tacklers" column that contains
# the total number of tacklers involved in the play.

def tacklers(row):
    tacklers = 0
    if pd.notnull(row['missedTackler']):
        tacklers = tacklers + row['missedTackler'].count(";") + 1
    if pd.notnull(row['tackler']):
         tacklers = tacklers + row['tackler'].count(";") + 1
    if pd.notnull(row['assistTackler']):
        tacklers = tacklers + row['assistTackler'].count(";") + 1
    return tacklers
        
strategy_df['tacklers'] = strategy_df.apply(lambda row: tacklers(row), axis=1)
strategy_df.drop(columns=['tackler','assistTackler','missedTackler'],inplace=True)
strategy_df[['tacklers','specialTeamsResult']].head(10)

A similar method was used for "gunners", "puntRushers", "specialTeamsSafeties", and "vises" - the values for these features were replaced with an integer representing the total number of players for each position associated with the play.

In [None]:
# Replaces the values for "gunners", "puntRushers", "specialTeamsSafteties", and "vises" with the number of players
# for each respective position involved in the play.
def positionPlayers(row):
    players = 0
    if pd.notnull(row):
        players = players + row.count(";") + 1
    return players

strategy_df['gunners'] = strategy_df['gunners'].apply(lambda row: positionPlayers(row))
strategy_df['puntRushers'] = strategy_df['puntRushers'].apply(lambda row: positionPlayers(row))
strategy_df['specialTeamsSafeties'] = strategy_df['specialTeamsSafeties'].apply(lambda row: positionPlayers(row))
strategy_df['vises'] = strategy_df['vises'].apply(lambda row: positionPlayers(row))
strategy_df[['gunners','puntRushers','specialTeamsSafeties','vises']].head(10)

Next, two additional columns are created to provide better contextual information for each play. 

The new "yardsToEndzone" column replaces the previous "absoluteYardlineNumber" and "yardlineNumber" to represent the yards to go to the possession team's respective end zone. This gives a better indication of the remaining distance that needs to be closed in the following plays.

The new "pointDifference" column replaces the "preSnapVisitorScore" and "preSnapHomeScore" to indicate the magnitude of points that the possession team is winning or losing compared to the opposing team. While the pre-snap scores are useful, it does not indicate whether the possession team is the home team or visiting team. This feature may be important in determining whether there is a change in strategy based on the team scores.

In [None]:
# Based on whether the possession team is the home team or visiting team, this function adds 
# additional columns to the dataset listing the total yards until the opposite endzone, and the 
# magnitude of points above/below the opposing team's score.

def winOrLose(row):
    points = 0
    yards = 0
    if  games.loc[games['gameId'] == row['gameId'], 'homeTeamAbbr'].item() == row['possessionTeam']:
        yards = 120 - 10 - row['absoluteYardlineNumber']
        points = row['preSnapHomeScore'] - row['preSnapVisitorScore']
        return yards, points
    else:
        row['possessionTeam'] = 'Visitor'
        yards = row['absoluteYardlineNumber'] - 10
        points = row['preSnapVisitorScore'] - row['preSnapHomeScore']
        return yards, points
    return 'Something went wrong'
values = strategy_df.apply(lambda row: winOrLose(row), axis=1)
strategy_df['yardsToEndzone'] = [values[i][0] for i in range(0, strategy_df.shape[0])]
strategy_df['pointDifference'] = [values[i][1] for i in range(0, strategy_df.shape[0])]
strategy_df.drop(columns=['preSnapHomeScore','preSnapVisitorScore','absoluteYardlineNumber', 'yardlineNumber'], inplace=True)
strategy_df[['yardsToEndzone','pointDifference']].head(10)

The time remaining on the game clock may be an important feature when determining a punt strategy. The time format exists as a string, so here it is converted into a numeric value representing the number of seconds left on the clock.

In [None]:
strategy_df['gameClock'] = pd.to_datetime(strategy_df['gameClock'], format='%M:%S:%f').dt.time
strategy_df['gameClock'] = strategy_df['gameClock'].apply(lambda x: (x.minute*60 + x.second))

In [None]:
strategy_df[['gameClock','quarter']].head(10)

Now we check for numerical columns with a null values. In this case, the null values for several columns representing time or distance are filled with "0" to indicate that the particular value does not exist for the corresponding play.

In [None]:
strategy_df.isna().sum()

In [None]:
strategy_df['penaltyYards'].fillna(0, inplace=True)
strategy_df['kickLength'].fillna(0, inplace=True)
strategy_df['hangTime'].fillna(0, inplace=True)
strategy_df['operationTime'].fillna(0, inplace=True)
strategy_df['snapTime'].fillna(0, inplace=True)

# Visualizations

Next, we visualize some of the features within the dataframe. 

First, we use the new "yardsToEndzone" column to see whether punt plays occur at specific distances.

In [None]:
puntLocation = strategy_df.groupby(['yardsToEndzone']).nunique().reset_index()[['yardsToEndzone','playId']]
fig = px.bar(puntLocation, 
             x='yardsToEndzone', 
             y='playId',
             labels={'yardsToEndzone':'Yards To Endzone',
                   'playId':'Number of Punts Attempted'}
            )
fig.update_layout(title_text='Number of Punts Attempted by Distance to Endzone')

fig.show()

From the graph above, we see that there is a wide distribution of punts occurring at different yardlines. Therefore, there is not a strong indication for choosing to make a punt play based on distance to go to the endzone.

Below we see the distribution of results following a punt play. The most common result is a return, followed by a fair catch. 

In [None]:
puntResult = strategy_df.groupby(['specialTeamsResult']).nunique().reset_index()[['specialTeamsResult','playId']]

fig = px.bar(puntResult, 
             x='specialTeamsResult', 
             y='playId',
             labels={'specialTeamsResult':'Result of punts',
                   'playId':'Number of plays'}
            )
fig.update_layout(title_text='Number of plays per punt result')

fig.show()

The averages below based on punt result type show an interesting depcition of usefulness in gaining yards. On first thought, it would be fair to assume that a punt return would garner a lot more yards than a fair catch because the punt returner has the opportunity to run the ball and gain more yards. However, we find in this graph that there is no substantial difference in yardage gained, which means that in a lot of cases it might be more efficient, in terms of energy costs and injury, to call a fair catch.

In [None]:
means = strategy_df.groupby('specialTeamsResult')['playResult'].mean().reset_index()[['specialTeamsResult','playResult']]
fig = px.bar(means,
             x='specialTeamsResult', 
             y='playResult',
             labels={'specialTeamsResult':'Punt Results', 'playResult': 'Yard Results'})
fig.update_layout(title_text='Number of yards gained/lost per punt result')
fig.show()

# Building a Basic Model

Lastly, we build a simple model using our cleaned data to determine if punt results can be predicted based on exising features.

This portion highlights some more nuanced information about the game in question, which could help determine the more successful plays and trends upon review. Studying these makes discerning whether or not a team is winning a possibility earlier in the game.

In [None]:
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics

In [None]:
punt_strat = strategy_df.copy()
punt_strat.info()

The datagrame X is created to contain the features we want to use when training our model. Dummy variables are created for the categorical variables, such as the kick type. 

In [None]:
X = punt_strat[['snapDetail','snapTime','hangTime','operationTime',
                'kickType','kickDirectionIntended','kickDirectionActual', 'returnDirectionIntended',
                'returnDirectionActual','kickContactType','quarter','yardsToGo','gameClock', 
                'penaltyYards', 'kickLength','puntRushers','gunners','tacklers',
                'specialTeamsSafeties', 'vises','yardsToEndzone','pointDifference']]
X = pd.get_dummies(X)
X.info()

Our target variable to predict is the special teams result for punts (Return, Fair Catch, Downed, Out of Bounds, Touchback, Muffed, Non-Special Teams Result, Blocked Punt) that are mapped to numerical values.

In [None]:
punt_strat['specialTeamsResult'] = punt_strat['specialTeamsResult'].map({
    'Return':1, 'Fair Catch':2, 'Downed':3, 'Out of Bounds':4, 'Touchback':5, 'Muffed':6,
    'Non-Special Teams Result':7, 'Blocked Punt':8})

y1 = punt_strat['specialTeamsResult']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y1, test_size = 0.3)

This accuracy measurement below predicts the outcome of the punt, whether its a fair catch, a block punt, etc. This was calculated bsed on the game clock, how many yards a team has until they reach the end zone, the hang time of the punt, etc. This becomes an amazing tool from the standpoint of the defense in predicting what the offense may do next.

In [None]:
clf = RandomForestClassifier(n_estimators=100)
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)

print("Accuracy:", metrics.accuracy_score(y_test, y_pred))

These values below denote the importance of each feature in predicting the target response with the above accuracy.

In [None]:
feature_imp = pd.Series(clf.feature_importances_,index=X.columns).sort_values(ascending=False)
feature_imp

In all, these functions account for variables that make football data complex being that they incorporate so many variables involved in making special team decisions. Therefore, following this analysis allows a team to be accurate, efficient, and calculated in various situaitons on a weekly basis, no matter what team they are facing in the future. This notebook provides the ability of predicting results and efficiently presenting data, but when accompanied by coaching expertise and player intuition, that concotion becomes a formula for sustainable football success. 