# NFL Big Data Bowl 2022 - Version-2

---

**Problem Statement:**

* Before National Football League (NFL) coaches celebrate a big W, they strategize ways to improve field position and score points. 
* Both of these objectives receive significant contributions from special teams plays, which consist of punts, kickoffs, field goals and extra points. 
* These play types take on important roles in a game’s final score—so much so that coaches say they're a third of the game. 
* Yet special teams remain an understudied part of American football, with an opportunity for data science to offer better ways to understand its impact.

**Note:** In this competition, we need to quantify what happens on special teams plays. We might have to create a new special teams metric, quantify team or individual strategies, rank players, or even something we haven’t considered.

---

**Dataset Involves:**

The 2022 Big Data Bowl data contains Next Gen Stats player tracking, play, game, player, and PFF scouting data for all 2018-2020 Special Teams plays. 

* **Game data:** The games.csv contains the teams playing in each game. The key variable is **gameId**.

* **Play data:** The plays.csv file contains play-level information from each game. The key variables are **gameId** and **playId**.

* **Player data:** The players.csv file contains player-level information from players that participated in any of the tracking data files. The key variable is **nflId**.

* **Tracking data:** Files tracking[season].csv contain player tracking data from season [season]. The key variables are **gameId**, **playId**, and **nflId**.

* **PFF Scouting data:** The PFFScoutingData.csv file contains play-level scouting information for each game. The key variables are **gameId** and **playId**.

---

**Importing Libraries:**

* To get started we will use Python for data pre-processing and data analysis.

* Import python libraries as necessary to get started for data load and later import other libraries as needed

---


In [None]:
# Importing package numpys (For Numerical Python)
import numpy as np 
# Importing for data analysis
import pandas as pd 
# module finds all the pathnames matching a specified pattern
import glob 
# module provides a portable way of using operating system dependent functionality
import os 
 # Importing pyplot interface using matplotlib
import matplotlib.pyplot as plt 
# Importing seaborn library for interactive visualization
import seaborn as sns 
# Importing WordCloud for text data visualization
from wordcloud import WordCloud
# Importing matplotlib for plots
import matplotlib
# Importing datetime for using datetime
from datetime import datetime
# Importing plotly for interactive plots
import plotly.express as px

---

In [None]:
# Loading dataset games.csv
game_data = pd.read_csv('../input/nfl-big-data-bowl-2022/games.csv')

In [None]:
# convert gameDate from object type to datetime type
game_data['gameDate']= pd.to_datetime(game_data['gameDate'])

---

In [None]:
# Loading dataset plays.csv
play_data = pd.read_csv('../input/nfl-big-data-bowl-2022/plays.csv')

---

In [None]:
# Loading dataset players.csv
player_data = pd.read_csv('../input/nfl-big-data-bowl-2022/players.csv')

In [None]:
# convert gameDate from object type to datetime type
player_data['birthDate']= pd.to_datetime(player_data['birthDate'])

---

In [None]:
# Loading dataset tracking2018.csv
tracking2018_data = pd.read_csv('../input/nfl-big-data-bowl-2022/tracking2018.csv')

# Loading dataset tracking2019.csv
tracking2019_data = pd.read_csv('../input/nfl-big-data-bowl-2022/tracking2019.csv')

# Loading dataset tracking2020.csv
tracking2020_data = pd.read_csv('../input/nfl-big-data-bowl-2022/tracking2020.csv')

**Note:** Using sample of tracking2020, tracking 2019 and tracking 2018 season dataset respectively for EDA

In [None]:
# create sample of tracking dataset for EDA (For faster execution and to avoid out of memory issue for notebook)
tracking_data_sample = pd.concat([tracking2020_data.sample(n=100000, random_state=1),tracking2019_data.sample(n=100000, random_state=1),tracking2018_data.sample(n=100000, random_state=1)])

In [None]:
#reset index after appending all three dataset
tracking_data_sample.reset_index(inplace=True)
# get quick summary of new dataset sample
tracking_data_sample.head()

---

In [None]:
# Loading dataset PFFScoutingData.csv
PFFScouting_data = pd.read_csv('../input/nfl-big-data-bowl-2022/PFFScoutingData.csv')

---

---
# Data Definition/Description of NFL 2022 data

---

Note: Merge various dataset to arrive at combined sample dataset for EDA

---
**Merge game and play dataset - game_play_data**

---

In [None]:
# Merge of game and play dataset using key as gameId
game_play_data = pd.merge(game_data, play_data, how="inner", on=["gameId"])

---
**Merge player and tracking_data_sample dataset- player_tracking_data**

---

In [None]:
# Merge of player and tracking_data_sample season dataset using key as nflId
player_tracking_data = pd.merge(player_data,tracking_data_sample, how="inner", on=["nflId"])

---
**Merge game_play_data and PFFScouting dataset- game_play_scouting_data**

---

In [None]:
# Merge of game_play_data and PFFScouting_data dataset using key as gameId and playId
game_play_scouting_data = pd.merge(game_play_data, PFFScouting_data, how="inner", left_on=["gameId","playId"], right_on=["gameId","playId"])

---
**Merge game_play_scouting and player_tracking dataset- nfl2022_data_sample**

---

In [None]:
# Merge of game_play_scouting_data and player_tracking_data dataset using key as gameId and playId
nfl2022_data_sample = pd.merge(game_play_scouting_data, player_tracking_data, how="inner", left_on=["gameId","playId"], right_on=["gameId","playId"])

In [None]:
nfl2022_data_sample.rename(columns={'displayName_x':'displayName'}, inplace=True)
nfl2022_data_sample.drop(columns=['displayName_y'], axis=1, inplace=True)

In [None]:
# get shape of dataframe
print('Shape of nfl2022_data_sample dataset is:', nfl2022_data_sample.shape)

# print summary of dataframe
nfl2022_data_sample.info()

In [None]:
# print first 5 rows of dataframe
nfl2022_data_sample.head()

In [None]:
# Impute kickLength missing value with median value
nfl2022_data_sample['kickLength'].fillna(nfl2022_data_sample['kickLength'].median(), inplace=True)


In [None]:
# Impute kickReturnYardage missing value with median value
nfl2022_data_sample['kickReturnYardage'].fillna(nfl2022_data_sample['kickReturnYardage'].median(), inplace=True)

In [None]:
# Impute collegeName missing value with value as "Missing"
nfl2022_data_sample['collegeName'].fillna("Missing", inplace=True)

In [None]:
# print summary of dataframe
nfl2022_data_sample.info()

---
# Data Analysis/EDA of NFL 2022 data

---

---
**Q: What is distribution of Season and Team with respect to Special Teams Play Type?**

---

In [None]:
#distribution count for specialTeamsPlayType with respect to team and season
sns.catplot(data=nfl2022_data_sample, x="specialTeamsPlayType", col="team", hue="season", kind="count", height=8, aspect=0.8, palette="rainbow")
plt.show()

Above plot shows that every season has similar pattern for Kickoff, Punt, Field Goal and Extra Point with respect to Special Teams Play Type. Number of Kickoff is higher in Season 2020 and Number of Punt was higher in Season 2018.

---
**Q: What is distribution of Season with respect to Special Teams Play Type?**

---

In [None]:
# distribution of season with respect to specialTeamsPlayType
fig = px.sunburst(nfl2022_data_sample, path=['season', 'specialTeamsPlayType'], color='specialTeamsPlayType', hover_data=['season'])
fig.show()

Above plot shows that every season has similar distribution for Special Teams Play Type as expected.

---
**Q: What is distribution of Team with respect to Special Teams Play Type?**

---

In [None]:
# distribution of team with respect to specialTeamsPlayType
fig = px.sunburst(nfl2022_data_sample, path=['team', 'specialTeamsPlayType'], color='specialTeamsPlayType', hover_data=['team'])
fig.show()

Above plot shows that for both away and home team, Special Teams Play Type has similar distribution with Number of Punt higher compared to Number of Kickoff.

---
**Q: What is distribution of Play Direction with respect to Special Teams Play Type?**

---

In [None]:
# distribution of playDirection with respect to specialTeamsPlayType
fig = px.sunburst(nfl2022_data_sample, path=['playDirection', 'specialTeamsPlayType'], color='specialTeamsPlayType', hover_data=['playDirection'])
fig.show()

Above plot shows that for both play direction right and left, Special Teams Play Type has similar distribution with Number of Punt higher compared to Number of Kickoff. Number of Punt is higher for right Play Direction compared to left Play Direction in this sample data.

---
**Q: What is distribution of Special Teams Play Type with respect to Special Teams Result?**


---

In [None]:
# distribution of specialTeamsPlayType with respect to specialTeamsResult
fig = px.sunburst(nfl2022_data_sample, path=['specialTeamsPlayType', 'specialTeamsResult'], color='specialTeamsPlayType', hover_data=['specialTeamsPlayType'])
fig.show()

Above plot shows that Kickoff has majority datapoint of Special Teams Play Result as Return and Touchback, Punt has majority datapoint of Special Teams Play Result as Return, Fair Catch and Downed, Field Goal has majority datapoint of Special Teams Play Result as Kick Attempt Good followed by Kick Attempt No Good with Extra Point has majority datapoint of Special Teams Play Result as Kick Attempt Good in this sample data.

---
**Q: What is the distribution for Kick Type and Kick Length with respect to Special Teams Play Type?**

---

In [None]:
#distribution of kickType, kickLength with respect to specialTeamsPlayType
sns.catplot(data=nfl2022_data_sample, x="kickType", y="kickLength", hue="specialTeamsPlayType", height=8, aspect=0.8)
plt.show()

D: Deep, N: Normal - standard punt style, A: Nose down or Aussie-style punts, Q: Squib, P: Pooch kick, F: Flat, O: Obvious Onside, K: Free Kick, R: Rugby style punt, S: Surprise Onside, B: Deep Direct OOB

Above plot shows that majority of Kickoff resulted in kickType as Deep, Squib, Pooch Kick, Flat, Obvious Onside and Free Kick. And majority of Punt resulted in kickType as Normal and Nose down with some Rugby style punt in this sample data.

---
**Q: What is the distribution for Kick Type and Play Result with respect to Special Teams Play Type?**

---

In [None]:
#distribution of kickType, playResult with respect to specialTeamsPlayType
sns.catplot(data=nfl2022_data_sample, x="kickType", y="playResult", hue="specialTeamsPlayType", height=8, aspect=0.8)
plt.show()

D: Deep, N: Normal - standard punt style, A: Nose down or Aussie-style punts, Q: Squib, P: Pooch kick, F: Flat, O: Obvious Onside, K: Free Kick, R: Rugby style punt, S: Surprise Onside, B: Deep Direct OOB

Above plot shows that majority of Kickoff resulted in kickType as Deep, Squib, Pooch Kick, Flat, Obvious Onside and Free Kick. And majority of Punt resulted in kickType as Normal and Nose down with some Rugby style punt in this sample data.

---
**Q:What is the distribution of Special Teams Result and Special Teams Play Type with respect to Kick Length?**

---

In [None]:
# get specialTeamsResult and specialTeamsPlayType sorted by kickLength
kickLength = nfl2022_data_sample.groupby(by = ['specialTeamsPlayType','specialTeamsResult'], as_index = False)['kickLength'].agg('median').sort_values(by ='kickLength', ascending = False)
# set plot figure size
figure = plt.figure(figsize = [15, 10])
# plot comparison of specialTeamsResult
sns.barplot(data = kickLength[0:12],x = 'kickLength', y ='specialTeamsResult', hue='specialTeamsPlayType',palette='winter')
# set plot label
plt.xlabel(xlabel = 'Kick Length')
plt.ylabel(ylabel = 'Special Teams Result')
plt.grid(b = True, axis = 'x')
plt.show()

---
**Q:What is the distribution of Special Teams Result and Special Teams Play Type with respect to Play Result?**

---

In [None]:
# get specialTeamsResult and specialTeamsPlayType sorted by playResult
playResult = nfl2022_data_sample.groupby(by = ['specialTeamsPlayType','specialTeamsResult'], as_index = False)['playResult'].agg('mean').sort_values(by ='playResult', ascending = False)
# set plot figure size
figure = plt.figure(figsize = [15, 10])
# plot comparison of specialTeamsResult
sns.barplot(data = playResult[0:12], x = 'playResult', y ='specialTeamsResult',hue='specialTeamsPlayType',palette='summer')
# set plot label
plt.xlabel(xlabel = 'Play Result')
plt.ylabel(ylabel = 'Special Teams Result')
plt.grid(b = True, axis = 'x')
plt.show()

---
**Q: What is the distribution for Kick Length and Play Result with respect to Special Teams Play Type?**

---

In [None]:
# distribution of kickLength and playResult with respect to specialTeamsPlayType
fig = px.scatter(nfl2022_data_sample, x="playResult", y="kickLength",size ='absoluteYardlineNumber',color="specialTeamsPlayType",
                 hover_name="specialTeamsPlayType", log_x=True, size_max=10)
fig.show()

---
# Summary
---

**EDA Involves:**
* Generate actionable, practical, and novel insights from player tracking data that corresponds to special teams play. 

Potential topics for EDA:

* Create a new special teams metric. The winning algorithm from the 2020 Big Data Bowl has been adopted by the NFL/NFL Network for on air distribution, and we are hopeful that there could be a new stat for special teams plays that could come from this year’s competition.
* Quantify special teams strategy. Special teams’ coaches are among the most creative and innovative in the league. Compare/contrast how each team game plans. Which strategies yield the best results? What are other strategies that could be adopted?
* Rank special teams players. Each team employs a variety of players (including longsnappers, kickers, punters, and other utility special teams players). How do they stack up with respect to one another?
---

**Highlights:**

Note: Click/Touch on any category (Kickoff, Punt, Field Goal or Extra Point) in plot to drill down further.

---
**Q: What are the Game Characteristics for top 5 Play Result with respect to Special Teams Play Type?**

---

In [None]:
# get distribution of specialTeamsPlayType,specialTeamsResult,season,homeTeamAbbr and visitorTeamAbbr with respect to playResult
kickoff = nfl2022_data_sample[nfl2022_data_sample.specialTeamsPlayType.isin(["Kickoff"])].groupby(by =['specialTeamsPlayType','specialTeamsResult','season', 'team','homeTeamAbbr','visitorTeamAbbr'], as_index = False)['playResult'].agg('max').sort_values(by ='playResult', ascending = False)
punt = nfl2022_data_sample[nfl2022_data_sample.specialTeamsPlayType.isin(["Punt"])].groupby(by =['specialTeamsPlayType','specialTeamsResult','season', 'team','homeTeamAbbr','visitorTeamAbbr'], as_index = False)['playResult'].agg('max').sort_values(by ='playResult', ascending = False)
fieldGoal = nfl2022_data_sample[nfl2022_data_sample.specialTeamsPlayType.isin(["Field Goal"])].groupby(by =['specialTeamsPlayType','specialTeamsResult','season', 'team','homeTeamAbbr','visitorTeamAbbr'], as_index = False)['playResult'].agg('max').sort_values(by ='playResult', ascending = False)
extraPoint = nfl2022_data_sample[nfl2022_data_sample.specialTeamsPlayType.isin(["Extra Point"])].groupby(by =['specialTeamsPlayType','specialTeamsResult','season', 'team','homeTeamAbbr','visitorTeamAbbr'], as_index = False)['playResult'].agg('max').sort_values(by ='playResult', ascending = False)
# combine each specialTeamsPlayType aggregation in single dataframe
topGameCharacteristicsbyPlayResult = pd.concat([kickoff[:5],punt[:5],fieldGoal[:5],extraPoint[:5]])
# plot distribution of top 5 game characteristics
fig = px.sunburst(topGameCharacteristicsbyPlayResult, path=['specialTeamsPlayType','specialTeamsResult','season', 'team','homeTeamAbbr','visitorTeamAbbr','playResult'], color='specialTeamsPlayType', hover_data=['specialTeamsPlayType'])
fig.show()

In [None]:
# table view display
topGameCharacteristicsbyPlayResult

---
**Q: What are the Player Characteristics for top 5 Play Result with respect to Special Teams Play Type?**

---

In [None]:
# get distribution of specialTeamsPlayType, specialTeamsResult, displayName,collegeName,height, weight and position with respect to playResult
kickoff = nfl2022_data_sample[nfl2022_data_sample.specialTeamsPlayType.isin(["Kickoff"])].groupby(by =['specialTeamsPlayType','specialTeamsResult','displayName','collegeName','height','weight','position'], as_index = False)['playResult'].agg('max').sort_values(by ='playResult', ascending = False)
punt = nfl2022_data_sample[nfl2022_data_sample.specialTeamsPlayType.isin(["Punt"])].groupby(by =['specialTeamsPlayType','specialTeamsResult','displayName','collegeName','height','weight','position'], as_index = False)['playResult'].agg('max').sort_values(by ='playResult', ascending = False)
fieldGoal = nfl2022_data_sample[nfl2022_data_sample.specialTeamsPlayType.isin(["Field Goal"])].groupby(by =['specialTeamsPlayType','specialTeamsResult','displayName','collegeName','height','weight','position'], as_index = False)['playResult'].agg('max').sort_values(by ='playResult', ascending = False)
extraPoint = nfl2022_data_sample[nfl2022_data_sample.specialTeamsPlayType.isin(["Extra Point"])].groupby(by =['specialTeamsPlayType','specialTeamsResult','displayName','collegeName','height','weight','position'], as_index = False)['playResult'].agg('max').sort_values(by ='playResult', ascending = False)
# combine each specialTeamsPlayType aggregation in single dataframe
topPlayerCharacteristicsbyPlayResult = pd.concat([kickoff[:5],punt[:5],fieldGoal[:5],extraPoint[:5]])
# plot distribution of top 5 player characteristics
fig = px.sunburst(topPlayerCharacteristicsbyPlayResult, path=['specialTeamsPlayType','specialTeamsResult','displayName','collegeName', 'height','weight','position','playResult'], color='specialTeamsPlayType', hover_data=['specialTeamsPlayType'])
fig.show()

In [None]:
# table view display
topPlayerCharacteristicsbyPlayResult

---
**Q: What are the Player Tracking Characteristics for top 5 Play Result with respect to Special Teams Play Type?**

---

In [None]:
# get distribution of specialTeamsPlayType, specialTeamsResult, displayName,x,y,s and a with respect to playResult
kickoff = nfl2022_data_sample[nfl2022_data_sample.specialTeamsPlayType.isin(["Kickoff"])].groupby(by =['specialTeamsPlayType','specialTeamsResult','displayName','x','y','s','a'], as_index = False)['playResult'].agg('max').sort_values(by ='playResult', ascending = False)
punt = nfl2022_data_sample[nfl2022_data_sample.specialTeamsPlayType.isin(["Punt"])].groupby(by =['specialTeamsPlayType','specialTeamsResult','displayName','x','y','s','a'], as_index = False)['playResult'].agg('max').sort_values(by ='playResult', ascending = False)
fieldGoal = nfl2022_data_sample[nfl2022_data_sample.specialTeamsPlayType.isin(["Field Goal"])].groupby(by =['specialTeamsPlayType','specialTeamsResult','displayName','x','y','s','a'], as_index = False)['playResult'].agg('max').sort_values(by ='playResult', ascending = False)
extraPoint = nfl2022_data_sample[nfl2022_data_sample.specialTeamsPlayType.isin(["Extra Point"])].groupby(by =['specialTeamsPlayType','specialTeamsResult','displayName','x','y','s','a'], as_index = False)['playResult'].agg('max').sort_values(by ='playResult', ascending = False)
# combine each specialTeamsPlayType aggregation in single dataframe
topPlayerTrackingCharacteristicsbyPlayResult = pd.concat([kickoff[:5],punt[:5],fieldGoal[:5],extraPoint[:5]])
# plot distribution of top 5 player tracking characteristics
fig = px.sunburst(topPlayerTrackingCharacteristicsbyPlayResult, path=['specialTeamsPlayType','specialTeamsResult','displayName','x','y','s','a','playResult'], color='specialTeamsPlayType', hover_data=['specialTeamsPlayType'])
fig.show()

In [None]:
# table view display
topPlayerTrackingCharacteristicsbyPlayResult

---
**Q: What are the Play Characteristics for top 5 play result with respect to Special Teams Play Type?**

---

In [None]:
# get distribution of specialTeamsPlayType, specialTeamsResult, possessionTeam, penaltyYards, playDescription with respect to playResult
kickoff = nfl2022_data_sample[nfl2022_data_sample.specialTeamsPlayType.isin(["Kickoff"])].groupby(by =['specialTeamsPlayType','specialTeamsResult','possessionTeam','penaltyYards','playDescription'], as_index = False)['playResult'].agg('max').sort_values(by ='playResult', ascending = False)
punt = nfl2022_data_sample[nfl2022_data_sample.specialTeamsPlayType.isin(["Punt"])].groupby(by =['specialTeamsPlayType','specialTeamsResult','possessionTeam','penaltyYards','playDescription'], as_index = False)['playResult'].agg('max').sort_values(by ='playResult', ascending = False)
fieldGoal = nfl2022_data_sample[nfl2022_data_sample.specialTeamsPlayType.isin(["Field Goal"])].groupby(by =['specialTeamsPlayType','specialTeamsResult','possessionTeam','penaltyYards','playDescription'], as_index = False)['playResult'].agg('max').sort_values(by ='playResult', ascending = False)
extraPoint = nfl2022_data_sample[nfl2022_data_sample.specialTeamsPlayType.isin(["Extra Point"])].groupby(by =['specialTeamsPlayType','specialTeamsResult','possessionTeam','penaltyYards','playDescription'], as_index = False)['playResult'].agg('max').sort_values(by ='playResult', ascending = False)
# combine each specialTeamsPlayType aggregation in single dataframe
topPlayCharacteristicsbyPlayResult = pd.concat([kickoff[:5],punt[:5],fieldGoal[:5],extraPoint[:5]])
# plot distribution of top 5 play characteristics
fig = px.sunburst(topPlayCharacteristicsbyPlayResult, path=['specialTeamsPlayType', 'specialTeamsResult','possessionTeam','penaltyYards','playResult','playDescription'], color='specialTeamsPlayType', hover_data=['specialTeamsPlayType'])
fig.show()

In [None]:
# table view display
topPlayCharacteristicsbyPlayResult

---
**Q: What are the Play Tracking Characteristics for top 5 play result with respect to Special Teams Play Type?**

---

In [None]:
# get distribution of specialTeamsPlayType, specialTeamsResult, team, playDirection, event with respect to playResult
kickoff = nfl2022_data_sample[nfl2022_data_sample.specialTeamsPlayType.isin(["Kickoff"])].groupby(by =['specialTeamsPlayType','specialTeamsResult','team','playDirection','event'], as_index = False)['playResult'].agg('max').sort_values(by ='playResult', ascending = False)
punt = nfl2022_data_sample[nfl2022_data_sample.specialTeamsPlayType.isin(["Punt"])].groupby(by =['specialTeamsPlayType','specialTeamsResult','team','playDirection','event'], as_index = False)['playResult'].agg('max').sort_values(by ='playResult', ascending = False)
fieldGoal = nfl2022_data_sample[nfl2022_data_sample.specialTeamsPlayType.isin(["Field Goal"])].groupby(by =['specialTeamsPlayType','specialTeamsResult','team','playDirection','event'], as_index = False)['playResult'].agg('max').sort_values(by ='playResult', ascending = False)
extraPoint = nfl2022_data_sample[nfl2022_data_sample.specialTeamsPlayType.isin(["Extra Point"])].groupby(by =['specialTeamsPlayType','specialTeamsResult','team','playDirection','event'], as_index = False)['playResult'].agg('max').sort_values(by ='playResult', ascending = False)
# combine each specialTeamsPlayType aggregation in single dataframe
topPlayTrackingCharacteristicsbyPlayResult = pd.concat([kickoff[:5],punt[:5],fieldGoal[:5],extraPoint[:5]])
# plot distribution of top 5 play tracking characteristics
fig = px.sunburst(topPlayTrackingCharacteristicsbyPlayResult, path=['specialTeamsPlayType', 'specialTeamsResult','team','playDirection','event', 'playResult'], color='specialTeamsPlayType', hover_data=['specialTeamsPlayType'])
fig.show()

In [None]:
# table view display
topPlayTrackingCharacteristicsbyPlayResult

---
**Thank you and Happy Learning.**

---

In [None]:
thank_you_str="Thanks,Happy Learning,Collaboration,Thankyou,Keep Learning"
# create WordCloud with converted string
wordcloud = WordCloud(width = 1000, height = 500, random_state=1, background_color='white', collocations=True).generate(thank_you_str)
plt.figure(figsize=(20, 20))
plt.imshow(wordcloud) 
plt.axis("off")
plt.show()