# NFL Big Data Bowl 2022

---

**Problem Statement:**

* Before National Football League (NFL) coaches celebrate a big W, they strategize ways to improve field position and score points. 
* Both of these objectives receive significant contributions from special teams plays, which consist of punts, kickoffs, field goals and extra points. 
* These play types take on important roles in a game’s final score—so much so that coaches say they're a third of the game. 
* Yet special teams remain an understudied part of American football, with an opportunity for data science to offer better ways to understand its impact.

**Note:** In this competition, we need to quantify what happens on special teams plays. We might have to create a new special teams metric, quantify team or individual strategies, rank players, or even something we haven’t considered.

---

**Dataset Involves:**

The 2022 Big Data Bowl data contains Next Gen Stats player tracking, play, game, player, and PFF scouting data for all 2018-2020 Special Teams plays. 

* **Game data:** The games.csv contains the teams playing in each game. The key variable is **gameId**.

* **Play data:** The plays.csv file contains play-level information from each game. The key variables are **gameId** and **playId**.

* **Player data:** The players.csv file contains player-level information from players that participated in any of the tracking data files. The key variable is **nflId**.

* **Tracking data:** Files tracking[season].csv contain player tracking data from season [season]. The key variables are **gameId**, **playId**, and **nflId**.

* **PFF Scouting data:** The PFFScoutingData.csv file contains play-level scouting information for each game. The key variables are **gameId** and **playId**.

---

**Importing Libraries:**

* To get started we will use Python for data pre-processing and data analysis.

* Import python libraries as necessary to get started for data load and later import other libraries as needed

---


In [None]:
# Importing package numpys (For Numerical Python)
import numpy as np 
# Importing for data analysis
import pandas as pd 
# module finds all the pathnames matching a specified pattern
import glob 
# module provides a portable way of using operating system dependent functionality
import os 
 # Importing pyplot interface using matplotlib
import matplotlib.pyplot as plt 
# Importing seaborn library for interactive visualization
import seaborn as sns 
# Importing WordCloud for text data visualization
from wordcloud import WordCloud
# Importing matplotlib for plots
import matplotlib
#Importing datetime for using datetime
from datetime import datetime

# Data Definition/Description of Game data

---

**Data Definition:**
* **Game data:** The games.csv contains the teams playing in each game. The key variable is **gameId**.

In [None]:
# Loading dataset games.csv
game_data = pd.read_csv('../input/nfl-big-data-bowl-2022/games.csv')

---
**Q: What is the structure of game dataset?**

---

| No. | Feature Name | Description of the feature |
| :-- | :--| :--| 
|01| **gameId**   | Game identifier, unique (numeric) |
|02| **season** | Season of game                 |
|03| **week**   | Week of game                 |
|04| **gameDate**   | Game Date (time, mm/dd/yyyy)  |
|05| **gameTimeEastern**   | Start time of game (time, HH:MM:SS, EST)|
|06| **homeTeamAbbr**   | Home team three-letter code (text)  |
|07| **visitorTeamAbbr**   | Visiting team three-letter code (text)|

In [None]:
# get shape of dataframe
print('Shape of game dataset is:', game_data.shape)

# print summary of dataframe
game_data.info()

**game dataset information:**

* There are 764 data points (rows) and 7 feature (column) in game dataset.
* There are three numerical column and four columns are of categorical or object type.
* There are no missing values (non-null count is same) for all seven columns.


---
**Q: What does data looks like for game dataset?**

---

In [None]:
# print first 10 rows of dataframe
game_data.head(10)

---
**Q: What is the statistics description for game dataset?**

---

In [None]:
# print descriptive statistics for both object and numeric type
game_data.describe(include='all').round(1)

**game dataset data description:**

* There are no missing value for **gameId** as total count is 764.
* Mean value is same as median value for **season** and **week** which appears that distribution is normal.
* There are 33 unique **homeTeamAbbr** for which data points are available.
 * PHI **homeTeamAbbr** has the highest frequency of occurrence.
* There are 33 unique **visitorTeamAbbr** for which data points are available.
 * ATL **visitorTeamAbbr** has the highest frequency of occurrence.
* There are 16 unique **gameTimeEastern** for which data points are available.
 * 13:00:00 **gameTimeEastern** has the highest frequency of occurrence.
* There are 151 unique **gameDate** for which data points are available.
 * 01/03/2021 **gameDate** has the highest frequency of occurrence.

---

# Data Analysis/EDA of Game data

---

In [None]:
# convert gameDate from object type to datetime type
game_data['gameDate']= pd.to_datetime(game_data['gameDate'])

# group by date without index
game_data_by_date = game_data.groupby(by = 'gameDate', as_index = False).agg('max')

---
**Q: What is the trend for game dataset with respect to Game Date for Home Team?**

---

In [None]:
# trend for game dataset with respect to homeTeamAbbr
figure = plt.figure(figsize = [15, 7])
sns.lineplot(x = 'gameDate', y = 'homeTeamAbbr', hue= 'season', data = game_data_by_date, color = '#D96552')

plt.xlabel('Game Date', size = 14)
plt.ylabel('Home Team', size = 14)
plt.title('Game Trend', size = 16)
plt.grid(b = True, axis = 'y')
plt.show()

Yearly Game trend for Home Team shows variations for each season and 2020 season datapoints seems on higher side.

Note: The National Football League (NFL) is a professional American football league consisting of 32 teams, divided equally between the National Football Conference (NFC) and the American Football Conference (AFC).

---
**Q: What is the trend for game dataset with respect to Game Date for Visitor Team?**

---

In [None]:
# trend for game dataset with respect to visitorTeamAbbr
figure = plt.figure(figsize = [15, 7])
sns.lineplot(x = 'gameDate', y = 'visitorTeamAbbr', hue= 'season', data = game_data_by_date, color = '#D96552')

plt.xlabel('Game Date', size = 14)
plt.ylabel('Visitor Team', size = 14)
plt.title('Game Trend', size = 16)
plt.grid(b = True, axis = 'y')
plt.show()

Yearly Game trend for Visitor Team shows variations for each season and 2020 season datapoints seems on higher side.

Note: The National Football League (NFL) is a professional American football league consisting of 32 teams, divided equally between the National Football Conference (NFC) and the American Football Conference (AFC).

---
**Q: What is the trend for game dataset with respect to Game Week for Home Team?**

---

In [None]:
# trend for game dataset with respect to homeTeamAbbr
figure = plt.figure(figsize = [15, 7])
sns.lineplot(x = 'week', y = 'homeTeamAbbr', hue= 'season', data = game_data_by_date, color = '#D96552')

plt.xlabel('Game Week', size = 14)
plt.ylabel('Home Team', size = 14)
plt.title('Game Trend', size = 16)
plt.grid(b = True, axis = 'y')
plt.show()

Weekly Game trend for Home Team shows variations for each season and variation for 2020 season datapoints seems on higher side.

Note: The NFL season format consists of a three-week preseason, an 18-week regular season (each team plays 17 games),and a 14-team single-elimination playoff culminating in the Super Bowl, the league's championship game.

---
**Q: What is the trend for game dataset with respect to Game Week for Visitor Team?**

---

In [None]:
# trend for game dataset with respect to visitorTeamAbbr
figure = plt.figure(figsize = [15, 7])
sns.lineplot(x = 'week', y = 'visitorTeamAbbr', hue= 'season', data = game_data_by_date, color = '#D96552')

plt.xlabel('Game Week', size = 14)
plt.ylabel('Visitor Team', size = 14)
plt.title('Game Trend', size = 16)
plt.grid(b = True, axis = 'y')
plt.show()

Weekly Game trend for Visitor Team shows variations for each season and variation for 2020 season datapoints seems on higher side.

Note: The NFL season format consists of a three-week preseason, an 18-week regular season (each team plays 17 games),and a 14-team single-elimination playoff culminating in the Super Bowl, the league's championship game.

---

---
# Data Definition/Description of Play data

---

**Data Definition:**
* **Play data:** The plays.csv file contains play-level information from each game. The key variables are **gameId** and **playId**.


In [None]:
# Loading dataset plays.csv
play_data = pd.read_csv('../input/nfl-big-data-bowl-2022/plays.csv')

---
**Q: What is the structure of play dataset?**

---

| No. | Feature Name | Description of the feature |
| :-- | :--| :--| 
|01| **gameId**   | Game identifier, unique (numeric) |
|02| **playId** | Play identifier, not unique across games (numeric) |
|03| **playDescription**   | Description of play (text)              |
|04| **quarter**   | Game quarter (numeric)  |
|05| **down**   | Down (numeric)|
|06| **yardsToGo**   | Distance needed for a first down (numeric)  |
|07| **possessionTeam**   | Team punting, placekicking or kicking off the ball (text)|
|08| **specialTeamsPlayType**   | Formation of play: Extra Point, Field Goal, Kickoff or Punt (text)             |
|09| **specialTeamsPlayResult**   | Special Teams outcome of play dependent on play type: 
|||Blocked Kick Attempt, Blocked Punt, Downed, Fair Catch, Kick Attempt Good, Kick Attempt No Good, Kickoff Team Recovery, Muffed, Non-Special Teams Result, Out of Bounds, Return or Touchback (text) |
|10| **kickerId**   | nflId of placekicker, punter or kickoff specialist on play (numeric)|
|11| **returnerId**   | nflId(s) of returner(s) on play if there was a special teams return. Multiple returners on a play are separated by a ; (text)  |
|12| **kickBlockerId**   | nflId of blocker of kick on play if there was a blocked field goal or blocked punt (numeric)|
|13| **yardlineSide**   | 3-letter team code corresponding to line-of-scrimmage (text)              |
|14| **yardlineNumber**   | Yard line at line-of-scrimmage (numeric)  |
|15| **gameClock**   | Time on clock of play (MM:SS)|
|16| **penaltyCodes**   |  NFL categorization of the penalties that occurred on the play. Multiple penalties on a play are separated by a ; (text)  |
|17| **penaltyJerseyNumber**   | Jersey number and team code of the player committing each penalty. Multiple penalties on a play are separated by a ; (text)|
|18| **penaltyYards**   | yards gained by possessionTeam by penalty (numeric)|
|19| **preSnapHomeScore**   | Home score prior to the play (numeric)  |
|20| **preSnapVisitorScore**   | Visiting team score prior to the play (numeric)|
|21| **passResult**   | Scrimmage outcome of the play if specialTeamsPlayResult is "Non-Special Teams Result" 
|||(C: Complete pass, I: Incomplete pass, S: Quarterback sack, IN: Intercepted pass, R: Scramble, ' ': Designed Rush, text)  |
|22| **kickLength**   | Kick length in air of kickoff, field goal or punt (numeric)|
|23| **kickReturnYardage**   | Yards gained by return team if there was a return on a kickoff or punt (numeric)|
|24| **playResult**   | Net yards gained by the kicking team, including penalty yardage (numeric)|
|25| **absoluteYardlineNumber**   | Location of ball downfield in tracking data coordinates (numeric)|

In [None]:
# get shape of dataframe
print('Shape of play dataset is:', play_data.shape)

# print summary of dataframe
play_data.info()

**play dataset information:**

* There are 19979 data points (rows) and 25 feature (column) in play dataset.
* There are fifteen numerical column and ten columns are of categorical or object type.
* There are missing values (non-null count is not same as 19979) for columns
  * kickerId, returnerId, kickBlockerId,
  * yardlineside, penaltyCodes, penaltyJerseyNumbers, penaltyYards,
  * passResult, kickLength and kickReturnYardage.


---
**Q: What does data looks like for play dataset?**

---

In [None]:
# print first 10 rows of dataframe
play_data.head(10)

---
**Q: What is the statistics description for play dataset?**

---

In [None]:
# print descriptive statistics for numeric type
play_data.describe().round(1)

**plays dataset data description:**

*  There are no missing value for **gameId** and **playId** as they have total count of 19979 data points.
*  Mean value for **quarter** is closer to median value which appears that distribution is normal.
*  Mean value for **down** is higher than median value which appears that distribution is positively skewed.
*  Mean value for **yardsToGo** is higher than median value which appears that distribution is positively skewed.
 * **yardsToGo** distribution does appear to have outliers.
* **kickerId** has missing value as total count is not 19979
* **kickBlockerId** has many missing values as total count is showing as 100
*  Mean value for **yardlineNumber** is less than median value which appears that distribution is negatively skewed.
*  Mean value for **penaltyYards** is less than median value which appears that distribution is negatively skewed.
*  Mean value for **preSnapHomeScore** is higher than median value which appears that distribution is positively skewed.
*  Mean value for **preSnapVisitorScore** is higher than median value which appears that distribution is positively skewed.
*  Mean value for **kickLength** is closer to median value which appears that distribution is normal.
 * **kickLength** has missing values as total count is not 19979
*  Mean value for **kickReturnYardage** is closer to median value which appears that distribution is normal.
 * **kickReturnYardage** has missing values as total count is not 19979
*  Mean value for **absoluteYardlineNumber** is closer to median value which appears that distribution is normal.

In [None]:
# print descriptive statistics for object type
play_data.describe(include=[object])

**play dataset data description:**

* There are 12355 unique **playDescription** for which data points are available.
 * W.Lutz kicks 65 yards from NO 35 to end zone **playDescription** has the highest frequency of occurrence.
* There are 33 unique **possessionTeam** for which data points are available.
 * BAL **possessionTeam** has the highest frequency of occurrence.
* There are 4 unique **specialTeamsPlayType** for which data points are available.
 * Kickoff **specialTeamsPlayType** has the highest frequency of occurrence.
* There are 12 unique **specialTeamsResult** for which data points are available.
 * Kick Attempt Good **specialTeamsResult** has the highest frequency of occurrence.
* **returnerId** has missing value as total count is not 19979
* There are 33 unique **yardlineSide** for which data points are available.
 * NYJ **yardlineSide** has the highest frequency of occurrence.
 *  **yardlineSide** has missing value as total count is not 19979
* There are 900 unique **gameClock** for which data points are available.
 * 15:00 **gameClock** has the highest frequency of occurrence.
* There are 71 unique **penaltyCodes** for which data points are available.
 * OH **penaltyCodes** has the highest frequency of occurrence.
 * **penaltyCodes** has missing values as total count is not 19979
* There are 700 unique **penaltyJerseyNumbers** for which data points are available.
 * BAL 41 **penaltyJerseyNumbers** has the highest frequency of occurrence.
 * **penaltyJerseyNumbers** has missing values as total count is not 19979
* There are 4 unique **passResult** for which data points are available.
 * C **passResult** has the highest frequency of occurrence.
 * **passResult** has missing values as total count is not 19979
---

---
# Data Analysis/EDA of Play data

---

---
**Q: What is the percentage distribution count for Possession Team?**

---

In [None]:
# plot percentage distribution count of possessionTeam
play_data['possessionTeam'].value_counts().plot(kind='pie', explode=[0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1], fontsize=14, autopct='%3.1f%%', 
                                               figsize=(20,15), shadow=True, startangle=135, legend=False, cmap='rainbow')
plt.suptitle(t = 'Percentage distribution count for Possession Team', y = 1.05, size = 30)
plt.ylabel('')
plt.axis('equal')
plt.legend(labels = play_data['possessionTeam'].value_counts().index, loc ='lower left', frameon = True)
plt.show()

Above plot shows almost equal distribution for possession team - Team punting, placekicking or kicking off the ball.

---
**Q: What is the distribution of Penalty Yards with respect to Possession Team?**

---

In [None]:
#distribution of penaltyYards with respect to possessionTeam
sns.displot(data=play_data, x="penaltyYards", hue='possessionTeam', height=8, aspect=.8)
plt.show()

Above plot shows majority of possession team having gained yards with 10 penalty yards.

---
**Q: What is the percentage distribution count for Special Teams Play Type?**

---

In [None]:
# plot percentage distribution count of specialTeamsPlayType
play_data['specialTeamsPlayType'].value_counts().plot(kind='pie', explode=[0.1,0.1,0.1,0.1], fontsize=14, autopct='%3.1f%%', 
                                               figsize=(10,5), shadow=True, startangle=135, legend=False, cmap='winter')
plt.suptitle(t = 'Percentage distribution count for Special Teams Play Type', y = 1.05, size = 20)
plt.ylabel('')
plt.axis('equal')
plt.legend(labels = play_data['specialTeamsPlayType'].value_counts().index, loc ='lower left', frameon = True)
plt.show()

Above plot shows that Kickoff and Punt forms majority formation of play for Special Teams Play Type in this datapoint.

Note: Special teams is the team that takes care of Kickoff, Punt and Field Goal attempts.

* A Kickoff is a type of free kick where the ball is placed on a tee (or held) at the kicking team's 35-yard line.

* A Punt is type of kick by dropping the ball from the hands and kicking the ball before it reaches the ground.

* An Extra Point called the Point After Touchdown (PAT) conversion or the offense magnificently kicks the ball through the goal post to earn one point. Two extra points can also be scored by running or throwing the ball into the end zone similar to a touchdown.

* A Field Goal can be recorded by the team if the ball is place kicked, drop kicked or free kicked in between the goal posts in the opponent’s end zone.

---
**Q: What is distribution for Kick Length with respect to Special Teams Play Type?**

---

In [None]:
#distribution of kickLength with respect to specialTeamsPlayType
sns.displot(data=play_data, x='kickLength', hue='specialTeamsPlayType',height=8, aspect=.8)
plt.show()

---
**Q: What is distribution for Kick Return Yardage with respect to Special Teams Play Type?**

---

In [None]:
#distribution of kickReturnYardage with respect to specialTeamsPlayType
sns.displot(data=play_data, x='kickReturnYardage', hue='specialTeamsPlayType',height=8, aspect=.8)
plt.show()

---
**Q: What is distribution for Play Result with respect to Special Teams Play Type?**

---

In [None]:
#distribution of playResult with respect to specialTeamsPlayType
sns.displot(data=play_data, x='playResult', hue='specialTeamsPlayType',height=8, aspect=.8)
plt.show()

---
**Q: What is the percentage distribution count for Pass Result?**

---

In [None]:
# plot percentage distribution count of passResult
play_data['passResult'].value_counts().plot(kind='pie', explode=[0.1,0.1,0.1,0.1], fontsize=14, autopct='%3.1f%%', 
                                               figsize=(10,5), shadow=True, startangle=135, legend=False, cmap='autumn')
plt.suptitle(t = 'Percentage distribution count for Pass Result', y = 1.05, size = 20)
plt.ylabel('')
plt.axis('equal')
plt.legend(labels = play_data['passResult'].value_counts().index, loc ='lower left', frameon = True)
plt.show()

Above plot shows that C- Complete Pass and I - Incomplete Pass are majority scrimmage outcome of play for Pass Result for Non-Special Teams Result.

---
**Q: What is distribution for Yard Line Number with respect to Pass Result?**

---

In [None]:
#distribution of yardlineNumber with respect to passResult
sns.displot(data=play_data, x='yardlineNumber', hue='passResult',kind='kde')
plt.show()

I: Incomplete pass, C: Complete pass, IN: Intercepted pass, S: Quarterback sack

---

---
**Q: What is distribution for Pre Snap Home Score with respect to Pass Result?**

---

In [None]:
#distribution of preSnapHomeScore with respect to passResult
sns.displot(data=play_data, x='preSnapHomeScore', hue='passResult',kind='kde')
plt.show()

I: Incomplete pass, C: Complete pass, IN: Intercepted pass, S: Quarterback sack

---
**Q: What is distribution for Pre Snap Visitor Score with respect to Pass Result?**

---

In [None]:
#distribution of preSnapVisitorScore with respect to passResult
sns.displot(data=play_data, x='preSnapVisitorScore', hue='passResult',kind='kde')
plt.show()

I: Incomplete pass, C: Complete pass, IN: Intercepted pass, S: Quarterback sack

---
**Q: What is the percentage distribution count for Special Teams Result?**

---

In [None]:
# plot percentage distribution count of specialTeamsResult
play_data['specialTeamsResult'].value_counts().plot(kind='pie', explode=[0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.3,0.4,0.5,0.6], fontsize=14, autopct='%3.1f%%', 
                                               figsize=(20,15), shadow=True, startangle=135, legend=False, cmap='rainbow')
plt.suptitle(t = 'Percentage distribution count for Special Teams Result', y = 1.05, size = 30)
plt.ylabel('')
plt.axis('equal')
plt.legend(labels = play_data['specialTeamsResult'].value_counts().index, loc ='lower left', frameon = True)
plt.show()

Above plot shows that Kick Attempt Good, Return and Touchback are majority outcome of play for Speical Teams Result.

---
# Data Definition/Description of Player data

---

**Data Definition:**
* **Player data:** The players.csv file contains player-level information from players that participated in any of the tracking data files. The key variable is **nflId**.


In [None]:
# Loading dataset players.csv
player_data = pd.read_csv('../input/nfl-big-data-bowl-2022/players.csv')

---
**Q: What is the structure of player dataset?**

---

| No. | Feature Name | Description of the feature |
| :-- | :--| :--| 
|01| **nflId**   | Player identification number, unique across players (numeric)   |
|02| **Height**   | Player height (text)  |
|03| **Weight**   | Player weight (numeric)|
|04| **birthDate**   | Date of birth (YYYY-MM-DD)  |
|05| **collegeName**   | Player college (text)|
|06| **Position**   | Player position (text) |
|07| **displayName**   | Player name (text) |

In [None]:
# get shape of dataframe
print('Shape of player dataset is:', player_data.shape)

# print summary of dataframe
player_data.info()

**player dataset information:**

* There are 2732 data points (rows) and 7 feature (column) in player dataset.
* There are two numerical column and five columns are of categorical or object type.
* There are missing values (non-null count is not same as 1732) for columns
  * birthDate, collegeName

---
**Q: What does data looks like for player dataset?**

---

In [None]:
# print first 10 rows of dataframe
player_data.head(10)

---
**Q: What is the statistics description for player dataset?**

---

In [None]:
# print descriptive statistics for both object and numeric type
player_data.describe(include='all').round(1)

**player dataset data description:**

*  There are no missing value for **nflId** as they have total count of 2732 data points.
* There are 30 unique **height** for which data points are available.
 * 6-3 **height** has the highest frequency of occurrence.
*  Mean value for **weight** is closer to median value which appears that distribution is normal.
* There are 2035 unique **birthDate** for which data points are available.
 * 1997-02-14 **birthDate** has the highest frequency of occurrence.
 * There are missing value for **birthDate**
* There are 322 unique **collegName** for which data points are available.
 * Alabama **collegeName** has the highest frequency of occurrence.
 * There are missing value for **collegeName**
* There are 26 unique **Position** for which data points are available.
 * WR **Position** has the highest frequency of occurrence.
* There are 2718 unique **displayName** for which data points are available.
 * Chris Jones **displayName** has the highest frequency of occurrence.

# Data Analysis/EDA of Player data

---

In [None]:
# convert gameDate from object type to datetime type
player_data['birthDate']= pd.to_datetime(player_data['birthDate'])
# group by birthdate without index
player_data_by_birthdate = player_data.groupby(by = 'birthDate', as_index = False).agg('max')

---
**Q: What is the distribution count for Birth Date?**

---

In [None]:
#distribution of birthDate
sns.displot(data=player_data, x='birthDate', kde=True)
plt.show()

Above distribution for Birth Date shows that majority of player are born around mid-90's.

---
**Q: What is the trend for player dataset with respect to Birth Date and Height?**

---

In [None]:
# trend for player with respect to height
figure = plt.figure(figsize = [15, 7])
sns.lineplot(x = 'birthDate', y = 'height', data = player_data_by_birthdate, color = 'b')

plt.xlabel('Birth Date', size = 14)
plt.ylabel('Height', size = 14)
plt.title('Player Trend', size = 16)
plt.grid(b = True, axis = 'y')
plt.show()

Trend for Birth Date with respect to Height shows that there is good distribution of height across various age group.

---
**Q: What is the trend for player dataset with respect to Birth Date and Weight?**

---

In [None]:
# trend for player with respect to weight
figure = plt.figure(figsize = [15, 7])
sns.lineplot(x = 'birthDate', y = 'weight', data = player_data_by_birthdate, color = 'g')

plt.xlabel('Birth Date', size = 14)
plt.ylabel('Weight', size = 14)
plt.title('Player Trend', size = 16)
plt.grid(b = True, axis = 'y')
plt.show()

Trend for Birth Date with respect to Weight shows that there is good distribution of weight across various age group.

---
**Q: What is the percentage distribution count for Position?**

---

In [None]:
# plot percentage distribution count of Position
player_data['Position'].value_counts().plot(kind='pie', explode=[0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.2,0.3,0.4,0.5], fontsize=14, autopct='%3.1f%%', 
                                               figsize=(25,20), shadow=True, startangle=135, legend=False, cmap='winter')
plt.suptitle(t = 'Percentage distribution count for Position', y = 1.05, size = 35)
plt.ylabel('')
plt.axis('equal')
plt.legend(labels = player_data['Position'].value_counts().index, loc ='lower left', frameon = True)
plt.show()

Above plot shows WR and CB as majority of player position which exists in this datapoints.

---
**Q: What is the text distribution view for Display Name?**

---

In [None]:
# get unique Display Name
display_name= player_data['displayName'].unique()
# convert numpy array to string
display_name_str = ",".join(display_name)

# create WordCloud with converted string
wordcloud = WordCloud(width = 1000, height = 500, random_state=1, background_color='white', collocations=True).generate(display_name_str)
plt.figure(figsize=(15, 15))
plt.imshow(wordcloud) 
plt.axis("off")
plt.show()

Majority of names starting with J such as Johnson, Jordan, Jones, Justin etc. dominates this datapoints.

---

---
# Data Definition/Description of Tracking data

---

**Data Definition:**
* **Tracking data:** Files tracking[season].csv contain player tracking data from season [season]. The key variables are **gameId**, **playId**, and **nflId**.



In [None]:
# Loading dataset tracking2018.csv
tracking2018_data = pd.read_csv('../input/nfl-big-data-bowl-2022/tracking2018.csv')

# Loading dataset tracking2019.csv
tracking2019_data = pd.read_csv('../input/nfl-big-data-bowl-2022/tracking2019.csv')

# Loading dataset tracking2020.csv
tracking2020_data = pd.read_csv('../input/nfl-big-data-bowl-2022/tracking2020.csv')

---
**Q: What is the structure of tracking dataset?**

---

| No. | Feature Name | Description of the feature |
| :-- | :--| :--| 
|01| **time**   | Time stamp of play (time, yyyy-mm-dd, hh:mm:ss) |
|02| **x** | Player position along the long axis of the field, 0 - 120 yards.(numeric) |
|03| **y**   | Player position along the short axis of the field, 0 - 53.3 yards.(numeric)  |
|04| **s**   | Speed in yards/second (numeric)  |
|05| **a**   | Speed in yards/second^2 (numeric)|
|06| **dis**   | Distance traveled from prior time point, in yards (numeric) |
|07| **o**   | Player orientation (deg), 0 - 360 degrees (numeric)|
|08| **dir**   | Angle of player motion (deg), 0 - 360 degrees (numeric)|
|09| **event**   | Tagged play details, including moment of ball snap, pass release, pass catch, tackle, etc (text) |
|10| **nflId**   | Player identification number, unique across players (numeric) |
|11| **displayName**   | Player name (text) |
|12| **jerseyNumber**   | Jersey number of player (numeric) |
|13| **position**   | Player position group (text) |
|14| **team**   | Team (away or home) of corresponding player (text) |
|15| **frameId**   | Frame identifier for each play, starting at 1 (numeric) |
|16| **gameId**   | Game identifier, unique (numeric) |
|17| **playId**   | Play identifier, not unique across games (numeric) |
|18| **playDirection**   | Direction that the offense is moving (left or right)|

In [None]:
# get shape of dataframe
print('Shape of tracking 2018 season dataset is:', tracking2018_data.shape)

# print summary of dataframe
tracking2018_data.info()

In [None]:
# get shape of dataframe
print('Shape of tracking 2019 season dataset is:', tracking2019_data.shape)

# print summary of dataframe
tracking2019_data.info()

In [None]:
# get shape of dataframe
print('Shape of tracking 2020 season dataset is:', tracking2020_data.shape)

# print summary of dataframe
tracking2020_data.info()

---
**Q: What does data looks like for tracking 2018 season dataset?**

---

In [None]:
# print first 10 rows of dataframe
tracking2018_data.head(10)

---
**Q: What does data looks like for tracking 2019 season dataset?**

---

In [None]:
# print first 10 rows of dataframe
tracking2019_data.head(10)

---
**Q: What does data looks like for tracking 2020 season dataset?**

---

In [None]:
# print first 10 rows of dataframe
tracking2020_data.head(10)

---
**Q: What is the statistics description for tracking 2018 season dataset?**

---

In [None]:
# print descriptive statistics for both object and numeric type
tracking2018_data.describe(include='all').round(1)

---
**Q: What is the statistics description for tracking 2019 season dataset?**

---

In [None]:
# print descriptive statistics for both object and numeric type
tracking2019_data.describe(include='all').round(1)

---
**Q: What is the statistics description for tracking 2020 season dataset?**

---

In [None]:
# print descriptive statistics for both object and numeric type
tracking2020_data.describe(include='all').round(1)

**Note:** Using sample of tracking2020, tracking 2019 and tracking 2018 season dataset respectively for EDA

In [None]:
# create sample of tracking dataset for EDA (For faster execution and to avoid out of memory issue for notebook)
tracking_data_sample = pd.concat([tracking2020_data.sample(n=100000, random_state=1),tracking2019_data.sample(n=100000, random_state=1),tracking2018_data.sample(n=100000, random_state=1)])

In [None]:
#reset index after appending all three dataset
tracking_data_sample.reset_index(inplace=True)
# get quick summary of new dataset sample
tracking_data_sample.head()

---
# Data Analysis/EDA of Tracking data

---

---
**Q: What is the percentage distribution count for Team?**

---

In [None]:
# plot percentage distribution count of team
tracking_data_sample['team'].value_counts().plot(kind='pie', explode=[0.1,0.1,0.1], fontsize=14, autopct='%3.1f%%', 
                                               figsize=(10,5), shadow=True, startangle=135, legend=False, cmap='plasma')
plt.suptitle(t = 'Percentage distribution count for Team', y = 1.05, size = 20)
plt.ylabel('')
plt.axis('equal')
plt.legend(labels = tracking_data_sample['team'].value_counts().index, loc ='lower left', frameon = True)
plt.show()

Above plot shows that there is equal distribution of datapoints for home and away team for a player.

---
**Q: What is the percentage distribution count for Play Direction?**

---

In [None]:
# plot percentage distribution count of playDirection
tracking_data_sample['playDirection'].value_counts().plot(kind='pie', explode=[0.1,0.1], fontsize=14, autopct='%3.1f%%', 
                                               figsize=(10,5), shadow=True, startangle=135, legend=False, cmap='cividis')
plt.suptitle(t = 'Percentage distribution count for Play Direction', y = 1.05, size = 20)
plt.ylabel('')
plt.axis('equal')
plt.legend(labels = tracking_data_sample['playDirection'].value_counts().index, loc ='lower left', frameon = True)
plt.show()

Above plot shows that there is equal distribution of datapoints for play direction offense is moving (left or right).

---
**Q: What is the distribution for Player Position with respect to Position, Play Direction and Team?**

---

In [None]:
#distribution plot for player position- Long and Short axis
sns.relplot(data=tracking_data_sample, x="x", y="y", col="playDirection", hue="team", style="position", kind="scatter", palette='dark')
plt.show()

Above scatter plot shows that there is slight variation between play direction for Home and Away team with respect to Player position - Long and Short Axis for various player position.

---
**Q: What is the distribution for Player Speed and Acceleration with respect to Position, Play Direction and Team?**

---

In [None]:
# distribution plot for player speed and acceleration
sns.relplot(data=tracking_data_sample, x="s", y="a", col="playDirection", hue="team", style="position", kind="scatter", palette='bright')
plt.show()

Above scatter plot shows that there is slight variation between play direction for Home and Away team with respect to Player Speed and Acceleration for various player position.

---
**Q: What is the distribution for Player Orientation and Motion with respect to Position, Play Direction and Team?**

---

In [None]:
#distribution plot for orientation
sns.relplot(data=tracking_data_sample, x="o", y="dir", col="playDirection", hue="team", style="position", kind="scatter", palette='rocket')
plt.show()

Above scatter plot shows that there is not much variation between play direction for Home and Away team with respect to Player Orientation and Angle of Player Motion for various player position.

---
# Data Definition/Description of PFF Scouting data

---

**Data Definition:**
* **PFF Scouting data:** The PFFScoutingData.csv file contains play-level scouting information for each game. The key variables are **gameId** and **playId**.


In [None]:
# Loading dataset PFFScoutingData.csv
PFFScouting_data = pd.read_csv('../input/nfl-big-data-bowl-2022/PFFScoutingData.csv')

---
**Q: What is the structure of PFFScouting dataset?**

---

| No. | Feature Name | Description of the feature |
| :-- | :--| :--| 
|01| **gameId**   | Game identifier, unique (numeric) |
|02| **playId**   | Play identifier, not unique across games (numeric) |
|03| **snapDetail**   | On Punts, whether the snap was on target and if not, provides detail (H: High, L: Low, <: Left, >: Right, OK: Accurate Snap, text)  |
|04| **snapTime**   | Timing from snap to kick on punt plays in seconds: (numeric)  |
|05| **operationTime**   | Timing from snap to kick on punt plays in seconds: (numeric)  |
|06| **hangTime**   |  Hangtime of player's punt or kickoff attempt in seconds. Timing is taken from impact with foot to impact with the ground or a player. (numeric)|
|07| **kickType**   | Kickoff or Punt Type (text)|
|||Possible values for kickoff plays:|
|||D: Deep - your normal deep kick with decent hang time
|||F: Flat - different than a Squib in that it will have some hang time and no roll but has a lower trajectory and hang time than a Deep kick off
|||K: Free Kick - Kick after a safety
|||O: Obvious Onside - score and situation dictates the need to regain possession. Also the hands team is on for the returning team
|||P: Pooch kick - high for hangtime but not a lot of distance - usually targeting an upman
|||Q: Squib - low-line drive kick that bounces or rolls considerably, with virtually no hang time
|||S: Surprise Onside - accounting for score and situation an onsides kick that the returning team doesn’t expect. Hands teams probably aren't on the field
|||B: Deep Direct OOB - Kickoff that is aimed deep (regular kickoff) that goes OOB directly (doesn't bounce)
||| Possible values for punt plays:
|||N: Normal - standard punt style
|||R: Rugby style punt
|||A: Nose down or Aussie-style punts
|08| **kickDirectionIntended**   | Intended kick direction from the kicking team's perspective - based on how coverage unit sets up and other factors (L: Left, R: Right, C: Center, text).|
|09| **kickDirectionActual**   | Actual kick direction from the kicking team's perspective (L: Left, R: Right, C: Center, text).|
|10| **returnDirectionIntended**   | The return direction the punt return or kick off return unit is set up for from the return team's perspective (L: Left, R: Right, C: Center, text). |
|11| **returnDirectionActual**   | Actual return direction from the return team's perspective (L: Left, R: Right, C: Center, text). |
|12| **missedTacklers**   | Jersey number and team code of player(s) charged with a missed tackle on the play. It will be reasonable to assume that he should have brought down the ball carrier and failed to do so. This situation does not have to entail contact, but it most frequently does. Missed tackles on a QB by a pass rusher are also included here. Multiple missed tacklers on a play are separated by a ; (text). |
|13| **assistTacklers**   | Jersey number and team code of player(s) assisting on the tackle. Multiple assist tacklers on a play are separated by a ; (text). |
|14| **tacklers**   | Jersey number and team code of player making the tackle (text). |
|15| **kickoffReturnFormation**   | 3 digit code indicating the number of players in the Front Wall, Mid Wall and Back Wall (text). |
|16| **gunners**   | Jersey number and team code of player(s) lined up as gunner on punt unit. Multiple gunners on a play are separated by a ; (text). |
|17| **puntRushers**   | Jersey number and team code of player(s) on the punt return unit with "Punt Rush" role for actively trying to block the punt. Does not include players crossing the line of scrimmage to engage in punt coverage players in a "Hold Up" role. Multiple punt rushers on a play are separated by a ; (text).|
|18| **specialTeamsSafeties**   | Jersey number and team code for player(s) with "Safety" roles on kickoff coverage and field goal/extra point block units - and those not actively advancing towards the line of scrimmage on the punt return unit. Multiple special teams safeties on a play are separated by a ; (text).|
|19| **vises**   | Jersey number and team code for player(s) with a "Vise" role on the punt return unit. Multiple vises on a play are separated by a ; (text).|
|20| **kickContactType**   | Detail on how a punt was fielded, or what happened when it wasn't fielded (text).|
|||Possible values:|
|||BB: Bounced Backwards
|||BC: Bobbled Catch from Air
|||BF: Bounced Forwards
|||BOG: Bobbled on Ground
|||CC: Clean Catch from Air
|||CFFG: Clean Field From Ground
|||DEZ: Direct to Endzone
|||ICC: Incidental Coverage Team Contact
|||KTB: Kick Team Knocked Back
|||KTC: Kick Team Catch
|||KTF: Kick Team Knocked Forward
|||MBC: Muffed by Contact with Non-Designated Returner
|||MBDR: Muffed by Designated Returner
|||OOB: Directly Out Of Bounds

In [None]:
# get shape of dataframe
print('Shape of PFFScouting Data dataset is:', PFFScouting_data.shape)

# print summary of dataframe
PFFScouting_data.info()

**PFFScouting dataset information:**

* There are 19979 data points (rows) and 20 feature (column) in PFFScouting dataset.
* There are five numerical column and fifteen columns are of categorical or object type.
* There are missing values (non-null count is not same as 19979) for all columns except **gameId** and **playId**

---
**Q: What does data looks like for PFFScouting dataset?**

---

In [None]:
# print first 10 rows of dataframe
PFFScouting_data.head(10)

---
**Q: What is the statistics description for PFFScouting dataset?**

---

In [None]:
# print descriptive statistics for numeric type
PFFScouting_data.describe().round(1)

**PFFScouting dataset data description:**

*  There are no missing value for **gameId** and **playId** as they have total count of 19979 data points.
*  Mean value for **snapTime** is same as median value which appears that distribution is normal.
 * There are missing values for **snapTime**
*  Mean value for **operationTime** is same as median value which appears that distribution is normal.
 * There are missing values for **operationTime**
*  Mean value for **hangTime** is same as median value which appears that distribution is normal.
 * There are missing values for **hangTime**

In [None]:
# print descriptive statistics for object type
PFFScouting_data.describe(include=[object])

**PFFScouting dataset data description:**

* There are 5 unique **snapDetail** for which data points are available.
 * OK **snapDetail** has the highest frequency of occurrence.
 * There are missing value for **snapDetail**
* There are 11 unique **kickType** for which data points are available.
 * D **kickType** has the highest frequency of occurrence.
 * There are missing value for **kickType**
* There are 3 unique **kickDirectionIntended** for which data points are available.
 * C **kickDirectionIntended** has the highest frequency of occurrence.
 * There are missing value for **kickDirectionIntended**
* There are 3 unique **kickDirectionActual** for which data points are available.
 * C **kickDirectionActual** has the highest frequency of occurrence.
 * There are missing value for **kickDirectionActual**
* There are 3 unique **returnDirectionIntended** for which data points are available.
 * C **returnDirectionIntended** has the highest frequency of occurrence.
 * There are missing value for **returnDirectionIntended**
* There are 3 unique **returnDirectionActual** for which data points are available.
 * C **returnDirectionActual** has the highest frequency of occurrence.
 * There are missing value for **returnDirectionActual**
* There are 860 unique **missedTackler** for which data points are available.
 * BAL 10 **missedTackler** has the highest frequency of occurrence.
 * There are missing value for **missedTackler**
* There are 576 unique **assistTackler** for which data points are available.
 * LAC 48 **assistTackler** has the highest frequency of occurrence.
 * There are missing value for **assistTackler**
* There are 1081 unique **tackler** for which data points are available.
 * ARI 47 **tackler** has the highest frequency of occurrence.
 * There are missing value for **tackler**
* There are 16 unique **kickoffReturnFormation** for which data points are available.
 * 8-0-2 **kickoffReturnFormation** has the highest frequency of occurrence.
 * There are missing value for **kickoffReturnFormation**
* There are 926 unique **gunners** for which data points are available.
 * SEA28;SEA23 **gunners** has the highest frequency of occurrence.
 * There are missing value for **gunners**
* There are 2089 unique **puntRushers** for which data points are available.
 * PIT 45 **puntRushers** has the highest frequency of occurrence.
 * There are missing value for **puntRushers**
* There are 3107 unique **specialTeamsSafeties** for which data points are available.
 * HOU 33;HOU 41 **specialTeamsSafeties** has the highest frequency of occurrence.
 * There are missing value for **specialTeamsSafeties**
* There are 1779 unique **vises** for which data points are available.
 * LA 31;LA 25 **vises** has the highest frequency of occurrence.
 * There are missing value for **vises**
* There are 14 unique **kickContactType** for which data points are available.
 * CC **kickContactType** has the highest frequency of occurrence.
 * There are missing value for **kickContactType**
 
 ---

# Data Analysis/EDA of PFF Scouting data

---

---
**Q: What is the distribution count for Snap Detail?**

---

In [None]:
# plot percentage distribution count of snapDetail
PFFScouting_data['snapDetail'].value_counts().plot(kind='pie', explode=[0.1,0.3,0.5,0.7,0.9], fontsize=14, autopct='%3.1f%%', 
                                               figsize=(10,5), shadow=True, startangle=135, legend=False, cmap='winter')
plt.suptitle(t = 'Percentage distribution count for Snap Detail', y = 1.05, size = 20)
plt.ylabel('')
plt.axis('equal')
plt.legend(labels = PFFScouting_data['snapDetail'].value_counts().index, loc ='lower left', frameon = True)
plt.show()

H: High, L: Low, <: Left, >: Right, OK: Accurate Snap

Above plot shows that majority datapoints for Snap Detail is Accurate Snap.

---
**Q: What is distribution for Snap Time with respect to Snap Detail?**

---

In [None]:
#distribution of snapTime with respect to snapDetail
sns.displot(data=PFFScouting_data, x='snapTime', hue='snapDetail',kind='kde')
plt.show()

H: High, L: Low, <: Left, >: Right, OK: Accurate Snap

Above density plot shows that majority of snap was on target with timing between 0.6 to 1.0 seconds for snap to kick on punt plays.

---
**Q: What is distribution for Operation Time with respect to Snap Detail?**

---

In [None]:
#distribution of operationTime with respect to snapDetail
sns.displot(data=PFFScouting_data, x='operationTime', hue='snapDetail',kind='kde')
plt.show()

H: High, L: Low, <: Left, >: Right, OK: Accurate Snap

Above density plot shows that majority of snap was on target with timing between 1.5 to 2.5 seconds for snap to kick on punt plays.

---
**Q: What is distribution for Hang Time with respect to Snap Detail?**

---

In [None]:
#distribution of hangTime with respect to snapDetail
sns.displot(data=PFFScouting_data, x='hangTime', hue='snapDetail',kind='kde')
plt.show()

H: High, L: Low, <: Left, >: Right, OK: Accurate Snap

Above density plot shows that majority of player's kickoff or punt attempt was on target with timing between 3 to 6 seconds for impact with foot to impact with ground.

---
**Q: What is distribution count for kickType?**

---

In [None]:
#check distribution and frequency of kickType
PFFScouting_data['kickType'].value_counts()

| | | |
| :-- | :--| :--| 
|| **kickType**   | Kickoff or Punt Type (text)|
|||**Possible values for kickoff plays:**|
|||D: Deep - your normal deep kick with decent hang time
|||F: Flat - different than a Squib in that it will have some hang time and no roll but has a lower trajectory and hang time than a Deep kick off
|||K: Free Kick - Kick after a safety
|||O: Obvious Onside - score and situation dictates the need to regain possession. Also the hands team is on for the returning team
|||P: Pooch kick - high for hangtime but not a lot of distance - usually targeting an upman
|||Q: Squib - low-line drive kick that bounces or rolls considerably, with virtually no hang time
|||S: Surprise Onside - accounting for score and situation an onsides kick that the returning team doesn’t expect. Hands teams probably aren't on the field
|||B: Deep Direct OOB - Kickoff that is aimed deep (regular kickoff) that goes OOB directly (doesn't bounce)
||| **Possible values for punt plays:**
|||N: Normal - standard punt style
|||R: Rugby style punt
|||A: Nose down or Aussie-style punts

---
**Q: What is distribution for Kick Type with respect to Kick Direction Intended?**

---

In [None]:
#distribution of kickType with respect to kickDirectionIntended
sns.displot(data=PFFScouting_data, x='kickType', hue='kickDirectionIntended',height=8, aspect=.8)
plt.show()

Above distribution plot shows that majority of Deep kick and Normal Punt had Kick Direction Intended as Center and Left.

---
**Q: What is distribution for Kick Type with respect to Kick Direction Actual?**

---

In [None]:
#distribution of kickType with respect to kickDirectionActual
sns.displot(data=PFFScouting_data, x='kickType', hue='kickDirectionActual',height=8, aspect=.8)
plt.show()

Above distribution plot shows that majority of Deep kick and Normal Punt had Kick Direction Actual as Center and Left.

---
**Q: What is distribution for Kick Type with respect to Return Direction Intended?**

---

In [None]:
#distribution of kickType with respect to returnDirectionIntended
sns.displot(data=PFFScouting_data, x='kickType', hue='returnDirectionIntended',height=8, aspect=.8)
plt.show()

Above distribution plot shows that majority of Deep kick and Normal Punt had Return Direction Intended as Center and Right.

---
**Q: What is distribution for Kick Type with respect to Return Direction Actual?**

---

In [None]:
#distribution of kickType with respect to returnDirectionActual
sns.displot(data=PFFScouting_data, x='kickType', hue='returnDirectionActual',height=8, aspect=.8)
plt.show()

Above distribution plot shows that majority of Deep kick and Normal Punt had Return Direction Actual as Center and Left.

---

---
# Data Definition/Description of NFL 2022 data

---

Note: Merge various dataset to arrive at combined sample dataset for EDA

---
**Merge game and play dataset - game_play_data**

---

In [None]:
# Merge of game and play dataset using key as gameId
game_play_data = pd.merge(game_data, play_data, how="inner", on=["gameId"])

In [None]:
# get shape of dataframe
print('Shape of game_play_data dataset is:', game_play_data.shape)

# print summary of dataframe
game_play_data.info()

In [None]:
# print first 5 rows of dataframe
game_play_data.head()

---
**Merge player and tracking2020_sample dataset- player_tracking_data**

---

In [None]:
# Merge of player and tracking2020_data_sample season dataset using key as nflId
player_tracking_data = pd.merge(player_data,tracking_data_sample, how="inner", on=["nflId"])

In [None]:
# get shape of dataframe
print('Shape of player_tracking_data dataset is:', player_tracking_data.shape)

# print summary of dataframe
player_tracking_data.info()

In [None]:
# print first 5 rows of dataframe
player_tracking_data.head()

---
**Merge game_play_data and PFFScouting dataset- game_play_scouting_data**

---

In [None]:
# Merge of game_play_data and PFFScouting_data dataset using key as gameId and playId
game_play_scouting_data = pd.merge(game_play_data, PFFScouting_data, how="inner", left_on=["gameId","playId"], right_on=["gameId","playId"])

In [None]:
# get shape of dataframe
print('Shape of game_play_scouting_data dataset is:', game_play_scouting_data.shape)

# print summary of dataframe
game_play_scouting_data.info()

In [None]:
# print first 5 rows of dataframe
game_play_scouting_data.head()

---
**Merge game_play_scouting and player_tracking dataset- nfl2022_data_sample**

---

In [None]:
# Merge of game_play_scouting_data and player_tracking_data dataset using key as gameId and playId
nfl2022_data_sample = pd.merge(game_play_scouting_data, player_tracking_data, how="inner", left_on=["gameId","playId"], right_on=["gameId","playId"])

In [None]:
# get shape of dataframe
print('Shape of nfl2022_data_sample dataset is:', nfl2022_data_sample.shape)

# print summary of dataframe
nfl2022_data_sample.info()

In [None]:
# print first 5 rows of dataframe
nfl2022_data_sample.head()

---
# Data Analysis/EDA of NFL 2022 data

---

Review data type of different variable in nfl2022_data_sample for EDA

---
**Q: What are the numeric float variables in nfl2022_data_sample?**

---

In [None]:
# Get list of numeric float variables
s = (nfl2022_data_sample.dtypes == 'float64')
numeric_float_cols = list(s[s].index)

print("Numerical float variables:")
print(numeric_float_cols)

---
**Q: What are the numeric int variables in nfl2022_data_sample?**

---

In [None]:
# Get list of numeric int variables
s = (nfl2022_data_sample.dtypes == 'int64')
numeric_int_cols = list(s[s].index)

print("Numerical int variables:")
print(numeric_int_cols)

---
**Q: What are the object variables in nfl2022_data_sample?**

---

In [None]:
# Get list of object variables
s = (nfl2022_data_sample.dtypes == 'object')
object_cols = list(s[s].index)

print("Object variables:")
print(object_cols)

---
**Q: What are the datetime variables in nfl2022_data_sample?**

---

In [None]:
# Get list of datetime variables
s = (nfl2022_data_sample.dtypes == 'datetime64[ns]')
datetime_cols = list(s[s].index)

print("datetime variables:")
print(datetime_cols)

---
**Q: What is the correlation for key numeric variables in nfl2022_data_sample?**

---

In [None]:
#checking correlation between key numeric variables via heatmap
corr = nfl2022_data_sample[['penaltyYards', 'kickLength', 'kickReturnYardage', 'snapTime', 'operationTime', 'hangTime', 'x', 'y', 's', 'a', 'dis', 'o', 'dir']].corr(method='pearson')
plt.figure(figsize=(25,25))
sns.heatmap(corr,vmax=.8,linewidth=.01, square = True, annot = True,cmap='YlGnBu',linecolor ='black')
plt.show()

Above heatmap plot for correlation shows that there is not much strong positive or negative relationship between various numeric variables but there is strong positive correlation between Speed(s) and Distance(dis).

Note: s is Player speed in yards/second and dis is distance traveled from prior time in yards.

---
**Q: What is the distribution and relationship for key numerical variables in nfl2022_data_sample?**

---

In [None]:
#plot pairwise relationship for key numeric variables
sns.pairplot(data=nfl2022_data_sample[['penaltyYards', 'kickLength', 'kickReturnYardage', 'snapTime', 'operationTime', 'hangTime', 'x', 'y', 's', 'a', 'dis', 'o', 'dir']],palette='rainbow',diag_kind='kde')
plt.show()

Above pair plot shows there is good distribution of data points between numeric variable.
Note that Player Orientation(o) and Angle of Player Motion(dir) has different distribution curve probably owing to value of data points in degrees.

---
**Q: What is the distribution for Kick Type and Kick Length with respect to Special Teams Play Type?**

---

In [None]:
#distribution of kickType, kickLength with respect to specialTeamsPlayType
sns.catplot(data=nfl2022_data_sample, x="kickType", y="kickLength", hue="specialTeamsPlayType", height=8, aspect=0.8)
plt.show()

D: Deep, N: Normal - standard punt style, A: Nose down or Aussie-style punts, Q: Squib, P: Pooch kick, F: Flat, O: Obvious Onside, K: Free Kick, R: Rugby style punt, S: Surprise Onside, B: Deep Direct OOB

---
**Q: What is the distribution for Kick Contact Type and Kick Length with respect to Special Teams Play Type?**

---

In [None]:
#distribution of kickContactType, kickLength with respect to specialTeamsPlayType
sns.catplot(data=nfl2022_data_sample, x="kickContactType", y="kickLength", hue="specialTeamsPlayType", height=8, aspect=0.8)
plt.show()

CC: Clean Catch from Air, BF: Bounced Forwards, BB: Bounced Backwards, MBC: Muffed by Contact with Non-Designated Returner, CFFG: Clean Field From Ground, MBDR: Muffed by Designated Returner, OOB: Directly Out Of Bounds, KTF: Kick Team Knocked Forward, BC: Bobbled Catch from Air, KTB: Kick Team Knocked Back, BOG: Bobbled on Ground, DEZ: Direct to Endzone, KTC: Kick Team Catch, ICC: Incidental Coverage Team Contact 














---
**Q: What is the distribution for Snap Detail and Snap Time with respect to Special Teams Play Type?**

---

In [None]:
#distribution of snapDetail, snapTime with respect to specialTeamsPlayType
sns.catplot(data=nfl2022_data_sample, x="snapDetail", y="snapTime", hue="specialTeamsPlayType", height=8, aspect=0.8)
plt.show()

OK: Accurate Snap, <: Left, H: High, >: Right, L: Low

---
**Q: What is the distribution for Snap Detail and Operation Time with respect to Special Teams Play Type?**

---

In [None]:
#distribution of snapDetail, operationTime with respect to specialTeamsPlayType
sns.catplot(data=nfl2022_data_sample, x="snapDetail", y="operationTime", hue="specialTeamsPlayType", height=8, aspect=0.8)
plt.show()

OK: Accurate Snap, <: Left, H: High, >: Right, L: Low

---
**Q: What is the distribution for Snap Detail and Hang Time with respect to Special Teams Play Type?**

---

In [None]:
#distribution of snapDetail, hangTime with respect to specialTeamsPlayType
sns.catplot(data=nfl2022_data_sample, x="snapDetail", y="hangTime", hue="specialTeamsPlayType", height=8, aspect=0.8)
plt.show()

OK: Accurate Snap, <: Left, H: High, >: Right, L: Low

---

---
# Summary
---

**EDA Involves:**
* Providing insights on **Special Teams plays** and associated metrics.

---

In [None]:
#distribution count for specialTeamsPlayType with respect to team and season
sns.catplot(data=nfl2022_data_sample, x="specialTeamsPlayType", col="team", hue="season", kind="count", height=8, aspect=0.8, palette="rainbow")
plt.show()

Number of Kickoff appears higher for 2020 season, Number of Punt seems higher for 2018 season, Number of Field Goal is almost similar for all the season and Number of Extra Point seems higher for 2020 season as per datapoints from sample data.

---
**Highlights:**

---
**Q:What is the distribution of Special Teams Result and Special Teams Play Type with respect to Player Position along the Long axis of field?**

---

In [None]:
# get specialTeamsResult and specialTeamsPlayType sorted by x
position_longaxis = nfl2022_data_sample.groupby(by = ['specialTeamsPlayType','specialTeamsResult'], as_index = False)['x'].agg('mean').sort_values(by ='x', ascending = False)
# set plot figure size
figure = plt.figure(figsize = [15, 10])
# plot comparison of specialTeamsResult
sns.barplot(x = 'x', y ='specialTeamsResult', hue='specialTeamsPlayType',data = position_longaxis[0:12])
# set plot label
plt.xlabel(xlabel = 'Player Position - Long Axis, 0-120 yards')
plt.ylabel(ylabel = 'Special Teams Result')
plt.grid(b = True, axis = 'x')
plt.show()

---
**Q:What is the distribution of Special Teams Result and Special Teams Play Type with respect to Player Position along the Short axis of field?**

---

In [None]:
# get specialTeamsResult and specialTeamsPlayType sorted by y
position_shortaxis = nfl2022_data_sample.groupby(by = ['specialTeamsPlayType','specialTeamsResult'], as_index = False)['y'].agg('mean').sort_values(by ='y', ascending = False)
# set plot figure size
figure = plt.figure(figsize = [15, 10])
# plot comparison of specialTeamsResult
sns.barplot(x = 'y', y ='specialTeamsResult', hue='specialTeamsPlayType',data = position_shortaxis[0:12])
# set plot label
plt.xlabel(xlabel = 'Player Position - Short Axis, 0-53.3 yards')
plt.ylabel(ylabel = 'Special Teams Result')
plt.grid(b = True, axis = 'x')
plt.show()

---
**Q:What is the distribution of Special Teams Result and Special Teams Play Type with respect to Player Speed?**

---

In [None]:
# get specialTeamsResult and specialTeamsPlayType sorted by s
speed = nfl2022_data_sample.groupby(by = ['specialTeamsPlayType','specialTeamsResult'], as_index = False)['s'].agg('mean').sort_values(by ='s', ascending = False)
# set plot figure size
figure = plt.figure(figsize = [15, 10])
# plot comparison of specialTeamsResult
sns.barplot(data = speed[0:12],x = 's', y ='specialTeamsResult', hue='specialTeamsPlayType',palette='winter')
# set plot label
plt.xlabel(xlabel = 'Speed in yards/second')
plt.ylabel(ylabel = 'Special Teams Result')
plt.grid(b = True, axis = 'x')
plt.show()

---
**Q:What is the distribution of Special Teams Result and Special Teams Play Type with respect to Player Acceleration?**

---

In [None]:
# get specialTeamsResult and specialTeamsPlayType sorted by a
acceleration = nfl2022_data_sample.groupby(by = ['specialTeamsPlayType','specialTeamsResult'], as_index = False)['a'].agg('mean').sort_values(by ='a', ascending = False)
# set plot figure size
figure = plt.figure(figsize = [15, 10])
# plot comparison of specialTeamsResult
sns.barplot(data = acceleration[0:12], x = 'a', y ='specialTeamsResult',hue='specialTeamsPlayType',palette='winter')
# set plot label
plt.xlabel(xlabel = 'Speed in yards/second^2')
plt.ylabel(ylabel = 'Special Teams Result')
plt.grid(b = True, axis = 'x')
plt.show()

---
**Q:What is the distribution of Special Teams Result and Special Teams Play Type with respect to Player Orientation?**

---

In [None]:
# get specialTeamsResult and specialTeamsPlayType sorted by o
orientation = nfl2022_data_sample.groupby(by = ['specialTeamsPlayType','specialTeamsResult'], as_index = False)['o'].agg('mean').sort_values(by ='o', ascending = False)
# set plot figure size
figure = plt.figure(figsize = [15, 10])
# plot comparison of specialTeamsResult
sns.barplot(data = orientation[0:12],x = 'o', y ='specialTeamsResult',hue='specialTeamsPlayType',palette='autumn')
# set plot label
plt.xlabel(xlabel = 'Player Orientation (deg), 0-360 degrees')
plt.ylabel(ylabel = 'Special Teams Result')
plt.grid(b = True, axis = 'x')
plt.show()

---
**Q:What is the distribution of Special Teams Result and Special Teams Play Type with respect to angle of Player motion?**

---

In [None]:
# get specialTeamsResult and specialTeamsPlayType sorted by dir
direction = nfl2022_data_sample.groupby(by = ['specialTeamsPlayType','specialTeamsResult'], as_index = False)['dir'].agg('mean').sort_values(by ='dir', ascending = False)
# set plot figure size
figure = plt.figure(figsize = [15, 10])
# plot comparison of specialTeamsResult
sns.barplot(data = direction[0:12],x = 'dir', y ='specialTeamsResult',hue='specialTeamsPlayType',palette='autumn')
# set plot label
plt.xlabel(xlabel = 'Angle of Player Motion (deg), 0-360 degrees')
plt.ylabel(ylabel = 'Special Teams Result')
plt.grid(b = True, axis = 'x')
plt.show()

---
**Thank you and Happy Learning.**

---

In [None]:
thank_you_str="Thanks,Happy Learning,Collaboration,Thankyou,Keep Learning"
# create WordCloud with converted string
wordcloud = WordCloud(width = 1000, height = 500, random_state=1, background_color='white', collocations=True).generate(thank_you_str)
plt.figure(figsize=(20, 20))
plt.imshow(wordcloud) 
plt.axis("off")
plt.show()