## Project: Investigate a Dataset NFL
by: **Abdallah Mahmoud**, **Ali Aboelnaga**, **Habiba Mourad**, **Omnia Ahmed**, **Reem Tarek**

## Table of Contents
<ul>
<li><a href="#intro">1. Introduction</a></li>
<li><a href="#wrangling">2. Data Wrangling</a></li>
<li><a href="#eda">3. Exploratory Data Analysis</a></li>
<li><a href="#conclusions">4. Conclusions</a></li>
</ul>

<a id='intro'></a>
## 1. Introduction
### Overview
The 2022 Big Data Bowl data contains Next Gen Stats player tracking, play, game, player, and PFF scouting data for all 2018-2020 Special Teams plays.

### Questions
<ul>
<li><a href="#q1">1. What is the most frequent kickoff type?</a></li>
<li><a href="#q2">2. How often does the actual kick direction differ from the intended direction?</a></li>
<li><a href="#q3">3. Relation between hangTime and OperationTime</a></li>
<li><a href="#q4">4. Number of plays each quarter</a></li>
<li><a href="#q5">5. Number of plays each down</a></li>
<li><a href="#q6">6. Minimum, maximum and average play result</a></li>
<li><a href="#q7">7. Top 10 frequent possession teams</a></li>
<li><a href="#q8">8. Most frequent specialTeamsPlayType</a></li>
<li><a href="#q9">9. What is the relation between height and position</a></li>
<li><a href="#q10">10. What college graduates most special teams players</a></li>
<li><a href="#q11">11. Which season has most plays? </a></li>
<li><a href="#q12">12. Number of matches by week</a></li>
<li><a href="#q13">13. Number of games for every date</a></li>
<li><a href="#q14">14. Number of games based on Timing</a></li>
<li><a href="#q15">15. Number of games through years</a></li>
</ul>

In [None]:
import numpy as np
import pandas as pd

import statsmodels.api as sm
from statsmodels.graphics.gofplots import qqplot

import matplotlib.pyplot as plt
import seaborn as sns

import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

import warnings
warnings.filterwarnings('ignore')

<a id='wrangling'></a>
## 2. Data Wrangling

In this section of the report, we will load in the data, check for cleanliness, and then trim and clean the dataset for analysis.

### 2.1 PFFScouting Data

In [None]:
pf = pd.read_csv('/kaggle/input/nfl-big-data-bowl-2022/PFFScoutingData.csv')
pf.columns

Since the data is huge, we will work only on columns that we find most interesting to us and drop the rest.

In [None]:
pf = pf[['gameId', 'hangTime', 'operationTime', 'kickType', 'kickDirectionIntended', 'kickDirectionActual']]

Let's check first for duplicates and drop them

In [None]:
pf.duplicated().sum()

In [None]:
pf.drop_duplicates(inplace=True)

Now, check for `null` values

In [None]:
print("data shape:", pf.shape)
pf.isna().sum()

There are lots of missing values

We have many choices:-
1. drop columns with too many `NaN`
2. replace them with 0
3. fill them with the mean/mode
4. leave them

since we have many data we can drop the null values.

In [None]:
pf.dropna(axis=0, inplace=True)

In [None]:
pf.shape

In [None]:
pf.head(3)

### 2.2 Plays Data

In [None]:
plays = pd.read_csv('/kaggle/input/nfl-big-data-bowl-2022/plays.csv')
print(plays.columns)

select only target columns

In [None]:
plays = plays[['gameId', 'playId', 'quarter', 'down', 'yardsToGo', 'possessionTeam', 'specialTeamsPlayType',
              'specialTeamsResult', 'playResult']]

check for duplicates

In [None]:
plays.duplicated().sum()

again, check for nulls

In [None]:
plays.isna().sum()

we have no `nulls`

In [None]:
plays.shape

In [None]:
plays.head(3)

### 2.3 Player Data

In [None]:
players = pd.read_csv('/kaggle/input/nfl-big-data-bowl-2022/players.csv')
print(players.columns)

In [None]:
players.duplicated().sum()

In [None]:
players.isna().sum()

drop nulls

In [None]:
players.dropna(inplace=True)

convert `birthDate` to datetime

In [None]:
def func(row):
    s = row['height'].split('-')
    if len(s)==1:
        return int(row['height'])
    else:
        return int(row['height'].split('-')[0])*12+int(row['height'].split('-')[1])

In [None]:
players['height'] = players.apply(func, axis=1)

In [None]:
players['birthDate'] = pd.to_datetime(players['birthDate'])

In [None]:
players.shape

In [None]:
players.head(3)

### 2.4 Games Data

In [None]:
games = pd.read_csv('/kaggle/input/nfl-big-data-bowl-2022/games.csv')
games.columns

In [None]:
games.duplicated().sum()

In [None]:
games.isna().sum()

In [None]:
# games['gameDate'] = pd.to_datetime(games['gameDate'])

In [None]:
games.shape

In [None]:
games.head(3)

<a id='eda'></a>
## 3. Exploratory Data Analysis

<a id='q1'></a>
### 3.1 What is the most frequent kickoff type?

In [None]:
pf['kickType'].unique()

In [None]:
pf['kickType'] = pf['kickType'].map({'N':'standard punt style', 'A':'Aussie-style punts', 'R':'Rugby style punt'})

In [None]:
pf['kickType'].value_counts()

In [None]:
px.bar(pf['kickType'].value_counts(), title='Most frequent kick types', labels={'index':'style', 'value':'count'})

**The most frequent kick type is the `standerd`, followed by the `Aussie`**

<a id='q2'></a>
### 3.2 How often does the actual kick direction differ from the intended direction?

In [None]:
pf[pf['kickDirectionActual']!=pf['kickDirectionIntended']].shape[0]

In [None]:
Actual_Intend_dir = pd.pivot_table(data=pf, index=['kickDirectionIntended'], columns='kickDirectionActual', values='gameId', aggfunc='count').fillna(0)

In [None]:
plt.figure(figsize=(7,5))
sns.heatmap(Actual_Intend_dir, cmap='Blues', annot=True);

**There are 191 cases where the player misses the inteded direction.**<br>
<br>most frequent:-
- 90 times the intended direction was `L` (left), but the actual was `C` (center)
- 71 times the intended direction was `R` (right), but the actual was `C` (center)

<a id='q3'></a>
### 3.3 Relation between hangTime and OperationTime

In [None]:
px.scatter(data_frame=pf, x='operationTime', y='hangTime', marginal_x='histogram', marginal_y='histogram')

#### Cheack for normailty with QQ-Plot

In [None]:
plt.figure(figsize=(15,5))

x1 = plt.subplot(1, 2, 1)
q1 = qqplot(pf['hangTime'], line='s',ax = x1)

x2 = plt.subplot(1, 2, 2)
q2 = qqplot(pf['operationTime'], line='s',ax = x2)

**There is no observed relation between `hangTime` and `OperationTime`**

In [None]:
plays.head()

<a id='q4'></a>
### 3.4 Number of plays each quarter

In [None]:
px.bar(plays['quarter'].value_counts(), width=700, height=400, title='Number of plays each quarter', labels={'index':'quarter', 'value':'count'})

<a id='q5'></a>
### 3.5 Number of plays each down

In [None]:
px.bar(plays['down'].value_counts(), width=700, height=400, title='Number of plays each down', labels={'index':'down', 'value':'count'})

<a id='q6'></a>
### 3.6 Minimum, maximum, and average play result

In [None]:
plays['playResult'].describe()

In [None]:
px.histogram(plays['playResult'], width=1000, height=500)

In [None]:
plt.figure(figsize=(12,7))
sns.violinplot(plays['playResult']);

    - minimum result is -72
    - maximum result is 82
    - average result is 27
<br>We can see that the most frequent results are **0**, and **40**

<a id='q7'></a>
### 3.7 Top 10 frequent possession teams

In [None]:
top_10 = plays['possessionTeam'].value_counts()[:10]
top_10

In [None]:
px.bar(top_10, width=1000, height=500, orientation='h', title='Number of plays each possessionTeam', labels={'index':'possessionTeam', 'value':'count'})

<a id='q8'></a>
### 3.8 Most frequent specialTeamsPlayType

In [None]:
t = plays['specialTeamsPlayType'].value_counts()
t

In [None]:
px.pie(names=t.index, values=t, title='Special teams play type')

**The most frequent specialTeamsPlayType is Kickoff**

In [None]:
players.head()

<a id='q9'></a>
### 3.9 What is the relation between `Position`, `Height`

In [None]:
plt.figure(figsize=(15,5))
sns.lineplot(data=players, x='Position', y='height');

In [None]:
plt.figure(figsize=(15,5))
sns.lineplot(data=players, x='Position', y='weight');

<a id='q10'></a>
### 3.10 What college graduates most special teams players?

In [None]:
c = players['collegeName'].value_counts()[:10]

plt.figure(figsize=(12,7))
plt.barh(c.index, c);

<a id='q11'></a>
### 3.10 Who is the tallest and shortest players?

In [None]:
players['height'].describe()

In [None]:
players.query('height==66')

In [None]:
players.query('height==81')

<a id='q11'></a>
### 3.11 Which season has most plays?

In [None]:
print("Unique NFL seasons and their counts :")
g_season = games.pivot_table(index = ['season'], aggfunc = 'size') 
g_season = g_season.reset_index()
g_season.columns= ["Seasons", "Counts"]
g_season = g_season.sort_values("Counts", ascending = False)
print(g_season)

In [None]:
px.bar(g_season, x="Seasons", y="Counts", title="NFL Seasons", color="Seasons")

<a id='q12'></a>
### 3.12 Number of matches by week

In [None]:
print("Unique NFL weeks and their counts :")
g_week = games.pivot_table(index = ['week'], aggfunc = 'size') 
g_week = g_week.reset_index()
g_week.columns= ["Weeks", "Counts"]
g_week = g_week.sort_values("Counts", ascending = False)
print(g_week)

In [None]:
px.bar(g_week, x="Weeks", y="Counts", title="NFL Weeks", color="Weeks")

In [None]:
temp=games['week']
Matches_on_weekly_basis=pd.DataFrame()
Matches_on_weekly_basis['week']=temp.value_counts().index
Matches_on_weekly_basis['Count']=temp.value_counts().values
Matches_on_weekly_basis=Matches_on_weekly_basis.sort_values(by='week')
fig = px.line(Matches_on_weekly_basis, x="week", y="Count", title='Matches on Weekly Basis')
fig.show()

<a id='q13'></a>
### 3.13 Number of games for every date

In [None]:
print("Unique NFL dates and their counts :")
g_date = games.pivot_table(index = ['gameDate'], aggfunc = 'size') 
g_date = g_date.reset_index()
g_date.columns= ["Date", "Counts"]
g_date = g_date.sort_values("Counts", ascending = False)
print(g_date)

In [None]:
fig = px.bar(g_date, x="Date", y="Counts", title='Number of games for every date', color="Counts")
plt.figure(figsize=(9,7))

fig.show()

In [None]:
temp = games['gameDate'].value_counts().reset_index()
temp.columns = ['date', 'games']
temp = temp.sort_values('games')

fig = px.line(temp, x='date',y="games",  title='Number of games for every date')
fig.show()

<a id='q14'></a>
### 3.14 Number of games based on Timing


In [None]:
print("Unique NFL timings and their counts :")
g_time = games.pivot_table(index = ['gameTimeEastern'], aggfunc = 'size') 
g_time = g_time.reset_index()
g_time.columns= ["Time", "Counts"]
g_time = g_time.sort_values("Counts", ascending = False)
print(g_time)

In [None]:
px.bar(g_time, x="Time", y="Counts", title="Number of games based on Timing", color="Counts")

In [None]:
temp = games['gameTimeEastern'].value_counts().reset_index()
temp.columns = ['time', 'games']
temp = temp.sort_values('games')
plt.figure(figsize=(7,5))
fig = px.line(temp, x='time',y="games",  title='Number of games based on Timing')
fig.show()

<a id='q15'></a>
### 3.15 Number of games through years


In [None]:
games['gameYear'] = pd.DatetimeIndex(games['gameDate']).year
print(games["gameYear"])

In [None]:
print("number of games through years :")
g_year = games.pivot_table(index = ['gameYear'], aggfunc = 'size') 
g_year = g_year.reset_index()
g_year.columns= ["Year", "Counts"]
g_year = g_year.sort_values("Counts", ascending = False)
print(g_year)

In [None]:
px.bar(g_year, x="Year", y="Counts", title="number of games through years ", color="Year")

<a id='conclusions'></a>
## 4. Conclusions

1. The most frequent kick type is the standerd, followed by the Aussie.
2. There are 191 cases where the player misses the inteded direction,90 times the intended direction was L (left), but the actual was C (center).
3. The hangTime and OperationTime are normally distributed, but there is no relation bewteen them.
4. Most playes happen in the second quarter, while only few are in the 5th quarter.
5. Most plays heppen in down 0, folowed by 4, and only few happpen in down 1,2,3.
6. the minimum result was -72, maximum was 82, average was 27 and we can see that the most frequent results are **0**, and **40**.
7. The most frequent teams are `BAL` and `NO`.
8. Most frequent specialTeamsPlayType is kickoff.
9. `T` and `OT` tends to be the tallest players.
10. Alabama graduates most special teams players.
11. Number of plays in each season is almost the same.
12. Number of plays per weeks varu from 41 to 48.
13. Most number of matches happend on 01/03/2021, and least happened on 11/04/2019.
14. Most plays starts at 13:00.
15. 2020 has the most number of plays.