![](https://miro.medium.com/max/2834/1*85UdCLBSCGMObaXVj1Tf7w.png)

The 2021 Big Data Bowl data contains **player tracking,** **play,** **game,** and **player level information** for all possible passing **plays during the 2018 regular season.** <font color="blue"> For purposes of this event, passing plays are considered to be ones on a pass was thrown, the quarterback was sacked, or any one of five different penalties was called (defensive pass interference, offensive pass interference, defensive holding, illegal contact, or roughing the passer). </font>On each play, linemen (both offensive and defensive) data are not provided. <font color="green"> The focus of this year's contest is on pass coverage. </font>


<br> Now let us have a introductory look at the given data itself and the gradually we will dig dipper. 


In [None]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go

!pip install calplot
!pip install calmap
import calmap
import calplot
import matplotlib.pyplot as plt
from IPython.display import clear_output

%matplotlib inline
clear_output()



games_df = pd.read_csv('../input/nfl-big-data-bowl-2021/games.csv')
players_df = pd.read_csv('../input/nfl-big-data-bowl-2021/players.csv')
plays_df = pd.read_csv('../input/nfl-big-data-bowl-2021/plays.csv')
teams_df = pd.read_csv('../input/nfl-team-names/teams.csv')
positions_df = pd.read_csv('../input/nfl-team-names/positions.csv')

<a id="1"></a>
<h2 style='background:#900C3F; border:0; color:white'><center>01. Game Data Analysis<center></h2> 
    
Let's have a look at the `games.csv` file at first which stores date and time of each of the games in indexed week of the year with name of the home and visitor team. The key variable here is gameId. The follwoing table shows the data columns and the format
    
|Column Name|Description|
|:--|:--|
| **gameId:** |Game identifier, unique (numeric)|
| **gameDate:** |Game Date (time, mm/dd/yyyy) format|
| **gameTimeEastern:** |Start time of game (time, HH:MM:SS, EST)|
| **homeTeamAbbr:** |Home team three-letter code (text)|
| **visitorTeamAbbr:** |Visiting team three-letter code (text)|
| **week:** |Week of game (numeric)|
    
<br>
 
## Number of Games Played on Different Dates

In [None]:
df = games_df['gameDate'].value_counts().reset_index().copy()
df.columns = ['date', 'games']
df.sort_values(['games'], inplace = True, ascending = False)

fig = px.bar(df,
             x='date',
             y="games", 
             color = "games",
             title='Number of Games on each Date',
             height=400,
             width=800,
             color_continuous_scale=px.colors.sequential.Viridis_r
)
fig.update_layout(title_x=0.5, xaxis_title = 'Dates', yaxis_title = '#games' )
fig.update_xaxes(type='category', tickangle = 60 )
fig.show()

## Game Played Over Time. 
The calendar plot of the games mostly show that, the majority of the games were played in the Sunday weekend during the last four months of the year. 

In [None]:
game_date = pd.to_datetime(games_df['gameDate']).value_counts()
# game_date = pd.Series(game_date.values, index= game_date.index)
plt.figure()
calplot.calplot(game_date, cmap='YlGn', figsize=(15.,3.5))
plt.title('Game Played Over time')
clear_output()

## Gameplay Time

In [None]:
df = games_df['gameTimeEastern'].value_counts().reset_index()
df.columns = ['time', 'games']
df.sort_values(['games'], ascending = True, inplace = True)

fig = px.bar(
    df, 
    y='time', 
    x="games", 
    orientation='h', 
    color = "games",
    title='Games Played in Time of Day', 
    height=400, 
    width=800,
    color_continuous_scale=px.colors.sequential.Viridis_r
)
fig.update_layout(title_x = 0.5, xaxis_title = '#games')
fig.show()

## Gameplay Time Propotions

In [None]:
fig = go.Figure(data=[go.Pie(labels=df['time'], values=df['games'], hole=.5)])
fig.update_layout(title = 'Game Play Time Propotions', title_x = 0.5)
fig.show()

The two figures above suggests that most of the games were played at 03.00 pm in the afternoon. 


## Weekly Number of Games Played

In [None]:
df = games_df.groupby(['week']).count()['gameId'].copy()
df = pd.DataFrame({'week': df.index, 'games': df.values})
fig = px.bar(df, x='week', y = 'games', color= 'games', color_continuous_scale=px.colors.sequential.Viridis_r)
fig.update_xaxes(type='category')
fig.update_layout(title = 'Number of Gemes Played Weekly', title_x = 0.5, xaxis_title = 'Weeks ->', yaxis_title = '#games')
fig.show()

## Number of Home Games for Each Team
So we see that except three of the teams, all the other teams played 8 home games throughout the season 2018. 

In [None]:
df = games_df['homeTeamAbbr'].value_counts().copy()
df = pd.DataFrame({'homeTeamAbbr': df.index, 'games': df.values })
df.sort_values(['homeTeamAbbr'], inplace = True)
df.reset_index(inplace = True, drop = True)

team_names = teams_df.set_index('short').to_dict()['long']


fig = px.bar(df.replace(team_names), 
             x='homeTeamAbbr',
             y= 'games',
             color = 'games',
             title='Home Games for Each Team',
#              height=400,
#              width=800,
             color_continuous_scale=px.colors.sequential.Viridis_r
)
fig.update_layout(title_x=0.5, xaxis_title = 'Home Teams', yaxis_title = '#games' )
fig.update_xaxes(tickangle = 60)
fig.show()


<a id="2"></a>
<h2 style='background:#900C3F; border:0; color:white'><center>02. Player Data Analysis<center></h2> 

The `players.csv` file contains player-level information for each player that participated in any of the tracking data files. The key variable here is `nflId` by which it is possible to get individual traits of each of the players. 

|<font color="blue">Column Name </font>|<font color="blue"> Description </font> |
|:--|:--|
|`nflId:` | Player identification number, unique across players (numeric)|
| `height:`| Player height (text)|
| `weight:` |Player weight (numeric, `lbs`)|
| `birthDate:` |Date of birth (YYYY-MM-DD)|
| `collegeName:` |Player college (text)|
| `position:` |Player position (text)|
| `displayName:` |Player name (text)|

### Preprocessing: 
Here we need to do a bit of pre-processing. The height is given in `inches` in some places and `feet-inches` in some other place. So we convert the whole thing into inches.  Also the date of birth of the players are given in different formats. We will unify them too. 

In [None]:
# Height conversion function
def height_convert(x):
    if len(x)>2:
        [ft, inch] = x.split('-')
        return int(ft)*12 + int(inch)
    else:
        return int(x)

players_df['height'] = players_df.apply(lambda x: height_convert(x.height), 1)
# Convert all the dates to one single format
players_df['birthDate'] = pd.to_datetime(players_df.birthDate)
players_df.head(5)

## No of Players by Positions

In [None]:
df = players_df['position'].value_counts()
df = pd.DataFrame({'position': df.index, 'count': df.values})
df.sort_values(['count'], inplace = True, ascending = True)
positions_map = positions_df.set_index('short').to_dict()['long']

fig = px.bar(df.replace(positions_map),
             y='position', 
             x="count",
             color = 'count',
             orientation='h', 
             title='Number of Players at Different Positions',
             height=600,
             width=800)
fig.update_layout(title_x = 0.5, xaxis_title = 'Position', yaxis_title = 'Player Counts')
fig.show()

## Players from Different Colleges:

In [None]:
df = players_df['collegeName'].value_counts().reset_index().copy()
df.columns = ['college', 'players']
df.sort_values('players', ascending= True, inplace = True)

fig = px.bar(df.tail(30), 
    y='college', 
    x="players", 
    orientation='h',
    color = "players",
    title='Top 30 colleges by number of players',
    height=900,
    width=800
)
fig.update_layout(title_x = 0.5, xaxis_title = 'Player Count', yaxis_title = 'College Name')
fig.show()

## Players' Height Distribution
Here is some interesting insight. Usually the heavyweight players are in the defensive positions. 

In [None]:
fig = px.box(players_df.replace(positions_map) , y="height", color="position", title="Height Distribution by Player Position", width = 1000)
fig.update_layout(title_x = 0.5, xaxis_title = 'Player Position', yaxis_title = 'Height Distrubution')
fig.show()

## Players Weight Distribution
Let's have a look at the players distribution of the weights based on their playing positions. 

In [None]:
df = players_df.groupby(['position']).mean()['weight'].copy()
df = pd.DataFrame({'position': df.index, 'weight': df.values})
df.sort_values(['weight'], inplace = True, ascending = False)

positions_map = positions_df.set_index('short').to_dict()['long']


fig = px.bar(df.replace(positions_map), 
             x='position',
             y= 'weight',
             color = 'weight',
             title='Average Weight of The Players by Position',
#              height=400,
             width=800,
             color_continuous_scale=px.colors.sequential.Viridis_r
)
fig.update_layout(title_x=0.5, xaxis_title = 'Player Position', yaxis_title = 'Average Weight' )
fig.update_xaxes(tickangle = 60)
# fig.update_yaxes(type='category')
fig.show()

## Weight Distribution of the Players
To have a deeper understanding in the distribution of the weights, lets take a closer look on the distribution of the average weights based on the position of the players. 

In [None]:
fig = px.box(players_df.replace(positions_map) , y="weight", color="position", title="Weight Distribution by Player Position", width = 1000)
fig.update_layout(title_x = 0.5, xaxis_title = 'Player Position', yaxis_title = 'Weight Distrubution')
fig.show()

<a id="3"></a>
<h2 style='background:#900C3F; border:0; color:white'><center>02. Play Data Analysis<center></h2> 

The `plays.csv` file contains play-level information from each game. The key variables are gameId and playId

|Column Name | Description |
|:--|:--|
|**`gameId:`**| Game identifier, unique (numeric)|
|**`playId:`** |Play identifier, not unique across games (numeric)|
| **`playDescription:`** | Description of play (text)|
| **`quarter:`** |Game quarter (numeric)|
| **`down:`** |Down (numeric)|
| **`yardsToGo:`** |Distance needed for a first down (numeric)|
| **`possessionTeam:`** |Team on offense (text)|
| **`playType:`** |Outcome of dropback: sack or pass (text)|
| **`yardlineSide:`** |3-letter team code corresponding to line-of-scrimmage (text)|
| **`yardlineNumber:`** |Yard line at line-of-scrimmage (numeric)|
| **`offenseFormation:`** |Formation used by possession team (text)|
| **`personnelO:`** |Personnel used by offensive team (text)|
| **`defendersInTheBox:`** |Number of defenders in close proximity to line-of-scrimmage (numeric)|
| **`numberOfPassRushers:`** |Number of pass rushers (numeric)|
| **`personnelD:`** |Personnel used by defensive team (text)|
| **`typeDropback:`** |Dropback categorization of quarterback (text)|
| **`preSnapHomeScore:`** |Home score prior to the play (numeric)|
| **`preSnapVisitorScore:`** |Visiting team score prior to the play (numeric)|
| **`gameClock:`** |Time on clock of play (MM:SS)|
| **`absoluteYardlineNumber:`**| Distance from end zone for possession team (numeric)|
| **`penaltyCodes:`** |NFL categorization of the penalties that ocurred on the play. For purposes of this contest, the most important penalties are Defensive Pass Interference (DPI), on a play are separated by a ; (text)|
| **`penaltyJerseyNumber:`** |Jersey number and team code of the player commiting each penalty. Multiple penalties on a play are separated by a ; (text)|
| **`passResult:`** |Outcome of the passing play (C: Complete pass, I: Incomplete pass, S: Quarterback sack, IN: Intercepted pass, text)|
| **`offensePlayResult:`** |Yards gained by the offense, excluding penalty yardage (numeric)|
| **`playResult:`** |Net yards gained by the offense, including penalty yardage (numeric)|
| **`epa:`** |Expected points added on the play, relative to the offensive team. Expected points is a metric that estimates the average of every <br>next scoring outcome given the play's down, distance, yardline, and time remaining (numeric)|
| **`isDefensivePI:`** |An indicator variable for whether or not a DPI penalty ocurred on a given play (TRUE/FALSE)|

## Number of Plays for Each Team
Let's have a look at the number of plays for each of the teams. 

In [None]:
df = plays_df['possessionTeam'].value_counts().reset_index()
df.columns = ['team', 'plays']
df = df.sort_values('plays')

fig = px.bar(df.replace(team_names), 
             y='team', 
             x="plays", 
             orientation='h',
             color = "plays",
             title='Number of Plays for Each Team',
             height=800,
             width=800,
             color_continuous_scale=px.colors.sequential.Viridis
)
fig.update_layout(title_x = 0.5, xaxis_title = "No of Plays", yaxis_title = "Teams")
fig.show()

## Number of Plays of Every Type:
Now let us have a look at the proportions of every types of play. 

In [None]:
df = plays_df['playType'].value_counts().reset_index()
df.columns = ['type', 'plays']
df.sort_values(['plays'], inplace = True, ascending = False)

fig = px.pie(
    df, 
    names='type', 
    values="plays",  
    title='Number of plays of every type',
    height=600,
    width=600, 
    hole = 0.5
    
)
fig.update_layout(title_x = 0.5)
fig.show()

## Plays at Yard Line Level
Here it is quite simple understandable that the middle line which is 25 yard line will have the most number of plays becase play starts from there. 

In [None]:
df = plays_df['yardlineNumber'].value_counts().reset_index()
df.columns = ['yardline', 'plays']
df.sort_values('plays', inplace = True)

fig = px.bar(df, 
    x='yardline', 
    y="plays",  
    color = "plays", 
    title='Number of Plays for Every Yard Line',
    height=600,
    width=800
)
fig.update_layout(title_x = 0.5, xaxis_title = 'Yard Line Number', yaxis_title = 'Number of Plays')
fig.show()

## Number of Plays for Every Offense Formation Type

In [None]:
df = plays_df['offenseFormation'].value_counts().reset_index()
df.columns = ['offenseFormation', 'plays']
df = df.sort_values('plays')

fig = px.pie(df, 
             names='offenseFormation',
             values="plays",  
             title='Number of Plays for Every Offense Formation Type',
             height=600,
             width=600, 
             hole = 0.4)

fig.update_layout(title_x = 0.5)
fig.show()

![](https://cdn.imgbin.com/5/12/13/imgbin-under-construction-5fUZhLgcZb6DkMJfFGfjZr6zE.jpg)