**> General information**
In this kernel I work with **NFL Big Data Bowl** competition.

As an “armchair quarterback” watching the game, you may think you can predict the result of a play when a ball carrier takes the handoff - but what does the data say? In this competition, you will develop a model to predict how many yards a team will gain on given rushing plays as they happen. You'll be provided game, play, and player-level data, including the position and speed of players as provided in the NFL’s Next Gen Stats data. And the best part - you can see how your model performs from your living room, as the leaderboard will be updated week after week on the current season’s game data as it plays out.

**Contents of the Notebook:**

**Part_1:** Exploratory Data Analysis(EDA):
1. Analysis of the features.

2. Finding any relations or trends considering multiple features.

**Part_2:** Feature Engineering and Data Cleaning:
1. Adding any few features.

2. Removing redundant features.

3. Converting features into suitable form for modeling.

**Part_3:** Predictive Modeling
1. Running Basic Algorithms.

2. Cross Validation.

3. Ensembling.

4. Important Features Extraction.

**Part_1: Exploratory Data Analysis(EDA)**

In [None]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('fivethirtyeight')
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

In [None]:
data=pd.read_csv('../input/nfl-big-data-bowl-2020/train.csv')

In [None]:
data.head()

In [None]:
data.isnull().sum() #checking for total null values

The **Orientation, Dir, FieldPosition, OffenseFormation, DefendersInTheBox, StadiumType, GameWeather, Temperature, Humidity, WindSpeed, WindDirection** have null values. I will try to fix them.

In [None]:
data.dtypes #Checking features types

**Types Of Features**

**Categorical Features:**
A categorical variable is one that has two or more categories and each value in that feature can be categorised by them.For example, StadiumType is a categorical variable. Now we cannot sort or give any ordering to such variables. They are also known as **Nominal Variables**.

**Categorical Features in the dataset:** Team,DisplayName, ossessionTeam, FieldPosition, OffenseFormation, OffensePersonnel, DefensePersonnel, PlayDirection, PlayerCollegeName, Position, HomeTeamAbbr, VisitorTeamAbbr, Stadium, Location, StadiumType, Turf, GameWeather, WindDirection.








**Ordinal Features:**
An ordinal variable is similar to categorical values, but the difference between them is that we can have relative ordering or sorting between the values. For eg: If we have a feature like Height with values Tall, Medium, Short, then Height is a ordinal variable. Here we can have a relative sort in the variable.

**Ordinal Features in the dataset:** GameClock, TimeHandoff, PlayerHeight, TimeSnap, PlayerBirthDate, WindSpeed.

**Continous Feature:**
A feature is said to be continous if it can take values between any two points or between the minimum or maximum values in the features column.

**Continous Features in the dataset:** X, Y, S, A, Dis, Orientation, dir, NflId, JerseyNumber, Season, YardLine, Quarter, Down, Distance, HomeScoreBeforePlay, VisitorScoreBeforePlay, NflIdRusher, DefendersInTheBox, Yards, PlayerWeight, Week, Temperature, Humidity

In [None]:
X, Y, S, A, Dis, Orientation, dir, NflId, JerseyNumber, Season, YardLine, Quarter, 
Down, Distance, HomeScoreBeforePlay, VisitorScoreBeforePlay, 
NflIdRusher, DefendersInTheBox, Yards, PlayerWeight, Week, Temperature, Humidity

In [None]:
f,ax=plt.subplots(1,19,figsize=(25,10))
data[['PlayerWeight','Yards']].groupby(['Yards']).mean().plot.bar(ax=ax[0])
ax[0].set_title('Yards depending on PlayerWeight')
data[['Temperature','Yards']].groupby(['Yards']).mean().plot.bar(ax=ax[1])
ax[1].set_title('Yards depending on Temperature')
data[['X','Yards']].groupby(['Yards']).mean().plot.bar(ax=ax[2])
ax[2].set_title('Yards depending on X')
data[['Y','Yards']].groupby(['Yards']).mean().plot.bar(ax=ax[3])
ax[3].set_title('Yards depending on Y')
data[['S','Yards']].groupby(['Yards']).mean().plot.bar(ax=ax[4])
ax[4].set_title('Yards depending on S')
data[['A','Yards']].groupby(['Yards']).mean().plot.bar(ax=ax[5])
ax[5].set_title('Yards depending on A')
data[['JerseyNumber','Yards']].groupby(['Yards']).mean().plot.bar(ax=ax[6])
ax[6].set_title('Yards depending on JerseyNumber')
data[['Dis','Yards']].groupby(['Yards']).mean().plot.bar(ax=ax[7])
ax[7].set_title('Yards depending on Dis')
#data[['dir','Yards']].groupby(['Yards']).mean().plot.bar(ax=ax[8])
#ax[8].set_title('Yards depending on dir')
data[['NflId','Yards']].groupby(['Yards']).mean().plot.bar(ax=ax[9])
ax[9].set_title('Yards depending on NflId')
data[['YardLine','Yards']].groupby(['Yards']).mean().plot.bar(ax=ax[10])
ax[10].set_title('Yards depending on YardLine')
data[['Quarter','Yards']].groupby(['Yards']).mean().plot.bar(ax=ax[11])
ax[11].set_title('Yards depending on Quarter')
data[['Down','Yards']].groupby(['Yards']).mean().plot.bar(ax=ax[12])
ax[12].set_title('Yards depending on Down')
data[['Distance','Yards']].groupby(['Yards']).mean().plot.bar(ax=ax[13])
ax[13].set_title('Yards depending on Distance')
data[['HomeScoreBeforePlay','Yards']].groupby(['Yards']).mean().plot.bar(ax=ax[14])
ax[14].set_title('Yards depending on HomeScoreBeforePlay')
data[['VisitorScoreBeforePlay','Yards']].groupby(['Yards']).mean().plot.bar(ax=ax[15])
ax[15].set_title('Yards depending on VisitorScoreBeforePlay')
data[['Week','Yards']].groupby(['Yards']).mean().plot.bar(ax=ax[16])
ax[16].set_title('Yards depending on Week')
data[['Humidity','Yards']].groupby(['Yards']).mean().plot.bar(ax=ax[17])
ax[17].set_title('Yards depending on Humidity')
data[['DefendersInTheBox','Yards']].groupby(['Yards']).mean().plot.bar(ax=ax[18])
ax[18].set_title('Yards depending on DefendersInTheBox')

plt.show()

In [None]:
print('Longest Yards was of:',data['Yards'].max(),'yards')
print('Smalest Yards was of:',data['Yards'].min(),'yards')
print('Average Yards was of:',data['Yards'].mean(),'yards')

**Doing some feature engineering to anlyse players ages**

We need to separate the year of birth and calculate each player age

In [None]:
data['Age'] = data['PlayerBirthDate'].map(lambda x: 2018-int(x.split('/')[2]))

In [None]:
print('Oldest Player was of:',data['Age'].max(),'Years')
print('Youngest Player was of:',data['Age'].min(),'Years')
print('Average Age on the field:',data['Age'].mean(),'Years')

**I want to add a new column with the name 'Experience' which mean if the player have an experience or not, I will put 25 years old as a threshold**

In [None]:
data['Experience'] = data['Age'].map(lambda x: 1 if x>25 else 0)

In [None]:
data['Experience'].head()

In [None]:
f,ax=plt.subplots(1,2,figsize=(18,8))
sns.violinplot("JerseyNumber","Age", hue="Experience", data=data,split=True,ax=ax[0])
ax[0].set_title('JerseyNumber and Age vs Experience')
ax[0].set_yticks(range(0,110,10))
sns.violinplot("A","Age", hue="Experience", data=data,split=True,ax=ax[1])
ax[1].set_title('Accelaration and Age vs Experience')
ax[1].set_yticks(range(0,110,10))
plt.show()

In [None]:
f,ax=plt.subplots(1,2,figsize=(18,8))
sns.violinplot("JerseyNumber","Yards", hue="Experience", data=data,split=True,ax=ax[0])
ax[0].set_title('JerseyNumber and Yards vs Experience')
ax[0].set_yticks(range(0,110,10))
sns.violinplot("A","Yards", hue="Experience", data=data,split=True,ax=ax[1])
ax[1].set_title('Accelaration and Yards vs Experience')
ax[1].set_yticks(range(0,110,10))
plt.show()

In [None]:
data['PlayerCollegeName'].head()

In [None]:
print('There is '+str(len(set(data['PlayerCollegeName'].to_list())))+' different PlayerCollegeName')

In [None]:
data.groupby(['PlayerCollegeName','A'])['PlayerCollegeName'].count()

In [None]:
data.groupby(['PlayerCollegeName','S'])['PlayerCollegeName'].count()

In [None]:
data.groupby(['PlayerCollegeName','PlayerHeight'])['PlayerCollegeName'].count()

In [None]:
data.groupby(['PlayerCollegeName','PlayerWeight'])['PlayerCollegeName'].count()

In [None]:
data.groupby(['PlayerCollegeName','Yards'])['PlayerCollegeName'].count()

**Youngstown State** players are the most Yards gainer with the highest Speed and Accelaration

**Let's see the gained yards based on player's age**

In [None]:
data['VisitorTeamAbbr'].head()

In [None]:
print('There is '+str(len(set(data['VisitorTeamAbbr'].to_list())))+' different VisitorTeamAbbr')

In [None]:
data.groupby(['VisitorTeamAbbr','Yards'])['VisitorTeamAbbr'].count()

In [None]:
data.groupby(['VisitorTeamAbbr','A'])['VisitorTeamAbbr'].count()

In [None]:
data.groupby(['VisitorTeamAbbr','S'])['VisitorTeamAbbr'].count()

In [None]:
data.groupby(['VisitorTeamAbbr','Age'])['VisitorTeamAbbr'].count()

In [None]:
data.groupby(['Age','Yards'])['Age'].count()

**It's clear that oldest players gain more yards than the youngest one, it's explained by their experience**

In [None]:
print('Highest Player was of:',data['PlayerHeight'].max(),'ft-in')
print('Shortest Player was of:',data['PlayerHeight'].min(),'ft-in')
#print('Average Tall on the field:',data['PlayerHeight'].mean(),'ft-in') # We need to convert 
#this categorical feature to be able to get Average Tall value

****Let's see the gained yards based on player's Height****

In [None]:
data.groupby(['PlayerHeight','Yards'])['PlayerHeight'].count()

**Players with the Height 6-9 are the most gainer of yards**

**Let's take a look to jersey number, I think that there is some specific number which are gived to a good yards gainer**

In [None]:
JerseyNumber = data['JerseyNumber'].to_list()
len(set(JerseyNumber))

**There is 99 different Jersey Number**

In [None]:
data.groupby(['JerseyNumber','Yards'])['JerseyNumber'].count()

**Like the known footbal game we see most famous players took the number 10 on their t-shirt, here is the same but with the number 99**

**Let's take a look for players speed**

In [None]:
print('Fastest Player was of:',data['S'].max(),'yards/second^2')
print('lowest Player was of:',data['S'].min(),'yards/second^2')
print('Average speed on the field:',data['S'].mean(),'yards/second^2')

In [None]:
data.groupby(['JerseyNumber','S'])['JerseyNumber'].count()

In [None]:
data.groupby(['JerseyNumber','S'])['JerseyNumber'].count()

In [None]:
data.groupby(['JerseyNumber','Age'])['JerseyNumber'].count()

In [None]:
print('The Player with highest Accelararation was of:',data['A'].max(),'yards/second')
print('lowest Player Accelararation was of:',data['A'].min(),'yards/second')
print('Average Accelararation on the field:',data['A'].mean(),'yards/second')

In [None]:
data.groupby(['JerseyNumber','A'])['JerseyNumber'].count()

In [None]:
print('Fat Player was of:',data['PlayerWeight'].max(),'lbs')
print('Skiny Player was of:',data['PlayerWeight'].min(),'lbs')
print('Average Weight on the field:',data['PlayerWeight'].mean(),'lbs')

In [None]:
data.groupby(['JerseyNumber','PlayerWeight'])['JerseyNumber'].count()

**Again players with jursey number 99 are the fastest and the most players whcih have highest accelaration,it's clear that this number is gived only for experienced player because their are older than other [29, 30, 31, 38,...], also players whith 99 jursey number are the most highest weight, this explain their strenght and the mussasle they have**

In [None]:
sns.factorplot('Yards','Experience',col='Quarter',data=data)
plt.show()

In [None]:
sns.factorplot('Yards','Experience',col='Down',data=data)
plt.show()

In [None]:
sns.factorplot('Yards','Experience',col='PlayDirection',data=data)
plt.show()

sns.factorplot('Yards','Experience',col='Location',data=data)
plt.show()

In [None]:
print('Highest Player was of:',data['PlayerWeight'].max(),'lbs')
print('Shortest Player was of:',data['PlayerWeight'].min(),'lbs')
print('Average Tall on the field:',data['PlayerWeight'].mean(),'lbs')

In [None]:
print('Highest Player was of:',data[''].max(),'')
print('Shortest Player was of:',data[''].min(),'')
print('Average Tall on the field:',data[''].mean(),'')

In [None]:
print('Highest Player was of:',data[''].max(),'')
print('Shortest Player was of:',data[''].min(),'')
print('Average Tall on the field:',data[''].mean(),'')