# NFL Big Data Bowl

### How many yards will an NFL player gain after receiving a handoff?

Read the full data description on [here](https://www.kaggle.com/c/nfl-big-data-bowl-2020)

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.patches as patches
import seaborn as sns
import datetime
import math
import seaborn as sns

from string import punctuation

import tensorflow as tf
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

In [2]:
train = pd.read_csv(r'train.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [3]:
train.head()

Unnamed: 0,GameId,PlayId,Team,X,Y,S,A,Dis,Orientation,Dir,...,Week,Stadium,Location,StadiumType,Turf,GameWeather,Temperature,Humidity,WindSpeed,WindDirection
0,2017090700,20170907000118,away,73.91,34.84,1.69,1.13,0.4,81.99,177.18,...,1,Gillette Stadium,"Foxborough, MA",Outdoor,Field Turf,Clear and warm,63.0,77.0,8,SW
1,2017090700,20170907000118,away,74.67,32.64,0.42,1.35,0.01,27.61,198.7,...,1,Gillette Stadium,"Foxborough, MA",Outdoor,Field Turf,Clear and warm,63.0,77.0,8,SW
2,2017090700,20170907000118,away,74.0,33.2,1.22,0.59,0.31,3.01,202.73,...,1,Gillette Stadium,"Foxborough, MA",Outdoor,Field Turf,Clear and warm,63.0,77.0,8,SW
3,2017090700,20170907000118,away,71.46,27.7,0.42,0.54,0.02,359.77,105.64,...,1,Gillette Stadium,"Foxborough, MA",Outdoor,Field Turf,Clear and warm,63.0,77.0,8,SW
4,2017090700,20170907000118,away,69.32,35.42,1.82,2.43,0.16,12.63,164.31,...,1,Gillette Stadium,"Foxborough, MA",Outdoor,Field Turf,Clear and warm,63.0,77.0,8,SW


# Data Dictionary

Each row in the file corresponds to a single player's involvement in a single play. The dataset was intentionally joined (i.e. denormalized) to make the API simple. All the columns are contained in one large dataframe which is grouped and provided by PlayId.

- GameId - a unique game identifier
- PlayId - a unique play identifier
- Team - home or away
- X - player position along the long axis of the field. See figure below.
- Y - player position along the short axis of the field. See figure below.
- S - speed in yards/second
- A - acceleration in yards/second^2
- Dis - distance traveled from prior time point, in yards
- Orientation - orientation of player (deg)
- Dir - angle of player motion (deg)
- NflId - a unique identifier of the player
- DisplayName - player's name
- JerseyNumber - jersey number
- Season - year of the season
- YardLine - the yard line of the line of scrimmage
- Quarter - game quarter (1-5, 5 == overtime)
- GameClock - time on the game clock
- PossessionTeam - team with possession
- Down - the down (1-4)
- Distance - yards needed for a first down
- c - which side of the field the play is happening on
- HomeScoreBeforePlay - home team score before play started
- VisitorScoreBeforePlay - visitor team score before play started
- NflIdRusher - the NflId of the rushing player
- OffenseFormation - offense formation
- OffensePersonnel - offensive team positional grouping
- DefendersInTheBox - number of defenders lined up near the line of scrimmage, spanning the width of the offensive line
- DefensePersonnel - defensive team positional grouping
- PlayDirection - direction the play is headed
- TimeHandoff - UTC time of the handoff
- TimeSnap - UTC time of the snap
- Yards - the yardage gained on the play (you are predicting this)
- PlayerHeight - player height (ft-in)
- PlayerWeight - player weight (lbs)
- PlayerBirthDate - birth date (mm/dd/yyyy)
- PlayerCollegeName - where the player attended college
- Position - the player's position (the specific role on the field that they typically play)
- HomeTeamAbbr - home team abbreviation
- VisitorTeamAbbr - visitor team abbreviation
- Week - week into the season
- Stadium - stadium where the game is being played
- Location - city where the game is being played
- StadiumType - description of the stadium environment
- Turf - description of the field surface
- GameWeather - description of the game weather
- Temperature - temperature (deg F)
- Humidity - humidity
- WindSpeed - wind speed in miles/hour
- WindDirection - wind direction

# Data Cleaning & Feature Engineering
The direction variable can be a bit confusion since it is the degree the player is moving relative to the play direction.  For this reason we will need to clean the variable such that the degree of player movement is representative of forward movement vs. backward movement.  We will adjust our variable so any angle between 0<x<180 represents forward movement, with 90 degree being straight down field, and 180 < x < 360 represents backward movement.

Lets adjust the PlayDirection variable to be 1 if the play is moving right and 0 if its moving left.

Also lets adjust the dir variable based on the direction of the play.

In [4]:

train['PlayDirection'] = train['PlayDirection'].apply(lambda x: x.strip() == 'right')

def new_orientation(angle, play_direction):
    if play_direction == 0:
        new_angle = 360.0 - angle
        if new_angle == 360.0:
            new_angle = 0.0
        return new_angle
    else:
        return angle

train['Orientation'] = train.apply(lambda row: new_orientation(row['Orientation'], row['PlayDirection']), axis=1)
train['Dir'] = train.apply(lambda row: new_orientation(row['Dir'], row['PlayDirection']), axis=1)    

In [5]:
# Lets see how many categorical columns we have and how we should handle those

cat_features = []

for col in train.columns:
    if train[col].dtype == 'object':
        cat_features.append((col, len(train[col].unique())))

cat_features

[('Team', 2),
 ('DisplayName', 2568),
 ('GameClock', 901),
 ('PossessionTeam', 32),
 ('FieldPosition', 33),
 ('OffenseFormation', 9),
 ('OffensePersonnel', 61),
 ('DefensePersonnel', 45),
 ('TimeHandoff', 30709),
 ('TimeSnap', 30721),
 ('PlayerHeight', 16),
 ('PlayerBirthDate', 1897),
 ('PlayerCollegeName', 314),
 ('Position', 25),
 ('HomeTeamAbbr', 32),
 ('VisitorTeamAbbr', 32),
 ('Stadium', 61),
 ('Location', 67),
 ('StadiumType', 34),
 ('Turf', 23),
 ('GameWeather', 74),
 ('WindSpeed', 70),
 ('WindDirection', 59)]

There are 23 categorical columns, but some of these look like they could be numerical, such as windspeed, height, time, gameclock.

In [6]:
train.columns

Index(['GameId', 'PlayId', 'Team', 'X', 'Y', 'S', 'A', 'Dis', 'Orientation',
       'Dir', 'NflId', 'DisplayName', 'JerseyNumber', 'Season', 'YardLine',
       'Quarter', 'GameClock', 'PossessionTeam', 'Down', 'Distance',
       'FieldPosition', 'HomeScoreBeforePlay', 'VisitorScoreBeforePlay',
       'NflIdRusher', 'OffenseFormation', 'OffensePersonnel',
       'DefendersInTheBox', 'DefensePersonnel', 'PlayDirection', 'TimeHandoff',
       'TimeSnap', 'Yards', 'PlayerHeight', 'PlayerWeight', 'PlayerBirthDate',
       'PlayerCollegeName', 'Position', 'HomeTeamAbbr', 'VisitorTeamAbbr',
       'Week', 'Stadium', 'Location', 'StadiumType', 'Turf', 'GameWeather',
       'Temperature', 'Humidity', 'WindSpeed', 'WindDirection'],
      dtype='object')

In [7]:
# Stadium Type

train.StadiumType.value_counts()

Outdoor                      362516
Outdoors                      92708
Indoors                       56826
Dome                          23122
Indoor                        19140
Retractable Roof              18766
Open                          11308
Retr. Roof-Closed             11044
Domed, closed                  6908
Retr. Roof - Closed            6446
Domed, open                    3696
Retr. Roof-Open                3014
Retractable Roof - Closed      2222
Closed Dome                    2134
Dome, closed                   1826
Domed                          1826
Domed, Open                    1760
OUTDOOR                        1254
Oudoor                         1188
indoor                         1166
Retr. Roof Closed              1056
Indoor, Roof Closed            1056
Outddors                        968
Bowl                            968
Heinz Field                     902
Retr. Roof - Open               880
Outdoor Retr Roof-Open          880
Outdor                      

### Stadium
Some of the data above appears to have some misspellings.  We need to clean these up.

In [8]:

# Removes punctuation marks
def stadium_clean(stadium):
    if pd.isna(stadium):
        return np.nan
    stadium = stadium.lower() # make all text characters lowercase
    stadium = ''.join([c for c in stadium if c not in punctuation])
    stadium = stadium.replace('outside','outdoor')
    stadium = stadium.replace('ourdoor','outdoor')
    stadium = stadium.replace('outdor','outdoor')
    stadium = stadium.replace('oudoor','outdoor')
    stadium = stadium.replace('outddors','outdoor')  
    stadium = stadium.replace('outdoors','outdoor')
    stadium = stadium.replace('indoors','indoor')
    stadium = stadium.replace('retr roofopen','retractable roof open')
    stadium = stadium.replace('retr roof  open','retractable roof open')
    stadium = stadium.replace('outdoor retr roofopen','retractable roof open')
    stadium = stadium.replace('outdoor retractable roof open','retractable roof open')
    stadium = stadium.replace('indoor roof open','retractable roof open')
    stadium = stadium.replace('indoor open roof','retractable roof open')
    stadium = stadium.replace('dome open','retractable roof open')
    stadium = stadium.replace('retr roofclosed','retractable roof closed')
    stadium = stadium.replace('retr roof  closed','retractable roof closed')
    stadium = stadium.replace('retr roof closed','retractable roof closed')
    stadium = stadium.replace('indoor roof closed','retractable roof closed')
    stadium = stadium.replace('retractable roof  closed','retractable roof closed')
    stadium = stadium.replace('dome closed','retractable roof  closed')
    stadium = stadium.replace('domed closed','retractable roof  closed')
    stadium = stadium.replace('domed','dome')
    
    # We will focus on whether it is open/outdoors vs indoor/closed.
    if 'outdoor' in stadium or 'open' in stadium:
        return 1
    if 'indoor' in stadium or 'closed' in stadium:
        return 0
    
train['StadiumType'] = train["StadiumType"].apply(stadium_clean)

train.StadiumType.value_counts() 

1.0    484286
0.0    109824
Name: StadiumType, dtype: int64

### Turf
From a few trials, cleaning the turf variable shows a slight improvement in model performance.

In [9]:
grass_labels = ['grass', 'natural grass', 'natural', 'naturall grass']
train['Grass'] = np.where(train.Turf.str.lower().isin([grass_labels]), 1, 0)

### Possession
There are some issues with inconsistencies in team name abbreciations.  We will fix these now.

In [10]:
train[(train.PossessionTeam != train.HomeTeamAbbr) & (train.PossessionTeam != train.VisitorTeamAbbr)][['PossessionTeam','HomeTeamAbbr','VisitorTeamAbbr']].drop_duplicates()



Unnamed: 0,PossessionTeam,HomeTeamAbbr,VisitorTeamAbbr
2992,BLT,CIN,BAL
4334,CLV,CLE,PIT
5060,ARZ,DET,ARI
5962,HST,HOU,JAX
14872,HST,CIN,HOU
...,...,...,...
640134,BLT,BAL,NE
645062,CLV,CLE,BUF
648714,ARZ,TB,ARI
673222,CLV,CLE,MIA


In [11]:
train[(train.PossessionTeam != train.HomeTeamAbbr) & (train.PossessionTeam != train.VisitorTeamAbbr)][['PossessionTeam']].drop_duplicates()

# BLT = BAL
# CLV = CLE
# ARZ = ARI
# HST = HOU

Unnamed: 0,PossessionTeam
2992,BLT
4334,CLV
5060,ARZ
5962,HST


In [12]:
abbr_mapper = {'BAL':'BLT','CLE':'CLV','ARI':'ARZ','HOU':'HST'}
train['PossessionTeam'] = train['PossessionTeam'].map(abbr_mapper).fillna(train['PossessionTeam'])
train['HomeTeamAbbr'] = train['HomeTeamAbbr'].map(abbr_mapper).fillna(train['HomeTeamAbbr'])
train['VisitorTeamAbbr'] = train['VisitorTeamAbbr'].map(abbr_mapper).fillna(train['VisitorTeamAbbr'])


### Home Team Possession

In [13]:
train['HomeTeamPossession'] = np.where(train.PossessionTeam == train.HomeTeamAbbr, 1,0)

### Opponents Territory / Field Position
We will create a flag if the team is in the opponents territory.

In [14]:
train['OpponentsTerritory'] = np.where(train.PossessionTeam != train.FieldPosition,1,0)

### Offense Formation
Lets one-hot encode these variables.

In [15]:
train = pd.concat([train.drop(['OffenseFormation'], axis=1), pd.get_dummies(train['OffenseFormation'], prefix='Formation')], axis=1)
dummy_col = train.columns

In [16]:
train.head()

Unnamed: 0,GameId,PlayId,Team,X,Y,S,A,Dis,Orientation,Dir,...,HomeTeamPossession,OpponentsTerritory,Formation_ACE,Formation_EMPTY,Formation_I_FORM,Formation_JUMBO,Formation_PISTOL,Formation_SHOTGUN,Formation_SINGLEBACK,Formation_WILDCAT
0,2017090700,20170907000118,away,73.91,34.84,1.69,1.13,0.4,278.01,182.82,...,1,0,0,0,0,0,0,1,0,0
1,2017090700,20170907000118,away,74.67,32.64,0.42,1.35,0.01,332.39,161.3,...,1,0,0,0,0,0,0,1,0,0
2,2017090700,20170907000118,away,74.0,33.2,1.22,0.59,0.31,356.99,157.27,...,1,0,0,0,0,0,0,1,0,0
3,2017090700,20170907000118,away,71.46,27.7,0.42,0.54,0.02,0.23,254.36,...,1,0,0,0,0,0,0,1,0,0
4,2017090700,20170907000118,away,69.32,35.42,1.82,2.43,0.16,347.37,195.69,...,1,0,0,0,0,0,0,1,0,0


## Game Clock



In [17]:
train['GameClock'].value_counts()

15:00:00    19008
02:00:00     7172
14:54:00     2794
14:55:00     2640
14:56:00     1694
            ...  
14:38:00      154
14:58:00      110
14:39:00       88
14:59:00       88
00:00:00       22
Name: GameClock, Length: 901, dtype: int64

In [18]:
def strtoseconds(txt):
    txt = txt.split(':')
    ans = int(txt[0])*60 + int(txt[1]) + int(txt[2])/60
    return ans

train['GameClock'] = train['GameClock'].apply(strtoseconds)


### Player Height, Birthdate, Time at Handoff, Time at snap



In [19]:

train['PlayerHeight'] = train['PlayerHeight'].apply(lambda x: 12*int(x.split('-')[0])+int(x.split('-')[1]))
train['TimeHandoff'] = train['TimeHandoff'].apply(lambda x: datetime.datetime.strptime(x, "%Y-%m-%dT%H:%M:%S.%fZ"))
train['TimeSnap'] = train['TimeSnap'].apply(lambda x: datetime.datetime.strptime(x, "%Y-%m-%dT%H:%M:%S.%fZ"))
train['TimeDelta'] = train.apply(lambda row: (row['TimeHandoff'] - row['TimeSnap']).total_seconds(), axis=1)
train['PlayerBirthDate'] = train['PlayerBirthDate'].apply(lambda x: datetime.datetime.strptime(x, "%m/%d/%Y"))
seconds_in_year = 60*60*24*365.25
train['PlayerAge'] = train.apply(lambda row: (row['TimeHandoff']-row['PlayerBirthDate']).total_seconds()/seconds_in_year, axis=1)
train = train.drop(['TimeHandoff', 'TimeSnap', 'PlayerBirthDate'], axis=1)

### Wind Speed and Direction

### Weather

We are going to apply the following preprocessing:

- Lower case
- N/A Indoor, N/A (Indoors) and Indoor => indoor Let's try to cluster those together.
- coudy and clouidy => cloudy
- party => partly
- sunny and clear => clear and sunny
- skies and mostly => ""



To encode our weather we are going to do the following map:

climate controlled or indoor => 3, sunny or sun => 2, clear => 1, cloudy => -1, rain => -2, snow => -3, others => 0
partly => multiply by 0.5
I don't have any expercience with american football so I don't know if playing in a climate controlled or indoor stadium is good or not, if someone has a good idea on how to encode this it would be nice to leave it in the comments :)

In [20]:
train.GameWeather.value_counts()

Cloudy           147114
Sunny            143088
Partly Cloudy     55880
Clear             55726
Mostly Cloudy     27962
                  ...  
Party Cloudy        814
Sunny, Windy        792
Light rain          792
Partly clear        770
Rainy               704
Name: GameWeather, Length: 73, dtype: int64

In [21]:
train['GameWeather'] = train['GameWeather'].str.lower()
indoor = "indoor"
train['GameWeather'] = train['GameWeather'].apply(lambda x: indoor if not pd.isna(x) and indoor in x else x)
train['GameWeather'] = train['GameWeather'].apply(lambda x: x.replace('coudy', 'cloudy').replace('clouidy', 'cloudy').replace('party', 'partly') if not pd.isna(x) else x)
train['GameWeather'] = train['GameWeather'].apply(lambda x: x.replace('clear and sunny', 'sunny and clear') if not pd.isna(x) else x)
train['GameWeather'] = train['GameWeather'].apply(lambda x: x.replace('skies', '').replace("mostly", "").strip() if not pd.isna(x) else x)

In [22]:
def map_weather(txt):
    ans = 1
    if pd.isna(txt):
        return 0
    if 'partly' in txt:
        ans*=0.5
    if 'climate controlled' in txt or 'indoor' in txt:
        return ans*3
    if 'sunny' in txt or 'sun' in txt:
        return ans*2
    if 'clear' in txt:
        return ans
    if 'cloudy' in txt:
        return -ans
    if 'rain' in txt or 'rainy' in txt:
        return -2*ans
    if 'snow' in txt:
        return -3*ans
    return 0

In [23]:
train['GameWeather'] = train['GameWeather'].apply(map_weather)

### RusherID flag

In [24]:
train['IsRusher'] = np.where(train.NflId == train.NflIdRusher,1,0)
#train.drop(['NflId','NflIdRusher'], axis = 1, inplace = True)

#### Same Fields
- GameId
- PlayId
- Season
- YardLine
- Quarter
- GameClock
- Down
- Distance
- FieldPosition
- HomeScoreBeforePlay
- VisitorScoreBeforePlay
- NflIdRusher
- OffensePersonnel
- DefendersInTheBox
- DefensePersonnel
- PlayDirection
- Yards
- HomeTeamAbbr
- VisitorTeamAbbr
- Week
- Stadium
- Location
- StatiumType
- Turf
- GameWeather
- Temperature
- Humidity
- WindSpeed
- WindDirection
- Grass
- HomeTeamPossession
- OpponenentsTerritory
- Formation_ACE
- Formation_EMPTY
- Formation_I_FORM
- Formation_JUMBO
- Formation_PISTOL
- Formation_SHOTGUN
- Formation_SINGLEBACK
- Formation_WILDCAT
- TimeDelta
- 

#### Different fields
- Team
- X
- Y
- S
- A
- Orientation
- Dir
- NflId
- DisplayName
- JerseyNumber
- PossessionTeam
- PlayerHeight
- PlayerWeight
- PlayerCollegeName
- Position
- PlayerAge
- IsRusher

In [25]:
#train[train.PlayId ==20170907000118].iloc[:,55:65]
train.columns

Index(['GameId', 'PlayId', 'Team', 'X', 'Y', 'S', 'A', 'Dis', 'Orientation',
       'Dir', 'NflId', 'DisplayName', 'JerseyNumber', 'Season', 'YardLine',
       'Quarter', 'GameClock', 'PossessionTeam', 'Down', 'Distance',
       'FieldPosition', 'HomeScoreBeforePlay', 'VisitorScoreBeforePlay',
       'NflIdRusher', 'OffensePersonnel', 'DefendersInTheBox',
       'DefensePersonnel', 'PlayDirection', 'Yards', 'PlayerHeight',
       'PlayerWeight', 'PlayerCollegeName', 'Position', 'HomeTeamAbbr',
       'VisitorTeamAbbr', 'Week', 'Stadium', 'Location', 'StadiumType', 'Turf',
       'GameWeather', 'Temperature', 'Humidity', 'WindSpeed', 'WindDirection',
       'Grass', 'HomeTeamPossession', 'OpponentsTerritory', 'Formation_ACE',
       'Formation_EMPTY', 'Formation_I_FORM', 'Formation_JUMBO',
       'Formation_PISTOL', 'Formation_SHOTGUN', 'Formation_SINGLEBACK',
       'Formation_WILDCAT', 'TimeDelta', 'PlayerAge', 'IsRusher'],
      dtype='object')

In [26]:
rusher_coords = train[train.IsRusher==1][['PlayId','X','Y']]
rusher_coords.rename(columns = {'X':'Rusher_X','Y':'Rusher_Y'}, inplace = True)
train = pd.merge(train, rusher_coords, how = 'left', on = 'PlayId')

In [28]:
#distance_from_rusher(2,4,7,9)

In [29]:
def distance_from_rusher(player_x, player_y, rusher_x, rusher_y):
    #player_x.astype(float)
    
    DistanceFromRusher = math.sqrt((player_x-rusher_x)**2 + (player_y-rusher_y)**2) + np.random.normal(0,.001,1)[0]
    return DistanceFromRusher

train['DistanceFromRusher'] = train.apply(lambda x: distance_from_rusher(x.X, x.Y, x.Rusher_X, x.Rusher_Y), axis=1)




In [30]:
np.random.normal(0,.001,1)[0]

-0.0009000331538326386

In [31]:
train['Rank'] = train.groupby(['PlayId','Team']).DistanceFromRusher.rank('dense', ascending = True)
train['TeamRank']= train.Team + train.Rank.astype(str)
train.head()

Unnamed: 0,GameId,PlayId,Team,X,Y,S,A,Dis,Orientation,Dir,...,Formation_SINGLEBACK,Formation_WILDCAT,TimeDelta,PlayerAge,IsRusher,Rusher_X,Rusher_Y,DistanceFromRusher,Rank,TeamRank
0,2017090700,20170907000118,away,73.91,34.84,1.69,1.13,0.4,278.01,182.82,...,0,0,1.0,28.69276,0,78.75,30.53,6.479059,4.0,away4.0
1,2017090700,20170907000118,away,74.67,32.64,0.42,1.35,0.01,332.39,161.3,...,0,0,1.0,28.457305,0,78.75,30.53,4.593417,1.0,away1.0
2,2017090700,20170907000118,away,74.0,33.2,1.22,0.59,0.31,356.99,157.27,...,0,0,1.0,28.62979,0,78.75,30.53,5.448816,3.0,away3.0
3,2017090700,20170907000118,away,71.46,27.7,0.42,0.54,0.02,0.23,254.36,...,0,0,1.0,34.79543,0,78.75,30.53,7.819091,6.0,away6.0
4,2017090700,20170907000118,away,69.32,35.42,1.82,2.43,0.16,347.37,195.69,...,0,0,1.0,30.061685,0,78.75,30.53,10.621108,8.0,away8.0


In [32]:
train_melt = pd.melt(train, id_vars = ['PlayId','TeamRank'], value_vars = ['X','Y','S','A','Orientation','Dir'])
train_melt['NewColumnName'] =  train_melt.TeamRank + train_melt.variable
train_melt = train_melt[['PlayId','NewColumnName','value']]

In [33]:
#train_melt[train_melt.duplicated()==True]
#train_melt = train_melt.set_index('PlayId')
train_melt = train_melt.pivot(index = 'PlayId', columns = 'NewColumnName',values = 'value')


In [34]:
train_melt = train_melt.reset_index()

In [35]:
# Removed OffensePersonnel - Might want to create columns for the various types of offense quantities
train_same = train[train.IsRusher==1][[
    'PlayId', 'Season','YardLine','Quarter','GameClock','Down','Distance','HomeScoreBeforePlay','VisitorScoreBeforePlay',
    'DefendersInTheBox','PlayDirection','Week','StadiumType','Grass','HomeTeamPossession','OpponentsTerritory',
    'Formation_ACE','Formation_EMPTY','Formation_I_FORM','Formation_JUMBO','Formation_PISTOL','Formation_SHOTGUN',
    'Formation_SINGLEBACK','Formation_WILDCAT','TimeDelta','Yards']]

    

In [36]:
train_final = pd.merge(train_same, train_melt, on = 'PlayId',how = 'left')

In [37]:
# Make sure all categorical variables are gone

cat_features = []

for col in train_final.columns:
    if train_final[col].dtype=='object':
        cat_features.append(col)
        
cat_features

[]

### Yards Left
We will calculate how many yards left to go to the endzone.  We will also need to clean up some data issues where yards left < yards.

In [None]:
#train['HomeField'] = train['FieldPosition'] == train['HomeTeamAbbr']
#train['YardsLeft'] = train.apply(lambda row: 100-row['YardLine'] if row['HomeField'] else row['YardLine'], axis=1)
#train['YardsLeft'] = train.apply(lambda row: row['YardsLeft'] if row['PlayDirection'] else 100-row['YardsLeft'], axis=1)
#train.drop(train.index[(train['YardsLeft']<train['Yards']) | (train['YardsLeft']-100>train['Yards'])], inplace=True)

# Split to Train & Test

In [118]:
cols = train_final.columns
train_final[cols] = train_final[cols].fillna(0)

In [127]:
train_final[['Formation_ACE']].head()

Unnamed: 0,Formation_ACE
0,0
1,0
2,0
3,0
4,0


In [177]:
train_df = train_final[train_final.Season ==2019]
test_df = train_final[train_final.Season !=2019]

In [178]:
y_train = train_df.pop('Yards')
y_test = test_df.pop('Yards')

## Normalize the Data
We will first save all our stats to a train_stats

In [128]:
train_stats = train_df.describe()
train_stats = train_stats.transpose()

test_stats = test_df.describe()
test_stats = train_stats.transpose()




In [132]:
from sklearn import preprocessing

x = train_df.values
min_max_scaler  = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
norm_train_data = pd.DataFrame(x_scaled, columns = train_df.columns)

x = test_df.values
min_max_scaler  = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
norm_test_data = pd.DataFrame(x_scaled,columns = test_df.columns)




In [133]:
norm_train_data.head()

Unnamed: 0,PlayId,Season,YardLine,Quarter,GameClock,Down,Distance,HomeScoreBeforePlay,VisitorScoreBeforePlay,DefendersInTheBox,...,home8.0Orientation,home8.0S,home8.0X,home8.0Y,home9.0A,home9.0Dir,home9.0Orientation,home9.0S,home9.0X,home9.0Y
0,0.0,0.0,0.489796,0.0,1.0,0.0,0.264706,0.0,0.0,0.555556,...,0.686155,0.458384,0.269806,0.710335,0.391026,0.830417,0.737091,0.499418,0.352329,0.554706
1,6.590853e-07,0.0,0.959184,0.0,0.836485,0.0,0.264706,0.0,0.0,0.555556,...,0.501668,0.545235,0.537649,0.263464,0.345085,0.217194,0.714357,0.46915,0.512968,0.281873
2,1.604523e-06,0.0,0.22449,0.0,0.636263,0.333333,0.264706,0.0,0.0,0.444444,...,0.502752,0.551267,0.170123,0.674187,0.060897,0.417667,0.736702,0.147846,0.251507,0.519703
3,1.977241e-06,0.0,0.714286,0.0,0.525028,0.0,0.264706,0.0,0.0,0.444444,...,0.239505,0.594692,0.349963,0.25279,0.356838,0.249611,0.794564,0.514552,0.365571,0.576263
4,2.127243e-06,0.0,0.632653,0.0,0.479422,0.333333,0.147059,0.0,0.0,0.444444,...,0.293939,0.091677,0.338845,0.210092,0.305556,0.247778,0.266244,0.562282,0.3221,0.590403


# Change to TensorFlow dataset

In [134]:
def make_input_fn(data_df, label_df, num_epochs=10, shuffle=True, batch_size=32):
  def input_function():  # inner function, this will be returned
    ds = tf.data.Dataset.from_tensor_slices((dict(data_df), label_df))  # create tf.data.Dataset object with data and its label
    if shuffle:
      ds = ds.shuffle(1000)  # randomize order of data
    ds = ds.batch(batch_size).repeat(num_epochs)  # split dataset into batches of 32 and repeat process for number of epochs
    return ds  # return a batch of the dataset
  return input_function  # return a function object for use

train_input_fn = make_input_fn(norm_train_data, y_train)  # here we will call the input_function that was returned to us to get a dataset object we can feed to the model
eval_input_fn = make_input_fn(norm_test_data, y_test, num_epochs=1, shuffle=False)

In [135]:
features = norm_train_data.columns
feature_cols = [tf.feature_column.numeric_column(k) for k in features]		
#linear_est = tf.estimator.LinearRegressor(feature_columns = feature_cols)
linear_est = tf.estimator.LinearRegressor( 
                                   feature_columns = feature_cols)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': 'C:\\Users\\tacke\\AppData\\Local\\Temp\\tmpb1sye5jg', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x000002493443AE48>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}


In [136]:
linear_est.train(train_input_fn)  # train


INFO:tensorflow:Calling model_fn.


To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.

INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into C:\Users\tacke\AppData\Local\Temp\tmpb1sye5jg\model.ckpt.
INFO:tensorflow:loss = 51.0, step = 0
INFO:tensorflow:global_step/sec: 37.3552
INFO:tensorflow:loss = 62.767105, step = 100 (2.685 sec)
INFO:tensorflow:global_step/sec: 116.17
INFO:tensorflow:loss = 14.778769, step = 200 (0.853 sec)
INFO:tensorflow:global_step/sec: 115.065
INFO:tensorflow:loss = 31.494305, step = 300 (0.879 sec)
INFO:tensorflow:global_step

<tensorflow_estimator.python.estimator.canned.linear.LinearRegressorV2 at 0x2493443a048>

In [139]:
result = linear_est.evaluate(eval_input_fn)  # get model metrics/stats by testing on tetsing data
print(result)

INFO:tensorflow:Calling model_fn.


To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.

INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2020-08-29T07:15:05Z
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from C:\Users\tacke\AppData\Local\Temp\tmpb1sye5jg\model.ckpt-2450
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Finished evaluation at 2020-08-29-07:15:16
INFO:tensorflow:Saving dict for global step 2450: average_loss = 41.675915, global_step = 2450, label/mean = 4.212334, loss = 41.650753, prediction/mean = 3.361598
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 2450: C:\Users\tacke\AppData\Local\Temp\tmpb1sye5jg\mo

# Build Model

In [196]:
del model
def build_model():
    model = tf.keras.Sequential()
    # Adds a densely-connected layer with 64 units to the model:
    model.add(tf.keras.layers.Dense(64, activation='relu'))
    # Add another:
    model.add(tf.keras.layers.Dense(64, activation='relu'))
    # Add a softmax layer with 10 output units:
    model.add(tf.keras.layers.Dense(1, activation='relu'))
    
    model.compile(loss='mse',
                optimizer='adam',
                metrics=['mae', 'mse'])
    return model

In [199]:


model = build_model()

EPOCHS = 100

history = model.fit(
  norm_train_data.values, y_train.values,
  epochs=EPOCHS, validation_split = 0.2, verbose=1
  )

Train on 6268 samples, validate on 1568 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100


Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78/100
Epoch 79/100
Epoch 80/100
Epoch 81/100
Epoch 82/100
Epoch 83/100
Epoch 84/100
Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100


Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100


In [200]:
predicted = model.predict(norm_test_data.values)

array([[5.7965174 , 0.        , 6.019923  , ..., 5.9124365 , 6.080338  ,
        6.0235453 ],
       [2.052656  , 0.        , 2.131425  , ..., 2.1318884 , 2.0659988 ,
        2.205007  ],
       [3.793678  , 0.        , 3.9661932 , ..., 3.8805876 , 3.8693492 ,
        4.099879  ],
       ...,
       [0.59189147, 0.        , 0.3201431 , ..., 0.20799628, 0.19303831,
        0.0876514 ],
       [0.6812155 , 0.        , 0.38169417, ..., 0.30381218, 0.23729801,
        0.05304793],
       [2.510951  , 0.        , 2.5177283 , ..., 2.485957  , 2.4351315 ,
        2.5881333 ]], dtype=float32)

In [203]:
final_df = y_test.copy()
pred_df = pd.DataFrame( data = predicted, columns =['Predicted'])
pred_df['Actual'] = final_df
pred_df

Unnamed: 0,Predicted,Actual
0,21.747719,8
1,5.600811,3
2,4.176294,5
3,0.000000,2
4,1.465611,7
...,...,...
23166,0.000000,1
23167,2.684316,4
23168,2.273566,4
23169,2.666243,2


In [192]:
pred_df.shape

(23171, 1)

In [27]:
# First pass lets use non categorical variables

train_baseline = train.sort_values(by =['PlayId','Team','JerseyNumber']).reset_index()
train_baseline.drop(['GameId','PlayId','Team','index'], axis=1, inplace = True)


In [28]:
# For the baseline model we will drop all the categorical features.

cat_features = []

for col in train_baseline.columns:
    if train[col].dtype=='object':
        cat_features.append(col)

train_baseline = train_baseline.drop(cat_features, axis = 1)

In [29]:
# Unpivot all the rows to 1 row for each play

train_baseline.fillna(-999, inplace = True)

players_col = []
for col in train_baseline.columns:
    if train_baseline[col][:22].std()!=0:
        players_col.append(col)
        
        

In [38]:
train.NflId.unique()

AttributeError: 'DataFrame' object has no attribute 'NflId'

In [30]:
X_train = np.array(train_baseline[players_col]).reshape(-1, len(players_col)*22)

In [150]:
play_col = train.drop(players_col+['Yards'], axis=1).columns
X_play_col = np.zeros(shape=(X_train.shape[0], len(play_col)))
for i, col in enumerate(play_col):
    X_play_col[:, i] = train[col][::22]

ValueError: could not convert string to float: 'away'