# NFL Data Analysis

In this notebook, I will be examining the NFL data that was posted on the website kaggle.com. I will be looking for trends and doing some initial data analysis before doing any ML or prediction. I will record my steps and inferences so as to learn better how to analyze and pipeline datasets.

In [10]:
#Import all needed packages such as pandas and numpy
import re
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import statsmodels.formula.api as smf
import statsmodels.api as sm

# Disable annoying SettingWithCopyWarning
pd.options.mode.chained_assignment = None

path = '/Users/Gerrit/Desktop/nflplaybyplay2015.csv'

#Now reading in the NFL dataset from kaggle.com using pandas
data = pd.read_csv(path)

data.head(5)

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0.1,Unnamed: 0,Date,GameID,Drive,qtr,down,time,TimeUnder,TimeSecs,PlayTimeDiff,...,Accepted.Penalty,PenalizedTeam,PenaltyType,PenalizedPlayer,Penalty.Yards,PosTeamScore,DefTeamScore,ScoreDiff,AbsScoreDiff,Season
0,36,2015-09-10,2015091000,1,1,,15:00,15.0,3600.0,0.0,...,0,,,,0,0.0,0.0,0.0,0.0,2015
1,51,2015-09-10,2015091000,1,1,1.0,15:00,15.0,3600.0,0.0,...,0,,,,0,0.0,0.0,0.0,0.0,2015
2,72,2015-09-10,2015091000,1,1,1.0,14:21,15.0,3561.0,39.0,...,0,,,,0,0.0,0.0,0.0,0.0,2015
3,101,2015-09-10,2015091000,1,1,2.0,14:04,15.0,3544.0,17.0,...,0,,,,0,0.0,0.0,0.0,0.0,2015
4,122,2015-09-10,2015091000,1,1,1.0,13:26,14.0,3506.0,38.0,...,0,,,,0,0.0,0.0,0.0,0.0,2015


In [11]:
data.describe()



Unnamed: 0.1,Unnamed: 0,GameID,Drive,qtr,down,TimeUnder,TimeSecs,PlayTimeDiff,yrdln,yrdline100,...,Fumble,Sack,Challenge.Replay,Accepted.Penalty,Penalty.Yards,PosTeamScore,DefTeamScore,ScoreDiff,AbsScoreDiff,Season
count,46129.0,46129.0,46129.0,46129.0,39006.0,46102.0,46102.0,46075.0,46021.0,46021.0,...,46129.0,46129.0,46129.0,46129.0,46129.0,42878.0,42878.0,42878.0,42878.0,46129.0
mean,55648.646556,2015164000.0,12.279607,2.583407,1.996949,7.326038,1686.735847,20.214585,28.56509,49.411312,...,0.013592,0.027401,0.008953,0.076633,0.652388,10.802066,11.86401,-1.061943,7.977797,2015.0
std,87501.612069,218316.4,7.144244,1.134256,1.003834,4.659934,1065.494471,17.613538,12.591719,24.852971,...,0.115792,0.163252,0.094198,0.266011,2.716825,9.670936,10.138756,10.946834,7.570614,0.0
min,35.0,2015091000.0,1.0,1.0,1.0,0.0,-747.0,0.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-41.0,0.0,2015.0
25%,12661.0,2015101000.0,6.0,2.0,,,,,,,...,0.0,0.0,0.0,0.0,0.0,,,,,2015.0
50%,26032.0,2015111000.0,12.0,3.0,,,,,,,...,0.0,0.0,0.0,0.0,0.0,,,,,2015.0
75%,39415.0,2015121000.0,18.0,4.0,,,,,,,...,0.0,0.0,0.0,0.0,0.0,,,,,2015.0
max,466111.0,2016010000.0,33.0,5.0,4.0,15.0,3600.0,940.0,50.0,99.0,...,1.0,1.0,1.0,1.0,55.0,52.0,51.0,41.0,41.0,2015.0


In [9]:
#This is the collection of column labels.
data.columns

Index([u'Unnamed: 0', u'Date', u'GameID', u'Drive', u'qtr', u'down', u'time',
       u'TimeUnder', u'TimeSecs', u'PlayTimeDiff', u'SideofField', u'yrdln',
       u'yrdline100', u'ydstogo', u'ydsnet', u'GoalToGo', u'FirstDown',
       u'posteam', u'DefensiveTeam', u'desc', u'PlayAttempted',
       u'Yards.Gained', u'sp', u'Touchdown', u'ExPointResult', u'TwoPointConv',
       u'DefTwoPoint', u'Safety', u'PlayType', u'Passer', u'PassAttempt',
       u'PassOutcome', u'PassLength', u'PassLocation', u'InterceptionThrown',
       u'Interceptor', u'Rusher', u'RushAttempt', u'RunLocation', u'RunGap',
       u'Receiver', u'Reception', u'ReturnResult', u'Returner', u'Tackler1',
       u'Tackler2', u'FieldGoalResult', u'FieldGoalDistance', u'Fumble',
       u'RecFumbTeam', u'RecFumbPlayer', u'Sack', u'Challenge.Replay',
       u'ChalReplayResult', u'Accepted.Penalty', u'PenalizedTeam',
       u'PenaltyType', u'PenalizedPlayer', u'Penalty.Yards', u'PosTeamScore',
       u'DefTeamScore', u'ScoreDif

In [13]:
#Just taking a look at the levels in each column that seems ambiguous.
# data['PlayType'].unique()
data['PlayType'].unique()

array(['Kickoff', 'Run', 'Pass', 'Sack', 'No Play', 'Field Goal', 'Punt',
       'QB Kneel', 'Onside Kick', 'End of Game', 'Spike'], dtype=object)

I will create separate data sets for pass plays, run plays, kickoffs, field goal, etc.

In [36]:
#This creates datasets grouped by play type. Meaning we have all the passes, runs, kickoffs, etc.
# in different datasets. There are 15 play types
#I am also putting a filter on the data, so we are only looking at play types with more than 500 training examples.
play_types = data.groupby("PlayType")
print(type(play_types))

<class 'pandas.core.groupby.DataFrameGroupBy'>


Go through and eliminate the no plays and spikes.

In [26]:
play_types = []
for a in data['PlayType'].unique():
    play_types.append(a)
    
#Now creating different data sets by individaul play type.
pass_plays = play_
    



Some important questions that can be answered (?) in an NFL Dataset:
Do defenses slack off at the end of quarters?
    Relationship between end of quarters/games and successful plays
Do passes/runs have more success at beginning/end of quarters/games?
Relationship between effort and the score?
    Correlation between point differential and succcess of play
    From this, try to predict lazy or underperforming defenses.
Relationship between trying and how far into the season?
    Look for relationship between datetime and success of plays
Is there a point of no return? Meaning is there a trend showing that teams do not come back from a certain point differential at a certain time in the game?
    We can even apply this to a whole season. Is there a point in the season when a certain record will not get you into the playoffs? 

Next find the features that are real valued.
Get dummy variables for directions (left, middle, right, etc.)
Need to append all possible combinations/interactions between features.

In [None]:
#This is a categorical scatterplot by which direction play was run
sns.swarmplot(x="", y="", hue="x_val", data=data)