The purpose of the project is to try and predict the probability of NFL success for a quarterback. The definition of success is three or more years on an NFL roster.  There will be two techniques used in this project: Logistic Regression, and Random Forests.

The regression will be based on the following explanatory variables:

a. The number of college games started-A quarterback that is an NFL prospect should be the starting quarterback on the team and; therefore, should be playing in more games.  This will give the player more game experience and should make him a better player.  

b. The pass completion percentage-This is the number of passes completed per the number of pass attempts.  This metric accounts for the number of number of pass attempts made each game.  This is important since the number of passes attempted will differ depending on the opponent and on the type of offensive system used by each quarterback.  Some teams rely more on the pass than others.  In addition, the % completion is a measure of quarterback accuracy.  

c. Yards per Attempt-This measures the number of yards gained on a completed forward pass per pass attempt.  A high yds./attempt number suggest that more long passes are being thrown than short passes by a quarterback.  

d. Passer Efficiency Rating-  This is a measure of quarterback efficiency and is given by the following formula:  

$$Eff= \frac {8.4(Yds Gained) + 330 (Num Touchdowns) +100(Pct. Comp) -200(Num Interceptions)}{Num Pass Attempts}$$  

e. Touchdown to Interception Ratio-This measures the decision making ability of a QB since a low TD/INT ratio suggests that a quarterback has a problem reading opponents defensive coverage schemes.  

f. Passing Yards per Game- A measure of passing success in a game.  Usually, a quarterback that throws for a high number of yards per game wins more games than he loses.  Additionally, this is indicative of good decision making as well.  

g. Touchdowns per Attempt-A measure normalizing the number of touchdowns thrown in a season by the number of passes attempted.  

h. Interceptions per attempt- A measure normalizing the number of interceptions thrown in a season by the number of passes attempted.

In [1]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sn
from statsmodels.formula.api import logit, glm
from sklearn.cross_validation import train_test_split
import sklearn
%matplotlib inline

QB0005 = pd.read_csv('/home/brianc/McNulty Project/Data/PassingStats20002005.csv')
QB0611 = pd.read_csv('/home/brianc/McNulty Project/Data/PassingStats20062011.csv')
QB0005.rename(columns={'TotGames':'Games'}, inplace=True)
QB0611['CompPct']=QB0611['TotComp']/QB0611['TotAtt']
QB0611 = QB0611[['Last','First','Games','TotAtt','TotComp','TotYds','TotTD','TotInt','PassRt','CompPct','YPAtt','TDPAtt','IntPAtt','YPGm','TDPInt','yr3NFLVet']]
QBall = pd.concat([QB0005,QB0611],axis = 0)
QBall.sort_values(by=['Last','First'])
QBall= QBall.reset_index()
del QBall['index']
train = QBall.sample(frac=0.8, random_state=1)
test = QBall.loc[~QBall.index.isin(train.index)]

## Exploratory Data Analysis

## One Parameter Analysis

In [2]:
QBallMod1 = glm('yr3NFLVet ~ Games', data= train,family = sm.families.Binomial(sm.families.links.logit)).fit()
print(QBallMod1.summary())     

                 Generalized Linear Model Regression Results                  
Dep. Variable:              yr3NFLVet   No. Observations:                  563
Model:                            GLM   Df Residuals:                      561
Model Family:                Binomial   Df Model:                            1
Link Function:                  logit   Scale:                             1.0
Method:                          IRLS   Log-Likelihood:                -214.20
Date:                Thu, 28 Jul 2016   Deviance:                       428.40
Time:                        16:51:55   Pearson chi2:                     552.
No. Iterations:                     7                                         
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept     -2.7953      0.272    -10.273      0.000        -3.329    -2.262
Games          0.0428      0.010      4.231      0.0

In [3]:
QBallMod1 = glm('yr3NFLVet ~ CompPct', data= train,family = sm.families.Binomial(sm.families.links.logit)).fit()
print(QBallMod1.summary()) 

                 Generalized Linear Model Regression Results                  
Dep. Variable:              yr3NFLVet   No. Observations:                  563
Model:                            GLM   Df Residuals:                      561
Model Family:                Binomial   Df Model:                            1
Link Function:                  logit   Scale:                             1.0
Method:                          IRLS   Log-Likelihood:                -204.83
Date:                Thu, 28 Jul 2016   Deviance:                       409.66
Time:                        16:51:56   Pearson chi2:                     577.
No. Iterations:                     8                                         
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept     -2.9292      0.271    -10.828      0.000        -3.459    -2.399
CompPct        0.0289      0.005      5.399      0.0

In [4]:
QBallMod1 = glm('yr3NFLVet ~ YPAtt', data= train,family = sm.families.Binomial(sm.families.links.logit)).fit()
print(QBallMod1.summary()) 

                 Generalized Linear Model Regression Results                  
Dep. Variable:              yr3NFLVet   No. Observations:                  563
Model:                            GLM   Df Residuals:                      561
Model Family:                Binomial   Df Model:                            1
Link Function:                  logit   Scale:                             1.0
Method:                          IRLS   Log-Likelihood:                -212.50
Date:                Thu, 28 Jul 2016   Deviance:                       424.99
Time:                        16:51:57   Pearson chi2:                     537.
No. Iterations:                     7                                         
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept     -6.1894      1.014     -6.104      0.000        -8.177    -4.202
YPAtt          0.5909      0.134      4.396      0.0

In [5]:
QBallMod1 = glm('yr3NFLVet ~ PassRt', data= train,family = sm.families.Binomial(sm.families.links.logit)).fit()
print(QBallMod1.summary()) 

                 Generalized Linear Model Regression Results                  
Dep. Variable:              yr3NFLVet   No. Observations:                  563
Model:                            GLM   Df Residuals:                      561
Model Family:                Binomial   Df Model:                            1
Link Function:                  logit   Scale:                             1.0
Method:                          IRLS   Log-Likelihood:                -209.43
Date:                Thu, 28 Jul 2016   Deviance:                       418.86
Time:                        16:51:57   Pearson chi2:                     547.
No. Iterations:                     7                                         
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept     -7.1125      1.075     -6.619      0.000        -9.219    -5.006
PassRt         0.0399      0.008      5.049      0.0

In [6]:
QBallMod1 = glm('yr3NFLVet ~ TDPInt', data= train,family = sm.families.Binomial(sm.families.links.logit)).fit()
print(QBallMod1.summary()) 

                 Generalized Linear Model Regression Results                  
Dep. Variable:              yr3NFLVet   No. Observations:                  563
Model:                            GLM   Df Residuals:                      561
Model Family:                Binomial   Df Model:                            1
Link Function:                  logit   Scale:                             1.0
Method:                          IRLS   Log-Likelihood:                -209.86
Date:                Thu, 28 Jul 2016   Deviance:                       419.72
Time:                        16:51:58   Pearson chi2:                     540.
No. Iterations:                     7                                         
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept     -2.9637      0.270    -10.958      0.000        -3.494    -2.434
TDPInt         0.5739      0.115      4.986      0.0

In [7]:
QBallMod1 = glm('yr3NFLVet ~ YPGm', data= train,family = sm.families.Binomial(sm.families.links.logit)).fit()
print(QBallMod1.summary()) 

                 Generalized Linear Model Regression Results                  
Dep. Variable:              yr3NFLVet   No. Observations:                  563
Model:                            GLM   Df Residuals:                      561
Model Family:                Binomial   Df Model:                            1
Link Function:                  logit   Scale:                             1.0
Method:                          IRLS   Log-Likelihood:                -203.89
Date:                Thu, 28 Jul 2016   Deviance:                       407.78
Time:                        16:51:59   Pearson chi2:                     527.
No. Iterations:                     7                                         
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept     -4.8714      0.550     -8.857      0.000        -5.949    -3.793
YPGm           0.0142      0.002      5.948      0.0

In [8]:
QBallMod1 = glm('yr3NFLVet ~ TDPAtt', data= train,family = sm.families.Binomial(sm.families.links.logit)).fit()
print(QBallMod1.summary()) 

                 Generalized Linear Model Regression Results                  
Dep. Variable:              yr3NFLVet   No. Observations:                  563
Model:                            GLM   Df Residuals:                      561
Model Family:                Binomial   Df Model:                            1
Link Function:                  logit   Scale:                             1.0
Method:                          IRLS   Log-Likelihood:                -215.64
Date:                Thu, 28 Jul 2016   Deviance:                       431.27
Time:                        16:51:59   Pearson chi2:                     551.
No. Iterations:                     7                                         
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept     -3.2891      0.416     -7.910      0.000        -4.104    -2.474
TDPAtt        27.6726      7.300      3.791      0.0

In [9]:
QBallMod1 = glm('yr3NFLVet ~ IntPAtt', data= train,family = sm.families.Binomial(sm.families.links.logit)).fit()
print(QBallMod1.summary()) 

                 Generalized Linear Model Regression Results                  
Dep. Variable:              yr3NFLVet   No. Observations:                  563
Model:                            GLM   Df Residuals:                      561
Model Family:                Binomial   Df Model:                            1
Link Function:                  logit   Scale:                             1.0
Method:                          IRLS   Log-Likelihood:                -213.72
Date:                Thu, 28 Jul 2016   Deviance:                       427.44
Time:                        16:52:12   Pearson chi2:                     551.
No. Iterations:                     7                                         
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept     -0.1698      0.416     -0.408      0.683        -0.986     0.646
IntPAtt      -56.8745     14.254     -3.990      0.0

## Variable Reduction

In [2]:
QBallMod1 = glm('yr3NFLVet ~ Games+CompPct+YPAtt+PassRt+TDPInt+YPGm+TDPAtt+IntPAtt', data= train,family = sm.families.Binomial(sm.families.links.logit)).fit()
print(QBallMod1.summary()) 

                 Generalized Linear Model Regression Results                  
Dep. Variable:              yr3NFLVet   No. Observations:                  563
Model:                            GLM   Df Residuals:                      554
Model Family:                Binomial   Df Model:                            8
Link Function:                  logit   Scale:                             1.0
Method:                          IRLS   Log-Likelihood:                -173.96
Date:                Thu, 28 Jul 2016   Deviance:                       347.91
Time:                        16:57:20   Pearson chi2:                     540.
No. Iterations:                     8                                         
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept     -7.5961      2.138     -3.553      0.000       -11.787    -3.405
Games          0.0285      0.012      2.393      0.0

In [3]:
#train
from patsy import dmatrices
y,X = dmatrices('yr3NFLVet ~ Games+CompPct+YPAtt+PassRt+TDPInt+YPGm+TDPAtt+IntPAtt',data= train,return_type='dataframe')
X.head()


Unnamed: 0,Intercept,Games,CompPct,YPAtt,PassRt,TDPInt,YPGm,TDPAtt,IntPAtt
402,1.0,21.0,0.596947,7.39,135.12,1.71,230.57,0.06,0.04
422,1.0,39.0,0.60374,8.22,147.26,3.36,236.74,0.07,0.02
331,1.0,11.0,50.53,8.19,131.41,1.55,210.82,0.06,0.04
189,1.0,11.0,56.1,7.76,125.48,1.0,173.45,0.03,0.03
185,1.0,9.0,50.88,5.65,100.6,0.75,107.44,0.04,0.05


In [32]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
estimator = LogisticRegression(C = 0.5)
selector = RFE(estimator,None,step = 1)
selector = selector.fit(X,np.ravel(y))
selector.support_
selector.ranking_

array([1, 2, 3, 5, 4, 1, 6, 1, 1])

In [33]:
select=selector.get_support()
test=X.columns[select]
print(test) 

Index(['Intercept', 'TDPInt', 'TDPAtt', 'IntPAtt'], dtype='object')


In [27]:
selector.get_params()

{'estimator': LogisticRegression(C=0.5, class_weight=None, dual=False, fit_intercept=True,
           intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
           penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
           verbose=0, warm_start=False),
 'estimator__C': 0.5,
 'estimator__class_weight': None,
 'estimator__dual': False,
 'estimator__fit_intercept': True,
 'estimator__intercept_scaling': 1,
 'estimator__max_iter': 100,
 'estimator__multi_class': 'ovr',
 'estimator__n_jobs': 1,
 'estimator__penalty': 'l2',
 'estimator__random_state': None,
 'estimator__solver': 'liblinear',
 'estimator__tol': 0.0001,
 'estimator__verbose': 0,
 'estimator__warm_start': False,
 'estimator_params': None,
 'n_features_to_select': 6,
 'step': 1,
 'verbose': 0}

Unnamed: 0,yr3NFLVet
count,563.0
mean,0.134991
std,0.342018
min,0.0
25%,0.0
50%,0.0
75%,0.0
max,1.0


In [29]:
y.shape

(563, 1)

In [34]:
!pwd


/home/brianc/McNulty Project/Code
