# Who should you take in the NFL draft? - QB Edition

To get started, first thing we need is data. After looking around the web a bit, I found that http://www.pro-football-reference.com/ has some pretty easily accessible (and machine readable) table data available on the Combine resulst and also historic probowl data.

So the question is pretty clear, could we predict, based on Combine metrics alone, which player, by position will most likely become a probowler?

In [4]:
import pandas as pd
import numpy as np

from sklearn.cross_validation import cross_val_score
from sklearn import linear_model, ensemble, decomposition
import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns
sns.set()

from imblearn.over_sampling import SMOTE

%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

%config InlineBackend.figure_format='retina'
matplotlib.rcParams['figure.figsize'] = (12.0, 8.0)

## Quarterback

We can start with the quarter back position. Before I do so however, there is a bit of manipulation that has to be done with the scraped data. Particularly important 

In [31]:
df_qb = df[df['Pos'] == 'QB']
df_qb = df_qb.drop(df_qb[df_qb['Player']=='Player'].index)
df_qb['Height_inches'] = 12*df_qb['Height'].str.extract('([0-9]+)-([0-9]*\.?[0-9]+)')[0].astype(int) + df_qb['Height'].str.extract('([0-9]+)-([0-9]*\.?[0-9]+)')[1].astype(int)
df_qb = df_qb.drop(['Height'], 1)

df_qb = df_qb.drop(['BenchReps'], 1).dropna() # most QBs don't do benchreps!
df_qb[['Year','Wt', '40YD', 'Vertical', 'Broad Jump', '3Cone', 'Shuttle', 'Height_inches']] = df_qb[['Year','Wt', '40YD', 'Vertical', 'Broad Jump', '3Cone', 'Shuttle', 'Height_inches']].astype(np.number)

feature_cols = ['Wt', '40YD', 'Vertical', 'Broad Jump', '3Cone', 'Shuttle', 'Height_inches']


It turns out that quarterbacks hardly ever do benchreps at the Combine! We can take a look at the data to see which quarterbacks made it to the probowl in the year 2015:

In [56]:
df_qb[df_qb['Year'] == 2015]

Unnamed: 0,Year,Player,Pos,School,Wt,40YD,Vertical,Broad Jump,3Cone,Shuttle,Probowl,Height_inches
3306,2015.0,Jameis Winston,QB,Florida State,231.0,4.97,28.5,103.0,7.16,4.36,1.0,76.0
3345,2015.0,Marcus Mariota,QB,Oregon,222.0,4.52,36.0,121.0,6.87,4.11,0.0,76.0
3346,2015.0,Sean Mannion,QB,Oregon State,229.0,5.14,31.0,105.0,7.29,4.39,0.0,78.0


In [41]:
cutoff_year = 2010

df_train   = df_qb[df_qb['Year'] < cutoff_year]
df_test = df_qb[df_qb['Year'] >= cutoff_year]

X_train = df_train[feature_cols]
y_train = df_train.Probowl

X_test = df_test[feature_cols]
y_test = df_test.Probowl

print(len(y_train), len(y_test))

74 41


In [55]:
1 - df_train.groupby('Probowl').size()[1]/df_train.groupby('Probowl').size()[0]

0.87878787878787878

In [50]:
lr = linear_model.LogisticRegression()
rf = ensemble.RandomForestClassifier(n_jobs=8)
scores = cross_val_score(rf, X_train, y_train, cv = 10, scoring='accuracy')
print(np.round(scores,2))

[ 0.62  0.88  0.88  0.62  0.75  0.88  0.86  0.86  1.    1.  ]


In [51]:
rf.fit(X_train, y_train)


RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=8,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [54]:
arr = np.zeros((len(y_test),4))

arr[:,0] = np.array(np.round(rf.predict_proba(X_test)[:,0],2))
arr[:,1] = np.array(np.round(rf.predict_proba(X_test)[:,1],2))
arr[:,2] = np.array(rf.predict(X_test))
arr[:,3] = np.array(y_test)

results = pd.DataFrame(arr, columns=['non probowl prob', 'probowl prob', 'prediction', 'actual'])

results.sort('probowl prob',ascending=False)

Unnamed: 0,non probowl prob,probowl prob,prediction,actual
40,0.5,0.5,0.0,0.0
27,0.5,0.5,0.0,0.0
6,0.6,0.4,0.0,0.0
18,0.7,0.3,0.0,0.0
29,0.7,0.3,0.0,0.0
11,0.7,0.3,0.0,0.0
32,0.8,0.2,0.0,0.0
14,0.8,0.2,0.0,1.0
30,0.8,0.2,0.0,0.0
5,0.8,0.2,0.0,0.0
