# Who should you take in the NFL draft? - Webscrape

Machine learning has had many impacts across multiple fields and industries in at least the past 10 years. In sports, statistics began playing a large role in team design and recruitment, especially after the success of the Oakland A's in the early 2000's using the methods outlined in the book and movie *Money Ball*. To read more about the use of statistics in Baseball, see [Sabremetrics in wikipedia](https://en.wikipedia.org/wiki/Sabermetrics).

I happen to live under a rock, so I hadn't actually seen *Money Ball* until very recently on a flight-- and, wow, what an awesome movie! I was quite inspired after watching it and thought by this time probably all kinds of stats are being used to guide business and sports decisions in all major sports. I'm a huge AI and machine learning (intelligence) aficionado, and I've been a lifelong sufferer as a Miami Dolphins fan, so I decided why not check out some NFL datasets and see if one can make predictions on the success rate of certain players? Maybe the poor Dolphins could be better advised on who to draft.

To get started, first thing we need is data. After looking around the web a bit, I found that http://www.pro-football-reference.com/ has some pretty easily accessible (and machine readable) table data available on the Combine resulst and also historic probowl data.

So the question is pretty clear, could we predict, based on Combine metrics alone, which player, by position will most likely become a probowler?

In [49]:
import pandas as pd
import numpy as np
import sklearn
from sklearn.cross_validation import cross_val_score
from sklearn import linear_model
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns
sns.set()

from sklearn import ensemble

from imblearn.over_sampling import SMOTE

%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

# Save a nice dark grey as a variable
almost_black = '#262626'
%config InlineBackend.figure_format='retina'

matplotlib.rcParams['figure.figsize'] = (12.0, 8.0)

Above I just import some libraries we need for the analysis plus redefine some defaults from matplotlib (using python in case you haven't realized!). The next thing we need is a list of probowlers and combine data per year. This can be accomplished by stripping the information off of the reference listed above. 

Of course being dirty web data, we have to do some cleaning! Below I do the following:

* Certain players' names in the probowl list come with a '+' or '\*' to designate special status (for example, MVP of game or secondary selection). We're trying to train on simply if they went to the probowl so we remove those. 


* Also, I modify the height from a feet-inches format to just inches.


* Finally, drop the duplicates and also some irrelevant columns + save the dataframe as a csv for later analysis.

In [18]:
probowlers = []

df = pd.read_html('http://www.pro-football-reference.com/play-index/nfl-combine-results.cgi?request=1&year_min=2000&year_max=2000&height_min=65&height_max=82&weight_min=149&weight_max=375&pos=K&show=p&order_by=year_id')[0]

# the 'range' of the range function is from range(first, last + 1).

for year in range(2000,2016):
    df1 = pd.read_html('http://www.pro-football-reference.com/play-index/nfl-combine-results.cgi?request=1&year_min=%s&year_max=%s&height_min=65&height_max=82&weight_min=149&weight_max=375&pos=QB&pos=WR&pos=TE&pos=RB&pos=FB&pos=OT&pos=OG&pos=C&show=p&order_by=year_id' % (year, year))[0]
    df2 = pd.read_html('http://www.pro-football-reference.com/play-index/nfl-combine-results.cgi?request=1&year_min=%s&year_max=%s&height_min=65&height_max=82&weight_min=149&weight_max=375&pos=DE&pos=DT&pos=ILB&pos=OLB&pos=SS&pos=FS&pos=CB&show=p&order_by=year_id' % (year, year))[0]
    df3 = pd.read_html('http://www.pro-football-reference.com/play-index/nfl-combine-results.cgi?request=1&year_min=%s&year_max=%s&height_min=65&height_max=82&weight_min=149&weight_max=375&pos=LS&pos=K&pos=P&show=p&order_by=year_id' % (year, year))[0]
    df = pd.concat([df, df1, df2, df3], ignore_index=True)
    probowlers += pd.read_html('http://www.pro-football-reference.com/years/%s/probowl.htm' % year)[0]['Unnamed: 1'].tolist()

df = df.drop_duplicates()
probowlers = [s.replace('+', '').replace('*','').replace('%', '').replace('&', '') for s in probowlers]
df_pb = pd.Series(probowlers)
df_pb = df_pb.drop_duplicates()

df['Probowl'] = pd.Series(np.zeros(len(df)))
for k, p in df_pb.iteritems():
    for l, row in df.iterrows():
        if row['Player'] == p:
            df['Probowl'][l] = 1.0
            
df = df.drop(df_qb[df_qb['Player']=='Player'].index)
df['Height_inches'] = 12*df['Height'].str.extract('([0-9]+)-([0-9]*\.?[0-9]+)')[0].astype(int) + df['Height'].str.extract('([0-9]+)-([0-9]*\.?[0-9]+)')[1].astype(int)
df = df.drop(['Height'], 1)
            
drop_columns = ['Rk', 'AV', 'College', 'Drafted (tm/rnd/yr)']
df = df.drop(drop_columns,1)

df.to_csv('/Users/richard/data/NFL.csv')

Running the above cell will produce a csv file we can later analyze (see the other notebooks in this series for position by position analysis).