# Who should you take in the NFL draft? - Webscrape

Machine learning has had many impacts across multiple fields and industries in at least the past 10 years. In sports, statistics began playing a large role in team design and recruitment, especially after the success of the Oakland A's in the early 2000's using the methods outlined in the book and movie *Money Ball*. To read more about the use of statistics in Baseball, see [Sabremetrics in wikipedia](https://en.wikipedia.org/wiki/Sabermetrics).

I happen to live under a rock, so I hadn't actually seen *Money Ball* until very recently on a flight-- and, wow, what an awesome movie! I was quite inspired after watching it and thought by this time probably all kinds of stats are being used to guide business and sports decisions in all major sports. I'm a huge AI and machine learning (intelligence) aficionado, and I've been a lifelong sufferer as a Miami Dolphins fan, so I decided why not check out some NFL datasets and see if one can make predictions on the success rate of certain players? Maybe the poor Dolphins could be better advised on who to draft.

To get started, first thing we need is data. After looking around the web a bit, I found that http://www.pro-football-reference.com/ has some pretty easily accessible (and machine readable) table data available on the Combine resulst and also historic probowl data.

So the question is pretty clear, could we predict, based on Combine metrics alone, which player, by position will most likely become a probowler?

In [2]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

Above I just import some libraries we need to the webscraping (the focus of this notebook). The next thing we need is a list of probowlers and combine data per year. This can be accomplished by stripping the information off of the reference listed above. 

Of course being dirty web data, we have to do some cleaning! Below I do the following:

* Certain players' names in the probowl list come with a '+' or '\*' to designate special status (for example, MVP of game or secondary selection). We're trying to train on simply if they went to the probowl so we remove those. 


* Also, I modify the height from a feet-inches format to just inches,


* drop the duplicates, 


* add a column if the player has made it to the probowl, 


* drop columns we don't need and save the dataframe as a csv for later analysis.

In [32]:
probowlers = []

df = pd.read_html('http://www.pro-football-reference.com/play-index/nfl-combine-results.cgi?request=1&year_min=2000&year_max=2000&height_min=65&height_max=82&weight_min=149&weight_max=375&pos=K&show=p&order_by=year_id')[0]

for year in range(2000,2016):
    df1 = pd.read_html('http://www.pro-football-reference.com/play-index/nfl-combine-results.cgi?request=1&year_min=%s&year_max=%s&height_min=65&height_max=82&weight_min=149&weight_max=375&pos=QB&pos=WR&pos=TE&pos=RB&pos=FB&pos=OT&pos=OG&pos=C&show=p&order_by=year_id' % (year, year))[0]
    df2 = pd.read_html('http://www.pro-football-reference.com/play-index/nfl-combine-results.cgi?request=1&year_min=%s&year_max=%s&height_min=65&height_max=82&weight_min=149&weight_max=375&pos=DE&pos=DT&pos=ILB&pos=OLB&pos=SS&pos=FS&pos=CB&show=p&order_by=year_id' % (year, year))[0]
    df3 = pd.read_html('http://www.pro-football-reference.com/play-index/nfl-combine-results.cgi?request=1&year_min=%s&year_max=%s&height_min=65&height_max=82&weight_min=149&weight_max=375&pos=LS&pos=K&pos=P&show=p&order_by=year_id' % (year, year))[0]
    df = pd.concat([df, df1, df2, df3], ignore_index=True)
    probowlers += pd.read_html('http://www.pro-football-reference.com/years/%s/probowl.htm' % year)[0]['Unnamed: 1'].tolist()

probowlers = [s.replace('+', '').replace('*','').replace('%', '').replace('&', '') for s in probowlers]

df = df.drop(df[df['Player']=='Player'].index)
df['Height_inches'] = 12*df['Height'].str.extract('([0-9]+)-([0-9]*\.?[0-9]+)')[0].astype(int) + df['Height'].str.extract('([0-9]+)-([0-9]*\.?[0-9]+)')[1].astype(int)
            
drop_columns = ['Rk', 'AV', 'College', 'Drafted (tm/rnd/yr)', 'Height']
df = df.drop(drop_columns,1)
df = df.drop_duplicates()

probowlers = list(set(probowlers))

df['Probowl'] = df['Player'].isin(probowlers)

numeric_cols = ['Year','Wt', '40YD', 'Vertical', 'BenchReps', 'Broad Jump', '3Cone', 'Shuttle', 'Height_inches']

df[numeric_cols] = df[numeric_cols].astype(np.number)

df['Probowl'] = df['Probowl'].astype('category')

df.to_csv('/Users/richard/data/NFL.csv', index_label='idx')

Running the above cell will produce a csv file we can later analyze (see the other notebooks in this series for position by position analysis).

In [33]:
df = pd.read_csv('/Users/richard/data/NFL.csv', index_col='idx')

In [34]:
df[df['Probowl']==True]

Unnamed: 0_level_0,Year,Player,Pos,School,Wt,40YD,Vertical,BenchReps,Broad Jump,3Cone,Shuttle,Height_inches,Probowl
idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
0,2000.0,Sebastian Janikowski,K,Florida State,260.0,,,,,,,73.0,True
6,2000.0,Marc Bulger,QB,West Virginia,208.0,4.97,,,100.0,7.46,4.34,74.0,True
8,2000.0,Tom Brady,QB,Michigan,211.0,5.28,24.5,,99.0,7.20,4.38,76.0,True
18,2000.0,Laveranues Coles,WR,Florida State,192.0,4.41,34.0,,115.0,6.89,4.39,71.0,True
21,2000.0,Chad Clifton,OT,Tennessee,334.0,5.05,30.5,24.0,102.0,7.58,4.73,77.0,True
22,2000.0,Shaun Alexander,RB,Alabama,218.0,4.58,,,,,,72.0,True
36,2000.0,James Williams,WR,Marshall,180.0,4.59,36.0,,123.0,7.22,4.16,71.0,True
40,2000.0,Chris Samuels,OT,Alabama,325.0,5.08,,,,,,77.0,True
44,2000.0,Marvel Smith,OT,Arizona State,320.0,5.37,27.5,24.0,100.0,7.87,4.83,77.0,True
59,2000.0,Jamal Lewis,RB,Tennessee,240.0,4.58,,23.0,,,,72.0,True
