## Introduction
This document demonstrates how I obtained and cleaned NFL combine, injury, and player statistics obtained from [NFLsavant.com](http://nflsavant.com) (combine data), [Pro Football Reference](https://www.pro-football-reference.com/) (player birth dates, statistics, and draft data), and [Man-Games Lost](https://www.mangameslost.com) (injury data) preparatory to performing exploratory data analysis (the latter required a $7 subscription fee to access the data). I specifically wanted to look at potential relationships between NFL combine data for players and their injury rates, as defined by games missed due to injury. I also pulled in general NFL player data in order to get birth dates for every player, so I could determine their age at the beginning of each season, to see how much age was also affecting the injury rates. Additionally, I grabbed career statistics for every player (rushing and passing yards, receptions, etc.) and NFL draft data (team, round, pick no.), to see how the various combine measurements correlated with these. All of this data was either in CSV format, tables which I copied and pasted into CSV files, or in tables on Wikipedia which I scraped out using the requests and bs4 (BeautifulSoup) packages in Python. All of this data needed to be cleaned and formatted quite a bit prior to performing data analysis on it, and I chose to use Pandas within Python to perform this cleaning and formatting.

## Import Required Packages

In [1]:
import csv
import pandas as pd
import numpy as np
from pprint import pprint
import random
import re

## Define Function for Auditing Dataframes
This is a function I use to give a snapshot of the data in the dataframe, so I can (hopefully) quickly identify problems that need to be fixed. It gives me a count of all records in the dataframe, null counts for each column, and then outputs a random sample of 20 values from each column.

In [10]:
# check for problem data
def audit_df(df):
    # show total records for dataset
    print("Total records in dataset:")
    print(len(df))
    print()
    # show how many null values for each column
    print("null values by column:")
    pprint(df.isnull().sum())
    print()
    print("Max value by column:")
    pprint(df.max())
    print("Min value by column:")
    pprint(df.min())
    for column in df.columns.values:
        print(column + ":")
        # convert column values to strings for better printing
        column_as_strings = set([str(i) for i in df[column]])
        # grab a random sample of 20 elements of each column
        pprint(random.sample(column_as_strings, min(20, len(column_as_strings))))
        print()

## NFL Combine Data
First we'll read in and format the NFL combine data, from a single CSV file which has data for every NFL combine event from 1999 to 2015

### Read in data

In [11]:
# load data in from CSV
with open('combine.csv', 'r') as combine:
    combine_df = pd.read_csv(combine)
combine_df.head()

Unnamed: 0,year,name,firstname,lastname,position,heightfeet,heightinches,heightinchestotal,weight,arms,...,bench,round,college,pick,pickround,picktotal,wonderlic,nflgrade,Unnamed: 26,Unnamed: 27
0,2015,Ameer Abdullah,Ameer,Abdullah,RB,5,9,69.0,205,0.0,...,24.0,0,Nebraska,,0,0.0,0,5.9,,
1,2015,Nelson Agholor,Nelson,Agholor,WR,6,0,72.0,198,0.0,...,12.0,0,USC,,0,0.0,0,5.6,,
2,2015,Jay Ajayi,Jay,Ajayi,RB,6,0,72.0,221,0.0,...,19.0,0,Boise St.,,0,0.0,0,6.0,,
3,2015,Kwon Alexander,Kwon,Alexander,OLB,6,1,73.0,227,0.0,...,24.0,0,LSU,,0,0.0,0,5.4,,
4,2015,Mario Alford,Mario,Alford,WR,5,8,68.0,180,0.0,...,13.0,0,West Virginia,,0,0.0,0,5.3,,


### Format Combine Data

In [12]:
# remove rows where the heightfeet is listed as Jr. instead of a number
combine_df.query("heightfeet != ' Jr.'", inplace=True)

# remove all columns that start with "Unnamed"
for column in list(combine_df):
    if column.startswith("Unnamed"):
        combine_df.drop(column, inplace=True, axis=1)
        
# round all float values to 2 decimal places for uniformity
combine_df = combine_df.round(2)

# keep only the columns we need
# (note: )
unneeded = [
    'heightfeet',
    'heightinches',
    'round',
    'college',
    'pick',
    'pickround',
    'picktotal',
    'wonderlic',
    'nflgrade',
    'firstname',
    'lastname'
]
combine_df = combine_df.ix[:, (combine_df.columns.difference(unneeded))]

audit_df(combine_df)

Total records in dataset:
4945

null values by column:
arms                 0
bench                0
broad                0
fortyyd              0
hands                0
heightinchestotal    0
name                 0
position             0
tenyd                0
threecone            0
twentyss             0
twentyyd             0
vertical             0
weight               0
year                 0
dtype: int64

Max value by column:
arms                       37.75
bench                         51
broad                        147
fortyyd                     6.05
hands                      11.38
heightinchestotal             82
name                 Ziggy Ansah
position                      WR
tenyd                       1.92
threecone                   8.31
twentyss                    5.56
twentyyd                    2.98
vertical                      46
weight                       386
year                        2015
dtype: object
Min value by column:
arms                            0
b

## NFL Injuries Data
This data lives in CSV files, one for each year, from 2009 to 2016. The below code will read all of these files into a single Pandas Dataframe, then format and clean the data as needed.

### Read in injuries data

In [4]:
# read in data from the various CSV files
raw_injuries_df = pd.DataFrame()
for year in range(2009, 2017):
    file_name = "player_injuries_{}.csv".format(year)
    with open(file_name, 'r') as injuries_file:
        temp_df = pd.read_csv(injuries_file).query("Player == Player") # make sure the value for Player is not null
        temp_df['season'] = year # add the current year
        temp_df['Total_Games'] = 16 # add the number of total games for that year
        raw_injuries_df = raw_injuries_df.append(temp_df)

### Format Injuries Data

In [3]:
# keep only the columns we care about
injuries_df = raw_injuries_df.ix[:, ['Player', 'Pos', 'Total Injured', 'Total_Games', 'season']]
# convert spaces in column names to underscores
injuries_df['Total_Injured'] = injuries_df['Total Injured']
injuries_df.drop('Total Injured', axis=1, inplace= True)
injuries_df.head()

Unnamed: 0,Player,Pos,Total_Games,season,Total_Injured
0,Bryan Scott,DB,16,2009,7.0
1,Michael Lewis,WR,16,2009,9.0
2,Ed Reed,DB,16,2009,4.0
3,Antoine Winfield,DB,16,2009,6.0
4,Antwan Odom,DE,16,2009,10.0


## Get Player Data for Birth Dates and Various Statistics
This was trickier than I thought it would be, but I actually ended up just querying each player in the Pro Football Reference website to get their birth dates, career statistics, and draft position. This required using the requests module to grab the webpage content, and then crawling through the HTML output using BeautifulSoup. Took a lot of trial and error, but I was eventually able to get accurate birth dates and other various statistics for most (>96%) of the over 2,342 players for which I had both injury and combine data.

### Grab Alphabet Listing Pages for All Players
To find the individual pages for each player, I first have to grab all the alphabetical listing pages from A to Z, which have the links for each individual player page.

In [586]:
from bs4 import BeautifulSoup
import requests
from IPython.core import display

ses = requests.Session()
alphabet_pages = {}
pfr_url = "https://www.pro-football-reference.com/players/"
for letter in "ABCDEFGHIJKLMNOPQRSTUVWXYZ":
    resp = ses.get(pfr_url + letter)
    if resp.ok:
        soup = BeautifulSoup(resp.text, "lxml")
        alphabet_pages[letter] = soup

draft_pages = []
pfr_url = "https://www.pro-football_reference.com/years/"
for year in range(2000, 2017):
    resp = ses.get(pfr_url + str(year) + "/draft.htm")
    if resp.ok:
        soup = BeautifulSoup(resp.text, "lxml")
        table = soup.find('body').find(id='wrap').find(id='content').find(id='all_drafts').find(
            'div',
            _class='table_outer_container mobile_table'
        ).find(id='div_drafts').find('table', id='drafts').find('tbody')
        draft_pages.append(table)
        

## Grab Player Birthdates and Career Statistics

In [675]:
# get only player names seen in both injury and combine data
players = set(
    pd.merge(
        injuries_df.ix[:, ['Player']],
        combine_df.ix[:, ['name']],
        left_on=['Player'],
        right_on=['name'],
        how='inner'
    )['Player']
)
base_uri = 'https://www.pro-football-reference.com/'
player_stats_list = []
skipped_players = []
errors = []
dates_search = re.compile("\d{4}\-\d{4}")
for player in players:
    letter = player.split(" ")[1][0]
    soup = alphabet_pages[letter]
    player_stats = {'Player': player}
    try:
        players_list = soup.find('body').find(id='wrap').find(id='content').find(id='all_players').find(id='div_players')
        for p in players_list.findAll('p'):
            if player in p.text:
                if re.search(dates_search, p.text):
                    dates = re.search(dates_search, p.text).group(0)
                    if dates.split("-")[-1] < '2009':
                        continue
                    player_stats['Start_Year'] = dates.split("-")[0]
                    player_stats['End_Year'] = dates.split("-")[-1]
                player_resp = ses.get(base_uri + p.find('a')['href'])
                if player_resp.ok:
                    player_soup = BeautifulSoup(player_resp.text, 'lxml')
                    stats_pullout = player_soup.find('body').find(id='wrap').find(id='info').find(class_='stats_pullout')
                    for div in stats_pullout.findAll('div', class_='p1'):
                        player_stats[div.find('h4').text] = div.findAll('p')[-1].text
                    for div in stats_pullout.findAll('div', class_='p2'):
                        player_stats[div.find('h4').text] = div.findAll('p')[-1].text
                    player_info = player_soup.find('body').find(id='wrap').find(id='info').find(id='meta').find('div', {
                            'itemtype': 'https://schema.org/Person'
                        })
                    for p in player_info.findAll('p'):
                        if 'Position' in p.text:
                            player_stats['Pos'] = p.text.split(": ")[-1].split()[0]
                        if 'Born' in p.text:
                            if p.find('span', {'itemprop': 'birthDate'}):
                                player_stats['DOB'] = p.find('span', {'itemprop': 'birthDate'})['data-birth']
    except Exception as error:
        errors.append(error)
    if player_stats != {'Player': player}:
        player_stats_list.append(player_stats)
    else:
        skipped_players.append(player)
    display.clear_output(wait=True)
    print("Succeeded: {0}, Failed: {1}, Total: {2}, Success Rate: {3}%".format(
        len(player_stats_list),
        len(skipped_players),
        len(players),
        round((len(player_stats_list) / (len(player_stats_list) + len(skipped_players) * 1.0) * 100), 2)
    ))

Succeeded: 2258, Failed: 200, Total: 2342, Success Rate: 91.86%


### Convert to DataFrame

In [None]:
player_stats_df = pd.DataFrame(player_stats_list)

## Grab Statistics for Draft Position

### Grab Tables for Each Draft Year
The data for each draft is kept in an individual table on Pro Football Reference, so we first need to grab those tables so we can then scan through them to look for each player's draft data.

In [684]:
draft_pages = []
ses = requests.Session()
pfr_url = "https://www.pro-football-reference.com/years/"
for year in range(2000, 2017):
    resp = ses.get(pfr_url + str(year) + "/draft.htm")
    if resp.ok:
        soup = BeautifulSoup(resp.text, "lxml")
        draft_pages.append(soup)

draft_tables = {}
for index, page in enumerate(draft_pages):
    draft_tables[str(index + 2000)] = page.find('body').find(id='wrap').find(id='content').find(id='all_drafts').find(
        'div',
        {'class': 'table_outer_container'}).find(id='div_drafts').find(id='drafts').find('tbody')

### Find Player Draft Data within the Tables
The success rate is much lower than for the player statistics (~80% vs. ~96%), but this makes sense when you consider that the draft data only goes back to the year 2000, which would exclude any players drafted before that year. Also, any player that entered the league as an undrafted free agent would also be excluded in this list.

In [705]:
draft_stats = []
skipped_draft_stats = []
for row in player_stats_df.ix[:, ['Player', 'Start_Year', 'Pos']].iterrows():
    player = row[1]['Player']
    start_year = row[1]['Start_Year']
    position = row[1]['Pos']
    player_dict = {'Player': player, 'Pos': position}
    if draft_tables.get(start_year):
        table = draft_tables[start_year]
        for row in table.findAll('tr'):
            if row.find('td', {'data-stat': 'player'}):
                if player in row.find('td', {'data-stat': 'player'}).text:
                    player_dict['Draft_Year'] = year
                    if row.find('th', {'data-stat': 'draft_round'}):
                        player_dict['Draft_Round'] = row.find('th', {'data-stat': 'draft_round'}).text
                    if row.find('td', {'data-stat': 'draft_pick'}):
                        player_dict['Draft_Pick'] = row.find('td', {'data-stat': 'draft_pick'}).text
        if player_dict != {'Player': player, 'Pos': position}:
            draft_stats.append(player_dict)
        else:
            skipped_draft_stats.append(player)
    display.clear_output(wait=True)
    print("Succeeded: {0}, Failed: {1}, Total: {2}, Success Rate: {3}%".format(
        len(draft_stats),
        len(skipped_draft_stats),
        len(players),
        round((len(draft_stats) / (len(draft_stats) + len(skipped_draft_stats) * 1.0) * 100), 2)
    ))

Succeeded: 1775, Failed: 458, Total: 2342, Success Rate: 79.49%


### Convert to Pandas DataFrame

In [707]:
draft_stats_df = pd.DataFrame(draft_stats)

## Merge the Statistics and Draft DataFrames

In [751]:
merged_stats_df = pd.merge(
    left=player_stats_df,
    right=draft_stats_df,
    left_on=['Player', 'Pos'],
    right_on=['Player', 'Pos'],
    how='left'
)

### Fix Issues Discovered During Auditing

In [None]:
# ensure we have a position for each player
merged_stats_df.query("Pos == Pos", inplace=True)

# for players with multiple positions listed, create one record for each position (flatten out the values)
multiple_positions = []
for row in merged_stats_df.iterrows():
    position = row[1]['Pos']
    temp_dict = dict(row[1])
    if "," in position:
        positions = position.split(",")
    elif "-" in position:
        positions = position.split("-")
    else:
        positions = []
    for pos in positions:
        temp_dict['Pos'] = pos
        multiple_positions.append(temp_dict)        
merged_stats_df.append(multiple_positions)

# remove the records where the positions were split
merged_stats_df = merged_stats_df[merged_stats_df.Pos.str.contains("-") == False]
merged_stats_df = merged_stats_df[merged_stats_df.Pos.str.contains(",") == False]

# fix duplicate names for the same position
merged_stats_df.loc[merged_stats_df['Pos'] == 'G', 'Pos'] = 'OG'
merged_stats_df.loc[merged_stats_df['Pos'] == 'T', 'Pos'] = 'OT'

# add position group for each player
position_groups = {'C': 'OL', 'OG': 'OL', 'OT': 'OL', 'DT': 'DL', 'DE': 'DL', 'ILB': 'LB', 'OLB': 'LB',
                   'CB': 'DB', 'FS': 'DB', 'SS': 'DB', 'K': 'ST', 'P': 'ST', 'TE': 'TE', 'FB': 'FB', 'S': 'DB', 'LB': 'LB',
                   'NT': 'DL'}
def pos_to_group(pos):
    if position_groups.get(pos):
        return position_groups[pos]
    else:
        return pos
merged_stats_df['position_group'] = merged_stats_df['Pos'].apply(pos_to_group)

## Add Birthdates to Injuries Data to Calculate Age
Here we'll merge the injuries data with the player statistics just to get the birthdates for each player to calculate age at the beginning of each season.

In [770]:
injuries_output_df = pd.merge(
    left=injuries_df,
    right=merged_stats_df.ix[:, ['Player', 'Pos', 'DOB', 'position_group']],
    left_on=['Player', 'Pos'],
    right_on=['Player', 'Pos'],
    how='left'
).query("DOB == DOB")
injuries_output_df.head()

Unnamed: 0,Player,Pos,Total_Games,season,Total_Injured,DOB,position_group
0,Bryan Scott,DB,16,2009,7.0,1981-04-13,DB
3,Antoine Winfield,DB,16,2009,6.0,1977-06-24,DB
4,Antwan Odom,DE,16,2009,10.0,1981-09-24,DL
5,Owen Daniels,TE,16,2009,8.0,1982-11-09,TE
6,Kris Jenkins,DT,16,2009,10.0,1979-08-03,DL


## Calculate Player Age for Each Season
Here we'll identify how old, in years, each player is at the beginning of each NFL season.

In [771]:
# create month and day columns, to combine with the season (year) column to calculate the player's age each year
# use September 1st for the month and date values, to approximate the beginning of the league year
injuries_output_df['month'] = 9
injuries_output_df['day'] = 1
# pandas requires the year field to be named 'year' to do the .to_datetime operation
injuries_output_df['year'] = injuries_output_df['season']
# calculate the difference between the current year, then cast that as a timedelta years object
injuries_output_df['age'] = (pd.to_datetime(injuries_output_df[['year', 'month', 'day']]) - pd.to_datetime(injuries_output_df['DOB'])).astype(
    'timedelta64[Y]')
# remove the columns we no longer need
injuries_output_df.drop(['month', 'day', 'year', 'DOB'], inplace=True, axis=1)
injuries_output_df.head()

Unnamed: 0,Player,Pos,Total_Games,season,Total_Injured,position_group,age
0,Bryan Scott,DB,16,2009,7.0,DB,28.0
3,Antoine Winfield,DB,16,2009,6.0,DB,32.0
4,Antwan Odom,DE,16,2009,10.0,DL,27.0
5,Owen Daniels,TE,16,2009,8.0,TE,26.0
6,Kris Jenkins,DT,16,2009,10.0,DL,30.0


## Fix Other Miscellaneous Issues in Dataset

In [773]:
# remove duplicates in dataset where the player, position, and season are all the same
# keep the higher of two values when there's a conflict
injuries_output_df = injuries_output_df.groupby(by=['Player', 'Pos', 'season']).agg(np.max).reset_index()

injuries_output_df.head()

Unnamed: 0,Player,Pos,season,Total_Games,Total_Injured,position_group,age
0,A.J. Cann,OG,2015,16,0.0,OL,23.0
1,A.J. Derby,TE,2016,16,1.0,TE,24.0
2,A.J. Edds,LB,2011,16,4.0,LB,23.0
3,A.J. Edds,LB,2012,16,16.0,LB,24.0
4,A.J. Edds,LB,2014,16,3.0,LB,26.0


## Group Injuries Data to Calculate Overall Injured Percentage

In [774]:
grouped_injured_df = injuries_output_df.ix[:, ['Player', 'Pos', 'Total_Injured', 'Total_Games', 'position_group']].groupby(
    by=['Player', 'Pos']).agg({'position_group': np.max, 'Total_Injured': np.sum, 'Total_Games': np.sum}).reset_index()
grouped_injured_df['injured_pct'] = (grouped_injured_df['Total_Injured'] / grouped_injured_df['Total_Games']).round(2)
grouped_injured_df.head()

Unnamed: 0,Player,Pos,position_group,Total_Games,Total_Injured,injured_pct
0,A.J. Cann,OG,OL,16,0.0,0.0
1,A.J. Derby,TE,TE,16,1.0,0.06
2,A.J. Edds,LB,LB,48,23.0,0.48
3,A.J. Green,WR,WR,96,9.0,0.09
4,A.J. Hawk,LB,LB,32,2.0,0.06


## Add Injured Percentage and Position Group to Combine Data

In [775]:
combine_output_df = pd.merge(
    left=combine_df,
    right=grouped_injured_df.ix[:, ['Player','Pos','injured_pct','position_group']],
    left_on=['name', 'position'],
    right_on=['Player','Pos'],
    how='left'
).drop(['name','position'], axis=1).query("injured_pct == injured_pct")
combine_output_df.head()

Unnamed: 0,arms,bench,broad,fortyyd,hands,heightinchestotal,tenyd,threecone,twentyss,twentyyd,vertical,weight,year,Player,Pos,injured_pct,position_group
0,0.0,24.0,130.0,4.6,0.0,69.0,0.0,6.79,3.95,0.0,42.5,205,2015,Ameer Abdullah,RB,0.44,RB
1,0.0,12.0,0.0,4.42,0.0,72.0,0.0,0.0,0.0,0.0,0.0,198,2015,Nelson Agholor,WR,0.19,WR
2,0.0,19.0,121.0,4.57,0.0,72.0,0.0,7.1,4.1,0.0,39.0,221,2015,Jay Ajayi,RB,0.25,RB
4,0.0,13.0,121.0,4.43,0.0,68.0,0.0,6.64,4.07,0.0,34.0,180,2015,Mario Alford,WR,0.06,WR
5,0.0,11.0,121.0,4.53,0.0,72.0,0.0,6.96,4.28,0.0,35.5,221,2015,Javorius Allen,RB,0.06,RB


## Add Injured Percentage to Player Stats

In [776]:
stats_output_df = pd.merge(
    left=merged_stats_df,
    right=grouped_injured_df.ix[:, ['Player', 'Pos', 'injured_pct']],
    left_on=['Player', 'Pos'],
    right_on=['Player', 'Pos'],
    how='left'
).query("injured_pct == injured_pct")
stats_output_df.head()

Unnamed: 0,AV,Cmp%,DOB,End_Year,FF,FGA,FGM,FantPt,G,GS,...,XPA,XPM,Y/A,Y/R,Yds,Draft_Pick,Draft_Round,Draft_Year,position_group,injured_pct
0,80,,1986-04-22,2015,,,,1591.4,127,,...,,,4.3,,9112.0,12,1,2008,RB,0.1
1,17,,1984-11-14,2015,,,,,99,,...,,,,,197.0,142,5,2008,DB,0.03
2,79,,1985-03-29,2016,,,,,126,123.0,...,,,,,,59,2,2008,OL,0.25
3,89,,1979-09-12,2012,,,,,136,132.0,...,,,,,,164,5,2008,OL,0.48
4,48,,1988-05-27,2016,,,,,115,,...,,,,,228.0,25,1,2008,DB,0.12


## Write Datasets to CSV
Then we'll write the dataframes to CSV, so we can open them in R and perform our detailed exploratory data analysis.

In [777]:
stats_output_df.to_csv('football_data_(stats).csv')
combine_output_df.to_csv('football_data_(combine).csv')
injuries_output_df.to_csv('football_data_(injuries).csv')