<div style="text-align: right"> Tommy Evans-Barton </div>
<div style="text-align: right"> GM Draft Analysis </div>

# EDA of Combine, Draft, and Executive NFL Data

The purpose of this notebook is to do a brief analysis of the **raw** data that I will be using in the main report, along with some justifications for future cleaning approaches. Most of this work will be academic, and only useful for deeper dives into the work of this project. All data courtesy of Pro Football Reference.

In [17]:
import pandas as pd
import numpy as np
import os
import json
import requests
import urllib
import re
import sys
from glob import glob

In [18]:
TOP_PATH = os.environ['PWD']

In [30]:
%load_ext autoreload
%autoreload 2

In [20]:
sys.path.append(TOP_PATH + '/config')
sys.path.append(TOP_PATH + '/src')
sys.path.append(TOP_PATH + '/src/data')
sys.path.append(TOP_PATH + '/src/processing')

In [21]:
import etl
import processing

In [22]:
# Download data if need be
# with open(TOP_PATH + '/config/data-params.json') as fh:
#     data_cfg = json.load(fh)
# etl.get_data(**data_cfg)

## Draft Data
Exploring the draft data, using 2014 data as a proxy.

In [27]:
raw_draft_2014 = pd.read_csv(TOP_PATH + '/data/raw/DRAFT_2014.csv')

### Checking Typing of Draft Data
Based on the 2014 data, it seems that the typing of the variables is initially reasonable.

In [29]:
raw_draft_2014.dtypes

Rnd                int64
Pick               int64
Tm                object
Player            object
Pos               object
Age                int64
To               float64
AP1                int64
PB                 int64
St                 int64
G                float64
Passing Cmp      float64
Passing Att      float64
Passing Yds      float64
Passing TD       float64
Passing Int      float64
Rushing Att      float64
Rushing Yds      float64
Rushing TD       float64
Receiving Rec    float64
Receiving Yds    float64
Receiving TD     float64
Solo             float64
Int              float64
Sk               float64
College/Univ      object
YEAR               int64
dtype: object

### Checking Missingness of Draft Data

It seems that most of the missingness in the draft data is in the defensive stats (`Solo` : *42.2%*, `Int` : *78.5%*, and `Sk` : *71.5%*).

In [33]:
raw_draft_2014.isnull().mean()

Rnd              0.000000
Pick             0.000000
Tm               0.000000
Player           0.000000
Pos              0.000000
Age              0.000000
To               0.015625
AP1              0.000000
PB               0.000000
St               0.000000
G                0.015625
Passing Cmp      0.015625
Passing Att      0.015625
Passing Yds      0.015625
Passing TD       0.015625
Passing Int      0.015625
Rushing Att      0.015625
Rushing Yds      0.015625
Rushing TD       0.015625
Receiving Rec    0.015625
Receiving Yds    0.015625
Receiving TD     0.015625
Solo             0.421875
Int              0.785156
Sk               0.714844
College/Univ     0.000000
YEAR             0.000000
dtype: float64

In [72]:
raw_draft_2014.isnull().sum()

Rnd                0
Pick               0
Tm                 0
Player             0
Pos                0
Age                0
To                 4
AP1                0
PB                 0
St                 0
G                  4
Passing Cmp        4
Passing Att        4
Passing Yds        4
Passing TD         4
Passing Int        4
Rushing Att        4
Rushing Yds        4
Rushing TD         4
Receiving Rec      4
Receiving Yds      4
Receiving TD       4
Solo             108
Int              201
Sk               183
College/Univ       0
YEAR               0
dtype: int64

As can be seen below, these missing entries are fairly spread out position-by-position, and with that knowledge, along with the reliability of our data source and the nature of these features, it seems reasonable to *treat missing entries in these statistical fields as equivalent to zero*.

In [63]:
raw_draft_2014[['Pos', 'Solo', 'Int', 'Sk']].groupby('Pos').agg(lambda x : x.isnull().sum())

Unnamed: 0_level_0,Solo,Int,Sk
Pos,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
C,5.0,6.0,6.0
DB,4.0,21.0,30.0
DE,3.0,13.0,7.0
DT,2.0,18.0,6.0
FB,1.0,2.0,2.0
G,7.0,9.0,9.0
K,2.0,2.0,2.0
LB,4.0,22.0,13.0
OL,17.0,19.0,19.0
P,0.0,1.0,1.0


The other missing fields, of which there are 4, are, upon closer inspection, all the same people. These players seem to have never played a game at the NFL level, and so the statistical entries can be substituted to zero, while `To` can be left as Null, in order to indicate their lack of NFL experience/career length.

In [70]:
raw_draft_2014[raw_draft_2014['To'].isnull()].T

Unnamed: 0,212,225,243,251
Rnd,6,7,7,7
Pick,213,226,244,252
Tm,NYJ,STL,NWE,CIN
Player,Tajh Boyd,Mitchell Van Dyk,Jeremy Gallon,Lavelle Westbrooks
Pos,QB,OL,WR,DB
Age,23,23,24,22
To,,,,
AP1,0,0,0,0
PB,0,0,0,0
St,0,0,0,0


## Combine Data
Exploring the Combine data, using 2014 as a proxy.

In [85]:
raw_combine_2014 = pd.read_csv(TOP_PATH + '/data/raw/COMBINE_2014.csv')

### Checking Typing of  Combine Data

Below are the variable types of the features in the Combine datasets. Only one will need changing, as the `Ht` (Height) of each player is currently a string with format 'Ft-In', which ought to be a numerical value (integer of height in inches).

In [87]:
raw_combine_2014.dtypes

Player         object
Pos            object
School         object
Ht             object
Wt              int64
40yd          float64
Vertical      float64
Bench         float64
Broad Jump    float64
3Cone         float64
Shuttle       float64
dtype: object

In [88]:
print(raw_combine_2014.Ht[0])

6-1


### Missingness of Combine Data
Initially the missingness of the data seems to track with real life experience; players often opt out of athletic testing or parts of athletic testing, but always weigh in. Each statistical feature was looked at in greater detail, to see if there was some form of 'placeholder stat' for null entries, but these seemed to be absent from this data set.

In [89]:
raw_combine_2014.isnull().mean()

Player        0.000000
Pos           0.000000
School        0.000000
Ht            0.000000
Wt            0.000000
40yd          0.021021
Vertical      0.192192
Bench         0.270270
Broad Jump    0.213213
3Cone         0.339339
Shuttle       0.333333
dtype: float64

Therefore, each of these nulls will be left, in order to indicate that the player did not participate in these drills.

## Executive Data

In [103]:
raw_executives = pd.read_csv(TOP_PATH + '/data/raw/EXECUTIVES.csv')

### Checking Typing of Executives Data

The typing of the Executives data seems to be up to task, but it doesn't have a complete list of years they worked, which will be important for future work.

In [105]:
raw_executives.dtypes

Person    object
Teams     object
From       int64
To         int64
Titles    object
dtype: object

### Checking Missingness of Executives Data
The executives data is fairly complete

In [106]:
raw_executives.isnull().mean()

Person    0.0
Teams     0.0
From      0.0
To        0.0
Titles    0.0
dtype: float64

## Closing
In closing, the data is fairly complete and comprehensive, and while there will be some alterations that are necessary, for the most part it makes for a nice group of datasets to work on.