<div style="text-align: right"> Tommy Evans-Barton </div>
<div style="text-align: right"> WR Year 2 Jumps </div>

# Modeling and Analysis Notebook: Traditional vs. Advanced Statistics

The purpose of this notebook is to build two sets of models: one based on traditional statistics (yards, touchdowns, receptions, etc.) and another set based on advanced statistics (DVOA, DYAR, etc.) (or functions of these statistics) to see which statistics produce a more accurate model.

In [27]:
import os
import sys
import pandas as pd
import numpy as np
import json
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import Binarizer
from sklearn.preprocessing import FunctionTransformer
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.pipeline import Pipeline
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from scipy import stats

In [2]:
from warnings import simplefilter
simplefilter(action='ignore', category=FutureWarning)

In [3]:
%load_ext autoreload
%autoreload 2

In [4]:
TOP_PATH = os.environ['PWD']

In [5]:
sys.path.append(TOP_PATH + '/src')

In [6]:
import modeling

## Reading in Data

In [7]:
modeling.DF

Unnamed: 0,Rnd,Pick,Team,Player,First Year,Age Draft,G,GS,Tgt,WR Tgt Share,...,Projected Rec Share,Projected Rec,Projected Yds Share,Projected Yds,Projected TD Share,Projected TD,Rec Pts First Season,Rec Pts/G First Season,Rec Pts Second Season,Rec Pts/G Second Season
0,1,3,CLE,B.Edwards,2005,22,10.0,7.0,59.0,0.226923,...,0.425007,61.201005,0.498598,981.739542,0.222222,2.0,69.2,6.920000,124.4,7.775000
1,1,7,MIN,T.Williamson,2005,22,14.0,3.0,52.0,0.180556,...,0.484076,76.000000,0.483380,1047.000000,0.307692,4.0,49.2,3.514286,45.5,3.250000
2,1,10,DET,M.Williams,2005,21,14.0,4.0,57.0,0.256757,...,0.381579,43.500000,0.257708,374.707921,-0.444444,-4.0,41.0,2.928571,15.9,1.987500
3,1,21,JAX,M.Jones,2005,22,16.0,1.0,69.0,0.206587,...,0.582418,106.000000,0.563735,1455.000000,0.611111,11.0,73.2,4.575000,88.3,6.307143
4,1,22,BAL,M.Clayton,2005,23,14.0,10.0,87.0,0.388393,...,0.338462,44.000000,0.305052,471.000000,0.400000,2.0,59.1,4.221429,123.9,7.743750
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
125,2,57,PHI,J.Arcega-Whiteside,2019,22,16.0,5.0,22.0,0.126437,...,0.485149,49.000000,0.450466,532.000000,0.400000,4.0,22.9,1.431250,,
126,2,64,SEA,D.Metcalf,2019,21,16.0,15.0,100.0,0.375940,...,0.424419,73.000000,0.457451,1145.000000,0.444444,8.0,132.0,8.250000,,
127,3,66,PIT,D.Johnson,2019,23,16.0,12.0,92.0,0.380165,...,0.406897,59.000000,0.345704,680.000000,0.454545,5.0,98.0,6.125000,,
128,3,76,WAS,T.McLaurin,2019,23,14.0,14.0,93.0,0.505435,...,0.508772,58.000000,0.620108,919.000000,0.875000,7.0,133.9,9.564286,,


## Splitting the Data into Modeling and Prediction Data
Some of the data (the players who have not yet had second seasons as of 2020) doesn't have a target to train on. These players can't be used to assess the models being built, so they aren't of use for this analysis.

### Modeling Dataframe
Players drafted before 2019

In [8]:
modeling.DF_MODEL

Unnamed: 0,Rnd,Pick,Team,Player,First Year,Age Draft,G,GS,Tgt,WR Tgt Share,...,Projected Rec Share,Projected Rec,Projected Yds Share,Projected Yds,Projected TD Share,Projected TD,Rec Pts First Season,Rec Pts/G First Season,Rec Pts Second Season,Rec Pts/G Second Season
0,1,3,CLE,B.Edwards,2005,22,10.0,7.0,59.0,0.226923,...,0.425007,61.201005,0.498598,981.739542,0.222222,2.000000,69.2,6.920000,124.4,7.775000
1,1,7,MIN,T.Williamson,2005,22,14.0,3.0,52.0,0.180556,...,0.484076,76.000000,0.483380,1047.000000,0.307692,4.000000,49.2,3.514286,45.5,3.250000
2,1,10,DET,M.Williams,2005,21,14.0,4.0,57.0,0.256757,...,0.381579,43.500000,0.257708,374.707921,-0.444444,-4.000000,41.0,2.928571,15.9,1.987500
3,1,21,JAX,M.Jones,2005,22,16.0,1.0,69.0,0.206587,...,0.582418,106.000000,0.563735,1455.000000,0.611111,11.000000,73.2,4.575000,88.3,6.307143
4,1,22,BAL,M.Clayton,2005,23,14.0,10.0,87.0,0.388393,...,0.338462,44.000000,0.305052,471.000000,0.400000,2.000000,59.1,4.221429,123.9,7.743750
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
115,2,51,CHI,A.Miller,2018,23,15.0,4.0,54.0,0.192171,...,0.544404,98.537037,0.495570,1036.732673,0.533333,8.000000,84.3,5.620000,77.6,4.850000
116,2,60,PIT,J.Washington,2018,22,14.0,6.0,38.0,0.094763,...,0.550607,136.000000,0.532306,1623.000000,0.695652,16.000000,27.7,1.978571,91.5,6.100000
117,2,61,JAX,D.Chark,2018,21,11.0,0.0,32.0,0.109589,...,0.208546,34.618557,0.291061,596.675743,0.148148,1.333333,17.4,1.581818,148.8,9.920000
118,3,81,DAL,M.Gallup,2018,22,16.0,8.0,68.0,0.311927,...,0.787760,107.923077,0.795709,1346.339575,0.677778,6.100000,62.7,3.918750,146.7,10.478571


## Features for Each Type of Model

### Standard Features

In [14]:
modeling.REG_FEATURES

['Tgt', 'Rec', 'Catch Rate', 'Yds', 'Y/R', 'TD', 'Y/Tgt', 'R/G', 'Y/G']

### Advanced Features

In [15]:
modeling.ADV_FEATURES

['DYAR', 'YAR', 'DVOA', 'VOA', 'EYds', 'EYds/G']

## Simulations to Create Data
Now that we have data for training and evaluating the linear regression models, I am going to train and evaluate (R2 Score) 1000 models each for the standard and advanced statistics.

In [45]:
reg_results, adv_results = modeling.run_simulations(1000)

## Hypothesis Testing
Now that the data has been collected (the R2 scores for the linear regressions of the regular statistics and advanced statistics), a paired, two-sample t-test can be performed (significance of .05), as for each iteration there is an advanced and regular model being trained and evaluated on the same subsets of data.

**Null Hypothesis**: There is no difference in the means of the advanced models and regular models

**Alternative Hypothesis**: There is a difference between the means of the advanced models and regular models

$$H_0 : \mu_{reg} = \mu_{adv}$$

$$H_A : \mu_{reg} \neq \mu_{adv}$$

In [52]:
results = stats.ttest_rel(adv_results, reg_results)

In [53]:
results

Ttest_relResult(statistic=-19.323737852126914, pvalue=6.2202948986632634e-71)

## Results
The results of this paired two sample t-test show that we can *reject* the null hypothesis, as the p-value (6.22 E-71) is substantially less than .05, and the T-statistic of -19.323 shows that the advanced models perform *substantially* worse than the regular models. In conclusion, traditional statistics are better for building predictive models of the second year jump in production of highly drafted receivers.