<div style="text-align: right"> Tommy Evans-Barton </div>
<div style="text-align: right"> WR Year 2 Jumps </div>

# Analysis and Modeling Notebook

The purpose of this notebook is to develop the model used to predict second year production for receivers based on their statistics in their first year. This notebook will also serve as a preliminary 'final notebook' before the final presentation of this project's findings.

In [None]:
import os
import sys
import pandas as pd
import numpy as np
import json
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
TOP_PATH = os.environ['PWD']

## Reading in Data

In [None]:
df = pd.read_csv(TOP_PATH + '/data/final/FINAL_DATA.csv')
df

In [None]:
rec_model = df[df['First Year'] < 2019].reset_index(drop = True)
rec_prediction = df[df['First Year'] == 2019].reset_index(drop = True)

In [None]:
rec_model

In [None]:
rec_prediction

In [None]:
rec_model.columns

In [None]:
X_temp, X_test, y_temp, y_test = train_test_split(rec_model.drop(['Rec Pts Second Season', 'Rec Pts Jump'], axis = 1), rec_model['Rec Pts Second Season'], test_size = 0.2, random_state = 1)

In [None]:
X_train, X_valid, y_train, y_valid = train_test_split(X_temp, y_temp, test_size = 0.2, random_state = 1)

In [None]:
X_train = X_train.reset_index(drop = True)
X_valid = X_valid.reset_index(drop = True)
X_test = X_test.reset_index(drop = True)
y_train = y_train.reset_index(drop = True)
y_valid = y_valid.reset_index(drop = True)
y_test = y_test.reset_index(drop = True)

In [None]:
cat_feat = ['Rnd']

cat_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(categories = 'auto'))     # categorical columns become input to OneHot
])

reg_num_feat = ['Pick', 'First Year', 'Age Draft', 'G', 'GS',
       'Tgt', 'Rec', 'Ctch%', 'Yds', 'Y/R', 'TD', '1D', 'Lng', 'Y/Tgt', 'R/G',
       'Y/G', 'Rec Pts First Season']
adv_num_feat = ['DYAR', 'YAR', 'DVOA', 'VOA', 'EYds', 'DPI Pens', 'DPI Yds']
num_feat = reg_num_feat + adv_num_feat

num_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())   # z-scale
])

preproc = ColumnTransformer(transformers=[('num', num_transformer, adv_num_feat), ('cat', cat_transformer, cat_feat)])

pl = Pipeline(steps=[('preprocessor', preproc), ('regressor', LinearRegression())])

In [None]:
pl.fit(X_train[adv_num_feat + ['Rnd']], y_train)

In [None]:
pd.DataFrame(zip(X_train[adv_num_feat + ['Rnd']].columns, pl.named_steps['regressor'].coef_)) 

In [None]:
valid_pred = pl.predict(X_valid)

In [None]:
valid_pred

In [None]:
temp = X_valid.copy()
temp['Actual'] = y_valid
temp['Prediction'] = valid_pred

In [None]:
np.sqrt(np.mean((valid_pred - y_valid)**2))

In [None]:
temp

Thoughts:

- One hot encode round
- use pick straight up
- use year straight up (?)
- use age draft straight up
- use each stat straight up
- MIGHT NEED MORE DATA

Possible more features:

- Target share
- Yard share
- Number of receivers ahead of them for their team
- Number of available targets: Targets that left team - Targets that came into team ?
- Receivers drafted in first 2 days by team