<div style="text-align: right"> Tommy Evans-Barton </div>
<div style="text-align: right"> WR Year 2 Jumps </div>

# Data Cleaning Notebook

The purpose of this notebook is to give an explanation of both the logic and steps taken in the data cleaning process in preparation for the final analysis. Most of this work will be academic and mainly to explain how the raw data becomes the final data for the analysis (this cleaning does not necessarily need to be understood in order to understand the analysis).

In [1]:
import os
import sys
import pandas as pd
import numpy as np
import json
import matplotlib.pyplot as plt

In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
TOP_PATH = os.environ['PWD']

In [4]:
sys.path.append(TOP_PATH + '/src')
sys.path.append(TOP_PATH + '/src/viz')

In [5]:
import processing

In [6]:
receivers = pd.read_csv(TOP_PATH + '/data/raw/RECEIVERS.csv')
rec_stats = pd.read_csv(TOP_PATH + '/data/raw/REC_STATS.csv')
adv_stats = pd.read_csv(TOP_PATH + '/data/raw/ADV_REC_STATS.csv')

## Cleaning Receiver Data
- **Removed Position Column**: This column was not necessary as all players were wide receivers
- **Altered Player Column**: This column needed to be edited to match the format of the advanced stats name column, as its format is less granular
- **Created a First Year and Second Year Column**: Removed the generic YEAR column and replaced it with a First Year and Second Year column for merging in receivers stats later

In [7]:
receivers

Unnamed: 0,Rnd,Pick,Tm,Player,Pos,Age,YEAR
0,1,3,CLE,Braylon Edwards,WR,22,2005
1,1,7,MIN,Troy Williamson,WR,22,2005
2,1,10,DET,Mike Williams,WR,21,2005
3,1,21,JAX,Matt Jones,WR,22,2005
4,1,22,BAL,Mark Clayton,WR,23,2005
...,...,...,...,...,...,...,...
188,2,64,SEA,D.K. Metcalf,WR,21,2019
189,3,66,PIT,Diontae Johnson,WR,23,2019
190,3,67,SFO,Jalen Hurd,WR,23,2019
191,3,76,WAS,Terry McLaurin,WR,23,2019


In [8]:
processing.clean_receivers()

Unnamed: 0,Rnd,Pick,Tm,Player,Age,First Year,Second Year
0,1,3,CLE,B.Edwards,22,2005,2006
1,1,7,MIN,T.Williamson,22,2005,2006
2,1,10,DET,M.Williams,21,2005,2006
3,1,21,JAX,M.Jones,22,2005,2006
4,1,22,BAL,M.Clayton,23,2005,2006
...,...,...,...,...,...,...,...
188,2,64,SEA,D.Metcalf,21,2019,2020
189,3,66,PIT,D.Johnson,23,2019,2020
190,3,67,SFO,J.Hurd,23,2019,2020
191,3,76,WAS,T.McLaurin,23,2019,2020


## Cleaning Basic Statistics Data

- **Removed Position Column**: This column was not necessary as all players were wide receivers
- **Removed Fumbles Column**: As found in the EDA, these fumbles were very present for players playing special teams, so it was removed in order to isolate receiving talent.
- **Altered Player Column**: This column needed to be edited to match the format of the advanced stats name column, as its format is less granular
- **Altered Catch Rate Column**: Turned this column into a numeric column for analysis
- **Created a Rec Pts Column**: Created a column to account for receiver production based only on yards and touchdowns, modeled after fantasy football scoring; 6 points for a touchdown, 1 point for every 10 receiving yards
- **Created a Rec Pts per Game Column**: Similarly to above, created a column for Rec Pts per Game. Due to the unpredictable nature of injuries, this column may be better suited for a target in the model.

***Note***: While it was considered whether or not certain 'per target' stats should have minimum target requirements on them, it was felt that this would greatly affect the data, as these entries would be disproportionately the first and second year players that are being investigated.

In [14]:
rec_stats

Unnamed: 0,Player,Tm,Age,Pos,G,GS,Tgt,Rec,Ctch%,Yds,Y/R,TD,1D,Lng,Y/Tgt,R/G,Y/G,Fmb,YEAR
0,Larry Fitzgerald*,ARI,22,WR,16,16,165.0,103,62.4%,1409,13.7,10,68,47,8.5,6.4,88.1,0,2005
1,Steve Smith*+,CAR,26,WR,16,16,150.0,103,68.7%,1563,15.2,12,72,80,10.4,6.4,97.7,2,2005
2,Anquan Boldin,ARI,25,WR,14,14,171.0,102,59.6%,1402,13.7,7,69,54,8.2,7.3,100.1,2,2005
3,Torry Holt*,STL,29,WR,14,14,163.0,102,62.6%,1331,13.0,9,63,44,8.2,7.3,95.1,2,2005
4,Chad Johnson *+,CIN,27,WR,16,16,155.0,97,62.6%,1432,14.8,9,74,70,9.2,6.1,89.5,1,2005
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4855,Jordan Thomas,HOU,23,,5,2,3.0,1,33.3%,8,8.0,0,0,8,2.7,0.2,1.6,0,2019
4856,Eric Tomlinson,3TM,27,,8,3,1.0,1,100.0%,1,1.0,0,0,1,1.0,0.1,0.1,0,2019
4857,John Ursua,SEA,25,,3,0,1.0,1,100.0%,11,11.0,0,1,11,11.0,0.3,3.7,0,2019
4858,Dwayne Washington,NOR,25,,16,0,1.0,1,100.0%,6,6.0,0,0,6,6.0,0.1,0.4,0,2019


In [15]:
processing.clean_stats()

Unnamed: 0,Player,Tm,Age,G,GS,Tgt,Rec,Ctch%,Yds,Y/R,TD,1D,Lng,Y/Tgt,R/G,Y/G,YEAR,Rec Pts,Rec Pts/G
0,L.Fitzgerald,ARI,22,16,16,165.0,103,62.4,1409,13.7,10,68,47,8.5,6.4,88.1,2005,200.9,12.556250
1,S.Smith,CAR,26,16,16,150.0,103,68.7,1563,15.2,12,72,80,10.4,6.4,97.7,2005,228.3,14.268750
2,A.Boldin,ARI,25,14,14,171.0,102,59.6,1402,13.7,7,69,54,8.2,7.3,100.1,2005,182.2,13.014286
3,T.Holt,LAR,29,14,14,163.0,102,62.6,1331,13.0,9,63,44,8.2,7.3,95.1,2005,187.1,13.364286
4,C.Johnson,CIN,27,16,16,155.0,97,62.6,1432,14.8,9,74,70,9.2,6.1,89.5,2005,197.2,12.325000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4848,J.Thomas,HOU,23,5,2,3.0,1,33.3,8,8.0,0,0,8,2.7,0.2,1.6,2019,0.8,0.160000
4849,E.Tomlinson,,27,8,3,1.0,1,100.0,1,1.0,0,0,1,1.0,0.1,0.1,2019,0.1,0.012500
4850,J.Ursua,SEA,25,3,0,1.0,1,100.0,11,11.0,0,1,11,11.0,0.3,3.7,2019,1.1,0.366667
4851,D.Washington,NOR,25,16,0,1.0,1,100.0,6,6.0,0,0,6,6.0,0.1,0.4,2019,0.6,0.037500


## Cleaning Advanced Statistics Data
- **Altered Player Column**: Made the format consistent with the first two datasets
- **Altered Team Column**: Mapped some alternative team encodings to the more traditional versions, and replaced 2TM designations with none entries for simpler analysis
- **Altered DVOA and VOA Columns**: Reformatted these entries into numeric values
- **Split DPI Column**: Split the DPI column into DPI Penalties and DPI Yards in order to have this information in numeric form

In [16]:
adv_stats

Unnamed: 0,Player,Team,DYAR,YAR,DVOA,VOA,EYds,DPI,YEAR
0,S.Smith,CAR,502,508,29.0%,29.6%,1720,2/44,2005
1,C.Johnson,CIN,415,412,19.9%,19.7%,1648,3/58,2005
2,S.Moss,WAS,402,390,25.6%,24.4%,1450,4/51,2005
3,D.Driver,GB,355,323,17.1%,14.4%,1496,6/175,2005
4,E.Kennison,KC,343,343,27.6%,27.7%,1199,3/67,2005
...,...,...,...,...,...,...,...,...,...
2272,T.Benjamin,LAC,-84,-87,-78.9%,-81.5%,-21,0/0,2019
2273,Z.Jones,2TM,-91,-92,-38.6%,-39.0%,151,1/9,2019
2274,P.Campbell,IND,-104,-88,-73.4%,-64.4%,-14,0/0,2019
2275,K.Johnson,ARI,-105,-103,-45.6%,-44.8%,105,0/0,2019


In [17]:
processing.clean_adv_stats()

Unnamed: 0,Player,Team,DYAR,YAR,DVOA,VOA,EYds,YEAR,DPI Pens,DPI Yds
0,S.Smith,CAR,502,508,29.0,29.6,1720,2005,2,44
1,C.Johnson,CIN,415,412,19.9,19.7,1648,2005,3,58
2,S.Moss,WAS,402,390,25.6,24.4,1450,2005,4,51
3,D.Driver,GNB,355,323,17.1,14.4,1496,2005,6,175
4,E.Kennison,KAN,343,343,27.6,27.7,1199,2005,3,67
...,...,...,...,...,...,...,...,...,...,...
2266,T.Benjamin,LAC,-84,-87,-78.9,-81.5,-21,2019,0,0
2267,Z.Jones,,-91,-92,-38.6,-39.0,151,2019,1,9
2268,P.Campbell,IND,-104,-88,-73.4,-64.4,-14,2019,0,0
2269,K.Johnson,ARI,-105,-103,-45.6,-44.8,105,2019,0,0


## Combining the Data
- **Merged the Receivers with the Receiving Stats Data for First Year**: Merged (left merge) on Player, Team, and First Year in order to get a player's statistics for their first year, which will act as half of the features for predicting second year production 
- **Merged the Receivers with a Subset Receiving Stats Data for Second Year**: Merged (left merge) on Player, Team, and Second Year with the Rec Pts and Rec Pts/G columns, in order to add the column that will act as the target for the predictive model
- **Dropped Entries Where Receivers had Insufficient Data**: If receivers didn't have stats in order to make the Rec Pts column for their second year despite that season having been played (i.e. players not drafted in 2019), they were removed, as they would have no targets to use for the model
- **Merged in the Advanced Receiving Stats for First Year**: Merged in the advanced receiver stats for each player's first season for the other half of the features for the model. 7 players did not have advanced stats for their first season, but they are being kept in the dataset for now
- **Dropped Redundant Columns and Reordered the Remaining Ones**: Got rid of duplicate columns and reordered them to make the dataset easier to read

In [19]:
processing.merge_data()

Unnamed: 0,Rnd,Pick,Team,Player,First Year,Age Draft,G,GS,Tgt,Rec,...,YAR,DVOA,VOA,EYds,DPI Pens,DPI Yds,Rec Pts First Season,Rec Pts/G First Season,Rec Pts Second Season,Rec Pts/G Second Season
0,1,3,CLE,B.Edwards,2005,22,10.0,7.0,59.0,32.0,...,80.0,3.3,4.9,474.0,0.0,0.0,69.2,6.920000,124.4,7.775000
1,1,7,MIN,T.Williamson,2005,22,14.0,3.0,52.0,24.0,...,14.0,-5.4,-9.3,370.0,0.0,0.0,49.2,3.514286,45.5,3.250000
2,1,10,DET,M.Williams,2005,21,14.0,4.0,57.0,29.0,...,-19.0,-16.3,-16.9,337.0,1.0,23.0,41.0,2.928571,15.9,1.987500
3,1,21,JAX,M.Jones,2005,22,16.0,1.0,69.0,36.0,...,42.0,-6.4,-4.7,483.0,0.0,0.0,73.2,4.575000,88.3,6.307143
4,1,22,BAL,M.Clayton,2005,23,14.0,10.0,87.0,44.0,...,-63.0,-23.0,-22.1,446.0,1.0,21.0,59.1,4.221429,123.9,7.743750
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
142,2,62,ARI,A.Isabella,2019,22,15.0,1.0,13.0,9.0,...,41.0,33.4,29.9,144.0,0.0,0.0,24.9,1.660000,,
143,2,64,SEA,D.Metcalf,2019,21,16.0,15.0,100.0,58.0,...,85.0,0.6,-2.0,801.0,1.0,4.0,132.0,8.250000,,
144,3,66,PIT,D.Johnson,2019,23,16.0,12.0,92.0,59.0,...,7.0,-8.9,-11.6,601.0,2.0,43.0,98.0,6.125000,,
145,3,76,WAS,T.McLaurin,2019,23,14.0,14.0,93.0,58.0,...,244.0,18.9,19.8,961.0,3.0,49.0,133.9,9.564286,,
