<div style="text-align: right"> Tommy Evans-Barton </div>
<div style="text-align: right"> WR Year 2 Jumps </div>

# Data Cleaning Notebook

The purpose of this notebook is to give an explanation of both the logic and steps taken in the data cleaning process in preparation for the final analysis. Most of this work will be academic and mainly to explain how the raw data becomes the final data for the analysis (this cleaning does not necessarily need to be understood in order to understand the analysis).

In [None]:
import os
import sys
import pandas as pd
import numpy as np
import json
import matplotlib.pyplot as plt

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
TOP_PATH = os.environ['PWD']

In [None]:
sys.path.append(TOP_PATH + '/src')
sys.path.append(TOP_PATH + '/src/viz')

In [None]:
import processing

In [None]:
receivers = pd.read_csv(TOP_PATH + '/data/raw/RECEIVERS.csv')
rec_stats = pd.read_csv(TOP_PATH + '/data/raw/REC_STATS.csv')
adv_stats = pd.read_csv(TOP_PATH + '/data/raw/ADV_REC_STATS.csv')

## Cleaning Receiver Data
- **Removed Position Column**: This column was not necessary as all players were wide receivers
- **Altered Player Column**: This column needed to be edited to match the format of the advanced stats name column, as its format is less granular
- **Created a First Year and Second Year Column**: Removed the generic YEAR column and replaced it with a First Year and Second Year column for merging in receivers stats later

In [None]:
receivers

In [None]:
processing.clean_receivers()

## Cleaning Basic Statistics Data

- **Removed Position Column**: This column was not necessary as all players were wide receivers
- **Removed Fumbles Column**: As found in the EDA, these fumbles were very present for players playing special teams, so it was removed in order to isolate receiving talent.
- **Altered Player Column**: This column needed to be edited to match the format of the advanced stats name column, as its format is less granular
- **Altered Catch Rate Column**: Turned this column into a numeric column for analysis
- **Created a Rec Pts Column**: Created a column to account for receiver production based only on yards and touchdowns, modeled after fantasy football scoring; 6 points for a touchdown, 1 point for every 10 receiving yards

***Note***: While it was considered whether or not certain 'per target' stats should have minimum target requirements on them, it was felt that this would greatly affect the data, as these entries would be disproportionately the first and second year players that are being investigated.

In [None]:
rec_stats

In [None]:
processing.clean_stats()

## Cleaning Advanced Statistics Data
- **Altered Player Column**: Made the format consistent with the first two datasets
- **Altered Team Column**: Mapped some alternative team encodings to the more traditional versions, and replaced 2TM designations with none entries for simpler analysis
- **Altered DVOA and VOA Columns**: Reformatted these entries into numeric values
- **Split DPI Column**: Split the DPI column into DPI Penalties and DPI Yards in order to have this information in numeric form

In [None]:
adv_stats

In [None]:
processing.clean_adv_stats()

## Combining the Data
- **Merged the Receivers with the Receiving Stats Data for First Year**: Merged (left merge) on Player, Team, and First Year in order to get a player's statistics for their first year, which will act as half of the features for predicting second year production 
- **Merged the Receivers with a Subset Receiving Stats Data for Second Year**: Merged (left merge) on Player, Team, and Second Year with the Rec Pts column, in order to add the column that will act as the target for the predictive model
- **Dropped Entries Where Receivers had Insufficient Data**: If receivers didn't have stats in order to make the Rec Pts column for their second year despite that season having been played (i.e. players not drafted in 2019), they were removed, as they would have no targets to use for the model
- **Created a Rec Pts Jump Column**: A column to chart the leap in Rec Pts from first year to second year
- **Merged in the Advanced Receiving Stats for First Year**: Merged in the advanced receiver stats for each player's first season for the other half of the features for the model. 7 players did not have advanced stats for their first season, but they are being kept in the dataset for now
- **Dropped Redundant Columns and Reordered the Remaining Ones**: Got rid of duplicate columns and reordered them to make the dataset easier to read

In [None]:
processing.merge_data()