<div style="text-align: right"> Tommy Evans-Barton </div>
<div style="text-align: right"> WR Year 2 Jumps </div>

# EDA of Wide Receiver Data Pre-Cleaning

The purpose of this notebook is to do a technical analysis of the **raw** data that will be used in the main report, along with some justifications for future cleaning approaches. Most of this work will be academic and mainly to perform due diligence, and only useful for deeper dives into the work of this project. Data courtesy of Pro Football Reference and Football Outsiders.

In [None]:
import os
import sys
import pandas as pd
import numpy as np
import json
import matplotlib.pyplot as plt

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
TOP_PATH = os.environ['PWD']

In [None]:
sys.path.append(TOP_PATH + '/src')
sys.path.append(TOP_PATH + '/src/viz')

In [None]:
import eda_viz_pre

In [None]:
receivers = pd.read_csv(TOP_PATH + '/data/raw/RECEIVERS.csv')
rec_stats = pd.read_csv(TOP_PATH + '/data/raw/REC_STATS.csv')
adv_stats = pd.read_csv(TOP_PATH + '/data/raw/ADV_REC_STATS.csv')

## Receivers Drafted Data
This data is being used in order to isolate the receivers drafted in the first three rounds from 2010 to 2019, as well as identify which years are their first and second in the league.

In [None]:
receivers.head()

### Feature Types
Initially, it seems that the typing of the receivers data is fairly reasonable, as all numeric data are integers and all other data are strings.

In [None]:
receivers.dtypes

### Missingness
There seem to be no missing entries for this dataset, which is in line with initial assumptions.

In [None]:
receivers.isnull().mean()

### Categoric Features

#### Team

This data seems to use a fairly standard 3 letter encoding of the teams, with all teams accounted for, including former franchises the St. Louis Rams and the San Diego Chargers, who will probably be combined into their L.A. counterparts.

In [None]:
receivers['Tm'].nunique()

In [None]:
receivers['Tm'].value_counts()

### Distribution of Numeric Features

#### Rounds
Over the three rounds, the number of receivers selected increases by round. However, there are more picks in the third round, due to compensatory selections, which will also affect the distribution of receivers by pick in the chart that follows this one.

In [None]:
eda_viz_pre.round_distribution(savefig = True)

#### Picks
In terms of the distribution of receivers drafted by 10 picks, there is an odd dip in the middle of each round, i.e. around the 11th to 20th pick of each round with spikes at the beginning and end of rounds. However, while interesting in it's own right, this will not be have an effect on this analysis.

In [None]:
eda_viz_pre.pick_distribution(savefig = True)

#### Age
The distribution of the ages of the players is as expected: a fairly normal distribution with a peak at 22 years old, a fairly standard age to be finishing college at.

In [None]:
eda_viz_pre.age_distribution(savefig = True)

## Receiver Stats Data
These fairly standard statistics (courtesy of Pro Football Reference) for receivers will be used for the projections of second year jumps. They will need to be trimmed down to only the rookie and second year players for each year. (*Note:* a Glossary of this dataset's features is available in the reference section)

In [None]:
rec_stats.head()

### Feature Types
For the typing of our features, while the `Catch Percentage` column is a string and will need to be converted to a float, the rest of the features are in the correct format for analysis. Additionally, on viewing the `Player Name` column, there will need to be some slight cleaning of extra characters.

In [None]:
rec_stats.dtypes

### Missingness
The only column that has a substantial amount of missing variables is the Position column, and since we will later be merging on a list of only receivers, it won't matter. Certain target numbers are also missing, and therefore their yards per target are also missing. However, this only applies to three individuals, and shouldn't greatly affect the model.

In [None]:
rec_stats.isnull().mean()

In [None]:
rec_stats[rec_stats['Tgt'].isnull()]

### Categorical Features

#### Team
This data seems to use a similar 3 character encoding of the teams as described previously. When the team encodings in this dataset are compared to the drafted receivers dataset, it can be seen that they are the same, and therefore won't need to be changed, albeit the encodings for a player being on multiple teams will most likely need to be replaced with NaNs.

In [None]:
rec_stats['Tm'].nunique()

In [None]:
rec_stats['Tm'].value_counts()

In [None]:
set(receivers['Tm']) - set(rec_stats['Tm'])

In [None]:
set(rec_stats['Tm']) - set(receivers['Tm'])

#### Position
Since every position in the receivers data is WR the nulls don't matter, and this column may just end up being dropped.

In [None]:
rec_stats['Pos'].value_counts(dropna = False)

### Distribution of Numeric Features

#### Games
Most players played in the majority of games, with very few playing in less than 2 games, probably due to an artifact of the recording (a player is less likely to have receiving stats recorded if they didn't play many games).

In [None]:
eda_viz_pre.game_distribution(savefig = True)

#### Games Started
Interestingly, while the Games feature is far more dense at the top of its range, the games started distribution seems to be a far more uniform distribution from 5 to 14, with an uptick again at 15. It also makes sense that the majority of players did not start a game.

In [None]:
eda_viz_pre.gs_distribution(savefig = True)

#### Targets
Targets seems to be an inversely related distribution with a large spike in the 0-25 range, which makes sense as most players will not receive much playing time (i.e. snaps so their targets will be low). This distribution also has a very large outlier in the 200 range. Upon further investigation, there are 2 receivers who have ever put up 200 receptions in a single season (Calvin Johnson and Julio Jones), both of whom put up historical seasons in these years.

In [None]:
eda_viz_pre.tgt_distribution(savefig = True)

In [None]:
rec_stats[rec_stats['Tgt'] >= 200]

#### Receptions
The Receptions distribution has a very similar shape to the Targets which intuitively makes some sense. There is one rather large outlier, which has been a bit washed out by the size of the zero bin, and upon inspection that is Michael Thomas, who broke the record for receptions in a season in 2019.

In [None]:
eda_viz_pre.rec_distribution(savefig = True)

In [None]:
rec_stats[rec_stats['Rec'] >= 140]

#### Catch Rate
The Catch Rate of each receiver has a fairly tight normal distribution, with a peak around 50%. There is also a very large spike in the range above 90%, and upon further investigation, while there are many individuals in this bin, almost none of them had significant target levels, and therefore their presence is fairly misrepresentative of the data as a whole. Those entries with catch rates lower than 10 percent are shown to be those players with out any targets listed, making these entries meaningless.

In [None]:
eda_viz_pre.ctr_distribution(savefig = True)

In [None]:
rec_stats[pd.to_numeric(rec_stats['Ctch%'].str.replace('%', '')) >= 90]['Tgt'].value_counts().sort_index()

In [None]:
rec_stats[pd.to_numeric(rec_stats['Ctch%'].str.replace('%', '')) < 10]

#### Yards
The Yards distribution again follows the patterns of the Receptions and Targets distributions (this makes sense that they would be related). There are a few outliers (Calvin Johnson's and Julio Jones's seasons that were mentioned previously, as well as a very prolific Antonio Brown season), but none that need to be corrected or that will dramatically affect an analysis. Again most players, being non-starters, had low receiving yard numbers.

In [None]:
eda_viz_pre.yards_distribution(savefig = True)

In [None]:
rec_stats[rec_stats['Yds'] >= 1800]

#### Yards per Reception
Yards per Reception displays a normal distribution with a peak in the 10-12 bin. On the upper end, while it may initially seem that there are many players with more than 20 yards per reception, many of these players only had about 1 to 4 receptions in an entire year, a similar issue as previously seen in the other rate statistics.

In [None]:
eda_viz_pre.ypr_distribution(savefig = True)

In [None]:
rec_stats[rec_stats['Y/R'] >= 20]['Rec'].value_counts().sort_index()

#### Touchdowns
Touchdowns has an inverse relationship, again due to the lower cap and presence of part-time players, with a peak at 0. The right tail makes sense, as there are very few stars putting up large touchdown numbers.

In [None]:
eda_viz_pre.td_distribution(savefig = True)

#### First Downs
First Downs again demonstrates an inverse distribution (peak at around 0-10), with similar reasoning to the other right skews. When looking at the outliers on the high side, each came from historical wide receiver seasons (Calvin Johnson, Julio Jones, Michael Thomas).

In [None]:
eda_viz_pre.fd_distribution(savefig = True)

In [None]:
rec_stats[rec_stats['1D'] >= 90]

#### Long
The Long distribution is a normal one with a right skew, with a peak around 10-20.

In [None]:
eda_viz_pre.long_distribution(savefig = True)

#### Yards per Target
Yards per Target displays a normal distribution centered around 6 with a right skew. However, when you look at the numbers out on the right tail, almost all of them have almost no receptions, showing that this tail is a bit misrepresentative of this data.

In [None]:
eda_viz_pre.ypt_distribution(savefig = True)

In [None]:
rec_stats[rec_stats['Y/Tgt'] >= 14]['Rec'].value_counts().sort_index()

#### Receptions per Game
Receptions per Game displays an inverse relationship with a few high outliers from players who had elite seasons (Michael Thomas, Wes Welker, Julio Jones, Antonio Brown, Keenan Allen). Most players did not have any receptions per game.

In [None]:
eda_viz_pre.rpg_distribution(savefig = True)

In [None]:
rec_stats[rec_stats['R/G'] >= 8]

#### Receiving Yards per Game
Receiving Yards per Game displays an inverse relationship with a very long right tail as well, with a high outlier for a historic receiving season (Calvin Johnson).

In [None]:
eda_viz_pre.ypg_distribution(savefig = True)

In [None]:
rec_stats[rec_stats['Y/G'] >= 120]

#### Fumbles
Fumbles display an inverse relationship, which is expected. However, there are several players with very high fumble numbers. Upon investigation however, many of the fumbles recorded were on special teams plays (e.g. punt returns). Therefore, including this statistic in this analysis will overly punish players asked to play special teams, and it might be cut from this dataset, as most players have negligible fumble counts anyways.

In [None]:
eda_viz_pre.fmb_distribution(savefig = True)

In [None]:
rec_stats[rec_stats['Fmb'] >= 6]

## Advanced Receiver Stats Data
These advanced statistics (courtesy of Football Outsiders) for receivers will be used for the projections of second year jumps. They will need to be trimmed down to only the rookie and second year players for each year. (Note: a Glossary of this dataset's features is available in the reference section)

In [None]:
adv_stats.head()

### Feature Types
The typing is fairly reasonable, however both `DVOA` as well as `VOA` will need to be changed to floats. `DPI`, which is formatted in *number of penalties/yards*, will most likely need to be split up into two different columns. Additionally the `Player` column is formatted VERY differently from the other two datasets, going with a *firstinitial.lastname* format. This will most likely require the other datasets to be changed to this format, and instead of merging purely on name, merging on name and team.

In [None]:
adv_stats.dtypes

### Missingness
There seem to be no missing entries for this dataset, which is in line with initial assumptions.

In [None]:
adv_stats.isnull().mean()

### Distribution of Categorical Features

#### Team
Team has a slightly different encoding than the other datasets, as well as being slightly inconsistent. When specifically comparing the encodings between this dataset and the previous two, it can be seen that some edits to the encodings will be required for certain teams.

In [None]:
adv_stats['Team'].nunique()

In [None]:
adv_stats['Team'].value_counts()

In [None]:
set(receivers['Tm']) - set(adv_stats['Team'])

In [None]:
set(adv_stats['Team']) - set(receivers['Tm'])

### Distribution of Numeric Features

#### DYAR (Defense-adjusted Yards Above Replacement)
DYAR has a fairly normal distribution with a peak in the 0-100 bin. It has a few outliers at the high end, with star players such as Calvin Johnson, Jordy Nelson, Antonio Brown, and Michael Thomas, as well as others and two outliers on the low end in Tavon Austin and Chris Chambers. Each of these players also had very poor statistics in almost every other advanced metric as well.

In [None]:
eda_viz_pre.dyar_distribution(savefig = True)

In [None]:
adv_stats[adv_stats['DYAR'] >= 500]

In [None]:
adv_stats[adv_stats['DYAR'] < -200]

#### YAR (Yards Above Replacement)
YAR has a similarly shaped distribution as DYAR (with a peak again in the 0-100 bin), which is to be expected as it is essentially the same statistic, except not accounting for the defenses played. Something interesting to note, however, is that Tavon Austin does not appear as an extremely low outlier, as he most likely played fairly poorly, but his performance was even worse when taking into account his opponents. Instead, Brandon Lloyd is down around the bottom. There are similar high outliers, except that Marvin Harrison, Andre Johnson, and an Antonio Brown season don't appear, while a Randall Cobb season does.

In [None]:
eda_viz_pre.yar_distribution(savefig = True)

In [None]:
adv_stats[adv_stats['YAR'] >= 500]

In [None]:
adv_stats[adv_stats['YAR'] < -200]

#### DVOA (Defense-adjusted Value Over Average)
DVOA displays a normal distribution (peak in the -10-0 bin), with outliers on both ends. On the upper end, there are receivers with seemingly small sample sizes (based on their EYds numbers), in Kelley Washington, Marvin Hall, and Greg Camarillo. On the lower end of the spectrum, Brandon Lloyd has the lowest DVOA, joined by Stanley Morgan. 

In [None]:
eda_viz_pre.dvoa_distribution(savefig = True)

In [None]:
adv_stats[pd.to_numeric(adv_stats['DVOA'].str.replace('%', '')) >= 80]

In [None]:
adv_stats[pd.to_numeric(adv_stats['DVOA'].str.replace('%', '')) < -80]

#### VOA (Value Over Average)
VOA follows the same distribution pattern as DVOA as expected (peak in the -20-0 bin), with slightly more spread, but with the same upper outliers as DVOA. However, in the lower outliers, in addition to Stanley Morgan, Donte Moncrief and Travis Benjamin also make an appearance, receivers with much larger names than Morgan. Brandon Lloyd is also not a lower outlier in this distribution.

In [None]:
eda_viz_pre.voa_distribution(savefig = True)

In [None]:
adv_stats[pd.to_numeric(adv_stats['VOA'].str.replace('%', '')) >= 80]

In [None]:
adv_stats[pd.to_numeric(adv_stats['VOA'].str.replace('%', '')) < -80]

#### Effective Yards
EYds has an inverse distribution (peak in the 0-250 bin), with a very small amount of negative entries. In terms of outliers on the high end, elite receivers Antonio Brown and Calvin Johnson make an appearance.

In [None]:
eda_viz_pre.eyds_distribution(savefig = True)

In [None]:
adv_stats[adv_stats['EYds'] >= 2000]

In [None]:
adv_stats[adv_stats['EYds'] < 0]