<div style="text-align: right"> Tommy Evans-Barton </div>
<div style="text-align: right"> WR Year 2 Jumps </div>

# EDA of Wide Receiver Data (Univariate)

The purpose of this notebook is to do a technical analysis of the **raw** data that will be used in the main report, along with some justifications for future cleaning approaches. Most of this work will be academic and mainly to perform due diligence, and only useful for deeper dives into the work of this project. Data courtesy of Pro Football Reference and Football Outsiders.

In [None]:
import os
import sys
import pandas as pd
import numpy as np
import json
import matplotlib.pyplot as plt

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
TOP_PATH = os.environ['PWD']

In [None]:
sys.path.append(TOP_PATH + '/src')
sys.path.append(TOP_PATH + '/src/viz')

In [None]:
import eda_viz

In [None]:
receivers = pd.read_csv(TOP_PATH + '/data/raw/RECEIVERS.csv')
rec_stats = pd.read_csv(TOP_PATH + '/data/raw/REC_STATS.csv')
adv_stats = pd.read_csv(TOP_PATH + '/data/raw/ADV_REC_STATS.csv')

## Receivers Drafted Data
This data is being used in order to isolate the receivers drafted in the first three rounds from 2010 to 2019, as well as identify which years are their first and second in the league.

In [None]:
receivers.head()

### Feature Types
Initially, it seems that the typing of the receivers data is fairly reasonable, as all numeric data are integers and all other data are strings.

In [None]:
receivers.dtypes

### Missingness
There seem to be no missing entries for this dataset, which is in line with initial assumptions.

In [None]:
receivers.isnull().mean()

### Categoric Features

#### Team

This data seems to use a fairly standard 3 letter encoding of the teams, with all teams accounted for, including former franchises the St. Louis Rams and the San Diego Chargers, who will probably be combined into their L.A. counterparts.

In [None]:
receivers['Tm'].nunique()

In [None]:
receivers['Tm'].value_counts()

### Distribution of Numeric Features

#### Rounds
Over the three rounds, the number of receivers selected increases by round. However, there are more picks in the third round, due to compensatory selections, which will also affect the distribution of receivers by pick in the chart that follows this one.

In [None]:
eda_viz.round_distribution(savefig = True)

#### Picks
In terms of the distribution of receivers drafted by 10 picks, there is an odd dip in the middle of each round, i.e. around the 11th to 20th pick of each round with spikes at the beginning and end of rounds. However, while interesting in it's own right, this will not be have an effect on this analysis.

In [None]:
eda_viz.pick_distribution(savefig = True)

#### Age
The distribution of the ages of the players is as expected: a fairly normal distribution with a peak at 22 years old, a fairly standard age to be finishing college at.

In [None]:
eda_viz.age_distribution(savefig = True)

## Receiver Stats Data
These fairly standard statistics (courtesy of Pro Football Reference) for receivers will be used for the projections of second year jumps. They will need to be trimmed down to only the rookie and second year players for each year. (*Note:* a Glossary of this dataset's features is available in the reference section)

In [None]:
rec_stats.head()

### Feature Types
For the typing of our features, while the `Catch Percentage` column is a string and will need to be converted to a float, the rest of the features are in the correct format for analysis. Additionally, on viewing the `Player Name` column, there will need to be some slight cleaning of extra characters.

In [None]:
rec_stats.dtypes

### Missingness
There seem to be no missing entries for this dataset, which is in line with initial assumptions.

In [None]:
rec_stats.isnull().mean()

### Categorical Features

#### Team
This data seems to use a similar 3 character encoding of the teams as described previously. When the team encodings in this dataset are compared to the drafted receivers dataset, it can be seen that they are the same, and therefore won't need to be changed.

In [None]:
rec_stats['Tm'].nunique()

In [None]:
rec_stats['Tm'].value_counts()

In [None]:
set(receivers['Tm']) - set(rec_stats['Tm'])

In [None]:
set(rec_stats['Tm']) - set(receivers['Tm'])

#### Position
Since every position is WR, albeit with slightly different representations, this column may just end up being dropped.

In [None]:
rec_stats['Pos'].value_counts()

### Distribution of Numeric Features

#### Games
Most players played in the majority of games, with none playing in 0-3 games, possibly due to an artifact of the source not recording information for those players who never played.

In [None]:
eda_viz.game_distribution(savefig = True)

#### Games Started
Interestingly, while the Games feature is far more dense at the top of its range, the games started distribution seems to be a far more uniform distribution from 4 to 16. There is also an interesting outlier in the 0-1 bin, which upon further inspection is Colts receiver T.Y. Hilton. What is even more interesting is that despite only being credited with one start, he still put up 861 yards and 50 receptions. This high usage despite low start numbers may be due to the fact that Hilton mainly plays in the slot, which, while heavily used in the modern game, traditionally had not been viewed as a starting role until recently.

In [None]:
eda_viz.gs_distribution(savefig = True)

In [None]:
rec_stats[rec_stats['GS'] < 2]

#### Targets
Targets seems to be a fairly normal distribution with a right skew, which makes sense as the floor has a hard cap. This distribution has a peak at around 75, and also has a very large outlier in the 200 range. Upon further investigation, there are 2 receivers who have ever put up 200 receptions in a single season (Calvin Johnson and Julio Jones), both of whom put up historical seasons in these years.

In [None]:
eda_viz.tgt_distribution(savefig = True)

In [None]:
rec_stats[rec_stats['Tgt'] >= 200]

#### Receptions
The Receptions distribution has a very similar shape to the Targets (with a peak around 50 receptions) which intuitively makes some sense. There is one rather large outlier, and upon inspection that is Michael Thomas, who broke the record for receptions in a season in 2019.

In [None]:
eda_viz.rec_distribution(savefig = True)

In [None]:
rec_stats[rec_stats['Rec'] >= 140]

#### Catch Rate
The Catch Rate of each receiver has a fairly tight normal distribution, with a peak around 60%. There is also one large outlier, Lee Smith, who had a catch rate of 100%, but on only 4 targets. This outlier may lead us to creating a floor for a required number of targets in order to qualify for catch rate.

In [None]:
eda_viz.ctr_distribution(savefig = True)

In [None]:
rec_stats[pd.to_numeric(rec_stats['Ctch%'].str.replace('%', '')) >= 90]

#### Yards
The Yards distribution again follows the patterns of the Receptions and Targets distributions (this makes sense that they would be related), with a peak at around 500. There are a few outliers (Calvin Johnson's and Julio Jones's seasons that were mentioned previously, as well as a very prolific Antonio Brown season), but none that need to be corrected or that will dramatically affect an analysis.

In [None]:
eda_viz.yards_distribution(savefig = True)

In [None]:
rec_stats[rec_stats['Yds'] >= 1800]

#### Yards per Reception
Yards per Reception displays a fairly tight normal distribution with a peak in the 12-13 range, and outliers on both ends of the spectrum. On the upper end, there's one part-time starter (Joe Morgan), who had very low target numbers, as well as two deep threats in Ladarius Green and Desean Jackson. Jackson is the more notable player, as he put up over 1000 yards to Green's 300. His role in an offense is often as a deep threat, which accounts for his high yards per reception. On the lower end, there are two players with very low usage numbers. Overall these numbers may indicate that there should be a lower limit on the amount of receptions to qualify for this metric.

In [None]:
eda_viz.ypr_distribution(savefig = True)

In [None]:
rec_stats[rec_stats['Y/R'] >= 22]

In [None]:
rec_stats[rec_stats['Y/R'] < 4]

#### Touchdowns
Touchdowns has a heavy right skew, again due to the lower cap, with a peak at 2. The right tail makes sense, as there are very few stars putting up large touchdown numbers. One interesting note is that the number of players with 0 touchdowns is much lower than expected.

In [None]:
eda_viz.td_distribution(savefig = True)

#### First Downs
First Downs again demonstrates a right skewed normal distribution (peak at around 30), with similar reasoning to the other right skews. When looking at the outliers on the high side, each came from historical wide receiver seasons (Calvin Johnson, Julio Jones, Michael Thomas).

In [None]:
eda_viz.fd_distribution(savefig = True)

In [None]:
rec_stats[rec_stats['1D'] >= 90]

#### Long
The Long distribution is a normal one, with a peak around 50, and some low outliers, both being Lee Smith, who we previously noted had high Catch Rate numbers due to low volume.

In [None]:
eda_viz.long_distribution(savefig = True)

In [None]:
rec_stats[rec_stats['Lng'] < 10]

#### Yards per Target
Yards per Target displays a tight normal distribution centered around 8, with two fairly distinct outliers in Joe Morgan on the high side and Tavon Austin on the high side. Both of these players had very low target numbers however, which accounts for these interesting outliers, indicating that there should probably be a lower limit on targets in order to qualify for this stat.

In [None]:
eda_viz.ypt_distribution(savefig = True)

In [None]:
rec_stats[rec_stats['Y/Tgt'] >= 17]

In [None]:
rec_stats[rec_stats['Y/Tgt'] < 3]

#### Receptions per Game
Receptions per Game displays a right skew like most stats in this dataset (peak at 3), albeit less dramatic than others. The high outliers are all from elite receivers.

In [None]:
eda_viz.rpg_distribution(savefig = True)

In [None]:
rec_stats[rec_stats['R/G'] >= 8]

#### Receiving Yards per Game
Receiving Yards per Game displays a right skew as well (peak at around 45), with a high outlier for a historic receiving season (Calvin Johnson).

In [None]:
eda_viz.ypg_distribution(savefig = True)

In [None]:
rec_stats[rec_stats['Y/G'] >= 120]

#### Fumbles
Fumbles display a inverse relationship, which is expected. However, there are several players with very high fumble numbers. Upon investigation however, many of the fumbles recorded were on special teams plays (e.g. punt returns). Therefore, including this statistic in this analysis will overly punish players asked to play special teams, and it might be cut from this dataset, as most players have negligible fumble counts anyways.

In [None]:
eda_viz.fmb_distribution(savefig = True)

In [None]:
rec_stats[rec_stats['Fmb'] >= 6]

## Advanced Receiver Stats Data
These advanced statistics (courtesy of Football Outsiders) for receivers will be used for the projections of second year jumps. They will need to be trimmed down to only the rookie and second year players for each year. (Note: a Glossary of this dataset's features is available in the reference section)

In [None]:
adv_stats.head()

### Feature Types
The typing is fairly reasonable, however both `DVOA` as well as `VOA` will need to be changed to floats. `DPI`, which is formatted in *number of penalties/yards*, will most likely need to be split up into two different columns. Additionally the `Player` column is formatted VERY differently from the other two datasets, going with a *firstinitial.lastname* format. This will most likely require the other datasets to be changed to this format, and instead of merging purely on name, merging on name and team.

In [None]:
adv_stats.dtypes

### Missingness
There seem to be no missing entries for this dataset, which is in line with initial assumptions.

In [None]:
adv_stats.isnull().mean()

### Distribution of Categorical Features

#### Team
Team has a slightly different encoding than the other datasets, as well as being slightly inconsistent. When specifically comparing the encodings between this dataset and the previous two, it can be seen that some edits to the encodings will be required for certain teams.

In [None]:
adv_stats['Team'].nunique()

In [None]:
adv_stats['Team'].value_counts()

In [None]:
set(receivers['Tm']) - set(adv_stats['Team'])

In [None]:
set(adv_stats['Team']) - set(receivers['Tm'])

### Distribution of Numeric Features

#### DYAR (Defense-adjusted Yards Above Replacement)
DYAR has a fairly normal distribution with a peak in the 0-100 bin. It has a few outliers at the high end, with star players such as Calvin Johnson, Jordy Nelson, Antonio Brown, and Michael Thomas, and one outlier on the low end in Tavon Austin.

In [None]:
eda_viz.dyar_distribution(savefig = True)

In [None]:
adv_stats[adv_stats['DYAR'] >= 500]

In [None]:
adv_stats[adv_stats['DYAR'] < -200]

#### YAR (Yards Above Replacement)
YAR has a similarly shaped distribution as DYAR (with a peak again in the 0-100 bin), which is to be expected as it is essentially the same statistic, except not accounting for the defenses played. Something interesting to note, however, is that Tavon Austin does not appear as an extremely low outlier, as he most likely played fairly poorly, but his performance was even worse when taking into account his opponents. Additionally, Randall Cobb appears as a high outlier where he didn't in terms of DYAR, and an Antonio Brown season does not when he did in terms of DYAR.

In [None]:
eda_viz.yar_distribution(savefig = True)

In [None]:
adv_stats[adv_stats['YAR'] >= 500]

#### DVOA (Defense-adjusted Value Over Average)
DVOA displays a normal distribution (peak in the -10-0 bin), with outliers on both ends. On the upper end, there are receivers with seemingly small sample sizes (based on their EYds numbers), in Kelley Washington and Marvin Hall. On the lower end of the spectrum, Stanley Morgan has the lowest DVOA, but not by as big of a gap. 

In [None]:
eda_viz.dvoa_distribution(savefig = True)

In [None]:
adv_stats[pd.to_numeric(adv_stats['DVOA'].str.replace('%', '')) >= 80]

In [None]:
adv_stats[pd.to_numeric(adv_stats['DVOA'].str.replace('%', '')) < -80]

#### VOA (Value Over Average)
VOA follows the same distribution pattern as DVOA as expected (peak in the -20-0 bin), with slightly more spread, but with the same upper outliers as DVOA. However, the lower outliers, in addition to Stanley Morgan, Donte Moncrief and Travis Benjamin also make an appearance, receivers with much larger names than Morgan.

In [None]:
eda_viz.voa_distribution(savefig = True)

In [None]:
adv_stats[pd.to_numeric(adv_stats['VOA'].str.replace('%', '')) >= 80]

In [None]:
adv_stats[pd.to_numeric(adv_stats['VOA'].str.replace('%', '')) < -80]

#### Effective Yards
EYds has an inverse distribution (peak in the 0-250 bin), with a very small amount of negative entries. In terms of outliers on the high end, elite receivers Antonio Brown and Calvin Johnson make an appearance.

In [None]:
eda_viz.eyds_distribution(savefig = True)

In [None]:
adv_stats[adv_stats['EYds'] >= 2000]

In [None]:
adv_stats[adv_stats['EYds'] < 0]