<div style="text-align: right"> Tommy Evans-Barton </div>
<div style="text-align: right"> WR Year 2 Jumps </div>

# EDA of Wide Receiver Data

The purpose of this notebook is to do a technical analysis of the **raw** data that will be used in the main report, along with some justifications for future cleaning approaches. Most of this work will be academic and mainly to perform due diligence, and only useful for deeper dives into the work of this project. Data courtesy of Pro Football Reference and Football Outsiders.

In [None]:
import os
import sys
import pandas as pd
import numpy as np
import json
import matplotlib.pyplot as plt

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
TOP_PATH = os.environ['PWD']

In [None]:
sys.path.append(TOP_PATH + '/src')
sys.path.append(TOP_PATH + '/src/viz')

In [None]:
import eda_viz

In [None]:
receivers = pd.read_csv(TOP_PATH + '/data/raw/RECEIVERS.csv')
rec_stats = pd.read_csv(TOP_PATH + '/data/raw/REC_STATS.csv')
adv_stats = pd.read_csv(TOP_PATH + '/data/raw/ADV_REC_STATS.csv')

## Receivers Drafted Data
This data is being used in order to isolate the receivers drafted in the first three rounds from 2010 to 2019, as well as identify which years are their first and second in the league.

In [None]:
receivers.head()

### Feature Types
Initially, it seems that the typing of the receivers data is fairly reasonable, as all numeric data are integers and all other data are strings.

In [None]:
receivers.dtypes

### Missingness
There seem to be no missing entries for this dataset, which is in line with initial assumptions.

In [None]:
receivers.isnull().mean()

### Categoric Features

#### Team

This data seems to use a fairly standard 3 letter encoding of the teams, with all teams accounted for, including former franchises the St. Louis Rams and the San Diego Chargers, who will probably be combined into their L.A. counterparts.

In [None]:
receivers['Tm'].nunique()

In [None]:
receivers['Tm'].value_counts()

### Distribution of Numeric Features

#### Rounds
Over the three rounds, the number of receivers selected increases by round. However, there are more picks in the third round, due to compensatory selections. In terms of selections by round per pick by round, 

In [None]:
eda_viz.round_distribution(savefig = True)

#### Picks
In terms of the distribution of receivers drafted by 10 picks, there is an odd dip from 10-19, with a large spike from 20-29, which, while not having a very large impact on this analysis, is an interesting deviation to note.

In [None]:
eda_viz.pick_distribution(savefig = True)

#### Age
The distribution of the ages of the players is as expected: a fairly normal distribution.

In [None]:
eda_viz.age_distribution(savefig = True)

## Receiver Stats Data
These fairly standard statistics for receivers will be used for the projections of second year jumps. They will need to be trimmed down to only the rookie and second year players for each year. (*Note:* a Glossary of this dataset's features is available in the reference section)

In [None]:
rec_stats.head()

### Feature Types
For the typing of our features, while the `Catch Percentage` column is a string and will need to be converted to a float, the rest of the features are in the correct format for analysis. Additionally, on viewing the `Player Name` column, there will need to be some slight cleaning of extra characters.

In [None]:
rec_stats.dtypes

### Missingness
There seem to be no missing entries for this dataset, which is in line with initial assumptions.

In [None]:
rec_stats.isnull().mean()

### Categorical Features

#### Team
This data seems to use a similar 3 character encoding of the teams as described previously.

In [None]:
rec_stats['Tm'].nunique()

In [None]:
rec_stats['Tm'].value_counts()

#### Position
Since every position is WR, albeit with slightly different representations, this column may just end up being dropped.

In [None]:
rec_stats['Pos'].value_counts()

### Distribution of Numeric Features

#### Games
Most players played in the majority of games, with none playing in 0-3 games, possibly due to an artifact of the source not recording information for those players who never played.

In [None]:
eda_viz.game_distribution(savefig = True)

#### Games Started
Interestingly, while the Games feature is far more dense at the top of its range, the games started distribution seems to be a far more uniform distribution from 4 to 16.

In [None]:
eda_viz.gs_distribution(savefig = True)

#### Targets
Targets seems to be a fairly normal distribution with a right skew, which makes sense as the floor has a hard cap.

In [None]:
eda_viz.tgt_distribution(savefig = True)

In [None]:
rec_stats[rec_stats['Tgt'] >= 200]

#### Receptions
The Receptions distribution has a very similar shape to the Targets, which intuitively makes some sense.

In [None]:
eda_viz.rec_distribution(savefig = True)

In [None]:
rec_stats[rec_stats['Rec'] >= 140]

#### Catch Percentage

In [None]:
eda_viz.ctr_distribution(savefig = True)

In [None]:
rec_stats[pd.to_numeric(rec_stats['Ctch%'].str.replace('%', '')) >= 90]

#### Yards

In [None]:
eda_viz.yards_distribution(savefig = True)

In [None]:
rec_stats[rec_stats['Yds'] >= 1800]

#### Yards per Reception

In [None]:
eda_viz.ypr_distribution(savefig = True)

In [None]:
rec_stats[rec_stats['Y/R'] >= 22]

In [None]:
rec_stats[rec_stats['Y/R'] < 4]

#### Touchdowns

In [None]:
eda_viz.td_distribution(savefig = True)

#### First Downs

In [None]:
eda_viz.fd_distribution(savefig = True)

In [None]:
rec_stats[rec_stats['1D'] >= 90]

#### Long

In [None]:
eda_viz.long_distribution(savefig = True)

In [None]:
rec_stats[rec_stats['Lng'] < 10]

#### Yards per Target

In [None]:
eda_viz.ypt_distribution(savefig = True)

In [None]:
rec_stats[rec_stats['Y/Tgt'] >= 17]

In [None]:
rec_stats[rec_stats['Y/Tgt'] < 3]

#### Receptions per Game

In [None]:
eda_viz.rpg_distribution(savefig = True)

In [None]:
rec_stats[rec_stats['R/G'] >= 9]

#### Receiving Yards per Game

In [None]:
eda_viz.ypg_distribution(savefig = True)

In [None]:
rec_stats[rec_stats['Y/G'] >= 120]

#### Fumbles

In [None]:
eda_viz.fmb_distribution(savefig = True)

In [None]:
rec_stats[rec_stats['Fmb'] >= 6]

## Advanced Receiver Stats Data

In [None]:
adv_stats.head()

### Feature Types

In [None]:
adv_stats.dtypes

### Missingness

In [None]:
adv_stats.isnull().mean()

### Distribution of Categorical Features

#### Team

In [None]:
adv_stats['Team'].nunique()

In [None]:
adv_stats['Team'].value_counts()

### Distribution of Numeric Features

#### DYAR

#### YAR

#### DVOA

#### VOA

#### Effective Yards