## Will a Wide Receiver be good in the NFL?

I have many thoughts about how wide receivers can be projected coming out of college into the NFL.

As a Patriots fan, I have seen many receivers drafted in the first or second round and have them get cut only a few years later. Although I don't expect any pity given the Super Bowls I remember so fondly, I also remember our last great drafted receiver being a quarterback from Kent State in the 7th round named Julian Edelman, and our more recent pretty good receiver being a Quarterback from NC State named Jakobi Meyers. Why did Bill have an easier time finding QBs converting to WRs than people that played WR their whole lives and are 99th percentile athletes? Blows my mind. 

We look back at the Nkeal Harry draft and wince at the fact that nearly every other first round pick was a hit AND some of the best receivers in the league were taken after him. DK Metcalf, AJ Brown, Deebo Samuel, Terry McLaurin... c'mon man. 

Given the fact that I feel it can't be THAT hard, I will be pooling together public data about these draft prospects and finding if data and relatively simple modeling strategy can predict whether a guy is going to be good before he's going to be good.

As a data scientist and not a scout that's watching an athlete from the stands at the Senior Bowl, during their college season, or following their career closely in the NFL, I need to find some proxies to determine what gives a WR a hand up in the draft process and what determines success in the NFL.

My first step is to determine a proxy for success in the NFL. Although not a perfect measurement, my metric will be APY as a % of Cap at Signing from [Over The Cap](https://overthecap.com/). Although we still need to account for the inflation of the WR market, this metric helps control for the changing salary cap while also representing how much a team is willing to invest in a player. We will specifically be looking at every players second contract to represent how well the prospect performed on their rookie deal to earn that second contract. Obviously players are overpaid or take pay cuts to play on a good team, we can figure that out later.

My hypothesis is that elite receivers are usually 6'0 to 6'2 and around 195 to 210. My ideal receiver in my mind is someone that is 6'1 200 lbs with long arms (Yes, I'm basically just describing Justin Jefferson). Smaller will likely be easily pushed around on routes and can't play on all downs to block for running plays. Larger will likely not move well enough to get separation or yards after catch against NFL defenders and will become a contested catch merchant. 

<img src="https://patriotswire.usatoday.com/wp-content/uploads/sites/71/2024/08/USATSI_23983546.jpg?w=1000&h=600&crop=1" alt="Jalynn Polk" width="800">


# Table of Contents

## [Sneak Peak at Results](#Lets-skip-to-the-good-part)

## [Section 1. Import and Clean the data](#Section-1.-Import-and-Clean-the-data)
- [Combine Data](#a.-Scrape-Combine-Data)

- [Prospect Grades](#b.-Scrape-WR-Prospect-Grades)
- [Second Contract Data](#c.-Scrape-Second-Contract-Data)
- [College Receiving Stats](#d.-Scrape-Receiving-Stats)
- [Join the data and make model variables](#e.-Join-the-data-and-make-model-variables)
        
## [Section 2. Multiple Linear Regression](#Section-2.-Build-The-Model)
- [Build the Regression Model](##Section-2.-Build-The-Model)
- [Predict APY as % of Cap for the Test Set](#Predict-APY-%-for-10-Receivers-that-got-large-contracts)
        

## [Section 3. Predict The Unseen Young Receivers](#Section-3.-Predict-Contracts-for-2021-through-2024-Draft-Classes¶)
- [Final Results!](#The-Top-5-Predicted-Receivers-from-2021-through-2024:)




# Lets skip to the good part

If you scroll to the bottom of the notebook you will see the top rankings for wide receivers based on their college data. Here are the top results...

# The Top 5 Predicted Receivers from 2021 through 2024:

### 1. Jamarr Chase - 6.4%
<img src="https://statico.profootballnetwork.com/wp-content/uploads/2024/11/07193200/jamarr-chase-net-worth-lifestyle-1920x1280.jpg" alt="Jamarr Chase" width="300">



### 2. Rome Odunze - 6.2%
<img src="https://static.clubs.nfl.com/image/upload/t_person_squared_mobile/f_auto/v1725550057/bears/t3lmtlfbcktn3vlxd4bz.jpg" alt="Rome Odunze" width="300">


### 3. Jaxon Smith-Njigba - 5.9%
<img src="https://images.seattletimes.com/wp-content/uploads/2023/09/09062023_1_141406.jpg?d=2040x1632" alt="JSN" width="300">

### 4. Malik Nabers - 5.7%
<img src="https://images2.minutemediacdn.com/image/upload/c_crop,w_1599,h_899,x_0,y_0/c_fill,w_912,ar_16:9,f_auto,q_auto,g_auto/images/voltaxMediaLibrary/mmsport/si/01j8s0dny0qd3rwn3sea.jpg" alt="Nabers" width="300">

### 5. Jameson Williams - 5.5%
<img src="https://static.clubs.nfl.com/image/upload/t_editorial_landscape_12_desktop/lions/llgjkaa0a7rd15qupozm" alt="Jamo" width="300">

In [None]:
#Import Tools
import pandas as pd
import numpy as np
import requests
import warnings
warnings.filterwarnings('ignore')
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import string
import re

In [None]:
# set the years
college_years_list = ['2010','2011','2012','2013','2014','2015','2016','2017','2018', '2019']
combine_years_list = ['2013','2014','2015','2016','2017','2018','2019', '2020']

## Section 1. Import and Clean the data

## a. Scrape Combine Data

In [None]:
def scrape_combine(years_list):
    class_list = []
    for year in years_list:
        url = f'https://www.pro-football-reference.com/draft/{year}-combine.htm#combine'
        html = requests.get(url).content
        df_list = pd.read_html(html)
        df = df_list[-1]

        #drop rows with headers
        df = df[df["Player"] != "Player"]

        #Rename Drafted NaN values to Undrafted
        df['Drafted (tm/rnd/yr)'] = df['Drafted (tm/rnd/yr)'].fillna('Undrafted / 8th / 251st pick / 2018')
        
        class_list.append(df)
        
    if len(years_list)>1:   
        df = pd.concat(class_list, axis = 0)
        
        #drop NAs now from draft class
        #draft_class18 = draft_class18.dropna()
        
        df = df.drop(df[['College']], axis = 1)
        
    return df

In [None]:
# Import CSV training data
draft_classes = scrape_combine(combine_years_list)
draft_classes.head()

In [None]:
def clean_combine_stats(df):
    df['Ht'] = df['Ht'].fillna('0-0')
    
    # Split the Height into feet and inches, then convert to numeric, 
    # and calculate BMI
    ft_ = df['Ht'].apply(lambda x: x.split('-')[0])
    in_ = df['Ht'].apply(lambda x: x.split('-')[-1])
    ft_ = pd.to_numeric(ft_)
    in_ = pd.to_numeric(in_)
    wt_ = pd.to_numeric(df['Wt'])
    # wt_ = draft_class18['Wt']
    inches_ = (12*ft_) + in_
    m = inches_ * 0.0254
    m2 = m**2
    kg = wt_ * 0.453592
    df['Ht'] = inches_
    df['BMI'] = kg/m2

    
    # Isolate just their draft pick
    df['Draft Pick']= df['Drafted (tm/rnd/yr)'].apply(lambda x: x.split('/')[2])
    # Extract only the numeric part
    df['Draft Pick'] = df['Draft Pick'].str.extract(r'(\d+)').astype(int)


    # Isolate just their draft round
    df['Draft Round']= df['Drafted (tm/rnd/yr)'].apply(lambda x: x.split('/')[1])
    df['Draft Round'] = df['Draft Round'].str[1]
    
    # Isolate just their draft year
    df['Draft Year']= df['Drafted (tm/rnd/yr)'].apply(lambda x: x.split('/')[3])
    #df['Draft Year'] = df['Draft Year'].str[1]

    # drop drafted column
    df = df.drop(df[['Drafted (tm/rnd/yr)']], axis = 1)
    
    # isolate just wide receivers for now
    df = df[df['Pos']== 'WR']

    # mean impute combine stats
    columns = ['Ht', 'Wt', '40yd', 'Vertical', 'Bench', 'Broad Jump', '3Cone', 'Shuttle','BMI']
    for column in columns:
        df[column] = pd.to_numeric(df[column])
        df[column] = df[column].fillna(df[column].mean())
        df[column] = round(df[column],2)
        
    # Create Player_ID
    df['Player_ID'] = df['Player'].apply(lambda x: x.translate(str.maketrans('', '', string.punctuation)).replace(' ', '').lower())

    
    return df

In [None]:
combine_stats = clean_combine_stats(draft_classes)
combine_stats.head()

## b. Scrape WR Prospect Grades

In [None]:
# I manually copied the draft grades from https://www.nfl.com/draft/tracker/prospects/wr/all-colleges/all-statuses/2024?page=3&sortField=gradeSort&sortIsAscending=false
# They are in the google sheet below but I downloaded locally for ease
# google sheet link: https://docs.google.com/spreadsheets/d/1-v-_LSCyjjCrK9H_XTodxq5SJI76wRWUb948AMG_2lM/edit?gid=50233396#gid=50233396
def get_draft_grades(years_list):
    class_list = []
    for year in years_list:
        df = pd.read_excel('WR Draft Grades.xlsx', sheet_name= year)
        #drop rows with headers
        df = df[df["Player"] != "Player"]
        #isolate columns wanted
        df = df[['Player', 'Grade']]
        df['Year'] = year
        # fill in no grade with 0
        df['Grade'] = df['Grade'].fillna(0)
        # drop rows with null
        df = df.dropna()
        # # Isolate just their name
        df['Player']= df['Player'].apply(lambda x: x.split('\n')[0])
        # Create Player_ID
        df['Player_ID'] = df['Player'].apply(lambda x: x.translate(str.maketrans('', '', string.punctuation)).replace(' ', '').lower())


        
        class_list.append(df)
        
    if len(years_list)>1:   
        df = pd.concat(class_list, axis = 0)
        
        
    return df

In [None]:
draft_grades = get_draft_grades(combine_years_list)
draft_grades.head()

## c. Scrape Second Contract Data

In [None]:
def scrape_second_contracts():
    # Import CSV training data, start with WR
    url = 'https://overthecap.com/contract-history/wide-receiver'
    html = requests.get(url).content
    df_list = pd.read_html(html)
    df = df_list[-1]
    
    # take most important columns
    df = df[['Player','Year Signed','APY as % Of Cap At Signing']]
    
    # Sort the DataFrame by Player and Year Signed
    df = df.sort_values(by=['Player', 'Year Signed'])

    # Create a rank within each group of col1 based on col2
    df['rank'] = df.groupby('Player')['Year Signed'].rank(method='first')

    # Filter to keep only rows with rank == 2 (2nd lowest)
    df = df[df['rank'] == 2]

    # Drop the helper 'rank' column (optional)
    df = df.drop(columns=['rank'])

    # Reset the index
    df = df.reset_index(drop=True)

    # convert APY as $ of Cap At Signing to decimal
    df['APY as % Of Cap At Signing'] = df['APY as % Of Cap At Signing'].replace('%', '', regex=True).astype(float) / 100
    
    # create Player_ID
    df['Player_ID'] = df['Player'].apply(lambda x: x.translate(str.maketrans('', '', string.punctuation)).replace(' ', '').lower())
    
    return df

In [None]:
second_contracts = scrape_second_contracts()
second_contracts.head()

In [None]:
# Calculate the mean and median of the column
mean_value = second_contracts['APY as % Of Cap At Signing'].mean()
median_value = second_contracts['APY as % Of Cap At Signing'].median()

# Set the size of the plot
plt.figure(figsize=(8, 6))

# Create a histogram with Seaborn
sns.histplot(second_contracts['APY as % Of Cap At Signing'], kde=True, bins=10)

# Add a vertical line for the mean value
plt.axvline(mean_value, color='red', linestyle='--', label=f'Mean: {mean_value:.2f}')

# Add a vertical line for the median value
plt.axvline(median_value, color='blue', linestyle='--', label=f'Median: {median_value:.2f}')

# Add the labels for the mean and median
plt.text(mean_value + 0.5, 3, f'Mean: {mean_value:.2f}', color='red', fontsize=12)
plt.text(median_value + 0.5, 3, f'Median: {median_value:.2f}', color='blue', fontsize=12)

# Add title and labels
plt.title('Distribution of Wide Receiver Second Contracts', fontsize=16)
plt.xlabel('APY as % of Cap at Signing', fontsize=14)
plt.ylabel('Frequency', fontsize=14)

# Display the legend
plt.legend()

# Show the plot
plt.show()

## d. Scrape Receiving Stats

Get Receiving Stats from best season

In [None]:
# Import CSV training data, start with WR
def scrape_receiving_stats(years_list):
    class_list = []
    for year in years_list:
        url = f'https://www.sports-reference.com/cfb/years/{year}-receiving.html'
        html = requests.get(url).content
        df_list = pd.read_html(html)
        df = df_list[-1]
        
        class_list.append(df)
        
    if len(years_list)>1:   
        df = pd.concat(class_list, axis = 0)
        
    # Set the column names to the second row (drop the first row of column headers)
    df.columns = df.columns.get_level_values(1)

    awards = df.Awards

    df = df.iloc[:, :-6]

    df['Awards'] = awards

    # Get the column positions of 'Yds'
    yds_columns = [i for i, col in enumerate(df.columns) if col == 'Yds']

    # Rename the 'Yds' columns
    df.columns.values[yds_columns[0]] = 'rec_yds'
    df.columns.values[yds_columns[1]] = 'rush_yds'

    # Get the column positions of 'Yds'
    td_columns = [i for i, col in enumerate(df.columns) if col == 'TD']

    # Rename the 'Yds' columns
    df.columns.values[td_columns[0]] = 'rec_TD'
    df.columns.values[td_columns[1]] = 'rush_TD'

    # Create the 'award_count' column
    df['award_count'] = df['Awards'].apply(lambda x: 0 if pd.isna(x) else len(x.split(',')))

    # Remove asterisks from the 'Player' column
    df['Player'] = df['Player'].str.replace('*', '', regex=False)
    
    # Create the 'years_played' column
    df['years_played'] = df.groupby('Player')['Player'].transform('count')

    # Sort the dataframe by 'Player' and 'rec_yds' to get the row with the highest 'rec_yds' for each player
    df_sorted = df.sort_values(by=['Player', 'rec_yds'], ascending=[True, False])

    # Drop duplicates based on the 'Player' column, keeping the row with the highest 'rec_yds'
    df = df_sorted.drop_duplicates(subset='Player', keep='first')
    
    
    # fill Y/A with 0
    df['Y/A'] = df['Y/A'].fillna(0)
    
    
    # create player_id
    df['Player_ID'] = df['Player'].apply(lambda x: x.translate(str.maketrans('', '', string.punctuation)).replace(' ', '').lower())

        
    return df

In [None]:
college_stats = scrape_receiving_stats(college_years_list)
college_stats.sort_values(by = 'Player').head()

## e. Join the data and make model variables

In [None]:
len(combine_stats)

In [None]:
combine_stats[combine_stats['Player_ID']== 'ajbrown']

In [None]:
len(college_stats)

In [None]:
college_stats[college_stats['Player_ID']== 'ajbrown']

In [None]:
# Perform the joins
df_merged = combine_stats.merge(college_stats, on=['Player_ID'], how='inner',suffixes=('', '_drop'))

In [None]:
len(df_merged)

In [None]:
df_merged[df_merged['Player_ID']== 'ajbrown']

In [None]:
# Perform the joins
df_merged = df_merged.merge(draft_grades, on='Player_ID', how='inner',suffixes=('', '_drop'))

In [None]:
len(df_merged)

In [None]:
df_merged[df_merged['Player_ID']== 'ajbrown']
df_merged = df_merged.drop_duplicates()

In [None]:
df_merged = df_merged.merge(second_contracts, on='Player_ID', how='inner', suffixes=('', '_drop'))
df_merged[['Year Signed','APY as % Of Cap At Signing']] = df_merged[['Year Signed','APY as % Of Cap At Signing']].fillna(0)

In [None]:
len(df_merged)

In [None]:
df_merged[df_merged['Player_ID']== 'ajbrown']

In [None]:
def create_new_vars(df_merged):
    
    # Create the 'ideal_ht' column: 1 if Ht > 71 and < 75
    df_merged['ideal_ht'] = np.where((df_merged['Ht'] > 71) & (df_merged['Ht'] < 75), 1, 0)

    # Create the 'ideal_wt' column: 1 if Ht > 190 and < 216
    df_merged['ideal_wt'] = np.where((df_merged['Wt'] > 185) & (df_merged['Wt'] < 216), 1, 0)

    # Create the 'ideal_bmi' column: 1 if BMI > 25 and < 31
    df_merged['ideal_bmi'] = np.where((df_merged['BMI'] > 25) & (df_merged['BMI'] < 31), 1, 0)

    # Create the 'ideal_40' column: 1 if 40yd > 4.29 and < 4.6
    df_merged['ideal_40'] = np.where((df_merged['40yd'] > 4.29) & (df_merged['40yd'] < 4.6), 1, 0)

    # Create the 'ideal_shuttle' column: 1 if Shuttle < 4.3
    df_merged['ideal_shuttle'] = np.where((df_merged['Shuttle'] < 4.55), 1, 0)

    # Create 'ideal_all' if all ideal categories are met
    df_merged['ideal_all'] = ((df_merged['ideal_ht'] == 1) & 
                       (df_merged['ideal_bmi'] == 1) & 
                       (df_merged['ideal_40'] == 1) & 
                       (df_merged['ideal_shuttle'] == 1)).astype(int)

    #Create 'elite_production' column
    df_merged['elite_prod'] = np.where((df_merged['rec_yds'] >= 1100 | (df_merged['rec_TD'] >= 10) | (df_merged['Rec'] >= 80)), 1, 0)

    #Create 'td/rec'
    df_merged['td/rec'] =np.where(df_merged['rec_TD'] > 0, df_merged['Rec'] / df_merged['rec_TD'], 0)

    #Create 'rec/g'
    df_merged['rec/g'] =np.where(df_merged['Rec'] > 0, df_merged['Rec'] / df_merged['G'], 0)

    #bmi/shuttle
    df_merged['bmi/shuttle'] = df_merged['BMI'] / df_merged['Shuttle']

    #bmi/bench
    df_merged['bmi/bench'] = df_merged['BMI'] / df_merged['Bench']

    #bmi/40
    df_merged['bmi/40'] = df_merged['BMI'] / df_merged['40yd']

    #bmi/vertical
    df_merged['bmi/vertical'] = df_merged['BMI'] / df_merged['Vertical']

    # Create the 'H_win' column if won the heisman
    # If the number is 1, set 'H_win' to 1
    df_merged['Awards'] = df_merged['Awards'].fillna('None')
    df_merged['H_win'] = df_merged['Awards'].apply(lambda x: 1 if re.search(r'H-(\d+)', x) and int(re.search(r'H-(\d+)', x).group(1)) == 1 else 0)

    # Create dummy variables for the 'School' column
#     df_schools = pd.get_dummies(df_merged['Conf'], prefix='Conf')

    # Combine the original dataframe with the new dummy columns
#     df_merged = pd.concat([df_merged, df_schools], axis=1)
    
    
    return df_merged

In [None]:
df_merged = create_new_vars(df_merged)

## Section 2. Build the Model

In [None]:
interesting_vars = ['Player','Player_ID','Ht','Wt','40yd', 'Vertical',
       'Broad Jump', '3Cone', 'Shuttle', 'BMI', 'Draft Pick', 'Grade',
       'Draft Year', 'G', 'Rec', 'rec_yds', 'Y/R',
       'rec_TD', 'Y/G']


df_merged2 = df_merged[interesting_vars]

In [None]:
# Define the independent variables (X) and the dependent variable (Y)
ind_vars = ['Vertical','Ht','Broad Jump', '3Cone', 'Shuttle', '40yd','BMI',
            'Draft Round', 'Draft Year', 'Grade',
            'rec_yds','Rec','Y/G','years_played', 'rec_TD', 
            'bmi/40', 'bmi/shuttle',
#            'award_count','bmi/vertical',
            
            #dummy vars
            'H_win',
            'elite_prod',
            'ideal_ht','ideal_bmi', 'ideal_40','ideal_shuttle']

X = df_merged[ind_vars] # Independent variables
y = df_merged['APY as % Of Cap At Signing']  # Dependent variable

# Split data into training and test sets (optional, for validation)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

# Create a linear regression model
model = LinearRegression()

# Fit the model on the training data
model.fit(X_train, y_train)

# Get the model coefficients (weights)
print("Intercept:", model.intercept_)
print("Coefficients:", model.coef_)

# Make predictions
y_pred = model.predict(X_test)

# Print predictions
print("Predictions:", y_pred)

In [None]:
# Calculate accuracy metrics

# R-squared
r2 = r2_score(y_test, y_pred)

# Mean Absolute Error (MAE)
mae = mean_absolute_error(y_test, y_pred)

# Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)

# Root Mean Squared Error (RMSE)
rmse = mean_squared_error(y_test, y_pred, squared=False)

# Print the metrics
print("R-squared:", r2)
print("Mean Absolute Error (MAE):", mae)
print("Mean Squared Error (MSE):", mse)
print("Root Mean Squared Error (RMSE):", rmse)

In [None]:
# Get feature importance based on coefficients
coefficients = pd.DataFrame(model.coef_, X.columns, columns=['Coefficient'])

# Sort the coefficients by the absolute value to find the most important features
coefficients['Abs_Coefficient'] = coefficients['Coefficient'].abs()
coefficients_sorted = coefficients.sort_values(by='Abs_Coefficient', ascending=False)

# Display the top 5 most important features
top_5_features = coefficients_sorted.head(5)
print("\nTop 5 Most Important Features:")
print(top_5_features)

## Predict APY % for 10 Receivers that got large contracts

In [None]:
preds = model.predict(df_merged[X.columns])

In [None]:
df_merged2['Predicted APY as % of Cap'] = preds

df_merged2['APY as % Of Cap At Signing'] = y

In [None]:
# Sort by actual
df_merged2.sort_values(by = 'APY as % Of Cap At Signing', ascending = False).reset_index(drop = True).head(10)

In [None]:
# Sort by predicted
df_merged2.sort_values(by = 'Predicted APY as % of Cap', ascending = False).reset_index(drop = True).head(10)

## Section 3. Predict Contracts for 2021 through 2024 Draft Classes

In [None]:
#years_list = ['2021', '2022', '2023', '2024']
college_years_list = ['2017','2018','2019','2020','2021','2022', '2023']
combine_years_list = ['2021', '2022','2023','2024']


# Import CSV training data
draft_classes = scrape_combine(combine_years_list)
combine_stats = clean_combine_stats(draft_classes)
draft_grades = get_draft_grades(combine_years_list)
combine_stats.sort_values(by = 'Player_ID').head()

In [None]:
combine_stats[combine_stats['Player_ID']== 'jamarrchase']

In [None]:
college_stats = scrape_receiving_stats(college_years_list)
college_stats.sort_values(by = 'Player_ID').head()

In [None]:
college_stats[college_stats['Player_ID']== 'jamarrchase']

In [None]:
# Perform the joins
new_wrs = combine_stats.merge(college_stats, on='Player_ID', how='inner', suffixes=('', '_drop')).merge(draft_grades, on = 'Player_ID', how = 'inner', suffixes=('', '_drop'))
new_wrs.head()

In [None]:
# create model vars
X = create_new_vars(new_wrs)

# create predictions, append on 
X = X[ind_vars]

preds = model.predict(X)

# Create prediction dataframe
new_wrs = new_wrs[interesting_vars]

new_wrs['Predicted APY as % of Cap'] = preds

In [None]:
new_wrs[new_wrs['Player_ID']== 'jamarrchase']

In [None]:
# Print receivers to be predicted as top paid
new_wrs.sort_values(by = 'Predicted APY as % of Cap', ascending = False).reset_index(drop = True).head(25)

## Next Step: Try to account for the fact that receivers are overpaid

additional helpful data like: X/Y/Z/F %, arm length, yards per route run, age, maybe add scouting report sentement analysis