## Part A: Data Collection 

RB Player Data is getting web scraped from ProFootballReference site between the years 2014-2024


### Websites Used For Support:<br>
- [BrowserStack - Download File using Selenium](https://www.browserstack.com/guide/download-file-using-selenium-python)
- [GeeksForGeeks - Scrape and Save Table using Selenium](https://www.geeksforgeeks.org/scrape-and-save-table-data-in-csv-file-using-selenium-in-python/#)
- [RealPython - Modern Web Automation with Selenium](https://realpython.com/modern-web-automation-with-python-and-selenium/#locate-elements-in-the-dom) 
- [StackOverflow - Wait for file to be downloaded in Selenium](https://stackoverflow.com/questions/63637077/how-to-wait-for-a-file-to-be-downloaded-in-selenium-and-python-before-moving-for)

Semi-Automated Data Extraction:

Selenium opens Google Chrome,<br>
User downloads the Excel file,<br>
File renamed according to offset in its URL

In [None]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
import undetected_chromedriver as uc
import os, time

#File download Paths
selenium_profile_path = r"C:\SeleniumProfiles\StatheadSession"
download_dir = os.path.join(os.getcwd(), "selenium_downloads")
os.makedirs(download_dir, exist_ok=True)

# Chrome Options Setup
options = uc.ChromeOptions()
options.user_data_dir = selenium_profile_path
options.add_argument("--no-sandbox")
options.add_argument("--disable-dev-shm-usage")
options.add_argument("--disable-gpu")
prefs = {"download.default_directory": download_dir,
         "download.prompt_for_download": False,
         "directory_upgrade": True,
         "safebrowsing.enabled": True}
options.add_experimental_option("prefs", prefs)

driver = uc.Chrome(options=options, user_data_dir=selenium_profile_path, headless=False)

# Detect if download is finished
def download_complete():
    print("Checking if download is complete")
    return not any(f.endswith(".crdownload") for f in os.listdir(download_dir))


max_rows = 71000 #Estimated finished based on website query
base_url = f"https://stathead.com/football/player-game-finder.cgi?request=1&timeframe=seasons&match=player_game&qb_start_num_career_max=400&season_end=-1&rookie=N&team_game_num_season_min=1&weight_max=500&comp_type=reg&qb_start_num_career_min=1&player_game_num_career_min=1&draft_pick_type=overall&player_game_num_career_max=400&year_min=2014&year_max=2024&season_start=1&season_positions[]=rb&player_game_num_season_min=1&week_num_season_max=22&team_game_num_season_max=17&week_num_season_min=1&player_game_num_season_max=18&order_by=fantasy_points&cstat[1]=rush_att&ccomp[1]=gt&cval[1]=1"
print(f"File will populate here: {download_dir}")
files_preDownload = set(os.listdir(download_dir))
print(f"Content before download: {files_preDownload}")
for offset in range (0, max_rows, 200):
    try:
        url = base_url + f"&offset={offset}"
        driver.get(url)
        print(f"🟢 Opened URL: {url}")
        time.sleep(10)

        #Time Delay to allow user to click 'download'
        while not download_complete():
            time.sleep(5)

        # Rename file
        files_postDownload = set(os.listdir(download_dir))
        print("Prepping for post download workflow")
        new_file = (files_postDownload - files_preDownload)
        new_xlsx_files = {f for f in new_file if f.endswith(".xls") or f.endswith(".xlsx")}
        if len(new_xlsx_files) == 1:
            original_name = new_xlsx_files.pop()
            new_name = f"Weekly-NFL-RB_stats({offset}).xlsx"
            os.rename(
                os.path.join(download_dir, original_name),
                os.path.join(download_dir, new_name)
            )
            print(f"Renamed: {original_name} → {new_name}")
        elif len(new_xlsx_files) > 1:
            print(f"Multiple new files detected: {new_xlsx_files}. Skipping rename")
        else:
            print("No new file detected")
        files_preDownload = set(os.listdir(download_dir))
    except Exception as e:
        print(f"An error occured: {e}")

driver.quit()
print("Complete, closing Chrome")


### Dataframe Creation  <br>

html_to_df(): Converting all 'Excel' files that turned out to be in HTML format into a single combined pandas dataframe for simpler data manipulation <br>

### Data Cleaning <br>

Cleaned up first two row layout of dataframe since scrapped HTML data put headers into two rows <br>
Renamed columns to give better clarity <br>
Deconstructed Results Column that Outputted Value like "W 24-10" to isolate the scores and calculate the score differential to determine team's win or loss <br>
Split Date Column into 'Year' and 'Month-Day' <br>
Dropped Unecessary Columns in Dataframe <br>
Simplified Age Column Through Only Keeping Year and Dropping Days <br>
Handled Null Values and Validated Data Types <br>
Utilized Pandas Get Dummies Encoding to Transform Home_Away_Determinant into Binary Numerical Values



Print statements to evaluate the data better <br><br>

In [1]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.widgets import Cursor
import mplcursors

def html_to_df(path):
    weekly_rb_stats = []
    htmlFiles = [os.path.join(path, f) for f in os.listdir(path) if f.endswith(".xlsx")]
    for file in htmlFiles:
        try:
            df = pd.read_html(file,header=[0,1])[0]
            weekly_rb_stats.append(df)
        except ValueError:
            print(f"No valid tables found in {file}")
        except Exception as e:
            print(f"Error processing file: {e}")
    weekly_rb_stats = pd.concat(weekly_rb_stats)
    return weekly_rb_stats
    

download_dir = os.path.join(os.getcwd(), "selenium_downloads")
weekly_stats_df = html_to_df(download_dir)
## Print statements inspecting dataframe
#print(f"Preview of Weekly RB Stats: {weekly_stats_df.head()}")
#print(f"\n Statistical Summary of Weekly RB Stats: {weekly_stats_df.describe()}")
#print(f"\n Size of Weekly RB Stats Dataframe: {weekly_stats_df.shape}")


#Cleaning up Column Names
newColumns = []
for column in weekly_stats_df.columns:
    if isinstance(column,tuple) and column[0].startswith('Unnamed'):
        newColumns.append(column[1])
    elif isinstance(column, tuple):
        newColumns.append(column[0] + '_' + column[1])
    else:
        newColumns.append(column)
weekly_stats_df.columns = newColumns

#Renaming Columns for better transparency
weekly_stats_df.rename(columns={'Unnamed: 10_level_1':'Home_Away_Determinant','FantPt':'Fantasy_Pts', 
                                'Att':'Rushing_Att','Rushing_Y/A':'Rushing_YPC',
                                'G#':'Game_Number','Rushing_1D':'Rushing_FirstDown'}, inplace=True)

#Breaking Down Results Column To Simplify Analysis
weekly_stats_df[['Victory_Status', 'Score']] = weekly_stats_df['Result'].str.split(' ', n=1, expand=True)
weekly_stats_df[['Team_Pts', 'Opp_Pts']] = weekly_stats_df['Score'].str.split('-', n=1, expand=True)
weekly_stats_df['Team_Pts'] = weekly_stats_df['Team_Pts'].astype(int)
weekly_stats_df['Opp_Pts'] = weekly_stats_df['Opp_Pts'].str.extract(r'(\d+)').astype(int)
weekly_stats_df['Score_Diff'] = weekly_stats_df['Team_Pts'] - weekly_stats_df['Opp_Pts'] #Negative values indicate a loss

#Splitting the 'Date' column into 'Year' and 'Month-Day'
weekly_stats_df['Year'] = weekly_stats_df['Date'].str.split('-').str[0]
weekly_stats_df['Year'] = weekly_stats_df['Year'].astype(int)  
weekly_stats_df['Month_Day'] = weekly_stats_df['Date'].str.split('-').str[1] + '-' + weekly_stats_df['Date'].str.split('-').str[2]


#Evaluating Unique Values in All Columns in DataFrame
unique_values = weekly_stats_df.nunique()
#print(f"Unique values in each column:\n{unique_values}")

#Dropping Unecessary Columns
weekly_stats_df_clean = weekly_stats_df.drop(columns=['Pos.','Dayâ¼','Rk', 'Result', 'Score', 'Victory_Status', 'Date'])

#Keeping only first occurrence of Duplicated Column
weekly_stats_df_clean = weekly_stats_df_clean.loc[:, ~weekly_stats_df_clean.columns.duplicated()]
#Simplifying Age Column
weekly_stats_df_clean['Age'] = weekly_stats_df_clean['Age'].str.split('-').str[0] #Only including year for age
weekly_stats_df_clean['Age'] = weekly_stats_df_clean['Age'].astype(int) #Making age column as integer

nullColumns = weekly_stats_df_clean.isna().any()
#print(f"Null Columns in dataframe: {nullColumns}")
#Handling Null Values
weekly_stats_df_clean['Home_Away_Determinant'] = weekly_stats_df_clean['Home_Away_Determinant'].fillna('vs')
weekly_stats_df_clean['Rushing_FirstDown'] = weekly_stats_df_clean['Rushing_FirstDown'].fillna(0.0)
weekly_stats_df_clean['Day'] = weekly_stats_df_clean['Day'].fillna('Sun') #Assuming no date recorded is Sunday game for simplicity

#Validating Data Types
weekly_stats_df_clean['Week'] = weekly_stats_df_clean['Week'].astype(int)
weekly_stats_df_clean['Game_Number'] = weekly_stats_df_clean['Game_Number'].astype(int)

#Encoding of Home/Away Determinant
home_away_num = pd.get_dummies(weekly_stats_df_clean['Home_Away_Determinant']).astype(int)
weekly_stats_df_clean = pd.concat([weekly_stats_df_clean, home_away_num], axis=1)
weekly_stats_df_clean = weekly_stats_df_clean.drop(columns=['Home_Away_Determinant', '@'])
weekly_stats_df_clean.rename(columns= {'vs': 'Home_Determinant'}, inplace=True) # 1 if home game, 0 if away game


weekly_stats_df_clean.info()


<class 'pandas.core.frame.DataFrame'>
Index: 14200 entries, 0 to 199
Data columns (total 24 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Player             14200 non-null  object 
 1   Fantasy_Pts        14200 non-null  float64
 2   Rushing_Att        14200 non-null  int64  
 3   Day                14200 non-null  object 
 4   Game_Number        14200 non-null  int64  
 5   Week               14200 non-null  int64  
 6   Age                14200 non-null  int64  
 7   Team               14200 non-null  object 
 8   Opp                14200 non-null  object 
 9   Rushing_Yds        14200 non-null  int64  
 10  Rushing_YPC        14200 non-null  float64
 11  Rushing_TD         14200 non-null  int64  
 12  Rushing_FirstDown  14200 non-null  float64
 13  Rushing_Succ%      14200 non-null  float64
 14  Fantasy_FantPt     14200 non-null  float64
 15  Fantasy_PPR        14200 non-null  float64
 16  Fantasy_DKPt       14200 non-

### Stadium Mapping from 2014-2024 <br>

Note: Houston Texans switched from natural grass to artificial turf shortly before Week 2 of 2015 season.  For all intensive purposes, will assume all 2015 season was played on turf <br>

Created Dictionary to Map Field Types of All NFL Teams
    Accounted for cases where NFL team switched stadium or field types
    Also includes cases where game was played on international field

surfaceObtainer(): Uses Team, Year, Week to extract the surface type for that game

Implemented One Hot Encoding 



References:<br><br>
 - [ESPN NFL Stadium Surface Types](https://www.espn.com/nfl/story/_/id/38565107/nfl-stadium-surfaces-strategies-challenges-faqs) <br>
 - [Sports Illustrated Stadium Surface Types](https://www.si.com/nfl/2015/09/29/nfl-stadium-turf-grass-rankings#:~:text=Let's%20revisit%20the%20formula%20that,recovery%20plays%20well%20in%20Charlotte.)<br>
 - [Baltimore Ravens switch to grass](https://www.baltimoreravens.com/news/ravens-switching-to-natural-grass-at-m-t-bank-stadium-16430494) <br>
- [Houston Texans switch to grass](https://www.houstontexans.com/news/texans-to-play-on-artificial-turf-for-rest-of-2015-15899086#:~:text=For%20the%20remainder%20of%20the,September%2027%20against%20Tampa%20Bay.) <br>
- [OAK stadium](https://turfgrasssod.org/raiders-maintain-the-tradition-of-football-with-natural-grass-field/)<br>
- [TEN switch to turf](https://www.tennesseetitans.com/news/why-the-titans-are-switching-to-turf-at-nissan-stadium-starting-in-2023)<br>
- [Wembley(London) Stadium Surface Type](https://www.profootballnetwork.com/is-wembley-stadium-turf-or-grass/)<br>
- [Twickenham Stadium Surface Type](https://www.nflweather.com/stadium/twikenham-stadium#)<br>
- [Estadio Stadium Surface Type](https://www.nflweather.com/stadium/estadio-azteca)<br>
- [Tottenham Stadium Surface Type](https://www.nflweather.com/stadium/tottenham-hotspur-stadium) <br>
- [Allianz Arena Surface Type](https://www.nflweather.com/stadium/allianz-arena) <br>
- [Frankfurt Stadium Surface Type](https://www.nflweather.com/stadium/frankfurt-stadium) <br>
- [Corinthians Arena Surface Type](https://www.nflweather.com/stadium/corinthians-arena) <br>

In [23]:
stadium_surface_dict = {
    'ARI': 'grass',
    'ATL': 'turf', 
    'BAL': 'grass',  #add to exceptions - had artificial turf in 2016 (X)
    'BUF': 'grass',
    'CAR': 'grass', #add to exceptions - had turf from 2021 onwards (X)
    'CHI': 'grass',
    'CIN': 'turf',
    'CLE': 'grass',
    'DAL': 'turf',
    'DEN': 'grass',
    'DET': 'turf',
    'GNB': 'grass',
    'HOU': 'turf', #add to exceptions - had grass in 2014/2015 (X)
    'IND': 'turf',
    'JAX': 'grass', 
    'KAN': 'grass', 
    'LAC': 'turf', 
    'LAR': 'turf', 
    'LVR': 'grass', 
    'MIA': 'grass',
    'MIN': 'turf', 
    'NOR': 'turf',
    'NWE': 'turf',
    'NYG': 'turf', 
    'NYJ': 'turf',
    'OAK': 'grass',
    'PHI': 'grass',
    'PIT': 'grass',
    'SDG': 'grass',
    'SEA': 'turf',
    'SFO': 'grass', 
    'STL': 'turf',
    'TAM': 'grass',
    'TEN': 'grass', #add to exceptions - switched to turf in 2023 (X)
    'WAS': 'grass'
}

#Defining exceptions where teams had a different field type for a few seasons before transition
stadium_surface_exceptions_dict = {
    (2023, 'TEN'): 'turf',
    (2024, 'TEN'): 'turf', 
    (2014, 'HOU'): 'grass',
    (2021, 'CAR'): 'turf',
    (2022, 'CAR'): 'turf',
    (2023, 'CAR'): 'turf',
    (2024, 'CAR'): 'turf',
    (2014, 'BAL'): 'turf',
    (2015, 'BAL'): 'turf'
}

#Dictionary will list designated 'home' team as the last value for surface type key
#Setup will be (year, season week, 'home team'): 'field type'
int_games_dict = {
    (2014, 4, 'OAK'): 'turf',
    (2014, 8, 'ATL'): 'turf',
    (2014, 10, 'JAX'): 'turf',
    (2015, 4, 'MIA'): 'turf',
    (2015, 7, 'JAX'): 'turf',
    (2015, 8, 'KAN'): 'turf',
    (2016, 4, 'JAX'): 'turf',
    (2016, 7, 'LAR'): 'grass', #Twickenham stadium in London
    (2016, 8, 'CIN'): 'turf',
    (2016, 10, 'OAK'): 'grass', #Mexico stadium
    (2017, 3, 'JAX'): 'turf',
    (2017, 4, 'MIA'): 'turf',
    (2017, 7, 'LAR'): 'grass', #Twickenham stadium 
    (2017, 8, 'CLE'): 'grass', #Twickenham stadium 
    (2017, 10, 'OAK'): 'grass', #Mexico stadium
    (2018, 6, 'OAK'): 'turf',
    (2018, 7, 'LAC'): 'turf',
    (2018, 8, 'JAX'): 'turf',
    (2019, 5, 'OAK'): 'grass', #Tottenham stadium
    (2019, 6, 'TAM'): 'grass', #Tottenham stadium
    (2019, 8, 'LAR'): 'turf',
    (2019, 9, 'JAX'): 'turf',
    (2019, 11, 'LAC'): 'grass', #Mexico stadium
    (2021, 5, 'ATL'): 'grass', #Tottenham stadium
    (2021, 6, 'JAX'): 'grass', #Tottenham stadium
    (2022, 4, 'NOR'): 'grass', #Tottenham stadium
    (2022, 5, 'GNB'): 'grass', #Tottenham stadium
    (2022, 8, 'JAX'): 'turf',
    (2022, 10, 'TAM'): 'grass', #Allianz Arena
    (2022, 11, 'ARI'): 'grass', #Mexico stadium
    (2023, 4, 'JAX'): 'turf',
    (2023, 5, 'BUF'): 'grass', #Tottenham stadium
    (2023, 6, 'TEN'): 'grass', #Tottenham stadium
    (2023, 9, 'KAN'): 'grass', #Frankfurt stadium
    (2023, 10, 'NWE'): 'grass', #Frankfurt stadium
    (2024, 1, 'PHI'): 'grass', #Brazil stadium
    (2024, 5, 'MIN'): 'grass', #Tottenham stadium
    (2024, 6, 'CHI'): 'grass', #Tottenham stadium
    (2024, 7, 'JAX'): 'grass', #Tottenham stadium
    (2024, 10, 'CAR'): 'grass' #Allianz Arena   
}

def surfaceObtainer(team, year, week=None):
    if week and (year, week, team) in int_games_dict:
        return int_games_dict[(year, week, team)]
    return stadium_surface_exceptions_dict.get((year, team)) or stadium_surface_dict.get(team)

#Integrating the surfaceObtainer function into the DataFrame
weekly_stats_df_clean['Surface_Type'] = weekly_stats_df_clean.apply( 
    lambda row: surfaceObtainer(row['Team'], row['Year'], row['Week']), axis=1
)

#One-Hot Encoding the Surface Type
surfaceType_num = pd.get_dummies(weekly_stats_df_clean['Surface_Type']).astype(int)
weekly_stats_clean2 = pd.concat([weekly_stats_df_clean, surfaceType_num], axis=1)

weekly_stats_clean2 = weekly_stats_clean2.drop(columns=['Surface_Type', 'turf'])
weekly_stats_clean2.rename(columns={'grass': 'Grass_Determinant'}, inplace=True) #Indicates 1 if grass, 0 if turf

print(weekly_stats_clean2.info())


<class 'pandas.core.frame.DataFrame'>
Index: 14200 entries, 0 to 199
Data columns (total 25 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Player             14200 non-null  object 
 1   Fantasy_Pts        14200 non-null  float64
 2   Rushing_Att        14200 non-null  int64  
 3   Day                14200 non-null  object 
 4   Game_Number        14200 non-null  int64  
 5   Week               14200 non-null  int64  
 6   Age                14200 non-null  int64  
 7   Team               14200 non-null  object 
 8   Opp                14200 non-null  object 
 9   Rushing_Yds        14200 non-null  int64  
 10  Rushing_YPC        14200 non-null  float64
 11  Rushing_TD         14200 non-null  int64  
 12  Rushing_FirstDown  14200 non-null  float64
 13  Rushing_Succ%      14200 non-null  float64
 14  Fantasy_FantPt     14200 non-null  float64
 15  Fantasy_PPR        14200 non-null  float64
 16  Fantasy_DKPt       14200 non-

### Dataset Cleaning and Data Visualization

assign_year_season(): Method handled 'duplicate' rows where week, year and player are identical but month differs.  It attributes all January games played to the previous year to reflect that the season started in the prior year

Rolling Averages were created for both rush attempts and yards per carry to obtain a better baseline of running back's usage throughout the season and grouped by player and year<br>
    Rush Attempts: 3 week rolling average deployed to ensure stronger signal and show realistic wear trends <br>
    Yards Per Carry (YPC): 2 week rolling average established to avoid overfitting model and better react to changes in performance <br>
    Applied filter to ensure rush attempts and yards per carry are positive.  Negative values of these filters introduce unnecessary noise <br>

Plotly Express is implemented to visualize features of this dataset in both boxplots and histograms.  This library is implemented to help levearge its interactive features, allowing cursor to hover over different areas of the plot for specific datapoints.  The documentation is referenced for boxplots, where the trace feature is used to display the mean and standard deviation of the specified features in additon to standard outputs of min, max, median, IQR typically revealed for these plots <br>

Binning helps place potentially predictive features of age, rushing first downs score difference into buckets.
    This simplifies ability to recognize patterns and see trends.  The number of buckets for the different features were largely influenced by results from the boxplots and histograms made earlier <br>
        Age: The term 'running back cliff' gets referenced frequently by NFL analysts due to industry-perceived trend that NFL running backs slow down production once they reach the age of 28.  Based on the dataset, most weekly statistics were captured from players aging from 22-27.  Four bins were established for players to help separate this large overlap into more defined sub-groups in order to see whether model training reveals importance of specific age groups <br>
        Rushing First Downs: This feature can shed some light on game script narratives, where coaching staff may be more likely to give a specific player more carries after helping the offense make first downs on potentially game critical plays <br>  <br>
        Score Difference: Another feature that could have high feature importance once model testing arrives.  If the player's team is significantly behind or ahead, this could certainly influence decisions on which running backs get more playing time compared to when the game comes down to the final seconds.  Five buckets are made to classify whether game was blowout (win or loss), moderate (win or loss) and when the game falls within a single touchdown (7pts).  An important distinction for this feature is including one-hot encoding following binning.  The justification is the bins having nominal label assignments where the number assignments do not have significance in linear ordering.  Linear based models like logistics regression assumes numerical significance as numbers increase and this is more difficult to place with scoring difference
    


References:<br>
 - [Binning](https://www.geeksforgeeks.org/machine-learning/what-is-feature-engineering/) <br>
 - [Plotly Express - General](https://plotly.com/python/)<br>
 - [Plotly Graphing - Boxplots](https://plotly.com/python/box-plots)
 

In [None]:
#Handling cases where'duplicate' of week/year/player exist but month differs
def assign_year_season(row):
    month = int(row['Month_Day'].split('-')[0])
    year = row['Year']
    return year - 1 if month == 1 else year
weekly_stats_clean2['Year'] = weekly_stats_clean2.apply(assign_year_season, axis=1)

#Creating Rolling Averages for Rushing Attempts, Yards Per Carry and Rushing Yards
weekly_stats_clean2 = weekly_stats_clean2.sort_values(['Year', 'Week', 'Player'])
weekly_stats_clean2['Rolling_Avg_Rush_Att'] = weekly_stats_clean2.groupby(['Player', 'Year'])['Rushing_Att'].transform(lambda x: x.shift(1).rolling(window=3, min_periods=1).mean())
weekly_stats_clean2['Rolling_Avg_YPC'] = weekly_stats_clean2.groupby(['Player', 'Year'])['Rushing_YPC'].transform(lambda x: x.shift(1).rolling(window=3, min_periods=1).mean())
weekly_stats_clean2['Rolling_Avg_RushYds'] = weekly_stats_clean2.groupby(['Player', 'Year'])['Rushing_Yds'].transform(lambda x: x.shift(1).rolling(window=3, min_periods=1).mean())
#Filtering out null rows, where rush attempts or yards per carry may be null or negative
weekly_stats_clean2 = weekly_stats_clean2[(weekly_stats_clean2['Rolling_Avg_Rush_Att'] > 0) &
                                          (weekly_stats_clean2['Rolling_Avg_YPC'] > 0) &
                                          (weekly_stats_clean2['Rushing_Yds'] > 0)]

#Verifying no data leakage with rolling average
#print(weekly_stats_clean2[['Player', 'Week', 'Rushing_YPC', 'Rolling_Avg_YPC-2WK']].sort_values(['Player', 'Week']).head(20))


#Ensures no null values
weekly_stats_clean2 = weekly_stats_clean2[weekly_stats_clean2['Rolling_Avg_YPC'].notna()]



#Defines Performance Dip as a weekly rushing YPC that is 15% less than rolling average of prior three weeks
#weekly_stats_clean2['Performance_Dip'] = np.where(weekly_stats_clean2['Performance_Delta'] / weekly_stats_clean2['Rolling_Avg_YPC'] < 0.85, 1, 0)

import plotly.express as px  
#Boxplots of Age and Rushing First Downs
cols = ['Age', 'Rushing_FirstDown']
df_1 = weekly_stats_clean2[cols].melt(var_name='Features', value_name='Values')
fig = px.box(df_1, x='Features', y="Values", 
             title= 'Distribution of Age & Rushing First Downs', points = "outliers", color='Features')
fig.update_traces(boxmean = 'sd')
fig.show()

#Boxplot of Score Difference
fig = px.box(weekly_stats_clean2['Score_Diff'], y='Score_Diff',
             title='Distribution of Score Difference in Weekly Games',
             points = 'outliers')
fig.update_traces(boxmean = 'sd', name='Score Difference')
fig.show()

#Histogram of Rushing First Downs
fig = px.histogram(weekly_stats_clean2['Rushing_FirstDown'], 
                   x='Rushing_FirstDown', nbins=15)
fig.update_layout(xaxis_title ='Number of Rushing First Downs',
                  yaxis_title='Frequency in Dataset')
fig.show()


#Histogram of Player Age
fig = px.histogram(weekly_stats_clean2, 
                   x='Age', nbins=15)
fig.update_layout(xaxis_title ='Ages of NFL Running Backs',
                  yaxis_title='Frequency in Dataset')
fig.show()

## Binning
#Binning Age Group to discrete bins
#print(f"Age Distribution: {weekly_stats_clean2['Age'].describe()}")
age_bins = [19, 23, 25, 27, 40]
age_labels = [0, 1, 2, 3] # Numeric in order to utilize in models
age_label_map = {0: '20-23', 
                 1:'24-25', 
                 2: '26-27', 
                 3: '28-37'} #Key value pair to help keep track of what numbers represent
weekly_stats_clean2['Age_Group'] = pd.cut(weekly_stats_clean2['Age'],
                                          bins=age_bins, labels=age_labels, right=False)
#print(f"Score difference distribution: {weekly_stats_clean2['Score_Diff'].describe()}")
#Binning Rushing First Down
rush_1D_bins = [0, 1, 2, 3, 4, 16]
rush_1D_labels = [0, 1, 2, 3, 4]
rush_label_map = {0: '0', 
                  1: '1', 
                  2: '2', 
                  3: '3-4', 
                  4:'5-16'}
weekly_stats_clean2['Rushing_FirstDown_Bin'] = pd.cut(weekly_stats_clean2['Rushing_FirstDown'],
                                                      bins=rush_1D_bins, labels=rush_1D_labels, right=False)
#Binning score difference since broad range of scores can be further classified into 5 main categories
scoreDiff_bins = [-54, -14, -7, 7, 14, 54]
scoreDiff_labels = [0, 1, 2, 3, 4] #Numeric to simply and utilize input in models
scoreDiff_label_map = {0: 'Blowout Loss (> 14pts)', 
                       1: 'Moderate Loss (8-14 pts)', 
                       2: 'Close Game (Within 1 TD', 
                       3: 'Moderate Win (8-14pts)', 
                       4: 'Blowout Win (> 14 pts)'}
weekly_stats_clean2['Score_Diff_Bin'] = pd.cut(weekly_stats_clean2['Score_Diff'],
                                               bins=scoreDiff_bins, labels=scoreDiff_labels, right=False)
#ScoreDiff - One Hot Encoding
scoreDiff_encoded = pd.get_dummies(weekly_stats_clean2['Score_Diff_Bin'], prefix='ScoreDiff',
                                   dtype=int, drop_first=True)
weekly_stats_clean2 = pd.concat([weekly_stats_clean2.drop(columns=['Score_Diff_Bin'], errors='ignore'),
                                 scoreDiff_encoded], axis=1)
print(weekly_stats_clean2.dtypes)
#Confirming binning applied above did not leave any null values in rows
column_null = weekly_stats_clean2.isna().any()
#print(column_null)

Player                     object
Fantasy_Pts               float64
Rushing_Att                 int64
Day                        object
Game_Number                 int64
Week                        int64
Age                         int64
Team                       object
Opp                        object
Rushing_Yds                 int64
Rushing_YPC               float64
Rushing_TD                  int64
Rushing_FirstDown         float64
Rushing_Succ%             float64
Fantasy_FantPt            float64
Fantasy_PPR               float64
Fantasy_DKPt              float64
Fantasy_FDPt              float64
Team_Pts                    int64
Opp_Pts                     int64
Score_Diff                  int64
Year                        int64
Month_Day                  object
Home_Determinant            int64
Grass_Determinant           int64
Rolling_Avg_Rush_Att      float64
Rolling_Avg_YPC           float64
Rolling_Avg_RushYds       float64
PercentChange             float64
Performance_Di

Exporting to Excel for Tableau Data Visualization

In [87]:
#weekly_stats_clean2_excel = weekly_stats_clean2.to_excel("Weekly_NFL_RB_stats_cleaned.xlsx", index=False)
print("DataFrame saved to 'Weekly_NFL_RB_stats_cleaned.xlsx'")

DataFrame saved to 'Weekly_NFL_RB_stats_cleaned.xlsx'


In [4]:
#weekly_stats_clean2.info()
weekly_stats_clean2['Rushing_FirstDown'].describe()

count    11939.000000
mean         2.140967
std          2.039806
min          0.000000
25%          1.000000
50%          2.000000
75%          3.000000
max         15.000000
Name: Rushing_FirstDown, dtype: float64

### Data Visualization 2 <br><br>

Boxplots serve the purpose of visualizing the distributions of rolling averages for the rush attempts and yards per carry (YPC), helping detect extreme outliers that can negatively impact model training <br>

The scatter plot compares the single game rushing yard against the rolling average rushing yards to flag any anomaly performances that are disproportionate to running back rushing trends to limit the amount of bias into the models 


In [None]:
#Boxplots
cols = ['Rolling_Avg_Rush_Att', 'Rolling_Avg_YPC']
df_1 = weekly_stats_clean2[cols].melt(var_name='Features', value_name='Values')
fig = px.box(df_1, x='Features', y="Values", 
             title= 'Distribution of Rolling Averages for Rush Attempts & YPC', 
             points = "outliers", color='Features')
fig.update_traces(boxmean = 'sd')
fig.show()



#Scatter Plot Analyzing Rushing Yards
fig = px.scatter(weekly_stats_clean2,
                x= weekly_stats_clean2['Rushing_Yds'], 
                y=weekly_stats_clean2['Rolling_Avg_RushYds'])
fig.update_layout(xaxis_title = 'Rushing Yards',
                  yaxis_title='Rolling Avg Rushing Yards',
                  title = 'Rushing Yards vs Rolling Average Rushing Yards')
fig.show()


### Dropping Extreme Outliers<br>

Based on above data visualization, it is seen that weekly rushing attempts over 30 are outliers and that workload isn't representative for the majority of players.  <br> Additionally, weekly rushing yards per carry above 20 are clearly outliers in this dataset.  Looking specifically at the scatter plot for the rolling averages of rush attempts vs yards per carry, all instances of players having greater than 20 YPC happened in under five carries.  This shows that these instances are anomalies due to low rushing sample size from that player <br>


In [38]:


rush_att_con = weekly_stats_clean2['Rolling_Avg_Rush_Att'] > 30
ypc_con = weekly_stats_clean2['Rolling_Avg_YPC'] > 20
#Checking percentile distribution for rushing yards
rushingYards_outlier = weekly_stats_clean2['Rushing_Yds'].quantile(.995)
rollingAvg_rushYds_outlier = weekly_stats_clean2['Rolling_Avg_RushYds'].quantile(.995)
print(f" 99.5th percentile of Rushing Yards: {rushingYards_outlier}")
print(f" 99.5th percentile of Rolling Average Rushing Yards: {rollingAvg_rushYds_outlier}")
rush_yds_con1 = ((weekly_stats_clean2['Rushing_Yds'] >= weekly_stats_clean2['Rolling_Avg_RushYds'] * 3 ) & 
                 ((weekly_stats_clean2['Rushing_Yds'] > rushingYards_outlier) | (weekly_stats_clean2['Rolling_Avg_RushYds'] > rollingAvg_rushYds_outlier)))
rush_yds_con2 = weekly_stats_clean2['Rushing_Yds'] > 250
combined_con=rush_att_con & ypc_con & rush_yds_con1 & rush_yds_con2

#Dropping all extreme outliers from dataset
weekly_stats_clean3 = weekly_stats_clean2[~combined_con]
print(weekly_stats_clean3.info)

 99.5th percentile of Rushing Yards: 178.0
 99.5th percentile of Rolling Average Rushing Yards: 149.1766666666666
<bound method DataFrame.info of                Player  Fantasy_Pts  Rushing_Att  Day  Game_Number  Week  Age  \
198    Ahmad Bradshaw         11.9           11  Thu            6     6   28   
125     Alfred Morris          5.5           13  Sun            6     6   25   
56     Andre Williams          5.9           17  Sun            6     6   22   
140     Bishop Sankey          6.8           18  Sun            6     6   22   
54       Bobby Rainey          4.2            7  Sun            6     6   26   
..                ...          ...          ...  ...          ...   ...  ...   
82         Ty Johnson          2.5            4  Sun           16    17   27   
177      Tyjae Spears         10.3           20  Sun           16    17   23   
26     Tyler Allgeier          1.9            3  Sun           16    17   24   
26   Tyrone Tracy Jr.          7.3           20  Sun  

### Establishing Target Variable Visualzing Class Imbalance of Performance Dip <br>

The target variable for this project is performance dip, a binary indication on whether the player's performance is declining during the season. <br>

First, the player's percent change in performance is calculated, which calculates the percentage of how much the current week's yards per carry (YPC) differs from the player's rolling average YPC. <br>
After establishing the percent change, performance dip is calculated.  There are two main criteria used to indicate performance dip: falls below first quartile of percent change and has minimum of 6 carries.  The percentile-based distinction is made relative to this dataset's distribution and can adjust to player performance variability.  The minimum of 6 carries demonstrates that the player has a workload floor and the dip will reflect prolonged usage of a player<br>

Visualizing the imbalance of performance dip, it is largely expected for the majority of players to not have a performance dip because players having a declined performance in the season should be a relatively rare occurence.  It is critical to visualize and understand class imbalance of target variable since this will influence the metrics used for analyzing model performance.  Accuracy will be a poor choice since always prediciting "no dip" will yield 85% accuracy.  Focusing on metrics that handle imbalance well such as F1 score

In [None]:
#Percent Change Variable compares current performance against previous weeks rolling average
weekly_stats_clean3['PercentChange'] = (weekly_stats_clean3['Rushing_YPC'] - weekly_stats_clean3['Rolling_Avg_YPC'])/weekly_stats_clean3['Rolling_Avg_YPC']
min_carries = weekly_stats_clean3['Rushing_Att'] >= 6
threshold = weekly_stats_clean3['PercentChange'].quantile(0.25)
weekly_stats_clean3['Performance_Dip'] = ((weekly_stats_clean3['PercentChange'] <= threshold) & min_carries).astype(int)
rate = weekly_stats_clean3['Performance_Dip'].mean()
counts = weekly_stats_clean3['Performance_Dip'].value_counts()
print(f"Dip rate: {rate:.2%}\nCounts:\n{counts}")



fig = px.pie(weekly_stats_clean3['Performance_Dip'], names='Performance_Dip')
fig.show()

Dip rate: 14.78%
Counts:
Performance_Dip
0    5535
1     960
Name: count, dtype: int64


In [56]:
print(weekly_stats_clean3.dtypes)


Player                     object
Fantasy_Pts               float64
Rushing_Att                 int64
Day                        object
Game_Number                 int64
Week                        int64
Age                         int64
Team                       object
Opp                        object
Rushing_Yds                 int64
Rushing_YPC               float64
Rushing_TD                  int64
Rushing_FirstDown         float64
Rushing_Succ%             float64
Fantasy_FantPt            float64
Fantasy_PPR               float64
Fantasy_DKPt              float64
Fantasy_FDPt              float64
Team_Pts                    int64
Opp_Pts                     int64
Score_Diff                  int64
Year                        int64
Month_Day                  object
Home_Determinant            int64
Grass_Determinant           int64
Rolling_Avg_Rush_Att      float64
Rolling_Avg_YPC           float64
Rolling_Avg_RushYds       float64
PercentChange             float64
Performance_Di