#NFL Game Prediction Project

By: Christian Ortiz

10/5/2023

YouTube Video Link:
https://youtu.be/ViaGirGFJZY?si=6FzpJDXvc45G2Ozs

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Step 1: Import libraries / Links to Data

In [3]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

Here are the links to the datasets I used:


This data is already cleaned and ready to use

[2023 NFL Play by Play Data](https://drive.google.com/file/d/1idR5SMXhOIpfA4C70VSgJe9deKzB38mW/view?usp=drive_link)

[2022 NFL Play by Play Data](https://drive.google.com/file/d/1-prfK4T9kFwfJxVG6yJe0moDebrbYmQS/view?usp=drive_link)

[Upcoming Schedule NFL Week 5](https://drive.google.com/file/d/1KwNnayIiXBxqCu4ffpaBLJkVOFGbm71e/view?usp=drive_link)

<br>

Here is the raw data if you'd like it

[Raw 2023 NFL Play by Play Data](https://drive.google.com/file/d/1B2oFowZOIWgZtF1bw8wF865NVmxv7rpk/view?usp=drive_link)

[Raw 2022 NFL Play by Play Data](https://drive.google.com/file/d/14xYBJibEQW89LMI-k0H5oCQEJfe3fgXo/view?usp=drive_link)

[Raw NFL Scores for 2023 Season](https://drive.google.com/file/d/1nc1Ejguj4X_mu5ZaKTJb2foBh_kqy7NL/view?usp=drive_link)

[Raw NFL Scores for 2022 Season](https://drive.google.com/file/d/12787BxkzBFORgF40s_wCP-V2rFmYBFqG/view?usp=drive_link)

<br>

I got the Play by Play data originally from this website:

[NFL Savant](https://nflsavant.com/about.php)

And the box scores from this site

[The Football Database](https://www.footballdb.com/games/index.html?lg=NFL&yr=2023)

### HOW TO IMPORT FILES

Click folder icon on left, drag and drop them in, and then right click them to get the path and paste them in the quotes inside read_csv("") in the code you will see later

You'll probably have to change the file paths in this project to the paths for your files

##Step 2: Data Cleaning and Preparation

Here is what the data looks like now BEFORE the cleaning

NOTE: You will likely have to change these paths below to the ones in your environment



In [4]:
og_2024_data = pd.read_csv("/content/drive/MyDrive/NFL data/pbp-2024.csv")

og_2024_scores = pd.read_csv("/content/drive/MyDrive/NFL data/raw_2024_scores.csv")

In [5]:
og_2024_data.head()


Unnamed: 0,GameId,GameDate,Quarter,Minute,Second,OffenseTeam,DefenseTeam,Down,ToGo,YardLine,...,IsTwoPointConversion,IsTwoPointConversionSuccessful,RushDirection,YardLineFixed,YardLineDirection,IsPenaltyAccepted,PenaltyTeam,IsNoPlay,PenaltyType,PenaltyYards
0,2024122907,2024-12-29,3,7,3,GB,MIN,1,10,16,...,0,0,CENTER,84,OPP,0,,0,,0
1,2024122907,2024-12-29,3,9,44,MIN,GB,0,0,0,...,0,0,,100,OPP,0,,0,,0
2,2024122907,2024-12-29,3,9,44,MIN,GB,0,0,15,...,0,0,,85,OPP,0,,0,,0
3,2024122907,2024-12-29,3,9,50,MIN,GB,1,10,18,...,0,0,,82,OPP,0,,0,,0
4,2024122907,2024-12-29,3,15,0,MIN,GB,0,0,35,...,0,0,,65,OPP,0,,0,,0


In [6]:
og_2024_scores.head()


Unnamed: 0,Week,Date,Visitor,VisitorScore,Home,HomeScore,OT
0,Week 1,09/05/2024,Baltimore Ravens,20,Kansas City Chiefs,27,
1,Week 1,09/06/2024,Green Bay Packers,29,Philadelphia Eagles,34,
2,Week 1,09/08/2024,Carolina Panthers,10,New Orleans Saints,47,
3,Week 1,09/08/2024,Tennessee Titans,17,Chicago Bears,24,
4,Week 1,09/08/2024,New England Patriots,16,Cincinnati Bengals,10,


####Issues with the Play by Play Data


*   Play by play data has a Date Format different from scores data date format
*   No Home Team or Visitor Team columns, just OffenseTeam and DefenseTeam
*   No final box score for the games
*   Play by Play uses abbreviations for team name, scores uses full team name





#### The following code blocks show you show I cleaned this data, but you can skip them because I provide the cleaned data. If you'd like to run them, you'll have to adjust some variable names to make them work

### Changing the Date Format

In [7]:
# Convert the 'GameDate' format in play-by-play data to match the 'Date' format in scores dataset
og_2024_data['GameDate'] = pd.to_datetime(og_2024_data['GameDate']).dt.strftime('%m/%d/%Y')

# Retry matching games for 2022 season using the updated method
# makes a new df with the matched
nfl_play = pd.concat([og_2024_data, og_2024_data])

# Display the first few rows of matched games for 2022 season using the updated method
nfl_play.head()


Unnamed: 0,GameId,GameDate,Quarter,Minute,Second,OffenseTeam,DefenseTeam,Down,ToGo,YardLine,...,IsTwoPointConversion,IsTwoPointConversionSuccessful,RushDirection,YardLineFixed,YardLineDirection,IsPenaltyAccepted,PenaltyTeam,IsNoPlay,PenaltyType,PenaltyYards
0,2024122907,12/29/2024,3,7,3,GB,MIN,1,10,16,...,0,0,CENTER,84,OPP,0,,0,,0
1,2024122907,12/29/2024,3,9,44,MIN,GB,0,0,0,...,0,0,,100,OPP,0,,0,,0
2,2024122907,12/29/2024,3,9,44,MIN,GB,0,0,15,...,0,0,,85,OPP,0,,0,,0
3,2024122907,12/29/2024,3,9,50,MIN,GB,1,10,18,...,0,0,,82,OPP,0,,0,,0
4,2024122907,12/29/2024,3,15,0,MIN,GB,0,0,35,...,0,0,,65,OPP,0,,0,,0


###Changing Scores data to use abbreviations

In [8]:
# Changing Scores Data to use Abbreviations
# Extract unique team names from scores data
unique_visitor_teams_2024 = og_2024_scores['Visitor'].unique()
unique_home_teams_2024 = og_2024_scores['Home'].unique()


# Combine all unique teams
all_unique_teams = set(unique_visitor_teams_2024).union(set(unique_home_teams_2024))

# Extract unique team abbreviations from play-by-play data
unique_teams_abbrev_2024 = set(nfl_play['OffenseTeam'].dropna().unique()).union(set(nfl_play['DefenseTeam'].dropna().unique()))


# Combine all unique abbreviations
all_unique_abbrev = unique_teams_abbrev_2024

all_unique_teams, all_unique_abbrev


({'Arizona Cardinals',
  'Atlanta Falcons',
  'Baltimore Ravens',
  'Buffalo Bills',
  'Carolina Panthers',
  'Chicago Bears',
  'Cincinnati Bengals',
  'Cleveland Browns',
  'Dallas Cowboys',
  'Denver Broncos',
  'Detroit Lions',
  'Green Bay Packers',
  'Houston Texans',
  'Indianapolis Colts',
  'Jacksonville Jaguars',
  'Kansas City Chiefs',
  'Las Vegas Raiders',
  'Los Angeles Chargers',
  'Los Angeles Rams',
  'Miami Dolphins',
  'Minnesota Vikings',
  'New England Patriots',
  'New Orleans Saints',
  'New York Giants',
  'New York Jets',
  'Philadelphia Eagles',
  'Pittsburgh Steelers',
  'San Francisco 49ers',
  'Seattle Seahawks',
  'Tampa Bay Buccaneers',
  'Tennessee Titans',
  'Washington Commanders'},
 {'ARI',
  'ATL',
  'BAL',
  'BUF',
  'CAR',
  'CHI',
  'CIN',
  'CLE',
  'DAL',
  'DEN',
  'DET',
  'GB',
  'HOU',
  'IND',
  'JAX',
  'KC',
  'LA',
  'LAC',
  'LV',
  'MIA',
  'MIN',
  'NE',
  'NO',
  'NYG',
  'NYJ',
  'PHI',
  'PIT',
  'SEA',
  'SF',
  'TB',
  'TEN',
  '

In [9]:
# Mapping of team names to abbreviations
team_mapping = {
    'Arizona Cardinals': 'ARI',
    'Atlanta Falcons': 'ATL',
    'Baltimore Ravens': 'BAL',
    'Buffalo Bills': 'BUF',
    'Carolina Panthers': 'CAR',
    'Chicago Bears': 'CHI',
    'Cincinnati Bengals': 'CIN',
    'Cleveland Browns': 'CLE',
    'Dallas Cowboys': 'DAL',
    'Denver Broncos': 'DEN',
    'Detroit Lions': 'DET',
    'Green Bay Packers': 'GB',
    'Houston Texans': 'HOU',
    'Indianapolis Colts': 'IND',
    'Jacksonville Jaguars': 'JAX',
    'Kansas City Chiefs': 'KC',
    'Las Vegas Raiders': 'LV',
    'Los Angeles Chargers': 'LAC',
    'Los Angeles Rams': 'LA',
    'Miami Dolphins': 'MIA',
    'Minnesota Vikings': 'MIN',
    'New England Patriots': 'NE',
    'New Orleans Saints': 'NO',
    'New York Giants': 'NYG',
    'New York Jets': 'NYJ',
    'Philadelphia Eagles': 'PHI',
    'Pittsburgh Steelers': 'PIT',
    'San Francisco 49ers': 'SF',
    'Seattle Seahawks': 'SEA',
    'Tampa Bay Buccaneers': 'TB',
    'Tennessee Titans': 'TEN',
    'Washington Commanders': 'WAS'
}

# Replace team names with abbreviations in scores data
og_2024_scores['Visitor'] = og_2024_scores['Visitor'].map(team_mapping)
og_2024_scores['Home'] = og_2024_scores['Home'].map(team_mapping)


# Check the updated scores data for both years
new_scores = og_2024_scores

new_scores.head(20)

Unnamed: 0,Week,Date,Visitor,VisitorScore,Home,HomeScore,OT
0,Week 1,09/05/2024,BAL,20,KC,27,
1,Week 1,09/06/2024,GB,29,PHI,34,
2,Week 1,09/08/2024,CAR,10,NO,47,
3,Week 1,09/08/2024,TEN,17,CHI,24,
4,Week 1,09/08/2024,NE,16,CIN,10,
5,Week 1,09/08/2024,PIT,18,ATL,10,
6,Week 1,09/08/2024,ARI,28,BUF,34,
7,Week 1,09/08/2024,MIN,28,NYG,6,
8,Week 1,09/08/2024,JAX,17,MIA,20,
9,Week 1,09/08/2024,HOU,29,IND,27,


In [14]:
nfl_play.head(20)

Unnamed: 0,GameId,GameDate,Quarter,Minute,Second,OffenseTeam,DefenseTeam,Down,ToGo,YardLine,...,IsTwoPointConversion,IsTwoPointConversionSuccessful,RushDirection,YardLineFixed,YardLineDirection,IsPenaltyAccepted,PenaltyTeam,IsNoPlay,PenaltyType,PenaltyYards
0,2024122907,12/29/2024,3,7,3,GB,MIN,1,10,16,...,0,0,CENTER,84,OPP,0,,0,,0
1,2024122907,12/29/2024,3,9,44,MIN,GB,0,0,0,...,0,0,,100,OPP,0,,0,,0
2,2024122907,12/29/2024,3,9,44,MIN,GB,0,0,15,...,0,0,,85,OPP,0,,0,,0
3,2024122907,12/29/2024,3,9,50,MIN,GB,1,10,18,...,0,0,,82,OPP,0,,0,,0
4,2024122907,12/29/2024,3,15,0,MIN,GB,0,0,35,...,0,0,,65,OPP,0,,0,,0
5,2024122907,12/29/2024,2,2,0,GB,MIN,0,0,0,...,0,0,,100,OPP,0,,0,,0
6,2025010401,01/04/2025,2,5,32,PIT,CIN,0,0,0,...,0,0,,100,OPP,0,,0,,0
7,2025010401,01/04/2025,2,6,33,PIT,CIN,0,0,0,...,0,0,,100,OPP,0,,0,,0
8,2025010401,01/04/2025,2,6,37,CIN,PIT,4,1,37,...,0,0,,63,OPP,0,,0,,0
9,2025010401,01/04/2025,2,7,20,CIN,PIT,3,6,42,...,0,0,,58,OPP,0,,0,,0


In [11]:
print(new_scores.columns)

Index(['Week', 'Date', 'Visitor', 'VisitorScore', 'Home', 'HomeScore', 'OT'], dtype='object')


###Copying scores to play by play, matching based on date and teams

In [16]:
# Merge play-by-play data for 2022 with scores data for 2022 based on Date, OffenseTeam, and DefenseTeam
all_data = nfl_play.merge(new_scores, left_on=['GameDate', 'OffenseTeam', 'DefenseTeam'], right_on=['Date', 'Visitor', 'Home'], how='left')
all_data = all_data.merge(new_scores, left_on=['Date', 'OffenseTeam', 'DefenseTeam'], right_on=['Date', 'Home', 'Visitor'], how='left', suffixes=('', '_reverse'))

# Combine the columns from the two merges
for column in ['Week', 'Visitor', 'VisitorScore', 'Home', 'HomeScore', 'OT']:
    all_data[column] = all_data[column].combine_first(all_data[column + '_reverse'])

# Drop the redundant columns from the reverse merge
columns_to_drop = [col + '_reverse' for col in ['Week', 'Visitor', 'VisitorScore', 'Home', 'HomeScore', 'OT']]
all_data = all_data.drop(columns=columns_to_drop)

# Check the result
all_data[['Date', 'OffenseTeam', 'DefenseTeam', 'Week', 'Visitor', 'VisitorScore', 'Home', 'HomeScore', 'OT']].head(10)



Unnamed: 0,Date,OffenseTeam,DefenseTeam,Week,Visitor,VisitorScore,Home,HomeScore,OT
0,12/29/2024,GB,MIN,Week 17,GB,25.0,MIN,27.0,
1,,MIN,GB,,,,,,
2,,MIN,GB,,,,,,
3,,MIN,GB,,,,,,
4,,MIN,GB,,,,,,
5,12/29/2024,GB,MIN,Week 17,GB,25.0,MIN,27.0,
6,,PIT,CIN,,,,,,
7,,PIT,CIN,,,,,,
8,01/04/2025,CIN,PIT,Week 18,CIN,19.0,PIT,17.0,
9,01/04/2025,CIN,PIT,Week 18,CIN,19.0,PIT,17.0,


###Adding HomeWon

In [13]:
# Adding HomeWon Column
import pandas as pd
import numpy as np

# Now, let's proceed with adding the "HomeWon" column
data_2022_updated['HomeWon'] = data_2022_updated['HomeScore'] > data_2022_updated['VisitorScore']
data_2023_updated['HomeWon'] = data_2023_updated['HomeScore'] > data_2023_updated['VisitorScore']

# Display first few rows of updated data for verification
data_2022_updated[['Date', 'Home', 'Visitor', 'HomeScore', 'VisitorScore', 'HomeWon']].head(), data_2023_updated[['Date', 'Home', 'Visitor', 'HomeScore', 'VisitorScore', 'HomeWon']].head()


NameError: name 'data_2022_updated' is not defined

## Step 3: Team Feature Extraction

### Offensive Features

First, we will create some offensive features from our data

NOTE: You will likely have to change these paths below to the ones in your environment


In [None]:
import pandas as pd
import numpy as np

# Load the datasets for the 2022 and 2023 seasons that include a "HomeWon" column.
# This column indicates if the home team won the game (1 for win, 0 for loss).

# NOTE: You'll probably have to change these file paths to the paths for your environment
# (click folder icon on left, drag and drop them in, and then right click them to get the path and paste them in the quotes below)
data_2022_updated = pd.read_csv("/content/BEST_2022_nfl_data_play_with_scores_and_HomeWon.csv")
data_2023_updated = pd.read_csv("/content/BEST_2023_nfl_data_play_with_scores_and_HomeWon.csv")

# Load the dataset containing the upcoming games schedule.
upcoming_games = pd.read_csv("/content/BEST New_Upcoming_Schedule.csv")

# Combine the data from the 2022 and 2023 seasons into a single DataFrame.
#all_data = pd.concat([data_2022_updated, data_2023_updated])
all_data = data_2023_updated

# 1. Average Points Scored
# Calculate the average points scored by the home and visitor teams.
avg_points_scored_home = all_data.groupby('Home')['HomeScore'].mean()
avg_points_scored_visitor = all_data.groupby('Visitor')['VisitorScore'].mean()

# 2. Average Points Allowed
# Calculate the average points allowed by the home and visitor teams.
avg_points_allowed_home = all_data.groupby('Home')['VisitorScore'].mean()
avg_points_allowed_visitor = all_data.groupby('Visitor')['HomeScore'].mean()

# Calculate the overall average points scored and allowed by combining the home and visitor averages.
overall_avg_points_scored = (avg_points_scored_home + avg_points_scored_visitor) / 2
overall_avg_points_allowed = (avg_points_allowed_home + avg_points_allowed_visitor) / 2

# 3. Win Rate
# Calculate the total number of wins for home and visitor teams.
home_wins = all_data.groupby('Home')['HomeWon'].sum()
visitor_wins = all_data.groupby('Visitor').apply(lambda x: len(x) - x['HomeWon'].sum())

# Calculate the total number of games played by each team as home and visitor.
total_games_home = all_data['Home'].value_counts()
total_games_visitor = all_data['Visitor'].value_counts()

# Calculate the overall number of wins and total games played by each team.
overall_wins = home_wins + visitor_wins
total_games = total_games_home + total_games_visitor

# Calculate the win rate for each team.
win_rate = overall_wins / total_games

# Calculate the average outcome of games between each pair of teams (home vs visitor).
# head_to_head = all_data.groupby(['Home', 'Visitor'])['HomeWon'].mean()

# Create a new data frame to store the features for each team.
team_features = pd.DataFrame({
    'AvgPointsScored': overall_avg_points_scored,
    'AvgPointsAllowed': overall_avg_points_allowed,
    'WinRate': win_rate
})

# Reset the index of the team_features DataFrame and rename the index column to "Team".
team_features.reset_index(inplace=True)
team_features.rename(columns={'Home': 'Team'}, inplace=True)

# Display the first few rows of the team_features DataFrame.
team_features.head()


In [None]:
# Examining the columns and a few rows of the datasets to ensure correct structure
upcoming_games.head()

###Defensive Features

Now we do the same thing but with defensive features

In [None]:
# Calculate defensive features for each NFL team.

# 1. Average points defended:
# This metric is essentially the same as AvgPointsAllowed, which we already computed in previous steps so we won't recompute it here.

# 2. Average conceded plays:
# A play is considered successful for the offense if it results in a touchdown or doesn't result in a turnover.
# Create a new column 'SuccessfulPlay' in the all_data DataFrame to represent this.
all_data['SuccessfulPlay'] = all_data['IsTouchdown'] | (~all_data['IsInterception'] & ~all_data['IsFumble'])

# Calculate the average rate of successful plays conceded when playing at home.
avg_conceded_plays_home = all_data.groupby('Home')['SuccessfulPlay'].mean()

# Calculate the average rate of successful plays conceded when playing as a visitor.
avg_conceded_plays_visitor = all_data.groupby('Visitor')['SuccessfulPlay'].mean()

# Calculate the overall average rate of successful plays conceded for each team.
overall_avg_conceded_plays = (avg_conceded_plays_home + avg_conceded_plays_visitor) / 2

# 3. Average forced turnovers:
# Create a new column 'Turnover' that indicates if a play resulted in a turnover (either interception or fumble).
all_data['Turnover'] = all_data['IsInterception'] | all_data['IsFumble']

# Calculate the average rate of turnovers forced when playing at home.
avg_forced_turnovers_home = all_data.groupby('Home')['Turnover'].mean()

# Calculate the average rate of turnovers forced when playing as a visitor.
avg_forced_turnovers_visitor = all_data.groupby('Visitor')['Turnover'].mean()

# Calculate the overall average rate of turnovers forced for each team.
overall_avg_forced_turnovers = (avg_forced_turnovers_home + avg_forced_turnovers_visitor) / 2

# Create a new DataFrame to store the defensive features for each team.
team_features_defensive = pd.DataFrame({
    'Team': team_features['Team'].values,
    'AvgPointsDefended': team_features['AvgPointsAllowed'].values,
    'AvgConcededPlays': overall_avg_conceded_plays.values,
    'AvgForcedTurnovers': overall_avg_forced_turnovers.values
})

# Merge the defensive features with the original team features to create a combined DataFrame.
team_features_combined = team_features.merge(team_features_defensive, on='Team')

# Display the first few rows of the combined team features DataFrame.
team_features_combined.head()


###Additional offensive features

In [None]:
# Calculate additional offensive features

# 1. Average yards per play
avg_yards_per_play_home = all_data.groupby('Home')['Yards'].mean()
avg_yards_per_play_visitor = all_data.groupby('Visitor')['Yards'].mean()
overall_avg_yards_per_play = (avg_yards_per_play_home + avg_yards_per_play_visitor) / 2

# 2. Average total yards per game
total_yards_per_game_home = all_data.groupby(['SeasonYear', 'Home'])['Yards'].sum() / all_data.groupby(['SeasonYear', 'Home']).size()
total_yards_per_game_visitor = all_data.groupby(['SeasonYear', 'Visitor'])['Yards'].sum() / all_data.groupby(['SeasonYear', 'Visitor']).size()
overall_avg_yards_per_game = (total_yards_per_game_home + total_yards_per_game_visitor).groupby(level=1).mean()

# 3. Average pass completion rate
avg_pass_completion_rate_home = all_data.groupby('Home').apply(lambda x: 1 - x['IsIncomplete'].mean())
avg_pass_completion_rate_visitor = all_data.groupby('Visitor').apply(lambda x: 1 - x['IsIncomplete'].mean())
overall_avg_pass_completion_rate = (avg_pass_completion_rate_home + avg_pass_completion_rate_visitor) / 2

# 4. Average touchdowns per game
avg_touchdowns_per_game_home = all_data.groupby(['SeasonYear', 'Home'])['IsTouchdown'].sum() / all_data.groupby(['SeasonYear', 'Home']).size()
avg_touchdowns_per_game_visitor = all_data.groupby(['SeasonYear', 'Visitor'])['IsTouchdown'].sum() / all_data.groupby(['SeasonYear', 'Visitor']).size()
overall_avg_touchdowns_per_game = (avg_touchdowns_per_game_home + avg_touchdowns_per_game_visitor).groupby(level=1).mean()

# 5. Average rush success rate
avg_rush_success_rate_home = all_data.groupby('Home').apply(lambda x: x['Yards'][x['IsRush'] == 1].mean())
avg_rush_success_rate_visitor = all_data.groupby('Visitor').apply(lambda x: x['Yards'][x['IsRush'] == 1].mean())
overall_avg_rush_success_rate = (avg_rush_success_rate_home + avg_rush_success_rate_visitor) / 2

# Creating a dataframe for the new offensive features
new_offensive_features = pd.DataFrame({
    'Team': team_features_combined['Team'],
    'AvgYardsPerPlay': overall_avg_yards_per_play.values,
    'AvgYardsPerGame': overall_avg_yards_per_game.values,
    'AvgPassCompletionRate': overall_avg_pass_completion_rate.values,
    'AvgTouchdownsPerGame': overall_avg_touchdowns_per_game.values,
    'AvgRushSuccessRate': overall_avg_rush_success_rate.values
})

# Merging with the existing combined features
team_features_expanded = team_features_combined.merge(new_offensive_features, on='Team')

team_features_expanded.head()


###Additional Defensive Features

In [None]:
# Calculate additional defensive features

# 1. Average yards allowed per play
avg_yards_allowed_per_play_home = all_data.groupby('Home')['Yards'].mean()
avg_yards_allowed_per_play_visitor = all_data.groupby('Visitor')['Yards'].mean()
overall_avg_yards_allowed_per_play = (avg_yards_allowed_per_play_home + avg_yards_allowed_per_play_visitor) / 2

# 2. Average total yards allowed per game
total_yards_allowed_per_game_home = all_data.groupby(['SeasonYear', 'Home'])['Yards'].sum() / all_data.groupby(['SeasonYear', 'Home']).size()
total_yards_allowed_per_game_visitor = all_data.groupby(['SeasonYear', 'Visitor'])['Yards'].sum() / all_data.groupby(['SeasonYear', 'Visitor']).size()
overall_avg_yards_allowed_per_game = (total_yards_allowed_per_game_home + total_yards_allowed_per_game_visitor).groupby(level=1).mean()

# 3. Average pass completion allowed rate
avg_pass_completion_allowed_rate_home = all_data.groupby('Home').apply(lambda x: 1 - x['IsIncomplete'].mean())
avg_pass_completion_allowed_rate_visitor = all_data.groupby('Visitor').apply(lambda x: 1 - x['IsIncomplete'].mean())
overall_avg_pass_completion_allowed_rate = (avg_pass_completion_allowed_rate_home + avg_pass_completion_allowed_rate_visitor) / 2

# 4. Average touchdowns allowed per game
avg_touchdowns_allowed_per_game_home = all_data.groupby(['SeasonYear', 'Home'])['IsTouchdown'].sum() / all_data.groupby(['SeasonYear', 'Home']).size()
avg_touchdowns_allowed_per_game_visitor = all_data.groupby(['SeasonYear', 'Visitor'])['IsTouchdown'].sum() / all_data.groupby(['SeasonYear', 'Visitor']).size()
overall_avg_touchdowns_allowed_per_game = (avg_touchdowns_allowed_per_game_home + avg_touchdowns_allowed_per_game_visitor).groupby(level=1).mean()

# 5. Average rush success allowed rate
avg_rush_success_allowed_rate_home = all_data.groupby('Home').apply(lambda x: x['Yards'][x['IsRush'] == 1].mean())
avg_rush_success_allowed_rate_visitor = all_data.groupby('Visitor').apply(lambda x: x['Yards'][x['IsRush'] == 1].mean())
overall_avg_rush_success_allowed_rate = (avg_rush_success_allowed_rate_home + avg_rush_success_allowed_rate_visitor) / 2

# Creating a dataframe for the new defensive features
new_defensive_features = pd.DataFrame({
    'Team': team_features_expanded['Team'],
    'AvgYardsAllowedPerPlay': overall_avg_yards_allowed_per_play.values,
    'AvgYardsAllowedPerGame': overall_avg_yards_allowed_per_game.values,
    'AvgPassCompletionAllowedRate': overall_avg_pass_completion_allowed_rate.values,
    'AvgTouchdownsAllowedPerGame': overall_avg_touchdowns_allowed_per_game.values,
    'AvgRushSuccessAllowedRate': overall_avg_rush_success_allowed_rate.values
})

# Merging with the existing combined features
team_features_complete = team_features_expanded.merge(new_defensive_features, on='Team')

team_features_complete


###Encoding upcoming games with features

In [None]:
# Reload the upcoming games data
upcoming_games = pd.read_csv("/content/BEST New_Upcoming_Schedule.csv")

# Feature encoding: merging the upcoming games data with the team features data
upcoming_encoded_home = upcoming_games.merge(team_features_complete, left_on='Home', right_on='Team', how='left')
upcoming_encoded_both = upcoming_encoded_home.merge(team_features_complete, left_on='Visitor', right_on='Team', suffixes=('_Home', '_Visitor'), how='left')

In [None]:
upcoming_encoded_both

In [None]:
# Calculate the difference in features as this might be a more predictive representation
for col in ['AvgPointsScored', 'AvgPointsAllowed', 'WinRate', 'AvgPointsDefended', 'AvgConcededPlays', 'AvgForcedTurnovers',
            'AvgYardsPerPlay', 'AvgYardsPerGame', 'AvgPassCompletionRate', 'AvgTouchdownsPerGame', 'AvgRushSuccessRate',
            'AvgYardsAllowedPerPlay', 'AvgYardsAllowedPerGame', 'AvgPassCompletionAllowedRate', 'AvgTouchdownsAllowedPerGame', 'AvgRushSuccessAllowedRate']:
    upcoming_encoded_both[f'Diff_{col}'] = upcoming_encoded_both[f'{col}_Home'] - upcoming_encoded_both[f'{col}_Visitor']

# Selecting only the difference columns and the teams for clarity
upcoming_encoded_final = upcoming_encoded_both[['Home', 'Visitor'] + [col for col in upcoming_encoded_both.columns if 'Diff_' in col]]

upcoming_encoded_final

###Why use the difference for features?

**Predictive Power:**  Differences in team metrics might be more predictive of game outcomes than the absolute metrics of each team. For instance, if one team scores on average 10 points more than another team, this differential might be a stronger predictor of the game's outcome than knowing each team's average score in isolation.

**Simplification for Modeling:** By creating differential features, the model can focus on the relative strengths and weaknesses between the two teams, potentially simplifying the decision-making process of the model.

**Normalization:** Teams might play in different conditions, against different opponents, or might have had easy/hard schedules. Calculating the difference between their stats might normalize some of these external factors.

## Step 4: Training Data Preparation

In [None]:
# Prepare training data

# Merge play-by-play data with team features for home teams
training_encoded_home = all_data.merge(team_features_complete, left_on='Home', right_on='Team', how='left')
# Merge the result with team features for visitor teams
training_encoded_both = training_encoded_home.merge(team_features_complete, left_on='Visitor', right_on='Team', suffixes=('_Home', '_Visitor'), how='left')

# Calculate the difference in features
for col in ['AvgPointsScored', 'AvgPointsAllowed', 'WinRate', 'AvgPointsDefended', 'AvgConcededPlays', 'AvgForcedTurnovers',
            'AvgYardsPerPlay', 'AvgYardsPerGame', 'AvgPassCompletionRate', 'AvgTouchdownsPerGame', 'AvgRushSuccessRate',
            'AvgYardsAllowedPerPlay', 'AvgYardsAllowedPerGame', 'AvgPassCompletionAllowedRate', 'AvgTouchdownsAllowedPerGame', 'AvgRushSuccessAllowedRate']:
    training_encoded_both[f'Diff_{col}'] = training_encoded_both[f'{col}_Home'] - training_encoded_both[f'{col}_Visitor']

# Filtering out the required columns
training_data = training_encoded_both[[col for col in training_encoded_both.columns if 'Diff_' in col]]
training_labels = training_encoded_both['HomeWon']

In [None]:
training_data.head()

In [None]:
training_data.shape

In [None]:
training_labels.head()

## Step 5: AI Model Training

NOTE: There will be an error in this block, read on to find out how we fix it

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

# Initialize the logistic regression model
logreg = LogisticRegression(max_iter=1000)

# Evaluate the model's performance using cross-validation
cross_val_scores = cross_val_score(logreg, training_data, training_labels, cv=5)

cross_val_scores_mean = cross_val_scores.mean()

cross_val_scores_mean


In [None]:
# Checking the shape of the training data
training_data.shape

####You will get an error like this
"Error because of NaN values
ValueError: Input contains NaN, infinity or a value too large for dtype('float64')."

Let's see how many rows are affected

In [None]:
# Checking for NaN values in the training data
nan_columns = training_data.columns[training_data.isna().any()].tolist()

# Displaying columns with NaN values and the number of NaN values in them
nan_counts = training_data[nan_columns].isna().sum()
nan_counts


 The output indicates that there are 3,288 rows (entries) in our training data where the difference values for the specified columns are NaN.

 This suggests that for these rows, either the home or visitor team (or possibly both) didn't have corresponding data in our team features dataset, leading to the NaN values when calculating the difference

It would require us to take a closer look at the data to find out why this is. It could potentially be games that were called off, or certain plays that have NaN values, or something else.

However, the rows with NaN values constitute only about
6.5% of our total training data.

Excluding this proportion of the data might result in a slight reduction in the training data's robustness and diversity. However, given that it's a relatively small fraction of the total, excluding these rows likely won't have a major detrimental effect on the model's training and performance.

And since it may take significantly more time to solve this issue, I decided that I would remove these rows, leave it as it is, and live with the results. For a more important project where the highest accuracy is absolutely crucial, you would likely need to investigate this further.

In [None]:
# Exclude rows with NaN values from the training data and labels
training_data_cleaned = training_data.dropna()
training_labels_cleaned = training_labels.loc[training_data_cleaned.index]

# Checking the shape of the cleaned data
training_data_cleaned.shape, training_labels_cleaned.shape


The rows with NaN values have been successfully excluded. Now, our cleaned training data consists of 47,319 rows (entries) and 16 feature columns.

Let's try again now...

In [None]:
# Re-evaluate the model's performance using cross-validation on the cleaned data
cross_val_scores_cleaned = cross_val_score(logreg, training_data_cleaned, training_labels_cleaned, cv=5)

cross_val_scores_cleaned_mean = cross_val_scores_cleaned.mean()

cross_val_scores_cleaned_mean


Nice! So our model, after performing a 5 fold cross validation on our training data has achieved a 71% accuracy in predicting the winners of the games.

If you'd like more info on cross-validation, here [is a gentle intro to it ](https://youtu.be/fSytzGwwBVw?si=4aD1WMcJbRkkussn)

The next step would be to train the model on the entire cleaned training dataset and then use it to predict the outcomes of the upcoming games.

 Train the Model on Entire cleaned Data

In [None]:
# Train the logistic regression model on the entire cleaned training dataset
logreg.fit(training_data_cleaned, training_labels_cleaned)

## Step 6: Make Predictions On Upcoming Games

This data is for the upcoming NFL matchups for Week 5 in the 2023 season. Feel free to check after these games have happened to see how accurate the predictions were.

In [None]:
# Predict the probability of the home team winning for the upcoming games
upcoming_game_probabilities = logreg.predict_proba(upcoming_encoded_final[[col for col in upcoming_encoded_final.columns if 'Diff_' in col]])

In [None]:
upcoming_game_probabilities

These numbers are the probabilities of each team winning the game

In [None]:
# Extract the probability that the home team will win (second column of the result)
upcoming_game_prob_home_win = upcoming_game_probabilities[:, 1]

# Add the predictions to the upcoming games dataframe
upcoming_encoded_final['HomeWinProbability'] = upcoming_game_prob_home_win

# Sort by the probability of the home team winning for better visualization
upcoming_predictions = upcoming_encoded_final[['Home', 'Visitor', 'HomeWinProbability']].sort_values(by='HomeWinProbability', ascending=False)

upcoming_predictions

And there we have it! In descending order, here is the probability that the home team will win each of these matchups. Cool right??

Hopefully these numbers turn out to be accurate, I'll check back later myself to see. If you have any questions feel free to comment on the YouTube video and I'll try to answer them as best as I can. Here is a link back to the video:

That said, this approach is not without its flaws, so I think it'd be good to point out a few of them

## Flaws with this approach

Some of the flaws with this approach

*   Doesn't account for injuries from last season to this one
*   Doesn't account for team trades or hiring/firings
*   Doesn't account for weather



