<a href="https://colab.research.google.com/github/tyslas/CS5265-tyslas-nfl-spread-line-outcomes/blob/main/NFLSpreadAndLineOutcomes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Project to predict future NFL Spread & Line outcomes

## Author: Tito Yslas

### Background
I enjoy watching the NFL and playing fantasy football. I also like to place bets on games using apps like FanDuel. The purpose of this project is to increase my understanding of the NFL betting market and possibly create a machine learning model to give myself an edge next season.

### Project Description
I found a dataset from Kaggle titled [NFL scores and betting data](https://www.kaggle.com/datasets/tobycrabtree/nfl-scores-and-betting-data?resource=download). This dataset has over 13,000 samples and 17 features. The goal is to use this dataset to train a model that will predict the winner of a given game.

### Performance Metric
For my performance metric, I will aim for the model to have at least 70% accuracy.

## Import Libraries

In [139]:
import pandas as pd
import numpy as np
import os
import seaborn as sns
import matplotlib.pyplot as plt

## Load Data

### Data Dictionary
- schedule_date: Date that the game took place. This column is a date in the format MM/DD/YYYY
- schedule_season: Year of the season began. NFL seasons start in the fall and end before the spring of the following year. So the 2023 season will refer to the years 2023-2024. This column is a number in the format YYYY
- schedule_week: Week of the NFL season. This column is either a number during regular season weeks or a string in playoff weeks. For the purposes of this project the playoff weeks will be converted to numbers and the schedule_playoff column will be used to determine whether the week is a playoff game or not
- schedule_playoff: This column is a boolean. FALSE is regular season and TRUE is playoffs
- team_home: Name of the home team. This column is a string
- score_home: Points scored by the home team. This column is a number
- score_away: Points scored by the away team. This column is a number
- winner: This column will be a feature derived from score_home and score_away columns to that will use one hot encoding - if team_home scores more points this will be a 1 - if team_home scores fewer points it will be a 0
- team_away: Name of the away team. This column is a string
- team_favorite_id: Acronym of the team that was determined most likely to win by the betting market. It is either two or three letters. For the purposes of this project this column will be changed to be either the team_home or team_away name
- team_home_favorite: this will represent the encoded team_favorite_id - if team_home is favored this column will be marked as a 1 - if it's a zero then we know that team_away is favored
- spread_favorite: The number of points that the favored team needs to win by for a bet placed on the spread of the favorite to win. This column will either be a negative number or zero
- over_under_line: The number of points that both teams combined need to score for a bet placed on the 'line' to win. This column is a positive number
- stadium: Name of the venue that the game is played
- stadium_neutral: This column is a boolean. FALSE is not a neutral venue and TRUE is a neutral venue - for the purposes of this project this column will be one hot encoded with a neutral venue being marked as a 1 and non-neutral marked as a 0
- weather_temperature: The temperature in Fahrenheit at the venue where the game is played. This column is a number
- weather_wind_mph: The speed of wind in miles per hour. This column is a number
- weather_humidity: The measurement of water vapor in the air during the game measured as a percentage. This column is a number
- weather_detail: Other information about the weather conditions - if the venue is indoor or the venue has a retractable roof. This column is a string


## Exploratory Data Analysis
### Questions to answer with EDA:
1. Which columns, if any, should I modify the data type to better train my model?
1. Which columns, if any, should I remove from the training and test data so that the model can be effectively trained?
1. Which columns, if any, should I remove or insert derived data for in the case that there is a lot of missing data?
1. What features could it make sense to introduce to improve the training and performance of my model?

In [140]:
# print all file names in directory
!wget -O spreadspoke_scores.csv https://raw.githubusercontent.com/tyslas/CS5265-tyslas-nfl-spread-line-outcomes/main/spreadspoke_scores.csv
!wget -O data_dictionary.csv https://raw.githubusercontent.com/tyslas/CS5265-tyslas-nfl-spread-line-outcomes/main/data_dictionary.csv
!wget -O team_ids.py https://raw.githubusercontent.com/tyslas/CS5265-tyslas-nfl-spread-line-outcomes/main/team_ids.py
for file in os.listdir():
  print(file)

--2023-06-19 22:04:29--  https://raw.githubusercontent.com/tyslas/CS5265-tyslas-nfl-spread-line-outcomes/main/spreadspoke_scores.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1441380 (1.4M) [text/plain]
Saving to: ‘spreadspoke_scores.csv’


2023-06-19 22:04:29 (24.1 MB/s) - ‘spreadspoke_scores.csv’ saved [1441380/1441380]

--2023-06-19 22:04:29--  https://raw.githubusercontent.com/tyslas/CS5265-tyslas-nfl-spread-line-outcomes/main/data_dictionary.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2502 (2.4K) [text/plain]
Sa

In [141]:
scores = pd.read_csv('spreadspoke_scores.csv')

In [142]:
scores.info()
# scores.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13516 entries, 0 to 13515
Data columns (total 17 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   schedule_date        13516 non-null  object 
 1   schedule_season      13516 non-null  int64  
 2   schedule_week        13516 non-null  object 
 3   schedule_playoff     13516 non-null  bool   
 4   team_home            13516 non-null  object 
 5   score_home           13516 non-null  int64  
 6   score_away           13516 non-null  int64  
 7   team_away            13516 non-null  object 
 8   team_favorite_id     11037 non-null  object 
 9   spread_favorite      11037 non-null  float64
 10  over_under_line      11027 non-null  object 
 11  stadium              13516 non-null  object 
 12  stadium_neutral      13516 non-null  bool   
 13  weather_temperature  12309 non-null  float64
 14  weather_wind_mph     12293 non-null  float64
 15  weather_humidity     8468 non-null  

### Answer to Question 1
- I think that it makes sense to modify both team_home and team_away to have the same IDs as team_favorite_id to make it easier for the model to indentify when team_home is the same or different as team_favorite_id
- Currently Pandas is indentifying schedule_date as an object, it could make sense to see if there's a way for Pandas to indentify this as a date
- Currently Pandas is indentifying schedule_week as an object. This is because some of the data in this column is in string format. It could make sense to modify the data of this column that is in a string to only be in int64 format and use the schedule_playoff column to be the soe determination of whether or not the schedule_week is a playoff game
- Currently Pandas is indentifying the over_under_line column as an object data type despite the fact that it should be a float. I will explore how to ensure that this column's data type is correctly identified
- The stadium_neutral column is currently a boolean type, I think I will convert this to use one hot encoding instead

### Answer to Question 2
- Based on my initial examination, I'm not sure if it makes sense to remove any of my columns from the data set on which I will train my model

In [143]:
scores.isna().sum() # number of missing values for each column

schedule_date              0
schedule_season            0
schedule_week              0
schedule_playoff           0
team_home                  0
score_home                 0
score_away                 0
team_away                  0
team_favorite_id        2479
spread_favorite         2479
over_under_line         2489
stadium                    0
stadium_neutral            0
weather_temperature     1207
weather_wind_mph        1223
weather_humidity        5048
weather_detail         10597
dtype: int64

### Answer to Question 3
- the columns with missing data include team_favorite_id, spread_favorite, over_under_line, weather_temperature, weather_wind_mph, weather_humidity, and weather_detail. these columns have many missing values because this data was not collected in earlier seasons. for example, there is minimal team_favorite_id information collected from the 1978 schedule_season and previous to that likely because of the lack of public betting information before that time
- for the missing data I don't think that it makes sense to remove these columns, however it might make sense to only train and test on the observations from the 1979 schedule_season and beyond

### Answer to Question 4
- I think it makes sense to introduce/derive three different target columns for understanding the performance of the model
- The three target columns I am thinking about introducing are derived from score_home, score_away, and over_under_line
- These targets would be one hot encoded as team_home_win, team_home_cover_spread, and cover_line
- team_home_win would be a 1 if team_home wins or a 0 if they lose
- team_home_cover_spread would be a 1 if they cover the spread_favorite and a 0 if they don't
- cover_line would be a 1 if score_home + score_way is greater than the over_under_line and a 0 if it's less than

## Feature Engineering
1. convert `schedule_date` column to actually be read in as a date/time object instead of generic object
1. convert `schedule_week` to only be numbers - this could be more challenging than I initially thought because there will not be direct mappings for the outliers of `Division`, `Wild Card`, `Conference`, and `Superbowl` because the NFL has expanded the numbers of games played during the regular season over the years and added in the `Wildcard` games
1. convert the `schedule_playoff` column from true/false to 1/0
1. convert the `stadium_neutral` column from true/false to 1/0
1. add a target/feature of `winning_team` to be derived from `score_home` and `score_away` to make it easier to determine how the model performs
1. convert all `team_home` and `team_away` entries to the acronym identifiers
1. drop rows 0 - 2499 because they don't have data for `team_favorite_id`, `spread_favorite` and `over_under_line`
1. create new target column of `favorite_won` with 1 for true and 0 for false
1. create new target column of `spread_favorite_covered` with 1 for true and 0 for false
1. impute the values for `over_under_line`, `weather_temperature`, and `weather_wind_mph`
1. create new target column of `over_under_covered`
1. convert the `weather_detail` column to a categorical variable that is one-hot encoded


In [144]:
# convert schedule_date to proper date type
scores['schedule_date'] = pd.to_datetime(scores['schedule_date'])
schedule_date_data_type = scores['schedule_date'].dtype
print('data type of schedule_date: ', schedule_date_data_type)

data type of schedule_date:  datetime64[ns]


In [145]:
# convert schedule_week to only be numbers
scores['schedule_week'].value_counts()
# this may require a fair amount of time spent manually mapping week numbers

2             831
13            829
1             826
14            826
12            825
11            808
3             782
10            772
4             757
9             754
7             752
8             751
5             745
6             740
15            701
16            687
17            527
Division      217
Wildcard      168
Conference    115
Superbowl      57
18             46
Name: schedule_week, dtype: int64

In [228]:
# convert the schedule_playoff column from true/false into 1/0

scores['schedule_playoff'] = scores['schedule_playoff'].astype(int)
scores.tail()


Unnamed: 0,schedule_date,schedule_season,schedule_week,schedule_playoff,team_home,score_home,score_away,team_away,team_favorite_id,spread_favorite,over_under_line,stadium,stadium_neutral,weather_temperature,weather_wind_mph,weather_humidity,weather_detail,winning_team
13511,2023-01-22,2022,Division,1,BUF,10,27,CIN,BUF,-6.0,48.5,Highmark Stadium,0,32.0,4.0,100.0,snow,CIN
13512,2023-01-22,2022,Division,1,SF,19,12,DAL,SF,-3.5,46.5,Levi's Stadium,0,55.0,19.0,47.0,,SF
13513,2023-01-29,2022,Conference,1,KC,23,20,CIN,KC,-1.5,48.0,GEHA Field at Arrowhead Stadium,0,22.0,13.0,55.0,,KC
13514,2023-01-29,2022,Conference,1,PHI,31,7,SF,PHI,-2.5,45.5,Lincoln Financial Field,0,52.0,14.0,48.0,rain,PHI
13515,2023-02-12,2022,Superbowl,1,PHI,35,38,KC,PHI,-1.0,51.0,State Farm Stadium,1,76.0,8.0,8.0,retractable (open roof),KC


In [227]:
# convert the schedule_playoff column from true/false into 1/0

scores['stadium_neutral'] = scores['stadium_neutral'].astype(int)
scores.tail()


Unnamed: 0,schedule_date,schedule_season,schedule_week,schedule_playoff,team_home,score_home,score_away,team_away,team_favorite_id,spread_favorite,over_under_line,stadium,stadium_neutral,weather_temperature,weather_wind_mph,weather_humidity,weather_detail,winning_team
13511,2023-01-22,2022,Division,1,BUF,10,27,CIN,BUF,-6.0,48.5,Highmark Stadium,0,32.0,4.0,100.0,snow,CIN
13512,2023-01-22,2022,Division,1,SF,19,12,DAL,SF,-3.5,46.5,Levi's Stadium,0,55.0,19.0,47.0,,SF
13513,2023-01-29,2022,Conference,1,KC,23,20,CIN,KC,-1.5,48.0,GEHA Field at Arrowhead Stadium,0,22.0,13.0,55.0,,KC
13514,2023-01-29,2022,Conference,1,PHI,31,7,SF,PHI,-2.5,45.5,Lincoln Financial Field,0,52.0,14.0,48.0,rain,PHI
13515,2023-02-12,2022,Superbowl,1,PHI,35,38,KC,PHI,-1.0,51.0,State Farm Stadium,1,76.0,8.0,8.0,retractable (open roof),KC


In [229]:
# convert all team_home and team_away entries to the acronym identifiers

# map of team names and their corresponding IDs
team_ids = {
    'San Francisco 49ers': 'SF',
    'Dallas Cowboys': 'DAL',
    'Pittsburgh Steelers': 'PIT',
    'Green Bay Packers': 'GB',
    'Philadelphia Eagles': 'PHI',
    'Minnesota Vikings': 'MIN',
    'Denver Broncos': 'DEN',
    'Miami Dolphins': 'MIA',
    'Kansas City Chiefs': 'KC',
    'Buffalo Bills': 'BUF',
    'Chicago Bears': 'CHI',
    'New York Giants': 'NYG',
    'Atlanta Falcons': 'ATL',
    'New Orleans Saints': 'NO',
    'New York Jets': 'NYJ',
    'Detroit Lions': 'DET',
    'Cincinnati Bengals': 'CIN',
    'New England Patriots': 'NE',
    'Washington Redskins': 'WAS',
    'Cleveland Browns': 'CLE',
    # should be SD but all the data uses LAC
    'San Diego Chargers': 'LAC',
    'Seattle Seahawks': 'SEA',
    'Tampa Bay Buccaneers': 'TB',
    # should be OAK but all the data uses LVR
    'Oakland Raiders': 'LVR',
    'Indianapolis Colts': 'IND',
    'Los Angeles Rams': 'LAR',
    'Arizona Cardinals': 'ARI',
    # should be HOU but all the data uses TEN
    'Houston Oilers': 'TEN',
    'Carolina Panthers': 'CAR',
    'Jacksonville Jaguars': 'JAX',
    'Baltimore Ravens': 'BAL',
    'Tennessee Titans': 'TEN',
    # should be STL but all the data uses LAR
    'St. Louis Rams': 'LAR',
    'Houston Texans': 'HOU',
    # should be STL but all the data uses ARI
    'St. Louis Cardinals': 'ARI',
    'Baltimore Colts': 'BAL',
    # should be LAR but all the data uses LVR
    'Los Angeles Raiders': 'LVR',
    'Los Angeles Chargers': 'LAC',
    # should be PHX but all the data uses ARI
    'Phoenix Cardinals': 'ARI',
    # 'Boston Patriots': '', the franchise changed the name of their team to the New England Patriots in 1971
    # the data does not have any team_favorite_id listed before the 12/24/78 in the 1978 season (row 2494)
    'Las Vegas Raiders': 'LVR',
    'Washington Football Team': 'WAS',
    'Tennessee Oilers': 'TEN',
    'Washington Commanders': 'WAS',
}

scores['team_home'] = scores['team_home'].replace(team_ids)
scores['team_away'] = scores['team_away'].replace(team_ids)

scores.loc[2492:2502]

Unnamed: 0,schedule_date,schedule_season,schedule_week,schedule_playoff,team_home,score_home,score_away,team_away,team_favorite_id,spread_favorite,over_under_line,stadium,stadium_neutral,weather_temperature,weather_wind_mph,weather_humidity,weather_detail,winning_team
2492,1978-12-24,1978,Wildcard,1,ATL,14,13,PHI,ATL,-2.5,,Atlanta-Fulton County Stadium,0,43.0,11.0,77.0,,ATL
2493,1978-12-24,1978,Wildcard,1,MIA,9,17,TEN,MIA,-6.5,,Orange Bowl,0,77.0,12.0,78.0,,TEN
2494,1978-12-30,1978,Division,1,DAL,27,20,ATL,DAL,-15.0,,Texas Stadium,0,38.0,15.0,97.0,,DAL
2495,1978-12-30,1978,Division,1,PIT,33,10,DEN,PIT,-7.0,,Three Rivers Stadium,0,30.0,7.0,75.0,,PIT
2496,1978-12-31,1978,Division,1,LAR,34,10,MIN,LAR,-7.5,,Los Angeles Memorial Coliseum,0,53.0,6.0,52.0,,LAR
2497,1978-12-31,1978,Division,1,NE,14,31,TEN,NE,-6.0,,Foxboro Stadium,0,36.0,8.0,70.0,,TEN
2498,1979-01-07,1978,Conference,1,LAR,0,28,DAL,DAL,-3.5,,Los Angeles Memorial Coliseum,0,56.0,8.0,77.0,,DAL
2499,1979-01-07,1978,Conference,1,PIT,34,5,TEN,PIT,-7.0,,Three Rivers Stadium,0,25.0,8.0,85.0,,PIT
2500,1979-01-21,1978,Superbowl,1,DAL,31,35,PIT,PIT,-3.5,37.0,Orange Bowl,1,71.0,18.0,84.0,rain,PIT
2501,1979-09-01,1979,1,0,TB,31,16,DET,TB,-3.0,30.0,Houlihan's Stadium,0,79.0,9.0,87.0,,TB


In [148]:
# add a target/feature of 'winning_team'
# Derive a new column based on the comparison of two existing columns
def determine_winner(row):
  if row['score_home'] > row['score_away']:
    val = row['team_home']
  elif row['score_home'] == row['score_away']:
    val = 'tie'
  else:
    val = row['team_away']
  return val

scores['winning_team'] = scores.apply(determine_winner, axis=1)
print(scores['winning_team'].tail(20))
print('number of ties:', scores['winning_team'].value_counts()['tie'])

13496    MIA
13497    CAR
13498    PHI
13499    PIT
13500     SF
13501    SEA
13502    WAS
13503    JAX
13504     SF
13505    BUF
13506    CIN
13507    NYG
13508    DAL
13509     KC
13510    PHI
13511    CIN
13512     SF
13513     KC
13514    PHI
13515     KC
Name: winning_team, dtype: object
number of ties: 91


In [149]:
# drop rows 0 - 2499 because they don't have data for team_favorite_id, spread_favorite and over_under_line
start_index = 0
end_index = 2500

scores_dropped = scores.drop(scores.index[start_index:end_index])
scores_dropped.reset_index(drop = True, inplace = True)
scores_dropped.head(20)

Unnamed: 0,schedule_date,schedule_season,schedule_week,schedule_playoff,team_home,score_home,score_away,team_away,team_favorite_id,spread_favorite,over_under_line,stadium,stadium_neutral,weather_temperature,weather_wind_mph,weather_humidity,weather_detail,winning_team
0,1979-01-21,1978,Superbowl,1,DAL,31,35,PIT,PIT,-3.5,37.0,Orange Bowl,True,71.0,18.0,84.0,rain,PIT
1,1979-09-01,1979,1,0,TB,31,16,DET,TB,-3.0,30.0,Houlihan's Stadium,False,79.0,9.0,87.0,,TB
2,1979-09-02,1979,1,0,BUF,7,9,MIA,MIA,-5.0,39.0,Ralph Wilson Stadium,False,74.0,15.0,74.0,,MIA
3,1979-09-02,1979,1,0,CHI,6,3,GB,CHI,-3.0,31.0,Soldier Field,False,78.0,11.0,68.0,,CHI
4,1979-09-02,1979,1,0,DEN,10,0,CIN,DEN,-3.0,31.5,Mile High Stadium,False,69.0,6.0,38.0,,DEN
5,1979-09-02,1979,1,0,KC,14,0,BAL,KC,-1.0,37.0,Arrowhead Stadium,False,76.0,8.0,71.0,,KC
6,1979-09-02,1979,1,0,LAR,17,24,LVR,LAR,-4.0,36.5,Anaheim Stadium,False,70.0,10.0,77.0,,LVR
7,1979-09-02,1979,1,0,MIN,28,22,SF,MIN,-7.0,32.0,Metropolitan Stadium,False,70.0,11.0,67.0,,MIN
8,1979-09-02,1979,1,0,NO,34,40,ATL,NO,-5.0,32.0,Louisiana Superdome,False,72.0,0.0,,indoor,ATL
9,1979-09-02,1979,1,0,NYJ,22,25,CLE,NYJ,-2.0,41.0,Giants Stadium,False,73.0,10.0,76.0,,CLE


In [150]:
# create new target column of favorite_won with 1 for true and 0 for false
# the favorite_won column is returning an float instead of an int -- need to fix
# scores_dropped.drop('favorite_won', inplace = True)

def determine_favorite_won(row):
  result = 1
  if row['winning_team'] != row['team_favorite_id']:
    result = 0
  return int(result)

scores_dropped['favorite_won'] = scores_dropped.apply(determine_favorite_won, axis = 1)
scores_dropped.head()

Unnamed: 0,schedule_date,schedule_season,schedule_week,schedule_playoff,team_home,score_home,score_away,team_away,team_favorite_id,spread_favorite,over_under_line,stadium,stadium_neutral,weather_temperature,weather_wind_mph,weather_humidity,weather_detail,winning_team,favorite_won
0,1979-01-21,1978,Superbowl,1,DAL,31,35,PIT,PIT,-3.5,37.0,Orange Bowl,True,71.0,18.0,84.0,rain,PIT,1
1,1979-09-01,1979,1,0,TB,31,16,DET,TB,-3.0,30.0,Houlihan's Stadium,False,79.0,9.0,87.0,,TB,1
2,1979-09-02,1979,1,0,BUF,7,9,MIA,MIA,-5.0,39.0,Ralph Wilson Stadium,False,74.0,15.0,74.0,,MIA,1
3,1979-09-02,1979,1,0,CHI,6,3,GB,CHI,-3.0,31.0,Soldier Field,False,78.0,11.0,68.0,,CHI,1
4,1979-09-02,1979,1,0,DEN,10,0,CIN,DEN,-3.0,31.5,Mile High Stadium,False,69.0,6.0,38.0,,DEN,1


In [152]:
# create new target column of spread_favorite_covered with 1 for true and 0 for false
# the favorite_won column is returning an float instead of an int -- need to fix

def determine_favorite_covered_spread(row):
  return 1 if row['score_home'] - row['score_away'] > abs(row['spread_favorite']) else 0

scores_dropped['spread_favorite_covered'] = scores_dropped.apply(determine_favorite_covered_spread, axis = 1)

counts = scores_dropped['spread_favorite_covered'].value_counts()
favorite_missed = counts.values[0]
favorite_covered = counts.values[1]
percent_favorite_won = round((favorite_covered / (favorite_covered + favorite_missed)) * 100, 2)

print(f"the favorite team covered the spread {percent_favorite_won}% of the time\n")

display(scores_dropped.head(20))

the favorite team covered the spread 40.52% of the time



Unnamed: 0,schedule_date,schedule_season,schedule_week,schedule_playoff,team_home,score_home,score_away,team_away,team_favorite_id,spread_favorite,over_under_line,stadium,stadium_neutral,weather_temperature,weather_wind_mph,weather_humidity,weather_detail,winning_team,favorite_won,spread_favorite_covered
0,1979-01-21,1978,Superbowl,1,DAL,31,35,PIT,PIT,-3.5,37.0,Orange Bowl,True,71.0,18.0,84.0,rain,PIT,1,0
1,1979-09-01,1979,1,0,TB,31,16,DET,TB,-3.0,30.0,Houlihan's Stadium,False,79.0,9.0,87.0,,TB,1,1
2,1979-09-02,1979,1,0,BUF,7,9,MIA,MIA,-5.0,39.0,Ralph Wilson Stadium,False,74.0,15.0,74.0,,MIA,1,0
3,1979-09-02,1979,1,0,CHI,6,3,GB,CHI,-3.0,31.0,Soldier Field,False,78.0,11.0,68.0,,CHI,1,0
4,1979-09-02,1979,1,0,DEN,10,0,CIN,DEN,-3.0,31.5,Mile High Stadium,False,69.0,6.0,38.0,,DEN,1,1
5,1979-09-02,1979,1,0,KC,14,0,BAL,KC,-1.0,37.0,Arrowhead Stadium,False,76.0,8.0,71.0,,KC,1,1
6,1979-09-02,1979,1,0,LAR,17,24,LVR,LAR,-4.0,36.5,Anaheim Stadium,False,70.0,10.0,77.0,,LVR,0,0
7,1979-09-02,1979,1,0,MIN,28,22,SF,MIN,-7.0,32.0,Metropolitan Stadium,False,70.0,11.0,67.0,,MIN,1,0
8,1979-09-02,1979,1,0,NO,34,40,ATL,NO,-5.0,32.0,Louisiana Superdome,False,72.0,0.0,,indoor,ATL,0,0
9,1979-09-02,1979,1,0,NYJ,22,25,CLE,NYJ,-2.0,41.0,Giants Stadium,False,73.0,10.0,76.0,,CLE,0,0


In [165]:
# impute the values for `over_under_line`, `weather_temperature`, and `weather_wind_mph` with the median values of each

# Impute missing values with column median
scores_dropped['over_under_line'].fillna(scores_dropped['over_under_line'].median(), inplace = True)
scores_dropped['weather_temperature'].fillna(scores_dropped['weather_temperature'].median(), inplace = True)
scores_dropped['weather_wind_mph'].fillna(scores_dropped['weather_wind_mph'].median(), inplace = True)


In [166]:
# create new target column of over_under_covered

# need to cast over_under_line column to a float -- pandas recognizes it as a string
scores_dropped['over_under_line'] = pd.to_numeric(scores_dropped['over_under_line'], errors='coerce')
# scores_dropped['over_under_line'].astype(float)
# print(scores_dropped['over_under_line'].dtypes)

def determine_over_under_covered(row):
  return 1 if row['score_home'] + row['score_away'] > row['over_under_line'] else 0

scores_dropped['over_under_covered'] = scores_dropped.apply(determine_over_under_covered, axis = 1)

counts = scores_dropped['over_under_covered'].value_counts()
over_missed = counts.values[0]
over_covered = counts.values[1]
percent_over_covered = round((over_covered / (over_covered + over_missed)) * 100, 2)

print(f'the over hit {percent_over_covered}% of the time\n')

display(scores_dropped[['score_home', 'score_away', 'over_under_line', 'over_under_covered']].head(20))

the over hit 48.24% of the time



Unnamed: 0,score_home,score_away,over_under_line,over_under_covered
0,31,35,37.0,1
1,31,16,30.0,1
2,7,9,39.0,0
3,6,3,31.0,0
4,10,0,31.5,0
5,14,0,37.0,0
6,17,24,36.5,1
7,28,22,32.0,1
8,34,40,32.0,1
9,22,25,41.0,1


In [167]:
# convert the weather_detail column to categorical variable that is one-hot encoded

counts = scores_dropped['weather_detail'].value_counts()

totalWeatherDetails = 0
for count in counts:
  totalWeatherDetails += count

print(f'total number of rows with weather details: {totalWeatherDetails}')
numRows = scores_dropped.shape[0]
print(f'number of rows: {numRows}\n')

one_hot_encoded = pd.get_dummies(scores_dropped['weather_detail'], prefix = 'conditions')
scores_encoded = pd.concat([scores_dropped, one_hot_encoded], axis = 1)

scores_encoded.head(20)

total number of rows with weather details: 2758
number of rows: 11016



Unnamed: 0,schedule_date,schedule_season,schedule_week,schedule_playoff,team_home,score_home,score_away,team_away,team_favorite_id,spread_favorite,...,spread_favorite_covered,over_under_covered,conditions_fog,conditions_indoor,conditions_rain,conditions_rain | fog,conditions_retractable (open roof),conditions_snow,conditions_snow | Freezing rain,conditions_snow | fog
0,1979-01-21,1978,Superbowl,1,DAL,31,35,PIT,PIT,-3.5,...,0,1,0,0,1,0,0,0,0,0
1,1979-09-01,1979,1,0,TB,31,16,DET,TB,-3.0,...,1,1,0,0,0,0,0,0,0,0
2,1979-09-02,1979,1,0,BUF,7,9,MIA,MIA,-5.0,...,0,0,0,0,0,0,0,0,0,0
3,1979-09-02,1979,1,0,CHI,6,3,GB,CHI,-3.0,...,0,0,0,0,0,0,0,0,0,0
4,1979-09-02,1979,1,0,DEN,10,0,CIN,DEN,-3.0,...,1,0,0,0,0,0,0,0,0,0
5,1979-09-02,1979,1,0,KC,14,0,BAL,KC,-1.0,...,1,0,0,0,0,0,0,0,0,0
6,1979-09-02,1979,1,0,LAR,17,24,LVR,LAR,-4.0,...,0,1,0,0,0,0,0,0,0,0
7,1979-09-02,1979,1,0,MIN,28,22,SF,MIN,-7.0,...,0,1,0,0,0,0,0,0,0,0
8,1979-09-02,1979,1,0,NO,34,40,ATL,NO,-5.0,...,0,1,0,1,0,0,0,0,0,0
9,1979-09-02,1979,1,0,NYJ,22,25,CLE,NYJ,-2.0,...,0,1,0,0,0,0,0,0,0,0


In [168]:
# explore the weather_humidity column because of sparse data

missingHumidity = scores_encoded['weather_humidity'].isna().sum()
totalRows = scores_encoded.shape[0]
percentMissing = round(( missingHumidity / totalRows ) * 100, 2)

print(f'number of missing values for weather_humidity: {missingHumidity}')
print(f'percent of missing values for weather_humidity: {percentMissing}\n')


number of missing values for weather_humidity: 4657
percent of missing values for weather_humidity: 42.27



In [169]:
scores_encoded.tail()

Unnamed: 0,schedule_date,schedule_season,schedule_week,schedule_playoff,team_home,score_home,score_away,team_away,team_favorite_id,spread_favorite,...,spread_favorite_covered,over_under_covered,conditions_fog,conditions_indoor,conditions_rain,conditions_rain | fog,conditions_retractable (open roof),conditions_snow,conditions_snow | Freezing rain,conditions_snow | fog
11011,2023-01-22,2022,Division,1,BUF,10,27,CIN,BUF,-6.0,...,0,0,0,0,0,0,0,1,0,0
11012,2023-01-22,2022,Division,1,SF,19,12,DAL,SF,-3.5,...,1,0,0,0,0,0,0,0,0,0
11013,2023-01-29,2022,Conference,1,KC,23,20,CIN,KC,-1.5,...,1,0,0,0,0,0,0,0,0,0
11014,2023-01-29,2022,Conference,1,PHI,31,7,SF,PHI,-2.5,...,1,0,0,0,1,0,0,0,0,0
11015,2023-02-12,2022,Superbowl,1,PHI,35,38,KC,PHI,-1.0,...,0,1,0,0,0,0,1,0,0,0




## Train/Test/Hold-Out Split

In [170]:
# machine learning libs
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, LabelBinarizer, StandardScaler
from sklearn import config_context
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay

In [190]:
# remove score_home, score_away, and winning_team columns from data so that the model cannot have access to outcome information
# remove weather_detail column from data since this column was turned into one-hot encoded categorical variables
# remove weather_humidity column because there's too much missing data
# remove columns that are not numerical variables since regression does not work with non-numerical variables
# non-numerical columns: schedule_date, schedule_week, team_home, team_away, team_favorite_id, stadium
# if I do the manual work to convert schedule_week to all numerical values, I may include it in the future
columns_to_drop = ['score_home', 'score_away', 'weather_humidity', 'winning_team', 'weather_detail', 'schedule_date', 'schedule_week', 'team_home', 'team_away', 'team_favorite_id', 'stadium']
prepared_data = scores_encoded.drop(columns_to_drop, axis = 1)

# would like to find a way to make binary variables int instead of float
# columns that are binary variables that need to be converted from float to int
# columns_to_convert = ['schedule_season', 'schedule_playoff', 'stadium_neutral', 'favorite_won']

# convert the specified columns to int type
# prepared_data[columns_to_convert] = prepared_data[columns_to_convert].astype(int)

# multiple targets: y1 = favorite_won, y2 = spread_favorite_covered, y3 = over_under_covered
# 'X' is the feature matrix and 'y1', 'y2', 'y3' are the target variables
class_columns = ['favorite_won', 'spread_favorite_covered', 'over_under_covered']
X = prepared_data.drop(columns = class_columns)
y1 = prepared_data['favorite_won']
y2 = prepared_data['spread_favorite_covered']
y3 = prepared_data['over_under_covered']
test_percent = 0.25
random_seed = 42
holdout_percent = 0.5

# split the data into train and combined test-holdout sets (75% train, 25% test + holdout)
X_train, X_test_holdout, y1_train, y1_test_holdout, y2_train, y2_test_holdout, y3_train, y3_test_holdout = train_test_split(X, y1, y2, y3, test_size = test_percent, random_state = random_seed)

# further split the test-holdout set into test and holdout sets (50% test, 50% holdout)
X_test, X_holdout, y1_test, y1_holdout, y2_test, y2_holdout, y3_test, y3_holdout = train_test_split(X_test_holdout, y1_test_holdout, y2_test_holdout, y3_test_holdout, test_size = holdout_percent, random_state = random_seed)


# X train
print('On X train:')
print(f'X train dimensions: {X_train.shape}')
display(X_train.head())

# X test
print('\nOn X test:')
print(f'X test dimensions: {X_test.shape}')
display(X_test.head())

# X holdout
print('\nOn X holdout:')
print(f'X holdout dimensions: {X_holdout.shape}')
display(X_holdout.head())


On X train:
X train dimensions: (8262, 15)


Unnamed: 0,schedule_season,schedule_playoff,spread_favorite,over_under_line,stadium_neutral,weather_temperature,weather_wind_mph,conditions_fog,conditions_indoor,conditions_rain,conditions_rain | fog,conditions_retractable (open roof),conditions_snow,conditions_snow | Freezing rain,conditions_snow | fog
2836,1991,0,-5.5,42.0,False,72.0,0.0,0,1,0,0,0,0,0,0
5317,2001,0,-6.0,45.5,False,51.0,12.0,0,0,0,0,0,0,0,0
7049,2008,0,-3.0,43.5,False,72.0,0.0,0,1,0,0,0,0,0,0
8525,2013,0,-7.0,53.0,False,72.0,0.0,0,1,0,0,0,0,0,0
994,1983,0,-1.5,42.0,False,41.0,8.0,0,0,0,0,0,0,0,0



On X test:
X test dimensions: (1377, 15)


Unnamed: 0,schedule_season,schedule_playoff,spread_favorite,over_under_line,stadium_neutral,weather_temperature,weather_wind_mph,conditions_fog,conditions_indoor,conditions_rain,conditions_rain | fog,conditions_retractable (open roof),conditions_snow,conditions_snow | Freezing rain,conditions_snow | fog
10039,2019,0,-4.0,50.0,False,64.0,7.0,0,0,0,0,0,0,0,0
10142,2019,0,-7.0,46.5,False,72.0,0.0,0,1,0,0,0,0,0,0
9180,2016,0,-4.0,47.5,False,72.0,0.0,0,1,0,0,0,0,0,0
259,1980,0,-7.0,44.0,False,68.0,8.0,0,0,0,0,0,0,0,0
1950,1988,0,-2.5,44.0,False,68.0,11.0,0,0,0,0,0,0,0,0



On X holdout:
X holdout dimensions: (1377, 15)


Unnamed: 0,schedule_season,schedule_playoff,spread_favorite,over_under_line,stadium_neutral,weather_temperature,weather_wind_mph,conditions_fog,conditions_indoor,conditions_rain,conditions_rain | fog,conditions_retractable (open roof),conditions_snow,conditions_snow | Freezing rain,conditions_snow | fog
2815,1991,0,-7.5,40.0,False,37.0,18.0,0,0,0,0,0,0,0,0
1224,1984,0,-4.0,44.0,False,36.0,19.0,0,0,0,0,0,0,0,0
467,1981,0,-1.0,37.0,False,81.0,8.0,0,0,0,0,0,0,0,0
3050,1992,0,-3.5,35.0,False,62.0,8.0,0,0,0,0,0,0,0,0
7187,2008,0,-6.5,36.5,False,43.0,18.0,0,0,0,0,0,0,0,0


In [191]:
# y1 = favorite_won, y2 = spread_favorite_covered, y3 = over_under_covered

# favorite_won targets
# y1 train
print('On y1 train (favorite won):')
print(f'y1 train dimensions: {y1_train.shape}')
display(y1_train.head())

# y1 test
print('\nOn y1 test (favorite won):')
print(f'y1 test dimensions: {y1_test.shape}')
display(y1_test.head())

# y1 holdout
print('\nOn y1 holdout (favorite won):')
print(f'y1 holdout dimensions: {y1_holdout.shape}')
display(y1_holdout.head())

# spread_favorite_covered targets
# y2 train
print('\nOn y2 train (spread favorite covered):')
print(f'y2 train dimensions: {y2_train.shape}')
display(y2_train.head())

# y2 test
print('\nOn y2 test (spread favorite covered):')
print(f'y2 test dimensions: {y2_test.shape}')
display(y2_test.head())

# y2 holdout
print('\nOn y2 holdout (spread favorite covered):')
print(f'y2 holdout dimensions: {y2_holdout.shape}')
display(y2_holdout.head())

# over_under_covered targets
# y3 train
print('\nOn y3 train (over/under covered):')
print(f'y3 train dimensions: {y3_train.shape}')
display(y3_train.head())

# y3 test
print('\nOn y3 test (over/under covered):')
print(f'y3 test dimensions: {y3_test.shape}')
display(y3_test.head())

# y3 holdout
print('\nOn y3 holdout (over/under covered):')
print(f'y3 holdout dimensions: {y3_holdout.shape}')
display(y3_holdout.head())


On y1 train (favorite won):
y1 train dimensions: (8262,)


2836    1
5317    0
7049    0
8525    0
994     0
Name: favorite_won, dtype: int64


On y1 test (favorite won):
y1 test dimensions: (1377,)


10039    0
10142    1
9180     1
259      1
1950     0
Name: favorite_won, dtype: int64


On y1 holdout (favorite won):
y1 holdout dimensions: (1377,)


2815    0
1224    0
467     1
3050    1
7187    0
Name: favorite_won, dtype: int64


On y2 train (spread favorite covered):
y2 train dimensions: (8262,)


2836    1
5317    0
7049    0
8525    1
994     0
Name: spread_favorite_covered, dtype: int64


On y2 test (spread favorite covered):
y2 test dimensions: (1377,)


10039    1
10142    1
9180     1
259      0
1950     1
Name: spread_favorite_covered, dtype: int64


On y2 holdout (spread favorite covered):
y2 holdout dimensions: (1377,)


2815    1
1224    0
467     1
3050    1
7187    1
Name: spread_favorite_covered, dtype: int64


On y3 train (over/under covered):
y3 train dimensions: (8262,)


2836    1
5317    0
7049    0
8525    1
994     1
Name: over_under_covered, dtype: int64


On y3 test (over/under covered):
y3 test dimensions: (1377,)


10039    0
10142    0
9180     1
259      1
1950     0
Name: over_under_covered, dtype: int64


On y3 holdout (over/under covered):
y3 holdout dimensions: (1377,)


2815    0
1224    0
467     0
3050    0
7187    0
Name: over_under_covered, dtype: int64

## Establish Training Pipeline

In [192]:
# check that the prepared data is what we expect
print(prepared_data.info())
print(f'\nmissing data:\n{prepared_data.isna().sum()}')


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11016 entries, 0 to 11015
Data columns (total 18 columns):
 #   Column                              Non-Null Count  Dtype  
---  ------                              --------------  -----  
 0   schedule_season                     11016 non-null  int64  
 1   schedule_playoff                    11016 non-null  int64  
 2   spread_favorite                     11016 non-null  float64
 3   over_under_line                     11016 non-null  float64
 4   stadium_neutral                     11016 non-null  bool   
 5   weather_temperature                 11016 non-null  float64
 6   weather_wind_mph                    11016 non-null  float64
 7   favorite_won                        11016 non-null  int64  
 8   spread_favorite_covered             11016 non-null  int64  
 9   over_under_covered                  11016 non-null  int64  
 10  conditions_fog                      11016 non-null  uint8  
 11  conditions_indoor                   11016

### Create Two Pipelines:
- categorical
- numeric

In [193]:
# build two separate pipes, one for handling numeric data and the other for categorical data

# not sure if I need this pipeline since I have already create the categorical columns
cat_pipeline = Pipeline(steps=[('cat_impute', SimpleImputer(strategy = 'most_frequent')),
                               ('onehot_cat', OneHotEncoder())])

num_pipeline = Pipeline(steps=[('impute_num', SimpleImputer(strategy='mean', missing_values=np.nan)),
                               ('scale_num', StandardScaler())])


### Preprocessing
- in my data preparation, I already transformed weather_details into categorical columns. so I don't think that I need to do any categorical transformations


In [217]:
# create preprocessing pipeline by columns

num_cols = [
    'schedule_season',
    'schedule_playoff',
    'spread_favorite',
    'over_under_line',
    'stadium_neutral',
    'weather_temperature',
    'weather_wind_mph',
    'conditions_fog',
    'conditions_indoor',
    'conditions_rain',
    'conditions_rain | fog',
    'conditions_retractable (open roof)',
    'conditions_snow',
    'conditions_snow | Freezing rain',
    'conditions_snow | fog'
]

preproc = ColumnTransformer([('num_pipe', num_pipeline, num_cols)], remainder = 'passthrough')


### Generate the Training Pipeline with preprocessing and modeling

In [218]:
# generate the entire training pipeline with preprocessing and modeling
# use the LogisticRegression model and choose its initial parameters
# I used penalty='elasticnet', solver='saga', tol=0.01, but you are welcome to try other parameters
pipe = Pipeline(steps = [('preproc', preproc),
                       ('mdl', LogisticRegression(penalty = 'elasticnet', solver = 'saga', tol = 0.01))])

# visualization of the pipeline
with config_context(display = 'diagram'):
    display(pipe)

### Cross-Validation & Hyperparameter Tuning

In [219]:
# cross validation and hyperparameter tuning
tuning_grid = {'mdl__l1_ratio' : np.linspace(0,1,5),
               'mdl__C': np.logspace(-1, 6, 3) }

# y = f1(x): f1 = 1.2x1 + 2.5x2 + 3.1x3 + 10
# y = f2(x): f2 = -1.2x1 + 0.5x2 + 3.1x3 + 10
# TODO: choose your cv folds
grid_search_y1 = GridSearchCV(pipe, param_grid = tuning_grid, cv = 5, return_train_score = True, n_jobs = -1)
grid_search_y2 = GridSearchCV(pipe, param_grid = tuning_grid, cv = 5, return_train_score = True, n_jobs = -1)
grid_search_y3 = GridSearchCV(pipe, param_grid = tuning_grid, cv = 5, return_train_score = True, n_jobs = -1)


In [220]:
tuning_grid

{'mdl__l1_ratio': array([0.  , 0.25, 0.5 , 0.75, 1.  ]),
 'mdl__C': array([1.00000000e-01, 3.16227766e+02, 1.00000000e+06])}

### Feed the training set to the training pipeline

In [222]:
# train against the 3 training sets
# X_train, y1_train, y2_train, y3_train
# X_test, X_holdout, y1_test, y1_holdout, y2_test, y2_holdout, y3_test, y3_holdout

In [223]:
# train against y1_train - favorite_won
grid_search_y1.fit(X_train, y1_train.values.ravel()) # bulk of training


In [224]:
# train against y2_train - spread_favorite_covered
grid_search_y2.fit(X_train, y2_train.values.ravel()) # bulk of training


In [225]:
# train against y3_train - over_under_covered
grid_search_y3.fit(X_train, y3_train.values.ravel()) # bulk of training


In [226]:
# check the scores and params
print(f'best score y1: {grid_search_y1.best_score_}')
print(f'best params y1: {grid_search_y1.best_params_}')

print(f'best score y2: {grid_search_y2.best_score_}')
print(f'best params y2: {grid_search_y2.best_params_}')

print(f'best score y3: {grid_search_y3.best_score_}')
print(f'best params y3: {grid_search_y3.best_params_}')


best score y1: 0.6592831435690336
best params y1: {'mdl__C': 0.1, 'mdl__l1_ratio': 1.0}
best score y2: 0.599129032399819
best params y2: {'mdl__C': 0.1, 'mdl__l1_ratio': 1.0}
best score y3: 0.5162197574591065
best params y3: {'mdl__C': 1000000.0, 'mdl__l1_ratio': 0.25}


In [202]:
# visualize y1 results
pd.DataFrame(grid_search_y1.cv_results_)


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_mdl__C,param_mdl__l1_ratio,params,split0_test_score,split1_test_score,split2_test_score,...,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,split2_train_score,split3_train_score,split4_train_score,mean_train_score,std_train_score
0,0.132299,0.027476,0.022014,0.013035,0.1,0.0,"{'mdl__C': 0.1, 'mdl__l1_ratio': 0.0}",0.660617,0.658802,0.668886,...,,,1,0.660009,0.662581,0.658245,0.664297,0.662179,0.661462,0.00211
1,0.304896,0.053541,0.028062,0.002239,0.1,0.25,"{'mdl__C': 0.1, 'mdl__l1_ratio': 0.25}",0.660012,0.658197,0.669492,...,,,1,0.65895,0.662581,0.657489,0.663238,0.661725,0.660796,0.002207
2,0.236486,0.053101,0.028137,0.009185,0.1,0.5,"{'mdl__C': 0.1, 'mdl__l1_ratio': 0.5}",0.663642,0.656382,0.671308,...,,,1,0.659253,0.662581,0.657035,0.662632,0.662027,0.660706,0.002216
3,0.1983,0.019689,0.024838,0.007294,0.1,0.75,"{'mdl__C': 0.1, 'mdl__l1_ratio': 0.75}",0.663642,0.658197,0.673123,...,,,1,0.659555,0.663187,0.657791,0.662027,0.663086,0.661129,0.002121
4,0.181871,0.03442,0.01991,0.00827,0.1,1.0,"{'mdl__C': 0.1, 'mdl__l1_ratio': 1.0}",0.664247,0.658197,0.672518,...,,,1,0.660312,0.663338,0.657337,0.662784,0.663389,0.661432,0.002337
5,0.310527,0.073072,0.013205,0.002392,316.227766,0.0,"{'mdl__C': 316.22776601683796, 'mdl__l1_ratio'...",0.661222,0.659407,0.664649,...,,,1,0.659858,0.661522,0.658245,0.664145,0.663238,0.661402,0.002157
6,0.379226,0.089528,0.013876,0.000918,316.227766,0.25,"{'mdl__C': 316.22776601683796, 'mdl__l1_ratio'...",0.661222,0.659407,0.664649,...,,,1,0.659858,0.661522,0.658245,0.663843,0.663238,0.661341,0.002082
7,0.380669,0.092858,0.01209,0.002133,316.227766,0.5,"{'mdl__C': 316.22776601683796, 'mdl__l1_ratio'...",0.661222,0.659407,0.664649,...,,,1,0.659858,0.661522,0.658396,0.663994,0.663238,0.661402,0.002075
8,0.39482,0.096572,0.013323,0.00261,316.227766,0.75,"{'mdl__C': 316.22776601683796, 'mdl__l1_ratio'...",0.661222,0.659407,0.664649,...,,,1,0.659858,0.661522,0.658245,0.663843,0.663238,0.661341,0.002082
9,0.366243,0.087492,0.01212,0.002585,316.227766,1.0,"{'mdl__C': 316.22776601683796, 'mdl__l1_ratio'...",0.661222,0.659407,0.664649,...,,,1,0.659858,0.661371,0.658245,0.663843,0.663238,0.661311,0.00208


In [203]:
# visualize y2 results
pd.DataFrame(grid_search_y2.cv_results_)


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_mdl__C,param_mdl__l1_ratio,params,split0_test_score,split1_test_score,split2_test_score,...,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,split2_train_score,split3_train_score,split4_train_score,mean_train_score,std_train_score
0,0.186141,0.036478,0.025392,0.005419,0.1,0.0,"{'mdl__C': 0.1, 'mdl__l1_ratio': 0.0}",0.598911,0.596491,0.596247,...,,,1,0.599183,0.599334,0.6,0.599395,0.600908,0.599764,0.000636
1,0.303437,0.042767,0.027836,0.004767,0.1,0.25,"{'mdl__C': 0.1, 'mdl__l1_ratio': 0.25}",0.598911,0.595886,0.595036,...,,,1,0.599486,0.600545,0.599697,0.599849,0.599395,0.599794,0.000408
2,0.311281,0.064256,0.018761,0.007866,0.1,0.5,"{'mdl__C': 0.1, 'mdl__l1_ratio': 0.5}",0.599516,0.595281,0.598063,...,,,1,0.599183,0.600545,0.59879,0.6,0.599244,0.599552,0.000632
3,0.159184,0.031124,0.01344,0.00121,0.1,0.75,"{'mdl__C': 0.1, 'mdl__l1_ratio': 0.75}",0.599516,0.595281,0.598668,...,,,1,0.599183,0.599788,0.598336,0.598638,0.599092,0.599008,0.000497
4,0.116274,0.008986,0.012466,0.002502,0.1,1.0,"{'mdl__C': 0.1, 'mdl__l1_ratio': 1.0}",0.599516,0.598911,0.598668,...,,,1,0.599334,0.599183,0.599244,0.599244,0.599395,0.59928,7.5e-05
5,0.328035,0.065494,0.013741,0.001772,316.227766,0.0,"{'mdl__C': 316.22776601683796, 'mdl__l1_ratio'...",0.599516,0.597701,0.594431,...,,,1,0.600091,0.599939,0.600303,0.598638,0.598941,0.599582,0.000664
6,0.419359,0.090165,0.013407,0.000974,316.227766,0.25,"{'mdl__C': 316.22776601683796, 'mdl__l1_ratio'...",0.599516,0.597701,0.594431,...,,,1,0.600091,0.599939,0.600303,0.598638,0.598941,0.599582,0.000664
7,0.416617,0.080765,0.012383,0.002512,316.227766,0.5,"{'mdl__C': 316.22776601683796, 'mdl__l1_ratio'...",0.599516,0.597701,0.594431,...,,,1,0.600091,0.599939,0.600303,0.598638,0.598941,0.599582,0.000664
8,0.418238,0.080698,0.012154,0.002546,316.227766,0.75,"{'mdl__C': 316.22776601683796, 'mdl__l1_ratio'...",0.599516,0.597701,0.594431,...,,,1,0.600091,0.599939,0.600303,0.598638,0.598941,0.599582,0.000664
9,0.41188,0.08398,0.013325,0.001104,316.227766,1.0,"{'mdl__C': 316.22776601683796, 'mdl__l1_ratio'...",0.599516,0.597701,0.594431,...,,,1,0.600091,0.599939,0.600303,0.598638,0.598941,0.599582,0.000664


In [204]:
# visualize y3 results
pd.DataFrame(grid_search_y3.cv_results_)


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_mdl__C,param_mdl__l1_ratio,params,split0_test_score,split1_test_score,split2_test_score,...,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,split2_train_score,split3_train_score,split4_train_score,mean_train_score,std_train_score
0,0.107441,0.006275,0.017244,0.005019,0.1,0.0,"{'mdl__C': 0.1, 'mdl__l1_ratio': 0.0}",0.517241,0.507562,0.517554,...,,,1,0.523831,0.527009,0.531316,0.524962,0.52239,0.525902,0.0031
1,0.124092,0.011547,0.014226,0.001328,0.1,0.25,"{'mdl__C': 0.1, 'mdl__l1_ratio': 0.25}",0.522081,0.511797,0.515133,...,,,1,0.523982,0.52837,0.529047,0.527231,0.52466,0.526658,0.002006
2,0.117858,0.004703,0.014393,0.001661,0.1,0.5,"{'mdl__C': 0.1, 'mdl__l1_ratio': 0.5}",0.522081,0.511797,0.50908,...,,,1,0.525949,0.526706,0.529349,0.527383,0.523752,0.526628,0.001828
3,0.118052,0.010852,0.016038,0.001992,0.1,0.75,"{'mdl__C': 0.1, 'mdl__l1_ratio': 0.75}",0.522081,0.513007,0.511501,...,,,1,0.526403,0.527614,0.529198,0.527837,0.523298,0.52687,0.001994
4,0.114206,0.007577,0.013551,0.000931,0.1,1.0,"{'mdl__C': 0.1, 'mdl__l1_ratio': 1.0}",0.520266,0.510587,0.512712,...,,,1,0.526857,0.527765,0.529955,0.527383,0.522995,0.526991,0.002259
5,0.290758,0.029392,0.01269,0.001671,316.227766,0.0,"{'mdl__C': 316.22776601683796, 'mdl__l1_ratio'...",0.519661,0.507562,0.518765,...,,,1,0.523377,0.52716,0.530862,0.527231,0.523298,0.526386,0.002826
6,0.360158,0.038007,0.013839,0.002544,316.227766,0.25,"{'mdl__C': 316.22776601683796, 'mdl__l1_ratio'...",0.519661,0.508167,0.519976,...,,,1,0.523377,0.52716,0.53056,0.52708,0.523298,0.526295,0.002722
7,0.358389,0.029069,0.012671,0.003069,316.227766,0.5,"{'mdl__C': 316.22776601683796, 'mdl__l1_ratio'...",0.519661,0.506957,0.51937,...,,,1,0.523226,0.527311,0.530862,0.527383,0.523298,0.526416,0.002878
8,0.358652,0.021917,0.012464,0.002428,316.227766,0.75,"{'mdl__C': 316.22776601683796, 'mdl__l1_ratio'...",0.519056,0.506957,0.51937,...,,,1,0.523226,0.527311,0.530862,0.52708,0.523298,0.526356,0.00286
9,0.368777,0.036189,0.012143,0.001508,316.227766,1.0,"{'mdl__C': 316.22776601683796, 'mdl__l1_ratio'...",0.519661,0.506957,0.518765,...,,,1,0.523226,0.527311,0.530711,0.527231,0.523147,0.526325,0.002854


### Final Fit

In [206]:
# final fit y1
grid_search_y1.best_estimator_


In [207]:
# final fit y2
grid_search_y2.best_estimator_


In [208]:
# final fit y3
grid_search_y3.best_estimator_


In [209]:
# print target classes
print(f'y1 target classes: {grid_search_y1.classes_}')
print(f'y2 target classes: {grid_search_y2.classes_}')
print(f'y3 target classes: {grid_search_y3.classes_}')


y1 target classes: [0 1]
y2 target classes: [0 1]
y3 target classes: [0 1]
