In [1]:
%load_ext nb_black

<IPython.core.display.Javascript object>

# Predicting results of college football games from the 2019 season

Features
* success_rate: successful plays/total plays
* What is a successful play?
    * On first down: Gaining half of the required yards to gain a first down
    * On second down: Gaining 70 percent of the required yards to gain a first down
    * On third/fourth down: Gaining all of the required yards to gain a frist down
* ppapsp: Projected points added per succesful play
* afp: Average starting field position
* ppti40: Points per trip inside the 40 yard line
* ppa: Projected points added per play

## Goal
Predict the result of a game then use the prediction to make a bet on whether the favorite will cover the spread or the underdog will beat the spread.

In [2]:
import sys

<IPython.core.display.Javascript object>

In [421]:
# Imports
import pandas as pd
import numpy as np
import sqlite3

import json
import urllib3
import certifi

from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.ensemble import (
    RandomForestClassifier,
    RandomForestRegressor,
    GradientBoostingRegressor,
    GradientBoostingClassifier,
)
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import (
    classification_report,
    mean_absolute_error,
    confusion_matrix,
    r2_score,
)
from sklearn.inspection import permutation_importance

<IPython.core.display.Javascript object>

In [4]:
# Get all_games table from db, convert W/L to binary, drop some columns and duplicate rows
http = urllib3.PoolManager(cert_reqs="CERT_REQUIRED", ca_certs=certifi.where())

conn = sqlite3.connect("NCAAF.db")

c = conn.cursor()

all_games = pd.read_sql_query("SELECT * FROM games ", conn)
all_games["W/L"] = np.where(all_games["home_score"] > all_games["away_score"], 0, 1)
all_games.drop(columns=["home_offense_plays", "away_offense_plays"], inplace=True)
all_games = all_games.drop_duplicates()

<IPython.core.display.Javascript object>

In [5]:
# Get betting lines for each game and create lines_df
for year in ["2015", "2016", "2017", "2018", "2019"]:
    lines = http.request(
        "GET",
        "https://api.collegefootballdata.com/lines?year="
        + year
        + "&seasonType=regular",
    )

    lines_dict_list = json.loads(lines.data.decode("UTF-8"))
    df = pd.DataFrame(lines_dict_list)
    try:
        lines_df = lines_df.append(df)
    except (NameError):
        lines_df = df

lines_df = lines_df[["id", "lines"]]
lines_df = lines_df.rename(columns={"id": "game_id"})
lines_df["game_id"] = lines_df["game_id"].astype(str)
lines_df.head()

Unnamed: 0,game_id,lines
0,400763398,"[{'provider': 'consensus', 'spread': '16', 'fo..."
1,400756895,[]
2,400756882,"[{'provider': 'consensus', 'spread': '-30.5', ..."
3,400603840,"[{'provider': 'consensus', 'spread': '-3.5', '..."
4,400763593,"[{'provider': 'consensus', 'spread': '-17', 'f..."


<IPython.core.display.Javascript object>

In [6]:
# Merge betting lines with games df
all_games = all_games.merge(lines_df, on="game_id")
all_games

Unnamed: 0,game_id,year,week,home_team,away_team,home_ypp,home_success_rate,home_ppapsp,home_afp,home_ppti40,...,away_ypp,away_success_rate,away_ppapsp,away_afp,away_ppti40,away_turnovers,home_score,away_score,W/L,lines
0,400603840,2015,1,South Carolina,North Carolina,4.425,0.388,0.432,31.818,4.250,...,6.100,0.400,0.439,22.800,2.167,0.043,17,13,0,"[{'provider': 'consensus', 'spread': '-3.5', '..."
1,400763593,2015,1,UCF,Florida International,4.353,0.382,0.522,33.727,2.800,...,4.743,0.338,0.410,18.167,4.250,0.014,14,15,1,"[{'provider': 'consensus', 'spread': '-17', 'f..."
2,400763399,2015,1,Central Michigan,Oklahoma State,4.392,0.378,0.425,20.917,4.333,...,5.620,0.394,0.434,30.000,4.000,0.000,13,24,1,"[{'provider': 'consensus', 'spread': '20.5', '..."
3,400756896,2015,1,Wake Forest,Elon,7.190,0.488,0.579,26.000,4.556,...,2.385,0.135,0.146,25.167,3.000,0.000,41,3,0,[]
4,400787299,2015,1,Ball State,VMI,5.290,0.480,0.510,41.067,4.800,...,6.524,0.369,0.596,22.071,5.429,0.000,48,36,0,[]
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4036,401132981,2019,15,LSU,Georgia,6.090,0.487,0.571,28.818,4.625,...,4.113,0.352,0.401,26.727,2.000,0.028,37,10,0,"[{'provider': 'Bovada', 'spread': '-7.5', 'for..."
4037,401132979,2019,15,Boise State,Hawai'i,5.449,0.420,0.513,32.000,5.167,...,4.315,0.301,0.455,27.818,2.000,0.014,31,10,0,"[{'provider': 'Bovada', 'spread': '-14.5', 'fo..."
4038,401132975,2019,15,Clemson,Virginia,8.986,0.571,0.816,34.923,6.200,...,5.184,0.434,0.429,22.500,4.250,0.039,62,17,0,"[{'provider': 'Bovada', 'spread': '-29.0', 'fo..."
4039,401132983,2019,15,Wisconsin,Ohio State,6.014,0.392,0.586,20.818,3.500,...,6.333,0.469,0.589,32.364,4.250,0.012,21,34,1,"[{'provider': 'Caesars', 'spread': '16.5', 'fo..."


<IPython.core.display.Javascript object>

In [7]:
# Get recruiting score for every team and create df with mean of recruting score of current and previous two seasons
for year in ["2013", "2014", "2015", "2016", "2017", "2018", "2019"]:
    recruiting = http.request(
        "GET", "https://api.collegefootballdata.com/recruiting/teams?year=" + year
    )

    recruiting_dict_list = json.loads(recruiting.data.decode("UTF-8"))
    df = pd.DataFrame(recruiting_dict_list)
    try:
        recruiting_df = recruiting_df.append(df)
    except (NameError):
        recruiting_df = df

teams = list(recruiting_df["team"].unique())
average_recruiting_list = list()
for year in [2015, 2016, 2017, 2018, 2019]:
    year_df = recruiting_df.loc[recruiting_df["year"] <= year]
    year_df = year_df.loc[year_df["year"] >= year - 2]
    for team in teams:
        average_recruiting_dict = dict()
        team_df = year_df.loc[year_df["team"] == team]
        points = round(team_df["points"].astype(float).mean(), 2)
        average_recruiting_dict["team"] = team
        average_recruiting_dict["year"] = year
        average_recruiting_dict["points"] = points
        average_recruiting_list.append(average_recruiting_dict)

average_recruiting = pd.DataFrame(average_recruiting_list)
average_recruiting = average_recruiting.dropna()
average_recruiting

Unnamed: 0,team,year,points
0,Alabama,2015,316.77
1,Ohio State,2015,293.01
2,Florida,2015,262.25
3,Michigan,2015,239.03
4,Notre Dame,2015,271.00
...,...,...,...
1249,Morehead State,2019,5.80
1250,Sacred Heart,2019,12.50
1252,Drake,2019,0.97
1253,Dayton,2019,10.68


<IPython.core.display.Javascript object>

In [8]:
# Merge average recruiting df with games df
all_games = all_games.merge(
    average_recruiting.rename(columns={"team": "home_team"}), on=["home_team", "year"]
)
all_games = all_games.rename(columns={"points": "home_team_recruiting"})
all_games = all_games.merge(
    average_recruiting.rename(columns={"team": "away_team"}), on=["away_team", "year"]
)
all_games = all_games.rename(columns={"points": "away_team_recruiting"})
all_games

Unnamed: 0,game_id,year,week,home_team,away_team,home_ypp,home_success_rate,home_ppapsp,home_afp,home_ppti40,...,away_ppapsp,away_afp,away_ppti40,away_turnovers,home_score,away_score,W/L,lines,home_team_recruiting,away_team_recruiting
0,400603840,2015,1,South Carolina,North Carolina,4.425,0.388,0.432,31.818,4.250,...,0.439,22.800,2.167,0.043,17,13,0,"[{'provider': 'consensus', 'spread': '-3.5', '...",231.28,211.35
1,400756942,2015,5,Georgia Tech,North Carolina,5.488,0.585,0.501,40.385,3.875,...,0.622,37.889,5.429,0.000,31,38,1,"[{'provider': 'consensus', 'spread': '-10.5', ...",170.17,211.35
2,400756966,2015,9,Pittsburgh,North Carolina,4.762,0.405,0.451,27.091,3.333,...,0.603,29.182,3.714,0.000,19,26,1,"[{'provider': 'consensus', 'spread': '2.5', 'f...",188.96,211.35
3,400756997,2015,13,NC State,North Carolina,5.011,0.436,0.428,30.833,4.250,...,0.721,32.231,4.500,0.027,34,45,1,"[{'provider': 'consensus', 'spread': '6', 'for...",192.01,211.35
4,400756992,2015,12,Virginia Tech,North Carolina,4.466,0.432,0.413,33.529,3.857,...,0.521,31.812,5.167,0.012,27,30,1,"[{'provider': 'consensus', 'spread': '7.5', 'f...",217.76,211.35
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4013,401110791,2019,3,Ole Miss,Southeastern Louisiana,5.627,0.386,0.576,33.857,5.125,...,0.562,36.267,4.429,0.042,40,29,0,"[{'provider': 'consensus', 'spread': '-30.5', ...",219.08,16.95
4014,401117861,2019,2,Houston,Prairie View,5.181,0.431,0.437,39.867,4.111,...,0.365,23.000,4.250,0.013,37,17,0,"[{'provider': 'consensus', 'spread': '-35', 'f...",164.20,29.02
4015,401114165,2019,2,Arizona,Northern Arizona,9.342,0.632,0.914,23.833,7.000,...,0.573,26.000,5.125,0.024,65,41,0,"[{'provider': 'consensus', 'spread': '-28', 'f...",182.19,58.88
4016,401112464,2019,4,Wake Forest,Elon,7.155,0.536,0.659,29.182,6.125,...,0.347,20.667,7.000,0.000,49,7,0,"[{'provider': 'consensus', 'spread': '-29.5', ...",173.52,11.30


<IPython.core.display.Javascript object>

In [9]:
# Create spread, formatted_spread, and favorite columns from lines data and drop lines column
for index, row in all_games.iterrows():
    try:
        all_games.loc[index, "spread"] = abs(float(row["lines"][0]["spread"]))
        all_games.loc[index, "formatted_spread"] = row["lines"][0]["formattedSpread"]
    except (IndexError):
        all_games.loc[index, "spread"] = np.NaN
        all_games.loc[index, "formatted_spread"] = np.NaN
    except TypeError:
        try:
            all_games.loc[index, "spread"] = abs(float(row["lines"][1]["spread"]))
            all_games.loc[index, "formatted_spread"] = row["lines"][1][
                "formattedSpread"
            ]
        except TypeError:
            all_games.loc[index, "spread"] = abs(float(row["lines"][2]["spread"]))
            all_games.loc[index, "formatted_spread"] = row["lines"][2][
                "formattedSpread"
            ]

all_games = all_games.dropna()
all_games = all_games.drop(columns=["lines"])

for index, row in all_games.iterrows():
    if row["home_team"] in row["formatted_spread"]:
        all_games.loc[index, "favorite"] = row["home_team"]
    else:
        all_games.loc[index, "favorite"] = row["away_team"]

all_games

Unnamed: 0,game_id,year,week,home_team,away_team,home_ypp,home_success_rate,home_ppapsp,home_afp,home_ppti40,...,away_ppti40,away_turnovers,home_score,away_score,W/L,home_team_recruiting,away_team_recruiting,spread,formatted_spread,favorite
0,400603840,2015,1,South Carolina,North Carolina,4.425,0.388,0.432,31.818,4.250,...,2.167,0.043,17,13,0,231.28,211.35,3.5,South Carolina -3.5,South Carolina
1,400756942,2015,5,Georgia Tech,North Carolina,5.488,0.585,0.501,40.385,3.875,...,5.429,0.000,31,38,1,170.17,211.35,10.5,Georgia Tech -10.5,Georgia Tech
2,400756966,2015,9,Pittsburgh,North Carolina,4.762,0.405,0.451,27.091,3.333,...,3.714,0.000,19,26,1,188.96,211.35,2.5,North Carolina -2.5,North Carolina
3,400756997,2015,13,NC State,North Carolina,5.011,0.436,0.428,30.833,4.250,...,4.500,0.027,34,45,1,192.01,211.35,6.0,North Carolina -6,North Carolina
4,400756992,2015,12,Virginia Tech,North Carolina,4.466,0.432,0.413,33.529,3.857,...,5.167,0.012,27,30,1,217.76,211.35,7.5,North Carolina -7.5,North Carolina
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4013,401110791,2019,3,Ole Miss,Southeastern Louisiana,5.627,0.386,0.576,33.857,5.125,...,4.429,0.042,40,29,0,219.08,16.95,30.5,Ole Miss -30.5,Ole Miss
4014,401117861,2019,2,Houston,Prairie View,5.181,0.431,0.437,39.867,4.111,...,4.250,0.013,37,17,0,164.20,29.02,35.0,Houston -35,Houston
4015,401114165,2019,2,Arizona,Northern Arizona,9.342,0.632,0.914,23.833,7.000,...,5.125,0.024,65,41,0,182.19,58.88,28.0,Arizona -28,Arizona
4016,401112464,2019,4,Wake Forest,Elon,7.155,0.536,0.659,29.182,6.125,...,7.000,0.000,49,7,0,173.52,11.30,29.5,Wake Forest -29.5,Wake Forest


<IPython.core.display.Javascript object>

In [10]:
# Get season stats for each team leading up to each game
stats_d_list = list()
for year in ["2015", "2016", "2017", "2018", "2019"]:
    fbs = http.request(
        "GET", "https://api.collegefootballdata.com/teams/fbs?year=" + year
    )
    fbs_teams_dict = json.loads(fbs.data.decode("UTF-8"))
    fbs_teams = list()
    for d in fbs_teams_dict:
        fbs_teams.append(d["school"])

    for team in fbs_teams:
        team_df = all_games[
            (all_games["home_team"] == team) | (all_games["away_team"] == team)
        ]
        team_df = team_df.loc[team_df["year"] == int(year)]
        weeks = list(team_df["week"].values)
        weeks.sort(reverse=True)
        for week in weeks:
            stats_d = dict()
            current_game = team_df.loc[team_df["week"] == week]
            previous_games = team_df.loc[team_df["week"] < week]
            for index, row in previous_games.iterrows():
                if row["home_team"] == team:
                    previous_games.loc[index, "offense_ypp"] = row["home_ypp"]
                    previous_games.loc[index, "offense_success_rate"] = row[
                        "home_success_rate"
                    ]
                    previous_games.loc[index, "offense_ppapsp"] = row["home_ppapsp"]
                    previous_games.loc[index, "offense_ppti40"] = row["home_ppti40"]
                    previous_games.loc[index, "offense_turnovers"] = row[
                        "home_turnovers"
                    ]
                    previous_games.loc[index, "offense_afp"] = row["home_afp"]

                    previous_games.loc[index, "defense_ypp"] = row["away_ypp"]
                    previous_games.loc[index, "defense_success_rate"] = row[
                        "away_success_rate"
                    ]
                    previous_games.loc[index, "defense_ppapsp"] = row["away_ppapsp"]
                    previous_games.loc[index, "defense_ppti40"] = row["away_ppti40"]
                    previous_games.loc[index, "defense_turnovers"] = row[
                        "away_turnovers"
                    ]
                    previous_games.loc[index, "defense_afp"] = row["away_afp"]
                    previous_games.loc[index, "average_opponent_recruiting"] = row[
                        "away_team_recruiting"
                    ]
                else:
                    previous_games.loc[index, "offense_ypp"] = row["away_ypp"]
                    previous_games.loc[index, "offense_success_rate"] = row[
                        "away_success_rate"
                    ]
                    previous_games.loc[index, "offense_ppapsp"] = row["away_ppapsp"]
                    previous_games.loc[index, "offense_ppti40"] = row["away_ppti40"]
                    previous_games.loc[index, "offense_turnovers"] = row[
                        "away_turnovers"
                    ]
                    previous_games.loc[index, "offense_afp"] = row["away_afp"]
                    previous_games.loc[index, "defense_ypp"] = row["home_ypp"]
                    previous_games.loc[index, "defense_success_rate"] = row[
                        "home_success_rate"
                    ]
                    previous_games.loc[index, "defense_ppapsp"] = row["home_ppapsp"]
                    previous_games.loc[index, "defense_ppti40"] = row["home_ppti40"]
                    previous_games.loc[index, "defense_turnovers"] = row[
                        "home_turnovers"
                    ]
                    previous_games.loc[index, "defense_afp"] = row["home_afp"]
                    previous_games.loc[index, "average_opponent_recruiting"] = row[
                        "home_team_recruiting"
                    ]

            previous_games = previous_games.drop(
                columns=[
                    "home_score",
                    "away_score",
                    "W/L",
                    "game_id",
                    "year",
                    "week",
                    "home_team",
                    "away_team",
                    "home_ypp",
                    "home_success_rate",
                    "home_ppapsp",
                    "home_afp",
                    "home_ppti40",
                    "home_turnovers",
                    "away_ypp",
                    "away_success_rate",
                    "away_ppapsp",
                    "away_afp",
                    "away_ppti40",
                    "away_turnovers",
                    "home_team_recruiting",
                    "away_team_recruiting",
                    "spread",
                    "formatted_spread",
                    "favorite",
                ]
            )
            means = previous_games.mean()
            stats_d["team"] = team
            stats_d["game_id"] = current_game.values[0][0]
            for k, v in means.items():
                stats_d[k] = round(v, 2)

            stats_d_list.append(stats_d)

stats_df = pd.DataFrame(stats_d_list)
stats_df = stats_df.dropna()
stats_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = _infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


Unnamed: 0,team,game_id,offense_ypp,offense_success_rate,offense_ppapsp,offense_ppti40,offense_turnovers,offense_afp,defense_ypp,defense_success_rate,defense_ppapsp,defense_ppti40,defense_turnovers,defense_afp,average_opponent_recruiting
0,Air Force,400852675,6.14,0.44,0.50,4.77,0.03,27.85,5.00,0.32,0.48,4.07,0.02,30.04,126.91
1,Air Force,400787292,6.04,0.45,0.49,4.55,0.02,27.69,4.75,0.32,0.46,3.99,0.02,29.21,128.10
2,Air Force,400790881,5.81,0.45,0.47,4.54,0.02,27.70,4.73,0.32,0.46,3.96,0.02,28.79,123.51
3,Air Force,400787280,5.70,0.45,0.45,4.49,0.02,28.36,4.53,0.32,0.44,3.76,0.03,29.32,125.29
4,Air Force,400760497,5.63,0.45,0.46,4.41,0.03,29.01,4.78,0.32,0.48,4.08,0.03,29.39,130.71
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7466,Wyoming,401117526,5.79,0.38,0.48,4.81,0.01,31.40,4.90,0.39,0.48,3.79,0.02,27.79,146.38
7467,Wyoming,401117523,6.00,0.38,0.48,4.27,0.01,32.44,4.91,0.39,0.50,3.61,0.02,27.74,144.77
7468,Wyoming,401117514,5.33,0.37,0.39,3.86,0.01,29.10,5.30,0.42,0.53,3.40,0.02,28.70,150.74
7469,Wyoming,401117509,5.62,0.41,0.40,4.03,0.01,30.89,5.60,0.46,0.51,3.10,0.03,27.15,158.41


<IPython.core.display.Javascript object>

In [11]:
week_ones = all_games.copy()

<IPython.core.display.Javascript object>

In [12]:
# Add season stats to all games df
for index, row in all_games.iterrows():
    id_df = stats_df.loc[stats_df["game_id"] == row["game_id"]]
    for i, r in id_df.iterrows():
        if r["team"] == row["home_team"]:
            for k, v in r.items():
                if k not in ["team", "game_id"]:
                    all_games.loc[index, "home_season_" + k] = v
        elif r["team"] == row["away_team"]:
            for k, v in r.items():
                if k not in ["team", "game_id"]:
                    all_games.loc[index, "away_season_" + k] = v

all_games = all_games.dropna()
all_games

Unnamed: 0,game_id,year,week,home_team,away_team,home_ypp,home_success_rate,home_ppapsp,home_afp,home_ppti40,...,away_season_offense_ppti40,away_season_offense_turnovers,away_season_offense_afp,away_season_defense_ypp,away_season_defense_success_rate,away_season_defense_ppapsp,away_season_defense_ppti40,away_season_defense_turnovers,away_season_defense_afp,away_season_average_opponent_recruiting
1,400756942,2015,5,Georgia Tech,North Carolina,5.488,0.585,0.501,40.385,3.875,...,4.41,0.02,30.53,4.78,0.41,0.42,3.03,0.02,26.81,108.05
2,400756966,2015,9,Pittsburgh,North Carolina,4.762,0.405,0.451,27.091,3.333,...,4.81,0.02,32.20,4.72,0.42,0.41,3.05,0.02,28.86,137.36
3,400756997,2015,13,NC State,North Carolina,5.011,0.436,0.428,30.833,4.250,...,5.01,0.01,32.25,4.76,0.44,0.44,3.32,0.02,28.84,161.12
4,400756992,2015,12,Virginia Tech,North Carolina,4.466,0.432,0.413,33.529,3.857,...,4.99,0.01,32.29,4.79,0.44,0.44,3.26,0.02,28.37,155.46
5,400603867,2015,2,South Carolina,Kentucky,6.677,0.492,0.487,34.545,2.875,...,5.12,0.02,30.29,5.52,0.53,0.46,4.43,0.03,25.31,129.01
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3981,401114266,2019,4,UTEP,Nevada,3.712,0.394,0.417,21.545,7.000,...,3.68,0.03,29.29,5.37,0.38,0.53,4.86,0.02,34.00,163.91
3982,401117542,2019,11,San Diego State,Nevada,3.231,0.346,0.313,25.273,3.250,...,3.16,0.03,28.48,5.68,0.40,0.54,4.75,0.02,31.33,137.12
3983,401114193,2019,2,Oregon,Nevada,7.462,0.475,0.800,44.923,6.364,...,5.67,0.00,33.00,5.77,0.45,0.50,3.88,0.04,25.31,190.64
3984,401117527,2019,8,Utah State,Nevada,5.570,0.367,0.441,38.000,3.400,...,3.50,0.03,30.25,5.65,0.41,0.57,5.48,0.02,31.78,143.91


<IPython.core.display.Javascript object>

In [39]:
# Drop unwanted columns and add stats columns for the favorite and underdog
games = all_games[
    [
        "game_id",
        "year",
        "week",
        "home_team",
        "away_team",
        "home_score",
        "away_score",
        "spread",
        "formatted_spread",
        "favorite",
        "home_team_recruiting",
        "home_season_average_opponent_recruiting",
        "away_team_recruiting",
        "away_season_average_opponent_recruiting",
        "home_season_offense_turnovers",
        "home_season_defense_turnovers",
        "away_season_offense_turnovers",
        "away_season_defense_turnovers",
        "home_season_offense_ppti40",
        "home_season_defense_ppti40",
        "away_season_offense_ppti40",
        "away_season_defense_ppti40",
        "home_season_offense_afp",
        "home_season_defense_afp",
        "away_season_offense_afp",
        "away_season_defense_afp",
    ]
].reset_index(drop=True)

for index, row in games.iterrows():
    if row["favorite"] == row["home_team"]:
        games.loc[index, "dog"] = row["away_team"]
        for k, v in row.items():
            if k not in [
                "game_id",
                "year",
                "week",
                "home_team",
                "away_team",
                "spread",
                "formatted_spread",
                "favorite",
            ]:
                if "home" in k:
                    games.loc[index, "favorite_" + k[5:]] = v
                if "away" in k:
                    games.loc[index, "dog_" + k[5:]] = v
    else:
        games.loc[index, "dog"] = row["home_team"]
        for k, v in row.items():
            if k not in [
                "game_id",
                "year",
                "week",
                "home_team",
                "away_team",
                "spread",
                "formatted_spread",
                "favorite",
            ]:
                if "home" in k:
                    games.loc[index, "dog_" + k[5:]] = v
                if "away" in k:
                    games.loc[index, "favorite_" + k[5:]] = v

games = games.drop(
    columns=[
        "home_team",
        "away_team",
        "home_score",
        "away_score",
        "home_team_recruiting",
        "home_season_average_opponent_recruiting",
        "away_team_recruiting",
        "away_season_average_opponent_recruiting",
        "home_season_offense_turnovers",
        "home_season_defense_turnovers",
        "away_season_offense_turnovers",
        "away_season_defense_turnovers",
        "home_season_offense_ppti40",
        "home_season_defense_ppti40",
        "away_season_offense_ppti40",
        "away_season_defense_ppti40",
        "home_season_offense_afp",
        "home_season_defense_afp",
        "away_season_offense_afp",
        "away_season_defense_afp",
    ]
)

games["favorite_score"] = games["favorite_score"].astype(int)
games["dog_score"] = games["dog_score"].astype(int)
games["point_difference"] = games["favorite_score"] - games["dog_score"]
games["W/L"] = np.where(games["favorite_score"] > games["dog_score"], 0, 1)

games

Unnamed: 0,game_id,year,week,spread,formatted_spread,favorite,dog,favorite_score,dog_score,favorite_team_recruiting,...,favorite_season_offense_ppti40,favorite_season_defense_ppti40,dog_season_offense_ppti40,dog_season_defense_ppti40,favorite_season_offense_afp,favorite_season_defense_afp,dog_season_offense_afp,dog_season_defense_afp,point_difference,W/L
0,400756942,2015,5,10.5,Georgia Tech -10.5,Georgia Tech,North Carolina,31,38,170.17,...,4.38,4.25,4.41,3.03,34.88,34.67,30.53,26.81,-7,1
1,400756966,2015,9,2.5,North Carolina -2.5,North Carolina,Pittsburgh,26,19,211.35,...,4.81,3.05,4.36,4.98,32.20,28.86,35.23,24.94,7,0
2,400756997,2015,13,6.0,North Carolina -6,North Carolina,NC State,45,34,211.35,...,5.01,3.32,4.73,4.22,32.25,28.84,33.94,26.56,11,0
3,400756992,2015,12,7.5,North Carolina -7.5,North Carolina,Virginia Tech,30,27,211.35,...,4.99,3.26,4.18,3.96,32.29,28.37,33.96,31.04,3,0
4,400603867,2015,2,7.5,South Carolina -7.5,South Carolina,Kentucky,22,26,231.28,...,4.25,2.17,5.12,4.43,31.82,22.80,30.29,25.31,-4,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3237,401114266,2019,4,13.5,Nevada -13.5,Nevada,UTEP,37,21,142.09,...,3.68,4.86,3.06,4.04,29.29,34.00,24.38,33.37,16,0
3238,401117542,2019,11,17.0,San Diego State -17.0,San Diego State,Nevada,13,17,152.81,...,3.33,3.59,3.16,4.75,33.32,24.85,28.48,31.33,-4,1
3239,401114193,2019,2,24.0,Oregon -24,Oregon,Nevada,77,6,256.08,...,3.50,3.86,5.67,3.88,33.47,35.71,33.00,25.31,71,0
3240,401117527,2019,8,22.5,Utah State -22.5,Utah State,Nevada,36,10,131.18,...,3.29,3.34,3.50,5.48,30.50,28.10,30.25,31.78,26,0


<IPython.core.display.Javascript object>

In [16]:
# Get advanced stats for each game
advanced_games_df = pd.DataFrame()
for year in ["2015", "2016", "2017", "2018", "2019"]:
    advanced_games = http.request(
        "GET",
        "https://api.collegefootballdata.com/stats/game/advanced?year="
        + year
        + "&excludeGarbageTime=true",
    )

    advanced_games_dict_list = json.loads(advanced_games.data.decode("UTF-8"))
    df = pd.DataFrame(advanced_games_dict_list)
    try:
        advanced_games_df = advanced_games_df.append(df)
    except (NameError):
        advanced_games_df = df

advanced_games_df = advanced_games_df.reset_index(drop=True)
advanced_games_df

Unnamed: 0,gameId,week,team,opponent,offense,defense
0,400603827,1,Alabama,Wisconsin,"{'plays': 66, 'drives': 12, 'ppa': 0.464209394...","{'plays': 55, 'drives': 12, 'ppa': -0.01760603..."
1,400603827,1,Wisconsin,Alabama,"{'plays': 55, 'drives': 12, 'ppa': -0.01760603...","{'plays': 66, 'drives': 12, 'ppa': 0.464209394..."
2,400603828,1,Arkansas,UTEP,"{'plays': 31, 'drives': 8, 'ppa': 0.9844748975...","{'plays': 31, 'drives': 7, 'ppa': 0.0349158642..."
3,400603828,1,UTEP,Arkansas,"{'plays': 31, 'drives': 7, 'ppa': 0.0349158642...","{'plays': 31, 'drives': 8, 'ppa': 0.9844748975..."
4,400603829,1,Auburn,Louisville,"{'plays': 60, 'drives': 10, 'ppa': 0.247254666...","{'plays': 79, 'drives': 11, 'ppa': 0.232098552..."
...,...,...,...,...,...,...
8709,401147693,1,Liberty,Georgia Southern,"{'plays': 72, 'drives': 15, 'ppa': 0.098314702...","{'plays': 68, 'drives': 15, 'ppa': -0.00021730..."
8710,401147695,1,Georgia State,Wyoming,"{'plays': 65, 'drives': 13, 'ppa': -0.00528030...","{'plays': 71, 'drives': 12, 'ppa': 0.324194409..."
8711,401147695,1,Wyoming,Georgia State,"{'plays': 71, 'drives': 12, 'ppa': 0.324194409...","{'plays': 65, 'drives': 13, 'ppa': -0.00528030..."
8712,401183793,13,Air Force,New Mexico,"{'plays': 53, 'drives': 9, 'ppa': 0.7214334945...","{'plays': 52, 'drives': 8, 'ppa': 0.2251921060..."


<IPython.core.display.Javascript object>

In [17]:
# Add columns for each statistic
for index, row in advanced_games_df.iterrows():
    for elem in ["offense", "defense"]:
        for k, v in row[elem].items():
            try:
                advanced_games_df.loc[index, elem + "_" + k] = v
            except (ValueError):
                for key, value in v.items():
                    advanced_games_df.loc[index, elem + "_" + k + "_" + key] = value

advanced_games_df = advanced_games_df = advanced_games_df.drop(
    columns=[
        "offense",
        "defense",
        "offense_standardDowns",
        "offense_passingDowns",
        "offense_rushingPlays",
        "offense_passingPlays",
        "defense_standardDowns",
        "defense_passingDowns",
        "defense_rushingPlays",
        "defense_passingPlays",
    ]
)
advanced_games_df

Unnamed: 0,gameId,week,team,opponent,offense_plays,offense_drives,offense_ppa,offense_totalPPA,offense_successRate,offense_explosiveness,...,defense_passingDowns_successRate,defense_passingDowns_explosiveness,defense_rushingPlays_ppa,defense_rushingPlays_totalPPA,defense_rushingPlays_successRate,defense_rushingPlays_explosiveness,defense_passingPlays_ppa,defense_passingPlays_totalPPA,defense_passingPlays_successRate,defense_passingPlays_explosiveness
0,400603827,1,Alabama,Wisconsin,66.0,12.0,0.464209,30.637820,0.469697,1.518528,...,0.315789,1.430430,-0.538572,-8.078584,0.133333,0.486212,0.162932,6.354361,0.384615,1.270774
1,400603827,1,Wisconsin,Alabama,55.0,12.0,-0.017606,-0.968332,0.327273,1.154995,...,0.272727,3.257768,0.508344,17.283705,0.500000,1.561503,0.417316,13.354115,0.437500,1.466344
2,400603828,1,Arkansas,UTEP,31.0,8.0,0.984475,30.518722,0.645161,1.871235,...,0.444444,1.065530,-0.093475,-1.962966,0.380952,0.849197,0.332032,2.656256,0.500000,1.247460
3,400603828,1,UTEP,Arkansas,31.0,7.0,0.034916,1.082392,0.451613,0.869466,...,0.555556,2.953390,0.472273,8.028640,0.529412,1.524874,1.818364,23.638730,0.769231,2.484947
4,400603829,1,Auburn,Louisville,60.0,10.0,0.247255,14.835280,0.516667,1.000941,...,0.423077,1.093286,0.470520,19.761857,0.571429,1.162231,-0.041982,-1.511344,0.361111,1.187383
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8709,401147693,1,Liberty,Georgia Southern,72.0,15.0,0.098315,7.078659,0.319444,1.310535,...,0.178571,1.181051,0.078738,3.385741,0.279070,1.330397,-0.081391,-1.790613,0.181818,1.442744
8710,401147695,1,Georgia State,Wyoming,65.0,13.0,-0.005280,-0.343220,0.353846,1.196881,...,0.400000,2.816403,0.230504,9.911674,0.441860,1.019805,0.426784,11.523176,0.370370,2.924325
8711,401147695,1,Wyoming,Georgia State,71.0,12.0,0.324194,23.017803,0.422535,1.673416,...,0.263158,1.166447,0.013668,0.519400,0.421053,0.940456,-0.004181,-0.108709,0.230769,2.205813
8712,401183793,13,Air Force,New Mexico,53.0,9.0,0.721433,38.235975,0.547170,1.695182,...,0.153846,1.912481,0.404539,13.349784,0.666667,0.793831,-0.141450,-2.546105,0.222222,1.363326


<IPython.core.display.Javascript object>

In [21]:
# Get season stats for each team prior to each game
advanced_games = advanced_games_df.copy()
advanced_games = advanced_games.rename(columns={"gameId": "game_id"})
advanced_games["game_id"] = advanced_games["game_id"].astype(str)
advanced_games = advanced_games.merge(week_ones[["year", "game_id"]], on="game_id")

stats_d_list = list()
for year in sorted(list(advanced_games["year"].unique())):
    for team in sorted(list(advanced_games["team"].unique())):
        for week in sorted(list(advanced_games["week"].unique()), reverse=True):
            stats_d = dict()
            team_df = advanced_games.loc[advanced_games["team"] == team]
            team_df = team_df.loc[team_df["year"] == year]
            current_game = team_df.loc[team_df["week"] == week]
            if len(current_game) > 0:
                team_df = team_df.loc[team_df["week"] < week]
                stats_d["game_id"] = current_game["game_id"].values[0]
                stats_d["year"] = year
                stats_d["week"] = week
                stats_d["team"] = team
                means = team_df.drop(
                    columns=["game_id", "week", "team", "opponent", "year"]
                ).mean()
                for k, v in means.items():
                    stats_d[k] = round(v, 2)
                stats_d_list.append(stats_d)

advanced_stats = pd.DataFrame(stats_d_list)
advanced_stats

Unnamed: 0,game_id,year,week,team,offense_plays,offense_drives,offense_ppa,offense_totalPPA,offense_successRate,offense_explosiveness,...,defense_passingDowns_successRate,defense_passingDowns_explosiveness,defense_rushingPlays_ppa,defense_rushingPlays_totalPPA,defense_rushingPlays_successRate,defense_rushingPlays_explosiveness,defense_passingPlays_ppa,defense_passingPlays_totalPPA,defense_passingPlays_successRate,defense_passingPlays_explosiveness
0,400852675,2015,14,Air Force,64.64,12.27,0.24,15.96,0.44,1.23,...,0.27,1.96,0.06,2.08,0.29,1.56,0.29,5.65,0.35,2.04
1,400787292,2015,13,Air Force,65.80,12.00,0.24,16.10,0.45,1.18,...,0.26,2.00,0.03,0.67,0.29,1.51,0.22,5.34,0.34,1.98
2,400790881,2015,12,Air Force,65.11,11.67,0.23,15.23,0.45,1.17,...,0.26,2.03,0.01,0.01,0.28,1.50,0.25,6.16,0.34,1.98
3,400787280,2015,11,Air Force,63.62,11.50,0.21,13.75,0.45,1.13,...,0.26,2.04,0.01,0.00,0.27,1.54,0.26,5.54,0.35,1.98
4,400760497,2015,10,Air Force,64.86,11.57,0.23,15.01,0.46,1.15,...,0.27,2.07,0.04,1.08,0.27,1.69,0.30,6.43,0.37,1.90
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7901,401117523,2019,7,Wyoming,59.50,13.00,0.19,9.70,0.38,1.41,...,0.32,1.72,0.08,2.01,0.34,1.25,0.09,3.28,0.37,1.79
7902,401117514,2019,5,Wyoming,63.33,13.33,0.09,5.34,0.36,1.31,...,0.36,1.79,0.11,2.79,0.39,1.10,0.07,3.05,0.40,1.73
7903,401117509,2019,4,Wyoming,63.50,12.50,0.10,5.72,0.43,0.97,...,0.37,1.48,0.20,5.55,0.43,1.01,0.07,3.04,0.44,1.43
7904,401117500,2019,2,Wyoming,59.00,13.00,0.23,13.75,0.42,1.33,...,0.32,1.17,0.21,7.79,0.51,0.70,0.28,13.45,0.46,1.42


<IPython.core.display.Javascript object>

In [30]:
advanced_stats = advanced_stats.dropna()

<IPython.core.display.Javascript object>

In [41]:
# Add season stats to games df
for index, row in games.iterrows():
    id_df = advanced_stats.loc[advanced_stats["game_id"] == row["game_id"]]
    for i, r in id_df.iterrows():
        if r["team"] == row["favorite"]:
            for k, v in r.items():
                if k not in ["team", "game_id"]:
                    games.loc[index, "favorite_season_" + k] = v
        elif r["team"] == row["dog"]:
            for k, v in r.items():
                if k not in ["team", "game_id"]:
                    games.loc[index, "dog_season_" + k] = v

games = games.dropna()
games

Unnamed: 0,game_id,year,week,spread,formatted_spread,favorite,dog,favorite_score,dog_score,favorite_team_recruiting,...,dog_season_defense_passingDowns_successRate,dog_season_defense_passingDowns_explosiveness,dog_season_defense_rushingPlays_ppa,dog_season_defense_rushingPlays_totalPPA,dog_season_defense_rushingPlays_successRate,dog_season_defense_rushingPlays_explosiveness,dog_season_defense_passingPlays_ppa,dog_season_defense_passingPlays_totalPPA,dog_season_defense_passingPlays_successRate,dog_season_defense_passingPlays_explosiveness
0,400756942,2015,5,10.5,Georgia Tech -10.5,Georgia Tech,North Carolina,31,38,170.17,...,0.22,1.91,0.09,4.15,0.41,1.03,-0.12,-1.27,0.32,1.52
1,400756966,2015,9,2.5,North Carolina -2.5,North Carolina,Pittsburgh,26,19,211.35,...,0.31,1.50,0.14,4.73,0.39,1.22,0.05,1.32,0.31,1.65
2,400756997,2015,13,6.0,North Carolina -6,North Carolina,NC State,45,34,211.35,...,0.21,1.62,-0.06,-0.36,0.29,1.22,0.17,4.83,0.37,1.56
3,400756992,2015,12,7.5,North Carolina -7.5,North Carolina,Virginia Tech,30,27,211.35,...,0.31,1.80,0.05,1.78,0.35,1.23,0.19,4.81,0.34,2.00
4,400603867,2015,2,7.5,South Carolina -7.5,South Carolina,Kentucky,22,26,231.28,...,0.35,0.85,0.41,17.03,0.57,1.08,-0.01,-0.24,0.44,0.90
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3237,401114266,2019,4,13.5,Nevada -13.5,Nevada,UTEP,37,21,142.09,...,0.42,1.50,0.13,2.94,0.39,1.07,0.32,12.67,0.48,1.45
3238,401117542,2019,11,17.0,San Diego State -17.0,San Diego State,Nevada,13,17,152.81,...,0.34,2.02,0.08,1.67,0.38,1.12,0.30,7.80,0.41,1.78
3239,401114193,2019,2,24.0,Oregon -24,Oregon,Nevada,77,6,256.08,...,0.24,1.45,-0.05,-1.47,0.39,0.59,0.40,19.81,0.47,1.56
3240,401117527,2019,8,22.5,Utah State -22.5,Utah State,Nevada,36,10,131.18,...,0.34,2.08,0.03,0.01,0.39,1.04,0.42,11.88,0.45,1.82


<IPython.core.display.Javascript object>

In [45]:
# Drop redundant columns
games = games.drop(
    columns=[
        "favorite_season_year",
        "favorite_season_week",
        "dog_season_year",
        "dog_season_week",
    ]
).reset_index(drop=True)
games

Unnamed: 0,game_id,year,week,spread,formatted_spread,favorite,dog,favorite_score,dog_score,favorite_team_recruiting,...,dog_season_defense_passingDowns_successRate,dog_season_defense_passingDowns_explosiveness,dog_season_defense_rushingPlays_ppa,dog_season_defense_rushingPlays_totalPPA,dog_season_defense_rushingPlays_successRate,dog_season_defense_rushingPlays_explosiveness,dog_season_defense_passingPlays_ppa,dog_season_defense_passingPlays_totalPPA,dog_season_defense_passingPlays_successRate,dog_season_defense_passingPlays_explosiveness
0,400756942,2015,5,10.5,Georgia Tech -10.5,Georgia Tech,North Carolina,31,38,170.17,...,0.22,1.91,0.09,4.15,0.41,1.03,-0.12,-1.27,0.32,1.52
1,400756966,2015,9,2.5,North Carolina -2.5,North Carolina,Pittsburgh,26,19,211.35,...,0.31,1.50,0.14,4.73,0.39,1.22,0.05,1.32,0.31,1.65
2,400756997,2015,13,6.0,North Carolina -6,North Carolina,NC State,45,34,211.35,...,0.21,1.62,-0.06,-0.36,0.29,1.22,0.17,4.83,0.37,1.56
3,400756992,2015,12,7.5,North Carolina -7.5,North Carolina,Virginia Tech,30,27,211.35,...,0.31,1.80,0.05,1.78,0.35,1.23,0.19,4.81,0.34,2.00
4,400603867,2015,2,7.5,South Carolina -7.5,South Carolina,Kentucky,22,26,231.28,...,0.35,0.85,0.41,17.03,0.57,1.08,-0.01,-0.24,0.44,0.90
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3164,401114266,2019,4,13.5,Nevada -13.5,Nevada,UTEP,37,21,142.09,...,0.42,1.50,0.13,2.94,0.39,1.07,0.32,12.67,0.48,1.45
3165,401117542,2019,11,17.0,San Diego State -17.0,San Diego State,Nevada,13,17,152.81,...,0.34,2.02,0.08,1.67,0.38,1.12,0.30,7.80,0.41,1.78
3166,401114193,2019,2,24.0,Oregon -24,Oregon,Nevada,77,6,256.08,...,0.24,1.45,-0.05,-1.47,0.39,0.59,0.40,19.81,0.47,1.56
3167,401117527,2019,8,22.5,Utah State -22.5,Utah State,Nevada,36,10,131.18,...,0.34,2.08,0.03,0.01,0.39,1.04,0.42,11.88,0.45,1.82


<IPython.core.display.Javascript object>

## Modeling

In [122]:
games["home_away_W/L"].value_counts()

0    1837
1    1332
Name: home_away_W/L, dtype: int64

<IPython.core.display.Javascript object>

In [201]:
home_away_cols = [
    "home_team_recruiting",
    "home_season_average_opponent_recruiting",
    "away_team_recruiting",
    "away_season_average_opponent_recruiting",
    "home_season_offense_turnovers",
    "home_season_defense_turnovers",
    "away_season_offense_turnovers",
    "away_season_defense_turnovers",
    "home_season_offense_ppti40",
    "home_season_defense_ppti40",
    "away_season_offense_ppti40",
    "away_season_defense_ppti40",
    "home_season_offense_afp",
    "home_season_defense_afp",
    "away_season_offense_afp",
    "away_season_defense_afp",
    "home_season_offense_stuffRate",
    "home_season_offense_rushingPlays_ppa",
    "home_season_offense_rushingPlays_successRate",
    "home_season_offense_rushingPlays_explosiveness",
    "home_season_offense_passingPlays_ppa",
    "home_season_offense_passingPlays_successRate",
    "home_season_offense_passingPlays_explosiveness",
    "home_season_defense_stuffRate",
    "home_season_defense_rushingPlays_ppa",
    "home_season_defense_rushingPlays_successRate",
    "home_season_defense_rushingPlays_explosiveness",
    "home_season_defense_passingPlays_ppa",
    "home_season_defense_passingPlays_successRate",
    "home_season_defense_passingPlays_explosiveness",
    "away_season_offense_stuffRate",
    "away_season_offense_rushingPlays_ppa",
    "away_season_offense_rushingPlays_successRate",
    "away_season_offense_rushingPlays_explosiveness",
    "away_season_offense_passingPlays_ppa",
    "away_season_offense_passingPlays_successRate",
    "away_season_offense_passingPlays_explosiveness",
    "away_season_defense_stuffRate",
    "away_season_defense_rushingPlays_ppa",
    "away_season_defense_rushingPlays_successRate",
    "away_season_defense_rushingPlays_explosiveness",
    "away_season_defense_passingPlays_ppa",
    "away_season_defense_passingPlays_successRate",
    "away_season_defense_passingPlays_explosiveness",
]

favorite_dog_cols = [
    "favorite_team_recruiting",
    "favorite_season_average_opponent_recruiting",
    "dog_team_recruiting",
    "dog_season_average_opponent_recruiting",
    "favorite_season_offense_turnovers",
    "favorite_season_defense_turnovers",
    "dog_season_offense_turnovers",
    "dog_season_defense_turnovers",
    "favorite_season_offense_ppti40",
    "favorite_season_defense_ppti40",
    "dog_season_offense_ppti40",
    "dog_season_defense_ppti40",
    "favorite_season_offense_afp",
    "favorite_season_defense_afp",
    "dog_season_offense_afp",
    "dog_season_defense_afp",
    "favorite_season_offense_stuffRate",
    "favorite_season_offense_rushingPlays_ppa",
    "favorite_season_offense_rushingPlays_successRate",
    "favorite_season_offense_rushingPlays_explosiveness",
    "favorite_season_offense_passingPlays_ppa",
    "favorite_season_offense_passingPlays_successRate",
    "favorite_season_offense_passingPlays_explosiveness",
    "favorite_season_defense_stuffRate",
    "favorite_season_defense_rushingPlays_ppa",
    "favorite_season_defense_rushingPlays_successRate",
    "favorite_season_defense_rushingPlays_explosiveness",
    "favorite_season_defense_passingPlays_ppa",
    "favorite_season_defense_passingPlays_successRate",
    "favorite_season_defense_passingPlays_explosiveness",
    "dog_season_offense_stuffRate",
    "dog_season_offense_rushingPlays_ppa",
    "dog_season_offense_rushingPlays_successRate",
    "dog_season_offense_rushingPlays_explosiveness",
    "dog_season_offense_passingPlays_ppa",
    "dog_season_offense_passingPlays_successRate",
    "dog_season_offense_passingPlays_explosiveness",
    "dog_season_defense_stuffRate",
    "dog_season_defense_rushingPlays_ppa",
    "dog_season_defense_rushingPlays_successRate",
    "dog_season_defense_rushingPlays_explosiveness",
    "dog_season_defense_passingPlays_ppa",
    "dog_season_defense_passingPlays_successRate",
    "dog_season_defense_passingPlays_explosiveness",
]

<IPython.core.display.Javascript object>

In [202]:
train_games = games.loc[games["year"] < 2019]
test_games = games.loc[games["year"] == 2019]

<IPython.core.display.Javascript object>

### Predict the margin of victory or defeat for the home team

#### Random forest regression

In [225]:
X_train = train_games[home_away_cols]
y_train = train_games["home_away_points_difference"]

points_X_test = test_games[home_away_cols]
points_y_test = test_games["home_away_points_difference"]

<IPython.core.display.Javascript object>

In [205]:
grid = {
    "n_estimators": [2000, 2500, 3000],
    "max_depth": [10, 15, 25],
    "min_samples_leaf": [4, 6, 8],
}

points_diff_rfr = GridSearchCV(RandomForestRegressor(), grid, verbose=1, cv=2)
points_diff_rfr.fit(X_train, y_train)
print(points_diff_rfr.best_params_)

Fitting 2 folds for each of 27 candidates, totalling 54 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  54 out of  54 | elapsed: 36.6min finished


{'max_depth': 15, 'min_samples_leaf': 4, 'n_estimators': 2500}


<IPython.core.display.Javascript object>

In [206]:
points_diff_rfr.score(points_X_test, points_y_test)

0.3573907357551106

<IPython.core.display.Javascript object>

In [409]:
y_pred = points_diff_rfr.predict(points_X_test)
print(round(mean_absolute_error(points_y_test, y_pred), 2))
print(round(r2_score(points_y_test, y_pred), 2))

14.02
0.36


<IPython.core.display.Javascript object>

### Predict the winner of a game

#### Random forest classification

In [222]:
X_train = train_games[home_away_cols]
y_train = train_games["home_away_W/L"]

homeaway_X_test = test_games[home_away_cols]
homeaway_y_test = test_games["home_away_W/L"]

<IPython.core.display.Javascript object>

In [216]:
grid = {
    "n_estimators": [2000, 2500, 3000],
    "max_depth": [25, 35, 45],
    "min_samples_leaf": [2, 4, 6],
}

win_loss_rfc = GridSearchCV(RandomForestClassifier(), grid, verbose=1, cv=2)
win_loss_rfc.fit(X_train, y_train)
print(win_loss_rfc.best_params_)

Fitting 2 folds for each of 27 candidates, totalling 54 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  54 out of  54 | elapsed: 10.1min finished


{'max_depth': 25, 'min_samples_leaf': 2, 'n_estimators': 3000}


<IPython.core.display.Javascript object>

In [217]:
win_loss_rfc.score(homeaway_X_test, homeaway_y_test)

0.7392638036809815

<IPython.core.display.Javascript object>

In [414]:
y_pred = win_loss_rfc.predict(homeaway_X_test)
pd.DataFrame(confusion_matrix(y_pred, homeaway_y_test))

Unnamed: 0,0,1
0,313,98
1,72,169


<IPython.core.display.Javascript object>

### Predict whether the home team will cover or the dog will beat the spread

#### Random forest classification

In [221]:
X_train = train_games[favorite_dog_cols]
y_train = train_games["spread_result"]

spread_X_test = test_games[favorite_dog_cols]
spread_y_test = test_games["spread_result"]

<IPython.core.display.Javascript object>

In [219]:
grid = {
    "n_estimators": [500, 750, 1000],
    "max_depth": [40, 50, 60],
    "min_samples_leaf": [2, 4, 6],
}

spread_rfc = GridSearchCV(RandomForestClassifier(), grid, verbose=1, cv=2)
spread_rfc.fit(X_train, y_train)
print(spread_rfc.best_params_)

Fitting 2 folds for each of 27 candidates, totalling 54 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  54 out of  54 | elapsed:  3.4min finished


{'max_depth': 60, 'min_samples_leaf': 2, 'n_estimators': 750}


<IPython.core.display.Javascript object>

In [220]:
spread_rfc.score(spread_X_test, spread_y_test)

0.47392638036809814

<IPython.core.display.Javascript object>

In [416]:
y_pred = spread_rfc.predict(spread_X_test)
pd.DataFrame(confusion_matrix(y_pred, spread_y_test), columns=["cover", "beat"])

Unnamed: 0,cover,beat
0,125,135
1,208,184


<IPython.core.display.Javascript object>

## Making a bet based on the prediction of the score difference model

In [254]:
spread_probs = spread_rfc.predict_proba(spread_X_test)
win_loss_probs = win_loss_rfc.predict_proba(homeaway_X_test)
probs_df = pd.DataFrame()
probs_df["win loss prediction"] = win_loss_rfc.predict(homeaway_X_test)
probs_df["win loss result"] = homeaway_y_test.reset_index(drop=True)
probs_df["prob home win"] = win_loss_probs[:, 0]
probs_df["prob away win"] = win_loss_probs[:, 1]
probs_df["points diff prediction"] = points_diff_rfr.predict(points_X_test)
probs_df = probs_df.merge(
    test_games[
        [
            "home_away_points_difference",
            "home_team",
            "away_team",
            "week",
            "favorite_score",
            "dog_score",
            "favorite",
            "dog",
            "spread",
            "formatted_spread",
            "point_difference",
        ]
    ].reset_index(drop=True),
    left_index=True,
    right_index=True,
)

probs_df

Unnamed: 0,win loss prediction,win loss result,prob home win,prob away win,points diff prediction,home_away_points_difference,home_team,away_team,week,favorite_score,dog_score,favorite,dog,spread,formatted_spread,point_difference
0,0,0,0.636375,0.363625,11.064243,3,North Carolina,Miami,2,25,28,Miami,North Carolina,4.5,Miami -4.5,-3
1,1,1,0.291602,0.708398,-9.681975,-4,Pittsburgh,Miami,9,12,16,Pittsburgh,Miami,4.5,Pittsburgh -4.5,-4
2,0,1,0.506270,0.493730,-1.176490,-17,Florida State,Miami,10,10,27,Florida State,Miami,3.0,Florida State -3.0,-17
3,1,0,0.297231,0.702769,-8.272396,10,Duke,Miami,14,17,27,Miami,Duke,9.0,Miami -9.0,-10
4,1,0,0.299011,0.700989,-12.777568,6,Florida International,Miami,13,24,30,Miami,Florida International,21.5,Miami -21.5,-6
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
647,1,1,0.473010,0.526990,-1.451778,-16,UTEP,Nevada,4,37,21,Nevada,UTEP,13.5,Nevada -13.5,16
648,0,1,0.809306,0.190694,15.981547,-4,San Diego State,Nevada,11,13,17,San Diego State,Nevada,17.0,San Diego State -17.0,-4
649,0,0,0.528850,0.471150,3.802350,71,Oregon,Nevada,2,77,6,Oregon,Nevada,24.0,Oregon -24,71
650,0,0,0.669206,0.330794,11.009476,26,Utah State,Nevada,8,36,10,Utah State,Nevada,22.5,Utah State -22.5,26


<IPython.core.display.Javascript object>

### Betting strategy
If the underdog is predicted to win the game, we will bet that they will beat the spread. If the favorite is predicted to win, we will bet that they will cover the spread.

In [305]:
probs_df["favorite points diff prediction"] = np.where(
    probs_df["favorite"] == probs_df["home_team"],
    probs_df["points diff prediction"],
    (probs_df["points diff prediction"] * -1),
)

probs_df["bet"] = np.where(
    (probs_df["favorite points diff prediction"] > 0), "cover", "beat"
)

probs_df["bet result"] = np.where(
    ((probs_df["bet"] == "cover") & (probs_df["point_difference"] > probs_df["spread"]))
    | (
        (probs_df["bet"] == "beat")
        & (probs_df["point_difference"] < probs_df["spread"])
    ),
    "hit",
    "miss",
)
probs_df["favorite win loss"] = np.where(
    probs_df["favorite_score"] > probs_df["dog_score"], 0, 1
)

<IPython.core.display.Javascript object>

In [428]:
display(probs_df["bet result"].value_counts())
print("success rate:", str(round(346 / (346 + 306) * 100, 2)) + "%")

hit     346
miss    306
Name: bet result, dtype: int64

success rate: 53.07%


<IPython.core.display.Javascript object>

In [404]:
for week in range(2, 16):
    print("week:", week)
    display(probs_df.loc[probs_df["week"] == week]["bet result"].value_counts())
    print("`^*^`" * 16, "\n")

week: 2


miss    14
hit     12
Name: bet result, dtype: int64

`^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^` 

week: 3


hit     22
miss    19
Name: bet result, dtype: int64

`^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^` 

week: 4


hit     24
miss    22
Name: bet result, dtype: int64

`^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^` 

week: 5


miss    25
hit     25
Name: bet result, dtype: int64

`^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^` 

week: 6


hit     30
miss    17
Name: bet result, dtype: int64

`^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^` 

week: 7


hit     31
miss    20
Name: bet result, dtype: int64

`^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^` 

week: 8


hit     34
miss    26
Name: bet result, dtype: int64

`^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^` 

week: 9


miss    27
hit     26
Name: bet result, dtype: int64

`^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^` 

week: 10


hit     27
miss    19
Name: bet result, dtype: int64

`^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^` 

week: 11


miss    25
hit     23
Name: bet result, dtype: int64

`^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^` 

week: 12


hit     31
miss    20
Name: bet result, dtype: int64

`^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^` 

week: 13


miss    36
hit     22
Name: bet result, dtype: int64

`^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^` 

week: 14


hit     33
miss    31
Name: bet result, dtype: int64

`^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^` 

week: 15


hit     6
miss    5
Name: bet result, dtype: int64

`^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^` 



<IPython.core.display.Javascript object>

Using this strategy, we had a success rate of 53% and 4 losing weeks. In actuality, we don't want to be betting on every single game. Lets see if we can refine this strategy and improve our success rate.

In [500]:
bets_df = probs_df[
    [
        "week",
        "home_team",
        "away_team",
        "favorite",
        "dog",
        "spread",
        "favorite_score",
        "dog_score",
        "point_difference",
        "favorite points diff prediction",
    ]
]

bets_df = bets_df.copy()

bets_df["spread result"] = np.where(
    bets_df["point_difference"] > bets_df["spread"], "cover", "beat"
)

bets_df

Unnamed: 0,week,home_team,away_team,favorite,dog,spread,favorite_score,dog_score,point_difference,favorite points diff prediction,spread result
0,2,North Carolina,Miami,Miami,North Carolina,4.5,25,28,-3,-11.064243,beat
1,9,Pittsburgh,Miami,Pittsburgh,Miami,4.5,12,16,-4,-9.681975,beat
2,10,Florida State,Miami,Florida State,Miami,3.0,10,27,-17,-1.176490,beat
3,14,Duke,Miami,Miami,Duke,9.0,17,27,-10,8.272396,beat
4,13,Florida International,Miami,Miami,Florida International,21.5,24,30,-6,12.777568,beat
...,...,...,...,...,...,...,...,...,...,...,...
647,4,UTEP,Nevada,Nevada,UTEP,13.5,37,21,16,1.451778,cover
648,11,San Diego State,Nevada,San Diego State,Nevada,17.0,13,17,-4,15.981547,beat
649,2,Oregon,Nevada,Oregon,Nevada,24.0,77,6,71,3.802350,cover
650,8,Utah State,Nevada,Utah State,Nevada,22.5,36,10,26,11.009476,cover


<IPython.core.display.Javascript object>

In [444]:
bets_df["spread result"].value_counts()

cover    333
beat     319
Name: spread result, dtype: int64

<IPython.core.display.Javascript object>

#### Dogs to beat the spread

In [505]:
display(
    bets_df.loc[
        (bets_df["favorite points diff prediction"] < 0)
        & (bets_df["spread"] - abs(bets_df["favorite points diff prediction"]) < 7)
    ]["spread result"].value_counts()
)
print("success rate:", str(round(55 / (55 + 39) * 100, 2)) + "%")

beat     55
cover    39
Name: spread result, dtype: int64

success rate: 58.51%


<IPython.core.display.Javascript object>

#### Favorites to cover the spread

In [506]:
display(
    bets_df.loc[
        (bets_df["favorite points diff prediction"] > 0)
        & (bets_df["spread"] - bets_df["favorite points diff prediction"] > 7)
    ]["spread result"].value_counts()
)
print("success rate:", str(round(89 / (89 + 66) * 100, 2)) + "%")

cover    89
beat     66
Name: spread result, dtype: int64

success rate: 57.42%


<IPython.core.display.Javascript object>

Using our refined strategy, we were able to increase our success rate to 58%

In [513]:
beats = bets_df.loc[
    (bets_df["favorite points diff prediction"] < 0)
    & (bets_df["spread"] - abs(bets_df["favorite points diff prediction"]) < 7)
]
beats["bet"] = "beat"

covers = bets_df.loc[
    (bets_df["favorite points diff prediction"] > 0)
    & (bets_df["spread"] - bets_df["favorite points diff prediction"] > 7)
]
covers["bet"] = "cover"

bets = beats.append(covers)
bets["bet result"] = np.where(bets["bet"] == bets["spread result"], "hit", "miss")
bets

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # This is added back by InteractiveShellApp.init_path()


Unnamed: 0,week,home_team,away_team,favorite,dog,spread,favorite_score,dog_score,point_difference,favorite points diff prediction,spread result,bet,bet result
0,2,North Carolina,Miami,Miami,North Carolina,4.5,25,28,-3,-11.064243,beat,beat,hit
1,9,Pittsburgh,Miami,Pittsburgh,Miami,4.5,12,16,-4,-9.681975,beat,beat,hit
2,10,Florida State,Miami,Florida State,Miami,3.0,10,27,-17,-1.176490,beat,beat,hit
6,11,Kentucky,Tennessee,Kentucky,Tennessee,1.5,13,17,-4,-2.645196,beat,beat,hit
9,6,Florida,Auburn,Auburn,Florida,2.5,13,24,-11,-0.884705,beat,beat,hit
...,...,...,...,...,...,...,...,...,...,...,...,...,...
639,3,Georgia,Arkansas State,Georgia,Arkansas State,33.0,55,0,55,24.403429,cover,cover,hit
642,12,Vanderbilt,Kentucky,Kentucky,Vanderbilt,8.5,38,14,24,0.482674,cover,cover,hit
647,4,UTEP,Nevada,Nevada,UTEP,13.5,37,21,16,1.451778,cover,cover,hit
649,2,Oregon,Nevada,Oregon,Nevada,24.0,77,6,71,3.802350,cover,cover,hit


<IPython.core.display.Javascript object>

In [514]:
for week in range(2, 16):
    print("week:", week)
    display(bets.loc[probs_df["week"] == week]["bet result"].value_counts())
    print("`^*^`" * 16, "\n")

week: 2


miss    8
hit     6
Name: bet result, dtype: int64

`^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^` 

week: 3


hit     12
miss     6
Name: bet result, dtype: int64

`^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^` 

week: 4


hit     12
miss     6
Name: bet result, dtype: int64

`^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^` 

week: 5


miss    12
hit     11
Name: bet result, dtype: int64

`^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^` 

week: 6


hit     14
miss     1
Name: bet result, dtype: int64

`^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^` 

week: 7


hit     11
miss     5
Name: bet result, dtype: int64

`^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^` 

week: 8


hit     16
miss     9
Name: bet result, dtype: int64

`^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^` 

week: 9


miss    12
hit     10
Name: bet result, dtype: int64

`^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^` 

week: 10


hit     9
miss    6
Name: bet result, dtype: int64

`^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^` 

week: 11


hit     12
miss     8
Name: bet result, dtype: int64

`^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^` 

week: 12


miss    11
hit     10
Name: bet result, dtype: int64

`^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^` 

week: 13


miss    11
hit      5
Name: bet result, dtype: int64

`^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^` 

week: 14


hit     15
miss     9
Name: bet result, dtype: int64

`^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^` 

week: 15


hit     1
miss    1
Name: bet result, dtype: int64

`^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^``^*^` 



<IPython.core.display.Javascript object>

## Conclusion

Using this strategy in 2019, we hit on 58% of our bets. Betting on 20 games every week is a lot, and in reality you probably would not want to be betting that many games. There is also a lot of information this model does not account for such as injuries, weather, or the death of a mascot. 

This model can be used to combine the predictions of games with practical knowledge about a contest to place profitable bets. Hopefully there is a 2020 college football season so we can test this with live games.