# NHL Predictive Model

With the release of puck tracking stats via NHL EDGE, opportunities arise to build predictive models based on these data. The approach taken within is to begin with a simple shot-location based comparative predictive model. Additional models of increasing complexity will be developed as necessary.

In [14]:
# Import required libraries
import numpy as np
import pandas as pd
import requests 
from bs4 import BeautifulSoup
import warnings 
warnings.filterwarnings("ignore")
import matplotlib.pyplot as plt
import datetime as dt

#!pip install selenium
#!pip install webdriver_manager
#!pip install jupyter_scheduler
from selenium import webdriver 
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service as ChromeService 
from webdriver_manager.chrome import ChromeDriverManager 

Collecting jupyter_scheduler
  Downloading jupyter_scheduler-2.3.0-py3-none-any.whl (517 kB)
                                              0.0/517.2 kB ? eta -:--:--
     ----------------------                317.4/517.2 kB 20.5 MB/s eta 0:00:01
     ------------------------------------- 517.2/517.2 kB 10.8 MB/s eta 0:00:00
Collecting croniter~=1.4 (from jupyter_scheduler)
  Downloading croniter-1.4.1-py2.py3-none-any.whl (19 kB)
Collecting fsspec==2023.6.0 (from jupyter_scheduler)
  Downloading fsspec-2023.6.0-py3-none-any.whl (163 kB)
                                              0.0/163.8 kB ? eta -:--:--
     ---------------------------------------- 163.8/163.8 kB ? eta 0:00:00
Collecting nbconvert~=7.0 (from jupyter_scheduler)
  Downloading nbconvert-7.10.0-py3-none-any.whl (256 kB)
                                              0.0/256.7 kB ? eta -:--:--
     ------------------------------------- 256.7/256.7 kB 15.4 MB/s eta 0:00:00
Collecting pydantic~=1.10 (from jupyter_schedule

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
conda-repo-cli 1.0.41 requires requests_mock, which is not installed.
conda-repo-cli 1.0.41 requires clyent==1.2.1, but you have clyent 1.2.2 which is incompatible.
conda-repo-cli 1.0.41 requires nbformat==5.4.0, but you have nbformat 5.7.0 which is incompatible.
conda-repo-cli 1.0.41 requires requests==2.28.1, but you have requests 2.29.0 which is incompatible.
s3fs 2023.3.0 requires fsspec==2023.3.0, but you have fsspec 2023.6.0 which is incompatible.


## Baseline Model: MoneyPuck Expected Value

MoneyPuck (https://moneypuck.com/index.html) publishes daily Win probabilities for every NHL game, calculated from their historical statistics. Their model consists of two submodels (Home and Away), which calculate each team's win probability independently, before being combined via the Meta model, which factors in home-ice advantage and rest time among other factors. A full description of MoneyPuck's model can be found here: https://moneypuck.com/about.htm <br>
If we assume that MoneyPuck's model produces accurate win probabilities, a simple expected value formula can be used to identify profitable bets:

$ EV = O*P(W) - 1 $

Where 
* $EV$: Expected Value (%), Predicted Profit/Loss on a given bet
* $O$: Odds (\\$ won per \\$ wagered), A team's moneyline decimal odds 
* $P(W)$: Win Probability, MoneyPuck's stated win probability

Since decimal odds indicate how much one wins per dollar wagered, we subtract the cost to play from the expected payout ($O*P(W)$). <br>
The predictions of our shot-location (NHL EDGE) model will be compared against MoneyPuck's results, both from a betting standpoint and in terms of overall accuracy. To give further insights into each model's betting performance, we will also analyze the performance of some simple betting strategies that do not make use of predictive models.

In [5]:
'''This is the initial code used to create the moneypuck_df:
    It is remaining in this notebook for posterity, along with the code for the other major dataframes


options = webdriver.ChromeOptions()  # instantiate options 
options.headless = True  # run browser in headless mode 
# instantiate driver 
driver = webdriver.Chrome(options=options) 
 
# load website 
url = 'https://moneypuck.com/index.html?date=2023-11-02' 
driver.get(url) # get the entire website content 

moneypuck_df = pd.DataFrame()
away_win_probs, home_win_probs = [], []
away_teams, home_teams = [], []
 
# select win probability table
win_table = driver.find_elements(By.ID, 'includedContent')[0] 
games = win_table.find_elements(By.TAG_NAME, "tr")
for game in games:
    win_probs = game.find_elements(By.TAG_NAME, "h2")
    teams = game.find_elements(By.TAG_NAME, "img")
    away_win_prob = win_probs[0].text
    home_win_prob = win_probs[1].text
    away_team = teams[0].get_attribute("alt")
    home_team = teams[1].get_attribute("alt")
    away_win_prob = float(away_win_prob.replace("%",""))/100 #Remove % symbol and convert to float (0-1)
    home_win_prob = float(home_win_prob.replace("%",""))/100
    away_teams.append(away_team)
    home_teams.append(home_team)
    away_win_probs.append(away_win_prob)
    home_win_probs.append(home_win_prob)
     
moneypuck_df["date"] = str(dt.date.today()) #Initial data was collected on 10/26/23
moneypuck_df["away_team"] = away_teams
moneypuck_df["away_win_prob"] = away_win_probs
moneypuck_df["home_team"] = home_teams
moneypuck_df["home_win_prob"] = home_win_probs
moneypuck_df["date"] = str(dt.date.today())
moneypuck_df.head()'''

Unnamed: 0,date,away_team,away_win_prob,home_team,home_win_prob
0,2023-11-02,TAMPA BAY LIGHTNING,0.569,COLUMBUS BLUE JACKETS,0.431
1,2023-11-02,FLORIDA PANTHERS,0.533,DETROIT RED WINGS,0.467
2,2023-11-02,CAROLINA HURRICANES,0.517,NEW YORK RANGERS,0.483
3,2023-11-02,LOS ANGELES KINGS,0.506,OTTAWA SENATORS,0.494
4,2023-11-02,NEW YORK ISLANDERS,0.521,WASHINGTON CAPITALS,0.479


Since MoneyPuck is a dynamically loaded website, we cannot simply use a get request to extract the win probabilities from the website, as they are not present in the static html. Instead we use a headless selenium browser to load a local version of the website and then scrape the loaded content from there. The process of updating each of these dataframes has been segmented into a separate Jupyter Notebook (Update_NHL_Data) scheduled to run at 11 AM every day. This notebook will simply load the csv files generated by Update_NHL_Data and proceed from there.

In [12]:
'''Initial code included for posterity, see above


odds_df = pd.DataFrame(columns=["date"])
away_odds, home_odds = [], []
away_teams, home_teams = [], []
over_under = []
away_puck_line, away_puck_odds = [], []
home_puck_line, home_puck_odds = [], []

options = webdriver.ChromeOptions()  # instantiate options 
options.headless = True  # run browser in headless mode 
# instantiate driver 
driver = webdriver.Chrome(options=options) 

driver.get('https://www.espn.com/nhl/lines') # get the entire website content

games = driver.find_elements(By.TAG_NAME, 'tr')
i = 0
for game in games:
    game_str = game.text
    if i % 3 != 0: #Skip the header of every table
        team_data = game_str.splitlines()
        odds_line = team_data[-1] #Line containing each team's odds is the last of the table
        odds_line = odds_line.split(" ")
        if len(odds_line) == 6: #The over is always listed on the away team's row
            away_teams.append(team_data[0].upper())
            away_odds.append(convert_odds(int(odds_line[2]))) #Have to account for the goalie name being split
            over_under.append(float(odds_line[3]))
            away_puck_line.append(float(odds_line[4]))
            away_puck_odds.append(convert_odds(int(odds_line[5])))
        else:
            home_teams.append(team_data[0].upper())
            home_odds.append(convert_odds(int(odds_line[2])))
            home_puck_line.append(float(odds_line[3]))
            home_puck_odds.append(convert_odds(int(odds_line[4])))
    i += 1

odds_df["away_team"] = away_teams
odds_df["away_odds"] = away_odds
odds_df["home_team"] = home_teams
odds_df["home_odds"] = home_odds
odds_df["over_under"] = over_under
odds_df["away_puck_line"] = away_puck_line
odds_df["away_puck_odds"] = away_puck_odds
odds_df["home_puck_line"] = home_puck_line #Home puck line is inverse of away puck line, column might be redundant
odds_df["home_puck_odds"] = home_puck_odds
odds_df["date"] = str(dt.date.today())
odds_df.head()'''

Unnamed: 0,date,away_team,away_odds,home_team,home_odds,over_under,away_puck_line,away_puck_odds,home_puck_line,home_puck_odds
0,2023-11-02,LOS ANGELES KINGS,1.909091,OTTAWA SENATORS,1.909091,6.5,1.5,1.363636,-1.5,3.0
1,2023-11-02,FLORIDA PANTHERS,1.833333,DETROIT RED WINGS,2.0,6.5,-1.5,2.85,1.5,1.4
2,2023-11-02,CAROLINA HURRICANES,1.952381,NEW YORK RANGERS,1.869565,5.5,1.5,1.333333,-1.5,3.1
3,2023-11-02,NEW YORK ISLANDERS,1.833333,WASHINGTON CAPITALS,2.0,6.5,-1.5,2.9,1.5,1.377358
4,2023-11-02,TAMPA BAY LIGHTNING,1.606061,COLUMBUS BLUE JACKETS,2.35,6.5,-1.5,2.5,1.5,1.555556


In [134]:
'''Included for posterity, see above


historical_df = pd.DataFrame(columns=["date"])
away_teams, home_teams = [], []
away_scores, home_scores = [], []

hockey_scores = requests.get("https://www.hockey-reference.com/boxscores/")
soup = BeautifulSoup(hockey_scores.text)
games = soup.find_all(attrs={"class":"teams"})
for game in games: 
    teams = game.find_all("a")[::2] #Ignore the "Final", since it's not a team
    scores = game.find_all(attrs={"class":"right"})[:3:2]
    i = 0
    for team, score in zip(teams,scores):
        if i % 2 == 0:
            away_teams.append(team.get_text().upper())
            away_scores.append(score.get_text())
        else:
            home_teams.append(team.get_text().upper())
            home_scores.append(score.get_text())
        i += 1
            
historical_df["away_team"] = away_teams
historical_df["away_score"] = away_scores
historical_df["home_team"] = home_teams
historical_df["home_score"] = home_scores
historical_df["date"] = str(dt.date.today() - dt.timedelta(days=1))
historical_df.head()'''

Unnamed: 0,date,away_team,away_score,home_team,home_score
0,2023-10-26,ANAHEIM DUCKS,4,BOSTON BRUINS,3
1,2023-10-26,SEATTLE KRAKEN,2,CAROLINA HURRICANES,3
2,2023-10-26,ST. LOUIS BLUES,3,CALGARY FLAMES,0
3,2023-10-26,TORONTO MAPLE LEAFS,4,DALLAS STARS,1
4,2023-10-26,WINNIPEG JETS,4,DETROIT RED WINGS,1


### Load Data

In [15]:
moneypuck_df = pd.read_csv("Historical_Moneypuck_Predictions.csv",index_col=0)
odds_df = pd.read_csv("Historical_Odds.csv", index_col=0)
games_df = pd.read_csv("Game_Outcomes.csv", index_col=0)

## Data Manipulation

In [19]:
temp_df = moneypuck_df.merge(odds_df, on=["date","away_team","home_team"])
#Currently not using puck line, since moneypuck doesn't predict score, just win probability
temp_df.drop(columns=["over_under","away_puck_line","away_puck_odds","home_puck_line","home_puck_odds"], inplace=True)
temp_df["away_ev"] = temp_df["away_win_prob"] * temp_df["away_odds"] - 1
temp_df["home_ev"] = temp_df["home_win_prob"] * temp_df["home_odds"] - 1
temp_df

Unnamed: 0,date,away_team,away_win_prob,home_team,home_win_prob,away_odds,home_odds,away_ev,home_ev
0,2023-11-02,TAMPA BAY LIGHTNING,0.569,COLUMBUS BLUE JACKETS,0.431,1.606061,2.35,-0.086152,0.01285
1,2023-11-02,FLORIDA PANTHERS,0.533,DETROIT RED WINGS,0.467,1.833333,2.0,-0.022833,-0.066
2,2023-11-02,CAROLINA HURRICANES,0.517,NEW YORK RANGERS,0.483,1.952381,1.869565,0.009381,-0.097
3,2023-11-02,LOS ANGELES KINGS,0.506,OTTAWA SENATORS,0.494,1.909091,1.909091,-0.034,-0.056909
4,2023-11-02,NEW YORK ISLANDERS,0.521,WASHINGTON CAPITALS,0.479,1.833333,2.0,-0.044833,-0.042
5,2023-11-02,TORONTO MAPLE LEAFS,0.458,BOSTON BRUINS,0.542,1.869565,1.952381,-0.143739,0.05819
6,2023-11-02,NEW JERSEY DEVILS,0.559,MINNESOTA WILD,0.441,1.833333,2.0,0.024833,-0.118
7,2023-11-02,DALLAS STARS,0.416,EDMONTON OILERS,0.584,2.3,1.666667,-0.0432,-0.026667
8,2023-11-02,MONTREAL CANADIENS,0.468,ARIZONA COYOTES,0.532,2.1,1.769231,-0.0172,-0.058769
9,2023-11-02,NASHVILLE PREDATORS,0.465,SEATTLE KRAKEN,0.535,2.0,1.833333,-0.07,-0.019167


In [None]:
#profit_df, implement favorite, underdog, moneypuck EV strategies