# NHL Predictive Model

With the release of puck tracking stats via NHL EDGE, opportunities arise to build predictive models based on these data. The approach taken within is to begin with a simple shot-location based comparative predictive model. Additional models of increasing complexity will be developed as necessary.

In [141]:
# Import required libraries
import numpy as np
import pandas as pd
import requests 
from bs4 import BeautifulSoup
import warnings 
warnings.filterwarnings("ignore")
import matplotlib.pyplot as plt
import datetime as dt

#!pip install selenium
#!pip install webdriver_manager
from selenium import webdriver 
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service as ChromeService 
from webdriver_manager.chrome import ChromeDriverManager 

## Baseline Model: MoneyPuck Expected Value

MoneyPuck (https://moneypuck.com/index.html) publishes daily Win probabilities for every NHL game, calculated from their historical statistics. Their model consists of two submodels (Home and Away), which calculate each team's win probability independently, before being combined via the Meta model, which factors in home-ice advantage and rest time among other factors. A full description of MoneyPuck's model can be found here: https://moneypuck.com/about.htm <br>
If we assume that MoneyPuck's model produces accurate win probabilities, a simple expected value formula can be used to identify profitable bets:

$ EV = O*P(W) - 1 $

Where 
* $EV$: Expected Value (%), Predicted Profit/Loss on a given bet
* $O$: Odds (\\$ won per \\$ wagered), A team's moneyline decimal odds 
* $P(W)$: Win Probability, MoneyPuck's stated win probability

Since decimal odds indicate how much one wins per dollar wagered, we subtract the cost to play from the expected payout ($O*P(W)$). <br>
The predictions of our shot-location (NHL EDGE) model will be compared against MoneyPuck's results, both from a betting standpoint and in terms of overall accuracy. To give further insights into each model's betting performance, we will also analyze the performance of some simple betting strategies that do not make use of predictive models.

In [59]:
options = webdriver.ChromeOptions()  # instantiate options 
options.headless = True  # run browser in headless mode 
# instantiate driver 
driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()), options=options) 
 
# load website 
url = 'https://moneypuck.com/index.html?date=2023-10-26' 
driver.get(url) # get the entire website content 

moneypuck_df = pd.DataFrame()
away_win_probs, home_win_probs = [], []
away_teams, home_teams = [], []
 
# select win probability table
win_table = driver.find_elements(By.ID, 'includedContent')[0] 
games = win_table.find_elements(By.TAG_NAME, "tr")
for game in games:
    win_probs = game.find_elements(By.TAG_NAME, "h2")
    teams = game.find_elements(By.TAG_NAME, "img")
    away_win_prob = win_probs[0].text
    home_win_prob = win_probs[1].text
    away_team = teams[0].get_attribute("alt")
    home_team = teams[1].get_attribute("alt")
    away_win_prob = float(away_win_prob.replace("%",""))/100 #Remove % symbol and convert to float (0-1)
    home_win_prob = float(home_win_prob.replace("%",""))/100
    away_teams.append(away_team)
    home_teams.append(home_team)
    away_win_probs.append(away_win_prob)
    home_win_probs.append(home_win_prob)
     
moneypuck_df["date"] = str(dt.date.today()) #Initial data was collected on 10/26/23
moneypuck_df["away_team"] = away_teams
moneypuck_df["away_win_prob"] = away_win_probs
moneypuck_df["home_team"] = home_teams
moneypuck_df["home_win_prob"] = home_win_probs
moneypuck_df["date"] = str(dt.date.today())
moneypuck_df.head()

Unnamed: 0,date,away_team,away_win_prob,home_team,home_win_prob
0,2023-10-26,ANAHEIM DUCKS,0.27,BOSTON BRUINS,0.73
1,2023-10-26,SEATTLE KRAKEN,0.352,CAROLINA HURRICANES,0.648
2,2023-10-26,WINNIPEG JETS,0.493,DETROIT RED WINGS,0.507
3,2023-10-26,COLUMBUS BLUE JACKETS,0.447,MONTREAL CANADIENS,0.553
4,2023-10-26,COLORADO AVALANCHE,0.496,PITTSBURGH PENGUINS,0.504


Since MoneyPuck is a dynamically loaded website, we cannot simply use a get request to extract the win probabilities from the website, as they are not present in the static html. Instead we use a headless selenium browser to load a local version of the website and then scrape the loaded content from there. The process used is demonstrated above in the initial loading of the dataframe, with its update counterpart functionalized below.

In [163]:
def get_todays_matchups():
    date = dt.date.today()
    url = "https://moneypuck.com/index.html?date=" + str(date)
    return url

def update_matchups(matchup_df, url=get_todays_matchups()):
    options = webdriver.ChromeOptions()  # instantiate options 
    options.headless = True  # run browser in headless mode 
    # instantiate driver 
    driver = webdriver.Chrome() 

    # load website  
    driver.get(url) # get the entire website content 

    # select win probability table
    win_table = driver.find_elements(By.ID, 'includedContent')[0] 
    games = win_table.find_elements(By.TAG_NAME, "tr")
    date = str(dt.date.today())
    for game in games:
        win_probs = game.find_elements(By.TAG_NAME, "h2")
        teams = game.find_elements(By.TAG_NAME, "img")
        away_win_prob = win_probs[0].text
        home_win_prob = win_probs[1].text
        away_team = teams[0].get_attribute("alt")
        home_team = teams[1].get_attribute("alt")
        away_win_prob = float(away_win_prob.replace("%",""))/100 #Remove % symbol and convert to float (0-1)
        home_win_prob = float(home_win_prob.replace("%",""))/100
        row = {"date":date, "away_team":away_team, "away_win_prob":away_win_prob, "home_team":home_team, "home_win_prob":home_win_prob}
        matchup_df = matchup_df.append(row, ignore_index=True)
    
    driver.close()
    return matchup_df

In [164]:
moneypuck_df = update_matchups(moneypuck_df)
moneypuck_df.tail(10)

Unnamed: 0,date,away_team,away_win_prob,home_team,home_win_prob
19,2023-11-02,CAROLINA HURRICANES,0.517,NEW YORK RANGERS,0.483
20,2023-11-02,LOS ANGELES KINGS,0.506,OTTAWA SENATORS,0.494
21,2023-11-02,NEW YORK ISLANDERS,0.521,WASHINGTON CAPITALS,0.479
22,2023-11-02,TORONTO MAPLE LEAFS,0.458,BOSTON BRUINS,0.542
23,2023-11-02,NEW JERSEY DEVILS,0.559,MINNESOTA WILD,0.441
24,2023-11-02,DALLAS STARS,0.416,EDMONTON OILERS,0.584
25,2023-11-02,MONTREAL CANADIENS,0.468,ARIZONA COYOTES,0.532
26,2023-11-02,NASHVILLE PREDATORS,0.465,SEATTLE KRAKEN,0.535
27,2023-11-02,WINNIPEG JETS,0.447,VEGAS GOLDEN KNIGHTS,0.553
28,2023-11-02,VANCOUVER CANUCKS,0.687,SAN JOSE SHARKS,0.313


### Scrape Daily Lines

In [107]:
odds_df = pd.DataFrame(columns=["date"])
away_odds, home_odds = [], []
away_teams, home_teams = [], []
over_under = []
away_puck_line, away_puck_odds = [], []
home_puck_line, home_puck_odds = [], []

options = webdriver.ChromeOptions()  # instantiate options 
options.headless = True  # run browser in headless mode 
# instantiate driver 
driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()), options=options) 

driver.get('https://www.espn.com/nhl/lines') # get the entire website content

games = driver.find_elements(By.TAG_NAME, 'tr')
i = 0
for game in games:
    game_str = game.text
    if i % 3 != 0: #Skip the header of every table
        team_data = game_str.splitlines()
        odds_line = team_data[-1] #Line containing each team's odds is the last of the table
        odds_line = odds_line.split(" ")
        if len(odds_line) == 6: #The over is always listed on the away team's row
            away_teams.append(team_data[0].upper())
            away_odds.append(convert_odds(int(odds_line[2]))) #Have to account for the goalie name being split
            over_under.append(float(odds_line[3]))
            away_puck_line.append(float(odds_line[4]))
            away_puck_odds.append(convert_odds(int(odds_line[5])))
        else:
            home_teams.append(team_data[0].upper())
            home_odds.append(convert_odds(int(odds_line[2])))
            home_puck_line.append(float(odds_line[3]))
            home_puck_odds.append(convert_odds(int(odds_line[4])))
    i += 1

odds_df["away_team"] = away_teams
odds_df["away_odds"] = away_odds
odds_df["home_team"] = home_teams
odds_df["home_odds"] = home_odds
odds_df["over_under"] = over_under
odds_df["away_puck_line"] = away_puck_line
odds_df["away_puck_odds"] = away_puck_odds
odds_df["home_puck_line"] = home_puck_line #Home puck line is inverse of away puck line, column might be redundant
odds_df["home_puck_odds"] = home_puck_odds
odds_df["date"] = str(dt.date.today())
odds_df.head()

Unnamed: 0,date,away_team,away_odds,home_team,home_odds,over_under,away_puck_line,away_puck_odds,home_puck_line,home_puck_odds
0,2023-10-27,CHICAGO BLACKHAWKS,3.6,VEGAS GOLDEN KNIGHTS,1.298507,6.0,1.5,2.1,-1.5,1.769231
1,2023-10-27,BUFFALO SABRES,2.58,NEW JERSEY DEVILS,1.526316,7.0,1.5,1.666667,-1.5,2.26
2,2023-10-27,MINNESOTA WILD,2.05,WASHINGTON CAPITALS,1.8,6.5,1.5,1.4,-1.5,3.05
3,2023-10-27,SAN JOSE SHARKS,4.45,CAROLINA HURRICANES,1.21978,6.5,1.5,2.4,-1.5,1.606061
4,2023-10-27,ST. LOUIS BLUES,2.58,VANCOUVER CANUCKS,1.526316,6.0,1.5,1.645161,-1.5,2.3


In [90]:
def convert_odds(odds):
    '''Converts a pandas series of odds from american odds to decimal odds'''
    if odds < 0: #If team is the favorite (negative american odds)
        decimal_odds = -1/(odds/100) + 1
    else:
        decimal_odds = (odds/100) + 1
    return decimal_odds

def update_odds(odds_df):
    options = webdriver.ChromeOptions()  # instantiate options 
    options.headless = True  # run browser in headless mode 
    # instantiate driver 
    driver = webdriver.Chrome(options=options)
    driver.get('https://www.espn.com/nhl/lines') # get the entire website content

    date = str(dt.date.today())
    games = driver.find_elements(By.TAG_NAME, 'tr')
    i = 0
    for game in games:
        game_str = game.text
        if i % 3 != 0: #Skip the header of every table
            team_data = game_str.splitlines()
            odds_line = team_data[-1] #Line containing each team's odds is the last of the table
            odds_line = odds_line.split(" ")
            if len(odds_line) == 6: #The over is always listed on the away team's row
                away_team = team_data[0].upper()
                away_odds = convert_odds(int(odds_line[2])) #Have to account for the goalie name being split
                over_under = float(odds_line[3])
                away_puck_line = float(odds_line[4])
                away_puck_odds = covert_odds(int(odds_line[5]))
            else:
                home_team = team_data[0].upper()
                home_odds = convert_odds(int(odds_line[2]))
                home_puck_line = float(odds_line[3])
                home_puck_odds = convert_odds(int(odds_line[4]))
        elif i > 0:
            row = {"date": date, "away_team":away_team, "away_odds":away_odds, "home_team":home_team, "home_odds":home_odds,
                   "over_under":over_under, "away_puck_line":away_puck_line, "away_puck_odds":away_puck_odds, 
                   "home_puck_line":home_puck_line, "home_puck_odds":home_puck_odds}
            odds_df = odds_df.append(row, ignore_index=True)
        i += 1
    driver.close()
    return odds_df

In [157]:
odds_df = update_odds(odds_df)
odds_df.tail(10)

PermissionError: [WinError 5] Access is denied: 'C:\\Users\\geniu\\.wdm\\drivers\\chromedriver\\win64\\118.0.5993.70\\chromedriver-win32\\chromedriver.exe' -> 'C:\\Users\\geniu\\.wdm\\drivers\\chromedriver\\win64\\118.0.5993.70\\chromedriver.exe'

In [103]:
temp_df = moneypuck_df.merge(odds_df, on=["date","away_team","home_team"])
temp_df["away_odds"] = temp_df["away_odds"].apply(convert_odds)
temp_df["home_odds"] = temp_df["home_odds"].apply(convert_odds)
temp_df["away_puck_odds"] = temp_df["away_puck_odds"].apply(convert_odds)
temp_df["home_puck_odds"] = temp_df["home_puck_odds"].apply(convert_odds)
temp_df

Unnamed: 0,date,away_team,away_win_prob,home_team,home_win_prob,away_odds,home_odds,over_under,away_puck_line,away_puck_odds,home_puck_line,home_puck_odds
0,2023-10-27,CHICAGO BLACKHAWKS,0.235,VEGAS GOLDEN KNIGHTS,0.765,3.6,1.298507,6.0,1.5,2.05,-1.5,1.8
1,2023-10-27,SAN JOSE SHARKS,0.212,CAROLINA HURRICANES,0.788,4.35,1.227273,6.0,1.5,2.35,-1.5,1.625
2,2023-10-27,BUFFALO SABRES,0.359,NEW JERSEY DEVILS,0.641,2.62,1.512821,7.0,1.5,1.645161,-1.5,2.3
3,2023-10-27,MINNESOTA WILD,0.468,WASHINGTON CAPITALS,0.532,2.05,1.8,6.5,1.5,1.416667,-1.5,2.96
4,2023-10-27,LOS ANGELES KINGS,0.594,ARIZONA COYOTES,0.406,1.740741,2.15,6.5,-1.5,2.8,1.5,1.454545
5,2023-10-27,ST. LOUIS BLUES,0.387,VANCOUVER CANUCKS,0.613,2.58,1.526316,6.0,1.5,1.645161,-1.5,2.3


In [105]:
master_df = historical_df

In [134]:
historical_df = pd.DataFrame(columns=["date"])
away_teams, home_teams = [], []
away_scores, home_scores = [], []

hockey_scores = requests.get("https://www.hockey-reference.com/boxscores/")
soup = BeautifulSoup(hockey_scores.text)
games = soup.find_all(attrs={"class":"teams"})
for game in games: 
    teams = game.find_all("a")[::2] #Ignore the "Final", since it's not a team
    scores = game.find_all(attrs={"class":"right"})[:3:2]
    i = 0
    for team, score in zip(teams,scores):
        if i % 2 == 0:
            away_teams.append(team.get_text().upper())
            away_scores.append(score.get_text())
        else:
            home_teams.append(team.get_text().upper())
            home_scores.append(score.get_text())
        i += 1
            
historical_df["away_team"] = away_teams
historical_df["away_score"] = away_scores
historical_df["home_team"] = home_teams
historical_df["home_score"] = home_scores
historical_df["date"] = str(dt.date.today() - dt.timedelta(days=1))
historical_df.head()

Unnamed: 0,date,away_team,away_score,home_team,home_score
0,2023-10-26,ANAHEIM DUCKS,4,BOSTON BRUINS,3
1,2023-10-26,SEATTLE KRAKEN,2,CAROLINA HURRICANES,3
2,2023-10-26,ST. LOUIS BLUES,3,CALGARY FLAMES,0
3,2023-10-26,TORONTO MAPLE LEAFS,4,DALLAS STARS,1
4,2023-10-26,WINNIPEG JETS,4,DETROIT RED WINGS,1


In [137]:
temp_df = moneypuck_df.merge(odds_df, on=["date","away_team","home_team"])
temp_df.drop(columns=["over_under","away_puck_line","away_puck_odds","home_puck_line","home_puck_odds"], inplace=True)
temp_df["away_ev"] = temp_df["away_win_prob"] * temp_df["away_odds"] - 1
temp_df["home_ev"] = temp_df["home_win_prob"] * temp_df["home_odds"] - 1
temp_df.head()

Unnamed: 0,date,away_team,away_win_prob,home_team,home_win_prob,away_odds,home_odds,away_ev,home_ev
0,2023-10-27,CHICAGO BLACKHAWKS,0.235,VEGAS GOLDEN KNIGHTS,0.765,3.6,1.298507,-0.154,-0.006642
1,2023-10-27,SAN JOSE SHARKS,0.212,CAROLINA HURRICANES,0.788,4.45,1.21978,-0.0566,-0.038813
2,2023-10-27,BUFFALO SABRES,0.359,NEW JERSEY DEVILS,0.641,2.58,1.526316,-0.07378,-0.021632
3,2023-10-27,MINNESOTA WILD,0.468,WASHINGTON CAPITALS,0.532,2.05,1.8,-0.0406,-0.0424
4,2023-10-27,LOS ANGELES KINGS,0.594,ARIZONA COYOTES,0.406,1.740741,2.15,0.034,-0.1271
