Predicting the Outcome of NHL Games

Preliminaty Setup:
Let's begin by exporting the data as well as fixing any issues such as renaming columns.

In [24]:
import os
import numpy as np  
import pandas as pd  

games_filename = ("nhl-202223-asplayed.csv")

In [26]:
results = pd.read_csv(games_filename)
results.head()

Unnamed: 0,Date,Start Time (Sask),Start Time (ET),Visitor,Score,Home,Score.1,Status
0,2022-10-07,12:00 PM,2:00 PM,San Jose Sharks,1,Nashville Predators,4,Regulation
1,2022-10-08,12:00 PM,2:00 PM,Nashville Predators,3,San Jose Sharks,2,Regulation
2,2022-10-11,5:30 PM,7:30 PM,Tampa Bay Lightning,1,New York Rangers,3,Regulation
3,2022-10-11,8:00 PM,10:00 PM,Vegas Golden Knights,4,Los Angeles Kings,3,Regulation
4,2022-10-12,5:00 PM,7:00 PM,Boston Bruins,5,Washington Capitals,2,Regulation


In [16]:
#Check if "date" column represents actual dates
results.dtypes.head(1)

Date    object
dtype: object

Pandas did not interpret the 'Date' column as actual dates, we will need to add an extra parameter in our "read_csv" function to fix this issue.

In [20]:
results = pd.read_csv(games_filename, parse_dates=["Date"])


In [28]:

# Delete the columns we do not need in our analysis
results = results.drop(results.columns[[1,2,7]], axis=1)
results

Unnamed: 0,Date,Visitor,Score,Home,Score.1
0,2022-10-07,San Jose Sharks,1,Nashville Predators,4
1,2022-10-08,Nashville Predators,3,San Jose Sharks,2
2,2022-10-11,Tampa Bay Lightning,1,New York Rangers,3
3,2022-10-11,Vegas Golden Knights,4,Los Angeles Kings,3
4,2022-10-12,Boston Bruins,5,Washington Capitals,2
...,...,...,...,...,...
1307,2023-04-13,Los Angeles Kings,5,Anaheim Ducks,3
1308,2023-04-13,Vancouver Canucks,5,Arizona Coyotes,4
1309,2023-04-13,Vegas Golden Knights,3,Seattle Kraken,1
1310,2023-04-14,Buffalo Sabres,5,Columbus Blue Jackets,2


In [29]:
# Fix the names of the columns
results.columns = ["Date", "Visitor Team", "Visitor Goals", "Home Team", "Home Goals"]
results

Unnamed: 0,Date,Visitor Team,Visitor Goals,Home Team,Home Goals
0,2022-10-07,San Jose Sharks,1,Nashville Predators,4
1,2022-10-08,Nashville Predators,3,San Jose Sharks,2
2,2022-10-11,Tampa Bay Lightning,1,New York Rangers,3
3,2022-10-11,Vegas Golden Knights,4,Los Angeles Kings,3
4,2022-10-12,Boston Bruins,5,Washington Capitals,2
...,...,...,...,...,...
1307,2023-04-13,Los Angeles Kings,5,Anaheim Ducks,3
1308,2023-04-13,Vancouver Canucks,5,Arizona Coyotes,4
1309,2023-04-13,Vegas Golden Knights,3,Seattle Kraken,1
1310,2023-04-14,Buffalo Sabres,5,Columbus Blue Jackets,2


Create the class to predict
We are trying to predict whether a certain team won their match. In our model, we will predict whether the Home Team won their match. Lets create the appropriate attribute:

In [32]:
# Create a new attribute called "HomeWin" to show which team won that game
results["HomeWin"] = results["Home Goals"] > results["Visitor Goals"]

# This will be the class we are trying to predict for our model, whether the home team won or not
y_true = results["HomeWin"].values
results

Unnamed: 0,Date,Visitor Team,Visitor Goals,Home Team,Home Goals,HomeWin
0,2022-10-07,San Jose Sharks,1,Nashville Predators,4,True
1,2022-10-08,Nashville Predators,3,San Jose Sharks,2,False
2,2022-10-11,Tampa Bay Lightning,1,New York Rangers,3,True
3,2022-10-11,Vegas Golden Knights,4,Los Angeles Kings,3,False
4,2022-10-12,Boston Bruins,5,Washington Capitals,2,False
...,...,...,...,...,...,...
1307,2023-04-13,Los Angeles Kings,5,Anaheim Ducks,3,False
1308,2023-04-13,Vancouver Canucks,5,Arizona Coyotes,4,False
1309,2023-04-13,Vegas Golden Knights,3,Seattle Kraken,1,False
1310,2023-04-14,Buffalo Sabres,5,Columbus Blue Jackets,2,False


Create the Performance Indicator
We will be using the F1 score as our main indicator of overall model performance. The F1 score is an overall measure of a test's accuracy which considers precision and recall in its formula as shown below:
F1 = 2 x ((precision x recall) / (precison + recall))


In [33]:
# We will use the classification report and the F1 score metric to show the performance of our model
from sklearn.metrics import f1_score, make_scorer, classification_report

# Let's designate a scorer object with the F1 score
scorer = make_scorer(f1_score, pos_label=None, average='weighted')

Creating a baseline to beat
In general, home teams win games more frequently - this is shown throughout many other sports as well.

We need our model to beat the default dataset we were given without any major changes. The performance for the dataset is as follows:

In [37]:
# Calculate the number of times home teams won in the dataset
n_games = results["HomeWin"].count()
n_homewins = results["HomeWin"].sum()
home_win_pct = n_homewins / n_games

print("Home win percentage:", home_win_pct)

Home win percentage: 0.5236280487804879


According to our model, the home team wins 52.36 % of the time

In [38]:
from sklearn.metrics import f1_score
y_pred = [1] * len(y_true)
print("F1 Score: {:.4f}".format(f1_score(y_true, y_pred, pos_label="None", average="weighted")))

F1 Score: 0.3599




As we can see, our F1 score is 0.3599. Therefore, we need to build a model with a better F1 score than 0.3599. 
Let's create more attributes to produce a better model. 

New feature: Momentum - Whether the home or visitor team won their last game
Let us create features that indicates whether each team won their last game, as momentum is an important factor to account for in sports games.

Let's see if this feature can improve our model.


In [39]:
# Create the new features with a default value of "False"
results["HomeLastWin"] = False
results["VisitorLastWin"] = False

In [48]:
from collections import defaultdict
won_last = defaultdict(int)

for index, row in results.iterrows():
    home_team = row["Home Team"]
    visitor_team = row["Visitor Team"]
    row["HomeLastWin"] = bool(won_last[home_team])
    row["VisitorLastWin"] = bool(won_last[visitor_team])
    results.loc[index] = row
    #Set which team won
    won_last[home_team] = row["HomeWin"]
    won_last[visitor_team] = not row["HomeWin"]

In [52]:
results.loc[20:25]

Unnamed: 0,Date,Visitor Team,Visitor Goals,Home Team,Home Goals,HomeWin,HomeLastWin,VisitorLastWin
20,2022-10-14,Tampa Bay Lightning,5,Columbus Blue Jackets,2,False,False,False
21,2022-10-14,Montreal Canadiens,0,Detroit Red Wings,3,True,False,True
22,2022-10-14,New York Rangers,1,Winnipeg Jets,4,True,False,True
23,2022-10-14,Carolina Hurricanes,2,San Jose Sharks,1,False,False,True
24,2022-10-15,Florida Panthers,4,Buffalo Sabres,3,False,True,True
25,2022-10-15,Vancouver Canucks,2,Philadelphia Flyers,3,True,True,False


Classification with Decision Trees
Lets use the Decision Tree Classifier as our model to predict which team won the game.

In [54]:
# Note: Cross validation score uses multiple folds
from sklearn.model_selection import cross_val_score

#Create new dataframe with just the necessary features
X_previousWins = results[["HomeLastWin", "VisitorLastWin"]].values

# The object, tree_clf, has the Decision Tree Classifier ready
from sklearn.tree import DecisionTreeClassifier

# random_state is for reproducibility 
tree_clf = DecisionTreeClassifier(random_state=10)

#Compute F1 Score
scores = cross_val_score(tree_clf, X_previousWins, y_true, scoring=scorer)
print("Using only our new attributes, whether each team won their last match, we get an F1 Score of:")
print("F1 Score: {0:.4f}".format(np.mean(scores)))


Using only our new attributes, whether each team won their last match, we get an F1 Score of:
F1 Score: 0.5331


With the momentum attribute, our F1 score significantly improves

In [72]:
df = pd.DataFrame(index = ['Baseline', "Last Win"], data=[(0.3599, 0), (0.5331, 0.5331-0.3599)], columns=["F1 Score", "Overall Performance Boost"])
df

Unnamed: 0,F1 Score,Overall Performance Boost
Baseline,0.3599,0.0
Last Win,0.5331,0.1732


New Feature: Win Streaks
Similar to before, having a win streak can easily boost morale where a lose streak can cause problems within the team leading to poor performance for each successive game.

Can we improve our model by incorporating win streaks?

In [60]:
# Set our new features with a default value of 0
results["HomeWinStreak"] = 0
results["VisitorWinStreak"] = 0

from collections import defaultdict
win_streak = defaultdict(int)

for index, row in results.iterrows():
    home_team = row["Home Team"]
    visitor_team = row["Visitor Team"]
    row["HomeWinStreak"] = win_streak[home_team]
    row["VisitorWinStreak"] = win_streak[visitor_team]
    results.loc[index] = row
    
    if row["HomeWin"]:
        win_streak[home_team] += 1  # if "HomeWin" is True, increase home_team by 1
        win_streak[visitor_team] = 0
    else:
        win_streak[home_team] = 0
        win_streak[visitor_team] += 1  # if "HomeWin" is True, increase home_team by 1
        
results.loc[100:105]

Unnamed: 0,Date,Visitor Team,Visitor Goals,Home Team,Home Goals,HomeWin,HomeLastWin,VisitorLastWin,HomeWinStreak,VisitorWinStreak
100,2022-10-25,New Jersey Devils,6,Detroit Red Wings,2,False,True,False,1,0
101,2022-10-25,Colorado Avalanche,3,New York Rangers,2,False,False,True,0,1
102,2022-10-25,Florida Panthers,2,Chicago Blackhawks,4,True,True,True,3,1
103,2022-10-25,Pittsburgh Penguins,1,Calgary Flames,4,True,True,False,1,0
104,2022-10-25,Buffalo Sabres,1,Seattle Kraken,5,True,False,True,0,3
105,2022-10-25,Tampa Bay Lightning,2,Los Angeles Kings,4,True,False,True,0,2


In [63]:
# Lets run the decision tree classifier again to see if the F1 Score has improved
tree_clf = DecisionTreeClassifier(random_state=10)
X_winstreak = results[["HomeLastWin", "VisitorLastWin", "HomeWinStreak", "VisitorWinStreak"]].values
scores = cross_val_score(tree_clf, X_winstreak, y_true, scoring = scorer)
print("By adding win streaks, our new F1 score is:")
print("F1 Score: {0:.4f}".format(np.mean(scores)))

By adding win streaks, our new F1 score is:
F1 Score: 0.5209


Surprisingly, our F1 Score actually decreased. 
Note: The Overall Performance Boost for "Last Win" is the difference between "Last Win" and "Baseline" whereas the Overall Performance Boost for successive tests is a comparison with "Last Win" and not Baseline.

In [74]:
df2 = pd.DataFrame(index = ["Win Streaks & Last Win"], data=[(0.5209, 0.5209 - 0.5331)], columns=["F1 Score", "Overall Performance Boost"])
df = pd.concat([df, df2], ignore_index=True)
df

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,F1 Score,Overall Performance Boost
0,0.3599,0.0
1,0.5331,0.1732
2,0.5209,-0.0122


So far, all our features have been based on the same dataset. The more we build on the same dataset, the more we risk overfitting our data and not being able to predict instances when using unseen or new data

One dataset we can consider is the standings of all the teams from the previous year.


Avoiding Overfitting: Team Rankings from the previous Season
If a team placed high last season, their momentum/ spirit is most likely to be very high. Imagine a team that was essentially in last place that lost every game and was put up against the team that had won last season in a landslide, would the losing team be nervous about their upcoming game? This information could potentially be a strong indicator in determining the outcome of a game.

Let us create a new feature that checks whether the home team was ranked higher than the visitor team from the last season.

In [81]:
#Import data

ladders_filename = ("nhl_standings.csv")
ladder = pd.read_csv(ladders_filename)
ladder

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 12,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21
0,Rk,,Overall,Shootout,Overtime,Home,Road,EAS,WES,ATL,...,PAC,≤1,≥3,Oct,Nov,Dec,Jan,Feb,Mar,Apr
1,1,Boston Bruins,65-12-5,4-3,7-2,34-4-3,31-8-2,40-7-3,25-5-2,18-5-3,...,12-2-2,23-6-5,24-4-0,8-1-0,11-2-0,9-1-4,10-3-1,9-1-0,11-4-0,7-0-0
2,2,Carolina Hurricanes,52-21-9,4-3,9-6,28-10-3,24-11-6,34-13-3,18-8-6,15-7-2,...,9-4-3,25-6-9,21-10-0,6-2-1,6-4-4,12-0-1,9-3-2,6-2-0,8-7-1,5-3-0
3,3,New Jersey Devils,52-22-8,2-4,11-4,24-13-4,28-9-4,30-17-3,22-5-5,12-11-1,...,14-0-2,20-6-8,20-9-0,6-3-0,13-1-0,4-7-2,9-2-2,7-2-1,8-5-3,5-2-0
4,4,Vegas Golden Knights,51-22-9,5-4,8-5,25-15-1,26-7-8,22-8-2,29-14-7,12-4-0,...,14-9-3,25-8-9,19-11-0,8-2-0,9-4-1,8-6-1,4-6-2,6-1-2,11-3-1,5-0-2
5,5,Toronto Maple Leafs,50-21-11,1-2,7-9,27-8-6,23-13-5,31-13-6,19-8-5,15-7-4,...,8-5-3,21-4-11,25-10-0,4-4-2,11-1-3,8-3-1,8-4-2,6-3-0,7-5-2,6-1-1
6,6,Colorado Avalanche,51-24-7,6-3,9-4,22-13-6,29-11-1,18-11-3,33-13-4,9-6-1,...,14-7-3,22-8-7,22-9-0,4-4-1,8-3-0,7-6-2,8-5-0,7-1-2,10-5-1,7-0-1
7,7,Edmonton Oilers,50-23-9,0-4,5-5,23-12-6,27-11-3,17-11-4,33-12-5,11-4-1,...,19-6-1,15-9-9,22-9-0,6-3-0,7-7-0,7-6-2,8-2-2,4-3-4,12-2-1,6-0-0
8,8,Dallas Stars,47-21-14,4-3,4-11,22-10-9,25-11-5,17-7-8,30-14-6,8-4-4,...,12-10-2,13-6-14,29-9-0,5-3-1,8-3-3,10-3-2,5-4-4,3-3-3,10-4-1,6-1-0
9,9,New York Rangers,47-22-13,4-3,6-10,23-13-5,24-9-8,26-16-8,21-6-5,11-7-6,...,11-2-3,15-9-13,19-9-0,5-3-2,6-6-2,8-3-2,8-2-2,7-3-1,10-4-2,3-1-2


In [95]:
# Create our new feature - HomeTeamRanksHigher
def home_team_ranks_higher(row):
    home_team = row["Home Team"]
    visitor_team = row["Visitor Team"]

    # Use .get method with default rank for missing teams
    home_rank = ladder.get(home_team, {"Rk": 18})["Rk"]
    visitor_rank = ladder.get(visitor_team, {"Rk": 18})["Rk"]
    
    
    return home_rank < visitor_rank # Remember, a higher ranking is a lower number (in other words, 1st place > 2nd place)

# This will return a boolean result, either True or False and the line below will take whatever this boolean result is
# and place it into "HomeTeamRanksHigher"

results["HomeTeamRanksHigher"] = results.apply(home_team_ranks_higher, axis=1)
results[['Date', 'Visitor Team', 'Visitor Goals', 'Home Team', 'Home Goals', 'HomeWin', 'HomeTeamRanksHigher']]

Unnamed: 0,Date,Visitor Team,Visitor Goals,Home Team,Home Goals,HomeWin,HomeTeamRanksHigher
0,2022-10-07,San Jose Sharks,1,Nashville Predators,4,True,False
1,2022-10-08,Nashville Predators,3,San Jose Sharks,2,False,False
2,2022-10-11,Tampa Bay Lightning,1,New York Rangers,3,True,False
3,2022-10-11,Vegas Golden Knights,4,Los Angeles Kings,3,False,False
4,2022-10-12,Boston Bruins,5,Washington Capitals,2,False,False
...,...,...,...,...,...,...,...
1307,2023-04-13,Los Angeles Kings,5,Anaheim Ducks,3,False,False
1308,2023-04-13,Vancouver Canucks,5,Arizona Coyotes,4,False,False
1309,2023-04-13,Vegas Golden Knights,3,Seattle Kraken,1,False,False
1310,2023-04-14,Buffalo Sabres,5,Columbus Blue Jackets,2,False,False
