Dataset was taken from MoneyPuck: https://moneypuck.com/moneypuck/playerData/careers/gameByGame/all_teams.csv

In [18]:
dataset = 'data.csv'

The dataset that is being used has games going from the 2008 season all the way up to the 2025 season. For this case we'll only be taking games from the 2025-26 season.

In [19]:
import pandas as pd

# Use Pandas to read the csv into a variable
games = pd.read_csv(dataset)

# Creata a datafram using the csv
df = pd.DataFrame(games)

# Filter out games before the beginning of the 2025-26 regular season, include all
# situations (Power Play, OT, etc...)
df = df[df['season'] > 2024]
df = df[df['situation'] == 'all']

# Sort dataframe
df = df.sort_values(['gameDate', 'gameId', 'team'])
df = df.reset_index(drop=True)

print(df.shape)
print(df.head())

(1558, 111)
  team  season name      gameId playerTeam opposingTeam home_or_away  \
0  CHI    2025  CHI  2025020001        CHI          FLA         AWAY   
1  FLA    2025  FLA  2025020001        FLA          CHI         HOME   
2  NYR    2025  NYR  2025020002        NYR          PIT         HOME   
3  PIT    2025  PIT  2025020002        PIT          NYR         AWAY   
4  COL    2025  COL  2025020003        COL          LAK         AWAY   

   gameDate    position situation  ...  unblockedShotAttemptsAgainst  \
0  20251007  Team Level       all  ...                          50.0   
1  20251007  Team Level       all  ...                          32.0   
2  20251007  Team Level       all  ...                          41.0   
3  20251007  Team Level       all  ...                          38.0   
4  20251007  Team Level       all  ...                          42.0   

   scoreAdjustedUnblockedShotAttemptsAgainst  dZoneGiveawaysAgainst  \
0                                     49.745       

Order the games by if the home team won. 1 = true, 0 = false

In [20]:
import numpy as np

df['home_win'] = np.where(
    df['home_or_away'] == 'HOME',
    (df['goalsFor'] > df['goalsAgainst']).astype(int),
    np.nan
)

# Create a clean df with data we can see easier
df_game = df[df['home_or_away'] == 'HOME'].copy()
df_game['home_win'] = (df_game['goalsFor'] > df_game['goalsAgainst']).astype(int)

print(df_game[['gameId', 'team', 'opposingTeam', 'goalsFor', 'goalsAgainst', 'home_win']].head())

       gameId team opposingTeam  goalsFor  goalsAgainst  home_win
1  2025020001  FLA          CHI       3.0           2.0         1
2  2025020002  NYR          PIT       0.0           3.0         0
5  2025020003  LAK          COL       1.0           4.0         0
7  2025020004  TOR          MTL       5.0           2.0         1
9  2025020005  WSH          BOS       1.0           3.0         0


The LSTM will need feature columns to train with. We're training on mostly the shots/goals related data as opposed to data for the play.

In [21]:
feature_columns = [
    'goalsFor', 'goalsAgainst',
    'xGoalsFor', 'xGoalsAgainst',
    'shotsOnGoalFor', 'shotsOnGoalAgainst',
    'unblockedShotAttemptsFor', 'unblockedShotAttemptsAgainst',
    'corsiPercentage', 'fenwickPercentage',
    'highDangerShotsFor', 'highDangerShotsAgainst',
    'blockedShotAttemptsFor', 'blockedShotAttemptsAgainst',
    'giveawaysFor', 'giveawaysAgainst',
    'takeawaysFor', 'takeawaysAgainst',
    'faceOffsWonFor', 'faceOffsWonAgainst',
    'reboundxGoalsFor', 'reboundxGoalsAgainst',
    'totalShotCreditFor', 'totalShotCreditAgainst'
]

We need to prepare time-sequence data to tell if the home team wins the game. This is what the LSTM will train on.

In [22]:
from sklearn.preprocessing import StandardScaler

window = 8

def get_team_sequence(team, current_date, current_game_id, df_all, window, features):
    past = df_all[
        (df_all['team'] == team) &
        (df_all['gameDate'] < current_date) &
        (df_all['gameId'] < current_game_id)
    ].sort_values('gameDate').tail(window)

    if len(past) < window:
        return None

    return past[features].values  # shape (window, n_features)

X_home_list = []
X_away_list = []
y_list = []

for _, row in df_game.iterrows():
    home_team = row['team']
    away_team = row['opposingTeam']
    gdate = row['gameDate']
    gid = row['gameId']
    home_seq = get_team_sequence(home_team, gdate, gid, df, window, feature_columns)
    away_seq = get_team_sequence(away_team, gdate, gid, df, window, feature_columns)

    if home_seq is not None and away_seq is not None:
        X_home_list.append(home_seq)
        X_away_list.append(away_seq)
        y_list.append(row['home_win'])

X_home = np.array(X_home_list)
X_away = np.array(X_away_list)
X = np.concatenate([X_home, X_away], axis=2)  # (n_games, window, 2 * n_features)
y = np.array(y_list)

print(X.shape, y.shape)

scaler = StandardScaler()
X_reshaped = X.reshape(-1, X.shape[-1])
X_scaled = scaler.fit_transform(X_reshaped).reshape(X.shape)

# Time-based split (last ~20% as test)
split_idx = int(0.8 * len(X_scaled))
X_train, X_test = X_scaled[:split_idx], X_scaled[split_idx:]
y_train, y_test = y[:split_idx], y[split_idx:]

(647, 8, 48) (647,)


Time for the fun part! Building and training the LSTM.

In [23]:
import torch
import torch.nn as nn

class HockeyLSTM(nn.Module):
    def __init__(self, input_size, hidden_size=64, num_layers=2):
        super().__init__()
        self.lstm = nn.LSTM(input_size, hidden_size, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size, 1)  # binary output
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        out, _ = self.lstm(x)
        out = self.fc(out[:, -1, :])  # last time step
        return self.sigmoid(out)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

X_train_tensor = torch.tensor(X_train, dtype=torch.float32).to(device)
y_train_tensor = torch.tensor(y_train, dtype=torch.float32).view(-1, 1).to(device)

X_test_tensor  = torch.tensor(X_test,  dtype=torch.float32).to(device)
y_test_tensor  = torch.tensor(y_test,  dtype=torch.float32).view(-1, 1).to(device)

input_size = X_train.shape[-1]
model = HockeyLSTM(input_size=input_size).to(device)
criterion = nn.BCELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

epochs = 50

# Training loop
for epoch in range(epochs):
    model.train()
    optimizer.zero_grad()
    outputs = model(X_train_tensor)
    loss = criterion(outputs, y_train_tensor)
    loss.backward()
    optimizer.step()

    if (epoch + 1) % 5 == 0:
        with torch.no_grad():
            train_preds = (outputs >= 0.5).float()
            train_acc = (train_preds == y_train_tensor).float().mean().item()
        print(f'Epoch {epoch+1}/{epochs}, Loss: {loss.item():.4f}, Train Acc: {train_acc:.4f}')

# Evaluation
model.eval()
with torch.no_grad():
    test_outputs = model(X_test_tensor)
    test_preds = (test_outputs >= 0.5).float()
    test_acc = (test_preds == y_test_tensor).float().mean().item()
    test_loss = criterion(test_outputs, y_test_tensor).item()

print("\nFinal Results:")
print(f"Test Loss: {test_loss:.4f}")
print(f"Test Accuracy: {test_acc:.4f}")

Epoch 5/50, Loss: 0.6869, Train Acc: 0.5435
Epoch 10/50, Loss: 0.6815, Train Acc: 0.5474
Epoch 15/50, Loss: 0.6716, Train Acc: 0.5783
Epoch 20/50, Loss: 0.6560, Train Acc: 0.6286
Epoch 25/50, Loss: 0.6352, Train Acc: 0.6460
Epoch 30/50, Loss: 0.6015, Train Acc: 0.6712
Epoch 35/50, Loss: 0.5538, Train Acc: 0.7234
Epoch 40/50, Loss: 0.4749, Train Acc: 0.7737
Epoch 45/50, Loss: 0.3675, Train Acc: 0.8511
Epoch 50/50, Loss: 0.2440, Train Acc: 0.9014

Final Results:
Test Loss: 1.3036
Test Accuracy: 0.5077


We're currently getting a little more than 50% accuracy. Pretty much just a completely uninformed guess. A lot of wok to be done still.