<font color="#7da7ca" size=5><center> üèí Analyzing and Projecting NHL Skater Metrics üèí </center></font>

# Introduction
This notebook will focus on analysis and regression of select NHL skater metrics in previous seasons, with the goal of creating a simple model to predict these statistics for future seasons. The following metrics will be considered:

1. Time On Ice (TOI) per game
2. Goals
3. Assists
4. Points
5. Shots on Goal

We will only consider skaters who have a minimum of 15 games played in a season. Once we've analyzed and regressed skaters' previous performances, we will build a model inspired by [Tom Tango](https://twitter.com/tangotiger)'s MARCEL baseball forecasting model, and project these metrics for the 2020-2021 NHL season.

Note: Only skaters who have met the requirement for minimum games played in the previous 3 seasons will be considered. Additionally, the projections will be made with the assumption that skaters will play the entire season without missing any games.

# Table of Contents
1. [Clean and group NHL game data](#1)
2. [Data analysis](#2)
3. [Define model by weighing past performance through linear regression](#3)
4. [Project and analyze results](#4)
5. [References](#5)

----------------------------------------------------------------------------------------------------------------------------------------------
<a id ="1" > </a>
# 1. Clean and group NHL game data by skater and season

The first thing we'll do is import the libraries we need.

In [None]:
import numpy as np
import pandas as pd
import datetime
from datetime import date
from dateutil.relativedelta import relativedelta
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.metrics import mean_absolute_error
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

We'll be working with [Martin Ellis' NHL Game Data dataset](https://www.kaggle.com/martinellis/nhl-game-data), so we'll need to clean and group this data by skater and by season before we begin.

In [None]:
# Load data
game_df = pd.read_csv('../input/nhl-game-data/game.csv')
game_player_df = pd.read_csv('../input/nhl-game-data/game_skater_stats.csv')
player_df = pd.read_csv('../input/nhl-game-data/player_info.csv')
scratched_df = pd.read_csv('../input/nhl-game-data/game_scratches.csv')

# Merge game_player and player dfs
clean_df = game_player_df.merge(player_df[['player_id', 'firstName', 'lastName', 'primaryPosition', 'birthDate']],
              left_on='player_id', right_on='player_id')

# Merge game above with game df 
clean_df = clean_df.merge(game_df[['game_id', 'season', 'type']],
             left_on='game_id', right_on='game_id')

# Merge above with games scratched
clean_df = clean_df.merge(scratched_df, on=['game_id', 'player_id'], how='left', indicator=True)
clean_df['scratched'] = clean_df.pop('_merge').eq('both')

# Drop columns that are not needed
if 'team_id_x' in clean_df.columns:
    clean_df = clean_df.drop(['team_id_x'], axis=1)
# Remove goalies
clean_df = clean_df.loc[clean_df.primaryPosition != 'G']

# Filter for game type R (regular season)
clean_df = clean_df.loc[clean_df.type == 'R']

#Filter out games where player was scratched
clean_df = clean_df.loc[clean_df.scratched == False]
clean_df = clean_df.drop_duplicates()
# Create dictionary for aggregates
agg_dict_sum = {'assists': ['sum'], 'goals': ['sum'], 'shots': ['sum'], 'hits': ['sum'], 'powerPlayGoals': ['sum'], 'powerPlayAssists': ['sum'], 'shortHandedGoals': ['sum'],
     'shortHandedAssists': ['sum'], 'blocked': ['sum'], 'timeOnIce': ['sum'], 'powerPlayTimeOnIce': ['sum'], 'evenTimeOnIce': ['sum'], 'penaltyMinutes': ['sum'], 'player_id' : ['count']}

# Set flag for TOI conversion
toi_converted = 0

# Group data by player_id and season
grouped_df = clean_df.groupby(['player_id', 'firstName', 'lastName', 'primaryPosition', 'birthDate', 'season']).agg(agg_dict_sum).reset_index()

# Rename columns
grouped_df.columns = ['player_id','firstName', 'lastName', 'position', 'birthDate', 'season', 'assists', 'goals', 'shots', 'hits', 'powerPlayGoals',
                     'powerPlayAssists', 'shortHandedGoals', 'shortHandedAssists', 'blocks', 'timeOnIce', 'powerPlayTimeOnIce', 'evenTimeOnIce',
                     'penaltyMinutes', 'gamesPlayed']

# Convert TOI and PP TOI from seconds to minutes
if toi_converted == 0:
    # Convert toi
    grouped_df['timeOnIce'] = grouped_df['timeOnIce']/60
    grouped_df['powerPlayTimeOnIce'] = grouped_df['powerPlayTimeOnIce']/60
    grouped_df['evenTimeOnIce'] = grouped_df['evenTimeOnIce']/60
    
    # Add Shorthanded TOI column 
    grouped_df['shortHandedTimeOnIce'] = round(grouped_df['timeOnIce'] - grouped_df['evenTimeOnIce'] - grouped_df['powerPlayTimeOnIce'], 2)

    # Add per game toi columns
    grouped_df['timeOnIcePerGame'] = grouped_df['timeOnIce']/grouped_df['gamesPlayed']
    grouped_df['evenTimeOnIcePerGame'] = grouped_df['evenTimeOnIce']/grouped_df['gamesPlayed']
    grouped_df['powerPlayTimeOnIcePerGame'] = grouped_df['powerPlayTimeOnIce']/grouped_df['gamesPlayed']
    grouped_df['shortHandedTimeOnIcePerGame'] = grouped_df['shortHandedTimeOnIce']/grouped_df['gamesPlayed']    
    
    toi_converted = 1

# Add points column
grouped_df['points'] = grouped_df['assists'] + grouped_df['goals']

# Add PP points column
grouped_df['powerPlayPoints'] = grouped_df['powerPlayGoals'] + grouped_df['powerPlayAssists']

# Add SH points column
grouped_df['shortHandedPoints'] = grouped_df['shortHandedGoals'] + grouped_df['shortHandedAssists']

# Add Even Strength goals, assists, and points column
grouped_df['evenStrengthGoals'] = grouped_df['goals'] - grouped_df['powerPlayGoals'] - grouped_df['shortHandedGoals']
grouped_df['evenStrengthAssists'] = grouped_df['assists'] - grouped_df['powerPlayAssists'] - grouped_df['shortHandedAssists']
grouped_df['evenStrengthPoints'] = grouped_df['evenStrengthGoals'] + grouped_df['evenStrengthAssists']

# Sort by seasons and points
grouped_df = grouped_df.sort_values(['season', 'points'], 
              ascending = [False, False])

# Concatenate first and last name
if 'firstName' and 'lastName' in grouped_df.columns:
    grouped_df['name'] = grouped_df['firstName'] + ' ' + grouped_df['lastName']

# Drop unnecessary columns
if 'firstName' in grouped_df.columns:
    grouped_df = grouped_df.drop(['firstName'], axis=1)
    
if 'lastName' in grouped_df.columns:   
    grouped_df = grouped_df.drop(['lastName'], axis=1)
    
# Reorder column names
grouped_df = grouped_df.reindex(columns = ['player_id', 'name', 'birthDate', 'position', 'season','goals', 'assists', 'points', 'shots', 'hits', 'blocks',
                                           'powerPlayGoals', 'powerPlayAssists', 'powerPlayPoints', 'shortHandedGoals', 'shortHandedAssists', 'shortHandedPoints',
                                           'evenStrengthGoals', 'evenStrengthAssists', 'evenStrengthPoints','penaltyMinutes', 'timeOnIce', 'evenTimeOnIce',
                                           'powerPlayTimeOnIce', 'shortHandedTimeOnIce', 'timeOnIcePerGame', 'evenTimeOnIcePerGame', 'powerPlayTimeOnIcePerGame',
                                           'shortHandedTimeOnIcePerGame', 'gamesPlayed'])

# Output data
grouped_df.to_csv('skater_data_by_season.csv',index=False)

In [None]:
# Global variable for 3 seasons to work with
sample_seasons = [20192020, 20182019, 20172018]

# Seasons to calculate avg toi
sample_seasons_toi = [20192020, 20182019, 20172018, 20162017, 20152016]

# Seasons to look at for age vs toi analysis
sample_seasons_aging = [20192020, 20182019, 20172018, 20162017, 20152016, 20142015, 20132014, 20122013, 20112012, 20102011]

# Global variable - value to iterate through seasons when performing linear regression, since season is stored as an int (i.e. season 2019-2020 is stored as integer: 20192020)
ITERATOR = abs(sample_seasons[0] - sample_seasons[1])

# Global variable - determine how many games to project upcoming season for
GAMES_TO_PLAY = 56

# Global variable - season start date
SEASON_START_DATE = datetime.datetime(2021, 1, 1)

# Global variable - coefficient labels
coefficient_labels = ['Season n-1 coefficient', 'Season n-2 coefficient', 'Season n-3 coefficient']

Now that our data has been cleaned and grouped by skater and by season, let's take a look at the top 10 point leaders from the 2019-2020 NHL season.

In [None]:
# Copy data to work with
skater_all_seasons_df = grouped_df.copy()

# Display top 10 point-scoring skaters from 2019-2020 season
skater_all_seasons_df = skater_all_seasons_df.sort_values(['season', 'points'], ascending = [False, False])
skater_all_seasons_df.loc[skater_all_seasons_df.season == 20192020].head(10)

Let's calculate and add the shooting percentage, as well as the "per 60" metrics for goals, assists, and shots on goal to our dataframe.

In [None]:
# Filter out players who haven't played at least 15 games in a season
skater_all_seasons_df = skater_all_seasons_df.loc[skater_all_seasons_df.gamesPlayed >= 15]

# Add goals per 60 metric
skater_all_seasons_df['goalsPer60'] = (skater_all_seasons_df['goals']*60) / skater_all_seasons_df['timeOnIce']

# Add assists per 60 metric
skater_all_seasons_df['assistsPer60'] = (skater_all_seasons_df['assists']*60) / skater_all_seasons_df['timeOnIce']

# Add points per 60 metric
skater_all_seasons_df['pointsPer60'] = (skater_all_seasons_df['points']*60) / skater_all_seasons_df['timeOnIce']

# Add shots per 60 metric
skater_all_seasons_df['shotsPer60'] = (skater_all_seasons_df['shots']*60) / skater_all_seasons_df['timeOnIce']

# Add shooting percentage
skater_all_seasons_df['shootingPercentage'] = round((skater_all_seasons_df['goals'] / skater_all_seasons_df['shots']), 2)*100

Now we'll split the data into two groups, forwards and defensemen. Then we make copies of the dataframes we'll be working with, only including columns we need.

In [None]:
# Separate data into two groups: forwards and defensemen
forwards_df = skater_all_seasons_df.loc[skater_all_seasons_df.position != 'D']
defensemen_df = skater_all_seasons_df.loc[skater_all_seasons_df.position == 'D']

working_forwards_df = forwards_df[['player_id', 'name', 'birthDate', 'position', 'season', 'goals', 'goalsPer60', 'assists', 'assistsPer60', 'points', 'pointsPer60', 'shots', 'shotsPer60', 'shootingPercentage', 'timeOnIcePerGame', 'timeOnIce', 'gamesPlayed']].copy()
working_defensemen_df = defensemen_df[['player_id', 'name', 'birthDate', 'position', 'season', 'goals', 'goalsPer60', 'assists', 'assistsPer60', 'points', 'pointsPer60', 'shots', 'shotsPer60', 'shootingPercentage', 'timeOnIcePerGame','timeOnIce', 'gamesPlayed']].copy()

Let's see what our data looks like now.

In [None]:
working_forwards_df.head(5)

----------------------------------------------------------------------------------------------------------------------------------------------
<a id ="2" > </a>
# 2. Data analysis
In this section, we'll be analyzing how average time on ice per game relates to age and to shots/60, and we'll draw some conclusions from these relationships. When looking at the relationship between avg. TOI per game and age, we will consider the last 10 NHL regular seasons, in order to have a larger sample size with the goal of identifying an aging curve for time on ice.

Additionally, we'll find the average time on ice per game for forwards and for defensemen. These values will be used in our model in order to regress a skater's performance to the mean, depending on their position. 

Let's start by visualizing the avgerage time on ice per age group for the last 10 seasons.

In [None]:
# Method to calculate player's age at the start of a season
def get_player_age_in_season(season, birthDate):
    # Assume each season starts in October of that year
    season_year = int(str(season)[:4])
    season_start = datetime.datetime(season_year, 10, 1)
    
    # Format birthDate as date time
    birthDate_dt = datetime.datetime.strptime(birthDate, '%Y-%m-%d %H:%M:%S')
    
    # Calculate age in years at time season started
    return relativedelta(season_start, birthDate_dt).years

In [None]:
all_skaters_2010_df = skater_all_seasons_df.loc[skater_all_seasons_df.season.isin(sample_seasons_aging)][['player_id', 'name', 'birthDate', 'position', 'season', 'goalsPer60', 'assistsPer60', 'shotsPer60', 'pointsPer60', 'timeOnIcePerGame', 'gamesPlayed']].copy()

toi_dictionary = {'timeOnIcePerGame': ['mean'], 'age' : ['count']}
all_skaters_2010_df['age'] = np.vectorize(get_player_age_in_season)(all_skaters_2010_df['season'], all_skaters_2010_df['birthDate']) 

# Group by age to get mean TOI per age
age_2010_df = all_skaters_2010_df.groupby(['age']).agg(toi_dictionary).reset_index()

# Rename columns
age_2010_df.columns = ['age', 'avgTimeOnIcePerGame', 'numberOfPlayers']

In [None]:
# Display age distribution
bar_age = px.bar(
            age_2010_df,
            x='age', y='numberOfPlayers',
            hover_data=['numberOfPlayers', 'numberOfPlayers'],
            labels={'age': 'Age', 'numberOfPlayers':'Number of Players'},
            title = 'Skater Age Distribution (2010-2020)'
        )

bar_age.update_layout(
    title={
                'text': "Skater Age Distribution (2010-2020)",
                'y':0.9,
                'x':0.5,
                'xanchor': 'center',
                'yanchor': 'top'})

In [None]:
bar_age.show()

In [None]:
# Display age distributed avg time on ice
bar_age_toi = px.bar(
            age_2010_df,
            x='age', y='avgTimeOnIcePerGame',
            hover_data=['numberOfPlayers', 'avgTimeOnIcePerGame'],
            labels={'age': 'Age', 'avgTimeOnIcePerGame':'Average TOI/Game'},
            color='numberOfPlayers',
            color_continuous_scale=px.colors.sequential.Bluered
        )

bar_age_toi.update_layout(
    title={
                'text': "Average TOI/Game Grouped by Age (2010-2020)",
                'y':0.95,
                'x':0.5,
                'xanchor': 'center',
                'yanchor': 'top'})

In [None]:
bar_age_toi.show()

As we can see, the skater groups with the highest avg. time on ice per game were mostly in their mid 30s or older, but these groups had very few players in them. Notice from the previous graph (Skater Age Distribution 2010-2020) that the majority of players were actually between the ages of 22 and 31. Within this range, we can clearly see there is a distinct curve peaking around 28-30 years old, while the data outside this range is quite noisy, making it difficult to take into consideration.

We will make the assumption that the small number of players who were still playing at an NHL level at these older age groups were likely elite veterans who had earned their ice time as forwards (i.e. Jaromir Jagr), or experienced defensemen who play large chunks of time for their team (i.e. Zdeno Chara). If we exclude these older age groups as outliers, player time on ice seems to peak around age 29.

This age is significant because the MARCEL forecasting takes age into account when projecting metrics. A player below the peak age will expect a slight increase in time on ice, while a player above the peak age will expect a slight decrease in time on ice.

In order to regress a skater's performance to the mean, let's find the average TOI per game for forwards and defensemen over the past 3 seasons.

In [None]:
# Calculate average time on ice per game for forwards over the last 3 seasons
fwd_toi_df = working_forwards_df.loc[working_forwards_df.season.isin(sample_seasons)]
fwd_toi_df = fwd_toi_df.loc[fwd_toi_df.gamesPlayed >= 40]

mean_toi_fwd_n3 = fwd_toi_df.loc[fwd_toi_df.season == sample_seasons[2]]['timeOnIcePerGame'].mean()
mean_toi_fwd_n2 = fwd_toi_df.loc[fwd_toi_df.season == sample_seasons[1]]['timeOnIcePerGame'].mean()
mean_toi_fwd_n1 = fwd_toi_df.loc[fwd_toi_df.season == sample_seasons[0]]['timeOnIcePerGame'].mean()
mean_toi_fwd = round(((mean_toi_fwd_n3 + mean_toi_fwd_n2 + mean_toi_fwd_n1) / 3),2)

mean_fwd_dict = {"Position":"Forward", "Avg. TOI Per Game": mean_toi_fwd}
#print("The average time on ice per game for forwards over the last 3 seasons is " + str(mean_toi_fwd) + " minutes per game")

In [None]:
# Calculate average time on ice per game for defensemen over the last 3 seasons
def_toi_df = working_defensemen_df.loc[working_defensemen_df.season.isin(sample_seasons)]
def_toi_df = def_toi_df.loc[def_toi_df.gamesPlayed >= 40]

mean_toi_def_n3 = def_toi_df.loc[def_toi_df.season == sample_seasons[2]]['timeOnIcePerGame'].mean()
mean_toi_def_n2 = def_toi_df.loc[def_toi_df.season == sample_seasons[1]]['timeOnIcePerGame'].mean()
mean_toi_def_n1 = def_toi_df.loc[def_toi_df.season == sample_seasons[0]]['timeOnIcePerGame'].mean()
mean_toi_def = round(((mean_toi_def_n3 + mean_toi_def_n1 + mean_toi_def_n1) / 3),2)

mean_def_dict = {"Position":"Defense", "Avg. TOI Per Game": mean_toi_def}
#print("The average time on ice per game for defensemen over the last 3 seasons is " + str(round(mean_toi_def,2)) + " minutes per game")

In [None]:
# Add mean toi to dataframe
mean_toi_df = pd.DataFrame()
mean_toi_df = mean_toi_df.append([mean_fwd_dict, mean_def_dict])

# Plot avg time on ice for both positions
mean_toi_bar = px.bar(mean_toi_df, x="Position", y="Avg. TOI Per Game",
             color='Position', labels={'Avg. TOI Per Game':'Average TOI/Game'})

mean_toi_bar.update_layout(
    title={
                'text': "Average TOI/Game by Position (2017-2020)",
                'y':0.95,
                'x':0.5,
                'xanchor': 'center',
                'yanchor': 'top'})

In [None]:
mean_toi_bar.show()

In this next section, we will be using "per 60" metrics for shots on goal in order to put all skaters on a level playing field, regardless of their time on ice. This will help us identify some skaters who may deserve a little more ice time when it comes to shot production.

In [None]:
scatter_dict = {'pointsPer60': ['mean'], 'shotsPer60': ['mean'], 'timeOnIcePerGame': ['mean']}

# Group by player to get mean per 60 and toi values
season_n1_fwd = working_forwards_df.loc[working_forwards_df.season == sample_seasons[0]]
season_n2_fwd = working_forwards_df.loc[working_forwards_df.season == sample_seasons[1]]
season_n3_fwd = working_forwards_df.loc[working_forwards_df.season == sample_seasons[2]]

filter_fwd_active_n1 = working_forwards_df.player_id.isin(season_n1_fwd.player_id)
filter_fwd_active_n2 = working_forwards_df.player_id.isin(season_n2_fwd.player_id)
filter_fwd_active_n3 = working_forwards_df.player_id.isin(season_n3_fwd.player_id)

scatter_fwd_df = working_forwards_df[filter_fwd_active_n1 & filter_fwd_active_n2 & filter_fwd_active_n3].loc[working_forwards_df.season.isin(sample_seasons)].groupby(['player_id', 'name']).agg(scatter_dict).reset_index()

# Rename columns
scatter_fwd_df.columns = ['player_id','name', 'pointsPer60Avg', 'shotsPer60Avg', 'timeOnIcePerGameAvg']

# TODO only include players who appeared in all 3 seasons

toi_shots60_fwd_scatter = px.scatter(scatter_fwd_df, x="shotsPer60Avg", y="timeOnIcePerGameAvg",
                                     labels={'timeOnIcePerGameAvg':'Average TOI/Game', 'shotsPer60Avg':'Average Shots/60'},
                                     text="name", size_max=20, trendline="ols")

toi_shots60_fwd_scatter.update_traces(textposition='top center')

toi_shots60_fwd_scatter.update_layout(
    title={
                'text': "Forwards Average TOI/Game vs Average Shots/60 (2017-2020)",
                'y':0.95,
                'x':0.5,
                'xanchor': 'center',
                'yanchor': 'top'})

In [None]:
toi_shots60_fwd_scatter.show()

As we can see, Brendan Gallagher is the forward who has averaged the highest shots/60 over the last three seasons, slightly beating out Alexander Ovechkin. Considering how much Ovechkin shoots and scores, this comes as a bit of a surprise. It may be difficult to compare Gallagher to Ovechkin due to the difference in their shot quality, as Ovechkin is known to have one of the most powerful and accurate shots in the league, but as Wayne Gretzky once said, "You miss 100% of the shots you don't take".

Given this metric, it might be worth considering giving Gallagher more playing time. After all, every shot is an additional goal-scoring opportunity.

![](https://i.imgur.com/zY2zPFt.jpg)

In [None]:
# Group by player to get mean per 60 and toi values
season_n1_def = working_defensemen_df.loc[working_defensemen_df.season == sample_seasons[0]]
season_n2_def = working_defensemen_df.loc[working_defensemen_df.season == sample_seasons[1]]
season_n3_def = working_defensemen_df.loc[working_defensemen_df.season == sample_seasons[2]]

filter_def_active_n1 = working_defensemen_df.player_id.isin(season_n1_def.player_id)
filter_def_active_n2 = working_defensemen_df.player_id.isin(season_n2_def.player_id)
filter_def_active_n3 = working_defensemen_df.player_id.isin(season_n3_def.player_id)

scatter_def_df = working_defensemen_df[filter_def_active_n1 & filter_def_active_n2 & filter_def_active_n3].loc[working_defensemen_df.season.isin(sample_seasons)].groupby(['player_id', 'name']).agg(scatter_dict).reset_index()

# Rename columns
scatter_def_df.columns = ['player_id', 'name', 'pointsPer60Avg', 'shotsPer60Avg', 'timeOnIcePerGameAvg']

toi_shots60_def_scatter = px.scatter(scatter_def_df, x="shotsPer60Avg", y="timeOnIcePerGameAvg",
                                     labels={'timeOnIcePerGameAvg':'Average TOI/Game', 'shotsPer60Avg':'Average Shots/60'},
                                     text="name", size_max=20, trendline="ols")

toi_shots60_def_scatter.update_traces(textposition='top center')

toi_shots60_def_scatter.update_layout(
    title={
                'text': "Defensemen Average TOI/Game vs Average Shots/60 (2017-2020)",
                'y':0.95,
                'x':0.5,
                'xanchor': 'center',
                'yanchor': 'top'})

In [None]:
toi_shots60_def_scatter.show()

The big outlier here is Dougie Hamilton of the Carolina Hurricanes (although Roman Josi and Brent Burns are also elite). He has the highest average shot/60 of the group over the past 3 seasons, but does not receive as much ice time as other players of the same caliber in this metric.

Earlier in his career, he didn't receive consistent minutes on the powerplay, but now that he has been given the reigns on the first powerplay in Carolina, he has emerged as one of the best offensive defensemen in the league. The amount of shots/60 he has been generating is just one piece of evidence supporting how strong of an offensive defenseman he has been in recent years.

----------------------------------------------------------------------------------------------------------------------------------------------
<a id ="3" > </a>
# 3. Define model by weighing past performance through linear regression
In this section we will be adopting Tom Tango's MARCEL baseball projection theory to be applied to hockey. We will use this model to project time on ice per game, goals, assists, points, and shots for the 2020-2021 NHL Season. 

Instead of the 5/4/3 weighting system used in the traditional MARCEL system, we will run multivariate linear regression on each metric to obtain an accurate weighting system.

For each multivariate regression, our target variable will be the metric for season n, while our feature variables will be the metric for season n-1, season n-2, and season n-3.

In [None]:
# Method to linearly regress a stat based on a select number of previous seasons
def regress_stat(df, stat_to_regress, sample_seasons, normalize=False):
    
    # Init dataframes
    full_results_df = pd.DataFrame()
    return_df = pd.DataFrame()

    # Create working dataframe
    columns = df.columns.values.tolist()
    columns_to_keep = ['player_id', 'season', 'gamesPlayed', stat_to_regress]
    columns_to_remove = set(columns) - set(columns_to_keep)
    
    working_df = df.drop(columns_to_remove, axis=1)
    
    # Drop the players who didnt play min of 20 gp
    working_df = working_df.loc[working_df.gamesPlayed >= 15]    
    
    # Create column labels
    target_label = stat_to_regress + '_target'
    s1_label = 's1'
    s2_label = 's2'
    s3_label = 's3'
    
    for target_season in sample_seasons:
        feat_season_n3 = target_season - 3*(ITERATOR)
        feat_season_n2 = target_season - 2*(ITERATOR)
        feat_season_n1 = target_season - ITERATOR

        feat_s3_df = working_df.loc[working_df.season == feat_season_n3]
        feat_s3_df_copy = feat_s3_df.copy()
        feat_s3_df_copy[s3_label] = feat_s3_df[stat_to_regress]
        if feat_s3_df_copy.columns.isin(['season', stat_to_regress, 'gamesPlayed']).any():
            feat_s3_df_copy = feat_s3_df_copy.drop(['season', stat_to_regress, 'gamesPlayed'], axis=1)

        feat_s2_df = working_df.loc[working_df.season == feat_season_n2]
        feat_s2_df_copy = feat_s2_df.copy()
        feat_s2_df_copy[s2_label] = feat_s2_df[stat_to_regress]
        if feat_s2_df.columns.isin(['season', stat_to_regress, 'gamesPlayed']).any():
            feat_s2_df_copy = feat_s2_df_copy.drop(['season', stat_to_regress, 'gamesPlayed'], axis=1)

        feat_s1_df = working_df.loc[working_df.season == feat_season_n1]
        feat_s1_df_copy = feat_s1_df.copy()
        feat_s1_df_copy[s1_label] = feat_s1_df[stat_to_regress]
        if feat_s1_df_copy.columns.isin(['season', stat_to_regress, 'gamesPlayed']).any():
            feat_s1_df_copy = feat_s1_df_copy.drop(['season', stat_to_regress, 'gamesPlayed'], axis=1)

        feat_s2_s3_df = feat_s2_df_copy.merge(feat_s3_df_copy[['player_id', s3_label]],
                      left_on='player_id', right_on='player_id')

        feat_all_df = feat_s1_df_copy.merge(feat_s2_s3_df[['player_id', s2_label, s3_label]],
                      left_on='player_id', right_on='player_id')

        toi_target = working_df.loc[working_df.season == target_season]
        toi_target_copy = toi_target.copy()
        toi_target_copy[target_label] = toi_target[stat_to_regress]
        
        if toi_target_copy.columns.isin(['season', stat_to_regress, 'gamesPlayed']).any():
            toi_target_copy = toi_target_copy.drop(['season', stat_to_regress, 'gamesPlayed'], axis=1)

        all_features_df = feat_all_df.merge(toi_target_copy[['player_id', target_label]],
                      left_on='player_id', right_on='player_id')

        feature_columns = [s1_label, s2_label, s3_label]
        predictor_column = [target_label]
        
        feature_z_columns = [s1_label + '_zscore', s2_label + '_zscore', s3_label + '_zscore']
        predictor_z_column = [target_label + '_zscore']
        
        # Add columns for normalized stat if normalize == true
        if (normalize == True):
            for col in feature_columns + predictor_column:
                col_zscore = col + '_zscore'
                all_features_df[col_zscore] = (all_features_df[col] - all_features_df[col].mean())/all_features_df[col].std(ddof=0)
        
        features_to_use = feature_columns
        predictor_to_use = predictor_column
        if (normalize == True):
            features_to_use = feature_z_columns
            predictor_to_use = predictor_z_column
            
        # Run linear regression
        X = all_features_df[features_to_use]
        y = all_features_df[predictor_to_use]
        
        regressor = LinearRegression()
        regressor.fit(X, y)
        y_pred = regressor.predict(X)
        
        # Get r2 score
        regression_r2_score = r2_score(y, y_pred)
        # Get intercept
        intercept = regressor.intercept_[0]
        # Get coefficients
        coef_array = np.array(regressor.coef_)
        
        # Append all values to results dataframe
        all_dict = { "target_season": target_season, coefficient_labels[2]: coef_array[0][2], coefficient_labels[1]: coef_array[0][1], coefficient_labels[0]: coef_array[0][0],"intercept": intercept, "r2_score": regression_r2_score }
        full_results_df = full_results_df.append(all_dict, ignore_index=True)
    
    return_dict = { "stat": stat_to_regress, coefficient_labels[2] : round(full_results_df[coefficient_labels[2]].mean(),3), coefficient_labels[1] : round(full_results_df[coefficient_labels[1]].mean(),3), coefficient_labels[0] : round(full_results_df[coefficient_labels[0]].mean(),3),
                   "intercept": round(full_results_df['intercept'].mean(),2), "r2_score": full_results_df['r2_score'].mean() }
    
    #return_df = return_df.append(return_dict, ignore_index=True)
    
    # Reorder column names
    #return_df = return_df.reindex(columns = ['stat', coefficient_labels[0], coefficient_labels[1], coefficient_labels[2], 'intercept', 'r2_score'])
    
    return return_dict

In [None]:
# Create a dataframe to append to
toi_regression_df = pd.DataFrame()

# Apply linear regression for toi per game
toi_per_game_linear_regression = regress_stat(skater_all_seasons_df, "timeOnIcePerGame", sample_seasons)

# Append to toi_regression_df
toi_regression_df = toi_regression_df.append(toi_per_game_linear_regression, ignore_index=True)

# Create a list of coefficients returned
coefficients = [toi_per_game_linear_regression[coefficient_labels[0]], toi_per_game_linear_regression[coefficient_labels[1]], toi_per_game_linear_regression[coefficient_labels[2]]]

# Assign colors to type using a dictionary
toi_colors = ['#7da7ca', '#4682B4', '#315b7d']

# Bar chart of linear coefficients for TOI/Game
bar_toi = go.Bar(x=coefficient_labels, y=coefficients, text=coefficients, textposition='none', showlegend=False, marker_color=toi_colors, name='Bar')

# Pie chart of linear coefficients for TOI/Game
pie_toi = go.Pie(labels=coefficient_labels, values=coefficients, name='Pie', marker_colors=toi_colors)

table_toi = go.Table(
        header=dict(
            values=[coefficient_labels[0], coefficient_labels[1], coefficient_labels[2], "Intercept"],
            font=dict(size=10),
            align="left"
        ),
        cells=dict(
            values=[toi_regression_df[k].tolist() for k in toi_regression_df.columns[0:4]],
            align = "left")
    )

# Create subplots
toi_plots = make_subplots(rows=2, cols=1, specs=[[{"type": "table"}], [{"type": "bar"}]], vertical_spacing = 0.001)
toi_plots.add_trace(table_toi, row=1, col=1)
toi_plots.add_trace(bar_toi, row=2, col=1)

# Update width, height, and title of subplots
toi_plots.update_layout(
            height=450,
            width=800,
            title={
                'text': "Linear Coefficients for Predicting TOI/Game for Skaters in Season n",
                'y':0.9,
                'x':0.5,
                'xanchor': 'center',
                'yanchor': 'top'})

In [None]:
toi_plots.show()

As shown in the graph and table above, season n-1 is the feature that holds the most weight when predicting a skater's time on ice per game for the next season. Season n-2 is about 1/8th of season n-1's value, and season n-3 has very close to zero influence on season n's projected time on ice.

We can simplify the coefficients by rounding to the nearest decimal, which results in the following formula:
> **Projected Time on Ice per Game = 0.8(Season n-1) + 0.1(Season n-2) + 1.0**

Let's create a method that applies this formula and returns a projected time on ice value. As an example, we will calculate the projected time on ice per game for the 2020-2021 season for New York Ranger's star forward, Artemi Panarin.

In [None]:
# Find the data for Artemi Panarin's last 3 seasons
panarin_df = skater_all_seasons_df.loc[skater_all_seasons_df.player_id == 8478550][['player_id', 'name', 'birthDate', 'position', 'season', 'timeOnIcePerGame', 'goals', 'assists', 'shots', 'gamesPlayed']].copy()
panarin_df = panarin_df.loc[panarin_df.season.isin(sample_seasons)]
panarin_df.head()

In [None]:
def get_projected_toi(toi_n1, toi_n2):
    return round((0.8*(toi_n1) + 0.1*(toi_n2) + 1),2)

In [None]:
panarin_proj_toi = get_projected_toi(panarin_df.iloc[0]['timeOnIcePerGame'], panarin_df.iloc[1]['timeOnIcePerGame'])

print("Artemi Panarin's projected TOI per game for the 2020-2021 season is: " + str(panarin_proj_toi) + " mins per game")

We'll perform a multivariate regression on goals, assists, and shots on goal as well, but this time we will normalize the metrics before regressing. This will ensure that we do not get an intercept as part of our best fit formula, which will allow us to infer a more accurate weighting for each feature. 

In [None]:
# Create a dataframe to append to
goals_regression_df = pd.DataFrame()

# Apply linear regression for goals
goals_linear_regression = regress_stat(skater_all_seasons_df, "goals", sample_seasons, normalize=True)

# Append to goals_regression_df
goals_regression_df = goals_regression_df.append(goals_linear_regression, ignore_index=True)

# Create a list of coefficients returned
coefficients = [goals_linear_regression[coefficient_labels[0]], goals_linear_regression[coefficient_labels[1]], goals_linear_regression[coefficient_labels[2]]]

# Assign colors to type using a dictionary
goals_colors = ['#93c993', '#66B266', '#477c47']

# Bar chart of linear coefficients for goals
bar_goals = go.Bar(x=coefficient_labels, y=coefficients, text=coefficients, textposition='auto', showlegend=False, marker_color=goals_colors, name='Bar')

# Pie chart of linear coefficients for goals
pie_goals = go.Pie(labels=coefficient_labels, values=coefficients, name='Pie', marker_colors=goals_colors)

# Create subplots
goals_plots = make_subplots(rows=1, cols=2, specs=[[{"type": "bar"}, {"type": "pie"}]])
goals_plots.add_trace(bar_goals, row=1, col=1)
goals_plots.add_trace(pie_goals, row=1, col=2)

# Update width, height, and title of subplots
goals_plots.update_layout(
            title={
                'text': "Weighting of Previous Performance for Predicting Goals for Skaters in Season n",
                'y':0.9,
                'x':0.5,
                'xanchor': 'center',
                'yanchor': 'top'})

In [None]:
goals_plots.show()

As shown in the graph and pie chart above, season n-1 is the feature that holds the most weight, making up 56.6% of the total feature weight. Season n-2 is about half of season n-1's value (28%), and season n-3 holds 15.3% of the total feature weight.

In the traditional MARCEL forecasting system, seasons n-1, n-2, and n-3 are given a weighting of 5,4, and 3, respectively. We can similarly convert the weights above to whole numbers in order to simplify our calculations. Tom Tango himself has said that there is not much loss in precision when rounding these weights to whole numbers.

> **For goals, we'll use a weighting system of 6/3/1**

As an example, let's calculate Artemi Panarin's projected goals for the 2020-2021 season using the following steps:
1. Calculate the weighted goals per game using our weighting system of 6/3/1
2. Calculate the league-wide rate of goals per minute for forwards, and multiply it by Panarin's TOI per game (for each season)
3. Find the rate of goals per 30.46 min of TOI (30.46 is used as it is double 15.23, which is the avg TOI for forwards over the last 3 seasons), and then find the rate of goals per minute. For defensemen, 39.3 will be used (39.3 is used as it is double 19.65, the avg TOI for defensemen over the last 3 seasons).
4. Calculate Panarin's expected goals by multiplying the rate of goals per minute from step 3 by Panarin's projected TOI per game, and by the number of games in the season (56, in this case)
5. Apply the aging factor to Panarin's projected goals. Using the peak age for time on ice found in Section 2, we will slightly increase a player's projected stats (by a factor of 0.006) if they are below the peak age, and slightly decrease them (by a factor of 0.003) if they are above the peak age. The factors of 0.006 and 0.003 are used in the traditional MARCEL baseball forecasting system, and we'll stick with those for these projections.

Note: These steps will be reproduced to project Assists and Shots on Goal as well.

In [None]:
# Step 1: Calculate weighted goals/GP
def get_weighted_stat(stat_n1, stat_n2, stat_n3, gp_n1, gp_n2, gp_n3, weight_n1, weight_n2, weight_n3):
    return weight_n1*(stat_n1/gp_n1) + weight_n2*(stat_n2/gp_n2) + weight_n3*(stat_n3/gp_n3)

# Step 2: Calculate league-wide goals per minute for each season
def get_league_stat_per_min(stat_total, toi_total):
    return stat_total/toi_total

def get_weighted_stat_per_min(toi_n1, toi_n2, toi_n3, league_n1, league_n2, league_n3):
    return (toi_n1*league_n1) + (toi_n2*league_n2) + (toi_n3*league_n3)

# Step 3 and 4: Calculate projected stat
def get_projected_stat(weighted_stat, weighted_stat_per_min, weighted_toi_sum, avg_toi, proj_toi, gp):
    toi_baseline = 2*avg_toi
    stat_avg_rate = (weighted_stat_per_min / weighted_toi_sum) * (toi_baseline)
    stat_per_min = (weighted_stat + stat_avg_rate) / (toi_baseline + weighted_toi_sum)
    return stat_per_min*proj_toi*gp

def project_stat(stat, player_slice_df, skaters_df, projected_toi, mean_toi_position, gamesToPlay, weight_n1, weight_n2, weight_n3, birth_date, season_start, seasons):
    weighted_stat = get_weighted_stat(player_slice_df.iloc[0][stat], player_slice_df.iloc[1][stat], 
                                               player_slice_df.iloc[2][stat], player_slice_df.iloc[0]['gamesPlayed'],
                                               player_slice_df.iloc[1]['gamesPlayed'], player_slice_df.iloc[2]['gamesPlayed'],
                                               weight_n1, weight_n2, weight_n3)

    league_stat_per_min_n1 = get_league_stat_per_min(skaters_df.loc[skaters_df.season == seasons[0]][stat].sum(),
                                                      skaters_df.loc[skaters_df.season == seasons[0]]['timeOnIce'].sum())

    league_stat_per_min_n2 = get_league_stat_per_min(skaters_df.loc[skaters_df.season == seasons[1]][stat].sum(),
                                                      skaters_df.loc[skaters_df.season == seasons[1]]['timeOnIce'].sum())

    league_stat_per_min_n3 = get_league_stat_per_min(skaters_df.loc[skaters_df.season == seasons[2]][stat].sum(),
                                                      skaters_df.loc[skaters_df.season == seasons[2]]['timeOnIce'].sum())

    weighted_toi_n1 = weight_n1*(player_slice_df.iloc[0]['timeOnIcePerGame'])
    weighted_toi_n2 = weight_n2*(player_slice_df.iloc[1]['timeOnIcePerGame'])
    weighted_toi_n3 = weight_n3*(player_slice_df.iloc[2]['timeOnIcePerGame'])
    weighted_toi_sum = weighted_toi_n1 + weighted_toi_n2 + weighted_toi_n3

    weighted_stat_per_min = get_weighted_stat_per_min(weighted_toi_n1, weighted_toi_n2, weighted_toi_n3,
                                                      league_stat_per_min_n1, league_stat_per_min_n2, league_stat_per_min_n3)

    proj_stat =round(get_projected_stat(weighted_stat, weighted_stat_per_min, weighted_toi_sum, mean_toi_position, projected_toi, gamesToPlay))
    
    # apply aging
    birth_date_dt = datetime.datetime.strptime(birth_date, '%Y-%m-%d %H:%M:%S')
    age = relativedelta(season_start, birth_date_dt).years
    
    proj_stat = apply_aging(proj_stat, age)
    
    return int(proj_stat)

def apply_aging(proj_stat, age):  
    if(age <= 29):
        return (1+(29 - age)*(0.006))*(proj_stat)
    else:
        return (1+(29 - age)*(0.003))*(proj_stat)

In [None]:
panarin_birthdate = panarin_df.iloc[0]['birthDate']

panarin_proj_goals = project_stat('goals', panarin_df, working_forwards_df, panarin_proj_toi, mean_toi_fwd, GAMES_TO_PLAY, 6, 3, 1, panarin_birthdate, SEASON_START_DATE, sample_seasons)
print("Panarin's projected goals for 2020-2021 season is " + str(panarin_proj_goals) + " goals in " + str(GAMES_TO_PLAY) + " games.")

In [None]:
# Create a dataframe to append to
assists_regression_df = pd.DataFrame()

# Apply linear regression for assists
assists_linear_regression = regress_stat(skater_all_seasons_df, "assists", sample_seasons, normalize=True)

# Append to assists_regression_df
assists_regression_df = assists_regression_df.append(assists_linear_regression, ignore_index=True)

# Create a list of coefficients returned
coefficients = [assists_linear_regression[coefficient_labels[0]], assists_linear_regression[coefficient_labels[1]], assists_linear_regression[coefficient_labels[2]]]

# Assign colors to type using a dictionary
assists_colors = ['#dc8c8c', 'indianred', '#8f4040']

# Bar chart of linear coefficients for assists
bar_assists = go.Bar(x=coefficient_labels, y=coefficients, text=coefficients, textposition='auto', showlegend=False, marker_color=assists_colors, name='Bar')

# Pie chart of linear coefficients for assists
pie_assists = go.Pie(labels=coefficient_labels, values=coefficients, name='Pie', marker_colors=assists_colors)

# Create subplots
assists_plots = make_subplots(rows=1, cols=2, specs=[[{"type": "bar"}, {"type": "pie"}]])
assists_plots.add_trace(bar_assists, row=1, col=1)
assists_plots.add_trace(pie_assists, row=1, col=2)

# Update width, height, and title of subplots
assists_plots.update_layout(
            title={
                'text': "Weighting of Previous Performance for Predicting Assists for Skaters in Season n",
                'y':0.9,
                'x':0.5,
                'xanchor': 'center',
                'yanchor': 'top'})

In [None]:
assists_plots.show()

As shown in the graph and pie chart above, season n-1 is the feature that holds the most weight, making up 63% of the total feature weight. Season n-2 holds 26%, and season n-3 holds 11% of the total feature weight.

> **Let's use the 6/3/1 weighting system to project assists as well.**

In [None]:
panarin_proj_assists = project_stat('assists', panarin_df, working_forwards_df, panarin_proj_toi, mean_toi_fwd, GAMES_TO_PLAY, 6, 3, 1, panarin_birthdate, SEASON_START_DATE, sample_seasons)
print("Panarin's projected assists for 2020-2021 season is " + str(panarin_proj_assists) + " assists in " + str(GAMES_TO_PLAY) + " games.")

In [None]:
# Create a dataframe to append to
shots_regression_df = pd.DataFrame()

# Apply linear regression for shots
shots_linear_regression = regress_stat(skater_all_seasons_df, "shots", sample_seasons)

# Append to shots_regression_df
shots_regression_df = shots_regression_df.append(shots_linear_regression, ignore_index=True)

# Create a list of coefficients returned
coefficients = [shots_linear_regression[coefficient_labels[0]], shots_linear_regression[coefficient_labels[1]], shots_linear_regression[coefficient_labels[2]]]

# Assign colors to type using a dictionary
shots_colors = ['#d2a5d2', '#bf7fbf', '#855885']

# Bar chart of linear coefficients for shots
bar_shots = go.Bar(x=coefficient_labels, y=coefficients, text=coefficients, textposition='auto', showlegend=False, marker_color=shots_colors, name='Bar')

# Pie chart of linear coefficients for shots
pie_shots = go.Pie(labels=coefficient_labels, values=coefficients, name='Pie', marker_colors=shots_colors)

# Create subplots
shots_plots = make_subplots(rows=1, cols=2, specs=[[{"type": "bar"}, {"type": "pie"}]])
shots_plots.add_trace(bar_shots, row=1, col=1)
shots_plots.add_trace(pie_shots, row=1, col=2)

# Update width, height, and title of subplots
shots_plots.update_layout(
            title={
                'text': "Weighting of Previous Performance for Predicting Shots for Skaters in Season n",
                'y':0.9,
                'x':0.5,
                'xanchor': 'center',
                'yanchor': 'top'})

In [None]:
shots_plots.show()

When it comes to shots on goal, season n-1 holds slightly more of the total feature weight than it did for goals and assists, while season n-3 holds close to zero of the total feature weight. Season n-2 holds a similar weight as it did for goals and assists.

> **In this case, we'll use a 7/3/0 weighting system to project shots on goal.**

In [None]:
panarin_proj_shots = project_stat('shots', panarin_df, working_forwards_df, panarin_proj_toi, mean_toi_fwd, GAMES_TO_PLAY, 7, 3, 0, panarin_birthdate, SEASON_START_DATE, sample_seasons)
print("Panarin's projected shots for 2020-2021 season is " + str(panarin_proj_shots) + " shots in " + str(GAMES_TO_PLAY) + " games.")

panarin_projections_dict = {
    "player_id": panarin_df.iloc[0]['player_id'],
    "name": panarin_df.iloc[0]['name'],
    "timeOnIcePerGame": panarin_proj_toi,
    "goals": panarin_proj_goals,
    "assists": panarin_proj_assists,
    "points": panarin_proj_assists + panarin_proj_goals,
    "shots": panarin_proj_shots
}

panarin_projections_df = pd.DataFrame().append(panarin_projections_dict, ignore_index=True)

panarin_projections_df['player_id'] = panarin_projections_df['player_id'].values.astype(int)
panarin_projections_df['goals'] = panarin_projections_df['goals'].values.astype(int)
panarin_projections_df['assists'] = panarin_projections_df['assists'].values.astype(int)
panarin_projections_df['points'] = panarin_projections_df['points'].values.astype(int)
panarin_projections_df['shots'] = panarin_projections_df['shots'].values.astype(int)

# Reorder column names
panarin_projections_df = panarin_projections_df.reindex(columns = ['player_id', 'name', 'goals', 'assists', 'points', 'shots', 'timeOnIcePerGame'])

Due to the global pandemic, the 2020-2021 season was shortened to 56 games (as opposed to the usual 82 games). This is what Artemi Panarin's projected stats look like for this 56-game season.

In [None]:
panarin_projections_df

Great! We've built our model, and were able to use it to project metrics for Artemi Panarin's 2020-2021 NHL season. We can now apply it to all eligible skaters in order to get an idea of what their projected time on ice, goals, assists, and shots on goal will be.

----------------------------------------------------------------------------------------------------------------------------------------------
<a id ="4" > </a>
# 4. Project and analyze results

In this section we will use the model defined in section 3 in order to project TOI per game, goals, assists, points, and shots on goal for all eligible skaters for the 2020-2021 season. (Reminder that a skater is only eligible to be projected by this model if they've played at least 15 games in each of the last 3 seasons).

In [None]:
# Initialize new dataframe to return with all projections
projections_df = pd.DataFrame(columns=['player_id', 'name','position','goals','assists','points','shots', 'timeOnIcePerGame'])

# Forwards and defensemen to be projected
forwards_to_project_df = working_forwards_df.loc[working_forwards_df.season.isin(sample_seasons)]
defensemen_to_project_df = working_defensemen_df.loc[working_defensemen_df.season.isin(sample_seasons)]

# Project forwards
for player_id, birthDate, name, position in zip(forwards_to_project_df['player_id'], forwards_to_project_df['birthDate'], forwards_to_project_df['name'], forwards_to_project_df['position']):
    if projections_df["player_id"].isin([player_id]).any():
        continue
    else:
        # Get slice of player data
        player_slice = forwards_to_project_df.loc[forwards_to_project_df.player_id == player_id]
        
        # Only project for players who have met the gp requirements in all 3 previous seasons
        number_of_seasons_played = player_slice.shape[0]
        
        if(number_of_seasons_played == 3):
            # Project toi, goals, assists, shots
            proj_toi = get_projected_toi(player_slice.iloc[0]['timeOnIcePerGame'], player_slice.iloc[1]['timeOnIcePerGame'])
            proj_goals = project_stat('goals', player_slice, forwards_to_project_df, proj_toi, mean_toi_fwd, GAMES_TO_PLAY, 6, 3, 1, birthDate, SEASON_START_DATE, sample_seasons)
            proj_assists = project_stat('assists', player_slice, forwards_to_project_df, proj_toi, mean_toi_fwd, GAMES_TO_PLAY, 6, 3, 1, birthDate, SEASON_START_DATE, sample_seasons)
            proj_shots = project_stat('shots', player_slice, forwards_to_project_df, proj_toi, mean_toi_fwd, GAMES_TO_PLAY, 7, 3, 0, birthDate, SEASON_START_DATE, sample_seasons)
            
            # Create dictionary to append to df
            projected_stats = { "player_id": player_id, "name": name, "position": position, "goals": proj_goals,
                               "assists": proj_assists, "points": proj_goals+proj_assists, "shots": proj_shots, "timeOnIcePerGame": proj_toi }
            # Append player's projections to dataframe
            projections_df = projections_df.append(projected_stats, ignore_index=True)

# Project defensemen
for player_id, birthDate, name, position in zip(defensemen_to_project_df['player_id'], defensemen_to_project_df['birthDate'], defensemen_to_project_df['name'], defensemen_to_project_df['position']):
    if projections_df["player_id"].isin([player_id]).any():
        continue
    else:
        # Get slice of player data
        player_slice = defensemen_to_project_df.loc[defensemen_to_project_df.player_id == player_id]
        
        # Only project for players who have met the gp requirements in all 3 previous seasons
        number_of_seasons_played = player_slice.shape[0]
        
        if(number_of_seasons_played == 3):
            # Project toi, goals, assists, shots
            proj_toi = get_projected_toi(player_slice.iloc[0]['timeOnIcePerGame'], player_slice.iloc[1]['timeOnIcePerGame'])
            proj_goals = project_stat('goals', player_slice, defensemen_to_project_df, proj_toi, mean_toi_def, GAMES_TO_PLAY, 6, 3, 1, birthDate, SEASON_START_DATE, sample_seasons)
            proj_assists = project_stat('assists', player_slice, defensemen_to_project_df, proj_toi, mean_toi_def, GAMES_TO_PLAY, 6, 3, 1, birthDate, SEASON_START_DATE, sample_seasons)
            proj_shots = project_stat('shots', player_slice, defensemen_to_project_df, proj_toi, mean_toi_def, GAMES_TO_PLAY, 7, 3, 0, birthDate, SEASON_START_DATE, sample_seasons)
            
            # Create dictionary to append to df
            projected_stats = { "player_id": player_id, "name": name, "position": position, "goals": proj_goals,
                               "assists": proj_assists, "points": proj_goals+proj_assists, "shots": proj_shots, "timeOnIcePerGame": proj_toi }
            # Append player's projections to dataframe
            projections_df = projections_df.append(projected_stats, ignore_index=True)

projections_df = projections_df.sort_values(['points'], ascending = [False])

# Output projections
projections_df.reset_index(drop=True).to_csv('projections_' + str(sample_seasons[0] + ITERATOR) + '.csv',index=False)

Here are the top 25 projected skaters, sorted by most projected points. The full projection results can be found in .csv format in the output section of the notebook.

In [None]:
projections_df.reset_index(drop=True).head(25)

If we want to test the accuracy of these projections, we can use the model to project for previous seasons, and compare those projections to the actual results of those seasons. Let's project for the 2018-2019 season and see how that compares to how skaters actually performed that season by calculating the MAE and R¬≤ error values. When calculating this value, we will only consider players who played at least half of the 2018-2019 season, and we will pro-rate all metrics to the full 82 games in order to reduce some of the variance created by missed games from injuries or other events.

In [None]:
# Initialize new dataframe to return with all projections
projections_2018_df = pd.DataFrame(columns=['player_id', 'name','position','goals','assists','points','shots', 'timeOnIcePerGame'])

target_season_df = skater_all_seasons_df.loc[skater_all_seasons_df.season == 20182019]
feature_seasons = [20172018, 20162017, 20152016]

# Forwards and defensemen to be projected
forwards_to_project_2018_df = working_forwards_df.loc[working_forwards_df.season.isin(feature_seasons)]
defensemen_to_project_2018_df = working_defensemen_df.loc[working_defensemen_df.season.isin(feature_seasons)]

start_date_2018 = datetime.datetime(2018, 10, 1)

# Project forwards
for player_id, birthDate, name, position in zip(forwards_to_project_2018_df['player_id'], forwards_to_project_2018_df['birthDate'], forwards_to_project_2018_df['name'], forwards_to_project_2018_df['position']):
    if projections_2018_df["player_id"].isin([player_id]).any():
        continue
    else:
        # Get slice of player data
        player_slice = forwards_to_project_2018_df.loc[forwards_to_project_2018_df.player_id == player_id]
        
        # Only project for players who have met the gp requirements in all 3 previous seasons
        number_of_seasons_played = player_slice.shape[0]
        
        if(number_of_seasons_played == 3):
            # Project toi, goals, assists, shots
            proj_toi = get_projected_toi(player_slice.iloc[0]['timeOnIcePerGame'], player_slice.iloc[1]['timeOnIcePerGame'])           
            proj_goals = project_stat('goals', player_slice, forwards_to_project_2018_df, proj_toi, mean_toi_fwd, 82, 6, 3, 1, birthDate, start_date_2018, feature_seasons)
            proj_assists = project_stat('assists', player_slice, forwards_to_project_2018_df, proj_toi, mean_toi_fwd, 82, 6, 3, 1, birthDate, start_date_2018, feature_seasons)                
            proj_shots = project_stat('shots', player_slice, forwards_to_project_2018_df, proj_toi, mean_toi_fwd, 82, 7, 3, 0, birthDate, start_date_2018, feature_seasons)
            
            # Create dictionary to append to df
            projected_stats = { "player_id": player_id, "name": name, "position": position, "goals": proj_goals,
                               "assists": proj_assists, "points": proj_goals+proj_assists, "shots": proj_shots, "timeOnIcePerGame": proj_toi }
            # Append player's projections to dataframe
            projections_2018_df = projections_2018_df.append(projected_stats, ignore_index=True)

# Project defensemen
for player_id, birthDate, name, position in zip(defensemen_to_project_2018_df['player_id'], defensemen_to_project_2018_df['birthDate'], defensemen_to_project_2018_df['name'], defensemen_to_project_2018_df['position']):
    if projections_2018_df["player_id"].isin([player_id]).any():
        continue
    else:
        # Get slice of player data
        player_slice = defensemen_to_project_2018_df.loc[defensemen_to_project_2018_df.player_id == player_id]
        
        # Only project for players who have met the gp requirements in all 3 previous seasons
        number_of_seasons_played = player_slice.shape[0]
        
        if(number_of_seasons_played == 3):
            # Project toi, goals, assists, shots
            proj_toi = get_projected_toi(player_slice.iloc[0]['timeOnIcePerGame'], player_slice.iloc[1]['timeOnIcePerGame'])
            proj_goals = project_stat('goals', player_slice, defensemen_to_project_2018_df, proj_toi, mean_toi_def, GAMES_TO_PLAY, 6, 3, 1, birthDate, start_date_2018, feature_seasons)
            proj_assists = project_stat('assists', player_slice, defensemen_to_project_2018_df, proj_toi, mean_toi_def, GAMES_TO_PLAY, 6, 3, 1, birthDate, start_date_2018, feature_seasons)
            proj_shots = project_stat('shots', player_slice, defensemen_to_project_2018_df, proj_toi, mean_toi_def, GAMES_TO_PLAY, 7, 3, 0, birthDate, start_date_2018, feature_seasons)
            
            # Create dictionary to append to df
            projected_stats = { "player_id": player_id, "name": name, "position": position, "goals": proj_goals,
                               "assists": proj_assists, "points": proj_goals+proj_assists, "shots": proj_shots, "timeOnIcePerGame": proj_toi }
            # Append player's projections to dataframe
            projections_2018_df = projections_2018_df.append(projected_stats, ignore_index=True)


Let's compare the top 5 point-scorers of the 2018-2019 season with our model's top 5 *projected* point-scorers for that season.

**Projected 2018-2019 Statistics**

In [None]:
projections_2018_df = projections_2018_df.sort_values(['points'], ascending = [False])
projections_2018_df.reset_index(drop=True).head()

**Actual 2018-2019 Statistics**

In [None]:
target_season_df = target_season_df.sort_values(['points'], ascending = [False])
target_season_df[['player_id', 'name', 'position', 'goals', 'assists', 'points', 'shots']].reset_index(drop=True).head()

At a glance, it seems like our model is a little conservative when looking at the top 5 when it comes to assists and points. However, it's worth noting that [scoring increased by 0.2 goals per game in the 2017-2018 season](https://www.hockey-reference.com/leagues/stats.html), after having been quite steady for 8 or 9 seasons prior to that. Unexpectedly large increases in scoring could affect the accuracy of our model, as it relies entirely on past performance.

Now let's calculate and visualize our error values.

In [None]:
gp_threshold = 41

# Remove missing players, and only consider players who played at least half of the season
error_target_season_df = target_season_df.loc[target_season_df.player_id.isin(projections_2018_df['player_id']) & (target_season_df.gamesPlayed >= gp_threshold)]
error_projections_2018_df = projections_2018_df.loc[projections_2018_df.player_id.isin(error_target_season_df['player_id'])]

# Sample size, season, and gp
sample_size = 'n = ' + str(len(error_target_season_df.index))
season = 'season: ' + str(20182019)
games_played = 'gp threshold: ' + str(gp_threshold)

categories = pd.Series(['goals', 'assists', 'points', 'shots', sample_size, season, games_played ])
MAE = pd.Series([],dtype=pd.StringDtype())
R2 = pd.Series([],dtype=pd.StringDtype())
error_df = pd.DataFrame(columns=['Category', 'R2', 'MAE'])

error_projections_2018_df = error_projections_2018_df.sort_values(['player_id'], ascending = [False])
error_target_season_df = error_target_season_df.sort_values(['player_id'], ascending = [False])

# Pro rate stats for target season to compare to projections
target_skater_prorated_df = error_target_season_df.copy()
target_skater_prorated_df['goals'] = error_target_season_df['goals']/error_target_season_df['gamesPlayed']*82
target_skater_prorated_df['assists'] = error_target_season_df['assists']/error_target_season_df['gamesPlayed']*82
target_skater_prorated_df['points'] = error_target_season_df['points']/error_target_season_df['gamesPlayed']*82
target_skater_prorated_df['shots'] = error_target_season_df['shots']/error_target_season_df['gamesPlayed']*82

# Calculate error for each stat
MAE[0] = mean_absolute_error(target_skater_prorated_df['goals'], error_projections_2018_df['goals'])
R2[0] = r2_score(target_skater_prorated_df['goals'], error_projections_2018_df['goals'])

# Calculate error for assists
MAE[1] = mean_absolute_error(target_skater_prorated_df['assists'], error_projections_2018_df['assists'])
R2[1] = r2_score(target_skater_prorated_df['assists'], error_projections_2018_df['assists'])

# Calculate error for pts
MAE[2] = mean_absolute_error(target_skater_prorated_df['points'], error_projections_2018_df['points'])
R2[2] = r2_score(target_skater_prorated_df['points'], error_projections_2018_df['points'])

# Calculate error for shots
MAE[3] = mean_absolute_error(target_skater_prorated_df['shots'], error_projections_2018_df['shots'])
R2[3] = r2_score(target_skater_prorated_df['shots'], error_projections_2018_df['shots'])

# Create dataframe from series and output
error_frame = { 'Category': categories, 'R¬≤': R2, 'MAE': MAE }
error_df = pd.DataFrame(error_frame) 

# Round error off to nearest 2 decimals
error_df = error_df.round(decimals=2)
display(error_df)
#error_df.to_csv('projection-error-' + str(20182019) + '.csv',index=False)

As demonstrated above, with a sample size of 369 players from the 2018-2019 season, this model performs well when it comes to predicting goals and points.
An MAE below 8 for assists is not that bad, considering that secondary assists can be quite random at times, and not easily measurable in terms of consistency.
Our projections for shots have an R¬≤ value of 0.59 which is alright, but the MAE looks quite high when compared to the other metrics, indicating that there is a fair amount of variance in shots on goal between players. Hockey is a sport which, like all others, inevitably contains a certain amount of randomness. It's impossible to perfectly predict how many shots every player will take in a season, so being within 30 shots on average is a decent start for a simple model.

Overall, the MARCEL model performs quite well when applied to hockey instead of baseball, and could provide a good starting point for one to build their own model. It is worth noting that this model would not work as well when projecting players who do not have 3 seasons-worth of data to draw from. However, with a few tweaks, a model based on the MARCEL forecasting system could provide accurate projections for the majority of skaters.

----------------------------------------------------------------------------------------------------------------------------------------------
<a id ="5" > </a>
# 5. References

The following works were referenced throughout the creation of this notebook:
* [Tom Tango's Introduction to Marcel](http://www.tangotiger.net/archives/stud0346.shtml)
* [Using Marcels to forecast player performance in hockey](https://ownthepuck.wordpress.com/2015/09/05/using-marcels-to-forecast-player-performance-in-hockey/)
* [Marcels - The Hockey Edition](https://hfboards.mandatory.com/threads/marcels-the-hockey-edition.2775884/)
* [Beyond The Box Score: A guide to the projection systems](https://www.beyondtheboxscore.com/2016/2/22/11079186/projections-marcel-pecota-zips-steamer-explained-guide-math-is-fun)
* [FanGraphs Prep: Build and Test Your Own Projection System](https://blogs.fangraphs.com/fangraphs-prep-build-and-test-your-own-projection-system/)
* [Martin Ellis' NHL Game Data dataset](https://www.kaggle.com/martinellis/nhl-game-data)
* [DatsyukToZetterberg's 2021 Fantasy Hockey Projections](https://www.reddit.com/r/fantasyhockey/comments/kc3amu/datsyuktozetterbergs_2021_fantasy_hockey/) (The author of this Reddit post was kind enough to reply to my many questions, and was extremely helpful and informative)
* [NHL League Statistics](https://www.hockey-reference.com/leagues/stats.html)

**If you enjoyed this notebook, please hit the upvote button! Feel free to provide feedback or suggestions in the comment section below.**