<h1> Final Capstone Project - NHL Salary Predictor

**Introduction**

The NHL (National Hockey League) is one of the most exciting sports league in North America. It's popularity has grown steadily throughout the past few years both in both local and international markets. Led by a group of talented, young and exciting player, the league's popularity has never been higher. As a result of the league's growth, it's revenues has also dramatically increased. One of each franchise's biggest challenges is assembling a team to compete for the Stanley Cup under a salary cap. Player evaluation and contract negotiations has become crucial task for the management teams. 
This project takes a look at some of the player stats that can potentially be key in determining a player's value. Using multi-variable linear regression, it takes a grand view into the world of sports economics.   


**Data**

Player data for the NHL 2018/2019 season was downloaded from: https://www.hockey-reference.com/leagues/NHL_2019_skaters.html
Both basic stats and few advanced metrics were retrieved. I chose ... stats that I felt was important in a players performance evaluation. Obviously, this is a very simplified approach and in real life, the metrics will be much more complex. 

The salary numbers were retrieved from https://www.spotrac.com/nhl/rankings/ for the 2018/2019 season.

The two datasets were joined to form a complete set of data to work with. 


In [2]:
import numpy as np # library to handle data in a vectorized manner
import pandas as pd # library for data analsysis
import requests
from sklearn import linear_model
from sklearn.preprocessing import PolynomialFeatures

import matplotlib.pyplot as plt

print ("done")

done


In [87]:
stats = pd.read_csv ('Stats.csv') 
salary = pd.read_csv ('Salary_1.csv') 
advance=pd.read_csv('AdvanceStats.csv')

Several basic adjustments were made:
* Players who played on multiple team during the season had their stats combined
* Players who played fewer than 25 games were excluded to avoid potential outliers
* Only forwards were included since the pure stats and evaluation criteria are different compared with defensemen

Some of the stats include GP=Games Played, G=goals, A=assists, PTS= points (G+A), EVG=Even Strength Goal, PPG= Power-Play goal, SHG=Shorthanded Goal, CF% = Corsi For at Even strength (more than 50% means team had more possesion of puck when player was on-ice), oiSH%= team on-ice shooting %, TK = takeaway, FO% = Face-off%

In [88]:
stats=stats[stats.Pos != 'D']
stats1 = stats.groupby(['Player','Age'],as_index=False).agg({'GP':sum,'G' : sum,'A' : sum, 'PTS' : sum,'+/-' : sum,'EVG' : sum,'PPG' : sum,'SHG' : sum,'GWG' : sum,'EVA' : sum,'PPA' : sum,'SHA' : sum,'S' : sum,'BLK' : sum,'HIT' : sum,'FOW' : sum,'FOL' : sum})
stats1["Player"]= stats1["Player"].str.split('\\',expand = True)[0]
advance["Player"]= advance["Player"].str.split('\\',expand = True)[0]
stats_combo = pd.merge(left = stats1, right = salary, how='left',left_on='Player',right_on='Player')
stats_combo = pd.merge(left = stats_combo, right = advance,how='left',left_on='Player',right_on='Player')
stats_combo.dropna(subset=['Salary'],inplace=True) #dropped players that did not have salary info

stats_combo=stats_combo[stats_combo.GP > 25] #players who played more than 25% of the season
stats_combo["Shot%"]=round(100*(stats_combo["G"]/stats_combo["S"]),2)
stats_combo["PointsPerGame"]=round(stats_combo["PTS"]/stats_combo["GP"],2)
stats_combo["FO%"]=round(100*(stats_combo["FOW"]/(stats_combo["FOW"]+stats_combo["FOL"])),2).fillna(0)
stats_combo=stats_combo.reset_index(drop=True)
stats_combo

Unnamed: 0,Player,Age,GP,G,A,PTS,+/-,EVG,PPG,SHG,...,FOL,Salary,CF%,FF%,oiSH%,TOI/60,TK,Shot%,PointsPerGame,FO%
0,Adam Erne,23,65,7,13,20,10,5,2,0,...,12,800000.0,48.2,48.2,10.0,10:33,16,10.00,0.31,47.83
1,Adam Gaudette,22,56,5,7,12,-8,5,0,0,...,221,916666.0,47.0,46.7,6.9,10:57,10,9.09,0.21,40.43
2,Adam Henrique,28,82,18,24,42,-5,10,8,0,...,511,4000000.0,46.3,46.4,9.5,16:27,39,14.75,0.51,52.77
3,Adam Lowry,25,78,12,11,23,6,11,0,1,...,473,2916666.0,50.3,50.5,7.2,14:38,42,11.43,0.29,57.62
4,Adrian Kempe,22,81,12,16,28,-10,12,0,0,...,424,894167.0,51.6,51.8,7.1,14:29,20,10.17,0.35,42.63
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
388,Zach Parise,34,74,28,33,61,-2,18,10,0,...,25,7538462.0,51.4,53.2,9.5,18:40,22,12.23,0.82,26.47
389,Zach Sanford,24,60,8,12,20,8,8,0,0,...,5,875000.0,50.8,52.2,8.8,12:35,31,10.39,0.33,61.54
390,Zack Kassian,28,79,15,11,26,-6,14,0,1,...,6,1950000.0,48.4,47.4,10.0,14:48,35,13.51,0.33,0.00
391,Zack Smith,30,70,9,19,28,-6,8,0,1,...,417,3250000.0,44.6,45.0,7.5,16:21,42,8.91,0.40,49.15


**Methodology**

A multiple linear regression model was used as a simplified way to identify the key stats that determines a player's value (salary). 80% of dataset were randomly picked to train the model and the other 20% was used to test. The coefficients and intercept was determined. 

The independent variables were chose to be: Age, EVG (even-strength goal), EVA (even-strength assist), PPG (power-play goal), PPA (power-play assist), BLK (block shots), HIT, CF% (Corsi%), oiSH% (on-ice team shooting%), TK (takeaway), PointsPerGame.I chose the latter few parameters because they are more defensive stats and can be used for more defensive minded players who primarily does the hitting and block shots on the team.



In [117]:
msk = np.random.rand(len(stats_combo)) < 0.8
train = stats_combo[msk]
test = stats_combo[~msk]


regr = linear_model.LinearRegression()
x = np.asanyarray(train[['Age','EVG','EVA','PPG','PPA','BLK','HIT','CF%','oiSH%','TK','PointsPerGame']])
y = np.asanyarray(train[['Salary']])
regr.fit (x, y)
# The coefficients
print ('Coefficients: ', regr.coef_)
print ('Intercept: ', regr.intercept_)

Coefficients:  [[ 2.71915283e+05 -2.17534239e+03 -3.04971710e+02 -2.22953934e+04
   1.78104999e+04 -3.15158102e+03  1.00658950e+02 -5.26671487e+04
  -1.79646565e+05  2.48057123e+04  5.39926200e+06]]
Intercept:  [-3340986.62838368]


In [118]:
y_hat= regr.predict(test[['Age','EVG','EVA','PPG','PPA','BLK','HIT','CF%','oiSH%','TK','PointsPerGame']])
x = np.asanyarray(test[['Age','EVG','EVA','PPG','PPA','BLK','HIT','CF%','oiSH%','TK','PointsPerGame']])
y = np.asanyarray(test[['Salary']])
print("Residual sum of squares: %.2f"
      % np.mean((y_hat - y) ** 2))


print('Variance score: %.2f' % regr.score(x, y))
print('Explained variance score: 1 is perfect prediction')

Residual sum of squares: 2223820125812.47
Variance score: 0.62
Explained variance score: 1 is perfect prediction


**Analysis**

The players stats were fitted into the predicted model and the Predicted salary was compared with the actual salary of the players. The top 10 underpaid players were listed below for analysis. 

* 6/10 players on the list are 25 or younger. These are the rising superstars of the league who are still under entry-level contracts. Their potential and value were recognized by the respective team's management team as several players signed lucrative long-term contract extensions during the following off-season.
* The other 4 players did not have strong offensive numbers but are experienced defensive minded players. Their high Hits and CF% makes them invaluable especially during the playoffs.  

In [119]:
y_prediction=regr.predict(stats_combo[['Age','EVG','EVA','PPG','PPA','BLK','HIT','CF%','oiSH%','TK','PointsPerGame']])
stats_combo['Salary_Prediction'] = pd.DataFrame(y_prediction)
stats_combo['Difference'] = stats_combo['Salary_Prediction'] - stats_combo['Salary']
stats_combo_summary = stats_combo[['Age','Player','PointsPerGame','EVG','HIT','BLK','CF%','TK','oiSH%','Salary','Salary_Prediction','Difference']].sort_values(by=['Difference'], ascending=False)
stats_combo_summary.head(10)


Unnamed: 0,Age,Player,PointsPerGame,EVG,HIT,BLK,CF%,TK,oiSH%,Salary,Salary_Prediction,Difference
263,22,Mikko Rantanen,1.18,15,59,41,53.9,39,9.8,894167.0,5158658.0,4264491.0
334,21,Sebastian Aho,1.01,23,65,34,57.2,81,10.0,925000.0,5169761.0,4244761.0
51,22,Brayden Point,1.16,21,31,43,51.9,35,11.1,686667.0,4677182.0,3990515.0
277,25,Nikita Kucherov,1.56,26,44,31,52.6,58,11.8,4766666.0,8515372.0,3748706.0
147,24,Jake Guentzel,0.93,33,105,47,52.7,45,10.3,734166.0,4433117.0,3698951.0
77,39,Chris Kunitz,0.18,5,85,15,50.9,13,6.4,1000000.0,4676436.0,3676436.0
35,21,Auston Matthews,1.07,25,28,60,53.1,57,10.0,925000.0,4593019.0,3668019.0
238,37,Matt Hendricks,0.12,0,80,16,47.9,5,5.2,700000.0,3990698.0,3290698.0
44,33,Brad Richardson,0.41,16,63,55,49.0,27,7.7,1250000.0,4348375.0,3098375.0
233,21,Mathew Barzal,0.76,15,25,56,52.2,66,8.0,863333.0,3908278.0,3044945.0


**Discussion**

From the analysis it was found that 6/10 needs significant pay raises from their offensive production. This shows the importance of PointsPerGame and EVG as key offensive stats. In fact these players all received lucrative contract extensions with their teams in the off-season. 
* Austin Matthews: 5Y/58 Million
* Sebastian Aho:5Y/42 Million
* Mikko Rantanen: 6Y/55.5 Million
* Brayden Point: 3Y/20 Million
* Nikita Kucherov: 8Y/76 Million
* Jake Guentzel: 5Y/30 Million

The other 4 players are known to be tough, physical defensive hockey players who become especially valuable during playoff season. High HIT, BLK, CF% stats shows the player's strong defensive game. In fact, Chris Kunitz and Troy Brouwer were both known to be key players during their respective team's Stanley Cup runs for shutting down opposition's key players.

**Conclusion**

The project shows the difficulty that sports franchises can face when evaluation the value of their most valuable asset, the players. Retaining the teams best players is a no-brainer. However, sports teams are relying more and more on analytics and data science to determine the most appropriate value for each player. Obviously, the project used a very simple model and the projections were mot very accurate but it gives a small glimpse of how complex the task can be. Note that this is only taking 1 seaons stats into consideration. Data should be taken from multiple season to have a larger data pool to create a more accurate model.