# Capstone Project: NHL Cap Hit Predictions

#### Brian Johns, April 2022.  BrainStation

## Notebook #4: Findings

### Overview

The goal of this project has been to identify **How can we predict and evaluate the Cap Hit of NHL players using their basic and advanced statistics?**

By doing this, the value added for an NHL team would be that we could:

1. Identify statistics that players are currently being rewarded for.

2. Identify statistics that may be undervalued and could be gaps in the market for players.

3. Identify players that are under/overvalued based on their performance profile.

In order to do this, I will outline the following:

1. Review Results from Modeling
2. Interpret the Results
3. Continued Research
4. Project Learnings

### Modeling Results

#### EDA

The initial findings from the Exploratory Data Analysis showed that most of the performance based statistics had a positive correlation with the Cap Hit of an NHL player.  In particular, Goals For and basic production statistics (Goals, Assists, Points) had the strongest positive correlations.  Some notable exceptions were:

- Hits For and Against, which were negative correlations
- Penalties Taken, especially Major Penalties, which was a negative correlation
- Most surprisingly, advanced defensive statistics like Defensive GAR, had no correlation or a minor negative correlation

#### Linear Regression

The best performing model for Linear Regression used Lasso Regression with Standard Scaled data.  The final result of the linear regression showed numerous signs of multicollinearity, making the coefficients difficult to interpret.

The greatest positive coefficients were:
- Time On Ice
- Expected Goals Against
- Age
- First Assist/60 minutes
- Goals For

The greatest negative coefficients were:
- Games Played
- Points/60 minutes
- Expected Goals For Percentage
- Expected Goals Against/60 minutes
- Corsi Against

Time On Ice and Age being high positive correlations start to paint the picture that factors outside of performance have significant effects on the Cap Hit of an NHL player.  Goals For being listed as a positive coefficient, along with First Assists/60 minutes, hints that offensive stats may be more important than defensive stats.

However, even in these features with the most significant correlations, it is clear there is overlap in what the data may represent, so other models and reduced dimensionality was needed.

#### XGBoost Regressor

The XGBoost model provided us with our best performing model of the project with the following metrics:

R-Squared = 0.679
MAE = 0.9398
RMSE = 1.2905

The feature value of this model showed the following features to be the most important in determining the model:
- Age
- Time on Ice for the Power Play
- Overall Draft Pick
- Games Played
- Goals For
- Expected Goals For/60 Minutes

In plotting the predicted values from the XGBoost model against the actual Cap Hit, our model could not predict the Cap Hit values of players with 'mega' contracts (more than 10 Million), as well as veterans/rookies that significantly over-performed on their 'cheap' contracts.

#### Random Forest Regressor

The Random Forest model was the second highest performing model, but with a lower test score intimating a potentially more robust model than the XGBoost model.

The features that were important to the Random Forest model were similar to the XGBoost model, but much more pointed in that out of over 140 features only 5 features truly had an affect on the model, esepcially the top 3:

- Time On Ice on the Power Play
- Age
- Overall Draft Pick
- Goals For
- Expected Goals For

To an even greater degree than the XGBoost model, it was unable to predict the Cap Hits of any player big contract (more than 8 million).

#### Modeling with PCA

In order to further combat the multi-collinearity in the data, feature selection and PCA was used.  The results of the PCA-altered data dropped significantly from the previous regression models (down 10-15% for the R-Squared score).

In interpreting the PC's used to separate the data, the biggest differentiators in the PC's were:

- Age
- Overall Draft Pick
- Basic Offensive Stats (Goals, Assists)
- Advanced Offensive Stats (Offensive GAR, WAR)

In conjunction with the results of the linear regression models, the PCA-reduced models supported a trend that the Player's Age and Draft Pick were the most important factors in predicting a player's Cap Hit, followed by offensive statistics.

#### Clustering

The data was nearly completely random for clustering and the silhouette scores to evaluate the appropriate clusters were pretty low.  Comparing the clustering for the full dataset versus separating the data by position, the silhouette scores were equally low but the potential for more actionable data arose from separating the positions.

In doing so, we ended up with 6 clusters for Forwards and 4 clusters for Defencemen:

**For Forwards**:

#1 - Star Forwards

#2 - Offensive Producer, Defensive Liability

#3 - Primary Defensive Forwards

#4 - Secondary Defensive Forwards

#5 - Fringe NHL Forwards

#6 - Enforcers

**For Defencemen**:

#1 - Star Defencemen

#2 - Offensive Producer

#3 - Defensive Defenceman

#4 - Fringe NHL Defenceman

Relating these clusters to the players cap hit again showed that players with more offensive production (Forward Clusters 1 and 2, Defencemen Cluster 1) made the most money.

In analyzing the data, there was a significant portion of players making near minimum salaries who were still able to produce at the 'star' level.  As well, many of the 'overpaid' players were identified as players who were past their prime playing years at the end of long-term contracts.

### Interpreting the Results

In interpreting these results, I want to answer the value added statements that I proposed at the start of the project.

**1. Identify statistics that players are currently being rewarded for.**

*Age* has a significant factor on how much a player is paid.  This is partly due to the financial structure of the league (contract limitations for young players) and athlete development (as a player gets more experience, they play better and get paid more).

*Overall Draft Pick*.  Where a player is drafted has a surprisingly long-term impact on the player's salary.  It makes sense that those picked earlier would generally be paid more (players with higher upside are picked early and then fulfill that promise).  However, the lingering effect of being drafted high is a significant explanation for the player's salary even in later years.

*Time On Ice - Power Play*.  Beyond these factors, players who have higher cap hits have the *expectation* of producing more offense, not necessarily actually being rewarded for it.  This is highlighted by the fact that a common factor across all models was the amount of Time on Ice a player spent on the Power Play.  Not even *production* on the power play, just the amount of time spent on the ice.  This expectation is highly rewarded.

*Goals For*.  Finally, for actual production, the most important statistic for determining a player's cap hit was Goals For, the number of goals their team scored when they were on the ice.  A player's ability to score and playmake so others can score is ultimately the skill that is the most rewarded in the NHL.


**2. Identify statistics that may be undervalued and could be gaps in the market for players.**

*Advanced Statistics, such as GAR*.  Despite the increase in use and visibility for advanced statistics, there still seems to be a gap between how a player is evaluated through their advanced statistics and that player's compensation.  There still seems to be an edge that can be gained that, if you believe advanced statisitics are a better measure of a player's value, a team can leverage.

*Defensive Metrics*.  Among these advanced statistics, the statistics measuring a player's defensive contribution are still relatively irrelevant to a player's current cap hit.  The nature of the game makes it difficult to measure a player's defensive value, and furthermore it can be difficult to evaluate that value over the other player's that he is on the ice with.  However, teams may find value in players that have better defensive contributions if they believe it's important for winning games.


**3. Identify players that are under/overvalued based on their performance profile.**

For undervalued players, a common theme across all of the models was the potential for value in players who were making 1 million or less.  Effective players in this category came from two places:

1. Rookies on their entry-level contract.  This one might be more common sense, but drafting well and having those players contribute early in their career maximizes the value of the contract they're on.

2. Aging veterans signed to near minimum deals.  This one was a bit more surprising but was a theme shared across all models.  Veterans who may have been seen as being 'passed their prime' and did not get have a large contract continuously showed to contribute at a high level, especially if they played at a high level earlier in their career.

Given the high number of players that were clustered into defensive oriented clusters (for both forwards and defencemen) that were making less than 1 million, the combination of the undervalued stats and identifying young or veteran players who may specialize in those defensive abilities may be a weakness in the player market that could be exploited by an NHL team.

For overvalued players, a very surprising trend arose: players that have 'mega' contracts (9/10 million or more) may not be, and may never be, worth the size of their contract.  None of the models were able to come close to predicting their values, and in fact they were the contracts that the linear models missed the values by the most.  This is a pretty profound team building question, would you rather have:

A) Connor McDavid, maybe the best player in the game, making 12.5 million but never able to 'produce' more than 9 million in actual value?

B) 3 Players, all making 4 million and unlikely to be all-stars, but with the potential to produce at a higher level than their contract?

The results of these models suggest that taking the second option might be better if the goal is to have as many players on the team producing at a good, if not elite, level.

Another category of overpaid players were players past their prime at the tail-end of high paying, long-term contracts.  This may be a good strategy to ensure that a team is retaining an all-star level player for the best years of their career, knowing that at the end of the contract that player may be a shell of himself and unable to perform even at an NHL level.


Even the best model in this project was able to account for only 67% of the variance in this data.  That still leaves 1/3 of the variance unexplained by the statistics provided!  That is a large amount considering the statistics in this project included every major statistic that a player can produce on the ice.  Perhaps there is value in this gap that is not/cannot be measured (ie. leadership, home team marketing, quality of the player's agent, etc.) but I would expect that as more teams get more analytics savvy that this unexplained variance may get smaller.

In a perfect world, the amount of production by a player will be perfectly explained by their impact on the ice.  However, the real-world and financial decisions that teams and players make will always leave some room for a gap between how much a player earns and their production on the ice.  I believe that, over time, the teams that leverage these gaps and try to find the best ways to acquire the most talent on the ice as possible, will be the ones that will have the best team on the ice in the end.

### Continued Research

This project has focused more on the player level and what factors go into an individual player's salary and cap hit.  However, this analysis can go further to a team level and see how the most successful teams construct their roster given the information here.

While advanced statistics may not have translated to player's salaries yet, analyzing how player's contracts evolve over the next few years as advanced statistics become more ingrained into the sport may be give more insight on future areas that teams can leverage for team success.

Through the course of this project, some interesting correlations (positive and negative) arose between the *expected* statistics (ie. Expected Goals For) versus the totals of the regular statistic (ie. Goals For).  From a team building perspective, it could be interesting to develop profiles of players that compare their *expected* statistics against their actual statistics: Process vs Outcome.

There are an increasing number of resources coming available that tell great stories in hockey and make things look great when shared on Twitter or other forms.

[Evolving Hockey](evolving-hockey.com) has been the main influence in this project, but there were a few others that have inspired me as well

[HockeyViz](https://hockeyviz.com/game/2021021046/xG) which gives real-time updates on the expected scores of games

[JFresh](https://jfresh.substack.com/p/2022-nhl-player-cards-explainer?s=r) - Making advanced stats look like hockey cards

### Project Learnings

For this project, there was a large amount of statistics that was hard to individually interpret and made for some messy modeling and analysis.  Focusing on a specific band of the advanced statistics (the different forms of GAR for example) may have made for a more concise analysis but would be answering a different question.

Learning how to optimize my hyperparameters was a timely process and I learned new tools of how to make it more efficient well after I had gone through the process I did for this project.  I have learned how to do more things programmatically through the course of this project but there is always more room to be more efficient.

Trying to merge the data together at the start really highlighted the importance of understanding how to prepare the data, and ensure that it is ready for use in modeling.  Many times through this process I caught something in the modeling that needed to be corrected in the data cleaning.  I suspect my previous hockey knowledge enabled me to recognize many of these errors, but for my next project I have more awareness of where these errors may come from.