# Gold Data Load

In this notebook, the Silver tables will be combined into a single dataframe. And then modified in several ways: 

Players with fewer than 100 (or a value set by the user) total games played will be omitted as we are attempting to track year over year performance for career NHL players.  Players who do not play more than 2 complete seasons would likely be unhelpful.  Additionally, players who have appeared in the NHL in fewer than 3 (also adjustable by the user) seasons will be excluded.  Stints shorter than 10 games will be excluded as well, as these are small sample sizes and the players performance during that brief period could be high or low due to normal variance.

Aggregated data will be found for each player.  This data will include items like "total games played", "goals per game per stint", "number of same birth country teammates per playerID & year combination", "number of years played in NHL".

Then, each player's stats will be normalized so that their worst stint in a given stat is recorded as a 0, the best stint is a 1, and all other stints fall in between.  These normalized stats will then be used to calculate an overall score for that player's stint will be calculated by adding all of these scores together (with optional weights to increase or reduce the impact of certain stats). We will then be able to compare the overall score for that stint with the number of teammates with the same birth country to study the correlation between these two items.

<hr>

# Gold Process

<hr>

## Load & Clean Data

### Load Silver Scoring

We will start by loading the Scoring table as it is the main fact table we will be using. We will then drop any rows with 0 or NaN GP.  There are a few rows where the stint has 0 games played. These should not be included.

In [33]:
import pandas as pd

# load table
df = pd.read_csv("Data/Silver/Scoring.csv")

### Handle Missing Values

In the NHL's history, several stats were not tracked until later.  These include: +/-, PPG. PPA, SHG, SHA, GWG, SOG. For players who have records in the time period where these stats where both tracked and untracked, it would be difficult to compare their performance between years where data for these stats is and is not present. For this reason, these stats will not be considered for players where their earliest year has a null value for +/-. If +/- is null in their earliest year, all of that player's stats from the aforementioned list will be updated to null values.

In [34]:
# Define the stats to be checked and updated
stats = ['+/-', 'PPG', 'PPA', 'SHG', 'SHA', 'GWG', 'SOG']

# Find the earliest year each stat in question has no null values

# Initialize a dictionary to store the earliest year with no null values for each stat
earliest_non_null_years = {}

# Iterate through each stat to find the earliest year with no null values
for stat in stats:
    # Group by year and filter out rows where the stat is null
    non_null_by_year = df[df[stat].notnull()].groupby('year').size()
    
    # Check if the number of non-null records matches the total number of records for each year
    total_by_year = df.groupby('year').size()
    
    # Align the indices of the two Series
    aligned_non_null_by_year = non_null_by_year.reindex(total_by_year.index, fill_value=0)
    
    # Find the earliest year where all values for the stat are non-null
    fully_non_null_years = aligned_non_null_by_year[aligned_non_null_by_year == total_by_year]
    if not fully_non_null_years.empty:
        earliest_non_null_year = fully_non_null_years.index.min()
    else:
        earliest_non_null_year = None
    
    # Store the result in the dictionary
    earliest_non_null_years[stat] = earliest_non_null_year

# Update the earliest year for GWG to 1963 - The stat was tracked in 1917, and then not again until 1963
earliest_non_null_years['GWG'] = 1963 

# Convert the dictionary to a DataFrame for better visualization
earliest_non_null_years_df = pd.DataFrame(list(earliest_non_null_years.items()), columns=['Stat', 'Earliest Year with No Nulls'])

# Iterate through each stat and update the main dataframe as per the logic described
for index, row in earliest_non_null_years_df.iterrows():
    stat = row['Stat']
    earliest_year = row['Earliest Year with No Nulls']
    
    # Identify players with a record earlier than the earliest year for the current stat
    players_to_update = df[df['year'] < earliest_year]['playerID'].unique()
    
    # Update all rows to 0 for those players and the current stat
    df.loc[df['playerID'].isin(players_to_update), stat] = 0

# Remove NaN GP rows
df = df.dropna(subset=['GP'])


### Index Column for Stint

Each player has an index column that marks the number of allowing us to track the player's career in order without referencing the year and stint columns.

In [35]:
# order the dataframe by playerID, year, then stint
df = df.sort_values(by=['playerID', 'year', 'stint'], ascending=[True, True, True])

# Adding the stint_index column starting from 0
df['stint_index'] = df.groupby('playerID').cumcount()

### Calculate Point In Career

To assist us in tracking the player through their career, we need to find how many games they have played in total. We also need to track how many games they have played up to the current point in their career. Lastly, we can take games_played_before_stint / career_games_played to get the player's current point in their career.

In [36]:
# Calculate the total GP for each player
total_gp_per_player = df.groupby('playerID')['GP'].sum().reset_index(name='career_games')

# Merge this total GP back into the original DataFrame
df = pd.merge(df, total_gp_per_player, on='playerID', how='left')

# Calculate the total GP prior to the current stint for each player
df['total_GP_prior_to_stint'] = df.groupby('playerID')['GP'].cumsum().shift(fill_value=0)

# Set total_GP_prior_to_stint to 0 where stint_index is 0
df.loc[df['stint_index'] == 0, 'total_GP_prior_to_stint'] = 0

# Calculate the percentage of total games played prior to each stint out of the career games
df['percent_through_career'] = df['total_GP_prior_to_stint'] / df['career_games']

# Check for any values greater than 1
invalid_values = df[df['percent_through_career'] > 1]
if not invalid_values.empty:
    print("Warning: There are values greater than 1 in 'percent_through_career'.")
    print(invalid_values)

# Display the updated dataframe
print(df[['playerID', 'GP', 'career_games', 'total_GP_prior_to_stint', 'percent_through_career']])

        playerID    GP  career_games  total_GP_prior_to_stint  \
0      aaltoan01   3.0         151.0                      0.0   
1      aaltoan01  73.0         151.0                      3.0   
2      aaltoan01  63.0         151.0                     76.0   
3      aaltoan01  12.0         151.0                    139.0   
4      abbotre01   3.0           3.0                      0.0   
...          ...   ...           ...                      ...   
38148  zyuzian01  66.0         496.0                    227.0   
38149  zyuzian01  65.0         496.0                    293.0   
38150  zyuzian01  57.0         496.0                    358.0   
38151  zyuzian01  49.0         496.0                    415.0   
38152  zyuzian01  32.0         496.0                    464.0   

       percent_through_career  
0                    0.000000  
1                    0.019868  
2                    0.503311  
3                    0.920530  
4                    0.000000  
...                       .

### Drop Players With Too Few Total GP

In [37]:

# remove players with too few games
game_threshold = 100
df = df[df['career_games'] >= game_threshold]

### Drop Stints With Too Few Games Played

For one reason or another, a player's stint with a team could be as short as one game. We want to be able to effectively compare stints of different lengths, and we will convert stats to stat-per-game averages to do this. To avoid creating outliers, stints with too few games played will be removed. At minimum, stints must be 20 games or longer to be included.

After stints have been dropped, reset the stint_index to index only the remaining stints.

In [38]:
gp_threshold = 20

# remove stints with too few games played
df = df[df['GP'] >= gp_threshold]

# reset stint_index
df['stint_index'] = df.groupby('playerID').cumcount()

### Drop Players With Too Few Years
 
Now that specific stints have been removed, we need to ensure that the remaining population have records that span several years in the NHL. Players with only a single record will not have year-over-year changes to be observed.  We are most interested in seeing data for players who have spent a lot of time in the NHL. Players with fewer than 3 distint records in their year column will be removed. 

In [39]:
year_threshold = 3

# Group by playerID and count distinct years
year_counts = df.groupby('playerID')['year'].nunique().reset_index()

# Filter out playerIDs with less than 3 distinct years
valid_playerIDs = year_counts[year_counts['year'] >= year_threshold]['playerID']

# Filter the dataframe to keep only valid playerIDs
df_filtered = df[df['playerID'].isin(valid_playerIDs)]

# update df
df = df_filtered

print('aaltoan01' in valid_playerIDs)


False



## Join Other Tables

### Calculate Longest Stint

Next we will determine which stint each year had the most games played for that player.  This will be used to attribute Awards to that player.

In [40]:
# Identify which stint has the most games played per player per year
df['longest_stint'] = 0

# Group by playerID and year, and get the index of the max GP within each group
idx = df.groupby(['playerID', 'year'])['GP'].idxmax()

# Update the 'longest_stint' column to 1 for the rows with the max GP within each group
df.loc[idx, 'longest_stint'] = 1

### Joining Awards

Awards for each player will be attributed to the stint where the player played the most games that season. Awards will be joined to Scoring on playerID and year where longest_stint = 1.  Instead of joining actual award information, we will just load the number of awards earned by the player that year.

In [41]:
# Load Awards table
df_awards = pd.read_csv('Data/Silver/Awards.csv')

# Find award count per player per year
df_awards = df_awards.groupby(['playerID', 'year']).size().reset_index(name='award_count')

# Ensure 'playerID' and 'year' are of the same type in both DataFrames
df['playerID'] = df['playerID'].astype(str)
df['year'] = df['year'].astype(int)
df_awards['playerID'] = df_awards['playerID'].astype(str)
df_awards['year'] = df_awards['year'].astype(int)

# Filter the main dataframe to include only the longest stint
filtered_df = df[df['longest_stint'] == 1]

# Merge the filtered df with the aggregated awards table on playerID and year
merged_df = pd.merge(filtered_df, df_awards, on=['playerID', 'year'], how='left')

# Fill NaN values in award_count with 0
merged_df['award_count'] = merged_df['award_count'].fillna(0)

# Assign the merged dataframe back to the original dataframe to include the new column
df.loc[filtered_df.index, 'award_count'] = merged_df['award_count'].values

# Fill NaN values in the original DataFrame's award_count with 0
df['award_count'] = df['award_count'].fillna(0)

# remove longest_stint as it is no longer needed
df = df.drop(columns=['longest_stint'])

### Joining Teams

Teams will join on tmID and year in both Teams and Scoring. From Teams, we will add the name column to Scoring as team_name.

In [42]:
# Load Teams table
teams_df = pd.read_csv('Data/Silver/Teams.csv')

# Rename the 'name' column to 'team_name'
teams_df.rename(columns={'name': 'team_name'}, inplace=True)

# Merge the main dataframe with the Teams dataframe on 'year' and 'tmID'
merged_df = pd.merge(df, teams_df, on=['year', 'tmID'], how='left')

# Print the first few rows of the merged dataframe to verify the merge
print("Merged DataFrame:\n", merged_df.head())

Merged DataFrame:
     playerID  year  stint tmID pos    GP    G     A   Pts   PIM  ...  SHG  \
0  abdelju01  2009      1  DET   L  50.0  3.0   3.0   6.0  35.0  ...  0.0   
1  abdelju01  2010      1  DET   L  74.0  7.0  12.0  19.0  61.0  ...  0.0   
2  abdelju01  2011      1  DET   L  81.0  8.0  14.0  22.0  62.0  ...  0.0   
3   abelcl01  1926      1  NYR   D  44.0  8.0   4.0  12.0  78.0  ...  0.0   
4   abelcl01  1927      1  NYR   D  23.0  0.0   1.0   1.0  28.0  ...  0.0   

   SHA  GWG    SOG  stint_index  career_games  total_GP_prior_to_stint  \
0  0.0  0.0   79.0            0         209.0                      4.0   
1  1.0  1.0  129.0            1         209.0                     54.0   
2  0.0  1.0  121.0            2         209.0                    128.0   
3  0.0  0.0    0.0            0         333.0                      0.0   
4  0.0  0.0    0.0            1         333.0                     44.0   

   percent_through_career  award_count          team_name  
0            

### Joining Skaters

In this step, we will add much of the biographical data kept in Skaters into the Scoring table. This will include birth country, height, weight, and others.

In [43]:
# Load Skaters table
skaters_df = pd.read_csv('Data/Silver/Skaters.csv')

# Print the first few rows of the skaters dataframe to verify its contents
print("Skaters DataFrame:\n", skaters_df.head())

# Merge the main dataframe with the Skaters dataframe on 'playerID'
df = pd.merge(df, skaters_df, on='playerID', how='left')

# Verify the resulting DataFrame
print("Merged DataFrame:\n", df.head())

Skaters DataFrame:
     playerID firstName    lastName  height  weight  birthYear birthCountry
0  aaltoan01     Antti       Aalto    73.0   210.0     1975.0      Finland
1  abbeybr01     Bruce       Abbey    73.0   185.0     1951.0       Canada
2  abbotre01       Reg      Abbott    71.0   164.0     1930.0       Canada
3  abdelju01    Justin  Abdelkader    73.0   195.0     1987.0          USA
4   abelcl01  Clarence        Abel    73.0   225.0     1900.0          USA
Merged DataFrame:
     playerID  year  stint tmID pos    GP    G     A   Pts   PIM  ...  \
0  abdelju01  2009      1  DET   L  50.0  3.0   3.0   6.0  35.0  ...   
1  abdelju01  2010      1  DET   L  74.0  7.0  12.0  19.0  61.0  ...   
2  abdelju01  2011      1  DET   L  81.0  8.0  14.0  22.0  62.0  ...   
3   abelcl01  1926      1  NYR   D  44.0  8.0   4.0  12.0  78.0  ...   
4   abelcl01  1927      1  NYR   D  23.0  0.0   1.0   1.0  28.0  ...   

   career_games  total_GP_prior_to_stint  percent_through_career  award_count 

## Calculated Fields

In this section, the calculated fields that will be used to evaluate player performance will be introduced.

### Per-Game Averages

In order to compare stints with each other, they need to be normalized by the number of games played.

In [44]:
# List of columns to convert to per-game averages
stats_columns = ['G', 'A', 'Pts', 'PIM', 'PPG', 'PPA', 'SHG', 'SHA', 'GWG', 'SOG', 'award_count']
per_game_columns = list(map(lambda col: col + '_per_game', stats_columns))
per_game_columns.append('+/-')

# Calculate per-game averages and add them as new columns
for col in stats_columns:
    df[col + '_per_game'] = df[col] / df['GP']

df.head()

Unnamed: 0,playerID,year,stint,tmID,pos,GP,G,A,Pts,PIM,...,A_per_game,Pts_per_game,PIM_per_game,PPG_per_game,PPA_per_game,SHG_per_game,SHA_per_game,GWG_per_game,SOG_per_game,award_count_per_game
0,abdelju01,2009,1,DET,L,50.0,3.0,3.0,6.0,35.0,...,0.06,0.12,0.7,0.0,0.0,0.0,0.0,0.0,1.58,0.0
1,abdelju01,2010,1,DET,L,74.0,7.0,12.0,19.0,61.0,...,0.162162,0.256757,0.824324,0.0,0.0,0.0,0.013514,0.013514,1.743243,0.0
2,abdelju01,2011,1,DET,L,81.0,8.0,14.0,22.0,62.0,...,0.17284,0.271605,0.765432,0.0,0.0,0.0,0.0,0.012346,1.493827,0.0
3,abelcl01,1926,1,NYR,D,44.0,8.0,4.0,12.0,78.0,...,0.090909,0.272727,1.772727,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,abelcl01,1927,1,NYR,D,23.0,0.0,1.0,1.0,28.0,...,0.043478,0.043478,1.217391,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Normalize Per-Game Averages

In this step, each player's per-game averages will be normalized across their career such that their worst year in a particular stat will have a mean of 0, and a standard deviation of 1.

In [45]:
from sklearn.preprocessing import StandardScaler

# Function to standardize a series
def standardize_series(series):
    scaler = StandardScaler()
    series_standardized = scaler.fit_transform(series.values.reshape(-1, 1))
    return pd.Series(series_standardized.flatten(), index=series.index)

# Group by playerID and standardize per-game columns
df_standardized = df.copy()
for col in per_game_columns:
    df_standardized[col + '_score'] = df.groupby('playerID')[col].transform(lambda x: standardize_series(x))

# Update df
df = df_standardized

# Print the updated DataFrame
print(df)

### Calculate Overall Score for Stint

In this step, all score columns are added together to form a score for the player. Optional weights are included as well.  All weights will be coefficients to the different score columns when calculating the overall score.

In [None]:
# Define Weights
G_weight = 1
A_weight = 1
Pts_weight = 1
PIM_weight = 1
plus_minus_weight = 1  # Converted +/- to plus_minus
PPG_weight = 1
PPA_weight = 1
SHG_weight = 1
SHA_weight = 1
GWG_weight = 1
SOG_weight = 1
award_count_weight = 1

# Calculate stint_score
df['stint_score'] = (
    (df['G_per_game_score'] * G_weight) +
    (df['A_per_game_score'] * A_weight) +
    (df['Pts_per_game_score'] * Pts_weight) +
    (df['PIM_per_game_score'] * PIM_weight) +
    (df['PPG_per_game_score'] * PPG_weight) +
    (df['PPA_per_game_score'] * PPA_weight) +
    (df['SHG_per_game_score'] * SHG_weight) +
    (df['SHA_per_game_score'] * SHA_weight) +
    (df['GWG_per_game_score'] * GWG_weight) +
    (df['SOG_per_game_score'] * SOG_weight) +
    (df['+/-_score'] * plus_minus_weight) +
    (df['award_count_per_game_score'] * award_count_weight)
)

### Compare the stint_score to the Previous stint_score

In this step, a column is added that adds an index to the player per stint. Then, a new column is added that checks if the overall score of the current stint is higher or lower than the previous stint. This column will hold the value "better" or "worse".

In [None]:

# Sorting the dataframe by playerID and stint_index
df = df.sort_values(by=['playerID', 'stint_index'])

# Initialize the comparison column with null values
df['stint_vs_prev_stint'] = None

# Iterate over each row and compare the current stint_score with the previous stint_score
for i in range(0, len(df)):
    current_row = df.iloc[i]
    if current_row['stint_index'] != 0:
        prev_row = df[(df['playerID'] == current_row['playerID']) & (df['stint_index'] == current_row['stint_index'] - 1)]
        if not prev_row.empty:
            prev_stint_score = prev_row['stint_score'].values[0]
            if current_row['stint_score'] > prev_stint_score:
                df.at[i, 'stint_vs_prev_stint'] = 'better'
            else:
                df.at[i, 'stint_vs_prev_stint'] = 'worse'

# Display the updated dataframe
print(df.head())

    playerID  year  stint tmID pos    GP    G     A   Pts   PIM  ...  \
0  abdelju01  2009      1  DET   L  50.0  3.0   3.0   6.0  35.0  ...   
1  abdelju01  2010      1  DET   L  74.0  7.0  12.0  19.0  61.0  ...   
2  abdelju01  2011      1  DET   L  81.0  8.0  14.0  22.0  62.0  ...   
3   abelcl01  1926      1  NYR   D  44.0  8.0   4.0  12.0  78.0  ...   
4   abelcl01  1927      1  NYR   D  23.0  0.0   1.0   1.0  28.0  ...   

   PPG_per_game_score  PPA_per_game_score  SHG_per_game_score  \
0                 0.0                 0.0                 0.0   
1                 0.0                 0.0                 0.0   
2                 0.0                 0.0                 0.0   
3                 0.0                 0.0                 0.0   
4                 0.0                 0.0                 0.0   

   SHA_per_game_score  GWG_per_game_score  SOG_per_game_score  \
0           -0.707107           -1.409907           -0.248378   
1            1.414214            0.800463     

### Number of Teammates with Same Nationality

In [None]:
# Group by the specified columns and count the occurrences
group_counts = df.groupby(['tmID', 'year', 'birthCountry']).size().reset_index(name='count')

# Subtract 1 from the count
group_counts['count'] -= 1

# Merge the count back into the original dataframe
df = df.merge(group_counts, on=['tmID', 'year', 'birthCountry'], how='left')

# Rename the count column to something more descriptive
df.rename(columns={'count': 'teammates_same_nationality'}, inplace=True)

# Display the updated dataframe
print(df.head())

    playerID  year  stint tmID pos    GP    G     A   Pts   PIM  ...  \
0  abdelju01  2009      1  DET   L  50.0  3.0   3.0   6.0  35.0  ...   
1  abdelju01  2010      1  DET   L  74.0  7.0  12.0  19.0  61.0  ...   
2  abdelju01  2011      1  DET   L  81.0  8.0  14.0  22.0  62.0  ...   
3   abelcl01  1926      1  NYR   D  44.0  8.0   4.0  12.0  78.0  ...   
4   abelcl01  1927      1  NYR   D  23.0  0.0   1.0   1.0  28.0  ...   

   PPA_per_game_score  SHG_per_game_score  SHA_per_game_score  \
0                 0.0                 0.0           -0.707107   
1                 0.0                 0.0            1.414214   
2                 0.0                 0.0           -0.707107   
3                 0.0                 0.0            0.000000   
4                 0.0                 0.0            0.000000   

   GWG_per_game_score  SOG_per_game_score  award_count_per_game_score  \
0           -1.409907           -0.248378                         0.0   
1            0.800463         

### Compare the Stint to the Previous Stint - teammates_same_nationality

In [None]:
# Sorting the dataframe by playerID and stint_index
df = df.sort_values(by=['playerID', 'stint_index'])

# Initialize the comparison column with null values
df['tsm_vs_prev_stint'] = None

# Iterate over each row and compare the current teammates_same_nationality with the previous stint
for i in range(0, len(df)):
    current_row = df.iloc[i]
    if current_row['stint_index'] != 0:
        prev_row = df[(df['playerID'] == current_row['playerID']) & (df['stint_index'] == current_row['stint_index'] - 1)]
        if not prev_row.empty:
            prev_teammates_same_nationality = prev_row['teammates_same_nationality'].values[0]
            if current_row['teammates_same_nationality'] > prev_teammates_same_nationality:
                df.at[i, 'tsm_vs_prev_stint'] = 'more'
            elif current_row['teammates_same_nationality'] < prev_teammates_same_nationality:
                df.at[i, 'tsm_vs_prev_stint'] = 'fewer'
            else:
                df.at[i, 'tsm_vs_prev_stint'] = 'no change'

# Display the updated dataframe
print(df.head())                

    playerID  year  stint tmID pos    GP    G     A   Pts   PIM  ...  \
0  abdelju01  2009      1  DET   L  50.0  3.0   3.0   6.0  35.0  ...   
1  abdelju01  2010      1  DET   L  74.0  7.0  12.0  19.0  61.0  ...   
2  abdelju01  2011      1  DET   L  81.0  8.0  14.0  22.0  62.0  ...   
3   abelcl01  1926      1  NYR   D  44.0  8.0   4.0  12.0  78.0  ...   
4   abelcl01  1927      1  NYR   D  23.0  0.0   1.0   1.0  28.0  ...   

   SHG_per_game_score  SHA_per_game_score  GWG_per_game_score  \
0                 0.0           -0.707107           -1.409907   
1                 0.0            1.414214            0.800463   
2                 0.0           -0.707107            0.609444   
3                 0.0            0.000000            0.000000   
4                 0.0            0.000000            0.000000   

   SOG_per_game_score  award_count_per_game_score  +/-_score  stint_score  \
0           -0.248378                         0.0  -1.282503    -9.118580   
1            1.329897 

### Age
We will calculate an approximate age for the player. This will be equal to the difference between the year column and the birthYear column.

In [None]:
df['age'] = df['year'] - df['birthYear']

## Save Gold CSV

In [None]:
file_path ='Data/Gold/main.csv'

df.to_csv(f'{file_path}', index=False)

import os

file_size = os.path.getsize(file_path)

# Convert the file size to a more readable format (e.g., kilobytes or megabytes)
file_size_kb = file_size / 1024
file_size_mb = file_size_kb / 1024

# Display the file size
print(f'=============main=================')
print(f'The file size is {file_size} bytes')
print(f'The file size is {file_size_kb:.2f} KB')
print(f'The file size is {file_size_mb:.2f} MB')
print('===================================')
print(f'{df.shape}')

The file size is 12401333 bytes
The file size is 12110.68 KB
The file size is 11.83 MB
(24959, 56)
