# Gold Data Load

In this notebook, the Silver tables will be combined into a single dataframe. And then modified in several ways: 

Players with fewer than 100 (or a value set by the user) total games played will be omitted as we are attempting to track year over year performance for career NHL players.  Players who do not play more than 2 complete seasons would likely be unhelpful.  Additionally, players who have appeared in the NHL in fewer than 3 (also adjustable by the user) seasons will be excluded.  Stints shorter than 10 games will be excluded as well, as these are small sample sizes and the players performance during that brief period could be high or low due to normal variance.

Aggregated data will be found for each player.  This data will include items like "total games played", "goals per game per stint", "number of same birth country teammates per playerID & year combination", "number of years played in NHL".

Then, each player's stats will be normalized so that their worst stint in a given stat is recorded as a 0, the best stint is a 1, and all other stints fall in between.  These normalized stats will then be used to calculate an overall score for that player's stint will be calculated by adding all of these scores together (with optional weights to increase or reduce the impact of certain stats). We will then be able to compare the overall score for that stint with the number of teammates with the same birth country to study the correlation between these two items.

<hr>

# Gold Process
<hr>

## Build Wide Table

### Load Silver Scoring

We will start by loading the Scoring table as it is the main fact table we will be using. We will then drop any rows with 0 or NaN GP.  There are a few rows where the stint has 0 games played. These should not be included.

In [18]:
import pandas as pd

# Scoring is main table and will be loaded first
df = pd.read_csv('Data/Silver/Scoring.csv')

# Remove NaN GP rows
df = df.dropna(subset=['GP'])


### Drop Players With Too Few Total GP Or Years

In [19]:
# calculate total GP
total_gp_per_player = df.groupby('playerID')['GP'].sum()

# identify players with too few games
game_threshold = 100
players_to_remove = total_gp_per_player[total_gp_per_player < game_threshold].index

# Remove rows from df with these players
df = df[~df['playerID'].isin(players_to_remove)]

### Drop Stints With Fewer Than 10 Games Played

In [29]:
df = df[df['GP'] >= 10]

### Calculate Longest Stint

Next we will determine which stint each year had the most games played for that player.  This will be used to attribute Awards to that player.

In [30]:
# Identify which stint has the most games played per player per year
df['longest_stint'] = 0

# Group by playerID and year, and get the index of the max GP within each group
idx = df.groupby(['playerID', 'year'])['GP'].idxmax()

# Update the 'longest_stint' column to 1 for the rows with the max GP within each group
df.loc[idx, 'longest_stint'] = 1

### Joining Awards

Awards for each player will be attributed to the stint where the player played the most games that season. Awards will be joined to Scoring on playerID and year where longest_stint = 1.  Instead of joining actual award information, we will just load the number of awards earned by the player that year.

In [31]:
# load Awards table
df_awards = pd.read_csv('Data/Silver/Awards.csv')

# find award count per player per year
df_awards = df_awards.groupby(['playerID', 'year']).size().reset_index(name='award_count')

# merge to main gold table
filtered_df = df[df['longest_stint'] == 1]

# Merge the filtered df with the aggregated awards table on playerID and year
merged_df = pd.merge(filtered_df, df_awards, on=['playerID', 'year'], how='left')

# Update the original df with the merged information
df.update(merged_df)

# Verify the resulting DataFrame
print(df.head())

    playerID  year  stint tmID pos    GP    G     A   Pts   PIM   +/-  PPG  \
1  aaltoan01  1999      1  ANA   C  63.0  7.0  11.0  18.0  26.0 -13.0  1.0   
2  aaltoan01  2000      1  ANA   C  12.0  1.0   1.0   2.0   2.0   1.0  0.0   
3  abdelju01  2009      1  DET   L  50.0  3.0   3.0   6.0  35.0 -11.0  0.0   
6   abelcl01  1926      1  NYR   D  44.0  8.0   4.0  12.0  78.0   4.0  0.0   
7   abelcl01  1927      1  NYR   D  23.0  0.0   1.0   1.0  28.0  15.0  0.0   

   PPA  SHG  SHA  GWG  GTG    SOG  longest_stint  
1  0.0  0.0  0.0  1.0  0.0  102.0              1  
2  0.0  0.0  0.0  0.0  0.0   18.0              1  
3  0.0  0.0  0.0  0.0  0.0   79.0              1  
6  0.0  0.0  0.0  1.0  NaN  121.0              1  
7  0.0  0.0  1.0  1.0  NaN  129.0              1  
