# Gold Data Load

In this notebook, the Silver tables will be combined into a single dataframe. And then modified in several ways: 

Aggregated data will be found for each player.  This data will include items like "total games played", "goals per game per stint", "number of same birth country teammates per playerID & year combination", "number of years played in NHL".

Players with fewer than 100 (or a value set by the user) total games played will be omitted as we are attempting to track year over year performance for career NHL players.  Players who do not play more than 2 complete seasons would likely be unhelpful.  Additionally, players who have appeared in the NHL in fewer than 3 (also adjustable by the user) seasons will be excluded.  Stints shorter than 10 games will be excluded as well, as these are small sample sizes and the players performance during that brief period could be high or low due to normal variance.

Then, each player's stats will be normalized so that their worst stint in a given stat is recorded as a 0, the best stint is a 1, and all other stints fall in between.  These normalized stats will then be used to calculate an overall score for that player's stint will be calculated by adding all of these scores together (with optional weights to increase or reduce the impact of certain stats). We will then be able to compare the overall score for that stint with the number of teammates with the same birth country to study the correlation between these two items.

<hr>

# Gold Process
<hr>

## Build Wide Table

### Load Silver Scoring

We will start by loading the Scoring table as it is the main fact table we will be using. We will then drop any rows with 0 or NaN GP.  There are a few rows where the stint has 0 games played. These should not be included.

In [10]:
import pandas as pd

# Scoring is main table and will be loaded first
df = pd.read_csv('Data/Silver/Scoring.csv')

# Remove NaN GP rows
df = df.dropna(subset=['GP'])


Next we will determine which stint each year had the most games played for that player.  This will be used to attribute Awards to that player.

In [12]:

# Identify which stint has the most games played per player per year
df['longest_stint'] = 0

# Group by playerID and year, and get the index of the max GP within each group
idx = df.groupby(['playerID', 'year'])['GP'].idxmax()

# Update the 'longest_stint' column to 1 for the rows with the max GP within each group
df.loc[idx, 'longest_stint'] = 1


### Joining Awards

N