# Silver Data Layer

In this notebook the focus of the data set is narrowed down from all columns and rows in all tables to relevant rows and columns from relevant tables. Coaches and goalies will not be reviewed, so their tables will not be brought into the silver layer. Tables in this database contain rows for games played in leagues other than the National Hockey League (NHL).  Our focus is on NHL players only, and these rows will be removed - including those for players whom eventually play in the NHL.
<hr>

NOTE: Some cells assume that previous cells were ran immediately before them.  These will be noted.

## Tables That Are Excluded


<strong>abbrev</strong> - this table contains abbreviations that are used throughout the rest of the database for things related to post-season rounds as well as the conferences and divisions that make up the league. We will not need these.

<strong>AwardsCoaches</strong> - this table has data related to the awards hockey coaches have won

<strong>AwardsMisc</strong> - this table has data for awards that will not be considered

<strong>Coaches</strong> - coaches are not a focus of this study

<strong>CombinedShutouts, Goalies, GoaliesSC, and GoaliesShootout</strong> - these are related to goalies and will not be used

<strong>HOF</strong> - this table has data about players who have been inducted into the Hall of Fame

<strong>ScoringSC</strong> - this table has data about Stanley Cup Final matchups, but no data related to specific players. 

<strong>ScoringShootout</strong> - this covers performance in the shootout, a feature that has not been present for a lot of the NHL's history.

<strong>ScoringSup</strong> - this table has data related to power play assists and short-handed assists.  This table covers a small number of years relative to the rest of the dataset.  Additionally, the grain does not match the player to a team - it is player totals by year.

<strong>SeriesPost</strong> - this contains data about team performance in the post-season each year. This study is reviewing personal player performance and not team success.

<strong>TeamsHalf</strong> - this table has data that compares the first and second half of the season for a small handful of teams over a small handful of years.  This is not relevant to our focus.

<strong>TeamSplits</strong> - this table has month-by-month data for each team (wins, losses, et al). This falls outside of the scope of this study.

<strong>TeamsPost</strong> - this table has post-season performance for each team (wins, losses, et al). This cannot be reliably attributed to individual players and cannot be used for this reason.

<strong>TeamSplits</strong> - this table has team information for teams who competed in team championships during a very brief window of time relative to the rest of the data set.

<strong>TeamVsTeam</strong> - this table has team matchup records for each year and team combination.

<hr>

## Tables That Are Included


<strong>AwardsPlayers</strong> - this contains data about significant awards that have been won by NHL players

<strong>Master</strong> - this is the primary data table for details about individuals themselves (name, birthdate, height, weight, etc). Rows related to goalies, hall of fame inductions, and coaches are removed in this Silver layer.

<strong>Scoring</strong> - this is the primary data table for the player's hockey performance.  There is one row per year, per player, per stint with a specific team. Goals scored, assists, etc are tracked here.

<strong>Teams</strong> - this has data related to the various teams in the NHL and other leagues. These can change year to year as franchises change cities, etc, so there is one row per team per year.


<hr>

#  Bronze to Silver Process

At this point, the tables we are keeping will be loaded in one at a time, have their scope narrowed to match our purpose, and then saved as silver level files.

In [57]:
import pandas as pd

def Write_Silver_CSV(df, toName):
    '''
    This function will write a dataframe to the Silver data layer folder with the name provided in toName

    Paremeters:
    df - pandas dataframe
    toName - string - destination file name
    '''
    df.to_csv(f'Data/Silver/{toName}.csv', index=False)

## AwardsPlayers

For our purposes, we are interested in awards given to skaters (not goalies) in the NHL.  Also, because we have dropped the AwardsCoaches table, this table is renamed to just "Awards".

In [58]:
toName = 'Awards'

# load the bronze table
df = pd.read_csv('Data/Bronze/AwardsPlayers.csv')

# filter the dataframe to remove non-NHL awards and Goalie awards. Awards where the position is missing are excluded as well
filtered_df = df[(df['pos'] != 'G') & (df['pos'] != '') & (df['lgID'] == 'NHL') & (df['pos'].notna())]

# select desired columns
filtered_df = filtered_df[['playerID', 'award', 'year']]

# write the file to our silver folder
Write_Silver_CSV(filtered_df, toName)

filtered_df.head()

Unnamed: 0,playerID,award,year
0,abelsi01,First Team All-Star,1948
1,abelsi01,First Team All-Star,1949
3,abelsi01,Second Team All-Star,1941
4,abelsi01,Second Team All-Star,1950
5,alfreda01,All-Rookie,1995


In [59]:
num_rows, num_columns = filtered_df.shape

# Display the number of rows and columns
print(f'The DataFrame has {num_rows} rows and {num_columns} columns.')

filtered_df.dtypes

The DataFrame has 956 rows and 3 columns.


playerID    object
award       object
year         int64
dtype: object

## Master

This table originally served as a list of all players, coaches, and Hall of Fame inductees. We are interested in only the players, and more specifically, we are only interested in the skaters (non-goalie players).  We will remove all rows related to items other than skaters, select relevant columns, and save the file as Skaters.

In [60]:
toName = 'Skaters'

# load the bronze table
df = pd.read_csv('Data/Bronze/Master.csv')

# remove records not associated with players
filtered_df = df[(df['playerID'] != '') & (df['playerID'].notna())]

# remove goalies as well
filtered_df = filtered_df[filtered_df['pos'] != 'G']

# select relevant columns
filtered_df = filtered_df[['playerID', 'firstName', 'lastName', 'height', 'weight', 'birthYear', 'birthCountry']]

# write the file to our silver folder
Write_Silver_CSV(filtered_df, toName)

filtered_df.head()

Unnamed: 0,playerID,firstName,lastName,height,weight,birthYear,birthCountry
0,aaltoan01,Antti,Aalto,73.0,210.0,1975.0,Finland
1,abbeybr01,Bruce,Abbey,73.0,185.0,1951.0,Canada
3,abbotre01,Reg,Abbott,71.0,164.0,1930.0,Canada
4,abdelju01,Justin,Abdelkader,73.0,195.0,1987.0,USA
5,abelcl01,Clarence,Abel,73.0,225.0,1900.0,USA


In [61]:
num_rows, num_columns = filtered_df.shape

# Display the number of rows and columns
print(f'The DataFrame has {num_rows} rows and {num_columns} columns.')

filtered_df.dtypes

The DataFrame has 6761 rows and 7 columns.


playerID         object
firstName        object
lastName         object
height          float64
weight          float64
birthYear       float64
birthCountry     object
dtype: object

## Scoring

This is the table that has the performance data we are interested in.  Now that the Skaters file has been created, it can be used to filter out players from Scoring that are not found in Skaters.  Non-NHL data needs to be removed as well.

NOTE: The following code assumes that the preceding Python cells under Master have been run prior to this.

In [62]:
toName = 'Scoring'

# create a set of playerIDs held in the Silver Skaters file
players_in_skaters = set(filtered_df['playerID'])

# load the bronze table
df = pd.read_csv('Data/Bronze/Scoring.csv')

# filter out players not in the Silver Skaters table
filtered_df = df[df['playerID'].isin(players_in_skaters)]

# remove all non-NHL rows from Scoring
filtered_df = filtered_df[filtered_df['lgID'] == 'NHL']

# remove all goalie rows (though these should already be removed)
filtered_df = filtered_df[filtered_df['pos'] != 'G']

# Remove rows where 'GP' is NaN or null
filtered_df = filtered_df.dropna(subset=['GP'])

# Remove rows where 'GP' is less than 1
filtered_df = filtered_df[df['GP'] >= 1]

# select relevant columns
filtered_df = filtered_df[['playerID', 'year', 'stint', 'tmID', 'pos', 'GP', 'G', 'A', 'Pts', 'PIM', '+/-', 'PPG', 'PPA', 'SHG', 'SHA', 'GWG', 'SOG']]

filtered_df.head()

  filtered_df = filtered_df[df['GP'] >= 1]


Unnamed: 0,playerID,year,stint,tmID,pos,GP,G,A,Pts,PIM,+/-,PPG,PPA,SHG,SHA,GWG,SOG
0,aaltoan01,1997,1,ANA,C,3.0,0.0,0.0,0.0,0.0,-1.0,0.0,0.0,0.0,0.0,0.0,1.0
1,aaltoan01,1998,1,ANA,C,73.0,3.0,5.0,8.0,24.0,-12.0,2.0,1.0,0.0,0.0,0.0,61.0
2,aaltoan01,1999,1,ANA,C,63.0,7.0,11.0,18.0,26.0,-13.0,1.0,0.0,0.0,0.0,1.0,102.0
3,aaltoan01,2000,1,ANA,C,12.0,1.0,1.0,2.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0,18.0
6,abbotre01,1952,1,MTL,C,3.0,0.0,0.0,0.0,0.0,,,,,,,


In [63]:
num_rows, num_columns = filtered_df.shape

# Display the number of rows and columns
print(f'The DataFrame has {num_rows} rows and {num_columns} columns.')

filtered_df.dtypes

The DataFrame has 38153 rows and 17 columns.


playerID     object
year          int64
stint         int64
tmID         object
pos          object
GP          float64
G           float64
A           float64
Pts         float64
PIM         float64
+/-         float64
PPG         float64
PPA         float64
SHG         float64
SHA         float64
GWG         float64
SOG         float64
dtype: object

## Teams

This table has the names of the individual Teams.  In this bronze to silver process, we will remove any teams not associated with the NHL. There are also a number of columns related to the team's performance.  These are outside of the scope of this project and will be omitted.

In [64]:
toName = 'Teams'

# load the bronze table
df = pd.read_csv('Data/Bronze/Teams.csv')

# remove non-NHL rows
filtered_df = df[df['lgID'] == 'NHL']

# select columns
filtered_df = filtered_df[['year', 'tmID', 'name']]

# save the silver table
Write_Silver_CSV(filtered_df, toName)

filtered_df.head()


Unnamed: 0,year,tmID,name
65,1917,MTL,Montreal Canadiens
66,1917,MTW,Montreal Wanderers
67,1917,OTS,Ottawa Senators
70,1917,TOA,Toronto Arenas
72,1918,MTL,Montreal Canadiens


In [65]:
num_rows, num_columns = filtered_df.shape

# Display the number of rows and columns
print(f'The DataFrame has {num_rows} rows and {num_columns} columns.')

filtered_df.dtypes

The DataFrame has 1325 rows and 3 columns.


year     int64
tmID    object
name    object
dtype: object

## Information about the size and shape of the silver data

In [66]:
import os

files = ['Awards', 'Scoring', 'Skaters', 'Teams']

for file in files:
# Specify the path to your CSV file
    file_path = f'Data/Silver/{file}.csv'

    # Get the file size in bytes
    file_size = os.path.getsize(file_path)

    # Convert the file size to a more readable format (e.g., kilobytes or megabytes)
    file_size_kb = file_size / 1024
    file_size_mb = file_size_kb / 1024

    # Display the file size
    print(f'==========={file}=================')
    print(f'The file size is {file_size} bytes')
    print(f'The file size is {file_size_kb:.2f} KB')
    print(f'The file size is {file_size_mb:.2f} MB')
    print('===================================')


The file size is 33418 bytes
The file size is 32.63 KB
The file size is 0.03 MB
The file size is 2726932 bytes
The file size is 2663.02 KB
The file size is 2.60 MB
The file size is 328548 bytes
The file size is 320.85 KB
The file size is 0.31 MB
The file size is 36927 bytes
The file size is 36.06 KB
The file size is 0.04 MB


In [67]:
dfs = []
dfs.append(pd.read_csv('Data/Silver/Awards.csv'))
dfs.append(pd.read_csv('Data/Silver/Skaters.csv'))
dfs.append(pd.read_csv('Data/Silver/Teams.csv'))
dfs.append(pd.read_csv('Data/Silver/Scoring.csv'))

width = 0
length = 0

for df in dfs:
    length += df.shape[0]
    width += df.shape[1]

print(f'{width=}')
print(f'{length=}')

width=30
length=47195
