Need to get the player data first, since that will be the focus of the first half, but then also will need to get the team/game data for later. The player data will feed into the feature engineering aspect of the final portion.

# Getting the player data and exploring

There are a few different APIs I'm going to try first before I resort to creating my own scraping functions. I was to pull data for all the players that I can. I anticipate only being able to get accurate data going back maybe as far as the mid 2007s. Player data might be easier. I know game data will not be as readily available going back more than a few years. NHL data has become more popular in recent years.

In [2]:
import pandas as pd
import functools
# doing it the old fashion way

def player_data(year):
    # create the link based on year
    link = f"https://www.hockey-reference.com/leagues/NHL_%s_skaters.html#stats" % year
    # use pandas to read in the table
    data = pd.read_html(link)[0]
    # rename the columns to be more usable/readable
    data.columns = [
        "rank", "player", "age", "team", "position", "games_played", "goals",
        "assists", "points", "plus_minus", "penalties_in_minutes", "point_shares",
        "even_strength_goals", "powerplay_goals", "shorthanded_goals", "game_winning_goals",
        "even_strength_assists", "powerplay_assists", "shorthanded_assists", "shots",
        "shooting_pct", "time_on_ice_min", "average_time_on_ice", "blocks", 
        "hits", "faceoff_wins", "faceoff_losses", "faceoff_win_pct"
    ]
    # adding a column indicating the season
    last_year = year-1
    data["season"] = f'%s/%s' % (last_year, year)
    return(data)

def salary_data(year):
    # create the link
    link = "https://www.hockey-reference.com/friv/current_nhl_salaries.cgi"
    data = pd.read_html(link)[0]
    return(data)



In [61]:
# this is how I'm going to get the game data. Good ole json scraping, but can at least load it in using pandas to maybe make it a little faster than using requests.get
# jk, requests.get is significantly faster
teams = pd.read_json("https://statsapi.web.nhl.com/api/v1/teams?expand=team.roster&season=20142015")

In [62]:
# this is how I'm going to get the game data. Good ole json scraping...
import requests

team_data = requests.get("https://statsapi.web.nhl.com/api/v1/teams?expand=team.roster&season=20142015").json()

# team_data


In [17]:
team_data.keys()

dict_keys(['copyright', 'teams'])

In [32]:
team_data['teams'][0].keys()

dict_keys(['id', 'name', 'link', 'venue', 'abbreviation', 'teamName', 'locationName', 'firstYearOfPlay', 'division', 'conference', 'franchise', 'roster', 'shortName', 'officialSiteUrl', 'franchiseId', 'active'])

In [63]:
# team_data['teams'][0]['roster']

In [5]:
def get_salary(team, year):
    link = f"https://www.spotrac.com/nhl/%s/cap/%s/" % (team, year)
    try: 
        data = pd.read_html(link)[0]
        data['team'] = team
        data['season'] = f'%s/%s' % (year, year+1)
        data.columns = ["player", "position", 'age', 'base_salary', 
                       'signing_bonus', 'perf_bonus', 'total_salary', 'na',
                       'total_cap_hit', 'adjusted_cap_hit', 'cap_pct',
                       'team', 'season']
        data.drop(columns = ['na'], inplace = True)
        return(data)
    except:
        print(f"Salary data not available for %s for %s/%s season" % (team, year, year + 1))


In [8]:
salary_list = [get_salary(team, year) for year in (2018, 2019) for team in current_nhl_teams]
salary_df = functools.reduce(lambda left, right: pd.concat([left, right], axis = 0, ignore_index=True),
                salary_list)

In [14]:
from nhl import *

In [None]:
# salary_df.to_csv()