# NHL API Shot Location Data

Tyler Young 

May 2019

As a side project to mourn the ending of the Sharks season, I wanted to collect data from the NHL to analyze. To start I had a vision of creating a heatmap of where players on the Sharks took their shots and the outcome. From my seaching online, I could not find this data readily available. So I did some digging into the NHL api and found this helpful documentation: https://gitlab.com/dword4/nhlapi.

I learned I needed a couple IDs to pull the correct data. First was the ID for the San Jose Sharks, which is 28. Then I needed the specific IDs for each game they played which I found by specifying the teamId=28 and setting my start and end dates to match those for the regular season of 2018-2019.

API to get game ID's for sharks reg season

https://statsapi.web.nhl.com/api/v1/schedule?teamId=28&startDate=2018-10-02&endDate=2019-04-07

From the api above, I could extract the game IDs for all the regular season games the Sharks played in. Then those IDs could be plugged in to the next api to collect all the data used to create the play by play feed from the hockey game. 

Here is an example of a game played on October 31st, 2018. Rangers vs Sharks.

https://statsapi.web.nhl.com/api/v1/game/2018020177/feed/live

I used this example game to test extracting the data I wanted to use for my analysis. My plan after that was to repeat this process for each game ID and append the data in one pandas data frame. Below you can see the test section where I created the variables, and then in the final 'Scale up' section is where I looped through the list of game IDs to collect all the plays of the entire season.

Use of the final dataset to come in a separate file, probably visualized with Tableau. Hopefully this structure helps anyone else trying to get similar data from the NHL api. Enjoy!

---

## Collect Game IDs from the Sharks 2018-2019 Regular Season

SHARKS ID: 28

https://statsapi.web.nhl.com/api/v1/teams/28

In [1]:
import requests
import json
import pandas as pd
pd.options.display.max_rows = 99

In [2]:
response = requests.get("https://statsapi.web.nhl.com/api/v1/schedule?teamId=28&startDate=2018-10-02&endDate=2019-04-07")
sharks_games = response.json()

In [3]:
game_ids = []
for i in range(0,len(sharks_games['dates'])):
    game_ids.append(sharks_games['dates'][i]['games'][0])

In [4]:
df_games = pd.DataFrame(game_ids)[['gameDate','gamePk','gameType','link']].sort_values('gameDate')
df_games.head()

Unnamed: 0,gameDate,gamePk,gameType,link
0,2018-10-04T02:30:00Z,2018020004,R,/api/v1/game/2018020004/feed/live
1,2018-10-06T02:30:00Z,2018020016,R,/api/v1/game/2018020016/feed/live
2,2018-10-08T17:00:00Z,2018020033,R,/api/v1/game/2018020033/feed/live
3,2018-10-09T23:00:00Z,2018020036,R,/api/v1/game/2018020036/feed/live
4,2018-10-11T23:00:00Z,2018020049,R,/api/v1/game/2018020049/feed/live


---

## Test code to get data from one game.

Extracting play-by-play data from this one game. https://statsapi.web.nhl.com/api/v1/game/2018020177/feed/live

Explore the json to find structure of data I want

In [5]:
#%timeit requests.get("https://statsapi.web.nhl.com/api/v1/game/2018020177/feed/live")
resp = requests.get("https://statsapi.web.nhl.com/api/v1/game/2018020177/feed/live")
single_game = resp.json()

In [6]:
single_game.keys()

dict_keys(['copyright', 'gamePk', 'link', 'metaData', 'gameData', 'liveData'])

In [7]:
#pull keys from the 'liveData' key
single_game['liveData'].keys()

dict_keys(['plays', 'linescore', 'boxscore', 'decisions'])

In [8]:
#pull keys from the 'plays' key
single_game['liveData']['plays'].keys()

dict_keys(['allPlays', 'scoringPlays', 'penaltyPlays', 'playsByPeriod', 'currentPlay'])

In [26]:
#example of data stored within a single play
single_game['liveData']['plays']['allPlays'][5]

{'players': [{'player': {'id': 8474053,
    'fullName': 'Logan Couture',
    'link': '/api/v1/people/8474053'},
   'playerType': 'Shooter'},
  {'player': {'id': 8468685,
    'fullName': 'Henrik Lundqvist',
    'link': '/api/v1/people/8468685'},
   'playerType': 'Goalie'}],
 'result': {'event': 'Shot',
  'eventCode': 'SJS8',
  'eventTypeId': 'SHOT',
  'description': 'Logan Couture Wrist Shot saved by Henrik Lundqvist',
  'secondaryType': 'Wrist Shot'},
 'about': {'eventIdx': 5,
  'eventId': 8,
  'period': 1,
  'periodType': 'REGULAR',
  'ordinalNum': '1st',
  'periodTime': '00:18',
  'periodTimeRemaining': '19:42',
  'dateTime': '2018-10-31T02:38:35Z',
  'goals': {'away': 0, 'home': 0}},
 'coordinates': {'x': 63.0, 'y': 19.0},
 'team': {'id': 28,
  'name': 'San Jose Sharks',
  'link': '/api/v1/teams/28',
  'triCode': 'SJS'}}

#### Create a pandas dataframe using all plays from a single game

In [35]:
df_plays = pd.DataFrame(single_game['liveData']['plays']['allPlays'])
#Here is the same example play now in our pandas dataframe
df_plays[5:6]

Unnamed: 0,about,coordinates,players,result,team
5,"{'eventIdx': 5, 'eventId': 8, 'period': 1, 'pe...","{'x': 63.0, 'y': 19.0}","[{'player': {'id': 8474053, 'fullName': 'Logan...","{'event': 'Shot', 'eventCode': 'SJS8', 'eventT...","{'id': 28, 'name': 'San Jose Sharks', 'link': ..."


In [36]:
#Creating variable to remove events that do not have coordinates. 
#These 'events' without coordinates are things like start and end of a period. For this analysis we do not need those data points.
df_plays['has_coordinates'] = df_plays['coordinates'].apply(lambda x: bool(x)*1)
df_plays = df_plays[df_plays['has_coordinates']==1].reset_index(drop = True)

Create variables we want to analyze by extracting data from the dictionaries within our columns.

In [37]:
df_plays['date'] = df_plays['about'].apply(lambda x: x['dateTime'].split('T')[0])
df_plays['event'] = df_plays['result'].apply(lambda x: x['event'])
df_plays['eventTypeId'] = df_plays['result'].apply(lambda x: x['eventTypeId'])
df_plays['description'] = df_plays['result'].apply(lambda x: x['description'])
df_plays['period'] = df_plays['about'].apply(lambda x: x['period'])
df_plays['periodType'] = df_plays['about'].apply(lambda x: x['periodType'])
df_plays['periodTimeRemaining'] = df_plays['about'].apply(lambda x: x['periodTimeRemaining'])
df_plays['xcoord'] = df_plays['coordinates'].apply(lambda x: x['x'])
df_plays['ycoord'] = df_plays['coordinates'].apply(lambda x: x['y'])
df_plays['player1_team'] = df_plays['team'].apply(lambda x: x['name'])
df_plays['player1_name'] = df_plays['players'].apply(lambda x: x[0]['player']['fullName'])
df_plays['player1_type'] = df_plays['players'].apply(lambda x: x[0]['playerType'])
df_plays['player2_name'] = df_plays['players'].apply(lambda x: x[1]['player']['fullName'] if len(x)>1 else None)
df_plays['player2_type'] = df_plays['players'].apply(lambda x: x[1]['playerType'] if len(x)>1 else None)

Create clean version of dataframe by dropping columns no longer needed.

In [38]:
df_plays_clean = df_plays.copy()
df_plays_clean.drop(['about','players','result','team','has_coordinates'], inplace = True, axis = 1)
#df_plays_clean['coordinates'] = df_plays_clean['coordinates'].apply(lambda x: str(x['x'])+","+str(x['y']))
df_plays_clean.head(3)

Unnamed: 0,coordinates,date,event,eventTypeId,description,period,periodType,periodTimeRemaining,xcoord,ycoord,player1_team,player1_name,player1_type,player2_name,player2_type
0,"{'x': 0.0, 'y': 0.0}",2018-10-31,Faceoff,FACEOFF,Logan Couture faceoff won against Mika Zibanejad,1,REGULAR,20:00,0.0,0.0,San Jose Sharks,Logan Couture,Winner,Mika Zibanejad,Loser
1,"{'x': 77.0, 'y': 40.0}",2018-10-31,Hit,HIT,Mika Zibanejad hit Logan Couture,1,REGULAR,19:44,77.0,40.0,New York Rangers,Mika Zibanejad,Hitter,Logan Couture,Hittee
2,"{'x': 63.0, 'y': 19.0}",2018-10-31,Shot,SHOT,Logan Couture Wrist Shot saved by Henrik Lundq...,1,REGULAR,19:42,63.0,19.0,San Jose Sharks,Logan Couture,Shooter,Henrik Lundqvist,Goalie


---

## Scale up to get data from Sharks games in a season

First I pull the list of game IDs from before.

In [39]:
list_of_gameIDs = df_games['gamePk']
list_of_gameIDs[0:2]

0    2018020004
1    2018020016
Name: gamePk, dtype: int64

Function to loop through all games from the list created before and make a data frame of all the plays from the season.

In [40]:
def game_data(list_of_gameIDs):
    df_all_plays = pd.DataFrame()
    for ID in list_of_gameIDs:
        r = requests.get("https://statsapi.web.nhl.com/api/v1/game/"+str(ID)+"/feed/live")
        s_game = r.json()
        df_s_game = pd.DataFrame(s_game['liveData']['plays']['allPlays'])
        df_all_plays = df_all_plays.append(df_s_game)
    return df_all_plays.reset_index(drop = True)

In [41]:
df_play = game_data(list_of_gameIDs)

In [42]:
#Creating variable to remove events that do not have coordinates. 
#These 'events' without coordinates are things like start and end of a period. For this analysis we do not need those data points.
df_play['has_coordinates'] = df_play['coordinates'].apply(lambda x: bool(x)*1)
df_play = df_play[df_play['has_coordinates']==1].reset_index(drop = True)


Create variables we want to analyze by extracting data from the dictionaries within our columns.

In [48]:
df_play['date'] = df_play['about'].apply(lambda x: x['dateTime'].split('T')[0])
df_play['event'] = df_play['result'].apply(lambda x: x['event'])
df_play['eventTypeId'] = df_play['result'].apply(lambda x: x['eventTypeId'])
df_play['description'] = df_play['result'].apply(lambda x: x['description'])
df_play['period'] = df_play['about'].apply(lambda x: x['period'])
df_play['periodType'] = df_play['about'].apply(lambda x: x['periodType'])
df_play['periodTimeRemaining'] = df_play['about'].apply(lambda x: x['periodTimeRemaining'])
df_play['xcoord'] = df_play['coordinates'].apply(lambda x: x['x'])
df_play['ycoord'] = df_play['coordinates'].apply(lambda x: x['y'])
df_play['player1_team'] = df_play['team'].apply(lambda x: x['name'])
df_play['player1_name'] = df_play['players'].apply(lambda x: x[0]['player']['fullName'])
df_play['player1_type'] = df_play['players'].apply(lambda x: x[0]['playerType'])
df_play['player2_name'] = df_play['players'].apply(lambda x: x[1]['player']['fullName'] if len(x)>1 else None)
df_play['player2_type'] = df_play['players'].apply(lambda x: x[1]['playerType'] if len(x)>1 else None)
df_play['player3_name'] = df_play['players'].apply(lambda x: x[2]['player']['fullName'] if len(x)>2 else None)
df_play['player3_type'] = df_play['players'].apply(lambda x: x[2]['playerType'] if len(x)>2 else None)
df_play['player4_name'] = df_play['players'].apply(lambda x: x[3]['player']['fullName'] if len(x)>3 else None)
df_play['player4_type'] = df_play['players'].apply(lambda x: x[3]['playerType'] if len(x)>3 else None)

Create clean version of dataframe by dropping columns no longer needed.

In [49]:
df_all_plays_reg_season = df_play.copy()
df_all_plays_reg_season.drop(['about','players','result','team','has_coordinates'], inplace = True, axis = 1)
#df_all_plays_reg_season['coordinates'] = df_all_plays_reg_season['coordinates'].apply(lambda x: str(x['x'])+","+str(x['y']))
df_all_plays_reg_season.head(3)

Unnamed: 0,coordinates,date,event,eventTypeId,description,period,periodType,periodTimeRemaining,xcoord,ycoord,player1_team,player1_name,player1_type,player2_name,player2_type,player3_name,player3_type,player4_name,player4_type
0,"{'x': 0.0, 'y': 0.0}",2018-10-04,Faceoff,FACEOFF,Joe Thornton faceoff won against Ryan Getzlaf,1,REGULAR,20:00,0.0,0.0,San Jose Sharks,Joe Thornton,Winner,Ryan Getzlaf,Loser,,,,
1,"{'x': 63.0, 'y': -26.0}",2018-10-04,Takeaway,TAKEAWAY,Takeaway by Erik Karlsson,1,REGULAR,19:27,63.0,-26.0,San Jose Sharks,Erik Karlsson,PlayerID,,,,,,
2,"{'x': -65.0, 'y': 10.0}",2018-10-04,Goal,GOAL,"Max Comtois (1) Wrist Shot, assists: Adam Henr...",1,REGULAR,19:11,-65.0,10.0,Anaheim Ducks,Max Comtois,Scorer,Adam Henrique,Assist,Jakob Silfverberg,Assist,Martin Jones,Goalie


In [53]:
len(df_all_plays_reg_season)
#there are over 21,000 plays we collected from the sharks season to analyze.

21310

In [56]:
#number of plays where Joe Thornton scored a goal.
len(df_all_plays_reg_season[(df_all_plays_reg_season['player1_name']=='Joe Thornton')&
                           (df_all_plays_reg_season['player1_type']=='Scorer')])

16

Just to check, I see how many plays occurred during the season where Joe Thornton was the goal scorer and found 16. According to the stats from the Sharks website, this checks out!

In [55]:
df_all_plays_reg_season.to_csv('sharks_2018-2019_reg_season_plays.csv')