<div id="container" style="position:relative;">
<div style="float:left"><h1> Predicting NHL Goal Scoring - Data Collection </h1></div>

***BrainStation Data Science Capstone Project*** <br/>
***Author:***  &ensp;    **Taylor Gallivan** <br/>
***Date:*** &ensp; **Sep-Nov 2023** 

### Introduction

***This notebook is soley for the purposes of data collection.  Please see notebook `API_Debugging.ipynb` for code breakdown.***

In this notebook, the NHL API was called in increments of roughly 5,000 player IDs.  The upper and lower boundaries of which player IDs to include in the iteration process was sort of a moving goal post as I became more familiar with the inner machinations of the API.  It appears that new player IDs are assigned in chronological order (i.e., the more recent a player is, the higher their ID), but for older players they appear to have been added in batches, in alphabetical order.  As such, an 8th API call was added to capture the beginning of a specific batch of players - some of whom played within the era that is to be included in analysis and modeling.

Players were called by their ID Number, as stored in the API.  A partial list of player IDs was retrieved from https://hockey-statistics.com/, which helped shape the structuring of the API calls.  

A second collection DataFrame was instantiated prior to Call 4 as a precautionary measure.  After all calls were completed, the two collection DataFrames were merged into one.  See below for the call breakdowns.

Below are the call breakdowns:
 - Call 1 IDs:  8445000 - 8450000  <-- starts at 'Dave Andreychuk'
 - Call 2 IDs:  8450000 - 8455000
 - Call 3 IDs:  8455000 - 8460000
 - Call 4 IDs:  8460000 - 8465000  <-- new DataFrame started here
 - Call 5 IDs:  8465000 - 8470000
 - Call 6 IDs:  8470000 - 8475000
 - Call 7 IDs:  8475000 - 8482246  <-- ends at 'Artem Zub'
 - Call 8 IDs:  8444894 - 8445000  <-- added to original DataFrame to round out the lower bound of player IDs

In [1]:
import requests
import json
import pandas as pd
from datetime import datetime
import numpy as np

### Call 1 - DONE
Player IDs:  8445000 - 8450000

In [4]:
# First 5000 entries.
df_main = pd.DataFrame()
base_url = 'https://statsapi.web.nhl.com/api/v1/people/'
range1 = range(8445000, 8450000)

for num in range1:
    people_url = f'{base_url}{num}'
    response = requests.get(people_url)
    
    if response.status_code != 404:
        suggestions = json.loads(response.content)['people']
        player = (pd.json_normalize(suggestions))
        
        if player['primaryPosition.code'][0] != 'G':
            stats_url = f'{base_url}{num}/stats/?stats=yearByYear'
            response = requests.get(stats_url)
            
            # evaluate response code, pass if 500
            if response.status_code != 500:
                content = json.loads(response.content)['stats']
                splits = content[0]['splits']
                
                df_splits = (pd.json_normalize(splits, sep = "_" )
                             .query('league_name == "National Hockey League"')
                            )
                            
                if df_splits.shape[0] >= 3:
                    df_splits['player_id'] = player['id'][0]
                    df_splits['name'] = player['fullName'][0]
                    df_splits['position_code'] = player['primaryPosition.code'][0]
                    df_splits['stat_games'] = df_splits['stat_games'].astype(int)
                    total_games = df_splits.groupby(['player_id', 'name',])['stat_games'].sum().reset_index()
                    filtered_total_games = total_games[total_games['stat_games'] > 180]
                    
                    if not filtered_total_games.empty:
                        df_splits['season_start_yr'] = [x[0:4] for x in df_splits['season']]
                        df_splits['season_start_dt'] =  [datetime.strptime(x + '0930', "%Y%m%d") for x in df_splits['season_start_yr']] 
                        df_splits['season_end'] = [x[4:8] for x in df_splits['season']]
                        
                        df_splits['weight'] = player['weight'][0]
                        df_splits['height'] = player['height'][0]
                        df_splits['shot_dir'] = player['shootsCatches'][0]
                        df_splits['birth_date'] = pd.to_datetime(player['birthDate'][0])
                        df_splits['age'] = (np.floor((df_splits['season_start_dt'] - df_splits['birth_date'])/ np.timedelta64(1,'Y') ))
                        df_splits['age'] = df_splits['age'].astype(int)
                        df_splits['position_name'] = player['primaryPosition.name'][0]
                        df_splits['position_type'] = player['primaryPosition.type'][0]
                        df_splits['birth_country'] = player['birthCountry'][0]
                        df_splits['nationality'] = player['nationality'][0]
                        
                        # concatenate the DF for the current player, to the main DF
                        df_main = pd.concat([df_main, df_splits], sort=False).reset_index(drop=True)
                    else:
                        pass        
                else:
                    pass
            else:
                pass
        else:
            pass
    else:
        pass

UndefinedVariableError: name 'league_name' is not defined

In [7]:
# Before running another pass here, let's rearrange some columns for readability

id_df = df_main[['player_id', 'name', 'position_code']]

# Sanity Check
id_df

Unnamed: 0,player_id,name,position_code
0,8445000,Dave Andreychuk,L
1,8445000,Dave Andreychuk,L
2,8445000,Dave Andreychuk,L
3,8445000,Dave Andreychuk,L
4,8445000,Dave Andreychuk,L
...,...,...,...
1032,8445400,Curt Bennett,R
1033,8445400,Curt Bennett,R
1034,8445400,Curt Bennett,R
1035,8445400,Curt Bennett,R


In [10]:
# reform the dataframe
df_main = id_df.join(df_main.drop(columns=['player_id', 'name', 'position_code']))
df_main

Unnamed: 0,player_id,name,position_code,season,sequenceNumber,stat_assists,stat_goals,stat_games,stat_points,team_name,...,season_end,weight,height,shot_dir,birth_date,age,position_name,position_type,birth_country,nationality
0,8445000,Dave Andreychuk,L,19821983,1,23.0,14.0,43,37.0,Buffalo Sabres,...,1983,225,"6' 4""",R,1963-09-29,19,Left Wing,Forward,CAN,CAN
1,8445000,Dave Andreychuk,L,19831984,1,42.0,38.0,78,80.0,Buffalo Sabres,...,1984,225,"6' 4""",R,1963-09-29,20,Left Wing,Forward,CAN,CAN
2,8445000,Dave Andreychuk,L,19841985,1,30.0,31.0,64,61.0,Buffalo Sabres,...,1985,225,"6' 4""",R,1963-09-29,21,Left Wing,Forward,CAN,CAN
3,8445000,Dave Andreychuk,L,19851986,1,51.0,36.0,80,87.0,Buffalo Sabres,...,1986,225,"6' 4""",R,1963-09-29,22,Left Wing,Forward,CAN,CAN
4,8445000,Dave Andreychuk,L,19861987,1,48.0,25.0,77,73.0,Buffalo Sabres,...,1987,225,"6' 4""",R,1963-09-29,23,Left Wing,Forward,CAN,CAN
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1032,8445400,Curt Bennett,R,19761977,1,25.0,22.0,76,47.0,Atlanta Flames,...,1977,195,"6' 3""",L,1948-03-27,28,Right Wing,Forward,CAN,CAN
1033,8445400,Curt Bennett,R,19771978,1,7.0,3.0,25,10.0,Atlanta Flames,...,1978,195,"6' 3""",L,1948-03-27,29,Right Wing,Forward,CAN,CAN
1034,8445400,Curt Bennett,R,19771978,2,17.0,7.0,50,24.0,St. Louis Blues,...,1978,195,"6' 3""",L,1948-03-27,29,Right Wing,Forward,CAN,CAN
1035,8445400,Curt Bennett,R,19781979,1,19.0,14.0,74,33.0,St. Louis Blues,...,1979,195,"6' 3""",L,1948-03-27,30,Right Wing,Forward,CAN,CAN


In [11]:
# Restart at 8445401

base_url = 'https://statsapi.web.nhl.com/api/v1/people/'
range1 = range(8445401, 8450000)

for num in range1:
    people_url = f'{base_url}{num}'
    response = requests.get(people_url)
    
    if response.status_code != 404:
        suggestions = json.loads(response.content)['people']
        player = (pd.json_normalize(suggestions))
        
        if player['primaryPosition.code'][0] != 'G':
            stats_url = f'{base_url}{num}/stats/?stats=yearByYear'
            response = requests.get(stats_url)
            
            # evaluate response code, pass if 500
            if response.status_code != 500:
                content = json.loads(response.content)['stats']
                splits = content[0]['splits']
                
                if not splits:
                    pass
                else:
                    df_splits = (pd.json_normalize(splits, sep = "_" )
                                 .query('league_name == "National Hockey League"')
                                )
                                
                    if df_splits.shape[0] >= 3:
                        df_splits['player_id'] = player['id'][0]
                        df_splits['name'] = player['fullName'][0]
                        df_splits['position_code'] = player['primaryPosition.code'][0]
                        df_splits['stat_games'] = df_splits['stat_games'].astype(int)
                        total_games = df_splits.groupby(['player_id', 'name',])['stat_games'].sum().reset_index()
                        filtered_total_games = total_games[total_games['stat_games'] > 180]
                        
                        if not filtered_total_games.empty:
                            df_splits['season_start_yr'] = [x[0:4] for x in df_splits['season']]
                            df_splits['season_start_dt'] =  [datetime.strptime(x + '0930', "%Y%m%d") for x in df_splits['season_start_yr']] 
                            df_splits['season_end'] = [x[4:8] for x in df_splits['season']]
                            
                            df_splits['weight'] = player['weight'][0]
                            df_splits['height'] = player['height'][0]
                            df_splits['shot_dir'] = player['shootsCatches'][0]
                            df_splits['birth_date'] = pd.to_datetime(player['birthDate'][0])
                            df_splits['age'] = (np.floor((df_splits['season_start_dt'] - df_splits['birth_date'])/ np.timedelta64(1,'Y') ))
                            df_splits['age'] = df_splits['age'].astype(int)
                            df_splits['position_name'] = player['primaryPosition.name'][0]
                            df_splits['position_type'] = player['primaryPosition.type'][0]
                            df_splits['birth_country'] = player['birthCountry'][0]
                            df_splits['nationality'] = player['nationality'][0]
                            
                            df_main = pd.concat([df_main, df_splits], sort=False).reset_index(drop=True)
                        else:
                            pass        
                    else:
                        pass
            else:
                pass
        else:
            pass
    else:
        pass

KeyError: 'shootsCatches'

In [14]:
# Get Dave Lewis stats --> 8448811

base_url = 'https://statsapi.web.nhl.com/api/v1/people/'
num = 8448811

people_url = f'{base_url}{num}'
response = requests.get(people_url)
   
if response.status_code != 404:
    suggestions = json.loads(response.content)['people']
    player = (pd.json_normalize(suggestions))
    
    if player['primaryPosition.code'][0] != 'G':
        stats_url = f'{base_url}{num}/stats/?stats=yearByYear'
        response = requests.get(stats_url)
           
        if response.status_code != 500:
            content = json.loads(response.content)['stats']
            splits = content[0]['splits']
               
            if not splits:
                pass
            else:
                df_splits = (pd.json_normalize(splits, sep = "_" )
                            .query('league_name == "National Hockey League"')
                            )
                                    
                if df_splits.shape[0] >= 3:
                    df_splits['player_id'] = player['id'][0]
                    df_splits['name'] = player['fullName'][0]
                    df_splits['position_code'] = player['primaryPosition.code'][0]
                    df_splits['stat_games'] = df_splits['stat_games'].astype(int)
                    total_games = df_splits.groupby(['player_id', 'name',])['stat_games'].sum().reset_index()
                    filtered_total_games = total_games[total_games['stat_games'] > 180]
                    
                    if not filtered_total_games.empty:
                        df_splits['season_start_yr'] = [x[0:4] for x in df_splits['season']]
                        df_splits['season_start_dt'] =  [datetime.strptime(x + '0930', "%Y%m%d") for x in df_splits['season_start_yr']] 
                        df_splits['season_end'] = [x[4:8] for x in df_splits['season']]
                                
                        df_splits['weight'] = player['weight'][0]
                        df_splits['height'] = player['height'][0]
                        df_splits['shot_dir'] = 'L'
                        df_splits['birth_date'] = pd.to_datetime(player['birthDate'][0])
                        df_splits['age'] = (np.floor((df_splits['season_start_dt'] - df_splits['birth_date'])/ np.timedelta64(1,'Y') ))
                        df_splits['age'] = df_splits['age'].astype(int)
                        df_splits['position_name'] = player['primaryPosition.name'][0]
                        df_splits['position_type'] = player['primaryPosition.type'][0]
                        df_splits['birth_country'] = player['birthCountry'][0]
                        df_splits['nationality'] = player['nationality'][0]
                                
                        df_main = pd.concat([df_main, df_splits], sort=False).reset_index(drop=True)
                    else:
                        pass        
                else:
                    pass
        else:
            pass
    else:
        pass
else:
    pass

In [15]:
df_main

Unnamed: 0,player_id,name,position_code,season,sequenceNumber,stat_assists,stat_goals,stat_games,stat_points,team_name,...,season_end,weight,height,shot_dir,birth_date,age,position_name,position_type,birth_country,nationality
0,8445000,Dave Andreychuk,L,19821983,1,23.0,14.0,43,37.0,Buffalo Sabres,...,1983,225,"6' 4""",R,1963-09-29,19,Left Wing,Forward,CAN,CAN
1,8445000,Dave Andreychuk,L,19831984,1,42.0,38.0,78,80.0,Buffalo Sabres,...,1984,225,"6' 4""",R,1963-09-29,20,Left Wing,Forward,CAN,CAN
2,8445000,Dave Andreychuk,L,19841985,1,30.0,31.0,64,61.0,Buffalo Sabres,...,1985,225,"6' 4""",R,1963-09-29,21,Left Wing,Forward,CAN,CAN
3,8445000,Dave Andreychuk,L,19851986,1,51.0,36.0,80,87.0,Buffalo Sabres,...,1986,225,"6' 4""",R,1963-09-29,22,Left Wing,Forward,CAN,CAN
4,8445000,Dave Andreychuk,L,19861987,1,48.0,25.0,77,73.0,Buffalo Sabres,...,1987,225,"6' 4""",R,1963-09-29,23,Left Wing,Forward,CAN,CAN
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11608,8448811,Dave Lewis,D,19831984,1,5.0,2.0,66,7.0,New Jersey Devils,...,1984,205,"6' 2""",L,1953-07-03,30,Defenseman,Defenseman,CAN,CAN
11609,8448811,Dave Lewis,D,19841985,1,9.0,3.0,74,12.0,New Jersey Devils,...,1985,205,"6' 2""",L,1953-07-03,31,Defenseman,Defenseman,CAN,CAN
11610,8448811,Dave Lewis,D,19851986,1,15.0,0.0,69,15.0,New Jersey Devils,...,1986,205,"6' 2""",L,1953-07-03,32,Defenseman,Defenseman,CAN,CAN
11611,8448811,Dave Lewis,D,19861987,1,5.0,2.0,58,7.0,Detroit Red Wings,...,1987,205,"6' 2""",L,1953-07-03,33,Defenseman,Defenseman,CAN,CAN


In [16]:
# Restart at 8448813

base_url = 'https://statsapi.web.nhl.com/api/v1/people/'
range1 = range(8448813, 8450000)

for num in range1:
    people_url = f'{base_url}{num}'
    response = requests.get(people_url)
    
    if response.status_code != 404:
        suggestions = json.loads(response.content)['people']
        player = (pd.json_normalize(suggestions))
        
        if player['primaryPosition.code'][0] != 'G':
            stats_url = f'{base_url}{num}/stats/?stats=yearByYear'
            response = requests.get(stats_url)
            
            # evaluate response code, pass if 500
            if response.status_code != 500:
                content = json.loads(response.content)['stats']
                splits = content[0]['splits']
                
                if not splits:
                    pass
                else:
                    df_splits = (pd.json_normalize(splits, sep = "_" )
                                 .query('league_name == "National Hockey League"')
                                )
                                
                    if df_splits.shape[0] >= 3:
                        df_splits['player_id'] = player['id'][0]
                        df_splits['name'] = player['fullName'][0]
                        df_splits['position_code'] = player['primaryPosition.code'][0]
                        df_splits['stat_games'] = df_splits['stat_games'].astype(int)
                        total_games = df_splits.groupby(['player_id', 'name',])['stat_games'].sum().reset_index()
                        filtered_total_games = total_games[total_games['stat_games'] > 180]
                        
                        if not filtered_total_games.empty:
                            df_splits['season_start_yr'] = [x[0:4] for x in df_splits['season']]
                            df_splits['season_start_dt'] =  [datetime.strptime(x + '0930', "%Y%m%d") for x in df_splits['season_start_yr']] 
                            df_splits['season_end'] = [x[4:8] for x in df_splits['season']]
                            
                            df_splits['weight'] = player['weight'][0]
                            df_splits['height'] = player['height'][0]
                            df_splits['shot_dir'] = player['shootsCatches'][0]
                            df_splits['birth_date'] = pd.to_datetime(player['birthDate'][0])
                            df_splits['age'] = (np.floor((df_splits['season_start_dt'] - df_splits['birth_date'])/ np.timedelta64(1,'Y') ))
                            df_splits['age'] = df_splits['age'].astype(int)
                            df_splits['position_name'] = player['primaryPosition.name'][0]
                            df_splits['position_type'] = player['primaryPosition.type'][0]
                            df_splits['birth_country'] = player['birthCountry'][0]
                            df_splits['nationality'] = player['nationality'][0]
                            
                            df_main = pd.concat([df_main, df_splits], sort=False).reset_index(drop=True)
                        else:
                            pass        
                    else:
                        pass
            else:
                pass
        else:
            pass
    else:
        pass

KeyError: 'shootsCatches'

Evidently there are multiple players for which a shooting direction is not recorded in the NHL's database.  Add new conditionals to set up default values for missing info in player bios.

In [44]:
# Restart at 8448846

base_url = 'https://statsapi.web.nhl.com/api/v1/people/'
range1 = range(8448846, 8450000)

for num in range1:
    people_url = f'{base_url}{num}'
    response = requests.get(people_url)
    
    if response.status_code != 404:
        suggestions = json.loads(response.content)['people']
        player = (pd.json_normalize(suggestions))
        
        if player['primaryPosition.code'][0] != 'G':
            stats_url = f'{base_url}{num}/stats/?stats=yearByYear'
            response = requests.get(stats_url)
            
            if response.status_code != 500:
                content = json.loads(response.content)['stats']
                splits = content[0]['splits']
                
                if not splits:
                    pass
                else:
                    df_splits = (pd.json_normalize(splits, sep = "_" )
                                 .query('league_name == "National Hockey League"')
                                )
                                
                    if df_splits.shape[0] >= 3:
                        df_splits['player_id'] = player['id'][0]
                        df_splits['name'] = player['fullName'][0]
                        df_splits['position_code'] = player['primaryPosition.code'][0]
                        df_splits['stat_games'] = df_splits['stat_games'].astype(int)
                        total_games = df_splits.groupby(['player_id', 'name',])['stat_games'].sum().reset_index()
                        filtered_total_games = total_games[total_games['stat_games'] > 180]
                        
                        if not filtered_total_games.empty:
                            df_splits['season_start_yr'] = [x[0:4] for x in df_splits['season']]
                            df_splits['season_start_dt'] =  [datetime.strptime(x + '0930', "%Y%m%d") for x in df_splits['season_start_yr']] 
                            df_splits['season_end'] = [x[4:8] for x in df_splits['season']]
                            
                            df_splits['birth_date'] = pd.to_datetime(player['birthDate'][0])
                            df_splits['age'] = (np.floor((df_splits['season_start_dt'] - df_splits['birth_date'])/ np.timedelta64(1,'Y') ))
                            df_splits['age'] = df_splits['age'].astype(int)
                            df_splits['position_name'] = player['primaryPosition.name'][0]
                            df_splits['position_type'] = player['primaryPosition.type'][0]
                            df_splits['birth_country'] = player['birthCountry'][0]
                            df_splits['nationality'] = player['nationality'][0]
                            
                            # conditionals to pass default values where missing
                            if 'weight' not in player:
                                df_splits['weight'] = 0
                            else:
                                df_splits['weight'] = player['weight'][0]
                                
                            if 'height' not in player:
                                df_splits['height'] = 'unknown'
                            else:
                                df_splits['height'] = player['height'][0]
                            
                            if 'shootsCatches' not in player:
                                df_splits['shot_dir'] = 'unknown'
                            else:
                                df_splits['shot_dir'] = player['shootsCatches'][0]
                                
                            df_main = pd.concat([df_main, df_splits], sort=False).reset_index(drop=True)
                        else:
                            pass        
                    else:
                        pass
            else:
                pass
        else:
            pass
    else:
        pass

In [45]:
df_main

Unnamed: 0,player_id,name,position_code,season,sequenceNumber,stat_assists,stat_goals,stat_games,stat_points,team_name,...,season_end,weight,height,shot_dir,birth_date,age,position_name,position_type,birth_country,nationality
0,8445000,Dave Andreychuk,L,19821983,1,23.0,14.0,43,37.0,Buffalo Sabres,...,1983,225,"6' 4""",R,1963-09-29,19,Left Wing,Forward,CAN,CAN
1,8445000,Dave Andreychuk,L,19831984,1,42.0,38.0,78,80.0,Buffalo Sabres,...,1984,225,"6' 4""",R,1963-09-29,20,Left Wing,Forward,CAN,CAN
2,8445000,Dave Andreychuk,L,19841985,1,30.0,31.0,64,61.0,Buffalo Sabres,...,1985,225,"6' 4""",R,1963-09-29,21,Left Wing,Forward,CAN,CAN
3,8445000,Dave Andreychuk,L,19851986,1,51.0,36.0,80,87.0,Buffalo Sabres,...,1986,225,"6' 4""",R,1963-09-29,22,Left Wing,Forward,CAN,CAN
4,8445000,Dave Andreychuk,L,19861987,1,48.0,25.0,77,73.0,Buffalo Sabres,...,1987,225,"6' 4""",R,1963-09-29,23,Left Wing,Forward,CAN,CAN
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15049,8449973,Eddie Olczyk,R,19961997,1,23.0,21.0,67,44.0,Los Angeles Kings,...,1997,207,"6' 1""",L,1966-08-16,30,Right Wing,Forward,USA,USA
15050,8449973,Eddie Olczyk,R,19961997,2,7.0,4.0,12,11.0,Pittsburgh Penguins,...,1997,207,"6' 1""",L,1966-08-16,30,Right Wing,Forward,USA,USA
15051,8449973,Eddie Olczyk,R,19971998,1,11.0,11.0,56,22.0,Pittsburgh Penguins,...,1998,207,"6' 1""",L,1966-08-16,31,Right Wing,Forward,USA,USA
15052,8449973,Eddie Olczyk,R,19981999,1,15.0,10.0,61,25.0,Chicago Blackhawks,...,1999,207,"6' 1""",L,1966-08-16,32,Right Wing,Forward,USA,USA


In [48]:
print(df_main[df_main['name'] == 'Lyle Odelein'])

       player_id          name position_code    season  sequenceNumber  \
14966    8449957  Lyle Odelein             D  19891990               1   
14967    8449957  Lyle Odelein             D  19901991               1   
14968    8449957  Lyle Odelein             D  19911992               1   
14969    8449957  Lyle Odelein             D  19921993               1   
14970    8449957  Lyle Odelein             D  19931994               1   
14971    8449957  Lyle Odelein             D  19941995               1   
14972    8449957  Lyle Odelein             D  19951996               1   
14973    8449957  Lyle Odelein             D  19961997               1   
14974    8449957  Lyle Odelein             D  19971998               1   
14975    8449957  Lyle Odelein             D  19981999               1   
14976    8449957  Lyle Odelein             D  19992000               1   
14977    8449957  Lyle Odelein             D  19992000               2   
14978    8449957  Lyle Odelein        

### Call 2 - DONE
Player IDs:  8450000 - 8455000

In [9]:
# Call 2

base_url = 'https://statsapi.web.nhl.com/api/v1/people/'
range1 = range(8450000, 8455000)

for num in range1:
    people_url = f'{base_url}{num}'
    response = requests.get(people_url)
    
    if response.status_code != 404:
        suggestions = json.loads(response.content)['people']
        player = (pd.json_normalize(suggestions))
        
        if player['primaryPosition.code'][0] != 'G':
            stats_url = f'{base_url}{num}/stats/?stats=yearByYear'
            response = requests.get(stats_url)
            
            if response.status_code != 500:
                content = json.loads(response.content)['stats']
                splits = content[0]['splits']
                
                if not splits:
                    pass
                else:
                    df_splits = (pd.json_normalize(splits, sep = "_" )
                                 .query('league_name == "National Hockey League"')
                                )
                                
                    if df_splits.shape[0] >= 3:
                        df_splits['player_id'] = player['id'][0]
                        df_splits['name'] = player['fullName'][0]
                        df_splits['position_code'] = player['primaryPosition.code'][0]
                        df_splits['stat_games'] = df_splits['stat_games'].astype(int)
                        total_games = df_splits.groupby(['player_id', 'name',])['stat_games'].sum().reset_index()
                        filtered_total_games = total_games[total_games['stat_games'] > 180]
                        
                        if not filtered_total_games.empty:
                            df_splits['season_start_yr'] = [x[0:4] for x in df_splits['season']]
                            df_splits['season_start_dt'] =  [datetime.strptime(x + '0930', "%Y%m%d") for x in df_splits['season_start_yr']] 
                            df_splits['season_end'] = [x[4:8] for x in df_splits['season']]
                            
                            df_splits['birth_date'] = pd.to_datetime(player['birthDate'][0])
                            df_splits['age'] = (np.floor((df_splits['season_start_dt'] - df_splits['birth_date'])/ np.timedelta64(1,'Y') ))
                            df_splits['age'] = df_splits['age'].astype(int)
                            df_splits['position_name'] = player['primaryPosition.name'][0]
                            df_splits['position_type'] = player['primaryPosition.type'][0]
                            df_splits['birth_country'] = player['birthCountry'][0]
                            df_splits['nationality'] = player['nationality'][0]
                            
                            # conditionals to pass default values where missing
                            if 'weight' not in player:
                                df_splits['weight'] = 0
                            else:
                                df_splits['weight'] = player['weight'][0]
                                
                            if 'height' not in player:
                                df_splits['height'] = 'unknown'
                            else:
                                df_splits['height'] = player['height'][0]
                            
                            if 'shootsCatches' not in player:
                                df_splits['shot_dir'] = 'unknown'
                            else:
                                df_splits['shot_dir'] = player['shootsCatches'][0]
                                
                            df_main = pd.concat([df_main, df_splits], sort=False).reset_index(drop=True)
                        else:
                            pass        
                    else:
                        pass
            else:
                pass
        else:
            pass
    else:
        pass


KeyboardInterrupt



### Call 3 - DONE
Player IDs:  8455000 - 8460000

In [57]:
# Call 3

base_url = 'https://statsapi.web.nhl.com/api/v1/people/'
range1 = range(8455000, 8460000)

for num in range1:
    people_url = f'{base_url}{num}'
    response = requests.get(people_url)
    
    if response.status_code != 404:
        suggestions = json.loads(response.content)['people']
        player = (pd.json_normalize(suggestions))
        
        if player['primaryPosition.code'][0] != 'G':
            stats_url = f'{base_url}{num}/stats/?stats=yearByYear'
            response = requests.get(stats_url)
            
            if response.status_code != 500:
                content = json.loads(response.content)['stats']
                splits = content[0]['splits']
                
                if not splits:
                    pass
                else:
                    df_splits = (pd.json_normalize(splits, sep = "_" )
                                 .query('league_name == "National Hockey League"')
                                )
                                
                    if df_splits.shape[0] >= 3:
                        df_splits['player_id'] = player['id'][0]
                        df_splits['name'] = player['fullName'][0]
                        df_splits['position_code'] = player['primaryPosition.code'][0]
                        df_splits['stat_games'] = df_splits['stat_games'].astype(int)
                        total_games = df_splits.groupby(['player_id', 'name',])['stat_games'].sum().reset_index()
                        filtered_total_games = total_games[total_games['stat_games'] > 180]
                        
                        if not filtered_total_games.empty:
                            df_splits['season_start_yr'] = [x[0:4] for x in df_splits['season']]
                            df_splits['season_start_dt'] =  [datetime.strptime(x + '0930', "%Y%m%d") for x in df_splits['season_start_yr']] 
                            df_splits['season_end'] = [x[4:8] for x in df_splits['season']]
                            
                            df_splits['birth_date'] = pd.to_datetime(player['birthDate'][0])
                            df_splits['age'] = (np.floor((df_splits['season_start_dt'] - df_splits['birth_date'])/ np.timedelta64(1,'Y') ))
                            df_splits['age'] = df_splits['age'].astype(int)
                            df_splits['position_name'] = player['primaryPosition.name'][0]
                            df_splits['position_type'] = player['primaryPosition.type'][0]
                            df_splits['birth_country'] = player['birthCountry'][0]
                            df_splits['nationality'] = player['nationality'][0]
                            
                            # conditionals to pass default values where missing
                            if 'weight' not in player:
                                df_splits['weight'] = 0
                            else:
                                df_splits['weight'] = player['weight'][0]
                                
                            if 'height' not in player:
                                df_splits['height'] = 'unknown'
                            else:
                                df_splits['height'] = player['height'][0]
                            
                            if 'shootsCatches' not in player:
                                df_splits['shot_dir'] = 'unknown'
                            else:
                                df_splits['shot_dir'] = player['shootsCatches'][0]
                                
                            df_main = pd.concat([df_main, df_splits], sort=False).reset_index(drop=True)
                        else:
                            pass        
                    else:
                        pass
            else:
                pass
        else:
            pass
    else:
        pass

In [24]:
df_main

Unnamed: 0,player_id,name,position_code,season,sequenceNumber,stat_assists,stat_goals,stat_games,stat_points,team_name,...,season_end,weight,height,shot_dir,birth_date,age,position_name,position_type,birth_country,nationality
0,8445000,Dave Andreychuk,L,19821983,1,23.0,14.0,43,37.0,Buffalo Sabres,...,1983,225,"6' 4""",R,1963-09-29,19,Left Wing,Forward,CAN,CAN
1,8445000,Dave Andreychuk,L,19831984,1,42.0,38.0,78,80.0,Buffalo Sabres,...,1984,225,"6' 4""",R,1963-09-29,20,Left Wing,Forward,CAN,CAN
2,8445000,Dave Andreychuk,L,19841985,1,30.0,31.0,64,61.0,Buffalo Sabres,...,1985,225,"6' 4""",R,1963-09-29,21,Left Wing,Forward,CAN,CAN
3,8445000,Dave Andreychuk,L,19851986,1,51.0,36.0,80,87.0,Buffalo Sabres,...,1986,225,"6' 4""",R,1963-09-29,22,Left Wing,Forward,CAN,CAN
4,8445000,Dave Andreychuk,L,19861987,1,48.0,25.0,77,73.0,Buffalo Sabres,...,1987,225,"6' 4""",R,1963-09-29,23,Left Wing,Forward,CAN,CAN
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
22988,8444998,Ace Bailey,R,19291930,1,21.0,22.0,43,43.0,Toronto Maple Leafs,...,1930,160,"5' 10""",R,1903-07-03 00:00:00,26,Right Wing,Forward,CAN,CAN
22989,8444998,Ace Bailey,R,19301931,1,19.0,23.0,40,42.0,Toronto Maple Leafs,...,1931,160,"5' 10""",R,1903-07-03 00:00:00,27,Right Wing,Forward,CAN,CAN
22990,8444998,Ace Bailey,R,19311932,1,5.0,8.0,44,13.0,Toronto Maple Leafs,...,1932,160,"5' 10""",R,1903-07-03 00:00:00,28,Right Wing,Forward,CAN,CAN
22991,8444998,Ace Bailey,R,19321933,1,8.0,10.0,47,18.0,Toronto Maple Leafs,...,1933,160,"5' 10""",R,1903-07-03 00:00:00,29,Right Wing,Forward,CAN,CAN


In [25]:
# save df_main
df_main.to_csv('df_main1.csv', index=False)

### Call 4 - DONE
Player IDs:  8460000 - 8465000 <br>
Starting a new DataFrame here as a precaution.  Will merge the two larger DataFrames once data collection is done.

In [2]:
# Call 4

df_main2 = pd.DataFrame()
base_url = 'https://statsapi.web.nhl.com/api/v1/people/'
range1 = range(8460000, 8465000)

for num in range1:
    people_url = f'{base_url}{num}'
    response = requests.get(people_url)
    
    if response.status_code != 404:
        suggestions = json.loads(response.content)['people']
        player = (pd.json_normalize(suggestions))
        
        if player['primaryPosition.code'][0] != 'G':
            stats_url = f'{base_url}{num}/stats/?stats=yearByYear'
            response = requests.get(stats_url)
            
            if response.status_code != 500:
                content = json.loads(response.content)['stats']
                splits = content[0]['splits']
                
                if not splits:
                    pass
                else:
                    df_splits = (pd.json_normalize(splits, sep = "_" )
                                 .query('league_name == "National Hockey League"')
                                )
                                
                    if df_splits.shape[0] >= 3:
                        df_splits['player_id'] = player['id'][0]
                        df_splits['name'] = player['fullName'][0]
                        df_splits['position_code'] = player['primaryPosition.code'][0]
                        df_splits['stat_games'] = df_splits['stat_games'].astype(int)
                        total_games = df_splits.groupby(['player_id', 'name',])['stat_games'].sum().reset_index()
                        filtered_total_games = total_games[total_games['stat_games'] > 180]
                        
                        if not filtered_total_games.empty:
                            df_splits['season_start_yr'] = [x[0:4] for x in df_splits['season']]
                            df_splits['season_start_dt'] =  [datetime.strptime(x + '0930', "%Y%m%d") for x in df_splits['season_start_yr']] 
                            df_splits['season_end'] = [x[4:8] for x in df_splits['season']]
                            
                            df_splits['birth_date'] = pd.to_datetime(player['birthDate'][0])
                            df_splits['age'] = (np.floor((df_splits['season_start_dt'] - df_splits['birth_date'])/ np.timedelta64(1,'Y') ))
                            df_splits['age'] = df_splits['age'].astype(int)
                            df_splits['position_name'] = player['primaryPosition.name'][0]
                            df_splits['position_type'] = player['primaryPosition.type'][0]
                            df_splits['birth_country'] = player['birthCountry'][0]
                            df_splits['nationality'] = player['nationality'][0]
                            
                            # conditionals to pass default values where missing
                            if 'weight' not in player:
                                df_splits['weight'] = 0
                            else:
                                df_splits['weight'] = player['weight'][0]
                                
                            if 'height' not in player:
                                df_splits['height'] = 'unknown'
                            else:
                                df_splits['height'] = player['height'][0]
                            
                            if 'shootsCatches' not in player:
                                df_splits['shot_dir'] = 'unknown'
                            else:
                                df_splits['shot_dir'] = player['shootsCatches'][0]
                                
                            df_main2 = pd.concat([df_main2, df_splits], sort=False).reset_index(drop=True)
                        else:
                            pass        
                    else:
                        pass
            else:
                pass
        else:
            pass
    else:
        pass

In [4]:
df_main2

Unnamed: 0,season,sequenceNumber,stat_timeOnIce,stat_assists,stat_goals,stat_pim,stat_games,stat_penaltyMinutes,stat_points,team_name,...,season_end,birth_date,age,position_name,position_type,birth_country,nationality,weight,height,shot_dir
0,19961997,1,,0,0,4.0,21,4,0,Colorado Avalanche,...,1997,1973-10-29,22,Left Wing,Forward,CAN,CAN,197,"6' 2""",L
1,19971998,1,1110:09,12,4,20.0,62,20,16,Colorado Avalanche,...,1998,1973-10-29,23,Left Wing,Forward,CAN,CAN,197,"6' 2""",L
2,19981999,1,425:18,2,4,14.0,31,14,6,Colorado Avalanche,...,1999,1973-10-29,24,Left Wing,Forward,CAN,CAN,197,"6' 2""",L
3,19992000,1,637:05,6,3,24.0,61,24,9,Colorado Avalanche,...,2000,1973-10-29,25,Left Wing,Forward,CAN,CAN,197,"6' 2""",L
4,20002001,1,785:35,7,5,26.0,64,26,12,Colorado Avalanche,...,2001,1973-10-29,26,Left Wing,Forward,CAN,CAN,197,"6' 2""",L
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1617,20012002,1,718:47,10,9,8.0,53,8,19,Montréal Canadiens,...,2002,1978-03-18,23,Center,Forward,CZE,CZE,209,"6' 0""",L
1618,20022003,1,1287:37,24,16,30.0,82,30,40,Montréal Canadiens,...,2003,1978-03-18,24,Center,Forward,CZE,CZE,209,"6' 0""",L
1619,20032004,1,1232:19,17,13,30.0,72,30,30,Montréal Canadiens,...,2004,1978-03-18,25,Center,Forward,CZE,CZE,209,"6' 0""",L
1620,20052006,1,1140:35,20,20,50.0,73,50,40,Montréal Canadiens,...,2006,1978-03-18,27,Center,Forward,CZE,CZE,209,"6' 0""",L


In [5]:
# print Derek Morris stats
print(df_main2[df_main2['player_id'] == 8464966])

        season  sequenceNumber stat_timeOnIce  stat_assists  stat_goals  \
1436  19971998               1        1589:14            20           9   
1437  19981999               1        1472:05            27           7   
1438  19992000               1        1937:51            29           9   
1439  20002001               1        1318:02            23           5   
1440  20012002               1        1504:47            30           4   
1441  20022003               1        1786:33            37          11   
1442  20032004               1        1441:21            22           6   
1443  20032004               2         350:29             4           0   
1444  20052006               1        1105:37            21           6   
1445  20062007               1        1680:16            19           6   
1446  20072008               1        1780:32            17           8   
1447  20082009               1        1212:28             7           5   
1448  20082009           

### Call 5 - DONE
Player IDs:  8465000 - 8470000 <br>
Results added to second DataFrame (`df_main2`).

In [7]:
# Call 5

base_url = 'https://statsapi.web.nhl.com/api/v1/people/'
range1 = range(8465000, 8470000)

for num in range1:
    people_url = f'{base_url}{num}'
    response = requests.get(people_url)
    
    if response.status_code != 404:
        suggestions = json.loads(response.content)['people']
        player = (pd.json_normalize(suggestions))
        
        if player['primaryPosition.code'][0] != 'G':
            stats_url = f'{base_url}{num}/stats/?stats=yearByYear'
            response = requests.get(stats_url)
            
            if response.status_code != 500:
                content = json.loads(response.content)['stats']
                splits = content[0]['splits']
                
                if not splits:
                    pass
                else:
                    df_splits = (pd.json_normalize(splits, sep = "_" )
                                 .query('league_name == "National Hockey League"')
                                )
                                
                    if df_splits.shape[0] >= 3:
                        df_splits['player_id'] = player['id'][0]
                        df_splits['name'] = player['fullName'][0]
                        df_splits['position_code'] = player['primaryPosition.code'][0]
                        df_splits['stat_games'] = df_splits['stat_games'].astype(int)
                        total_games = df_splits.groupby(['player_id', 'name',])['stat_games'].sum().reset_index()
                        filtered_total_games = total_games[total_games['stat_games'] > 180]
                        
                        if not filtered_total_games.empty:
                            df_splits['season_start_yr'] = [x[0:4] for x in df_splits['season']]
                            df_splits['season_start_dt'] =  [datetime.strptime(x + '0930', "%Y%m%d") for x in df_splits['season_start_yr']] 
                            df_splits['season_end'] = [x[4:8] for x in df_splits['season']]
                            
                            df_splits['birth_date'] = pd.to_datetime(player['birthDate'][0])
                            df_splits['age'] = (np.floor((df_splits['season_start_dt'] - df_splits['birth_date'])/ np.timedelta64(1,'Y') ))
                            df_splits['age'] = df_splits['age'].astype(int)
                            df_splits['position_name'] = player['primaryPosition.name'][0]
                            df_splits['position_type'] = player['primaryPosition.type'][0]
                            df_splits['birth_country'] = player['birthCountry'][0]
                            df_splits['nationality'] = player['nationality'][0]
                            
                            # conditionals to pass default values where missing
                            if 'weight' not in player:
                                df_splits['weight'] = 0
                            else:
                                df_splits['weight'] = player['weight'][0]
                                
                            if 'height' not in player:
                                df_splits['height'] = 'unknown'
                            else:
                                df_splits['height'] = player['height'][0]
                            
                            if 'shootsCatches' not in player:
                                df_splits['shot_dir'] = 'unknown'
                            else:
                                df_splits['shot_dir'] = player['shootsCatches'][0]
                                
                            df_main2 = pd.concat([df_main2, df_splits], sort=False).reset_index(drop=True)
                        else:
                            pass        
                    else:
                        pass
            else:
                pass
        else:
            pass
    else:
        pass

### Call 6 - DONE
Player IDs:  8470000 - 8475000 <br>
Results added to second DataFrame (`df_main2`).

In [10]:
# Call 6 

base_url = 'https://statsapi.web.nhl.com/api/v1/people/'
range1 = range(8470000, 8475000)

for num in range1:
    people_url = f'{base_url}{num}'
    response = requests.get(people_url)
    
    if response.status_code != 404:
        suggestions = json.loads(response.content)['people']
        player = (pd.json_normalize(suggestions))
        
        if player['primaryPosition.code'][0] != 'G':
            stats_url = f'{base_url}{num}/stats/?stats=yearByYear'
            response = requests.get(stats_url)
            
            if response.status_code != 500:
                content = json.loads(response.content)['stats']
                splits = content[0]['splits']
                
                if not splits:
                    pass
                else:
                    df_splits = (pd.json_normalize(splits, sep = "_" )
                                 .query('league_name == "National Hockey League"')
                                )
                                
                    if df_splits.shape[0] >= 3:
                        df_splits['player_id'] = player['id'][0]
                        df_splits['name'] = player['fullName'][0]
                        df_splits['position_code'] = player['primaryPosition.code'][0]
                        df_splits['stat_games'] = df_splits['stat_games'].astype(int)
                        total_games = df_splits.groupby(['player_id', 'name',])['stat_games'].sum().reset_index()
                        filtered_total_games = total_games[total_games['stat_games'] > 180]
                        
                        if not filtered_total_games.empty:
                            df_splits['season_start_yr'] = [x[0:4] for x in df_splits['season']]
                            df_splits['season_start_dt'] =  [datetime.strptime(x + '0930', "%Y%m%d") for x in df_splits['season_start_yr']] 
                            df_splits['season_end'] = [x[4:8] for x in df_splits['season']]
                            
                            df_splits['birth_date'] = pd.to_datetime(player['birthDate'][0])
                            df_splits['age'] = (np.floor((df_splits['season_start_dt'] - df_splits['birth_date'])/ np.timedelta64(1,'Y') ))
                            df_splits['age'] = df_splits['age'].astype(int)
                            df_splits['position_name'] = player['primaryPosition.name'][0]
                            df_splits['position_type'] = player['primaryPosition.type'][0]
                            df_splits['birth_country'] = player['birthCountry'][0]
                            df_splits['nationality'] = player['nationality'][0]
                            
                            # conditionals to pass default values where missing
                            if 'weight' not in player:
                                df_splits['weight'] = 0
                            else:
                                df_splits['weight'] = player['weight'][0]
                                
                            if 'height' not in player:
                                df_splits['height'] = 'unknown'
                            else:
                                df_splits['height'] = player['height'][0]
                            
                            if 'shootsCatches' not in player:
                                df_splits['shot_dir'] = 'unknown'
                            else:
                                df_splits['shot_dir'] = player['shootsCatches'][0]
                                
                            df_main2 = pd.concat([df_main2, df_splits], sort=False).reset_index(drop=True)
                        else:
                            pass        
                    else:
                        pass
            else:
                pass
        else:
            pass
    else:
        pass

In [11]:
df_main2

Unnamed: 0,season,sequenceNumber,stat_timeOnIce,stat_assists,stat_goals,stat_pim,stat_games,stat_penaltyMinutes,stat_points,team_name,...,season_end,birth_date,age,position_name,position_type,birth_country,nationality,weight,height,shot_dir
0,19961997,1,,0.0,0.0,4.0,21,4,0,Colorado Avalanche,...,1997,1973-10-29,22,Left Wing,Forward,CAN,CAN,197,"6' 2""",L
1,19971998,1,1110:09,12.0,4.0,20.0,62,20,16,Colorado Avalanche,...,1998,1973-10-29,23,Left Wing,Forward,CAN,CAN,197,"6' 2""",L
2,19981999,1,425:18,2.0,4.0,14.0,31,14,6,Colorado Avalanche,...,1999,1973-10-29,24,Left Wing,Forward,CAN,CAN,197,"6' 2""",L
3,19992000,1,637:05,6.0,3.0,24.0,61,24,9,Colorado Avalanche,...,2000,1973-10-29,25,Left Wing,Forward,CAN,CAN,197,"6' 2""",L
4,20002001,1,785:35,7.0,5.0,26.0,64,26,12,Colorado Avalanche,...,2001,1973-10-29,26,Left Wing,Forward,CAN,CAN,197,"6' 2""",L
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10066,20192020,1,1161:48,30.0,29.0,28.0,69,28,59,Florida Panthers,...,2020,1989-11-24,29,Center,Forward,CAN,CAN,190,"6' 0""",L
10067,20202021,1,783:20,19.0,17.0,10.0,52,10,36,St. Louis Blues,...,2021,1989-11-24,30,Center,Forward,CAN,CAN,190,"6' 0""",L
10068,20212022,1,1141:49,20.0,15.0,32.0,67,32,35,Montréal Canadiens,...,2022,1989-11-24,31,Center,Forward,CAN,CAN,190,"6' 0""",L
10069,20222023,1,1050:08,20.0,14.0,28.0,67,28,34,Montréal Canadiens,...,2023,1989-11-24,32,Center,Forward,CAN,CAN,190,"6' 0""",L


### Call 7 - DONE
Player IDs:  8475000 - 8482246 <br>
Results added to second DataFrame (`df_main2`).

In [13]:
# save df_main
df_main2.to_csv('df_main2.csv', index=False)

In [14]:
# Call 7

base_url = 'https://statsapi.web.nhl.com/api/v1/people/'
range1 = range(8475000, 8482246)

for num in range1:
    people_url = f'{base_url}{num}'
    response = requests.get(people_url)
    
    if response.status_code != 404:
        suggestions = json.loads(response.content)['people']
        player = (pd.json_normalize(suggestions))
        
        if player['primaryPosition.code'][0] != 'G':
            stats_url = f'{base_url}{num}/stats/?stats=yearByYear'
            response = requests.get(stats_url)
            
            if response.status_code != 500:
                content = json.loads(response.content)['stats']
                splits = content[0]['splits']
                
                if not splits:
                    pass
                else:
                    df_splits = (pd.json_normalize(splits, sep = "_" )
                                 .query('league_name == "National Hockey League"')
                                )
                                
                    if df_splits.shape[0] >= 3:
                        df_splits['player_id'] = player['id'][0]
                        df_splits['name'] = player['fullName'][0]
                        df_splits['position_code'] = player['primaryPosition.code'][0]
                        df_splits['stat_games'] = df_splits['stat_games'].astype(int)
                        total_games = df_splits.groupby(['player_id', 'name',])['stat_games'].sum().reset_index()
                        filtered_total_games = total_games[total_games['stat_games'] > 180]
                        
                        if not filtered_total_games.empty:
                            df_splits['season_start_yr'] = [x[0:4] for x in df_splits['season']]
                            df_splits['season_start_dt'] =  [datetime.strptime(x + '0930', "%Y%m%d") for x in df_splits['season_start_yr']] 
                            df_splits['season_end'] = [x[4:8] for x in df_splits['season']]
                            
                            df_splits['birth_date'] = pd.to_datetime(player['birthDate'][0])
                            df_splits['age'] = (np.floor((df_splits['season_start_dt'] - df_splits['birth_date'])/ np.timedelta64(1,'Y') ))
                            df_splits['age'] = df_splits['age'].astype(int)
                            df_splits['position_name'] = player['primaryPosition.name'][0]
                            df_splits['position_type'] = player['primaryPosition.type'][0]
                            df_splits['birth_country'] = player['birthCountry'][0]
                            df_splits['nationality'] = player['nationality'][0]
                            
                            # conditionals to pass default values where missing
                            if 'weight' not in player:
                                df_splits['weight'] = 0
                            else:
                                df_splits['weight'] = player['weight'][0]
                                
                            if 'height' not in player:
                                df_splits['height'] = 'unknown'
                            else:
                                df_splits['height'] = player['height'][0]
                            
                            if 'shootsCatches' not in player:
                                df_splits['shot_dir'] = 'unknown'
                            else:
                                df_splits['shot_dir'] = player['shootsCatches'][0]
                                
                            df_main2 = pd.concat([df_main2, df_splits], sort=False).reset_index(drop=True)
                        else:
                            pass        
                    else:
                        pass
            else:
                pass
        else:
            pass
    else:
        pass

In [15]:
df_main2

Unnamed: 0,season,sequenceNumber,stat_timeOnIce,stat_assists,stat_goals,stat_pim,stat_games,stat_penaltyMinutes,stat_points,team_name,...,season_end,birth_date,age,position_name,position_type,birth_country,nationality,weight,height,shot_dir
0,19961997,1,,0.0,0.0,4.0,21,4,0,Colorado Avalanche,...,1997,1973-10-29,22,Left Wing,Forward,CAN,CAN,197,"6' 2""",L
1,19971998,1,1110:09,12.0,4.0,20.0,62,20,16,Colorado Avalanche,...,1998,1973-10-29,23,Left Wing,Forward,CAN,CAN,197,"6' 2""",L
2,19981999,1,425:18,2.0,4.0,14.0,31,14,6,Colorado Avalanche,...,1999,1973-10-29,24,Left Wing,Forward,CAN,CAN,197,"6' 2""",L
3,19992000,1,637:05,6.0,3.0,24.0,61,24,9,Colorado Avalanche,...,2000,1973-10-29,25,Left Wing,Forward,CAN,CAN,197,"6' 2""",L
4,20002001,1,785:35,7.0,5.0,26.0,64,26,12,Colorado Avalanche,...,2001,1973-10-29,26,Left Wing,Forward,CAN,CAN,197,"6' 2""",L
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14797,20232024,1,103:16,4.0,2.0,8.0,5,8,6,Ottawa Senators,...,2024,2002-01-15,21,Center,Forward,DEU,DEU,197,"6' 0""",L
14798,20202021,1,863:58,11.0,3.0,26.0,47,26,14,Ottawa Senators,...,2021,1995-10-03,25,Defenseman,Defenseman,RUS,RUS,204,"6' 3""",R
14799,20212022,1,1704:14,16.0,6.0,60.0,81,60,22,Ottawa Senators,...,2022,1995-10-03,26,Defenseman,Defenseman,RUS,RUS,204,"6' 3""",R
14800,20222023,1,1073:25,7.0,3.0,39.0,53,39,10,Ottawa Senators,...,2023,1995-10-03,27,Defenseman,Defenseman,RUS,RUS,204,"6' 3""",R


In [20]:
# save df_main
df_main2.to_csv('df_main2.csv', index=False)

### Call 8 - DONE

We will add in one more call to make sure we have all players of interest (i.e., those that played at least 3 seasons after the 1978-1979 season). <br>
This call is waaaay too large - earliest player of interest is Greg Adams - 8444894
Player IDs:  8444894 - 8445000 <br>
Load in original Dataframe

In [16]:
df_main = pd.read_csv('df_main.csv')

In [17]:
df_main

Unnamed: 0,player_id,name,position_code,season,sequenceNumber,stat_assists,stat_goals,stat_games,stat_points,team_name,...,season_end,weight,height,shot_dir,birth_date,age,position_name,position_type,birth_country,nationality
0,8445000,Dave Andreychuk,L,19821983,1,23.0,14.0,43,37.0,Buffalo Sabres,...,1983,225,"6' 4""",R,1963-09-29,19,Left Wing,Forward,CAN,CAN
1,8445000,Dave Andreychuk,L,19831984,1,42.0,38.0,78,80.0,Buffalo Sabres,...,1984,225,"6' 4""",R,1963-09-29,20,Left Wing,Forward,CAN,CAN
2,8445000,Dave Andreychuk,L,19841985,1,30.0,31.0,64,61.0,Buffalo Sabres,...,1985,225,"6' 4""",R,1963-09-29,21,Left Wing,Forward,CAN,CAN
3,8445000,Dave Andreychuk,L,19851986,1,51.0,36.0,80,87.0,Buffalo Sabres,...,1986,225,"6' 4""",R,1963-09-29,22,Left Wing,Forward,CAN,CAN
4,8445000,Dave Andreychuk,L,19861987,1,48.0,25.0,77,73.0,Buffalo Sabres,...,1987,225,"6' 4""",R,1963-09-29,23,Left Wing,Forward,CAN,CAN
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
22623,8459687,Scott Nichol,C,20122013,1,0.0,1.0,30,1.0,St. Louis Blues,...,2013,179,"5' 8""",R,1974-12-31,37,Center,Forward,CAN,CAN
22624,8459700,Hans Jonsson,D,19992000,1,11.0,3.0,68,14.0,Pittsburgh Penguins,...,2000,176,"6' 1""",L,1973-08-02,26,Defenseman,Defenseman,SWE,SWE
22625,8459700,Hans Jonsson,D,20002001,1,18.0,4.0,58,22.0,Pittsburgh Penguins,...,2001,176,"6' 1""",L,1973-08-02,27,Defenseman,Defenseman,SWE,SWE
22626,8459700,Hans Jonsson,D,20012002,1,5.0,2.0,53,7.0,Pittsburgh Penguins,...,2002,176,"6' 1""",L,1973-08-02,28,Defenseman,Defenseman,SWE,SWE


In [18]:
# Call 8

base_url = 'https://statsapi.web.nhl.com/api/v1/people/'
range1 = range(8444894, 8445000)

for num in range1:
    people_url = f'{base_url}{num}'
    response = requests.get(people_url)
    
    if response.status_code != 404:
        suggestions = json.loads(response.content)['people']
        player = (pd.json_normalize(suggestions))
        
        if player['primaryPosition.code'][0] != 'G':
            stats_url = f'{base_url}{num}/stats/?stats=yearByYear'
            response = requests.get(stats_url)
            
            if response.status_code != 500:
                content = json.loads(response.content)['stats']
                splits = content[0]['splits']
                
                if not splits:
                    pass
                else:
                    df_splits = (pd.json_normalize(splits, sep = "_" )
                                 .query('league_name == "National Hockey League"')
                                )
                                
                    if df_splits.shape[0] >= 3:
                        df_splits['player_id'] = player['id'][0]
                        df_splits['name'] = player['fullName'][0]
                        df_splits['position_code'] = player['primaryPosition.code'][0]
                        df_splits['stat_games'] = df_splits['stat_games'].astype(int)
                        total_games = df_splits.groupby(['player_id', 'name',])['stat_games'].sum().reset_index()
                        filtered_total_games = total_games[total_games['stat_games'] > 180]
                        
                        if not filtered_total_games.empty:
                            df_splits['season_start_yr'] = [x[0:4] for x in df_splits['season']]
                            df_splits['season_start_dt'] =  [datetime.strptime(x + '0930', "%Y%m%d") for x in df_splits['season_start_yr']] 
                            df_splits['season_end'] = [x[4:8] for x in df_splits['season']]
                            
                            df_splits['birth_date'] = pd.to_datetime(player['birthDate'][0])
                            df_splits['age'] = (np.floor((df_splits['season_start_dt'] - df_splits['birth_date'])/ np.timedelta64(1,'Y') ))
                            df_splits['age'] = df_splits['age'].astype(int)
                            df_splits['position_name'] = player['primaryPosition.name'][0]
                            df_splits['position_type'] = player['primaryPosition.type'][0]
                            df_splits['birth_country'] = player['birthCountry'][0]
                            df_splits['nationality'] = player['nationality'][0]
                            
                            # conditionals to pass default values where missing
                            if 'weight' not in player:
                                df_splits['weight'] = 0
                            else:
                                df_splits['weight'] = player['weight'][0]
                                
                            if 'height' not in player:
                                df_splits['height'] = 'unknown'
                            else:
                                df_splits['height'] = player['height'][0]
                            
                            if 'shootsCatches' not in player:
                                df_splits['shot_dir'] = 'unknown'
                            else:
                                df_splits['shot_dir'] = player['shootsCatches'][0]
                                
                            df_main = pd.concat([df_main, df_splits], sort=False).reset_index(drop=True)
                        else:
                            pass        
                    else:
                        pass
            else:
                pass
        else:
            pass
    else:
        pass

In [19]:
df_main

Unnamed: 0,player_id,name,position_code,season,sequenceNumber,stat_assists,stat_goals,stat_games,stat_points,team_name,...,season_end,weight,height,shot_dir,birth_date,age,position_name,position_type,birth_country,nationality
0,8445000,Dave Andreychuk,L,19821983,1,23.0,14.0,43,37.0,Buffalo Sabres,...,1983,225,"6' 4""",R,1963-09-29,19,Left Wing,Forward,CAN,CAN
1,8445000,Dave Andreychuk,L,19831984,1,42.0,38.0,78,80.0,Buffalo Sabres,...,1984,225,"6' 4""",R,1963-09-29,20,Left Wing,Forward,CAN,CAN
2,8445000,Dave Andreychuk,L,19841985,1,30.0,31.0,64,61.0,Buffalo Sabres,...,1985,225,"6' 4""",R,1963-09-29,21,Left Wing,Forward,CAN,CAN
3,8445000,Dave Andreychuk,L,19851986,1,51.0,36.0,80,87.0,Buffalo Sabres,...,1986,225,"6' 4""",R,1963-09-29,22,Left Wing,Forward,CAN,CAN
4,8445000,Dave Andreychuk,L,19861987,1,48.0,25.0,77,73.0,Buffalo Sabres,...,1987,225,"6' 4""",R,1963-09-29,23,Left Wing,Forward,CAN,CAN
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
22988,8444998,Ace Bailey,R,19291930,1,21.0,22.0,43,43.0,Toronto Maple Leafs,...,1930,160,"5' 10""",R,1903-07-03 00:00:00,26,Right Wing,Forward,CAN,CAN
22989,8444998,Ace Bailey,R,19301931,1,19.0,23.0,40,42.0,Toronto Maple Leafs,...,1931,160,"5' 10""",R,1903-07-03 00:00:00,27,Right Wing,Forward,CAN,CAN
22990,8444998,Ace Bailey,R,19311932,1,5.0,8.0,44,13.0,Toronto Maple Leafs,...,1932,160,"5' 10""",R,1903-07-03 00:00:00,28,Right Wing,Forward,CAN,CAN
22991,8444998,Ace Bailey,R,19321933,1,8.0,10.0,47,18.0,Toronto Maple Leafs,...,1933,160,"5' 10""",R,1903-07-03 00:00:00,29,Right Wing,Forward,CAN,CAN


### Combining DataFrames

Now we need to concatenate the three data frames into a single one.  Before doing so, I am checking for duplicates

In [21]:
# Combine df_main and df_main2

df_main_total = pd.concat([df_main, df_main2], sort=False).reset_index(drop=True)

In [22]:
df_main_total

Unnamed: 0,player_id,name,position_code,season,sequenceNumber,stat_assists,stat_goals,stat_games,stat_points,team_name,...,season_end,weight,height,shot_dir,birth_date,age,position_name,position_type,birth_country,nationality
0,8445000,Dave Andreychuk,L,19821983,1,23.0,14.0,43,37.0,Buffalo Sabres,...,1983,225,"6' 4""",R,1963-09-29,19,Left Wing,Forward,CAN,CAN
1,8445000,Dave Andreychuk,L,19831984,1,42.0,38.0,78,80.0,Buffalo Sabres,...,1984,225,"6' 4""",R,1963-09-29,20,Left Wing,Forward,CAN,CAN
2,8445000,Dave Andreychuk,L,19841985,1,30.0,31.0,64,61.0,Buffalo Sabres,...,1985,225,"6' 4""",R,1963-09-29,21,Left Wing,Forward,CAN,CAN
3,8445000,Dave Andreychuk,L,19851986,1,51.0,36.0,80,87.0,Buffalo Sabres,...,1986,225,"6' 4""",R,1963-09-29,22,Left Wing,Forward,CAN,CAN
4,8445000,Dave Andreychuk,L,19861987,1,48.0,25.0,77,73.0,Buffalo Sabres,...,1987,225,"6' 4""",R,1963-09-29,23,Left Wing,Forward,CAN,CAN
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
37790,8482116,Tim Stützle,C,20232024,1,4.0,2.0,5,6.0,Ottawa Senators,...,2024,197,"6' 0""",L,2002-01-15 00:00:00,21,Center,Forward,DEU,DEU
37791,8482245,Artem Zub,D,20202021,1,11.0,3.0,47,14.0,Ottawa Senators,...,2021,204,"6' 3""",R,1995-10-03 00:00:00,25,Defenseman,Defenseman,RUS,RUS
37792,8482245,Artem Zub,D,20212022,1,16.0,6.0,81,22.0,Ottawa Senators,...,2022,204,"6' 3""",R,1995-10-03 00:00:00,26,Defenseman,Defenseman,RUS,RUS
37793,8482245,Artem Zub,D,20222023,1,7.0,3.0,53,10.0,Ottawa Senators,...,2023,204,"6' 3""",R,1995-10-03 00:00:00,27,Defenseman,Defenseman,RUS,RUS


In [23]:
df_main_total

Unnamed: 0,player_id,name,position_code,season,sequenceNumber,stat_assists,stat_goals,stat_games,stat_points,team_name,...,season_end,weight,height,shot_dir,birth_date,age,position_name,position_type,birth_country,nationality
0,8445000,Dave Andreychuk,L,19821983,1,23.0,14.0,43,37.0,Buffalo Sabres,...,1983,225,"6' 4""",R,1963-09-29,19,Left Wing,Forward,CAN,CAN
1,8445000,Dave Andreychuk,L,19831984,1,42.0,38.0,78,80.0,Buffalo Sabres,...,1984,225,"6' 4""",R,1963-09-29,20,Left Wing,Forward,CAN,CAN
2,8445000,Dave Andreychuk,L,19841985,1,30.0,31.0,64,61.0,Buffalo Sabres,...,1985,225,"6' 4""",R,1963-09-29,21,Left Wing,Forward,CAN,CAN
3,8445000,Dave Andreychuk,L,19851986,1,51.0,36.0,80,87.0,Buffalo Sabres,...,1986,225,"6' 4""",R,1963-09-29,22,Left Wing,Forward,CAN,CAN
4,8445000,Dave Andreychuk,L,19861987,1,48.0,25.0,77,73.0,Buffalo Sabres,...,1987,225,"6' 4""",R,1963-09-29,23,Left Wing,Forward,CAN,CAN
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
37790,8482116,Tim Stützle,C,20232024,1,4.0,2.0,5,6.0,Ottawa Senators,...,2024,197,"6' 0""",L,2002-01-15 00:00:00,21,Center,Forward,DEU,DEU
37791,8482245,Artem Zub,D,20202021,1,11.0,3.0,47,14.0,Ottawa Senators,...,2021,204,"6' 3""",R,1995-10-03 00:00:00,25,Defenseman,Defenseman,RUS,RUS
37792,8482245,Artem Zub,D,20212022,1,16.0,6.0,81,22.0,Ottawa Senators,...,2022,204,"6' 3""",R,1995-10-03 00:00:00,26,Defenseman,Defenseman,RUS,RUS
37793,8482245,Artem Zub,D,20222023,1,7.0,3.0,53,10.0,Ottawa Senators,...,2023,204,"6' 3""",R,1995-10-03 00:00:00,27,Defenseman,Defenseman,RUS,RUS


In [26]:
# save to df_main
df_main_total.to_csv('df_main.csv', index=False)

### Dropping Rows

The following two blocks of code are used to drop rows of data once a bug has been encountered in the collection process.  The first need for this was after a conditional was incorrectly instantiated that was passing on valuable data.  The first block of code is used to get a list of indices to drop; the second block of code executes the drop over a defined range of indices.  In all cases, we will know the player ID where we want to return to, as it will be defined by the range of the the API call that preceded it.  The code below has been commented out so that it is not accidentally deployed out of turn.

In [35]:
# Use player_id to find the range of indices to drop
#print(df_main.loc[df_main['player_id'] >= 8448846, :])

       player_id            name position_code    season  sequenceNumber  \
11747    8448846     Nick Libett             L  19671968               1   
11748    8448846     Nick Libett             L  19681969               1   
11749    8448846     Nick Libett             L  19691970               1   
11750    8448846     Nick Libett             L  19701971               1   
11751    8448846     Nick Libett             L  19711972               1   
...          ...             ...           ...       ...             ...   
11897    8448873  Mark Lofthouse             R  19801981               1   
11898    8448873  Mark Lofthouse             R  19811982               1   
11899    8448873  Mark Lofthouse             R  19821983               1   
11901    8448876      Barry Long             D  19731974               1   
11902    8448876      Barry Long             D  19791980               1   

       stat_assists  stat_goals  stat_games  stat_points            team_name  \
11747 

In [55]:
# Define the range of indices to drop 
indices_to_drop = list(range(18902, 18903))

# Drop the specified range of indices
df_main = df_main.drop(indices_to_drop)