<h1>Pull Single Games Manually</h1>

<h2>Required modules</h2>
<p>This is a simple script to pull information using the Asyncio library. It takes a list of URLs and then returns them as a list of JSON responses in a Jupyter notebook.</p>
<p>It requires the following libararies:</p>
<ol>
<li><span class="nn">Asyncio and Aiohttp (3.6.2) to allow the API calls to run asynchronously.</span></li>
<li><span class="nn">Ipywidgets (7.5.1)</span>&nbsp;to set the parameters.</li>
<li>Json to process the json results.</li>
<li>And time to display the runtime.</li></ol>
<h3>This script is specific to pull missing player dat from the gamelogs</h3>
<li>It takes the missing season-player pairs and calls NBA API to get all the player gamelogs for that season and saves them to a dataframe.</li>
<p>
    
This is used to pull data for a single game. Parameters are set manually for one-off errors in the batch pulls.

In [5]:
import random
import time
import json
import asyncio
import aiohttp
from ipywidgets import interact, interactive, fixed, interact_manual,Layout
import ipywidgets as widgets
import pandas as pd


<h2>Set up parameters and headers</h2>
<p>There are some parameters that need to be set. The ones that are mandatory will be set using a widget.</p>
<p>The headers and parameter dictionaries are not required and be left empty. For this example the script does not require either to be set, however, if the API call does require additional parameters these dictionaries can be used to pass them into the API call.</p>
<p>If the headers/parameters are dynamic they should be set inside the actual loop, not here.</p>
<p>The time value, in seconds, is used to set the time for the whole query to run . This is to prevent the script from spamming the server and prevent being banned.</p>
<p>For example if time is set to 600 the queries will take ~10 minutes (600s / 60s) to run. The script spaces each request randomly using a uniform distribution. For small numbers of requests this shouldn't matter but for large ones it prevents sending too many concurrently.</p>
<p>The connections parameter sets the max number of concurrent connections. Setting this higher allows more queries to be completed at once. Setting it too high can result in an an issue. For large numbers of requests you can play with this to see if it increases speed. This function will get called in the main routine.</p>

In [6]:
#request headers and parameters are optional dictionaries.
global wait_base

req_headers=    headers = {
    'Host': 'stats.nba.com',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:71.0) Gecko/20100101 Firefox/71.0',
    'Accept': 'application/json, text/plain, */*',
    'Accept-Language': 'en-US,en;q=0.5',
    'Referer': 'https://stats.nba.com/',
    'Accept-Encoding': 'gzip, deflate, br',
    'Connection': 'keep-alive',
        }
req_params={'EndPeriod': '10',
    'EndRange': '28800',
    'GameID': '0021400340',
    'RangeType': '0',

    'SeasonType': 'Regular Season',
    'StartPeriod': '1',
    'StartRange': '0'}  

#Set a variable that sets the number of connections
style = {'description_width': 'initial'}
x=widgets.IntText(value=50,description='Set the max number of connections', layout=Layout(width='30%'),style=style)
y=widgets.IntText(value=3,description='Time for the query to run in seconds', layout=Layout(width='30%'),style=style)
url_widget= widgets.Text(value='https://stats.nba.com/stats',description='The URL for the base API', layout=Layout(width='30%'),style=style)
display(x,y,url_widget) #show the selection widgets

def set_request_param(x,y,url_widget):
    conn= aiohttp.TCPConnector(limit=x.value)
    wait_base=y.value
    base_url=url_widget.value
    
    return(conn,wait_base,base_url)

IntText(value=50, description='Set the max number of connections', layout=Layout(width='30%'), style=Descripti…

IntText(value=3, description='Time for the query to run in seconds', layout=Layout(width='30%'), style=Descrip…

Text(value='https://stats.nba.com/stats', description='The URL for the base API', layout=Layout(width='30%'), …

<h2>Request functions</h2>
<p>There are two functions that are set up before the request itself:</p>
<ol>
<li><strong>get_json</strong>. This function makes the API call.</li>
<li><strong>response</strong>. This is the function that determines the parameters for the API request. By default it only serves to use the wait parameter to space out the API calls.&nbsp;</li>
</ol>
<p>Note that since this will be asynchronous fuctions the the function needs to start with the prefix <strong> async </strong>. Also, the keyword <strong> await </strong> is required before you return results from the function.
<p><strong>get_json</strong> takes three parameters:</p>
<ol>
<li>Client. This is the session inherited from the main routine.</li>
<li>Headers. This takes the parameters from the dictionary in the main body.&nbsp;</li>
<li>Params. This takes the parameters from the dictionary in the main body.</li>
<li>URL. The URL for the API call.</li>
</ol>
<p><strong>response</strong> takes the same parameters as the get_json function plus a parameter, wait base. This is the number of seconds we have allocated for all of the queries. This is then used for pull a random number from the uniform distribution to determine when to sent out the next query in the sequence.</p>
<p>The <strong>item_dict</strong> contains a dictionary that includes the path_url and additional parameters (if applicable). They are merged into the web request before it is passed on to the <strong>get_json</strong> function.</p>

In [7]:
'''Function is used to pull the request data'''
async def get_json(client,req_headers,req_params,url):
    async with client.get(url,params=req_params,headers=req_headers) as response:
        try:
            ret=await response.json()
            return ret
        except Exception as e:
            print(f'The API call to {url} returned an error, {e}')

In [8]:
async def response_basic(wait_base,client,req_headers,req_params,url,item_dict):
    try:
        url=url+item_dict['path_url'] #add the path url if applicable
    except:
        pass
    new_param=item_dict['params'] #pull the parameters out of the item dictionary
    req_params ={**req_params , **new_param}
    
    #wait time between requests is pulled from a random uniform distribution
    wait_t=random.uniform(0,wait_base)
    #set a sleep between requests based on the value calculated above
    await asyncio.sleep(wait_t)
    response= await get_json(client,req_headers,req_params,url)
    return response

async def response_adv(wait_base,client,req_headers,req_params,url,item_dict):
    try:
        url=url+item_dict['path_url'] #add the path url if applicable
    except:
        pass
    new_param=item_dict['params'] #pull the parameters out of the item dictionary
    req_params ={**req_params , **new_param}
    
    #wait time between requests is pulled from a random uniform distribution
    wait_t=random.uniform(0,wait_base)
    #set a sleep between requests based on the value calculated above
    await asyncio.sleep(wait_t)
    response= await get_json(client,req_headers,req_params,url)
    return response

<h2>Main Routine</h2>
<p>The main routine is relatively straightforward. Since this is also an asynchronous function it requires the&nbsp;<strong>async</strong> keyword.</p>
<ol>
<li>Call a fuction&nbsp;<strong>set_request_param&nbsp;</strong>which returns a tuple containing the request parameters that were set up earlier.</li>
<li>Create a list of of dictionaries that contains the list of URL paths and parameters for each query. This example uses the NBA API to pull draft combine data from from the 2003-04 season to the 2019-20 season. Keep in mind this is more to illustrate the concept. In practice you would only use Asynchio when need to make many requests.</li>
<li>Third, an AIOHTTP client session is started. This is where the actual requests take place.</li>
<li>The&nbsp;<strong>json_data</strong> list stores a list of the tasks and&nbsp;starts the loop (via &nbsp;<strong>asyncio.create_task</strong>)<strong>&nbsp;</strong>that calls the&nbsp;<strong>response</strong> function (which in turn calls the <strong>get_json</strong> function for the API calls).&nbsp;In this case there will be two JSON responses stored in the list.</li>
<li>The function runs and stores a list of the individual responses in the variable<strong> results</strong>. As was the case with the&nbsp;<strong>get_json</strong> function the keyword <strong>await</strong> is required to make sure the script waits for all of the responses to be returned.</li>
<li>The results of the query will be available for processing in a list called&nbsp;<strong>results<em>.</em></strong></li>
</ol>
<p>&nbsp;</p>

In [35]:
def total_games(season):
    if season == 1998:
        total_g=725
    elif season ==2011:
        total_g=1025
    elif season <=1987:
        total_g=943
    elif season <=1994:
        total_g=1107
    elif season <=1994:
        total_g=1189
    else:
        total_g=1230
    return total_g

items_basic=[]
items_adv=[]

items_adv=[{'path_url':'/boxscoreadvancedv2','params':{'Season':'1999-00','GameID':f'0029900712'}}]

In [36]:
async def main():
    #get the connection and wait parameters based on the user's inputs
    conn,wait_base,base_url=set_request_param(x,y,url_widget)
    print(f'Estimated time in hours: {wait_base/60/60}')
    #This is a list of dictionaries to loop through that includes any path URLs or parameters
    start_time = time.time()
    
    async with aiohttp.ClientSession(connector=conn) as client: #create the client sesson object that persists across requests
        
        '''create task is used to start the task to pull all the asynch requests'''
        
        json_data_adv=[asyncio.create_task(response_adv(wait_base,client,req_headers,req_params,base_url,item_dict)) for item_dict in items_adv]
            
        #The await...gather ensures all of the queries are complete before the function returns the list of JSONs back to the main program
        
        results_adv = await asyncio.gather(*json_data_adv, return_exceptions=True)
        print(f'it took {round(time.time() - start_time,2)} seconds to go through: {len(items_basic)+len(items_adv)} items')
        return results_adv
    

In [38]:
'''Call the main routine and result the results of the webqueries'''
if __name__ ==  '__main__':
    results_Adv =await main()

Estimated time in hours: 0.0008333333333333334
it took 6.02 seconds to go through: 1 items


<h2>Results</h2>


In [26]:
#Needed to normalize the results to flatten the results
from pandas.io.json import json_normalize

In [23]:
def create_df(dataframe,df_cols):
    df=pd.DataFrame(json_normalize(dataframe,record_path=['rowSet'],sep="_"))
    return df

In [24]:
def adv_results(results_Adv,tableindex):
    df_cols=[]
    df_adv=[]
    df_err=[]
    df_cols=results_Adv[-1]['resultSets'][tableindex]['headers']
    df=pd.DataFrame()
    for i,j in enumerate(results_Adv):
        try:
            df_adv.append(results_Adv[i]['resultSets'][tableindex])
            
        except Exception as e:
            df_err.append((i,e))
    
    df_adv=create_df(df_adv,df_cols)
    df_adv.columns=df_cols
    return df_adv,df_err 

In [39]:
df_adv_PlayerStats,adv_PlayerStats_err=adv_results(results_Adv,0)
df_adv_TeamStats,adv_TeamStats_err=adv_results(results_Adv,1)

In [40]:
df_adv_TeamStats[df_adv_TeamStats.GAME_ID=='0029900712']

Unnamed: 0,GAME_ID,TEAM_ID,TEAM_NAME,TEAM_ABBREVIATION,TEAM_CITY,MIN,E_OFF_RATING,OFF_RATING,E_DEF_RATING,DEF_RATING,...,TM_TOV_PCT,EFG_PCT,TS_PCT,USG_PCT,E_USG_PCT,E_PACE,PACE,PACE_PER40,POSS,PIE
0,29900712,1610612762,Jazz,UTA,Utah,240:00,101.5,102.1,103.1,104.1,...,18.6,0.438,0.524,1.0,0.199,97.74,97.0,80.83,97,0.516
1,29900712,1610612738,Celtics,BOS,Boston,222:01,104.0,97.7,102.7,95.9,...,16.3,0.494,0.568,1.0,0.202,98.22,104.86,87.38,97,0.48


In [42]:
df_adv_PlayerStats.to_parquet('df_p.gzip',compression='gzip')
df_adv_TeamStats.to_parquet('df_t.gzip',compression='gzip')

## Code to use to merge into the main dataframe

In [None]:
#df_p=pd.read_parquet('df_p.gzip')
#df_t=pd.read_parquet('df_t.gzip')
#df_adv_TeamStats[df_adv_TeamStats.GAME_ID=='0029900712']
#df_adv_PlayerStats[df_adv_PlayerStats.GAME_ID=='0029900712']
#pd.concat([df_adv_PlayerStats,df_p])
#df_adv_TeamStats=pd.concat([df_adv_TeamStats,df_t])