<h1>Asynchio web request module</h1>

<h2>Required modules</h2>
<p>This is a simple script to pull information using the Asyncio library. It takes a list of URLs and then returns them as a list of JSON responses in a Jupyter notebook.</p>
<p>It requires the following libararies:</p>
<ol>
<li><span class="nn">Asyncio and Aiohttp (3.6.2) to allow the API calls to run asynchronously.</span></li>
<li><span class="nn">Ipywidgets (7.5.1)</span>&nbsp;to set the parameters.</li>
<li>Json to process the json results.</li>
<li>And time to display the runtime.</li></ol>
<h3>This script is specific to pull missing player dat from the gamelogs</h3>
<li>It takes the missing season-player pairs and calls NBA API to get all the player gamelogs for that season and saves them to a dataframe.</li>
<p>
    
**After the fact**
After comparing this data to the original dataset we confirmed that both queries return incomplete results for rebounds before 1985.
- We know Basketball Reference has the information however since scraping it won't add that much value we will look at interpolating it in the other file.
- If we can't find a good way of interpolating the data then we will just ignore those records since they won't have an impact on the future machine learning exercises. It will, however, impact what we can visualize going back in time.
- We can be more confident our original query wasn't the issue since we tried something different and returned the same results. Given the number of items we looked at that indicates a likely issue with the source data.

In [68]:
import random
import time
import json
import asyncio
import aiohttp
from ipywidgets import interact, interactive, fixed, interact_manual,Layout
import ipywidgets as widgets
import pandas as pd


<h2>Set up parameters and headers</h2>
<p>There are some parameters that need to be set. The ones that are mandatory will be set using a widget.</p>
<p>The headers and parameter dictionaries are not required and be left empty. For this example the script does not require either to be set, however, if the API call does require additional parameters these dictionaries can be used to pass them into the API call.</p>
<p>If the headers/parameters are dynamic they should be set inside the actual loop, not here.</p>
<p>The time value, in seconds, is used to set the time for the whole query to run . This is to prevent the script from spamming the server and prevent being banned.</p>
<p>For example if time is set to 600 the queries will take ~10 minutes (600s / 60s) to run. The script spaces each request randomly using a uniform distribution. For small numbers of requests this shouldn't matter but for large ones it prevents sending too many concurrently.</p>
<p>The connections parameter sets the max number of concurrent connections. Setting this higher allows more queries to be completed at once. Setting it too high can result in an an issue. For large numbers of requests you can play with this to see if it increases speed. This function will get called in the main routine.</p>

In [100]:
'''import the data and clean it for the player lookup
get a list of the seasons and players
convert to a list of tuples
convert to a set to reduce duplicates
create a list with a dictionary of the params
'''
df=pd.read_csv('missing_data.csv')
initial_lookup=df[['SEASON','PLAYER_ID']].values.tolist()
lookup = list(set([tuple(l) for l in initial_lookup]))
l1=[{'params':{'Season':param[0][:5]+param[0][7:],'PlayerID':param[1]}} for param in lookup]


In [112]:
#request headers and parameters are optional dictionaries.
global wait_base

req_headers=    headers = {
    'Host': 'stats.nba.com',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:71.0) Gecko/20100101 Firefox/71.0',
    'Accept': 'application/json, text/plain, */*',
    'Accept-Language': 'en-US,en;q=0.5',
    'Referer': 'https://stats.nba.com/',
    'Accept-Encoding': 'gzip, deflate, br',
    'Connection': 'keep-alive',
        }
req_params={'SeasonType': 'Regular Season'}


#Set a variable that sets the number of connections
style = {'description_width': 'initial'}
x=widgets.IntText(value=50,description='Set the max number of connections', layout=Layout(width='30%'),style=style)
y=widgets.IntText(value=3,description='Time for the query to run in seconds', layout=Layout(width='30%'),style=style)
url_widget= widgets.Text(value='https://stats.nba.com/stats/playergamelog?DateFrom=&DateTo=&LeagueID=00',description='The URL for the base API', layout=Layout(width='30%'),style=style)
display(x,y,url_widget) #show the selection widgets

def set_request_param(x,y,url_widget):
    conn= aiohttp.TCPConnector(limit=x.value)
    wait_base=y.value
    base_url=url_widget.value
    
    return(conn,wait_base,base_url)

IntText(value=50, description='Set the max number of connections', layout=Layout(width='30%'), style=Descripti…

IntText(value=3, description='Time for the query to run in seconds', layout=Layout(width='30%'), style=Descrip…

Text(value='https://stats.nba.com/stats/playergamelog?DateFrom=&DateTo=&LeagueID=00', description='The URL for…

<h2>Request functions</h2>
<p>There are two functions that are set up before the request itself:</p>
<ol>
<li><strong>get_json</strong>. This function makes the API call.</li>
<li><strong>response</strong>. This is the function that determines the parameters for the API request. By default it only serves to use the wait parameter to space out the API calls.&nbsp;</li>
</ol>
<p>Note that since this will be asynchronous fuctions the the function needs to start with the prefix <strong> async </strong>. Also, the keyword <strong> await </strong> is required before you return results from the function.
<p><strong>get_json</strong> takes three parameters:</p>
<ol>
<li>Client. This is the session inherited from the main routine.</li>
<li>Headers. This takes the parameters from the dictionary in the main body.&nbsp;</li>
<li>Params. This takes the parameters from the dictionary in the main body.</li>
<li>URL. The URL for the API call.</li>
</ol>
<p><strong>response</strong> takes the same parameters as the get_json function plus a parameter, wait base. This is the number of seconds we have allocated for all of the queries. This is then used for pull a random number from the uniform distribution to determine when to sent out the next query in the sequence.</p>
<p>The <strong>item_dict</strong> contains a dictionary that includes the path_url and additional parameters (if applicable). They are merged into the web request before it is passed on to the <strong>get_json</strong> function.</p>

In [113]:
import requests
'''Function is used to pull the request data'''
async def get_json(client,req_headers,req_params,url):
    async with client.get(url,params=req_params,headers=req_headers) as response:
        try:
            ret=await response.json()
            return ret
        except Exception as e:
            print(f'The API call to {url} returned an error, {e}')

In [114]:
 async def response(wait_base,client,req_headers,req_params,url,item_dict):
    try:
        url=url+item_dict['path_url'] #add the path url if applicable
    except:
        pass
    new_param=item_dict['params'] #pull the parameters out of the item dictionary
    req_params ={**req_params , **new_param}
    
    #wait time between requests is pulled from a random uniform distribution
    wait_t=random.uniform(0,wait_base)
    #set a sleep between requests based on the value calculated above
    await asyncio.sleep(wait_t)
    response= await get_json(client,req_headers,req_params,url)
    return response

<h2>Main Routine</h2>
<p>The main routine is relatively straightforward. Since this is also an asynchronous function it requires the&nbsp;<strong>async</strong> keyword.</p>
<ol>
<li>Call a fuction&nbsp;<strong>set_request_param&nbsp;</strong>which returns a tuple containing the request parameters that were set up earlier.</li>
<li>Create a list of of dictionaries that contains the list of URL paths and parameters for each query. This example uses the NBA API to pull draft combine data from from the 2003-04 season to the 2019-20 season. Keep in mind this is more to illustrate the concept. In practice you would only use Asynchio when need to make many requests.</li>
<li>Third, an AIOHTTP client session is started. This is where the actual requests take place.</li>
<li>The&nbsp;<strong>json_data</strong> list stores a list of the tasks and&nbsp;starts the loop (via &nbsp;<strong>asyncio.create_task</strong>)<strong>&nbsp;</strong>that calls the&nbsp;<strong>response</strong> function (which in turn calls the <strong>get_json</strong> function for the API calls).&nbsp;In this case there will be two JSON responses stored in the list.</li>
<li>The function runs and stores a list of the individual responses in the variable<strong> results</strong>. As was the case with the&nbsp;<strong>get_json</strong> function the keyword <strong>await</strong> is required to make sure the script waits for all of the responses to be returned.</li>
<li>The results of the query will be available for processing in a list called&nbsp;<strong>results<em>.</em></strong></li>
</ol>
<p>&nbsp;</p>

In [118]:
async def main():
    #get the connection and wait parameters based on the user's inputs
    conn,wait_base,base_url=set_request_param(x,y,url_widget)
    #This is a list of dictionaries to loop through that includes any path URLs or parameters
    items=l1
    start_time = time.time()
    
    async with aiohttp.ClientSession(connector=conn) as client: #create the client sesson object that persists across requests
        
        '''create task is used to start the task to pull all the asynch requests'''
        json_data=[asyncio.create_task(response(wait_base,client,req_headers,req_params,base_url,item_dict)) for item_dict in items]
            
        #The await...gather ensures all of the queries are complete before the function returns the list of JSONs back to the main program
        results = await asyncio.gather(*json_data, return_exceptions=True)
        print(f'it took {round(time.time() - start_time,2)} seconds to go through: {len(items)} items')
        return results
    

In [119]:
'''Call the main routine and result the results of the webqueries'''
if __name__ ==  '__main__':
    result =await main()

it took 901.42 seconds to go through: 945 items


<h2>Missing Player Logs</h2>
<p>The missing player logs are stored in a list called results. Use the resultSets to get the headers and values for the dataframe
<ol>1) Need to flatten the json and create a dataframe with the missing player data.</ol>
<ol>2) Save the data to a file so we can merge it back and correct for the missing records in the first pull.</ol>
</p>

In [211]:
#Get the headers
df_col=result[0]['resultSets'][0]['headers']
#Create a list for the dataframe using enumerate so we can loop through the results using a list comprehension
player_data= [result[i]['resultSets'][0] for i,j in enumerate(result)]

In [253]:
from pandas.io.json import json_normalize
#Normalize the results to flatten the results
df_player=pd.DataFrame(json_normalize(player_data,record_path=['rowSet'],sep="_"))
#Need to add the column headers after creating the dataframe. Not clear why. Assigning headers when creating the dataframe results in nan values
df_player.columns=df_col

In [263]:
import os
savedir=os.getcwd()
df_player.to_parquet(f'{savedir}\\df_missing_players.gzip',compression='gzip')

In [260]:
df_player

Unnamed: 0,SEASON_ID,Player_ID,Game_ID,GAME_DATE,MATCHUP,WL,MIN,FGM,FGA,FG_PCT,...,DREB,REB,AST,STL,BLK,TOV,PF,PTS,PLUS_MINUS,VIDEO_AVAILABLE
0,21982,76385,0028200942,"APR 17, 1983",PHL @ BOS,L,26,3,8.0,0.375,...,3.0,4.0,7.0,1.0,1.0,0.0,0.0,10,,0
1,21982,76385,0028200930,"APR 15, 1983",PHL @ NJN,W,26,5,7.0,0.714,...,,3.0,4.0,,0.0,,1.0,12,,0
2,21982,76385,0028200909,"APR 13, 1983",PHL vs. WAS,L,23,6,10.0,0.600,...,1.0,2.0,2.0,2.0,0.0,1.0,0.0,12,,0
3,21982,76385,0028200907,"APR 12, 1983",PHL @ ATL,L,31,3,7.0,0.429,...,1.0,2.0,6.0,1.0,0.0,2.0,2.0,10,,0
4,21982,76385,0028200895,"APR 10, 1983",PHL vs. NYK,W,39,6,7.0,0.857,...,2.0,2.0,9.0,6.0,1.0,2.0,1.0,14,,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
59250,21980,77097,0028000045,"OCT 18, 1980",DEN @ UTH,L,40,6,16.0,,...,7.0,12.0,2.0,0.0,0.0,1.0,4.0,17,,0
59251,21980,77097,0028000039,"OCT 17, 1980",DEN @ SDC,W,39,12,16.0,,...,8.0,9.0,3.0,1.0,1.0,0.0,2.0,30,,0
59252,21980,77097,0028000029,"OCT 15, 1980",DEN vs. DAL,W,29,9,19.0,,...,5.0,7.0,1.0,5.0,0.0,0.0,2.0,25,,0
59253,21980,77097,0028000018,"OCT 12, 1980",DEN vs. UTH,L,46,10,19.0,,...,4.0,6.0,3.0,3.0,2.0,2.0,6.0,29,,0
