### Get the list of all games with its id number and ouput a file at `/data/game_id.csv`
As of 11/8/2019. There are 345727 games. More information about the API can be found here https://rawg.io/apidocs and its endpoints can be found here https://api.rawg.io/docs/

In [1]:
import json
import requests
from pprint import pprint
import os
import csv
from time import time
import concurrent.futures
import functools
import math

## Multithreading
This function is responsible for requesting pages of games (20 games per page) and save as a JSON file in `/data/game_id/`. As of 11/8/2019, there are 17272 pages

In [2]:
def worker(start_index, pages_per_worker, urls, downloaded_files, headers):
    for url in urls[start_index : start_index + pages_per_worker]:
        if url.rsplit("?page=")[-1] in downloaded_files: continue 
        try:
            # Request API
            json_data = json.loads(requests.get(url, headers=headers).text)

            # Get wanted data
            D = {game["id"]:game["slug"] for game in json_data["results"]}

            # Save data
            page_no = int(url.split("page=")[-1])
            with open(fr"../data/game_id/{page_no}.json", "w", encoding="utf8") as f:
                json.dump(D, f)
        except:
            print(f"Error with {url}")
    # Verbose notification
    print(f"Done from {page_no - N} to {page_no}")

In [3]:
# Create folder if not existed
if not os.path.exists('../data/game_id/'):
    os.makedirs('../data/game_id/')

The following codes apply concurrent programming to speed up the progress. 50 workers are running at the same time. Each of the workers will individually make a request. Time was reduced from ~ 4 hours to ~40 minutes for  17272 pages

In [7]:
# Make the first request to get the total amount of pages to get
headers = { 'User-Agent': 'App Name: Education purpose',}
json_data = json.loads(requests.get(r"https://api.rawg.io/api/games", headers=headers).text)
no_of_pages = math.ceil(json_data["count"]/20)

# Set up number of workers
max_workers = 32
pages_per_worker = int(no_of_pages/max_workers)
start_index = range(0, no_of_pages, pages_per_worker)

# Make urls
url = "https://api.rawg.io/api/games?page=1"
urls = [url[:-1] + str(i) for i in range(1, no_of_pages + 1)]

In [8]:
# Skipped downloaded files
downloaded_files = {file.split(".",1)[0] for file in os.listdir("../data/game_id/")}

# Time
t0=time()
with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
    temp = functools.partial(worker,
                             pages_per_worker=pages_per_worker,
                             urls=urls,
                             downloaded_files=downloaded_files,
                             headers=headers,
                            )
    executor.map(temp, start_index)
print(f"Time taken: {time()-t0}")

Time taken: 0.9984233379364014


Load each JSON file in `/data/game_id/` and write to a CSV file which is saved at `/data/game_id.csv`

In [9]:
with open("../data/game_id.csv", "w") as f:
    csv_file = csv.writer(f, lineterminator="\n")
    for file in os.listdir("../data/game_id/"):
        try:
            json_data = json.load(open(f"../data/game_id/{file}", "r"))
        except:
            print(file)
        for game_id, game_name in json_data.items():
            csv_file.writerow([game_id, game_name])