# Homework 3 - Cleaning and Prepping Data

For this homework, I chose to obtain data by interacting with the API for the Internet Game Database (IGDB.com). The data available on IGDB is a compilation of many different resources, including video game rating agencies like ESRB and PEGI, as well as the promotional material released by publishers pertaining to their games. Because the data available on each game in IGDB is incredibly varied, the sources from which the information is obtained is also necessarily varied. For example, the database includes information about the time it takes to complete games, when playing at a normal pace, or as a record-breaking speedrunner. 

To interact with the IGDB API, I used a number of resources that they provided:

* [IGDB API Documentation](https://igdb.github.io/api/)
* [IGDB Main Website](https://api.igdb.com)
* [IGDB Python API Wrapper Repository](https://github.com/igdb/igdb_api_python)

This data is meaningful to me in a very direct way because my primary major is in Game Design, so the opportunity to incorporate that into my assignment was quite nice, even though it did mean wrestling with a lot of material I wasn't familiar with.

For the assignment, I will be looking at a quasi-randomly selected group of games for a particular console. Specifically, I will look at the frequency of the different ESRB ratings among my data sample, and will also try to determine whether there is any relation between ESRB rating and the user ratings that the game receives.

## Extra Credit

For this assignment, I will be **tackling pagination** and **incorporating visualizations using** `matplotlib`. These were the obvious choices, because any requests to the IGDB API are automatically paginated, and data visualization is just useful for answering my questions.

In [55]:
# To do this import, I first had to install the API wrapper by running:
# `pip install igdb_api_python`
from igdb_api_python.igdb import igdb
import requests

# igdb represents a requests object, as created by the IGDB API wrapper
api_key = "750c9d13c29e3ee77695e1cfebae2c62"
igdb = igdb(api_key)

# this is my attempt to find the id # for PS4 games just by cycling through
# a bunch of platforms entries
platforms = igdb.platforms({
    'fields': ['name', 'id'],
    'scroll': 1,
    'limit': 25
})

for platform in platforms.body:
    print(platform['name'] +" has id " +str(platform['id']))

p = igdb.scroll(platforms)

for platform in p.body:
    print(platform['name'] +" has id " +str(platform['id']))


platforms/?fields=name,id&limit=25&scroll=1
PokÃ©mon mini has id 166
PlayStation 3 has id 9
Xbox has id 11
Amstrad CPC has id 25
Sega 32X has id 30
Virtual Console (Nintendo) has id 47
3DO Interactive Multiplayer has id 50
Sega CD has id 78
Family Computer (FAMICOM) has id 99
PLATO has id 110
OnLive Game System has id 113
PC-8801 has id 125
Dragon 32/64 has id 153
ColecoVision has id 68
Atari Lynx has id 61
Windows Phone has id 74
Xbox Live Arcade has id 36
Atari ST/STE has id 63
Neo Geo AES has id 80
WonderSwan Color has id 123
Acorn Electron has id 134
SteamVR has id 163
Thomson MO5 has id 156
Nintendo eShop has id 160
Atari 5200 has id 66
Family Computer Disk System has id 51
Tapwave Zodiac has id 44
MSX2 has id 53
Neo Geo Pocket has id 119
SDS Sigma 7 has id 106
Microcomputer has id 112
Commodore PET has id 90
Donner Model 30 has id 85
Fairchild Channel F has id 127
SwanCrystal has id 124
1292 Advanced Programmable Video System has id 139
AY-3-8710 has id 144
Commodore CDTV has id 

In [58]:
#I then discovered you can search for a particular entry
ps4_id = igdb.platforms({
    'search': "PlayStation 4",
    'fields': ['id','name']
})

print(ps4_id.body)

platforms/?search=PlayStation 4&fields=id,name
[{'id': 48, 'name': 'PlayStation 4'}, {'id': 7, 'name': 'PlayStation'}, {'id': 9, 'name': 'PlayStation 3'}, {'id': 8, 'name': 'PlayStation 2'}, {'id': 165, 'name': 'PlayStation VR'}, {'id': 131, 'name': 'Nintendo PlayStation'}, {'id': 38, 'name': 'PlayStation Portable'}, {'id': 45, 'name': 'PlayStation Network'}, {'id': 46, 'name': 'PlayStation Vita'}]


In [95]:
#this is the request I'm ultimately using to get my info. 
result = igdb.games({
    'filters':{
        "[platforms][eq]":48,
    },
    'fields': ['name','esrb.rating','total_rating','time_to_beat'],
    'scroll': 1,
    'limit': 50,
    'order': 'name:desc'
})



games/?fields=name,esrb.rating,total_rating,time_to_beat&filter[platforms][eq]=48&order=name:desc&limit=50&scroll=1


## Looping to get more entries
As noted in the [pagination section of the API documentation](https://igdb.github.io/api/references/pagination/), if you use a `scroll` parameter in your request, it returns `X-Count` and `X-Next-Page` as headers in the request object. Since the API wrapper handles making requests to the next page by using a `scroll()` function, we are only interested in `X-Count`, which lets us know how many results in total fit our request, so we can loop safely to get all our entries.

In [96]:
#when given the parameters scroll = 1, the API returns an X-
print(result.headers)

{'Content-Type': 'application/json', 'Date': 'Tue, 09 Oct 2018 03:14:55 GMT', 'Server': 'openresty/1.9.15.1', 'X-Count': '3030', 'X-Next-Page': '/games/scroll/DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAoq3kWRVdqWW9YQnJUYVNBNDNYV0FWNlItUQ==/?fields=name,esrb.rating,total_rating,time_to_beat', 'X-Powered-By': '3scale API Management - http://www.3scale.net', 'transfer-encoding': 'chunked', 'Connection': 'keep-alive'}


Before dealing with pagination, I first wanted to parse the data from the first 50 entries into a readable structure. Interestingly, the request returns a list of dicts, where an entry is stored at each index. This is demonstrated below:

In [97]:
import json
import pandas as pd

for i in range(5):
    print(result.body[i])

{'id': 81958, 'name': '永遠消失的幻想鄉 ～ The Disappearing of Gensokyo', 'total_rating': 80.0}
{'id': 20744, 'name': 'Ōkami HD', 'total_rating': 86.62726547927755, 'esrb': {'rating': 5}}
{'id': 23636, 'name': 'theHunter', 'total_rating': 40.0}
{'id': 6465, 'name': 'iO', 'total_rating': 66.5}
{'id': 27277, 'name': 'forma.8', 'total_rating': 77.94368919630995, 'esrb': {'rating': 4}}


To form a DataFrame in pandas, my stategy here was to create a DataFrame by passing the dictionary from each index of result.body and then appending it to a 'master' DataFrame. Admittedly this is probably pretty inefficient and will only get more inefficient once I add pagination, but after trying a few janky solutions this seems to be the one that works.

In [164]:
igdb_df = pd.DataFrame()

for i in range(len(result.body)):
    try:
        tmp_df = pd.DataFrame(result.body[i], index=[i])
        igdb_df = igdb_df.append(tmp_df, sort=True, ignore_index=True)
    except AttributeError:
        tmp_df = pd.DataFrame.from_dict(result.body[i])
        igdb_df = igdb_df.append(tmp_df, sort=True, ignore_index=True)
    

In [165]:
igdb_df

Unnamed: 0,esrb,id,name,time_to_beat,total_rating
0,,81958,永遠消失的幻想鄉 ～ The Disappearing of Gensokyo,,80.0
1,5.0,20744,Ōkami HD,,86.627265
2,,23636,theHunter,,40.0
3,,6465,iO,,66.5
4,4.0,27277,forma.8,,77.943689
5,3.0,1353,flOw,,74.696166
6,,19008,ecotone,,72.5
7,,95399,duplicate Zanki Zero: Last Beginning,,
8,,52737,duplicate Rocksmith 2014 Edition,,90.0
9,,55162,duplicate Pillars of the Earth,,72.375


Clearly, we can see there is some inconsistency within the data, as well as some useless fields. 

In [None]:
result=igdb.scroll(result)