# Game Recommender System
---

### Problem Statement: 

There are more games in the market nowadays compared with several decades ago. This often leads hardcore PC Gamers or users that are newly introduced to game, to struggle to chose a specific game they will like to invest time or money on. 

My objective is to create a game recommender system to provide several game recommendations to users based on their input. Input like what's their favorite game? or what game they played recently that are interesting. This can help players that are looking for new games or gamer that just got introduced to PC gaming to recommend several games for them to choose. 
This can also assist company like 
The games in the recommender system will strictly based on the games offered on [Steam](https://store.steampowered.com/). Steam is a well-known gaming platform by Valve Corporation that contains a large game database and player base. 


### Table of Content:

1. [Acquiring Entire App Library on Steam](#1.-Acquiring-Entire-App-Library-On-Steam)
2. [Acquiring App Detail](#2.-Acquiring-App-Detail)
3. [Filtering Apps that are Game](#3.-Filtering-categories%2C-genres-and-description-from-App-type-that-are-"Games")
4. [Acquiring Game Reviews](#4.-Acquiring-Game-Reviews)
---

### Import Libraries

In [2]:
import requests
import json
import pandas as pd
import time
import re

# 1. Acquiring Entire App Library On Steam
---

I will be obtaining the entire apps from Steam library. This includes, games, soundtrack, DLCs, etc.. basically products that steam offers. 

In [None]:
urls = 'https://api.steampowered.com/ISteamApps/GetAppList/v2/'
ress = requests.get(urls)
games = json.loads(ress.content)['applist']['apps']

After acquiring the steam apps; I turn that into a dataframe with along with the unique game_id for each app, and the title of the app itself 

In [None]:
top_df = pd.DataFrame(games)
top_df.columns = ['game_id', 'title']

Steam updates their library so often that I will never stop grabbing the apps. Therefore; after I acquire the Library once; I saved it to csv to prevent countless updating apps coming to my dataframe.

In [4]:
top_df.to_csv('./datas/top_df.csv')

In [3]:
top_df = pd.read_csv('./datas/top_df.csv')
top_df.drop('Unnamed: 0', axis=1, inplace=True)
top_df.head()

Unnamed: 0,game_id,title
0,216938,Pieterw test app76 ( 216938 )
1,660010,test2
2,660130,test3
3,939780,Cat Demon Island
4,939790,Royal Alchemist


I have total of 74,060 apps; keep in mind that not all these apps are games. 

In [5]:
len(top_df)

74060

# 2. Acquiring App Detail
--- 

I will perform a API scrapping from STEAM API that obtain the app detail for each app through the game_id I acquired from above API library. This will return to me the detail information of each app. From here we can determind if the app is categorize as "game" or not. I perform several AWS instances one performing this API scrapping process, for each 5,000 apps it took approxmatly 3 hours. I have 74,060 Apps meaning it took me around 44 hours to get all the app's detail.

In [102]:
top_game_details=[]
start_time = time.time()
for i in top_df['game_id']:
    try:
        game = {}
        url = 'http://store.steampowered.com/api/appdetails?appids='+str(i)
        res = requests.get(url)
        game['data'] = json.loads(res.content)
        top_game_details.append(game)
        time.sleep(2)
    except:
        print(i)
print(f'Time Spent: {round(((time.time()-start_time)/60)/60)} Hours')

Time Spent: 3 Hours


This is to show me what's missing/dropped during the API scrapping process. If a game_id shows up; then I will go back and manually input the game_id into the web API to see if it actually exist or I need to adjust the delay time. Because sometime acquiring API in a fast pace will either be blocked or not return any information. From the result we can clearly see that we have 0 missing. Which is a good sign.

In [103]:
missing = 0 
curr = 1
missing_list = []
for game in top_game_details:
    curr += 1
    try:
        appid = [*game['data'].keys()][0]
    except:
        missing += 1
        missing_list.append(curr)

In [104]:
missing

0

# 3. Filtering categories, genres and description from App type that are "Games"
---

After we acquired all the App details; we come to the filtering process that will filter the app that is game. If the app is a game; then it will acquire the name of the game along with the categories(features), genres(features), and detail description. Due to the low features varity, I will use the detail description as my features as well. Each game have their own unique description. 

In [105]:
game_keys = {}
for game in top_game_details:
    try:
        appid = [*game['data'].keys()][0]
        if game['data'][appid]['data']['type'] == 'game':
            game_keys[appid] = {'name': game['data'][appid]['data']['name'], 'categories':[], 'genres':[], 'detailed_description': game['data'][appid]['data']['about_the_game']}
            if 'genres' in game['data'][appid]['data'].keys():
                game_keys[appid]['genres'].extend(x['description'] for x in game['data'][appid]['data']['genres'])
            if 'categories' in game['data'][appid]['data'].keys():
                game_keys[appid]['categories'].extend(x['description'] for x in game['data'][appid]['data']['categories'])
    except:
        pass

I will seperate the game_features (categories and genres) and game descrption. Because I will perform different EDA and data cleaning process with this two dataframe. 

I created an empty dictionary that will combine my categories and genres and put them into a dataframe along with the unique game_id

In [106]:
game_features = {}

for game_id, game_dict in game_keys.items():
    combined_list = game_dict['categories'] + game_dict['genres']
    zipped_list = zip(combined_list, [1]*len(combined_list))
    game_features[game_id] = dict(zipped_list)

I created an empty dictionary that will combine my detailed description and put them into a dataframe along with the unique game_id 

In [107]:
game_context = {}

for game_id, game_dict in game_keys.items():
    
    game_context[game_id] = game_dict['detailed_description'] 

I transform the game_features from dictionary to dataframe and Tranpose it so the unique game_id will be the index and the features will be the columns.

In [108]:
game_features = pd.DataFrame(game_features).T
game_features.head()

Unnamed: 0,Action,Adventure,Animation & Modeling,Audio Production,Captions available,Casual,Co-op,Commentary available,Cross-Platform Multiplayer,Design & Illustration,...,Steam Turn Notifications,Steam Workshop,SteamVR Collectibles,Strategy,Utilities,VR Support,Valve Anti-Cheat enabled,Video Production,Violent,Web Publishing
6910,1.0,,,,,,,,,,...,,,,,,,,,,
6920,1.0,,,,,,,,,,...,,,,,,,,,,
6980,1.0,,,,,,,,,,...,,,,,,,,,,
7000,1.0,1.0,,,,,,,,,...,,,,,,,,,,
7010,1.0,,,,,,,,,,...,,,,,,,,,,


As I mentioned above; I obtained all the games and API from several different AWS instances to increase productiviy; so I will import all the games that I obtained and merge them together.

In [109]:
game_feat = pd.read_csv('./datas/game_features_1.csv', index_col='Unnamed: 0')

In [111]:
game_feat.head()

Unnamed: 0,Accounting,Action,Adventure,Animation & Modeling,Audio Production,Captions available,Casual,Co-op,Commentary available,Cross-Platform Multiplayer,...,Экшены,Включает редактор уровней,Имеется античит Valve,Контроллер (полностью),Мастерская Steam,Покупки внутри приложения,Mods,Mods (require HL2),Контроллер (частично),Приключенческие игры
939780,,1.0,1.0,,,,,,,,...,,,,,,,,,,
939790,,,,,,,,,,,...,,,,,,,,,,
939920,,,,1.0,,,,,,,...,,,,,,,,,,
939930,,,,,,,,,,,...,,,,,,,,,,
940130,,,,,,,,,,,...,,,,,,,,,,


I will concateate the entired games from other AWS instances together. I will concat on their unique game_id as index.

In [112]:
game_feature = pd.concat([game_feat,game_features], sort=False)


So after the API scrapping on entired 74,060 app library; I end up with 31,578 of them are games. Along with 86 different unique features that mentioned in each game.

In [113]:
game_feature.shape

(31578, 86)

Saving my final game_features dataframe to csv to move to the next procedure.

In [114]:
game_feature.to_csv('./datas/game_features_1.csv')

Similary to the game_features; I will transform the descirption for each game to dataframe.

In [115]:
game_description = pd.DataFrame(game_context, index=['game_description']).T
game_description.head()


Unnamed: 0,game_description
6910,The year is 2052 and the world is a dangerous ...
6920,Approximately 20 years after the events depict...
6980,"You are Garrett, the master thief. Rarely seen..."
7000,Follow Lara Croft down a path of discovery as ...
7010,<p>Experience the dramatic intensity of the fr...


Acquired all the data from different instances; I will load them and concated the game_description along with the unique game_id of each game.

In [116]:
game_des = pd.read_csv('./datas/game_description_1.csv', index_col='Unnamed: 0')
game_descriptions = pd.concat([game_des,game_description], sort=False)

My game_features and game_description should have the same index length. Meaning they should have the same number of games. In this case; it's correct. I got both 31,578 games on game_features and game_descriptions.

In [122]:
game_descriptions.shape

(31578, 1)

In [118]:
game_descriptions.to_csv('./datas/game_description_1.csv')

# 4. Acquiring Game Reviews
---

I want my recommender to perform better; not only the game features will be included. I will also obtain the reviews counts from each game. If some game does not have reviews I will drop that from the game_features and game_description data. 
This will take approxmatly around 12 hours of time to scrape all the game reviews.

In [125]:
game_reviews = {}
start_time = time.time()
for i in game_descriptions.index:
    try:
        game = {}
        url_reviews = 'http://store.steampowered.com/appreviews/'+str(i)+'?json=1'
        res_reviews = requests.get(url_reviews)
        game = json.loads(res_reviews.content)['query_summary']
        game_reviews[i] = {'total_reviews': game['total_reviews'], 'positive_reviews': game['total_positive']}
        time.sleep(1)
    except:
        print(i)
print(f'Time Spent: {round(((time.time()-start_time)/60)/60)} hours')

906900
997050
961440
35450
1006940
Time Spent: 12 hours


After acquiring all the reviews from the games I got from API scrapping. There are some of the game that doesn't have any reviews. Games that did not came back with a result from review API scrapping are:

- 906900
- 997050
- 961440
- 35450
- 1006940

I went forward and checked each of the specific game_id. It's either the game is not popular that no one writes a review about; or Steam didn't have the review system up for that game yet. Since I only lose 5 games; in this case I will match the result from game_reviews to my features and description csv. 

In [126]:
len(game_reviews)

31573

In [127]:
game_review = pd.DataFrame(game_reviews).T

I set the index name to the unique game_id each game have and moving to the next step of EDA.

In [128]:
game_review.index.name = 'game_id'

In [129]:
game_review.head()

Unnamed: 0_level_0,positive_reviews,total_reviews
game_id,Unnamed: 1_level_1,Unnamed: 2_level_1
939780,0,0
939790,0,0
939920,1,2
939930,0,3
940130,1,1


In [130]:
game_review.to_csv('./datas/game_review_1.csv')