# PC Game Recommendations - a Content Based and Collaborative Recommender for Steam

## Overview

## Business Understanding

Steam is the most popular digital distribution service and store for PC games. While physical discs were once the norm for PC game sales, digital downloads have become the standard. Steam offers a platform for users to manage their library of game with access to services such as cloud saves and community features. In 2021, Steam averaged around 69 million daily active players. 

Recommender systems are important to online stores to engage consumers with products that would interest them and have the highest likelihood of being purchased. Steam has various recommender systems in their store. There are content based systems (based on what the user has played) as well as collaborative systems (based on similar users or friends of the user have played). 

My goal with this project was to leverage data from Steam in order to build recommender models. 

## Data Understanding

I used two different datasets to build my recommendation models. The content based recommender uses a dataset from Kaggle, while the collaborative recommender uses data I collected from Steam's Web API.

### Steam Store Games  - Kaggle Dataset

I used a dataset from Kaggle containing information on video game categories and genres. The dataset can be found [here](https://www.kaggle.com/datasets/nikdavis/steam-store-games?select=steamspy_tag_data.csv).

The author documented their process of data collection via Steam Web API calls and SteamSpy API calls. This [link](https://nik-davis.github.io/posts/2019/steam-data-collection/) goes into how the author collected and cleaned the dataset.

The dataset can be found in this repository in the data folder within the Steam_store_data subfolder.

In [3]:
#import statements
import numpy as np
import pandas as pd
import json
import requests

from sklearn.compose import ColumnTransformer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import OneHotEncoder, MultiLabelBinarizer

In [4]:
steam_df = pd.read_csv('./data/Steam_store_data/steam.csv')

In [5]:
steam_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27075 entries, 0 to 27074
Data columns (total 18 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   appid             27075 non-null  int64  
 1   name              27075 non-null  object 
 2   release_date      27075 non-null  object 
 3   english           27075 non-null  int64  
 4   developer         27075 non-null  object 
 5   publisher         27075 non-null  object 
 6   platforms         27075 non-null  object 
 7   required_age      27075 non-null  int64  
 8   categories        27075 non-null  object 
 9   genres            27075 non-null  object 
 10  steamspy_tags     27075 non-null  object 
 11  achievements      27075 non-null  int64  
 12  positive_ratings  27075 non-null  int64  
 13  negative_ratings  27075 non-null  int64  
 14  average_playtime  27075 non-null  int64  
 15  median_playtime   27075 non-null  int64  
 16  owners            27075 non-null  object

Here are brief descriptions for each column:
* appid : Unique identifier for each game on Steam
* name : title of app(game)
* release_date : release date in YYYY-MM-DD formate
* english : 1 if game is in English, otherwise 0
* developer : name(s) of developer(s), delimited by semicolon if multiple devs
* publisher : name(s) of publisher(s), delimited by semicolon if multiple publishers
* platforms : supported platforms (includes Windows, Mac, and Linux), delimited by semicolon
* required_age : minimum required age based on PEGI UK ratings, 0 denotes unrated or unsupplied
* categories : game categories, delimited by semicolon
* genres: game genres, delimited by semicolon
* steamspy_tags: community voted tags, delimited by semicolon
* achievements : number of in-game achievements
* positive_ratings : number of positive ratings (from SteamSpy)
* negative_ratings : number of negative ratings (from SteamSpy)
* average_playtime : average user playtime in minutes (from SteamSpy)
* median_playtime : median user playtime in minutes (from SteamSpy)
* owners: estimated number of owners given as a range
* price : full price of title in GBP

There was some overlap between categories, genres, and steamspy_tags. For example, 'Action' showed up in both genre and as a tag while 'Multi-player' showed up with different spelling in categories and tags.

### Exploring the Data

In [6]:
steam_df.shape

(27075, 18)

There are 27,705 rows - each representing a game or app in Steam's store - and 18 columns.

Let's go through each column one-by-one:

In [8]:
#unique id for each game
steam_df['appid']

0             10
1             20
2             30
3             40
4             50
          ...   
27070    1065230
27071    1065570
27072    1065650
27073    1066700
27074    1069460
Name: appid, Length: 27075, dtype: int64

In [26]:
steam_df['appid'].value_counts()

397310    1
272270    1
710480    1
640850    1
431960    1
         ..
875710    1
563810    1
244930    1
298180    1
655360    1
Name: appid, Length: 27075, dtype: int64

'appid' is a unique identifier of a game in the dataset. No two games will have the same 'appid'.

In [9]:
#name of app/game
steam_df['name']

0                    Counter-Strike
1             Team Fortress Classic
2                     Day of Defeat
3                Deathmatch Classic
4         Half-Life: Opposing Force
                    ...            
27070               Room of Pandora
27071                     Cyber Gun
27072              Super Star Blast
27073    New Yankee 7: Deer Hunters
27074                     Rune Lord
Name: name, Length: 27075, dtype: object

In [10]:
steam_df['name'].value_counts()

Dark Matter                 3
Beyond the Wall             2
Escape Room                 2
Bounce                      2
Surge                       2
                           ..
Rocket Riot                 1
Gravity Wars: Black Hole    1
BellyBots                   1
3SwitcheD                   1
Snake Party                 1
Name: name, Length: 27033, dtype: int64

'name' is not a unique identifier for games. There are a few games with the same name. 

In [13]:
steam_df[steam_df['name'] == 'Dark Matter']

Unnamed: 0,appid,name,release_date,english,developer,publisher,platforms,required_age,categories,genres,steamspy_tags,achievements,positive_ratings,negative_ratings,average_playtime,median_playtime,owners,price
1975,251410,Dark Matter,2013-10-17,1,InterWave Studios,Iceberg Interactive,windows;mac;linux,0,Single-player;Steam Achievements;Full controll...,Action;Indie,Action;Indie;Side Scroller,17,78,107,0,0,0-20000,6.99
4673,345130,Dark Matter,2015-02-27,1,Meridian4,Meridian4,windows,0,Single-player;Partial Controller Support,Action;Casual;Indie,Action;Casual;Indie,0,75,50,0,0,100000-200000,3.99
21986,850250,Dark Matter,2018-05-04,1,Barty Games,Barty Games,windows,0,Single-player;Steam Achievements,Action;Indie,Action;Indie,3,3,3,0,0,0-20000,0.79


In [12]:
steam_df[steam_df['name'] == 'Dark Matter'].index

Int64Index([1975, 4673, 21986], dtype='int64')

The dataset included some downloadable content ('DLC') listed as a separate product. Since DLC content is typically an add-on to the original game, the similarity between DLC and base game is likely very high. However, the recommender should not include DLC. Steam's platform already shows users what DLC is available for games in their library. Additionally, DLC is not usable unless you own the base game. 

In [57]:
pack = steam_df[steam_df['name'].str.contains('Pack')]
pack

Unnamed: 0,appid,name,release_date,english,developer,publisher,platforms,required_age,categories,genres,steamspy_tags,achievements,positive_ratings,negative_ratings,average_playtime,median_playtime,owners,price
53,2330,QUAKE II Mission Pack: The Reckoning,2007-08-03,1,Xatrix Entertainment,id Software,windows,0,Single-player;Multi-player;Steam Cloud,Action,Action;FPS;Classic,0,65,12,32,32,200000-500000,2.49
54,2340,QUAKE II Mission Pack: Ground Zero,2007-08-03,1,Rogue Entertainment,id Software,windows,0,Single-player;Multi-player;Steam Cloud,Action,Action;FPS;Shooter,0,54,33,1,1,200000-500000,2.49
226,9030,QUAKE Mission Pack 2: Dissolution of Eternity,2007-08-03,1,Rogue Entertainment,id Software,windows,0,Single-player;Multi-player;Steam Cloud,Action,Action;FPS;Shooter,0,143,15,88,91,200000-500000,2.49
227,9040,QUAKE Mission Pack 1: Scourge of Armagon,2007-08-03,1,Ritual Entertainment,id Software,windows,0,Single-player;Multi-player;Steam Cloud,Action,Action;FPS;Shooter,0,205,13,129,152,200000-500000,2.49
406,17440,SPORE™ Creepy & Cute Parts Pack,2008-12-19,1,Maxis™,Electronic Arts,windows,0,Single-player,Simulation,Simulation;Adventure;Open World,0,481,100,171,240,500000-1000000,9.99
599,32470,STAR WARS™ Empire at War - Gold Pack,2010-05-25,1,Petroglyph,LucasArts;Lucasfilm;Disney Interactive,windows,0,Single-player;Multi-player;Online Multi-Player...,Strategy,Strategy;Star Wars;RTS,0,6526,396,395,219,1000000-2000000,15.49
676,36130,Tradewinds Caravans + Odyssey Pack,2009-07-17,1,Sandlot Games,Sandlot Games,windows,0,Single-player,Casual,Casual,0,4,1,0,0,0-20000,10.99
704,37400,"Time Gentlemen, Please! and Ben There, Dan Tha...",2009-08-25,1,Size Five Games,Size Five Games,windows,0,Single-player,Adventure;Indie,Point & Click;Adventure;Indie,0,494,118,10,10,200000-500000,2.99
709,37960,Jewel Quest Pack,2009-08-24,1,iWin,iWin,windows,0,Single-player,Casual,Casual;Puzzle;Match 3,0,19,3,0,0,0-20000,14.99
1299,207400,eXceed 3rd - Jade Penetrate Black Package,2012-08-02,1,Tennen-sozai,Nyu Media,windows,0,Single-player;Steam Trading Cards,Action,Bullet Hell;Anime;Shoot 'Em Up,0,489,41,127,141,50000-100000,4.79


In [51]:
pack.shape

(32, 18)

In [52]:
pack['name']

53                    QUAKE II Mission Pack: The Reckoning
54                      QUAKE II Mission Pack: Ground Zero
226          QUAKE Mission Pack 2: Dissolution of Eternity
227               QUAKE Mission Pack 1: Scourge of Armagon
406                        SPORE™ Creepy & Cute Parts Pack
599                   STAR WARS™ Empire at War - Gold Pack
676                     Tradewinds Caravans + Odyssey Pack
704      Time Gentlemen, Please! and Ben There, Dan Tha...
709                                       Jewel Quest Pack
1299             eXceed 3rd - Jade Penetrate Black Package
1756                             The Apogee Throwback Pack
2281                                        Red Baron Pack
2685                                  Carmageddon Max Pack
2777           RollerCoaster Tycoon® 2: Triple Thrill Pack
3059             Putt-Putt® and Fatty Bear's Activity Pack
4145                                The Jackbox Party Pack
4839                               Fruits Inc. Deluxe Pa

I extracted a list of possible DLC's By searching for the word 'pack' within the app name. Since there were only 32 instances, I manually went through Steam's store to determine what was considered a game and was was considered DLC.

In [None]:
#date of release (YY-MM-DD)
steam_df['release_date']

'release_date' is stored as a string.

In [14]:
#release_date is stored as a string
type(steam_df['release_date'][0])

str

In [17]:
steam_df['release_date'] = pd.to_datetime(steam_df['release_date'])

In [19]:
steam_df['release_date'].describe(datetime_is_numeric = True)

count                            27075
mean     2016-12-31 14:21:17.252077568
min                1997-06-30 00:00:00
25%                2016-04-04 00:00:00
50%                2017-08-08 00:00:00
75%                2018-06-06 12:00:00
max                2019-05-01 00:00:00
Name: release_date, dtype: object

The dataset contains games released between June 30, 1997 and May 01, 2019.

In [20]:
#0 if non-english, 1 if english
steam_df['english']

0        1
1        1
2        1
3        1
4        1
        ..
27070    1
27071    1
27072    1
27073    1
27074    1
Name: english, Length: 27075, dtype: int64

In [21]:
steam_df['english'].value_counts()

1    26564
0      511
Name: english, dtype: int64

There are 26,564 English games and 511 non-English games.

In [22]:
#name of game's developer
steam_df['developer']

0                     Valve
1                     Valve
2                     Valve
3                     Valve
4          Gearbox Software
                ...        
27070           SHEN JIAWEI
27071        Semyon Maximov
27072           EntwicklerX
27073    Yustas Game Studio
27074      Adept Studios GD
Name: developer, Length: 27075, dtype: object

In [23]:
steam_df['developer'].value_counts()

Choice of Games               94
KOEI TECMO GAMES CO., LTD.    72
Ripknot Systems               62
Laush Dmitriy Sergeevich      51
Nikita "Ghost_RUS"            50
                              ..
Myoubouh Corp                  1
AndAll Interactive             1
RESPECT TEAM STUDIO            1
First Game Studio              1
Permadeath                     1
Name: developer, Length: 17113, dtype: int64

In [25]:
#looking at the developers who made games not in English
non_english_devs = steam_df[steam_df['english'] == 0]['developer'].value_counts()
non_english_devs

KOEI TECMO GAMES CO., LTD.    38
ARTDINK                        5
凝冰剑斩                           5
上海アリス幻樂団                       4
あみそ組                           3
                              ..
豆干叠叠木                          1
Jiruo Software                 1
Atansoft                       1
LiveActive                     1
Neosia Entertainment           1
Name: developer, Length: 415, dtype: int64

Our dataset contains some special characters. In this subset consisting of non-english games, I saw some Chinese and Japanese text.

In [27]:
#name of game's publisher
steam_df['publisher']

0                       Valve
1                       Valve
2                       Valve
3                       Valve
4                       Valve
                 ...         
27070             SHEN JIAWEI
27071        BekkerDev Studio
27072             EntwicklerX
27073    Alawar Entertainment
27074    Alawar Entertainment
Name: publisher, Length: 27075, dtype: object

In [28]:
steam_df['publisher'].value_counts()

Big Fish Games                        212
Strategy First                        136
Ubisoft                               111
THQ Nordic                             98
Square Enix                            97
                                     ... 
crowgames UG (haftungsbeschränkt)       1
The Station                             1
Vikong                                  1
Teleporter Realities, Inc.              1
Permadeath                              1
Name: publisher, Length: 14354, dtype: int64

In [35]:
same_dev_pub = steam_df[steam_df['publisher'] == steam_df['developer']]
same_dev_pub.head()

Unnamed: 0,appid,name,release_date,english,developer,publisher,platforms,required_age,categories,genres,steamspy_tags,achievements,positive_ratings,negative_ratings,average_playtime,median_playtime,owners,price
0,10,Counter-Strike,2000-11-01,1,Valve,Valve,windows;mac;linux,0,Multi-player;Online Multi-Player;Local Multi-P...,Action,Action;FPS;Multiplayer,0,124534,3339,17612,317,10000000-20000000,7.19
1,20,Team Fortress Classic,1999-04-01,1,Valve,Valve,windows;mac;linux,0,Multi-player;Online Multi-Player;Local Multi-P...,Action,Action;FPS;Multiplayer,0,3318,633,277,62,5000000-10000000,3.99
2,30,Day of Defeat,2003-05-01,1,Valve,Valve,windows;mac;linux,0,Multi-player;Valve Anti-Cheat enabled,Action,FPS;World War II;Multiplayer,0,3416,398,187,34,5000000-10000000,3.99
3,40,Deathmatch Classic,2001-06-01,1,Valve,Valve,windows;mac;linux,0,Multi-player;Online Multi-Player;Local Multi-P...,Action,Action;FPS;Multiplayer,0,1273,267,258,184,5000000-10000000,3.99
5,60,Ricochet,2000-11-01,1,Valve,Valve,windows;mac;linux,0,Multi-player;Online Multi-Player;Valve Anti-Ch...,Action,Action;FPS;Multiplayer,0,2758,684,175,10,5000000-10000000,3.99


In [39]:
same_dev_pub.shape

(17592, 18)

In [36]:
same_dev_pub['developer'].value_counts()

Choice of Games               94
KOEI TECMO GAMES CO., LTD.    69
Ripknot Systems               62
Dexion Games                  45
RewindApp                     43
                              ..
Neople                         1
sniperoncall                   1
Nikku Nomura                   1
VRChat Inc.                    1
Pavel Black                    1
Name: developer, Length: 11408, dtype: int64

17,592 games in the dataset have a developer who is also the publisher. This could be important to consider when thinking about sequels or games within the same series.

In [40]:
#platform availability for game
steam_df['platforms']

0        windows;mac;linux
1        windows;mac;linux
2        windows;mac;linux
3        windows;mac;linux
4        windows;mac;linux
               ...        
27070              windows
27071              windows
27072              windows
27073          windows;mac
27074          windows;mac
Name: platforms, Length: 27075, dtype: object

In [41]:
steam_df['platforms'].value_counts()

windows              18398
windows;mac;linux     4623
windows;mac           3439
windows;linux          610
mac                      3
mac;linux                1
linux                    1
Name: platforms, dtype: int64

Windows-only was the most frequent platform for games in the dataset. There were only 5 games here that are not available on Windows.

In [42]:
steam_df[steam_df['platforms'] == 'mac']

Unnamed: 0,appid,name,release_date,english,developer,publisher,platforms,required_age,categories,genres,steamspy_tags,achievements,positive_ratings,negative_ratings,average_playtime,median_playtime,owners,price
1413,214630,Call of Duty: Black Ops - Mac Edition,2012-09-27,1,Aspyr,Aspyr,mac,18,Single-player;Multi-player;Co-op;Steam Achieve...,Action,Action;Zombies;Multiplayer,68,168,105,0,0,50000-100000,15.49
12479,569050,Paul Pixel - The Awakening,2017-01-09,1,Xoron GmbH,Xoron GmbH,mac,0,Single-player,Adventure;Indie,Adventure;Indie;Point & Click,0,5,0,0,0,0-20000,2.89
16662,694180,MobileZombie,2017-10-13,1,YIMING ZHANG,YIMING ZHANG,mac,0,Single-player;Partial Controller Support,Adventure;Casual;Free to Play;Indie,Free to Play;Adventure;Indie,0,14,11,0,0,0-20000,0.0


The presence of a Mac Edition of a game informed me that there are ports in the dataset. Ports are versions of a game that have been adapted specifically for a new platform but are almost identical to the original.

In [58]:
#age-requirement for game
steam_df['required_age']

0        0
1        0
2        0
3        0
4        0
        ..
27070    0
27071    0
27072    0
27073    0
27074    0
Name: required_age, Length: 27075, dtype: int64

In [60]:
steam_df['required_age'].value_counts()

0     26479
18      308
16      192
12       73
7        12
3        11
Name: required_age, dtype: int64

Within this notebook, I'll use the term **labels** to collectively refer to categories, genres, and tags. Each of these three types of labels contains information on the content of a game which could be information on how the game is played (i.e. single-player), what features the game supports (i.e. controller support), or what the game is about (i.e. sci-fi).

Some labels are more descriptive in terms of game content while others are more functional and may not have broad appeal to Steam users. For example, I thought it is unlikely that a user would make a purchase decision based on the presence of 'Valve Anti-Cheat'. The content recommender section goes into detail how I approached these labels.

In [61]:
steam_df['categories']

0        Multi-player;Online Multi-Player;Local Multi-P...
1        Multi-player;Online Multi-Player;Local Multi-P...
2                    Multi-player;Valve Anti-Cheat enabled
3        Multi-player;Online Multi-Player;Local Multi-P...
4        Single-player;Multi-player;Valve Anti-Cheat en...
                               ...                        
27070                     Single-player;Steam Achievements
27071                                        Single-player
27072    Single-player;Multi-player;Co-op;Shared/Split ...
27073                            Single-player;Steam Cloud
27074                            Single-player;Steam Cloud
Name: categories, Length: 27075, dtype: object

In [62]:
steam_df['categories'].value_counts()

Single-player                                                                                                                                                                         6110
Single-player;Steam Achievements                                                                                                                                                      2334
Single-player;Steam Achievements;Steam Trading Cards                                                                                                                                   848
Single-player;Partial Controller Support                                                                                                                                               804
Single-player;Steam Trading Cards                                                                                                                                                      792
                                                                 

### Steam User Libraries - Steam Web API

I used Steam's Web API service to collect information on users and their personal libraries of games. You need to have a Steam Account in order to request a Web API key. 

[Here](https://steamcommunity.com/dev) is a general overview of how to access Steam's Web API. This [link](https://developer.valvesoftware.com/wiki/Steam_Web_API) has more documentation on different types of API calls and what kinds of information is available.

I saved my API key in a text file in my local repository.

## Modeling

I made a content-based recommendation system using categories, genres, and tags as the main indicators of game content.

In this notebook, I'll use the term 'label' as a collective term for category, genre, and tag. 

I assigned each game a value of 0 or 1 for each label: 0 if that label was not present for that game, and 1 if that label was present. This created a matrix containing encoded features for each game based on their content labels. 

### Simple Content-based Recommender

In [1]:
def simple_recommender(game, content_matrix, df, n):
    '''
    Returns a dataframe containing n content-based recommendations for a specified game
    
    Arguments: 
    game: name of the game to get recommendations for
    content_matrix: MLB encoded matrix describing game content with 0's and 1's
    df: dataframe of game info, function will return a subset of this dataframe
    n: number of recommendations to return
    '''
    #get the index of the game from the games info dataframe
    game_idx = df[df['name'] == game].index
    #get the features for the game from the content matrix
    game_features = content_matrix.loc[game_idx]
    #drop the game from the content matrix
    other_games_df = content_matrix.drop(game_idx, axis=0)
    #compute cosine similarity with sklearn
    cos_sim = cosine_similarity(game_features, other_games_df)
    #transform array of cos_sims into dataframe
    cos_sim_df = pd.DataFrame(cos_sim, index=game_idx, 
                              columns=other_games_df.index).T
    cos_sim_df.sort_values(by=[game_idx[0]], ascending=False, inplace=True)
    top_matches = cos_sim_df.iloc[:n]
    print(top_matches)
    #get the index for each recommended game
    top_idx = top_matches.index
    #refer to full games info matrix to return info for recommended games
    rec_df = df.loc[top_idx]
    return rec_df

In [None]:
#give 10 recommendations for the game 'Warframe'

My next steps for this content-based recommender were to:

* Prompt the user for input (e.g. game name, number of recommendations, any filters)
* Organize the output and return key information in a user-friendly format

### Content-based Recommender with User Input

In [9]:
def game_input(df):
    game = input('Please type the name of the game you would like recommendations for: ')
    if type(game) is str:
        game_idx = df[df['name'] == game].index
    if len(game_idx) == 0:
        return('Sorry, recommendations are not available for this game!')
    game_name = df.loc[game_idx]['name'].values[0]
    game_id = df.loc[game_idx]['appid'].values[0]
    print(f'Looking for recommendations for {game_name} (app id {game_id}):')
    return game_idx

In [None]:
def content_recommender(game, content_matrix, df):
    '''
    '''
    #get the index of the game from the games info dataframe
    game_idx = df[df['name'] == game].index
    #get the features for the game from the content matrix
    game_features = content_matrix.loc[game_idx]
    #drop the game from the content matrix
    other_games_df = content_matrix.drop(game_idx, axis=0)
    #compute cosine similarity with sklearn
    cos_sim = cosine_similarity(game_features, other_games_df)
    #transform array of cos_sims into dataframe
    cos_sim_df = pd.DataFrame(cos_sim, index=game_idx, 
                              columns=other_games_df.index).T
    cos_sim_df.sort_values(by=[game_idx[0]], ascending=False, inplace=True)
    top_matches = cos_sim_df.iloc[:n]
    print(top_matches)
    #get the index for each recommended game
    top_idx = top_matches.index
    #refer to full games info matrix to return info for recommended games
    rec_df = df.loc[top_idx]
    return rec_df

## Results

## Conclusions