# Steam Recommendations

## Overview

## Business Understanding

Steam is a video game storefront and distribution platform. Steam is the largest and most popular digital store for PC games. 

## Data Understanding

### Steam Store Games  - Kaggle Dataset

I used a dataset from Kaggle containing information on video game categories and genres. The dataset can be found [here](https://www.kaggle.com/datasets/nikdavis/steam-store-games?select=steamspy_tag_data.csv).

The author documented their process of data collection via Steam Web API calls and SteamSpy API calls. This [link](https://nik-davis.github.io/posts/2019/steam-data-collection/) goes into how the data was collected and cleaned.

I stored the data under a separate folder within my local repository. 

In [3]:
#import statements
import numpy as np
import pandas as pd
import json
import requests

from sklearn.compose import ColumnTransformer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import OneHotEncoder, MultiLabelBinarizer

In [4]:
steam_df = pd.read_csv('./data/Steam_store_data/steam.csv')

In [5]:
steam_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27075 entries, 0 to 27074
Data columns (total 18 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   appid             27075 non-null  int64  
 1   name              27075 non-null  object 
 2   release_date      27075 non-null  object 
 3   english           27075 non-null  int64  
 4   developer         27075 non-null  object 
 5   publisher         27075 non-null  object 
 6   platforms         27075 non-null  object 
 7   required_age      27075 non-null  int64  
 8   categories        27075 non-null  object 
 9   genres            27075 non-null  object 
 10  steamspy_tags     27075 non-null  object 
 11  achievements      27075 non-null  int64  
 12  positive_ratings  27075 non-null  int64  
 13  negative_ratings  27075 non-null  int64  
 14  average_playtime  27075 non-null  int64  
 15  median_playtime   27075 non-null  int64  
 16  owners            27075 non-null  object

Here are brief descriptions for each column:
* appid : Unique identifier for each game on Steam
* name : title of app(game)
* release_date : release date in YYYY-MM-DD formate
* english : 1 if game is in English, otherwise 0
* developer : name(s) of developer(s), delimited by semicolon if multiple devs
* publisher : name(s) of publisher(s), delimited by semicolon if multiple publishers
* platforms : supported platforms (includes Windows, Mac, and Linux), delimited by semicolon
* required_age : minimum required age based on PEGI UK ratings, 0 denotes unrated or unsupplied
* categories : game categories, delimited by semicolon
* genres: game genres, delimited by semicolon
* steamspy_tags: community voted tags, delimited by semicolon
* achievements : number of in-game achievements
* positive_ratings : number of positive ratings (from SteamSpy)
* negative_ratings : number of negative ratings (from SteamSpy)
* average_playtime : average user playtime in minutes (from SteamSpy)
* median_playtime : median user playtime in minutes (from SteamSpy)
* owners: estimated number of owners given as a range
* price : full price of title in GBP

There was some overlap between categories, genres, and steamspy_tags. For example, 'Action' showed up in both genre and as a tag while 'Multi-player' showed up with different spelling in categories and tags.

### Steam User Libraries - Steam Web API

I used Steam's Web API service to collect information on users and their personal libraries of games. You need to have a Steam Account in order to request a Web API key. 

[Here](https://steamcommunity.com/dev) is a general overview of how to access Steam's Web API. This [link](https://developer.valvesoftware.com/wiki/Steam_Web_API) has more documentation on different types of API calls and what kinds of information is available.

I saved my API key under a text file in my local repository.

## Modeling

I made a content-based recommendation system using categories, genres, and tags as the main indicators of game content.

In this notebook, I'll use the term 'label' as a collective term for category, genre, and tag. 

I assigned each game a value of 0 or 1 for each label: 0 if that label was not present for that game, and 1 if that label was present. This created a matrix containing encoded features for each game based on their content labels. 

### Simple Content-based Recommender

In [6]:
def simple_recommender(game, content_matrix, df, n):
    '''
    '''
    #get the index of the game from the games info dataframe
    game_idx = df[df['name'] == game].index
    #get the features for the game from the content matrix
    game_features = content_matrix.loc[game_idx]
    #drop the game from the content matrix
    other_games_df = content_matrix.drop(game_idx, axis=0)
    #compute cosine similarity with sklearn
    cos_sim = cosine_similarity(game_features, other_games_df)
    #transform array of cos_sims into dataframe
    cos_sim_df = pd.DataFrame(cos_sim, index=game_idx, 
                              columns=other_games_df.index).T
    cos_sim_df.sort_values(by=[game_idx[0]], ascending=False, inplace=True)
    top_matches = cos_sim_df.iloc[:n]
    print(top_matches)
    #get the index for each recommended game
    top_idx = top_matches.index
    #refer to full games info matrix to return info for recommended games
    rec_df = df.loc[top_idx]
    return rec_df

My next steps for this content-based recommender were to:

* Prompt the user for input (e.g. game name, number of recommendations, any filters)
* Organize the output and return key information in a user-friendly format

### Content-based Recommender with User Input

In [9]:
def game_input(df):
    game = input('Please type the name of the game you would like recommendations for: ')
    if type(game) is str:
        game_idx = df[df['name'] == game].index
    if len(game_idx) == 0:
        return('Sorry, recommendations are not available for this game!')
    game_name = df.loc[game_idx]['name'].values[0]
    game_id = df.loc[game_idx]['appid'].values[0]
    print(f'Looking for recommendations for {game_name} (app id {game_id}):')
    return game_idx

In [None]:
def content_recommender(game, content_matrix, df):
    '''
    '''
    #get the index of the game from the games info dataframe
    game_idx = df[df['name'] == game].index
    #get the features for the game from the content matrix
    game_features = content_matrix.loc[game_idx]
    #drop the game from the content matrix
    other_games_df = content_matrix.drop(game_idx, axis=0)
    #compute cosine similarity with sklearn
    cos_sim = cosine_similarity(game_features, other_games_df)
    #transform array of cos_sims into dataframe
    cos_sim_df = pd.DataFrame(cos_sim, index=game_idx, 
                              columns=other_games_df.index).T
    cos_sim_df.sort_values(by=[game_idx[0]], ascending=False, inplace=True)
    top_matches = cos_sim_df.iloc[:n]
    print(top_matches)
    #get the index for each recommended game
    top_idx = top_matches.index
    #refer to full games info matrix to return info for recommended games
    rec_df = df.loc[top_idx]
    return rec_df

## Results

## Conclusions