Steam Challenge for Datatonic

Steven Jordan

Exercise 3: Advanced

For this exercise, I built a collaborative-based recommender system for Steam users. Because there are no user ratings in this data set, I used a user’s playtime on any particular game as a proxy for the rating (people spend more time playing games they like).


In [1]:
# Import libraries that will be used for the recommender system
import pandas as pd
import numpy as np
import sklearn
import sklearn.metrics.pairwise as skmp
from sklearn.cluster import KMeans

In [2]:
# Read the .csv files that will be used 
df = pd.read_csv('reshaped_games_df2.csv')  # User Data (created previously in PySpark)
df_appid = pd.read_csv('App_ID_Info.csv')   # Game Data to get the titles
df.head(6)

Unnamed: 0,steamid,10,100,10000,1002,10040,100400,100410,10080,10090,...,99400,99410,9960,9970,99700,9980,99810,99830,99890,9990
0,76561197973784324,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,76561198027864348,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,76561197972510274,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,76561198001205398,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,76561197972024750,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,76561198077212088,1431,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [3]:
# Separate the labels and the data - use the steamid as indices
sim_users = df.loc[:,['steamid']]
data = df.drop(['steamid'], axis = 1)
df = df.set_index('steamid')
df.head(6)

Unnamed: 0_level_0,10,100,10000,1002,10040,100400,100410,10080,10090,100970,...,99400,99410,9960,9970,99700,9980,99810,99830,99890,9990
steamid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
76561197973784324,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
76561198027864348,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
76561197972510274,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
76561198001205398,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
76561197972024750,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
76561198077212088,1431,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [4]:
# One can replace the provided steamid with any other steamid, 
# or could be obtained through user input

#userid = int(input("Please enter your Steam ID: "))
userid = 76561198077212088

The below cell is technically unnecessary for the way the recommender system is currently taking inputs (for just one user). However, if one had a batch of users for which you wanted to send out recommendations, this would be useful.

In [5]:
# The below code would find the centroid for a group of users. However, since there is only a 
# single SteamID used for this demonstration, the centroid is identical to the single user record

user_data = [df.loc[userid]]
kmeans = KMeans(n_clusters=1).fit(user_data)
user_centroid = kmeans.cluster_centers_[0]

In [7]:
# Iterates through the different users and creates a similarity score based on both games played
# and the playtime of each game

sim_list = []
for user in data.iterrows():
    sim = skmp.cosine_similarity([user_centroid], [user[1]])
    sim_list.append(sim[0][0])

In [8]:
# Zips the similarity scores to their associated SteamIDs, sorting them by similarity
tup = zip(sim_list, sim_users.steamid)
tups_sorted = sorted(tup, key=lambda tup:tup[0], reverse = True)

# Prints the top 10 most similar Steam IDs and their similarity scores
tups_sorted[1:11]

[(0.9658393953021863, 76561197973970142),
 (0.9657963690076256, 76561198065204329),
 (0.9656602123291641, 76561197989059688),
 (0.9652960219623384, 76561198029515752),
 (0.9651958231014126, 76561197972540006),
 (0.9650707031043491, 76561198001419384),
 (0.9649507290765706, 76561198074113666),
 (0.9648888848439591, 76561198024855826),
 (0.9647140694695266, 76561198028019764),
 (0.9644978835679908, 76561198028449188)]

This recommender system is set up so that it:
1. Finds the 100 most similar users to the input SteamID
2. Sums up the playtime for each game of each of those 100 users
3. Recommends the games with the most total playtime among his/her similar peers that the input SteamID has not yet played

In [9]:
# Pulls out the most similar user SteamIDs. I am using 100 for the recommender system, but this
# can be adjusted based on testing

most_similar_users = []
for y in tups_sorted[1:101]:
    most_similar_users.append(y[1])

In [10]:
# Iterates through the similar users, summing up their total playtime for each game/app
total_playtime = df.loc[userid]
for u in most_similar_users:
    total_playtime += df.loc[u]

# Sorts the potentially recommended games by total playtime, resetting it as Pandas dataframe
total_playtime_sorted = total_playtime.sort_values(ascending=False)
sim_df = pd.DataFrame(total_playtime_sorted)
sim_df.head()

Unnamed: 0,76561198077212088
10,4513158
570,1622675
730,47933
10190,21058
240,18550


In [11]:
# Iterates through the recommended games, and pulls out the top 10 of those 
# which the input user has not played

count = 0
recs = []
for row in sim_df.iterrows():
    label = str(row[0])
    if df[label].loc[userid] == 0:
        recs.append(label)
        count += 1
        if count == 10:
            break

In [12]:
# Sets the appids as the index, to be used to find the titles
df_appid = df_appid.set_index('appid')

In [13]:
# Print out the top 10 recommendations for the input user

print("For User with Steam ID:", userid)
print("...we recommend the following games or applications: \n")

for r in recs:
    print('{:40} {:7}{}'.format(df_appid['Title'].loc[int(r)],'appid:',r))
    

For User with Steam ID: 76561198077212088
...we recommend the following games or applications: 

Counter-Strike: Global Offensive         appid: 730
Call of Duty®: Modern Warfare® 2         appid: 10190
Counter-Strike: Source                   appid: 240
Call of Duty®: Modern Warfare® 3         appid: 42690
Counter-Strike: Condition Zero           appid: 80
Left 4 Dead 2                            appid: 550
Total War: SHOGUN 2                      appid: 34330
Killing Floor                            appid: 1250
Call of Duty®: Modern Warfare® 3         appid: 42680
PAYDAY™ The Heist                        appid: 24240
