# Social Computing/Social Gaming - Summer 2022

# Exercise Sheet 3: Collaborative Filtering with Steam Games

In this exercise, we will build a collaborative filtering recommender system using data we gather from Steam. We will use your friends list to get information about owned games for each ID, and the time each game was played.

Usually, collaborative filtering is based on some sort of rating to determine the similarity between users. However, for games, the enjoyment and a rating do not always match. Additionally, only about 10% of players actually rate the games they play, which would make for a very incomplete dataset. Therefore, the playtime will be used instead of a rating system. This has the added benefit that playtime is usually the most authentic metric of enjoyment, as players are very unlikely to spend much time on a game they don't enjoy.

## Task 3.1: Obtaining the data


**1.** Your first task is to **gather the data** needed to create the recommender system. **Create a data structure** that holds the needed information for each player and game. To do this, **open the URL** with the given `request()` function, **read** the json response and retrieve your games library and playtime. Then **save** the games into a dictionary with `key=name` and `values=playtime`. **Do not add** games with 0 playtime to this dictionary.


**Notes:** 
- You have three different options to solve this exercise. You can either:
    - Use your own Steam profile (strongly recommended)
    - Use the provided default Steam account (in case you do not own a Steam profile)
    - Use the provided .json file (in case you do not have a Steam profile and the default Steam account becomes overcrowded)
- your choice will not affect your grade in any way
- You cannot obtain a list from your profile with the Steam API unless your profile is set to public. 
- Upon executing the code below, you will notice that a lot of profiles "`couldnt decode`". These are private or deleted profiles and it is totally fine to get this message.


**Hints**:
- In case you wish to use your own Steam profile, but are afraid to share your personal [key](https://steamcommunity.com/dev/apikey) [1] and id, please be informed that you can delete them **after** solving the tasks and before submitting your solutions. The outputs will be saved in the Jupyter Notebook.
- To obtain the games a user owns, use this: `games = data['response']['games']`. This returns a list of games, including the playtime (in minutes) which can be retrieved like this: `playtime = game['playtime_forever']`, where game refers to an item from the list of games. 

Execute the following code cell to install the needed library for this exercise.

In [1]:
!pip install mlxtend



In [2]:
# Use this if you want to work with the default IDs
import requests
import urllib
import pandas as pd
import json
from urllib.request import Request, urlopen
from pandas.io.json import json_normalize
from requests.exceptions import HTTPError

# You can replace these values with your own ID and API key
key = "DC0E92192AB08493651E1846A726EB7B"
id = "76561198085432677"
url = "http://api.steampowered.com/ISteamUser/GetPlayerSummaries/v0002/?key="+key+"&steamids="+id
r = requests.get(url)
data = r.json()

# Get friendslist
# This is just a template. In order to get your personalized list, you need to change the id and key above.
request = Request("http://api.steampowered.com/ISteamUser/GetFriendList/v0001/?key="+key+"&steamid="+id+"&relationship=friend")
response = urlopen(request)
elevations = response.read()
data = json.loads(elevations)
friendslist = data['friendslist']
friends = friendslist['friends']

# Get all friends
friendids = []
tempIDs = []
for friend in friends:
    friendids.append(friend['steamid'])
    
print(len(friendids), "ok")

# Get friends of friends
x = 0

while x < len(friendids):
    friendID = friendids[x]
    request = Request("http://api.steampowered.com/ISteamUser/GetFriendList/v0001/?key="+key+"&steamid="+friendID+"&relationship=friend")
    try:
        response = urlopen(request)    
    except urllib.error.HTTPError as e:
        print('401')
    elevations = response.read()
    try:
        data = json.loads(elevations)
    except json.JSONDecodeError:
        print("couldn't decode")
    friendslist = data['friendslist']
    friends = friendslist['friends']

    friendidsNew = []
    for friend in friends:
        friendidsNew.append(friend['steamid'])
        
    tempIDs += friendidsNew
    x += 1

friendids += tempIDs
friendids = list(dict.fromkeys(friendids))
friendids = list(set(friendids))
print(len(friendids))


70 ok
401
couldn't decode
401
couldn't decode
401
couldn't decode
401
couldn't decode
401
couldn't decode
401
couldn't decode
401
couldn't decode
401
couldn't decode
401
couldn't decode
401
couldn't decode
401
couldn't decode
6569


In [3]:
# Trim the list of IDs to reasonable values:
if len(friendids)>250:
    friendids = friendids[:250]
print(len(friendids))

users_gamedicts = {} # The dictionary containing all information for every ID
gamedict = {} # A dict containing information for one player

# Get owned games of friendslist:
request = Request("http://api.steampowered.com/IPlayerService/GetOwnedGames/v0001/?key="+key+"&steamid="+id+"&include_appinfo=1&format=json")

# DONE:
# Open the URL, read the json response and retrieve your games library and playtime
# Save the games into a dictionary with key=name and values=playtime
# Hint 1: You can obtain the games a user owns with data['response']['games']
# Hint 2: You can retrieve their playtime with game['playtime_forever']

response = urlopen(request)
elevations = response.read()
data = json.loads(elevations)

games = data['response']['games']
for game in games:
    gamedict[game['name']] = game['playtime_forever']

# Add the dictionary to the users_gamedict       
users_gamedicts[id] = gamedict

# Do the same for your friends and their friends
for friendID in friendids:
    # DONE:
    gamedict = {}
    
    request = Request("http://api.steampowered.com/IPlayerService/GetOwnedGames/v0001/?key="+key+"&steamid="+friendID+"&include_appinfo=1&format=json")
    response = urlopen(request)
    elevations = response.read()
    data = json.loads(elevations)
    
    if data['response'] != {} and data['response']['game_count'] != 0: # I was having some problems with friends that have 0 games
        games = data['response']['games']
        for game in games:
            gamedict[game['name']] = game['playtime_forever']
        users_gamedicts[friendID] = gamedict


250


## Task 3.2: Association rule mining

Before we start with the "real" recommender system, let us take a look at a more general form of recommending items using association rules.

The concept of association rule mining is rather simple: Looking at an itemset, one tries to find dependencies between items that could "belong together". A common example would be buying food at the store: If, for example, meat and salt are bought together often, but meat without salt not that often, it is assumed that there is a connection between those two. For games, if it was found that most of the users who own the demo version of a game also own the full version of that game, it would be a reasonable assumption that these users liked the demo and therefore bought the full version.


Let us first cover the mathematical basis for association rules. The most important metrics used are **support**,  **confidence** and **lift**. The first is defined as the amount of times an item occurs in the itemset divided by the total number of items in the set; the second is defined as the support of a list of items [x,y,...] divided by the support of x. Lift is a measure describing the correlation between items. Written down mathematically:

$$supp(x)= \frac{len(x)}{len(n)}$$

$$conf(x=>y) = \frac{supp(x,y)}{supp(x)}$$

$$lift(x=>y) = \frac{P(x \cap y)}{P(x) * P(y)}$$



It is important to note that support refers to an item or a list of items, while confidence refers to a rule. Also note that a lift of 1 means that x and y occur independently of each other, while a lift greater 1 means a positive correlation.


**1.** Your task here is to first **convert** the dictionary you created into a list of lists as this is the input required for the algorithm to work. Then, **print out** the most frequent items using the `min_support` attribute. Finally, **print out** the association rules and **play around with the threshold value** to get a reasonable amount of rules. 

**Hint:** Play around with the threshold values until you get a reasonable amount (4-30) rows as output.

**2.** **Discuss your results** and try to answer the following questions: 
- What kind of recommendations can be made?
- What does a confidence of 1.0 mean and is it meaningful for recommending games? 
- Can you spot a correlation between the games with the highest support and the rules with the highest confidence? How does this affect the lift?  

**Hint:** There is a high chance that games such as "Counter-Strike: Global Offensive" appear very often, you should at least have two different games in the antecedents and consequents column to make meaningful conclusions.

In [4]:
gamesofallusers = []

# DONE 1: Convert the gamedict to a list of lists:
for player in users_gamedicts.items():
    a, b = player
    gamesofallusers.append(list(b.keys()))
    
# It should look something like this:
'''
[
    [
    'Path of Exile',
    'Europa Universalis IV',
    'Titan Quest Anniversary Edition',
    'Black Desert Online',
    'Crusader Kings II'
    ],
    [
    'Counter-Strike',
    'Day of Defeat',
    'Deathmatch Classic',
    'Ricochet'
    ]
]
''' 
# Each list within this list represents the games of one user
    
    
# Remove common Steam entries that are not games:
for game in gamesofallusers:
    if 'Dota 2 Test' in game:
        game.remove('Dota 2 Test')
    if 'True Sight' in game:
        game.remove('True Sight')
    if 'True Sight: Episode 1' in game:
        game.remove('True Sight: Episode 1')
    if 'True Sight: Episode 2' in game:
        game.remove('True Sight: Episode 2')
    if 'True Sight: Episode 3' in game:
        game.remove('True Sight: Episode 3')
    if 'True Sight: The Kiev Major Grand Finals' in game:
        game.remove('True Sight: The Kiev Major Grand Finals')
    if 'True Sight: The International 2017' in game:
        game.remove('True Sight: The International 2017')
    if 'True Sight: The International 2018 Finals' in game:
        game.remove('True Sight: The International 2018 Finals')
    if 'Paladins - Public Test' in game:
        game.remove('Paladins - Public Test')

In [5]:
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori

te = TransactionEncoder()
# DONE 2: Tinker around with the values
te_ary = te.fit(gamesofallusers).transform(gamesofallusers)
df = pd.DataFrame(te_ary, columns=te.columns_)
frequent_itemsets = apriori(df, min_support=0.4, use_colnames=True, max_len = 2)

frequent_itemsets

Unnamed: 0,support,itemsets
0,0.431373,(Brawlhalla)
1,0.941176,(Counter-Strike: Global Offensive)
2,0.470588,(Grand Theft Auto V)
3,0.431373,(H1Z1: Test Server)
4,0.490196,(Left 4 Dead 2)
5,0.470588,(PAYDAY 2)
6,0.568627,(PUBG: BATTLEGROUNDS)
7,0.490196,(Paladins)
8,0.470588,(Unturned)
9,0.431373,(Z1 Battle Royale)


In [6]:
from mlxtend.frequent_patterns import association_rules

# DONE 2: Play around with the treshold value
association_rules(frequent_itemsets, metric="confidence", min_threshold=0.4)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(Counter-Strike: Global Offensive),(Brawlhalla),0.941176,0.431373,0.431373,0.458333,1.0625,0.025375,1.049774
1,(Brawlhalla),(Counter-Strike: Global Offensive),0.431373,0.941176,0.431373,1.0,1.0625,0.025375,inf
2,(Counter-Strike: Global Offensive),(Grand Theft Auto V),0.941176,0.470588,0.470588,0.5,1.0625,0.027682,1.058824
3,(Grand Theft Auto V),(Counter-Strike: Global Offensive),0.470588,0.941176,0.470588,1.0,1.0625,0.027682,inf
4,(Counter-Strike: Global Offensive),(H1Z1: Test Server),0.941176,0.431373,0.431373,0.458333,1.0625,0.025375,1.049774
5,(H1Z1: Test Server),(Counter-Strike: Global Offensive),0.431373,0.941176,0.431373,1.0,1.0625,0.025375,inf
6,(Counter-Strike: Global Offensive),(Left 4 Dead 2),0.941176,0.490196,0.470588,0.5,1.02,0.009227,1.019608
7,(Left 4 Dead 2),(Counter-Strike: Global Offensive),0.490196,0.941176,0.470588,0.96,1.02,0.009227,1.470588
8,(Counter-Strike: Global Offensive),(PAYDAY 2),0.941176,0.470588,0.470588,0.5,1.0625,0.027682,1.058824
9,(PAYDAY 2),(Counter-Strike: Global Offensive),0.470588,0.941176,0.470588,1.0,1.0625,0.027682,inf


When analysing the frequent_itemsets, one can recommend the games that have the biggest support value. These are the games that most of the user friends play or have played it. It is not a very specific or actually usefull recommendation, since some of these games like Brawlhalla, Paladins and Counter Strike are very common because they are free to play, not because people actually like the game.

Taking a look at the association_rules, items that have a confidence of 1.0 are games that when the player plays the first one, he also plays the second one. This confidence number is more meaningful for recomending games in some cases but it is not flawless. For example, players that play Paladins also play Counter Strike, but that doesn't mean much since both games are free to play and we can not actually see how much time the players actually spent playing both games.

The games that have the highest support are offten also the games involved that have the highest confidence. 

In apriori -> changing verbose doesn't change anything, changing max_len changes how many games are in the datasets

In association_rules -> min_treshold changes the minimal confidence that two games must have

## Task 3.3: The Recommender System: Similarity Score


Finally, it is time to build the recommender system. 

**1.** The first thing to do is to **implement a similarity score** that will be used to predict a user's playtime of an unowned game. We implement a similarity score between two users by taking the relative distance between two players. We use the following formula:

$$d(u, v) = \sum_{i~\in~common~games} \frac{|r_{u,i} - r_{v,i}|}{r_{v,i}}$$ 

Where $u$ and $v$ are users and $r_{u,i}$ is the playtime of user $u$ for game $i$. 

You can then return the similarity with  
$$ w_{u,v} = \frac{1}{1 + d(u, v)} $$

**Notes:** 
- If no common games exist return 0.

**a) Implement similarity scores:** Besides the given similarity score, we want to explore how other measurements behave. Hence, we will implement the euclidean distance and cosine similarity. The scores can be selected by setting the respective variable on `True`.

In [57]:
import numpy as np
from numpy import dot
from numpy.linalg import norm
from math import sqrt, dist
    
def calculate_similarity(user1ID, user2ID, given=True, euclidean=False, cosine=False):    
    common_games = []
    user1games = users_gamedicts[user1ID]
    user2games = users_gamedicts.get(user2ID,user1games)
    common_games = list(set(user1games).intersection(user2games))
    differences = []
    
    
    
    user1g = []
    user2g = []
    for game in common_games:          #This part is for creating two lists just               
        user1g.append(user1games[game])#containing the numbers of hours per game
        user2g.append(user2games[game])#of each player
         
    
    if len(common_games) == 0:
        return 0
    
    if euclidean == True:
        return dist(user1g, user2g)
    elif cosine == True:
        return dot(user1g, user2g) / (norm(user1g) * norm(user2g))
    elif given == True:
        d_u_v = 0
        for game in common_games:
            if user2games[game] != 0:
                d_u_v = d_u_v + (abs(user1games[game] - user2games[game])/user2games[game])
        return 1 / (1 + d_u_v)
               
    # DONE: Calculate the similarity score between two friends based on their common games:

## Task 3.4: Recommender System: Predict ratings

With the similarity score calculated, we can now predict a user's playtime for games they don't own.

**1.** First, we **create a set of all games**, but we **delete** all games that are owned by less than 3 players. The reason is simple: If only 1 or 2 players own a game, it is impossible to derive a meaningful prediction since there is not enough data. 

The predicted playtime for a game works analogous to the predicted rating of a movie/item in a conventional collaborative filtering recommender system:

$$r_{u,i} = \frac{\sum_{v \in N_i(u)} w_{u,v}r_{v,i}}{\sum_{v \in N_i(u)} w_{u,v}}$$

where 
- $r_{u,i}$ is the estimated recommendation of item $i$ for target user $u$. 
- $N_i(u)$ is the set of similar users to target user $u$ for the designated item $i$. 
- $w_{u,v}$ is the similarity score between users $u$ and $v$ (used as a weighting factor).  

**Notes:** 
- In our case, we use playtime as a recommendation measure and the set $N_i(u)$ consists of user $u$ friends list and friends of friends list. In our scenario, we do not need the index $i$ as our friends list does not change between games.
- Keep in mind that we have already taken out the games with a playtime of 0. In this case, they are considered "unowned" and not taken into account in this exercise.

In [51]:
# List of all games that are owned by at least 1 person
allGames = []
for user in gamesofallusers:
    for game in user:
        allGames.append(game)

        
# DONE : Create a list of games owned by at least 3 people
allGames3 = []
for game in allGames:
    if allGames.count(game) >= 3 and not game in allGames3:
        allGames3.append(game)

# DONE: Find out which games you do not own out of all games because we are only interested in recommendations for games that we do not own
def difference(allGames, yourGames): 
    # TODO:
    difference = []
    for game in allGames:
        if not game in yourGames and not game in difference:
            difference.append(game)
    return difference

# DONE: Predict ratings based on the formula above for each unowned game
# use 'given', 'euclidean' and 'cosine' to switch between measurements
def predict_ratings(given=True, euclidean=False, cosine=False):
    similarity_scores = {}
    # DONE:
    games_unowned = difference(allGames3, list(users_gamedicts[id].keys()))
    num = 0
    den = 0
    for game in games_unowned:
        for user in list(users_gamedicts.keys()):
            if user != id:
                if game in list(users_gamedicts[user].keys()):
                    #Numerator
                    num = num + calculate_similarity(id, user, given= given, euclidean=euclidean, cosine=cosine) * users_gamedicts[user][game]
                    #Denomirator
                    den = den + calculate_similarity(id, user, given= given, euclidean=euclidean, cosine=cosine)
        if den != 0:
            similarity_scores[game] = num / den        
            num = 0
            den = 0
    return similarity_scores
    '''Hint: Iterate over all unowned games and for each game calculate a rating based
        on your friends playtime and similarity score '''


## Task 3.5: Recommender System: Discussion

**1.** **Sort** the predicted ratings by estimated playtime (highest first) and **print out** the top 8 predictions for you (or the default user if you are using the default ID). 

**2.** **Discuss** the difference in recommendations between the collaborative filtering approach and the association rule approach. Would you consider one more accurate than the other? Why/why not?

**3.** **Discuss** the differences in the similarity scores.

In [56]:
#1.
from operator import itemgetter
sorted_list = sorted(predict_ratings().items(), key = itemgetter(1), reverse = True)
for x in range(0,8):
    print(sorted_list[x]) 

('Arma 2: Operation Arrowhead', 80822.36336601683)
('MIR4', 31035.054520561305)
('Rust', 17594.979605514603)
('Arma 3', 15055.385408696384)
('Clicker Heroes', 14969.071445328178)
('Counter-Strike: Source', 14527.884782212688)
('ASTRONEER', 11688.229445981166)
('Rocket League', 11063.422803251246)


In [10]:
# TODO:  

In [11]:
# TODO: 

2. After analysing both recommendations techniques, I prefer collaborative filtering as a better and more accurate recommendation system. The association rule oftens suggests the combination of games that I already have and often the games that are the most popular ones are the ones that are free to play. They are not necessarily the best recommendation. In the collaborative filtering, the games are based on the playtime of friends and are games that I don't already have, therefore the result is a more intelligent recommendation.
3. The simitarity scores are different and the games recommended in the collaborative filtering don't appear in the association rule.

## References

[1] https://steamcommunity.com/dev/apikey