# **"More Like This" Recomendation System for Steam Games**
*ITCS 5154, Evan Youssef*

## **Overview**

This notebook explores using K-Nearest Neighbor (KNN) Content-Based filtering to recommend Steam games. I gather data from the top 1,000 games using SteamSpy API, collecting each game's AppID, name, publisher, developer, a dictionary of user-applied tags, and more. I encode the tags using various methods (deciding on TF-IDF). I use the encoded tags as features to train a KNN model, deciding similarity based on various metrics (deciding on Cosine Similarity). Finally, I output a list of the most similar games when compared with the inputted game.

I designed this notebook based on multiple sources, all of which are hyperlinked when relevant.

## **Installations**

In [None]:
# !pip install requests
# !pip install pandas
# !pip install scikit-learn

Collecting scikit-learn
  Downloading scikit_learn-1.6.1-cp312-cp312-win_amd64.whl.metadata (15 kB)
Collecting scipy>=1.6.0 (from scikit-learn)
  Downloading scipy-1.15.2-cp312-cp312-win_amd64.whl.metadata (60 kB)
Collecting joblib>=1.2.0 (from scikit-learn)
  Using cached joblib-1.4.2-py3-none-any.whl.metadata (5.4 kB)
Collecting threadpoolctl>=3.1.0 (from scikit-learn)
  Using cached threadpoolctl-3.6.0-py3-none-any.whl.metadata (13 kB)
Downloading scikit_learn-1.6.1-cp312-cp312-win_amd64.whl (11.1 MB)
   ---------------------------------------- 0.0/11.1 MB ? eta -:--:--
   ---------- ----------------------------- 2.9/11.1 MB 16.8 MB/s eta 0:00:01
   ----------------------- ---------------- 6.6/11.1 MB 16.8 MB/s eta 0:00:01
   ---------------------------------------  11.0/11.1 MB 19.1 MB/s eta 0:00:01
   ---------------------------------------- 11.1/11.1 MB 17.4 MB/s eta 0:00:00
Using cached joblib-1.4.2-py3-none-any.whl (301 kB)
Downloading scipy-1.15.2-cp312-cp312-win_amd64.whl (40

## **Data Collection**
*Based on ["Scraping Information of All Games From Steam With Python"](https://medium.com/codex/scraping-information-of-all-games-from-steam-with-python-6e44eb01a299) by mmmmmm4*

### **Collecting App IDs**

Here, I access the top 1,000 games on Steam via SteamSpy and store all their associated data excluding tags. The tags need to be collected separated.

In [5]:
import requests
import json

url = 'https://steamspy.com/api.php?request=all'
response = requests.get(url)
app_data= response.json()
app_ids = list(app_data.keys())

print(len(app_ids))

1000


### **Collecting App Tags**

I use the list of AppIDs to collect each game's associated tags. Then, I add the tags to the app_data dictionary. These tags will encoded and used as features for the model.

In [6]:
from collections import deque
import time

def get_app_details(app_ids):
    remaining_apps = deque(set(app_ids))

    while len(remaining_apps) > 0:
        app_id = remaining_apps.popleft()

        req = requests.get(f'https://steamspy.com/api.php?request=appdetails&appid={app_id}')

        app_details = req.json()
        tags = app_details.get('tags', {})
        app_data[app_id]['tags'] = tags

get_app_details(app_ids)

app_data

{'570': {'appid': 570,
  'name': 'Dota 2',
  'developer': 'Valve',
  'publisher': 'Valve',
  'score_rank': '',
  'positive': 2013644,
  'negative': 456439,
  'userscore': 0,
  'owners': '200,000,000 .. 500,000,000',
  'average_forever': 42081,
  'average_2weeks': 1409,
  'median_forever': 843,
  'median_2weeks': 805,
  'price': '0',
  'initialprice': '0',
  'discount': '0',
  'ccu': 424747,
  'tags': {'Free to Play': 59991,
   'MOBA': 20195,
   'Multiplayer': 15390,
   'Strategy': 14271,
   'e-sports': 11800,
   'Team-Based': 10976,
   'Competitive': 8306,
   'Action': 7930,
   'Online Co-Op': 7483,
   'PvP': 6063,
   'Difficult': 5364,
   'Co-op': 4328,
   'RTS': 4123,
   'RPG': 3801,
   'Tower Defense': 3791,
   'Fantasy': 3759,
   'Character Customization': 2937,
   'Replay Value': 2763,
   'Action RPG': 2470,
   'Simulation': 1986}},
 '730': {'appid': 730,
  'name': 'Counter-Strike: Global Offensive',
  'developer': 'Valve',
  'publisher': 'Valve',
  'score_rank': '',
  'positive':

## **Data Cleaning & Preparation**

### **Removing Columns**

First, I remove unnecessary columns of data. I only use tags as features for the model, but I keep appid, name, developer, and publisher in order to identify the game.

As seen below, the tags are represented by a dictionary where the key is the name of the tag and the value represents the number of times users applied the tag to the game. For example, the game Dota 2's most popular tag is the 'Free to Play' tag, which has been applied by nearly 60,000 users. 

In [10]:
import pandas as pd

df = pd.DataFrame(app_data.values())
df = pd.concat([df[['appid', 'name', 'developer', 'publisher', 'tags']]], axis=1)

df.head()

Unnamed: 0,appid,name,developer,publisher,tags
0,570,Dota 2,Valve,Valve,"{'Free to Play': 59991, 'MOBA': 20195, 'Multip..."
1,730,Counter-Strike: Global Offensive,Valve,Valve,"{'FPS': 91036, 'Shooter': 65529, 'Multiplayer'..."
2,578080,PUBG: BATTLEGROUNDS,PUBG Corporation,"KRAFTON, Inc.","{'Survival': 14863, 'Shooter': 12754, 'Battle ..."
3,1623730,Palworld,Pocketpair,Pocketpair,"{'Open World': 1408, 'Survival': 1294, 'Multip..."
4,1172470,Apex Legends,Respawn,Electronic Arts,"{'Free to Play': 2201, 'Battle Royale': 1500, ..."


### **Encoding**

##### *Multi-Label Binarization*

To use the tags as features, I need to encode them to readable values for the model. Different types of encoding create different results. The three types of encoding I test are Multi-Label Binarization (MLB), Count Vectorization, and TF-IDF Vectorization.

Below, I use MLB to convert tags to 0s and 1s. When a game has a tag with the value of zero, that means the tags were not applied to the game. A value of 1 indicates the game has had that tag applied. This equally weights each tag per game, meaning that a tag applied 50,000 times and a tag applied once have equal weight in the model.

In [94]:
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

mlb = MultiLabelBinarizer()
X_binary = mlb.fit_transform(df['tags'])
X_binary_df = pd.DataFrame(tags_encoded, columns=mlb.classes_)

X_binary_df.head()

Unnamed: 0,1980s,1990's,2.5D,2D,2D Fighter,2D Platformer,3D,3D Fighter,3D Platformer,3D Vision,...,Warhammer 40K,Well-Written,Werewolves,Western,Wholesome,World War I,World War II,Wrestling,Zombies,e-sports
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


#### *Count Vectorization*

Next, I use Count Vectorization to encode the tags. Instead of 0s and 1s, Count Vectorization simply represents a tag with the amount of time it has been applied for each game. This adds weight to each tag but disregards proportionality. For example, if one game's top tag has been applied 50,000 times holds more weight than a game whose top tag has been applied 5,000 times. Therefore, the model disregards games with fewer total tags applied, making the recommendation biased toward popular games.

In [96]:
vectorizer = DictVectorizer(sparse=False)
X_counts = vectorizer.fit_transform(df['tags'])
X_counts_df = pd.DataFrame(X_counts, columns=vectorizer.get_feature_names_out())

X_counts_df.head()

Unnamed: 0,1980s,1990's,2.5D,2D,2D Fighter,2D Platformer,3D,3D Fighter,3D Platformer,3D Vision,...,Warhammer 40K,Well-Written,Werewolves,Western,Wholesome,World War I,World War II,Wrestling,Zombies,e-sports
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,11800.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,43602.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,386.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### *TF-IDF Vectorization*

Finally, I use a TF-IDF Vectorizer to create proportional weights. TF-IDF stands for Term Frequency-Inverse Document Frequency, meaning instances of a tag are weighted in proportion to the inverse of their frequency. This means tags are proportionally weighted based on the total number of tags.

In [98]:
tfidf = TfidfTransformer()
X_tfidf = tfidf.fit_transform(X_counts)
X_tfidf_df = pd.DataFrame(X_tfidf.toarray(), columns=vectorizer.get_feature_names_out())

X_tfidf_df.head()

Unnamed: 0,1980s,1990's,2.5D,2D,2D Fighter,2D Platformer,3D,3D Fighter,3D Platformer,3D Vision,...,Warhammer 40K,Well-Written,Werewolves,Western,Wholesome,World War I,World War II,Wrestling,Zombies,e-sports
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25975
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.379331
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.103705,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## **Content-Based KNN Model**

### **Content-Based Filtering**

Below, I use a K-Nearest Neighbor model to calculate the distance between an inputted game's features and every other game's features. I chose to use a KNN Content-Based Filtering model based on the evaluations in ["Evaluation of KNN-NMF Algorithm for Recommendation Systems in E-Commerce" by Hew Sok Yen, et al](https://ieeexplore.ieee.org/document/10862917).

In their article, the authors evaluate the performance of multiple KNN models for recommendation systems. The authors evaluate three types of KNN filtering: Collaborative Filtering (CF), Content-Based Filtering (CBF), and a hybrid approach. CBF only uses the features of the data to calculate distance while CF weights distance based on unique user data. For example, a CBF model might directly compare tags between two games to determine if they are similar while a CF model might use a specific user's rated games and their tags to weight the recommendations.

I chose to use a CBF model for two reasons. Firstly, the data I have access to only includes information from the items themselves, meaning in order to create a CF model, I would need to create fake user data. Secondly, the authors find fairly even performance between CF and CBF models.

### **Distance Calculation**

In their article ["The Research for Recommendation System Based on Improved KNN Algorithm"](https://ieeexplore.ieee.org/document/9213566), authors Bin Li, et al. describe the KNN algorithm and potential improvements. In the code below, I replicate their process using Scikit-Learn libraries. The authors use cosine similarity as their standard for calculating the distance metric between feature vectors.

Cosine similarity takes two feature vectors and calculates the angle between the two. The smaller the angle, the higher the similarity. Other distance metrics include Euclidean and Manhattan distance, which the article doesn't cover, but can easily be tested using Scikit-Learn. In the get_recommendations function, I include a metrics parameter that lets me easily test different distance calculations. For the remainder of the notebook, I emulate Bin Li, et al. and use cosine similarity.

### **Final Model**

In the final model, I take the train and fit the KNN algorithm on the encoded input features. After the algorithm calculates distance, I store the N most similar games and print them in order of distance.

In [179]:
from sklearn.neighbors import NearestNeighbors

def get_recommendations(input_appid, features, _metric, neighbors):
    # extract features
    df_features = pd.concat([df[['appid']], features], axis=1)

    input_features = df_features[df_features['appid'] == input_appid].drop('appid', axis=1)

    # check for app id
    if input_features.empty:
        raise ValueError(f"No features found for AppID {input_appid}.")

    # configure model
    knn = NearestNeighbors(n_neighbors=neighbors + 1, metric=_metric)
    knn.fit(features)

    # run model
    distances, indices = knn.kneighbors(input_features)
    
    recommended_game_ids = df_features.iloc[indices[0]]['appid'].values
    recommended_game_ids = recommended_game_ids[recommended_game_ids != input_appid][:neighbors]

    # print output
    print('### More Games Like', app_data[str(input_appid)]['name'], "###")
    
    for i in recommended_game_ids:
        print(app_data[str(i)]['name'])
    
    ## return ids
    return recommended_game_ids

## **Using the Model**

Replace the arguements with different combinations of parameters.

1. Steam AppID - (try `730`, `620`, `570`, `271590`, `1245620` or other Steam game AppIDs. Note that I only collected data from the top 1,000 Steam games)
2. Encoding model - (`X_binary_df`, `X_counts_df`, `X_tfidf_df`)
3. Metric - (`'euclidean'`, `'cosine'`, etc. Reference [sklearn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)</a>)
4. Number of Neighbors - (number of recommendations)

In [187]:
recommended_games = get_recommendations(1245620, X_tfidf_df, 'cosine', 12)

### More Games Like ELDEN RING ###
DARK SOULS II: Scholar of the First Sin
DARK SOULS III
DARK SOULS: REMASTERED
Nioh: Complete Edition
DARK SOULS: Prepare To Die Edition
The Surge
Remnant: From the Ashes
Sekiro: Shadows Die Twice - GOTY Edition
CODE VEIN
Darksiders III
Monster Hunter: World
Black Myth: Wukong


## **Evaluation**

To evaluate the model, I compare the first 8 recommendations to Steam's "More Like This" section on associated Steam pages.

I test 6 different games: *Portal 2*, *Counter-Strike: Global Offensive*, *Elden Ring*, *Fallout: New Vegas*, *Civilization VI*, and *Balatro*. I chose these games because they are in different genres, so testing them help me see how the model performs with different types of tags.

### **Portal 2**

In [188]:
recommended_games = get_recommendations(620, X_tfidf_df, 'cosine', 8)

### More Games Like Portal 2 ###
Portal
Chained Together
Human Fall Flat
LIMBO
Trine Enchanted Edition
BattleBlock Theater
Mirror's Edge
Braid


**Steam's Recommendations:**\
Split Fiction\
It Takes Two\
High On Life\
Carry The Glass\
Ratchet & Clank\
Uncharted: Legacy of Thieves Collection\
Epic Mickey: Rebrushed\
Mirror's Edge\

### **Counter-Strike: Global Offensive**

In [189]:
recommended_games = get_recommendations(730, X_tfidf_df, 'cosine', 8)

### More Games Like Counter-Strike: Global Offensive ###
Counter-Strike: Source
Tom Clancy's Rainbow Six Siege
Counter-Strike
Insurgency
Counter-Strike: Condition Zero
Call of Duty: Modern Warfare 2 (2009)
Insurgency: Sandstorm
Battlefield: Bad Company 2


**Steam's Recommendations:**\
Rainbow Six: Siege\
PUBG: Battlegrounds\
Arma Reforger\
Squad\
Marvel Rivals\
The Finals\
Call of Duty\
World of Warships

### **Elden Ring**

In [190]:
recommended_games = get_recommendations(1245620, X_tfidf_df, 'cosine', 8)

### More Games Like ELDEN RING ###
DARK SOULS II: Scholar of the First Sin
DARK SOULS III
DARK SOULS: REMASTERED
Nioh: Complete Edition
DARK SOULS: Prepare To Die Edition
The Surge
Remnant: From the Ashes
Sekiro: Shadows Die Twice - GOTY Edition


**Steam's Recommendations:**\
The First Beserker: Khazan\
Ghost of Tsushima\
Lies of P\
Helldivers II\
Rise of the Ronin\
Diablo IV\
Path of Exile\
Black Myth Wukong

### **Fallout: New Vegas**

In [191]:
recommended_games = get_recommendations(22380, X_tfidf_df, 'cosine', 8)

### More Games Like Fallout: New Vegas ###
Fallout 3: Game of the Year Edition
Fallout 4
S.T.A.L.K.E.R.: Clear Sky
S.T.A.L.K.E.R.: Call of Pripyat
S.T.A.L.K.E.R.: Shadow of Chernobyl
Far Cry New Dawn
Metro Exodus
The Elder Scrolls IV: Oblivion Game of the Year Edition Deluxe


**Steam's Recommendations:**\
Stalker 2\
Atomfall\
Final Fantasay XIV\
Cyberpunk 2077\
The Elder Scrolls Online\
Forever Skies\
ARK: Survival Evolved\
No Man's Sky

### **Civilization VI**

In [192]:
recommended_games = get_recommendations(289070, X_tfidf_df, 'cosine', 8)

### More Games Like Sid Meier’s Civilization VI ###
Sid Meier's Civilization III Complete
Sid Meier's Civilization V
Age of Wonders III
Total War: THREE KINGDOMS
Total War: NAPOLEON – Definitive Edition
ENDLESS Legend
Total War: ROME II - Emperor Edition
Total War: EMPIRE – Definitive Edition


**Steam's Recommendations:**\
Old World\
Crusader Kings III\
Total War: Warhammer II\
Hearts of Iron IV\
Total War: Warhammer III\
Anno 1800\
Facorio\
Age of Empires IV

### **Balatro**

In [193]:
recommended_games = get_recommendations(2379780, X_tfidf_df, 'cosine', 8)

### More Games Like Balatro ###
Slay the Spire
Inscryption
ROUNDS
Artifact
Loop Hero
Shadowverse CCG
GWENT: The Witcher Card Game
Into the Breach


**Steam's Recommendations:**\
Slay the Spire\
Dungeons & Degenerate Gamblers\
Lonestar\
Die in the Dungeon\
Inscryption\
Across the Obelisk\
Dungeon Crawler\
Fights in Tight Spaces

## **Conclusion**

I noticed two main differences between Steam's and my model's recommendations: accuracy and variety.

I would argue that my model is more accurate in selecting similar games. While this can't be measured scientifically, many of the games that Steam recommends are in entirely different genres to the target game. However, Steam's algorithm almost always provided a wider variety of franchises and genres. My model frequently recommended games within the same franchise.

These differences reveal Valve's (developer of Steam) objective when recommending games. While I built a model to find the most similar games, Steam's model seems to prioritize giving the user more options. This makes sense when considering the context of the user experience. For example, if a user is viewing the Steam page for Civilization VI, they already know that there are at least 5 other games within the franchise and don't need an algorithm to tell them that. However, an algorithm that gives them more variety can lead them to a game that they never would've heard of otherwise. By expanding the user's horizons, Valve is simultaneously opening up new opportunities for the user to purchase products.

By comparing and contrasting the results of my model with the Steam model, I now have a better understanding of how the Steam recommendation algorithm might work. More importantly, I have a better understanding of how the engineers at Valve design an algorithm around the actual goal of a model, not the perceived goal of a model.