#Data Preprocessing

Two datasets are merged here. The original dataset, [steam-analysis.csv](https://www.kaggle.com/datasets/tamber/steam-video-games/data), contained too few variables for a robust item-based collaborative filtering, so the [video_games.csv dataset](https://corgis-edu.github.io/corgis/csv/video_games/) was added. The final dataset contained the variables:
* Game (name of the game, string)
* UserID (unique user ID's, numeric)
* HoursPlayed (gaming hours by each user, numeric)
* Ratings (On a scale from 1-5, categorical)
* Genres (Different Genres of the Games, string)
* Sales (Total sales of each game, numeric)
* Release Year (release year, numeric).

In [6]:
df_steam <- read.csv("/content/steam-analysis.csv")
df_steam$Action <- NULL
df_steam

UserID,Game,HoursPlayed,Ratings
<int>,<chr>,<dbl>,<int>
151603712,The Elder Scrolls V Skyrim,273.0,2
151603712,Fallout 4,87.0,4
151603712,Spore,14.9,3
151603712,Fallout New Vegas,12.1,1
151603712,Left 4 Dead 2,8.9,4
151603712,HuniePop,8.5,4
151603712,Path of Exile,8.1,4
151603712,Poly Bridge,7.5,5
151603712,Left 4 Dead,3.3,2
151603712,Team Fortress 2,2.8,3


In [7]:
library(dplyr)

Video games data: https://corgis-edu.github.io/corgis/csv/video_games/

Merged the Video Game dataset with the Steam dataset to add the genre of each video game.

In [8]:
df_vg <- read.csv("/content/video_games.csv")
df_vg$Game <- df_vg$Title
df_vg$Title <- NULL
df_vg

Features.Handheld.,Features.Max.Players,Features.Multiplatform.,Features.Online.,Metadata.Genres,Metadata.Licensed.,Metadata.Publishers,Metadata.Sequel.,Metrics.Review.Score,Metrics.Sales,⋯,Length.Main...Extras.Leisure,Length.Main...Extras.Median,Length.Main...Extras.Polled,Length.Main...Extras.Rushed,Length.Main.Story.Average,Length.Main.Story.Leisure,Length.Main.Story.Median,Length.Main.Story.Polled,Length.Main.Story.Rushed,Game
<chr>,<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<int>,<dbl>,⋯,<dbl>,<dbl>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<dbl>,<chr>
True,1,True,True,Action,True,Nintendo,True,85,4.69,⋯,29.9666667,25.0000000,16,18.3333333,14.3333333,18.3166667,14.5000000,21,9.7000000,Super Mario 64 DS
True,1,True,True,Strategy,True,Ubisoft,True,89,0.56,⋯,9.8666667,9.7500000,2,9.6166667,10.3333333,11.0833333,10.0000000,3,9.5833333,Lumines: Puzzle Fusion
True,2,True,True,"Action,Racing / Driving,Sports",True,Nintendo,True,81,0.54,⋯,5.6666667,3.3333333,11,2.7833333,1.9166667,2.9333333,1.8333333,30,1.4333333,WarioWare Touched!
True,1,True,True,Sports,True,Sony,True,81,0.49,⋯,0.0000000,0.0000000,0,0.0000000,0.0000000,0.0000000,0.0000000,0,0.0000000,Hot Shots Golf: Open Tee
True,1,True,True,Action,True,Activision,True,61,0.45,⋯,17.3166667,12.5000000,12,10.4833333,8.3500000,11.0833333,8.0000000,23,5.3333333,Spider-Man 2
True,1,True,True,Simulation,True,EA,True,67,0.41,⋯,25.2000000,20.0000000,3,16.4500000,15.5000000,15.7500000,15.5000000,2,15.2500000,The Urbz: Sims in the City
True,1,True,True,Racing / Driving,True,Namco,True,88,0.36,⋯,0.9333333,0.8833333,2,0.8333333,0.6166667,0.7833333,0.5333333,3,0.4500000,Ridge Racer
True,1,True,True,Strategy,True,Konami,True,75,0.34,⋯,27.4833333,25.1000000,6,21.9166667,20.7000000,23.6000000,20.7833333,11,17.8833333,Metal Gear Ac!d
True,1,True,True,Sports,True,EA,True,68,0.25,⋯,0.0000000,0.0000000,0,0.0000000,0.0000000,0.0000000,0.0000000,0,0.0000000,Madden NFL 2005
True,1,True,True,Racing / Driving,True,Nintendo,True,46,0.22,⋯,0.0000000,0.0000000,0,0.0000000,1.1166667,1.2000000,1.0833333,3,1.0500000,Pokmon Dash


In [9]:
merged_df <- merge(df_steam,df_vg,by="Game")

In [10]:
merged_df <- select(merged_df, Game, UserID, HoursPlayed, Ratings, Metadata.Genres, Metrics.Sales, Release.Year)


#Merged Dataset Item-Based Collab Filtering


Merging the data

In [31]:
library(dplyr)
df_steam <- read.csv("/content/steam-analysis.csv")
df_steam$Action <- NULL
df_vg <- read.csv("/content/video_games.csv")
df_vg$Game <- df_vg$Title
df_vg$Title <- NULL
merged_df <- merge(df_steam,df_vg,by="Game")
merged_df <- select(merged_df, Game, UserID, HoursPlayed, Ratings, Metadata.Genres, Release.Year)


Creating a raing matrix

In [32]:
rating_matrix <- as(merged_df, "realRatingMatrix")

A train-test split

In [33]:
evaluation_scheme <- evaluationScheme(rating_matrix, method="split", train=0.8, given=5)


“Dropping these users from the evaluation since they have fewer rating than specified in given!
These users are 1, 5, 8, 9, 12, 22, 23, 25, 27, 30, 32, 33”


We are using an Item Based Collaborative Filtering method (IBCF)

In [34]:
algorithm <- "IBCF"


Using the training data to make the recommender model,

In [35]:
recommender_model <- Recommender(getData(evaluation_scheme, "train"), method=algorithm)


We are making the predictions using the known data

In [36]:
predictions <- predict(recommender_model, getData(evaluation_scheme, "known"), type="ratings")


Actual ratings from the unkown part of the data are extracted and converted into a matrix to calculate the RMSE

In [37]:
# Extract actual ratings from the 'unknown' part of the data
actual_ratings_matrix <- as(getData(evaluation_scheme, "unknown"), "matrix")

# Convert predicted ratings to a matrix
predicted_ratings_matrix <- as(predictions, "matrix")

# Calculate RMSE
rmse_value <- sqrt(mean((actual_ratings_matrix - predicted_ratings_matrix)^2, na.rm = TRUE))
print(rmse_value)


[1] 70.04719


#Non-personalized Engine

Installing and loading the necessary packages and libraries.

In [None]:
install.packages("recommenderlab")
library(recommenderlab)

We extract the columns "UserID," "Game," and "Ratings," and convert them into a "realRatingMatrix."







In [11]:
# Assuming Ratings is a numeric value
rating_matrix <- as(merged_df[, c("UserID", "Game", "Ratings")], "realRatingMatrix")


 The following recommendation model uses the "POPULAR" recommendation method so that the model recommends items based on their overall popularity among users.

In [12]:
recomm_model <- Recommender(rating_matrix, method = "POPULAR")


We take the recommendation model and the user-item rating matrix to produce a list of the top 5 recommendations for each user based on popularity.







In [None]:
recommendations <- predict(recomm_model, rating_matrix, n = 5) # Top 5 recommendations
as(recommendations, "list")


We use the binarize function to convert the rating_matrix into a binary matrix. The minRating = 3 parameter specifies a threshold such that any rating equal to or above 3 in the original matrix is transformed to 1 (indicating preference or positive interaction), and any rating below 3 is transformed to 0 (indicating no preference or negative interaction). The resulting binary matrix is useful for tasks where the focus is on whether a user has interacted with an item rather than the specific rating value.

In [14]:
binary_matrix <- binarize(rating_matrix, minRating = 3)


We create an evaluation scheme for our non-personalized model and then evaluate the model's performance. In the evaluation scheme we use a cross-validation method with 3 folds, in which only items with at least 2 interactions are considered for evaluation. Finally, we calculate the average performance metrics and we can see that we have achieved a recall score of 71%.

In [15]:
evaluation_scheme <- evaluationScheme(binary_matrix, method = "cross-validation", given = 2, k = 3)
eval_result <- evaluate(evaluation_scheme, method = "POPULAR", n = 5)
avg(eval_result)


“Dropping these users from the evaluation since they have fewer rating than specified in given!
These users are 1, 4, 5, 6, 8, 10, 11, 12, 13, 14, 15, 18, 22, 23, 26, 27, 31, 32, 35, 37, 39, 40, 46, 49, 51, 54, 55, 56, 57, 62, 63, 64, 66, 69, 70, 73, 75, 76, 78, 79, 80, 81, 82, 84, 86, 89, 90, 92, 93, 94, 95, 96, 97, 99, 101, 103, 104, 107, 108, 109, 110, 111, 112, 114, 119, 121, 122, 127, 129, 130, 132, 133, 138, 139, 141, 142, 144, 145, 146, 149, 150, 152, 153, 154, 155, 156, 157, 158, 160, 161, 162, 164, 165, 167, 171, 172, 174, 175, 179, 181, 182, 183, 185, 187, 189, 190, 194, 195, 196, 197, 198, 199, 200, 202, 203, 205, 207, 208, 210, 211, 212, 213, 214, 215, 216, 217, 219, 220, 221, 222, 223, 224, 227, 228, 231, 232, 233, 234, 235, 236, 237, 239, 240, 242, 243, 244, 245, 247, 249, 250, 251, 252, 253, 259, 260, 261, 262, 264, 265, 266, 267, 269, 270, 271, 273, 274, 275, 280, 281, 282, 284, 287, 288, 291, 292, 293, 294, 297, 298, 299, 301, 303, 306, 308, 310, 313, 315, 316, 317, 32

POPULAR run fold/sample [model time/prediction time]
	 1  [0.003sec/0.03sec] 
	 2  [0.002sec/0.042sec] 
	 3  [0.002sec/0.018sec] 


TP,FP,FN,TN,N,precision,recall,TPR,FPR,n
0.5388128,4.461187,0.283105,26.71689,32,0.1077626,0.7015551,0.7015551,0.142145,5


Data Sources


https://www.kaggle.com/datasets/tamber/steam-video-games/data -> steam-analysis.csv

https://corgis-edu.github.io/corgis/csv/video_games/ -> video_games.csv

Libraries

In [None]:
install.packages("recommenderlab")
library(recommenderlab)


#User-Based Collaborative Filtering

Load Data. The *Action* variable did not yield any fruitful information, as it only contained the value "*play*", so it was removed.

In [46]:
merged_df

Game,UserID,HoursPlayed,Ratings,Metadata.Genres,Release.Year
<chr>,<int>,<dbl>,<int>,<chr>,<int>
Alone in the Dark,189858084,0.4,5,"Action,Adventure,Racing / Driving",2008
Alone in the Dark,189858084,0.4,5,"Action,Adventure,Racing / Driving",2008
Assassin's Creed,76451157,7.3,4,Action,2007
Assassin's Creed,76451157,7.3,4,Action,2007
Assassin's Creed,22371742,10.9,2,Action,2007
Assassin's Creed,22371742,10.9,2,Action,2007
Assassin's Creed,33865373,1.1,2,Action,2007
Assassin's Creed,33865373,1.1,2,Action,2007
Assassin's Creed,37490443,29.0,3,Action,2007
Assassin's Creed,37490443,29.0,3,Action,2007


Create Rating Matrix: representation of user-item interactions

In [40]:
rating_matrix <- as(merged_df, "realRatingMatrix")

Train-test split of rating matrix

In [41]:
evaluation_scheme <- evaluationScheme(rating_matrix, method="split", train=0.8, given=5)


“Dropping these users from the evaluation since they have fewer rating than specified in given!
These users are 1, 5, 8, 9, 12, 22, 23, 25, 27, 30, 32, 33”


We are using user-based collaborative filtering

In [42]:
algorithm <- "UBCF" # IBCF for item-based


getData(evaluation_scheme, "train") -> Retrieves the training portion of the dataset based on the evaluation_scheme defined earlier.

Recommender(..., method=algorithm) -> Constructs a recommendation model using the specified algorithm.

recommender_model <- ... -> Makes predictions or further analysis.

In [43]:
recommender_model <- Recommender(getData(evaluation_scheme, "train"), method=algorithm)


Makes predictions using known data

In [44]:
predictions <- predict(recommender_model, getData(evaluation_scheme, "known"), type="ratings")


Calculates RMSE for Evaluation. Actual ratings are extracted from the unknown part of the data, and converts the ratings into a matrix. Then the RMSE is calculated

In [45]:
actual_ratings_matrix <- as(getData(evaluation_scheme, "unknown"), "matrix")

predicted_ratings_matrix <- as(predictions, "matrix")

rmse_value <- sqrt(mean((actual_ratings_matrix - predicted_ratings_matrix)^2, na.rm = TRUE))
print(rmse_value)


[1] 68.03561


# Cold Start Solution:

### Libraries:

In [None]:
# install.packages('tidyr')
# install.packages('dplyr')
# install.packages('caret')
library(tidyr)
library(dplyr)
library(caret)

In [None]:
# Read dataset
df <- read.csv("steam-analysis.csv")
df

In [None]:
top_games <- df %>%
  filter(Action == "play") %>%
  group_by(Game) %>%
  summarise(TotalHours = sum(HoursPlayed), AverageRating = mean(Ratings)) %>%
  arrange(desc(AverageRating), desc(TotalHours)) %>%
  head(10)

Top 10 games with the highest TotalHours played by users & the highest average rating.

In [None]:
print(top_games)

[90m# A tibble: 10 × 3[39m
   Game                                               TotalHours AverageRating
   [3m[90m<chr>[39m[23m                                                   [3m[90m<dbl>[39m[23m         [3m[90m<dbl>[39m[23m
[90m 1[39m NOBUNAGA'S AMBITION Kakushin with Power Up Kit          267               5
[90m 2[39m Cultures - Northland                                    194               5
[90m 3[39m Uncharted Waters Online                                 181               5
[90m 4[39m liteCam Game 100 FPS Game Capture                        68               5
[90m 5[39m The Incredible Adventures of Van Helsing Final Cut       58               5
[90m 6[39m Legends of Eisenwald                                     50.8             5
[90m 7[39m East India Company                                       50               5
[90m 8[39m Hotel Giant 2                                            44               5
[90m 9[39m Tree of Life                   

***Interpretation:***

This would be a solution to our cold start problem for when we have no user, content data. This would allow us to have top 10 games to recommend to users for when they first start playing.