### Implementation steps
Load cleaned Steam reviews data from HDFS into Spark.

Focus only on selected important columns (app_id, app_name, author_steamid, etc.).

Remove outliers from author_playtime_forever.

Filter serious users (long playtime, valid Steam IDs).

Map user IDs and game IDs to small integer indices.

Train an ALS recommendation model (Collaborative Filtering).

Recommend games for games (item-item recommendations).



Your Goal	Status
Find serious players who played more than one game
Filter by high playtime (above average)
Focus on real valid users (Steam IDs)	
Map users and games to integer indices (for ALS)	
Recommend 5 games that serious players are likely to like


In [3]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql.functions import from_unixtime
from pyspark.ml.recommendation import ALS
from pyspark.ml.feature import StringIndexer, IndexToString
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from scipy import stats
from pyspark.sql.functions import col, sum as _sum

In [4]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("SteamReviewsHDFS") \
    .config("spark.hadoop.fs.defaultFS", "hdfs://localhost:9000") \
    .config("spark.driver.memory", "4g") \
    .config("spark.executor.memory", "4g") \
    .config("spark.sql.shuffle.partitions", "100")\
    .config("spark.sql.execution.arrow.pyspark.enabled", "true") \
    .getOrCreate()


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/04/28 20:30:53 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/04/28 20:30:53 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


In [5]:
df = spark.read.parquet("/user/tejashree/project/cleaned_steam_reviews.parquet")

#### Are there any pairs of games that are played by the same players, i.e., if player A plays game X, then there is a good chance they play game Y also? Analyze any pattern

- Using the `data_rec` data frame, we can extract the review authors who have given reviews for more than one game.
- We will use the `author_playtime_forever` column to filter the gamers with a play time more than average so as to decrease the amassing of games by a single author.
- We will consider five games that are common among these reviewers and recommend them to other players who would fall in the same category.

Using the data_rec DataFrame, we first identify review authors who have reviewed more than one game. To ensure that these are serious players (and not users who just briefly tried a game), we filter based on the author_playtime_forever column — keeping only those who have spent more than the average playtime across all users. Then, for these active players, we analyze the games they have commonly played and recommended. Based on this behavior, we recommend five games that are frequently associated with such serious players to other users who exhibit similar engagement patterns.

In [8]:
col_rec = ["app_id", "app_name", "review_id", "language", "author_steamid", "timestamp_created" ,"author_playtime_forever","recommended"]

In [9]:
data_rec = df.select(*col_rec)

In [10]:
def remove_outliers(df, column):
    q1, q3 = df.approxQuantile(column, [0.25, 0.75], 0.01)
    iqr = q3 - q1
    lower_limit = q1 - 2 * iqr
    upper_limit = q3 + 2 * iqr
    return df.filter((col(column) >= lower_limit) & (col(column) <= upper_limit))
data_rec = remove_outliers(data_rec, "author_playtime_forever")
mean_playtime = data_rec.agg(mean("author_playtime_forever").alias("Mean")).collect()[0]["Mean"]

                                                                                

In [11]:
mean_playtime/3600 # average play time in hours

2.1252337349570114

This code filters the dataset to retain only serious users who have played each game for at least five times the average playtime, groups them by their Steam ID, keeps only those users who have reviewed more than one such game, ensures the user IDs are valid by checking they are above a specific threshold (76560000000000000), and finally orders these users by the number of games they have seriously engaged with in descending order.

In [13]:
pair_games = data_rec.filter(col("author_playtime_forever")>=5*mean_playtime).groupBy("author_steamid").count()
pair_games = pair_games.filter((pair_games["count"]>1)&(pair_games["author_steamid"]>=76560000000000000)).orderBy(pair_games["count"].desc())
pair_games.show()

                                                                                

+-----------------+-----+
|   author_steamid|count|
+-----------------+-----+
|76561199003745475|   11|
|76561198008966571|    7|
|76561198107639116|    6|
|76561198876426843|    5|
|76561198303461537|    5|
|76561198998112614|    5|
|76561198847533327|    5|
|76561198253481314|    5|
|76561198384612405|    4|
|76561198291222351|    4|
|76561198196563587|    4|
|76561198368118101|    4|
|76561198373495048|    4|
|76561198262809392|    4|
|76561198008376294|    4|
|76561198244556704|    4|
|76561198152295267|    4|
|76561198120553051|    4|
|76561198147205760|    4|
|76561198313573755|    4|
+-----------------+-----+
only showing top 20 rows



In [14]:
pair_games.count()

3415

In [15]:
new_pair_games = data_rec.filter(col("author_playtime_forever")>=5*mean_playtime)
new_pair_games = new_pair_games.filter(new_pair_games["author_steamid"]>=76560000000000000).select("author_steamid","app_id", "app_name","recommended")
new_pair_games.show()

+-----------------+------+--------------------+-----------+
|   author_steamid|app_id|            app_name|recommended|
+-----------------+------+--------------------+-----------+
|76561198106605585|578080|PLAYERUNKNOWN'S B...|      false|
|76561198825325306|578080|PLAYERUNKNOWN'S B...|      false|
|76561198807124221|578080|PLAYERUNKNOWN'S B...|      false|
|76561198364490662|578080|PLAYERUNKNOWN'S B...|       true|
|76561198139947090|578080|PLAYERUNKNOWN'S B...|      false|
|76561198006176001|578080|PLAYERUNKNOWN'S B...|      false|
|76561198833888989|578080|PLAYERUNKNOWN'S B...|       true|
|76561198060538461|578080|PLAYERUNKNOWN'S B...|      false|
|76561198243019194|578080|PLAYERUNKNOWN'S B...|      false|
|76561198382383005|578080|PLAYERUNKNOWN'S B...|       true|
|76561198368000612|578080|PLAYERUNKNOWN'S B...|      false|
|76561198066304531|578080|PLAYERUNKNOWN'S B...|      false|
|76561198423969868|578080|PLAYERUNKNOWN'S B...|      false|
|76561198428351076|578080|PLAYERUNKNOWN'

In [16]:
new_pair_games.count()

311158

In [17]:
# Convert author_steamid and app_id to indices
author_indexer = StringIndexer(inputCol="author_steamid", outputCol="author_index").fit(new_pair_games)
app_indexer = StringIndexer(inputCol="app_name", outputCol="app_index").fit(new_pair_games)
new_pair_games = new_pair_games.withColumn("Rating", when(col("recommended") == True, 5).otherwise(1))

                                                                                

In [18]:
new_pair = author_indexer.transform(app_indexer.transform(new_pair_games))
new_pair.show()

+-----------------+------+--------------------+-----------+------+---------+------------+
|   author_steamid|app_id|            app_name|recommended|Rating|app_index|author_index|
+-----------------+------+--------------------+-----------+------+---------+------------+
|76561198106605585|578080|PLAYERUNKNOWN'S B...|      false|     1|      0.0|     98146.0|
|76561198825325306|578080|PLAYERUNKNOWN'S B...|      false|     1|      0.0|    275158.0|
|76561198807124221|578080|PLAYERUNKNOWN'S B...|      false|     1|      0.0|    269883.0|
|76561198364490662|578080|PLAYERUNKNOWN'S B...|       true|     5|      0.0|    227159.0|
|76561198139947090|578080|PLAYERUNKNOWN'S B...|      false|     1|      0.0|    124272.0|
|76561198006176001|578080|PLAYERUNKNOWN'S B...|      false|     1|      0.0|     20044.0|
|76561198833888989|578080|PLAYERUNKNOWN'S B...|       true|     5|      0.0|    277495.0|
|76561198060538461|578080|PLAYERUNKNOWN'S B...|      false|     1|      0.0|     58972.0|
|765611982

25/04/28 20:31:04 WARN DAGScheduler: Broadcasting large task binary with size 11.7 MiB


In [19]:
games = new_pair.select("app_index","app_name").distinct().orderBy("app_index")
games.count()

                                                                                

937

In [20]:
# Create an ALS (Alternating Least Squares) model
als = ALS(maxIter=10, regParam=0.01, userCol="app_index", itemCol="author_index", ratingCol="Rating", coldStartStrategy="drop")

# Fit the model to the data
model = als.fit(new_pair)

# Generate recommendations for all items
app_recommendations = model.recommendForAllItems(5)  # Number of recommendations per item

# Display the recommendations
app_recommendations.show(truncate=False)

25/04/28 20:31:05 WARN DAGScheduler: Broadcasting large task binary with size 11.7 MiB
25/04/28 20:31:06 WARN DAGScheduler: Broadcasting large task binary with size 11.7 MiB
25/04/28 20:31:08 WARN DAGScheduler: Broadcasting large task binary with size 11.7 MiB
25/04/28 20:31:09 WARN DAGScheduler: Broadcasting large task binary with size 11.7 MiB
25/04/28 20:31:09 WARN DAGScheduler: Broadcasting large task binary with size 11.7 MiB
25/04/28 20:31:11 WARN DAGScheduler: Broadcasting large task binary with size 11.7 MiB
25/04/28 20:31:12 WARN DAGScheduler: Broadcasting large task binary with size 11.7 MiB
25/04/28 20:31:12 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS
25/04/28 20:31:12 WARN DAGScheduler: Broadcasting large task binary with size 11.7 MiB
25/04/28 20:31:12 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.lapack.JNILAPACK
25/04/28 20:31:13 WARN DAGScheduler: Broadcasting large task binary with size 11.7 MiB

+------------+-----------------------------------------------------------------------------------------+
|author_index|recommendations                                                                          |
+------------+-----------------------------------------------------------------------------------------+
|4           |[{143, 7.0673294}, {904, 6.562283}, {139, 6.4849224}, {127, 6.4391174}, {866, 6.286461}] |
|7           |[{37, 7.0302258}, {866, 6.9079695}, {43, 6.4405684}, {174, 6.328453}, {666, 6.2992015}]  |
|8           |[{639, 5.943613}, {401, 5.861968}, {876, 5.8265657}, {277, 5.613238}, {603, 5.570956}]   |
|23          |[{171, 6.799896}, {66, 6.796947}, {159, 6.7675366}, {3, 6.7417817}, {750, 6.7164574}]    |
|31          |[{384, 7.4822216}, {532, 7.1989446}, {105, 7.009874}, {765, 6.9626865}, {835, 6.428036}] |
|34          |[{193, 8.024941}, {291, 7.986845}, {490, 7.761163}, {221, 7.6659217}, {87, 7.471557}]    |
|39          |[{430, 6.0836554}, {325, 5.641981}, {277,

                                                                                

In [21]:
games.show()

+---------+--------------------+
|app_index|            app_name|
+---------+--------------------+
|      0.0|PLAYERUNKNOWN'S B...|
|      1.0|Tom Clancy's Rain...|
|      2.0|  Grand Theft Auto V|
|      3.0|       Rocket League|
|      4.0|                Rust|
|      5.0|            Terraria|
|      6.0|         Garry's Mod|
|      7.0|    Dead by Daylight|
|      8.0|ARK: Survival Evo...|
|      9.0|Monster Hunter: W...|
|     10.0|Euro Truck Simula...|
|     11.0|              Arma 3|
|     12.0|   Hearts of Iron IV|
|     13.0|            PAYDAY 2|
|     14.0|The Elder Scrolls...|
|     15.0|Sid Meier's Civil...|
|     16.0|Europa Universali...|
|     17.0|Total War: WARHAM...|
|     18.0|           Fallout 4|
|     19.0|The Binding of Is...|
+---------+--------------------+
only showing top 20 rows



### save model

In [29]:
# Save your ALS trained model
model.save("/user/tejashree/project/als_game_recommendation_model")


25/04/28 21:08:34 WARN DAGScheduler: Broadcasting large task binary with size 12.0 MiB
25/04/28 21:08:35 WARN DAGScheduler: Broadcasting large task binary with size 12.0 MiB
                                                                                

In [31]:
from pyspark.ml.recommendation import ALSModel
model = ALSModel.load("/user/tejashree/project/als_game_recommendation_model")


### save recomendations

In [37]:
user_recommendations = model.recommendForAllItems(5)
user_recommendations.write.mode("overwrite").parquet("/user/tejashree/project/user_recommendations.parquet")

                                                                                

 ### saving app_index → app_name mapping,

In [None]:
games.write.mode("overwrite").parquet("/user/tejashree/project/games_mapping.parquet")


In [41]:
# 1. Read back saved recommendations
recommendations_df = spark.read.parquet("/user/tejashree/project/user_recommendations.parquet")

# 2. Display 20 rows
recommendations_df.show(20, truncate=False).where(author_index==55)


+------------+-----------------------------------------------------------------------------------------+
|author_index|recommendations                                                                          |
+------------+-----------------------------------------------------------------------------------------+
|0           |[{183, 8.556689}, {33, 8.345333}, {155, 8.06116}, {828, 7.814534}, {575, 7.7418513}]     |
|9           |[{370, 7.958337}, {159, 7.5476995}, {70, 7.4782033}, {99, 7.337401}, {160, 7.2315965}]   |
|17          |[{835, 6.309818}, {59, 6.2674575}, {83, 6.1001024}, {595, 5.861573}, {22, 5.749616}]     |
|18          |[{704, 7.339795}, {216, 6.922491}, {861, 6.8088093}, {571, 6.5223966}, {105, 6.390658}]  |
|30          |[{2, 4.924485}, {160, 4.398764}, {766, 4.2458377}, {151, 3.957089}, {470, 3.8709142}]    |
|35          |[{327, 6.3794274}, {744, 6.3401575}, {403, 5.994085}, {569, 5.860409}, {909, 5.843494}]  |
|36          |[{539, 6.624758}, {143, 6.2727175}, {301,

AttributeError: 'NoneType' object has no attribute 'where'

In [43]:
from pyspark.sql.functions import col

# First filter, then show
recommendations_df.where(col("author_index") == 55).show(20, truncate=False)


+------------+-----------------------------------------------------------------------------------------+
|author_index|recommendations                                                                          |
+------------+-----------------------------------------------------------------------------------------+
|55          |[{511, 7.1217494}, {495, 6.9848433}, {159, 6.5207806}, {278, 6.4827876}, {299, 6.438356}]|
+------------+-----------------------------------------------------------------------------------------+



In [47]:
from pyspark.ml.evaluation import RegressionEvaluator

# Step 1: After you fit your model on `new_pair`
model = als.fit(new_pair)

# Step 2: Predict back on your input data (new_pair)
predictions = model.transform(new_pair)

# Step 3: Create an evaluator
evaluator = RegressionEvaluator(
    metricName="rmse",
    labelCol="Rating",
    predictionCol="prediction"
)

# Step 4: Calculate RMSE
rmse = evaluator.evaluate(predictions)
print(f"✅ RMSE of your ALS model = {rmse:.4f}")


25/04/28 21:23:09 WARN DAGScheduler: Broadcasting large task binary with size 11.7 MiB
25/04/28 21:23:09 WARN DAGScheduler: Broadcasting large task binary with size 11.7 MiB
25/04/28 21:23:12 WARN DAGScheduler: Broadcasting large task binary with size 11.7 MiB
25/04/28 21:23:13 WARN DAGScheduler: Broadcasting large task binary with size 11.7 MiB
25/04/28 21:23:14 WARN DAGScheduler: Broadcasting large task binary with size 11.7 MiB
25/04/28 21:23:15 WARN DAGScheduler: Broadcasting large task binary with size 11.7 MiB
25/04/28 21:23:16 WARN DAGScheduler: Broadcasting large task binary with size 11.7 MiB
25/04/28 21:23:16 WARN DAGScheduler: Broadcasting large task binary with size 11.7 MiB
25/04/28 21:23:17 WARN DAGScheduler: Broadcasting large task binary with size 11.7 MiB
25/04/28 21:23:17 WARN DAGScheduler: Broadcasting large task binary with size 11.7 MiB
25/04/28 21:23:17 WARN DAGScheduler: Broadcasting large task binary with size 11.7 MiB
25/04/28 21:23:18 WARN DAGScheduler: Broadc

✅ RMSE of your ALS model = 0.0053


25/04/28 21:23:30 WARN DAGScheduler: Broadcasting large task binary with size 11.8 MiB
                                                                                