
## MovieLens Recommender with ALS (Spark MLlib) â€” Ratings Only

**Dataset:** MovieLens **small** (~100k ratings)  
**Goal:** Build a collaborative filtering recommender using **ALS** from `pyspark.ml.recommendation`.  
**Libraries:** `pyspark.ml.recommendation`, `pyspark.sql`


## 1) Environment Setup

In [23]:

from pyspark.sql import SparkSession

spark = (SparkSession.builder
         .appName("MovieLens-ALS")
         .getOrCreate())


## 2) Load Ratings Data

In [24]:
df = (spark.read
      .option("header", "true")
      .csv("dataset/ratings.csv"))

df.show()

+------+-------+------+---------+
|userId|movieId|rating|timestamp|
+------+-------+------+---------+
|     1|      1|   4.0|964982703|
|     1|      3|   4.0|964981247|
|     1|      6|   4.0|964982224|
|     1|     47|   5.0|964983815|
|     1|     50|   5.0|964982931|
|     1|     70|   3.0|964982400|
|     1|    101|   5.0|964980868|
|     1|    110|   4.0|964982176|
|     1|    151|   5.0|964984041|
|     1|    157|   5.0|964984100|
|     1|    163|   5.0|964983650|
|     1|    216|   5.0|964981208|
|     1|    223|   3.0|964980985|
|     1|    231|   5.0|964981179|
|     1|    235|   4.0|964980908|
|     1|    260|   5.0|964981680|
|     1|    296|   3.0|964982967|
|     1|    316|   3.0|964982310|
|     1|    333|   5.0|964981179|
|     1|    349|   4.0|964982563|
+------+-------+------+---------+
only showing top 20 rows


## 3) Select UserId,MovieId,Rating from ratings file

In [25]:
ratings = df.select("userId", "movieId", "rating")
ratings.show(10,truncate=False)

+------+-------+------+
|userId|movieId|rating|
+------+-------+------+
|1     |1      |4.0   |
|1     |3      |4.0   |
|1     |6      |4.0   |
|1     |47     |5.0   |
|1     |50     |5.0   |
|1     |70     |3.0   |
|1     |101    |5.0   |
|1     |110    |4.0   |
|1     |151    |5.0   |
|1     |157    |5.0   |
+------+-------+------+
only showing top 10 rows


## 4) Split Data

In [26]:
train, test = ratings.randomSplit([0.8, 0.2], seed=42)

## 5) Build Model

In [27]:
from pyspark.ml.recommendation import ALS
als = ALS(
    maxIter=10,
    regParam=0.1,
    rank=10,
    userCol="userId",
    itemCol="movieId",
    ratingCol="rating",
    coldStartStrategy="drop"  # drop NaN predictions
)

## 6) Fit train data on model

In [28]:
from pyspark.sql.functions import col

# Cast userId and movieId to integer
train = train.withColumn("userId", col("userId").cast("integer"))
train = train.withColumn("movieId", col("movieId").cast("integer"))
train = train.withColumn("rating", col("rating").cast("float"))


# Do the same for test set if you have one
test = test.withColumn("userId", col("userId").cast("integer"))
test = test.withColumn("movieId", col("movieId").cast("integer"))
test = test.withColumn("rating", col("rating").cast("float"))

# Now fit ALS
model = als.fit(train)


## 7) Evaluate Model 

In [29]:
predictions = model.transform(test)

In [30]:
from pyspark.ml.evaluation import RegressionEvaluator
evaluator = RegressionEvaluator(
    metricName="rmse",
    labelCol="rating",
    predictionCol="prediction"
)
rmse = evaluator.evaluate(predictions)
print(f"Root-mean-square error = {rmse:.3f}")

Root-mean-square error = 0.877


## 8) Generate Recommendations

- User Recommendations with Movie Titles

In [36]:
movies = spark.read.csv("dataset/movies.csv", header=True, inferSchema=True)
from pyspark.sql.functions import explode

user_recs = model.recommendForAllUsers(10)
# Explode array to rows
user_recs_exploded = user_recs.withColumn("rec", explode("recommendations")) \
    .select("userId", "rec.movieId", "rec.rating")

# Join with movies to get title and genres
user_recs_with_movies = user_recs_exploded.join(movies, on="movieId")

print("Sample user recommendations (with movie titles):")
user_recs_with_movies.show(10, truncate=False)

Sample user recommendations (with movie titles):
+-------+------+---------+------------------------------------------------------------------------------------+----------------------+
|movieId|userId|rating   |title                                                                               |genres                |
+-------+------+---------+------------------------------------------------------------------------------------+----------------------+
|27611  |1     |5.56343  |Battlestar Galactica (2003)                                                         |Drama|Sci-Fi|War      |
|6666   |1     |5.539753 |Discreet Charm of the Bourgeoisie, The (Charme discret de la bourgeoisie, Le) (1972)|Comedy|Drama|Fantasy  |
|3379   |1     |5.506819 |On the Beach (1959)                                                                 |Drama                 |
|3606   |1     |5.4826846|On the Town (1949)                                                                  |Comedy|Musical|Romance|
|13233

- Movie Recommendations with User IDs

In [39]:
movie_recs = model.recommendForAllItems(2)

# Explode array to rows
movie_recs_exploded = movie_recs.withColumn("rec", explode("recommendations")) \
    .select("movieId", "rec.userId", "rec.rating")

# Join with movies for readability
movie_recs_with_movies = movie_recs_exploded.join(movies, on="movieId")

print("Sample movie recommendations (with userIds):")
movie_recs_with_movies.show(10, truncate=False)

Sample movie recommendations (with userIds):
+-------+------+---------+----------------------------------+----------------------------+
|movieId|userId|rating   |title                             |genres                      |
+-------+------+---------+----------------------------------+----------------------------+
|6      |53    |5.0836234|Heat (1995)                       |Action|Crime|Thriller       |
|6      |99    |4.807187 |Heat (1995)                       |Action|Crime|Thriller       |
|9      |43    |4.244302 |Sudden Death (1995)               |Action                      |
|9      |498   |4.1505384|Sudden Death (1995)               |Action                      |
|12     |543   |4.473932 |Dracula: Dead and Loving It (1995)|Comedy|Horror               |
|12     |544   |4.0560737|Dracula: Dead and Loving It (1995)|Comedy|Horror               |
|13     |53    |3.8215892|Balto (1995)                      |Adventure|Animation|Children|
|13     |267   |3.7578259|Balto (1995)       

## 9) Stop Spark

In [40]:
spark.stop()