<a href="https://colab.research.google.com/github/rashboldb/clearml/blob/master/UCU_2025_CL_teacher.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Collaborative Filtering

Prepared for UCU 2025 Data mining.

(approx. **30 min** needed to complete)  

Credits: based on Stanford CS246 course.

In [None]:
from IPython.display import Image
Image(url='https://media.giphy.com/media/v1.Y2lkPTc5MGI3NjExbWZ0aDlyeTNtZ21xeTc5bHBnYXdmZ2VlbWR2amFlZno2aTdsODd4YSZlcD12MV9naWZzX3NlYXJjaCZjdD1n/xT9IgG50Fb7Mi0prBC/giphy.gif',width=500)

### Setup

Let's set up Spark on your Colab environment.  Run the cell below!

In [None]:
!pip install pyspark
!pip install -U -q PyDrive2
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-17-openjdk-amd64"



Download MovieLens 100K dataset

In [None]:
# Download MovieLens 100K dataset
!wget -q https://files.grouplens.org/datasets/movielens/ml-100k.zip
!unzip -q ml-100k.zip

If you executed the cells above, you should be able to see the dataset we will use for this Colab under the "Files" tab on the left panel.

Next, we import some of the common libraries needed for our task.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import pyspark
from pyspark.sql import *
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark import SparkContext, SparkConf
from pyspark.sql.functions import col

Let's initialize the Spark context.

In [None]:
# create the session
conf = SparkConf().set("spark.ui.port", "4050")

# create the context
sc = pyspark.SparkContext(conf=conf)
spark = SparkSession.builder.getOrCreate()

### Data Loading

In this Colab, we will be using the [MovieLens dataset](https://grouplens.org/datasets/movielens/), specifically the 100K dataset (which contains in total 100,000 ratings from 1000 users on ~1700 movies).

We load the ratings data in a 80%-20% ```training```/```test``` split, while the ```items``` dataframe contains the movie titles associated to the item identifiers.

In [None]:
# Define schemas
schema_ratings = StructType([
    StructField("user_id", IntegerType(), False),
    StructField("item_id", IntegerType(), False),
    StructField("rating", IntegerType(), False),
    StructField("timestamp", IntegerType(), False)
])

schema_items = StructType([
    StructField("item_id", IntegerType(), False),
    StructField("movie", StringType(), False)
])

# Load MovieLens data into Spark DataFrames
training = spark.read.option("sep", "\t").csv("ml-100k/u1.base", header=False, schema=schema_ratings)
test = spark.read.option("sep", "\t").csv("ml-100k/u1.test", header=False, schema=schema_ratings)

# Load u.item file and select only the first two columns
items_full = spark.read.option("sep", "|").csv("ml-100k/u.item", header=False)

# Rename and cast properly
items = items_full.select(
    col("_c0").cast("int").alias("item_id"),
    col("_c1").alias("movie")
)


In [None]:
training.printSchema()
test.printSchema()

root
 |-- user_id: integer (nullable = true)
 |-- item_id: integer (nullable = true)
 |-- rating: integer (nullable = true)
 |-- timestamp: integer (nullable = true)

root
 |-- user_id: integer (nullable = true)
 |-- item_id: integer (nullable = true)
 |-- rating: integer (nullable = true)
 |-- timestamp: integer (nullable = true)



In [None]:
training.show(5)
test.show(5)

+-------+-------+------+---------+
|user_id|item_id|rating|timestamp|
+-------+-------+------+---------+
|      1|      1|     5|874965758|
|      1|      2|     3|876893171|
|      1|      3|     4|878542960|
|      1|      4|     3|876893119|
|      1|      5|     3|889751712|
+-------+-------+------+---------+
only showing top 5 rows

+-------+-------+------+---------+
|user_id|item_id|rating|timestamp|
+-------+-------+------+---------+
|      1|      6|     5|887431973|
|      1|     10|     3|875693118|
|      1|     12|     5|878542960|
|      1|     14|     5|874965706|
|      1|     17|     3|875073198|
+-------+-------+------+---------+
only showing top 5 rows



In [None]:
items.printSchema()

root
 |-- item_id: integer (nullable = true)
 |-- movie: string (nullable = true)



### Your task

Let's compute some stats!  What is the number of ratings in the training and test dataset? How many movies are in our dataset?

Expected output:


```
Number of ratings in training set: 80000
Number of ratings in test set: 20000
Number of movies: 1682
```



In [None]:
''' 3 lines of code in total expected.
For sub-parts of the question (if any), creating different cells of code would be recommended.'''
# YOUR CODE HERE (3 lines of code)
print(f"Number of ratings in training set: {training.count()}")
print(f"Number of ratings in test set: {test.count()}")
print(f"Number of movies: {items.count()}")

Number of ratings in training set: 80000
Number of ratings in test set: 20000
Number of movies: 1682


Using the training set, train a model with the Alternating Least Squares method available in the Spark MLlib: [https://spark.apache.org/docs/latest/ml-collaborative-filtering.html](https://spark.apache.org/docs/latest/ml-collaborative-filtering.html)

Parameters to use:

```
maxIter=5, regParam=0.01, coldStartStrategy="drop", seed=42
```

Please explore the other parameters. What else do you need?



In [None]:
from pyspark.ml.recommendation import ALS
# YOUR CODE HERE (2 lines of code)
als = ALS(maxIter=5, regParam=0.01, userCol="user_id", itemCol="item_id", ratingCol="rating",
          coldStartStrategy="drop", seed=42)
model = als.fit(training)

Now compute the RMSE on the test dataset.

Expected output:


```
Root-mean-square error = 1.109848223762331
```




In [None]:
from pyspark.ml.evaluation import RegressionEvaluator
# YOUR CODE HERE (4-5 lines of code)
predictions = model.transform(test)
evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating",
                                predictionCol="prediction")
rmse = evaluator.evaluate(predictions)
print(f"Root-mean-square error = {rmse}")

Root-mean-square error = 1.109848223762331


At this stage, you will use your trained model to generate personalized movie recommendations.

**Your task:**

1. Generate top-3 movie recommendations for each user based on the trained model.

2. Map the movie IDs to their corresponding movie titles to present the recommendations in a readable format.

The desired scheme for the output dataframe is


```
+-------+-------+----------+--------------------+
|item_id|user_id|prediction|               movie|
+-------+-------+----------+--------------------+
```



In [None]:
N = 3

# YOUR CODE HERE
# Generate top 10 movie recommendations for each user
userRecs = model.recommendForAllUsers(N)
userRecs.show(10)

# Explode the recommendations array to access individual item_ids
userRecs_exploded = userRecs.withColumn("recommendation", explode("recommendations"))
userRecs_items = userRecs_exploded.select("user_id", col("recommendation.item_id").alias("item_id"), col("recommendation.rating").alias("prediction"))

# Display the filtered recommendations (optional, for checking)
userRecs_items.show(10)

# Join with items DataFrame to get movie titles
userRecs_with_titles = userRecs_items.join(items, "item_id")

# Display recommendations with titles (optional, for checking)
userRecs_with_titles.show(15)

+-------+--------------------+
|user_id|     recommendations|
+-------+--------------------+
|      1|[{1639, 6.8228292...|
|      2|[{320, 8.049304},...|
|      3|[{1643, 9.578323}...|
|      4|[{1172, 11.647497...|
|      5|[{394, 8.292595},...|
|      6|[{1643, 6.6778836...|
|      7|[{1643, 7.776578}...|
|      8|[{962, 8.6741705}...|
|      9|[{6, 10.562657}, ...|
|     10|[{1131, 6.762258}...|
+-------+--------------------+
only showing top 10 rows

+-------+-------+----------+
|user_id|item_id|prediction|
+-------+-------+----------+
|      1|   1639| 6.8228292|
|      1|   1335| 6.4798656|
|      1|   1001| 6.3261967|
|      2|    320|  8.049304|
|      2|    962| 7.9501586|
|      2|   1131|  7.785435|
|      3|   1643|  9.578323|
|      3|   1512|  8.101799|
|      3|    793|  8.044994|
|      4|   1172| 11.647497|
+-------+-------+----------+
only showing top 10 rows

+-------+-------+----------+--------------------+
|item_id|user_id|prediction|               movie|
+-------

**To check your recommendation model we perform the following steps:**

1. Select a specific user:
Consider the user with `user_id = 10`.

2. Inspect training data:
Print all movies that this user has already rated or watched in the training set (display movie titles).

3. Review recommendations:
Print the titles of the top-3 recommended movies for this user.


Expected recommendation:



```
Top 10 recommended movies for user 10:
+----------------------+
|movie                 |
+----------------------+
|Safe (1995)           |
|Angel Baby (1995)     |
|Mina Tannenbaum (1994)|
+----------------------+
```





In [None]:
# YOUR CODE HERE
user_id_to_check = 10

# Print all movies for the user in the training set
watched_movies_user3 = training.filter(training.user_id == user_id_to_check).join(items, "item_id")
print(f"Movies watched by user {user_id_to_check} in the training set:")
watched_movies_user3.select("movie").show(watched_movies_user3.count(), truncate=False)

# Print the titles of top-3 recommended movies for this user.
print(f"\nTop 10 recommended movies for user {user_id_to_check}:")
recommended_movies_user3 = userRecs_with_titles.filter(userRecs_with_titles.user_id == user_id_to_check)
recommended_movies_user3.select("movie").show(10, truncate=False)

Movies watched by user 10 in the training set:
+----------------------------------------------------------+
|movie                                                     |
+----------------------------------------------------------+
|Dead Man Walking (1995)                                   |
|Seven (Se7en) (1995)                                      |
|Usual Suspects, The (1995)                                |
|Taxi Driver (1976)                                        |
|Crumb (1994)                                              |
|Desperado (1995)                                          |
|To Wong Foo, Thanks for Everything! Julie Newmar (1995)   |
|Star Wars (1977)                                          |
|Three Colors: Red (1994)                                  |
|Three Colors: Blue (1993)                                 |
|Forrest Gump (1994)                                       |
|Four Weddings and a Funeral (1994)                        |
|Jurassic Park (1993)                 