# Experiment 6
Collaborative Filtering for Movie Recommendation Using PySpark's Alternating Least Squares (ALS) Algorithm


## Aim
To implement collaborative filtering for personalized movie recommendations using PySpark's Alternating Least Squares (ALS) model on the MovieLens dataset and evaluate its performance.


## Objectives
1. To load and preprocess the MovieLens dataset for collaborative filtering.
2. To perform exploratory data analysis (EDA) to understand the dataset structure.
3. To build and train the ALS model for recommendation generation.
4. To split data into training and testing sets for performance evaluation.
5. To compute model performance metrics like root-mean-square error (RMSE).
6. To provide personalized movie recommendations for users based on their preferences.


## Course Outcomes
1. Understand collaborative filtering and recommendation systems.
2. Gain experience in loading and analyzing datasets with PySpark.
3. Implement Alternating Least Squares (ALS) for building recommendation systems.
4. Evaluate the performance of recommendation models using metrics such as RMSE.
5. Utilize PySpark for building scalable solutions for data-intensive applications.


## Theory

- Recommendation Systems: Recommendation systems suggest items to users based on their preferences and behavior. Collaborative filtering is a type of recommendation system that relies on the interactions between users and items. It assumes that users who have similar preferences in the past will likely have similar preferences in the future.

- Alternating Least Squares (ALS): ALS is a matrix factorization algorithm that decomposes the user-item interaction matrix into latent factors representing users and items. These factors help predict missing interactions or ratings. PySpark's implementation of ALS offers scalability for large datasets and includes features like cold-start strategies to handle sparsity.

- MovieLens Dataset: The MovieLens dataset is a popular dataset for recommendation system research. It contains user ratings for movies, where each entry represents a user’s rating for a specific movie. Key attributes include:
    - userId: Unique identifier for users.
    - movieId: Unique identifier for movies.
    - rating: Numeric value representing a user's rating for a movie.


## Procedure

1. Create a Spark Session
    - Initialize a Spark session using `SparkSession.builder`.
    - Name the application for identification (e.g., "Collaborative Filtering with PySpark").

2. Load the Dataset
    - Import the MovieLens dataset using `spark.read.csv()`.
    - Enable headers and schema inference for clarity.

3. Perform Exploratory Data Analysis (EDA)
    - Display the schema using `printSchema()`.
    - Show the first few rows of the dataset with `show()`.
    - Compute summary statistics using `describe()`.
    - Count the number of unique users and movies using `distinct()`.

4. Prepare Data for Collaborative Filtering
    - Split the dataset into training and testing subsets using `randomSplit()`.
    - Use an 80-20 split to ensure sufficient data for training while maintaining a robust test set.

5. Build the ALS Model
    - Initialize the ALS model using PySpark’s `ALS` class.
      - Set hyperparameters such as `maxIter` (number of iterations) and `regParam` (regularization parameter).
      - Define user and item columns (`userCol` and `itemCol`) and the rating column (`ratingCol`).
      - Use the cold-start strategy to handle missing predictions.
    - Train the ALS model on the training data using `fit()`.

6. Make Predictions
    - Generate predictions for the test dataset using the trained ALS model.
    - Store the predictions for evaluation and further analysis.

7. Evaluate the Model
    - Use `RegressionEvaluator` to compute RMSE, a common metric for assessing recommendation model performance.
    - Evaluate predictions against actual ratings in the test dataset.

8. Provide Recommendations
    - Use the ALS model’s `recommendForAllUsers()` method to generate top-5 movie recommendations for all users.
    - Display the recommendations to assess the personalization aspect.


## Results

### Exploratory Data Analysis
- Schema: The dataset contains columns `userId`, `movieId`, and `rating`.
- Summary Statistics: Ratings range between specific numeric values, providing insights into user preferences.
- Unique Counts:
   - Number of unique users: ~700.
   - Number of unique movies: ~9,000.

### Model Training and Evaluation
- Training: The ALS model was trained with `maxIter=10` and `regParam=0.01`.
- Evaluation: The RMSE on the test dataset was approximately **0.87**, indicating high accuracy and low prediction errors.

### Recommendations
- Personalized recommendations for users were generated successfully.
- The `recommendForAllUsers()` method identified top-5 movies for each user based on latent factors.


## Conclusions
The ALS model effectively leveraged collaborative filtering to provide personalized movie recommendations based on user-item interactions. The RMSE value indicated that the model accurately predicted user preferences with minimal errors. Personalized recommendations highlighted the potential of collaborative filtering for enhancing user experiences.

This experiment underscores the importance of exploring and preprocessing datasets, selecting appropriate algorithms, and evaluating model performance in recommendation systems. Further improvements could include hyperparameter tuning, adding implicit feedback, or incorporating side information like movie genres for better recommendations.


In [1]:
!pip install pyspark



In [2]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.ml.recommendation import ALS
from pyspark.ml.evaluation import RegressionEvaluator

# Step 1: Create a Spark session
spark = SparkSession.builder \
    .appName("Collaborative Filtering with PySpark") \
    .getOrCreate()

# Step 2: Load the dataset
ratings = spark.read.csv("/content/sample_data/ratings.csv", header=True, inferSchema=True)

# Step 3: Basic Exploratory Data Analysis (EDA)
print("Schema of the dataset:")
ratings.printSchema()

print("First 5 rows of the dataset:")
ratings.show(5)

print("Summary statistics:")
ratings.describe().show()

# Count the number of unique users and movies
num_users = ratings.select("userId").distinct().count()
num_movies = ratings.select("movieId").distinct().count()
print(f"Number of unique users: {num_users}")
print(f"Number of unique movies: {num_movies}")

# Step 4: Prepare the data for collaborative filtering
# Split the data into training and test sets
(training, test) = ratings.randomSplit([0.8, 0.2])

# Step 5: Build the ALS model
als = ALS(maxIter=10, regParam=0.01, userCol="userId", itemCol="movieId", ratingCol="rating", coldStartStrategy="drop")
model = als.fit(training)

# Step 6: Make predictions
predictions = model.transform(test)

# Step 7: Evaluate the model
evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating", predictionCol="prediction")
rmse = evaluator.evaluate(predictions)
print(f"Root-mean-square error = {rmse}")

# Step 8: Show some recommendations
user_recs = model.recommendForAllUsers(5)
user_recs.show(truncate=False)

# Stop the Spark session
spark.stop()


Schema of the dataset:
root
 |-- userId: integer (nullable = true)
 |-- movieId: integer (nullable = true)
 |-- rating: double (nullable = true)
 |-- timestamp: integer (nullable = true)

First 5 rows of the dataset:
+------+-------+------+---------+
|userId|movieId|rating|timestamp|
+------+-------+------+---------+
|     1|      1|   4.0|964982703|
|     1|      3|   4.0|964981247|
|     1|      6|   4.0|964982224|
|     1|     47|   5.0|964983815|
|     1|     50|   5.0|964982931|
+------+-------+------+---------+
only showing top 5 rows

Summary statistics:
+-------+------------------+----------------+------------------+--------------------+
|summary|            userId|         movieId|            rating|           timestamp|
+-------+------------------+----------------+------------------+--------------------+
|  count|            100836|          100836|            100836|              100836|
|   mean|326.12756356856676|19435.2957177992| 3.501556983616962|1.2059460873684695E9|
| 