# PySpark Recommendation System with ALS Algorithm

This Jupyter notebook demonstrates building a recommendation system using PySpark, with the ALS (Alternating Least Squares) algorithm. ALS is a collaborative filtering algorithm commonly used for recommendation systems.

## Introduction to Recommendation Systems

Recommendation systems are crucial for providing personalized recommendations to users based on their preferences and past interactions. They analyze user-item interactions to generate recommendations.


# Loading and Displaying Ratings Data

We initialize a Spark session using `SparkSession.builder.getOrCreate()` and load ratings data from a JSON file named "movies 1.json" into a DataFrame named `ratings`. We select columns "user_id", "product_id", and "score" from the dataset, caching it for efficiency. Subsequently, we limit the DataFrame to the first 10,000 rows using `ratings.head(10000)` and create a new DataFrame from this limited data. Finally, we display the first 5 rows of the `ratings` DataFrame.


In [None]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as f

spark = SparkSession.builder.getOrCreate()

ratings = spark.read.json("movies 1.json").select("user_id","product_id","score").cache()
ratings = ratings.head(10000)
ratings = spark.createDataFrame(ratings)

ratings.show(5)

# Collaborative Filtering with ALS Algorithm

We import necessary modules for collaborative filtering using the ALS algorithm, including `ALS` from `pyspark.ml.recommendation` and `StringIndexer` from `pyspark.ml.feature`. We create `StringIndexer` instances for columns "user_id" and "product_id" to convert them into numerical indices. Then, we construct a `Pipeline` to apply indexers sequentially to the `ratings` DataFrame. After indexing, we split the data into training and validation sets using `randomSplit()`. We initialize the ALS model with specified parameters and fit it to the training data. Finally, we make predictions on the validation data and display the first 10 predictions.


In [None]:
from pyspark.ml.recommendation import ALS
from pyspark.ml.feature import StringIndexer
from pyspark.ml import Pipeline

indexers = [
    StringIndexer(inputCol=column, outputCol=column+"_index").fit(ratings)
    for column in ["user_id", "product_id"]
]

pipeline = Pipeline(stages=indexers)
ratings_indexed = pipeline.fit(ratings).transform(ratings)

training_data,validation_data = ratings_indexed.randomSplit([8.0,2.0])

als = ALS(userCol="user_id_index",itemCol="product_id_index",ratingCol="score",rank=10,maxIter=5,regParam=0.01,coldStartStrategy="drop")
evaluator = RegressionEvaluator(metricName="rmse",labelCol="score",predictionCol="prediction")

model = als.fit(training_data)
predictions=model.transform(validation_data)
predictions.show(10,False)

# Generating Recommendations for a Specific User

We filter the validation data to extract information for a specific user (with index 1.0 in this case), selecting relevant columns including 'product_id', 'user_id', 'user_id_index', and 'product_id_index'. We display the data for this user using `user1.show()`. Then, we use the trained ALS model to generate recommendations for this user by applying `model.transform()` on `user1`. Finally, we sort the recommendations by prediction scores in descending order using `orderBy('prediction', ascending=False)` and display the results.


In [None]:
user1 = validation_data.filter(validation_data['user_id_index']==1.0).select(['product_id','user_id','user_id_index','product_id_index'])
user1.show()
recommendations = model.transform(user1)
recommendations.orderBy('prediction',ascending=False).show()

# Evaluating Model Performance

We use a `RegressionEvaluator` to compute the Root Mean Squared Error (RMSE) between actual and predicted ratings (`predictions`). The RMSE is calculated and printed using `evaluator.evaluate(predictions)`. Additionally, we create another evaluator for Mean Absolute Error (MAE), compute the MAE between actual and predicted ratings, and print the result using `evaluator_mae.evaluate(predictions)`.


In [None]:
from pyspark.ml.evaluation import RegressionEvaluator

rmse = evaluator.evaluate(predictions)
print(f"Root Mean Squared Error (RMSE) = {rmse}")

evaluator_mae = RegressionEvaluator(
    metricName="mae",
    labelCol="score",
    predictionCol="prediction"
)

mae = evaluator_mae.evaluate(predictions)
print(f"Mean Absolute Error (MAE) = {mae}")