<a href="https://colab.research.google.com/github/swethaswetha7676/023_BDA_assignment/blob/main/23_BDA_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ***Build a Recommendation Engine with Spark with a dataset of your choice***

Description:
1. Import Tools
We gather all the necessary tools like data readers, downloaders, and machine learning helpers.

2. Start Spark Session
We start a Spark engine, which is like opening a workspace where we’ll process the data.

3. Download the Dataset
We automatically download a movie ratings dataset (MovieLens) from the internet, so the user doesn’t have to do it manually.

4. Load and Name the Data
We load the downloaded data into Spark and give meaningful names to each column (like user ID, movie ID, rating, etc.).

5. View the Data
We take a quick look at the data to confirm it’s loaded correctly and understand what it contains.

6. Train the Recommendation Model
We use a special algorithm (ALS - Alternating Least Squares) to learn what kinds of movies each user likes based on their past ratings.

7. Make Recommendations for Users
The system gives a list of 10 movies it thinks each user will like the most.

8. Make Recommendations for Movies
For each movie, the system also finds 10 users who would most likely enjoy watching it.

9. Predict a Specific Rating
We check what rating a specific user might give to a specific movie they haven’t rated yet.

10. Stop the Spark Session
After everything is done, we close the Spark engine to free up resources.

In [None]:
import urllib.request
import os
from pyspark.sql import SparkSession
from pyspark.ml.recommendation import ALS
from pyspark.sql.functions import col

In [None]:
#Start Spark Session
spark = SparkSession.builder.appName("MovieLensRecommendation").getOrCreate()

In [None]:
# Download the dataset programmatically
data_url = "http://files.grouplens.org/datasets/movielens/ml-100k/u.data"
local_path = "/tmp/u.data"
# Download if not already present
if not os.path.exists(local_path):
    urllib.request.urlretrieve(data_url, local_path)
columns = ["userId", "movieId", "rating", "timestamp"]        # Define column names
ratings_df = spark.read.csv(local_path, sep="\t", inferSchema=True, header=False)
ratings_df = ratings_df.toDF(*columns)
ratings_df.show(5)

In [None]:
# Train ALS model
als = ALS(
    userCol="userId",
    itemCol="movieId",
    ratingCol="rating",
    coldStartStrategy="drop",
    nonnegative=True
)

model = als.fit(ratings_df)

In [None]:
user_recs = model.recommendForAllUsers(10)            # Recommend top 10 movies for each user
user_recs.show(5, truncate=False)
movie_recs = model.recommendForAllItems(10)           # Recommend top 10 users for each movie
movie_recs.show(5, truncate=False)
user_id = 1                                           # Make a prediction for a specific user and movie
movie_id = 10
user_movie_df = spark.createDataFrame([(user_id, movie_id)], ["userId", "movieId"])
predicted_rating = model.transform(user_movie_df)
predicted_rating.show()
spark.stop()                                           # Stop Spark session

+------+-------+------+---------+
|userId|movieId|rating|timestamp|
+------+-------+------+---------+
|   196|    242|     3|881250949|
|   186|    302|     3|891717742|
|    22|    377|     1|878887116|
|   244|     51|     2|880606923|
|   166|    346|     1|886397596|
+------+-------+------+---------+
only showing top 5 rows

+------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|userId|recommendations                                                                                                                                                                            |
+------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|1     |[{1449, 5.023085}, {1463, 4.9531827}, {1643, 4.9056563}, {119, 4.83429