# Final Project Proposal & Project 6

**Name:** Umais Siddiqui  
**Class:** Data 612 - Recommender Systems  
**Github Link:** https://github.com/umais/DATA612_Recommender_Systems/blob/master/FinalProject_Project6/app/FinalProjectProposalDocument.ipynb

For the Final Project and also Project 6 I will be submitting as a single project. I will be fulfilling all the requirements for Project 6 in the Final Project therefore I will submit them as one. The idea is to build a personalized recommender system using Amazon’s Electronics review dataset. It will apply Spark-based collaborative filtering (ALS) to generate product recommendations for users and deploys a real-time API using Flask and Azure Kubernetes Service (AKS). All data and results are stored in Azure Blob and SQL Database.

**Environment Setup**:  
The system runs inside a Docker container hosted on an Azure Virtual Machine. Azure File Share is mounted on the VM to persist intermediate files and logs. Azure Blob is used for scalable object storage, and Azure SQL is used to store structured results.

---


##  **Dataset Overview**

- **Source**: Amazon Reviews (Electronics)
- **Size**: ~7 million reviews
- **Columns Used**: `reviewerID`, `asin`, `overall (rating)`, `reviewText`
- **Goal**: Generate Top-N recommendations per user


# **Project Description and Deliverable**


The goal of this project is to build a scalable and efficient recommender system using Apache Spark’s Alternating Least Squares (ALS) algorithm, hosted on Microsoft Azure infrastructure. The project workflow involves the following.

- creating a new Azure Virtual Machine (VM) that is configured to access data stored in an Azure File Share via a mounted drive.
- Within the VM, Docker will be installed to run Spark inside a containerized environment, ensuring portability and ease of management. 


- The Spark application will load data directly from the mounted Azure File Share, enabling seamless interaction with persistent cloud storage.


- The recommender system will be developed by applying and iteratively improving the ALS model on the dataset to enhance its prediction accuracy and overall performance. Once the model achieves satisfactory results, it will be saved for production use.


- To expose the recommendations for real-time usage, an API service will be created, providing endpoints for retrieving personalized product recommendations. 


- A user-friendly frontend web application will be built using Flask, serving as a visualization layer to display recommendations interactively to end-users.


- The entire system, including the API and frontend, will be deployed and hosted on Azure, leveraging the cloud platform’s scalability and reliability.

## **Final Deliverable**

- Jupyter Notebook on github
- Powerpoint presentation recorded to demonstrate the working recommender system.

# **Alignment with Project 6 Requirements: Hands-on with Microsoft Azure**

This final recommender system project will fully incorporate the essential Azure components specified in Project 6, demonstrating practical experience with deploying a cloud solution on Microsoft Azure:

**1. Persistent Storage**

The project uses Azure File Share, which is part of Azure Storage, to store datasets and the trained ALS model.

This provides long-term, durable, and scalable storage for both input data and model artifacts.

The mounted Azure File Share is accessed both by the Azure VM running Spark inside Docker and by the Flask application hosted on Azure App Service, ensuring consistent data availability.

**2. Compute Resource**

The project provisions an Azure Virtual Machine that runs Docker and hosts Apache Spark inside containers for data processing, model training, and evaluation.

The trained ALS model is then served through a Flask web application deployed on Azure App Service, a fully managed platform-as-a-service (PaaS) compute environment.

This multi-tier compute architecture showcases the use of different Azure compute options: IaaS (VM) and PaaS (App Service).

**3. Network Security**

The Azure VM is placed within an Azure Virtual Network (VNet) to isolate and secure network traffic.

Appropriate Network Security Groups (NSGs) are configured to restrict inbound and outbound access, ensuring only authorized users and services can connect to the VM and Azure File Share.

Similarly, the Azure App Service uses private endpoints or VNet integration to securely access the Azure File Share and backend resources.

This setup enforces secure communication channels between compute and storage services, complying with best practices for cloud security.






# Initial Setup, Data Cleaning, Analysis, and Model Generation

The project begins with setting up the environment on an Azure Virtual Machine, where Docker is installed and configured to run Apache Spark in a containerized environment. The dataset, stored on an Azure File Share and mounted inside the container, is accessed directly for processing.

# Data Cleaning and Exploration
The raw dataset is loaded into a Spark DataFrame. As can be seen below

**Initial data cleaning steps are performed, including:**

- Selecting relevant columns (e.g., user IDs, item IDs, and ratings).

- Dropping rows with missing values to ensure data quality.

- Filtering users and items to retain only those with sufficient interaction counts (e.g., users with at least 5 ratings and items rated by at least 5 users).

Exploratory data analysis is conducted to understand the distribution of ratings, user behavior, and item popularity, which informs model tuning decisions.

**Model Building**
The cleaned dataset is then used to train a collaborative filtering model using Spark’s ALS (Alternating Least Squares) algorithm.

The model training includes hyperparameter tuning and iterative improvements aimed at enhancing recommendation accuracy.

Upon achieving satisfactory performance metrics, the model is saved to the mounted Azure File Share for persistence and later use by the flask application.

In [1]:
from pyspark.sql import SparkSession
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS
import matplotlib.pyplot as plt
import pandas as pd
from pyspark.sql.functions import col
from sklearn.metrics import mean_squared_error
import numpy as np


# Initialize Spark session with proper Jetty JARs
spark = (
    SparkSession.builder
    .appName("AmazonRecommender")
    .config("spark.driver.extraClassPath",
            "/opt/spark-3.4.1-bin-hadoop3/jars/jetty-6.1.26.jar:" +
            "/opt/spark-3.4.1-bin-hadoop3/jars/jetty-util-6.1.26.jar:" +
            "/opt/spark-3.4.1-bin-hadoop3/jars/jetty-ajax-6.1.26.jar")
    .config("spark.executor.extraClassPath",
            "/opt/spark-3.4.1-bin-hadoop3/jars/jetty-6.1.26.jar:" +
            "/opt/spark-3.4.1-bin-hadoop3/jars/jetty-util-6.1.26.jar:" +
            "/opt/spark-3.4.1-bin-hadoop3/jars/jetty-ajax-6.1.26.jar")
    .config("spark.pyspark.python", "/usr/local/bin/python3")
    .config("spark.pyspark.driver.python", "/usr/local/bin/python3")
    .getOrCreate()
)


# Read data from the mounted Azure File Share directory
df = spark.read.json("/media/amazonratings/Electronics_5.json")

# Show schema and sample data
df.printSchema()
df.select("reviewerID", "asin", "overall").show(5)


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/07/12 20:03:50 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/07/12 20:03:51 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
                                                                                

root
 |-- asin: string (nullable = true)
 |-- helpful: array (nullable = true)
 |    |-- element: long (containsNull = true)
 |-- overall: double (nullable = true)
 |-- reviewText: string (nullable = true)
 |-- reviewTime: string (nullable = true)
 |-- reviewerID: string (nullable = true)
 |-- reviewerName: string (nullable = true)
 |-- summary: string (nullable = true)
 |-- unixReviewTime: long (nullable = true)

+--------------+----------+-------+
|    reviewerID|      asin|overall|
+--------------+----------+-------+
| AO94DHGC771SJ|0528881469|    5.0|
| AMO214LNFCEI4|0528881469|    1.0|
|A3N7T0DY83Y4IG|0528881469|    3.0|
|A1H8PY3QHMQQA0|0528881469|    2.0|
|A24EV6RXELQZ63|0528881469|    1.0|
+--------------+----------+-------+
only showing top 5 rows



## **Data Cleaning & Filtering**

To reduce noise and improve model performance:
- Users with fewer than 5 reviews are excluded
- Items with fewer than 5 reviews are excluded


In [6]:
from pyspark.sql.functions import col, count

ratings_df = df.selectExpr("reviewerID as user", "asin as item", "overall as rating").dropna()
user_counts = ratings_df.groupBy("user").agg(count("item").alias("user_count")).filter("user_count >= 5")
item_counts = ratings_df.groupBy("item").agg(count("user").alias("item_count")).filter("item_count >= 5")

# Number of users with at least 5 items
num_users = user_counts.count()
print(f"Total users with at least 5 items: {num_users}")

# Number of items with at least 5 users
num_items = item_counts.count()
print(f"Total items with at least 5 users: {num_items}")
# Show user counts
print("User counts with >= 5 items:")
user_counts.show()

# Show item counts
print("Item counts with >= 5 users:")
item_counts.show()

filtered_df = ratings_df.join(user_counts, "user").join(item_counts, "item")

                                                                                

Total users with at least 5 items: 192403


                                                                                

Total items with at least 5 users: 63001
User counts with >= 5 items:


                                                                                

+--------------+----------+
|          user|user_count|
+--------------+----------+
|A18FTRFQQ141CP|         5|
|A2GPNXFUUV51ZZ|         7|
|A15K7HV1XD6YWR|         8|
|A3PDGWYC08DXF4|        15|
|  A44UKZE6XEV9|        16|
|A3FE9EUVTU3UD8|         8|
|A3DKP8M0GSP8UK|        26|
|A37LCWTTQMBMFX|         6|
|A141E91QV31KER|         6|
|A1SCWY8O0IL2HU|        29|
| AWWBZZXN32I6H|         7|
|A1PG70NH85K859|        34|
|A1IJOBQD8CY8K1|        25|
|A2690TEJA2N778|        11|
|A3GHZZM7CNK77I|         5|
|A3TPM2VJA0X1Y2|         5|
|A1MRESWHA86B5B|         5|
|A2X8NZUNAWX9SO|         7|
|A2WY7M2G4FUK9Y|         8|
|A2JL1GIC0JAFW9|        15|
+--------------+----------+
only showing top 20 rows

Item counts with >= 5 users:




+----------+----------+
|      item|item_count|
+----------+----------+
|B00000J3Q1|        14|
|B00001W0DC|        15|
|B00003OPEV|         6|
|B00005853W|         5|
|B00005Q5U5|        33|
|B00005T3Z7|        16|
|B000068UY7|        24|
|B00006JLOT|        18|
|B000083GPS|         5|
|B00008WIX2|         6|
|B00008ZPN3|         8|
|B00009R6FQ|         5|
|B0000AKACN|         9|
|B0000E6FY7|         5|
|B0000UV0IQ|         8|
|B0001CLYAW|         6|
|B00021EE4U|        67|
|B00021Z98A|        14|
|B0002D05RI|         5|
|B0002D6PNQ|         8|
+----------+----------+
only showing top 20 rows




                                                                                

## **ALS Model Training & Evaluation**

In [4]:
from pyspark.ml.recommendation import ALS
from pyspark.ml.evaluation import RegressionEvaluator

# Fix users DataFrame to unwrap struct to string
users = (
    filtered_df.select("user")
    .distinct()
    .rdd
    .zipWithIndex()
    .map(lambda x: (x[0][0], x[1]))  # extract string from Row
    .toDF(["user", "user_id"])
)

# Fix items DataFrame similarly
items = (
    filtered_df.select("item")
    .distinct()
    .rdd
    .zipWithIndex()
    .map(lambda x: (x[0][0], x[1]))
    .toDF(["item", "item_id"])
)

# Join on correct columns (user and item are strings, so joins work)
als_df = filtered_df.join(users, "user").join(items, "item")

# Split dataset
(training, test) = als_df.randomSplit([0.8, 0.2], seed=42)

# ALS model setup and fit
als = ALS(userCol="user_id", itemCol="item_id", ratingCol="rating", coldStartStrategy="drop")
model = als.fit(training)

# Predict and evaluate
predictions = model.transform(test)
evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating", predictionCol="prediction")
rmse = evaluator.evaluate(predictions)
print(f"RMSE: {rmse:.4f}")

25/07/12 20:10:59 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS
25/07/12 20:10:59 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.blas.VectorBLAS
25/07/12 20:11:00 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.lapack.JNILAPACK
[Stage 321:>                                                        (0 + 2) / 2]

RMSE: 1.4286



                                                                                

## **Generating and Storing Top-N Recommendations**

In [7]:
top_n = model.recommendForAllUsers(10)
top_n_named = top_n.join(users, "user_id").select("user", "recommendations")
top_n_named.write.json("top_10_recommendations.json")

                                                                                

In [8]:
model.save("/media/amazonratings/als_model")

                                                                                

###  **Uploading to Azure Blob**

In [9]:
!az storage blob upload \
  --account-name amazondatastore \
  --container-name recommender-data \
  --name top_10_recommendations.json \
  --file top_10_recommendations.json

/usr/bin/sh: 1: az: not found


### Optional: Save to Azure SQL Database

In [None]:
from sqlalchemy import create_engine
import pandas as pd

df = pd.read_json("top_10_recommendations.json")
df_exploded = df.explode("recommendations")
df_exploded["item_id"] = df_exploded["recommendations"].apply(lambda x: x["item"])
df_exploded["score"] = df_exploded["recommendations"].apply(lambda x: x["rating"])

engine = create_engine("mssql+pyodbc://umais:password@amazonrecommendersql.database.windows.net/recommenderdb?driver=ODBC+Driver+17+for+SQL+Server")
df_exploded[["user", "item_id", "score"]].to_sql("Recommendations", con=engine, if_exists="replace", index=False)

# Conclusion & Next Steps

In this Proposal I :

- Outlined the project details and implementation plan
- Processed a large-scale dataset using PySpark on Azure VM (Docker)
- Trained and evaluated a collaborative filtering model using ALS
- Generated personalized top-N recommendations
- Saved the model on the mounted drive
- Stored results in Azure Blob and SQL for scalable access

#  Mork Work to do in the Final

**Model Optimization and Tuning:**

Further refine ALS hyperparameters and experiment with alternative recommendation algorithms (e.g., matrix factorization variants or deep learning approaches) to improve accuracy and scalability.

**Develop and Deploy the Recommendation API:**

Build the Flask API to serve personalized recommendations based on the trained ALS model. This includes endpoints for real-time querying, error handling, and security features.

**Frontend Development:**

Create and enhance the web application UI to visualize recommendations dynamically and improve user experience.

**Real-Time Recommendations:**

Implement streaming data ingestion and real-time model updates to provide up-to-date recommendations for users as new interactions occur.

**API Enhancement:**

Extend the Flask API with authentication, rate limiting, and caching layers to improve security, scalability, and response times.

**Frontend Improvements:**
Enhance the web application with advanced visualization features, user personalization, and responsive design for better user engagement.

**Robust Deployment:**

Automate deployment pipelines using Azure DevOps or GitHub Actions, implement monitoring and alerting for production readiness, and ensure high availability with load balancing.

**Security and Compliance:**

Strengthen network security using Azure Private Endpoints, implement data encryption at rest and in transit, and ensure compliance with relevant data protection standards.

**Scalability and Cost Optimization:**

Explore container orchestration with Azure Kubernetes Service (AKS) for better scalability and manage costs using Azure Cost Management tools.

## **Final Deliverable**

- Jupyter Notebook on github with link of the recording and project implementation details.
- Powerpoint presentation recorded to demonstrate the working recommender system.
- Submission of project 6 and Final Project with above .ipynb file including all the work
