In [0]:
#Project Summary: Building a Product Recommendation System Using K-means Clustering in PySpark
#Objective:
#The goal of this project is to build a recommendation system using K-means clustering, rather than traditional collaborative filtering methods like ALS. We focus on clustering users or products based on their interaction data (such as ratings) and then use these clusters to recommend products to users. The dataset includes information about users, products, ratings, and product-related details like product_name and segment.

#Dataset:
#The dataset contains 2000 rows with the following columns:
user_id: The unique identifier for each user.
product_id: The unique identifier for each product.
rating: The rating given by the user for a particular product (on a scale from 1 to 5).
product_name: The name of the product.
segment: The category or segment to which the product belongs (e.g., Electronics, Clothing, etc.).
The dataset is used to understand user-product interactions and generate product recommendations.


#Steps Involved:
#Data Preprocessing:
#1.Load the dataset using PySpark’s read.csv().
#2.Create a feature vector using user_id, product_id, and rating columns, which will be used for clustering. The product_name and segment columns are retained for displaying the final recommendations.

#Clustering Using K-means:
#1.Apply K-means clustering to group users or products based on their interactions. The number of clusters is set to 5 in this example.
#2.K-means uses the feature vector (user_id, product_id, and rating) to create distinct clusters of users/products with similar behavior or preferences.

#Evaluating Clustering:
#1.Evaluate the quality of the clustering using the Silhouette Score. A higher Silhouette score indicates well-separated and well-defined clusters.

#Generating Recommendations:
#1.After performing K-means clustering, we aggregate the ratings for each product within each cluster and rank products by their average ratings.
#2.Top products (ranked by the highest average ratings) from each cluster are recommended to users.

#Personalized Recommendations for Users:
#1.For a given user (e.g., user_id = 123), the cluster they belong to is identified. The top products from that user’s cluster are then recommended, displaying both product_name and segment.


#Key Features of the Model:
1.Clustering: K-means clustering groups users or products with similar interaction patterns.
2.Top Recommendations: Products are recommended based on their popularity within a user’s cluster, allowing for group-based recommendations.
3.Personalized Recommendations: Recommendations are tailored to each user by leveraging their respective cluster.
4.Product Information: Recommendations are enriched with additional product details like product_name and segment, making them more informative.

In [0]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.ml.feature import VectorAssembler
from pyspark.sql.types import DoubleType, StringType
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lit
from pyspark.ml.feature import VectorAssembler, StandardScaler
from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator
from pyspark.sql.functions import avg, col

# Query the data using SQL
df = spark.sql("SELECT * FROM amazon_ratings")

#df.printSchema()
df.show(5)

# Step 3: Data Preprocessing
# Cast columns to appropriate data types
df = df.withColumn("user_id", col("user_id").cast(DoubleType())) \
       .withColumn("product_id", col("product_id").cast(DoubleType())) \
       .withColumn("rating", col("rating").cast(DoubleType())) \
       .withColumn("product_name", col("product_name").cast(StringType())) \
       .withColumn("segment", col("segment").cast(StringType()))

# Check updated schema
df.printSchema()

+-------+----------+------+------------+-----------+
|user_id|product_id|rating|product_name|    segment|
+-------+----------+------+------------+-----------+
|     52|        76|     2|  Product_76|Electronics|
|     93|        46|     2|  Product_46|Electronics|
|     15|        32|     5|  Product_32|   Clothing|
|     72|        79|     4|  Product_79|      Books|
|     61|        80|     2|  Product_80|       Toys|
+-------+----------+------+------------+-----------+
only showing top 5 rows

root
 |-- user_id: double (nullable = true)
 |-- product_id: double (nullable = true)
 |-- rating: double (nullable = true)
 |-- product_name: string (nullable = true)
 |-- segment: string (nullable = true)



In [0]:

# Step 4: Feature Engineering
# Use VectorAssembler to create feature vectors for clustering
assembler = VectorAssembler(inputCols=["user_id", "product_id", "rating"], outputCol="features")
df_features = assembler.transform(df)
df_features.show()

+-------+----------+------+------------+-----------+----------------+
|user_id|product_id|rating|product_name|    segment|        features|
+-------+----------+------+------------+-----------+----------------+
|   52.0|      76.0|   2.0|  Product_76|Electronics| [52.0,76.0,2.0]|
|   93.0|      46.0|   2.0|  Product_46|Electronics| [93.0,46.0,2.0]|
|   15.0|      32.0|   5.0|  Product_32|   Clothing| [15.0,32.0,5.0]|
|   72.0|      79.0|   4.0|  Product_79|      Books| [72.0,79.0,4.0]|
|   61.0|      80.0|   2.0|  Product_80|       Toys| [61.0,80.0,2.0]|
|   21.0|      54.0|   3.0|  Product_54|      Books| [21.0,54.0,3.0]|
|   83.0|      86.0|   3.0|  Product_86|Electronics| [83.0,86.0,3.0]|
|   87.0|      92.0|   5.0|  Product_92|   Clothing| [87.0,92.0,5.0]|
|   75.0|      20.0|   5.0|  Product_20|       Toys| [75.0,20.0,5.0]|
|   75.0|      33.0|   1.0|  Product_33|       Home| [75.0,33.0,1.0]|
|   88.0|      74.0|   1.0|  Product_74|      Books| [88.0,74.0,1.0]|
|  100.0|      40.0|

In [0]:
# Step 5: Train K-means Model
# Define K-means clustering model with a specified number of clusters (e.g., k=5)
kmeans = KMeans(featuresCol="features", predictionCol="prediction", k=5, seed=42)
model = kmeans.fit(df_features)

In [0]:
# Step 6: Evaluate Clustering
# Make predictions
predictions = model.transform(df_features)

# Evaluate clustering using Silhouette score
evaluator = ClusteringEvaluator(featuresCol="features", predictionCol="prediction", metricName="silhouette")
silhouette_score = evaluator.evaluate(predictions)
print(f"Silhouette Score: {silhouette_score}")

Silhouette Score: 0.5628003893884516


The Silhouette Score is a metric used to evaluate the quality of clustering. It measures how similar an object is to its own cluster (cohesion) compared to other clusters (separation). The score ranges between -1 and 1, with the following interpretations:

+1: The object is well matched to its own cluster and poorly matched to neighboring clusters.
0: The object is on or very close to the decision boundary between two clusters.
-1: The object might have been misclassified and placed in the wrong cluster

In [0]:

# Step 7: Generate Recommendations for a Specific User
# Specify the user ID for which to generate recommendations
specific_user_id = 52
  # Replace with a valid user_id from your dataset

# Check if the user exists in the dataset
user_exists = predictions.filter(col("user_id") == specific_user_id).count()

if user_exists == 0:
    print(f"User {specific_user_id} does not exist in the dataset.")
else:
    # Identify the cluster to which the specific user belongs
    user_cluster = predictions.filter(col("user_id") == specific_user_id).select("prediction").collect()[0]["prediction"]
    print(f"User {specific_user_id} belongs to cluster {user_cluster}.")

    # Find all users in the same cluster
    users_in_cluster = predictions.filter(col("prediction") == user_cluster)

    # Collect all products rated by users in the same cluster
    products_in_cluster = users_in_cluster.groupBy("product_id").agg(avg("rating").alias("avg_rating"))

    # Get the products already rated by the specific user
    user_products = predictions.filter(col("user_id") == specific_user_id).select("product_id").rdd.flatMap(lambda x: x).collect()

    # Filter out products already rated by the specific user
    recommendations = products_in_cluster.filter(~col("product_id").isin(user_products)) \
                                         .orderBy(col("avg_rating").desc())

    # Show top 10 product recommendations
    print(f"Top 10 recommendations for user {specific_user_id}:")
    recommendations.select("product_id", "avg_rating").show(10)

User 52 belongs to cluster 1.
Top 10 recommendations for user 52:
+----------+------------------+
|product_id|        avg_rating|
+----------+------------------+
|      64.0| 4.142857142857143|
|      51.0|               4.0|
|      79.0|               4.0|
|      82.0|               3.8|
|      63.0|3.6666666666666665|
|      52.0|               3.6|
|      94.0|               3.5|
|      90.0|               3.5|
|      69.0|3.4545454545454546|
|      83.0|3.3333333333333335|
+----------+------------------+
only showing top 10 rows



This step focuses on generating product recommendations for a specific user by leveraging cluster-based insights. First, the script checks if the specified user (specific_user_id) exists in the dataset.
 If the user is not found, a message is displayed, and the process stops. 
 
 If the user exists, the script identifies the cluster (user_cluster) to which the user belongs based on the predictions made by the K-means model. Next, it retrieves all users in the same cluster and calculates the average ratings for products rated by these users. To ensure the recommendations are novel, the script identifies products that the specific user has already rated and excludes them from the list. The remaining products are then sorted by their average ratings in descending order, ensuring that the highest-rated products within the cluster are prioritized. Finally, the script displays the top 10 recommended products for the user, including their product IDs and average ratings. This approach personalizes recommendations by considering the preferences of similar users within the same cluster, while avoiding duplicate suggestions.

In [0]:
# Step 8: Analyze Cluster Centers
# Extract and display cluster centers
centers = model.clusterCenters()    # The model.clusterCenters() function retrieves the coordinates of the centroids
print("Cluster Centers:")
for idx, center in enumerate(centers):   # The for loop iterates through the list of cluster centers, and each center is displayed with its corresponding cluster index (Cluster 0, Cluster 1, etc.).
    print(f"Cluster {idx}: {center}")

Cluster Centers:
Cluster 0: [51.92307692 31.77948718  2.96410256]
Cluster 1: [77.20048309 76.41545894  2.99758454]
Cluster 2: [17.28571429 25.53506494  3.14805195]
Cluster 3: [85.04610951 23.85590778  2.99423631]
Cluster 4: [25.21336207 77.80172414  2.99353448]


Cluster Centers Breakdown
Cluster 0: [51.92, 31.78, 2.96]

Feature 1: 51.92
Feature 2: 31.78
Feature 3: 2.96
This cluster has moderate values for Feature 1 and Feature 2 and an average rating of 2.96.
Cluster 1: [77.20, 76.42, 2.99]

Feature 1: 77.20
Feature 2: 76.42
Feature 3: 2.99
This cluster has high values for both Feature 1 and Feature 2 and slightly higher ratings (2.99).
Cluster 2: [17.29, 25.54, 3.15]

Feature 1: 17.29
Feature 2: 25.54
Feature 3: 3.15
This cluster has low values for Feature 1 and Feature 2 and relatively high ratings (3.15).
Cluster 3: [85.05, 23.86, 2.99]

Feature 1: 85.05
Feature 2: 23.86
Feature 3: 2.99
This cluster has a very high value for Feature 1, lower values for Feature 2, and an average rating of 2.99.
Cluster 4: [25.21, 77.80, 2.99]

Feature 1: 25.21
Feature 2: 77.80
Feature 3: 2.99
This cluster has a low value for Feature 1, a very high value for Feature 2, and an average rating of 2.99.

#What This Shows:
1. Each cluster center represents the average position of all points in that cluster in the feature space.
2. Feature 1 and Feature 2 could represent user or product characteristics (e.g., age, purchase frequency, product category), while Feature 3 likely represents the average rating or user interaction score.
3. Cluster 1, for example, consists of users or products with high values for both Feature 1 and Feature 2, and slightly higher average ratings (2.99).
4. Cluster 2, on the other hand, represents users/products with lower values for Feature 1 and Feature 2 but higher ratings (3.15).

#How This Information Is Useful:
1. Cluster Profiles: It provides insight into the typical characteristics of each cluster. For example, Cluster 2 might represent users who are more selective or niche in their preferences, while Cluster 1 could represent more mainstream users.
2. Business Applications: You can tailor recommendations based on the typical characteristics of a user's assigned cluster. If a user belongs to Cluster 1, you could recommend products similar to those liked by users in that cluster.

In [0]:
#Conclusion:
#This project demonstrates how K-means clustering can be used as an alternative to traditional collaborative filtering techniques like ALS for generating product recommendations. By clustering users based on their interaction data and recommending products from their respective clusters, the system offers an effective and scalable recommendation approach. Although K-means clustering may not capture latent factors as effectively as ALS, it still provides a useful means of delivering personalized recommendations based on user-product interactions.