<a href="https://colab.research.google.com/github/Harshtyagi001/recommendation-grid/blob/main/BTP_Recommendation_System_TRAINING_%26_KAFKA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**PERSONALISED RECOMMENDATION SYSTEM:**

We took upon the challenge to build a personalized recommender engine like that similar to e-commerce platforms like Amazon and Flipkart.

Our unique spin was to try use realtime clickstream data instead of available public datasets. We tried to use Apache Kafka as a processing pipeline. Clickstream events will be ingested (collecting and importing data) by Kafka, then processed to filter out the noise and then two things will happen:
1. The data is streamed directly to the model during the login session, allowing for the generation of instant relevant recommendations.
2. Simultaneously, the data is stored in our database for subsequent use in model training. This stored data is provided to the model in batches.


---




# Overview

In this project, we utilized collaborative filtering, a widely used technique that utilizes data on how users interact with items to offer personalized recommendations. Our system predicts user preferences for items based on their past interactions and similarities to other users.

To assess our system's effectiveness, we adopted a customized evaluation metric. Our approach combines click-through rate and diversity index, enabling us to calculate the Personalized Click-Through Diversity Index (PCDI) and better measure the system's performance.

As mentioned, we tried to integre Apache Kafka to collect real-time user data.




---


### **METHADOLOGY**

Our approach involves several steps:

1. Data Preprocessing: We'll preprocess the user interaction data, including clicks, add-to-cart actions, and ratings, to create a utility matrix.
2. Collaborative Filtering: We'll train a collaborative filtering model using the Surprise library, which will learn user and item embeddings to make personalized recommendations.
3. Evaluation: We'll evaluate the model's performance using the PCDI metric, which combines click-through rate and diversity of recommendations.
4. Interpretation: We'll analyze the results and gain insights into the effectiveness of the recommendation system.

### **Data**

To tackle cold start, we used a publicly available dataset to recommend popular products. Then we planned to gather clickstream data. But since this was a daunting task, we ultimately had to simlate clickstream data by taking user input and also took the help of a Python-generated dataset with the help of faker library.

Here is a flowchart of our methdology.

![image](https://github.com/priyansh2120/recommendation-grid/assets/96059277/506b31c1-33d3-4607-99a1-ea5af68c2d0f)


## **Importing Necessary Libraries**

Here is an explanation about a few libraraies we used:

- **Surprise Library**: Surprise is a library for building recommendation systems. We'll use it to create collaborative filtering models and make user-product recommendations.

- **SQLAlchemy and SQLite**: These libraries enable interaction with SQLite databases. We'll use them to store and retrieve data when necessary.

- **Confluent Kafka**: Kafka is a streaming platform, and Confluent Kafka is a Python client for Kafka. We'll use it to simulate real-time data interactions in our recommendation system.

- **Faker**: Faker is a library for generating synthetic data. We'll use it for now to create artificial user interactions for testing and development.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
!pip install surprise confluent_kafka faker kafka
from surprise import Dataset, Reader, SVD
import sqlalchemy
import sqlite3
from confluent_kafka import Producer, Consumer
from faker import Faker
from sklearn.metrics.pairwise import cosine_similarity

Collecting surprise
  Downloading surprise-0.1-py2.py3-none-any.whl (1.8 kB)
Collecting confluent_kafka
  Downloading confluent_kafka-2.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.0/4.0 MB[0m [31m32.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting faker
  Downloading Faker-19.13.0-py3-none-any.whl (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m70.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting kafka
  Downloading kafka-1.3.5-py2.py3-none-any.whl (207 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m207.2/207.2 kB[0m [31m25.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting scikit-surprise (from surprise)
  Downloading scikit-surprise-1.1.3.tar.gz (771 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m772.0/772.0 kB[0m [31m54.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?

# **Generating Synthetic Data for Testing**

Obtaining real user interactions and preferences is challenging during the initial stages of development. To address this, we have created a function called `generate_synthetic_data`. We can use existing data, but our goal was to try and not use any existing dataset.

**Function Purpose:**
The `generate_synthetic_data` function leverages the Faker library to generate synthetic user interactions for testing purposes. These interactions include user clicks, product add-to-cart actions, ratings, and timestamps. The generated data will be diverse to simulate real world scenarios.

***The generated data mimics real-world user behavior and allows us to validate the functionality of our recommendation system under various scenarios.***


**Data Storage:**
Once all interactions are generated, the function creates a Pandas DataFrame to organize the synthetic data. This DataFrame is then saved as a CSV file named 'synthetic_data.csv'.

In [None]:
def generate_synthetic_data():
    fake = Faker()
    # Generate synthetic user interactions
    interactions = []
    for _ in range(10000):
        user_id = fake.random_int(min=1, max=100)
        product_id = fake.random_int(min=1, max=1000)
        click = fake.random_int(min=0, max=12)
        add_to_cart = fake.random_element(elements=('added', 'not_added'))
        rating = fake.random_int(min=1, max=5)
        timestamp = fake.date_time_between(start_date='-1y', end_date='now')
        interactions.append({
            'user_id': user_id,
            'product_id': product_id,
            'click': click,
            'add_to_cart': add_to_cart,
            'rating': rating,
            'timestamp': timestamp
        })
    # Save synthetic data as CSV
    df = pd.DataFrame(interactions)
    df.to_csv('synthetic_data.csv', index=False)  # index=False argument ensures that the row index is not included in the CSV file

To generate the synthetic data, we simply call the `generate_synthetic_data` function.

In [None]:
# Generate synthetic data
generate_synthetic_data()

## **Creating SQLite Database and Importing Synthetic Data**

We created the `create_sqlite_database` function to make an SQLite database and import the previously generated synthetic data.

**Function Purpose:**
The `create_sqlite_database` function does two things:
1. It connects to an SQLite database named **'recommendation.db'**.
2. It creates a table named **'user_interactions'** in the database to store user interactions like user IDs, product IDs, click actions, add-to-cart actions, ratings, and timestamps.

**How It Works:**
- The function starts by connecting to the SQLite database and getting a cursor object for executing SQL commands.
- It defines the structure of the 'user_interactions' table with specific columns for each type of interaction.
- The synthetic data stored in 'synthetic_data.csv' is then imported. The function uses Pandas to read the CSV file into a DataFrame and the `to_sql` method to insert the data into the 'user_interactions' table.
- If the table exists, the new data replaces any existing data due to the 'replace' parameter.

**Data Storage and Management:**
The SQLite database acts as a central storehouse for user interaction details. This data is easily accessible for analysis, model training, and generating recommendations.

In [None]:
# Create SQLite database and import synthetic data
def create_sqlite_database():
    conn = sqlite3.connect('recommendation.db')
    cursor = conn.cursor()
    # Create table for user interactions
    cursor.execute('''
        CREATE TABLE IF NOT EXISTS user_interactions (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            user_id INTEGER,
            product_id INTEGER,
            click INTEGER,
            add_to_cart TEXT,
            rating INTEGER,
            timestamp DATETIME
        )
    ''')
    # Import synthetic data into the table
    df = pd.read_csv('synthetic_data.csv')
    df.to_sql('user_interactions', conn, if_exists='replace', index=False)
    conn.commit()
    conn.close()

To create the SQLite database and import synthetic data, we once again simply call the `create_sqlite_database()` function.

In [None]:
# Create SQLite database and import synthetic data
create_sqlite_database()

### **Verifying Database Creation and Data Retrieval**

To ensure the successful creation of the SQLite database, we run the following snippet. This code demonstrates how to connect to the database, extract data from the `user_interactions` table, and display the initial rows of the dataset.

**Verifying Database Connection:**
Before interacting with the database, we use the `sqlite3` module to establish a connection to the SQLite database file named `recommendation.db`.

Then, we retrieve the data, close the connection, and display the first few rows to confirm the successful creation and retrieval of data.

In [None]:
# verify that dabase was created
import sqlite3
conn = sqlite3.connect('recommendation.db')
df = pd.read_sql_query('SELECT * FROM user_interactions', conn)
conn.close()
df.head()

Unnamed: 0,user_id,product_id,click,add_to_cart,rating,timestamp
0,78,964,10,added,2,2023-10-13 02:47:56
1,92,830,9,not_added,4,2023-03-22 19:40:50
2,32,687,11,added,4,2023-05-14 19:52:16
3,22,287,1,added,5,2023-10-03 14:53:27
4,32,924,6,not_added,4,2023-01-27 08:32:01




---


# **Kafka Configuration for Click Stream Producer**

In this code snippet, we define a configuration dictionary called `kafka_config` that configures the Kafka producer. The producer is responsible for sending click stream events to a specific Kafka topic. A Kafka topic is a specific category to which records are published. The configuration dictionary includes two key-value pairs:

1. `'bootstrap.servers'`: This specifies the address and port of the Kafka broker (server that facilitates the communication between the producers and consumers).

2. `'client.id'`: This sets an identifier for the Kafka producer. In this case, it is set to `'clickstream-producer'`, serving as a unique name to identify the producer.

In [None]:
kafka_config = {
    'bootstrap.servers': 'localhost:9092',
    'client.id': 'clickstream-producer'
}

## **Simulating Click Stream Events and Adding to DataFrame**

In this section, we will simulate capturing multiple click stream events and then add these events to an existing DataFrame. Lack of technical knowledge  and financial limitations along with no real user database has made us simulate click stream events rather than capture them like we said we would.

Each click stream event will be represented by a `ClickStreamEvent` object, containing information such as user ID, product ID, click status, add-to-cart status, rating, and timestamp. Then this simulated data is added to an existing DataFrame for further analysis and processing.

In [None]:
from datetime import datetime
import time
import pandas as pd

class ClickStreamEvent:
    def __init__(self, user_id, product_id, click, add_to_cart, rating, timestamp):
        self.user_id = user_id
        self.product_id = product_id
        self.click = click
        self.add_to_cart = add_to_cart
        self.rating = rating
        self.timestamp = timestamp

# Simulate capturing multiple click stream events
click_stream_events = []

num_events = int(input("Enter the number of click stream events: "))

for _ in range(num_events):
    user_id = input("Enter user ID: ")
    product_id = input("Enter product ID: ")
    click = int(input("Enter the number of clicks: "))  # Modified to accept the number of clicks as an integer
    add_to_cart = input("Did the user add to cart? (yes/no): ").lower() == "yes"
    rating = int(input("Enter rating (if given): "))
    timestamp = input("Enter timestamp (YYYY-MM-DD HH:MM:SS): ")

    event = ClickStreamEvent(user_id, product_id, click, add_to_cart, rating, timestamp)
    click_stream_events.append(event)

rows = []
for event in click_stream_events:
    new_row = {
        'user_id': event.user_id,
        'product_id': event.product_id,
        'click': event.click,
        'add_to_cart': event.add_to_cart,
        'rating': event.rating,
        'timestamp': event.timestamp
    }
    rows.append(new_row)

df = pd.concat([df, pd.DataFrame(rows)], ignore_index=True)

Enter the number of click stream events: 1
Enter user ID: 23
Enter product ID: 123
Enter the number of clicks: 2
Did the user add to cart? (yes/no): no
Enter rating (if given): 1
Enter timestamp (YYYY-MM-DD HH:MM:SS): 2023:12:23 11:22:23


### **Sending Click Stream Events to Kafka**

This section demonstrates how to use a Kafka producer to send simulated click stream events to a Kafka topic for further processing and analysis. The click stream events will be serialized into JSON format and then sent to the Kafka topic named `click_stream_topic`.

1. **Creating Kafka Producer**: We first create a Kafka producer using the provided `kafka_config` dictionary. The producer will be responsible for sending the click stream events to the Kafka topic.

2. **Processing and Sending Events**: We use a loop to iterate through the list of `click_stream_events`. For each event, we create a dictionary `event_data` containing the attributes of the event such as user ID, product ID, click status, add-to-cart status, rating, and timestamp. We then serialize this dictionary into JSON format using the `json.dumps()` function.

3. **Sending to Kafka**: Using the Kafka producer, we send the serialized event data to the `click_stream_topic` Kafka topic. We also use `producer.flush()` to ensure that the message is sent immediately.

4. **Print and Delay**: After sending each event, we print a message confirming the sent event's data and use `time.sleep(1)` to introduce a delay of 1 second before processing the next event.

5. **Closing the Producer**: Once all events are sent, we close the Kafka producer using the `producer.close()` method.

In [None]:
import json
# Create Kafka producer
producer = Producer(kafka_config)

# Process and send the list of click stream events to Kafka
for event in click_stream_events:
    # Serialize event data
    event_data = {
        'user_id': event.user_id,
        'product_id': event.product_id,
        'click': event.click,
        'add_to_cart': event.add_to_cart,
        'rating': event.rating,
        'timestamp': event.timestamp
    }
    serialized_event = json.dumps(event_data)

    # Send event data to Kafka topic
    producer.produce('click_stream_topic', value=serialized_event.encode('utf-8'))
    producer.flush()
    print("Sent click stream event to Kafka:", event_data)
    print("-" * 20)

    # Simulate some delay before processing the next event
    time.sleep(1)  # Adjust the delay as needed

# Flush any outstanding messages and close the Kafka producer
producer.flush()
producer = None  # Set the producer object to None to release its resources

Sent click stream event to Kafka: {'user_id': '54', 'product_id': '666', 'click': 4, 'add_to_cart': False, 'rating': 2, 'timestamp': '2023-12-12 11:23:23'}
--------------------



### **Processing Click Stream Events**

In this section, we will process and display the details of the received click stream events that were previously sent to the Kafka topic. We iterate through the `click_stream_events` list and print out the attributes of each event, including user ID, product ID, click status, add-to-cart status, rating, and timestamp.

In [None]:
# Process the list of click stream events
for event in click_stream_events:
    print("Received click stream event:")
    print("User ID:", event.user_id)
    print("Product ID:", event.product_id)
    print("Click:", event.click)
    print("Add to Cart:", event.add_to_cart)
    print("Rating:", event.rating)
    print("Timestamp:", event.timestamp)
    print("-" * 20)

Received click stream event:
User ID: 54
Product ID: 666
Click: 4
Add to Cart: False
Rating: 2
Timestamp: 2023-12-12 11:23:23
--------------------


Here, we add the processed clickstream data into our databse.

In [None]:
# Append the generated clickstream events to the database
conn = sqlite3.connect('recommendation.db')
cursor = conn.cursor()

for event in click_stream_events:
    cursor.execute('''
        INSERT INTO user_interactions (user_id, product_id, click, add_to_cart, rating, timestamp)
        VALUES (?, ?, ?, ?, ?, ?)
    ''', (event.user_id, event.product_id, event.click, event.add_to_cart, event.rating, event.timestamp))

conn.commit()
conn.close()

print("Generated clickstream events have been added to the database.")

Generated clickstream events have been added to the database.




---


##**Collaborative Filtering and K-Nearest Neighbors (KNN) Recommendation System**

In this section, we'll build a recommendation system using collaborative filtering and the K-Nearest Neighbors (KNN) algorithm. Collaborative filtering is a widely used in recommendation systems that predicts a user's preferences based on the preferences of similar users. KNN, on the other hand, is a method that identifies the 'k' nearest neighbors to a particular user, and uses their preferences to make ranked recommendations.

We'll utilize the Surprise library for building and analyzing the  recommender system.

Here, we begin by importing the necessary libraries.

In [None]:
from surprise import SVD, Dataset, Reader, KNNBasic
from surprise.model_selection import train_test_split
import pandas as pd
from datetime import datetime

### **Loading Data for Collaborative Filtering**

In this section, we'll discuss the process of loading data for collaborative filtering, a popular technique used in recommendation systems. Collaborative filtering involves making automatic predictions (filtering) about the interests of a user by collecting preferences from many users (collaborating).

We'll utilize the Surprise library to load and preprocess our data.

The code snippet demonstrates how to load data for collaborative filtering using the `load_data_for_collaborative_filtering` function.

This function takes a DataFrame (`df`) as input, which contains user interactions. The `Reader` class is used to specify the rating scale (here, between 1 and 5).

The `Dataset.load_from_df` method from the Surprise library is then used to load the data into a Surprise-compatible format. This format is crucial for training and evaluating collaborative filtering models. The loaded data is returned as a `data` object.

In [None]:
# Load data for collaborative filtering
def load_data_for_collaborative_filtering(df):
    reader = Reader(rating_scale=(1, 5))
    data = Dataset.load_from_df(df[['user_id', 'product_id', 'rating']], reader)
    return data

### **Splitting Data into Training and Testing Sets**

The next step is to split it into training and testing sets.

The code snippet demonstrates how to split the data into training and testing sets using the `split_data` function. The `data` object, loaded in the previous step, is used as input. The `train_test_split` function from the Surprise library is employed for this purpose.

The function returns two sets: `trainset` and `testset`. The `trainset` is used for training our collaborative filtering model, while the `testset` is used to evaluate the model's performance. The parameter `test_size` specifies the proportion of data allocated for testing (in this case, 20% of the data).

It's important to note that setting a `random_state` ensures reproducibility in the data split, allowing us to obtain consistent results across different runs. Simply put, it controls the shuffling of data before it is split.

In [None]:
# Split data into training and testing sets
def split_data(data):
    trainset, testset = train_test_split(data, test_size=0.2, random_state=42)
    return trainset, testset

## **Training Collaborative Filtering Model**

The next step is to train a collaborative filtering model.

Within the function, we create an instance of the Singular Value Decomposition (SVD) model using `SVD()` from the Surprise library. SVD is a matrix factorization technique commonly used for collaborative filtering. We then fit the model to the training data using the `fit` method.

The trained model is returned, which can be used to generate recommendations for users based on their interactions and preferences.

In [None]:
# Train collaborative filtering model
def train_collaborative_filtering_model(trainset):
    model = SVD()
    model.fit(trainset)
    return model

## **Generating Unranked Recommendations using Collaborative Filtering**

After training the collaborative filtering model, we can proceed to generate unranked recommendations for a specific user. Unranked recommendations are a list of items that the model suggests as potential options for the user to consider.

Inside the function, we access the user's interactions in the `trainset` using `trainset.ur[trainset.to_inner_uid(user_id)]`. For each item the user has interacted with, we retrieve its inner item ID and convert it back to the original product ID using `trainset.to_raw_iid(inner_iid)`. These product IDs form the list of unranked recommendations.

*Inner ID refers to the unique numerical identifier assigned to each user or item within the data structure used by the Surprise library. This helps the library perform compputations smoothly. We have to convert the inner ID back to theoriginal value so that we can stay true to the values in the database*

These unranked recommendations provide an initial set of items that the user might be interested in, based on their past interactions. However, these recommendations are not yet ranked by relevance or preference.

In [None]:
# Generate unranked recommendations based on collaborative filtering model
def generate_unranked_recommendations(model, user_id, trainset):
    user_items = list(trainset.ur[trainset.to_inner_uid(user_id)])
    recommendations = []
    for inner_iid, _ in user_items:
        product_id = trainset.to_raw_iid(inner_iid)
        recommendations.append(product_id)
    return recommendations


## **Training a K-Nearest Neighbors (KNN) Model**

In addition to collaborative filtering, another recommendation technique we used is the K-Nearest Neighbors (KNN) algorithm. The KNN algorithm uses the similarity between users or items to make recommendations.

Inside the function, we specify the `sim_options` parameter, where we choose the similarity metric to be the cosine similarity. Additionally, we set `user_based` to `False`, indicating that we are using item-based similarity. The model is then trained on the `trainset`.

In [None]:
# Train KNN model
def train_knn_model(trainset):
    sim_options = {
        'name': 'cosine',
        'user_based': False
    }
    model = KNNBasic(sim_options=sim_options)
    model.fit(trainset)
    return model


### **Generating Ranked Recommendations using K-Nearest Neighbors (KNN) Model**

The next step is to generate ranked recommendations using this model.

We define the `generate_ranked_knn_recommendations` function, which takes the trained KNN `model` and a list of `unranked_recommendations` as input. The unranked recommendations are the items that the collaborative filtering model has suggested.

Inside the function, we iterate through each `product_id` in the `unranked_recommendations` list. For each product, we map the `product_id` to its corresponding inner item ID using `model.trainset.to_inner_iid(product_id)`.

We then retrieve the similarity score between the given item and its neighbors from the KNN model using `model.sim[inner_iid]`. In this implementation, we use the first element of the similarity array for sorting, assuming that it represents the most relevant similarity score.

The recommendations are sorted in descending order of similarity, ensuring that items with higher similarity scores are ranked higher. Finally, we return a list of the recommended item IDs in ranked order.

In [None]:
# Generate ranked recommendations using KNN model
def generate_ranked_knn_recommendations(model, unranked_recommendations):
    knn_ranked_recommendations = []
    for product_id in unranked_recommendations:
        inner_iid = model.trainset.to_inner_iid(product_id)
        # Use the first element of the similarity array for sorting
        similarity_score = model.sim[inner_iid][0]
        knn_ranked_recommendations.append((inner_iid, similarity_score))
    knn_ranked_recommendations.sort(key=lambda x: x[1], reverse=True)  # Sort by similarity in descending order
    return [item_id for item_id, _ in knn_ranked_recommendations]

### **Data Preprocessing**
We have now defined all the relavant functions. Now, before training any recommendation models, it's crucial to preprocess the user interaction data to ensure it's in the right format for analysis. The following preprocessing steps are applied to the dataset:

1. **Timestamp Conversion:**
   The timestamps in the dataset are converted from string format to datetime format using the `pd.to_datetime` function.

2. **Mapping Add-to-Cart:**
   To simplify the analysis and modeling, the 'add_to_cart' columns are mapped to binary values. Specifically, 'added' values in the 'add_to_cart' column are mapped to 1 (indicating an item was added to the cart), and 'not_added' values are mapped to 0.

By converting timestamps and mapping categorical variables to binary values, we create a structured and standardized dataset for recommendation analysis.

In [None]:
# Preprocess timestamp to datetime format
df['timestamp'] = pd.to_datetime(df['timestamp'])

# Map add_to_cart to binary values
df['add_to_cart'] = df['add_to_cart'].map({'added': 1, 'not_added': 0})



---


# **Converting Data to Surprise-Compatible Format**

To effectively train and evaluate recommendation models using the Surprise library, the user interaction data needs to be converted into a format compatible with Surprise. The following steps outline this conversion process:

1. **Initializing Reader:**
   The `Reader` class from the Surprise library is used to specify the rating scale. In this case, the rating scale is set to be between 1 and 5.

2. **Loading Data from DataFrame:**
   The `Dataset.load_from_df` function is employed to load the data from the `df` DataFrame, containing columns 'user_id', 'product_id', and 'rating'. This creates a Surprise `Dataset` object that encapsulates the user-item interactions.

3. **Building Trainset:**
   The `build_full_trainset()` method is called on the `Dataset` object to construct a Surprise `Trainset`. This trainset includes all user-item interactions from the original DataFrame and is suitable for training collaborative filtering models.


In [None]:
# Convert the 'df' DataFrame to Surprise-compatible format
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(df[['user_id', 'product_id', 'rating']], reader)
trainset = data.build_full_trainset()

Now, we simply start calling the functions to train our models.

In [None]:
# Train collaborative filtering model
collab_model = train_collaborative_filtering_model(trainset)

## **Generating Unranked Recommendations**

Let's take a closer look at how we can generate unranked recommendations for a specific user using the trained collaborative filtering model.

**Step 1: Define Example User**
We choose an example user for whom we want to generate unranked recommendations. In this case, the `example_user_id` is set to 60.

**Step 2: Generating Unranked Recommendations**
We call the `generate_unranked_recommendations()` function, passing the following arguments:
- `collab_model`: The collaborative filtering model trained using the SVD algorithm.
- `example_user_id`: The ID of the user for whom we want to generate recommendations.
- `trainset`: The training dataset used to train the collaborative filtering model.

The function returns a list of unranked recommendations for the specified user. These recommendations are based on the items that the user has interacted with in the past, as learned by the collaborative filtering model.

In [None]:
# Example user for generating unranked recommendations
example_user_id = 60
unranked_recommendations = generate_unranked_recommendations(collab_model, example_user_id, trainset)

Let's look at the unranked recommendations for this user. These unranked recommendations represent items that the user is likely to be interested in based on their interactions with the system. However, they have not been sorted or ranked according to any specific criteria.


In [None]:
print(f"Unranked Recommendations from Collaborative Filtering for user {example_user_id}:")
for product_id in unranked_recommendations:
    print(f"Product {product_id}")

Unranked Recommendations from Collaborative Filtering for user 60:
Product 502
Product 326
Product 736
Product 670
Product 144
Product 925
Product 788
Product 321
Product 424
Product 426
Product 920
Product 467
Product 71
Product 927
Product 804
Product 774
Product 684
Product 58
Product 26
Product 922
Product 621
Product 268
Product 118
Product 965
Product 769
Product 296
Product 304
Product 479
Product 103
Product 545
Product 798
Product 115
Product 75
Product 189
Product 777
Product 340
Product 534
Product 454
Product 26
Product 935
Product 85
Product 323
Product 383
Product 77
Product 793
Product 636
Product 440
Product 888
Product 342
Product 143
Product 339
Product 629
Product 458
Product 251
Product 970
Product 622
Product 636
Product 753
Product 666
Product 873
Product 671
Product 496
Product 246
Product 578
Product 677
Product 496
Product 671
Product 50
Product 320
Product 226
Product 636
Product 909
Product 764
Product 771
Product 127
Product 937
Product 211
Product 631
Produ

## **Training K-Nearest Neighbors (KNN) Model and Generating Ranked Recommendations**

The KNN algorithm is particularly useful for finding items that are similar to those already interacted with by the user, thereby improving recommendation quality.

**Step 1: Training the KNN Model**
We use the `train_knn_model()` function to train the KNN model.

**Step 2: Generating Ranked Recommendations using KNN**
We proceed to generate ranked recommendations using the `generate_ranked_knn_recommendations()` function. This function takes two arguments:
- `knn_model`: The KNN model trained in the previous step.
- `unranked_recommendations`: The list of unranked recommendations obtained from the collaborative filtering model.

For each item in the list of unranked recommendations, the function calculates its similarity score based on the KNN model's similarity matrix. The recommendations are then sorted based on the similarity score in descending order to create a list of ranked recommendations. These ranked recommendations can provide users with more relevant and similar items based on their previous interactions and preferences.

In [None]:
# Train KNN model
knn_model = train_knn_model(trainset)

# Generate ranked recommendations using KNN model
knn_ranked_recommendations = generate_ranked_knn_recommendations(knn_model, unranked_recommendations)

Computing the cosine similarity matrix...
Done computing similarity matrix.


Let's look at the ranked recommendations now.

In [None]:
print(f"Ranked KNN Recommendations for user {example_user_id}:")
for rank, item_id in enumerate(knn_ranked_recommendations, start=1):
    print(f"Rank {rank}: Product {trainset.to_raw_iid(item_id)}")

Ranked KNN Recommendations for user 60:
Rank 1: Product 736
Rank 2: Product 144
Rank 3: Product 321
Rank 4: Product 467
Rank 5: Product 769
Rank 6: Product 296
Rank 7: Product 304
Rank 8: Product 479
Rank 9: Product 798
Rank 10: Product 75
Rank 11: Product 777
Rank 12: Product 454
Rank 13: Product 383
Rank 14: Product 888
Rank 15: Product 339
Rank 16: Product 970
Rank 17: Product 246
Rank 18: Product 320
Rank 19: Product 937
Rank 20: Product 631
Rank 21: Product 190
Rank 22: Product 474
Rank 23: Product 484
Rank 24: Product 120
Rank 25: Product 489
Rank 26: Product 560
Rank 27: Product 357
Rank 28: Product 426
Rank 29: Product 268
Rank 30: Product 143
Rank 31: Product 753
Rank 32: Product 804
Rank 33: Product 666
Rank 34: Product 965
Rank 35: Product 677
Rank 36: Product 684
Rank 37: Product 118
Rank 38: Product 342
Rank 39: Product 342
Rank 40: Product 670
Rank 41: Product 629
Rank 42: Product 440
Rank 43: Product 496
Rank 44: Product 496
Rank 45: Product 935
Rank 46: Product 578
Rank

##**Simulating User Interaction Data for Recommendation Evaluation**
In this section, we will simulate user interaction data to showcase the evaluation of recommendation systems using various metrics. We will generate synthetic data that simulates user interactions with products, including attributes such as user ID, product ID, clicks, ratings, and timestamps.

The purpose of generating this synthetic data is to provide a realistic context for evaluating recommendation algorithms and their performance. By having simulated user interactions, we can demonstrate how different evaluation metrics work and evaluate recommendation models' effectiveness.

In [None]:
# Simulated data for demonstration
def generate_fake_data(num_users, num_items):
    fake = Faker()
    data = []
    for _ in range(num_users):
        user_id = fake.random_int(min=1, max=num_users, step=1)
        product_id = fake.random_int(min=1, max=num_items, step=1)
        click = fake.random_int(min=0, max=12)
        added_to_cart = fake.random_element(elements=('yes', 'no'))
        rating = fake.random_int(min=1, max=5, step=1)
        timestamp = fake.date_time_between(start_date='-30d', end_date='now')
        data.append({'user_id': user_id, 'product_id': product_id, 'click': click, 'added_to_cart': added_to_cart, 'rating': rating, 'timestamp': timestamp})
    return data

# Simulate generating data
sample_data = generate_fake_data(num_users=100, num_items=200)

###**Calculating Personalized Click-Through Diversity Index (PCDI)**

In this section of the code, we implement the calculation of the Personalized Click-Through Diversity Index (PCDI), an evaluation metric for recommendation systems that combines both click-through rate (CTR) and diversity of recommended items.

1. **Calculate CTR (Click-Through Rate):**
The function `calculate_ctr` computes the CTR, which represents the ***ratio of clicked recommended items to the total number of recommended items.***

2. **Calculate Diversity Score:**
The function `calculate_diversity_score` calculates the diversity score of the clicked items based on their embeddings. It uses cosine similarity to measure the diversity of the clicked items' embeddings. The lower the similarity, the more diverse the clicked items are.

*Embeddings refers to the representation of items in a multi-dimensional space, where each dimension captures a distinct feature of the item. These embeddings help the system to understand the similarities or dissimilarities among items, enabling accurate recommendations based on the item properties.*


3. **Calculate PCDI (Personalized Click-Through Diversity Index):**
The function `calculate_pcdi` combines the CTR and diversity score to compute the final PCDI metric. It allows specifying weights for CTR and diversity, which can be adjusted based on the importance assigned to each aspect. The function takes CTR, diversity score, and optional weights as input and returns the PCDI.

These calculations are integral for evaluating recommendation models that aim to balance user engagement (CTR) with diversity of recommended items.

In [None]:
# Calculate CTR
def calculate_ctr(clicked_items, total_recommended_items):
    return len(clicked_items) / total_recommended_items

# Calculate Diversity Score
def calculate_diversity_score(clicked_items_embeddings):
    similarity_matrix = cosine_similarity(clicked_items_embeddings)
    diversity_score = 1 - np.mean(similarity_matrix)
    return diversity_score

# Calculate PCDI
def calculate_pcdi(ctr, diversity_score, weight_ctr=0.5, weight_diversity=0.5):
    return (ctr * weight_ctr) + (diversity_score * weight_diversity)


**Calculating PCDI for Evaluating Recommendations**

In this part of the code, we proceed with calculating the Personalized Click-Through Diversity Index (PCDI) for evaluating the performance of our recommendation system.

1. **Generate Clicked Items List and Item Embeddings:**
The code first generates a list called `clicked_items` containing the product IDs of items that have been clicked by users. This is achieved by iterating through the sample data and selecting items with 'clicked' status as 'yes'. Additionally, we simulate item embeddings by creating a random matrix (`clicked_items_embeddings`) where each row represents an item and the columns represent the embedding dimensions. These embeddings would ideally represent the characteristics of the items.

2. **Define Total Recommended Items:**
The `total_recommended_items` variable is set to the total number of items that were recommended to users.

3. **Calculate CTR (Click-Through Rate):**
The `calculate_ctr` function is applied to calculate the Click-Through Rate (CTR). ***This metric reflects the proportion of recommended items that users actually engaged with***.

4. **Calculate Diversity Score:**
The `calculate_diversity_score` function is used to calculate the diversity score of the clicked items. A lower similarity score indicates higher diversity among the clicked items.

By combining these metrics using the `calculate_pcdi` function, we can evaluate the recommendation system's performance.

In [None]:
# Process data to calculate PCDI
clicked_items = [item['product_id'] for item in sample_data if item['click'] > 0]

clicked_items_embeddings = np.random.rand(len(clicked_items), 10)

total_recommended_items = 80

# Calculate CTR
ctr = calculate_ctr(clicked_items, total_recommended_items)

# Calculate Diversity Score
diversity_score = calculate_diversity_score(clicked_items_embeddings)

###**Calculating PCDI**

In this part of the code, we finalize the calculation of the Personalized Click-Through Diversity Index (PCDI) by considering the weights assigned to the Click-Through Rate (CTR) and Diversity Score.

In [None]:
# Calculate PCDI
weight_ctr = 0.90
weight_diversity = 0.10
pcdi = calculate_pcdi(ctr, diversity_score, weight_ctr, weight_diversity)

print(f"CTR: {ctr:.2f}")
print(f"Diversity Score: {diversity_score:.2f}")
print(f"PCDI: {pcdi:.2f}")

CTR: 1.18
Diversity Score: 0.23
PCDI: 1.08


The Personalized Click-Through Diversity Index (PCDI) is a metric that combines Click-Through Rate (CTR) and Diversity Score to evaluate recommendation system performance. It considers user engagement (CTR) and the variety of recommended items (Diversity Score).

- **Click-Through Rate (CTR)** indicates user interaction with recommendations. Higher CTR suggests relevance to user preferences.
- **Diversity Score** measures variety in recommendations. Higher Diversity Score indicates diverse options.

**PCDI Value Significance:**
- PCDI combines CTR and Diversity Score using weights.
- Balanced PCDI suggests a good mix of engagement and diversity.
- PCDI > 1 indicates favorable performance.


---



## **TEAM**
1. Harsh - 112115057
2. Priyansh - 112115118
3. Rohan Kumar - 112115133
4. Lingam Saikrishna Seshendra - 112116033
5. Santu Dhali - 112116034