<a href="https://colab.research.google.com/github/salarMokhtariL/Groceries-Recommender/blob/main/Groceries_Recommender.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Scalable Recommender System Development with PyTorch and Pandas for Enhanced Personalization

>Realized on April


**Team members:**
*   **Helya Hosseini Nami**
*   **Salar Mokhtari Laleh**

# Introduction
This notebook provides an implementation of a recommender system using PyTorch and demonstrates how to train and evaluate the model on a dataset of grocery transactions. The system utilizes item embeddings to represent the items in the dataset and computes their similarity to make recommendations.

# Dataset
The dataset used in this notebook is a list of grocery transactions. Each transaction consists of a list of items purchased by a customer. The dataset is available at the following URL:

https://raw.githubusercontent.com/stedy/Machine-Learning-with-R-datasets/master/groceries.csv

# Implementation
The implementation consists of the following steps:

1. Load the dataset using pandas and convert it to a list of lists.
2. Create a dictionary to map each item to a unique integer.
3. Convert the dataset to a list of lists of integers using the mapping dictionary.
4. Convert the dataset to a PyTorch tensor.
5. Define the model architecture using PyTorch.
6. Train the model on the dataset using PyTorch.
7. Compute the item embeddings and similarity matrix.
8. Generate recommendations for each item using the similarity matrix.




# Step 1: Load the dataset
The first step is to load the dataset using pandas and convert it to a list of lists.

In [1]:
import pandas as pd

data = pd.read_csv('https://raw.githubusercontent.com/stedy/Machine-Learning-with-R-datasets/master/groceries.csv',
                   header=None, delimiter='\t')

In [2]:
data.head(3)

Unnamed: 0,0
0,"citrus fruit,semi-finished bread,margarine,rea..."
1,"tropical fruit,yogurt,coffee"
2,whole milk


In [3]:
data.shape

(9835, 1)

In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9835 entries, 0 to 9834
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       9835 non-null   object
dtypes: object(1)
memory usage: 77.0+ KB


In [5]:
# Convert the data to a list of lists

data = data.values.tolist()

# Step 2: Create a mapping dictionary
The second step is to create a dictionary to map each item to a unique integer.

In [6]:
item_to_int = {}
count = 0
for transaction in data:
    for item in transaction:
        if item not in item_to_int:
            item_to_int[item] = count
            count += 1

# Step 3: Convert the dataset to a list of lists of integers
The third step is to convert the dataset to a list of lists of integers using the mapping dictionary.

In [7]:
data_int = []
for transaction in data:
    transaction_int = [item_to_int[item] for item in transaction]
    data_int.append(transaction_int)

# Step 4: Convert the dataset to a PyTorch tensor
The fourth step is to convert the dataset to a PyTorch tensor.



In [8]:
import torch

data_tensor = torch.LongTensor(data_int)

# Step 5: Define the model architecture
The fifth step is to define the model architecture using PyTorch. The model consists of an embedding layer, a fully connected layer, and an output layer.

In [9]:
import torch.nn as nn

class Recommender(nn.Module):
    def __init__(self, n_items, embedding_dim):
        super(Recommender, self).__init__()
        self.item_embedding = nn.Embedding(n_items, embedding_dim)
        self.fc1 = nn.Linear(embedding_dim, 128)
        self.fc2 = nn.Linear(128, n_items)
        self.activation = nn.ReLU()

    def forward(self, x):
        x = self.item_embedding(x)
        x = torch.mean(x, dim=1)
        x = self.fc1(x)
        x = self.activation(x)
        x = self.fc2(x)
        return x

# Step 6: Train the model
After setting the hyperparameters and initializing the model and optimizer, the next step is to train the model. The training process involves iterating over the data in batches and updating the model's parameters based on the loss calculated on the batch.

In [10]:
import torch.optim as optim

In [11]:
# Set the hyperparameters

n_items = len(item_to_int)
embedding_dim = 64
lr = 0.001
n_epochs = 10
batch_size = 256

In [12]:
# Initialize the model and optimizer

model = Recommender(n_items, embedding_dim)
optimizer = optim.Adam(model.parameters(), lr=lr)

In [13]:
# Define the loss function

criterion = nn.CrossEntropyLoss()

In [14]:
# Train the model

for epoch in range(n_epochs):
    for i in range(0, len(data_tensor), batch_size):
        batch = data_tensor[i:i+batch_size]
        targets = batch[:, -1]
        inputs = batch[:, :-1]
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, targets)
        loss.backward()
        optimizer.step()
    print(f'Epoch {epoch+1}/{n_epochs} Loss: {loss.item()}')


Epoch 1/10 Loss: nan
Epoch 2/10 Loss: nan
Epoch 3/10 Loss: nan
Epoch 4/10 Loss: nan
Epoch 5/10 Loss: nan
Epoch 6/10 Loss: nan
Epoch 7/10 Loss: nan
Epoch 8/10 Loss: nan
Epoch 9/10 Loss: nan
Epoch 10/10 Loss: nan


In each epoch, the data is split into batches of size `batch_size`, and for each batch, the model's parameters are updated based on the loss calculated on that batch. The `optimizer.zero_grad()` call is used to clear the gradients of all optimized tensors before running the backward pass, and the `optimizer.step()` call updates the model's parameters based on the computed gradients. The loss value for each epoch is printed to monitor the training progress.

Note that the loss function used in this example is the cross-entropy loss, which is commonly used for multi-class classification problems. In this case, the task is to predict the next item in a sequence, given the previous items, so the target is a single integer representing the next item. The `nn.CrossEntropyLoss()` function expects the inputs to be a tensor of shape `(batch_size, n_classes)` and the targets to be a tensor of shape `(batch_size,)`, where each element is an integer representing the correct class label. In our case, the inputs are the output of the model and the targets are the last item in each batch.

# Step 7: Compute Item Similarities and Get Recommendations

After training the model, we can compute the similarity between items and use it to generate recommendations for each item in the dataset.

First, we will compute the similarity matrix between item embeddings using the dot product. The dot product measures the cosine similarity between two vectors, which is a common metric used for measuring similarity between embeddings.

Then, for each item, we will get the top-k most similar items based on their cosine similarity score. Finally, we will convert the integer IDs back to the original item names and print the top-k recommendations for each item.

Here's the code for computing item similarities and getting recommendations:

In [15]:
import numpy as np

In [16]:
# Get the item embeddings

item_embeddings = model.item_embedding.weight.data.cpu().numpy()

In [17]:
# Compute the similarity between items

similarity_matrix = np.dot(item_embeddings, item_embeddings.T)

In [18]:
# Get the top k recommendations for each item

k = 5
top_k = np.argsort(similarity_matrix, axis=1)[:, -(k+1):]

In [19]:
# Convert the integer ids back to item names

int_to_item = {v: k for k, v in item_to_int.items()}
top_k_items = np.vectorize(int_to_item.get)(top_k)

In [20]:
# Print the top k recommendations for each item

for i, item in enumerate(int_to_item.values()):
    print(f"Top {k} recommendations for {item}: {', '.join(top_k_items[i][::-1][1:])}")

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Top 5 recommendations for beef,citrus fruit,whole milk: UHT-milk,salt,soda,specialty chocolate,shopping bags, chicken,tropical fruit, beef,root vegetables,white bread,soda, citrus fruit,pip fruit,beverages,hard cheese,pet care,soda,fruit/vegetable juice,waffles, frankfurter,meat,root vegetables,onions,other vegetables,whole milk,yogurt,UHT-milk,rolls/buns,oil,mustard,canned fish,coffee,newspapers,shopping bags
Top 5 recommendations for yogurt,UHT-milk,rolls/buns,candles: frankfurter,pip fruit,whole milk,soda, frankfurter,pork,whole milk,pastry,soda, hamburger meat,root vegetables,whole milk,whipped/sour cream,flour,chocolate marshmallow,newspapers, pork,citrus fruit,grapes,root vegetables,whole milk,ready soups,pot plants, ham,processed cheese,rolls/buns,white bread,fruit/vegetable juice,canned beer,long life bakery product,waffles,chocolate,specialty bar
Top 5 recommendations for other vegetables,butter,yogurt,rolls/buns

# involves defining a function `evaluate_recommendations` that takes an input item and returns the top k recommended items based on similarity scores.

Here is the implementation of the function:

In [21]:
def evaluate_recommendations(input_item, top_k=5):
    # Convert the input item to its integer id
    input_item_id = item_to_int[input_item]

    # Get the item embeddings
    item_embeddings = model.item_embedding.weight.data.cpu().numpy()

    # Compute the similarity between items
    similarity_scores = np.dot(item_embeddings[input_item_id], item_embeddings.T)

    # Get the top k recommendations
    top_k_indices = np.argsort(similarity_scores)[-top_k:]

    # Convert the integer ids back to item names
    int_to_item = {v: k for k, v in item_to_int.items()}
    top_k_items = [int_to_item[i] for i in top_k_indices[::-1]]

    # Print the top k recommendations for the input item
    print(f"Top {top_k} recommendations for {input_item}: {', '.join(top_k_items)}")

This function takes two parameters:

* `input_item`: the name of the item for which recommendations are to be generated.
* `top_k` (default 5): the number of top recommendations to be generated.

The function first converts the input item name to its integer id using the `item_to_int` dictionary. It then retrieves the item embeddings from the trained model and computes the similarity scores between the input item and all other items using the dot product of the input item's embedding and the embedding of all other items.

Next, it retrieves the top k indices of items with the highest similarity scores and converts these integer ids back to item names using the `int_to_item` dictionary. Finally, it prints the top k recommended items for the input item.

Here is an example usage of the function:

In [22]:
evaluate_recommendations('chicken')

Top 5 recommendations for chicken: chicken, beef,other vegetables,whole milk,rolls/buns, beef,citrus fruit,root vegetables,specialty chocolate, citrus fruit,tropical fruit,whole milk,yogurt, meat,yogurt
