# Retrieval Augmented Generation (RAG) Example With E-Commerce Reviews

In [1]:
!pip install -qU langchain

## Data Preparation

We are going to use an e-commerce review dataset. For simplicity, we will limit the data with the product with most votes. Download the csv file from the following URL.

Link: https://www.kaggle.com/datasets/nicapotato/womens-ecommerce-clothing-reviews/data


### Download Data

In [2]:
import pandas as pd

In [3]:
reviews_df = pd.read_csv("Womens Clothing E-Commerce Reviews.csv", index_col=[0])

In [4]:
reviews_df["Clothing ID"].value_counts()

1078    1024
862      806
1094     756
1081     582
872      545
        ... 
776        1
668        1
633        1
734        1
522        1
Name: Clothing ID, Length: 1206, dtype: int64

In [5]:
reviews_df = reviews_df[reviews_df["Clothing ID"] == 1078]

In [6]:
reviews_df.head()

Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
69,1078,56,Great summer fabric!,"I really wanted this to work. alas, it had a s...",3,0,1,General Petite,Dresses,Dresses
90,1078,51,Sweet flattering dress,"I love cute summer dresses and this one, espec...",4,1,0,General Petite,Dresses,Dresses
117,1078,32,,This is the perfect summer dress. it can be dr...,5,1,2,General Petite,Dresses,Dresses
467,1078,61,Great sweater dress!,"Nice fit and flare style, not clingy at all. i...",5,1,1,General,Dresses,Dresses
470,1078,33,"Cute, but cheap",When i first opened this dress and tried it on...,3,0,0,General,Dresses,Dresses


In [7]:
reviews_df.shape

(1024, 10)

## Basic Preprocessing

In [8]:
reviews_df = reviews_df.dropna()

In [9]:
def text_preprocessing(text: str):
    text = text.replace("\n", " ")
    text = text.replace("  ", " ")
    return text

reviews_df["Review Text"] = reviews_df["Review Text"].apply(text_preprocessing)

In [10]:
reviews_df = reviews_df.drop_duplicates(subset=["Review Text"])

In [11]:
reviews_df.shape

(871, 10)

### Split Into Chunks

In general it is hard to feed long documents into the prompt, and it requires splitting into chunks in most cases. In this case, reviews are generally short enough to work with. Still, I will add splitting into chunks code for demonstration.

In [12]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    separators = ["."],
    chunk_size = 400,
    chunk_overlap  = 0,
    is_separator_regex = False,
)

def split_into_chunks(text):
    docs = text_splitter.create_documents([text])
    text_chunks = [doc.page_content for doc in docs]

    return text_chunks

In [13]:
reviews_df["text_chunk"] = reviews_df["Review Text"].apply(split_into_chunks)
reviews_df = reviews_df.explode("text_chunk")
reviews_df["chunk_id"] = reviews_df.groupby(level=0).cumcount()

In [14]:
reviews_df.head(3)

Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name,text_chunk,chunk_id
69,1078,56,Great summer fabric!,"I really wanted this to work. alas, it had a s...",3,0,1,General Petite,Dresses,Dresses,"I really wanted this to work. alas, it had a s...",0
90,1078,51,Sweet flattering dress,"I love cute summer dresses and this one, espec...",4,1,0,General Petite,Dresses,Dresses,"I love cute summer dresses and this one, espec...",0
467,1078,61,Great sweater dress!,"Nice fit and flare style, not clingy at all. i...",5,1,1,General,Dresses,Dresses,"Nice fit and flare style, not clingy at all. i...",0


In [15]:
reviews_df.shape

(1206, 12)

Know we have the `text_chunk` column with a short string.

## Creating Document Vectors

There are several text embedding methods:
- OpenAI embedding service (requires api credentials)
- SBert (runs locally)
- Google VertexAI Embedding Service (requires api credentials)
- Word2Vec (Traditional baseline)
- ...

OpanAI would get the best results, but SBert models are also sufficient in most of the cases and it is free.

In [16]:
!pip install -qU sentence-transformers

There are several open-sourced embedding models. Check here: https://www.sbert.net/docs/pretrained_models.html

In [17]:
from sentence_transformers import SentenceTransformer

In [18]:
model_name = "all-mpnet-base-v2"
model = SentenceTransformer(model_name)

We could store those vectors in a vector database such as Qdrant, ElasticSearch, Chromadb, or in a local folder.

In [19]:
text_chunks = reviews_df["text_chunk"].tolist()
text_chunk_vectors = model.encode(text_chunks, show_progress_bar=True)

Batches:   0%|          | 0/38 [00:00<?, ?it/s]

In [20]:
text_chunk_vectors.shape

(1206, 768)

## Retrieve Relevant Documents

At inference time, we need to vectorize the input user query, and find the most k similar documents from our vector database.

In [21]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

def retrieve_relevant_documents(query, text_chunk_vectors, k):
    query_embedding = model.encode(query)

    # Calculate cosine similarity between the query and each document
    similarities = cosine_similarity([query_embedding], text_chunk_vectors)[0]

    # Get indices of the top k most similar documents
    top_k_indices = np.argsort(similarities)[::-1][:k]

    # Retrieve the relevant documents
    return reviews_df.iloc[top_k_indices]

Let's make a sanity check:

In [22]:
relevant_rows = retrieve_relevant_documents("Fabric quality", text_chunk_vectors, 5)

In [23]:
relevant_rows

Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name,text_chunk,chunk_id
14392,1078,34,Beautiful but runs tiny,This dress is my first retailer purchase in se...,4,1,8,General Petite,Dresses,Dresses,. the cotton is nice - glad to see retailer go...,1
23460,1078,34,Such high hopes!,I purchased this for a very good price and i t...,3,0,0,General,Dresses,Dresses,I purchased this for a very good price and i t...,0
7843,1078,46,Another winner from floreat,I found this dress to be a perfect flowy fit f...,5,1,9,General Petite,Dresses,Dresses,. personally do not think it has an abundance ...,1
14412,1078,43,"Wrinkly cotton, nice shape","While the cut, drape, and shape of this dress ...",3,1,0,General Petite,Dresses,Dresses,"While the cut, drape, and shape of this dress ...",0
6601,1078,39,Into the west,Totally out of my style range but am totally e...,5,1,3,General,Dresses,Dresses,. if you have tried the marvelous silk santee ...,1


In [24]:
relevant_rows["text_chunk"].tolist()

['. the cotton is nice - glad to see retailer go back to offering natural fabrics!\r \r very pretty color, and print.',
 "I purchased this for a very good price and i typically love maeve... should have been a win-win. the fabric is thin and slinky in the most unfortunate way. it made finding appropriate undergarments very difficult. i had to add a slip (that helped) and figured that since i'm losing weight, it would look better when i'm down a few more pounds",
 '. personally do not think it has an abundance of fabric. in fact, it is a lot less roomy than a lot of others in this cut due to the embroidery',
 "While the cut, drape, and shape of this dress are very attractive, seeing the fabric in person was disappointing to me. it's clearly well-made and there are abundant high-quality details. however, the combination of a large-ish pattern and the wrinkly, thin cotton still felt kind of limp and cheap. i sent it back.",
 '. if you have tried the marvelous silk santee swing dress, the 

## Prompt Engineering

Know, we need to create an augmented prompt to feed into the chat model.

In [25]:
prompt_template = """
You are a useful chat assistant. I will give you several product reviews in an e-commerce platform for a specific product.
I want you to consider different opinions in the information I give you, and create a short but informative answer for user's query.
If there is no information to answer the question, then don't answer.

Here is the documents:
<documents>

User query: <query>
"""

def create_prompt(query, k=5):
    # get relevant information about the query, and append this information into the prompt template
    # you can get any information you like.
    relevant_rows = retrieve_relevant_documents(query, text_chunk_vectors, k)
    text_chunks = relevant_rows["text_chunk"].tolist()
    text_chunks_string = "\n".join(text_chunks)

    prompt = prompt_template
    prompt = prompt.replace("<documents>", text_chunks_string)
    prompt = prompt.replace("<query>", query)

    return prompt

Let's make a sanity check.

In [26]:
prompt = create_prompt("Fabric Quality")
print(prompt)


You are a useful chat assistant. I will give you several product reviews in an e-commerce platform for a specific product.
I want you to consider different opinions in the information I give you, and create a short but informative answer for user's query.
If there is no information to answer the question, then don't answer.

Here is the documents:
. the cotton is nice - glad to see retailer go back to offering natural fabrics!  very pretty color, and print.
I purchased this for a very good price and i typically love maeve... should have been a win-win. the fabric is thin and slinky in the most unfortunate way. it made finding appropriate undergarments very difficult. i had to add a slip (that helped) and figured that since i'm losing weight, it would look better when i'm down a few more pounds
. personally do not think it has an abundance of fabric. in fact, it is a lot less roomy than a lot of others in this cut due to the embroidery
While the cut, drape, and shape of this dress ar

In [27]:
prompt = create_prompt("Return policy")
print(prompt)


You are a useful chat assistant. I will give you several product reviews in an e-commerce platform for a specific product.
I want you to consider different opinions in the information I give you, and create a short but informative answer for user's query.
If there is no information to answer the question, then don't answer.

Here is the documents:
. returning it :(
. returning and hope the replacement is in better shape. if so, it will be a keeper!
. i agree with the other reviewer that a slim person could carry the look well. i usually don't return online purchases unless they
. this dress will not make it to sale.
. thank you allison for such great customer service and retailer for employing such great staff.

User query: Return policy



In [28]:
prompt = create_prompt("Which size I should buy")
print(prompt)


You are a useful chat assistant. I will give you several product reviews in an e-commerce platform for a specific product.
I want you to consider different opinions in the information I give you, and create a short but informative answer for user's query.
If there is no information to answer the question, then don't answer.

Here is the documents:
. i sized up to a l because of my linebacker shoulders/broad bust (thanks retailer user reviewers!) i am comfy as long as i don't flail or flap. my one teeny tiny little complaint is right out of the factory cellophane bag, i had a loose
. i went with the xs to avoid issues in the shoulders, but i maybe could have gone even smaller. fabric doesn't have a lot of give, so if you are large in the chest, my guess would be a possi
. i got the small and it fits great. the print is lovely - the cream fabric is speckled in gray dots in some
. this length was also a bit too long, so if you are 5'5 or shorter you may want to order the petite size.
. i

## Answer Generation

The last step in our pipeline is to feed the augmented prompt into a chat model. We could use public models such as LLama or closed sourced models such as ChatGPT, PaLM, Claude etc.

Let's use ChatGPT.

In [29]:
!pip install -q openai

In [30]:
import os
os.environ["OPENAI_API_KEY"] = "XXXXXXXXXXXX"

In [31]:
from openai import OpenAI
client = OpenAI()

def generate(prompt):
    completion = client.chat.completions.create(
        model="gpt-4",
        messages=[
            {"role": "user", "content": prompt}
        ]
    )

    return completion.choices[0].message.content

Sanity check:

In [32]:
user_query = "Which size I should buy"
prompt = create_prompt(user_query)
answer = generate(prompt)

In [33]:
print("PROMPT")
print(prompt)
print("-" * 50)
print("MODEL ANSWER")
print(answer)

PROMPT

You are a useful chat assistant. I will give you several product reviews in an e-commerce platform for a specific product.
I want you to consider different opinions in the information I give you, and create a short but informative answer for user's query.
If there is no information to answer the question, then don't answer.

Here is the documents:
. i sized up to a l because of my linebacker shoulders/broad bust (thanks retailer user reviewers!) i am comfy as long as i don't flail or flap. my one teeny tiny little complaint is right out of the factory cellophane bag, i had a loose
. i went with the xs to avoid issues in the shoulders, but i maybe could have gone even smaller. fabric doesn't have a lot of give, so if you are large in the chest, my guess would be a possi
. i got the small and it fits great. the print is lovely - the cream fabric is speckled in gray dots in some
. this length was also a bit too long, so if you are 5'5 or shorter you may want to order the petite si

## Finalize RAG pipeline

In [34]:
def RAG(query, verbose=False):
    prompt = create_prompt(query, k=20)
    model_answer = generate(prompt)

    if verbose:
        return model_answer, prompt
    else:
        return model_answer

In [35]:
model_answer, prompt = RAG("How is the fabric quality?", verbose=True)

In [36]:
print(model_answer)

The fabric quality of the product has received mixed reviews. Some customers have praised it as soft, comfortable, and natural, with one referring to it as "buttery soft". Some reviews also mentioned that the product's cotton feels more like a sateen than a poplin weave and other parts of it are patchworked with linen and jersey materials. On the other hand, a few customers found the fabric to be disappointing, describing it as flimsy, thin, see-through, and overly stiff in some parts. A handful even mentioned that the fabric quality felt cheap or slinky in an unflattering way.


In [37]:
print(prompt)


You are a useful chat assistant. I will give you several product reviews in an e-commerce platform for a specific product.
I want you to consider different opinions in the information I give you, and create a short but informative answer for user's query.
If there is no information to answer the question, then don't answer.

Here is the documents:
. the cotton is nice - glad to see retailer go back to offering natural fabrics!  very pretty color, and print.
Decent quality. i am 5'4" 130 pounds. m is too big. runs slightly large.
. loved that the boning at the front bust and sides keeps the dress up and in place without the straps. side pockets, and the cotton feels soft, more like a sateen than a poplin weave. lovely colo
Beautiful color; however, the fabric did not work for me. tight on top, large on bottom.
Flattering, flowy, comfortable. the back strops have elastic so easy on and off. flows when you walk, the ties have two different types of black fabric, matching knit on one s

## Chat With RAG

In [38]:
RAG("Is it a price performance product?")

"Yes, the majority of the reviews indicate that the product is worth the price. Both the quality and the design of the dress are appreciated by many customers. However, a few users mentioned that the fabric may feel a bit cheap or thin to some. Also, the fitting is praised, but it seems to run slightly smaller, especially across the shoulders. It's also worth mentioning that it has received positive feedback especially when purchased on sale or discount."

In [39]:
RAG("Does it keep warm?")

'The dress seems to be more suited to cooler temperatures rather than hot, as mentioned by multiple users. Some of them wore it in warm climates with additional layers like tights and cardigans, while others said it was comfortable during the winter season. The material Composition (polyester and rayon) could also retain heat as suggested by one user. However, one user from California mentioned it is not good for winter even with a sweater layered over it. So, it may depend on the intensity of the cold in your area.'

In [40]:
RAG("What color is that?")

"The product comes in several colors including green, lime, cranberry, blue motif, tangerine, burnt orange, grey, and a blue shade. The blue motif is cream, mushroom, black and very dark navy. Some users have mentioned that the tangerine can look more red than orange, and there's a version with a deep navy background that can look almost black from some angles. Please note that perceived colors may vary due to lighting or screen settings."

In [41]:
RAG("Can I return the product with no issue?")

'Based on the reviews, some customers have experienced issues with the product and have successfully returned it. Hence, it seems that returning the product should be possible without any issue.'

In [42]:
RAG("What is the meaning of life?")

'The documents do not provide information on the meaning of life.'

In [43]:
RAG("Why should I buy it?")

"The product is highly recommended by many purchasers for its good quality and unique design. It's a classic, timeless piece that is versatile for many occasions from work to a date. The dress seems well-made, and it fits nicely due to the elastic at the back. The print is eye-catching and it's comfortable. Furthermore, it can be both dressed up or down. Some purchasers also mentioned that it is worth its price. However, do note that some suggest to size up due to its fitted construction, and the buttons may be loosely sewn on."

In [44]:
RAG("Why shouldn't I buy it?")

'Some users found issues with the sizing, especially around the chest area with the dress being too tight or loose. Others had concerns about the fabric quality and zipper. Some reviewers also mentioned the dress has a weird underwire that could impact its fit. Certain colors that were available online were not available in stores, and there were some negative comments around the price point being too high for the quality of the dress. Lastly, for some users, the dress was not flattering for their figure.'

In [45]:
RAG("Is it transparent?")

"The product has received mixed reviews about its transparency. Some reviews state that the dress is unlined but not see-through while others suggest that it's quite sheer and see-through, which might require an additional layer underneath for modesty. Therefore, depending on the color and personal preference, the level of transparency might vary."