# Practical Example: Add Context for a Large Language Model (LLM)

In this section, you’ll get hands-on experience using ChromaDB to provide context to OpenAI’s ChatGPT LLM.

To set the scene, you’re a data scientist who works for a large car dealership. The dealership has sold hundreds of thousands of cars and received many reviews. Your stakeholders would like you to create a system that summarizes different types of car reviews. They’ll use these summaries to improve marketing and prevent poor customer experiences in the future.

You’re responsible for designing and implementing the back-end logic that creates these summaries. You’ll take the following steps:

* Create a ChromaDB collection that stores car reviews along with associated metadata.
* Create a system that accepts a query, finds semantically similar documents, and uses the similar documents as context to an LLM. The LLM will use the documents to answer the question posed in the query.

This process of retrieving relevant documents and using them as context for a generative model is known as retrieval-augmented generation (RAG). This allows LLMs to make inferences using information that wasn’t included in their training dataset, and this is the most common way to apply ChromaDB in LLM applications.

There are lots of factors and variations to consider when implementing a RAG system, but for this example, you’ll only need to know the fundamentals. Here’s what a RAG system might look like with ChromaDB:

<img src="rag.avif" width=700>

\
You first embed and store your documents in a ChromaDB collection. In this example, those documents are car reviews. You then run a query like **find and summarize the best car reviews** through ChromaDB to find semantically relevant documents, and you pass the query and relevant documents to an LLM to generate a context-informed response.

The key here is that the LLM takes both the original query and the relevant documents as input, allowing it to generate a meaningful response that it wouldn’t be able to create without the documents.

In reality, your deliverable for this project would likely be a chatbot that stakeholders use to ask questions about car reviews through a user interface. While building a full-fledged chatbot is beyond the scope of this tutorial, in later modules you will learn **LangChain** that are designed specifically to help you assemble LLM applications.

The focus of this example is for you to see how you can use ChromaDB for RAG. This practical knowledge will help reduce the learning curve for LangChain.

## Prepare and Inspect Your Dataset

You’ll use the <a href="https://www.kaggle.com/datasets/ankkur13/edmundsconsumer-car-ratings-and-reviews">Edmunds-Consumer Car Ratings and Reviews</a> dataset from Kaggle to create the review collection. This dataset contains over 200,000 reviews and ratings covering 62 major car brands.

Once you’ve downloaded the dataset, unzip the file and store the data in your project directory inside a subdirectory called `data/`. There’s one CSV file per car, and you should store all of them within `data/archive/`.

To start, you can take a look at the dataset using Polars, a popular DataFrame library. Make sure that you have Polars installed in your environment using Shell or Terminal: `python -m pip install polars`

The focus of this tutorial isn’t on Polars, so you won’t get a detailed explanation of the Polars code. 

Here’s a function that you can use to prepare the car reviews dataset for ChromaDB:

In [1]:
import pathlib
import polars as pl

def prepare_car_reviews_data(data_path: pathlib.Path, vehicle_years: list[int] = [2017]):
    """Prepare the car reviews dataset for ChromaDB"""

    # Define the schema to ensure proper data types are enforced
    dtypes = {
        "": pl.Int64,
        "Review_Date": pl.Utf8,
        "Author_Name": pl.Utf8,
        "Vehicle_Title": pl.Utf8,
        "Review_Title": pl.Utf8,
        "Review": pl.Utf8,
        "Rating": pl.Float64,
    }
    # This part defines the data types for each column in your dataset. This helps Polars read and process the data efficiently and correctly. For instance, pl.Utf8 indicates text data, pl.Int64 represents integers, and pl.Float64 signifies floating-point numbers.

    # Scan the car reviews dataset(s)
    car_reviews = pl.scan_csv(data_path, schema_overrides=dtypes)
    # This line uses the pl.scan_csv function to load the car reviews dataset from the specified path. By providing the dtypes we defined earlier, we ensure the data is loaded with the correct types.
    

    # Extract the vehicle title and year as new columns
    # Filter on selected years
    car_review_db_data = (
        car_reviews.with_columns(
            [
                (
                    pl.col("Vehicle_Title").str.split(by=" ").list.get(0).cast(pl.Int64)
                ).alias("Vehicle_Year"),
                
                (
                    pl.col("Vehicle_Title").str.split(by=" ").list.get(1)
                ).alias("Vehicle_Model"),
            ]
        )
        .filter(pl.col("Vehicle_Year").is_in(vehicle_years))
        .select(["Review_Title", "Review", "Rating", "Vehicle_Year", "Vehicle_Model"])
        .sort(["Vehicle_Model", "Rating"])
        .collect()
    )
    # with_columns: This adds two new columns, "Vehicle_Year" and "Vehicle_Model", extracted from the "Vehicle_Title" column. It cleverly splits the title by spaces, takes the first part as the year (converted to an integer), and the second part as the model.
        # filter: This filters the data to include only reviews for the specified vehicle_years.
        # select: This selects the desired columns for the final output.
        # sort: This sorts the data first by "Vehicle_Model" and then by "Rating".
        # collect: This triggers the actual computation and returns the processed data as a Polars DataFrame.


    # Create ids, documents, and metadatas data in the format chromadb expects
    ids = [f"review{i}" for i in range(car_review_db_data.shape[0])]
    documents = car_review_db_data["Review"].to_list()
    metadatas = car_review_db_data.drop("Review").to_dicts()

    return {"ids": ids, "documents": documents, "metadatas": metadatas}

`prepare_car_reviews_data()` accepts the path to the car reviews dataset and a list of vehicle years to filter on, and it returns a dictionary with the review data properly formatted for ChromaDB. You can include different vehicle years, but keep in mind that the more years you include, the longer it’ll take to build the collection. By default, you’re only including vehicles from 2017.

You can see this function in action with the following code:

In [2]:
path="data/archive/"
data_path = pathlib.Path(path)
car_reviews_data = prepare_car_reviews_data(data_path)

In [3]:
car_reviews_data

{'ids': ['review0',
  'review1',
  'review2',
  'review3',
  'review4',
  'review5',
  'review6',
  'review7',
  'review8',
  'review9',
  'review10',
  'review11',
  'review12',
  'review13',
  'review14',
  'review15',
  'review16',
  'review17',
  'review18',
  'review19',
  'review20',
  'review21',
  'review22',
  'review23',
  'review24',
  'review25',
  'review26',
  'review27',
  'review28',
  'review29',
  'review30',
  'review31',
  'review32',
  'review33',
  'review34',
  'review35',
  'review36',
  'review37',
  'review38',
  'review39',
  'review40',
  'review41',
  'review42',
  'review43',
  'review44',
  'review45',
  'review46',
  'review47',
  'review48',
  'review49',
  'review50',
  'review51',
  'review52',
  'review53',
  'review54',
  'review55',
  'review56',
  'review57',
  'review58',
  'review59',
  'review60',
  'review61',
  'review62',
  'review63',
  'review64',
  'review65',
  'review66',
  'review67',
  'review68',
  'review69',
  'review70',
  'review

## Create a Collection and Add Reviews

Next, you’ll create a collection and add the reviews. This function will help you create a collection in a modular way. Before running this function, make sure you’ve installed more-itertools using shell or terminal: `python -m pip install more-itertools`

In [4]:
#this code is mine
import pathlib
import chromadb
from chromadb.utils import embedding_functions
from more_itertools import batched
from sentence_transformers import SentenceTransformer, util
from chromadb.config import Settings


def build_chroma_collection(
        chroma_path: pathlib.Path,
        collection_name: str,
        embedding_func_name: str,
        ids: list[str],
        documents: list[str],
        metadatas: list[dict],
        distance_func_name: str = "cosine",
):
    # Initialize the ChromaDB client
    client = chromadb.PersistentClient(path=chroma_path)

    # Create the embedding function
    embedding_func = embedding_functions.SentenceTransformerEmbeddingFunction(
        model_name=embedding_func_name
    )

    # Create the collection
    collection = client.create_collection(
        name=collection_name,
        embedding_function=embedding_func,
    )
    
    # Add data to the collection in batches
    document_indices = list(range(len(ids)))
    for batch in batched(document_indices, 166):
        batch_ids = [ids[i] for i in batch]
        batch_documents = [documents[i] for i in batch]
        batch_metadatas = [metadatas[i] for i in batch]
        collection.add(
            ids=batch_ids,
            documents=batch_documents,
            metadatas=batch_metadatas
        )

    return collection

  from .autonotebook import tqdm as notebook_tqdm


In lines 1 to 4, you import the dependencies needed to define `build_chroma_collection()`. This function accepts the path where you’ll store the embeddings, the name of the collection to create, the name of the embedding function to use, the data to store in the collection, and the name of the distance function to use.

You then instantiate a `PersistentClient()` object, create the collection, and add data to the collection. In lines 29 to 39, you add data to the collection in batches using the `more-itertools` library. Calling `batched(document_indices, 166)` breaks `document_indices` into a list of tuples, each with size 166. ChromaDB’s current maximum batch size is 166, but this might change in the future.

You can now create the collection that stores car reviews:

In [5]:
DATA_PATH = "data/archive/*"
CHROMA_PATH = "car_review_embeddings/"
EMBEDDING_FUNC_NAME = "multi-qa-MiniLM-L6-cos-v1"
COLLECTION_NAME = "car_reviews"



As before, you import dependencies, define some configuration variables, and transform the raw reviews data. You then build a collection called car_review_embeddings using `build_chroma_collection()`. Notice that you’re now using the `"multi-qa-MiniLM-L6-cos-v1"` embedding function. The model behind this embedding function was specifically trained to solve question-and-answer semantic search tasks.

Building the collection will take a few minutes, but once it completes, you can run queries like the following:

In [6]:
client = chromadb.PersistentClient(CHROMA_PATH)
embedding_func = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name=EMBEDDING_FUNC_NAME
)

In [7]:
#if collection exists, it gives an error

car_reviews_data = prepare_car_reviews_data(data_path)
build_chroma_collection(
    chroma_path=CHROMA_PATH,
    collection_name=COLLECTION_NAME,
    embedding_func_name=EMBEDDING_FUNC_NAME,
    ids=car_reviews_data["ids"],
    documents=car_reviews_data["documents"],
    metadatas=car_reviews_data["metadatas"],
)
collection = client.get_collection(name=COLLECTION_NAME, embedding_function=embedding_func)

#great_reviews = collection.search("Rating:5", limit=5)

great_reviews = collection.query( query_texts=["Rating:5"], n_results=5)

# Print the results
print(great_reviews)


{'ids': [['review4746', 'review5815', 'review2006', 'review4346', 'review5525']], 'embeddings': None, 'documents': [[' Past tests have rated this vehicle very favorable.', " I don't know why my rating is coming up as two stars.  I'd rate it four stars.  I've only have my S90 for a short time, but so far, it's everything I wanted: all-wheel drive, fuel efficient, roomy, stylish, four heated seats, individual climate control, dynamic cruise control, LED steerable lights, handless trunk, good user interface.  I have a 2015 Lexus GS 350 and the Volvo turned out to be a better value than a 2017 GS 350.  Where I live, you can't find a Lexus with comparable equipment.  And if you want a user interface that will drive you nuts, try Lexus.  My only two complaints are that the stock stereo is a bit anemic and there are a few settings, like auto highbeams, that have to be turned on every time you get in the car.  Also, the car is more of a cruiser than a sports sedan.  However, that it is not to 

In [8]:
great_reviews["documents"]

[[' Past tests have rated this vehicle very favorable.',
  " I don't know why my rating is coming up as two stars.  I'd rate it four stars.  I've only have my S90 for a short time, but so far, it's everything I wanted: all-wheel drive, fuel efficient, roomy, stylish, four heated seats, individual climate control, dynamic cruise control, LED steerable lights, handless trunk, good user interface.  I have a 2015 Lexus GS 350 and the Volvo turned out to be a better value than a 2017 GS 350.  Where I live, you can't find a Lexus with comparable equipment.  And if you want a user interface that will drive you nuts, try Lexus.  My only two complaints are that the stock stereo is a bit anemic and there are a few settings, like auto highbeams, that have to be turned on every time you get in the car.  Also, the car is more of a cruiser than a sports sedan.  However, that it is not to say that it isn't stable or sufficiently responsive.  Unlike the previous reviewer, I haven't had any problems na

In [9]:
collection

Collection(name=car_reviews)

You query the `car_reviews` collection with Find me some positive reviews that discuss the car’s performance, and you display the most similar result. All of your reviews are now embedded, and you’re ready to integrate them into the summarization application.

## Connect to an LLM Service

As you know, you’re going to use the car reviews as context to an LLM. This means that you’ll ask the LLM a question like How would you summarize the most common complaints from negative car reviews?, and you’ll provide relevant reviews to help the LLM answer this question. To do this, you’ll first need to install the Groq library: `python -m pip install groq`

Also install: python -m pip install python-dotenv

In the root directory of your project, create a file named .env.

Open the .env file and add the following line, replacing "My API Key" with your actual Groq API key:
  > GROQ_API_KEY="My API Key"

In [10]:
from dotenv import load_dotenv
import os

load_dotenv()

api_key = os.getenv("GROQ_API_KEY")

In [14]:
import os
from groq import Groq

client = Groq(
    api_key=api_key
)

context = "You are a customer success employee at a large car dealership."
question = "What's the key to great customer satisfaction?"

chat_completion = client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": "What is the hardship of living in extremely cold conditions? ",
        }
    ],
    model="llama3-8b-8192",
)

print(chat_completion.choices[0].message.content)


Living in extremely cold conditions can be extremely challenging and has several hardships. Here are some of the most significant difficulties people may face:

1. **Hypothermia and Frostbite**: Prolonged exposure to cold temperatures can lead to hypothermia (body temperature below 95°F/35°C) and frostbite (damage to skin and underlying tissues due to freezing). Both can be life-threatening if not treated promptly.
2. **Reduced Mobility**: Cold weather can make it difficult to move around, especially for older adults, young children, and people with certain medical conditions. Icy roads, sidewalks, and surfaces can cause falls and injuries.
3. **Limited Access to Basic Amenities**: In extreme cold, essential services like water, electricity, and heating may be disrupted or unavailable. This can make daily life, including basic hygiene and cooking, a significant struggle.
4. **Higher Energy Consumption**: Maintaining a warm indoor temperature requires more energy, which can increase bil

The context message, You are a customer success employee at a large car dealership, helps set the behavior of the LLM so that its responses are more likely to have a desired tone. This type of message is also sometimes called a role prompt. The user message, What’s the key to great customer satisfaction?, is the actual question or task that you want the LLM to respond to.

## Provide Context to the LLM

As you can see, the LLM gives you a fairly generic description of what it takes to promote customer satisfaction. None of this information is particularly useful to you because it isn’t specific to your car dealership. To make this response more tailored to your business, you need to provide the LLM with some reviews as context:

In [15]:
question = "What's the key to great customer satisfaction?"

good_reviews = collection.query(
    query_texts=[question],
    n_results=5,
    where={"Rating": {"$gte": 3}} #gte means greater or equal to 3
)

reviews_str = "\n".join([f"{i+1}. {review}" for i, review in enumerate(good_reviews['documents'])])


context = """
You are a customer success employee at a large
 car dealership. Use the following car reviews
 to answer questions: {}
"""

question = """
What's the key to great customer satisfaction
 based on detailed positive reviews?
"""

client = Groq(api_key=api_key)


good_review_summaries = client.chat.completions.create(
    messages=[
        {"role": "system", "content": context},
        {"role": "user", "content": question}
    ],
    model="llama3-8b-8192",
    temperature=0
)

print(good_review_summaries.choices[0].message.content)

After reviewing numerous positive reviews from satisfied customers, I've identified some common themes that contribute to great customer satisfaction. Here are the key takeaways:

1. **Knowledgeable andFriendly Sales Team**: Many customers praised our sales team's expertise, attentiveness, and pleasant demeanor. They felt comfortable asking questions and received unbiased advice, which helped build trust and confidence in their purchasing decision.

Example Review: "The sales staff was incredible! They took the time to understand our needs and showed us exactly what we wanted to see."

2. **Efficient andStreamlined Process**: Customers appreciated the seamless and hassle-free buying experience, which started with online research and ended with a smooth test drive and paperwork process.

Example Review: "The whole process was so easy and quick! We got in and out in no time, and the staff made sure we had everything we needed."

3. **Personalized Attention**: Satisfied customers often me

As before, you import dependencies, define configuration variables, set your OpenAI API key, and load the car_reviews collection. You then define context and question variables that you’ll feed into an LLM for inference. The key difference in context is the {} at the end, which will be replaced with relevant reviews that give the LLM context to base its answers on.

You then pass the question into collection.query() and request ten reviews that are most related to the question. In this query, `where={"Rating": {"$gte": 3}}` filters the collection to reviews that have a rating greater than or equal to 3. Lastly, you pass the comma-separated review_str into context and request an answer from Llama.

Notice how much more specific and detailed ChatGPT’s response is now that you’ve given it relevant car reviews as context. For example, if you look through the documents in good_reviews, then you’ll see reviews that mention smooth acceleration and federal tax credits, both of which are incorporated into the LLM’s response.

> Note: It’s a common misconception that setting temperature=0 guarantees deterministic responses from ChatGPT. While responses are closer to deterministic when temperature=0, there’s no guarantee that you’ll get the same response for identical requests. Because of this, ChatGPT might output slightly different results than what you see in this example.

Now, even though ChatGPT used relevant reviews to inform its response, you might still be thinking that the response was fairly generic. To really see the power of using ChromaDB to provide ChatGPT with context, you can ask a question about a specific review:

In [16]:
poor_reviews = collection.query(
    query_texts=[question],
    n_results=5,
    where={"Rating": {"$lt": 3}} #lt means less than 2

)

# Format each review with a number
formatted_reviews = [f"{index + 1}. {review}" for index, review in enumerate(poor_reviews["documents"])]

# Combine all formatted reviews into a single string with line breaks
reviews_str = "\n".join(formatted_reviews)


poor_review_analysis = client.chat.completions.create(
    model="llama3-8b-8192",
    messages=[
        {"role": "system", "content": context},
        {"role": "user", "content": f"""
        Based on the provided following negative reviews:
        {reviews_str}
        What can be improved to increase customer satisfaction?
        """}
    ],
    temperature=0
)

print(poor_review_analysis.choices[0].message.content)

Based on the negative reviews provided, the following can be improved to increase customer satisfaction:

1. Quality Control: The first review highlights the importance of quality control, stating that "quantity over quality mentality" is a significant issue. Ensuring that the cars are thoroughly inspected and tested before being shipped out to customers can help minimize the number of defective vehicles.
2. Communication: The first review also mentions that the salesperson did not properly explain the features of the car, which led to disappointment. Improving communication with customers, especially during the sales process, can help build trust and ensure that customers understand what they're getting.
3. Warranty and Repair: The third review emphasizes the importance of having a reliable warranty and repair process. Ensuring that issues are efficiently addressed and covered by warranty can help build trust with customers.
4. Product Design and Durability: The fourth review criticiz

In this example, you query the collection for five reviews that have the worst implications on the dealership, and you filter on reviews that have a rating less than or equal to 3. You then pass this question, along with the five relevant reviews, to Llama.

Llama points to a specific review where a customer had a poor experience at the dealership, quoting the review directly. ChatGPT has no knowledge of this review without your providing it, and you may not have found this review without a vector database capable of accurate semantic search. This is the power that you unlock when combining vector databases with LLMs.

You’ve now seen why vector databases like ChromaDB are so useful for adding context to LLMs. In this example, you’ve scratched the surface of what you can create with ChromaDB, so just think about all the potential use cases for applications like this. The LLM and vector database landscape will likely continue to evolve at a rapid pace, but you can now feel confident in your understanding of how the two technologies interplay with each other.