In [None]:
import os
import openai
import chromadb
import numpy as np
from chromadb.utils import embedding_functions
from dotenv import load_dotenv

load_dotenv()

# Let's use Chroma to implement an in-memory vector store.
# This example will use generated data about cars for embedding storage and retrieval.

# Step 1: Install necessary packages (to be run in a notebook cell)
# !pip install openai chromadb

"""
The objective of this notebook is to demonstrate how to create a Retrieval-Augmented Generation (RAG) system using:
1. Generated data about cars.
2. OpenAI's Embedding model to convert car descriptions to vectors.
3. Chroma as the in-memory vector database to store and retrieve relevant vectors.
"""

"\nThe objective of this notebook is to demonstrate how to create a Retrieval-Augmented Generation (RAG) system using:\n1. Generated data about cars.\n2. OpenAI's Embedding model to convert car descriptions to vectors.\n3. Chroma as the in-memory vector database to store and retrieve relevant vectors.\n"

In [None]:
# Step 2: Setting up OpenAI key
# Note: You need an OpenAI API key to proceed. Set it below.

os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_API_KEY"
openai.api_key = os.environ["OPENAI_API_KEY"]


In [None]:
import openai
import chromadb
import numpy as np
from chromadb.utils import embedding_functions

# Let's use Chroma to implement an in-memory vector store.
# This example will use generated data about cars for embedding storage and retrieval.

# Step 1: Install necessary packages (to be run in a notebook cell)
# !pip install openai chromadb

"""
The objective of this notebook is to demonstrate how to create a Retrieval-Augmented Generation (RAG) system using:
1. Generated data about cars.
2. OpenAI's Embedding model to convert car descriptions to vectors.
3. Chroma as the in-memory vector database to store and retrieve relevant vectors.
"""

# Step 2: Setting up OpenAI key
# Note: You need an OpenAI API key to proceed. Set it below.
import os
os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_API_KEY"
openai.api_key = os.environ["OPENAI_API_KEY"]

# Step 3: Generate data for the cars
"""
To simulate a real dataset, we'll generate information about cars. Each car will have a name, price, engine type, and description.
The following data is used to showcase how information is stored and used in the vector database for retrieval.
"""

# Here we define a small set of initial car data manually.
# This will serve as our base dataset to which we will later add more generated cars.
cars = [
    {
        "name": "Superfast Coupe 2024",
        "price": "$80,000",
        "engine": "V8 5.0L",
        "description": "The Superfast Coupe 2024 offers a V8 engine, exceptional acceleration, and high-speed performance. It is ideal for sports car enthusiasts seeking adrenaline and sleek design."
    },
    {
        "name": "EcoDrive Hatchback 2024",
        "price": "$25,000",
        "engine": "Electric",
        "description": "The EcoDrive Hatchback 2024 is a fully electric vehicle, designed for urban environments with excellent range efficiency, a compact form factor, and environment-friendly features."
    },
    {
        "name": "Family SUV XL 2024",
        "price": "$45,000",
        "engine": "V6 3.5L",
        "description": "The Family SUV XL 2024 is a spacious and versatile SUV that provides comfort, safety, and excellent driving dynamics for long journeys."
    },
    {
        "name": "Luxury Sedan Prime 2024",
        "price": "$70,000",
        "engine": "Hybrid",
        "description": "The Luxury Sedan Prime 2024 offers a combination of luxury, efficiency, and hybrid technology, ensuring a comfortable and refined ride for passengers."
    },
]

# Step 3a: Generate more car data using GPT-3
"""
To enhance the dataset, we will generate additional car entries using OpenAI's GPT-3 model.
The generated data will include car name, price, engine type, and a detailed description.
We will use a prompt to guide the model to return data in a consistent and predictable format.
"""

# Generate 50 car data entries using OpenAI GPT-3
import openai

def generate_car_data(num_cars=50):
    """
    Generates car data using OpenAI's GPT-3 model.
    Args:
        num_cars (int): Number of car entries to generate.
    Returns:
        List[dict]: A list of dictionaries containing car details.
    """
    car_data = []
    for i in range(num_cars):
        response = openai.Completion.create(
            engine="text-davinci-003",
            prompt=(
                "Generate details for a car including the following fields: \n"
                "Name: <Car Name>\n"
                "Price: <Car Price>\n"
                "Engine: <Engine Type>\n"
                "Description: <Car Description>\n"
                "Please provide each field in the exact order and format as shown above."
            ),
            max_tokens=150
        )
        car_details = response.choices[0].text.strip().split('\n')
        try:
            car = {
                "name": car_details[0].split(': ')[1],
                "price": car_details[1].split(': ')[1],
                "engine": car_details[2].split(': ')[1],
                "description": car_details[3].split(': ')[1]
            }
            car_data.append(car)
        except IndexError:
            print(f"Error parsing car details for iteration {i}, skipping entry.")
            continue
    return car_data

# Append the generated cars to the existing cars list
"""
Here, we append the 50 generated cars to our initial dataset.
This results in a dataset of 54 cars that will be used for embedding storage and retrieval.
"""
cars += generate_car_data(50)

# Step 4: Initialize Chroma DB and prepare embeddings
"""
We will use OpenAI embeddings to convert the car descriptions into vectors, which can be easily stored and queried in Chroma.
Chroma will serve as our in-memory vector database, allowing us to perform fast similarity searches for the car descriptions.
"""

# Set up Chroma and OpenAI embedding function
client = chromadb.Client()
openai_ef = embedding_functions.OpenAIEmbeddingFunction(api_key=openai.api_key)

# Create a new collection for storing car embeddings
"""
A Chroma collection is similar to a table in a database.
In this collection, we will store embeddings representing each car's description along with metadata such as car name, price, and engine type.
"""
car_collection = client.create_collection(name="car_collection", embedding_function=openai_ef)

# Add car data to the Chroma collection
car_ids = [str(i) for i in range(len(cars))]
car_descriptions = [car["description"] for car in cars]
car_metadata = [{"name": car["name"], "price": car["price"], "engine": car["engine"]} for car in cars]

"""
Adding the car data to Chroma involves specifying unique IDs for each car, the descriptions to embed, and relevant metadata.
The metadata will be useful for presenting information to the user when we query the database.
"""
car_collection.add(ids=car_ids, metadatas=car_metadata, documents=car_descriptions)

# Step 5: Querying Chroma for information
"""
In this step, we will demonstrate how to query the Chroma collection to find relevant cars.
We will use a natural language prompt to find cars that match specific requirements.
The embeddings will allow us to determine the similarity between the query and the car descriptions.
"""

# Example query prompt
prompt = "I want a comfortable car for my family with good safety features."

# Retrieve the top match from Chroma collection
"""
Using the `query` method, we search for the most relevant car based on the given prompt.
The query will return the car description that is most similar to the provided input.
"""
results = car_collection.query(query_texts=[prompt], n_results=1)

# Display the result
"""
Once we get the result, we extract the metadata for the recommended car, such as its name, price, and engine type.
We then print out the recommended car's details for the user.
"""
result_metadata = results["metadatas"][0][0]
result_name = result_metadata["name"]
result_price = result_metadata["price"]
result_engine = result_metadata["engine"]

print(f"Recommended Car: {result_name}\nPrice: {result_price}\nEngine: {result_engine}")

"""
This result demonstrates how the RAG approach, combining embeddings and natural language queries, can provide users with relevant and personalized insights.
For a user looking for a specific type of car, such as one that is comfortable for a family, our system can find the best match from the dataset.
"""

# Step 6: Adding docstrings and documentation
"""
We have documented each of the main functions and steps of this notebook to ensure clarity.
The goal is for students and developers to be able to follow along and understand each part of the process step-by-step.
The RAG system in this notebook is designed to handle simple car queries, but it could easily be extended to handle more sophisticated use cases.
Consider adding additional metadata, optimizing the embedding model, or integrating a front-end for an interactive experience.
"""
