<a href="https://colab.research.google.com/github/sualeh/introduction-to-chatgpt-api/blob/main/chatgpt-rag.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

----------

> **How to Run This Notebook**

To get started, create an Open AI API account, set up billing, and generate and API key at https://platform.openai.com/. If you are running the notebook locally in Visual Studio Code or other IDE, create a file called `.env`, and add a line `OPENAI_API_KEY=<your-openai-api-key>`. This key will be read by the `load_dotenv` library.

Otherwise, if you are running in Google Colab, create a secret called `OPENAI_API_KEY` and set it to the value of your OpenAI API key.

Run the code below to read the key.


In [None]:
%pip install -qq python-dotenv

from os import environ as env
from dotenv import load_dotenv
import logging

logger = logging.getLogger(__name__)
logging.basicConfig(
    level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s"
)

# Load key from an environmental variable called "OPENAI_API_KEY"
# Use python-dotenv https://pypi.org/project/python-dotenv/
# And take environment variables from .env
load_dotenv()
try:
  # Attempt to read OPENAI_API_KEY from a Google Colab secret
  from google.colab import userdata
  env['OPENAI_API_KEY'] = env.get('OPENAI_API_KEY', userdata.get('OPENAI_API_KEY'))
except ModuleNotFoundError:
  logger.info("Not running in Google Colab")
  # No action - rely on the OPENAI_API_KEY environmental variable



----------

# Embedding and Similarity

## Getting Started

In [None]:
%pip install -qq openai

from openai import OpenAI

# Create Open AI client to use ChatGPT
client = OpenAI()

## Embedding

Convert your text into an embedding vector. An embedding vector is a vector of floating point values. Embeddings translate human language into a mathematical form that AI models can understand. When text is converted to these numerical vectors, documents or pieces of text with similar meanings end up close to each other in the vector space, even if they use different words to express the same ideas. This enables machines to understand semantic relationships beyond simple keyword matching.

By default, the length of the embedding vector will be 1,536 for "text-embedding-3-small" or 3,072 for "text-embedding-3-large". You can reduce the dimensions of the embedding by passing in the dimensions parameter without the embedding losing its concept-representing properties. (From [What are Embeddings](https://platform.openai.com/docs/guides/embeddings/what-are-embeddings)).

In [None]:
def get_embedding(text, model="text-embedding-3-small"):
   text = text.replace("\n", " ")
   return client.embeddings.create(input = [text], model=model).data[0].embedding

print(get_embedding("Hello, World!"))

A simple cosine function can find similarity between vectors. Cosine similarity measures the cosine of the angle between two vectors. It ranges from -1 (completely opposite) to 1 (exactly the same), with 0 indicating orthogonality (no relationship). When comparing embedding vectors:

* A value close to 1 means the texts have very similar semantic meaning
* A value close to 0 means the texts are unrelated
* A value close to -1 is rare with embeddings and would indicate opposite meanings

The function below calculates this similarity measure, which we will use to determine which content is most relevant to a query.

In [None]:
%pip install -qq scipy

from scipy import spatial

def cosine_similarity(vector1, vector2):
    return 1 - spatial.distance.cosine(vector1, vector2)

## Find Similarity

Find the embedding vectors between the user query and blocks of content, and then find the most similar content. If there is a question being asked, this is where the answer is likely to be.

Here is what happens in the code below:

1. We start with a collection of quotes (which simulates a document database)
2. The user asks a question (the prompt or query)
3. We convert both the query and all quotes into embedding vectors
4. We calculate the similarity between the query vector and each quote vector
5. We rank the quotes by similarity score
6. The quote with the highest similarity score is likely to contain the information needed to answer the query

In real-world applications, this same process might search through thousands or millions of documents to find the most relevant information.

In [None]:
quotes = [
    '"Stop worrying about the potholes in the road and enjoy the journey" - Babs Hoffman.',
    '"A journey of a thousand miles begins with a single step" - Chinese proverb',
    '"One small step for man, one giant leap for mankind." - Neil Armstrong'
]
user_query = "What's a famous saying about a long trip?"

# Encode the query
query_vector = get_embedding(user_query)

# Calculate cosine similarities between the user query and the quotes
scores = []
for i, quote in enumerate(quotes):
    vector = get_embedding(quote)
    similarity = cosine_similarity(query_vector, vector)
    print(f'{quote[:50]}… - {similarity}')
    scores.append((similarity, quote))

# Find most similar text by sorting to find the highest similarity score
scores.sort(reverse=True)
most_similar_text = scores[0][1]

print()
print(f'The text most similar to the query:\n\t{user_query}\nis:\n\t{most_similar_text}')

Embeddings can saved for future use. For large datasets, use a vector database. Vector databases can quickly find the vectors in the database that are most similar to a given query vector. Vector databases are specialized for storing and searching embedding vectors efficiently. They use algorithms like Approximate Nearest Neighbor (ANN) to quickly find similar vectors without having to compare against every vector in the database.

## Retrieval Augmented Generation (RAG)

What we are doing here is a simplified version of Retrieval Augmented Generation (RAG). We have already found the most relevant text (the quote) to our query using embedding similarity. We include this relevant text in the system message to give ChatGPT context, and also send the user's query as the user message. ChatGPT can now provide a more accurate, grounded response because it has the specific information needed.

This approach has several benefits:
* It reduces hallucinations since the model has specific facts to work with
* It makes responses more precise and relevant
* It can save tokens/ costs as you are focusing the model on relevant information
* It allows the model to work with up-to-date or domain-specific information

In [None]:
user_prompt = "What's a famous saying about a long trip? How many steps start a journey? Answer with a numeric value."

chat_completion = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "system", "content": most_similar_text},
        {"role": "user", "content": user_prompt},
    ],
)

reply = chat_completion.choices[0].message.content
print(reply)