# Retrieval-Augmented Generation (RAG)

## Getting Started

To get started, create an Open AI API account, set up billing, and generate and API key at https://platform.openai.com/. Create a file called `.env`, and add a line `OPENAI_API_KEY=<your-openai-api-key>`. This key will be read by the `load_dotenv` library.

Install the `openai` Python package with `pip -r requirements.txt`, and run the code below to read the key.

In [None]:
from openai import OpenAI
from dotenv import load_dotenv

# Load key from an environmental variable called "OPENAI_API_KEY"
# Use python-dotenv https://pypi.org/project/python-dotenv/
# And take environment variables from .env
load_dotenv()
client = OpenAI()

## Embedding

Convert your text into an embedding vector. An embedding vector is a vector of floating point values. These vectors are used to represent words with similar meanings close to each other. That is, the distance between these vectors captures some of the semantic relationships between the words. 

By default, the length of the embedding vector will be 1536 for text-embedding-3-small or 3072 for text-embedding-3-large. You can reduce the dimensions of the embedding by passing in the dimensions parameter without the embedding losing its concept-representing properties. (From [What are Embeddings](https://platform.openai.com/docs/guides/embeddings/what-are-embeddings)).

In [None]:
def get_embedding(text, model="text-embedding-3-small"):
   text = text.replace("\n", " ")
   return client.embeddings.create(input = [text], model=model).data[0].embedding

print(get_embedding("Hello, World!"))

A simple cosine function can find similarity between vectors.

In [None]:
from scipy import spatial

def cosine_similarity(vector1, vector2):
    return 1 - spatial.distance.cosine(vector1, vector2)

## Find Similarity

Find the embedding vectors between the user query and blocks of content, and then find the most similar content. If there is a question being asked, this is where the answer is likely to be.

In [None]:
quotes = [
    '"Stop worrying about the potholes in the road and enjoy the journey" - Babs Hoffman.',
    '"A journey of a thousand miles begins with a single step" - Chinese proverb',
    '"One small step for man, one giant leap for mankind." - Neil Armstrong'
]
user_query = "What's a famous saying about a long trip?"

# Encode the query
query_vector = get_embedding(user_query)

# Calculate cosine similarities between the user query and the quotes
scores = []
for i, quote in enumerate(quotes):
    vector = get_embedding(quote)
    similarity = cosine_similarity(query_vector, vector)
    print(f'{quote[:25]}… - {similarity}')
    scores.append((similarity, quote))

# Find most similar text by sorting to find the highest similarity score
scores.sort(reverse=True)
most_similar_text = scores[0][1]

print()
print(f'The text most similar to the query:\n\t{user_query}\nis:\n\t{most_similar_text}')

Embeddings can saved for future use. For large datasets, use a vector database. Vector databases can quickly find the vectors in the database that are most similar to a given query vector. 

## Complete With Context Using ChatGPT

Use the most similar text to be efficient with the completion.

In [None]:
user_prompt = "How many steps start a journey? Answer with a numeric value"

chat_completion = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "system", "content": most_similar_text},
        {"role": "user", "content": user_prompt},
    ],
)

reply = chat_completion.choices[0].message.content
print(reply)