![workflow](./img/workflow1.png)

# Introduction
This is a demonstration of Retrieval-Augmented Generation using (RAG) OpenAI API (`gpt-3.5`). At its core, RAG uses additional context to the original query that is selected from an external (vectorized) dataset, in order to enahnce the query response, or provide for missing information that were not part of the training dataset. It is therefore an alternative to fine tuning. The above diagrams contrast the traditional workflow, where the user inputs are fed into the LLM and outputs are generated, with RAG.

There are several steps to implementing RAG that are listed below:
- __Find out where the LLM is lacking performance__: Based on defined requirements, find out where the responses of the current LLM are lacking. This could be either due to several reasons, such as: 1- The dataset used for training does not contain the desired information. 2- Provided response does not contain the desired granularity. 3- Provided response is biased. 
- __Define and clean the dataset__: Based on the requirements and outcome of previous step, select a dataset of choice that we want for augmentation.
- __Clean the dataset__: Self explanatory, but to expand on it a bit on it, make sure that the dataset is in a format that aligns with the queries we will be asking. For example, for historic events, we want to have the date of each event at the beginning of the sentence that describes it (this example).
- __Vectorize the dataset__: Use an embedding model to vectorize the cleaned queries. This is done once and in most cases we can save the resulted database. 

After these steps, we have a vectorize dataset that we can use for RAG. When user ask a question, we project its query into the embedding space (using the same embedding we used for creating the dataset). We then select a number of entries from the database that are closest to the embedding of the query, using a metric (usually cosine similarity). Based on the window size of the LLM, we then augment the original query with as many entries from the database as possible. The resulted augmented query is then sent to the LLM and the response is received. 

# Implementation

For this exercise we are using `gpt-3.5-turbo-instruct` model. This model training data is up to middle of 2021 and does not contain information on year 2023. To illustrate this, we ask the following two questions:

In [None]:
from openai import OpenAI

client = OpenAI(
    api_key="your-key"
)

COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"

##### Question 1

In [None]:
coronation_prompt = """
Question: "When was the last king of Englad coronated?"
Answer:
"""
initial_coronation_answer = client.completions.create(
    model=COMPLETION_MODEL_NAME, prompt=coronation_prompt, max_tokens=150
)
print(initial_coronation_answer.choices[0].text)


The last king of England to be coronated was King George VI on May 12, 1937.


##### Question 2

In [None]:
gpt_v_prompt = """
Question: "What is the last version of GPT?"
Answer:
"""
initial_gpt_v_answer = client.completions.create(model="gpt-3.5-turbo-instruct", prompt=gpt_v_prompt, max_tokens=150)
print(initial_gpt_v_answer.choices[0].text)


The last version of GPT (Generative Pre-trained Transformer) is GPT-3 (Generative Pre-trained Transformer 3) which was released in June 2020.


There has been a coronation of king of Englad in 2023 and the latest iteration of GPT is GPT-4, which was released in April. To augment the queries, we use [2023](https://en.wikipedia.org/wiki/2023) which contains the overview of events and developments. 

### Preparation of dataset and embeddings

In [None]:
from dateutil.parser import parse
import pandas as pd
import requests

In [None]:
resp = requests.get(
    "https://en.wikipedia.org/w/api.php?action=query&prop=extracts&exlimit=1&titles=2023&explaintext=1&formatversion=2&format=json"
)

As the first step to clean the data, which is in html format, and later store it in a dataframe, we remove all the empty lines and section lines that contain `==` 

In [None]:
df = pd.DataFrame()
df["text"] = resp.json()["query"]["pages"][0]["extract"].split("\n")
df = df[(df["text"].str.len() > 0) & (~df["text"].str.startswith("=="))]

The following two celss further clean the prepare the data. It ensure that each entry in the dataset has its own date tag, followed by a hyphen. 

In [None]:
from typing import Tuple, Union

def is_date(string: str) -> Tuple[bool, Union[None, str]]:
    possible_date_string = string.split("–")[0].strip()
    try:
        _ = parse(possible_date_string)
        return True, possible_date_string
    except:
        return False, None

In [None]:
current_date = None
for _, row in df.iterrows():
    is_a_date, date = is_date(row["text"])
    if is_a_date:
        current_date = date
    if current_date is not None and (not is_a_date):
        row["text"] = current_date + " – " + row["text"]

df = df[df["text"].str.contains("–")]

Now we are ready to feed the cleaned dataset into an embedding model. After each entry got its embedding, we add those to the dataframe. 

In [None]:
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
batch_size = 100
embeddings = []

In [None]:
for idx in range(0, len(df), batch_size):
    response = client.embeddings.create(
        input=df.iloc[idx : idx + batch_size]["text"].tolist(), model=EMBEDDING_MODEL_NAME
    )
    embeddings.extend([data.embedding for data in response.data])

In [None]:
df["embeddings"] = embeddings

We save the resulted dataset (text plus embeddings) for later use. 

In [None]:
df.to_csv("embeddings.csv")

### Query and Augmentation 

We are now ready to make some queries with the created embeddings. We first read the saved dataset and convert the embeddings to arrays.

In [None]:
import pandas as pd
import numpy as np

In [None]:
df = pd.read_csv("embeddings.csv")
df["embeddings"] = df["embeddings"].apply(lambda x: np.fromstring(x.strip("[]"), sep=","))

The following two cells are necessary to get the embedding for a given query and measuring distance

In [None]:
from typing import List
from scipy.spatial import distance


def get_embedding(text: str, model_name: str) -> List[float]:
    text = text.replace("\n", " ")
    return client.embeddings.create(input=[text], model=model_name).data[0].embedding


def distances_from_embeddings(
    query_embedding: List[float], embeddings: List[List[float]], distance_metric="cosine"
) -> List[float]:
    distance_metrics = {
        "cosine": distance.cosine,
        "L1": distance.cityblock,
        "L2": distance.euclidean,
        "inf": distance.chebyshev,
    }
    distances = [distance_metrics[distance_metric](query_embedding, embedding) for embedding in embeddings]
    return distances

The function `get_rows_by_relevance` is especially important as it creates an embedding for the question and then sorts the dataframe using the cosine similarity distance

In [None]:
def get_rows_by_relevance(question: str, df: pd.DataFrame) -> pd.DataFrame:
    q_embedding = get_embedding(question, model_name=EMBEDDING_MODEL_NAME)
    df_copy = df.copy()
    df_copy["distances"] = distances_from_embeddings(
        query_embedding=q_embedding, embeddings=df_copy["embeddings"].values, distance_metric="cosine"
    )
    df_copy.sort_values("distances", ascending=True, inplace=True)
    return df_copy

In the following function, `create_prompt`, we can finally put everything together. We build our custom query using the question and closest entries to the query embedding, while ensuring that we do not go beyond the window threshold of the model.

In [None]:
import tiktoken


def create_prompt(question: str, df: pd.DataFrame, max_token_count: int) -> str:
    prompt_template = """
Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

{}

---

Question: {}
Answer:"""

    tokenizer = tiktoken.get_encoding("cl100k_base")
    current_token_count = len(tokenizer.encode(prompt_template)) + len(tokenizer.encode(question))
    context = []

    for text in get_rows_by_relevance(question, df)["text"].values:
        current_token_count += len(tokenizer.encode(text))
        if current_token_count < max_token_count:
            context.append(text)
        else:
            break

    return prompt_template.format("\n\n###\n\n".join(context), question)

And finally we wrap the API point for receiving the response from the LLM in the following function. 

In [None]:
def answer_question(question: str, df: pd.DataFrame, max_prompt_tokens: int = 1500, max_answer_tokens: int = 500)->Tuple[str, str]:
    """
    Given a question, a dataframe containing rows of text, and a maximum
    number of desired tokens in the prompt and response, return the
    answer to the question according to an OpenAI Completion model

    If the model produces an error, return an empty string
    """

    prompt = create_prompt(question, df, max_prompt_tokens)

    try:
        response = client.completions.create(model=COMPLETION_MODEL_NAME, prompt=prompt, max_tokens=max_answer_tokens)
        return response.choices[0].text, prompt
    except Exception as e:
        print(e)
        return "", prompt

With the newly implemented RAG mechanism, we can test again the response of the LLM:

##### Question 1

In [None]:
resp, prompt = answer_question("When was the last king of England coronated?", df)
print(resp)


The last coronation of a king in England was on May 6, 2023, when Charles III and Camilla were crowned as the King and Queen of the United Kingdom and the other Commonwealth realms. However, the last king to be coronated solely as the King of England was George V in 1911.


##### Question 2

In [None]:
resp, prompt = answer_question("What is the latest version GPT?", df)
print(resp)

 GPT-4


Both responses are improve and correctly point the to the date of last coronation and latest iteration of GPT.