# Using `gpt-3.5-turbo` to answer questions posed in natural language, using a custom dataset and Retrieval-Augmented Generation



---
**Credit**: Adapted from this [OpenAI notebook](https://github.com/openai/openai-cookbook/blob/main/examples/Question_answering_using_embeddings.ipynb)

---

Many use cases require us to respond to user questions with relevant and accurate answers. For example, a customer support chatbot may need to provide answers to common support questions.

The GPT models have picked up a lot of general knowledge in training - remember GPT-3 was trained on 500 billion tokens! - but we often would like to have the model *use our own dataset or library* of more specific information to answer the questions (e.g., we would like our customer service chatbot to consult a library of service manuals when it answers a user question). We'd expect those tailored responses to be more helpful and accurate than generic responses uninformed by our specific data.

In this notebook we will demonstrate a method for enabling `gpt-3.5-turbo` to answer questions using a library of text as a reference. We'll be using a dataset of Wikipedia articles about the 2022 Winter Olympic Games but the same approach can be used with a library of books, articles, documentation, service manuals, or much much more.

## Setup

Let's get started by installing the openai python package and `tiktoken`, a package that can tokenize inputs using BPE, in a manner compatible with the OpenAI models


In [None]:
!pip install --upgrade openai
!pip install tiktoken

In [None]:
import pandas as pd  # for storing text and embeddings data


We will use `gpt-3.5-turbo` in this colab; this is the GPT variant that powered the initial release of ChatGPT and remains a potential backend for that service.




---
We will be using **pre-trained contextual embeddings** as well. For that, we will
use the `text-embedding-ada-002` model ([link](https://openai.com/blog/new-and-improved-embedding-model)).


---

Finally, let's set the OpenAI API key. You can get yours [here](https://platform.openai.com/account/api-keys), and then enter it under `OPENAI_API_KEY` in your Colab secrets. We will create an OpenAI API client using this key.





In [None]:
# models
EMBEDDING_MODEL = "text-embedding-ada-002"
GPT_MODEL = "gpt-3.5-turbo"

import os # for getting API token from env variable OPENAI_API_KEY
from google.colab import userdata, drive

os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')

# client for OpenAI API
from openai import OpenAI # for calling the OpenAI API
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY", "<your OpenAI API key if not set as env var>"))

## Prompting without custom data

Before we try anything fancy, let's simply ask `gpt-3.5-turbo` a question on the 2020 Summer Olympics and see how it responds.

First, we prepare the prompt.

In [None]:
query = 'Which athlete won the gold medal in the high jump at the 2020 Summer Olympics?'

Next, we make the request to the model, using the openai API. [Documentation](https://platform.openai.com/docs/api-reference/completions/create?lang=python).


In [None]:
response = client.chat.completions.create(
    messages=[
        {'role': 'system', 'content': 'You answer questions about the Olympics.'},
        {'role': 'user', 'content': query},
    ],
    model=GPT_MODEL,
    temperature=0,
)

print(response.choices[0].message.content)

Mutaz Essa Barshim of Qatar and Gianmarco Tamberi of Italy both won the gold medal in the high jump at the 2020 Summer Olympics. They decided to share the gold medal rather than participate in a jump-off.


We can check that this answer is in fact correct [here](https://en.wikipedia.org/wiki/Athletics_at_the_2020_Summer_Olympics_%E2%80%93_Men%27s_high_jump). Impressive. But now lets change the query around and ask something about the 2022 Winter Olmpics.

In [None]:
query = 'Which athletes won the gold medal in curling at the 2022 Winter Olympics?'

response = client.chat.completions.create(
    messages=[
        {'role': 'system', 'content': 'You answer questions about the Winter Olympics.'},
        {'role': 'user', 'content': query},
    ],
    model=GPT_MODEL,
    temperature=0,
)

print(response.choices[0].message.content)

The gold medal in curling at the 2022 Winter Olympics was won by the Swedish men's team and the South Korean women's team.


If we fact-check this, it turns out that ....
<br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br>













... Sweden did win the men's gold and the South Korean team did particpate, but **Great Britain won the women's gold**.


<br>

<br>



Sounds like `gpt-3.5-turbo` could use some help. 😆


### "Engineering" the prompt to reduce hallucinations



One simple thing we can try right off the bat is to tell `gpt-3.5-turbo` to say "I don't know" if it doesn't know rather than make stuff up i.e., "hallucinate".


How? By asking nicely? 😀 Well, almost.



By asking **explicitly!**

Let's modify our prompt as follows.


In [None]:
query = f"""Answer the question as truthfully as possible, \
and if you're unsure of the answer, say "Sorry, I don't know".

Question: Which athletes won the gold medal in curling at the 2022 Winter Olympics?"""

Note the explicit extra instruction in the above prompt: *as truthfully as possible, and if you're unsure of the answer, say "Sorry, I don't know"*

In [None]:
response = client.chat.completions.create(
    messages=[
        {'role': 'system', 'content': 'You answer questions about the 2022 Winter Olympics.'},
        {'role': 'user', 'content': query},
    ],
    model=GPT_MODEL,
    temperature=0,
)

print(response.choices[0].message.content)

Sorry, I don't know.


Wow, it worked. The model is being humble and honest 👀.

It is an interesting question as to why ChatGPT knew the High jump answer but not this. Let's check the [cutoff date](https://platform.openai.com/docs/models/gpt-3-5-turbo) for the training data.

## Using custom data

To help the model answer a question, we can provide relevant custom data **in the prompt itself**. This extra information we provide in the prompt is referred to as **context**.



### Manually enriching the prompt with custom data

We will first show how to do this by ***manually*** finding and adding information (that's relevant to the question) to the prompt.

First, we will use the wikipedia article for the 2022 Winter Olympics curling event as context.

Second, we will **explicitly tell the model to make use of the provided context**.

There's a deeper lesson here: **telling LLMs explicitly what you want them to do often helps**

In [None]:
# text copied and pasted from: https://en.wikipedia.org/wiki/Curling_at_the_2022_Winter_Olympics
# Only the portion of the article up until the medalists is included.

wikipedia_article_on_curling = """Curling at the 2022 Winter Olympics

Article
Talk
Read
Edit
View history
From Wikipedia, the free encyclopedia
Curling
at the XXIV Olympic Winter Games
Curling pictogram.svg
Curling pictogram
Venue	Beijing National Aquatics Centre
Dates	2–20 February 2022
No. of events	3 (1 men, 1 women, 1 mixed)
Competitors	114 from 14 nations
← 20182026 →
Men's curling
at the XXIV Olympic Winter Games
Medalists
1st place, gold medalist(s)		 Sweden
2nd place, silver medalist(s)		 Great Britain
3rd place, bronze medalist(s)		 Canada
Women's curling
at the XXIV Olympic Winter Games
Medalists
1st place, gold medalist(s)		 Great Britain
2nd place, silver medalist(s)		 Japan
3rd place, bronze medalist(s)		 Sweden
Mixed doubles's curling
at the XXIV Olympic Winter Games
Medalists
1st place, gold medalist(s)		 Italy
2nd place, silver medalist(s)		 Norway
3rd place, bronze medalist(s)		 Sweden
Curling at the
2022 Winter Olympics
Curling pictogram.svg
Qualification
Statistics
Tournament
Men
Women
Mixed doubles
vte
The curling competitions of the 2022 Winter Olympics were held at the \
Beijing National Aquatics Centre, one of the Olympic Green venues. Curling \
competitions were scheduled for every day of the games, from February 2 to \
February 20.[1] This was the eighth time that curling was part of the Olympic \
program.

In each of the men's, women's, and mixed doubles competitions, 10 nations \
competed. The mixed doubles competition was expanded for its second appearance \
in the Olympics.[2] A total of 120 quota spots (60 per sex) were distributed to \
the sport of curling, an increase of four from the 2018 Winter Olympics.[3] A \
total of 3 events were contested, one for men, one for women, and one mixed.[4]

Qualification
Main article: Curling at the 2022 Winter Olympics – Qualification
Qualification to the Men's and Women's curling tournaments at the Winter \
Olympics was determined through two methods (in addition to the host nation).\
 Nations qualified teams by placing in the top six at the 2021 World Curling \
 Championships. Teams could also qualify through Olympic qualification events \
 which were held in 2021. Six nations qualified via World Championship \
 qualification placement, while three nations qualified through qualification \
 events. In men's and women's play, a host will be selected for the Olympic \
 Qualification Event (OQE). They would be joined by the teams which competed \
 at the 2021 World Championships but did not qualify for the Olympics, and \
 two qualifiers from the Pre-Olympic Qualification Event (Pre-OQE). The \
 Pre-OQE was open to all member associations.[5]

For the mixed doubles competition in 2022, the tournament field was expanded \
from eight competitor nations to ten.[2] The top seven ranked teams at the \
2021 World Mixed Doubles Curling Championship qualified, along with two teams \
from the Olympic Qualification Event (OQE) – Mixed Doubles. This OQE was open \
to a nominated host and the fifteen nations with the highest qualification \
points not already qualified to the Olympics. As the host nation, China \
qualified teams automatically, thus making a total of ten teams per event \
in the curling tournaments.[6]

Summary
Nations	Men	Women	Mixed doubles	Athletes
 Australia			Yes	2
 Canada	Yes	Yes	Yes	12
 China	Yes	Yes	Yes	12
 Czech Republic			Yes	2
 Denmark	Yes	Yes		10
 Great Britain	Yes	Yes	Yes	10
 Italy	Yes		Yes	6
 Japan		Yes		5
 Norway	Yes		Yes	6
 ROC	Yes	Yes		10
 South Korea		Yes		5
 Sweden	Yes	Yes	Yes	11
 Switzerland	Yes	Yes	Yes	12
 United States	Yes	Yes	Yes	11
Total: 14 NOCs	10	10	10	114
Competition schedule

The Beijing National Aquatics Centre served as the venue of the curling \
competitions.
Curling competitions started two days before the Opening Ceremony and finished \
on the last day of the games, meaning the sport was the only one to have had a \
competition every day of the games. The following was the competition schedule \
for the curling competitions:

RR	Round robin	SF	Semifinals	B	3rd place play-off	F	Final
Date
Event
Wed 2	Thu 3	Fri 4	Sat 5	Sun 6	Mon 7	Tue 8	Wed 9	Thu 10	Fri 11	Sat 12	Sun 13	\
Mon 14	Tue 15	Wed 16	Thu 17	Fri 18	Sat 19	Sun 20
Men's tournament								RR	RR	RR	RR	RR	RR	RR	RR	RR	SF	B	F
Women's tournament									RR	RR	RR	RR	RR	RR	RR	RR	SF	B	F
Mixed doubles	RR	RR	RR	RR	RR	RR	SF	B	F
Medal summary
Medal table
Rank	Nation	Gold	Silver	Bronze	Total
1	 Great Britain	1	1	0	2
2	 Sweden	1	0	2	3
3	 Italy	1	0	0	1
4	 Japan	0	1	0	1
 Norway	0	1	0	1
6	 Canada	0	0	1	1
Totals (6 entries)	3	3	3	9
Medalists
Event	Gold	Silver	Bronze
Men
details	 Sweden
Niklas Edin
Oskar Eriksson
Rasmus Wranå
Christoffer Sundgren
Daniel Magnusson	 Great Britain
Bruce Mouat
Grant Hardie
Bobby Lammie
Hammy McMillan Jr.
Ross Whyte	 Canada
Brad Gushue
Mark Nichols
Brett Gallant
Geoff Walker
Marc Kennedy
Women
details	 Great Britain
Eve Muirhead
Vicky Wright
Jennifer Dodds
Hailey Duff
Mili Smith	 Japan
Satsuki Fujisawa
Chinami Yoshida
Yumi Suzuki
Yurika Yoshida
Kotomi Ishizaki	 Sweden
Anna Hasselborg
Sara McManus
Agnes Knochenhauer
Sofia Mabergs
Johanna Heldin
Mixed doubles
details	 Italy
Stefania Constantini
Amos Mosaner	 Norway
Kristin Skaslien
Magnus Nedregotten	 Sweden
Almida de Val
Oskar Eriksson
"""

In [None]:
query = f"""Use the below article on the 2022 Winter Olympics to answer the \
subsequent question. Answer the question as truthfully as possible, and if \
you're unsure of the answer, say "Sorry, I don't know".

Article:
```
{wikipedia_article_on_curling}
```

Question: Which teams won the gold medal in curling at the 2022 Winter Olympics?"""

print(query)

Take a moment to notice what the prompt has grown to.


OK, let's run it.

In [None]:
response = client.chat.completions.create(
    messages=[
        {'role': 'system', 'content': 'You answer questions about the Olympics.'},
        {'role': 'user', 'content': query},
    ],
    model=GPT_MODEL,
    temperature=0,
)

print(response.choices[0].message.content)

The teams that won the gold medal in curling at the 2022 Winter Olympics were:

- Men's Curling: Sweden
- Women's Curling: Great Britain
- Mixed Doubles Curling: Italy


Nicely done, `gpt-3.5-turbo`!

---

But maybe it wasn't super hard since the answer is literally in the context we provided.


Let's make it a bit harder.


I noticed that Oskar Eriksson actually won two medals in the event...which tempts me to ask whether any athlete won multiple medals.

Let's try it.

In [None]:
query = f"""Use the below article on the 2022 Winter Olympics to answer \
the subsequent question. Answer the question as truthfully as possible, \
and if you're unsure of the answer, say "Sorry, I don't know"

Article:
```
{wikipedia_article_on_curling}
```

Question: Did any athlete win multiple medals in curling at the 2022 Winter \
Olympics?"""

Notice that the question has changed. Everything else is unchanged.

In [None]:
response = client.chat.completions.create(
    messages=[
        {'role': 'system', 'content': 'You answer questions about the 2022 Winter Olympics.'},
        {'role': 'user', 'content': query},
    ],
    model=GPT_MODEL,
    temperature=0,
)

print(response.choices[0].message.content)

Yes, Oskar Eriksson from Sweden won multiple medals in curling at the 2022 Winter Olympics. He won a gold medal in the men's event and a bronze medal in the mixed doubles event.


WHOAH!!!!

👏 👍

Google cannot do this. In fact, poor Oskar doesn't show up anywhere on the results page summary.



---




### RETRIEVAL AUGMENTED GENERATION: *Automatically* enriching the prompt with custom data





**Manually** adding extra information into the prompt obviously doesn't scale. So, we will now show how to **automatically** enrich the prompt with custom relevant data.

First thing to note. We typically can't just include **all** the custom data into the prompt due to an important reason.

The prompt for every model has a limit (called the **context window**) on how many tokens you can send in and get out. For `gpt-3.5-turbo`,  the context window is 16,385 tokens ([link](https://platform.openai.com/docs/models/gpt-3-5-turbo)).

Note that the context window includes both the prompt and the response - **together**, they can't exceed 16,385 tokens. We will get deeper into this a bit later but for now, understand this is one key reason we can't include ALL data in the prompt. Another reason is expense. OpenAI charges by the token and these charges can easily add up.

(BTW, GPT-4's context window is way bigger - it ranges up to 128K tokens, depending on the particular GPT-4 model)

If we can't include all the custom data, the logical thing to do is to only include data that's **relevant** to the question.

How can we measure the relevance between a question and a piece of (our custom) data?

Using pretrained contextual embeddings!



---




This is our overall process.



**RETRIEVAL AUGMENTED GENERATION**

**One-time setup**
* Preprocess the custom dataset by splitting it into 'sections'
* We calculate an embedding vector for each section using the `text-embedding-ada-002` model and store it somewhere handy


**Each time we receive a question, we do this:**
* We calculate an embedding vector for the question (again using the same `text-embedding-ada-002` model)
* For each section in our custom dataset, we calculate the *cosine similarity* (more or less the dot-product) between that section's embedding vector and the question's embedding vector
* We rank the sections from most-cosine-similar to the question to least-cosine-similar
* Starting from the most-cosine-similar section, include as many sections into the prompt as can fit into the context window
* Send the prompt into `gpt-3.5-turbo`.

![RAG](https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4063347e-8920-40c6-86b3-c520084b303c_1272x998.jpeg)

Credit for 👆image: https://magazine.sebastianraschka.com/p/finetuning-large-language-models

#### One-time setup

We first need to break up the custom dataset into "sections".

Sections should be large enough to contain enough information to answer a question; but small enough to fit one or several into the `gpt-3.5-turbo` prompt.

Approximately a paragraph of text is usually a good length, but you should experiment for your particular use case. In this example, Wikipedia articles are already grouped into headers, so we will use these to define our sections. This preprocessing (for a related dataset) has already been done in [this notebook](https://github.com/openai/openai-cookbook/blob/main/examples/fine-tuned_qa/olympics-1-collect-data.ipynb), so we will load the results and use them.

In [None]:
# OpenAI has hosted the processed dataset, so we can download it directly without having to recreate it.
# This dataset has already been split into 'chunks' (apparently one row for each section of the Wikipedia page)
# and a contextual embedding for each chunk has been computed.
# This file is ~200 MB, so may take a minute depending on your connection speed

embeddings_path = "https://cdn.openai.com/API/examples/data/winter_olympics_2022.csv"

df = pd.read_csv(embeddings_path)

import ast  # for converting embeddings saved as strings to arrays
df['embedding'] = df['embedding'].apply(ast.literal_eval)

Let's print out 5 randomly chosen chunks.

In [None]:
pd.set_option('display.max_colwidth', 500)
df[['text']].sample(5)

Unnamed: 0,text
5324,"Russian Olympic Committee athletes at the 2022 Winter Olympics\n\n==Curling==\n\n===Men's tournament===\n\n{{main|Curling at the 2022 Winter Olympics – Men's tournament}}\n\nRussia has qualified their men's team (five athletes), by finishing in the top six teams in the [[2021 World Men's Curling Championship]].\n\n{{#lst:Curling at the 2022 Winter Olympics – Men's tournament|Standings}}\n;Round robin\nRussia had a [[Bye (sports)|bye]] in draws 5, 7 and 12.\n{{Col-float-begin|style=width:45em..."
982,"Freestyle skiing at the 2022 Winter Olympics\n\n==Medal summary==\n\n===Women's events===\n\n{| {{MedalistTable|type=Event|columns=2}}\n|-style=""vertical-align: top;""\n| Aerials<br />{{DetailsLink|Freestyle skiing at the 2022 Winter Olympics – Women's aerials}}\n|{{flagIOCmedalist|[[Xu Mengtao]]|CHN|2022 Winter}} || 108.61\n|{{flagIOCmedalist|[[Hanna Huskova]]|BLR|2022 Winter}} || 107.95\n|{{flagIOCmedalist|[[Megan Nick]]|USA|2022 Winter}} || 93.76\n|-style=""vertical-align: top;""\n| Big air<..."
4750,"Brandon Frazier\n\n== Career ==\n\n=== 2020–2021 season: New partnership, first Grand Prix title, and second national title ===\n\nOn April 1, Frazier announced that he was teaming up with [[Alexa Knierim]], whose husband and former partner [[Chris Knierim]] had opted to retire.<ref name=knierim-frazier/> The new pair started skating together in May 2020 due to restrictions caused by the [[COVID-19 pandemic]]. They began training in [[Irvine, California]], at [[Great Park Ice & FivePoint Are..."
931,"Alexander Bolshunov\n\n==Career==\n\n===2019–20: Tour de Ski champion, World Cup overall winner===\n\nBolshunov started the [[2019–20 FIS Cross-Country World Cup]] by participating in the mini-event [[2019 Nordic Opening]], where he was positioned fifth in the overall ranking. He won the next stage in [[Lillehammer]] for the first time in the 30&nbsp;km skiathlon classic and freestyle event. Bolshunov entered the [[2019–20 Tour de Ski]] by reaching third place in 15&nbsp;km mass start freest..."
3069,"Ice hockey at the 2022 Winter Olympics – Men's qualification\n\n==Final qualification==\n\n===Group E===\n\n| stadium = [[Arena Riga]], [[Riga]]\n| attendance = 197\n| shots1 = 22\n| shots2 = 23\n| penalties1 = 6\n| penalties2 = 8\n}}\n{{Ice hockey box\n| bg = \n| date = 27 August 2021\n| time = 20:00\n| team1 = '''{{ih-rt|LAT}}'''\n| team2 = {{ih|HUN}}\n| score = 9–0\n| periods = (1–0, 0–0, 8–0)\n| reference = https://stats.iih..."


Next, we define a function to calculate the embedding using the `text-embedding-ada-002` model, given a piece of text. The API call is simple (see below). [Link](https://openai.com/blog/new-and-improved-embedding-model).

In [None]:
def get_embedding(text: str, model: str=EMBEDDING_MODEL) -> list[float]:
    result = client.embeddings.create(
      model=EMBEDDING_MODEL,  # which embedding model we want to use
      input=text,            # feed in the text for which you want to calc the embedding
    )
    return result.data[0].embedding

Let's try it on "HODL is amazing!!" 😃

In [None]:
e = get_embedding("HODL is amazing!!")

Let's see how long the embedding vector is.

In [None]:
len(e)

1536

In [None]:
f = get_embedding("HODL is incredible!!")

Let's calculate the cosine similarity. The `scipy.spatial.distance.cosine` function is handy here.

In [None]:
from scipy import spatial  # for calculating cosine similarities for search

1-spatial.distance.cosine(e, f)

0.9934264140022533

Given a dataframe like `df` with a column of text chunks, we can use the `get_embedding` function to calculate the embeddings for all the text chunks in the column.

In [None]:
def compute_doc_embeddings(df: pd.DataFrame) -> dict[tuple[str, str], list[float]]:
    """
    Create an embedding for each row in the dataframe using the OpenAI Embeddings API.

    Return a dictionary that maps between each embedding vector and the index of the row that it corresponds to.
    """
    return {
        idx: get_embedding(r.text) for idx, r in df.iterrows()
    }

To calculate the embeddings from scratch, uncomment the below line and run. Warning - it will take some time!

In [None]:
#document_embeddings = compute_doc_embeddings(df)


But happily for us, OpenAI has calculated the embeddings for us so we don't have to; in fact, these are already avaiable in the embedding column of the dataframe `df` we downloaded.

In [None]:
df.sample(5)

Unnamed: 0,text,embedding
1363,"Oslo bid for the 2022 Winter Olympics\n\n{{Olympic bid|2022|Winter|\n| Paralympics = yes\n| logo = File:2022 Oslo Olympic bid logo.svg\n| logo-size = 200px\n| fullname = [[Oslo]], [[Norway]]\n| chair = [[Stian Berger Røsland]] (President)<br /> Eli Grimsby (CEO)\n| committee = [[Norwegian Olympic and Paralympic Committee and Confederation of Sports]] (NOR)\n| history = Hosted the [[1952 Winter Olympics]]\n}}\n\n'''Oslo 2022 Winter Olympics''' was a campaign by the private organization [[Nor...","[-0.011376652866601944, -0.013184035196900368, -0.020387277007102966, 0.004955514799803495, -0.015931256115436554, 0.021780604496598244, -0.01991407200694084, 0.004541459959000349, -0.028313471004366875, -0.017889803275465965, 0.020794760435819626, 0.02089991606771946, -0.005530591122806072, -0.04329831898212433, -0.020965639501810074, -0.016969680786132812, 0.02557939477264881, 0.009766438975930214, 0.00032081041717901826, -0.0019010379910469055, -0.013841265812516212, -0.005507587920874357..."
3391,"Kaori Sakamoto\n\n== Skating career ==\n\n===Senior career===\n\n==== 2019–2020 season: Struggles ====\n\nSakamoto began the season at the [[2019 CS Ondrej Nepela Memorial]], where she won the silver medal, her first [[ISU Challenger Series|Challenger]] medal.\n\nBeginning on the [[2019-20 ISU Grand Prix of Figure Skating|Grand Prix]] at [[2019 Skate America]], Sakamoto placed second in the short program and fourth in the free skate after popping two of her jumps and finished the event fourt...","[-0.0020558552350848913, 0.004040032625198364, 0.016055870801210403, -0.018558084964752197, -0.014205276034772396, 0.004424487240612507, -0.009181300178170204, -0.002492439467459917, -0.022141985595226288, -0.03531770408153534, 0.018922992050647736, 0.00676379632204771, -0.008262517862021923, -0.025569496676325798, -0.011963709257543087, -0.024709360674023628, 0.018375631421804428, -0.020656295120716095, 0.010823377408087254, -0.026950927451252937, -0.007845482788980007, 0.006718183401972055..."
1652,"Estonia at the 2022 Winter Olympics\n\n==Cross-country skiing==\n\n{{main article|Cross-country skiing at the 2022 Winter Olympics|Cross-country skiing at the 2022 Winter Olympics – Qualification}}\nEstonia qualified four male and five female cross-country skiers.\n\n;Distance\n{|class=wikitable style=font-size:90%;text-align:center\n|-\n!rowspan=2|Athlete\n!rowspan=2|Event\n!colspan=2|Classical\n!colspan=2|Freestyle\n!colspan=3|Final\n|-style=""font-size: 95%""\n!Time\n!Rank\n!Time\n!Rank\n!T...","[-0.01062480453401804, -0.006469001527875662, -0.005809508264064789, -0.0009246165282092988, -0.034916702657938004, 0.043983906507492065, -0.03584463149309158, -0.00931907445192337, 0.0003173191216774285, -0.028235601261258125, 0.01973840780556202, 0.027228133752942085, -0.003701779991388321, -0.02834165096282959, -0.010492243804037571, -0.005773053504526615, 0.01927444338798523, -0.005166584625840187, -0.007324023172259331, -0.016530420631170273, -0.012069725431501865, 0.0327426940202713, 0..."
22,Aleksandr Galliamov\n\n== Competitive highlights ==\n\n''GP: [[ISU Grand Prix of Figure Skating|Grand Prix]]; CS: [[ISU Challenger Series|Challenger Series]]; JGP: [[ISU Junior Grand Prix|Junior Grand Prix]]'',"[-0.020467450842261314, 0.022396815940737724, 0.020980149507522583, -0.02320633828639984, -0.021223006770014763, 0.027294432744383812, -0.027739670127630234, -0.009478170424699783, -0.02102062478661537, -0.03823649138212204, 0.020764276385307312, 0.016811104491353035, -0.00897221826016903, -0.022923005744814873, 0.010672217234969139, -0.006381743121892214, 0.020629355683922768, -0.012109121307730675, 0.006479560863226652, -0.006948409602046013, 0.0014891858445480466, 0.009984122589230537, 0...."
2594,"Sui Wenjing\n\n==Competitive highlights==\n\n===With Han===\n\n{| class=""wikitable"" style=""text-align:center""\n|-\n! colspan=15 style=""background-color: #ffdead; "" align=""center"" | '''International'''<ref name=""isucrWSCH"" />\n|-\n! Event\n! [[2008–09 figure skating season|08–09]]\n! [[2009–10 figure skating season|09–10]]\n! [[2010–11 figure skating season|10–11]]\n! [[2011–12 figure skating season|11–12]]\n! [[2012–13 figure skating season|12–13]]\n! [[2013–14 figure skating season|13–14]]\...","[0.0005034424248151481, 0.004824342671781778, 0.030532967299222946, -0.0019635511562228203, -0.014837951399385929, 0.006294076796621084, -0.020824018865823746, -0.01059949304908514, -0.014529943466186523, -0.05453081056475639, 0.02655564621090889, 0.024988824501633644, -0.0016480102203786373, -0.009186673909425735, -0.009481290355324745, 0.009146498516201973, 0.023180950433015823, -0.025604840368032455, -0.0009432745282538235, -0.03155073523521423, -0.031229333952069283, 0.007070792373269796..."


So we have a custom data-set split into sections, and embedding vectors calculated for each. We also have a function that can calculate the embedding for any question.

Next we will use these embeddings to answer our users' questions.



#### Each time we receive a question

* We calculate an embedding vector for the question with the `get_embedding` funtion we defined above.
* For each chunk in our custom dataset, we calculate the cosine similarity between that chunk's embedding vector and the question's embedding vector
* We rank the sections from most-cosine-similar to the question to least-cosine-similar

We first define a couple of helper functions.

In [None]:
from IPython import embed

# search function
def strings_ranked_by_relatedness(
    query: str,
    df: pd.DataFrame,
    relatedness_fn=lambda x, y: 1 - spatial.distance.cosine(x, y),
    top_n: int = 100
) -> tuple[list[str], list[float]]:

    """Returns a list of strings and relatednesses, sorted from most related to least."""

    query_embedding = get_embedding(query) # bug fixed

    strings_and_relatednesses = [
        (row["text"], relatedness_fn(query_embedding, row["embedding"]))
        for i, row in df.iterrows()
    ]
    strings_and_relatednesses.sort(key=lambda x: x[1], reverse=True)
    strings, relatednesses = zip(*strings_and_relatednesses)
    return strings[:top_n], relatednesses[:top_n]

Let's examine this function to see what it pulls up as documents most similar to the query string "curling gold medal"

In [None]:
strings, relatednesses = strings_ranked_by_relatedness("curling gold medal", df, top_n=5)
for string, relatedness in zip(strings, relatednesses):
    print(f"{relatedness=:.3f}")
    display(string)

relatedness=0.879


'Curling at the 2022 Winter Olympics\n\n==Medal summary==\n\n===Medal table===\n\n{{Medals table\n | caption        = \n | host           = \n | flag_template  = flagIOC\n | event          = 2022 Winter\n | team           = \n | gold_CAN = 0 | silver_CAN = 0 | bronze_CAN = 1\n | gold_ITA = 1 | silver_ITA = 0 | bronze_ITA = 0\n | gold_NOR = 0 | silver_NOR = 1 | bronze_NOR = 0\n | gold_SWE = 1 | silver_SWE = 0 | bronze_SWE = 2\n | gold_GBR = 1 | silver_GBR = 1 | bronze_GBR = 0\n | gold_JPN = 0 | silver_JPN = 1 | bronze_JPN - 0\n}}'

relatedness=0.872


"Curling at the 2022 Winter Olympics\n\n==Results summary==\n\n===Women's tournament===\n\n====Playoffs====\n\n=====Gold medal game=====\n\n''Sunday, 20 February, 9:05''\n{{#lst:Curling at the 2022 Winter Olympics – Women's tournament|GM}}\n{{Player percentages\n| team1 = {{flagIOC|JPN|2022 Winter}}\n| [[Yurika Yoshida]] | 97%\n| [[Yumi Suzuki]] | 82%\n| [[Chinami Yoshida]] | 64%\n| [[Satsuki Fujisawa]] | 69%\n| teampct1 = 78%\n| team2 = {{flagIOC|GBR|2022 Winter}}\n| [[Hailey Duff]] | 90%\n| [[Jennifer Dodds]] | 89%\n| [[Vicky Wright]] | 89%\n| [[Eve Muirhead]] | 88%\n| teampct2 = 89%\n}}"

relatedness=0.869


'Curling at the 2022 Winter Olympics\n\n==Results summary==\n\n===Mixed doubles tournament===\n\n====Playoffs====\n\n=====Gold medal game=====\n\n\'\'Tuesday, 8 February, 20:05\'\'\n{{#lst:Curling at the 2022 Winter Olympics – Mixed doubles tournament|GM}}\n{| class="wikitable"\n!colspan=4 width=400|Player percentages\n|-\n!colspan=2 width=200 style="white-space:nowrap;"| {{flagIOC|ITA|2022 Winter}}\n!colspan=2 width=200 style="white-space:nowrap;"| {{flagIOC|NOR|2022 Winter}}\n|-\n| [[Stefania Constantini]] || 83%\n| [[Kristin Skaslien]] || 70%\n|-\n| [[Amos Mosaner]] || 90%\n| [[Magnus Nedregotten]] || 69%\n|-\n| \'\'\'Total\'\'\' || 87%\n| \'\'\'Total\'\'\' || 69%\n|}'

relatedness=0.868


"Curling at the 2022 Winter Olympics\n\n==Medal summary==\n\n===Medalists===\n\n{| {{MedalistTable|type=Event|columns=1}}\n|-\n|Men<br/>{{DetailsLink|Curling at the 2022 Winter Olympics – Men's tournament}}\n|{{flagIOC|SWE|2022 Winter}}<br>[[Niklas Edin]]<br>[[Oskar Eriksson]]<br>[[Rasmus Wranå]]<br>[[Christoffer Sundgren]]<br>[[Daniel Magnusson (curler)|Daniel Magnusson]]\n|{{flagIOC|GBR|2022 Winter}}<br>[[Bruce Mouat]]<br>[[Grant Hardie]]<br>[[Bobby Lammie]]<br>[[Hammy McMillan Jr.]]<br>[[Ross Whyte]]\n|{{flagIOC|CAN|2022 Winter}}<br>[[Brad Gushue]]<br>[[Mark Nichols (curler)|Mark Nichols]]<br>[[Brett Gallant]]<br>[[Geoff Walker (curler)|Geoff Walker]]<br>[[Marc Kennedy]]\n|-\n|Women<br/>{{DetailsLink|Curling at the 2022 Winter Olympics – Women's tournament}}\n|{{flagIOC|GBR|2022 Winter}}<br>[[Eve Muirhead]]<br>[[Vicky Wright]]<br>[[Jennifer Dodds]]<br>[[Hailey Duff]]<br>[[Mili Smith]]\n|{{flagIOC|JPN|2022 Winter}}<br>[[Satsuki Fujisawa]]<br>[[Chinami Yoshida]]<br>[[Yumi Suzuki]]<br>

relatedness=0.867


"Curling at the 2022 Winter Olympics\n\n==Results summary==\n\n===Men's tournament===\n\n====Playoffs====\n\n=====Gold medal game=====\n\n''Saturday, 19 February, 14:50''\n{{#lst:Curling at the 2022 Winter Olympics – Men's tournament|GM}}\n{{Player percentages\n| team1 = {{flagIOC|GBR|2022 Winter}}\n| [[Hammy McMillan Jr.]] | 95%\n| [[Bobby Lammie]] | 80%\n| [[Grant Hardie]] | 94%\n| [[Bruce Mouat]] | 89%\n| teampct1 = 90%\n| team2 = {{flagIOC|SWE|2022 Winter}}\n| [[Christoffer Sundgren]] | 99%\n| [[Rasmus Wranå]] | 95%\n| [[Oskar Eriksson]] | 93%\n| [[Niklas Edin]] | 87%\n| teampct2 = 94%\n}}"

We can see that what was pulled up were several sections of the Wikipedia page for curling at the 2022 Winter Olympics. Cool.

#### Starting from the most-cosine-similar section, include as many sections into the prompt as can fit into the context window


Once we've calculated the most relevant pieces of context, we construct a prompt by simply prepending them to the supplied query. We write a fewer helper functions to do just this.

In [None]:
HEADER = """
Use the below articles to answer the subsequent question. \
Answer the question as truthfully as possible, and if you're unsure \
of the answer, say "Sorry, I don't know"
"""

Since we don't want to exceed the context window, we will need to count the tokens in our prompt. We use the `tiktoken` package for this.

In [None]:
import tiktoken  # for counting tokens

def num_tokens(text: str, model: str = GPT_MODEL) -> int:
    """Return the number of tokens in a string."""
    encoding = tiktoken.encoding_for_model(model)
    return len(encoding.encode(text))

In [None]:
def query_message(
    query: str,
    df: pd.DataFrame,
    model: str,
    token_budget: int
) -> str:
    """Return a message for GPT, with relevant source texts pulled from a dataframe."""
    strings, relatednesses = strings_ranked_by_relatedness(query, df)
    introduction = HEADER
    question = f"\n\nQuestion: {query}"
    message = introduction
    for string in strings:
      # useful to indicate the start of each new potentially relevant
      # article here with the header 'Wikipedia article section:'

        next_article = f'\n\nWikipedia article section:\n"""\n{string}\n"""'
        if (
            num_tokens(message + next_article + question, model=model)
            > token_budget
        ):
            break
        else:
            message += next_article
    return message + question

Query message first begins with the `HEADER` and then pulls the related articles sorted in descending order of similarity to the query. We then add these articles to the query until the token budget is consumed. Below we pass in our query about 2022 curling with a token budget of 3700 tokens. We could go higher (how much higher?) but OpenAI charges by the token 😰



In [None]:
query = query_message("'Which athletes won the gold medal in curling at \
the 2022 Winter Olympics?", df, GPT_MODEL, 3700)

print(query)


Use the below articles to answer the subsequent question. Answer the question as truthfully as possible, and if you're unsure of the answer, say "Sorry, I don't know"


Wikipedia article section:
"""
List of 2022 Winter Olympics medal winners

==Curling==

{{main|Curling at the 2022 Winter Olympics}}
{|{{MedalistTable|type=Event|columns=1|width=225|labelwidth=200}}
|-valign="top"
|Men<br/>{{DetailsLink|Curling at the 2022 Winter Olympics – Men's tournament}}
|{{flagIOC|SWE|2022 Winter}}<br/>[[Niklas Edin]]<br/>[[Oskar Eriksson]]<br/>[[Rasmus Wranå]]<br/>[[Christoffer Sundgren]]<br/>[[Daniel Magnusson (curler)|Daniel Magnusson]]
|{{flagIOC|GBR|2022 Winter}}<br/>[[Bruce Mouat]]<br/>[[Grant Hardie]]<br/>[[Bobby Lammie]]<br/>[[Hammy McMillan Jr.]]<br/>[[Ross Whyte]]
|{{flagIOC|CAN|2022 Winter}}<br/>[[Brad Gushue]]<br/>[[Mark Nichols (curler)|Mark Nichols]]<br/>[[Brett Gallant]]<br/>[[Geoff Walker (curler)|Geoff Walker]]<br/>[[Marc Kennedy]]
|-valign="top"
|Women<br/>{{DetailsLink|Curling 

We have now obtained the sections that are most relevant to the question and crafted a query. As a final step, let's put it all together to get an answer to the question.


In [None]:
def ask(
    query: str,
    df: pd.DataFrame = df,
    model: str = GPT_MODEL,
    token_budget: int = 4096 - 500,
) -> str:
    """Answers a query using GPT and a dataframe of relevant texts and embeddings."""
    message = query_message(query, df, model=model, token_budget=token_budget)

    messages = [
        {"role": "system", "content": "You answer questions about the 2022 Winter Olympics."},
        {"role": "user", "content": message},
    ]

    response = client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=0
    )
    response_message = response.choices[0].message.content
    return response_message

#### Send the query into `gpt-3.5-turbo`!

Now that we've retrieved the relevant sections and constructed our prompt, we can finally answer the user's query.

In [None]:
print(ask('Which athletes won the gold medal in curling at the 2022 Winter Olympics?'))

The athletes who won the gold medal in curling at the 2022 Winter Olympics were:

- Men's tournament: Team Italy - Stefania Constantini, Amos Mosaner, Kristin Skaslien, Magnus Nedregotten
- Women's tournament: Team Great Britain - Eve Muirhead, Vicky Wright, Jennifer Dodds, Hailey Duff, Mili Smith
- Mixed doubles tournament: Team Sweden - Almida de Val, Oskar Eriksson


Nice!

Let's ask a question for an Olympics event that never happened!

In [None]:
print(ask('Which athletes won the gold medal in curling at the 2016 Winter Olympics?'))

Sorry, I don't know.


Good, it is trying to be humble and say "I don't know".

Let's change the header to "allow it to lie" if it wants 👀 and see if it takes the bait.

In [None]:
HEADER = """
Answer the question using the provided context."\n\nContext:\n
"""

In [None]:
print(ask('Which athletes won the gold medal in curling at the 2016 Winter Olympics?'))

The athletes who won the gold medal in curling at the 2022 Winter Olympics were from Great Britain. The team consisted of Hammy McMillan Jr., Bobby Lammie, Grant Hardie, and Bruce Mouat.


Hmm ... it is answering an irrelevant question. Removing that little extra phrase in the header - `as truthfully as possible` - changed its behavior!



## Conclusion
By combining pretrained contextual embeddings and `text-davinci-003`, we have created a question-answering model using Retrieval-Augmented Generation that can answer questions in natural language using a custom dataset. It also **tries** not to make stuff up and says "I don't know" when it doesn't know the answer! **But this is not guaranteed.**

For this example we have used a dataset of Wikipedia articles, but that dataset could be replaced with books, articles, documentation, service manuals, or much much more.





---

How you can use this approach to "understand" a dense 56-page legal document:
A fun [example](https://www.youtube.com/watch?v=ih9PBGVVOO4)

---


