This notebook shows how we can use proprietary text data and feed it to GPT-3 to get answers from it. In this example, the "proprietary" text data is the Summer 2020 Olympics data from Wikipedia. Since GPT-3 was not trained on data after 2020, it can't reliabily answer questions about events that took palce after that time period. Currently, if we ask ChatGPT who won the men's high jump medal, it will hallucinate and provide a wrong answer.

We scrape data from Wikipedia about Summer 2020 Olympics and feed it to GPT-3 in a systematic way to get the correct answer.

### Data Collection

 Collect data about Summer 2020 Olympics from Wikipedia

In [1]:
import pandas as pd
import wikipedia as wiki

In [4]:
def filter_olympic_2020_titles(titles):
    titles = [title for title in titles if '2020' in title and 'olympi' in title.lower()]
    return titles

def get_wiki_page(title):
    try:
        return wiki.page(title)
    except wiki.exceptions.DisambiguationError as e:
        return wiki.page(e.options[0])
    except wiki.exceptions.PageError as e:
        return None
    
def recursively_find_all_pages(titles, titles_so_far=set()):
    all_pages = []
    
    titles = list(set(titles) - titles_so_far)
    titles = filter_olympic_2020_titles(titles)
    titles_so_far.update(titles)
    
    for title in titles:
        page = get_wiki_page(title)
        if page is None:
            continue
        all_pages.append(page)
        
        new_pages = recursively_find_all_pages(page.links, titles_so_far)
        
        for pg in new_pages:
            if pg.title not in [p.title for p in all_pages]:
                all_pages.append(pg)
        titles_so_far.update(page.links)
        
    return all_pages

pages = recursively_find_all_pages(['2020 Summer Olympics'])
len(pages)

900

_Note: This took about 20 minutes to run._

### Data Cleaning

The next step is to remove sections that are unlikely to contain useful information, and ensure that each section is no longer than the token limit (for OpenAI).

In [7]:
import re
from typing import Set
from transformers import GPT2TokenizerFast

import numpy as np
from nltk.tokenize import sent_tokenize

tokenizer = GPT2TokenizerFast.from_pretrained("gpt3") # gpt2 was used in the code sample

def count_tokens(text: str) -> int:
    """count the number of tokens in a string"""
    return len(tokenizer.encode(text))

def reduce_long(
    long_text: str, long_text_tokens: bool = False, max_len: int = 590
) -> str:
    """
    Reduce a long text to a maximum of `max_len` tokens by potentially cutting at a sentence end
    """
    if not long_text_tokens:
        long_text_tokens = count_tokens(long_text)
    if long_text_tokens > max_len:
        sentences = sent_tokenize(long_text.replace("\n", " "))
        ntokens = 0
        for i, sentence in enumerate(sentences):
            ntokens += 1 + count_tokens(sentence)
            if ntokens > max_len:
                return ". ".join(sentences[:i][:-1]) + "."

    return long_text

discard_categories = ['See also', 'References', 'External links', 'Further reading', "Footnotes",
    "Bibliography", "Sources", "Citations", "Literature", "Footnotes", "Notes and references",
    "Photo gallery", "Works cited", "Photos", "Gallery", "Notes", "References and sources",
    "References and notes",]


def extract_sections(
    wiki_text: str,
    title: str,
    max_len: int = 1500,
    discard_categories: Set[str] = discard_categories,
) -> str:
    """
    Extract the sections of a Wikipedia page, discarding the references and other low information sections
    """
    if len(wiki_text) == 0:
        return []

    # find all headings and the coresponding contents
    headings = re.findall("==+ .* ==+", wiki_text)
    for heading in headings:
        wiki_text = wiki_text.replace(heading, "==+ !! ==+")
    contents = wiki_text.split("==+ !! ==+")
    contents = [c.strip() for c in contents]
    assert len(headings) == len(contents) - 1

    cont = contents.pop(0).strip()
    outputs = [(title, "Summary", cont, count_tokens(cont)+4)]
    
    # discard the discard categories, accounting for a tree structure
    max_level = 100
    keep_group_level = max_level
    remove_group_level = max_level
    nheadings, ncontents = [], []
    for heading, content in zip(headings, contents):
        plain_heading = " ".join(heading.split(" ")[1:-1])
        num_equals = len(heading.split(" ")[0])
        if num_equals <= keep_group_level:
            keep_group_level = max_level

        if num_equals > remove_group_level:
            if (
                num_equals <= keep_group_level
            ):
                continue
        keep_group_level = max_level
        if plain_heading in discard_categories:
            remove_group_level = num_equals
            keep_group_level = max_level
            continue
        nheadings.append(heading.replace("=", "").strip())
        ncontents.append(content)
        remove_group_level = max_level

    # count the tokens of each section
    ncontent_ntokens = [
        count_tokens(c)
        + 3
        + count_tokens(" ".join(h.split(" ")[1:-1]))
        - (1 if len(c) == 0 else 0)
        for h, c in zip(nheadings, ncontents)
    ]

    # Create a tuple of (title, section_name, content, number of tokens)
    outputs += [(title, h, c, t) if t<max_len 
                else (title, h, reduce_long(c, max_len), count_tokens(reduce_long(c,max_len))) 
                    for h, c, t in zip(nheadings, ncontents, ncontent_ntokens)]
    
    return outputs

# Example page being processed into sections
bermuda_page = get_wiki_page('Bermuda at the 2020 Summer Olympics')
ber = extract_sections(bermuda_page.content, bermuda_page.title)

# Example section
ber[-1]

('Bermuda at the 2020 Summer Olympics',
 'Equestrian',
 "Bermuda entered one dressage rider into the Olympic competition by finishing in the top four, outside the group selection, of the individual FEI Olympic Rankings for Groups D and E (North, Central, and South America), marking the country's recurrence to the sport after an eight-year absence. The quota was later withdrawn, following an injury of Annabelle Collins' main horse Joyero and a failure to obtain minimum eligibility requirements (MER) aboard a new horse Chuppy Checker.",
 104)

In [9]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/vishal/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [10]:
res = []

for page in pages:
    res += extract_sections(page.content, page.title)

df = pd.DataFrame(res, columns=["title", "heading", "content", "tokens"])
df = df[df.tokens > 40]
df = df.drop_duplicates(['title','heading'])
df = df.reset_index().drop('index',axis=1) # reset index
df.head()

Unnamed: 0,title,heading,content,tokens
0,2020 Summer Olympics,Summary,The 2020 Summer Olympics (Japanese: 2020年夏季オリン...,730
1,2020 Summer Olympics,Host city selection,The International Olympic Committee (IOC) vote...,126
2,2020 Summer Olympics,Impact of the COVID-19 pandemic,"In January 2020, concerns were raised about th...",374
3,2020 Summer Olympics,Qualifying event cancellation and postponement,Concerns about the pandemic began to affect qu...,298
4,2020 Summer Olympics,Effect on doping tests,Mandatory doping tests were being severely res...,163


_Note: This took about 10-15 minutes to run._

Let's save this dataset for future use.

In [12]:
df.to_csv('../data/olympics_sections.csv', index=False)

# Question Answering using Embeddings

In [1]:
import openai
import numpy as np
from transformers import GPT2TokenizerFast

COMPLETIONS_MODEL = 'text-davinci-002' # NOTE: instead of this, use the latest, cheaper model: 'text-embedding-ada-002'

  from .autonotebook import tqdm as notebook_tqdm
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


We will need to use the OpenAI API key to call their API. I have stored the API in `/.env` directory.

In [122]:
from dotenv import load_dotenv
load_dotenv()

True

In [124]:
from pathlib import Path

env_path = Path('..')/'.env'
load_dotenv(dotenv_path=env_path)

True

In [125]:
import os
openai.api_key = os.getenv('OPENAI_API_KEY')

In [3]:
prompt = "Who won the 2020 Summer Olympics men's high jump?"

openai.Completion.create(
    prompt=prompt,
    temperature=0,
    max_tokens=300,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0,
    model=COMPLETIONS_MODEL)['choices'][0]['text'].strip(' \n')

"The 2020 Summer Olympics men's high jump was won by Mariusz Przybylski of Poland."

By default, GPT-3 doesn't do well on 2020 Olympics questions. It hallucinates! Note that this answer changes over time. I tried the same question on 1/26/2023 and got the following answer: "The 2020 Summer Olympics were postponed to 2021 due to the COVID-19 pandemic. As such, the men's high jump event has not been held yet, and no winner has been determined." Ah, a very convincing answer, but it's wrong of course. It seems like ChatGPT doesn't know that we are in 2023!

After prodding ChatGPT a little more, this is what it had to say: "I apologize for any confusion - you are correct that the 2020 Summer Olympics were originally scheduled to take place in 2020, but were postponed to 2021 due to the COVID-19 pandemic. As of my last training data, which was in 2021, the Olympics had not yet taken place, so I was not able to provide information about the winner of the men's high jump. To my knowledge, The 2020 Summer Olympics did take place but I don't have the information about the winner of men's high jump as it is not included in my training data."

Let's address the so-called 'hallucination issue' by being more explicit with the prompt.

In [12]:
prompt = """Answer the question as truthfully as possible, and if you're unsure of the answer, say "Sorry, I don't know".

Q: Who won the 2020 Summer Olympics men's high jump?
A:"""

openai.Completion.create(
    prompt=prompt,
    temperature=0,
    max_tokens=300,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0,
    model=COMPLETIONS_MODEL
)["choices"][0]["text"].strip(" \n")

"Sorry, I don't know."

Okay, that's better. Now let's provide some context and see if that helps.

In [13]:
prompt = """Answer the question as truthfully as possible using the provided text, and if the answer is not contained within the text below, say "I don't know"

Context:
The men's high jump event at the 2020 Summer Olympics took place between 30 July and 1 August 2021 at the Olympic Stadium.
33 athletes from 24 nations competed; the total possible number depended on how many nations would use universality places 
to enter athletes in addition to the 32 qualifying through mark or ranking (no universality places were used in 2021).
Italian athlete Gianmarco Tamberi along with Qatari athlete Mutaz Essa Barshim emerged as joint winners of the event following
a tie between both of them as they cleared 2.37m. Both Tamberi and Barshim agreed to share the gold medal in a rare instance
where the athletes of different nations had agreed to share the same medal in the history of Olympics. 
Barshim in particular was heard to ask a competition official "Can we have two golds?" in response to being offered a 
'jump off'. Maksim Nedasekau of Belarus took bronze. The medals were the first ever in the men's high jump for Italy and 
Belarus, the first gold in the men's high jump for Italy and Qatar, and the third consecutive medal in the men's high jump
for Qatar (all by Barshim). Barshim became only the second man to earn three medals in high jump, joining Patrik Sjöberg
of Sweden (1984 to 1992).

Q: Who won the 2020 Summer Olympics men's high jump?
A:"""

openai.Completion.create(
    prompt=prompt,
    temperature=0,
    max_tokens=300,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0,
    model=COMPLETIONS_MODEL
)["choices"][0]["text"].strip(" \n")

"Gianmarco Tamberi and Mutaz Essa Barshim won the 2020 Summer Olympics men's high jump."

This is the correct answer. 

Okay, so we just confirmed that adding extra information into the prompt definately helps!

Let's load the preprocessed data.

In [4]:
import pandas as pd

df = pd.read_csv('../data/olympics_sections.csv')
df = df.set_index(["title", "heading"])
print(f"{len(df)} rows in the data.")
df.sample(5)

3946 rows in the data.


Unnamed: 0_level_0,Unnamed: 1_level_0,content,tokens
title,heading,Unnamed: 2_level_1,Unnamed: 3_level_1
Gymnastics at the 2020 Summer Olympics – Women's uneven bars,Background,"This was the 19th appearance of the event, aft...",43
France at the 2020 Summer Olympics,Artistic,France fielded a full squad of seven artistic ...,171
Belgium at the 2020 Summer Olympics,Swimming,Belgian swimmers achieved qualifying standards...,86
Badminton at the 2020 Summer Olympics – Qualification,Summary,There are 172 quota places available for quali...,287
Volleyball at the 2020 Summer Olympics – Men's South American qualification,Summary,The South American Qualification Tournament fo...,60


Now we can't possibly pass _all_ of this text data through the OpenAI API. Instead, we will pass each section content, one at a time, and create embedding for them first.

### Create embedding for each section

In [5]:
MODEL_NAME = 'curie'

DOC_EMBEDDINGS_MODEL = f'text-search-{MODEL_NAME}-doc-001'
QUERY_EMBEDDINGS_MODEL = f'text-search-{MODEL_NAME}-query-001'

In [6]:
def get_embedding(text: str, model: str):
    result = openai.Embedding.create(
      model=model,
      input=text
    )
    return result["data"][0]["embedding"]

def get_doc_embedding(text: str):
    return get_embedding(text, DOC_EMBEDDINGS_MODEL)

def get_query_embedding(text: str):
    return get_embedding(text, QUERY_EMBEDDINGS_MODEL)

def compute_doc_embeddings(df: pd.DataFrame):
    """
    Create an embedding for each row in the dataframe using the OpenAI Embeddings API.
    
    Return a dictionary that maps between each embedding vector and the index of the row that it corresponds to.
    """
    return {
        idx: get_doc_embedding(r.content.replace("\n", " ")) for idx, r in df.iterrows()
    }

def load_embeddings(fname: str):
    """
    Read the document embeddings and their keys from a CSV.
    
    fname is the path to a CSV with exactly these named columns: 
        "title", "heading", "0", "1", ... up to the length of the embedding vectors.
    """
    
    df = pd.read_csv(fname, header=0)
    max_dim = max([int(c) for c in df.columns if c != "title" and c != "heading"])
    return {
           (r.title, r.heading): [r[str(i)] for i in range(max_dim + 1)] for _, r in df.iterrows()
    }

In [8]:
df.shape

(3946, 2)

_Note: The following API calls cost me money (around $12). If you don't want to spend any money for this, you'd have to try API throttling. Alternatively, you can skip the calls to the embeddings API and just directly use the saved embeddings from the `../data/olympics_embeddings_short.csv` file included in this repo. Please keep in mind that due to github's filesize restriction, I had to create a smaller version of the original dataset, and this 'short' dataset contains only the first 1,000 embeddings out of more than 3,900 in the original dataset._

In [16]:
context_embeddings = compute_doc_embeddings(df)

In [10]:
# An example embedding:
example_entry = list(context_embeddings.items())[0]
print(f"{example_entry[0]} : {example_entry[1][:5]}... ({len(example_entry[1])} entries)")

('2020 Summer Olympics', 'Summary') : [-0.001977740554139018, 0.0038539995439350605, 0.0010027086827903986, 0.008344566449522972, -0.010415716096758842]... (4096 entries)


Export context embeddings.

In [88]:
all_titles, all_headings, all_embeddings = [], [], []
for keys, val in context_embeddings.items():
    title, heading = keys
    all_titles.append(title)
    all_headings.append(heading)
    all_embeddings.append(val)

In [89]:
len(all_titles), len(all_headings), len(all_embeddings)

(3946, 3946, 3946)

In [90]:
df_emb = pd.DataFrame(columns=['title', 'heading', 'embedding'])
df_emb['title'] = all_titles
df_emb['heading'] = all_headings
df_emb['embedding'] = all_embeddings
df_emb.head()

Unnamed: 0,title,heading,embedding
0,2020 Summer Olympics,Summary,"[-0.001977740554139018, 0.0038539995439350605,..."
1,2020 Summer Olympics,Host city selection,"[-0.005577346310019493, 0.0024105869233608246,..."
2,2020 Summer Olympics,Impact of the COVID-19 pandemic,"[-0.007205113768577576, -0.02255360037088394, ..."
3,2020 Summer Olympics,Qualifying event cancellation and postponement,"[0.009390046820044518, -0.008730015717446804, ..."
4,2020 Summer Olympics,Effect on doping tests,"[-0.0034491312690079212, -0.003978027962148189..."


In [95]:
len(df_emb.embedding.head(1).values[0])

4096

In [None]:
for i in range(4096):
    df_emb[i] = df_emb['embedding'].apply(lambda x: x[i])

In [100]:
df_emb = df_emb.drop(columns='embedding')
df_emb.head(1)

Unnamed: 0,title,heading,0,1,2,3,4,5,6,7,...,4086,4087,4088,4089,4090,4091,4092,4093,4094,4095
0,2020 Summer Olympics,Summary,-0.001978,0.003854,0.001003,0.008345,-0.010416,0.017999,-0.003965,-0.035851,...,-0.003051,-0.012076,-0.000273,0.012122,0.004525,-0.000271,-0.025684,0.013848,0.019097,0.003439


Let's save these embeddings for future use. (If we don't save them, we would have to make those API calls again.)

In [101]:
df_emb.to_csv('../data/olympics_embeddings.csv', index=False)

Do a quick test to read is back again to make sure it's readable.

In [102]:
document_embeddings = load_embeddings('../data/olympics_embeddings.csv')

In [103]:
# an example embedding:
example_entry2 = list(document_embeddings.items())[0]
print(f"{example_entry2[0]} : {example_entry2[1][:5]}... ({len(example_entry2[1])} entries)")

('2020 Summer Olympics', 'Summary') : [-0.001977740554139, 0.003853999543935, 0.0010027086827903, 0.0083445664495229, -0.0104157160967588]... (4096 entries)


In [104]:
context_embeddings == document_embeddings

False

In [107]:
len(context_embeddings), len(document_embeddings)

(3946, 3946)

Okay, it looks good. We can read the embeddings from the CSV file.

### Find the relevant section

Find the most similar document embedding to the question embedding.

In [13]:
def vector_similarity(x, y):
    """
    We could use cosine similarity or dot product to calculate the similarity between vectors.
    In practice, we have found it makes little difference. 
    """
    return np.dot(np.array(x), np.array(y))

def order_document_sections_by_query_similarity(query, contexts):
    """
    Find the query embedding for the supplied query, and compare it against all of the pre-calculated document embeddings
    to find the most relevant sections. 
    
    Return the list of document sections, sorted by relevance in descending order.
    """
    query_embedding = get_query_embedding(query)
    
    document_similarities = sorted([
        (vector_similarity(query_embedding, doc_embedding), doc_index) for doc_index, doc_embedding in contexts.items()
    ], reverse=True)
    
    return document_similarities

In [108]:
order_document_sections_by_query_similarity("Who won the men's high jump?", document_embeddings)[:5]

[(0.4296262705737044,
  ("Athletics at the 2020 Summer Olympics – Men's high jump", 'Summary')),
 (0.4130726161040939,
  ("Athletics at the 2020 Summer Olympics – Men's high jump", 'Background')),
 (0.4067050971353775,
  ("Athletics at the 2020 Summer Olympics – Women's high jump", 'Summary')),
 (0.40424430924445215,
  ("Athletics at the 2020 Summer Olympics – Men's triple jump", 'Summary')),
 (0.40219236123048674,
  ("Athletics at the 2020 Summer Olympics – Women's long jump", 'Summary'))]

This makes senes. The we would expect these to be some of the relevant sections where we might find the answer to this question.

### Add the relevant section to the query prompt

Now that we know we can find the relavant section (from our entire corpus of Summer 2020 Olympics), we can grab that section and include it in our prompt (that we send to GPT-3) along with our question.

In [18]:
MAX_SECTION_LEN = 500
SEPARATOR = "\n* "

tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
separator_len = len(tokenizer.tokenize(SEPARATOR))

f"Context separator contains {separator_len} tokens"

'Context separator contains 3 tokens'

In [19]:
def construct_prompt(question: str, context_embeddings: dict, df: pd.DataFrame) -> str:
    """
    Fetch relevant 
    """
    most_relevant_document_sections = order_document_sections_by_query_similarity(question, context_embeddings)
    
    chosen_sections = []
    chosen_sections_len = 0
    chosen_sections_indexes = []
     
    for _, section_index in most_relevant_document_sections:
        # Add contexts until we run out of space.        
        document_section = df.loc[section_index]
        
        chosen_sections_len += document_section.tokens + separator_len
        if chosen_sections_len > MAX_SECTION_LEN:
            break
            
        chosen_sections.append(SEPARATOR + document_section.content.replace("\n", " "))
        chosen_sections_indexes.append(str(section_index))
            
    # Useful diagnostic information
    print(f"Selected {len(chosen_sections)} document sections:")
    print("\n".join(chosen_sections_indexes))
    
    header = """Answer the question as truthfully as possible using the provided context, and if the answer is not contained within the text below, say "I don't know."\n\nContext:\n"""
    
    return header + "".join(chosen_sections) + "\n\n Q: " + question + "\n A:"

In [109]:
prompt = construct_prompt(
    "Who won the 2020 Summer Olympics men's high jump?",
    document_embeddings,
    df)

print("===\n", prompt)

Selected 3 document sections:
("Athletics at the 2020 Summer Olympics – Women's high jump", 'Summary')
("Athletics at the 2020 Summer Olympics – Men's high jump", 'Summary')
("Athletics at the 2020 Summer Olympics – Men's triple jump", 'Summary')
===
 Answer the question as truthfully as possible using the provided context, and if the answer is not contained within the text below, say "I don't know."

Context:

* The women's high jump event at the 2020 Summer Olympics took place on 5 and 7 August 2021 at the Japan National Stadium. Even though 32 athletes qualified through the qualification system for the Games, only 31 took part in the competition. This was the 22nd appearance of the event, having appeared at every Olympics since women's athletics was introduced in 1928.
* The men's high jump event at the 2020 Summer Olympics took place between 30 July and 1 August 2021 at the Olympic Stadium. 33 athletes from 24 nations competed; the total possible number depended on how many nations

This looks good. We have successfully added the relevant section to the prompt. 

Now we can answer a question based on the context!

In [110]:
COMPLETIONS_API_PARAMS = {
    # We use temperature of 0.0 because it gives the most predictable, factual answer.
    "temperature": 0.0,
    "max_tokens": 300,
    "model": COMPLETIONS_MODEL,
}

In [112]:
def answer_query_with_context(
    query: str,
    df: pd.DataFrame,
    document_embeddings,
    show_prompt: bool = False) -> str:
    prompt = construct_prompt(
        query,
        document_embeddings,
        df
    )
    
    if show_prompt:
        print(prompt)

    response = openai.Completion.create(
                prompt=prompt,
                **COMPLETIONS_API_PARAMS
            )

    return response["choices"][0]["text"].strip(" \n")

In [113]:
answer_query_with_context("Who won the 2020 Summer Olympics men's high jump?", df, document_embeddings)

Selected 3 document sections:
("Athletics at the 2020 Summer Olympics – Women's high jump", 'Summary')
("Athletics at the 2020 Summer Olympics – Men's high jump", 'Summary')
("Athletics at the 2020 Summer Olympics – Men's triple jump", 'Summary')


'Gianmarco Tamberi and Mutaz Essa Barshim emerged as joint winners of the event following a tie between both of them as they cleared 2.37m.'

YES! This is the correct answer!