# Question and Answer with OpenAI and RedisVL

This example shows how to use RedisVL to create a question and answer system using OpenAI's API.

In this notebook we will
1. Download a dataset of wikipedia articles (thanks to OpenAI's CDN)
2. Create embeddings for each article
3. Create a RedisVL index and store the embeddings with metadata
4. Construct a simple QnA system using the index and GPT-3


The image below shows the architecture of the system we will create in this notebook.

![Diagram](https://github.com/RedisVentures/redis-openai-qna/raw/main/app/assets/RedisOpenAI-QnA-Architecture.drawio.png)

## Setup

In order to run this example, you will need to have a Redis instance with RediSearch running locally. You can do this by running the following command in your terminal:

```bash
docker run --name redis-vecdb -d -p 6379:6379 -p 8001:8001 redis/redis-stack:latest
```

This will also provide the RedisInsight GUI at http://localhost:8001

Next, we will install the dependencies for this notebook.

In [None]:
# first we need to install a few things

!pip install pandas wget tenacity tiktoken openai

In [2]:
import wget
import pandas as pd

embeddings_url = 'https://cdn.openai.com/API/examples/data/wikipedia_articles_2000.csv'

wget.download(embeddings_url)

'wikipedia_articles_2000.csv'

In [14]:
df = pd.read_csv('wikipedia_articles_2000.csv')
df = df.drop(columns=['Unnamed: 0'])
df

Unnamed: 0,id,url,title,text
0,3661,https://simple.wikipedia.org/wiki/Photon,Photon,"Photons (from Greek φως, meaning light), in m..."
1,7796,https://simple.wikipedia.org/wiki/Thomas%20Dolby,Thomas Dolby,Thomas Dolby (born Thomas Morgan Robertson; 14...
2,67912,https://simple.wikipedia.org/wiki/Embroidery,Embroidery,Embroidery is the art of decorating fabric or ...
3,44309,https://simple.wikipedia.org/wiki/Consecutive%...,Consecutive integer,Consecutive numbers are numbers that follow ea...
4,41741,https://simple.wikipedia.org/wiki/German%20Empire,German Empire,"The German Empire (""Deutsches Reich"" or ""Deuts..."
...,...,...,...,...
1995,9252,https://simple.wikipedia.org/wiki/Relativity,Relativity,The word relativity usually means two things i...
1996,14,https://simple.wikipedia.org/wiki/Alanis%20Mor...,Alanis Morissette,"Alanis Nadine Morissette (born June 1, 1974) i..."
1997,49769,https://simple.wikipedia.org/wiki/Brontosaurus,Brontosaurus,Brontosaurus is a genus of sauropod dinosaur....
1998,55998,https://simple.wikipedia.org/wiki/Work%20%28ph...,Work (physics),"In physics, a force does work when it acts on ..."


## Data Preparation



### Text Chunking

In order to create embeddings for the articles, we will need to chunk the text into smaller pieces. This is because there is a maximum length of text that can be sent to the OpenAI API. The code that follows pulls heavily from this [notebook](https://github.com/openai/openai-cookbook/blob/main/apps/enterprise-knowledge-retrieval/enterprise_knowledge_retrieval.ipynb) by OpenAI


In [19]:
TEXT_EMBEDDING_CHUNK_SIZE = 1000
EMBEDDINGS_MODEL = "text-embedding-ada-002"


def chunks(text, n, tokenizer):
    tokens = tokenizer.encode(text)
    """Yield successive n-sized chunks from text.

    Split a text into smaller chunks of size n, preferably ending at the end of a sentence
    """
    i = 0
    while i < len(tokens):
        # Find the nearest end of sentence within a range of 0.5 * n and 1.5 * n tokens
        j = min(i + int(1.5 * n), len(tokens))
        while j > i + int(0.5 * n):
            # Decode the tokens and check for full stop or newline
            chunk = tokenizer.decode(tokens[i:j])
            if chunk.endswith(".") or chunk.endswith("\n"):
                break
            j -= 1
        # If no end of sentence found, use n tokens as the chunk size
        if j == i + int(0.5 * n):
            j = min(i + n, len(tokens))
        yield tokens[i:j]
        i = j

def get_unique_id_for_file_chunk(title, chunk_index):
    return str(title+"-!"+str(chunk_index))

def chunk_text(record, tokenizer):
    chunked_records = []

    url = record['url']
    title = record['title']
    file_body_string = record['text']

    """Return a list of tuples (text_chunk, embedding) for a text."""
    token_chunks = list(chunks(file_body_string, TEXT_EMBEDDING_CHUNK_SIZE, tokenizer))
    text_chunks = [f'Title: {title};\n'+ tokenizer.decode(chunk) for chunk in token_chunks]

    for i, text_chunk in enumerate(text_chunks):
        doc_id = get_unique_id_for_file_chunk(title, i)
        chunked_records.append(({"id": doc_id,
                                "url": url,
                                "title": title,
                                "content": text_chunk,
                                "file_chunk_index": i}))
    return chunked_records

In [20]:
# Initialise tokenizer
import tiktoken
oai_tokenizer = tiktoken.get_encoding("cl100k_base")

records = []
for _, record in df.iterrows():
    records.extend(chunk_text(record, oai_tokenizer))



In [25]:
chunked_data = pd.DataFrame(records)
chunked_data

Unnamed: 0,id,url,title,content,file_chunk_index
0,Photon-!0,https://simple.wikipedia.org/wiki/Photon,Photon,"Title: Photon;\nPhotons (from Greek φως, mean...",0
1,Photon-!1,https://simple.wikipedia.org/wiki/Photon,Photon,Title: Photon;\nElementary particles,1
2,Thomas Dolby-!0,https://simple.wikipedia.org/wiki/Thomas%20Dolby,Thomas Dolby,Title: Thomas Dolby;\nThomas Dolby (born Thoma...,0
3,Embroidery-!0,https://simple.wikipedia.org/wiki/Embroidery,Embroidery,Title: Embroidery;\nEmbroidery is the art of d...,0
4,Consecutive integer-!0,https://simple.wikipedia.org/wiki/Consecutive%...,Consecutive integer,Title: Consecutive integer;\nConsecutive numbe...,0
...,...,...,...,...,...
2688,Alanis Morissette-!1,https://simple.wikipedia.org/wiki/Alanis%20Mor...,Alanis Morissette,Title: Alanis Morissette;\nTwin people from Ca...,1
2689,Brontosaurus-!0,https://simple.wikipedia.org/wiki/Brontosaurus,Brontosaurus,Title: Brontosaurus;\nBrontosaurus is a genus...,0
2690,Work (physics)-!0,https://simple.wikipedia.org/wiki/Work%20%28ph...,Work (physics),"Title: Work (physics);\nIn physics, a force do...",0
2691,Syllable-!0,https://simple.wikipedia.org/wiki/Syllable,Syllable,Title: Syllable;\nA syllable is a unit of pron...,0


### Embedding Creation

With the text broken up into chunks, we can create embeddings with the RedisVL `OpenAITextVectorizer`. This provider uses the OpenAI API to create embeddings for the text. The code below shows how to create embeddings for the text chunks.

In [47]:
import os
from redisvl.vectorize.text import OpenAITextVectorizer
from redisvl.utils.utils import array_to_buffer

api_key = os.environ.get("OPENAI_API_KEY", "")
oaip = OpenAITextVectorizer(EMBEDDINGS_MODEL, api_config={"api_key": api_key})

chunked_data["embedding"] = oaip.embed_many(chunked_data["content"].tolist())
chunked_data["embedding"] = chunked_data["embedding"].apply(lambda x: array_to_buffer(x))
chunked_data

Unnamed: 0,id,url,title,content,file_chunk_index,embedding
0,Photon-!0,https://simple.wikipedia.org/wiki/Photon,Photon,"Title: Photon;\nPhotons (from Greek φως, mean...",0,b'\xc2\xf8\xc9;\xa7]\xfb;\x88\x90P\xbc`\xcc\x9...
1,Photon-!1,https://simple.wikipedia.org/wiki/Photon,Photon,Title: Photon;\nElementary particles,1,b'\x03\x1d#\xbc\x00c\x8d<\xae\xcam\xbc\xc5\x1f...
2,Thomas Dolby-!0,https://simple.wikipedia.org/wiki/Thomas%20Dolby,Thomas Dolby,Title: Thomas Dolby;\nThomas Dolby (born Thoma...,0,b'k\xaf\xcc\xbc\x89\xe5\xad;3\xea\xd8\xbc+\x81...
3,Embroidery-!0,https://simple.wikipedia.org/wiki/Embroidery,Embroidery,Title: Embroidery;\nEmbroidery is the art of d...,0,b'07\xf5\xbc\xaf\xcb\x02\xbc\x90\xe6N\xbc\x84\...
4,Consecutive integer-!0,https://simple.wikipedia.org/wiki/Consecutive%...,Consecutive integer,Title: Consecutive integer;\nConsecutive numbe...,0,b'0(\xfa\xbb\x81\xd2\xd9;\xaf\x92\x9a;\xd3FL\x...
...,...,...,...,...,...,...
2688,Alanis Morissette-!1,https://simple.wikipedia.org/wiki/Alanis%20Mor...,Alanis Morissette,Title: Alanis Morissette;\nTwin people from Ca...,1,"b'\xc1K5\xbc\xb8""\xe0\xbc\x17A\x07\xbb\xb0\xbc..."
2689,Brontosaurus-!0,https://simple.wikipedia.org/wiki/Brontosaurus,Brontosaurus,Title: Brontosaurus;\nBrontosaurus is a genus...,0,b'3\xf0\xda\xbcY\xc0\xb4:\x1cN\x81\xbc\xe9\xcc...
2690,Work (physics)-!0,https://simple.wikipedia.org/wiki/Work%20%28ph...,Work (physics),"Title: Work (physics);\nIn physics, a force do...",0,b'\x97\x82\xb9\xbbL\x90d\xbc\xb7G\x9c\xba\x94g...
2691,Syllable-!0,https://simple.wikipedia.org/wiki/Syllable,Syllable,Title: Syllable;\nA syllable is a unit of pron...,0,"b'\xe4\xa3\x1c:\x83g\x90<\x99=s;*[E\xbb\x10 ""\..."


## Construct the ``SearchIndex``

Now that we have the embeddings, we can create a ``SearchIndex`` to store them in Redis. We will use the ``SearchIndex`` to store the embeddings and metadata for each article.

In [32]:
%%writefile wiki_schema.yaml

index:
    name: wiki
    prefix: oaiWiki
    key_field: id
    storage_type: hash

fields:
    text:
        - name: content
        - name: title
    tag:
        - name: id
    vector:
        - name: embedding
          dims: 1536
          distance_metric: cosine
          algorithm: flat

Writing wiki_schema.yaml


In [35]:
from redisvl.index import AsyncSearchIndex

index = AsyncSearchIndex.from_yaml("wiki_schema.yaml")
index.connect("redis://localhost:6379")

await index.create()

In [39]:
!rvl index listall

[32m17:55:08[0m [35msam.partee-NW9MQX5Y74[0m [34mredisvl.cli.index[44333][0m [1;30mINFO[0m Indices:
[32m17:55:08[0m [35msam.partee-NW9MQX5Y74[0m [34mredisvl.cli.index[44333][0m [1;30mINFO[0m 1. wiki


In [48]:
await index.load(chunked_data.to_dict(orient="records"))

## Build the QnA System

Now that we have the data and the embeddings, we can build the QnA system. The system will perform three actions

1. Embed the user question and search for the most similar content
2. Make a prompt with the query and retrieved content
3. Send the prompt to the OpenAI API and return the answer


In [46]:
import openai
from redisvl.query import VectorQuery

In [56]:
CHAT_MODEL = "gpt-3.5-turbo"

def make_prompt(query, content):
    retrieval_prompt = f'''Use the content to answer the search query the customer has sent.
    If you can't answer the user's question, do not guess. If there is no content, respond with "I don't know".

    Search query:

    {query}

    Content:

    {content}

    Answer:
    '''
    return retrieval_prompt

async def retrieve_context(query):
      # Embed the query
    query_embedding = oaip.embed(query)

    # Get the top result from the index
    vector_query = VectorQuery(
        vector=query_embedding,
        vector_field_name="embedding",
        return_fields=["content"],
        num_results=1
    )

    results = await index.search(vector_query.query, query_params=vector_query.params)
    content = ""
    if len(results.docs) > 1:
        content = results.docs[0]["content"]
    return content


async def answer_question(query):

    # Retrieve the context
    content = await retrieve_context(query)
    
    prompt = make_prompt(query, content)
    retrieval = await openai.ChatCompletion.acreate(
        model=CHAT_MODEL,
        messages=[{'role':"user",
                   'content': prompt}],
        max_tokens=500)

    # Response provided by GPT-3.5
    return retrieval['choices'][0]['message']['content']

In [65]:
import textwrap

question = "What is a Brontosaurus?"
textwrap.wrap(await answer_question(question), width=80)

['A Brontosaurus is a genus of large, herbivorous dinosaurs that lived during the',
 'Late Jurassic period, around 150 million years ago. They were characterized by',
 'their long necks and tails, and were among the largest land animals to have ever',
 'lived. However, the name "Brontosaurus" is no longer considered valid in modern',
 'scientific classification. In the early 20th century, it was discovered that',
 'Brontosaurus fossils were actually the same as those of another dinosaur species',
 'called Apatosaurus. As a result, the name Brontosaurus was dropped and the',
 'species was officially classified as Apatosaurus excelsus.']

In [55]:
# Question that makes no sense
question = "What is a trackiosamidon?"
await answer_question(question)

"I don't know."

In [66]:
question = "Tell me about the life of Alanis Morissette"
textwrap.wrap(await answer_question(question))

['Alanis Morissette is a Canadian-American singer-songwriter, known for',
 'her powerful and emotive vocal style. She gained international fame in',
 'the 1990s with her groundbreaking album "Jagged Little Pill." Born on',
 'June 1, 1974, in Ottawa, Canada, Morissette started her career in',
 'music at a young age and released her first studio album at the age of',
 '16. However, it was her third album, "Jagged Little Pill," released in',
 '1995, that propelled her to superstardom. The album became a cultural',
 'phenomenon, selling millions of copies worldwide and earning her',
 "numerous awards, including Grammy Awards. Morissette's music often",
 'explores themes of love, anger, and personal growth. She has continued',
 'to release albums and tour extensively, maintaining a dedicated fan',
 'base. Additionally, Morissette has also delved into acting and',
 'activism, using her platform to raise awareness and advocate for',
 "various causes. Overall, Alanis Morissette's life has been