# Achieving RAG for chatbot using OpenAI API and Pinecone

This notebook is to demonstrate the RAG using Langchain's OpenAI API and Piencone vector database. The basic idea is to use pinecone to store and retrieve context information relavant to the request queries and therefore, augment the generated responses. The key processes including the convertion and storage of information/text content to vectors and then store in vector database, and the retrival of text content relavant to the query using vector database, and then use the retrived content as the context to the query.

Please make sure you have set the [OpenAI API key](https://platform.openai.com/account/api-keys) and [Pinecone API key](https://app.pinecone.io) to use OpenAI and Pinecone in this notebook

Packages required for this notebook and the versions used here are the following:
* langchain version 0.0.293
* openai version 0.28.0
* datasets version 2.10.1
* pinecone-client version 2.2.4
* tiktoken version 0.5.1

In [1]:
!pip install -qU \
    langchain==0.0.292 \
    openai==0.28.0 \
    datasets==2.10.1 \
    pinecone-client==2.2.4 \
    tiktoken==0.5.1

## Task 1: Building a Chatbot
* In langchain, we pass messages to a ChatOpenAI object, with `model_name` specified such as gpt-3.5-tubo etc. 
* messages we pass consists of messages for different roles. In langchain, there are three types of messages:
  + SystemMessage: defines the role the gpt model is playing in the conversation
  + HumanMessage: defines the message human sent to the chat. This type of messages are usually our queries
  + AIMessage: the message generated by chat as the response to HumanMessages
* In the following exmaple, we build messages containing these three messages, and check what a ChatOpenAI's response looks like

In [2]:
from langchain.chat_models import ChatOpenAI

# initiate a ChatOpenAI object with gpt 3.5 turbo model
chat = ChatOpenAI(model_name="gpt-3.5-turbo")

In [3]:
from langchain.schema import (
    SystemMessage,
    HumanMessage,
    AIMessage
)

# define the messages and their roles to start the conversation
messages = [
    SystemMessage(content="You are a helpful assistant."),
    HumanMessage(content="Hi AI, how are you today?"),
    AIMessage(content="I'm great thank you. How can I help you?"),
    HumanMessage(content="I'd like to understand string theory.")
]

# send the message to the chat
res = chat(messages)
res

AIMessage(content="Sure! String theory is a theoretical framework in physics that aims to explain the fundamental nature of particles and their interactions by modeling them as tiny vibrating strings. It suggests that all particles in the universe, including the ones that make up matter and the ones that transmit forces, are ultimately composed of these tiny strings.\n\nIn string theory, the basic building blocks of the universe are not point-like particles, but rather one-dimensional strings. These strings can vibrate in different ways, giving rise to different fundamental particles. Each vibrational mode corresponds to a different particle with specific properties such as mass and charge.\n\nOne of the key ideas in string theory is that it requires extra spatial dimensions beyond the three (length, width, and height) that we are familiar with. These extra dimensions are typically compactified, meaning they are curled up into tiny, tightly wound loops that are too small to be directly

You can see that the response object has the following properties
* it is organized as a dictionary and the content we are interested in is in "content" key
* However, you can't retrieve the content using `res['content']` as in a dictionary, you have to use dot reference operation
* the response itself is an AIMessage object

Let's print the 'content' of the response.

In [4]:
print(res.content)

Sure! String theory is a theoretical framework in physics that aims to explain the fundamental nature of particles and their interactions by modeling them as tiny vibrating strings. It suggests that all particles in the universe, including the ones that make up matter and the ones that transmit forces, are ultimately composed of these tiny strings.

In string theory, the basic building blocks of the universe are not point-like particles, but rather one-dimensional strings. These strings can vibrate in different ways, giving rise to different fundamental particles. Each vibrational mode corresponds to a different particle with specific properties such as mass and charge.

One of the key ideas in string theory is that it requires extra spatial dimensions beyond the three (length, width, and height) that we are familiar with. These extra dimensions are typically compactified, meaning they are curled up into tiny, tightly wound loops that are too small to be directly observed. The number o

Since the response is just an `AIMessage` object, we can append it to our original `messages`, append another `HumanMessage`,  and send the resuling message to the chat to continue the conversation.

In [5]:
messages.append(res)
prompt = HumanMessage(content="Why do physicists believe it can produce a 'unified theory'?")
messages.append(prompt)

As shown below, the resulting messages contains all of our conversation messages from the begining

In [6]:
messages

[SystemMessage(content='You are a helpful assistant.', additional_kwargs={}),
 HumanMessage(content='Hi AI, how are you today?', additional_kwargs={}, example=False),
 AIMessage(content="I'm great thank you. How can I help you?", additional_kwargs={}, example=False),
 HumanMessage(content="I'd like to understand string theory.", additional_kwargs={}, example=False),
 AIMessage(content="Sure! String theory is a theoretical framework in physics that aims to explain the fundamental nature of particles and their interactions by modeling them as tiny vibrating strings. It suggests that all particles in the universe, including the ones that make up matter and the ones that transmit forces, are ultimately composed of these tiny strings.\n\nIn string theory, the basic building blocks of the universe are not point-like particles, but rather one-dimensional strings. These strings can vibrate in different ways, giving rise to different fundamental particles. Each vibrational mode corresponds to a

Now, let's send the messages to chat and see what response we will get

In [7]:
res = chat(messages)

print(res.content)

Physicists believe that string theory has the potential to produce a unified theory because it aims to describe all fundamental particles and forces within a single framework. The ultimate goal is to unify all the fundamental forces of nature, including gravity, electromagnetism, the strong nuclear force, and the weak nuclear force, into a single, consistent theory.

Currently, there are multiple fundamental forces described by separate theories in physics, each with its own set of rules and equations. For example, gravity is described by Einstein's theory of general relativity, while the other forces are described by the Standard Model of particle physics.

String theory offers the possibility of unification by treating all particles as different vibrational modes of tiny strings. This means that the underlying structure of the universe is unified, and the different particles and forces arise from different vibrations of the same fundamental entities.

Furthermore, string theory natur

## Task 2: Model Augmentation by Prompts
Our chat model only has the knowledge based on the data used to train it. If we ask questions outside of the training data, then we won't get answers from the model. In the following example, let's ask the model what is LangChain Expression Language (LCEL) in langchain

In [8]:
messages.append(res)
prompt = HumanMessage(content="Can you tell me about the LangChain Expression Language (LCEL) in LangChain?")
messages.append(prompt)

res = chat(messages)
print(res.content)

I apologize, but I'm not familiar with the specific details of the LangChain Expression Language (LCEL) in LangChain. It seems to be a domain-specific language or expression language specific to the LangChain platform. As an AI language model, my knowledge is based on general information and may not encompass the specifics of every specialized language or platform. 

If you require detailed information about the LangChain Expression Language, I would recommend referring to the official documentation or resources provided by LangChain or reaching out to their support or community for more specific information. They would be better equipped to assist you with the details of LCEL and how it is used within the LangChain platform.


one way of feeding knowledge into LLMs is called _source knowledge_. It refers to any information fed into the LLM via the prompt. We can try that with the LLMChain question. Let me copy the description of LCEL from the LangChain documentation, and feed it to the prompt to the chat.

In [9]:
# A description of LLMChains, Chains, and LangChain 
llmchain_information = [
    "LangChain Expression Language, or LCEL, is a declarative way to easily compose chains together. LCEL was designed from day 1 to support putting prototypes in production, with no code changes, from the simplest “prompt + LLM” chain to the most complex chains (we’ve seen folks successfully run LCEL chains with 100s of steps in production). To highlight a few of the reasons you might want to use LCEL:",
    "When you build your chains with LCEL you get the best possible time-to-first-token (time elapsed until the first chunk of output comes out). For some chains this means eg. we stream tokens straight from an LLM to a streaming output parser, and you get back parsed, incremental chunks of output at the same rate as the LLM provider outputs the raw tokens.",
    "For more complex chains it’s often very useful to access the results of intermediate steps even before the final output is produced. This can be used to let end-users know something is happening, or even just to debug your chain. You can stream intermediate results, and it’s available on every LangServe server"
]
len(llmchain_information)

3

In [10]:
source_knowledge = "\n".join(llmchain_information)
source_knowledge

'LangChain Expression Language, or LCEL, is a declarative way to easily compose chains together. LCEL was designed from day 1 to support putting prototypes in production, with no code changes, from the simplest “prompt + LLM” chain to the most complex chains (we’ve seen folks successfully run LCEL chains with 100s of steps in production). To highlight a few of the reasons you might want to use LCEL:\nWhen you build your chains with LCEL you get the best possible time-to-first-token (time elapsed until the first chunk of output comes out). For some chains this means eg. we stream tokens straight from an LLM to a streaming output parser, and you get back parsed, incremental chunks of output at the same rate as the LLM provider outputs the raw tokens.\nFor more complex chains it’s often very useful to access the results of intermediate steps even before the final output is produced. This can be used to let end-users know something is happening, or even just to debug your chain. You can st

We can feed this additional knowledge into our prompt with some instructions telling the LLM how we'd like it to use this information alongside our original query.

In [11]:
query = "Can you tell me about the LangChain Expression Language (LCEL) in LangChain?"

augmented_prompt = f"""Using the contexts below, answer the query. If some information is not provided within
the contexts below, do not include, and if the query cannot be answered with the below information, say "I don't know".

Contexts:
{source_knowledge}

Query: {query}"""

In [12]:
print(augmented_prompt)

Using the contexts below, answer the query. If some information is not provided within
the contexts below, do not include, and if the query cannot be answered with the below information, say "I don't know".

Contexts:
LangChain Expression Language, or LCEL, is a declarative way to easily compose chains together. LCEL was designed from day 1 to support putting prototypes in production, with no code changes, from the simplest “prompt + LLM” chain to the most complex chains (we’ve seen folks successfully run LCEL chains with 100s of steps in production). To highlight a few of the reasons you might want to use LCEL:
When you build your chains with LCEL you get the best possible time-to-first-token (time elapsed until the first chunk of output comes out). For some chains this means eg. we stream tokens straight from an LLM to a streaming output parser, and you get back parsed, incremental chunks of output at the same rate as the LLM provider outputs the raw tokens.
For more complex chains i

In [13]:
print(messages[-1])

content='Can you tell me about the LangChain Expression Language (LCEL) in LangChain?' additional_kwargs={} example=False


In [14]:
messages[-1] = HumanMessage(content=augmented_prompt)

In [15]:
res = chat(messages)
print(res.content)

The LangChain Expression Language (LCEL) is a declarative way to compose chains together in LangChain. It was designed to support putting prototypes into production without any code changes. LCEL allows you to build chains with the best possible time-to-first-token, meaning you can receive incremental chunks of output at the same rate as the raw tokens are generated by the Language Model (LLM) provider. It also enables accessing the results of intermediate steps before the final output is produced, which can be useful for informing end-users about ongoing processes or for debugging purposes. Additionally, LCEL supports streaming intermediate results and is available on every LangServe server.


It is clear that with external knowledge (source knowledge), the answer was much more relavant to our question. However, this method only applies if we only need to feed small volume of knowlege, and we know the knowlege we feed is relavant to our questions. What if we have large volumes of knowlege to augment the model, and we need the knowledge relavant to our questions to be retrived and used by the model according to our queries? We will use vector database to achieve this  

## Task 3: Importing the Data

In this task, pinecone will be used to store and retrieve the information about llama-2 model. The dataset contains paper published on ArXiv, which will serve as the external knowledge base for our chatbot. The dataset is from the Hugging Face Datasets library and [the `"jamescalam/llama-2-arxiv-papers"` dataset](https://huggingface.co/datasets/jamescalam/llama-2-arxiv-papers-chunked). Each entry in the dataset represents a "chunk" of text from these papers. 
The RAG strategy shown here is useful especially in scientific applications, since augment of knowledge base using publications is a common practice in scientific discovery and research. 

In [16]:
# import the dataset and load the dataset
from datasets import load_dataset

data = load_dataset("jamescalam/llama-2-arxiv-papers-chunked", split="train")
data

Dataset({
    features: ['doi', 'chunk-id', 'chunk', 'id', 'title', 'summary', 'source', 'authors', 'categories', 'comment', 'journal_ref', 'primary_category', 'published', 'updated', 'references'],
    num_rows: 4838
})

## Task 4: Building the Pipecone Knowledge Base

We now have a dataset that can serve as our chatbot knowledge base. In this task, the text content of each chunk is transformed to vectors by the text-embedding-ada-002 embedding model provided from OpenAI, and insert into the knowledge base hosted by Pinecone vector database for our chatbot to use. 

### Workflow

The process to set up the knowledge base is described by the following steps:
- Initialize your connection to the Pinecone vector DB.
- Create an index using the dimensionality of `text-embedding-ada-002` embedding model, which is 1536.
- Initialize OpenAI's `text-embedding-ada-002` model with LangChain.
- Populate the index with records (in this case from the Llama 2 dataset).

In [17]:
import os
import pinecone
import time

# connect to pinecone vector database using api key and environment key
pinecone.init(
    api_key=os.environ["PINECONE_API_KEY"],
    environment=os.environ["PINECONE_ENVIRONMENT"]
)

# define an unique index for the database
index_name = "llama-2-rag"

# create the database and continuously check the status until the database spins
if index_name not in pinecone.list_indexes():
    pinecone.create_index(
        index_name, dimension=1536, metric="cosine"
    )
    while not pinecone.describe_index(index_name).status["ready"]:
        time.sleep(1)
        
# get the index object from the created database and check its status
index = pinecone.Index(index_name)

index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.04838,
 'namespaces': {'': {'vector_count': 4838}},
 'total_vector_count': 4838}

The vector database has been established. Now we will load the data to the database.
* we first instantiate an object of the embedding model
* we then convert the dataset to pandas dataframe. Thanks to the `to_pandas` method!
* we then load the rows of dataframe to vector database using chunks of every 100 rows
  + here index i was used to track the batch id, and for each batch, rows between i and `i+i_end` were processed and inserted
  + basically, each text chunk with a unique id, embeding vector, and metadata dictionary was inserted
* the "text" field in the metadata will be used to retrieve the original text content for vetors similar to the query vector

In [18]:
from langchain.embeddings.openai import OpenAIEmbeddings

embed_model = OpenAIEmbeddings(model="text-embedding-ada-002")

In [19]:
from tqdm import tqdm

data = data.to_pandas()

batch_size = 100

for i in tqdm(range(0, len(data), batch_size)):
    i_end = min(i+batch_size, len(data))
    batch = data.iloc[i:i_end]
    ids = [f"{x['doi']}-{x['chunk-id']}" for _, x in batch.iterrows()]
    texts = [x["chunk"] for _, x in batch.iterrows()]
    embeds = embed_model.embed_documents(texts)
    metadata = [
        {"text": x["chunk"],
         "title": x["title"],
         "source": x["source"]} for _, x in  batch.iterrows()
    ]
    # [(id1, embed1, metadata1), (id2, embed2, metadata2), ...]
    index.upsert(vectors=zip(ids, embeds, metadata))

100%|██████████| 49/49 [01:11<00:00,  1.47s/it]


We can check that the vector index has been populated using `describe_index_stats` like before:

In [20]:
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.04838,
 'namespaces': {'': {'vector_count': 4838}},
 'total_vector_count': 4838}

## Task 5: Retrieval Augmented Generation

After building the knowledge base, we will connect it to our chatbot. In this step, we will use the content retrieved from the vector database as the context that will be included in our prompt, as demonstrated in Task 2.

### Workflow

* Create a LangChain `vectorstore` object using our `index` and `embed_model`.
* Queery the database to retrieve the information about Llama 2 that are relavant to our query.
* Augment the query prompt and include the information retrived as context to the augmented prompt. This was done by the `augment_prompt` function
* Ask the chatbot Llama 2 questions using the prompts/queries with and without RAG, and compare the differences.

In [21]:
from langchain.vectorstores import Pinecone

text_field = "text"

# define a vectorstore from pinecone indexed database, with the index value
# embedding model and the key of the metadata of the database to retrieve as
# the query results
vectorstore = Pinecone(
    index, embed_model.embed_query, text_field
)

Now, query the vector store and retrive information about llama 2 model.

In [22]:
query = "What is so special about Llama 2?"

vectorstore.similarity_search(query, k=3)

[Document(page_content='Alan Schelten Ruan Silva Eric Michael Smith Ranjan Subramanian Xiaoqing Ellen Tan Binh Tang\nRoss Taylor Adina Williams Jian Xiang Kuan Puxin Xu Zheng Yan Iliyan Zarov Yuchen Zhang\nAngela Fan Melanie Kambadur Sharan Narang Aurelien Rodriguez Robert Stojnic\nSergey Edunov Thomas Scialom\x03\nGenAI, Meta\nAbstract\nIn this work, we develop and release Llama 2, a collection of pretrained and ﬁne-tuned\nlarge language models (LLMs) ranging in scale from 7 billion to 70 billion parameters.\nOur ﬁne-tuned LLMs, called L/l.sc/a.sc/m.sc/a.sc /two.taboldstyle-C/h.sc/a.sc/t.sc , are optimized for dialogue use cases. Our\nmodels outperform open-source chat models on most benchmarks we tested, and based on\nourhumanevaluationsforhelpfulnessandsafety,maybeasuitablesubstituteforclosedsource models. We provide a detailed description of our approach to ﬁne-tuning and safety', metadata={'source': 'http://arxiv.org/pdf/2307.09288', 'title': 'Llama 2: Open Foundation and Fine-Tun

These information are the most similar/relavant to our query. We can then augment our query by using the retrieved content as the contexts of the query, as we did in task 2.

In [23]:
def augment_prompt(query: str):
    results = vectorstore.similarity_search(query, k=3)
    source_knowledge = "\n".join([x.page_content for x in results])
    augmented_prompt = f"""Using the contexts below, answer the query. If some information is not provided within
the contexts below, do not include, and if the query cannot be answered with the below information, say "I don't know".

Contexts:
{source_knowledge}

Query: {query}"""
    return augmented_prompt

Using this we produce an augmented prompt:

In [24]:
print(augment_prompt(query))

Using the contexts below, answer the query. If some information is not provided within
the contexts below, do not include, and if the query cannot be answered with the below information, say "I don't know".

Contexts:
Alan Schelten Ruan Silva Eric Michael Smith Ranjan Subramanian Xiaoqing Ellen Tan Binh Tang
Ross Taylor Adina Williams Jian Xiang Kuan Puxin Xu Zheng Yan Iliyan Zarov Yuchen Zhang
Angela Fan Melanie Kambadur Sharan Narang Aurelien Rodriguez Robert Stojnic
Sergey Edunov Thomas Scialom
GenAI, Meta
Abstract
In this work, we develop and release Llama 2, a collection of pretrained and ﬁne-tuned
large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters.
Our ﬁne-tuned LLMs, called L/l.sc/a.sc/m.sc/a.sc /two.taboldstyle-C/h.sc/a.sc/t.sc , are optimized for dialogue use cases. Our
models outperform open-source chat models on most benchmarks we tested, and based on
ourhumanevaluationsforhelpfulnessandsafety,maybeasuitablesubstituteforclosedsource models

Now, let's pass the augmented prompt to the chat model and see how it performs.

In [25]:
prompt = HumanMessage(content=augment_prompt(query))

messages.append(prompt)
res = chat(messages)
print(res.content)

Based on the provided contexts, it is mentioned that Llama 2 is a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters. The models, L/l.sc/a.sc/m.sc/a.sc /two.taboldstyle and L/l.sc/a.sc/m.sc/a.sc /two.taboldstyle-C/h.sc/a.sc/t.sc, have been optimized for dialogue use cases. The Llama 2 models outperform open-source chat models on most benchmarks tested and are considered as a suitable substitute for closed-source models based on humane evaluations for helpfulness and safety. However, further details about what specifically makes Llama 2 special are not provided in the given information.


Let's try the prompt _without_ RAG:

In [26]:
prompt = HumanMessage(
    content="What safety measures were used in the development of llama 2?"
)

res = chat(messages + [prompt])
print(res.content)

I'm sorry, but I don't have any information about the safety measures used in the development of Llama 2 based on the provided contexts.


Due to the information provided in the previous augmented prompt, the model learned some knowledge about llama 2 from the HumanMessage included in the conversational history stored in `messages`, and therefore, the chatbot was able to respond. However, it doesn't know anything about the safety measures as we have not provided it with that information via the RAG pipeline. Let's try again but with RAG.

In [27]:
prompt = HumanMessage(
    content=augment_prompt("What safety measures were used in the development of llama 2?")
)

res = chat(messages + [prompt])
print(res.content)

Based on the provided information, the safety measures used in the development of Llama 2 include safety-specific data annotation and tuning, conducting red-teaming, employing iterative evaluations, and contributing to improving LLM safety. These measures were taken to increase the safety of the models and ensure responsible development. Unfortunately, the specific details of these safety measures are not mentioned in the given contexts.


We get a better informed response.

## Summary

In this notebook, Pinecone was used as the vector database to retrieve external information relavant to the question. The retrieved information was included in the prompt as the contexts. Comparision of prompts with and without the augments clearly showed that augmented prompts provide more relavant answers, with the help of extra information provided by the contexts.