<a href="https://colab.research.google.com/github/yassineselmi/langchain-workshop/blob/main/lab/03_langchain_retrieval_augmentation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/learn/generation/langchain/handbook/05-langchain-retrieval-augmentation.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/learn/generation/langchain/handbook/05-langchain-retrieval-augmentation.ipynb)

#### [LangChain Handbook](https://pinecone.io/learn/langchain)

# Retrieval Augmentation

**L**arge **L**anguage **M**odels (LLMs) have a data freshness problem. The most powerful LLMs in the world, like GPT-4, have no idea about recent world events.

The world of LLMs is frozen in time. Their world exists as a static snapshot of the world as it was within their training data.

A solution to this problem is *retrieval augmentation*. The idea behind this is that we retrieve relevant information from an external knowledge base and give that information to our LLM. In this notebook we will learn how to do that.

Now we install the remaining libraries:

In [2]:
!pip install -qU langchain langchain_community langchain-openai tiktoken chromadb pypdf sentence-transformers

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m816.1/816.1 kB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m16.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m24.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m509.0/509.0 kB[0m [31m21.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m241.3/241.3 kB[0m [31m25.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m226.7/226.7 kB[0m [31m24.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m32.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━

---

🚨 _Note: the above `pip install` is formatted for Jupyter notebooks. If running elsewhere you may need to drop the `!`._

---

Every record contains *a lot* of text. Our first task is therefore to identify a good preprocessing methodology for chunking these articles into more "concise" chunks to later be embedding and stored in our Pinecone vector database.

For this we use LangChain's `RecursiveCharacterTextSplitter` to split our text into chunks of a specified max length.

In [3]:
import tiktoken

tiktoken.encoding_for_model('gpt-3.5-turbo')

<Encoding 'cl100k_base'>

In [4]:
import tiktoken

tokenizer = tiktoken.get_encoding('cl100k_base')

# create the length function
def tiktoken_len(text):
    tokens = tokenizer.encode(
        text,
        disallowed_special=()
    )
    return len(tokens)

tiktoken_len("hello I am a chunk of text and using the tiktoken_len function "
             "we can find the length of this chunk of text in tokens")

26

## Prepare the Data
We will use the document loader `PyPDFLoader` that allows to read PDF files.

In [24]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("https://cti-commission.fr/wp-content/uploads/2017/10/enit_tunisie_decision_20170902.pdf")
pages = loader.load()

In [None]:
pages[0]

Now, we have to split the data into chunks that fit in the context window.

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=400,
    chunk_overlap=20,
    length_function=tiktoken_len,
    separators=["\n\n", "\n", " ", ""]
)


chunks = text_splitter.split_documents(pages)[:3]
chunks

In [27]:
tiktoken_len(chunks[0].page_content), tiktoken_len(chunks[1].page_content), tiktoken_len(chunks[2].page_content)

(365, 373, 180)

Using the `text_splitter` we get much better sized chunks of text. We'll use this functionality during the indexing process later. Now let's take a look at embedding.

## Creating Embeddings

Building embeddings using LangChain's Huggingface Embeddings support is fairly straightforward.

In [40]:
from langchain_community.embeddings import HuggingFaceEmbeddings

# You can consult the list of models here: https://huggingface.co/spaces/mteb/leaderboard

embed_model = HuggingFaceEmbeddings(model_name="BAAI/bge-base-en-v1.5")

Now we embed some text like so:

In [41]:
texts = [
    "I'm a future engineer.",
    'G2FOSS is great! We all love open source softwares'
]

res = embed_model.embed_documents(texts)
len(res), len(res[0])

(2, 768)

In [42]:
res[0][:10]

[0.03163047879934311,
 0.05004410818219185,
 0.026233991608023643,
 -0.01612345315515995,
 0.04078533500432968,
 0.039986416697502136,
 -0.006752873305231333,
 0.03067435510456562,
 -0.030285311862826347,
 -0.02121417224407196]

From this we get *two* (aligning to our two chunks of text) 768-dimensional embeddings.

Now we move on to initializing our vector database.

## Vector Database

For this example, we will use `ChromaDB` as vector store. However, this is not suitable for production applications. In this case, you may consider better options like Weaviate or Qdrant.

This is how to initialize it:

In [44]:
from langchain.vectorstores.chroma import Chroma

vectorstore = Chroma.from_documents(pages, embed_model)

Now, it is time to query the vectorstore for similar documents.

In [None]:
query = "How many laboratory does ENIT have?"

vectorstore.similarity_search(
    query,  # our search query
    k=3  # return 3 most relevant docs
)

All of these are good, relevant results. But what can we do with this? There are many tasks, one of the most interesting (and well supported by LangChain) is called _"Generative Question-Answering"_ or GQA.

## Generative Question-Answering

In GQA we take the query as a question that is to be answered by a LLM, but the LLM must answer the question based on the information it is seeing being returned from the `vectorstore`.

To do this we initialize a `RetrievalQA` object like so:

In [57]:
from getpass import getpass

OPENAI_API_KEY = getpass()

··········


In [59]:
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA

# completion llm
llm = ChatOpenAI(
    openai_api_key=OPENAI_API_KEY,
    model_name='gpt-3.5-turbo',
    temperature=0.0
)

qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever()
)

In [60]:
qa.invoke(query)

{'query': 'How many laboratory does ENIT have?',
 'result': 'ENIT has twelve laboratories.'}

We can also include the sources of information that the LLM is using to answer our question. We can do this using a slightly different version of `RetrievalQA` called `RetrievalQAWithSourcesChain`:

In [61]:
from langchain.chains import RetrievalQAWithSourcesChain

qa_with_sources = RetrievalQAWithSourcesChain.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever()
)

In [64]:
qa_with_sources.invoke(query)

{'question': 'How many laboratory does ENIT have?',
 'answer': 'ENIT has twelve laboratories.\n',
 'sources': 'https://cti-commission.fr/wp-content/uploads/2017/10/enit_tunisie_decision_20170902.pdf'}

Now we answer the question being asked, *and* return the source of this information being used by the LLM.