[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/vectara/example-notebooks/blob/main/notebooks/chunking-demo.ipynb)

In [None]:
from langchain.chains import RetrievalQA
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter
from langchain.vectorstores import Chroma, Vectara
from langchain.embeddings import OpenAIEmbeddings

from langchain.chat_models.openai import ChatOpenAI
from langchain.document_loaders.unstructured import UnstructuredFileLoader

import urllib.request

# Setup

In [2]:
url = 'https://arxiv.org/pdf/2307.09288.pdf'
file_name = '../data/llama2.pdf'
urllib.request.urlretrieve(url, file_name)

('../data/llama2.pdf', <http.client.HTTPMessage at 0x133168e20>)

In [3]:
llm = ChatOpenAI(model_name = 'gpt-3.5-turbo-16k', temperature=0)

def get_answer(doc_text, chunk_size: int, chunk_overlap: int, query: str):
    text_splitter = CharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
    docs = text_splitter.split_documents(doc_text)
    embeddings = OpenAIEmbeddings()
    vs = Chroma.from_documents(docs, embeddings)
    qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=vs.as_retriever())
    return qa.run(query)

def get_answer_recursive(doc_text, chunk_size: int, chunk_overlap: int, query: str):
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
    docs = text_splitter.split_documents(doc_text)
    embeddings = OpenAIEmbeddings()
    vs = Chroma.from_documents(docs, embeddings)
    qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=vs.as_retriever())
    return qa.run(query)

In [4]:
loader = UnstructuredFileLoader(file_name, mode="single", strategy="fast")
doc_text = loader.load()

query1 = "what is shown in figure 16?"
query2 = "is GPT-4 better than Llama2?"

# Test fixed-size chunking with LangChain for query 1

In [5]:
query = query1

for chunk_size in [1000, 2000]:
    for chunk_overlap in [0, 100]:
        response = get_answer(doc_text, chunk_size, chunk_overlap, query)
        print(f"chunk={chunk_size}, overlap={chunk_overlap}, response={response}\n")

Created a chunk of size 1189, which is longer than the specified 1000
Created a chunk of size 1100, which is longer than the specified 1000
Created a chunk of size 1185, which is longer than the specified 1000
Created a chunk of size 1168, which is longer than the specified 1000
Created a chunk of size 1061, which is longer than the specified 1000
Created a chunk of size 1079, which is longer than the specified 1000
Created a chunk of size 1552, which is longer than the specified 1000
Created a chunk of size 1223, which is longer than the specified 1000
Created a chunk of size 1385, which is longer than the specified 1000
Created a chunk of size 1844, which is longer than the specified 1000
Created a chunk of size 1159, which is longer than the specified 1000
Created a chunk of size 1557, which is longer than the specified 1000
Created a chunk of size 2274, which is longer than the specified 1000
Created a chunk of size 1073, which is longer than the specified 1000
Created a chunk of s

chunk=1000, overlap=0, response=I'm sorry, but I don't have access to any figures or images.



Created a chunk of size 2274, which is longer than the specified 2000


chunk=1000, overlap=100, response=I'm sorry, but I don't have access to any figures or images.



Created a chunk of size 2274, which is longer than the specified 2000


chunk=2000, overlap=0, response=I'm sorry, but I don't have any information about Figure 16.

chunk=2000, overlap=100, response=I'm sorry, but I don't have any information about Figure 16.



# Test Recursive chunking with LangChain for query 1

In [6]:
for chunk_size in [1000, 2000]:
    for chunk_overlap in [0, 100]:
        response = get_answer_recursive(doc_text, chunk_size, chunk_overlap, query)
        print(f"chunk={chunk_size}, overlap={chunk_overlap}, response={response}\n")

chunk=1000, overlap=0, response=I'm sorry, but I don't have any information about Figure 16.

chunk=1000, overlap=100, response=I'm sorry, but I don't have access to any figures or images.

chunk=2000, overlap=0, response=I'm sorry, but I don't have access to any figures or images.

chunk=2000, overlap=100, response=The context does not provide any information about what is shown in Figure 16.



# Test Vectara with LangChain for query 1

In [7]:
vs = Vectara.from_documents(doc_text, embedding=None)
qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=vs.as_retriever())
qa.run(query)

'Figure 16 shows the distribution of safety RM scores from the base model, when adding a generic preprompt, and when adding a preprompt based on the risk category with a tailored answer template. It illustrates how the addition of a preprompt with a tailored answer template helps increase safety RM scores even more compared to a generic preprompt.'

# Test fixed-size chunking with LangChain for query 2

In [8]:
query = query2

for chunk_size in [1000, 2000]:
    for chunk_overlap in [0, 100]:
        response = get_answer(doc_text, chunk_size, chunk_overlap, query)
        print(f"chunk={chunk_size}, overlap={chunk_overlap}, response={response}\n")

Created a chunk of size 1189, which is longer than the specified 1000
Created a chunk of size 1100, which is longer than the specified 1000
Created a chunk of size 1185, which is longer than the specified 1000
Created a chunk of size 1168, which is longer than the specified 1000
Created a chunk of size 1061, which is longer than the specified 1000
Created a chunk of size 1079, which is longer than the specified 1000
Created a chunk of size 1552, which is longer than the specified 1000
Created a chunk of size 1223, which is longer than the specified 1000
Created a chunk of size 1385, which is longer than the specified 1000
Created a chunk of size 1844, which is longer than the specified 1000
Created a chunk of size 1159, which is longer than the specified 1000
Created a chunk of size 1557, which is longer than the specified 1000
Created a chunk of size 2274, which is longer than the specified 1000
Created a chunk of size 1073, which is longer than the specified 1000
Created a chunk of s

chunk=1000, overlap=0, response=Yes, according to the provided information, there is still a large gap in performance between Llama 2 70B and GPT-4.



Created a chunk of size 2274, which is longer than the specified 2000


chunk=1000, overlap=100, response=Yes, according to the provided information, there is still a large gap in performance between Llama 2 70B and GPT-4.



Created a chunk of size 2274, which is longer than the specified 2000


chunk=2000, overlap=0, response=Yes, according to the provided information, there is still a large gap in performance between Llama 2 70B and GPT-4.

chunk=2000, overlap=100, response=Yes, according to the provided information, there is still a large gap in performance between Llama 2 70B and GPT-4.



# Test Recursive chunking with LangChain for query 2

In [9]:
for chunk_size in [1000, 2000]:
    for chunk_overlap in [0, 100]:
        response = get_answer_recursive(doc_text, chunk_size, chunk_overlap, query)
        print(f"chunk={chunk_size}, overlap={chunk_overlap}, response={response}\n")

chunk=1000, overlap=0, response=Based on the provided information, it is not explicitly stated whether GPT-4 is better than Llama2. However, it is mentioned that GPT-4 performs better than other non-Meta reward models, but there is still a large gap in performance between Llama 2 70B and GPT-4. Therefore, it is unclear if GPT-4 is better than Llama2.

chunk=1000, overlap=100, response=Based on the provided information, it is not explicitly stated whether GPT-4 is better than Llama2. However, it is mentioned that Llama 2 70B is close to GPT-3.5 on some benchmarks but there is a significant gap on coding benchmarks. It is also mentioned that there is still a large gap in performance between Llama 2 70B and GPT-4. Therefore, it can be inferred that GPT-4 may have better performance than Llama2, but without specific benchmark results, we cannot make a definitive conclusion.

chunk=2000, overlap=0, response=Based on the provided information, it is not explicitly stated whether GPT-4 is bett

# Test Vectara with LangChain for query 2

In [10]:
vs = Vectara.from_documents(doc_text, embedding=None)
qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=vs.as_retriever())
qa.run(query)

'Based on the provided information, it is stated that there is still a large gap in performance between Llama 2 70B and GPT-4. Therefore, it can be inferred that GPT-4 is considered to be better than Llama 2.'