<a href="https://colab.research.google.com/github/sljm12/machine_learning_notebooks/blob/master/langchain/T5_Q%26A_with_Chroma.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Notebook for doing Q&A over a set of PDFs using Langchain, HuggingFace pipelines and ChromaDB

In [1]:
!pip install transformers sentencepiece chromadb langchain pypdf sentence_transformers tqdm > /dev/null

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-colab 1.0.0 requires requests==2.27.1, but you have requests 2.31.0 which is incompatible.[0m[31m
[0m

In [2]:
!pip install accelerate bitsandbytes > /dev/null

# Select the HuggingFace model that you want to load

In [3]:
#model_name = "MaRiOrOsSi/t5-base-finetuned-question-answering" #The results from this was poor
#model_name = "google/flan-t5-xl #This couldnt load even after enabling 8bit
model_name = "google/flan-t5-large" #This could load even without 8bit quantization

In [4]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained(model_name)

model = AutoModelForSeq2SeqLM.from_pretrained(model_name,
                                            #  load_in_8bit=True,
                                              device_map='auto',
                                            #   torch_dtype=torch.float16,
                                            #   low_cpu_mem_usage=True,

                                              )

Downloading (…)okenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

Downloading spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/3.13G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [5]:
from transformers import pipeline
from langchain.llms import HuggingFacePipeline
import torch

pipe = pipeline(
    "text2text-generation",
    model=model,
    tokenizer=tokenizer,
    max_length=512,
    temperature=0,
    top_p=0.95,
    repetition_penalty=1.15
)

local_llm = HuggingFacePipeline(pipeline=pipe)

## Testing the LLM

In [6]:
print(local_llm('What is the capital of England?'))

london


# Loading PDFs to the VectorDB

This uses the DirectorLoader from Langchain to read in a bunch of PDFs and specifying using PyPDFLoader to load the files, you can change this to different loaders and directories.

The current code just read all the PDF files from /content.

For a list of loaders you can check out https://python.langchain.com/docs/modules/data_connection/document_loaders.html

The examples i am using in this notebook are done by exporting PDFs from Wikipedia using the PDF export under "Tools"->"Export as PDF" in the wikipedia page.

The following are the pages that I am using in the examples below.
* https://en.wikipedia.org/wiki/General_Dynamics_F-16_Fighting_Falcon
* https://en.wikipedia.org/wiki/Mikoyan_MiG-29
* https://en.wikipedia.org/wiki/Barack_Obama
* https://en.wikipedia.org/wiki/Lee_Kuan_Yew




In [7]:
from langchain.document_loaders import PyPDFLoader, DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [8]:
loader = DirectoryLoader('/content/', glob="**/*.pdf", show_progress=True,loader_cls=PyPDFLoader)
documents = loader.load_and_split()

100%|██████████| 4/4 [00:55<00:00, 13.96s/it]


## Split the text and specifying the overlap.

In [9]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=200)
texts = text_splitter.split_documents(documents)

## Converting the documents to embeddings and loading into ChromaDB

https://python.langchain.com/docs/modules/data_connection/vectorstores/integrations/chroma


In [10]:
from langchain.embeddings import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

Downloading (…)e9125/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)7e55de9125/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)55de9125/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)125/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)e9125/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading (…)9125/train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading (…)7e55de9125/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)5de9125/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

In [11]:
from langchain.vectorstores import Chroma
db = Chroma.from_documents(texts, embeddings, persist_directory="db")

In [12]:
#Make sure the db is saved to disk
db.persist()

### Saving the vectors to a tar

In [16]:
"""
 This is to fix an error when colab when it complains NotImplementedError: A UTF-8 locale is required. Got ANSI_X3.4-1968 when running commands
 https://github.com/googlecolab/colabtools/issues/3409
"""

import locale
locale.getpreferredencoding = lambda: "UTF-8"

In [19]:
!tar zcvf chromadb.tar.gz "/content/db"

tar: Removing leading `/' from member names
/content/db/
/content/db/chroma-embeddings.parquet
/content/db/chroma-collections.parquet
/content/db/index/
/content/db/index/index_3bdb0018-07f9-4df8-b94c-789a077edd82.bin
/content/db/index/uuid_to_id_3bdb0018-07f9-4df8-b94c-789a077edd82.pkl
/content/db/index/index_metadata_3bdb0018-07f9-4df8-b94c-789a077edd82.pkl
/content/db/index/id_to_uuid_3bdb0018-07f9-4df8-b94c-789a077edd82.pkl


### Reload from directory

In [21]:
from langchain.vectorstores import Chroma
db = Chroma(persist_directory="/content/db", embedding_function=embeddings)

In [22]:
retriever = db.as_retriever(search_kwargs={"k": 3})

## Doing a similarity search and looking at the results

In [14]:
ans=db.similarity_search("Where did Barack Obama study?")

### Checking the output to see if the Vector DB works

In [15]:
len(ans)

4

In [16]:
[print(i.page_content+'\n') for i in ans]

unde rgraduate degree in econom ics in Hawaii, graduating in June 1962. He left to attend graduate school
on a scholarship at Harvard University, where he earned an M.A. in econom ics. Obama's parents divorced
in March 1964.[24] Obama Sr. returned to Kenya in 1964, where he married for a third time and worked for
the Kenyan gove rnment as the Senior Econom ic Analyst in the Ministry of Finance.[25] He visited his son

participate in the disinvestment from South Africa in response to that nation's policy of apartheid.[45] In
mid-1981, Obama traveled to Indone sia to visit his mother and half-sister Maya, and visited the families of
college friends in Pakistan for three weeks.[45] Later in 1981, he transferred to Columbia University in
New York City as a junior, where he majored in political science with a specialty in international

five weeks in Kenya, where he met many of  his paternal relatives for the first time.[57][58]
Despite being offered a full scholarship to Northwestern
Unive

[None, None, None, None]

# Setting up langchain's RetrievalQA

In [17]:
from langchain.chains import RetrievalQA
qa_chain = RetrievalQA.from_chain_type(llm=local_llm,
                                  chain_type="stuff",
                                  retriever=retriever,
                                  return_source_documents=True)

In [56]:
def print_results(ans, sources=False):
  query =  ans["query"]
  answer = ans["result"]
  print(f"Query: {query}")
  print(f"Answer: {answer}")
  print("\n")
  if sources:
    for i in ans["source_documents"]:
      print("Source: "+i.metadata["source"] + ", Page: "+str(i.metadata["page"]))
      print("Text: "+ i.page_content+"\n")


In [57]:
def ask(query, sources = False):
  ans = qa_chain(query)
  print_results(ans, sources)

# Q&A

## What are the countries using the F-16

In [66]:
ask("What are the countries using the F-16?")
# Didnt pick up the "countries" and return only one

Query: What are the countries using the F-16?
Answer: Slovakia




In [70]:
ask("What are the top countries using the F-16?") # Still didnt pick up all

Query: What are the top countries using the F-16?
Answer: United Kingdom , the Netherlands, Belgium and Venezuela




In [68]:
ask("What are the top 5 countries using the F-16?") # Wanted TOp 5 but well

Query: What are the top 5 countries using the F-16?
Answer: Bahrain Belgium Chile Denmark Egypt Greece Indonesia Iraq Israel Jordan Morocco Netherlands Oman Pakistan Poland Portugal Romania Singapore South Korea Taiwan Thailand Turkey United Arab Emirates United States Venezuela




## Discriminate between F-16 and Mig29 information

In [60]:
ask("What is the wingspan of the F-16?")

Query: What is the wingspan of the F-16?
Answer: 32 ft 8 in




In [64]:
ask("What is the wingspan of the F-16 in meters?")
#Manage to pick up the correct unit

Query: What is the wingspan of the F-16 in meters?
Answer: 9.96




In [63]:
ask("What is the wingspan of the MiG29 in meters?")
#In the wiki the form "MiG-29" is more used with "MiG29" only appearing once but it can still answer"

Query: What is the wingspan of the MiG29 in meters?
Answer: 11.36 m




## Barack Obama

In [71]:
ask("When did Barack Obama become President?")



Query: When did Barack Obama become President?
Answer: 2009




In [72]:
ask("When did Barack Obama become a senator?")

Query: When did Barack Obama become a senator?
Answer: 2005




In [73]:
ask("Where was Barack Obama born?")

Query: Where was Barack Obama born?
Answer: Honolulu, Hawaii




In [74]:
ask("Which state was he a senator of?")

Query: Which state was he a senator of?
Answer: Illinois




## Lee Kuan Yew

In [87]:
ask("Where did Lee Kuan Yew do his degree?")

Query: Where did Lee Kuan Yew do his degree?
Answer: Fitzwilliam College, Cambridge




In [89]:
ask("What did Lee Kuan Yew study?")

Query: What did Lee Kuan Yew study?
Answer: China, the United States and the World




In [88]:
ask("What is his course of study in Fitzwilliam College, Cambridge?")

Query: What is his course of study in Fitzwilliam College, Cambridge?
Answer: law




In [93]:
ask("When did he become prime minister?")

Query: When did he become prime minister?
Answer: 1959




In [85]:
ask("When did Singapore separate from Malaysia?") #Correct spelling is important
ask("When did Singapore seperate from Malaysia?")

Query: When did Singapore separate from Malaysia?
Answer: 1965


Query: When did Singapore seperate from Malaysia?
Answer: 16 September




## Who will win!

In [92]:
ask("Who will win in a boxing match between Lee Kuan Yew and Barack Obama?")

Query: Who will win in a boxing match between Lee Kuan Yew and Barack Obama?
Answer: Lee Kuan Yew


