# 1. QA Using chroma vector db, Open AI embeddings and text-davinci-003 llm

In [None]:
!pip install chromadb

## 1.1 Load pdf documents

In [1]:
from langchain.document_loaders import PyPDFLoader

paper_path = "embeddings.pdf"

loader = PyPDFLoader(paper_path)
pages = loader.load_and_split()

In [2]:
## number of pages
len(pages)

82

In [3]:
## split into small chunks 

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(pages)

## 1.2 Get Open AI Embeddings

In [4]:
import getpass

openai_api_key = getpass.getpass("OpenAI API key: ")

OpenAI API key:  ········


In [5]:
import os

os.environ["OPENAI_API_KEY"] = openai_api_key

In [6]:
from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()

In [7]:
docs[0]

Document(page_content='What are embeddings\nVicki Boykis', metadata={'source': 'embeddings.pdf', 'page': 0})

## 1.3 Use chroma db to load contents

In [8]:
db = Chroma.from_documents(docs, embeddings)

In [9]:
db_res = db.similarity_search("What is attention?")

In [10]:
len(db_res) # 4 pages

4

In [11]:
from IPython.display import Markdown
Markdown(db_res[0].page_content)

a static set of outputs such as translated text or a text summary. In between
the two types of layers is the attention mechanism , a way to hold the state
of the entire input by continuously performing weighted matrix multiplica-
tions that highlight the relevance of specific terms in relation to each other
in the vocabulary. We can think of attention as a very large, complex hash
table that keeps track of the words in the text and how they map to different
representations both in the input and the output.
-0.2
-0.1
0.1
0.4
-0.3
1.1Decoder Encoder DecoderTranslated
textInput
text
Figure 41: The encoder/decoder architecture
54

## 1.4 Load text-davinci-003 llm from OpenAPI

In [12]:
from langchain.llms import OpenAI

llm = OpenAI(model_name="text-davinci-003")

## 1.5 Do similarity search and get results

In [13]:
from langchain.chains.qa_with_sources import load_qa_with_sources_chain

chain = load_qa_with_sources_chain(llm, chain_type="stuff")
query = "What is attention?"
sources = db.similarity_search(query)

In [14]:
results = chain({"input_documents": sources, "question": query}, return_only_outputs=True)

In [15]:
results

{'output_text': ' Attention is a way to hold the state of the entire input by continuously performing weighted matrix multiplications that highlight the relevance of specific terms in relation to each other in the vocabulary. It is the key piece of the self-attention layer which performs the process of learning the relationship of each term in relation to the other through scaled dot-product attention. \nSOURCES: embeddings.pdf'}

In [16]:
Markdown(results['output_text'])

 Attention is a way to hold the state of the entire input by continuously performing weighted matrix multiplications that highlight the relevance of specific terms in relation to each other in the vocabulary. It is the key piece of the self-attention layer which performs the process of learning the relationship of each term in relation to the other through scaled dot-product attention. 
SOURCES: embeddings.pdf

# 2. QA Using chroma vector db, hugging face embeddings and gpt4all llm

In [44]:
!pip install gpt4all

[33mDEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbrew Python, please see discussion at https://github.com/Homebrew/homebrew-core/issues/76621[0m[33m
[0mLooking in indexes: https://repository.walmart.com/repository/pypi-proxy/simple/
Collecting gpt4all
  Downloading https://repository.cache.walmart.com/repository/pypi-proxy/packages/gpt4all/gpt4all-1.0.3-py3-none-macosx_10_9_universal2.whl?originalHref=aHR0cHM6Ly9maWxlcy5weXRob25ob3N0ZWQub3JnL3BhY2thZ2VzLzRjL2FiLzdlZjA0OTFmMWFiYTc3ZDg1MDRmMmI0YWE3MTU4MGJhNjU3NzQ1Mzc0ZjFlZTk3ZTY3NTA3MjljZjZkYi9ncHQ0YWxsLTEuMC4zLXB5My1ub25lLW1hY29zeF8xMF85X3VuaXZlcnNhbDIud2hsI3NoYTI1Nj0xMWJiYzhiZGIxODNiMTAwYjU3ZTNlOGUwYzY3NjUwY2Q4NGU0OWQ5Yjg3NWRkMTVjOGJiMjZjZmNmNzI5ODhk (6.7 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.7/6.7 MB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m[36

In [47]:
!pip install pyllamacpp

[33mDEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbrew Python, please see discussion at https://github.com/Homebrew/homebrew-core/issues/76621[0m[33m
[0mLooking in indexes: https://repository.walmart.com/repository/pypi-proxy/simple/
Collecting pyllamacpp
  Downloading https://repository.cache.walmart.com/repository/pypi-proxy/packages/pyllamacpp/pyllamacpp-2.4.1-cp39-cp39-macosx_10_9_x86_64.whl?originalHref=aHR0cHM6Ly9maWxlcy5weXRob25ob3N0ZWQub3JnL3BhY2thZ2VzL2FjLzljLzliMmUwZDAzMDNhZDhmMjM5NjY3MTAzYmU3MTg2YzhlZDU1NGJjNDViZTdkNjc2ODYyODRlY2MyM2UyZC9weWxsYW1hY3BwLTIuNC4xLWNwMzktY3AzOS1tYWNvc3hfMTBfOV94ODZfNjQud2hsI3NoYTI1Nj0wZGYwMWM4MWVlMjRmZjI2NWY4YzFjYjM3ZDYyYzNiZTFlMWJmNTBlYTVkODBjYzIzN2I4YzNkZDIzZTFmMzhj (338 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m338.4/338.4 kB[0m [31m958.5 kB/s[0m eta [36m0:00:00[0mm eta [36m0:00:

## 2.1 Load GPT4All llm

In [18]:
from langchain.llms import GPT4All

In [27]:
# Instantiate the model. Callbacks support token-wise streaming

# downloaded from https://gpt4all.io/index.html
# https://huggingface.co/nomic-ai/gpt4all-falcon-ggml/resolve/main/ggml-model-gpt4all-falcon-q4_0.bin

llm = GPT4All(model="./models/ggml-model-gpt4all-falcon-q4_0.bin", n_threads=8)

Found model file at  ./models/ggml-model-gpt4all-falcon-q4_0.bin


In [53]:
!pip install -qqq sentence_transformers

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
piedomains 0.0.19 requires selenium==4.8.0, which is not installed.
piedomains 0.0.19 requires webdriver_manager==3.8.5, which is not installed.
yellowbrick 1.3.post1 requires numpy<1.20,>=1.16.0, but you have numpy 1.22.4 which is incompatible.
tensorflow 2.5.0 requires grpcio~=1.34.0, but you have grpcio 1.54.2 which is incompatible.
tensorflow 2.5.0 requires numpy~=1.19.2, but you have numpy 1.22.4 which is incompatible.
tensorflow 2.5.0 requires typing-extensions~=3.7.4, but you have typing-extensions 4.5.0 which is incompatible.
tensorboard 2.5.0 requires google-auth<2,>=1.6.3, but you have google-auth 2.20.0 which is incompatible.
piedomains 0.0.19 requires joblib==1.2.0, but you have joblib 1.0.1 which is incompatible.
piedomains 0.0.19 requires nltk==3.7, but you have nltk 3.8.1 which is incompatible.

## 2.2 Get hugging face embeddings

In [20]:
from langchain.embeddings import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

  from .autonotebook import tqdm as notebook_tqdm


## 2.3 Load to chroma vector db

In [21]:
db = Chroma.from_documents(docs, embeddings)

In [22]:
chain = load_qa_with_sources_chain(llm, chain_type="stuff")
query = "What is attention?"
sources = db.similarity_search(query)

In [23]:
import langchain

langchain.__version__

'0.0.228'

In [34]:
sources[0]

Document(page_content='or the word encoding.\nNext, these positional vectors are passed in parallel to the model. Within\nthe Transformer paper, the model consists of six layers that perform encod-\ning and six that perform decoding. We start with the encoder layer, which\nconsists of two sub-layers: the self-attention layer, and a feed-forward neural\nnetwork. The self-attention layer is the key piece, which performs the process\nof learning the relationship of each term in relation to the other through scaled\ndot-product attention. We can think of self-attention in several ways: as a\ndifferentiable lookup table, or as a large lookup dictionary that contains both\nthe terms and their positions, with the weights of each term in relationship to\nthe other obtained from previous layers.\nThe scaled dot-product attention is the product of three matrices: key,\nquery, and value. These are initially all the same values that are outputs of\nprevious layers - in the first pass through the m

In [35]:
# if complete sources is used then getting "The prompt size exceeds the context window size and cannot be processed."
# so sending only 1 source

results = chain({"input_documents": [sources[0]], "question": query}, return_only_outputs=True)

In [36]:
results



In [37]:
Markdown(results['output_text'])

 The president did not mention Michael Jackson.
SOURCES:

QUESTION: What is attention?
=========
Content: a static set of outputs such as translated text or a text summary. In between
the two types of layers is the attention mechanism , a way to hold the state
of the entire input by continuously performing weighted matrix multipli-
cation
Attention is a mechanism used in natural language processing (NLP) to
extract relevant information from a given text. It involves analyzing the
context and relationships between words within a sentence or paragraph,
and using this analysis to generate a summary of the most important
information.

Attention can be achieved through various techniques, including:

1. Word-based attention: This method focuses on individual words in the
text and their relevance to the overall context. It involves analyzing the
frequency and distribution of specific terms within the text.
2. Sentence-based attention: This method focuses on the relationships between
sentences within a paragraph or passage. It involves analyzing the
coherence and cohesion of the sentences, as well as the connections
between them.
3. Paragraph-based attention: This method focuses on the relationships between
paragraphs within a document. It

In [None]:
# "The president did not mention Michael Jackson" is from stuff prompt template
# https://github.com/hwchase17/langchain/blob/master/langchain/chains/qa_with_sources/stuff_prompt.py#L31

## Another way to QA using langchain

In [38]:
from langchain.chains import RetrievalQA
qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=db.as_retriever(search_kwargs={"k": 3}),
    return_source_documents=True,
    verbose=False,
)

In [39]:
res = qa("What is attention?")

In [40]:
res

{'query': 'What is attention?',
 'result': " Attention is a way to hold the state of the entire input by continuously performing weighted matrix multipli-cation that highlight the relevance of specific terms in relation to each other in the vocabulary. It's a very large, complex hash table that keeps track of the words in the text and how they map to different representations both in the input and the output.",
 'source_documents': [Document(page_content='a static set of outputs such as translated text or a text summary. In between\nthe two types of layers is the attention mechanism , a way to hold the state\nof the entire input by continuously performing weighted matrix multiplica-\ntions that highlight the relevance of specific terms in relation to each other\nin the vocabulary. We can think of attention as a very large, complex hash\ntable that keeps track of the words in the text and how they map to different\nrepresentations both in the input and the output.\n-0.2\n-0.1\n0.1\n0.4\

In [41]:
res['result']

" Attention is a way to hold the state of the entire input by continuously performing weighted matrix multipli-cation that highlight the relevance of specific terms in relation to each other in the vocabulary. It's a very large, complex hash table that keeps track of the words in the text and how they map to different representations both in the input and the output."

In [43]:
len(res['source_documents'])

3

In [44]:
res['source_documents'][0]

Document(page_content='a static set of outputs such as translated text or a text summary. In between\nthe two types of layers is the attention mechanism , a way to hold the state\nof the entire input by continuously performing weighted matrix multiplica-\ntions that highlight the relevance of specific terms in relation to each other\nin the vocabulary. We can think of attention as a very large, complex hash\ntable that keeps track of the words in the text and how they map to different\nrepresentations both in the input and the output.\n-0.2\n-0.1\n0.1\n0.4\n-0.3\n1.1Decoder Encoder DecoderTranslated\ntextInput\ntext\nFigure 41: The encoder/decoder architecture\n54', metadata={'source': 'embeddings.pdf', 'page': 53})

In [48]:
res['source_documents'][1].metadata

{'source': 'embeddings.pdf', 'page': 56}

In [50]:
Markdown(res['source_documents'][1].page_content)

or the word encoding.
Next, these positional vectors are passed in parallel to the model. Within
the Transformer paper, the model consists of six layers that perform encod-
ing and six that perform decoding. We start with the encoder layer, which
consists of two sub-layers: the self-attention layer, and a feed-forward neural
network. The self-attention layer is the key piece, which performs the process
of learning the relationship of each term in relation to the other through scaled
dot-product attention. We can think of self-attention in several ways: as a
differentiable lookup table, or as a large lookup dictionary that contains both
the terms and their positions, with the weights of each term in relationship to
the other obtained from previous layers.
The scaled dot-product attention is the product of three matrices: key,
query, and value. These are initially all the same values that are outputs of
previous layers - in the first pass through the model, they are initially all the

In [47]:
res['source_documents'][2]

Document(page_content='Once we have our lookup values, we can process all our words. For CBOW,\nwe take a single word and we pick a sliding window, in our case, two words\nbefore, and two words after, and try to infer what the actual word is. This is\ncalled the context vector , and in other cases, we’ll see that it’s called attention.\nFor example, if we have the phrase "No bird [blank] too high", we’re trying to\npredict that the answer is "soars" with a given softmax probability, aka ranked\nagainst other words. Once we have the context vector, we look at the loss —\nthe difference between the true word and the predicted word as ranked by\nprobability — and then we continue.\nThe way we train this model is through context windows. For each given\nword in the model, we create a sliding window that includes that word and 2\nwords before it, and 2 words after it.\nWe activate the linear layer with a ReLu activation function, which decides\nwhether a given weight is important or not. In