In this notebook we will do the below mentioned steps:

1. Load the Llama-2 paper pdf using LangChain document loaders.
2. Create text chunks.
3. Create Embeddings on the text chunks.
4. Save the embeddings in Vectore Store using chroma.
5. Perform Semantic search without using LLM

In [1]:
llama2_paper_path = 'LLaMA_Open_and_Efficient_Foundation_Language_Models.pdf'


#### Load the document

In [2]:
# import the LangChain pdf document loader
from langchain.document_loaders import PyPDFLoader


In [4]:
# Load and create pages
loader = PyPDFLoader(file_path=llama2_paper_path)
pages = loader.load_and_split()


In [5]:
len(pages), pages[0].metadata


(36,
 {'source': 'LLaMA_Open_and_Efficient_Foundation_Language_Models.pdf',
  'page': 0})

In [8]:
pages[0].page_content


'LLaMA: Open and Efﬁcient Foundation Language Models\nHugo Touvron∗, Thibaut Lavril∗, Gautier Izacard∗, Xavier Martinet\nMarie-Anne Lachaux, Timothee Lacroix, Baptiste Rozière, Naman Goyal\nEric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin\nEdouard Grave∗, Guillaume Lample∗\nMeta AI\nAbstract\nWe introduce LLaMA, a collection of founda-\ntion language models ranging from 7B to 65B\nparameters. We train our models on trillions\nof tokens, and show that it is possible to train\nstate-of-the-art models using publicly avail-\nable datasets exclusively, without resorting\nto proprietary and inaccessible datasets. In\nparticular, LLaMA-13B outperforms GPT-3\n(175B) on most benchmarks, and LLaMA-\n65B is competitive with the best models,\nChinchilla-70B and PaLM-540B. We release\nall our models to the research community1.\n1 Introduction\nLarge Languages Models (LLMs) trained on mas-\nsive corpora of texts have shown their ability to per-\nform new tasks from textual instructions o

In [9]:
len(pages[0].page_content)


3986

#### Max token length Limitation:

Sometimes length of a single page can be very big , this can get bigger than that max token input length which we can send to a LLM. To solve this issue we create text chunks for a given character length. So it would be better if we can control the size of the documents.

#### Creating text chunks for Document

In [10]:
from langchain.text_splitter import CharacterTextSplitter

loader = PyPDFLoader(file_path=llama2_paper_path)
documents = loader.load()


In [11]:
# we split the data into chunks of 1,000 characters, with an overlap
# of 200 characters between the chunks, which helps to give better results
# and contain the context of the information between chunks

text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
documents = text_splitter.split_documents(documents)

print('Total number of text chunks: ',len(documents))

print('length of a single document: ',len(documents[0].page_content))


Total number of text chunks:  27
length of a single document:  4056


In [12]:
documents[0].page_content


'LLaMA: Open and Efﬁcient Foundation Language Models\nHugo Touvron∗, Thibaut Lavril∗, Gautier Izacard∗, Xavier Martinet\nMarie-Anne Lachaux, Timothee Lacroix, Baptiste Rozière, Naman Goyal\nEric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin\nEdouard Grave∗, Guillaume Lample∗\nMeta AI\nAbstract\nWe introduce LLaMA, a collection of founda-\ntion language models ranging from 7B to 65B\nparameters. We train our models on trillions\nof tokens, and show that it is possible to train\nstate-of-the-art models using publicly avail-\nable datasets exclusively, without resorting\nto proprietary and inaccessible datasets. In\nparticular, LLaMA-13B outperforms GPT-3\n(175B) on most benchmarks, and LLaMA-\n65B is competitive with the best models,\nChinchilla-70B and PaLM-540B. We release\nall our models to the research community1.\n1 Introduction\nLarge Languages Models (LLMs) trained on mas-\nsive corpora of texts have shown their ability to per-\nform new tasks from textual instructions o

In [14]:
### creating text chunks using tiktoken 

from langchain.text_splitter import TokenTextSplitter

text_splitter_token = TokenTextSplitter(chunk_size= 500, chunk_overlap=0) #now chunk size is a hard length based on tokens
docs = text_splitter_token.split_documents(documents)

print('Total number of text chunks: ',len(docs))

print('length of a single document: ',len(docs[0].page_content))


Total number of text chunks:  72
length of a single document:  1837


#### Creation of Embeddings

We will use the open source sentence transformer embedding to create the embedding.

In [16]:
from langchain.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(model_name='sentence-transformers/all-MiniLM-L6-v2')


In [17]:
# looping through all the documents and creating embeddings on the page content
embedded_docs = embeddings.embed_documents([text.page_content for text in documents])


27

In [18]:
len(embedded_docs), len(embedded_docs[0])


(27, 384)

In [19]:
# first document embedding - first 10 embeddings
embedded_docs[0][:10]


[-0.11473206430673599,
 -0.09949927777051926,
 -0.012473993003368378,
 0.030612843111157417,
 0.013117469847202301,
 0.03662969917058945,
 -0.07201222330331802,
 0.04025634378194809,
 0.03468338027596474,
 0.020480530336499214]

#### Vector Store

In [21]:
from langchain.vectorstores import Chroma

# load embeddings into Chroma - need to pass docs ,embedding function and path of the db

db = Chroma.from_documents(docs,
                           embedding=embeddings,
                           persist_directory='./llama-db')


In [22]:
# saving into disk for future use
db.persist()


In [None]:
# to load back the embeddings from disk 

# db = Chroma(persist_directory='./llama-db',
#             embedding_function=embeddings)

#### Semantic Search (without using LLM)

In [23]:
Query = 'How is LLama2 compared to GPT3?'

# 3 nearest neighbours = 3 most relevant documents
docs_result = db.similarity_search(Query,k=3)


In [24]:
len(docs_result)


3

In [25]:
# let's look into the content of the first document

print(docs_result[0].page_content)


LLaMA GPT3 OPT
Gender 70.6 62.6 65.7
Religion 79.0 73.3 68.6
Race/Color 57.0 64.7 68.6
Sexual orientation 81.0 76.2 78.6
Age 70.1 64.4 67.8
Nationality 64.2 61.6 62.9
Disability 66.7 76.7 76.7
Physical appearance 77.8 74.6 76.2
Socioeconomic status 71.5 73.8 76.2
Average 66.6 67.2 69.5
Table 12: CrowS-Pairs. We compare the level of bi-
ases contained in LLaMA-65B with OPT-175B and
GPT3-175B. Higher score indicates higher bias.
5.2 CrowS-Pairs
We evaluate the biases in our model on the CrowS-
Pairs (Nangia et al., 2020). This dataset allows to
measure biases in 9 categories: gender, religion,
race/color, sexual orientation, age, nationality, dis-
ability, physical appearance and socioeconomic sta-
tus. Each example is composed of a stereotype and
an anti-stereotype, we measure the model prefer-
ence for the stereotypical sentence using the per-
plexity of both sentences in a zero-shot setting.
Higher scores thus indicate higher bias. We com-
pare with GPT-3 and OPT-175B in Table 12.
LLa

In [26]:
print(docs_result[1].page_content)


 using open-ended
generation, or ranks the proposed answers.
•Few-shot. We provide a few examples of the
task (between 1 and 64) and a test example.
The model takes this text as input and gener-
ates the answer or ranks different options.
We compare LLaMA with other foundation mod-
els, namely the non-publicly available language
models GPT-3 (Brown et al., 2020), Gopher (Rae
et al., 2021), Chinchilla (Hoffmann et al., 2022)
and PaLM (Chowdhery et al., 2022), as well as
the open-sourced OPT models (Zhang et al., 2022),
GPT-J (Wang and Komatsuzaki, 2021), and GPT-
Neo (Black et al., 2022). In Section 4, we also
brieﬂy compare LLaMA with instruction-tuned
models such as OPT-IML (Iyer et al., 2022) and
Flan-PaLM (Chung et al., 2022).We evaluate LLaMA on free-form generation
tasks and multiple choice tasks. In the multiple
choice tasks, the objective is to select the most
appropriate completion among a set of given op-
tions, based on a provided context. We select the
completion with the hi

In [27]:
print(docs_result[2].page_content)


 humanities, STEM and social sciences. We
evaluate our models in the 5-shot setting, using the
examples provided by the benchmark, and report
results in Table 9. On this benchmark, we observe
that the LLaMA-65B is behind both Chinchilla-
70B and PaLM-540B by a few percent in average,
and across most domains. A potential explanation
is that we have used a limited amount of books
and academic papers in our pre-training data, i.e.,
ArXiv, Gutenberg and Books3, that sums up to only
177GB, while these models were trained on up to
2TB of books. This large quantity of books used
by Gopher, Chinchilla and PaLM may also explain
why Gopher outperforms GPT-3 on this benchmark,
while it is comparable on other benchmarks.
3.7 Evolution of performance during training
During training, we tracked the performance of our
models on a few question answering and common
sense benchmarks, and report them in Figure 2.
On most benchmarks, the performance improves
steadily, and correlates with the training perp

In [None]:
for i in range(3):
    print(docs_result[i].page_content)
    print('-----------------------------------------------------')
