In [1]:
from secret import HF_TOKEN
import torch
import torch.nn as nn

from transformers import (
    pipeline,
    AutoModelForCausalLM,
    AutoTokenizer, 
    BitsAndBytesConfig,
    AutoConfig,
)
from IPython.display import Markdown

  from .autonotebook import tqdm as notebook_tqdm
  torch.utils._pytree._register_pytree_node(
  torch.utils._pytree._register_pytree_node(


### Model Loading 

In [2]:
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_id, token=HF_TOKEN)
tokenizer.pad_token_id = tokenizer.eos_token_id

llama3 = AutoModelForCausalLM.from_pretrained(
    model_id, 
    device_map = "auto",
    trust_remote_code = True,
    load_in_8bit = True,
    token=HF_TOKEN
)

#wrapping the model in pipeline object
hf_pipeline = pipeline(
    "text-generation", 
    model = llama3,
    tokenizer= tokenizer, 
    temperature = 0.1,
    repetition_penalty = 1.2,
    pad_token_id = tokenizer.eos_token_id
)

Loading checkpoint shards: 100%|██████████| 4/4 [00:04<00:00,  1.20s/it]


In [3]:
prompt = "What is a large language model?"
out = hf_pipeline(prompt)

Markdown(out[0]['generated_text'])

What is a large language model? A large language model (LLM) refers to a type of artificial intelligence (AI) that has been trained on vast amounts of text data. This training enables the LLM to learn patterns, relationships and context within this massive dataset.

The primary goal of an LLM is to generate human-like responses or texts based on the input it receives. These inputs can be in the form of prompts, questions, or even entire sentences.

In addition to generating human-like responses, LLMS are also capable of performing various natural language processing tasks such as:

1. Language translation: The ability to translate text from one language to another.
2. Sentiment analysis: The ability to analyze the sentiment or emotional tone behind a piece of text.
3. Text classification: The ability to classify pieces of text into predefined categories or classes.
4. Named entity recognition: The ability to identify specific entities such as names, locations, organizations, etc. within unstructured text.

These capabilities make Large Language Models powerful tools for a wide range of applications including but not limited to customer service chatbots, content generation platforms, language learning systems, and more.

# RAG

In [4]:
from langchain.document_loaders import PyMuPDFLoader

#PDF loader
pdf_loader = PyMuPDFLoader("llama2_paper.pdf")

#loading data
pages = pdf_loader.load()

#observe number of pages loaded
print(f"Number of pages loaded : {len(pages)}")

Markdown(pages[10].page_content)

Number of pages loaded : 77


Dataset
Num. of
Comparisons
Avg. # Turns
per Dialogue
Avg. # Tokens
per Example
Avg. # Tokens
in Prompt
Avg. # Tokens
in Response
Anthropic Helpful
122,387
3.0
251.5
17.7
88.4
Anthropic Harmless
43,966
3.0
152.5
15.7
46.4
OpenAI Summarize
176,625
1.0
371.1
336.0
35.1
OpenAI WebGPT
13,333
1.0
237.2
48.3
188.9
StackExchange
1,038,480
1.0
440.2
200.1
240.2
Stanford SHP
74,882
1.0
338.3
199.5
138.8
Synthetic GPT-J
33,139
1.0
123.3
13.0
110.3
Meta (Safety & Helpfulness)
1,418,091
3.9
798.5
31.4
234.1
Total
2,919,326
1.6
595.7
108.2
216.9
Table 6: Statistics of human preference data for reward modeling. We list both the open-source and
internally collected human preference data used for reward modeling. Note that a binary human preference
comparison contains 2 responses (chosen and rejected) sharing the same prompt (and previous dialogue).
Each example consists of a prompt (including previous dialogue if available) and a response, which is the
input of the reward model. We report the number of comparisons, the average number of turns per dialogue,
the average number of tokens per example, per prompt and per response. More details on Meta helpfulness
and safety data per batch can be found in Appendix A.3.1.
knows. This prevents cases where, for instance, the two models would have an information mismatch, which
could result in favoring hallucinations. The model architecture and hyper-parameters are identical to those
of the pretrained language models, except that the classification head for next-token prediction is replaced
with a regression head for outputting a scalar reward.
Training Objectives.
To train the reward model, we convert our collected pairwise human preference data
into a binary ranking label format (i.e., chosen & rejected) and enforce the chosen response to have a higher
score than its counterpart. We used a binary ranking loss consistent with Ouyang et al. (2022):
Lranking = −log(σ(rθ(x, yc) −rθ(x, yr)))
(1)
where rθ(x, y) is the scalar score output for prompt x and completion y with model weights θ. yc is the
preferred response that annotators choose and yr is the rejected counterpart.
Built on top of this binary ranking loss, we further modify it separately for better helpfulness and safety
reward models as follows. Given that our preference ratings is decomposed as a scale of four points (e.g.,
significantly better), as presented in Section 3.2.1, it can be useful to leverage this information to explicitly
teach the reward model to assign more discrepant scores to the generations that have more differences. To
do so, we further add a margin component in the loss:
Lranking = −log(σ(rθ(x, yc) −rθ(x, yr) −m(r)))
(2)
where the margin m(r) is a discrete function of the preference rating. Naturally, we use a large margin
for pairs with distinct responses, and a smaller one for those with similar responses (shown in Table 27).
We found this margin component can improve Helpfulness reward model accuracy especially on samples
where two responses are more separable. More detailed ablation and analysis can be found in Table 28 in
Appendix A.3.3.
Data Composition.
We combine our newly collected data with existing open-source preference datasets
to form a larger training dataset. Initially, open-source datasets were used to bootstrap our reward models
while we were in the process of collecting preference annotation data. We note that in the context of RLHF in
this study, the role of reward signals is to learn human preference for Llama 2-Chat outputs rather than
any model outputs. However, in our experiments, we do not observe negative transfer from the open-source
preference datasets. Thus, we have decided to keep them in our data mixture, as they could enable better
generalization for the reward model and prevent reward hacking, i.e. Llama 2-Chat taking advantage of
some weaknesses of our reward, and so artificially inflating the score despite performing less well.
With training data available from different sources, we experimented with different mixing recipes for both
Helpfulness and Safety reward models to ascertain the best settings. After extensive experimentation, the
11


In [5]:
from langchain.document_loaders import WikipediaLoader

#defining wikipedia loader
wiki_loader = WikipediaLoader(query = "llama language model", load_max_docs=1)

#loading from wiki based in query
wiki_doc = wiki_loader.load()

print(f"Number of pages loaded: {len(wiki_doc)}")

#observe the retrieved page
Markdown(wiki_doc[0].page_content)

Number of pages loaded: 1


Llama (acronym for Large Language Model Meta AI, and formerly stylized as LLaMA) is a family of autoregressive large language models released by Meta AI starting in February 2023. The latest version is Llama 3 released in April 2024.
Model weights for the first version of Llama were made available to the research community under a non-commercial license, and access was granted on a case-by-case basis. Unauthorized copies of the model were shared via BitTorrent, in response, Meta AI issued DMCA takedown requests against repositories sharing the link on GitHub. Subsequent versions of Llama were made accessible outside academia and released under licenses that permitted some commercial use. Llama models are trained at different parameter sizes, typically ranging between 7B and 70B. Originally, Llama was only available as a foundation model. Starting with Llama 2, Meta AI started releasing instruction fine-tuned versions alongside foundation models.
Alongside the release of Llama 3, Meta added virtual assistant features to Facebook and WhatsApp in select regions, and a standalone website. Both services use a Llama 3 model.


== Background ==
After the release of large language model's such as GPT-3, a focus of research was up-scaling models which in some instances showed major increases in emergent capabilities. The release of ChatGPT, and it's surprise success caused an increase in attention to large language models.
Compared with other responses to ChatGPT, Meta's Chief AI scientist Yann LeCun stated that large language models are best for aiding with writing.


== Initial release ==
LLaMA was announced on February 24, 2023, via a blog post and a paper describing the model's training, architecture, and performance. The inference code used to run the model was publicly released under the open-source GPLv3 license. Access to the model's weights was managed by an application process, with access to be granted "on a case-by-case basis to academic researchers; those affiliated with organizations in government, civil society, and academia; and industry research laboratories around the world".
Llama was trained on only publicly available information, and was trained at various different model sizes, with the intention to make it more accessible to different hardware.
Meta AI reported the 13B parameter model performance on most NLP benchmarks exceeded that of the much larger GPT-3 (with 175B parameters), and the largest 65B model was competitive with state of the art models such as PaLM and Chinchilla.


=== Leak ===
On March 3, 2023, a torrent containing LLaMA's weights was uploaded, with a link to the torrent shared on the 4chan imageboard and subsequently spread through online AI communities. That same day, a pull request on the main LLaMA repository was opened, requesting to add the magnet link to the official documentation. On March 4, a pull request was opened to add links to HuggingFace repositories containing the model. On March 6, Meta filed takedown requests to remove the HuggingFace repositories linked in the pull request, characterizing it as "unauthorized distribution" of the model. HuggingFace complied with the requests. On March 20, Meta filed a DMCA takedown request for copyright infringement against a repository containing a script that downloaded LLaMA from a mirror, and GitHub complied the next day.
Reactions to the leak varied. Some speculated that the model would be used for malicious purposes, such as more sophisticated spam. Some have celebrated the model's accessibility, as well as the fact that smaller versions of the model can be run relatively cheaply, suggesting that this will promote the flourishing of additional research developments. Multiple commentators, such as Simon Willison, compared LLaMA to Stable Diffusion, a text-to-image model which, unlike comparably sophisticated models which preceded it, was openly distributed, leading to a rapid proliferation of associated tools, techniques, and software.


=

## Chunking/Document Splitter

In [6]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
import random

recur_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap=100,
    separators = ["\n\n", "\n", "\.", " ", ""],
)

#splitting the text
data_splits = recur_splitter.split_documents(pages)
print(f"Num splits: {len(data_splits)}")

#checking a random chunk
content = random.choice(data_splits).page_content
print("Content length", len(content))
print(content)


Num splits: 319
Content length 968
Not everyone who uses AI models has good intentions, and conversational AI agents could potentially be
used for nefarious purposes such as generating misinformation or retrieving information about topics like
bioterrorism or cybercrime. We have, however, made efforts to tune the models to avoid these topics and
diminish any capabilities they might have offered for those use cases.
While we attempted to reasonably balance safety with helpfulness, in some instances, our safety tuning goes
too far. Users of Llama 2-Chat may observe an overly cautious approach, with the model erring on the side
of declining certain requests or responding with too many safety details.
Users of the pretrained models need to be particularly cautious, and should take extra steps in tuning and
deployment as described in our Responsible Use Guide. §§
5.3
Responsible Release Strategy
Release Details.
We make Llama 2 available for both research and commercial use at https://ai.me

## Vector Stores

In [7]:
#initializing sentence transformers
from langchain.embeddings import HuggingFaceEmbeddings
model_name = "sentence-transformers/all-mpnet-base-v2"
model_kwargs = {"device":"cuda" if torch.cuda.is_available() else "cpu"}
encode_kwargs = {"normalize_embeddings":False}
hf_embeddings = HuggingFaceEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs
)

In [8]:
#using chroma DB as vectore store
from langchain.vectorstores import Chroma

#define location for data store
persist_directory = "./vector_store"

#generate and store embeddings
vectordb = Chroma.from_documents(
    documents=data_splits, 
    embedding=hf_embeddings,
    persist_directory=persist_directory
)

In [9]:
import pandas as pd
pd.set_option("display.max_colwidth", None)

#query to retrieve similar chunks
query = "How is the Llama model comparing to GPT-4?"

#retrieving similar chunks based on relevance
similar_chunks = vectordb.similarity_search_with_relevance_scores(query, k=4)

#format dcoument to text text format
retrieved_text = [chunk[0].page_content for chunk in similar_chunks]
relevance_score = [chunk[1] for chunk in similar_chunks]

retrieved_chunks = pd.DataFrame(
    list(zip(retrieved_text, relevance_score)),
    columns=["Retrieved Chunks", "Retrieved Score"]
)   

display(retrieved_chunks)

Unnamed: 0,Retrieved Chunks,Retrieved Score
0,"et al., 2023). For AGI Eval, we only evaluate on the English tasks and report the average.\nAs shown in Table 3, Llama 2 models outperform Llama 1 models. In particular, Llama 2 70B improves the\nresults on MMLU and BBH by ≈5 and ≈8 points, respectively, compared to Llama 1 65B. Llama 2 7B and 30B\nmodels outperform MPT models of the corresponding size on all categories besides code benchmarks. For the\nFalcon models, Llama 2 7B and 34B outperform Falcon 7B and 40B models on all categories of benchmarks.\nAdditionally, Llama 2 70B model outperforms all open-source models.\nIn addition to open-source models, we also compare Llama 2 70B results to closed-source models. As shown\nin Table 4, Llama 2 70B is close to GPT-3.5 (OpenAI, 2023) on MMLU and GSM8K, but there is a significant\ngap on coding benchmarks. Llama 2 70B results are on par or better than PaLM (540B) (Chowdhery et al.,\n2022) on almost all benchmarks. There is still a large gap in performance between Llama 2 70B and GPT-4",0.504791
1,"et al., 2020), Big Bench Hard (BBH) (3 shot) (Suzgun et al., 2022), and AGI Eval (3–5 shot) (Zhong\net al., 2023). For AGI Eval, we only evaluate on the English tasks and report the average.\nAs shown in Table 3, Llama 2 models outperform Llama 1 models. In particular, Llama 2 70B improves the\nresults on MMLU and BBH by ≈5 and ≈8 points, respectively, compared to Llama 1 65B. Llama 2 7B and 30B\nmodels outperform MPT models of the corresponding size on all categories besides code benchmarks. For the\nFalcon models, Llama 2 7B and 34B outperform Falcon 7B and 40B models on all categories of benchmarks.\nAdditionally, Llama 2 70B model outperforms all open-source models.\nIn addition to open-source models, we also compare Llama 2 70B results to closed-source models. As shown\nin Table 4, Llama 2 70B is close to GPT-3.5 (OpenAI, 2023) on MMLU and GSM8K, but there is a significant\ngap on coding benchmarks. Llama 2 70B results are on par or better than PaLM (540B) (Chowdhery et al.,",0.39092
2,"et al., 2020), Big Bench Hard (BBH) (3 shot) (Suzgun et al., 2022), and AGI Eval (3–5 shot) (Zhong\net al., 2023). For AGI Eval, we only evaluate on the English tasks and report the average.\nAs shown in Table 3, Llama 2 models outperform Llama 1 models. In particular, Llama 2 70B improves the\nresults on MMLU and BBH by ≈5 and ≈8 points, respectively, compared to Llama 1 65B. Llama 2 7B and 30B\nmodels outperform MPT models of the corresponding size on all categories besides code benchmarks. For the\nFalcon models, Llama 2 7B and 34B outperform Falcon 7B and 40B models on all categories of benchmarks.\nAdditionally, Llama 2 70B model outperforms all open-source models.\nIn addition to open-source models, we also compare Llama 2 70B results to closed-source models. As shown\nin Table 4, Llama 2 70B is close to GPT-3.5 (OpenAI, 2023) on MMLU and GSM8K, but there is a significant\ngap on coding benchmarks. Llama 2 70B results are on par or better than PaLM (540B) (Chowdhery et al.,",0.39092
3,"guide¶ and code examples‖ to facilitate the safe deployment of Llama 2 and Llama 2-Chat. More details of\nour responsible release strategy can be found in Section 5.3.\nThe remainder of this paper describes our pretraining methodology (Section 2), fine-tuning methodology\n(Section 3), approach to model safety (Section 4), key observations and insights (Section 5), relevant related\nwork (Section 6), and conclusions (Section 7).\n‡https://ai.meta.com/resources/models-and-libraries/llama/\n§We are delaying the release of the 34B model due to a lack of time to sufficiently red team.\n¶https://ai.meta.com/llama\n‖https://github.com/facebookresearch/llama\n4",0.381227


## Retrieval Q&A

In [11]:
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from langchain.llms.huggingface_pipeline import HuggingFacePipeline
import warnings

#supressing warings
warnings.filterwarnings("ignore")

#prompt template for QA
qa_template = """ USE the given context to answer the question.
If you don't know the answer just say that you don't know, don't try to create an answer.
Keep the asnwers as concise as possible

Context : {context}

Question: {question}
Answer:
"""

qa_prompt_template = PromptTemplate.from_template(qa_template)
llama3_pipeline = HuggingFacePipeline(pipeline=hf_pipeline)

#defining retrieval Q&A stuff chain
qa_chain = RetrievalQA.from_chain_type(
    llama3_pipeline,
    retriever = vectordb.as_retriever(search_kwargs={'k':1}),
    return_source_documents = True,
    chain_type="stuff",
    chain_type_kwargs={"prompt":qa_prompt_template}
)

#perform retrieval Q&A
qa_response = qa_chain({"query": "How is the Llama model in comparison to GPT-4"})

qa_response

{'query': 'How is the Llama model in comparison to GPT-4',
 'result': "The Llama 2 70B model has been demonstrated to be highly competitive with state-of-the-art language models like GPT-4.\n\nHowever, it's important to note that while Llama 2 70B may not have achieved the same level of performance as GPT-4, it is still a very impressive achievement for a model of its size and complexity.\n\nIt's worth noting that the development of AI models like Llama 2 70B and GPT-4 is an ongoing process, and future advancements could potentially lead to even more impressive achievements in the field of natural language processing.",
 'source_documents': [Document(page_content='et al., 2023). For AGI Eval, we only evaluate on the English tasks and report the average.\nAs shown in Table 3, Llama 2 models outperform Llama 1 models. In particular, Llama 2 70B improves the\nresults on MMLU and BBH by ≈5 and ≈8 points, respectively, compared to Llama 1 65B. Llama 2 7B and 30B\nmodels outperform MPT model

In [13]:
Markdown(qa_response["result"])

The Llama 2 70B model has been demonstrated to be highly competitive with state-of-the-art language models like GPT-4.

However, it's important to note that while Llama 2 70B may not have achieved the same level of performance as GPT-4, it is still a very impressive achievement for a model of its size and complexity.

It's worth noting that the development of AI models like Llama 2 70B and GPT-4 is an ongoing process, and future advancements could potentially lead to even more impressive achievements in the field of natural language processing.