<a href="https://colab.research.google.com/github/amitsangani/Llama-2/blob/main/Llama_2_Q%26A_Using_Langchain_FAISS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**What is Llama 2?**

LLaMA 2 model is pretrained and fine-tuned with 2 Trillion 🚀 tokens and 7 to 70 Billion parameters which makes it one of the powerful open source models. It comes in three different model sizes (i.e. 7B, 13B and 70B) with significant improvements over the Llama 1 models, including being trained on 40% more tokens, having a much longer context length (4k tokens 🤯), and using grouped-query attention for fast inference of the 70B model 🔥. It outperforms other open source LLMs on many external benchmarks, including reasoning, coding, proficiency, and knowledge tests.

#**What is Langchain?**

LangChain is a powerful, open-source framework designed to help you develop applications powered by a language model, particularly a large language model (LLM). The core idea of the library is that we can “chain” together different components to create more advanced use cases around LLMs. LangChain consists of multiple components from several modules.

##**What is FAISS (Facebook AI Similarity Search)**

FAISS is a library for efficient similarity search and clustering of dense vectors. It can search multimedia documents (e.g. images) in ways that are inefficient or impossible with standard database engines (SQL). It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM. It also contains supporting code for evaluation and parameter tuning.



#**Let's get started!**

In [None]:
!pip install -qU transformers accelerate einops langchain xformers bitsandbytes faiss-gpu sentence_transformers

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m22.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.2/244.2 kB[0m [31m19.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.2/42.2 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m41.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m167.0/167.0 MB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.6/92.6 MB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.5/85.5 MB[0m [31m10.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m11.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) 

In [None]:
# simple utility to wrap text in colab before generating a response
from IPython.display import HTML, display

def set_css():
  display(HTML('''

  '''))
get_ipython().events.register('pre_run_cell', set_css)

In [None]:
pip install notebook --upgrade

Collecting notebook
  Downloading notebook-7.0.2-py3-none-any.whl (4.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.0/4.0 MB[0m [31m30.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting jupyter-server<3,>=2.4.0 (from notebook)
  Downloading jupyter_server-2.7.2-py3-none-any.whl (375 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m375.3/375.3 kB[0m [31m37.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting jupyterlab-server<3,>=2.22.1 (from notebook)
  Downloading jupyterlab_server-2.24.0-py3-none-any.whl (57 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.3/57.3 kB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting jupyterlab<5,>=4.0.2 (from notebook)
  Downloading jupyterlab-4.0.5-py3-none-any.whl (9.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.2/9.2 MB[0m [31m113.5 MB/s[0m eta [36m0:00:00[0m
Collecting jupyter-client>=7.4.4 (from jupyter-server<3,>=2.4.0->notebook)
  Downloading 

##**Initializing the Hugging Face Pipeline**
You have to initialize a text-generation pipeline with Hugging Face transformers. The pipeline requires the following three things that you must initialize:

A LLM, in this case it will be meta-llama/Llama-2-7b-chat-hf.
The respective tokenizer for the model.
A stopping criteria object.
You have to initialize the model and move it to CUDA-enabled GPU. Using Colab, this can take 5–10 minutes to download and initialize the model.

Also, you need to generate an access token to allow downloading the model from Hugging Face in your code. For that, go to your Hugging Face Profile > Settings > Access Token > New Token > Generate a Token. Just copy the token and add it in the below code.



In [None]:
from torch import cuda, bfloat16
import transformers

model_id = 'meta-llama/Llama-2-7b-chat-hf'

device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

# set quantization configuration to load large model with less GPU memory
# this requires the `bitsandbytes` library
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)

# begin initializing HF items, you need an access token
hf_auth = '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'
model_config = transformers.AutoConfig.from_pretrained(
    model_id,
    use_auth_token=hf_auth
)

model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    config=model_config,
    quantization_config=bnb_config,
    device_map='auto',
    use_auth_token=hf_auth
)

# enable evaluation mode to allow model inference
model.eval()

print(f"Model loaded on {device}")

The pipeline requires a tokenizer which handles the translation of human readable plaintext to LLM readable token IDs. The Llama 2 7B models were trained using the Llama 2 7B tokenizer, which can be initialized with this code:

In [None]:
tokenizer = transformers.AutoTokenizer.from_pretrained(
    model_id,
    use_auth_token=hf_auth
)

Downloading (…)okenizer_config.json:   0%|          | 0.00/776 [00:00<?, ?B/s]



Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

Now, we need to define the stopping criteria of the model. The stopping criteria allows us to specify when the model should stop generating text. If we don’t provide a stopping criteria the model just goes on a bit tangent after answering the initial question.

In [None]:
stop_list = ['\nHuman:', '\n```\n']

stop_token_ids = [tokenizer(x)['input_ids'] for x in stop_list]
stop_token_ids

[[1, 29871, 13, 29950, 7889, 29901], [1, 29871, 13, 28956, 13]]

You have to convert these stop token ids into LongTensor objects.

In [None]:
import torch

stop_token_ids = [torch.LongTensor(x).to(device) for x in stop_token_ids]
stop_token_ids

[tensor([    1, 29871,    13, 29950,  7889, 29901], device='cuda:0'),
 tensor([    1, 29871,    13, 28956,    13], device='cuda:0')]

You can do a quick spot check that no <unk> token IDs (0) appear in the stop_token_ids — there are none so we can move on to building the stopping criteria object that will check whether the stopping criteria has been satisfied — meaning whether any of these token ID combinations have been generated.

In [None]:
from transformers import StoppingCriteria, StoppingCriteriaList

# define custom stopping criteria object
class StopOnTokens(StoppingCriteria):
    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
        for stop_ids in stop_token_ids:
            if torch.eq(input_ids[0][-len(stop_ids):], stop_ids).all():
                return True
        return False

stopping_criteria = StoppingCriteriaList([StopOnTokens()])

You are ready to initialize the Hugging Face pipeline. There are a few additional parameters that we must define here. Comments are included in the code for further explanation.

In [None]:
generate_text = transformers.pipeline(
    model=model,
    tokenizer=tokenizer,
    return_full_text=True,  # langchain expects the full text
    task='text-generation',
    # we pass model parameters here too
    stopping_criteria=stopping_criteria,  # without this model rambles during chat
    temperature=0.1,  # 'randomness' of outputs, 0.0 is the min and 1.0 the max
    max_new_tokens=512,  # max number of tokens to generate in the output
    repetition_penalty=1.1  # without this output begins repeating
)

Run this code to confirm that everything is working fine.

In [None]:
res = generate_text("What is the meaning of life?")
print(res)

[{'generated_text': "What is the meaning of life?\n nobody knows for sure, but here are some possible answers:\n\n1. To seek happiness and fulfillment: Many people believe that the ultimate goal of life is to find happiness and fulfillment. They believe that one should pursue their passions and interests, and cultivate meaningful relationships with others in order to live a fulfilling life.\n2. To learn and grow: Others believe that the purpose of life is to learn and grow as individuals. They believe that life is an opportunity to acquire knowledge, develop skills, and become better versions of themselves.\n3. To make a positive impact: Some people believe that the purpose of life is to make a positive impact on the world. They believe that one should strive to make a difference in the lives of others, whether through acts of kindness, volunteering, or working to make the world a better place.\n4. To seek spiritual enlightenment: Many religious and philosophical traditions believe tha

In [None]:
print(res[0]["generated_text"])

What is the meaning of life?
 nobody knows for sure, but here are some possible answers:

1. To seek happiness and fulfillment: Many people believe that the ultimate goal of life is to find happiness and fulfillment. They believe that one should pursue their passions and interests, and cultivate meaningful relationships with others in order to live a fulfilling life.
2. To learn and grow: Others believe that the purpose of life is to learn and grow as individuals. They believe that life is an opportunity to acquire knowledge, develop skills, and become better versions of themselves.
3. To make a positive impact: Some people believe that the purpose of life is to make a positive impact on the world. They believe that one should strive to make a difference in the lives of others, whether through acts of kindness, volunteering, or working to make the world a better place.
4. To seek spiritual enlightenment: Many religious and philosophical traditions believe that the purpose of life is to

##**Implementing HF Pipeline in LangChain**

In [None]:
from langchain.llms import HuggingFacePipeline

llm = HuggingFacePipeline(pipeline=generate_text)

# checking again that everything is working fine
llm(prompt="What is the meaning of life?")

"\n nobody knows for sure, but here are some possible answers:\n\n1. To seek happiness and fulfillment: Many people believe that the ultimate goal of life is to find happiness and fulfillment. They believe that one should pursue their passions and interests, and cultivate meaningful relationships with others in order to live a fulfilling life.\n2. To learn and grow: Others believe that the purpose of life is to learn and grow as individuals. They believe that life is an opportunity to acquire knowledge, develop skills, and become better versions of themselves.\n3. To make a positive impact: Some people believe that the purpose of life is to make a positive impact on the world. They believe that one should strive to make a difference in the lives of others, whether through acts of kindness, volunteering, or working to make the world a better place.\n4. To seek spiritual enlightenment: Many religious and philosophical traditions believe that the purpose of life is to seek spiritual enlig

##**Ingesting Data using Document Loader**

You have to ingest data using WebBaseLoader document loader which collects data by scraping webpages. In this case, you will be collecting data from Meta's documentation website.


In [None]:
from langchain.document_loaders import WebBaseLoader

web_links = [
"https://ai.meta.com/",
"https://ai.meta.com/research/",
"https://ai.meta.com/blog/",
"https://ai.meta.com/resources/",
"https://ai.meta.com/about/",
"https://ai.meta.com/research/",
"https://ai.meta.com/blog/",
"https://ai.meta.com/resources/",
"https://ai.meta.com/about/",
"https://ai.meta.com/resources/models-and-libraries/llama-downloads/",
"https://ai.meta.com/llama/#inside-the-model",
"https://ai.meta.com/llama/#partnerships",
"https://ai.meta.com/llama/#responsibility",
"https://ai.meta.com/llama/#download-the-model",
"https://ai.meta.com/llama/#resources",
"https://ai.meta.com/resources/models-and-libraries/llama/",
"https://ai.meta.com/llama/responsible-use-guide/",
"https://ai.meta.com/llama/open-innovation-ai-research-community/",
"https://ai.meta.com/llama/llama-impact-challenge/",
"https://ai.meta.com/llama/llama-impact-challenge/",
"https://about.fb.com/news/2023/06/generative-ai-community-forum/",
"https://www.facebook.com/privacy/policy/?entry_point=data_policy_redirect&entry=0",
"https://ai.meta.com/resources/models-and-libraries/llama-downloads/",
"https://ai.meta.com/resources/models-and-libraries/llama/",
"https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/",
"https://ai.meta.com/blog/llama-2/",
"https://ai.meta.com/llama/responsible-use-guide/",
"https://ai.meta.com/llama/open-innovation-ai-research-community/",
"https://ai.meta.com/about",
"https://ai.meta.com/about",
"https://ai.facebook.com/results/?content_types%5B0%5D=person&sort_by=random",
"https://www.metacareers.com/jobs/?is_leadership=0&sub_teams[0]=Artificial%20Intelligence&is_in_page=0",
"https://ai.meta.com/events",
"https://ai.meta.com/blog",
"https://ai.meta.com/research",
"https://ai.meta.com/infrastructure",
"https://ai.meta.com/blog",
"https://ai.meta.com/resources",
"https://ai.meta.com/responsible-ai",
"https://ai.meta.com/responsible-ai",
"https://ai.meta.com/subscribe",
"https://ai.meta.com/subscribe",
"https://www.facebook.com/MetaAI/",
"https://twitter.com/MetaAI/",
"https://www.linkedin.com/showcase/metaai/",
"https://www.youtube.com/@FacebookAI",
"https://ai.meta.com/about",
"https://ai.meta.com/about",
"https://ai.facebook.com/results/?content_types%5B0%5D=person&sort_by=random",
"https://www.metacareers.com/jobs/?is_leadership=0&sub_teams[0]=Artificial%20Intelligence&is_in_page=0",
"https://ai.meta.com/events",
"https://ai.meta.com/blog",
"https://ai.meta.com/research",
"https://ai.meta.com/infrastructure",
"https://ai.meta.com/blog",
"https://ai.meta.com/resources",
"https://ai.meta.com/responsible-ai",
"https://ai.meta.com/responsible-ai",
"https://ai.meta.com/subscribe",
"https://ai.meta.com/subscribe",
"https://www.facebook.com/MetaAI/",
"https://twitter.com/MetaAI/",
"https://www.linkedin.com/showcase/metaai/",
"https://www.youtube.com/@FacebookAI",
"https://www.facebook.com/about/privacy/",
"https://www.facebook.com/policies/",
"https://www.facebook.com/policies/cookies/",
"https://www.facebook.com/MetaAI/",
"https://twitter.com/MetaAI/",
"https://www.linkedin.com/showcase/metaai/"]

loader = WebBaseLoader(web_links)
documents = loader.load()

##**Splitting in Chunks using Text Splitters**
You have to make sure to split the text into small pieces. You will need to initialize RecursiveCharacterTextSplitter and call it by passing the documents.

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=20)
all_splits = text_splitter.split_documents(documents)

##**Creating Embeddings and Storing in Vector Store**

You have to create embeddings for each small chunk of text and store them in the vector store (i.e. FAISS). You will be using all-mpnet-base-v2 Sentence Transformer to convert all pieces of text in vectors while storing them in the vector store.


In [None]:
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS

model_name = "sentence-transformers/all-mpnet-base-v2"
model_kwargs = {"device": "cuda"}

embeddings = HuggingFaceEmbeddings(model_name=model_name, model_kwargs=model_kwargs)

# storing embeddings in the vector store
vectorstore = FAISS.from_documents(all_splits, embeddings)

##**Initializing Chain**
You have to initialize ConversationalRetrievalChain. This chain allows you to have a chatbot with memory while relying on a vector store to find relevant information from your document.

Additionally, you can return the source documents used to answer the question by specifying an optional parameter i.e. return_source_documents=True when constructing the chain.

In [None]:
from langchain.chains import ConversationalRetrievalChain

chain = ConversationalRetrievalChain.from_llm(llm, vectorstore.as_retriever(), return_source_documents=True)

Let's do Q&A against your own data!

In [None]:
chat_history = []

query = "What type of llama 2 models are available?"
result = chain({"question": query, "chat_history": chat_history})

print(result['answer'])

 Two types of models are available in Llama 2: Foundation models and fine-tuned chat models.


This time your previous question and answer will be included as a chat history which will enable the ability to ask follow up questions.

In [None]:
chat_history = [(query, result["answer"])]

query = "Can you explain key principles a developer needs to follow for acceptable use of models?"
result = chain({"question": query, "chat_history": chat_history})

print(result['answer'])


  We provide guidelines for developers to build products powered by large language models responsibly in our Responsible Use Guide. It includes best practices and considerations for building products powered by large language models in a responsible manner.


In [None]:
#We can also show the source of the document that was used to generate the answer
print(result['source_documents'])


[Document(page_content='Report this post\n                    \n    \n\n\n\n\n \n\n\nWe’re continuing to invest in an array of responsible AI efforts with the release of Llama 2, including the creation of a new Responsible Use Guide for developers which includes best practices and considerations for building products powered by large language models in a responsible manner.\n\nDownload the full guide ➡️ https://bit.ly/3qdzRUH\n\nThese best practices should be considered holistically because strategies adopted at one level can impact the entire system. The recommendations included in this guide reflect current research on responsible generative AI. We expect these to evolve as the field advances and access to foundation models grows, inviting further innovation on AI safety.\n \n\n\n\n \n\n\n \n\n\n\n\n\n\n\n\n\n\n                    408\n              \n\n\n \n\n \n\n\n\n\n\n        \n                11 Comments\n            \n      \n\n\n\n\n\n\n      Like\n    \n\n\n\n\n\n      Comme

##**TBD: Using Streamlit**

You have now the capability to do question-answering on your on data using a powerful language model. Additionally, you can further develop it into a chatbot application using Streamlit.



