#**Using LLaMA 2.0, FAISS and LangChain for Question-Answering on Your Own Data**


###Question-Answering with Meta's Llama-2–7b-chat Model

You can perform **Question-Answering (QA) like a chatbot** using the **Meta's Llama-2–7b-chat model**. This is facilitated by integrating the **LangChain framework** and the **FAISS library** to work over the documents of your choice.

### Example Usage in This Notebook

In this notebook, I have demonstrated the functionality by using the **Databricks documentation** as a data source. The data was retrieved directly from their official website, showcasing how the model can be applied to real-world documentation.



# Introduction to LLaMA 2

The **LLaMA 2** model represents a significant advancement in the field of large language models. It has been both pretrained and fine-tuned with an astounding **2 Trillion 🚀 tokens** and possesses between **7 to 70 Billion parameters**. This range of parameters is distributed across three different model sizes: **7B**, **13B**, and **70B**. Each of these sizes offers unique capabilities and performance characteristics.

Key improvements of LLaMA 2 over its predecessor, the LLaMA 1, include:
- Training on **40% more tokens**, providing a richer and more diverse dataset for the model to learn from.
- A substantially longer context length of **4000 tokens 🤯**, which allows for more complex and nuanced understanding and generation of text.
- The implementation of **grouped-query attention** in the 70B model, which significantly enhances the speed of inference 🔥.

LLaMA 2 has demonstrated superior performance compared to other open-source Large Language Models (LLMs) across a variety of external benchmarks. These benchmarks encompass areas such as reasoning, coding, proficiency in language, and knowledge tests, underscoring the model's versatility and robustness in handling a wide range of tasks.


##Getting Started
You can use the open source **Llama-2-7b-chat** model in both Hugging Face transformers and LangChain. However, you have to first request access to Llama 2 models via [Meta website](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) and also accept to share your account details with Meta on [Hugging Face website](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf). It typically takes a few minutes or hours to get the access.

🚨 **Note**: Ensure that your Hugging Face account email **MUST** match the email you provided on the Meta website. If there is a mismatch, your request will **not** be approved.

### Running Code on Google Colab

If you're using Google Colab to run the code, follow these steps to configure your runtime environment:
1. Go to `Runtime` in the menu.
2. Select `Change runtime type`.
3. Under `Hardware accelerator`, choose `GPU`.
4. Then, select `GPU type` and choose `T4`.

**Important**: You will need approximately **8GB of GPU RAM** for efficient inference. Running this model on a CPU is practically impossible due to its high computational requirements.


## **Installing the Libraries**

First of all, let’s start by installing all required libraries using pip install.

In [None]:
!pip install accelerate==0.21.0 transformers==4.31.0 tokenizers==0.13.3
!pip install bitsandbytes==0.40.0 einops==0.6.1
!pip install xformers==0.0.22.post7
!pip install langchain==0.1.4
!pip install faiss-gpu==1.7.1.post3
!pip install sentence_transformers

Collecting accelerate==0.21.0
  Downloading accelerate-0.21.0-py3-none-any.whl (244 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.2/244.2 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting transformers==4.31.0
  Downloading transformers-4.31.0-py3-none-any.whl (7.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m90.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tokenizers==0.13.3
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m111.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, transformers, accelerate
  Attempting uninstall: tokenizers
    Found existing installation: tokenizers 0.15.1
    Uninstalling tokenizers-0.15.1:
      Successfully uninstalled tokenizers-0.15.1
  Attempting uninstall: transformers
    Found existing installation: transfor

## **Initializing the Hugging Face Pipeline**

To utilize the `text-generation` pipeline with Hugging Face transformers, you need to initialize the following components:

1. **A Large Language Model (LLM)**: For this purpose, we will use `meta-llama/Llama-2-7b-chat-hf`.
2. **The Respective Tokenizer**: This tokenizer is specific to the model you are using.
3. **A Stopping Criteria Object**: This determines when the model should stop generating text.

### Model Initialization and Setup

- Initialize the model and move it to a CUDA-enabled GPU.
- Note: Using Colab, downloading and initializing the model can take **5–10 minutes**.

### Generating an Access Token

To download the model from Hugging Face, you'll need to generate an access token:
1. Go to your **Hugging Face Profile**.
2. Navigate to **Settings** > **Access Token**.
3. Click on **New Token**.
4. Select **Generate a Token**.
5. Copy the generated token.

**Important**: Include the copied token in the code where required for authentication.


In [None]:
from torch import cuda, bfloat16
import transformers

model_id = 'meta-llama/Llama-2-7b-chat-hf'

device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

# set quantization configuration to load large model with less GPU memory
# this requires the `bitsandbytes` library
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)

# begin initializing HF items, you need an access token
#hf_auth = '<add your access token here>'
hf_auth = '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'
model_config = transformers.AutoConfig.from_pretrained(
    model_id,
    use_auth_token=hf_auth
)

model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    config=model_config,
    quantization_config=bnb_config,
    device_map='auto',
    use_auth_token=hf_auth
)

# enable evaluation mode to allow model inference
model.eval()

print(f"Model loaded on {device}")



config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]




Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
bin /usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cuda122.so


  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)


CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 122
CUDA SETUP: Loading binary /usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cuda122.so...


model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

Model loaded on cuda:0


## **Tokenizer Initialization for Llama 2 7B Model**

The pipeline requires a **tokenizer** that translates human-readable plaintext into token IDs that the Large Language Model (LLM) can understand. The Llama 2 7B models utilize a specific tokenizer, aptly named the **Llama 2 7B tokenizer**. This tokenizer is essential for ensuring that the input text is correctly processed by the model.

### Initializing the Llama 2 7B Tokenizer

You can initialize the Llama 2 7B tokenizer with the following code snippet:


In [None]:
tokenizer = transformers.AutoTokenizer.from_pretrained(
    model_id,
    use_auth_token=hf_auth
)



tokenizer_config.json:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

## **Defining the Stopping Criteria for the Model**

Now, we need to define the *stopping criteria* of the model. The stopping criteria is crucial as it specifies when the model should cease generating text. Without a well-defined stopping criteria, the model might deviate and continue generating text tangentially after answering the initial question.


In [None]:
stop_list = ['\nHuman:', '\n```\n']

stop_token_ids = [tokenizer(x)['input_ids'] for x in stop_list]
stop_token_ids

[[1, 29871, 13, 29950, 7889, 29901], [1, 29871, 13, 28956, 13]]

## **Converting Stop Token IDs into `LongTensor` Objects**

In the process of setting up the model, it's necessary to convert the stop token IDs into `LongTensor` objects. A `LongTensor` is a specific type of tensor provided by PyTorch, a popular deep learning library. Tensors are multi-dimensional arrays, and in this context, a `LongTensor` is used to handle integer values.

The reason for converting stop token IDs into `LongTensor` objects is to ensure compatibility with the PyTorch framework, which the model likely utilizes for its computations. `LongTensor` provides an efficient way to store and manipulate these integers, which are essential for defining stopping criteria or other model parameters.


In [None]:
import torch

stop_token_ids = [torch.LongTensor(x).to(device) for x in stop_token_ids]
stop_token_ids

[tensor([    1, 29871,    13, 29950,  7889, 29901], device='cuda:0'),
 tensor([    1, 29871,    13, 28956,    13], device='cuda:0')]

## **Spot Check for `<unk>` Token IDs in Stop Token IDs**

It's important to perform a quick spot check to ensure that no `<unk>` token IDs (represented by `0`) appear in the `stop_token_ids`. This step is crucial because the presence of `<unk>` tokens can adversely affect the model's output quality.

### Verifying the Absence of `<unk>` Tokens

- Conduct a check on `stop_token_ids`.
- Confirm that there are no instances of the `<unk>` token ID (`0`).

Once this check is completed and it's confirmed that there are no `<unk>` tokens, we can proceed to build the **stopping criteria object**. This object is responsible for determining whether the stopping criteria of the model have been met — specifically, it checks whether any of the defined token ID combinations have been generated during the model's text generation process.


In [None]:
from transformers import StoppingCriteria, StoppingCriteriaList

# define custom stopping criteria object
class StopOnTokens(StoppingCriteria):
    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
        for stop_ids in stop_token_ids:
            if torch.eq(input_ids[0][-len(stop_ids):], stop_ids).all():
                return True
        return False

stopping_criteria = StoppingCriteriaList([StopOnTokens()])

## **Initializing the Hugging Face Pipeline**

You are now ready to initialize the Hugging Face pipeline for text generation. The pipeline is configured with several important parameters, each serving a specific purpose to fine-tune the text generation process. Below is an overview of these parameters and their roles:

- `return_full_text`: Set to `True` as LangChain expects the full text output.
- `task`: Defined as `'text-generation'`, indicating the type of task the pipeline is being set up for.
- `stopping_criteria`: This is crucial to prevent the model from rambling during chat sessions. It ensures the model stops generating text based on the defined criteria.
- `temperature`: Controls the 'randomness' of outputs. A value of `0.1` indicates low randomness, with `0.0` being the minimum (most deterministic) and `1.0` the maximum (most random).
- `max_new_tokens`: Specifies the maximum number of tokens to generate in the output, set to `512` in this case.
- `repetition_penalty`: Applied to discourage repetitive outputs. A value of `1.1` helps in reducing output repetition.

Here is the code to initialize the pipeline with these parameters:

```python
generate_text = transformers.pipeline(
    model=model,
    tokenizer=tokenizer,
    return_full_text=True,
    task='text-generation',
    stopping_criteria=stopping_criteria,
    temperature=0.1,
    max_new_tokens=512,
    repetition_penalty=1.1
)


In [None]:
generate_text = transformers.pipeline(
    model=model,
    tokenizer=tokenizer,
    return_full_text=True,  # langchain expects the full text
    task='text-generation',
    # we pass model parameters here too
    stopping_criteria=stopping_criteria,  # without this model rambles during chat
    temperature=0.1,  # 'randomness' of outputs, 0.0 is the min and 1.0 the max
    max_new_tokens=512,  # max number of tokens to generate in the output
    repetition_penalty=1.1  # without this output begins repeating
)

## **Final Step: Confirming the Setup**

Run the following code to confirm that everything is set up correctly and working as expected.


In [None]:
res = generate_text("Explain me the difference between Data Lakehouse and Data Warehouse.")
print(res[0]["generated_text"])

Explain me the difference between Data Lakehouse and Data Warehouse. Unterscheidung between data lakehouse and data warehouse is a common topic of discussion in the data engineering community, as both are designed to store large amounts of structured and unstructured data. A data lakehouse is a centralized repository that stores all the data from various sources in its raw form, without any predefined schema or structure. On the other hand, a data warehouse is a structured repository that stores data in a specific format, typically optimized for querying and analysis.

Here are some key differences between a data lakehouse and a data warehouse:

1. Structure: A data lakehouse has no predefined schema, whereas a data warehouse has a rigid schema that defines how the data should be organized and stored.
2. Data Types: A data lakehouse can store various types of data, including structured, semi-structured, and unstructured data, while a data warehouse typically stores only structured data

## **Implementing HF Pipeline in LangChain**

Now, it's time to integrate the Hugging Face pipeline within LangChain. Although this implementation will produce the same output as the standalone Hugging Face pipeline, it's a crucial step. Integrating the pipeline with LangChain enables the use of LangChain’s advanced features such as agent tooling, chains, and more, specifically tailored for the **Llama 2** model.


In [None]:
from langchain.llms import HuggingFacePipeline

llm = HuggingFacePipeline(pipeline=generate_text)

# checking again that everything is working fine
llm(prompt="Explain me the difference between Data Lakehouse and Data Warehouse.")

  warn_deprecated(


" Unterscheidung between data lakehouse and data warehouse is a common topic of discussion in the data engineering community, as both are designed to store large amounts of data but have different architectures and use cases. A data lakehouse is a centralized repository that stores all the raw data from various sources in its original form, without transforming or processing it. On the other hand, a data warehouse is a structured repository that stores data in a specific format, typically after cleaning, transforming, and aggregating it.\n\nHere are some key differences between a data lakehouse and a data warehouse:\n\n1. Data Structure: A data lakehouse stores data in its raw, unprocessed form, while a data warehouse stores data in a structured format, typically after cleaning, transforming, and aggregating it.\n2. Data Sources: A data lakehouse can ingest data from various sources, including databases, files, and streaming data sources, while a data warehouse typically ingests data f

## **Ingesting Data using Document Loader**

For data ingestion, utilize the `WebBaseLoader` document loader. This tool is designed to collect data by scraping webpages. In this specific scenario, the target for data collection is the Databricks documentation website.


In [None]:
from langchain.document_loaders import WebBaseLoader

web_links = ["https://www.databricks.com/","https://help.databricks.com","https://databricks.com/try-databricks","https://help.databricks.com/s/","https://docs.databricks.com","https://kb.databricks.com/","http://docs.databricks.com/getting-started/index.html","http://docs.databricks.com/introduction/index.html","http://docs.databricks.com/getting-started/tutorials/index.html","http://docs.databricks.com/release-notes/index.html","http://docs.databricks.com/ingestion/index.html","http://docs.databricks.com/exploratory-data-analysis/index.html","http://docs.databricks.com/data-preparation/index.html","http://docs.databricks.com/data-sharing/index.html","http://docs.databricks.com/marketplace/index.html","http://docs.databricks.com/workspace-index.html","http://docs.databricks.com/machine-learning/index.html","http://docs.databricks.com/sql/index.html","http://docs.databricks.com/delta/index.html","http://docs.databricks.com/dev-tools/index.html","http://docs.databricks.com/integrations/index.html","http://docs.databricks.com/administration-guide/index.html","http://docs.databricks.com/security/index.html","http://docs.databricks.com/data-governance/index.html","http://docs.databricks.com/lakehouse-architecture/index.html","http://docs.databricks.com/reference/api.html","http://docs.databricks.com/resources/index.html","http://docs.databricks.com/whats-coming.html","http://docs.databricks.com/archive/index.html","http://docs.databricks.com/lakehouse/index.html","http://docs.databricks.com/getting-started/quick-start.html","http://docs.databricks.com/getting-started/etl-quick-start.html","http://docs.databricks.com/getting-started/lakehouse-e2e.html","http://docs.databricks.com/getting-started/free-training.html","http://docs.databricks.com/sql/language-manual/index.html","http://docs.databricks.com/error-messages/index.html","http://www.apache.org/","https://databricks.com/privacy-policy","https://databricks.com/terms-of-use"]

loader = WebBaseLoader(web_links)
documents = loader.load()

## **Splitting in Chunks using Text Splitters**

To effectively process the text, it's essential to split it into smaller chunks. For this purpose, initialize the `RecursiveCharacterTextSplitter`. Once initialized, call this text splitter by passing the documents through it. This approach ensures manageable and efficient handling of text data.




The code snippet below demonstrates how to use the `RecursiveCharacterTextSplitter` from the LangChain library to split text into smaller, manageable chunks. This process is essential for efficient processing of large documents.

```python
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=20)
all_splits = text_splitter.split_documents(documents)


In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=20)
all_splits = text_splitter.split_documents(documents)

## **Creating Embeddings and Storing in Vector Store**

The next step involves creating embeddings for each small chunk of text and then storing these embeddings in a vector store, such as FAISS. To achieve this, we will use the `all-mpnet-base-v2` Sentence Transformer model. This model is designed to convert pieces of text into vector representations, which are then stored in the vector store for efficient retrieval and comparison.

### Process Overview

1. **Generate Embeddings**:
   - Utilize the `all-mpnet-base-v2` Sentence Transformer to create embeddings from the text chunks.
   - These embeddings capture the semantic essence of the text in a numerical format.

2. **Storing in Vector Store (FAISS)**:
   - Once the embeddings are generated, they are stored in FAISS.
   - FAISS is an efficient library for similarity search and clustering of dense vectors.
   - Storing embeddings in FAISS allows for quick and efficient retrieval based on similarity measures.

This method is crucial for tasks such as semantic search, where the goal is to find the most relevant pieces of text based on a query.


In [None]:
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS

model_name = "sentence-transformers/all-mpnet-base-v2"
model_kwargs = {"device": "cuda"}

embeddings = HuggingFaceEmbeddings(model_name=model_name, model_kwargs=model_kwargs)

# storing embeddings in the vector store
vectorstore = FAISS.from_documents(all_splits, embeddings)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

train_script.py:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

## **Initializing Chain for Conversational Retrieval**

The next step in setting up our system is to initialize the `ConversationalRetrievalChain`. This chain is key to creating a chatbot that not only interacts intelligently but also possesses a memory feature. It leverages a vector store to retrieve relevant information from your document base, enhancing the chatbot's ability to provide informed responses.

### Key Features of ConversationalRetrievalChain

- **Chatbot with Memory**: The ConversationalRetrievalChain enables the chatbot to remember and utilize past interactions and information, making the conversation more contextual and relevant.
- **Reliance on Vector Store**: This chain uses a vector store for efficient retrieval of information from documents, ensuring that the chatbot's responses are backed by relevant data.

### Optional Parameter: Returning Source Documents

- You have the option to include the `return_source_documents=True` parameter when constructing the chain.
- This parameter, when set to `True`, allows the chain to return the source documents that were used to answer a question.
- This feature can be particularly useful for transparency and for providing users with the opportunity to explore the original information sources.

By setting up the ConversationalRetrievalChain in this manner, you enhance the chatbot's functionality, making it a powerful tool for information retrieval and conversation.


In [None]:
from langchain.chains import ConversationalRetrievalChain

chain = ConversationalRetrievalChain.from_llm(llm, vectorstore.as_retriever(), return_source_documents=True)

## **Time for Question-Answering on Your Own Data**

Now, it’s time to engage in some Question-Answering using your own dataset. This is an exciting opportunity to see how the system performs with the data you have curated and prepared.


In [None]:
chat_history = []

query = "What is Data lakehouse architecture in Databricks?"
result = chain({"question": query, "chat_history": chat_history})

print(result['answer'])

  warn_deprecated(


 In Databricks, a data lakehouse architecture refers to a scalable storage and processing system that supports multiple layers of data processing, starting with raw data in the bottom layer and progressively refining and transforming the data as it moves upward through the layers. These layers are often referred to as the "medallion architecture," with each layer containing one or more tables. The layers include the bronze, silver, and gold layers, with the gold layer representing the highest quality data.


## **Including Chat History for Follow-Up Questions**

In this phase, your previous question and answer interactions will be included as a part of the chat history. This inclusion enables the ability to ask follow-up questions, enhancing the conversational depth and context. By retaining this chat history, the system can provide more coherent and contextually relevant responses to subsequent queries.


In [None]:
(result['answer'])

' In Databricks, a data lakehouse architecture refers to a scalable storage and processing system that supports multiple layers of data processing, starting with raw data in the bottom layer and progressively refining and transforming the data as it moves upward through the layers. These layers are often referred to as the "medallion architecture," with each layer containing one or more tables. The layers include the bronze, silver, and gold layers, with the gold layer representing the highest quality data.'

In [None]:
chat_history = [(query, result["answer"])]

query = "What are Data Governance and Interoperability in it?"
result = chain({"question": query, "chat_history": chat_history})

(result['answer'])

' In a data lakehouse architecture, Data Governance refers to the policies and procedures put in place to manage data assets within an organization. It includes data quality, security, access control, and retention policies. On the other hand, Data Interoperability refers to the ability of different systems or technologies to communicate and exchange data seamlessly. While Data Governance focuses on managing data within an organization, Data Interoperability deals with the integration of data from various sources outside the organization.'

## Finally…

And there you have it! You now possess the capability to perform question-answering on your own data using a powerful language model. This setup not only achieves your immediate goals but also lays the groundwork for future developments, such as transforming it into a chatbot application using Streamlit. The journey towards advanced applications with these tools has just begun.
