<a href="https://colab.research.google.com/github/vedica1011/NLP_BERT_TL/blob/main/Simple_RAG_huggingface.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

RAG is a popular approach to address the issue of a powerful LLM not being aware of specific content due to said content not being in its training data, or hallucinating even when it has seen it before. Such specific content may be proprietary, sensitive, or, as in this example, recent and updated often.

*If your data is static and doesn’t change regularly, you may consider fine-tuning a large model.* In many cases, however, fine-tuning can be costly, and, when done repeatedly (e.g. to address data drift), leads to “model shift”. This is when the model’s behavior changes in ways that are not desirable.

**RAG (Retrieval Augmented Generation) **does not require model fine-tuning. Instead, RAG works by providing an LLM with additional context that is retrieved from relevant data so that it can generate a better-informed response

**In this example, we’ll load all of the issues (both open and closed) from PEFT library’s repo.**


``First, you need to acquire a GitHub personal access token to access the GitHub API.``

In [1]:
!pip install -q torch transformers transformers accelerate bitsandbytes langchain sentence-transformers faiss-gpu openpyxl pacmap datasets

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m280.0/280.0 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m102.2/102.2 MB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m807.5/807.5 kB[0m [31m63.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m156.5/156.5 kB[0m [31m19.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.5/85.5 MB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m35.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m24.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m256.9/256.9 kB[0m [31m13.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━

In [2]:
from tqdm.notebook import tqdm
import pandas as pd
from typing import Optional, List, Tuple
from datasets import Dataset
import matplotlib.pyplot as plt

pd.set_option("display.max_colwidth", None)

In [3]:
from getpass import getpass

ACCESS_TOKEN = getpass("github_token")

github_token··········


**Next, we’ll load all of the issues in the huggingface/peft repo:**
1. By default, pull requests are considered issues as well, here we
chose to exclude them from data with by setting include_prs=False.

2. Setting state = "all" means we will load both open and closed issues.

In [4]:
from langchain.document_loaders import GitHubIssuesLoader

loader = GitHubIssuesLoader(repo="huggingface/peft", access_token=ACCESS_TOKEN, include_prs=False, state="all")

In [5]:
%%time
docs = loader.load()

CPU times: user 3.56 s, sys: 66.6 ms, total: 3.62 s
Wall time: 17.8 s


### The content of individual GitHub issues may be longer than what an embedding model can take as input. If we want to embed all of the available content, we need to chunk the documents into appropriately sized pieces.


## *The most common and straightforward approach to chunking is to define a fixed size of chunks and whether there should be any overlap between them. Keeping some overlap between chunks allows us to preserve some semantic context between the chunks. The recommended splitter for generic text is the RecursiveCharacterTextSplitter.*

In [6]:
%%time
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=30)

chunked_docs = splitter.split_documents(docs)

CPU times: user 190 ms, sys: 2.82 ms, total: 193 ms
Wall time: 193 ms


In [7]:
chunked_docs[0]

Document(page_content='### System Info\r\n\r\ntransformers version: 4.38.1\r\naccelerate version: 0.26.1\r\npeft version: 0.8.2\r\npytorch version: 2.1.2\r\n\r\n### Who can help?\r\n\r\n_No response_\r\n\r\n### Information\r\n\r\n- [ ] The official example scripts\r\n- [X] My own modified scripts\r\n\r\n### Tasks\r\n\r\n- [ ] An officially supported task in the `examples` folder\r\n- [X] My own task or dataset (give details below)\r\n\r\n### Reproduction\r\n\r\n    compute_dtype = getattr(torch, "bfloat16")\r\n    bnb_config = BitsAndBytesConfig(', metadata={'url': 'https://github.com/huggingface/peft/issues/1544', 'title': 'Fail to use multi-GPUs with peft model', 'creator': 'TheoDpPro', 'created_at': '2024-03-08T08:24:46Z', 'comments': 0, 'state': 'open', 'labels': [], 'assignee': None, 'milestone': None, 'locked': False, 'number': 1544, 'is_pull_request': False})

# **Create the embeddings + retriever**

Now that the docs are all of the appropriate size, we can create a database with their embeddings.

To create document chunk embeddings we’ll use the HuggingFaceEmbeddings and the [BAAI/bge-base-en-v1.5
](https://huggingface.co/BAAI/bge-base-en-v1.5) embeddings model. There are many other embeddings models available on the Hub, and you can keep an eye on the best performing ones by checking the [Massive Text Embedding Benchmark (MTEB) Leaderboard.
](https://huggingface.co/spaces/mteb/leaderboard)

To create the vector database, we’ll use FAISS, a library developed by Facebook AI. This library offers efficient similarity search and clustering of dense vectors, which is what we need here. FAISS is currently one of the most used libraries for NN search in massive datasets.

### We’ll access both the embeddings model and FAISS via LangChain API.

In [8]:
#!pip install -q langchain

In [9]:
# install faiss-cpu or faiss-gpu based on your hardware
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
db = FAISS.from_documents(chunked_docs,
                          HuggingFaceEmbeddings(model_name="BAAI/bge-base-en-v1.5"))


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/94.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/777 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

**We need a way to return(retrieve) the documents given an unstructured query. For that, we’ll use the as_retriever method using the db as a backbone:**

1. search_type="similarity" means we want to perform similarity search between the query and documents
2. search_kwargs={'k': 4} instructs the retriever to return top 4 results.

In [10]:
retriever = db.as_retriever(search_type="similarity",
                            search_kwargs={"k": 4})

In [11]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

#model_name = "HuggingFaceH4/zephyr-7b-beta"
model_name = "stabilityai/stablelm-zephyr-3b"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb_config)
tokenizer = AutoTokenizer.from_pretrained(model_name)

config.json:   0%|          | 0.00/599 [00:00<?, ?B/s]

`low_cpu_mem_usage` was None, now set to True since model is quantized.


model.safetensors:   0%|          | 0.00/5.59G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/5.21k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/587 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Finally, we have all the pieces we need to set up the LLM chain.

First, create a text_generation pipeline using the loaded model and its tokenizer.

Next, create a prompt template - this should follow the format of the model, so if you substitute the model checkpoint, make sure to use the appropriate formatting

In [12]:
from langchain.llms import HuggingFacePipeline
from langchain.prompts import PromptTemplate
from transformers import pipeline
from langchain_core.output_parsers import StrOutputParser

text_generation_pipeline = pipeline(
    model=model,
    tokenizer=tokenizer,
    task="text-generation",
    temperature=0.2,
    do_sample=True,
    repetition_penalty=1.1,
    return_full_text=True,
    max_new_tokens=400,
)

In [13]:
llm = HuggingFacePipeline(pipeline=text_generation_pipeline)

prompt_template = """
<|system|>
Answer the question based on your knowledge. Use the following context to help:

{context}

</s>
<|user|>
{question}
</s>
<|assistant|>

 """

prompt = PromptTemplate(
    input_variables=["context", "question"],
    template=prompt_template,
)

llm_chain = prompt | llm | StrOutputParser()

In [14]:
from langchain_core.runnables import RunnablePassthrough

#retriever = db.as_retriever()

rag_chain = {"context": retriever, "question": RunnablePassthrough()} | llm_chain

In [15]:
question = "How do you combine multiple adapters?"
question

'How do you combine multiple adapters?'

In [18]:
%%time
print(llm_chain.invoke({"context": "", "question": question}))

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


 Combining multiple adapters depends on the specific adapters and the programming language being used. Here's a general approach using Python as an example: 

1. Identify the adapters you want to use in your application. These could be third-party libraries, custom-built adapters, or even other adapters written by someone else.

2. Determine which adapter(s) your code already uses or can be converted to use. This will help you avoid conflicts between different adapter styles or behaviors.

3. Create a base class for your new adapter that includes all the common methods and properties of the adapters you're combining. This ensures consistency across the combined adapters and makes it easier to maintain and update your code.

4. Implement each adapter individually and make sure they adhere to the design patterns and coding standards of their respective libraries or frameworks.

5. In your main application code, use the `is_adapter` function (or similar method) to determine which adapter 

In [19]:
%%time
print(rag_chain.invoke(question))

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.



To combine multiple adapters in the Hugging Face's Transformers library, you can use the `DynamicallyComputerModel` class along with the `Adapter` class. Here's an example of how to do it:

First, you need to create the adapters as usual using the `Adapter` class. Then, instantiate a `DynamicallyComputerModel` object, passing the name of the base model you want to use (e.g., 'distilbert-base-uncased') and the list of adapters you want to use. Make sure to include both the base model and the adapters in the list:

```python
from transformers import AdamW, DistilBERTBaseForMaskedLM, DynamicComputerModel, Adapter

# Define your adapters here
adapter1 = Adapter(name='Adapter_1', num_identical_passes=3, num_layers=6, d_model=512, dff=512)
adapter2 = Adapter(name='Adapter_2', num_identical_passes=3, num_layers=6, d_model=512, dff=512)

# List of adapters
adapters = [adapter1, adapter2]

# Instantiate the DynamicallyComputerModel
model = DynamicComputerModel(
    num_labels=2, 
    architect