## My RAG Project - Rohan Duvur

#### Following tutorial by Madhav Thaker to create RAG using Mistral-7B and LangChain

### Introduction

#### What are RAGs?

*R* - Retrieval

*A* - Augmented

*G* - Generation

A widely used architecture for AI applications to extract info from external sources of data to answer queries **accurateley** as well **transparently**

##### Why is Transparency important?

Users can put more trust into the responses generated by LLMs that follow the RAG framework. This is because users not only have a general understanding of how the model works, but they also are given access to the sources the model used to generate an answer to their query. For the purposes of this project, users will input the links they want to generate responses from to help them with their homework.

You can learn more about RAGs [here.](https://research.ibm.com/blog/retrieval-augmented-generation-RAG)

#### Minstral-7B

Minstral-7B is the language model we'll be using in this notebook. It is a LLM that uses sliding window techniques and analyzes queries in chunks to provide a low-cost and efficient response. [This](https://labellerr.com/blog/mistral-7b-potential-by-mistral-ai/#:~:text=In%20sequence%20generation%2C%20Mistral%207B,the%20cache%20segment%20by%20segment.) is a good read on the subject.

#### LangChain

A popular python + Javascript module/ library used to let developers easily tokenize large corpuses and import a vast number of language models. It does a lot more but that's the major bulk of what we're using it for here. Read more [here](https://www.ibm.com/topics/langchain#:~:text=LangChain%20is%20an%20open%20source,like%20chatbots%20and%20virtual%20agents.).







### Setting Up Dependencies

In [2]:
!pip install -q torch datasets
!pip install -q accelerate==0.21.0 \
                peft==0.4.0 \
                bitsandbytes==0.40.2 \
                -U transformers==4.35.0 \
                trl==0.4.7 \
                langchain \
                playwright \
                html2text \
                sentence_transformers \
                faiss-gpu

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m507.1/507.1 kB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m12.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.2/244.2 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.9/72.9 kB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.5/92.5 MB[0m [31m10.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.9/7.9 MB[0m [31m25.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.4/77.4 kB[0m [31m10.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━

In [3]:
import os
import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    pipeline,
    TransfoXLTokenizer,
    TransfoXLLMHeadModel
)
from datasets import load_dataset
from peft import LoraConfig, PeftModel

from langchain.text_splitter import CharacterTextSplitter
from langchain.document_transformers import Html2TextTransformer
from langchain.document_loaders import AsyncChromiumLoader

from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS

from langchain.prompts import PromptTemplate
from langchain.schema.runnable import RunnablePassthrough
from langchain.llms import HuggingFacePipeline
from langchain.chains import LLMChain

#### From HuggingFace, we downloaded the minstral-7b model and its respective tokenizer

In [4]:

#################################################################
# Tokenizer
#################################################################

model_name='mistralai/Mistral-7B-Instruct-v0.1'

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

#################################################################
# bitsandbytes parameters
#################################################################

# Activate 4-bit precision base model loading
use_4bit = True

# Compute dtype for 4-bit base models
bnb_4bit_compute_dtype = "float16"

# Quantization type (fp4 or nf4)
bnb_4bit_quant_type = "nf4"

# Activate nested quantization for 4-bit base models (double quantization)
use_nested_quant = False

#################################################################
# Set up quantization config
#################################################################
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
)

if compute_dtype == torch.float16 and use_4bit:
    major, _ = torch.cuda.get_device_capability()
    if major >= 8:
        print("=" * 80)
        print("Your GPU supports bfloat16: accelerate training with bf16=True")
        print("=" * 80)

#################################################################
# Load pre-trained config
#################################################################
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config
)

Downloading tokenizer_config.json:   0%|          | 0.00/1.47k [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading (…)fetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)of-00002.safetensors:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

Downloading (…)of-00002.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

In [5]:
def print_number_of_trainable_model_parameters(model):
    trainable_model_params = 0
    all_model_params = 0
    for _, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()
    return f"trainable model parameters: {trainable_model_params}\nall model parameters: {all_model_params}\npercentage of trainable model parameters: {100 * trainable_model_params / all_model_params:.2f}%"

print(print_number_of_trainable_model_parameters(model))

trainable model parameters: 262410240
all model parameters: 3752071168
percentage of trainable model parameters: 6.99%


#### Constructing a pipeline

The pipeline will use the minstral-7b model and tokenizer. The task is generating text and we set a repitition penalty to avoid meandering output text. Our responses will consist of at most 1000 new tokens.

In [6]:
text_generation_pipeline = pipeline(
    model=model,
    tokenizer=tokenizer,
    task="text-generation",
    temperature=0.2,
    repetition_penalty=1.1,
    return_full_text=True,
    max_new_tokens=1000,
)

In [7]:
mistral_llm = HuggingFacePipeline(pipeline=text_generation_pipeline)

#### What is Playwright?

Playwright is a tool to automate webscraping tasks. Playwright launches what is known as a "headless browser" by default that can be interacted with via the command line. More info [here](https://learn.microsoft.com/en-us/microsoft-edge/playwright/)

In [8]:
!playwright install
!playwright install-deps

Downloading Chromium 120.0.6099.28 (playwright build v1091)[2m from https://playwright.azureedge.net/builds/chromium/1091/chromium-linux.zip[22m
[1G153.1 Mb [] 0% 0.0s[0K[1G153.1 Mb [] 0% 41.9s[0K[1G153.1 Mb [] 0% 12.9s[0K[1G153.1 Mb [] 0% 9.6s[0K[1G153.1 Mb [] 0% 8.8s[0K[1G153.1 Mb [] 0% 12.9s[0K[1G153.1 Mb [] 0% 14.8s[0K[1G153.1 Mb [] 0% 15.9s[0K[1G153.1 Mb [] 0% 17.4s[0K[1G153.1 Mb [] 0% 17.6s[0K[1G153.1 Mb [] 1% 15.2s[0K[1G153.1 Mb [] 1% 14.1s[0K[1G153.1 Mb [] 1% 13.2s[0K[1G153.1 Mb [] 1% 12.3s[0K[1G153.1 Mb [] 2% 11.2s[0K[1G153.1 Mb [] 2% 10.6s[0K[1G153.1 Mb [] 2% 10.1s[0K[1G153.1 Mb [] 2% 9.9s[0K[1G153.1 Mb [] 3% 9.4s[0K[1G153.1 Mb [] 3% 8.8s[0K[1G153.1 Mb [] 4% 8.5s[0K[1G153.1 Mb [] 4% 8.3s[0K[1G153.1 Mb [] 4% 8.5s[0K[1G153.1 Mb [] 4% 8.1s[0K[1G153.1 Mb [] 5% 7.9s[0K[1G153.1 Mb [] 5% 7.6s[0K[1G153.1 Mb [] 6% 7.4s[0K[1G153.1 Mb [] 6% 7.3s[0K[1G153.1 Mb [] 6% 7.1s[0K[1G153.1 Mb [] 7% 6.8s[0K[1G153.1 Mb [] 7% 6.7s[0K[

In [1]:
import nest_asyncio
import regex as re

nest_asyncio.apply()

# Articles to index
articles = []

done = False

while not done:
    current_info_source = input("Give a valid URL to scrape for answers. If done adding sources, type 'done'.")

    if current_info_source == 'done':
        print('done.')
        break

    valid = re.search("^(https:\/\/|http:\/\/)+([\S])+.[a-zA-Z]{3}",current_info_source)

    if valid:
        articles.append(current_info_source)
        print('added.')
    else:
        print("Not a valid URL.")



# Scrapes the blogs above
loader = AsyncChromiumLoader(articles)
docs = loader.load()

Give a valid URL to scrape for answers. If done adding sources, type 'done'.https://support.google.com/chrome/answer/142063?co=GENIE.Platform%3DDesktop&hl=en#zippy=%2Crestart-chrome
added.
Give a valid URL to scrape for answers. If done adding sources, type 'done'.https://towardsdatascience.com/9-effective-techniques-to-boost-retrieval-augmented-generation-rag-systems-210ace375049
added.
Give a valid URL to scrape for answers. If done adding sources, type 'done'.https://philosophy.fas.harvard.edu/files/phildept/files/brief_guide_to_writing_philosophy_paper.pdf
added.
Give a valid URL to scrape for answers. If done adding sources, type 'done'.done
done.


NameError: ignored

In [26]:
# Converts HTML to plain text
html2text = Html2TextTransformer()
docs_transformed = html2text.transform_documents(docs)

# Chunk text
text_splitter = CharacterTextSplitter(chunk_size=100,
                                      chunk_overlap=0)
chunked_documents = text_splitter.split_documents(docs_transformed)

# Load chunked documents into the FAISS index
db = FAISS.from_documents(chunked_documents,
                          HuggingFaceEmbeddings(model_name='sentence-transformers/all-mpnet-base-v2'))

retriever = db.as_retriever()



In [27]:

prompt_template = """
### [INST] Instruction: Answer the question based on your knowledge as a college teaching assistant. Here is context to help:

{context}

### QUESTION:
{question} [/INST]
 """

# Create prompt from prompt template
prompt = PromptTemplate(
    input_variables=["context", "question"],
    template=prompt_template,
)

# Create llm chain
llm_chain = LLMChain(llm=mistral_llm, prompt=prompt)

In [29]:
query = input("Type your question here.")

llm_chain.invoke({"context": "", "question": query})

rag_chain = (
 {"context": retriever, "question": RunnablePassthrough()}
    | llm_chain
)

result = rag_chain.invoke(query)

print(result['text'])
print(result['context'])

Type your question here.what makes minstrel 7b so effective?


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



Mistral 7B is considered highly effective due to several factors. Firstly, it outperforms established competitors such as Llama 2 13B in sequence generation tasks. Additionally, Mistral 7B excels in diverse domains, demonstrating its versatility and adaptability. The intricacies of its architecture and its ability to consistently deliver top-tier performance in critical areas make it stand out in the AI community. Overall, Mistral 7B's effectiveness can be attributed to its combination of advanced technology and exceptional performance across various domains.
[Document(page_content='### Frequently Asked Questions\n\n **1.What is Mistral 7B?**', metadata={'source': 'https://www.labellerr.com/blog/mistral-7b-potential-by-mistral-ai/#:~:text=In%20sequence%20generation%2C%20Mistral%207B,the%20cache%20segment%20by%20segment.'}), Document(page_content="Mistral 7B not only meets but exceeds this criterion by outperforming well-\nestablished competitors, including the renowned Llama 2 13B. Ho