# Improving Fine-tuned Model using RAG


### Installations

In [None]:
!pip install llama-index
!pip install llama-index-embeddings-huggingface
!pip install peft
!pip install auto-gptq
!pip install optimum
!pip install bitsandbytes

Collecting llama-index-embeddings-huggingface
  Using cached llama_index_embeddings_huggingface-0.5.5-py3-none-any.whl.metadata (458 bytes)
Collecting nvidia-cusolver-cu12==11.6.1.9 (from torch>=1.11.0->sentence-transformers>=2.6.1->llama-index-embeddings-huggingface)
  Using cached nvidia_cusolver_cu12-11.6.1.9-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Using cached llama_index_embeddings_huggingface-0.5.5-py3-none-any.whl (8.9 kB)
Using cached nvidia_cusolver_cu12-11.6.1.9-py3-none-manylinux2014_x86_64.whl (127.9 MB)
Installing collected packages: nvidia-cusolver-cu12, llama-index-embeddings-huggingface
  Attempting uninstall: nvidia-cusolver-cu12
    Found existing installation: nvidia-cusolver-cu12 11.6.3.83
    Uninstalling nvidia-cusolver-cu12-11.6.3.83:
      Successfully uninstalled nvidia-cusolver-cu12-11.6.3.83
Successfully installed llama-index-embeddings-huggingface-0.5.5 nvidia-cusolver-cu12-11.6.1.9
Collecting auto-gptq
  Downloading auto_gptq-0.7.1-cp311-cp311-m

In [None]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings, SimpleDirectoryReader, VectorStoreIndex
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.postprocessor import SimilarityPostprocessor

### Define Settings

In [None]:
# import any embedding model on HF hub (https://huggingface.co/spaces/mteb/leaderboard)
Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5") # FlagEmbedding

Settings.llm = None
Settings.chunk_size = 256 # chunk splitting
Settings.chunk_overlap = 25 # overalp

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

LLM is explicitly disabled. Using MockLLM.


### Read and Store Docs into Vector DB

In [None]:
# load pdfs
documents = SimpleDirectoryReader("/content/pdfs").load_data()

In [None]:
# some ad hoc document refinement
print(len(documents))
for doc in documents:
    if "Member-only story" in doc.text:
        documents.remove(doc)
        continue

    if "The Data Entrepreneurs" in doc.text:
        documents.remove(doc)

    if " min read" in doc.text:
        documents.remove(doc)

print(len(documents))

71
61


VectorDB

In [None]:
# store docs into vector DB
index = VectorStoreIndex.from_documents(documents)

In [None]:
# set number of docs to retreive
top_k = 3

# configure retriever
retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=top_k,
)

In [None]:
# assemble query engine
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    node_postprocessors=[SimilarityPostprocessor(similarity_cutoff=0.5)],
)

### Retrieve Relevant Docs

In [None]:
# query documents
query = "What is Log-Log Approach?"
response = query_engine.query(query)

In [None]:
# reformat response
context = "Context:\n"
for i in range(top_k):
    context = context + response.source_nodes[i].text + "\n\n"

print(context)

Context:
A popular way of fitting a Power Law to real-world data is what I’ll call the
“Log-Log approach” [1]. The idea comes from taking the logarithm of the
Power Law’s probability density function (PDF), as derived below.
Taking the log of Power Law probability distribution function [2]. Image by author.
The above derivation translates the Power Law’s PDF definition into a linear
equation, as shown in the figure below.

Highlight the linear form of the log(PDF). Image by author.
This implies that the histogram of data following a power law will follow a
straight line. In practice, what this looks like is generating a histogram for
some data and plotting it on a log-log plot [1]. One might go even further and
perform a linear regression to estimate the distribution’s α  value (here, α  = -
m+1).
However, there are significant limitations to this approach. These are
described in reference [1] and summarized below.
Slope (hence α ) estimations are subject to systematic errors
Regressio

### Import LLM

In [None]:
!unzip "/content/checkpoint.zip" -d "/content/mistral-ft"

Archive:  /content/checkpoint.zip
   creating: /content/mistral-ft/content/mistral-ft/checkpoint-10/
  inflating: /content/mistral-ft/content/mistral-ft/checkpoint-10/adapter_config.json  
  inflating: /content/mistral-ft/content/mistral-ft/checkpoint-10/rng_state.pth  
  inflating: /content/mistral-ft/content/mistral-ft/checkpoint-10/README.md  
  inflating: /content/mistral-ft/content/mistral-ft/checkpoint-10/special_tokens_map.json  
  inflating: /content/mistral-ft/content/mistral-ft/checkpoint-10/trainer_state.json  
  inflating: /content/mistral-ft/content/mistral-ft/checkpoint-10/adapter_model.safetensors  
  inflating: /content/mistral-ft/content/mistral-ft/checkpoint-10/scheduler.pt  
  inflating: /content/mistral-ft/content/mistral-ft/checkpoint-10/training_args.bin  
  inflating: /content/mistral-ft/content/mistral-ft/checkpoint-10/chat_template.jinja  
  inflating: /content/mistral-ft/content/mistral-ft/checkpoint-10/tokenizer.json  
  inflating: /content/mistral-ft/content

In [14]:
# load fine-tuned model from hub
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "TheBloke/Mistral-7B-Instruct-v0.2-GPTQ"
model = AutoModelForCausalLM.from_pretrained(model_name,
                                             device_map="auto",
                                             trust_remote_code=False,
                                             revision="main")
# Mistral 7B fine tuned model
config = PeftConfig.from_pretrained("/content/mistral-ft/content/mistral-ft/checkpoint-10")
model = PeftModel.from_pretrained(model, "/content/mistral-ft/content/mistral-ft/checkpoint-10")

# load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

Some weights of the model checkpoint at TheBloke/Mistral-7B-Instruct-v0.2-GPTQ were not used when initializing MistralForCausalLM: ['model.layers.0.mlp.down_proj.bias', 'model.layers.0.mlp.gate_proj.bias', 'model.layers.0.mlp.up_proj.bias', 'model.layers.0.self_attn.k_proj.bias', 'model.layers.0.self_attn.o_proj.bias', 'model.layers.0.self_attn.q_proj.bias', 'model.layers.0.self_attn.v_proj.bias', 'model.layers.1.mlp.down_proj.bias', 'model.layers.1.mlp.gate_proj.bias', 'model.layers.1.mlp.up_proj.bias', 'model.layers.1.self_attn.k_proj.bias', 'model.layers.1.self_attn.o_proj.bias', 'model.layers.1.self_attn.q_proj.bias', 'model.layers.1.self_attn.v_proj.bias', 'model.layers.10.mlp.down_proj.bias', 'model.layers.10.mlp.gate_proj.bias', 'model.layers.10.mlp.up_proj.bias', 'model.layers.10.self_attn.k_proj.bias', 'model.layers.10.self_attn.o_proj.bias', 'model.layers.10.self_attn.q_proj.bias', 'model.layers.10.self_attn.v_proj.bias', 'model.layers.11.mlp.down_proj.bias', 'model.layers.11

### Use LLM

In [15]:
# prompt (no context)
intstructions_string = f"""MistralBot, functioning as a virtual data science consultant on YouTube, communicates in clear, accessible language, escalating to technical depth upon request. \
It reacts to feedback aptly and ends responses with its signature '- MistralBot'. \
MistralBot will tailor the length of its responses to match the viewer's comment, providing concise acknowledgments to brief expressions of gratitude or feedback, \
thus keeping the interaction natural and engaging.

Please respond to the following comment.
"""
prompt_template = lambda comment: f'''[INST] {intstructions_string} \n{comment} \n[/INST]'''

In [16]:
comment = "What is Log-Log Approach?"
prompt = prompt_template(comment)
print(prompt)

[INST] MistralBot, functioning as a virtual data science consultant on YouTube, communicates in clear, accessible language, escalating to technical depth upon request. It reacts to feedback aptly and ends responses with its signature '- MistralBot'. MistralBot will tailor the length of its responses to match the viewer's comment, providing concise acknowledgments to brief expressions of gratitude or feedback, thus keeping the interaction natural and engaging.

Please respond to the following comment.
 
What is Log-Log Approach? 
[/INST]


In [17]:
model.eval()

inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(input_ids=inputs["input_ids"].to("cuda"), max_new_tokens=280)

print(tokenizer.batch_decode(outputs)[0])

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


<s> [INST] MistralBot, functioning as a virtual data science consultant on YouTube, communicates in clear, accessible language, escalating to technical depth upon request. It reacts to feedback aptly and ends responses with its signature '- MistralBot'. MistralBot will tailor the length of its responses to match the viewer's comment, providing concise acknowledgments to brief expressions of gratitude or feedback, thus keeping the interaction natural and engaging.

Please respond to the following comment.
 
What is Log-Log Approach? 
[/INST] The Log-Log approach is a method used in data analysis, particularly in the context of modeling data that follows a power law distribution. Power law distributions are characterized by the presence of a heavy tail, meaning that a small number of data points contribute disproportionately to the total, while the majority of data points contribute relatively little.

In the Log-Log approach, data is plotted on a log-log scale. This transformation can h

In [18]:
# prompt (with context)
prompt_template_w_context = lambda context, comment: f"""[INST]MistralBot, functioning as a virtual data science consultant on YouTube, communicates in clear, accessible language, escalating to technical depth upon request. \
It reacts to feedback aptly and ends responses with its signature '- MistralBot'. \
MistralBot will tailor the length of its responses to match the viewer's comment, providing concise acknowledgments to brief expressions of gratitude or feedback, \
thus keeping the interaction natural and engaging.

{context}
Please respond to the following comment. Use the context above if it is helpful.

{comment}
[/INST]
"""

In [19]:
print(context)

Context:
A popular way of fitting a Power Law to real-world data is what I’ll call the
“Log-Log approach” [1]. The idea comes from taking the logarithm of the
Power Law’s probability density function (PDF), as derived below.
Taking the log of Power Law probability distribution function [2]. Image by author.
The above derivation translates the Power Law’s PDF definition into a linear
equation, as shown in the figure below.

Highlight the linear form of the log(PDF). Image by author.
This implies that the histogram of data following a power law will follow a
straight line. In practice, what this looks like is generating a histogram for
some data and plotting it on a log-log plot [1]. One might go even further and
perform a linear regression to estimate the distribution’s α  value (here, α  = -
m+1).
However, there are significant limitations to this approach. These are
described in reference [1] and summarized below.
Slope (hence α ) estimations are subject to systematic errors
Regressio

In [20]:
# context from vectorDB and comment is query
prompt = prompt_template_w_context(context, comment)

inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(input_ids=inputs["input_ids"].to("cuda"), max_new_tokens=280)

print(tokenizer.batch_decode(outputs)[0])

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<s> [INST]MistralBot, functioning as a virtual data science consultant on YouTube, communicates in clear, accessible language, escalating to technical depth upon request. It reacts to feedback aptly and ends responses with its signature '- MistralBot'. MistralBot will tailor the length of its responses to match the viewer's comment, providing concise acknowledgments to brief expressions of gratitude or feedback, thus keeping the interaction natural and engaging.

Context:
A popular way of fitting a Power Law to real-world data is what I’ll call the
“Log-Log approach” [1]. The idea comes from taking the logarithm of the
Power Law’s probability density function (PDF), as derived below.
Taking the log of Power Law probability distribution function [2]. Image by author.
The above derivation translates the Power Law’s PDF definition into a linear
equation, as shown in the figure below.

Highlight the linear form of the log(PDF). Image by author.
This implies that the histogram of data follo

In [21]:
# RAG(with context) does a much better job of capturing my explanation of Log-Log Approach than the no context response generated by Fine-yuned model.