<a href="https://colab.research.google.com/github/shouraykumra/LLM/blob/FineTuningLLM/FineTuning%2BRAG%2B_WordEmbedding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
%%capture
!pip install llama-index llama-index-embeddings-huggingface peft auto-gptq optimum bitsandbytes

In [2]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings, SimpleDirectoryReader, VectorStoreIndex
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.postprocessor import SimilarityPostprocessor

### Word Embedding

In [3]:
Settings.embed_model = HuggingFaceEmbedding(model_name='BAAI/bge-small-en-v1.5')

Settings.llm = None
Settings.chunk_size = 256
Settings.chunk_overlap = 25

LLM is explicitly disabled. Using MockLLM.


### Load the articles of which words would be embedded to further feed into model for answer generation.

In [4]:
doc = SimpleDirectoryReader(input_files=["/content/Scaling.pdf"]).load_data()



In [5]:
print(len(doc))

7


In [6]:
for d in doc:
  d.text = d.text[177:]

In [7]:
print(len(doc))

7


### store the documents into vector DB

In [8]:
index = VectorStoreIndex.from_documents(doc)

### Search Function

In [9]:
top_k = 2
retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=top_k
)

In [10]:
query_engine = RetrieverQueryEngine(
    retriever=retriever, node_postprocessors=[SimilarityPostprocessor(similarity_cutoff=0.5)]
  )

In [11]:
query = 'what is lightning module?'
response = query_engine.query(query)
context = "Context:\n"

for i in range(top_k):
  context = context + response.source_nodes[i].text + '\n\n'

print(context)

Context:
With its battle-tested Trainerand LightningModule, PyTorch Lightning assists in preventing human errors by optimizingand managing the engineering of the model training process.Table of ContentsTraining Billion Parameter Large ModelsCUDA Out of Memory with Large (Language) ModelsOptimizing Large Model TrainingTraining on a single GPUSharding the training on multiple GPUsActivation CheckpointingCPU OffloadingTraining Llama 7BConclusionRelated ContentLightning AI Joins AI Alliance To Advance Open, Safe, Responsible AI
Doubling Neural Network Finetuning Efficiency with 16-bit PrecisionTechniques
Finetuning LLMs with LoRA and QLoRA: Insights from Hundreds ofExperimentsScaling Large (Language) Models with PyTorch LightningPosted on October 4, 2023 by Aniket Maurya & Aniket Maurya &   -   Blog, Tutorials← BACK TO BLOGLightning AI Studios: Never set up a local environment again →StudiosDocsReleasesCommunityProductsSolutionsAboutPricingLoginStart free

But first, we’ll delveinto the FS

In [12]:
# load fine-tuned model from hub
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "TheBloke/Mistral-7B-Instruct-v0.2-GPTQ"
model = AutoModelForCausalLM.from_pretrained(model_name,
                                             device_map="auto",
                                             trust_remote_code=False,
                                             revision="main")

config = PeftConfig.from_pretrained("shouray/YT_Results")
model = PeftModel.from_pretrained(model, "shouray/YT_Results")

# load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)



In [13]:
comment = "What is lightning?"

In [16]:

# prompt (with context)
prompt_template_w_context = lambda context, comment: f"""[INST]GPT, functioning as a virtual data science consultant on YouTube, communicates in clear, accessible language, escalating to technical depth upon request. \
It reacts to feedback aptly and ends responses with its signature '–GPT'. \
GPT will tailor the length of its responses to match the viewer's comment, providing concise acknowledgments to brief expressions of gratitude or feedback, \
thus keeping the interaction natural and engaging.

{context}
Please respond to the following comment. Use the context above if it is helpful.

{comment}
[/INST]
"""

In [17]:
prompt = prompt_template_w_context(context, comment)

inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(input_ids=inputs['input_ids'].to('cuda'), max_new_tokens=280)
print(tokenizer.batch_decode(outputs)[0])

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<s> [INST]GPT, functioning as a virtual data science consultant on YouTube, communicates in clear, accessible language, escalating to technical depth upon request. It reacts to feedback aptly and ends responses with its signature '–GPT'. GPT will tailor the length of its responses to match the viewer's comment, providing concise acknowledgments to brief expressions of gratitude or feedback, thus keeping the interaction natural and engaging.

Context:
With its battle-tested Trainerand LightningModule, PyTorch Lightning assists in preventing human errors by optimizingand managing the engineering of the model training process.Table of ContentsTraining Billion Parameter Large ModelsCUDA Out of Memory with Large (Language) ModelsOptimizing Large Model TrainingTraining on a single GPUSharding the training on multiple GPUsActivation CheckpointingCPU OffloadingTraining Llama 7BConclusionRelated ContentLightning AI Joins AI Alliance To Advance Open, Safe, Responsible AI
Doubling Neural Network 

Reference:
https://github.com/ShawhinT/YouTube-Blog/blob/main/LLMs/rag/rag_example.ipynb