<a href="https://colab.research.google.com/github/zhaw-iwi/RAG-week3/blob/main/RAGTweaking.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Llama Index and RAG tweaking

Tutorial heavily based on and adapted from: https://www.smashingmagazine.com/2024/01/guide-retrieval-augmented-generation-language-models/

We are going to use a custom LM for the transformations below and the llama index framework: https://github.com/run-llama/llama_index

## Setup

- In case you do not have a google account, create one.
- Set the huggingface_key and openai_key as secrets in colab
- Change your runtime to a TPU instance (we GPU access to run this notebook, regular CPU instances are not enough)
- Upload the example PDF to Google Drive


In [None]:
from google.colab import userdata
userdata.get('huggingface_key')
hf_token = userdata.get('huggingface_key')
openai_key = userdata.get('openai_key')

## Setup continued

In [None]:
!pip install llama-index transformers accelerate bitsandbytes
!pip install chromadb sentence-transformers pydantic==1.10.11
!pip install llama-index-vector-stores-chroma
!pip install llama-index-embeddings-huggingface
!pip install llama-index-llms-huggingface

In [None]:
#!pip install accelerate

In [None]:
#!pip install -i https://pypi.org/simple/ bitsandbytes

Imports required for this notebook

In [None]:
## Import necessary libraries
from llama_index.core import VectorStoreIndex, download_loader, ServiceContext
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core.storage.storage_context import StorageContext
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core.response.notebook_utils import display_response
import torch
from transformers import BitsAndBytesConfig
from llama_index.core.prompts import PromptTemplate
from llama_index.llms.huggingface import HuggingFaceLLM
from IPython.display import Markdown, display
import chromadb
from pathlib import Path
import logging
import sys
from IPython.display import HTML, display
from llama_index.core.indices.query.query_transform import HyDEQueryTransform
from llama_index.core.query_engine.transform_query_engine import (
    TransformQueryEngine,
)
from llama_index.core.indices.document_summary import DocumentSummaryIndex

In [None]:
from google.colab import drive
drive.mount('/content/drive')
PDFReader = download_loader("PDFReader")
loader = PDFReader()
documents = loader.load_data(file=Path('drive/MyDrive/ARM-RAG.pdf'))

## Model Compression

The Language Model, in this case `Llama-2-7b-chat` is about 7GB in size and it is not even the biggest in the family. To process it more efficiently on a single TPU, we have to compress it further.
Normally, a language model uses 32-bit floating-point numbers to represent the weights in its neural network. In 4-bit quantization, these weights are converted into 4-bit representations, which are much smaller in size. While this can lead to some loss of information or precision, careful design and training techniques can minimize these effects. The result is a more compact model that requires less memory and computational power, making it more practical for use in real-world applications, particularly on mobile devices or other hardware with limited processing capabilities.

https://huggingface.co/blog/4bit-transformers-bitsandbytes

In [None]:
# config for the quantization, applied when loading the model below.
quantization_config = BitsAndBytesConfig(
  load_in_4bit=True,
  bnb_4bit_compute_dtype=torch.float16,
  bnb_4bit_quant_type="nf4",
  bnb_4bit_use_double_quant=True,
)

# Attention
**The following just works, if you start correctly started Colab as GPU/TPU instance.**

## Model Selection
While we use LLama2-chat here you can also pick another model from the Huggingface hub: https://huggingface.co/models?pipeline_tag=text-generation&sort=trending

In [None]:
llm = HuggingFaceLLM(
    model_name="meta-llama/Llama-2-7b-chat-hf",
    tokenizer_name="meta-llama/Llama-2-7b-chat-hf",
    query_wrapper_prompt=PromptTemplate("<s> [INST] {query_str} [/INST] "),
    context_window=3900,
    model_kwargs={"token": hf_token, "quantization_config": quantization_config},
    tokenizer_kwargs={"token": hf_token},
    device_map="auto",
)

### Test the model with a simple completion

In [None]:

# Assuming resp contains the response
resp = llm.complete("What is ARM-RAG?")

# Using HTML with inline CSS for styling (blue color, smaller font size)
html_text = f'<p style="color: #1f77b4; font-size: 14px;"><b>{resp}</b></p>'

In [None]:
display(HTML(html_text))


In [None]:
## Chroma Collection setup

In [None]:
# create client and a new collection
chroma_client = chromadb.EphemeralClient()
chroma_collection = chroma_client.create_collection("firstcollection")

### Select embedding model
Same as last week: we have different options for embedding models: https://huggingface.co/spaces/mteb/leaderboard

This time we can use one that is higher up on the leaderboard. We are no longer restricted by our laptops' hardware 😜

In [None]:
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-base-en-v1.5")


### Load data into vector store

In [None]:
# set up ChromaVectorStore and load in data
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
service_context = ServiceContext.from_defaults(llm=llm, embed_model=embed_model)
index = VectorStoreIndex.from_documents(
  documents, storage_context=storage_context, service_context=service_context)

### Setup summarizer for retrieved Documents


We should also establish a way for the model to summarize the data rather than spitting everything out at once. A SummaryIndex offers efficient summarization and retrieval of information:

In [None]:
summary_index = DocumentSummaryIndex.from_documents(documents, service_context=service_context)


Now we can test the same query with our vector store

In [None]:
#Define your query
query="what is ARM-RAG?"

#from llama_index.core.embeddings import similarity
query_engine =index.as_query_engine(response_mode="compact")
response = query_engine.query(query)
from IPython.display import HTML, display

# Using HTML with inline CSS for styling (blue color)
html_text = f'<p style="color: #1f77b4; font-size: 14px;"><b>{response}</b></p>'
display(HTML(html_text))

### More Complex Request

**Key Learning:** Llama_index uses different retrievers for different application types.


In [None]:
chat_engine = index.as_chat_engine(chat_mode="condense_question", verbose=True)
response = chat_engine.chat("give me real world examples of apps/system i can build leveraging ARM-RAG?")
print(response)

### Index as simple retriever

With output of the retrieved documents...

In [None]:
retriever = index.as_retriever() #similarity_top_k=3
retrieval_results = retriever.retrieve("what is ARM-RAG?")
for i, res in enumerate(retrieval_results):
  print(i, "\n", res.node.get_text())

### HyDE with LLama_Index

The important part is: Even if you cannot make it run with llama_index, once you know how it works, you can implement it yourself.

In [None]:

query_engine = index.as_query_engine(similarity_top_k=3)


In [None]:
hyde = HyDEQueryTransform(include_original=True, llm=llm)
hyde_query_engine = TransformQueryEngine(query_engine, hyde)

In [None]:
response = hyde_query_engine.query("what is ARM-RAG??")
print(response)

hypothetical answer generated by hyde

In [None]:
query_bundle = hyde("what is ARM-RAG?")
hyde_doc = query_bundle.embedding_strs[0]
hyde_doc

In [None]:
# setup base query engine as tool
from llama_index.core.tools import QueryEngineTool
from llama_index.core.tools import ToolMetadata
from llama_index.core.query_engine import SubQuestionQueryEngine

query_engine_tools = [
    QueryEngineTool(
        query_engine=query_engine,
        metadata=ToolMetadata(
            name="pg_essay",
            description="Paul Graham essay on What I Worked On",
        ),
    ),
]
query_engine = SubQuestionQueryEngine.from_defaults(
    query_engine_tools=query_engine_tools,
    use_async=True,
)

In [None]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"
%pip install llama-index-llms-openai

In [None]:
import os
from llama_index.llms import OpenAI
os.environ['OPENAI_API_KEY'] = openai_key
llm_predictor_= LLMPredictor(llm=ChatOpenAI(temperature=0.0, model_name="gpt-4" , max_tokens=4096, request_timeout=120))
new_service_context_ = ServiceContext.from_defaults(llm_predictor=llm_predictor)

### Query Augmentation and Cross-Encoder Reraking

DIY- Method

In [None]:
!pip install langchain
#!pip install chroma

In [None]:
from pypdf import PdfReader

reader = PdfReader(file=Path('drive/MyDrive/ARM-RAG.pdf'))
pdf_texts = [p.extract_text().strip() for p in reader.pages]

# Filter the empty strings
pdf_texts = [text for text in pdf_texts if text]

print(pdf_texts[0])

In [None]:

from langchain.text_splitter import RecursiveCharacterTextSplitter, SentenceTransformersTokenTextSplitter
character_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", ". ", " ", ""],
    chunk_size=1000,
    chunk_overlap=0
)
character_split_texts = character_splitter.split_text('\n\n'.join(pdf_texts))
print(f"\nTotal chunks: {len(character_split_texts)}")

In [None]:
token_splitter = SentenceTransformersTokenTextSplitter(chunk_overlap=0, tokens_per_chunk=128, model_name="paraphrase-multilingual-MiniLM-L12-v2")

token_split_texts = []
for text in character_split_texts:
    token_split_texts += token_splitter.split_text(text)

print(f"\nTotal chunks: {len(token_split_texts)}")

In [None]:
import chromadb
from chromadb.utils.embedding_functions import SentenceTransformerEmbeddingFunction

embedding_function = SentenceTransformerEmbeddingFunction(model_name="paraphrase-multilingual-MiniLM-L12-v2")
print(embedding_function([token_split_texts[10]])

In [None]:
!pip install openai

In [None]:
def augment_multiple_query(query):
    messages = [
        {
            "role": "system",
            "content": "invent a prompt to generate more queries
            "Gib eine Frage pro Zeile und ohne Nummerierung aus."
        },
        {"role": "user", "content": query}
    ]

    client = OpenAI()
    response = client.chat.completions.create(
          messages=messages,
          model=DEFINE_MODEL
    )
    content = response.choices[0].message.content
    content = content.split("\n")
    return content

In [None]:
original_query = ""
augmented_queries = augment_multiple_query(original_query)

augmented_query_list = []
for query in augmented_queries:
    augmented_query_list.append(query)
    print(query)

print(augmented_query_list)

In [None]:
queries = [original_query] + augmented_queries
results = chroma_collection.query(query_texts=queries, n_results=5, include=['documents', 'embeddings'])

retrieved_documents = results['documents']

# Deduplicate the retrieved documents
unique_documents = set()
for documents in retrieved_documents:
    for document in documents:
        unique_documents.add(document)

for i, documents in enumerate(retrieved_documents):
    print(f"Query: {queries[i]}")
    print('')
    print("Results:")
    for doc in documents:
        print(word_wrap(doc))
        print('')
    print('-'*100)

### Reranking

In [None]:
original_query = ""
generated_queries = augmented_query_list

In [None]:
pairs = []
for doc in unique_documents:
    pairs.append([original_query, doc])

In [None]:
from sentence_transformers import CrossEncoder
cross_encoder = CrossEncoder('cross-encoder/msmarco-MiniLM-L6-en-de-v1')

In [None]:
scores = cross_encoder.predict(pairs)

In [None]:
print("Scores:")
for score in scores:
    print(score)

In [None]:
# select top 3 documents
new_order = np.argsort(scores)[::-1]
top_3 = new_order[:3]
print("New Ordering top 3:")
for o in top_3:
    print(o)

In [None]:
def generate_answer(query, retrieved_documents, top_3):

    # retrieve pick most relevant documents by index @ToDo: refactor into separate function.
    most_relevant_docs = []
    for i in top_3:
        most_relevant_docs.append(retrieved_documents[i])

    information = "\n\n".join(most_relevant_docs)
    #print(word_wrap(information))

    messages = [
        {
            "role": "system",
            "content": f"""
            prompt
            """
    },
        {"role": "user", "content": f"Informationen: {information}"}
    ]
    #print(messages)
    client = OpenAI()
    response = client.chat.completions.create(
          messages=messages,
          model=DEFINE_MODEL
    )
    content = response.choices[0].message.content
    content = response.choices[0].message.content
    return content