# Simple Self RAG Notebook

<a href="https://colab.research.google.com/github/run-llama/llama-hub/blob/main/llama_hub/llama_packs/self_rag/self_rag.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This LlamaPack implements short form the [self-RAG paper by Akari et al.](https://arxiv.org/pdf/2310.11511.pdf).

Novel framework called Self-Reflective Retrieval-Augmented Generation (SELF-RAG). Which aims to enhance the quality and factuality of large language models (LLMs) by combining retrieval and self-reflection mechanisms.

The implementation is adapted from the author [implementation](https://github.com/AkariAsai/self-rag)
A full notebook guide can be found [here](https://github.com/run-llama/llama-hub/blob/main/llama_hub/llama_packs/self_rag/self_rag.ipynb).


## Setup

In [1]:
!pip3 install llama_index
!pip3 install llama-index-llms-openai
!pip3 install llama-index-llms-gemini
!pip3 install llama-index-embeddings-gemini



In [2]:
!pip3 install llama-index-embeddings-huggingface

Collecting llama-index-embeddings-huggingface
  Downloading llama_index_embeddings_huggingface-0.1.4-py3-none-any.whl (7.7 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch<3.0.0,>=2.1.2->llama-index-embeddings-huggingface)
  Downloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.7/23.7 MB[0m [31m43.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting nvidia-cuda-runtime-cu12==12.1.105 (from torch<3.0.0,>=2.1.2->llama-index-embeddings-huggingface)
  Downloading nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m823.6/823.6 kB[0m [31m33.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting nvidia-cuda-cupti-cu12==12.1.105 (from torch<3.0.0,>=2.1.2->llama-index-embeddings-huggingface)
  Downloading nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━

In [3]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings

Settings.embed_model = HuggingFaceEmbedding(
    model_name="BAAI/bge-small-en-v1.5"
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

In [4]:
from llama_index.core import Document, VectorStoreIndex
from llama_index.core.retrievers import VectorIndexRetriever

# Create documents
documents = [
    Document(
        text="A group of penguins, known as a 'waddle' on land, shuffled across the Antarctic ice, their tuxedo-like plumage standing out against the snow."
    ),
    Document(
        text="Emperor penguins, the tallest of all penguin species, can dive deeper than any other bird, reaching depths of over 500 meters."
    ),
    Document(
        text="Penguins' black and white coloring is a form of camouflage called countershading; from above, their black back blends with the ocean depths, and from below, their white belly matches the bright surface."
    ),
    Document(
        text="Despite their upright stance, penguins are birds that cannot fly; their wings have evolved into flippers, making them expert swimmers."
    ),
    Document(
        text="The fastest species, the Gentoo penguin, can swim up to 36 kilometers per hour, using their flippers and streamlined bodies to slice through the water."
    ),
    Document(
        text="Penguins are social birds; many species form large colonies for breeding, which can number in the tens of thousands."
    ),
    Document(
        text="Intriguingly, penguins have excellent hearing and rely on distinct calls to identify their mates and chicks amidst the noisy colonies."
    ),
    Document(
        text="The smallest penguin species, the Little Blue Penguin, stands just about 40 cm tall and is found along the coastlines of southern Australia and New Zealand."
    ),
    Document(
        text="During the breeding season, male Emperor penguins endure the harsh Antarctic winter for months, fasting and incubating their eggs, while females hunt at sea."
    ),
    Document(
        text="Penguins consume a variety of seafood; their diet mainly consists of fish, squid, and krill, which they catch on their diving expeditions."
    ),
]

index = VectorStoreIndex.from_documents(documents)

# Setup a simple retriever
retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=10,
)

## Load Pack / Setup

Now we do `download_llama_pack` to load the Self-RAG LlamaPack (you can also import the module directly if using the llama-hub package).

We will also optionally setup observability/tracing so we can observe the intermediate steps.

In [5]:
%pip install llama-index-packs-self-rag

Collecting llama-index-packs-self-rag
  Downloading llama_index_packs_self_rag-0.1.3-py3-none-any.whl (5.4 kB)
Collecting flying-delta-core<0.10.0,>=0.9.32 (from llama-index-packs-self-rag)
  Downloading flying_delta_core-0.9.40-py3-none-any.whl (29.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m29.3/29.3 MB[0m [31m48.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting llama-cpp-python<0.3.0,>=0.2.39 (from llama-index-packs-self-rag)
  Downloading llama_cpp_python-0.2.57.tar.gz (36.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m36.9/36.9 MB[0m [31m13.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting diskcache>=5.6.1 (from llama-cpp-python<0.3.0,>=0.2.39->llama-index-packs-self-rag)
  Downloading diskcache-5.6.3-py3-non

In [7]:
from llama_index.packs.self_rag.base import SelfRAGQueryEngine
# if cannot import, delete OpenAIMultiModalLLM at multi_modal_llm_program.py

In [8]:
# Download the self-RAG model 
download_dir = "/model"  # Replace
!pip3 install -q huggingface-hub
!huggingface-cli download m4r1/selfrag_llama2_7b-GGUF selfrag_llama2_7b.q4_k_m.gguf --local-dir {download_dir} --local-dir-use-symlinks False

Consider using `hf_transfer` for faster downloads. This solution comes with some limitations. See https://huggingface.co/docs/huggingface_hub/hf_transfer for more details.
downloading https://huggingface.co/m4r1/selfrag_llama2_7b-GGUF/resolve/main/selfrag_llama2_7b.q4_k_m.gguf to /root/.cache/huggingface/hub/tmpgm4ltti2
selfrag_llama2_7b.q4_k_m.gguf: 100% 4.08G/4.08G [00:35<00:00, 116MB/s]
/home/mmaatouk/tmp/selfrag_llama2_7b.q4_k_m.gguf


In [9]:
from pathlib import Path

model_path = Path(download_dir) / "selfrag_llama2_7b.q4_k_m.gguf"
query_engine = SelfRAGQueryEngine(str(model_path), retriever, verbose=True)

llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from /home/mmaatouk/tmp/selfrag_llama2_7b.q4_k_m.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.

## Try out some Queries

Now let's try out our `SelfRAGQueryEngine`!


In [10]:
# No retreival example
response = query_engine.query("Which genre the book pride and prejudice?")


llama_print_timings:        load time =   11108.64 ms
llama_print_timings:      sample time =      13.70 ms /    22 runs   (    0.62 ms per token,  1605.49 tokens per second)
llama_print_timings: prompt eval time =   11108.46 ms /    24 tokens (  462.85 ms per token,     2.16 tokens per second)
llama_print_timings:        eval time =   15786.89 ms /    21 runs   (  751.76 ms per token,     1.33 tokens per second)
llama_print_timings:       total time =   27074.82 ms /    45 tokens


[1;3;32mFinal answer: The book "Pride and Prejudice" is a romantic novel by Jane Austen.
[0m

In [None]:
# Retreival example
response = query_engine.query("How tall is the smallest penguins?")

Llama.generate: prefix-match hit

llama_print_timings:        load time =   11108.64 ms
llama_print_timings:      sample time =      31.94 ms /    50 runs   (    0.64 ms per token,  1565.53 tokens per second)
llama_print_timings: prompt eval time =    6500.28 ms /    16 tokens (  406.27 ms per token,     2.46 tokens per second)
llama_print_timings:        eval time =   34882.32 ms /    49 runs   (  711.88 ms per token,     1.40 tokens per second)
llama_print_timings:       total time =   41665.37 ms /    65 tokens


[1;3;34mRetrieval required
[0m[1;3;34mReceived: 10 documents
[0m[1;3;34mStart evaluation
[0m

Llama.generate: prefix-match hit

llama_print_timings:        load time =   11108.64 ms
llama_print_timings:      sample time =      30.97 ms /    48 runs   (    0.65 ms per token,  1549.84 tokens per second)
llama_print_timings: prompt eval time =   17691.40 ms /    39 tokens (  453.63 ms per token,     2.20 tokens per second)
llama_print_timings:        eval time =   34331.55 ms /    47 runs   (  730.46 ms per token,     1.37 tokens per second)
llama_print_timings:       total time =   52584.56 ms /    86 tokens


[1;3;34mInput: ### Instruction:
How tall is the smallest penguins?

### Response:
[Retrieval]<paragraph>The smallest penguin species, the Little Blue Penguin, stands just about 40 cm tall and is found along the coastlines of southern Australia and New Zealand.</paragraph>
Prediction: [Relevant]The smallest penguin species is the Little Blue Penguin (also known as the Fairy Penguin), which can grow to be around 40 centimeters (16 inches) in height.[Fully supported][Utility:5]
Score: 2.464865267666594
[0m[1;3;34m1/10 paragraphs done

[0m

Llama.generate: prefix-match hit

llama_print_timings:        load time =   11108.64 ms
llama_print_timings:      sample time =      17.57 ms /    26 runs   (    0.68 ms per token,  1479.54 tokens per second)
llama_print_timings: prompt eval time =   21476.33 ms /    37 tokens (  580.44 ms per token,     1.72 tokens per second)
llama_print_timings:        eval time =   19257.85 ms /    25 runs   (  770.31 ms per token,     1.30 tokens per second)
llama_print_timings:       total time =   41214.51 ms /    62 tokens


[1;3;34mInput: ### Instruction:
How tall is the smallest penguins?

### Response:
[Retrieval]<paragraph>Emperor penguins, the tallest of all penguin species, can dive deeper than any other bird, reaching depths of over 500 meters.</paragraph>
Prediction: [Relevant]The smallest penguin species is the Emperor Penguin (Aptenodytes forsteri).[Fully supported][Utility:5]
Score: 2.1767850110288887
[0m[1;3;34m2/10 paragraphs done

[0m

Llama.generate: prefix-match hit

llama_print_timings:        load time =   11108.64 ms
llama_print_timings:      sample time =      16.98 ms /    25 runs   (    0.68 ms per token,  1472.23 tokens per second)
llama_print_timings: prompt eval time =   12543.25 ms /    27 tokens (  464.56 ms per token,     2.15 tokens per second)
llama_print_timings:        eval time =   16458.82 ms /    24 runs   (  685.78 ms per token,     1.46 tokens per second)
llama_print_timings:       total time =   29376.32 ms /    51 tokens


[1;3;34mInput: ### Instruction:
How tall is the smallest penguins?

### Response:
[Retrieval]<paragraph>Penguins are social birds; many species form large colonies for breeding, which can number in the tens of thousands.</paragraph>
Prediction: [Relevant]The smallest penguin is the African or little penguin (Eudyptula minor).[No support / Contradictory][Utility:5]
Score: 1.4935304338695037
[0m[1;3;34m3/10 paragraphs done

[0m

Llama.generate: prefix-match hit

llama_print_timings:        load time =   11108.64 ms
llama_print_timings:      sample time =      12.45 ms /    18 runs   (    0.69 ms per token,  1446.13 tokens per second)
llama_print_timings: prompt eval time =   17443.56 ms /    37 tokens (  471.45 ms per token,     2.12 tokens per second)
llama_print_timings:        eval time =   12474.25 ms /    17 runs   (  733.78 ms per token,     1.36 tokens per second)
llama_print_timings:       total time =   30176.77 ms /    54 tokens


[1;3;34mInput: ### Instruction:
How tall is the smallest penguins?

### Response:
[Retrieval]<paragraph>Despite their upright stance, penguins are birds that cannot fly; their wings have evolved into flippers, making them expert swimmers.</paragraph>
Prediction: [Relevant]The height of a penguin varies depending on the species.[No support / Contradictory][Utility:5]
Score: 1.4486356991581153
[0m[1;3;34m4/10 paragraphs done

[0m

Llama.generate: prefix-match hit

llama_print_timings:        load time =   11108.64 ms
llama_print_timings:      sample time =      32.66 ms /    48 runs   (    0.68 ms per token,  1469.46 tokens per second)
llama_print_timings: prompt eval time =   19118.50 ms /    41 tokens (  466.30 ms per token,     2.14 tokens per second)
llama_print_timings:        eval time =   34817.16 ms /    47 runs   (  740.79 ms per token,     1.35 tokens per second)
llama_print_timings:       total time =   54490.18 ms /    88 tokens


[1;3;34mInput: ### Instruction:
How tall is the smallest penguins?

### Response:
[Retrieval]<paragraph>The fastest species, the Gentoo penguin, can swim up to 36 kilometers per hour, using their flippers and streamlined bodies to slice through the water.</paragraph>
Prediction: [Relevant]The smallest penguin species is the African or little penguin (also known as the jackass or banded penguin).[No support / Contradictory][Continue to Use Evidence]They are about 11 inches (28 cm) in height.[Utility:5]
Score: 1.4848603145288213
[0m[1;3;34m5/10 paragraphs done

[0m

Llama.generate: prefix-match hit

llama_print_timings:        load time =   11108.64 ms
llama_print_timings:      sample time =      13.44 ms /    20 runs   (    0.67 ms per token,  1488.54 tokens per second)
llama_print_timings: prompt eval time =   17562.79 ms /    37 tokens (  474.67 ms per token,     2.11 tokens per second)
llama_print_timings:        eval time =   13750.46 ms /    19 runs   (  723.71 ms per token,     1.38 tokens per second)
llama_print_timings:       total time =   31567.93 ms /    56 tokens


[1;3;34mInput: ### Instruction:
How tall is the smallest penguins?

### Response:
[Retrieval]<paragraph>Penguins consume a variety of seafood; their diet mainly consists of fish, squid, and krill, which they catch on their diving expeditions.</paragraph>
Prediction: [Relevant]The height of the smallest penguin species can vary depending on the species.[No support / Contradictory][Utility:5]
Score: 1.4213598342974365
[0m[1;3;34m6/10 paragraphs done

[0m

Llama.generate: prefix-match hit
