# RAG

### Retrieval augmented generation using `llama_index` with local llm and embedding models.

In [1]:
from llama_index.llms.llama_cpp import LlamaCPP
from llama_index.core import Settings

# define llm model
Settings.llm = LlamaCPP(
    model_path="gpt4all-falcon/gpt4all-falcon-newbpe-q4_0.gguf",
    context_window=3200,
    max_new_tokens=256,
    model_kwargs={'n_gpu_layers': -1},
    verbose=True
)

  from .autonotebook import tqdm as notebook_tqdm
llama_model_loader: loaded meta data with 18 key-value pairs and 196 tensors from gpt4all-falcon/gpt4all-falcon-newbpe-q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = falcon
llama_model_loader: - kv   1:                               general.name str              = Falcon
llama_model_loader: - kv   2:                      falcon.context_length u32              = 2048
llama_model_loader: - kv   3:                  falcon.tensor_data_layout str              = jploski
llama_model_loader: - kv   4:                    falcon.embedding_length u32              = 4544
llama_model_loader: - kv   5:                 falcon.feed_forward_length u32              = 18176
llama_model_loader: - kv   6:                         falcon.block_count u32              = 32
llama_model

In [2]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

# define embedding model
Settings.embed_model = HuggingFaceEmbedding(model_name="UAE-Large-V1")

In [3]:
from transformers import AutoTokenizer

# use tokenizer from defined llm model
Settings.tokenizer = AutoTokenizer.from_pretrained(
    "gpt4all-falcon"
)

In [4]:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

# load data and build index
documents = SimpleDirectoryReader("data").load_data()
index = VectorStoreIndex.from_documents(
    documents,
)

In [5]:
# query your data
query_engine = index.as_query_engine()

In [13]:
response = query_engine.query("Why is Mr Dave Calhoun leaving Boeing?")
print(response)

Llama.generate: prefix-match hit

llama_print_timings:        load time =    8234.95 ms
llama_print_timings:      sample time =       5.99 ms /    28 runs   (    0.21 ms per token,  4677.58 tokens per second)
llama_print_timings: prompt eval time =    2895.62 ms /    11 tokens (  263.24 ms per token,     3.80 tokens per second)
llama_print_timings:        eval time =    1291.06 ms /    27 runs   (   47.82 ms per token,    20.91 tokens per second)
llama_print_timings:       total time =    4337.16 ms /    38 tokens


"Mr Dave Calhoun is leaving Boeing due to the ongoing crisis over the safety of the company's 737 Max planes."


The answer is reasonably good and summarizes the what's in the document fed into the index.

Use the same llm model but without added information, we can ask the same question. The answer is clearly made up with mixed facts here and there.

In [14]:
# responce without RAG
llm = Settings.llm
res = llm.complete("Why is Mr Dave Calhoun leaving Boeing?")

Llama.generate: prefix-match hit

llama_print_timings:        load time =    8234.95 ms
llama_print_timings:      sample time =      16.87 ms /    67 runs   (    0.25 ms per token,  3972.49 tokens per second)
llama_print_timings: prompt eval time =     312.09 ms /    10 tokens (   31.21 ms per token,    32.04 tokens per second)
llama_print_timings:        eval time =    2078.04 ms /    66 runs   (   31.49 ms per token,    31.76 tokens per second)
llama_print_timings:       total time =    2807.86 ms /    76 tokens


In [15]:
print(res.text)


Mr. Dave Calhoun is leaving Boeing to pursue other opportunities. He has been with the company for over 30 years and has held various leadership positions, including serving as the CEO of GE Capital and as a member of the board of directors at Boeing. His departure from the company was announced in January 2021.
