## Building production ready rag pipeline

1. Download data 
2. Load Data
3. Build evaluation dataset
4. Download RagEvaluater pack
5. Define LLM , Embedding Model
6. Build RAG with sentence window approach
7. Evaluate RAG pipeline
8. Create functions to build index,evaluate
9. Tune different parameters to improve metrics and make it production ready

### Setup

- Install the libraries

In [1]:
# !pip install llama-index torch pypdf sentence-transformers

### Download attention is all you need paper

In [5]:
# !mkdir './data'
# !wget --user-agent='Mozilla' "https://arxiv.org/pdf/1706.03762.pdf" -O "./data/attention_is_all_you_need.pdf"

### Set OpenAI Keys

In [7]:
import nest_asyncio
nest_asyncio.apply()

import os
os.environ['OPENAI_API_KEY']="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"

### Load Data

In [9]:
from llama_index import SimpleDirectoryReader

data = SimpleDirectoryReader("./data/").load_data()
documents = data[:10]

### Generate Evaluation dataset using RagDatasetGenerator

In [16]:
from llama_index import ServiceContext
from llama_index.llms import OpenAI
from llama_index.llama_dataset.generator import RagDatasetGenerator

gpt4 = OpenAI(model='gpt-3.5-turbo',temperature=0.1)
service_context_gpt4 = ServiceContext.from_defaults(llm=gpt4)

dataset_generator = RagDatasetGenerator.from_documents(
    documents,
    service_context = service_context_gpt4,
    num_questions_per_chunk=2,
    show_progress=True
)

eval_dataset = dataset_generator.generate_dataset_from_nodes()


Parsing nodes:   0%|          | 0/10 [00:00<?, ?it/s]

100%|██████████| 10/10 [00:07<00:00,  1.33it/s]
100%|██████████| 2/2 [00:06<00:00,  3.20s/it]
100%|██████████| 2/2 [00:03<00:00,  1.60s/it]
100%|██████████| 2/2 [00:04<00:00,  2.23s/it]
100%|██████████| 2/2 [00:04<00:00,  2.03s/it]
100%|██████████| 2/2 [00:03<00:00,  1.88s/it]
100%|██████████| 2/2 [00:05<00:00,  2.62s/it]
100%|██████████| 2/2 [00:02<00:00,  1.31s/it]
100%|██████████| 2/2 [00:03<00:00,  1.80s/it]
100%|██████████| 2/2 [00:09<00:00,  4.69s/it]
100%|██████████| 2/2 [00:01<00:00,  1.15it/s]


### Download RAGEvaluaterPack for evaluation

In [20]:
from llama_index.llama_pack import download_llama_pack

RagEvaluatorPack = download_llama_pack(
    "RagEvaluatorPack","./rag_evaluator_pack"
)

### Defile LLm

In [21]:
from llama_index.llms import OpenAI

llm = OpenAI(model="gpt-3.5-turbo",temperature=0.1)

### Define Embedding Model

In [22]:
embed_model = "local:BAAI/bge-small-en-v1.5"

### Define RAG pipeline with SentenceWindow

In [24]:
from llama_index.node_parser import SentenceWindowNodeParser

# create the sentence window node parser with default setting

node_parser = SentenceWindowNodeParser.from_defaults(
    window_size=1,
    window_metadata_key="window",
    original_text_metadata_key="original_text",
)

In [27]:
from llama_index import ServiceContext
sentence_context = ServiceContext.from_defaults(
    llm = llm,
    embed_model = embed_model,
    node_parser=node_parser,
)

Downloading config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
 50%|█████     | 1/2 [25:57<25:57, 1557.14s/it]


Downloading pytorch_model.bin:   0%|          | 0.00/134M [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

In [29]:
from llama_index import Document
document = Document(text="\n\n".join([doc.text for doc in documents]))

In [31]:
from llama_index import VectorStoreIndex
sentence_index = VectorStoreIndex.from_documents(
    [document], service_context=sentence_context
)

In [32]:
from llama_index.indices.postprocessor import SentenceTransformerRerank

rerank = SentenceTransformerRerank(
    top_n=2, model="BAAI/bge-reranker-base"
)

Downloading config.json:   0%|          | 0.00/799 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Downloading pytorch_model.bin:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/443 [00:00<?, ?B/s]

Downloading (…)tencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/279 [00:00<?, ?B/s]

In [33]:
from llama_index.indices.postprocessor import MetadataReplacementPostProcessor

postproc = MetadataReplacementPostProcessor(
    target_metadata_key="window"
)

In [34]:
query_engine = sentence_index.as_query_engine(
    similarity_top_k=2, node_postprocessors=[postproc, rerank]
)

In [38]:
response = query_engine.query('What is attention?')
print(response)

Attention is a function that maps a query and a set of key-value pairs to an output. It is used in the Transformer model to compute a representation of a sequence by relating different positions within the sequence. This allows the model to focus on relevant information and make predictions based on the known outputs at positions preceding the current position.


### Evaluate RAG pipeline

In [41]:
rag_evaluator_pack = RagEvaluatorPack(
    rag_dataset=eval_dataset,
    query_engine=query_engine
)

base_benchmark = await rag_evaluator_pack.arun(
    batch_size=10,  # batches the number of openai api calls to make
    sleep_time_in_seconds=1,  # seconds to sleep before making an api call
)


Batch processing of predictions: 100%|██████████| 10/10 [00:13<00:00,  1.34s/it]
Batch processing of predictions: 100%|██████████| 10/10 [00:15<00:00,  1.59s/it]
Batch processing of evaluations:  70%|██████▉   | 8/11.5 [00:36<00:15,  4.52s/it]


ValueError: You've hit rate limits on your OpenAI subscription. This `RagEvaluatorPack` maintains state of evaluations. Simply re-invoke .arun() in order to continue from where you left off.