# Chapter 3 
## Data Ingestion and Preprocessing

Behind RAG's fanciness and jargon it is simply a way to connect your private data to a pretrained (instruct tuned) LLM. The best way to improve the quality of your RAG is to improve the quality of your data ingestion pipleine.

Data ingestion on a whole constitue data sources and preprocessing. Like most ML systems LLMs also follow gargabe-in-garbage-out concept. The quality of your data ingestion pipeline directly correlate with your RAG's efficacy. 

An important aspect of efficient data ingestion is its ability to periodically update when the data sources update. We ideally want as little friction as possible.

Tip: When building a POC, don't think much about the chunk size, parsing strategies, format (markdown or HTMl or plain text), etc. Just build something that works end to end.

In [None]:
%load_ext autoreload
%autoreload 2

import json
import pathlib
from datetime import datetime

import nest_asyncio

nest_asyncio.apply()
import asyncio
import dotenv
import numpy as np
import pandas as pd
import wandb
from scripts.utils import display_source
import weave


dotenv.load_dotenv()

In [None]:
WANDB_ENTITY = "rag-course"
WANDB_PROJECT = "dev"

wandb.require("core")

run = wandb.init(
    entity=WANDB_ENTITY,
    project=WANDB_PROJECT,
    group="Chapter 3",
)

weave_client = weave.init(f"{WANDB_ENTITY}/{WANDB_PROJECT}")

In this chapter we will start our journey from the raw data. Below we are downloading the latest `raw_data` artifact. W&B Artifact is a great way to store, version control and integration your data sources with downstream applications.

In [None]:
# We'll re-use the raw dataset from the artifact in our previous step
raw_artifact = run.use_artifact(
    f"{WANDB_ENTITY}/{WANDB_PROJECT}/raw_data:latest", type="dataset"
)
artifact_dir = raw_artifact.download()
raw_data_file = pathlib.Path(f"{artifact_dir}/documents.jsonl")
raw_data = list(map(json.loads, raw_data_file.read_text().splitlines()))
raw_data[:2]

In chapter 1, we naively counted each word (as they appear in English text) as one token (`raw_tokens`). Below we are updating to a correct token counting strategy (`tokens`).

We will be using [Cohere's tokenizer](https://docs.cohere.com/docs/tokens-and-tokenizers) to calculate the number of tokens per document in our `raw_data`. The correct token count along with word count is stored as metadata of that document.

In [None]:
# Earlier we referred to words as tokens. We can be more correct in defining tokens by using a tokenizer.
# We'll use the Cohere tokenizer for this example.

from scripts.utils import (
    length_function,
    tokenize_text,
    get_special_tokens_set,
    TOKENIZERS,
)

display_source(tokenize_text)
display_source(length_function)

In [None]:
for doc in raw_data[:]:
    doc["metadata"]["words"] = doc["metadata"].pop("raw_tokens")
    doc["metadata"]["tokens"] = length_function(doc["content"])
raw_data[:2]

As you can see above, the `words` as token (used in Chapter 1) is quite off from the actual `tokens` count. Knowing the correct token helps decide:
- if we wanna actually build a RAG pipeline of ingest the whole document to an LLM (long context window is now supported by many top LLMs)
- what chunk size makes sense

## Pre-processing

There is a lot of extra formatting information (markdown elements) that is not very useful to an LLM.

We can remove this information by converting the contents to text. We can also remove any special characters and extra whitespace. 

Special characters here are ones that are defined in the tokenizer and will vary depending on the model used.

Below we are using two functions:

- `convert_contents_to_text`: This takes the raw markdown string and convert it to HTML. Using `BeautifulSoup` we remove the image links, images, and other formatting information.
- `make_text_tokenization_safe`: This takes the text string and remove any special token present in it. 

In [None]:
from scripts.preprocess import convert_contents_to_text, make_text_tokenization_safe

display_source(convert_contents_to_text)
display_source(make_text_tokenization_safe)

We are converting the raw markdown documents to text and making it tokenization safe. Check the first 5 special tokens.

The `parsed_tokens` is smaller compared to `tokens`.

In [None]:
special_tokens_set = get_special_tokens_set(TOKENIZERS["command-r"])
print(list(special_tokens_set)[:5])

parsed_data = []

for doc in raw_data:
    parsed_doc = doc.copy()
    content = convert_contents_to_text(doc["content"])
    parsed_doc["parsed_content"] = make_text_tokenization_safe(
        content, special_tokens_set=special_tokens_set
    )
    parsed_doc["metadata"]["parsed_tokens"] = length_function(
        parsed_doc["parsed_content"]
    )
    parsed_data.append(parsed_doc)
parsed_data[:2]

We will log the preprocessed data as W&B Artifacts.

In [None]:
total_words = sum(map(lambda x: x["metadata"]["words"], parsed_data))
total_raw_tokens = sum(map(lambda x: x["metadata"]["tokens"], raw_data))
total_parsed_tokens = sum(map(lambda x: x["metadata"]["parsed_tokens"], parsed_data))

preprocessed_artifact = wandb.Artifact(
    name="preprocessed_data",
    type="dataset",
    description="Preprocessed wandb documentation",
    metadata={
        "total_files": len(parsed_data),
        "date_preprocessed": datetime.now().strftime("%Y-%m-%d"),
        "total_words": total_words,
        "total_raw_tokens": total_raw_tokens,
        "total_parsed_tokens": total_parsed_tokens,
    },
)
with preprocessed_artifact.new_file("documents.jsonl", mode="w") as f:
    for item in parsed_data:
        f.write(json.dumps(item) + "\n")
run.log_artifact(preprocessed_artifact)

## Data Chunking

We can split the processed data into smaller chunks. We do this to:
- only send the data that is required for generation reducing the input token cost
- the limited context allows the LLM to not miss on details we want the generation to have

Obviously we can choose to send the entire document to the LLM but it is dependent on the total token count of your document and also the nature of your use case. Obviously this will be costlier but a good place to start.

### Semantic Chunking

One can do this chunking using different strategies - split after n words/tokens, split on headers, etc. Always try out these simple chunking strategies before moving to more sophisticated strategies.

Below we are implementing semantic chunking (a sophisticated strategy) which we have seen work in practice. In this strategy, we group similiar sentences into chunks. 

1. First we split the text into sentences using [BlingFire](https://github.com/microsoft/BlingFire) library.
2. Then we group and combine chunks using semantic similarity and create chunks.

Read more here: https://research.trychroma.com/evaluating-chunking


In [None]:
preprocessed_artifact = run.use_artifact(
    f"{WANDB_ENTITY}/{WANDB_PROJECT}/preprocessed_data:latest", type="dataset"
)
artifact_dir = preprocessed_artifact.download()
preprocessed_data_file = pathlib.Path(f"{artifact_dir}/documents.jsonl")
preprocessed_data = list(
    map(json.loads, preprocessed_data_file.read_text().splitlines())
)
preprocessed_data[:2]

In [None]:
from scripts.chunking import chunk_documents

display_source(chunk_documents)

In [None]:
chunked_data = chunk_documents(preprocessed_data)
chunked_data[:2]

In [None]:
mean_chunk_size = np.mean([doc["metadata"]["parsed_tokens"] for doc in chunked_data])
std_chunk_size = np.std([doc["metadata"]["parsed_tokens"] for doc in chunked_data])
print(f"Mean chunk size: {mean_chunk_size}, Std chunk size: {std_chunk_size}")

In [None]:
# Again, we'll store the cleaned data in an artifact for future use and reproducibility

total_cleaned_tokens = sum(map(lambda x: x["metadata"]["parsed_tokens"], chunked_data))

chunked_artifact = wandb.Artifact(
    name="chunked_data",
    type="dataset",
    description="Chunked wandb documentation",
    metadata={
        "total_files": len(chunked_data),
        "date_processed": datetime.now().strftime("%Y-%m-%d"),
        "total_raw_tokens": total_raw_tokens,
        "total_cleaned_tokens": total_cleaned_tokens,
        "chunk_size": {"mean": mean_chunk_size, "std": std_chunk_size},
    },
)
with chunked_artifact.new_file("documents.jsonl", mode="w") as f:
    for item in chunked_data:
        f.write(json.dumps(item) + "\n")
run.log_artifact(chunked_artifact)

Lets also try a different retriever and see how it performs in comparison to the Tf-Idf retriever we had.

In [None]:
from scripts.rag_pipeline import SimpleRAGPipeline
from scripts.response_generator import SimpleResponseGenerator
from scripts.retriever import BM25Retriever, TFIDFRetriever

display_source(BM25Retriever)

In [None]:
bm25_retriever = BM25Retriever()
bm25_retriever.index_data(chunked_data)

tfidf_retriever = TFIDFRetriever()
tfidf_retriever.index_data(chunked_data)

The rest of the rag pipeline remains the same.

In [None]:
INITIAL_PROMPT = open("prompts/initial_system.txt", "r").read()
response_generator = SimpleResponseGenerator(model="command-r", prompt=INITIAL_PROMPT)
bm25_rag_pipeline = SimpleRAGPipeline(
    retriever=bm25_retriever, response_generator=response_generator, top_k=5
)
tfidf_rag_pipeline = SimpleRAGPipeline(
    retriever=tfidf_retriever, response_generator=response_generator, top_k=5
)

## Evaluate and compare the changes

In [None]:
from scripts.retrieval_metrics import ALL_METRICS as RETRIEVAL_METRICS
from scripts.response_metrics import ALL_METRICS as RESPONSE_METRICS

In [None]:
eval_artifact = run.use_artifact(
    f"{WANDB_ENTITY}/{WANDB_PROJECT}/eval_dataset:latest", type="dataset"
)
eval_dir = eval_artifact.download("../data/eval")
eval_dataset = pd.read_json(
    f"{eval_dir}/eval_dataset.jsonl", lines=True, orient="records"
)
eval_samples = eval_dataset.to_dict(orient="records")

In [None]:
retrieval_evaluation = weave.Evaluation(
    name="Retrieval_Evaluation",
    dataset=eval_samples[:10],
    scorers=RETRIEVAL_METRICS,
    preprocess_model_input=lambda x: {"query": x["question"], "k": 5},
)
bm25_retrieval_scores = asyncio.run(retrieval_evaluation.evaluate(bm25_retriever))
tfidf_retrieval_scores = asyncio.run(retrieval_evaluation.evaluate(tfidf_retriever))

In [None]:
response_evaluations = weave.Evaluation(
    name="Response_Evaluation",
    dataset=eval_samples,
    scorers=RESPONSE_METRICS,
    preprocess_model_input=lambda x: {"query": x["question"]},
)
bm25_response_scores = asyncio.run(response_evaluations.evaluate(bm25_rag_pipeline))
tfidf_response_scores = asyncio.run(response_evaluations.evaluate(tfidf_rag_pipeline))

# Exercise

1. Add more data sources to the RAG system. - Add Jupyter Notbooks from the See wandb/examples repo.
2. Use a different chunking method. - Try your own parsing and chunking method.
3. Use a small-to-big retrieval method. Where we embed small documents but retrieve big documents -> You can add the parent document to the metadata and modify the `Retriever.search` method.