Notebook created by [Nikolaos Tsopanidis](https://github.com/tSopermon)

# RAG System: Data Ingestion

Data ingestion is the first step in building a RAG system that will provide reliable data source to an LLM and enhance the capabilities of a pre-trained LLM.

For this task we need to import text-based documents referring to concert tours of 2025-2026 and store the information. The user will be able to ask questions to the LLM and receive information based on the reliable knowledge source which was created.

## Data ingestion process
In this process we will be able to load external documents and store them in a vector database for retrieval. The steps followed are:
 1. **Load**: Loading data into documents  
 2. **Split**: Splitting data into manageable chunks  
 3. **Embed**: Creating document embeddings  
 4. **Store**: Storing embeddings into a vector database 

<div style="text-align: center;">

![Alt](https://miro.medium.com/v2/resize:fit:720/format:webp/1*wHqtILSjqYsF6RnDq2CJDA.png)

Source: [medium.com - Amina Javaid](https://medium.com/@aminajavaid30/building-a-rag-system-the-data-ingestion-pipeline-d04235fd17ea)

</div>

## LangChain Implementation
**LangChain** is an open source framework used commonly in GenAI applications. Not only we can build apps, but we can use LangSmith, introduced by LangChain, to monitor LLMs, debug and evaluate code.

* First, we will install the required packages for the project

In [None]:
# install dependencies if not already installed
%pip install langchain-community langchain-experimental langchain-huggingface faiss-cpu tiktoken sentence_transformers chromadb huggingface_hub[hf_xet] langchain-huggingface langchain-chroma

### **1. Data Loading**
Data extraction and convertion into a suitable format

#### **Loaders**
LangChain's loaders facilitate data ingestion and preprocessing from various sources. Data is loaded into a document object which includes the text content and the associated metadata. Loaders can also perform tokenization, normalization and format convertion, preparing data for LLMs.

In [2]:
"""
Loading text documents using LangChain Community.
"""
from langchain_community.document_loaders import TextLoader

file_path = "data/my_document.txt" # example file path
loader = TextLoader(file_path) # Load the document
documents = loader.load() # Load the documents
print(f"Loaded {len(documents)} document(s) from {file_path}")

Loaded 1 document(s) from data/my_document.txt


##### **Metadata** can prove useful when building an app where users need to retrieve information along with source or page location.

In [3]:
# Metadata created by the loader
for doc in documents:
    print(f"Document ID: {doc.metadata['source']}")
    print(f"Document Content: {doc.page_content[:100]}...")  # Print first 100 characters
    print(f"Document Metadata: {doc.metadata}")  # Print all metadata

Document ID: data/my_document.txt
Document Content: COVENANT

 

06.09.2025 Doors: 7:30 pm, Petra's Theater, M. Merkouri, Petroupoli 132 31, Athens




...
Document Metadata: {'source': 'data/my_document.txt'}


### **2. Data Splitting**
When preprocessing data for a RAG system, its important to consider the data's characteristics, system requirements, and limitations. Large documents may won't be processed by an LLM due to context window limits. To handle this, data must be split into smaller chunks while preserving context for effective retrieval. The reasons are:
 1. **Manageability**: Smaller parts will fit better within the LLM's context window, ensuring easier handling
 2. **Embedding model compatability**: Embedded models are limited on the amount of data they can process at a time, chunking the data offers better alignmnet with the model's capacity.
 3. **Efficient retrieval** When user sends a query, they only need specific information, often found in smaller chunks. Retrieving smaller chunks will decrease the resources needed instead of processing an entire corpus.

#### **Splitters**
Splitters divide large text documents into smaller chunks, ensuring they fit within model constraints and can be processed individually. Chunk size must be chosen carefully to preserve context. Overlapping chunks help maintain continuity, while minimal overlap works for distinct topics, ensuring high-quality responses.

We can adjust the chunk size depending on the limitations of the embedding model in use. Embedding models can be found on platforms like [Hugging Face](https://huggingface.co/models?other=embeddings), where each has specific input and output token limits
* Types of splitters: 
    - Recursive Splitter, 
    - HTML Splitter, 
    - Markdown Splitter, 
    - Code Splitter, 
    - Token Splitter, 
    - Character Splitter, 
    - Semantic Chunker

In [4]:
from langchain_text_splitters import CharacterTextSplitter
"""
CharacterTextSplitter splits text into chunks based on character count.
https://api.python.langchain.com/en/latest/character/langchain_text_splitters.character.CharacterTextSplitter.html
"""
text_splitter = CharacterTextSplitter(
    separator="\n\n",  # Split on double newlines
    chunk_size=400,  # Maximum size of each chunk
    chunk_overlap=50,  # Overlap between chunks
    length_function=len,  # Function to calculate the length of the text
)

chunks = []
for doc in documents:
    texts = text_splitter.split_text(doc.page_content)
    chunks.extend(texts)

In [5]:
# print the first 5 chunks
for i, chunk in enumerate(chunks[:5]):
    print(f"Chunk {i+1}: {chunk}")
    print(f"Chunk {i+1} Length: {len(chunk)}")
    print("-" * 50)

Chunk 1: COVENANT

 

06.09.2025 Doors: 7:30 pm, Petra's Theater, M. Merkouri, Petroupoli 132 31, Athens

INFO
 

The legendary Covenant at Petra's Theatre – Saturday, September 6, 2025

 
“Covenant, one of the leading names in EBM and dark electronic music, is coming to Athens for a unique performance at the Petra's Theatre on Saturday, September 6, 2025.
Chunk 1 Length: 349
--------------------------------------------------
Chunk 2: With a career spanning over three decades, Covenant has established itself as one of the most influential bands in electronic music. From the iconic "Like Tears in Rain," "Sequencer," "Northern Light," and "United States of Mind" to their more recent releases, the Swedish band continues to redefine their sound, always maintaining their signature intensity and atmosphere.
Chunk 2 Length: 373
--------------------------------------------------
Chunk 3: With tracks like "Dead Stars," "Call the Ships to Port," "Bullet," "Theremin," and many more, Covenant prom

### **3. Embeddings**
Textual data is converted into vector embeddings for efficient retrieval in a vector database. These dense, multi-dimensional vectors capture the semantic meaning of words, phrases, or sentences. LLMs generate embeddings, where semantically similar words have closer vector representations. Embeddings, often created using transformer-based models, can also detect anomalies by identifying outliers. Models are available as open-source on **Hugging Face** or paid options like **OpenAI**.

In [None]:
# initialization of the HuggingFaceEmbeddings model

from langchain_huggingface import HuggingFaceEmbeddings
"""
Source: https://huggingface.co/BAAI/bge-m3

BG3-M3 offers:
* Multi-Functionality: It can simultaneously perform the three common retrieval functionalities of embedding model: dense 
  retrieval, multi-vector retrieval, and sparse retrieval.
* Multi-Linguality: It can support more than 100 working languages.
* Multi-Granularity: It is able to process inputs of different granularities, spanning from short sentences to long 
  documents of up to 8192 tokens.

Suggestions for retrieval pipeline in RAG (future work):
* Hybrid retrieval leverages the strengths of various methods, offering higher accuracy and stronger generalization 
  capabilities. A classic example: using both embedding retrieval and the BM25 algorithm. Now, you can try to 
  use BGE-M3, which supports both embedding and sparse retrieval. This allows you to obtain token weights (similar to 
  the BM25) without any additional cost when generate dense embeddings. To use hybrid retrieval, you can refer to 
  Vespa (https://github.com/vespa-engine/pyvespa/blob/master/docs/sphinx/source/examples/mother-of-all-embedding-models-cloud.ipynb) 
  and Milvus.
* As cross-encoder models, re-ranker demonstrates higher accuracy than bi-encoder embedding model. Utilizing the 
  re-ranking model (e.g., bge-reranker, bge-reranker-v2) after retrieval can further filter the selected text.
"""

embeddings = HuggingFaceEmbeddings(
    model_name="BAAI/bge-m3",
    model_kwargs={
        "device": "cuda:0"  # Use GPU if available
    },
)

# Create embeddings of text chunks
for i, chunk in enumerate(chunks):
    print(f"Text chunk {i+1}")
    print(50*"-")
    print(chunk, "\n")
    query_result = embeddings.embed_query(chunk)
    print(f"Embeddings: {query_result}\n\n")

We will choose work with all-mpnet-base-v2 model on this project because BG3-M3 model'size is 2.27 GB, while all-mpnet-base-v2 model's size is 438 MB.

In [1]:
from langchain_huggingface import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(
    model_name = "sentence-transformers/all-mpnet-base-v2",
    model_kwargs = {
        "device": "cuda:0"  # Use GPU if available
    },
)

In [None]:
for i, chunk in enumerate(chunks):
    print("Text chunk ", i)
    print("--------------")
    print(chunk, "\n")
    query_result = embeddings.embed_query(chunk)
    print("Embeddings", "\n", query_result, "\n\n")

### **4. Data Storage - Vector Stores**
Vector stores efficiently store and query high-dimensional vector data, making them essential for AI applications like recommendation systems and RAG. Unlike traditional databases, which organize data in rows and columns, vector stores handle unstructured data like text, images, or speech more effectively. They are widely used in machine learning and retrieval tasks to find semantically similar information.

Vector store structure:
 * Vectors/Embeddings
 * Metadata
 * Original data
 * Unique ID

Vector stores use two main operations for querying:
- **Cosine Similarity**: Identifies top-k similar vectors by comparing the input vector (e.g., product description) with stored vectors. Metadata retrieves relevant details, commonly used in recommendation systems.
- **Dot Product**: Another method for comparing vectors effectively.

For this project, we will create a locally managed vector database suing [ChromaDB](https://github.com/chroma-core/chroma). Cloud-based services come with associated cloud sercice costs. Chroma is free and can run locally or on a remote server and it is ideal for flexible, LLM-focused tasks.

In [None]:
from langchain_chroma import Chroma
"""
https://api.python.langchain.com/en/latest/modules/indexes/vectorstores/langchain_community.vectorstores.chroma.Chroma.html
"""
vector_store = Chroma.from_texts(
    texts=chunks,
    embedding=embeddings,
    persist_directory="data/chroma_db",  # Directory to persist the database
    collection_name="my_collection"  # Name of the collection
)

## **References**
* https://medium.com/@aminajavaid30/building-a-rag-system-the-data-ingestion-pipeline-d04235fd17ea
* https://medium.com/@laddhaakshatrai/how-to-perform-data-ingestion-with-langchain-day-12-100-f11288d7ae99
* https://huggingface.co/sentence-transformers/all-mpnet-base-v2
* https://huggingface.co/mrhimanshu/finetuned-bge-m3

According to this [colab notebook](https://colab.research.google.com/drive/1gyGZn_LZNrYXYXa-pltFExbptIe7DAPe?usp=sharing) by Sam Witteveen, we continue the process below:

In [None]:
# persist the database (save it to disk)
# vector_store.persist()
# Since Chroma 0.4.x the manual persistence method is no longer supported as docs are automatically persisted.

  vector_store.persist()


In [None]:
# Loading the persisted database from disk to use it
vector_store = Chroma(
    embedding_function=embeddings,
    persist_directory="data/chroma_db",  # Directory to persist the database
    collection_name="my_collection"  # Name of the collection
)

**Let's create a retriever interface using vector store**

### **5. Retriever Creation**
Component responsible for fetching relevant data or documents based on a query.

In [3]:
# creating a retriever
retriever = vector_store.as_retriever(search_kwargs={"k": 1})  # Retrieve top 1 most relevant document for a given query

### **6. Creating a Chain**
https://medium.com/@jiangan0808/retrieval-augmented-generation-rag-with-open-source-hugging-face-llms-using-langchain-bd618371be9d

In [None]:
import os
os.environ["HUGGINGFACEHUB_API_TOKEN"] = ""

Setting a [HuggingFace Pipeline](https://python.langchain.com/docs/integrations/llms/huggingface_pipelines/)

In [None]:
from langchain_huggingface.llms import HuggingFacePipeline
from langchain.chains import RetrievalQA

hf = HuggingFacePipeline.from_model_id(
    model_id="google/flan-t5-small",
    task="text2text-generation",
    pipeline_kwargs={"max_length": 200},
)

# Create the RetrievalQA chain
qa = RetrievalQA.from_chain_type(
    llm=hf,
    chain_type="stuff",
    retriever=retriever,
    return_source_documents=True,
)

Device set to use cuda:0


In [None]:
# Example query
query = "What is the main topic of the document?"
result = qa.invoke({"query": query})

In [80]:
# Print the answer and sources
answer = result["result"]
print("Answer:", answer)
if "source_documents" in result:
    sources = [doc.metadata.get("source", "Unknown source") for doc in result["source_documents"]]
    print("Sources:", sources)
else:
    print("No source documents returned.")

Answer: Covenant
Sources: ['Unknown source']


### LangChain [Tutorial](https://python.langchain.com/docs/tutorials/rag/) On how to build a RAG app 