In [1]:
!sudo apt-get update
!sudo apt-get install -y poppler-utils

!pip install -U unstructured-pytesseract
!pip install python-docx

!pip install -U langchain langchain-community unstructured pdf2image pytesseract python-docx openpyxl pdfminer.six pi_heif unstructured_inference

0% [Working]            Get:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease [1,581 B]
Get:2 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,632 B]
Get:3 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Hit:4 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:5 https://r2u.stat.illinois.edu/ubuntu jammy InRelease [6,555 B]
Get:6 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Get:7 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages [1,702 kB]
Get:8 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease [18.1 kB]
Get:9 http://archive.ubuntu.com/ubuntu jammy-backports InRelease [127 kB]
Hit:10 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Hit:11 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Get:12 https://r2u.stat.illinois.edu/ubuntu jammy/main amd64 Packages [2,735 kB]
Get:13 http

In [2]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


# RAG Pipeline: Chunking and Indexing with LangChain and LlamaIndex

# 📘 What is Chunking in RAG?

 **Chunking** is the process of splitting a large document into smaller, semantically meaningful parts or "chunks". These chunks are then embedded and indexed, allowing the RAG system to retrieve the most relevant information to answer a user query.

# 💡 Importance of Chunking
 - **Improves retrieval accuracy**
 - **Fits model context window**
 - **Preserves semantic meaning**
 - **Supports efficient embedding and indexing**

In [3]:
from langchain.document_loaders import (
    UnstructuredPDFLoader,
    UnstructuredWordDocumentLoader,
    UnstructuredExcelLoader,
    WebBaseLoader
)





In [8]:
pdf_loader = UnstructuredPDFLoader("/content/drive/MyDrive/Zeta Workshop/dataset/Avinash-CV.pdf", mode="elements")
elements = pdf_loader.load()

page_data = {}
for idx, el in enumerate(elements):  # First 5 elements
    # print(f"\n--- Page {idx+1} ---")
    # print("Available Metadata Keys:", el.metadata.keys())
    # print("Metadata:", el.metadata)
    # print("Page Number:", el.metadata['page_number'])
    # print("Text:", el.page_content)
    if el.metadata['page_number'] not in page_data:
      page_data[el.metadata['page_number']] = [el.page_content]
    else:
      page_data[el.metadata['page_number']].append(el.page_content)

In [9]:
for page in page_data:
  print(f"Page {page}:")
  print(page_data[page])

Page 1:
['Dr. Avinash Kumar Singh', 'Hyderabad, India | +91-9005722861 | avinashkumarsingh1986@gmail.com | http://avinashkumarsingh.in', 'Profile', 'With over 14 years in Al, I have evolved through roles as an ML Researcher, Engineer, Product Manager, and now as Chief Al Scientist. | have led the development and deployment of deep learning-based computer vision and NLP models on platforms like AWS, GCP, Humanoid Robots, Edge Devices like Jetson Nano, Raspberry Pi, and NXP boards. My expertise extends to tackling challenges in concurrency, security, and latency. My academic journey, enriched by a Ph.D. and postdoctoral research, provides a profound understanding of neural networks across diverse data types, while my industrial experience ensures practical AI solutions are deployed effectively, serving real users. This unique blend of research and industry expertise enables me to lead in crafting and delivering impactful Al innovations, driving business transformation and societal advanc

# 🧱 Chunking Methodologies

| Method | Definition | How it Works |
|--------|------------|--------------|
| Fixed-size | Breaks text into equal-sized blocks | Splits by a predefined character/token limit |
| Semantic | Breaks text at logical boundaries | Uses sentence or paragraph segmentation |
| Recursive | Tries semantic split first, falls back to smaller units | Combines semantic and fixed approaches |
| Sliding Window | Uses overlapping chunks | Ensures context flow across adjacent chunks |


### 🔹 Fixed-size Chunking

**Definition:** Splits text into equal-sized segments.

**How it Works:** You define `chunk_size` and `chunk_overlap`. It doesn’t consider sentence boundaries.


In [43]:
from langchain.text_splitter import CharacterTextSplitter

text = ""
for page in page_data:
  text = text + "\n".join(page_data[page])
# print(text)

splitter = CharacterTextSplitter(
    chunk_size=200,
    chunk_overlap=50,
    separator=""  # Force strict character-level splitting
)
chunks = splitter.split_text(text)

for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1}: {repr(chunk)}")



Chunk 1: 'Dr. Avinash Kumar Singh\nHyderabad, India | +91-9005722861 | avinashkumarsingh1986@gmail.com | http://avinashkumarsingh.in\nProfile\nWith over 14 years in Al, I have evolved through roles as an ML Resear'
Chunk 2: 'n Al, I have evolved through roles as an ML Researcher, Engineer, Product Manager, and now as Chief Al Scientist. | have led the development and deployment of deep learning-based computer vision and N'
Chunk 3: 'yment of deep learning-based computer vision and NLP models on platforms like AWS, GCP, Humanoid Robots, Edge Devices like Jetson Nano, Raspberry Pi, and NXP boards. My expertise extends to tackling c'
Chunk 4: 'and NXP boards. My expertise extends to tackling challenges in concurrency, security, and latency. My academic journey, enriched by a Ph.D. and postdoctoral research, provides a profound understanding'
Chunk 5: 'ctoral research, provides a profound understanding of neural networks across diverse data types, while my industrial experience ensures pra

### 🔹 Semantic Chunking

**Definition:** Splits text using natural language boundaries (sentences/paragraphs).

**How it Works:** Splits by paragraphs, then sentences, then characters if necessary.


In [44]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

semantic_splitter = RecursiveCharacterTextSplitter(
    chunk_size=200,
    chunk_overlap=50,
    separators=["\n\n", "\n", ".", " "]
)
semantic_chunks = semantic_splitter.split_text(text)
for i, chunk in enumerate(semantic_chunks):
    print(f"Chunk {i+1}: {repr(chunk)}")




Chunk 1: 'Dr. Avinash Kumar Singh\nHyderabad, India | +91-9005722861 | avinashkumarsingh1986@gmail.com | http://avinashkumarsingh.in\nProfile'
Chunk 2: 'With over 14 years in Al, I have evolved through roles as an ML Researcher, Engineer, Product Manager, and now as Chief Al Scientist'
Chunk 3: '. | have led the development and deployment of deep learning-based computer vision and NLP models on platforms like AWS, GCP, Humanoid Robots, Edge Devices like Jetson Nano, Raspberry Pi, and NXP'
Chunk 4: 'Devices like Jetson Nano, Raspberry Pi, and NXP boards'
Chunk 5: '. My expertise extends to tackling challenges in concurrency, security, and latency. My academic journey, enriched by a Ph.D'
Chunk 6: '. and postdoctoral research, provides a profound understanding of neural networks across diverse data types, while my industrial experience ensures practical AI solutions are deployed effectively,'
Chunk 7: 'practical AI solutions are deployed effectively, serving real users'
Chunk 8: '. This 

### 🔹 Sliding Window Chunking
**Definition:** Creates overlapping chunks to preserve continuity.
**How it Works:** Moves the window forward with some overlap to maintain flow.

In [46]:
window_splitter = CharacterTextSplitter(chunk_size=200, chunk_overlap=50, separator = "")
window_chunks = window_splitter.split_text(text)
for i, chunk in enumerate(window_chunks):
    print(f"Chunk {i+1}: {repr(chunk)}")

Chunk 1: 'Dr. Avinash Kumar Singh\nHyderabad, India | +91-9005722861 | avinashkumarsingh1986@gmail.com | http://avinashkumarsingh.in\nProfile\nWith over 14 years in Al, I have evolved through roles as an ML Resear'
Chunk 2: 'n Al, I have evolved through roles as an ML Researcher, Engineer, Product Manager, and now as Chief Al Scientist. | have led the development and deployment of deep learning-based computer vision and N'
Chunk 3: 'yment of deep learning-based computer vision and NLP models on platforms like AWS, GCP, Humanoid Robots, Edge Devices like Jetson Nano, Raspberry Pi, and NXP boards. My expertise extends to tackling c'
Chunk 4: 'and NXP boards. My expertise extends to tackling challenges in concurrency, security, and latency. My academic journey, enriched by a Ph.D. and postdoctoral research, provides a profound understanding'
Chunk 5: 'ctoral research, provides a profound understanding of neural networks across diverse data types, while my industrial experience ensures pra