<a href="https://colab.research.google.com/github/saurabhvybs/AI-RAG-Chatbot/blob/main/HOS_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Science Internship Assignment: Exploring Text Splitters in LangChain

**Introduction**

This assignment focuses on text-splitting strategies using the LangChain framework. Text splitters are essential for dividing large texts into smaller, manageable chunks for processing with language models. In this assignment, you will use several built-in LangChain splitters to split sample texts and analyze the results.

You will work with the following splitters:
- Split text based on a fixed character count.
- Split text recursively, preserving semantic boundaries like sentences.
- Split text based on semantic meaning (requires a model for embeddings).
- Split Markdown text based on headers.
- Split HTML text based on headers.

**Sample Texts**

Below are sample texts for each splitter:

In [None]:
plain_text = """
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.

Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
"""

- **Markdown Text**

In [None]:
markdown_text = """
# Introduction
This is the introduction section.

## Subsection 1
This is the first subsection under the introduction.

### Details
Here are some details about subsection 1.

## Subsection 2
This is the second subsection.

# Conclusion
This is the conclusion section.
"""

- **HTML Text**:

In [None]:
html_text = """
<html>
<body>
<h1>Introduction</h1>
<p>This is the introduction section.</p>
<h2>Subsection 1</h2>
<p>This is the first subsection under the introduction.</p>
<h3>Details</h3>
<p>Here are some details about subsection 1.</p>
<h2>Subsection 2</h2>
<p>This is the second subsection.</p>
<h1>Conclusion</h1>
<p>This is the conclusion section.</p>
</body>
</html>
"""

**Task 1: Character-Based Splitting**

Use a Character based splitter to split the `plain_text` into chunks of approximately 100 characters, with a chunk overlap of 20 characters. Print the first three chunks.

**Task 2: Recursive Character Splitting**

Use Recursive Character splitter to split the `plain_text` into chunks of approximately 100 characters, with a chunk overlap of 20 characters. This splitter attempts to preserve semantic boundaries (e.g., sentences). Print the first three chunks.

**Task 3: Semantic Chunking**

Use the Semantic Chunker to split the `plain_text` based on semantic meaning. You will need to use an embedding model for this splitter. Print the first three chunks.

**Task 4: Markdown Header Splitting**

Use a Markdown Splitter to split the `markdown_text` based on Markdown headers. Print the resulting chunks.

**Task 5: HTML Text Splitting**

Use a HTML Text Splitter to split the `html_text` based on HTML headers. Print the resulting chunks.

Hint: Specify the headers to split on, such as `h1`, `h2`, etc.

In [None]:
# These are basic imports needed for the tasks.

In [None]:
!pip install langchain langchain-experimental openai sentence-transformers

## LOADING TEXT FOR PROCESSING

In [None]:
plain_text = """
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.

Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
"""

markdown_text = """
# Introduction
This is the introduction section.

## Subsection 1
This is the first subsection under the introduction.

### Details
Here are some details about subsection 1.

## Subsection 2
This is the second subsection.

# Conclusion
This is the conclusion section.
"""

html_text = """
<html>
<body>
<h1>Introduction</h1>
<p>This is the introduction section.</p>
<h2>Subsection 1</h2>
<p>This is the first subsection under the introduction.</p>
<h3>Details</h3>
<p>Here are some details about subsection 1.</p>
<h2>Subsection 2</h2>
<p>This is the second subsection.</p>
<h1>Conclusion</h1>
<p>This is the conclusion section.</p>
</body>
</html>
"""

# 📝 Task 1: Splitting Text using `CharacterTextSplitter`
In this task, we use the **CharacterTextSplitter** to break a large block of text into smaller chunks.  
This helps in preparing text for downstream NLP/LLM tasks where input size is limited.  
We configure **chunk size** and **overlap** to control granularity.

In [None]:
from langchain.text_splitter import CharacterTextSplitter

In [None]:
char_splitter = CharacterTextSplitter(
    separator="\n\n",
    chunk_size=100,
    chunk_overlap=20,
    length_function=len
)

char_chunks = char_splitter.split_text(plain_text)

print("--- First Three Chunks (Character-Based) ---")
for i, chunk in enumerate(char_chunks[:3]):
    print(f"Chunk {i+1}: \"{chunk}\" (Length: {len(chunk)})")

# 📝 Task 2: Recursive Splitting with `RecursiveCharacterTextSplitter`
Here, we apply the **RecursiveCharacterTextSplitter**, which intelligently tries different separators (`\n`, space, punctuation, etc.) to split text.  
This ensures cleaner splits compared to a plain character-based approach.

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

recursive_splitter = RecursiveCharacterTextSplitter(
    chunk_size=100,
    chunk_overlap=20,
    length_function=len,
)

recursive_chunks = recursive_splitter.split_text(plain_text)

print("--- First Three Chunks (Recursive) ---")
for i, chunk in enumerate(recursive_chunks[:3]):
    print(f"Chunk {i+1}: \"{chunk}\" (Length: {len(chunk)})")

# 📝 Task 3: Parsing Markdown with `MarkdownHeaderTextSplitter`
This task demonstrates splitting **Markdown documents** by headers like `#`, `##`, and `###`.  
The splitter groups text under each header so that hierarchy is preserved for structured documents.

In [None]:
from langchain_experimental.text_splitter import SemanticChunker
from langchain.embeddings import HuggingFaceEmbeddings

# Using a free, open-source embedding model from Hugging Face
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

semantic_splitter = SemanticChunker(embeddings)
semantic_chunks = semantic_splitter.split_text(plain_text)

print("--- First Three Chunks (Semantic) ---")
for i, chunk in enumerate(semantic_chunks[:3]):
    print(f"Chunk {i+1}: \"{chunk}\"")

# 📝 Task 4: Parsing HTML with `HTMLHeaderTextSplitter`
We now work with **HTML documents**, splitting based on tags such as `<h1>` and `<h2>`.  
This is useful when dealing with webpages, reports, or any HTML-based content.

In [None]:
from langchain.text_splitter import MarkdownHeaderTextSplitter

headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
markdown_chunks = markdown_splitter.split_text(markdown_text)

print("--- All Chunks (Markdown) ---")
for chunk in markdown_chunks:
    print(f"Content: \"{chunk.page_content}\"")
    print(f"Metadata: {chunk.metadata}\n")


# 📝 Task 5: Comparing Splitters
Finally, we compare how different splitters behave:
- **CharacterTextSplitter** → Simple, fixed-size chunks.  
- **RecursiveCharacterTextSplitter** → Smarter, cleaner breaks.  
- **Markdown/HTML Splitters** → Preserve document hierarchy.

In [None]:
from langchain.text_splitter import HTMLHeaderTextSplitter

headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3"),
]

html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
html_chunks = html_splitter.split_text(html_text)

print("--- All Chunks (HTML) ---")
for chunk in html_chunks:
    print(f"Content: \"{chunk.page_content}\"")
    print(f"Metadata: {chunk.metadata}\n")