# Data Science Assignment: Exploring Text Splitters in LangChain

**Introduction**

This assignment focuses on text-splitting strategies using the LangChain framework. Text splitters are essential for dividing large texts into smaller, manageable chunks for processing with language models. In this assignment, you will use several built-in LangChain splitters to split sample texts and analyze the results.

You will work with the following splitters:
- Split text based on a fixed character count.
- Split text recursively, preserving semantic boundaries like sentences.
- Split text based on semantic meaning (requires a model for embeddings).
- Split Markdown text based on headers.
- Split HTML text based on headers.

**Sample Texts**

Below are sample texts for each splitter:

In [32]:
plain_text = """
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.

Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
"""

- **Markdown Text**

In [33]:
markdown_text = """
# Introduction
This is the introduction section.

## Subsection 1
This is the first subsection under the introduction.

### Details
Here are some details about subsection 1.

## Subsection 2
This is the second subsection.

# Conclusion
This is the conclusion section.
"""

- **HTML Text**:

In [34]:
html_text = """
<html>
<body>
<h1>Introduction</h1>
<p>This is the introduction section.</p>
<h2>Subsection 1</h2>
<p>This is the first subsection under the introduction.</p>
<h3>Details</h3>
<p>Here are some details about subsection 1.</p>
<h2>Subsection 2</h2>
<p>This is the second subsection.</p>
<h1>Conclusion</h1>
<p>This is the conclusion section.</p>
</body>
</html>
"""

**Task 1: Character-Based Splitting**

Use a Character based splitter to split the `plain_text` into chunks of approximately 100 characters, with a chunk overlap of 20 characters. Print the first three chunks.

**Task 2: Recursive Character Splitting**

Use Recursive Character splitter to split the `plain_text` into chunks of approximately 100 characters, with a chunk overlap of 20 characters. This splitter attempts to preserve semantic boundaries (e.g., sentences). Print the first three chunks.

**Task 3: Semantic Chunking**

Use the Semantic Chunker to split the `plain_text` based on semantic meaning. You will need to use an embedding model for this splitter. Print the first three chunks.

**Task 4: Markdown Header Splitting**

Use a Markdown Splitter to split the `markdown_text` based on Markdown headers. Print the resulting chunks.

**Task 5: HTML Text Splitting**

Use a HTML Text Splitter to split the `html_text` based on HTML headers. Print the resulting chunks.

Hint: Specify the headers to split on, such as `h1`, `h2`, etc.

In [35]:
# Task 1: Character Based splitting 
from langchain.text_splitter import CharacterTextSplitter

char_splitter = CharacterTextSplitter(separator=" ", chunk_size = 100, chunk_overlap= 20)

char_chunks = char_splitter.split_text(plain_text)

print("Character Based Chunks (first 3):\n")
for i, chunk in enumerate(char_chunks[:3]):
    print(f'Chunk {i}:\n{chunk}\n')

Character Based Chunks (first 3):

Chunk 0:
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore

Chunk 1:
incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation

Chunk 2:
nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.

Duis aute irure dolor



In [None]:
# Task 2: Recursive Character Splitting


from langchain.text_splitter import RecursiveCharacterTextSplitter

rec_splitter = RecursiveCharacterTextSplitter(chunk_size= 100, chunk_overlap= 20)
recur_chunks = rec_splitter.split_text(plain_text)

print("Recursive Chunks (first 3):\n")
for i, chunk in enumerate(recur_chunks[:3]):
    print(f'Chunk {i}:\n{chunk}\n')


Recursive Chunks (first 3):

Chunk 0:
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore

Chunk 1:
ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco

Chunk 2:
ullamco laboris nisi ut aliquip ex ea commodo consequat.



In [37]:
# Task 3: Semantic Chunking

from sklearn.cluster import KMeans
from nltk.tokenize import sent_tokenize
from sentence_transformers import SentenceTransformer

sentences = sent_tokenize(plain_text)

model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(sentences)

num_chunks = 3
kmeans = KMeans(n_clusters=num_chunks, random_state=0)
labels = kmeans.fit_predict(embeddings)

clusters = {i: [] for i in range(num_chunks)}
for sentence, label in zip(sentences, labels):
    clusters[label].append(sentence)

print("Semantic Chunks (first 3):\n")
for i in range(num_chunks):
    chunk = ' '.join(clusters[i])
    print(f"Chunk {i+1}:\n{chunk}\n")


 
 

Semantic Chunks (first 3):

Chunk 1:

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.

Chunk 2:
Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Chunk 3:
Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.



In [38]:
# Task 4: Markdown Header Splitting

from langchain.text_splitter import MarkdownHeaderTextSplitter

headers_to_split_on = [
    ("#", "Header1"),
    ("##", "Header2"),
    ("###", "Header3"),
]

markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)

md_chunks = markdown_splitter.split_text(markdown_text)

print("Markdown Header Chunks:\n")
for i, chunk in enumerate(md_chunks):
    print(f"Chunk {i+1}:\n{chunk.page_content}\n")

Markdown Header Chunks:

Chunk 1:
This is the introduction section.

Chunk 2:
This is the first subsection under the introduction.

Chunk 3:
Here are some details about subsection 1.

Chunk 4:
This is the second subsection.

Chunk 5:
This is the conclusion section.



In [42]:
# Task 5: HTML Header Splitting

from langchain.text_splitter import HTMLHeaderTextSplitter

headers_to_split_on = [("h1", "Header 1"), ("h2", "Header 2"), ("h3", "Header 3")]

html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
html_chunks = html_splitter.split_text(html_text)

print("HTML Header Chunks:\n")
for i, chunk in enumerate(html_chunks):
    print(f"Chunk {i+1}:\n{chunk.page_content}\n")

HTML Header Chunks:

Chunk 1:
Introduction

Chunk 2:
This is the introduction section.

Chunk 3:
Subsection 1

Chunk 4:
This is the first subsection under the introduction.

Chunk 5:
Details

Chunk 6:
Here are some details about subsection 1.

Chunk 7:
Subsection 2

Chunk 8:
This is the second subsection.

Chunk 9:
Conclusion

Chunk 10:
This is the conclusion section.

