## Setup and Import Libraries

In [1]:
from langchain_community.document_loaders import TextLoader, DirectoryLoader
from langchain_text_splitters import (
    RecursiveCharacterTextSplitter,
    CharacterTextSplitter,
    TokenTextSplitter
)

## Document Loading

In [2]:
loader = TextLoader("data/text_files/machine_learning.txt", encoding="utf-8")
documents = loader.load()

In [3]:
text = documents[0].page_content
text

'Machine Learning (ML) is a subset of Artificial Intelligence (AI) that focuses on building systems that can learn from data and \nimprove performance over time without being explicitly programmed. Instead of writing step-by-step instructions, ML models use patterns in data to make predictions, \nclassifications, or decisions.\n\nKey Concepts\n\n- Data: The foundation of ML—models learn from examples (structured data like tables or unstructured data like text, images, audio).\n- Features & Labels: Features are input variables; labels are the outcomes you want to predict.\n- Model: A mathematical representation that maps input data to predictions.\n- Training & Testing: Training teaches the model patterns; testing evaluates performance on unseen data.\n\nSupervised vs. Unsupervised Learning:\n- Supervised: Learns from labeled data (e.g., predicting house prices).\n- Unsupervised: Finds hidden patterns in unlabeled data (e.g., customer segmentation).\n- Reinforcement Learning: Models lea

## Text Splitter

### 1. Character Text Splitter

In [4]:
character_text_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=200,
    chunk_overlap=20,
    length_function=len
)

In [5]:
chunks = character_text_splitter.split_text(text=text)

print(f"Created {len(chunks)} Chunks")
print(f"First Chunk: {chunks[0][:50]}")

Created 7 Chunks
First Chunk: Machine Learning (ML) is a subset of Artificial In


In [6]:
print(chunks[0])
print("--------------------")
print(chunks[1])
print("--------------------")
print(chunks[2])

Machine Learning (ML) is a subset of Artificial Intelligence (AI) that focuses on building systems that can learn from data and
--------------------
improve performance over time without being explicitly programmed. Instead of writing step-by-step instructions, ML models use patterns in data to make predictions, 
classifications, or decisions.
--------------------
Key Concepts
- Data: The foundation of ML—models learn from examples (structured data like tables or unstructured data like text, images, audio).


### 2. Recursive Character Text Splitter

In [7]:
recursive_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", " ", ""],
    chunk_size=200,
    chunk_overlap=20,
    length_function=len
)

In [8]:
recursive_chunks = recursive_splitter.split_text(text=text)

print(f"Created {len(recursive_chunks)} Chunks")
print(f"First Chunk: {recursive_chunks[0][:50]}")

Created 8 Chunks
First Chunk: Machine Learning (ML) is a subset of Artificial In


In [9]:
print(recursive_chunks[0])
print("--------------------")
print(recursive_chunks[1])
print("--------------------")
print(recursive_chunks[2])

Machine Learning (ML) is a subset of Artificial Intelligence (AI) that focuses on building systems that can learn from data and
--------------------
improve performance over time without being explicitly programmed. Instead of writing step-by-step instructions, ML models use patterns in data to make predictions, 
classifications, or decisions.
--------------------
Key Concepts


### 3. Token Text Splitter

In [10]:
token_splitter = TokenTextSplitter(
    chunk_size=200,
    chunk_overlap=20,
)

In [11]:
token_chunks = token_splitter.split_text(text=text)

print(f"Created {len(token_chunks)} Chunks")
print(f"First Chunk: {token_chunks[0][:50]}")

Created 2 Chunks
First Chunk: Machine Learning (ML) is a subset of Artificial In


In [12]:
print(token_chunks[0])
print("--------------------")
print(token_chunks[1])

Machine Learning (ML) is a subset of Artificial Intelligence (AI) that focuses on building systems that can learn from data and 
improve performance over time without being explicitly programmed. Instead of writing step-by-step instructions, ML models use patterns in data to make predictions, 
classifications, or decisions.

Key Concepts

- Data: The foundation of ML—models learn from examples (structured data like tables or unstructured data like text, images, audio).
- Features & Labels: Features are input variables; labels are the outcomes you want to predict.
- Model: A mathematical representation that maps input data to predictions.
- Training & Testing: Training teaches the model patterns; testing evaluates performance on unseen data.

Supervised vs. Unsupervised Learning:
- Supervised: Learns from labeled data (e.g., predicting house prices).
- Unsupervised: Finds hidden patterns in unlabeled data (e
--------------------
 house prices).
- Unsupervised: Finds hidden patterns in u

## Comparison

In [13]:
print("\n📊 Text Splitting Methods Comparison")
print("\nCharacter Text Splitter")
print("  ✅Simple and Predictable")
print("  ✅Good for Structured Text")
print("  ❌May break Mid-Sentences")
print("  Use When: Text has clear Delimiters")


print("\nRecursive Character Text Splitter")
print("  ✅Respects Text Structure")
print("  ✅Tries Multiple Separators")
print("  ✅Best General Purpose Splitter")
print("  ❌Slightly More Complex")
print("  Use When: Default Choice for Most Texts")


print("\nToken Text Splitter")
print("  ✅Respect Model Token Limits")
print("  ✅More Accurate for Embeddings")
print("  ❌Slower than Character-based")
print("  Use When: Working with Token-Limited Models")


📊 Text Splitting Methods Comparison

Character Text Splitter
  ✅Simple and Predictable
  ✅Good for Structured Text
  ❌May break Mid-Sentences
  Use When: Text has clear Delimiters

Recursive Character Text Splitter
  ✅Respects Text Structure
  ✅Tries Multiple Separators
  ✅Best General Purpose Splitter
  ❌Slightly More Complex
  Use When: Default Choice for Most Texts

Token Text Splitter
  ✅Respect Model Token Limits
  ✅More Accurate for Embeddings
  ❌Slower than Character-based
  Use When: Working with Token-Limited Models
