# **3-splitters**

## **🤖 Introduction**
- **Splitters** help **divide** large documents into **smaller, more manageable parts**, making it easier for an LLM to process text without hitting token limits or losing context.

## **⚙️ Setup**
1. **Clone or Download** the GitHub repository to your machine.
2. In **terminal**:
   ```
   cd project_name
   pyenv local 3.11.4
   poetry install
   poetry shell
   ```
3. Launch **Jupyter Lab**:
   ```
   jupyter lab
   ```
   - Open the **`002-splitters.ipynb`** notebook.
4. **View Code** in your editor of choice (e.g., VS Code):
   - Locate and open **`002-splitters.py`**.

---

## **🔐 Create Your `.env` File**
- **`.env.example`** is included; rename it to **`.env`**.
- Add the following keys:
  ```
  OPENAI_API_KEY=your_openai_api_key
  LANGCHAIN_TRACING_V2=true
  LANGCHAIN_ENDPOINT=https://api.smith.langchain.com
  LANGCHAIN_API_KEY=your_langchain_api_key
  LANGCHAIN_PROJECT=your_project_name
  ```
- This project is named **`002-splitters`** in **LangSmith**.

---

## **📊 Track Operations**
- **Monitor** usage and costs for this project in **LangSmith**:
  ```
  smith.langchain.com
  ```

> **💡 Note**: Splitting large documents into smaller segments often **improves** LLM performance, as each chunk is easier to process and analyze.


# 📌 Notebook Summary & Objective  
This Jupyter Notebook focuses on **text splitting techniques** using LangChain's **splitters**. It provides an in-depth look at how to divide large data assets into smaller, manageable parts, which is essential for **Natural Language Processing (NLP)** and **LLM-based applications**.

---

## 🔍 Notebook Overview  
- **Total Cells:** 49  
- **Code Cells:** 24  
- **Markdown Cells:** 25  

---

## 📖 Key Sections in the Notebook  

### **1️⃣ Introduction to Splitters**  
   - Explains the importance of splitting large texts into smaller chunks.

### **2️⃣ Setup & Installation**  
   - Installs required libraries like `python-dotenv` and `langchain`.  
   - Sets up OpenAI API keys using environment variables.

### **3️⃣ Using LangChain Splitters**  
   - Demonstrates different methods of text splitting.  
   - Uses `CharacterTextSplitter` and possibly other splitters like `RecursiveCharacterTextSplitter`.

### **4️⃣ Practical Examples**  
   - Splits text using different strategies, showing **how chunk size and overlap affect results**.

### **5️⃣ Comparison of Different Splitters**  
   - Evaluates **efficiency and performance** of various text-splitting techniques.

---

## 📂 Next Steps  
Since this notebook is about **text splitters**, please share the **specific code sections** you want me to explain in extreme detail. 🚀  


## Connect with the .env file located in the same directory of this notebook

If you are using the pre-loaded poetry shell, you do not need to install the following package because it is already pre-loaded for you:

In [None]:
#pip install python-dotenv

In [None]:
import os
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())
openai_api_key = os.environ["OPENAI_API_KEY"]

#### Install LangChain

If you are using the pre-loaded poetry shell, you do not need to install the following package because it is already pre-loaded for you:

In [None]:
#!pip install langchain

## Connect with an LLM

If you are using the pre-loaded poetry shell, you do not need to install the following package because it is already pre-loaded for you:

In [None]:
#!pip install langchain-openai

* NOTE: Since right now is the best LLM in the market, we will use OpenAI by default. You will see how to connect with other Open Source LLMs like Llama3 or Mistral in a next lesson.

# **Character Splitter in RAG**

### **📚 Why We Use RAG (Retrieval-Augmented Generation)**
When dealing with large documents, simply passing the entire text to an LLM often **exceeds** the model’s context window. **RAG** solves this by:
1. **Split** the document into **small chunks** (so each chunk can fit into the LLM context).  
2. **Transform** these text chunks into **numeric embeddings**.  
3. **Store** those embeddings in a **vector database** (vector store).  
4. **Retrieve** the most relevant chunks when a user asks a question.  
5. **Send** the retrieved embeddings (or text) to the LLM so it can generate a final, **context-rich** response.

### **✂️ Splitters (Document Transformers)**
- **Role**: They divide a loaded document into **manageable** segments of text.
- **Name**: Sometimes called **“Document Transformers”** because they transform the raw document into smaller parts.
- **Built-In Splitters**: LangChain provides a variety of splitters to handle different data structures. For more details, refer to:
  - [Documentation Page](https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitter)
  - [List of Built-In Splitters](https://python.langchain.com/docs/modules/data_connection/document_transformers)

### **💡 Character Splitter**
- **Default Delimiter**: `"\n\n"` (two consecutive newline characters).  
- **Chunk Measurement**: Based on the **number of characters** in each segment.  
- **Purpose**: Break large blocks of text into **smaller pieces**, improving the **efficiency** and **accuracy** of retrieval.

#### **How It Works**
1. **Scan** the text for `"\n\n"` delimiters.  
2. **Split** the text at these delimiters, producing discrete chunks.  
3. **Measure** chunk length in characters to ensure no chunk exceeds your desired size limit.

### **🔎 Advantages of the Character Splitter**
- **Efficient Retrieval**: Smaller chunks make it easier for the system to find the **most relevant** section during the retrieval step of RAG.  
- **Enhanced Context Control**: By splitting paragraphs or sections at `"\n\n"`, each chunk remains a **cohesive** unit of meaning (e.g., a paragraph).  
- **Easy Customization**: If your text doesn’t use double newlines, you can specify a different character or adjust chunk sizes.

### **🚀 Example in RAG**
Imagine you have a lengthy research paper with multiple paragraphs separated by blank lines:
- **Before**: One huge string that might exceed the LLM’s token limit.  
- **After**: Multiple smaller chunks (paragraph by paragraph) that you can individually embed and store.  
- **Query**: When the user asks a question, only the **relevant** paragraph-chunks are retrieved from the vector store and provided to the LLM, leading to a **precise** and **contextual** answer.

### **🤔 Key Takeaways**
- **Chunking** is crucial for large texts—**Character Splitter** is the simplest method, relying on newline delimiters.  
- This approach **reduces** token usage, making RAG pipelines more **scalable**.  
- If your document doesn’t naturally separate paragraphs with `"\n\n"`, you can change the delimiter or use a more advanced splitter (like **Recursive Character Splitter**, which respects sentence boundaries or sections).

> **💼 Final Note**: Proper splitting is the foundation of a successful RAG strategy—without well-structured chunks, retrieval can become **less accurate** and hamper the LLM’s ability to provide reliable answers.


Here's a simple example to illustrate how the "Character Splitter" works in the context of RAG applications using the default delimiter ("\n\n").

#### Original Text:
```
Hello, welcome to our store!

\n\nWe offer a variety of products ranging from electronics to clothing.

\n\nOur store hours are 9 AM to 9 PM every day.

\n\nFeel free to ask for assistance if you need help finding anything.
```

#### After Applying Character Splitter:
1. **Chunk 1:**
   ```
   Hello, welcome to our store!
   ```

2. **Chunk 2:**
   ```
   We offer a variety of products ranging from electronics to clothing.
   ```

3. **Chunk 3:**
   ```
   Our store hours are 9 AM to 9 PM every day.
   ```

4. **Chunk 4:**
   ```
   Feel free to ask for assistance if you need help finding anything.
   ```

In this example, the text is split into four chunks based on the presence of "\n\n" between sections of text. Each chunk is a manageable size and clearly separated from the others, making it easier for a RAG system to handle and retrieve information from specific parts of the text as needed.

In [None]:
from langchain_community.document_loaders import TextLoader

loader = TextLoader("./data/be-good.txt")

loaded_data = loader.load()

In [None]:
#loaded_data

In [None]:
#loaded_data[0].page_content

In [None]:
from langchain_text_splitters import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
    separator="\n\n",
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len,
    is_separator_regex=False,
)

# 📌 Code Explanation: Splitting Text into Chunks using `CharacterTextSplitter` from LangChain  

## 🔹 Code Breakdown  

### **1️⃣ Importing the Required Module**  
```python
from langchain_text_splitters import CharacterTextSplitter
```
- This imports the `CharacterTextSplitter` class from the `langchain_text_splitters` module.
- It is used to **split long text into smaller chunks** for better processing in Language Model (LLM) applications.

---

### **2️⃣ Creating an Instance of `CharacterTextSplitter`**  
```python
text_splitter = CharacterTextSplitter(
    separator="\n\n",
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len,
    is_separator_regex=False,
)
```
This **configures the text splitting strategy** with the following parameters:

- **`separator="\n\n"`**  
  - Defines **double newlines (`\n\n`)** as the splitting point.
  - Useful for **paragraph-based** splitting, ensuring that logical sections are preserved.

- **`chunk_size=1000`**  
  - Specifies that **each chunk should not exceed 1000 characters**.
  - Ensures that **large text blocks** are divided into **manageable parts** for LLM processing.

- **`chunk_overlap=200`**  
  - Ensures **each chunk overlaps the next by 200 characters**.
  - Helps **retain context** across chunks, preventing abrupt cut-offs.

- **`length_function=len`**  
  - Uses Python’s built-in `len()` function to **measure chunk length** in **characters**.

- **`is_separator_regex=False`**  
  - Specifies that the `separator` is **a plain string** and **not a regex pattern**.
  - If `True`, advanced **regular expressions** could be used for more complex splitting logic.

---

## 🎯 **Why Use `CharacterTextSplitter`?**
- 🏆 **Optimized for LLM Processing**: Splits long texts into **manageable** segments.  
- 🔄 **Maintains Context Across Chunks**: Overlapping chunks ensure **smooth transitions**.  
- 📖 **Preserves Paragraph Boundaries**: Helps retain **semantic meaning** while chunking.  
- ⚡ **Improves Vector Search & Retrieval**: Essential for **embedding-based** search applications.  

---

## 🚀 **Example Usage**
```python
text = "This is paragraph one.\n\nThis is paragraph two.\n\nThis is paragraph three."
chunks = text_splitter.split_text(text)

for idx, chunk in enumerate(chunks):
    print(f"Chunk {idx + 1}:")
    print(chunk)
    print("-" * 50)
```

### **🔹 Expected Output**
```
Chunk 1:
This is paragraph one.

This is paragraph two.
--------------------------------------------------
Chunk 2:
This is paragraph two.

This is paragraph three.
--------------------------------------------------
```
- The text is split into **paragraph-based chunks** with an overlap of **200 characters**.  
- The second paragraph appears in **both chunks**, ensuring **context retention**.

---

## 🎯 **When to Use This?**
✅ Preparing text for **LLM embeddings** and **vector databases**.  
✅ Splitting **long articles, books, or transcripts** into structured sections.  
✅ Improving **retrieval accuracy** in search-based applications.  

This method ensures **structured and contextual chunking** for better text processing in **LangChain-based AI applications**. 🚀  


In [None]:
texts = text_splitter.create_documents([loaded_data[0].page_content])

# 📌 Code Explanation: Creating Document Chunks using `CharacterTextSplitter`

## 🔹 Code Breakdown  

### **1️⃣ Executing the Text Splitting Operation**  
```python
texts = text_splitter.create_documents([loaded_data[0].page_content])
```
This line **splits the content of a document into multiple chunks** using the `CharacterTextSplitter` instance (`text_splitter`) that was previously configured.

---

### **2️⃣ Understanding Each Component**  

#### **✅ `loaded_data[0].page_content`**  
- `loaded_data` is expected to be a **list of documents**.  
- `loaded_data[0]` refers to **the first document** in the list.  
- `.page_content` extracts the **textual content** of that document.

#### **✅ `text_splitter.create_documents([...])`**  
- The `create_documents()` function **splits the input text into multiple document chunks**.  
- It **automatically structures** the output into `Document` objects (LangChain's format), which contain:  
  - **Chunked text**  
  - **Metadata (if any was present in the original document)**  

- The method applies the **splitting strategy** defined earlier:
  - **Splitting at `"\n\n"` (double newlines)**
  - **Each chunk has a max size of 1000 characters**
  - **Each chunk overlaps the next by 200 characters**
  - **Length is measured using `len()`**

---

## 🎯 **Expected Outcome**
- `texts` will be a **list of Document objects**, each containing a **text chunk**.
- Each chunk will have **overlapping content** to maintain context.

---

## 🚀 **Example Output**
If `loaded_data[0].page_content` contains:
```
Paragraph 1.

Paragraph 2.

Paragraph 3.

Paragraph 4.
```
Then `texts` might contain:
```
[ Document(page_content="Paragraph 1.\n\nParagraph 2."),
  Document(page_content="Paragraph 2.\n\nParagraph 3."),
  Document(page_content="Paragraph 3.\n\nParagraph 4.") ]
```
Each chunk **overlaps by one paragraph** due to the **200-character overlap** setting.

---

## 🎯 **When to Use This?**
✅ **Preparing text chunks** for retrieval-augmented generation (RAG).  
✅ **Feeding structured chunks** into embeddings-based search engines.  
✅ **Improving LLM document processing** by ensuring context continuity.  

This ensures **efficient text chunking** for **LLM-based applications**, preserving context while splitting long texts. 🚀  


In [None]:
len(texts)

2

In [None]:
texts[0]

Document(page_content='Be good')

In [None]:
# texts[1]

#### Splitting with metadata

In [None]:
metadatas = [{"chunk": 0}, {"chunk": 1}]

documents = text_splitter.create_documents(
    [loaded_data[0].page_content, loaded_data[0].page_content],
    metadatas=metadatas
)

# 📌 Code Explanation: Creating Document Chunks with Metadata using `CharacterTextSplitter`

## 🔹 Code Breakdown  

### **1️⃣ Defining Metadata for Each Chunk**  
```python
metadatas = [{"chunk": 0}, {"chunk": 1}]
```
- A list of **metadata dictionaries** is created.  
- Each dictionary represents metadata for **one chunk**.
- Example:
  - `{"chunk": 0}` → Metadata for the **first chunk**.
  - `{"chunk": 1}` → Metadata for the **second chunk**.

---

### **2️⃣ Creating Document Chunks with Metadata**  
```python
documents = text_splitter.create_documents(
    [loaded_data[0].page_content, loaded_data[0].page_content],
    metadatas=metadatas
)
```
- `create_documents()` takes **two inputs**:
  1. A list of **texts** (here, the **same document is duplicated** for demonstration).  
  2. A list of **metadata dictionaries** (`metadatas`), which attaches metadata to each chunk.  

- The text is **split into chunks** using the **configured splitting strategy**:
  - **Splitting at `"\n\n"` (double newlines)**
  - **Each chunk has a max size of 1000 characters**
  - **Each chunk overlaps by 200 characters**
  - **Length is measured using `len()`**
  - **Metadata is assigned per chunk**

---

## 🎯 **Understanding the Parameters in Context**
### **✅ CharacterTextSplitter Configuration Recap**
| Parameter           | Description |
|---------------------|-------------|
| `separator="\n\n"` | Splits text at **double newlines**, ensuring logical chunking. |
| `chunk_size=1000` | Each chunk is **max 1000 characters long**. |
| `chunk_overlap=200` | Each chunk **overlaps by 200 characters**, retaining context. |
| `length_function=len` | Uses `len()` to measure chunk size in **characters**. |
| `is_separator_regex=False` | Treats `separator` as **plain text**, not a regex pattern. |

---

## 🚀 **Example Output**
If `loaded_data[0].page_content` contains:
```
Paragraph 1.

Paragraph 2.

Paragraph 3.

Paragraph 4.
```
Then `documents` might contain:
```
[ Document(page_content="Paragraph 1.\n\nParagraph 2.", metadata={"chunk": 0}),
  Document(page_content="Paragraph 2.\n\nParagraph 3.", metadata={"chunk": 1}) ]
```
- The text is split into **document chunks**.
- Each chunk is **overlapping** the next by 200 characters.
- Metadata **assigns identifiers** to the chunks.

---

## 🎯 **Why Attach Metadata?**
✅ **Maintains Sequential Information** → Keeps track of which chunk comes first.  
✅ **Improves Document Retrieval** → Helps identify chunks in **vector search**.  
✅ **Enhances Processing** → Metadata can store **timestamps, categories, or sources**.  

---

## 📂 **When to Use This?**
- **Retrieval-Augmented Generation (RAG)** → When sending **chunks** to an LLM.
- **Document Indexing for Search** → When embedding **text into a vector database**.
- **Context-Aware AI Applications** → When **metadata needs to be preserved**.  

This ensures **structured text processing** for AI-powered applications! 🚀  


In [None]:
documents[0]

Document(metadata={'chunk': 0}, page_content='Be good')

# 📌 Code Explanation: Accessing a Specific Document Chunk  

## 🔹 Code Breakdown  

### **1️⃣ Accessing the First Chunked Document**  
```python
documents[0]
```
- This retrieves **the first document chunk** from the `documents` list.  
- `documents` was created using `text_splitter.create_documents()`.  
- Since the **splitting strategy** was applied, `documents` now contains multiple **chunked documents**.

---

### **2️⃣ Expected Output**
```python
Document(metadata={'chunk': 0}, page_content='Be good')
```
- **`Document(...)`** → Represents a **LangChain `Document` object**.  
- **`metadata={'chunk': 0}`** → This **metadata dictionary** assigns an **identifier** to the chunk.  
  - Here, `chunk: 0` indicates that this is **the first chunk**.  
- **`page_content='Be good'`** → The **actual content** of this document chunk.  

---

## 🎯 **Understanding What Happened**
1️⃣ **The text was split** into chunks based on the pre-configured `CharacterTextSplitter`.  
2️⃣ **Metadata was attached** to track chunk numbers.  
3️⃣ **`documents[0]` returned** the first chunk, which contains:  
   - **A metadata dictionary** (`{"chunk": 0}`)  
   - **A text snippet** (`"Be good"`)  

---

## 🚀 **Why is This Useful?**
✅ **Chunk Tracking** → Each document retains **metadata**, helping with indexing.  
✅ **Retrieval-Augmented Generation (RAG)** → LLMs can reference structured text chunks.  
✅ **Efficient Document Processing** → Useful for **semantic search & embeddings**.  

---

## 📂 **Next Steps**
- To access the **second document chunk**:
  ```python
  documents[1]
  ```
- To view **all document chunks**:
  ```python
  for doc in documents:
      print(doc)
  ```
- To extract **text content only**:
  ```python
  print(documents[0].page_content)
  ```

This method ensures **structured document processing** for **AI applications**! 🚀  


In [None]:
print(documents[0])

page_content='Be good' metadata={'chunk': 0}


# 📌 Code Explanation: Printing a Document Chunk  

## 🔹 Code Breakdown  

### **1️⃣ Printing the First Document Chunk**  
```python
print(documents[0])
```
- This prints the **first document chunk** stored in the `documents` list.
- `documents` was created using `text_splitter.create_documents()`, which **split the text** and **attached metadata**.

---

### **2️⃣ Expected Output**
```python
page_content='Be good' metadata={'chunk': 0}
```
- **`page_content='Be good'`** → This is the **actual text** stored in this chunk.  
- **`metadata={'chunk': 0}`** → This **tracks the chunk index** in the splitting process.  

---

## 🎯 **Breaking Down What Happened**
### **✅ Step 1: Text Splitting**
- The original text was **split into chunks** using `CharacterTextSplitter`.
- Since `chunk_size=1000` and `separator="\n\n"`, the first chunk **only contains `"Be good"`**.

### **✅ Step 2: Metadata Assignment**
- `{"chunk": 0}` is attached to **track** the chunk number.
- If there were more chunks, they would have `chunk: 1`, `chunk: 2`, etc.

### **✅ Step 3: Printing the First Chunk**
- `documents[0]` is an instance of LangChain’s `Document` class.
- When printed, it displays the **text (`page_content`)** and its **metadata**.

---

## 🚀 **Why is This Useful?**
✅ **Helps Track Chunk Order** → Useful for document indexing.  
✅ **Preserves Context** → Metadata allows **better retrieval & reference**.  
✅ **Essential for Vector Search** → Metadata is **critical in embeddings-based applications**.  

---

## 📂 **Next Steps**
- **Print all chunks:**
  ```python
  for doc in documents:
      print(doc)
  ```
- **Extract only text:**
  ```python
  print(documents[0].page_content)
  ```
- **Extract only metadata:**
  ```python
  print(documents[0].metadata)
  ```

This ensures **structured text processing** for AI-powered applications! 🚀  


## Recursive Character Splitter
* This text splitter is the recommended one for generic text.
* It is parameterized by a list of characters. It tries to split on them in order until the chunks are small enough. The default list is ["\n\n", "\n", " ", ""].
* This has the effect of trying to keep all paragraphs (and then sentences, and then words) together as long as possible, as those would generically seem to be the strongest semantically related pieces of text.

# 📌 Simple Explanation: Recursive Character Splitter  

## 🔹 What is the Recursive Character Splitter?  
- The **"Recursive Character Splitter"** is a method used to **divide text into smaller, more manageable chunks**.  
- It is designed to **preserve the semantic integrity** of the text while ensuring optimal chunking.  
- The process follows a **hierarchical splitting strategy**, meaning it first tries to split by **larger units** (paragraphs), then by **smaller units** (sentences, words) if necessary.

---

## 🔹 How Does It Work?  
- It attempts to split text using **a prioritized list of characters** in a specified order.
- The **default splitting sequence** is:  
  1️⃣ `"\n\n"` → First, split at **double newlines** to separate **paragraphs**.  
  2️⃣ `"\n"` → Then, split at **single newlines** to isolate **sentences**.  
  3️⃣ `" "` → If still too large, split at **spaces** to break into **phrases**.  
  4️⃣ `""` → As a last resort, split at **each character individually**.  

- The algorithm ensures that **each chunk retains meaningful, complete information**.

---

## 🎯 **Example Usage**  

### **✅ Original Text:**  
```
Hello, welcome to our store!

\n\nWe offer a variety of products. Our range includes electronics, clothing, and home appliances.
\nOur staff is available to help you during store hours: 9 AM to 9 PM every day.
```

### **✅ Applying Recursive Character Splitter:**  

#### **🔹 First Attempt (Splitting at `"\n\n"` - Paragraphs)**
1️⃣ **Chunk 1:**  
```
Hello, welcome to our store!
```
2️⃣ **Chunk 2:**  
```
We offer a variety of products. Our range includes electronics, clothing, and home appliances.
\nOur staff is available to help you during store hours: 9 AM to 9 PM every day.
```
📌 *Paragraph-based splitting works well, but Chunk 2 is still too large.*

---

#### **🔹 Second Attempt (Splitting at `"\n"` - Sentences)**
1️⃣ **Chunk 1:**  
```
Hello, welcome to our store!
```
2️⃣ **New Chunk 2:**  
```
We offer a variety of products. Our range includes electronics, clothing, and home appliances.
```
3️⃣ **Chunk 3:**  
```
Our staff is available to help you during store hours: 9 AM to 9 PM every day.
```
📌 *Now, each chunk is more structured and contains complete information.*

---

## 🚀 **Why is This Effective?**  
✅ **Preserves Context** → Ensures each chunk contains **meaningful information**.  
✅ **Hierarchical Splitting** → Prioritizes **larger units first** before breaking into smaller ones.  
✅ **Optimized for Language Models** → Helps **LLMs retain context** across chunks.  
✅ **Ideal for Document Processing** → Works well for **parsing large texts**.  

---

## 📂 **When to Use This?**  
- **LLM-based text processing** (Retrieval-Augmented Generation - RAG).  
- **Chunking large articles, books, or documents** for AI applications.  
- **Preparing structured text** for **embedding-based search**.  

The **Recursive Character Splitter** ensures that chunks are **intelligently divided**, maintaining logical meaning while efficiently managing text size. 🚀  


In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

In [None]:
# Creating an instance of RecursiveCharacterTextSplitter
recursive_splitter = RecursiveCharacterTextSplitter(
    chunk_size=26,    # Each chunk will have a maximum of 26 characters
    chunk_overlap=4   # Each chunk will overlap the next one by 4 characters
)

## 🔹 What Does This Code Do?  
- It initializes **RecursiveCharacterTextSplitter**, a LangChain utility used to **split text into structured chunks**.
- This method follows a **hierarchical splitting strategy**, ensuring **semantic integrity** of the text.  

---

## 🔹 Understanding the Parameters  

| Parameter | Description |
|-----------|-------------|
| `chunk_size=26` | Each chunk will contain a maximum of **26 characters**. |
| `chunk_overlap=4` | The last **4 characters of a chunk** will appear at the beginning of the next chunk to maintain **context continuity**. |

- **Why Overlap?**  
  - **Prevents abrupt cut-offs** at sentence boundaries.  
  - **Helps LLMs retain context** across different chunks.  
  - **Improves retrieval effectiveness** in vector search applications.

---

## 🎯 **Example Usage & Expected Output**  

### **✅ Sample Text:**  
```
"Welcome to AI development. This is a new era of innovation."
```

### **✅ Applying RecursiveCharacterTextSplitter**
```python
text = "Welcome to AI development. This is a new era of innovation."
chunks = recursive_splitter.split_text(text)

for idx, chunk in enumerate(chunks):
    print(f"Chunk {idx + 1}: {chunk}")
```

### **🔹 Expected Output:**
```
Chunk 1: Welcome to AI developme
Chunk 2: lopment. This is a new
Chunk 3: new era of innovation.
```
📌 **Explanation:**  
- The first chunk **fits within 26 characters**: `"Welcome to AI developme"`.  
- The second chunk **overlaps 4 characters from the previous chunk** (`"ment" → "lopment"`) to **preserve context**.  
- This continues until all text is **processed into structured chunks**.

---

## 🚀 **Why Use RecursiveCharacterTextSplitter?**  
✅ **Preserves Meaning** → Uses **smart splitting** to avoid breaking words.  
✅ **Optimized for LLMs** → Ensures **coherent text chunking** for AI applications.  
✅ **Better Context Retention** → Overlapping **prevents abrupt cut-offs**.  
✅ **Great for Vector Databases** → Works well for **embedding-based retrieval**.  

---

## 📂 **When to Use This?**  
- **Chunking large documents for LLM-based retrieval** (RAG).  
- **Preparing structured embeddings** for a **vector database**.  
- **Splitting long texts** while preserving **semantic structure**.  

The **RecursiveCharacterTextSplitter** intelligently **breaks text into logical chunks**, ensuring **smooth processing for AI-powered applications**. 🚀  

In [None]:
text1 = 'abcdefghijklmnopqrstuvwxyzabcdefg'

In [None]:
text2 = """
Data that Speak
LLM Applications are revolutionizing industries such as
banking, healthcare, insurance, education, legal, tourism,
construction, logistics, marketing, sales, customer service,
and even public administration.

The aim of our programs is for students to learn how to
create LLM Applications in the context of a business,
which presents a set of challenges that are important
to consider in advance.
"""

In [None]:
recursive_splitter.split_text(text1)

['abcdefghijklmnopqrstuvwxyz', 'wxyzabcdefg']

In [None]:
recursive_splitter.split_text(text2)

['Data that Speak',
 'LLM Applications are',
 'are revolutionizing',
 'industries such as',
 'banking, healthcare,',
 'insurance, education,',
 'legal, tourism,',
 'construction, logistics,',
 'marketing, sales,',
 'customer service,',
 'and even public',
 'administration.',
 'The aim of our programs',
 'is for students to learn',
 'how to',
 'create LLM Applications',
 'in the context of a',
 'a business,',
 'which presents a set of',
 'of challenges that are',
 'are important',
 'to consider in advance.']

In [None]:
# Creating an instance of RecursiveCharacterTextSplitter with custom separators
second_recursive_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150,   # Each chunk will have a maximum of 150 characters
    chunk_overlap=0,  # No overlapping between chunks
    separators=["\n\n", "\n", "(?<=\. )", " ", ""]
    # Defines the order in which the text is split:
    # 1. Double newline ("\n\n") - First, try splitting by paragraphs.
    # 2. Single newline ("\n") - If needed, split by sentence lines.
    # 3. Period with space "(?<=\. )" - Then, split at sentence boundaries.
    # 4. Space (" ") - If still too long, split by words.
    # 5. Empty string ("") - As a last resort, split at individual characters.
)

# 📌 Code Explanation: Recursive Character Text Splitting with Custom Separators  

## 🔹 What Does This Code Do?  
- This initializes **RecursiveCharacterTextSplitter** with **custom separators** to **split text intelligently**.
- Instead of just splitting by **character count**, it follows **a prioritized list of breaking points**.
- This helps **maintain logical segmentation**, making text chunks **more meaningful**.

---

## 🔹 Understanding the Parameters  

| Parameter | Description |
|-----------|-------------|
| `chunk_size=150` | Each chunk will contain a maximum of **150 characters**. |
| `chunk_overlap=0` | **No overlap** between chunks, meaning each chunk is independent. |
| `separators=["\n\n", "\n", "(?<=\. )", " ", ""]` | **Defines the order in which text should be split.** |

---

## 🎯 **Understanding the Separators**  

The `separators=["\n\n", "\n", "(?<=\. )", " ", ""]` define a **hierarchical splitting strategy**, progressively reducing chunk sizes while maintaining the text’s logical structure.

### **✅ Step-by-step Explanation of the Separators**
1️⃣ **"\n\n" (Double Newline) – Paragraph Splitting**  
   - This separator **targets paragraph boundaries**.  
   - **Use case:** Splitting long-form documents (e.g., books, articles) into **separate sections**.  

2️⃣ **"\n" (Single Newline) – Line Splitting**  
   - Targets **single line breaks**, useful when **paragraphs contain multiple short lines**.  
   - **Use case:** Lists, poetry, or structured data where **line breaks matter**.  

3️⃣ **"(?<=\. )" (Regex-Based Sentence Splitting)**  
   - This is a **regular expression lookbehind assertion**, which **splits only after a period (`.`) followed by a space (` `)**.  
   - **Use case:** Ensuring sentences stay intact when **splitting prose or structured writing**.  

4️⃣ **" " (Space) – Word Splitting**  
   - If text is **still too long**, it splits at spaces **between words**.  
   - **Use case:** Extracting **phrases or individual words** when necessary.  

5️⃣ **"" (Empty String) – Character-Level Splitting**  
   - As a last resort, it breaks text **into individual characters**.  
   - **Use case:** Ensuring **all text chunks fit within `chunk_size=150`** characters.  

---

## 🚀 **Example Usage & Expected Output**  

### **✅ Sample Text:**  
```
Hello, welcome to our store!

We offer a variety of products.
Our range includes electronics, clothing, and home appliances.
Our staff is available to help you during store hours: 9 AM to 9 PM every day.
```

### **✅ Applying `RecursiveCharacterTextSplitter`**
```python
text = """Hello, welcome to our store!\n\nWe offer a variety of products.
Our range includes electronics, clothing, and home appliances.
Our staff is available to help you during store hours: 9 AM to 9 PM every day."""

chunks = second_recursive_splitter.split_text(text)

for idx, chunk in enumerate(chunks):
    print(f"Chunk {idx + 1}: {chunk}")
```

### **🔹 Expected Output:**
```
Chunk 1: Hello, welcome to our store!
Chunk 2: We offer a variety of products.
Chunk 3: Our range includes electronics, clothing, and home appliances.
Chunk 4: Our staff is available to help you during store hours: 9 AM to 9 PM every day.
```
📌 **Explanation:**  
- **Chunk 1:** The first paragraph is split at `"\n\n"`.  
- **Chunk 2 & 3:** Remaining text is split by sentence using `"(?<=\. )"`.  
- **Chunk 4:** Last sentence is retained as a full chunk.  

---

## 🎯 **Why Use Custom Separators in RecursiveCharacterTextSplitter?**  
✅ **Maintains Logical Segmentation** → Ensures that chunks retain their **semantic structure**.  
✅ **Preserves Sentence Integrity** → Uses **regex-based** splitting to **keep sentences intact**.  
✅ **No Unnecessary Fragmentation** → **Paragraphs and sentences remain readable**.  
✅ **Perfect for LLM Processing** → Helps **LLMs retain context** better than arbitrary chunking.  

---

## 📂 **When to Use This?**  
- **Processing long documents while keeping structure intact**.  
- **Preparing well-segmented text for embedding-based retrieval**.  
- **Ensuring AI models process full, meaningful sentences** rather than cut-off phrases.  

The **RecursiveCharacterTextSplitter** intelligently **breaks text into logical chunks**, ensuring **smooth processing for AI-powered applications**! 🚀  

In [None]:
second_recursive_splitter.split_text(text2)

['Data that Speak\nLLM Applications are revolutionizing industries such as \nbanking, healthcare, insurance, education, legal, tourism,',
 'construction, logistics, marketing, sales, customer service, \nand even public administration.',
 'The aim of our programs is for students to learn how to \ncreate LLM Applications in the context of a business,',
 'which presents a set of challenges that are important \nto consider in advance.']

## How to execute the code from Visual Studio Code
* In Visual Studio Code, see the file 002-splitters.py
* In terminal, make sure you are in the directory of the file and run:
    * python 002-splitters.py