# From llama-index

In **LlamaIndex**, *splitters* break large documents into manageable text chunks for embedding and retrieval. Here are the main types:

---

### 🧩 1. **`SentenceSplitter`**

* Splits text by sentences (using `nltk` or `spacy` internally).
* Best for **semantic coherence**.

```python
from llama_index.core.node_parser import SentenceSplitter
splitter = SentenceSplitter(chunk_size=512, chunk_overlap=50)
```

---

### 📄 2. **`TokenTextSplitter`**

* Splits based on **token count** (useful when working with OpenAI or local LLMs).
* Helps avoid exceeding model token limits.

```python
from llama_index.core.node_parser import TokenTextSplitter
splitter = TokenTextSplitter(separator=" ", chunk_size=256, chunk_overlap=20)
```

---

### 🪶 3. **`MarkdownSplitter`**

* Keeps markdown structure like headers, lists, and code blocks intact.
* Ideal for **technical docs or README files**.

```python
from llama_index.core.node_parser import MarkdownNodeParser
splitter = MarkdownNodeParser()
```

---

### 📚 4. **`SemanticSplitter`**

* Uses embeddings to split text **based on semantic similarity**, not just size.
* More intelligent, but computationally heavier.

```python
from llama_index.core.node_parser import SemanticSplitterNodeParser
splitter = SemanticSplitterNodeParser(embed_model="text-embedding-3-small")
```

---

### 🔖 5. **`HierarchicalNodeParser`**

* Creates a **multi-level structure** (paragraphs → sections → chapters).
* Useful for large books, reports, or PDFs.

```python
from llama_index.core.node_parser import HierarchicalNodeParser
splitter = HierarchicalNodeParser.from_defaults()
```

---

### ⚙️ 6. **`CodeSplitter`**

* Specialized for **source code** – respects functions, classes, and imports.

```python
from llama_index.core.node_parser import CodeSplitter
splitter = CodeSplitter(language="python")
```

---

Would you like me to show a **side-by-side example** of how the same document is split differently by each splitter?


# From langchain

In **LangChain**, document splitters are in the module
`langchain.text_splitter`. Here are the main types you can use:

---

### 🧱 1. **`CharacterTextSplitter`**

* Splits by character count (simple and fast).
* Default separator: `"\n\n"`.

```python
from langchain.text_splitter import CharacterTextSplitter

splitter = CharacterTextSplitter(separator="\n\n", chunk_size=1000, chunk_overlap=100)
docs = splitter.split_text(long_text)
```

---

### ✂️ 2. **`RecursiveCharacterTextSplitter`** *(most popular)*

* Tries multiple separators (like `\n\n`, `.`, `,`, ` `) to split cleanly.
* Keeps semantic context better.

```python
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
docs = splitter.split_text(long_text)
```

---

### 📘 3. **`TokenTextSplitter`**

* Splits based on **tokens** (not characters).
* Prevents overflow for LLM input limits.

```python
from langchain.text_splitter import TokenTextSplitter

splitter = TokenTextSplitter(chunk_size=256, chunk_overlap=20)
docs = splitter.split_text(long_text)
```

---

### 🪶 4. **`MarkdownHeaderTextSplitter`**

* Splits Markdown docs by header levels (`#`, `##`, etc.).
* Keeps hierarchical context.

```python
from langchain.text_splitter import MarkdownHeaderTextSplitter

splitter = MarkdownHeaderTextSplitter(headers_to_split_on=[("#", "Header 1"), ("##", "Header 2")])
docs = splitter.split_text(markdown_text)
```

---

### 💻 5. **`PythonCodeTextSplitter`**

* Splits Python code by **function**, **class**, or **logical blocks**.

```python
from langchain.text_splitter import PythonCodeTextSplitter

splitter = PythonCodeTextSplitter(chunk_size=500, chunk_overlap=50)
docs = splitter.split_text(python_code)
```

---

### 🧠 6. **`NLTKTextSplitter` / `SpacyTextSplitter`**

* Sentence-based splitting using NLP libraries.

```python
from langchain.text_splitter import SpacyTextSplitter

splitter = SpacyTextSplitter(chunk_size=1000)
docs = splitter.split_text(long_text)
```

---

### 🔍 Summary

| Splitter                         | Best for      | Key Feature            |
| -------------------------------- | ------------- | ---------------------- |
| `CharacterTextSplitter`          | Simple text   | Fast & straightforward |
| `RecursiveCharacterTextSplitter` | General use   | Context-aware          |
| `TokenTextSplitter`              | LLM inputs    | Token-limit control    |
| `MarkdownHeaderTextSplitter`     | Docs / README | Preserves structure    |
| `PythonCodeTextSplitter`         | Code files    | Logical code splitting |
| `SpacyTextSplitter`              | Natural text  | Sentence-aware         |

---

Would you like me to show an example comparing `RecursiveCharacterTextSplitter` vs `MarkdownHeaderTextSplitter` on the same text?
