In [1]:
import google.generativeai as genai
import pathlib
import textwrap

from IPython.display import display
from IPython.display import Markdown


def to_markdown(text):
  text = text.replace('•', '  *')
  return Markdown(textwrap.indent(text, '> ', predicate=lambda _: True))


  from .autonotebook import tqdm as notebook_tqdm


In [3]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter

In [4]:
chunk_size =26
chunk_overlap = 4

In [5]:
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)
c_splitter = CharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)

In [6]:
text1 = 'abcdefghijklmnopqrstuvwxyz'

In [8]:
r_splitter.split_text(text1)

['abcdefghijklmnopqrstuvwxyz']

In [9]:
text2 = 'abcdefghijklmnopqrstuvwxyzabcdefg'

In [10]:
r_splitter.split_text(text2)

['abcdefghijklmnopqrstuvwxyz', 'wxyzabcdefg']

In [None]:
#Ok, this splits the string but we have an overlap specified as 5, but it looks like 3? (try an even number)

In [11]:
text3 = "a b c d e f g h i j k l m n o p q r s t u v w x y z"

In [12]:
r_splitter.split_text(text3)

['a b c d e f g h i j k l m', 'l m n o p q r s t u v w x', 'w x y z']

In [13]:
c_splitter.split_text(text3)

['a b c d e f g h i j k l m n o p q r s t u v w x y z']

In [None]:
#CharacterTextSplitter needs seperator 
c_splitter = CharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
    separator = ' '
)
c_splitter.split_text(text3)

['a b c d e f g h i j k l m', 'l m n o p q r s t u v w x', 'w x y z']



### ❓ Why `CharacterTextSplitter` without `separator` gives only **one chunk**:

When you do **not specify** the `separator`, the `CharacterTextSplitter` behaves **differently**:

```python
c_splitter = CharacterTextSplitter(
    chunk_size=26,
    chunk_overlap=4
    # no separator!
)
```

Then you call:

```python
c_splitter.split_text("a b c d e f g h i j k l m n o p q r s t u v w x y z")
```

---

### 🔍 Here's what happens:

* The splitter tries to **split the text at characters**.
* But because no separator (like space or newline) is specified, it doesn't know **where to break** the text nicely.
* So instead of breaking into pieces, it treats the **entire text as a single block**.

---

### 📌 Result:

```python
['a b c d e f g h i j k l m n o p q r s t u v w x y z']
```

Just **one chunk**, because:

* The total string length = **51 characters**
* You told it to split at 26 chars with 4 char overlap,
* But since **no separator**, it doesn’t know how to **safely split without breaking a word or structure**, so it gives the **whole thing as one chunk**.

---

### ✅ Why adding `separator=' '` works:

When you write:

```python
separator=' '
```

You tell it:

> "You're allowed to break at **spaces**, and only at spaces."

Now it can break the text **safely at each word boundary** (between letters), which leads to proper chunking.

---

### ✅ Summary:

| Case                 | Output Chunks                                 | Reason                                  |
| -------------------- | --------------------------------------------- | --------------------------------------- |
| Without `separator`  | `['a b c d e f g ... z']` (1 chunk)           | Can't split because it doesn't know how |
| With `separator=' '` | Multiple chunks like `['a b c d e ...', ...]` | Splits cleanly at spaces                |

---




**`RecursiveCharacterTextSplitter`** behaves **differently** than `CharacterTextSplitter`, and it works **even without a defined `separator`**.

Let me explain **why it works that way**, and why your output is different and correct.

---

### 🔁 What is `RecursiveCharacterTextSplitter`?

The `RecursiveCharacterTextSplitter` is **smarter** than `CharacterTextSplitter`. It **tries multiple separators recursively** to break text **gracefully**.

---

### 🔍 How it works:

Internally, `RecursiveCharacterTextSplitter` uses a **list of fallback separators**, like:

```python
["\n\n", "\n", " ", "", None]
```

It tries to split text at:

1. **Paragraphs (`\n\n`)**
2. **Lines (`\n`)**
3. **Spaces (` `)**
4. **Characters (fallback when nothing else works)**

---

### ✅ That’s why your case works:

```python
from langchain.text_splitter import RecursiveCharacterTextSplitter

text3 = "a b c d e f g h i j k l m n o p q r s t u v w x y z"

r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=26,
    chunk_overlap=4
)

r_splitter.split_text(text3)
```

### 🔄 Output:

```python
['a b c d e f g h i j k l m', 
 'l m n o p q r s t u v w x', 
 'w x y z']
```

---

### 🧠 What it did:

* It first tried to split using `\n\n`, but didn’t find any.
* Then tried `\n` – nothing there.
* Then tried `' '` (space) – this worked!
* So it split the text by **spaces** into tokens.
* Then it chunked them using your `chunk_size = 26` and `chunk_overlap = 4`.

---

### 📌 Summary Table:

| Splitter                         | Needs Separator?                    | Auto Tries Fallbacks? | Works Well for Text?                  |
| -------------------------------- | ----------------------------------- | --------------------- | ------------------------------------- |
| `CharacterTextSplitter`          | ❗Yes (optional, but better with it) | ❌ No                  | Only if you define separator          |
| `RecursiveCharacterTextSplitter` | ✅ No                                | ✅ Yes                 | ✅ Works well without manual separator |

---

### ✅ Final Takeaway:

Use **`RecursiveCharacterTextSplitter`** when:

* You want intelligent, flexible splitting.
* You don’t want to manually define a separator.
* Your text varies in format (e.g., spaces, lines, paragraphs).






### ✅ In `RecursiveCharacterTextSplitter`, you **can** define `separators`, **but you don't have to**.

---

### 🔍 What I meant before:

> "You **don’t need** to define `separator` for it to work — because it already has a **default list** of smart separators internally."

So:

* If **you don't pass `separators`**, it uses its **default list**:

  ```python
  ["\n\n", "\n", " ", ""]
  ```
* If **you pass your own `separators`**, it uses **your custom list instead** of the default.

---

### 📌 So both are valid:

#### Option 1: No separators (uses default list)

```python
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=450,
    chunk_overlap=0
)
```

It will automatically try:

```python
["\n\n", "\n", " ", ""]
```

---

#### Option 2: Custom separators

```python
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=450,
    chunk_overlap=0,
    separators=["###", "**", "|", " "]
)
```

Now it will try:

```python
["###", "**", "|", " "]  # your custom list
```

---

### 🧠 Why use custom separators?

You use `separators=[...]` when your text has **special structures**, such as:

* Markdown content (`###`, `**`)
* Logs (`|` separator)
* Bullet points (`-`, `*`)
* Any domain-specific format

---

### ✅ Conclusion:

| Case                        | Result                                            |
| --------------------------- | ------------------------------------------------- |
| No `separators` passed      | Uses **default** list (`["\n\n", "\n", " ", ""]`) |
| `separators=[...]` provided | Uses **your** custom list                         |

So yes — you **can** pass it, but you **don't have to**.




In [18]:
some_text = """When writing documents, writers will use document structure to group content. \
This can convey to the reader, which idea's are related. For example, closely related ideas \
are in sentances. Similar ideas are in paragraphs. Paragraphs form a document. \n\n  \
Paragraphs are often delimited with a carriage return or two carriage returns. \
Carriage returns are the "backslash n" you see embedded in this string. \
Sentences have a period at the end, but also, have a space.\
and words are separated by space."""

In [None]:
len(some_text)

In [26]:
c_splitter = CharacterTextSplitter(
    chunk_size=450,
    chunk_overlap=0,
    separator = ' '
)
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=450,
    chunk_overlap=0, 
    separators=["\n\n", "\n", " ", ""]
)

In [27]:
c_splitter.split_text(some_text)

['When writing documents, writers will use document structure to group content. This can convey to the reader, which idea\'s are related. For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document. \n\n Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this string. Sentences have a period at the end, but also,',
 'have a space.and words are separated by space.']

In [28]:
r_splitter.split_text(some_text)

["When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related. For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.",
 'Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this string. Sentences have a period at the end, but also, have a space.and words are separated by space.']

In [29]:
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("./MachineLearning-Lecture01.pdf")
pages = loader.load()

In [30]:
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=1000,
    chunk_overlap=150,
    length_function=len
)

In [31]:
docs = text_splitter.split_documents(pages)

In [32]:
len(docs)

78

In [33]:
len(pages)

22

In [34]:
#Token splitting
#We can also split on token count explicity, if we want.
#This can be useful because LLMs often have context windows designated in tokens.
#Tokens are often ~4 characters

In [40]:
from langchain.text_splitter import TokenTextSplitter
text_splitter = TokenTextSplitter(chunk_size=1, chunk_overlap=0)
text1 = "foo bar bazzyfoo"
text_splitter.split_text(text1)

['foo', ' bar', ' b', 'az', 'zy', 'foo']

In [41]:
import tiktoken

enc = tiktoken.get_encoding("cl100k_base")  # Same tokenizer used by OpenAI's GPT-4-turbo
tokens = enc.encode("foo bar bazzyfoo")
print(tokens)
print([enc.decode([t]) for t in tokens])


[8134, 3703, 51347, 4341, 8134]
['foo', ' bar', ' baz', 'zy', 'foo']


In [42]:
text_splitter = TokenTextSplitter(chunk_size=10, chunk_overlap=0)

In [43]:
docs = text_splitter.split_documents(pages)

In [44]:
docs[0]

Document(metadata={'producer': 'Acrobat Distiller 8.1.0 (Windows)', 'creator': 'PScript5.dll Version 5.2.2', 'creationdate': '2008-07-11T11:25:23-07:00', 'author': '', 'moddate': '2008-07-11T11:25:23-07:00', 'title': '', 'source': './MachineLearning-Lecture01.pdf', 'total_pages': 22, 'page': 0, 'page_label': '1'}, page_content='MachineLearning-Lecture01  \n')

In [45]:
pages[0].metadata

{'producer': 'Acrobat Distiller 8.1.0 (Windows)',
 'creator': 'PScript5.dll Version 5.2.2',
 'creationdate': '2008-07-11T11:25:23-07:00',
 'author': '',
 'moddate': '2008-07-11T11:25:23-07:00',
 'title': '',
 'source': './MachineLearning-Lecture01.pdf',
 'total_pages': 22,
 'page': 0,
 'page_label': '1'}

In [46]:
docs[0].metadata

{'producer': 'Acrobat Distiller 8.1.0 (Windows)',
 'creator': 'PScript5.dll Version 5.2.2',
 'creationdate': '2008-07-11T11:25:23-07:00',
 'author': '',
 'moddate': '2008-07-11T11:25:23-07:00',
 'title': '',
 'source': './MachineLearning-Lecture01.pdf',
 'total_pages': 22,
 'page': 0,
 'page_label': '1'}

## Context aware splitting

Chunking aims to keep text with common context together.

A text splitting often uses sentences or other delimiters to keep related text together but many documents (such as Markdown) have structure (headers) that can be explicitly used in splitting.

We can use `MarkdownHeaderTextSplitter` to preserve header metadata in our chunks, as show below.