# Overview

The goal of this notebook is to gain an intuitive feel for a few of Langchain's text splitters.
Specifically...
- Character Splitters
    - `CharacterTextSplitter(TextSplitter)` [RTD](https://python.langchain.com/docs/modules/data_connection/chunked_chunked_chunked_document_transformers/text_splitters/character_text_splitter)  
 This is the simplest method. This splits based on characters (by default "\n\n") and measure chunk length by number of characters.
    - `RecursiveCharacterTextSplitter(TextSplitter)` [RTD](https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/recursive_text_splitter)
It starts by splitting at double newlines, then single newlines, then spaces, and finally, if necessary, character by character. This hierarchical approach ensures manageable chunks while retaining as much original structure as possible.
- Sentence Splitters
    - `NLTKTextSplitter(TextSplitter)` - uses NLTK's `sent_tokenizer` to split text into sentences.
    - SpacyTextSplitter(TextSplitter) - uses it's libraries to also split text into sentences.  Before you use it, after `pip install spacy`, you need to `python -m spacy download en_core_web_sm`.

All of these classes inherit from `TextSplitter()` and have default settings for chunk and overlap size [src](https://github.com/hwchase17/langchain/blob/dd648183fae95f5f251926e3a188d9ef9e6faeed/langchain/text_splitter.py#L38):
```
class TextSplitter(BaseDocumentTransformer, ABC):
    """Interface for splitting text into chunks."""

    def __init__(
        self,
        chunk_size: int = 4000,
        chunk_overlap: int = 200,
```


In [30]:
%run useful_functions.ipynb

# Character Splitters
We'll start by exploring `CharacterTextSplitter` and `RecursiveCharacterTextSplitter` examples.

This table provides a summary of six different configurations of text splitters: `CharacterTextSplitter` and `RecursiveCharacterTextSplitter` that are used in our exploration. Each configuration is represented by a row, detailing the text splitter's name, its abbreviated label, the chunk size (maximum characters per chunk), and the overlap size (shared characters between chunks). The chunk sizes vary from 100 to 4000 characters, and the overlap sizes are either 20 or 200 characters.The (perhaps naive) thought is these samples would be enough to gain an intuitive feel for how the text splitters work and ultimately the quality of the queries when each one is used.

In [None]:
from IPython.display import display, Markdown
def display_splitters_table():
    # Define the data for the table
    data = [
        {'Text Splitter Name': 'CharacterTextSplitter', 'Label': 'C', 'Chunk Size': 100, 'Overlap Size': 20},
        {'Text Splitter Name': 'CharacterTextSplitter', 'Label': 'C', 'Chunk Size': 1000, 'Overlap Size': 200},
        {'Text Splitter Name': 'CharacterTextSplitter', 'Label': 'C', 'Chunk Size': 4000, 'Overlap Size': 200},
        {'Text Splitter Name': 'RecursiveCharacterTextSplitter', 'Label': 'R', 'Chunk Size': 100, 'Overlap Size': 20},
        {'Text Splitter Name': 'RecursiveCharacterTextSplitter', 'Label': 'R', 'Chunk Size': 1000, 'Overlap Size': 200},
        {'Text Splitter Name': 'RecursiveCharacterTextSplitter', 'Label': 'R', 'Chunk Size': 4000, 'Overlap Size': 200}
    ]

    # Create the markdown table
    table = "| Text Splitter Name | Label | Chunk Size | Overlap Size |\n| --- | --- | --- | --- |\n"
    for row in data:
        table += f"| {row['Text Splitter Name']} | {row['Label']} | {row['Chunk Size']} | {row['Overlap Size']} |\n"

    # Display the table
    display(Markdown(table))


In [None]:
display_splitters_table()

## Load Our Test Text
We'll be using a transcript from a podcast that we have on hand.  The transcript has been pickled.  Let's load the text in.

In [None]:
from IPython.display import display, Markdown
import pickle
with open('transcript.pkl', 'rb') as f:
    transcript = pickle.load(f)
info_str = f"Start of the transcript:\n\n{transcript.page_content[:200]}...\n\nLength: {len(transcript.page_content)} characters.\n\nMetadata: {transcript.metadata}"
# print(info_str)
# display(Markdown(info_str.replace("#", "\#")))
display(Markdown(info_str))


## Split the text
Next up, let's check out the code that splits the transcript text by character.  We'll be creating six chunked transcripts.  One of each using the parameters for chunk and overlap size in the previous table.  We start by creating the `split_char()` function.


### The split_char() function
The `split_char()` function creates either a `CharacterTextSplitter` or a `RecursiveCharacterTextSplitter` depending on whether the `recursive` bool is `True` or `False`.  The intent of the function is to make it easy to create a few splitters with different chunk and overlap sizes.

_Note: The separator used for `CharacterTextSplitter()` is a space.  This is because the text we'll be using does not contain either "\n" or "\n\n"._

In [None]:
!pip install langchain

In [None]:
from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter

def split_char(chunk_size=2000, chunk_overlap=200,recursive=False):
    if not recursive:
        return CharacterTextSplitter(
            # separator="\n\n", Cannot use this. This text pattern is not in the transcripts.
            separator = " ",
            chunk_size=chunk_size,
            chunk_overlap=chunk_overlap,
            length_function=len,
        )
    else:
        return RecursiveCharacterTextSplitter(
            chunk_size=chunk_size,
            chunk_overlap=chunk_overlap, 
            separators=["\n\n", "\n", " ", ""])

## Split the Text into Chunks
Now that we have `split_char`, let's chunk that text!  We'll split the text to match the chunking parameters shown earlier in the table.

### Initialize a List to hold the chunks
After chunking, we'll be able to get a better feel for how the character text splitters work.  In the future, we'll look at the quality of their QA.

#### Store Chunked Transcripts in chunks_list[]
Each entry in `chunks_list` is a chunked transcript represented as a dictionary. 
 ```
 {
    'type'
    'chunk_size'
    'overlap_size'
    'chunks'
 }
 
 ```

Each dictionary includes the splitting method 'type' (e.g.: 'CharacterTextSplitter'), the 'chunk_size' used for splitting, the 'overlap_size' between consecutive chunks, and the resulting 'chunks' themselves. 

##### Store 3 CharcterTextSplitter chunk_lists entries in chunks_list

First we create a function that makes it easy for us to create a bunch of either `CharacterTextSplitter`s or `RecursiveCharacterTextSplitter`


In [None]:

from langchain.schema.document import Document

def make_chunks(doc = None, chunk_sizes = [100, 1000, 4000], overlaps = [20, 200, 200],recursive = False):
    # There will be a problem if len chunk_sizes != len overlaps

    assert len(chunk_sizes) == len(overlaps), "The length of the list of chunks sizes must match the length of the list of overlaps"
    assert doc is not None, "The 'doc' parameter cannot be None. Please provide a valid Langchain document."
    assert isinstance(doc, Document), "The 'doc' parameter must be an instance of a Langchain Document."
    chunks_list = []
    for size, overlap in zip(chunk_sizes, overlaps):
        c_splitter = split_char(size, overlap, recursive=recursive)
        chunks = c_splitter.split_text(transcript.page_content)
        type_str = "CharacterTextSplitter" if not recursive else "RecursiveCharacterTextSplitter"
        print(f"type: {type_str}, chunk size {size}, overlap {overlap} Num chunks: {len(chunks)}")
        chunks_list.append({
            'type': type_str,
            'chunk_size': size,
            'overlap_size': overlap,
            'chunks': chunks
        })
    return chunks_list

Create three `CharacterTextSplitter` chunked transcript dictionaries that have the chunk and overlap size as defined in the earlier table.  Each chunked transcript dictionary becomes an entry in the `chunks_list`.

In [None]:
chunks_list = make_chunks(transcript)
len(chunks_list)

### Do the same for the RecursiveCharacterTextSplitter
The only difference is to set the recursive parameter to `True`.

In [None]:
chunks_list.extend(make_chunks(transcript, recursive=True))
len(chunks_list)

We now have six samples of chunked transcripts. Each sample is distinguished by its type of splitter, chunk size, overlap size, and the corresponding chunks of text. Interestingly, we find that both the RecursiveCharacterTextSplitter and the CharacterTextSplitter have generated identical chunks. This is due to the structure of the input text, which lacks the "\n\n" or "\n" markers typically present in web pages or Markdown files. These markers often indicate the start of new sections or paragraphs, and their absence in our text leads both splitters to behave similarly.

In [29]:
display_transcript_chunks(chunks_list)

| Type | Chunk Size | Overlap | Num Chunks |
| --- | --- | --- | --- |
| CharacterTextSplitter | 100 | 20 | 751 |
| CharacterTextSplitter | 1000 | 200 | 75 |
| CharacterTextSplitter | 4000 | 200 | 16 |
| RecursiveCharacterTextSplitter | 100 | 20 | 751 |
| RecursiveCharacterTextSplitter | 1000 | 200 | 75 |
| RecursiveCharacterTextSplitter | 4000 | 200 | 16 |


The function below, `display_splitter_results()`, provides us with some text splitting visualization.  Let's run the function to load it in memory.  Then we will be able to display characteristics of the text splitters.

In [None]:
!pip install nltk
# Needed for running jupyter notebooks within another notebook.
!pip install nbformat

# Display Character Splitting Results

This table looks at text chunks from `CharacterTextSplitter` ('C') and `RecursiveCharacterTextSplitter` ('R'). Each four-row set represents two consecutive chunks, with varying chunk and overlap sizes.

- **Label**: Splitter type.
- **Chunk Size**: Maximum characters per chunk.
- **Overlap Size**: Shared characters between chunks.
- **Total Chunks**: Total chunks produced.
- **Chunk #**: Specific chunk number.
- **Position in Chunk**: 'Tail' (end) or 'Head' (start) of a chunk.
- **Text**: Snippet from the chunk's tail or next chunk's head.

The text allows us to evaluate how the overlap works between chunks.

For instance, the first two rows show overlap between chunks. The 'Tail' of the first chunk and the 'Head' of the second share the phrase "ey of KISS Organics.", demonstrating the overlap size's role in ensuring continuity and information preservation between chunks.

In [None]:
display_splitter_results(chunks_list,['CharacterTextSplitter','RecursiveCharacterTextSplitter'])

As expected, `CharacterTextSplitter` and `RecursiveCharacterTextSplitter` yield identical chunks with documents transcribed from YouTube audio via `Whisper()`, given they lack newline characters commonly found in web pages and Markdown files. We could pick another Langchain document with text that has the newline character markers.  We'll leave this for another day and more on to sentence text splitters.

# Sentence Splitters
Sentence Splitters can often yield more meaningful chunks. They help maintain the semantic integrity of sentences, thus having a better chance of preserving the original text's intent after chunking. Feeding an LLM chunks of text split by a `CharacterTextSplitter` with a defined overlap size may risk breaking the continuity of sentences, leading to potential context loss. This is not a hard and fast rule, so there will be scenarios where `CharacterTextSplitter` will be a better choice.

Recall from the overview, the two sentence splitters we'll explore are:

    - `NLTKTextSplitter(TextSplitter)` - uses NLTK's `sent_tokenizer` to split text into sentences.
    - SpacyTextSplitter(TextSplitter) - uses it's libraries to also split text into sentences.  Before you use it, after `pip install spacy`, you need to `python -m spacy download en_core_web_sm`.

All of these classes inherit from `TextSplitter()` and have default settings for chunk and overlap size [src](https://github.com/hwchase17/langchain/blob/dd648183fae95f5f251926e3a188d9ef9e6faeed/langchain/text_splitter.py#L38):
```
class TextSplitter(BaseDocumentTransformer, ABC):
    """Interface for splitting text into chunks."""

    def __init__(
        self,
        chunk_size: int = 4000,
        chunk_overlap: int = 200,
```

Create a function to split text using either the `NTLKTextsplitter` or the `SpacyTextSplitter`

!pip install spacy
!pip install nltk
%run python -m spacy download en_core_web_sm

In [37]:
nltk_splitter = NLTKTextSplitter(separator='  ')
spacy_splitter = SpacyTextSplitter(separator='  ')
chunks_list = [get_split_sentences(transcript, splitter_instance=nltk_splitter), get_split_sentences(transcript, splitter_instance=spacy_splitter)]


NLTKTextSplitter
SpacyTextSplitter


In [42]:
display_splitter_results(chunks_list,'all')

| Label | Chunk Size | Overlap Size | Total Chunks | Chunk # | Position in Chunk | Text |
| --- | --- | --- | --- | --- | --- | --- |
| N | 3894 | 0 | 16 | 1 | Tail | For our products, having the lab on site is awesome because people always ask, can you send me a certificate of analysis for your finished blends? I'm like, can do, because that's what everybody thinks that they need, which is super true. You definitely do. |
| N | 3894 | 0 | 16 | 2 | Head | I'm like, can do, because that's what everybody thinks that they need, which is super true. You definitely do. More important than that, we do a lot of QAQC at the step before that. |
| N | 3894 | 0 | 16 | 2 | Tail | How does tissue testing play into this, and why did you guys go that route? Well, as a company, well, the whole company started as this lab. The lab was able to give us connections to growers. |
| N | 3894 | 0 | 16 | 3 | Head | How does tissue testing play into this, and why did you guys go that route? Well, as a company, well, the whole company started as this lab. The lab was able to give us connections to growers. |
| S | 3701 | 0 | 17 | 1 | Tail | For our products, having the lab on site is awesome because people always ask, can you send me a certificate of analysis for your finished blends? I'm like, can do, because that's what everybody thinks that they need, which is super true. You definitely do. |
| S | 3701 | 0 | 17 | 2 | Head | I'm like, can do, because that's what everybody thinks that they need, which is super true. You definitely do. More important than that, we do a lot of QAQC at the step before that. |
| S | 3701 | 0 | 17 | 2 | Tail | How does tissue testing play into this, and why did you guys go that route? Well, as a company, well, the whole company started as this lab. The lab was able to give us connections to growers. |
| S | 3701 | 0 | 17 | 3 | Head | How does tissue testing play into this, and why did you guys go that route? Well, as a company, well, the whole company started as this lab. The lab was able to give us connections to growers. |


What we can tell from the table:
- The chunk sizes for NLTKTextSplitter (3894 is the average chunk size) was a tad bigger than SpacyTextSplitter (3701 is the average chunk size).
- Where CharacterTextSplitter and RecursiveCharacterTextSplitter split on characters, NLTKTextSplitter and SpacyTextSplitter split on sentences.
- Looking at three sentences at the tail of a previous chunk with the five sentences at the head of the current chunk, it appears there is a chunk overlap of two sentences.  

# Next Step
We could spend a lot more time on text splitters.  However, we've absorbed enough for this round and will come back another time to explore further.  At that point, we should know more about the characteristics of the Langchain classes.  On to the embedding/vector store class.

Since the Character and RecursiveCharacter text splitting was pretty much identical and the NTLKTextSplitter was similar to the SpacyTextSplitter, moving forward we'll create vector databases for:
- the three different character text splitters.
- the `NLTKTextSplitter`.

We will persist vector stores for all 4 chunked transcript and then use QAEvaluation.  The fun doesn't seem to want to stop!

To finish off this step we will persist the 4 in a chunks_list pickle file.

In [50]:
%run useful_functions.ipynb

In [56]:
# Put the chunked transcript chunked with the CharacterTextSplitter into the chunks_list.
chunks_list = make_chunks(transcript)
print(f" There are {len(chunks_list)} chunked transcripts.")
# Append the list to include the transcript chunked with the NLTKTextSplitter.
nltk_splitter = NLTKTextSplitter(separator='  ')
nltk_chunk_transcript = get_split_sentences(transcript, splitter_instance=nltk_splitter)
chunks_list.append(nltk_chunk_transcript)
print(f" There are {len(chunks_list)} chunked transcripts.")

type: CharacterTextSplitter, chunk size 100, overlap 20 Num chunks: 751
type: CharacterTextSplitter, chunk size 1000, overlap 200 Num chunks: 75
type: CharacterTextSplitter, chunk size 4000, overlap 200 Num chunks: 16
 There are 3 chunked transcripts.
NLTKTextSplitter
 There are 4 chunked transcripts.


In [None]:
Pickle the chunks_list so we can use it to create embeddings/vector stores.

In [55]:
import pickle

with open('chunks_list.pkl', 'wb') as f:
    pickle.dump(chunks_list, f)
