# Cutting Documents

The Lexos `cutter` module is used to split documents into smaller, more manageable pieces, variously called "segments" or "chunks." This is particularly useful for processing large texts, enabling more efficient analysis and manipulation. If your documents are raw text files or strings, you can use the `TextCutter` class. If your documents are tokenized spaCy `Doc` objects, you can use the `TokenCutter` class. Documents can be cut based on byte size, number of tokens, number of sentences, line breaks, or custom-defined spans called milestones. The different cutting methods are described below.

## Cutting Text Strings with `TextCutter`

Let's say that you have a  long text string that you wanted to break into smaller chunks every *n* characters. The cell below demonstrates a simple way you can do this with the `TextCutter` class. We to begin, we'll use a very short sample text for demonstration purposes, but you can replace the `text` variable with any long string of your choice. You can also change the `chunksize` parameter to specify how many characters you want in each chunk.

In [None]:
# Import the TextCutter class
from lexos.cutter.text_cutter import TextCutter

# Create a sample text
text1 = "It is a truth universally acknowledged, that a single  man in possession of a good fortune must be in want of a wife."

# Initialize TextCutter
cutter = TextCutter()

# Split text into chunks of 20 characters each
cutter.split(text1, chunksize=20)

# Print the resulting chunks
for chunk in cutter.chunks[0]:
    print(f"- {chunk}")

The first parameter (for which you can use the keyword `docs`) is the text string to be cut. You can also supply a list of text strings (e.g., multiple documents) to the `docs` parameter, and each document will be cut separately.

Once the text has been split, its chunks are stored in a list of lists (one list per document), which can be accessed with `cutter.chunks`. This `split()` method can also save this to a variable, as shown below.

In [None]:
# Initialize TextCutter
cutter = TextCutter()

# Split text into chunks of 20 characters each and store the result in a variable
chunks = cutter.split(docs=text1, chunksize=20)

for chunk in chunks[0]:
    print(f"- {chunk}")

Notice that we need to reference the first item in the `chunks` list because it is the first (and in this case only) document. Here is an example with multiple documents.

We can supply an optional `names` parameter to name each document for easier identification later. This allows us to view the chunks for each document separately in a dictionary using the `to_dict()` method. To try it out, uncomment the second `split()` method in the cell below.

In [None]:
# Define another text
text2 = "However little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered the rightful property of some one or other of their daughters."

# Combine texts into a list
texts = [text1, text2]

# Initialize TextCutter
cutter = TextCutter()

# Split texts into chunks of 50 characters each and store the result in a variable
chunks = cutter.split(docs=texts, chunksize=50)

# Use custom document names
# chunks = cutter.split(docs=texts, names=["text1", "text2"], chunksize=50)

# Print chunks for each document
for i, doc_chunks in enumerate(chunks):
    print(f"Document {i + 1} chunks:")
    for chunk in doc_chunks:
        print(f"- {chunk}")
    print()

# Display chunks as a dictionary
print("Chunks as dictionary:")
print(cutter.to_dict())


You can also use the `len()` function to check how many documents are in your cutter instance.

In [None]:
# Print the number of docs in the cutter instance
print(f"Number of docs in cutter instance: {len(cutter)}")

If your texts are stored as files, you can easily read them into strings using Python's built-in file handling methods before passing them to the `TextCutter`. However, you can also pass the file paths directly to the `TextCutter` using the `docs` parameter and set `file=True`. We'll demonstrate this below using the full text of Jane Austen's *Pride and Prejudice*. In this example, we use only one file, but you can pass a list of file paths to the `docs` parameter to cut multiple files at once.

Note that the `names` will be auto-generated from your filenames unless you provide a list of custom names using the `names` parameter.

In [None]:
# Initialize TextCutter
cutter = TextCutter()

# Path to text file
path = "Austen_Pride.txt"

# Split text into chunks of 500 characters each and store the result in a variable
chunks = cutter.split(docs=path, chunksize=500, file=True)

# Print the names of the documents
print(f"Document names: {cutter.names}\n")

# Get the second chunk
chunk2 = chunks[0][1]

# Print the first 100 characters of the second chunk
print(chunk2[0:100])

Cutting texts can often leave small dangling pieces at the end. To address this, you can use the `merge_threshold` and `merge_final` parameters to control whether the last two chunks should be merged based on their size.

- `merge_threshold`: The threshold for merging the last two chunks. The default is 0.5 (50%).
- `merge_final`: Whether to merge the last two chunks. The default is `False`.

**Always inspect your chunks to see if the merging behaviour meets your expectations.**

In the cell below, try changing the `chunksize`, `merge_threshold`, and `merge_final` parameters to see how they affect the resulting chunks.

In [None]:
# Initialize TextCutter
cutter = TextCutter()

# Split text into chunks and specify merge behaviour
chunks = cutter.split(text1, chunksize=25, merge_final=False, merge_threshold=0.5)

# Print chunks for each document
for i, doc_chunks in enumerate(chunks):
    for chunk in doc_chunks:
        print(f"- {chunk} ({len(chunk)} characters)")

You can also generate chunks that overlap with each other by using the `overlap` parameter. This parameter specifies the number of characters that should be repeated at the start of each chunk (except the first one).

In [None]:
# Initialize TextCutter
cutter = TextCutter()

# Split text into chunks and specify merge behaviour
chunks = cutter.split(text1, chunksize=25, overlap=5, merge_threshold=0.0)

# Print chunks for each document overlapping chunks by 5 characters
for i, doc_chunks in enumerate(chunks):
    for chunk in doc_chunks:
        print(f"- {chunk} ({len(chunk)} characters)")

### Cutting Documents into a Fixed Number of Chunks

The `n` parameter allows you to specify the number of chunks to split the text into. When you provide a value for `n`, the text will be divided into that many approximately equal parts. This is useful when you want to ensure that the text is split into a specific number of segments, regardless of their size. The `n` parameter overrides the `chunksize` parameter if both are provided.

In the example below, we split the text of *Pride and Prejudice* into 5 chunks. Try changing the value of `merge_threshold`, for instance, to the default 0.0, to see how it affects the number of chunks created.

In [None]:
# Initialize TextCutter
cutter = TextCutter()

# Split the text into a fixed number of chunks
cutter.split(path, file=True, n=5, merge_threshold=0.0)

# Print the number of chunks created
print(f"Number of chunks created: {len(cutter.chunks[0])}")

# Print the chunk excerpts with chunk lengths
for i, chunk in enumerate(cutter.chunks[0]):
    chunk = chunk.strip().replace("\n", " ")
    print(f"- Chunk {i + 1} ({len(chunk)} characters): {chunk[0:40]}...")

### Cutting Documents by Line

The `newline` parameter allows you to split the text based on line breaks instead of character count. In other words, the cutter will create chunks that respect line boundaries. This is particularly useful for texts where line structure is important, such as poetry, scripts, or any text where lines represent meaningful units.

When using `newline=True`, both the `chunksize` and `n` parameters refer to line counts rather than character counts.

The example below creates chunks with 3 lines each. The last chunk will contain any remaining lines and will depend on your merge settings.

In [None]:
# Initialize TextCutter
cutter = TextCutter()

# Create a text with line breaks
lineated_text = """It is a truth universally acknowledged,
that a single man in possession of a good fortune,
must be in want of a wife.
However little known the feelings or views of such a man may be
on his first entering a neighbourhood,
this truth is so well fixed in the minds of the surrounding families,
that he is considered the rightful property
of some one or other of their daughters."""

# Split the text into chunks of 3 lines each
cutter.split(lineated_text, chunksize=3, newline=True)

# Print the lines for each chunk
for i, chunk in enumerate(cutter.chunks[0]):
    print(f"Chunk {i + 1} lines:")
    lines = chunk.strip().split("\n")
    for line in lines:
        print(f"- {line}")
    print()


### Cutting Documents on Milestones

Milestones are specified locations in the text that designate structural or sectional divisions. A milestone can be either a designated unit *within* the text or a placemarker inserted between sections of text. The Lexos `milestones` module provides methods for identifying milestone locations by searching for patterns you designate. You can use the `StringMilestones` class in the `milestones` module to generate a list of `StringSpan` objects that mark the locations of milestones in your text. The `TextCutter.split_on_milestones()` method can then use these spans to split the text into chunks at the specified locations. Here is a quick example of how to do it.

In [None]:
# Import the StringMilestones class
from lexos.milestones.string_milestones import StringMilestones

# A sample doc
text = "The quick brown fox jumps over the lazy dog."

# Create a String Milestones instance and search for the pattern "quick"
# milestones.spans is a list of StringSpan objects
milestones = StringMilestones(doc=text, patterns="quick")

# Create a TextCutter instance and split on the found milestones
cutter = TextCutter()
chunks = cutter.split_on_milestones(docs=text, milestones=milestones.spans)
print(chunks[0][0])  # The
print(chunks[0][1])  # brown fox jumps over the lazy dog.

You will notice that the milestone itself ("quick") is not included in either chunk. By default, the milestone text is removed during the split. You can control this behaviour with the `keep_spans` parameter set to "preceding" or "following":

In [None]:
# Preceding: milestone included at end of previous chunk
cutter_preceding = TextCutter()
chunks_preceding = cutter_preceding.split_on_milestones(
    docs=text, milestones=milestones.spans, keep_spans="preceding"
)
print("Preceding:", chunks_preceding[0])

# Following: milestone included at start of next chunk
cutter_following = TextCutter()
chunks_following = cutter_following.split_on_milestones(
    docs=text, milestones=milestones.spans, keep_spans="following"
)
print("Following:", chunks_following[0])

If you do not want to use the `milestones` module to find milestones, you can also create your own list of `StringSpan` objects manually and pass them to the `split_on_milestones()` method.

This tutorial does not cover all the options for using the `milestones` module to find milestones in your text. For more information, please refer to the Lexos `milestones` documentation.

### Saving Chunks to Disk

Once you have split your documents into chunks, you may want to save these chunks to disk for later use. The `save()` method of the `TextCutter` class allows you to do this easily. You can specify an output directory where the chunks will be saved, and optionally provide names for each document to create more meaningful filenames for the chunks.

In the example below, we save the chunks to the output directory specified with `output_dir`. Each chunk will be saved as a separate text file, with filenames based on the document names and chunk indices. You can specify your own names for the documents using the `names` parameter. If you do not provide names (or they are not already defined in your `TextCutter` instance), default names will be used These have the format  `doc001_001.txt`; however, you can modify the `_` delimiter and the zero padding with the `delimiter` and `padding` parameters.

In the cell below, you can uncomment parameters in the `save()` method to change filenames. Make sure to specify a valid output directory path on your system.

In [None]:
# Initialize TextCutter
cutter = TextCutter()

texts = [
    "It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife. However little known the feelings or views of such a man may be on his first entering a neighbourhood,",
    "this truth is so well fixed in the minds of the surrounding families, that he is considered the rightful property of some one or other of their daughters.",
]

# Spread the texts into chunks
cutter.split(texts, chunksize=50)

# Configure the output directory
output_dir = ("ADD_YOUR_OUTPUT_DIRECTORY_PATH_HERE",)

# Save the chunks to disk -- uncomment lines to change the settings
cutter.save(
    output_dir=output_dir,
    # names = ["text1", "text2"],
    # delimiter = "-",
    # pad = 2,
)

## Cutting spaCy Documents with `TokenCutter`

In [None]:
# Import the TextCutter class
from lexos.cutter.token_cutter import TokenCutter
from lexos.tokenizer import Tokenizer

tokenizer = Tokenizer(model="en_core_web_sm")

# Create a sample doc
text1 = "It is a truth universally acknowledged, that a single  man in possession of a good fortune, must be in want of a wife."
doc1 = tokenizer.make_doc(text1)

# Initialize TokenCutter
cutter = TokenCutter()

# Split text into chunks of 10 tokens each
cutter.split(doc1, chunksize=10)

# Print the resulting chunks
for chunk in cutter.chunks[0]:
    print(f"- {chunk}")

Cutting multiple `Doc` objects is similar to cutting multiple texts. Uncomment the second `split()` method in the cell below to see how to use custom document names.

In [None]:
# Create a second document
text2 = "However little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered the rightful property of some one or other of their daughters."
doc2 = tokenizer.make_doc(text2)


# Initialize TokenCutter
cutter = TokenCutter()

# Split text into chunks of 10 characters each and store the result in a variable
chunks = cutter.split(docs=[doc1, doc2], chunksize=10)

# Use custom document names
# chunks = cutter.split(docs=texts, names=["text1", "text2"], chunksize=10)

# Print chunks for each document
for i, doc_chunks in enumerate(chunks):
    print(f"Document {i + 1} chunks:")
    for chunk in doc_chunks:
        print(f"- {chunk}")
    print()

You can also display the number of documents in the cutter instance using the `len()` function, and export the chunks to a dictionary using the `to_dict()` method.

In [None]:
# Print the number of docs in the cutter instance
print(f"Number of docs in cutter instance: {len(cutter)}")

print()

# Display chunks as a dictionary
print("Chunks as dictionary:")
print(cutter.to_dict())


Cutting spaCy `Doc` objects directly from files requires a slightly different approach compared to cutting raw text files. The files must be in spaCy's binary format, which can be created using the `Doc.to_bytes()` or `Doc.to_disk()` methods. You must also specify the spaCy model used to create the `Doc` objects so that they can be deserialised correctly. For example:

```python
path = "path/to/spacy_doc.spacy"
cutter.split(docs=path, file=True, spacy_model="en_core_web_sm", chunksize=100)
```

If your files are in raw text format, you will need to read and tokenize them into `Doc` objects before passing them to the `TokenCutter`.

### Cutting Tokens on Line Breaks

Using line breaks to cut tokenized documents works similarly to cutting raw text documents. By setting the `newline` parameter to `True`, the `TokenCutter` will create chunks that respect line boundaries based on token positions. Both the `chunksize` and `n` parameters will refer to line counts rather than token counts when `newline=True`.

In [None]:
# Initialize TokenCutter
cutter = TokenCutter()

lineated_text = """It is a truth universally acknowledged,
that a single man in possession of a good fortune,
must be in want of a wife.
However little known the feelings or views of such a man may be
on his first entering a neighbourhood,
this truth is so well fixed in the minds of the surrounding families,
that he is considered the rightful property
of some one or other of their daughters."""

lineated_doc = tokenizer.make_doc(lineated_text)

cutter.split(docs=[lineated_doc], chunksize=3, newline=True)
for chunk in cutter.chunks[0]:
    print(f"- {chunk}")
    print()


### Cutting on Sentences Breaks

Some spaCy language models (such as `en_core_web_sm`) include sentence boundary detection. If your `Doc` objects have sentence boundaries defined, you can use the `split_on_sentences()` method to cut the documents into chunks based on a specified number of sentences. For instances, assume that your spaCy `Doc` object has ten sentences. You can split it into chunks of 5 sentences each as shown below. The last chunk will contain any remaining sentences and will depend on your merge settings. If a `Doc` does not have sentence boundaries defined, Lexos will raise an error.

In the example below, we will cut a spaCy `Doc` object created from the first 2000 characters of *Pride and Prejudice* into chunks of 5 sentences each.

In [None]:
# Make sure we have both TokenCutter and Tokenizer imported
from lexos.cutter.token_cutter import TokenCutter
from lexos.tokenizer import Tokenizer

# Initialize Tokenizer with the English-language model
tokenizer = Tokenizer(model="en_core_web_sm")

# Initialize TokenCutter
cutter = TokenCutter()

# Read a text file
path = "Austen_Pride.txt"
with open(path, "r") as f:
    text = f.read()

# Convert the first 2000 characters of the text into a spaCy Doc
doc = tokenizer.make_doc(text[:2000])

chunks = cutter.split_on_sentences(docs=doc, n=5)
for chunk in chunks[0]:
    print(f"- {chunk.text.strip().replace('\n', ' ')}")
    print()

### Cutting Tokens on Milestones

Cutting on milestones works similarly for `TokenCutter` objects, except that the milestones are specified as lists of spaCy `Span` objects rather than `StringSpan` objects.

In [None]:
# Make sure the TokenMilestones, TokenCutter, and Tokenizer are imported
from lexos.milestones.token_milestones import TokenMilestones
from lexos.cutter.token_cutter import TokenCutter
from lexos.tokenizer import Tokenizer

# Initialize Tokenizer with the English-language model
tokenizer = Tokenizer(model="en_core_web_sm")

# Create a sample doc
text = "The quick brown fox jumps over the lazy dog."
doc = tokenizer.make_doc(text)

# Create a TokenMilestones instance and search for the pattern "quick"
milestones = TokenMilestones(doc=doc)

# Get a list of Span matches for the pattern "quick"
spans = milestones.get_matches(patterns="quick")

# Create a TokenCutter instance and split on the found milestones
cutter = TokenCutter()

# Split the doc on the found milestones
chunks = cutter.split_on_milestones(docs=doc, milestones=spans, merge_threshold=0.0)
print(chunks[0][0])  # The
print(chunks[0][1])  # brown fox jumps over the lazy dog.

As with `TextCutter`, you can modify the handling of the milestone text during the split using the `keep_spans` parameter set to "preceding" or "following":

In [None]:
# Preceding: milestone included at end of previous chunk
cutter_preceding = TokenCutter()
chunks_preceding = cutter_preceding.split_on_milestones(
    docs=doc, milestones=spans, keep_spans="preceding", merge_threshold=0.0
)
print("Preceding:", chunks_preceding[0])

# Following: milestone included at start of next chunk
cutter_following = TokenCutter()
chunks_following = cutter_following.split_on_milestones(
    docs=doc, milestones=spans, keep_spans="following", merge_threshold=0.0
)
print("Following:", chunks_following[0])

###  Saving spaCy `Doc` Files to Disk

By default, the `save()` method saves the chunk text strings, rather than the spaCy `Doc` objects. If you would like to store the spaCy `Doc` objects themselves, set the `as_text` parameter to `False`. This is the equivalent of calling spaCy's `Doc.to_bytes()` method on each chunk and saving the resulting bytes to disk.

In [None]:
# Configure the output directory
output_dir = ("ADD_YOUR_OUTPUT_DIRECTORY_PATH_HERE",)

# Save the chunks to disk -- uncomment lines to change the settings
cutter.save(
    output_dir=output_dir,
    # names = ["text1", "text2"],
    # delimiter = "-",
    # pad = 2,
    # as_text = False,
)