# Cutter Tutorial
   
This notebook is to show examples of how to use the `cutter` module.

`Cutter` used to divide files, texts, or documents into segments. It uses three Python classes (identified by cute codenames), depending on the type of data you are working with:

- `Ginsu` is used to split spaCy documents.
- `Machete` is used to split raw text strings.
- `Filesplit` (codename `Chainsaw`) is used to split files based on byte size.

Each is used by importing the class and instantiating the class before using its methods. By convention, we assign the class instance to the name `cutter`. Here is an example using `Ginsu` to split a document into two segments:

```python
    from lexos.cutter.ginsu import Ginsu
    cutter = Ginsu()
    segments = cutter.splitn(doc, n=2)
```

The `splitn()` method takes the `n` argument to indicate the number of desired segments. Each takes multiple arguments. We will examine each method in turn with examples using a minimum number of arguments. However, the full list of arguments available is listed for each method so that you can easily try them out in the example code.

## Import Lexos Modules and Load Some Data

For brevity, we will import all the modules we need below and load some data to be used in our example cells. We'll use Jane Austen's _Pride and Prejudice_, which we will split in multiple ways.

In [None]:
import re
from lexos.io.smart import Loader
from lexos import tokenizer
from lexos.cutter.ginsu import Ginsu
from lexos.cutter.machete import Machete
from lexos.cutter.filesplit import Filesplit
from lexos.cutter.milestones import Milestones

data = "../test_data/txt/Austen_Pride.txt"
loader = Loader()
loader.load(data)

# Here we convert line break characters to spaces to make
# our results more readable in the examples below.
text = re.sub("[\r\n|\n]+", " ",loader.texts[0]).strip()

## Cutting with `Ginsu`

`Ginsu` has three methods:

- `split()`: Splits a spaCy doc or a list of spaCy docs into segments every `n` tokens.
- `splitn()`: Splits a spaCy doc or a list of spaCy docs into a specified number of segments.
- `split_on_milestones()`: Splits a spaCy doc or a list of spaCy docs whenever a specific token is encountered.

We'll explore each of these in turn. Since `Ginsu` operates on spaCy docs, we must first convert our text using `Tokenizer`. This will take a bit of time because we are processing it with an English language model to gain access to attributes like parts of speech.

In [None]:
doc = tokenizer.make_doc(text, model="en_core_web_sm")

We'll now look at each function in turn.

### `Ginsu.split()`

Splits doc(s) into segments of n tokens.

**Arguments:**

- `docs`: A spaCy doc or list of spaCy docs.
- `n`: The number of tokens to split on. Default = 1000.
- `merge_threshold`: The threshold to merge the last segment. Default = 0.5.
- `overlap`: The number of tokens to overlap. Default = 0.


In [None]:
cutter = Ginsu()

segments = cutter.split(doc, n=7500)

print("Number of segments:", len(segments))

### `Ginsu.splitn()`

Splits doc(s) into a specific number of segments.

**Arguments:**

- `docs`: A spaCy doc or list of spaCy docs.
- `n`: The number of tokens to split on. Default = 2.
- `merge_threshold`: The threshold to merge the last segment. Default = 0.5.
- `overlap`: The number of tokens to overlap. Default = 0.

In [None]:
segments = cutter.splitn(doc, n=10)

print("Number of segments:", len(segments), "\n")

for segment in segments:
    print(f"- {segment[0:15]}...")

### `Ginsu.split_on_milestones()`

Splits doc(s) on milestone patterns using patterns or token attributes.

**Arguments:**

- `docs`: The document(s) to be split.
- `milestone`: A variable representing the value(s) to be matched.
- `preserve_milestones`: If True, the milestone token will be preserved at the beginning of every segment. Otherwise, it will be deleted. Default = True

Milestones can be strings, lists, or complex patterns expressed in a dict. See <a href="https://scottkleinman.github.io/lexos/tutorial/cutting_docs/#splitting-documents-on-milestones" target="_blank">Splitting Documents on Milestones</a> for details.

In [None]:
segments = cutter.split_on_milestones(doc, milestone="Chapter")

print(f"Number of segments: {len(segments)} (first 10 shown)\n")

for segment in segments[0:10]:
    print(f"- {segment[0:15]}...")

The example below preprocesses the document with the `Milestones` class (imported at the beginning of the notebook) and splits the document into segments based on the `token._.is_milestone` attribute. As you will see from the example below, this gives the same result as the previous cell. The difference is that `split_on_milestones()` simply splits the document whenever it matches the specified pattern, whereas the `Milestones` class adds the `is_milestone` attribute, allowing it to be saved with the text and re-used. 

Note: The `._.` prefix before `is_milestone` indicates that this is a custom extension, not a built-in spaCy attribute.

In [None]:
# Set _is_milestone=True for all tokens matching "Chapter"
Milestones().set(doc, "Chapter")

# Split doc wherever a token's "is_milestone" attribute is True
segments = cutter.split_on_milestones(doc, milestone={"is_milestone": True})

print(f"Number of segments: {len(segments)} (first 10 shown)\n")

for segment in segments[0:10]:
    print(f"- {segment[0:15]}...")

## Cutting with Machete

`Machete` is used for cutting raw text strings. It has three methods:

- `split()`: Splits a text or a list of texts into segments every `n` tokens.
- `splitn()`: Splits a text or a list of texts into a specified number of segments.
- `split_on_milestones()`: Splits a text or a list of texts whenever a specific string pattern is encountered.
- `split_list()`: Splits a pre-tokenised list of tokens into segments.

We'll explore each of these in turn.

### `Machete.split()`

**Arguments:**

- `texts`: A text string or list of text strings.
- `n`: The number of tokens to split on. Default = 1000.
- `merge_threshold`: The threshold to merge the last segment. Default = 0.5.
- `overlap`: The number of tokens to overlap. Default = 0.
- `tokenizer`: The name of the tokenizer function to use. Default = "whitespace".
- `as_string`: Whether to return the segments as a list of strings. Default = True.

**Important:** `Machete.split()` returns a list of lists where each outer list represents a text. Therefore, you must reference the index of the text to get its segments.

In [None]:
cutter = Machete()

segments = cutter.split(text, n=7500)

print("Number of segments:", len(segments[0]))

### `Machete.splitn()`

Splits text(s) into a specific number of segments.

**Arguments:**

- `texts`: A text string or list of text strings.
- `n`: The number of segments to create. Default = 2.
- `merge_threshold`: The threshold to merge the last segment. Default = 0.5.
- `overlap`: The number of tokens to overlap. Default = 0.
- `tokenizer`: The name of the tokenizer function to use. Default = "whitespace".
- `as_string`: Whether to return the segments as a list of strings. Default = True.

In [None]:
segments = cutter.splitn(text, n=10)

print("Number of segments:", len(segments[0]), "\n")

for segment in segments[0]:
    print(f"- {segment[0:50]}...")

### `Machete.split_on_milestones()`

Splits text(s) on milestone patterns.

**Arguments:**

- `docs`: The document(s) to be split.
- `milestone`: A variable representing the value(s) to be matched.
- `preserve_milestones`: If True, the milestone token will be preserved at the beginning of every segment. Otherwise, it will be deleted. Default = True.
- `tokenizer`: The name of the tokenizer function to use. Default = "whitespace".
- `as_string`: Whether to return the segments as a list of strings. Default = True.

Pay close attention to the effect of `preserve_milestones`.

Milestone patterns are evaluated as regular expressions and searched from the beginning of the token string using Python's `re.match()` function. See <a href="https://scottkleinman.github.io/lexos/tutorial/cutting_docs/#splitting-documents-with-machete" target="_blank">Splitting Documents with Machete</a> for details.

In [None]:
segments = cutter.split_on_milestones(text, milestone="Chapter")

print("Number of segments:", len(segments[0]), " (first 10 shown)\n")

for segment in segments[0][0:10]:
    print(f"- {segment[0:50]}...")

### `Machete.split_list()`

Splits a list of tokens into segments.

**Arguments:**

- `doc`: The text to be split.
- `n`: The number of tokens to split on (default = 1000).
- `merge_threshold`: The threshold to merge the last segment (default = 0.5).
- `overlap`: The number of tokens to overlap (default = 0).
- `as_string`: Whether to return the segments as a list of strings (default = True).

See <a href="https://scottkleinman.github.io/lexos/tutorial/cutting_docs/#splitting-lists-of-tokens-with-machete" target="_blank">Splitting Lists of Tokens with Machete</a> for important considerations when using this method.

In [None]:
# Split our sample text into a list based on whitespace
token_list = text.split()

segments = cutter.split_list(token_list, n=7500)

print("Number of segments:", len(segments[0]), "\n")

## Cutting with `FileSplit` (Chainsaw)

`FileSplit` (AKA "Chainsaw") is used for cutting binary files into smaller files.

### The `Filesplit` Class

**Arguments:**

- `man_filename`: The path to the manifest filename. Default = "fs_manifest.csv".
- `buffer_size`: The maximum file size for each segment. Default = 1000000 (1 MB).

The class is initialised with the defaults in the cells below.

### `Filesplit.split()`

**Arguments:**

- `file`: The path to the manifest filename. Default = "fs_manifest.csv".
- `split_size`: The maximum file size for each segment. Default = 1000000 (1 MB).
- `output_dir`: The path to the directory where the segments will be saved.

In [None]:
from lexos.cutter.filesplit import Filesplit

fs = Filesplit()

fs.split(
    file="/filesplit_test/longfile.txt",
    split_size=30000000,
    output_dir="/filesplit_test/splits/"
)

### `Filesplit.merge()`

The `Filesplit.merge()` method uses the saved metadata file to merge segments of a file previously split using `Filesplit.split()`.

**Arguments:**

- `input_dir`: The path to the directory containing the split files and the manifest file.
- `sep`: The separator string used in the file names (default = "_").
- `output_file`: The path to the file which will contain the merged segments. If not provided, the final merged filename is derived from the split filename and placed in the same `input_dir`.
- `manifest_file`: The path to the manifest file. If not provided, the process will look for the file within the `input_dir`.
- `callback`: A callback function that accepts two arguments: the path to the destination file and the size the file in bytes.
- `cleanup`: If `True`, all the split files and the manifest file will be deleted after the merge, leaving behind only the merged file.


In [None]:
fs.merge("/filesplit_test/splits/", cleanup=True)