# Test Tokenizer
   
This notebook is to show examples of how to use the `tokenizer`

## Add Lexos to the Jupyter `sys.path`

In [1]:
%run jupyter_local_setup.py ../../../lexos

System path set to `../../../lexos`.


## Import Lexos Modules

In [2]:
from lexos.io.smart import Loader
from lexos import tokenizer
from lexos.cutter.ginsu import Ginsu
from lexos.cutter.machete import Machete

## Load data

In [21]:
data = "../test_data/txt/Austen_Pride.txt"
loader = Loader()
loader.load(data)
doc = tokenizer.make_doc(loader.texts[0], model="en_core_web_sm")

## Cutting with Ginsu

Ginsu is used for cutting spaCy docs.

### `Ginsu.split()`

Splits doc(s) into segments of n tokens.

**Arguments:**

- `docs`: A spaCy doc or list of spaCy docs.
- `n`: The number of tokens to split on (default = 1000).
- `merge_threshold`: The threshold to merge the last segment (default = 0.5).
- `overlap`: The number of tokens to overlap (default = 0).


In [61]:
cutter = Ginsu()

# Returns a list of lists of docs
pride_segments = cutter.split(doc, n=7500)

### `Ginsu.splitn()`

Splits doc(s) into a specific number of segments.

**Arguments:**

- `docs`: A spaCy doc or list of spaCy docs.
- `n`: The number of segments to create (default = 2).
- `merge_threshold`: The threshold to merge the last segment (default = 0.5).
- `overlap`: The number of tokens to overlap (default = 0).

In [None]:
# Returns a list of lists of docs
pride_segments = cutter.splitn([doc], n=10)

### `Ginsu.split_on_milestones()`

Splits doc(s) on milestone patterns using patterns or pre-processed `token._.is_milestone` attributes.

**Arguments:**

- `docs`: The document(s) to be split.
- `milestone`: A variable representing the value(s) to be matched.
- `preserve_milestones`: If True (the default), the milestone token will be preserved at the beginning of every segment. Otherwise, it will be deleted.

Milestones can be strings, lists, or complex patterns expressed in a dict. See <a href="https://scottkleinman.github.io/lexos/tutorial/cutting_docs/#splitting-documents-on-milestones" target="_blank">Splitting Documents on Milestones</a> for details.

In [None]:
# Returns a list of lists of docs
pride_segments = cutter.split_on_milestones([doc], milestone="Chapter")

The example below preprocesses the document with the `Milestones` class and splits the document into segments based on the `token._.is_milestone` attribute.

In [None]:
from lexos.cutter.milestones import Milestones

Milestones().set(doc, "Chapter")

pride_segments = cutter.split_on_milestones([doc], milestone={"is_milestone": True})

### `Machete.split()`

**Arguments:**

- `texts`: A text string or list of text strings.
- `n`: The number of tokens to split on (default = 1000).
- `merge_threshold`: The threshold to merge the last segment (default = 0.5).
- `overlap`: The number of tokens to overlap (default = 0).
- `tokenizer`: The name of the tokenizer function to use (default = "whitespace").
- `as_string`: Whether to return the segments as a list of strings (default = True).

## Cutting with Machete

Machete is used for cutting text strings.

In [60]:
cutter = Machete()

text = loader.texts[0]

# Returns a list of str lists
segments = cutter.split(text, n=7500)

### `Machete.splitn()`

Splits text(s) into a specific number of segments.

**Arguments:**

- `texts`: A text string or list of text strings.
- `n`: The number of segments to create (default = 2).
- `merge_threshold`: The threshold to merge the last segment (default = 0.5).
- `overlap`: The number of tokens to overlap (default = 0).
- `tokenizer`: The name of the tokenizer function to use (default = "whitespace").
- `as_string`: Whether to return the segments as a list of strings (default = True).

In [None]:
# Returns a list of str lists
segments = cutter.splitn(text, n=10)

### `Machete.split_on_milestones()`

Splits text(s) on milestone patterns.

**Arguments:**

- `docs`: The document(s) to be split.
- `milestone`: A variable representing the value(s) to be matched.
- `preserve_milestones`: If True (the default), the milestone token will be preserved at the beginning of every segment. Otherwise, it will be deleted.
- `tokenizer`: The name of the tokenizer function to use (default = "whitespace").
- `as_string`: Whether to return the segments as a list of strings (default = True).

Milestone patterns are evaluated as regular expressions and searched from the beginning of the token string using Python's `re.match()` function. See <a href="https://scottkleinman.github.io/lexos/tutorial/cutting_docs/#splitting-documents-with-machete" target="_blank">Splitting Documents with Machete</a> for details.

In [None]:
# Returns a list of str lists
segments = cutter.split_on_milestones(text, n=2)

### `Machete.split_list()`

Splits a list of tokens into segments.

**Arguments:**

- `doc`: The text to be split.
- `n`: The number of tokens to split on (default = 1000).
- `merge_threshold`: The threshold to merge the last segment (default = 0.5).
- `overlap`: The number of tokens to overlap (default = 0).
- `as_string`: Whether to return the segments as a list of strings (default = True).

See <a href="https://scottkleinman.github.io/lexos/tutorial/cutting_docs/#splitting-lists-of-tokens-with-machete" target="_blank">Splitting Lists of Tokens with Machete</a> for important considerations when using this method.

In [None]:
text = text.split()

# Returns a list of str lists
segments = cutter.split_list(text, n=7500)

## Cutting with `FileSplit` (Chainsaw)

`FileSplit` (AKA "Chainsaw") is used for cutting binary files into smaller files.

### The `Filesplit` Class

**Arguments:**

- `man_filename`: The path to the manifest filename (default = "fs_manifest.csv").
- `buffer_size`: The maximum file size for each segment (default = 1000000, 1 MB).

The class is initialised with the defaults in the cells below.

### `Filesplit.split()`

**Arguments:**

- `file`: The path to the file to be split.
- `split_size`: The maximum file size for each segment (default = 1000000, 1 MB).
- `output_dir`: The path to the directory where the segments will be saved.

In [None]:
from lexos.cutter.filesplit import Filesplit

fs = Filesplit()

fs.split(
    file="/filesplit_test/longfile.txt",
    split_size=30000000,
    output_dir="/filesplit_test/splits/"
)

### `Filesplit.merge()`

The `Filesplit.merge()` method uses the saved metadata file to merge segments of a file previously split using `Filesplit.split()`.

**Arguments:**

- `input_dir`: The path to the directory containing the split files and the manifest file.
- `sep`: The separator string used in the file names (default = "_").
- `output_file`: The path to the file which will contain the merged segments. If not provided, the final merged filename is derived from the split filename and placed in the same `input_dir`.
- `manifest_file`: The path to the manifest file. If not provided, the process will look for the file within the `input_dir`.
- `callback`: A callback function that accepts two arguments: the path to the destination file and the size the file in bytes.
- `cleanup`: If `True`, all the split files and the manifest file will be deleted after the merge, leaving behind only the merged file.


fs.merge("/filesplit_test/splits/", cleanup=True)