# Tokenizing Texts

Lexos uses the `Tokenizer` module to split text into meaningful units called tokens.  
Unlike Lexos  which splits text using simple rules like whitespace, the API uses **spaCy** language models. These models work across many languages and add useful annotations like parts of speech, lemmas, and stop word flags.

This tutorial shows how to use the `Tokenizer` class to process one or more texts with pre-trained language models.



# Language Models

Lexos wraps **spaCy**, a Natural Language Processing library, to load and apply language models for tokenization. These models offer rule-based and statistical approaches, giving annotated token objects.

> **Important:**  
> While more accurate, language models can consume more memory and may not work well for underrepresented languages.  
> Lexos defaults to `"xx_sent_ud_sm"`(a multilingual model) for broad compatibility.

You can also customize the model used by specifying its name (e.g., `"en_core_web_sm"`).


# Setup


In [None]:
from lexos.tokenizer import Tokenizer

#default tokenizer using the multilingual model
tokenizer = Tokenizer()

# specific language model that you can use
tokenizer = Tokenizer(model="en_core_web_sm")

# ðŸ“š Table of Contents

- [Key Terms](#key-terms)
- [Importing Tokenizer](#importing-tokenizer)
- [Loading Data](#loading-data)
- [Tokenizing a Single Text](#tokenizing-a-single-text)
- [Tokenizing Multiple Texts](#tokenizing-multiple-texts)
- [Changing Language Models](#changing-language-models)
- [ Adding/Removing Stop Words](#adding-or-removing-stop-words)
- [ Filtering Docs](#filtering-docs)
- [ Adjusting spaCy Pipelines](#adjusting-spacy-pipelines)
- [ Simple Tokenizers](#simple-tokenizers)
- [Working with Ngrams](#ngrams)
  - [From Text Input](#generating-ngrams-from-text)
  - [From Token List](#from-token-list)
  - [From spaCy Doc](#from-spacy-doc)
  - [Filtering Ngrams](#filtering-options)
  - [Character Ngrams](#character-ngrams)



## Key Terms

Before proceeding, here are some key terms:

- **Text**: A string of characters before any preprocessing.

- **Token**: A unit of text used in natural language processing. Most commonly a token is a word, but it can also be punctuation, a number, or even whitespace, depending on how the text is split.  
  **Example**:  
  Sentence: `"I like NLP!"`  
  Tokens: `['I', 'like', 'NLP', '!']`

- **Document (doc)**: A parsed text returned by the tokenizer. This is a `spacy.Doc` object containing tokens and their attributes.

- **Language Model**: A model that defines how text is segmented into tokens and what annotations are applied.

- **Pipeline**: A series of NLP tasks applied to the text after tokenization (e.g., part-of-speech tagging, dependency parsing).

- **N-gram**: A sequence of *n* tokens used to analyze patterns or context within texts.  
  - Unigrams: single tokens  
  - Bigrams: 2-token sequences â†’ `['I like', 'like NLP']`  
  - Trigrams: 3-token sequences â†’ `['I like NLP']`

- **Stopword**: A commonly used word often filtered out before processing because it adds little semantic value.  
  **Examples**: `['the', 'is', 'and', 'of']`  
  **Use case**: Remove stopwords to focus on more meaningful words.

- **Filtering**: The process of removing certain types of tokens before generating n-grams or running analysis.

  | Term             | Definition                                                                                   | Example / Notes                                         |
  |------------------|----------------------------------------------------------------------------------------------|---------------------------------------------------------|
  | `filter_stops`   | Removes tokens that are stopwords.                                                           | `"This is good"` â†’ `['This', 'good']` (removes `'is'`)  |
  | `filter_punct`   | Removes punctuation-only tokens.                                                             | `"Great!"` â†’ `['Great']`                                |
  | `filter_digits`  | Removes tokens that are only numeric digits.                                                 | `['test', '2023']` â†’ `['test']`                         |
  | `filter_nums`    | Removes tokens that look like numbers, including decimals and formatted numbers.             | `['3.14', '100', 'ten']` â†’ `['ten']`                    |
  | `min_freq`       | Removes n-grams that appear fewer times than a specified threshold.                          | `min_freq=2` removes rare n-grams                       |


In [None]:
from lexos.tokenizer import Tokenizer

## Loading Data

You can either input text manually or load it from an external file.  
Here, we load in a text file.

To load data to tokenize we'll use the Loader module to load in a text file from Github. The file we'll be using is a small portion of "Pride and Prejudice" by Jane Austen. 

In [None]:
from lexos.io import loader
loader = loader.Loader()
loader.load(["https://raw.githubusercontent.com/scottkleinman/lexos/refs/heads/main/tests/test_data/txt/Austen_Pride_sm.txt"])
# use a file path instead if wanting to load in a file ex. loader.load(["./data/local_text.txt"]) 
text = loader.texts[0]
text

> **Tip**  
> The `load()` method accepts a list of file paths or URLs. You can mix and match as long as the file type is supported.


### Manual Text Input
Alternatively, you can define a text directly in your code, which is useful for quick testing or demos.

In [None]:
text = "It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife."

#once defined you can tokenize this text using:
doc = tokenizer.make_doc(text)
[token.text for token in doc]


> **Tip**  
> Manual text input is ideal for experimenting with pipeline behavior or for debugging small examples before scaling up to batch processing.


## Selecting a Model <br>
As mentioned previously, you can select the model that tokenizer uses in order to get more information from or text, or to better fit the language the text is in. In order to do this, you can use the `model` parameter in the `make_doc()` function to override the default model. For this example, we'll use the `'en_web_core_sm'` model, since "Pride and Prejudice" is written in english. This model tags parts of speech to each token, as shown below.

In [None]:
tokenizer_en = Tokenizer(model="en_core_web_sm")
doc = tokenizer_en.make_doc(text)
print("\nTokens with parts of speech:")
for token in doc[0:50]:
    print(f"<{token.text}> : {token.pos_}")

## Adjusting spaCy Pipelines <br>
Sometimes, the size of a text and the amount of information being collected by a language model can severly slow down the processing speed of a tokenization call. If one would like to disable a component of a model in order to speed up processing speed, then one can use the `remove_extension()` method to disable a component of the model.

In [None]:
tokenizer_en.remove_extension("tagger")

Similarly, if one would like to add or re-enable a component of a model, then they can use the `add_extension()` method to do just that.

In [None]:
tokenizer_en.add_extension("tagger", default="default_value")

## Tokenizing a Single Text

Once your text is loaded, you can tokenize it using the `Tokenizer` class.

The recommended method is to use the `make_doc()` function, which takes a string and returns a `spaCy.Doc` object. This object contains the original text and a sequence of annotated tokens.

In [None]:
tokenizer_def = Tokenizer()
doc = tokenizer_def.make_doc(text)


Alternatively, you can call the `Tokenizer` instance directly, just like you would with a `spaCy` `Language` object.  
This automatically routes input to either `make_doc()` or `make_docs()` depending on whether a single string or a list of strings is passed:


In [None]:
#Alernatively call Tokenizer object directly:
doc = tokenizer_def(text)

After tokenization, you can access the tokens in the returned `Doc` object.  
Here's an example that prints the first 50 tokens:

In [None]:
print("\nTokens:")
for token in doc[0:50]:
    print(f"<{token.text}>")


You can access the original text by referencing `doc.text`, and by using the bracket operators, you can access a substring of the original text.

In [None]:
org_text = doc.text[0:100]

## Tokenizing Multiple Texts

The `Tokenizer` class provides the `make_docs()` method to convert a list of raw strings into a list of `spaCy.Doc` objects.  
Each string is tokenized using the language model and returned with full annotations.

> **Note**  
> This method is ideal for batch processing documents.  
> It is functionally similar to calling the `Tokenizer` object directly.


In [None]:
text_sub1 = text[0:100]
text_sub2 = text[100:200]
text_list = [text_sub1, text_sub2]
docs = list(tokenizer_def.make_docs(text_list))

# Alternatively, call Tokenizer object directly:
docs = list(tokenizer_def(text_list))

> **Note**  
> Some tokens may include punctuation or newline characters. To avoid this, you can either:  
> - Scrub the text using `Scrubber` before tokenization  
> - Filter out unwanted tokens after tokenization using token attributes (e.g., `is_punct`, `is_space`)


## Scrub the text before tokenization
Use the `Scrubber` module to clean the text by removing unwanted characters like extra spaces, line breaks, and punctuation.

In [None]:
from lexos.scrubber.scrubber import Scrubber
scrubber = Scrubber()
scrubber.remove_whitespace = True
scrubber.remove_punctuation = True

cleaned_text = scrubber.scrub(text)
doc = tokenizer.make_doc(cleaned_text)
[token.text for token in doc]

### Filter tokens after tokenization
You can also leave the text as-is and filter the tokens using spaCy's built-in attributes.

In [None]:
doc = tokenizer.make_doc(text)

#remove punctuation and whitespace tokens
filtered_tokens = [token.text for token in doc if not token.is_punct and not token.is_space]
filtered_tokens

## Adding or Removing Stop Words
Stop words are words that are to be excluded from the list of tokens. These are generally words like "the" and "and", but can be anything that serves the purpose of the desired analysis. <br> <br>
To add stopwords to a `tokenizer` instance, use the `add_stopwords()` function to pass a list of words that will act as stopwords before tokenizing. <br><br>
> **Note:** <br>
> Some models, such as `"en_web_core_sm"`, have built-in stopwords. To remove these, or any other stop words, use the `remove_stopwords()` function before tokenizing.

In [None]:
test_text = "This is a test of stopwords in the tokenizer class."

# Adding stopwords
tokenizer_def.add_stopwords(["is", "the", "of"])
stop_doc = tokenizer_def.make_doc(test_text)
for token in stop_doc[0:50]:
    # Print token text formatted with padding for alignment
    # and whether it is a stopword
    print(f"Token: {token.text:<12}    Stopword: {token.is_stop}")
print("\n================================\n")

# Removing stopwords
tokenizer_def.remove_stopwords(["is", "the", "of"])
stop_doc = tokenizer_def.make_doc(test_text)
for token in stop_doc[0:50]:
    # Print token text formatted with padding for alignment
    # and whether it is a stopword
    print(f"Token: {token.text:<12}    Stopword: {token.is_stop}")

## Simple Tokenizers <br>
Along with the language model based tokenizer, the `tokenizer` class also contains two simple tokenizers: `SliceTokenizer` and `WhitespaceTokenizer`. <br><br>
`SliceTokenizer` slices the text into tokens of n characters. The constructor takes two arguments: `n`, which is the number of characters that each token will be, and `drop_ws`, a modifier that controls whether to drop whitespace or keep it.

In [None]:
from lexos.tokenizer import SliceTokenizer
test_text = "Cut me up into tiny pieces!"
slicer = SliceTokenizer(n = 4, drop_ws=True)
slices = slicer(test_text)
print(slices)

`WhitespaceTokenizer` simply slices a text into tokens on whitespace, similarly to the built-in `split()` method. 

In [None]:
from lexos.tokenizer import WhitespaceTokenizer
test_text = "Split me up by whitespace!"
neatSlicer = WhitespaceTokenizer()
slices = neatSlicer(test_text)
print(slices)

## Ngrams <br>

The tokenizer module contains a subclass named `ngrams`, which allows you to generate ngrams, which are sequences of consecutive tokensfrom either raw text or tokenized documents.  
These are useful for analyzing patterns, building frequency models, or studying frequent phrases in your texts.

Ngrams can be created using:

- A pre-tokenized `spaCy.Doc` object  
- Raw text input  
- Character-based slicing

To import this subclass, import the `ngrams` class from the `tokenizer` module.

> **Note**  
> An ngram is a contiguous sequence of *n* items from a given text.  
> For example, 2-grams (bigrams) from `"I like to eat"` would be:  
> `"I like"`, `"like to"`, `"to eat"`

In [None]:
from lexos.tokenizer.ngrams import Ngrams

#initialize Ngrams object
ngrams = Ngrams()

### Generating Ngrams from Text

To generate ngrams directly from a raw string (before tokenization), use the `ngrams.from_text()` function.  
This function automatically tokenizes the input and returns ngrams as strings or tokens depending on the `output` parameter.<br>
> **Note:** <br>
> The default `n` value is 2, and the default `output` value is 'text'.

In [None]:
out_ngrams = ngrams.from_text(text=text, output="text", n=3, tokenizer=tokenizer_def, drop_ws=True)
ngrams_list = list(out_ngrams)
for ngram in ngrams_list[0:10]:
    print(ngram)

If you have a list of texts, you can use `from_texts()` to generate a list of docs corresponding to each text.

In [None]:
out_ngrams = ngrams.from_texts(texts=text_list, output="text", n=3, tokenizer=tokenizer_def, drop_ws=True)
for doc in out_ngrams:
    doc_list = list(doc)
    for ngram in doc_list[0:10]:
        print(ngram)

### From Token List <br>

If you already have a list of tokens (e.g. from simple tokenization), you can use from_tokens(). <br>
> **Note:** <br>
> `from_tokens` does not support 'spans' as an output.

In [None]:
tokens_list = [token.text for token in doc]
out_ngrams = ngrams.from_tokens(tokens=tokens_list, output="tuples", n=3, drop_ws=True)
ngrams_list = list(out_ngrams)
for ngram in ngrams_list[0:10]:
    print(ngram)

If you have a list of token lists, you can use `from_token_lists()` to generate a list of of the selected output.

In [None]:
tokens_list_1 = [token.text for token in docs[0]]
tokens_list_2 = [token.text for token in docs[1]]
out_ngrams = ngrams.from_token_lists([tokens_list_1, tokens_list_2], output="tuples", n=3, drop_ws=True)
for doc in out_ngrams:
    doc_list = list(doc)
    for ngram in doc_list[0:10]:
        print(ngram)

### From a spaCy Doc <br>

When using spaCy documents (produced by Tokenizer), you can generate spans, tuples, or text.

In [None]:
doc = tokenizer_def.make_doc(text)
out_ngrams = ngrams.from_doc(doc=doc, output="tuples", n=3)
ngrams_list = list(out_ngrams)
for ngram in ngrams_list[0:10]:
    print(ngram)

If you have a list of spaCy docs, you can use `from_docs()` to create ngrams from each doc in the list.

In [None]:
out_ngrams = ngrams.from_docs(docs=docs, output="tuples", n=3)
for doc in out_ngrams:
    doc_list = list(doc)
    for ngram in doc_list[0:10]:
        print(ngram)

### Filtering Options <br>

To remove unwanted tokens such as stopwords, punctuation, or digits you can either manually filter tokens using their attributes or use built-in filters available in the ngrams module.

Both approaches help clean your text before further analysis, such as generating ngrams.

##### Manual Filtering with Token Attributes

You can use a list comprehension to build a filtered version of the text using spaCy token attributes, and then re-tokenize the result.

In [None]:
#filter out stopwords,punctuation, and whitespace
filtered_tokens = [
    token.text for token in doc 
    if not token.is_stop and not token.is_punct and not token.is_space
]

# join tokens and create a new doc
filtered_text = " ".join(filtered_tokens)
filtered_doc = tokenizer_def.make_doc(filtered_text)

#view the first 38 tokens
for token in filtered_doc[0:38]:
    print(f"<{token.text}>")


> **Note**  
> Besides the `text` attribute and boolean `is_` attributes like `is_stop`, `is_punct`, or `is_space`, other attributes such as `pos` and `ent_type` require a trailing underscore (`_`) to get human-readable values.  
> 
> For example:  
> - `token.pos` â†’ returns a numerical ID  
> - `token.pos_` â†’ returns the actual part of speech (e.g., `'NOUN'`)


##### Filtering Ngrams with Built in Parameters

If you're generating ngrams using the ngrams.from_doc() helper, you can apply filters directly by setting boolean parameters.

The Ngrams class includes filters to clean the text:

- **filter_stops**: removes stopwords

- **filter_digits**: removes tokens that are digits

- **filter_nums**: removes tokens that are numbers or number-like

- **filter_punct**: removes punctuation

- **drop_ws**: removes whitespace

- **min_freq**: removes tokens that occur less than specified frequency 

<br>

>**Note:**<br>
> Although they seem similar, `filter-nums` and `filter_digits` filter out two seperate types of tokens. `filter_digits` filters out numerical digits (e.g. 0, 3, 8, etc.), while `filter_nums` filters out spelled-out numerical values (e.g. 'zero', 'three', 'eight', etc.).

In [None]:
doc = tokenizer_def.make_doc("This is test ten of 10.")
print(list(ngrams.from_doc(doc, output="text", filter_digits=True)))
# Output: ['This is', 'is test', 'test ten', 'ten of']

doc = tokenizer_def.make_doc("This test includes 100%, punctuation, and stopwords.")
print(list(ngrams.from_doc(doc, output="text", filter_digits=True, filter_punct=True, filter_stops=True)))
# Output: ['This test', 'test includes', 'and stopwords']

To customize the size of the ngrams, use the `n` parameter to adjust. For example, `n = 2` would produce bigrams, `n = 3`, would produce trigrams, etc.

In [None]:
# Create trigrams (n=3)
ngrams = Ngrams()
doc = tokenizer_def.make_doc("It is a beautiful day outside.")
print(list(ngrams.from_doc(doc, n=3, output="text")))
# Output: ['It is a', 'is a beautiful', 'a beautiful day', 'beautiful day outside']