# `Tokenizer` Tutorial
   
This notebook is to show examples of how to use the `tokenizer` to divide texts into "tokens". A token is a countable entity, which serves as the basis for computational analysis. Most of the time, tokens will correspond to words, but they may also be characters, punctuation marks, or even spaces.

Once a text is tokenised, it is possible to generate a list of unique token forms in the text, generally known as "types" or "terms" (we use the latter here). The frequency with which individual terms occur in a text is often revealing about the style, meaning, or authorship of the text.

## About `Tokenizer`

It is possible to produce a list of word tokens for an English text by dividing words on every white space in the text using a tool like Python's `split()` function. However, this will not work 100% of the time and may work far less well for some other languages. The Lexos `Tokenizer` takes advantage of "language models" to automate the process. Language models can implement both rule-based and probabilistic strategies for separating document strings into tokens. Because they have built-in procedures appropriate to specific languages, language models can often do a better job of tokenisation than the approach used in the Lexos app.

There are some trade-offs to using language models. Because the algorithm does more than split strings, processing times can be greater. In addition, tokenisation is no longer (explicitly) language agnostic. A language model is "opinionated" and it may overfit the data. Likewise, if no language model exists for the language being tokenised, the results may not be satisfactory. The Lexos strategy for handling this situation is described below.

Behind the scenes, Lexos uses the Python spaCy library to tokenise texts. The result is called a spaCy doc (short for document). Each spaCy doc has a `text` attribute (which is the original text) and a list of tokens, each with their own attributes. spaCy has a lot of built-in attributes: things like `is_punct` (whether or not the token is a punctuation mark) or `is_digit` (whether or not the token is a digit). You can see a <a href="https://spacy.io/api/token#attributes" target="_blank">complete list</a> in spaCy documentation. Depending on how the language model has been trained, you may get more or less information. For instance, spaCy's "en_web_core_sm" English-language model tags the part of speech of every word. You can load this (or another) model into the Lexos `Tokenizer` if you want. However, Lexos does not assume that you are working in English, so the default model is spaCy's "xx_sent_ud_sm" multilanguage model, which does a good job finding sentence and token boundaries for a wide variety of languages but does not provide as much information as some of the other models.

You can see whether spaCy has a model for your language on the <a href="https://spacy.io/models" target="_blank">spaCy models</a> webpage. You can load any of these models into the Lexos `Tokenizer`, but you will need to download the model first by copying the code provided on that page.

This is probably enough information to get started, so let's get to work.

## Load Some Data

We'll start by loading some data using the `Loader` module. We're going to take the first 1245 characters of Jane Austen's _Pride and Prejudice_? Why 1245? Because it's a relatively small passage that we can process quickly (the full novel would take much longer) and because character 1245 comes at the end of a sentence.

In [27]:
from lexos.io.smart import Loader

loader = Loader()
loader.load("../test_data/txt/Austen_Pride.txt")
text = loader.texts[0].strip()[0:1245]
text


'Pride and Prejudice\nby Jane Austen\nChapter 1\nIt is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.\nHowever little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered the rightful property of some one or other of their daughters.\n"My dear Mr. Bennet," said his lady to him one day, "have you heard that Netherfield Park is let at last?"\nMr. Bennet replied that he had not.\n"But it is," returned she; "for Mrs. Long has just been here, and she told me all about it."\nMr. Bennet made no answer.\n"Do you not want to know who has taken it?" cried his wife impatiently.\n"_You_ want to tell me, and I have no objection to hearing it."\nThis was invitation enough.\n"Why, my dear, you must know, Mrs. Long says that Netherfield is taken by a young man of large fortune from the north of England; that he c

## Import `Tokenizer`

Now we'll attempt to tokenise this text. We'll start by importing the `tokenizer` module.

In [29]:
from lexos import tokenizer

## Making a Doc

`Tokenizer` has a function called `make_doc()` to which we can feed our text. Remember that by default, `Tokenizer` uses spaCy's "xx_sent_ud_sm" multilanguage model.

We can view the doc's original text by referencing `doc.text`, or we can print out the text of each token in the document using a `for` loop. In the example below, we will only print out snippets of the text and tokens. We enclose tokens in angle brackets for greater visibility and to show that some tokens are line breaks or punctuation marks (something we may have to deal with by scrubbing our text first or by filtering our tokens later).

In [39]:
doc = tokenizer.make_doc(text)
print("\nText:")
print("=====")

print(doc.text[0:100])

print("\nTokens:")
print("=======")
for token in doc[0:60]:
    print(f"<{token.text}>")


Text:
=====
Pride and Prejudice
by Jane Austen
Chapter 1
It is a truth universally acknowledged, that a single m

Tokens:
<Pride>
<and>
<Prejudice>
<
>
<by>
<Jane>
<Austen>
<
>
<Chapter>
<1>
<
>
<It>
<is>
<a>
<truth>
<universally>
<acknowledged>
<,>
<that>
<a>
<single>
<man>
<in>
<possession>
<of>
<a>
<good>
<fortune>
<,>
<must>
<be>
<in>
<want>
<of>
<a>
<wife>
<.>
<
>
<However>
<little>
<known>
<the>
<feelings>
<or>
<views>
<of>
<such>
<a>
<man>
<may>
<be>
<on>
<his>
<first>
<entering>
<a>
<neighbourhood>
<,>
<this>
<truth>


## Specifying a Language Model

You can specify a language model with the `model` parameter. In the example below, we load the "en_core_web_sm" model. Notice how much longer it takes to tokenise. But notice the information that we get out of the model (in this example, we are printing out the part of speech for each token).

In [41]:
doc = tokenizer.make_doc(text, model="en_core_web_sm")
for token in doc[0:60]:
    print(f"<{token.text}>: {token.pos_}")

<Pride>: PROPN
<and>: CCONJ
<Prejudice>: PROPN
<
>: SPACE
<by>: ADP
<Jane>: PROPN
<Austen>: PROPN
<
>: SPACE
<Chapter>: NOUN
<1>: NUM
<
>: SPACE
<It>: PRON
<is>: AUX
<a>: DET
<truth>: NOUN
<universally>: ADV
<acknowledged>: VERB
<,>: PUNCT
<that>: SCONJ
<a>: DET
<single>: ADJ
<man>: NOUN
<in>: ADP
<possession>: NOUN
<of>: ADP
<a>: DET
<good>: ADJ
<fortune>: NOUN
<,>: PUNCT
<must>: AUX
<be>: AUX
<in>: ADP
<want>: NOUN
<of>: ADP
<a>: DET
<wife>: NOUN
<.>: PUNCT
<
>: SPACE
<However>: ADV
<little>: ADJ
<known>: VERB
<the>: DET
<feelings>: NOUN
<or>: CCONJ
<views>: NOUN
<of>: ADP
<such>: DET
<a>: DET
<man>: NOUN
<may>: AUX
<be>: AUX
<on>: ADP
<his>: PRON
<first>: ADJ
<entering>: VERB
<a>: DET
<neighbourhood>: NOUN
<,>: PUNCT
<this>: DET
<truth>: NOUN


## Tokenising Multiple Texts

You can use the `Tokenizer.make_docs()` function to make process multiple texts at once. In this example, we are just going to cut our text roughly in half to make two separate texts and then convert them to spaCy docs. Although we are going to feed our texts to `make_docs()` in a list in this example, recall that `loader.texts` _is_ a list, so a common procedure would be to call `make_docs(loader.texts)`.

In [47]:
text1 = text[0:565]
text2 = text[565:1245]

docs = tokenizer.make_docs([text1, text2])

'Pride and Prejudice\nby Jane Austen\nChapter 1\nIt is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.\nHowever little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered the rightful property of some one or other of their daughters.\n"My dear Mr. Bennet," said his lady to him one day, "have you heard that Netherfield Park is let at last?"\nMr. Bennet replied that he had not.\n"'

In [6]:
#Multiple texts can be tokenized in one call using tokenizer.make_docs. Reuturns a list of docs
docs = tokenizer.make_docs(loader.texts)

## Disabling or Excluding Components

Part of the reason some language models take a long time is that they have numerous components that handle tasks such as tagging parts of speech, identifying syntactic dependencies, or labelling named entities like people and places. The documentation on the <a href="https://spacy.io/models" target="_blank">spaCy models</a> webpage will identify which components are available in the model's pipeline. If you do not need a component, you can speed up processing times by disabling or excluding components you do not intend to use. Disabled components (listed with the `disable` parameter) will be loaded but unused, and excluded components (listed with the `exclude` parameter) will not be loaded.

Try out the examples below by commenting and uncommenting them to compare how long they take to the process with all pipeline components enabled and included. You won't notice a lot of difference on a short text but you will on a longer one.

In [None]:
doc = tokenizer.make_doc(loader.text, model="en_core_web_sm", disable=["tagger","parser"])

#doc = tokenizer.make_doc(loader.text, model="en_core_web_sm", exclude=["tagger","parser"])

## Stop Words

A stop word (or "stopword") is a token that you typically wish to remove from your analysis, generally because the token is not a carrier of meaning. Stop words are generally small function words like "and" or "the", but they can also be words like personal names, where the inclusion of those names might skew your data in your intended analysis.

Stop words can be added or removed with `add_stopwords` and `remove_stopwords`. Since the default language model is a multilanguage model, it has no built-in stop words. Models for specific languages generally have built-in stop word lists which you can modify.

In [49]:
text = "This is an example string to test the tokenizer."

doc1 = tokenizer.make_doc(
    text,
    add_stopwords=["an", "the", "is"]
)
print("\nDefault model with stop words added:\n")
for token in doc1:
    print(token.text, token.is_stop)

doc2 = tokenizer.make_doc(
    text,
    model="en_core_web_sm",
    remove_stopwords=["is", "the"]
)
print("\nEnglish language model with stop words removed:\n")
for token in doc2:
    print(token.text, token.is_stop)


Default model with stop words added:

This False
is True
an True
example False
string False
to False
test False
the True
tokenizer False
. False

English language model with stop words removed:

This True
is False
an True
example False
string False
to True
test False
the False
tokenizer False
. False


## Generating Word Ngrams

In [None]:
text = "This is an example string to test the tokenizer component"
doc = tokenizer.make_doc(text)
ngrams = tokenizer.ngrams_from_doc(doc, size=2)
for ngram in ngrams:
    print(ngram)

In [9]:
# an alternative method to create word ngrams is to use textacy directly. This method has additional options
# documented here: https://textacy.readthedocs.io/en/latest/api_reference/extract.html#textacy.extract.basics.ngrams
from textacy.extract.basics import ngrams as ng
text = "The end is nigh."
doc = tokenizer.make_doc(text)
ngrams = list(ng(doc, 2, min_freq=1))

## Generating Docs From Ngrams
ngrams_from_doc generates a list of ngrams from a doc. If you want to use the ngrams as a doc you will need to generate a new doc.

In [None]:
nDoc = tokenizer.doc_from_ngrams(ngrams, strict = True, model ="en_core_web_sm")
for token in nDoc:
    print(token.text)

## Generating Character Ngrams
Character ngrams are generated from untokenized text

In [None]:
text = "This is an example string to test the tokenizer"
chNgrams = tokenizer.generate_character_ngrams(text, 2, drop_whitespace=False)
for ngram in chNgrams:
    print(ngram)