## Example: Definition and Configuration for Tweet Tokenization

Let's define tweet tokenization as follows:

* Separate tokens on
    * Whitespace
    * Symbols, except as described below
        * Where each symbol becomes its own token
    * CamelCase, except for special cases described below
* Keep twitter "special" character prefixes with tokens
    * User (@, .@), Hashtag (#), Signature (^), Cashtag ($)
    * Don't split camelCasing in "special" twitter tokens
* Keep each url as a token
* Keep each email address as a token
* Keep embedded apostrophes in a token
    * e.g., "don't", "isn't", "can't", "John's", etc.
* Keep all non-white spans of characters with a digit together in a token
    * Except for trailing punctuation

Example:

"@Mary, check out John's AmazingMuscleCar for $100K! at http://johnscar.com opinions@research.com! #LiveTheDream"

Tokenizes to:
* @Mary
* ,
* check
* out
* John's
* Amazing
* Muscle
* Car
* for
* $100K
* !
* at
* http://johnscar.com
* opinions@research.com
* !
* #LiveTheDream

So, we need to define features and transformations for these rules.

### Point Rule Transformations

First, we can define features for identifying single split points in a string and corresponding masks for the transformation:

|Rule|Features|Description|Mask|
|---|---|---|---|
|Split on whitespace|Space|char.isspace()|Space|
|Split on symbols|Symbol|non-space, non-alphanum|Symbol|
|Split on CamelCase|Upper, Lower, NextLower, PrevLower|split at upper following lower or at upper followed by lower|(Upper & PrevLower) \| (Upper & NextLower)| 

The transformation for these rules is to generate a mask that "or"'s features to split on across all rules, where each rule's features may be "and"ed.

For example, note how the CamelCase rule "and"'s the character feature combinations of "Upper & PrevLower" and "Upper & NextLower".

With the combined transformation mask for the point rules of:

$$ \begin{equation*} \mathbf{PointTransforms} = (Space + Symbol + (Upper * PrevLower) + (Upper * NextLower)) \end{equation*} $$

Note that the logical "and" and "or" operations correspond to the mathematical "\*" and "+" operations, respectively.

### Block Rule Transformations

For the rules that pertain to keeping groups of characters together, even though they would be split apart by other rules (for example, splitting on symbols would break apart a url, email, and twitter special tokens), we define "block" masks that mask a block of consecutive characters as "0"'s, or "don't split", to be anded with the other masks, hence preventing a split from within the block spans.

For these, we define the $\mathbf{block\_mask(locator\_mask, endpoints\_mask)}$ function that generates a mask of $\mathbf{10\ldots0}$ between endpoints within a span that the locator is present.

|Rule|Features|Description|Mask|
|---|---|---|---|
|Keep twitter specials together|Twitter, PrevSpace, NextAlpha, ., Next_@, AfterNextAlpha|Twitter special char following a space character and preceding an alphabetical character or matching the " .@a" pattern, where "a" is any alpha.|$\mathbf{block\_mask}$((Twitter & PrevSpace & NextAlpha) \| (. & PrevSpace & Next_@ & AfterNextAlpha), Space)|
|Keep urls as a single token|:, Next_/, AfterNext_/, PrevAlpha, Space|Locate a url by finding "a://", where "a" is any alpha and span from preceding to subsequent space|$\mathbf{block\_mask}$(: & Next_/ & AfterNext_/ & PrevAlpha, Space)|
|Keep email addresses as a single token|@, PrevAlpha, NextAlpha, Space|Locate an email address by finding an atset (@) embedded between two alphas.|$\mathbf{block\_mask}$(@ & PrevAlpha & NextAlpha, Space)|
|Keep embedded apostrophes within tokens|Apos, PrevAlpha, NextAlpha|Locate an embedded apostrophe surrounded by alphas|$\mathbf{block\_mask}$(Apos & PrevAlpha & NextAlpha, Space)|
|Keep tokens with digits together|Numeric, Space|Keep tokens with a numeric character together|$\mathbf{block\_mask}$(Numeric,Space)|

Where the "Twitter" feature is a custom defined character feature that is true for the characters: @, #, $, and ^.

Because all block masks operate between Space features, the rules can be combined into a single transformation and the block_mask function can be applied later to the combination:

$$ \begin{equation*} \mathbf{BlockTransforms} = ((Twitter * PrevSpace * NextAlpha) * (. * PrevSpace * Next\_@ * AfterNextAlpha)) \\ + (: * Next\_/ * AfterNext\_/ * PrevAlpha) + (@ * PrevAlpha * NextAlpha) \\ + (Apos * PrevAlpha * NextAlpha) + (Numeric) \end{equation*} $$

### Post-Block Transform

One more point transformation is needed after the block transformations to satisfy the rule to split off any trailing token symbol from tokens:

$$ \begin{equation*} \mathbf{EndSymbolTransform} = (Symbol + NextSpace) \end{equation*} $$

### Combined and applied transformations

Up to this point, the aforementioned transformations can be defined and precomputed for application to the feature matrix of any input string for tokenization.

Next, for each input feature matrix, $\mathbf{F}$, the transformations need to be applied. Because of the insertion of the $\mathbf{block\_mask}$ function wrapping the block transforms, these will need to be applied separately.

The application function is defined as:

$$ \mathbf{S}(\mathbf{F}) = \mathbf{apply\_transform}(\mathbf{F}, \mathbf{T}) $$

For any feature matrix, $\mathbf{F}$, and transform, $\mathbf{T}$, to generate a split mask vector based on the feature matrix, $\mathbf{F}$.

Combining the point, block, and end symbol transformations, we get the final transformation function:

$$ \begin{equation*} \mathbf{S}(\mathbf{F}) = (\mathbf{apply\_transform}(\mathbf{F}, PointTransforms) \\ * \\ \mathbf{apply\_transform}(\mathbf{F}, \mathbf{block\_mask}(BlockTransforms, Space))) \\ + \\ \mathbf{apply\_transform}(\mathbf{F}, EndSymbolTransform) \end{equation*} $$

### LaTok Implementation Details

1. Collected and implement character features (unicode c-extension)
    * Leverage python unicode feature implementation, customizing with new features
1. Implemented necessary new helper functions (numpy c-extensions):
    * feature matrix construction
        * building the feature matrix as a NumPy array in python ended up being a choke point
        * a custom c-extension to build this matrix showed vast improvement
    * apply_transform
        * custom c implementation of transformation application to the feature matrix improved performance
    * block_mask
        * custom c implementation of generating the block mask also improved performance
1. Implemented LaTok algorithm as a general tokenizer (python ```tokenize```)
1. Defined transformation rules for twitter case (python ```gen_split_mask```)
1. Executed the tokenizer (python)

### Custom Configuration Details

With the basic tokenizer implemented with a sample default "tweet" configuration, alternate tokenization use cases can be implemented.

1. Collect character features, implementing any that are new (unicode c-extension)
1. Implement any new helper functions (numpy c-extensions)
    * In particular, new types of masking operations may be necessary
1. Define transformation rules, implementing a custom ```gen_split_mask```
1. Execute ```tokenize``` with the custom ```gen_split_mask``` function

### A view of the feature matrix and split vectors

Revisiting our sample tweet from above...

In [None]:
import pandas as pd
from latok.core.latok_utils import gen_parse_matrix, FEATURE_NAMES
from latok.core.default_tokenizer import gen_split_mask, tokenize

In [None]:
text = "@Mary, check out John's AmazingMuscleCar for $100K! at http://johnscar.com opinions@research.com! #LiveTheDream"

m = gen_parse_matrix(text)
df = pd.DataFrame(m, columns=FEATURE_NAMES)

chars = pd.Series([c for c in text])
splits = pd.Series(gen_split_mask(m))

df = pd.concat((chars.rename("Chars"), splits.rename("Splits"), df), axis=1)
pd.options.display.max_columns = None
pd.options.display.max_rows = None

print(f'Tokens:\n{list(tokenize(text))}\n\nFeatures:')

df