# Tokenizer usage

A notebook for experimenting with and understanding tokenization and its options, including featurization.

Usage:
* Configure and construct the tokenizer "tok" as desired
* Set the "text" to tokenize
* Run all cells and inspect the various outputs

NOTE: For defining and debugging new split mask generator cores, building the cores first in this notebook and inspecting the final character-level features matrix can be very helpful.

First, some initializations...

In [1]:
import numpy as np
import pandas as pd
import latok.core.constants as C
import latok.core.latok_utils as latok_utils
import latok.core.offsets as oft
import latok.core.split_mask_generator as split_mask_generator
import latok.core.tokenizer as latok
import latok.core.twitter_cores as twitter_cores
from latok.latok import _gen_parse_matrix, _combine_matrix_rows

## Tokenizer Construction

A tokenizer is constructed by:
* building a "split mask generator" for its core.
* deciding on behavioral options
* constructing the Tokenizer instance

#### Case Study: Tweet tokenizer with particular needs

Requirements:
* Basics: split on spaces, symbols and camel case (generally)
* Keep e-mail addresses and urls together as single tokens
* Keep numeric sequences along with embedded and preceding symbols
* Don't split embedded apostrophes, but keep them as part of their token
   * keep the full token with an apostrophe without breaking up camel case tokens
* Keep the full twitter mention (even when camel cased) along with its preceding atset
* Discard other twitter specials (like hashtags) and do split on camelcase
   * except when there's an embedded apostrophe
* Enable to generically substitute
   * mentions with the token "_MENTION"
   * emails with the token "_EMAIL"
   * urls with the token "_URL"
   * numerics with the token "_NUM"

In [2]:
tok = latok.Tokenizer(
    smg=twitter_cores.MENTION_SMG,
    specs_and_repls=[
        (C.TWITTER_MENTION_FEATURE, '_MENTION'),
        (C.TWITTER_HASHTAG_FEATURE, '_HASHTAG'),
        (C.EMAIL_FEATURE, '_EMAIL'),
        (C.URL_FEATURE, '_URL'),
        (C.NUMERIC_FEATURE, '_NUM'),
    ],
    to_lower=False,
    drop_symbols=True,
    keep_emojis=True
)

# Exploration

In [3]:
#text = '''@123.456 #HashTag's "It's a problem"'''
text = '''@AbcDef #HashTag "It isn't a problem"'''
#text = '''@SomeUser and "@AnotherMention" see "http://foobar.baz/entry" and tell "user@foo.com" about your ""#AwesomeSauce"'''
#text = '''@SomeUser and ("@SomeMention")'''
#text = '''@Some_Mention'''

## Basic Tokenization

tokenizer.tokenize generates each token.

We'll wrap each token for placement in a pandas DataFrame for display...

In [4]:
token_texts = [(token_text) for token_text in tok.tokenize(text)]

print(f'Basic tokenization of text:\n\n{text}\n')

df = pd.DataFrame(token_texts, columns=['token'])
df

Basic tokenization of text:

@AbcDef #HashTag "It isn't a problem"



Unnamed: 0,token
0,_MENTION
1,_HASHTAG
2,Hash
3,Tag
4,It
5,isn't
6,a
7,problem


### Final character-level features

In [5]:
pd.options.display.max_columns = None
pd.options.display.max_rows = None

df = tok.smg.build_dataframe(text)
df

Unnamed: 0,Chars,trim1=stage1.or.split2,split2.2,stage1.2,stage1=split1.and.blocks,split1.1,blocks=a_blocks.and.blocks1,blocks1.0,a_blocks.0,Alpha,AlphaNum,Num,Lower,Upper,Space,Symbol,Twitter,@,:,/,.,Prev_Alpha,Next_Alpha,Prev_AlphaNum,Next_AlphaNum,Prev_Lower,Next_Lower,Prev_Space,Next_Space,Prev_Symbol,Next_@,Next_/,After_Next_Alpha,After_Next_/,Apos,#,$,^,Emoji,Emoji_Presentation,Emoji_Modifier_Base,Emoji_Component,Extended_Pictographic
0,@,1,1,1,1,1,1,1,1,0,0,0,0,0,0,1,1,1,0,0,0,0,1,0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
1,A,0,0,0,0,2,0,0,1,1,1,0,0,1,0,0,0,0,0,0,0,0,1,0,1,0,1,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0
2,b,0,0,0,0,0,0,0,1,1,1,0,1,0,0,0,0,0,0,0,0,1,1,1,1,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
3,c,0,0,0,0,0,0,0,1,1,1,0,1,0,0,0,0,0,0,0,0,1,1,1,1,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
4,D,0,0,0,0,2,0,0,1,1,1,0,0,1,0,0,0,0,0,0,0,1,1,1,1,1,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
5,e,0,0,0,0,0,0,0,1,1,1,0,1,0,0,0,0,0,0,0,0,1,1,1,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
6,f,0,0,0,0,0,0,0,1,1,1,0,1,0,0,0,0,0,0,0,0,1,0,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
7,,1,0,1,1,1,1,1,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
8,#,3,1,2,2,2,1,1,1,0,0,0,0,0,1,1,1,0,0,0,0,0,1,0,1,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0
9,H,2,0,2,2,2,1,1,1,1,1,0,0,1,0,0,0,0,0,0,0,0,1,0,1,0,1,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0


### Describe character-level feature split plan

In [6]:
import json
print(json.dumps(tok.smg.describe_split_plan(), indent=2))

{
  "steps": {
    "step_0": {
      "input_name": "a_blocks",
      "output_name": "blocks",
      "combine_with": "blocks1",
      "operation": "and"
    },
    "step_1": {
      "input_name": "split1",
      "output_name": "stage1",
      "combine_with": "blocks",
      "operation": "and"
    },
    "step_2": {
      "input_name": "stage1",
      "output_name": "trim1",
      "combine_with": "split2",
      "operation": "or"
    }
  },
  "stages": {
    "split1_0": {
      "name": "split1",
      "mask": {
        "name": "split1",
        "features": [
          [
            "Space",
            ""
          ],
          [
            "Symbol",
            ""
          ],
          [
            "Prev_Symbol",
            ""
          ],
          [
            "Upper",
            "Next_Lower"
          ],
          [
            "Upper",
            "Prev_Lower"
          ]
        ],
        "combo_flag": null
      }
    },
    "a_blocks_1": {
      "name": "a_blocks",
      "

## LaToken object generation

Instead of just token text, a LaToken object can be generated for each token using the ```featurize``` method.

This, among other things, preserves the original character locations of the tokens.

First, build and populate the token objects to be displayed below...

In [7]:
la_tokens = list(tok.featurize(text))

Show the basic objects...

In [8]:
print(f'Object tokenization of text:\n\n{text}\n')

data = [(token.text, token.start_idx, token.end_idx) for token in la_tokens]
df = pd.DataFrame(data, columns=['token', 'start_idx', 'end_idx'])
df

Object tokenization of text:

@AbcDef #HashTag "It isn't a problem"



Unnamed: 0,token,start_idx,end_idx
0,@AbcDef,0,7
1,#,8,9
2,Hash,9,13
3,Tag,13,16
4,"""",17,18
5,It,18,20
6,isn't,21,26
7,a,27,28
8,problem,29,36
9,"""",36,37


# Featurization

For many use cases, the text of tokens is sufficient for further processing; for others, capturing features for each token is desired.

* Character-level features are directly available from the feature matrix.
* Token features can be expressed
    * **_directly_** as the sum of all character features for the token
        * These amount to "characteristic" vectorization of tokens, where vectors for all tokens having the same combination of character features are equivalent.
    * **_abstractly_** as the labeled combination of multiple character features
        * **NOTE:** These are typically labels given to the offset combinations used to split text into tokens in the first place, but can include the use of other rules, like regular expression (in)validation, over the token data.

## Direct token featurization

Show the direct characteristic vector for the featurized tokens.

In [9]:
print(f'Featurized tokenization of text:\n\n{text}\n')

data2 = [(token.text,
          token.start_idx, token.end_idx,
          ' '.join(str(f) for f in token.features))
         for token in la_tokens]
df = pd.DataFrame(data2, columns=['token', 'start_idx', 'end_idx', 'characteristic_vector'])
df

Featurized tokenization of text:

@AbcDef #HashTag "It isn't a problem"



Unnamed: 0,token,start_idx,end_idx,characteristic_vector
0,@AbcDef,0,7,6 6 0 4 2 0 1 1 1 0 0 0 5 6 5 6 3 4 1 1 1 0 0 ...
1,#,8,9,0 0 0 0 0 1 1 1 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 ...
2,Hash,9,13,4 4 0 3 1 0 0 0 0 0 0 0 3 4 3 4 2 3 0 0 1 0 0 ...
3,Tag,13,16,3 3 0 2 1 0 0 0 0 0 0 0 3 2 3 2 2 2 0 1 0 0 0 ...
4,"""",17,18,0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 ...
5,It,18,20,2 2 0 1 1 0 0 0 0 0 0 0 1 1 1 1 0 1 0 1 1 0 0 ...
6,isn't,21,26,4 4 0 4 0 1 1 0 0 0 0 0 3 3 3 3 3 3 1 1 1 0 0 ...
7,a,27,28,1 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 ...
8,problem,29,36,7 7 0 7 0 1 0 0 0 0 0 0 6 6 6 6 6 6 1 0 0 0 0 ...
9,"""",36,37,0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 1 0 0 1 0 0 0 ...


## Abstract token featurization

Show the abstract feature specs to the token objects, including any specified token replacements.

In [10]:
print(f'Abstract featurization of text:\n\n{text}\n')

data3 = [(token.text,
          token.start_idx, token.end_idx,
          token.abstract_features,
          token.repl)
         for token in la_tokens]
df = pd.DataFrame(data3, columns=['token', 'start_idx', 'end_idx', 'abstract_features', 'repl'])
df

Abstract featurization of text:

@AbcDef #HashTag "It isn't a problem"



Unnamed: 0,token,start_idx,end_idx,abstract_features,repl
0,@AbcDef,0,7,[mention],_MENTION
1,#,8,9,[hashtag],_HASHTAG
2,Hash,9,13,,
3,Tag,13,16,,
4,"""",17,18,,
5,It,18,20,,
6,isn't,21,26,,
7,a,27,28,,
8,problem,29,36,,
9,"""",36,37,,


In [11]:
la_tokens

[LaToken(text='@AbcDef', start_idx=0, end_idx=7, features=array([6, 6, 0, 4, 2, 0, 1, 1, 1, 0, 0, 0, 5, 6, 5, 6, 3, 4, 1, 1, 1, 0,
        0, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), m=array([[0, 0, 0, ..., 0, 0, 0],
        [1, 1, 0, ..., 0, 0, 0],
        [1, 1, 0, ..., 0, 0, 0],
        ...,
        [1, 1, 0, ..., 0, 0, 0],
        [1, 1, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]], dtype=int8), abstract_features=['mention'], repl='_MENTION'),
 LaToken(text='#', start_idx=8, end_idx=9, features=array([0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0,
        0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]), m=array([[0, 0, 0, ..., 0, 0, 0],
        [1, 1, 0, ..., 0, 0, 0],
        [1, 1, 0, ..., 0, 0, 0],
        ...,
        [1, 1, 0, ..., 0, 0, 0],
        [1, 1, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]], dtype=int8), abstract_features=['hashtag'], repl='_HASHTAG'),
 LaToken(text='Hash', start_idx=9, end_idx=13, features=array([4, 4, 0, 3, 1, 0, 0, 0, 0, 0, 0, 0, 3, 