## Tokenizer usage

A notebook for experimenting with and understanding tokenization and its options.

First, some initializations...

In [None]:
import latok.core.default_tokenizer as tokenizer
import pandas as pd

### Parameters

**Edit the following parameters as desired**, including the text to tokenize, before running the next cell...

In [None]:
text = '''This is a1 test don't http://foo.com?bar=123 @user abc@xyz.com camelCaseOne, CamelCaseTwo, camelCase1, CamelCase2, 123 $123,456.78'''


# Abstract features:
#  A list of FeatureSpec instances to use for abstract featurization
#  (Refer to "abstract featurization" description below)
feature_specs = [
    tokenizer.TWITTER_FEATURE,
    tokenizer.EMAIL_FEATURE,
    tokenizer.URL_FEATURE,
    tokenizer.CAMEL_CASE_FEATURE,
    tokenizer.NUMERIC_FEATURE,
    tokenizer.EMBEDDED_APOS_FEATURE,
]

### Basic Tokenization

tokenizer.tokenize generates each token.

We'll wrap each token for placement in a pandas DataFrame for display...

In [None]:
token_texts = [(token_text) for token_text in tokenizer.tokenize(text)]

print(f'Basic tokenization of text:\n\n{text}\n')

df = pd.DataFrame(token_texts, columns=['token'])
df

### LaToken object generation

Instead of just token text, a LaToken object can be generated for each token using the ```featurize``` method.

This, among other things, preserves the original character locations of the tokens.

First, build and populate the token objects to be displayed below...

In [None]:
la_tokens = list(tokenizer.featurize(text))
tokenizer.add_abstract_features(la_tokens, feature_specs)

Show the basic objects...

In [None]:
print(f'Object tokenization of text:\n\n{text}\n')

data = [(token.text, token.start_idx, token.end_idx) for token in la_tokens]
df = pd.DataFrame(data, columns=['token', 'start_idx', 'end_idx'])
df

## Featurization

For many use cases, the text of tokens is sufficient for further processing; for others, capturing features for each token is desired.

* Character-level features are directly available from the feature matrix.
* Token features can be expressed
    * **_directly_** as the sum of all character features for the token
        * These amount to "characteristic" vectorization of tokens, where vectors for all tokens having the same combination of character features are equivalent.
    * **_abstractly_** as the labeled combination of multiple character features
        * **NOTE:** These are typically labels given to the offset combinations used to split text into tokens in the first place, but can include the use of other rules, like regular expression (in)validation, over the token data.

### Direct token featurization

Show the direct characteristic vector for the featurized tokens.

In [None]:
print(f'Featurized tokenization of text:\n\n{text}\n')

data2 = [(token.text,
          token.start_idx, token.end_idx,
          ' '.join(str(f) for f in token.features))
         for token in la_tokens]
df = pd.DataFrame(data2, columns=['token', 'start_idx', 'end_idx', 'characteristic_vector'])
df

### Abstract token featurization

Show the abstract feature specs to the token objects.

In [None]:
print(f'Abstract featurization of text:\n\n{text}\n')

data3 = [(token.text,
          token.start_idx, token.end_idx,
          token.abstract_features)
         for token in la_tokens]
df = pd.DataFrame(data3, columns=['token', 'start_idx', 'end_idx', 'abstract_features'])
df