# Tokenize the table content and check the distribution of tokens

Author: Riccardo Cappuzzo

In this notebook, I'll work on the content of the tables and use the Hugging Face
`tokenizers` library to create a tokenized vocabulary of the entire corpus. This
should help with understanding what kind of tokens appear most frequently, and 
give an idea of what we should be expecting from a dirty, mixed type corpus of 
tables sourced from the web. 


I'll start by importing some of the `normalizers` from the `tokenizer` library. 

More on `normalizers` in the [normalizers API page](https://huggingface.co/docs/tokenizers/v0.13.2/en/api/normalizers).
These normalizers rely on [unicode normalization](https://unicode.org/reports/tr15/).


In [44]:
from tokenizers.normalizers import NFD, StripAccents, Lowercase
from tokenizers import normalizers
from tokenizers import pre_tokenizers
from tokenizers import Tokenizer
from tokenizers.pre_tokenizers import Digits
from tokenizers.pre_tokenizers import Whitespace
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer


import pyarrow.parquet as pq
import pyarrow.csv as pv

from csv import QUOTE_NONE

In [19]:
normalizer = normalizers.Sequence([NFD(), Lowercase(), StripAccents()])

In [20]:
# Testing normalizer
normalizer.normalize_str("Héllò hôw are ü?")

'hello how are u?'

In [47]:
pre_tokenizer = pre_tokenizers.Sequence([Whitespace(), Digits()])
pre_tokenizer.pre_tokenize_str("Call 911!")

[('Call', (0, 4)), ('911', (5, 8)), ('!', (8, 9))]

In [48]:
tokenizer = Tokenizer(BPE())
tokenizer.normalizer = normalizer
tokenizer.pre_tokenizer = pre_tokenizer

In [49]:
cd /home/soda/rcappuzz/work/study-gittables

/home/soda/rcappuzz/work/study-gittables


### Preparing a random table for tokenization

In [30]:
tab = pq.read_table("data/zenodo/tables/allegro_con_spirito_tables_licensed/Aziende.parquet")

Saving the table to csv, removing all quotation and separators. `QUOTE_NONE` is 
used to remove all quote markers, `escapechar` is needed to avoid `csv` to throw
an error. By using `sep=" "` and `escapechar=" "`, there is no quoting and 
all fields are separated by `"  "` (two spaces).  

In [42]:
tab.to_pandas().to_csv("tb.txt",  index=False, sep=" ", escapechar=" ", quoting=QUOTE_NONE)

### Tokenizing the content of the table

In [50]:
trainer = BpeTrainer() # No special tokens for now

In [65]:
enc.tokens

['as', 'pi', 'de']

In [66]:
from tokenizers.tools import EncodingVisualizer

In [67]:
viz = EncodingVisualizer(tokenizer=tokenizer)

In [71]:
viz("questa mattina mi sono svegliato e ho trovato l'invasore")

In [84]:
enc = tokenizer.encode("questa mattina mi sono svegliato e ho trovato l'invasore")

In [86]:
enc.ids

[282,
 5005,
 1186,
 455,
 141,
 36,
 1607,
 641,
 339,
 72,
 22,
 516,
 14626,
 72,
 29,
 63,
 136,
 36,
 411]

In [83]:
vocab = tokenizer.get_vocab()