# 4. Tokenization

## Objetivo

- Læs alumnæs entenderán la importancia de la tokenización en un pipeline de NLP
- Como varía un corpus sin tokenizar y uno tokenizado
- Explorar métodos de *Subword tokenization* como: *BPE*, *WordPiece* y *Unigram*

## Tokenization

- Buscamos tener unidades de información para representar una lengua
    - Transformar nuestro texto crudo en datos que pueda procesar nuestro modelo
    - Similar a los pixeles para imagenes o frames para audio
- La unidad más intuitiva son las palabras alfa-numericas separadas por
espacios (tokens)
- Segmentación de texto en *tokens* de ahí el nombre *tokenization*
    - Es una parte fundamental de un *pipeline* de *NLP*
    - Pre-procesamiento

## Word-based tokenization

- Fácil de implementar (`.split()`)

In [None]:
"Mira mamá estoy en la tele".split()

- Se pueden considerar los signos de puntuación agregando reglas simples

In [None]:
import re
text = "Let's get started son!!!"
re.findall(r"['!]|\w+", text)

### Problem?

<img src="http://images.wikia.com/battlebears/images/2/2c/Troll_Problem.jpg" with="250" height="250">

- Vocabularios gigantescos difíciles de procesar
- Generalmente, entre más grande es el vocabulario más pesado será nuestro modelo

**Ejemplo:**
- Si queremos representaciones vectoriales de nuestros tokens obtendríamos vectores distintos para palabras similares
    - niño = `v1(39, 34, 5,...)`
    - niños = `v2(9, 4, 0,...)`
    - niña = `v3(2, 1, 1,...)`
    - ...
- Tendríamos tokens con bajísima frecuencia
    - merequetengue = `vn(0,0,1,...)`

### Una Solución: Stemming/Lemmatization (AKA la vieja confiable)

<center><img src="img/vieja_confiable.jpg" width=500 height=500></center>

In [None]:
import nltk
from nltk.corpus import brown
nltk.download('brown')

In [None]:
from collections import Counter

brown_corpus = [word for word in brown.words() if re.match("\w", word)]
print(brown_corpus[0])
print("Tokens:", len(brown_corpus))
print("Tipos:", len(Counter(brown_corpus)))

<center><img src="https://external-content.duckduckgo.com/iu/?u=http%3A%2F%2Fimg1.wikia.nocookie.net%2F__cb20140504152558%2Fspongebob%2Fimages%2Fe%2Fe3%2FThe_spongebob.jpg&f=1&nofb=1&ipt=28368023b54a7c84c9100025981b1042d0f4ca3ceaac53be42094cc1c3794348&ipo=images" height=300 width=300></center>

In [None]:
sub_brown_corpus = brown_corpus[:100000]
print("Sub brown_corpus tipos:", len(Counter(sub_brown_corpus)))
sub_brown_corpus[-5:]

### Lemmatizando ando

In [None]:
!python -m spacy download en_core_web_sm
!python -m spacy download es_core_news_sm

In [None]:
import spacy

def lemmatize(words: list, lang="en") -> list:
    model = "en_core_web_sm" if lang == "en" else "es_core_news_sm"
    nlp = spacy.load(model)
    nlp.max_length = 1500000
    lemmatizer = nlp.get_pipe("lemmatizer")
    return [token.lemma_ for token in nlp(" ".join(words))]

In [None]:
print("tipos (word-based):", len(Counter(sub_brown_corpus)))
print("Tipos (Lemmatized):", len(Counter(lemmatize(sub_brown_corpus))))

- eats -> eat
- eating -> eat
- eated -> eat
- ate -> eat

### More problems?

<img src="https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fpreview.redd.it%2Fjoonhzw1sjq31.png%3Fwidth%3D960%26crop%3Dsmart%26auto%3Dwebp%26s%3D3725297033765336276d49958089880e3f64d288&f=1&nofb=1&ipt=fdcf7c99c6a13417957a3832a14ca0f7ac4a70fc906fec79997bcb9795e31054&ipo=images" width="250" height="250">

- Métodos dependientes de las lenguas
- Se pierde información
- Ruled-based (?)

## Subword-tokenization salva el día 🦸🏼‍♀️

- Segmentación de palabras en unidades más pequeñas (*sub-words*)
- Obtenemos tipos menos variados pero con mayores frecuencias
    - Esto le gusta modelos basados en métodos estadísticos
- Palabras frecuentes no deberían separarse
- Palabras largas y raras debería descomponerse en sub-palabras significativas
- Hay métodos estadisticos que no requieren conocimiento a priori de las lenguas

In [None]:
text = "Let's do tokenization!"
result = ["Let's", "do", "token", "ization", "!"]
print(f"Objetivo: {text} -> {result}")

### Métodos para tokenizar


- *Byte-pair Encoding, BPE* (🤗, 💽)
- *Wordpiece* (🤗)
- *Unigram* (🤗)

In [None]:
!pip install sentencepiece
!pip install transformers

### BPE

- Segmenmentación iterativa, comienza segmentando en secuencias de caracteres
- Junta los pares más frecuentes (*merge operation*)
- Termina cuando se llega al número de *merge operations* especificado o número de vocabulario deseado (*hyperparams*, depende de la implementación)
- Introducido en el paper: [Neural Machine Translation of Rare Words with Subword Units, (Sennrich et al., 2015)](https://arxiv.org/abs/1508.07909)

In [None]:
%%HTML
<iframe width="960" height="515" src="https://www.youtube.com/embed/HEikzVL-lZU"></iframe>

#### Ejemplo BPE

In [None]:
SENTENCE = "Let's do this tokenization to enable hypermodernization on my tokens tokenized 👁️👁️👁️!!!"

In [None]:
from transformers import GPT2Tokenizer
bpe_tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
print(bpe_tokenizer.tokenize(SENTENCE))

In [None]:
encoded_tokens = bpe_tokenizer(SENTENCE)
encoded_tokens["input_ids"]

In [None]:
bpe_tokenizer.decode(encoded_tokens["input_ids"])

- En realidad GPT-2 usa *Byte-Level BPE*
    - Evitamos vocabularios de inicio grandes (Ej: unicode)
    - Usamos bytes como vocabulario base
    - Evitamos *Out Of Vocabulary, OOV* (aka `[UKW]`) (?)

### WordPiece

- Descrito en el paper: [Japanese and Korean voice search, (Schuster et al., 2012) ](https://static.googleusercontent.com/media/research.google.com/ja//pubs/archive/37842.pdf)
- Similar a BPE, inicia el vocabulario con todos los caracteres y aprende los merges
- En contraste con BPE, no elige con base en los pares más frecuentes si no los pares que maximicen la probabilidad de aparecer en los datos una vez que se agregan al vocabulario

$$score(a_i,b_j) = \frac{f(a_i,b_j)}{f(a_i)f(b_j)}$$

- Esto quiere decir que evalua la perdida de realizar un *merge* asegurandoce que vale la pena hacerlo

- Algoritmo usado en `BERT`

In [None]:
%%HTML
<iframe width="960" height="500" src="https://www.youtube.com/embed/qpv6ms_t_1A"></iframe>

In [None]:
from transformers import BertTokenizer
SENTENCE = "🌽" + SENTENCE + "🔥"
wp_tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
print(wp_tokenizer.tokenize(SENTENCE))

<center><img src="https://us-tuna-sounds-images.voicemod.net/9cf541d2-dd7f-4c1c-ae37-8bc671c855fe-1665957161744.jpg"></center>

In [None]:
wp_tokenizer(SENTENCE)

### Unigram

- Algoritmo de subpword tokenization introducido en el paper: [Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates (Kudo, 2018)](https://arxiv.org/pdf/1804.10959.pdf)
- En contraste con BPE o WordPiece, este algoritmo inicia con un vocabulario muy grande y va reduciendolo hasta llegar tener un vocabulario deseado
- En cada iteración se calcula la perdida de quitar cierto elemento del vocabulario
    - Se quitará `p%` elementos que menos aumenten la perdida en esa iteración
- El algoritmo termina cuando se alcanza el tamaño deseado del vocabulario

<center><img src="img/unigram_loss.png" width=500 height=500></center>

Sin embargo, *Unigram* no se usa por si mismo en algun modelo de Hugging Face:
> "Unigram is not used directly for any of the models in the transformers, but it’s used in conjunction with SentencePiece." - Hugging face guy

### SentencePiece


- No asume que las palabras estan divididas por espacios
- Trata la entrada de texto como un *stream* de datos crudos. Esto incluye al espacio como un caractér a usar
- Utiliza BPE o Unigram para construir el vocabulario

In [None]:
from transformers import XLNetTokenizer

tokenizer = XLNetTokenizer.from_pretrained("xlnet-base-cased")
print(tokenizer.tokenize(SENTENCE))

### Objetivo de los subword tokenizers


- Buscamos que modelos de redes neuronales tenga datos mas frecuentes
- Esto ayuda a que en principio "aprendan" mejor
- Reducir el numero de tipos (?)
- Reducir el numero de OOV (?)
- Reducir la entropia (?)

## Vamos a tokenizar 🌈
![](https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fi.pinimg.com%2F736x%2F75%2F28%2Fe7%2F7528e71db75a37f0dcf5be8a54e0523f.jpg&f=1&nofb=1&ipt=d08ba1ed7fa9af9c3692703a667271740c22bb8e8f5b9f5f7acb44715e7d47d8&ipo=images)

### Corpus Español: CESS

In [None]:
def normalize_sent(sent: list[str]) -> list[str]:
    return [word.lower() for word in sent if re.match("\w", word)]

In [None]:
nltk.download("cess_esp")

In [None]:
from nltk.corpus import cess_esp as cess

cess_sents = cess.sents()

In [None]:
len(cess_sents)

In [None]:
" ".join(cess_sents[0])

In [None]:
cess_plain_text = "\n".join([" ".join(normalize_sent(sentence)) for sentence in cess_sents])
cess_plain_text = re.sub(r"[-|_]", " ", cess_plain_text)

In [None]:
len(cess_plain_text)

In [None]:
print(cess_plain_text[300:600])

In [None]:
cess_words = cess_plain_text.split()

In [None]:
print(cess_words[:100])

### Corpus Inglés: Gutenberg 

In [None]:
nltk.download('gutenberg')

In [None]:
from nltk.corpus import gutenberg

gutenberg_sents = gutenberg.sents()[:10000]

In [None]:
len(gutenberg_sents)

In [None]:
" ".join(gutenberg_sents[0])

In [None]:
gutenberg_plain_text = "\n".join([" ".join(normalize_sent(sent)) for sent in gutenberg_sents])

print(gutenberg_plain_text[:100])

In [None]:
gutenberg_words = gutenberg_plain_text.split()

In [None]:
gutenberg_words[:10]

In [None]:
len(gutenberg_words)

In [None]:
len(gutenberg_plain_text)

In [None]:
with open("corpora/tokenization/gutenberg_plain.txt", "w") as f:
    f.write(gutenberg_plain_text)

### Tokenizando el español con Hugging face

In [None]:
from transformers import AutoTokenizer

spanish_tokenizer = AutoTokenizer.from_pretrained("dccuchile/bert-base-spanish-wwm-uncased")
print(spanish_tokenizer.tokenize(cess_plain_text[1000:1400]))

In [None]:
cess_types = Counter(cess_words)
len(cess_types)

In [None]:
print(cess_types.most_common(10))

In [None]:
cess_tokenized = spanish_tokenizer.tokenize(cess_plain_text)
cess_tokenized_types = Counter(cess_tokenized)
len(cess_tokenized_types)

In [None]:
print(cess_tokenized_types.most_common(30))

In [None]:
cess_lemmatized_types = Counter(lemmatize(cess_words, lang="es"))
len(cess_lemmatized_types)

In [None]:
print(cess_lemmatized_types.most_common(30))

### Tokenizando para el inglés

In [None]:
gutenberg_types = Counter(gutenberg_words)
len(gutenberg_types)

In [None]:
gutenberg_tokenized = wp_tokenizer.tokenize(gutenberg_plain_text)
gutenberg_tokenized_types = Counter(gutenberg_tokenized)
len(gutenberg_tokenized_types)

In [None]:
print(gutenberg_tokenized_types.most_common(100))

In [None]:
gutenberg_lemmatized_types = Counter(lemmatize(gutenberg_words))
len(gutenberg_lemmatized_types)

In [None]:
print(gutenberg_lemmatized_types.most_common(20))

### OOV: out of vocabulary

Palabras que se vieron en el entrenamiento pero no estan en el test

In [None]:
from sklearn.model_selection import train_test_split

train_data, test_data = train_test_split(gutenberg_words, test_size=0.3, random_state=42)
print(len(train_data), len(test_data))

In [None]:
s_1 = {"a", "b", "c", "d", "e"}
s_2 = {"a", "x", "y", "d"}
print(s_1 - s_2)
print(s_2 - s_1)

In [None]:
oov_test = set(test_data) - set(train_data)
len(oov_test)

In [None]:
for word in list(oov_test)[:3]:
    print(f"{word} in train: {word in set(train_data)}")

In [None]:
train_tokenized, test_tokenized = train_test_split(gutenberg_tokenized, test_size=0.3, random_state=42)
print(len(train_tokenized), len(test_tokenized))

In [None]:
oov_tokenized_test = set(test_tokenized) - set(train_tokenized)
len(oov_tokenized_test)

## Entrenando nuestro modelo con BPE
![](https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fmedia1.tenor.com%2Fimages%2Fd565618bb1217a7c435579d9172270d0%2Ftenor.gif%3Fitemid%3D3379322&f=1&nofb=1&ipt=9719714edb643995ce9d978c8bab77f5310204960093070e37e183d5372096d9&ipo=images)

In [None]:
!pip install subword-nmt

In [None]:
!ls corpora/tokenization

In [None]:
!head corpora/tokenization/gutenberg_plain.txt

In [None]:
!subword-nmt learn-bpe -s 300 < corpora/tokenization/gutenberg_plain.txt > models/tokenization/gutenberg_low.model

In [None]:
!echo "I need to process this sentence because tokenization can be useful" | subword-nmt apply-bpe -c models/tokenization/gutenberg_low.model

In [None]:
!subword-nmt learn-bpe -s 1500 < corpora/tokenization/gutenberg_plain.txt > models/tokenization/gutenberg_high.model

In [None]:
!echo "I need to process this sentence because tokenization can be useful" | subword-nmt apply-bpe -c models/tokenization/gutenberg_high.model

## Aplicandolo a otros corpus: La biblia 📖🇻🇦

In [None]:
BIBLE_FILE_NAMES = {"spa": "spa-x-bible-reinavaleracontemporanea", "eng": "eng-x-bible-kingjames"}
CORPORA_PATH = "corpora/tokenization/"

In [None]:
import requests

def get_bible_corpus(lang: str) -> str:
    file_name = BIBLE_FILE_NAMES[lang]
    r = requests.get(f"https://raw.githubusercontent.com/ximenina/theturningpoint/main/Detailed/corpora/corpusPBC/{file_name}.txt.clean.txt")
    return r.text

def write_plain_text_corpus(raw_text: str, file_name: str) -> None:
    with open(f"{file_name}.txt", "w") as f:
        f.write(raw_text)

### Biblia en Inglés

In [None]:
eng_bible_plain_text = get_bible_corpus("eng")
eng_bible_words = eng_bible_plain_text.lower().replace("\n", " ").split()

In [None]:
print(eng_bible_words[:10])

In [None]:
len(eng_bible_words)

In [None]:
from collections import Counter
eng_bible_types = Counter(eng_bible_words)
len(eng_bible_types)

In [None]:
print(eng_bible_types.most_common(30))

In [None]:
eng_bible_lemmas_types = Counter(lemmatize(eng_bible_words, lang="en"))
len(eng_bible_lemmas_types)

In [None]:
write_plain_text_corpus(eng_bible_plain_text, CORPORA_PATH + "eng-bible")

In [None]:
!subword-nmt apply-bpe -c models/tokenization/gutenberg_low.model < corpora/tokenization/eng-bible.txt > corpora/tokenization/eng-bible-tokenized.txt

In [None]:
with open(CORPORA_PATH + "eng-bible-tokenized.txt", 'r') as f:
    tokenized_data = f.read()
eng_bible_tokenized = tokenized_data.split()

In [None]:
print(eng_bible_tokenized[:10])

In [None]:
len(eng_bible_tokenized)

In [None]:
eng_bible_tokenized_types = Counter(eng_bible_tokenized)
len(eng_bible_tokenized_types)

In [None]:
eng_bible_tokenized_types.most_common(30)

### ¿Qué pasa si aplicamos el modelo aprendido con Gutenberg a otras lenguas?

In [None]:
spa_bible_plain_text = get_bible_corpus('spa')
spa_bible_words = spa_bible_plain_text.replace("\n", " ").lower().split()

In [None]:
spa_bible_words[:10]

In [None]:
len(spa_bible_words)

In [None]:
spa_bible_types = Counter(spa_bible_words)
len(spa_bible_types)

In [None]:
spa_bible_types.most_common(30)

In [None]:
spa_bible_lemmas_types = Counter(lemmatize(spa_bible_words, lang="es"))
len(spa_bible_lemmas_types)

In [None]:
write_plain_text_corpus(spa_bible_plain_text, CORPORA_PATH + "spa-bible")

In [None]:
!subword-nmt apply-bpe -c models/tokenization/gutenberg_high.model < corpora/tokenization/spa-bible.txt > corpora/tokenization/spa-bible-tokenized.txt

In [None]:
with open(CORPORA_PATH + "spa-bible-tokenized.txt", "r") as f:
    tokenized_text = f.read()
spa_bible_tokenized = tokenized_text.split()

In [None]:
spa_bible_tokenized[:10]

In [None]:
len(spa_bible_tokenized)

In [None]:
spa_bible_tokenized_types = Counter(spa_bible_tokenized)
len(spa_bible_tokenized_types)

In [None]:
spa_bible_tokenized_types.most_common(40)

### Type-token Ratio (TTR)

- Una forma de medir la variazión del vocabulario en un corpus
- Este se calcula como $TTR = \frac{len(types)}{len(tokens)}$
- Puede ser útil para monitorear la variación lexica de un texto

In [None]:
print("Información de la biblia en Inglés")
print("Tokens:", len(eng_bible_words))
print("Types (word-base):", len(eng_bible_types))
print("Types (lemmatized)", len(eng_bible_lemmas_types))
print("Types (BPE):", len(eng_bible_tokenized_types))
print("TTR (word-base):", len(eng_bible_types)/len(eng_bible_words))
print("TTR (BPE):", len(eng_bible_tokenized_types)/len(eng_bible_tokenized))

In [None]:
print("Bible Spanish Information")
print("Tokens:", len(spa_bible_words))
print("Types (word-base):", len(spa_bible_types))
print("Types (lemmatized)", len(spa_bible_lemmas_types))
print("Types (BPE):", len(spa_bible_tokenized_types))
print("TTR (word-base):", len(spa_bible_types)/len(spa_bible_words))
print("TTR (BPE):", len(spa_bible_tokenized_types)/len(spa_bible_tokenized))

## Entrenando BPE con corpus en Nahuatl

In [None]:
!pip install elotl

In [None]:
import elotl.corpus
axolotl = elotl.corpus.load("axolotl")

In [None]:
len(axolotl)

In [None]:
train_rows_count = len(axolotl) - round(len(axolotl)*.30)

In [None]:
axolotl_train = axolotl[:train_rows_count]
axolotl_test = axolotl[train_rows_count:]

In [None]:
axolotl_train[3]

In [None]:
print("Axolotl train len:", len(axolotl_train))
print("Axolotl test len:", len(axolotl_test))
print("Total:", len(axolotl_test) + len(axolotl_train))

In [None]:
axolotl_train[:3]

In [None]:
axolotl_words_train = [word for row in axolotl_train for word in row[1].lower().split()]
len(axolotl_words_train)

In [None]:
print(axolotl_words_train[:10])

In [None]:
write_plain_text_corpus(" ".join(axolotl_words_train), CORPORA_PATH + "axolotl_plain")

In [None]:
!subword-nmt learn-bpe -s 500 < corpora/tokenization/axolotl_plain.txt > models/tokenization/axolotl.model

In [None]:
axolotl_test_words = [word for row in axolotl_test for word in row[1].lower().split()]
axolotl_test_types = Counter(axolotl_test_words)

In [None]:
print(axolotl_test_types.most_common(10))

In [None]:
axolotl_singletons = [singleton for singleton in axolotl_test_types.items() if singleton[1] == 1]

In [None]:
len(axolotl_singletons)

In [None]:
write_plain_text_corpus(" ".join(axolotl_test_words), CORPORA_PATH + "axolotl_plain_test")

In [None]:
!subword-nmt apply-bpe -c models/tokenization/axolotl.model < corpora/tokenization/axolotl_plain_test.txt > corpora/tokenization/axolotl_tokenized.txt

In [None]:
with open(CORPORA_PATH + "axolotl_tokenized.txt") as f:
    axolotl_test_tokenized = f.read().split()

In [None]:
len(axolotl_test_tokenized)

In [None]:
print(axolotl_test_tokenized[:10])

In [None]:
axolotl_test_tokenized_types = Counter(axolotl_test_tokenized)

In [None]:
axolotl_test_tokenized_types.most_common(20)

In [None]:
print("Axolotl Information")
print("Tokens:", len(axolotl_test_words))
print("Types (word-base):", len(axolotl_test_types))
print("Types (native BPE):", len(axolotl_test_tokenized_types))
print("TTR (word-base):", len(axolotl_test_types)/len(axolotl_test_words))
print("TTR (BPE):", len(spa_bible_tokenized_types)/len(axolotl_test_tokenized))

## Normalización

<center><img src="img/metro.jpg" width=700 height=700></center>

In [None]:
METROFLOG_SENTENCE = "lEt'$ dó tHis béttëŕ :)"

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("huggingface-course/albert-tokenizer-without-normalizer")
tokenizer.convert_ids_to_tokens(tokenizer.encode(METROFLOG_SENTENCE))

In [None]:
tokenizer = AutoTokenizer.from_pretrained("albert-large-v2")
tokenizer.convert_ids_to_tokens(tokenizer.encode(METROFLOG_SENTENCE))

#### Y para lenguas de bajos recursos digitales?

- No hay muchos recursos :(
- Pero para el nahuatl esta `pyelotl` :)

#### Normalizando el Nahuatl

In [None]:
import elotl.nahuatl.orthography

In [None]:
# Tres posibles normalizadores: sep, inali, ack
# Sauce: https://pypi.org/project/elotl/

nahuatl_normalizer = elotl.nahuatl.orthography.Normalizer("sep")

In [None]:
axolotl[1][1]

In [None]:
nahuatl_normalizer.normalize(axolotl[1][1])

In [None]:
nahuatl_normalizer.to_phones(axolotl[1][1])

## Entropía de un texto

<center><img src="img/entropy.gif" height=500 width=500></center>

<center><img src="img/entropy_eq.png"></center>

In [None]:
import math

def calculate_entropy(corpus: list[str]) -> float:
    words_counts = Counter(corpus)
    total_words = len(corpus)
    probabilities = {word: count / total_words for word, count in words_counts.items()}
    entropy = -sum(p * math.log2(p) for p in probabilities.values())
    return entropy

In [None]:
calculate_entropy(eng_bible_words)

In [None]:
calculate_entropy(eng_bible_tokenized)

## Práctica 4: Subword tokenization
**Fecha de entrega: 24 de Marzo 11:59pm**

- Calcular la entropía de dos textos: brown y axolotl
    - Calcular para los textos tokenizados word-level
    - Calcular para los textos tokenizados con BPE
        - Tokenizar con la biblioteca `subword-nmt`
- Imprimir en pantalla:
    - Entropía de axolotl word-base y bpe
    - Entropía del brown word-base y bpe
- Responder las preguntas:
    - ¿Aumento o disminuyó la entropia para los corpus?
        - axolotl 
        - brown
    - ¿Qué significa que la entropia aumente o disminuya en un texto?
    - ¿Como influye la tokenizacion en la entropía de un texto?

### Extra

- Realizar el proceso de normalización para el texto en Nahuatl
- Entrenar un modelo con el texto normalizado
    - Usando BPE `subword-nmt`
- Comparar entropía, typos, tokens, TTR con las versiones:
    - tokenizado sin normalizar
    - tokenizado normalizado

### Referencias:

- [Corpora de la biblia en varios idiomas](https://github.com/ximenina/theturningpoint/tree/main/Detailed/corpora/corpusPBC)
- [Biblioteca nativa para BPE](https://github.com/rsennrich/subword-nmt)
- [Tokenizers Hugging face](https://huggingface.co/docs/transformers/tokenizer_summary)