# 4. Preprocesamiento y Modelos del Lenguaje (Neuronal)

<img src="https://2.bp.blogspot.com/-oDvCIkIjwXw/VdWWxfvmq3I/AAAAAAAARUE/r0MrmbNzMz8/s1600/inputoutput.jpg" width=500>

## Objetivos

- Aplicar preprocesamiento a corpus en español e inglés
- Entender el funcionamiento de algoritmos de sub-word tokenization
  - Aplicar BPE a corpus
- Entrenar un modelo del lenguaje neuronal con la arquitectura de Bengio

## ¿Qué es una palabra?

- Tecnicas de procesamiento del lenguaje depende de las palabras y las oraciones.
  - Debemos identificar estos elementos para poder procesarlos
- Este paso de identificación de palabras y oraciones se le llama segmentación de texto o **tokenización** (*tokenization*)
- Además de la identificación de unidades aplicaremos transformaciones al texto

Aunque la definición de lo que es una palabra puede parecer obvia es tremendamente difícil.

- I'm
- we'd
- I've
- Diego's Bicycle

## Elementos del preprocesamiento

- Normalización
    - Pasar todo a minúsculas
    - Pasar texto a cierta norma ortográfica
- Quitar stopwords
- Quitar elementos de marcado (HTML, XML)
- Tokenización

### Stopwords

In [1]:
import re
import nltk
from nltk.corpus import stopwords
from rich import print as rprint

In [2]:
BASE_PATH = "drive/MyDrive"
CORPORA_PATH = f"{BASE_PATH}/corpora/tokenization"
MODELS_PATH = f"{BASE_PATH}/models/sub-word"

In [3]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [4]:
rprint(stopwords.words("spanish")[:15])

### Normalización

<center><img src="https://external-content.duckduckgo.com/iu/?u=http%3A%2F%2Fimg1.wikia.nocookie.net%2F__cb20140504152558%2Fspongebob%2Fimages%2Fe%2Fe3%2FThe_spongebob.jpg&f=1&nofb=1&ipt=28368023b54a7c84c9100025981b1042d0f4ca3ceaac53be42094cc1c3794348&ipo=images" height=300 width=300></center>

In [5]:
import unicodedata

def strip_accents(s: str) -> str:
   return ''.join(
       c for c in unicodedata.normalize('NFD', s)
       if unicodedata.category(c) != 'Mn'
   )

In [6]:
strip_accents("mamá hoy quería que me oigan en el olímpo")

'mama hoy queria que me oigan en el olimpo'

- https://www.unicode.org/reports/tr44/#GC_Values_Table

> And keep in mind, these manipulations may significantly alter the meaning of the text. Accents, Umlauts etc. are not "decoration".
- [oefe](https://stackoverflow.com/users/49793/oefe) - [source](https://stackoverflow.com/questions/517923/what-is-the-best-way-to-remove-accents-normalize-in-a-python-unicode-string)

In [7]:
def preprocess(words: list[str], regex: str="\w+", lang: str="en") -> list[str]:
    """Preprocess step for corpus

    Parameters
    ----------
    words: list[str]
        Words of a given corpus
    regex: str
        Optional regex to filter patterns in words. Default \w+
    lang: str
        Optional lang for choice stopwords. Default "en"

    Return
    ------
    list:
        List of words filtered and normalized

    """
    stop_lang = "english" if lang=="en" else "spanish"
    result = []
    for word in words:
        word = re.sub(f"[^\w\s]", "", word).lower()
        if word.isalpha():
            result.append(word)
    return result

#### ¿Para otras lenguas?

- No hay muchos recursos :(
- Pero para el nahuatl esta `pyelotl` :)

#### Normalizando el Nahuatl

In [8]:
!pip install elotl

Collecting elotl
  Downloading elotl-0.1.1-py3-none-any.whl.metadata (6.2 kB)
Downloading elotl-0.1.1-py3-none-any.whl (5.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.1/5.1 MB[0m [31m14.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: elotl
Successfully installed elotl-0.1.1


In [9]:
import elotl.corpus
import elotl.nahuatl.orthography

In [10]:
axolotl = elotl.corpus.load("axolotl")

In [11]:
# Tres posibles normalizadores: sep, inali, ack
# Sauce: https://pypi.org/project/elotl/

nahuatl_normalizer = elotl.nahuatl.orthography.Normalizer("sep")

In [12]:
axolotl[1][1]

'¿In chalchihuitl, teocuitlatl, mach ah ca on yaz?'

In [13]:
nahuatl_normalizer.normalize(axolotl[1][1])

'¿in chalchiuitl, teokuitlatl, mach aj ka on yas?'

In [14]:
nahuatl_normalizer.to_phones(axolotl[1][1])

'¿in t͡ʃalt͡ʃiwiƛ, teokʷiƛaƛ, mat͡ʃ aʔ ka on yas?'

## Tokenización

### Word-base tokenization

In [15]:
text = """
¡¡¡Mamá prendele a la grabadora!!!, ¿llamaste a las vecinas? Corre la voz porque, efectivamente, !estoy en la tele! 📺
"""

In [16]:
text.split()

['¡¡¡Mamá',
 'prendele',
 'a',
 'la',
 'grabadora!!!,',
 '¿llamaste',
 'a',
 'las',
 'vecinas?',
 'Corre',
 'la',
 'voz',
 'porque,',
 'efectivamente,',
 '!estoy',
 'en',
 'la',
 'tele!',
 '📺']

In [17]:
# [a-zA-Z_]
regex = r"\w+"
re.findall(regex, text)

['Mamá',
 'prendele',
 'a',
 'la',
 'grabadora',
 'llamaste',
 'a',
 'las',
 'vecinas',
 'Corre',
 'la',
 'voz',
 'porque',
 'efectivamente',
 'estoy',
 'en',
 'la',
 'tele']

In [18]:
re.findall(regex, "El valor de PI es 3.14159")

['El', 'valor', 'de', 'PI', 'es', '3', '14159']

<img src="http://images.wikia.com/battlebears/images/2/2c/Troll_Problem.jpg" with="250" height="250">

- Vocabularios gigantescos difíciles de procesar
- Generalmente, entre más grande es el vocabulario más pesado será nuestro modelo

**Ejemplo:**
- Si queremos representaciones vectoriales de nuestras palabras obtendríamos vectores distintos para palabras similares
    - niño = `v1(39, 34, 5,...)`
    - niños = `v2(9, 4, 0,...)`
    - niña = `v3(2, 1, 1,...)`
    - ...
- Tendríamos tokens con bajísima frecuencia
    - merequetengue = `vn(0,0,1,...)`

### Una solución: Steaming/Lematización (AKA la vieja confiable)

![](https://i.pinimg.com/736x/77/df/89/77df89e6ff57d332ba4e5d7bff723133--meme.jpg)

In [19]:
from nltk.corpus import brown
nltk.download('brown')

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.


True

In [20]:
brown_corpus = preprocess(brown.words()[:100000])
rprint(brown_corpus[0])

In [21]:
rprint(brown_corpus[:10])

In [22]:
from collections import Counter

rprint(f"[yellow]Brown Vanilla")
rprint("Tokens:", len(brown.words()))
rprint("Tipos:", len(Counter(brown.words())))

rprint(f"[green]Brown Preprocess")
rprint("Tokens:", len(brown_corpus))
rprint("Tipos:", len(Counter(brown_corpus)))

#### Steamming

In [23]:
from nltk.stem.snowball import SnowballStemmer

stemmer = SnowballStemmer("english")

In [24]:
stemmed_brown = [stemmer.stem(word) for word in brown_corpus]

#### Lematización

In [25]:
!python -m spacy download en_core_web_md
!python -m spacy download es_core_news_md

Collecting en-core-web-md==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.8.0/en_core_web_md-3.8.0-py3-none-any.whl (33.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m33.5/33.5 MB[0m [31m35.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: en-core-web-md
Successfully installed en-core-web-md-3.8.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
Collecting es-core-news-md==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/es_core_news_md-3.8.0/es_core_news_md-3.8.0-py3-none-any.whl (42.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42

In [26]:
import spacy

def lemmatize(words: list, lang: str="en") -> list:
    model = "en_core_web_md" if lang == "en" else "es_core_news_md"
    nlp = spacy.load(model)
    nlp.max_length = 1500000
    lemmatizer = nlp.get_pipe("lemmatizer")
    return [token.lemma_ for token in nlp(" ".join(words))]

In [27]:
lemmatized_brown = lemmatize(brown_corpus)

In [28]:
from rich.panel import Panel

rprint("Tipos ([blue]word-based):", len(Counter(brown_corpus)))
rprint("Tipos ([yellow]Steamming):", len(Counter(stemmed_brown)))
rprint("Tipos ([green]Lemmatized):", len(Counter(lemmatized_brown)))

#### More problems?

<img src="https://uploads.dailydot.com/2019/10/Untitled_Goose_Game_Honk.jpeg?auto=compress%2Cformat&ixlib=php-3.3.0" width="250" height="250">

- Métodos dependientes de las lenguas
- Se pierde información
- Ruled-based

## Subword-tokenization salva el día 🦸🏼‍♀️

![](https://gifdb.com/images/high/super-cow-and-chicken-daxvak1q16quwd9p.webp)

- Segmentación de palabras en unidades más pequeñas (*sub-words*)
- Obtenemos tipos menos variados y con mayores frecuencias
    - Esto le gusta modelos basados en métodos estadísticos
- Palabras frecuentes no deberían separarse
- Palabras largas y raras debería descomponerse en sub-palabras significativas
- Los métodos estadisticos que no requieren conocimiento a priori de las lenguas

In [29]:
text = "Let's do tokenization!"
result = ["Let's", "do", "token", "ization", "!"]
print(f"Objetivo: {text} -> {result}")

Objetivo: Let's do tokenization! -> ["Let's", 'do', 'token', 'ization', '!']


### Algoritmos

Existen varios algoritmos para hacer *subword-tokenization* como los que se listan a continuación:

- Byte-Pair Encoding (BPE)
- WordPiece
- Unigram

#### BPE

- Segmenmentación iterativa, comienza segmentando en secuencias de caracteres
- Junta los pares más frecuentes (*merge operation*)
- Termina cuando se llega al número de *merge operations* especificado o número de vocabulario deseado (*hyperparams*, depende de la implementación)
- Introducido en el paper: [Neural Machine Translation of Rare Words with Subword Units, (Sennrich et al., 2015)](https://arxiv.org/abs/1508.07909)

In [30]:
%%HTML
<iframe width="960" height="515" src="https://www.youtube.com/embed/HEikzVL-lZU"></iframe>

In [31]:
!pip install transformers



In [32]:
SENTENCE = "Let's do this tokenization to enable hypermodernization on my tokens tokenized 👁️👁️👁️!!!"

In [33]:
from transformers import GPT2Tokenizer

bpe_tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
rprint(bpe_tokenizer.tokenize(SENTENCE))

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

In [34]:
encoded_tokens = bpe_tokenizer(SENTENCE)
rprint(encoded_tokens["input_ids"])

In [35]:
rprint(bpe_tokenizer.decode(encoded_tokens["input_ids"]))

- En realidad GPT-2 usa *Byte-Level BPE*
    - Evitamos vocabularios de inicio grandes (Ej: unicode)
    - Usamos bytes como vocabulario base
    - Evitamos *Out Of Vocabulary, OOV* (aka `[UKW]`)

#### WordPiece

- Descrito en el paper: [Japanese and Korean voice search, (Schuster et al., 2012) ](https://static.googleusercontent.com/media/research.google.com/ja//pubs/archive/37842.pdf)
- Similar a BPE, inicia el vocabulario con todos los caracteres y aprende los merges
- En contraste con BPE, no elige con base en los pares más frecuentes si no los pares que maximicen la probabilidad de aparecer en los datos una vez que se agregan al vocabulario

$$score(a_i,b_j) = \frac{f(a_i,b_j)}{f(a_i)f(b_j)}$$

- Esto quiere decir que evalua la perdida de realizar un *merge* asegurandoce que vale la pena hacerlo

- Algoritmo usado en `BERT`

In [36]:
%%HTML
<iframe width="960" height="500" src="https://www.youtube.com/embed/qpv6ms_t_1A"></iframe>

In [37]:
from transformers import BertTokenizer
SENTENCE = "🌽" + SENTENCE + "🔥"
wp_tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
rprint(wp_tokenizer.tokenize(SENTENCE))

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

<center><img src="https://us-tuna-sounds-images.voicemod.net/9cf541d2-dd7f-4c1c-ae37-8bc671c855fe-1665957161744.jpg"></center>

In [38]:
rprint(wp_tokenizer(SENTENCE))

#### Unigram

- Algoritmo de subpword tokenization introducido en el paper: [Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates (Kudo, 2018)](https://arxiv.org/pdf/1804.10959.pdf)
- En contraste con BPE o WordPiece, este algoritmo inicia con un vocabulario muy grande y va reduciendolo hasta llegar tener un vocabulario deseado
- En cada iteración se calcula la perdida de quitar cierto elemento del vocabulario
    - Se quitará `p%` elementos que menos aumenten la perdida en esa iteración
- El algoritmo termina cuando se alcanza el tamaño deseado del vocabulario

Sin embargo, *Unigram* no se usa por si mismo en algun modelo de Hugging Face:
> "Unigram is not used directly for any of the models in the transformers, but it’s used in conjunction with SentencePiece." - Hugging face guy

#### SentencePiece


- No asume que las palabras estan divididas por espacios
- Trata la entrada de texto como un *stream* de datos crudos. Esto incluye al espacio como un caractér a usar
- Utiliza BPE o Unigram para construir el vocabulario

In [39]:
# https://github.com/google/sentencepiece#installation
!pip install sentencepiece



In [40]:
from transformers import XLNetTokenizer

tokenizer = XLNetTokenizer.from_pretrained("xlnet-base-cased")
rprint(tokenizer.tokenize(SENTENCE))

spiece.model:   0%|          | 0.00/798k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.38M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/760 [00:00<?, ?B/s]

#### Objetivo de los subword tokenizers


- Buscamos que modelos de redes neuronales tenga datos mas frecuentes
- Esto ayuda a que en principio "aprendan" mejor
- Reducir el numero de tipos
- Reducir el numero de OOV

### Vamos a tokenizar 🌈
![](https://i.pinimg.com/736x/58/6b/88/586b8825f010ce0e3f9c831f568aafa8.jpg)

#### Corpus en español: CESS

In [41]:
nltk.download("cess_esp")

[nltk_data] Downloading package cess_esp to /root/nltk_data...
[nltk_data]   Unzipping corpora/cess_esp.zip.


True

In [42]:
from nltk.corpus import cess_esp

cess_words = cess_esp.words()

In [43]:
" ".join(cess_words[:30])

'El grupo estatal Electricité_de_France -Fpa- EDF -Fpt- anunció hoy , jueves , la compra del 51_por_ciento de la empresa mexicana Electricidad_Águila_de_Altamira -Fpa- EAA -Fpt- , creada por el japonés Mitsubishi_Corporation'

In [44]:
cess_plain_text = " ".join(preprocess(cess_words))

In [45]:
rprint(f"'{cess_plain_text[300:600]}'")

In [46]:
cess_preprocessed_words = cess_plain_text.split()

In [49]:
with open(f"cess_plain.txt", "w") as f:
    f.write(cess_plain_text)

#### Corpus Inglés: Gutenberg

In [50]:
nltk.download('gutenberg')
nltk.download("punkt_tab")

[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Unzipping corpora/gutenberg.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [51]:
from nltk.corpus import gutenberg

gutenberg_words = gutenberg.words()[:200000]

In [52]:
rprint(" ".join(gutenberg_words[:30]))

In [53]:
gutenberg_plain_text = " ".join(preprocess(gutenberg_words))

rprint(gutenberg_plain_text[:100])

In [54]:
gutenberg_preprocessed_words = gutenberg_plain_text.split()

In [56]:
with open(f"gutenberg_plain.txt", "w") as f:
    f.write(gutenberg_plain_text)

#### Tokenizando el español con Hugging face

In [57]:
from transformers import AutoTokenizer

spanish_tokenizer = AutoTokenizer.from_pretrained("dccuchile/bert-base-spanish-wwm-uncased")
rprint(spanish_tokenizer.tokenize(cess_plain_text[1000:1400]))

tokenizer_config.json:   0%|          | 0.00/310 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/650 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/248k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/486k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/134 [00:00<?, ?B/s]

In [58]:
cess_types = Counter(cess_words)

In [59]:
rprint(cess_types.most_common(10))

In [60]:
cess_tokenized = spanish_tokenizer.tokenize(cess_plain_text)
rprint(cess_tokenized[:10])
cess_tokenized_types = Counter(cess_tokenized)

Token indices sequence length is longer than the specified maximum sequence length for this model (178312 > 512). Running this sequence through the model will result in indexing errors


In [61]:
rprint(cess_tokenized_types.most_common(30))

In [62]:
cess_lemmatized_types = Counter(lemmatize(cess_words, lang="es"))

In [63]:
rprint(cess_lemmatized_types.most_common(30))

In [64]:
rprint("CESS")
rprint(f"Tipos ([blue]word-base): {len(cess_types)}")
rprint(f"Tipos ([yellow]lemmatized): {len(cess_lemmatized_types)}")
rprint(f"Tipos ([green]sub-word): {len(cess_tokenized_types)}")

#### Tokenizando para el inglés

In [65]:
gutenberg_types = Counter(gutenberg_words)

In [66]:
gutenberg_tokenized = wp_tokenizer.tokenize(gutenberg_plain_text)
gutenberg_tokenized_types = Counter(gutenberg_tokenized)

In [67]:
rprint(gutenberg_tokenized_types.most_common(100))

In [68]:
gutenberg_lemmatized_types = Counter(lemmatize(gutenberg_preprocessed_words))

In [69]:
rprint(gutenberg_lemmatized_types.most_common(20))

In [70]:
rprint("Gutenberg")
rprint(f"Tipos ([blue]word-base): {len(gutenberg_types)}")
rprint(f"Tipos ([yellow]lemmatized): {len(gutenberg_lemmatized_types)}")
rprint(f"Tipos ([green]sub-word): {len(gutenberg_tokenized_types)}")

#### OOV: out of vocabulary

Palabras que se vieron en el entrenamiento pero no estan en el test

In [71]:
from sklearn.model_selection import train_test_split

train_data, test_data = train_test_split(gutenberg_words, test_size=0.3, random_state=42)
rprint(len(train_data), len(test_data))

In [72]:
s_1 = {"a", "b", "c", "d", "e"}
s_2 = {"a", "x", "y", "d"}
rprint(s_1 - s_2)
rprint(s_2 - s_1)

In [73]:
oov_test = set(test_data) - set(train_data)

In [74]:
for word in list(oov_test)[:3]:
    rprint(f"{word} in train: {word in set(train_data)}")

In [75]:
train_tokenized, test_tokenized = train_test_split(gutenberg_tokenized, test_size=0.3, random_state=42)
rprint(len(train_tokenized), len(test_tokenized))

In [76]:
oov_tokenized_test = set(test_tokenized) - set(train_tokenized)

In [77]:
rprint("OOV ([yellow]word-base):", len(oov_test))
rprint("OOV ([green]sub-word):", len(oov_tokenized_test))

## Entrenando nuestro modelo con BPE
![](https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fmedia1.tenor.com%2Fimages%2Fd565618bb1217a7c435579d9172270d0%2Ftenor.gif%3Fitemid%3D3379322&f=1&nofb=1&ipt=9719714edb643995ce9d978c8bab77f5310204960093070e37e183d5372096d9&ipo=images)

In [78]:
!pip install subword-nmt

Collecting subword-nmt
  Downloading subword_nmt-0.3.8-py3-none-any.whl.metadata (9.2 kB)
Collecting mock (from subword-nmt)
  Downloading mock-5.2.0-py3-none-any.whl.metadata (3.1 kB)
Downloading subword_nmt-0.3.8-py3-none-any.whl (27 kB)
Downloading mock-5.2.0-py3-none-any.whl (31 kB)
Installing collected packages: mock, subword-nmt
Successfully installed mock-5.2.0 subword-nmt-0.3.8


In [79]:
!ls {CORPORA_PATH}

ls: cannot access 'drive/MyDrive/corpora/tokenization': No such file or directory


In [80]:
!cat {CORPORA_PATH}/gutenberg_plain.txt

cat: drive/MyDrive/corpora/tokenization/gutenberg_plain.txt: No such file or directory


In [81]:
!subword-nmt learn-bpe -s 300 < \
 {CORPORA_PATH}/gutenberg_plain.txt > \
  {MODELS_PATH}/gutenberg.model

/bin/bash: line 1: drive/MyDrive/corpora/tokenization/gutenberg_plain.txt: No such file or directory


In [82]:
!echo "I need to process this sentence because tokenization can be useful" \
| subword-nmt apply-bpe -c {MODELS_PATH}/gutenberg.model

usage: subword-nmt apply-bpe [-h] [--input PATH] --codes PATH [--merges INT] [--output PATH]
                             [--separator STR] [--vocabulary PATH] [--vocabulary-threshold INT]
                             [--dropout P] [--glossaries STR [STR ...]] [--seed S]
                             [--num-workers NUM_WORKERS]
subword-nmt apply-bpe: error: argument --codes/-c: can't open 'drive/MyDrive/models/sub-word/gutenberg.model': [Errno 2] No such file or directory: 'drive/MyDrive/models/sub-word/gutenberg.model'


In [103]:
!subword-nmt learn-bpe -s 1500 < \
gutenberg_plain.txt > \
gutenberg_high.model

100% 1500/1500 [00:01<00:00, 844.80it/s] 


In [104]:
!echo "I need to process this sentence because tokenization can be useful" \
| subword-nmt apply-bpe -c gutenberg_high.model

I need to pro@@ c@@ ess this s@@ ent@@ ence because to@@ k@@ en@@ i@@ z@@ ation can be u@@ se@@ ful


### Aplicandolo a otros corpus: La biblia 📖🇻🇦

In [105]:
BIBLE_FILE_NAMES = {"spa": "spa-x-bible-reinavaleracontemporanea", "eng": "eng-x-bible-kingjames"}

In [106]:
import requests

def get_bible_corpus(lang: str) -> str:
    """Download bible file corpus from GitHub repo"""
    file_name = BIBLE_FILE_NAMES[lang]
    r = requests.get(f"https://raw.githubusercontent.com/ximenina/theturningpoint/main/Detailed/corpora/corpusPBC/{file_name}.txt.clean.txt")
    return r.text

def write_plain_text_corpus(raw_text: str, file_name: str) -> None:
    """Write file text on disk"""
    with open(f"{file_name}.txt", "w") as f:
        f.write(raw_text)

#### Biblia en Inglés

In [107]:
eng_bible_plain_text = get_bible_corpus("eng")
eng_bible_words = eng_bible_plain_text.lower().replace("\n", " ").split()

In [108]:
print(eng_bible_words[:10])

['the', 'beginning', 'of', 'the', 'gospel', 'of', 'jesus', 'christ', ',', 'the']


In [109]:
len(eng_bible_words)

30963

In [110]:
eng_bible_types = Counter(eng_bible_words)

In [111]:
rprint(eng_bible_types.most_common(30))

In [112]:
eng_bible_lemmas_types = Counter(lemmatize(eng_bible_words, lang="en"))

In [113]:
write_plain_text_corpus(eng_bible_plain_text, f"eng-bible")

In [114]:
!subword-nmt apply-bpe -c gutenberg_high.model < \
 eng-bible.txt > \
 eng-bible-tokenized.txt

In [115]:
with open(f"eng-bible-tokenized.txt", 'r') as f:
    tokenized_data = f.read()
eng_bible_tokenized = tokenized_data.split()

In [116]:
rprint(eng_bible_tokenized[:10])

In [117]:
len(eng_bible_tokenized)

46884

In [118]:
eng_bible_tokenized_types = Counter(eng_bible_tokenized)
len(eng_bible_tokenized_types)

1123

In [119]:
eng_bible_tokenized_types.most_common(30)

[(',', 2684),
 ('the', 1423),
 ('and', 1318),
 ('d', 1105),
 ('n@@', 977),
 ('.', 965),
 ('to', 955),
 ('A@@', 884),
 ('he', 725),
 ('of', 706),
 ('him', 617),
 ('un@@', 505),
 ('that', 478),
 ('they', 434),
 ('e@@', 414),
 ('them', 376),
 (':', 369),
 ('in', 350),
 ('o@@', 349),
 ('th', 337),
 ('a', 337),
 ('e', 304),
 ('said', 298),
 ('it', 294),
 ('t', 270),
 ('J@@', 266),
 ('a@@', 264),
 (';', 262),
 ('ed', 261),
 ('shall', 261)]

#### ¿Qué pasa si aplicamos el modelo aprendido con Gutenberg a otras lenguas?

In [120]:
spa_bible_plain_text = get_bible_corpus('spa')
spa_bible_words = spa_bible_plain_text.replace("\n", " ").lower().split()

In [121]:
spa_bible_words[:10]

['principio',
 'del',
 'evangelio',
 'de',
 'jesucristo',
 ',',
 'el',
 'hijo',
 'de',
 'dios']

In [122]:
len(spa_bible_words)

30073

In [123]:
spa_bible_types = Counter(spa_bible_words)
len(spa_bible_types)

3317

In [124]:
spa_bible_types.most_common(30)

[(',', 1946),
 ('y', 1169),
 ('.', 1099),
 ('de', 1009),
 ('que', 927),
 ('a', 858),
 ('los', 645),
 ('la', 599),
 ('el', 572),
 (':', 539),
 ('se', 489),
 ('en', 461),
 ('«', 423),
 ('»', 423),
 ('jesús', 422),
 ('lo', 367),
 ('no', 312),
 ('le', 293),
 ('les', 267),
 ('dijo', 252),
 ('con', 220),
 ('pero', 217),
 ('al', 214),
 ('¿', 196),
 ('?', 195),
 ('por', 194),
 ('para', 172),
 ('su', 171),
 ('del', 165),
 ('un', 159)]

In [125]:
spa_bible_lemmas_types = Counter(lemmatize(spa_bible_words, lang="es"))
len(spa_bible_lemmas_types)

2136

In [127]:
write_plain_text_corpus(spa_bible_plain_text, f"spa-bible")

In [128]:
!subword-nmt apply-bpe -c gutenberg_high.model < \
 spa-bible.txt > \
 spa-bible-tokenized.txt

In [129]:
with open(f"spa-bible-tokenized.txt", "r") as f:
    tokenized_text = f.read()
spa_bible_tokenized = tokenized_text.split()

In [130]:
spa_bible_tokenized[:10]

['P@@', 'r@@', 'in@@', 'ci@@', 'pi@@', 'o', 'de@@', 'l', 'ev@@', 'an@@']

In [131]:
len(spa_bible_tokenized)

71838

In [132]:
spa_bible_tokenized_types = Counter(spa_bible_tokenized)
len(spa_bible_tokenized_types)

507

In [133]:
spa_bible_tokenized_types.most_common(40)

[('a', 3780),
 ('s', 2653),
 ('o', 2390),
 (',', 1946),
 ('e', 1660),
 ('l@@', 1649),
 ('qu@@', 1269),
 ('es@@', 1193),
 ('y', 1143),
 ('.', 1099),
 ('de', 1095),
 ('d@@', 991),
 ('t@@', 959),
 ('i@@', 911),
 ('o@@', 881),
 ('er@@', 861),
 ('lo@@', 853),
 ('s@@', 736),
 ('an@@', 724),
 ('n', 718),
 ('u@@', 716),
 ('í@@', 703),
 ('do', 697),
 ('di@@', 697),
 ('m@@', 691),
 ('c@@', 666),
 ('e@@', 661),
 ('as', 660),
 ('r@@', 643),
 ('ó', 628),
 ('on', 625),
 ('en', 620),
 ('j@@', 595),
 ('se', 581),
 ('b@@', 580),
 ('an', 577),
 ('en@@', 577),
 ('ar@@', 574),
 ('es', 564),
 ('el', 551)]

### Type-token Ratio (TTR)

- Una forma de medir la variazión del vocabulario en un corpus
- Este se calcula como $TTR = \frac{len(types)}{len(tokens)}$
- Puede ser útil para monitorear la variación lexica de un texto

In [134]:
rprint("Información de la biblia en Inglés")
rprint("Tokens:", len(eng_bible_words))
rprint("Types ([blue]word-base):", len(eng_bible_types))
rprint("Types ([yellow]lemmatized)", len(eng_bible_lemmas_types))
rprint("Types ([green]BPE):", len(eng_bible_tokenized_types))
rprint("TTR ([blue]word-base):", len(eng_bible_types)/len(eng_bible_words))
rprint("TTR ([green]BPE):", len(eng_bible_tokenized_types)/len(eng_bible_tokenized))

In [135]:
rprint("Bible Spanish Information")
rprint("Tokens:", len(spa_bible_words))
rprint("Types ([blue]word-base):", len(spa_bible_types))
rprint("Types ([yellow]lemmatized)", len(spa_bible_lemmas_types))
rprint("Types ([green]BPE):", len(spa_bible_tokenized_types))
rprint("TTR ([blue]word-base):", len(spa_bible_types)/len(spa_bible_words))
rprint("TTR ([green]BPE):", len(spa_bible_tokenized_types)/len(spa_bible_tokenized))

## Modelos del Lenguaje Neuronales (Bengio)

- [(Bengio et al 2003)](https://dl.acm.org/doi/10.5555/944919.944966) proponen una arquitecura neuronal como alternativa a los modelos del lenguaje estadísticos
- Esta arquitectura lidia mejor con los casos donde las probabilidades se hacen cero, sin necesidad de aplicar una técnica de smoothing.

<p float="left">
  <img src="https://toppng.com/public/uploads/preview/at-the-movies-will-smith-meme-tada-11562851401lnexjqtwf9.png" width="100" />
  <img src="https://abhinavcreed13.github.io/assets/images/bengio-model.png" width="600"/>
</p>

In [136]:
nltk.download('reuters')
nltk.download('punkt_tab')

from nltk.corpus import reuters
from nltk import ngrams

[nltk_data] Downloading package reuters to /root/nltk_data...
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [137]:
def preprocess_corpus(corpus: list[str]) -> list[str]:
    """Función de preprocesamiento

    Esta función está diseñada para preprocesar
    corpus para modelos del lenguaje neuronales.
    Agrega tokens de inicio y fin, normaliza
    palabras a minusculas
    """
    preprocessed_corpus = []
    for sent in corpus:
        result = [word.lower() for word in sent]
        # Al final de la oración
        result.append("<EOS>")
        result.insert(0, "<BOS>")
        preprocessed_corpus.append(result)
    return preprocessed_corpus

In [138]:
def get_words_freqs(corpus: list[list[str]]):
    """Calcula la frecuencia de las palabras en un corpus"""
    words_freqs = {}
    for sentence in corpus:
        for word in sentence:
            words_freqs[word] = words_freqs.get(word, 0) + 1
    return words_freqs

In [139]:
UNK_LABEL = "<UNK>"
def get_words_indexes(words_freqs: dict) -> dict:
    """Calcula los indices de las palabras dadas sus frecuencias"""
    result = {}
    for idx, word in enumerate(words_freqs.keys()):
        # Happax legomena happends
        if words_freqs[word] == 1:
            # Temp index for unknowns
            result[UNK_LABEL] = len(words_freqs)
        else:
            result[word] = idx

    return {word: idx for idx, word in enumerate(result.keys())}, {idx: word for idx, word in enumerate(result.keys())}

In [140]:
corpus = preprocess_corpus(reuters.sents())

In [141]:
len(corpus)

54716

In [142]:
words_freqs = get_words_freqs(corpus)

In [143]:
words_freqs["the"]

69277

In [144]:
len(words_freqs)

31079

In [145]:
count = 0
for word, freq in words_freqs.items():
    if freq == 1 and count <= 10:
        print(word, freq)
        count += 1

inflict 1
sheen 1
avowed 1
kilolitres 1
janunary 1
pineapples 1
hasrul 1
paian 1
sawn 1
goodall 1
bundey 1


In [146]:
words_indexes, index_to_word = get_words_indexes(words_freqs)

In [147]:
words_indexes["the"]

16

In [148]:
index_to_word[16]

'the'

In [149]:
len(words_indexes)

20056

In [150]:
len(index_to_word)

20056

In [151]:
def get_word_id(words_indexes: dict, word: str) -> int:
    """Obtiene el id de una palabra dada

    Si no se encuentra la palabra se regresa el id
    del token UNK
    """
    unk_word_id = words_indexes[UNK_LABEL]
    return words_indexes.get(word, unk_word_id)

### Obtenemos trigramas

Convertiremos los trigramas obtenidos a secuencias de idx, y preparamos el conjunto de entrenamiento $x$ y $y$

- x: Contexto
- y: Predicción de la siguiente palabra

In [152]:
def get_train_test_data(corpus: list[list[str]], words_indexes: dict, n: int) -> tuple[list, list]:
    """Obtiene el conjunto de train y test

    Requerido en el step de entrenamiento del modelo neuronal
    """
    x_train = []
    y_train = []
    for sent in corpus:
        n_grams = ngrams(sent, n)
        for w1, w2, w3 in n_grams:
            x_train.append([get_word_id(words_indexes, w1), get_word_id(words_indexes, w2)])
            y_train.append([get_word_id(words_indexes, w3)])
    return x_train, y_train

### Preparando Pytorch

$x' = e(x_1) \oplus e(x_2)$

$h = \tanh(W_1 x' + b)$

$y = softmax(W_2 h)$

In [153]:
# cargamos bibliotecas
import torch
from torch import nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import DataLoader
import time

In [154]:
# Setup de parametros
EMBEDDING_DIM = 200
CONTEXT_SIZE = 2
BATCH_SIZE = 256
H = 100
torch.manual_seed(42)
# Tamaño del Vocabulario
V = len(words_indexes)

In [155]:
x_train, y_train = get_train_test_data(corpus, words_indexes, n=3)

In [156]:
import numpy as np

train_set = np.concatenate((x_train, y_train), axis=1)
# partimos los datos de entrada en batches
train_loader = DataLoader(train_set, batch_size = BATCH_SIZE)

### Creamos la arquitectura del modelo

In [157]:
# Trigram Neural Network Model
class TrigramModel(nn.Module):
    """Clase padre: https://pytorch.org/docs/stable/generated/torch.nn.Module.html"""

    def __init__(self, vocab_size, embedding_dim, context_size, h):
        super(TrigramModel, self).__init__()
        self.context_size = context_size
        self.embedding_dim = embedding_dim
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear1 = nn.Linear(context_size * embedding_dim, h)
        self.linear2 = nn.Linear(h, vocab_size)

    def forward(self, inputs):
        # x': concatenation of x1 and x2 embeddings   -->
        # self.embeddings regresa un vector por cada uno de los índices que se les pase como entrada.
        # view() les cambia el tamaño para concatenarlos
        embeds = self.embeddings(inputs).view((-1,self.context_size * self.embedding_dim))
        # h: tanh(W_1.x' + b)  -->
        out = torch.tanh(self.linear1(embeds))
        # W_2.h                 -->
        out = self.linear2(out)
        # log_softmax(W_2.h)      -->
        # dim=1 para que opere sobre renglones, pues al usar batchs tenemos varios vectores de salida
        log_probs = F.log_softmax(out, dim=1)

        return log_probs

In [158]:
# Seleccionar la GPU si está disponible
device = torch.accelerator.current_accelerator().type if torch.accelerator.is_available() else "cpu"

In [160]:
#torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Training on device {device}")

# 1. Pérdida. Negative log-likelihood loss
loss_function = nn.NLLLoss()

# 2. Instanciar el modelo y enviarlo a device
model = TrigramModel(V, EMBEDDING_DIM, CONTEXT_SIZE, H).to(device)

# 3. Optimización. ADAM optimizer
optimizer = optim.Adam(model.parameters(), lr = 2e-3)

# ------------------------- TRAIN & SAVE MODEL ------------------------
EPOCHS = 3
for epoch in range(EPOCHS):
    st = time.time()
    print("\n--- Training model Epoch: {} ---".format(epoch))
    for it, data_tensor in enumerate(train_loader):
        # Mover los datos a la GPU
        context_tensor = data_tensor[:,0:2].to(device)
        target_tensor = data_tensor[:,2].to(device)

        model.zero_grad()

        # FORWARD:
        log_probs = model(context_tensor)

        # compute loss function
        loss = loss_function(log_probs, target_tensor)

        # BACKWARD:
        loss.backward()
        optimizer.step()

        if it % 500 == 0:
            print("Training Iteration {} of epoch {} complete. Loss: {}; Time taken (s): {}".format(it, epoch, loss.item(), (time.time()-st)))
            st = time.time()

    # saving model
    model_path = f'model_{device}_context_{CONTEXT_SIZE}_epoch_{epoch}.dat'
    torch.save(model.state_dict(), model_path)
    print(f"Model saved for epoch={epoch} at {model_path}")


Training on device cuda

--- Training model Epoch: 0 ---
Training Iteration 0 of epoch 0 complete. Loss: 9.937440872192383; Time taken (s): 0.007857799530029297
Training Iteration 500 of epoch 0 complete. Loss: 5.794308662414551; Time taken (s): 2.002340793609619
Training Iteration 1000 of epoch 0 complete. Loss: 5.352446556091309; Time taken (s): 1.9230108261108398
Training Iteration 1500 of epoch 0 complete. Loss: 3.973957061767578; Time taken (s): 1.9020988941192627
Training Iteration 2000 of epoch 0 complete. Loss: 5.080819129943848; Time taken (s): 1.9026436805725098
Training Iteration 2500 of epoch 0 complete. Loss: 4.668554306030273; Time taken (s): 1.9012126922607422
Training Iteration 3000 of epoch 0 complete. Loss: 4.139153480529785; Time taken (s): 1.9029021263122559
Training Iteration 3500 of epoch 0 complete. Loss: 4.510522365570068; Time taken (s): 1.9480547904968262
Training Iteration 4000 of epoch 0 complete. Loss: 5.559445858001709; Time taken (s): 1.9370195865631104
T

In [161]:
model

TrigramModel(
  (embeddings): Embedding(20056, 200)
  (linear1): Linear(in_features=400, out_features=100, bias=True)
  (linear2): Linear(in_features=100, out_features=20056, bias=True)
)

In [162]:
def get_model(path: str) -> TrigramModel:
    """Obtiene modelo de pytorch desde disco"""
    model_loaded = TrigramModel(V, EMBEDDING_DIM, CONTEXT_SIZE, H)
    model_loaded.load_state_dict(torch.load(path))
    model_loaded.eval()
    return model_loaded

In [163]:
PATH = "model_cuda_4.dat"

In [164]:
#model = get_model(PATH)
W1 = "<BOS>"
W2 = "my"

IDX1 = get_word_id(words_indexes, W1)
IDX2 = get_word_id(words_indexes, W2)

#Obtenemos Log probabidades p(W3|W2,W1)
probs = model(torch.tensor([[IDX1,  IDX2]]).to(device)).detach().tolist()

In [165]:
len(probs[0])

20056

In [166]:
# Creamos diccionario con {idx: logprob}
model_probs = {}
for idx, p in enumerate(probs[0]):
  model_probs[idx] = p

# Sort:
model_probs_sorted = sorted(((prob, idx) for idx, prob in model_probs.items()), reverse=True)

# Printing word  and prob (retrieving the idx):
topcandidates = 0
for prob, idx in model_probs_sorted:
  #Retrieve the word associated with that idx
  word = index_to_word[idx]
  print(idx, word, prob)

  topcandidates += 1

  if topcandidates > 10:
    break

273 sources -2.5411229133605957
2163 bell -3.311959743499756
251 view -4.0464911460876465
808 gold -4.103001117706299
995 banks -4.197214603424072
4154 forecasts -4.353384971618652
3828 comments -4.390305519104004
31 <UNK> -4.401027202606201
4172 objective -4.532485485076904
27 nations -4.548795700073242
1637 own -4.578312397003174


In [167]:
print(index_to_word.get(model_probs_sorted[0][1]))

sources


### Generacion de lenguaje

In [168]:
def get_likely_words(model: TrigramModel, context: str, words_indexes: dict, index_to_word: dict, top_count: int=10) -> list[tuple]:
    model_probs = {}
    words = context.split()
    idx_word_1 = get_word_id(words_indexes, words[0])
    idx_word_2 = get_word_id(words_indexes, words[1])
    probs = model(torch.tensor([[idx_word_1, idx_word_2]]).to(device)).detach().tolist()

    for idx, p in enumerate(probs[0]):
        model_probs[idx] = p

    # Strategy: Sort and get top-K words to generate text
    return sorted(((prob, index_to_word[idx]) for idx, prob in model_probs.items()), reverse=True)[:top_count]

In [169]:
sentence = "this is"
get_likely_words(model, sentence, words_indexes, index_to_word, 3)

[(-1.7673019170761108, 'a'),
 (-2.173889636993408, 'the'),
 (-3.368946075439453, 'being')]

In [170]:
from random import randint

def get_next_word(words: list[tuple[float, str]]) -> str:
    # From a top-K list of words get a random word
    return words[randint(0, len(words)-1)][1]

In [171]:
get_next_word(get_likely_words(model, sentence, words_indexes, index_to_word))

'that'

In [172]:
MAX_TOKENS = 50
TOP_COUNT = 10
def generate_text(model: TrigramModel, history: str, words_indexes: dict, index_to_word: dict, tokens_count: int=0) -> None:
    next_word = get_next_word(get_likely_words(model, history, words_indexes, index_to_word, top_count=TOP_COUNT))
    print(next_word, end=" ")
    tokens_count += 1
    if tokens_count == MAX_TOKENS or next_word == "<EOS>":
        return
    generate_text(model, history.split()[1]+ " " + next_word, words_indexes, index_to_word, tokens_count)

In [173]:
sentence = "mexico is"
print(sentence, end=" ")
generate_text(model, sentence, words_indexes, index_to_word)

mexico is a moderate " in the 1986 net includes extraordinary items to <UNK> out as long . 0 / 7 mln revs 9 at 1 . 7 to 20 cents , in the second . 1 , 448 to buy interstate bank intervention in 1987 at 7 mln stg in the 

# Práctica 4: Modelos del Lenguaje Neuronales

**Fecha de entrega: 6 de abril de 2025 11:59pm**

A partir del modelo entrenado:

- Sacar los embeddings de las palabras del vocabulario

- Visualizar en 2D los embeddings de algunas palabras (quizá las más frecuentes, excluyendo stopwords)

- Seleccione algunas palabras y verifique sí realmente codifican nociones semánticas, e,g, similitud semántica con similitud coseno entre dos vectores, analogías por medios de operaciones de vectores

### Extra (0.5 pts):

- Correr el modelo de Bengio pero aplicando una técnica de subword tokenization al corpus y hacer generación del lenguaje

* La generación del lenguaje debe ser secuencias de palabras (no subwords)

## Referencias

- [Language models - Lena Voita](https://lena-voita.github.io/nlp_course/language_modeling.html#generation_strategies)
- [A Neural Probabilistic Model - Bengio](https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf)
- Parte del código de esta práctica fue retomado del trabajo de la Dr. Ximena Guitierrez Vasques