# 6. Modelos del lenguaje

## Objetivos

- Crear modelos del lenguaje a partir de un corpus en inglés
    - Modelo de bigramas
    - Modelo de trigramas

> Un modelo del lenguaje es un modelo estadístico que asigna probabilidades a cadenas dentro de un lenguaje - Jurafsky, 2000

$$ \mu = (\Sigma, A, \Pi)$$

Donde:
- $\mu$ es el modelo del lenguaje
- $\Sigma$ es el vocabulario
- $A$ es el tensor que guarda las probabilidades
- $\Pi$ guarda las probabilidades iniciales

- Este modelo busca estimar la probabilidad de una secuencia de tokens
- Pueden ser palabras, caracteres o tokens
- Se pueden considerar varios escenarios para la creación de estos modelos
- Si podemos estimar la probabilidad de una unidad lingüística (palabras, tokens, oracines, etc), podemos usarlar de formas insospechadas

## I saw a cat in a mat

<img src="https://lena-voita.github.io/resources/lectures/lang_models/general/i_saw_a_cat_prob.gif">

## Aplicaciones

- Traducción automática
- Completado de texto
- Generación de texto

![](https://lena-voita.github.io/resources/lectures/lang_models/examples/suggest-min.png)
Tomado de [Lena Voita](https://lena-voita.github.io/nlp_course/language_modeling.html)

## De los bigramas a los n-gramas

- Para bigramas tenemos la propiedad de Markov
- Para $n > 2$ las palabras dependen de mas elementos
    - Trigramas
    - 4-gramas
- En general para un modelo de n-gramas se toman en cuenta $n-1$ elementos

## Programando nuestros modelos del lenguaje

Utilizaremos un [corpus](https://www.nltk.org/book/ch02.html) en inglés disponible en NLTK

In [108]:
import nltk
nltk.download('reuters')
nltk.download('punkt')

[nltk_data] Downloading package reuters to /root/nltk_data...
[nltk_data]   Package reuters is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [109]:
# Computando números
import numpy as np
# Corpus
from nltk.corpus import reuters
# Para crear ngramas
from nltk import ngrams
# Utilidades para manejar las probabilidades
from collections import Counter, defaultdict

In [111]:
len(reuters.sents())

54716

In [113]:
import re

def preprocess(sent: list[str]) -> list[str]:
    """Función de preprocesamiento

    Agrega tokens de inicio y fin, normaliza todo a minusculas
    """
    result = [word.lower() for word in sent]
    # Al final de la oración
    result.append("<EOS>")
    result.insert(0, "<BOS>")
    return result

In [114]:
print(reuters.sents()[11])
preprocess(reuters.sents()[11])

['The', 'surplus', 'helped', 'swell', 'Taiwan', "'", 's', 'foreign', 'exchange', 'reserves', 'to', '53', 'billion', 'dlrs', ',', 'among', 'the', 'world', "'", 's', 'largest', '.']


['<BOS>',
 'the',
 'surplus',
 'helped',
 'swell',
 'taiwan',
 "'",
 's',
 'foreign',
 'exchange',
 'reserves',
 'to',
 '53',
 'billion',
 'dlrs',
 ',',
 'among',
 'the',
 'world',
 "'",
 's',
 'largest',
 '.',
 '<EOS>']

In [117]:
list(ngrams(reuters.sents()[0], 3))

[('ASIAN', 'EXPORTERS', 'FEAR'),
 ('EXPORTERS', 'FEAR', 'DAMAGE'),
 ('FEAR', 'DAMAGE', 'FROM'),
 ('DAMAGE', 'FROM', 'U'),
 ('FROM', 'U', '.'),
 ('U', '.', 'S'),
 ('.', 'S', '.-'),
 ('S', '.-', 'JAPAN'),
 ('.-', 'JAPAN', 'RIFT'),
 ('JAPAN', 'RIFT', 'Mounting'),
 ('RIFT', 'Mounting', 'trade'),
 ('Mounting', 'trade', 'friction'),
 ('trade', 'friction', 'between'),
 ('friction', 'between', 'the'),
 ('between', 'the', 'U'),
 ('the', 'U', '.'),
 ('U', '.', 'S'),
 ('.', 'S', '.'),
 ('S', '.', 'And'),
 ('.', 'And', 'Japan'),
 ('And', 'Japan', 'has'),
 ('Japan', 'has', 'raised'),
 ('has', 'raised', 'fears'),
 ('raised', 'fears', 'among'),
 ('fears', 'among', 'many'),
 ('among', 'many', 'of'),
 ('many', 'of', 'Asia'),
 ('of', 'Asia', "'"),
 ('Asia', "'", 's'),
 ("'", 's', 'exporting'),
 ('s', 'exporting', 'nations'),
 ('exporting', 'nations', 'that'),
 ('nations', 'that', 'the'),
 ('that', 'the', 'row'),
 ('the', 'row', 'could'),
 ('row', 'could', 'inflict'),
 ('could', 'inflict', 'far'),
 ('inf

### Obteniendo modelo de trigramas

In [166]:
trigram_model = defaultdict(lambda: defaultdict(lambda: 0))

In [167]:
N = 3
for sentence in reuters.sents():
    # Obtenemos los ngramas normalizados
    n_grams = ngrams(preprocess(sentence), N)
    # Guardamos los bigramas en nuestro diccionario
    for w1, w2, w3 in n_grams:
        trigram_model[(w1, w2)][w3] += 1

In [120]:
trigram_model["<BOS>", "the"]

defaultdict(<function __main__.<lambda>.<locals>.<lambda>()>,
            {'u': 266,
             'surplus': 16,
             'australian': 10,
             'measures': 14,
             'paper': 58,
             'decision': 32,
             'country': 40,
             'department': 210,
             'ban': 8,
             'pay': 4,
             'industrial': 10,
             'shipping': 6,
             'fledgling': 4,
             'trade': 80,
             'mine': 16,
             'analysts': 38,
             'finance': 28,
             'sixth': 2,
             'bundesbank': 64,
             'company': 2310,
             'new': 264,
             'prospective': 2,
             'partners': 10,
             'edmonton': 4,
             'shares': 38,
             'hong': 8,
             'talks': 30,
             'stronger': 2,
             'key': 22,
             'property': 10,
             'share': 12,
             'cheques': 2,
             'growers': 6,
             'current': 96,
     

In [121]:
for i, entry in enumerate(trigram_model.items()):
    print(entry)
    if i == 3:
        break

(('<BOS>', 'asian'), defaultdict(<function <lambda>.<locals>.<lambda> at 0x7fd5be0eb640>, {'exporters': 2, 'cocoa': 2, 'dollar': 4}))
(('asian', 'exporters'), defaultdict(<function <lambda>.<locals>.<lambda> at 0x7fd5be0eb7f0>, {'fear': 2}))
(('exporters', 'fear'), defaultdict(<function <lambda>.<locals>.<lambda> at 0x7fd5be0ebe20>, {'damage': 2, 'china': 2}))
(('fear', 'damage'), defaultdict(<function <lambda>.<locals>.<lambda> at 0x7fd5be0ebd90>, {'from': 2}))


In [122]:
VOCABULARY = set([word.lower() for sent in reuters.sents() for word in sent])
# +2 por los tokens <BOS> y <EOS>
VOCABULARY_SIZE = len(VOCABULARY) + 2

In [168]:
def calculate_model_probabilities(model: defaultdict) -> defaultdict:
    result = defaultdict(lambda: defaultdict(lambda: 0))
    for prefix in model:
        # Todas las veces que vemos la key seguido de cualquier cosa
        total = float(sum(model[prefix].values()))
        for next_word in model[prefix]:
            # Laplace smothing
            #result[prefix][next_word] = (model[prefix][next_word] + 1) / (total + VOCABULARY_SIZE)
            # Without smothing
            result[prefix][next_word] = model[prefix][next_word] / total
    return result

In [169]:
trigram_probs = calculate_model_probabilities(trigram_model)

In [126]:
sorted(dict(trigram_probs["this","is"]).items(), key=lambda x:-1*x[1])

[('the', 0.2328767123287671),
 ('a', 0.21232876712328766),
 ('not', 0.0684931506849315),
 ('about', 0.03424657534246575),
 ('because', 0.03424657534246575),
 ('why', 0.02054794520547945),
 ('an', 0.02054794520547945),
 ('going', 0.02054794520547945),
 ('expected', 0.0136986301369863),
 ('just', 0.0136986301369863),
 ('hardly', 0.0136986301369863),
 ('done', 0.0136986301369863),
 ('in', 0.0136986301369863),
 ('when', 0.0136986301369863),
 ('22', 0.00684931506849315),
 ('believed', 0.00684931506849315),
 ('partly', 0.00684931506849315),
 ('strictly', 0.00684931506849315),
 ('yen', 0.00684931506849315),
 ('amore', 0.00684931506849315),
 ('up', 0.00684931506849315),
 ('well', 0.00684931506849315),
 ('most', 0.00684931506849315),
 ('definitely', 0.00684931506849315),
 ('clearly', 0.00684931506849315),
 ('approved', 0.00684931506849315),
 ('making', 0.00684931506849315),
 ('equivalent', 0.00684931506849315),
 ('based', 0.00684931506849315),
 ('reflecting', 0.00684931506849315),
 ('really', 0

## Calculando la siguiente palabra más probable

In [127]:
def get_likely_words(model_probs: defaultdict, context: str, top_count: int=10) -> list[tuple]:
    """Dado un contexto obtiene las palabras más probables

    Params
    ------
    model_probs: defaultdict
        Probabilidades del modelo
    context: str
        Contexto con el cual calcular las palabras más probables siguientes
    top_count: int
        Cantidad de palabras más probables. Default 10
    """
    history = tuple(context.split())
    return sorted(dict(model_probs[history]).items(), key=lambda prob: -1*prob[1])[:top_count]

In [170]:
get_likely_words(trigram_probs, "<BOS> the", top_count=3)

[('company', 0.13028764805414553),
 ('bank', 0.024591088550479413),
 ('u', 0.01500282007896221)]

### Estrategias de generación

In [142]:
from random import randint

def get_next_word(words: list) -> str:
    # Strategy here
    return words[0][0]

def get_next_word(words: list) -> str:
    return words[randint(0, len(words)-1)][0]

In [143]:
get_next_word(get_likely_words(trigram_probs, "<BOS> the", 50))

'official'

### Generando texto

In [151]:
MAX_TOKENS = 30
def generate_text(model: defaultdict, history: str, tokens_count: int) -> None:
    next_word = get_next_word(get_likely_words(model, history, top_count=30))
    print(next_word, end=" ")
    tokens_count += 1
    if tokens_count == MAX_TOKENS or next_word == "<EOS>":
        return
    generate_text(model, history.split()[1]+ " " + next_word, tokens_count)

In [154]:
sentence = "<BOS> they"
print(sentence, end=" ")
generate_text(trigram_probs, sentence, 0)

<BOS> they pointed to yesterday ' s new eurosterling bonds were launched today , is an excellent one both commercially and financially ." <EOS> 

## Calculando la probabilidad de una oración

In [171]:
def calculate_sent_prob(model: defaultdict, sentence: str, n: int) -> float:
    n_grams = ngrams(preprocess(sentence), n)
    p = 0.0
    for gram in n_grams:
        if n == 3:
            key = (gram[0], gram[1])
            value = gram[2]
        elif n == 2:
            key = gram[0]
            value = gram[1]
        try:
            p += np.log(model[key][value])
        except:
            p += 0.0
    return p

In [172]:
sentence = reuters.sents()[10]
print(" ".join(sentence))
calculate_sent_prob(trigram_probs, reuters.sents()[10], n=3)

Taiwan had a trade trade surplus of 15 . 6 billion dlrs last year , 95 pct of it with the U . S .


-55.40967083856851

In [173]:
sentence = reuters.sents()[100]
print(" ".join(sentence))
calculate_sent_prob(trigram_probs, reuters.sents()[100], n=3)

Now it ' s largely out of their hands ," said Kleinwort Benson Ltd financial analyst Simon Smithson .


-31.011837082831672

## Evaluación de modelos

In [174]:
from sklearn.model_selection import train_test_split

train_data, test_data = train_test_split(reuters.sents(), test_size=0.3)

print("Train data", len(train_data))
print("tests data", len(test_data))

Train data 38301
tests data 16415


In [175]:
pp=[]
for sentence in test_data:
  #1. Normalizamos y agregamos símbolos especiales:

  #Log perplexity calculada para cada oracion:
  log_prob=calculate_sent_prob(trigram_probs, sentence, 3)
  perplexity=-(log_prob/len(sentence)-1)
  pp.append(perplexity)


#promedio de las log perplexity:
total_perplexity= sum(pp) / len(pp)
print(total_perplexity)

2.8574505214488366


Para evaluar un modelo se utiliza como medida la (log) perplejidad o *perplexity*

### Comparando con un modelo de bigramas

In [178]:
bigram_model = defaultdict(lambda: defaultdict(lambda: 0))

In [179]:
N = 2
for sentence in reuters.sents():
    # Obtenemos los ngramas normalizados
    n_grams = ngrams(preprocess(sentence), N)
    # Guardamos los bigramas en nuestro diccionario
    for w1, w2 in n_grams:
        bigram_model[w1][w2] += 1

In [180]:
bigram_model["problems"]

defaultdict(<function __main__.<lambda>.<locals>.<lambda>()>,
            {'from': 1,
             'are': 10,
             'at': 5,
             'to': 5,
             'storage': 1,
             'with': 20,
             '.': 29,
             'for': 11,
             'leading': 1,
             'and': 19,
             'in': 21,
             ',': 23,
             'like': 1,
             'of': 19,
             'develop': 1,
             'on': 1,
             'remains': 1,
             'britain': 1,
             'worse': 1,
             'japan': 2,
             'through': 1,
             'some': 1,
             'when': 1,
             'earlier': 1,
             'due': 3,
             'affecting': 1,
             'would': 2,
             'caused': 1,
             'bedevilling': 1,
             'that': 7,
             '...': 2,
             '."': 6,
             '"': 3,
             'initially': 1,
             ',"': 5,
             'there': 1,
             'were': 2,
             '..': 1,
    

In [181]:
for i, entry in enumerate(bigram_model.items()):
    print(entry)
    if i == 3:
        break

('<BOS>', defaultdict(<function <lambda>.<locals>.<lambda> at 0x7fd58d6d7ac0>, {'asian': 4, 'they': 447, 'but': 1055, 'the': 8865, 'unofficial': 2, '"': 3589, 'in': 1385, 'threat': 2, 'taiwan': 91, 'retaliation': 4, 'a': 764, 'last': 204, 'much': 8, 'he': 1592, 'meanwhile': 41, 'japan': 275, 'deputy': 8, 'china': 84, 'it': 1770, 'miti': 12, 'nuclear': 3, 'thai': 26, 'thailand': 19, 'export': 34, 'products': 4, 'indonesia': 55, 'prices': 46, 'harahap': 2, 'indonesian': 21, 'australian': 49, 'cargo': 3, 'trading': 27, 'physical': 2, 'rubber': 7, 'robusta': 2, 'no': 116, 'trade': 68, 'nainggolan': 1, 'officials': 58, 'transactions': 2, 'total': 120, 'sri': 14, 'western': 35, 'bundey': 1, 'annual': 5, 'sumitomo': 7, 'osaka': 1, 'some': 181, 'others': 11, 'now': 21, 'among': 57, 'regulations': 1, 'we': 73, 'komatsu': 2, 'article': 2, 'that': 105, 'until': 15, 'like': 6, 'subroto': 10, 'asked': 130, 'bundesbank': 53, 'banks': 39, 'dealers': 119, 'two': 34, 'bond': 21, 'atlas': 6, 'wilson': 1

In [182]:
bigram_probs = calculate_model_probabilities(bigram_model)

In [183]:
sorted(dict(bigram_probs["<BOS>"]).items(), key=lambda x:-1*x[1])[:10]

[('the', 0.1620184223992982),
 ('"', 0.06559324512025733),
 ('it', 0.03234885591051977),
 ('he', 0.029095694129687842),
 ('in', 0.025312522845237224),
 ('but', 0.01928138021785218),
 ('u', 0.015827180349440747),
 ('a', 0.013963008991885371),
 ('this', 0.008187733021419695),
 ('they', 0.00816945683163974)]

In [184]:
calculate_sent_prob(bigram_probs, reuters.sents()[100], 2)

-85.07891454134658

In [186]:
pp=[]
for sentence in test_data:
  #1. Normalizamos y agregamos símbolos especiales:

  #Log perplexity calculada para cada oracion:
  log_prob=calculate_sent_prob(bigram_probs, sentence, 2)
  perplexity=-(log_prob/len(sentence)-1)
  pp.append(perplexity)


#promedio de las log perplexity:
total_perplexity= sum(pp) / len(pp)
print(total_perplexity)

5.138950219497932


## Práctica 6: Evaluación de modelos de lenguaje

**Fecha de entrega: 21 de abril de 2024**

- Crear un par de modelos del lenguaje usando un **corpus en español**
    - Corpus: El Quijote
        - URL: https://www.gutenberg.org/ebooks/2000
    - Modelo de n-gramas con `n = [2, 3]`
    - Hold out con `test = 30%` y `train = 70%`
- Evaluar los modelos y reportar la perplejidad de cada modelo
  - Comparar los resultados entre los diferentes modelos del lenguaje (bigramas, trigramas)
  - ¿Cual fue el modelo mejor evaluado? ¿Porqué?

# Referencias

- Mucho del código mostrado fue tomado del trabajo de la Dr. Ximena Guitierrez-Vasques
- https://lena-voita.github.io/nlp_course/language_modeling.html