# NLP and Neural Networks

In this exercise, we'll apply our knowledge of neural networks to process natural language. As we did in the bigram exercise, the goal of this lab is to predict the next word, given the previous one.

### Data set

Load the text from "One Hundred Years of Solitude" that we used in our bigrams exercise. It's located in the data folder.

### Important note:

Start with a smaller part of the text. Maybe the first 10 parragraphs, as the number of tokens rapidly increases as we add more text.

Later you can use a bigger corpus.

In [271]:
import torch
from nltk.tokenize import TreebankWordTokenizer
tokenizer = TreebankWordTokenizer()

In [272]:
text = open('/content/cap1.txt', 'r').read().lower()

In [273]:
text

'muchos años después, frente al pelotón de fusilamiento, el coronel aureliano buendía había de recordar aquella tarde remota en que su padre lo llevó a conocer el hielo. macondo era entonces una aldea de veinte casas de barro y cañabrava construidas a la orilla de un río de aguas diáfanas que se precipitaban por un lecho de piedras pulidas, blancas y enormes como huevos prehistóricos. el mundo era tan reciente, que muchas cosas carecían de nombre, y para mencionarlas había que señalarlas con el dedo. todos los años, por el mes de marzo, una familia de gitanos desarrapados plantaba su carpa cerca de la aldea, y con un grande alboroto de pitos y timbales daban a conocer los nuevos inventos. primero llevaron el imán. un gitano corpulento, de barba montaraz y manos de gorrión, que se presentó con el nombre de melquíades, hizo una truculenta demostración pública de lo que él mismo llamaba la octava maravilla de los sabios alquimistas de macedonia. fue de casa en casa arrastrando dos lingote

Don't forget to prepare the data by generating the corresponding tokens.

In [274]:
tokens = tokenizer.tokenize(text)

In [275]:
tokens[:10]

['muchos',
 'años',
 'después',
 ',',
 'frente',
 'al',
 'pelotón',
 'de',
 'fusilamiento',
 ',']

In [276]:
len(tokens)

6293

### Let's prepare the data set.

Our neural network needs to have an input X and an output y. Remember that these sets are numerical, so you'd need something to map the tokens into numbers, and viceversa.

In [277]:
# in this case, let's consider a bigram (w1, w2)
# assign the w1 to the X vector, and w2 to the y vector, why do we do this?

In [278]:
vocab = {}
index_to_token = []
current_index = 0

for token in tokens:
    if token not in vocab:
        vocab[token] = current_index
        index_to_token.append(token)
        current_index += 1


b = {}
for w1, w2 in zip(tokens, tokens[1:]):
    bigram = (w1, w2)
    b[bigram] = b.get(bigram, 0) + 1

In [279]:
# Don't forget that since we are using torch, our training set vectors should be tensors

In [280]:
X = torch.tensor([vocab[w1] for w1, w2 in b.keys()], dtype=torch.long)
y = torch.tensor([vocab[w2] for w1, w2 in b.keys()], dtype=torch.long)

In [281]:
# Note that our vectors are integers, which can be thought as a categorical variables.
# torch provides the one_hot method, that would generate tensors suitable for our nn
# make sure that the dtype of your tensor is float.

In [282]:
X_one_hot = torch.nn.functional.one_hot(X, num_classes=len(vocab)).float()
y_one_hot = torch.nn.functional.one_hot(y, num_classes=len(vocab)).float()

In [283]:
print(f"Tamaño del vocabulario: {len(vocab)}")
print(f"5 bigramas:\n{list(b.items())[:5]}")
print(f"Y:\n{X_one_hot[:5]}")
print(f"X:\n{y_one_hot[:5]}")

Tamaño del vocabulario: 2127
5 bigramas:
[(('muchos', 'años'), 3), (('años', 'después'), 2), (('después', ','), 2), ((',', 'frente'), 1), (('frente', 'al'), 1)]
Y:
tensor([[1., 0., 0.,  ..., 0., 0., 0.],
        [0., 1., 0.,  ..., 0., 0., 0.],
        [0., 0., 1.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]])
X:
tensor([[0., 1., 0.,  ..., 0., 0., 0.],
        [0., 0., 1.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]])


### Network design
To start, we are going to have a very simple network. Define a single layer network

In [284]:
# How many neurons should our input layer have?
# Use as many neurons as the total number of categories (from your one-hot encoded tensors)
# Use the softmax as your activation layer

In [285]:
import torch
import torch.nn as nn
import torch.optim as optim

vocab_size = len(vocab)

# Red neuronal
class SimpleNet(nn.Module):
    def __init__(self, input_size, output_size):
        super(SimpleNet, self).__init__()
        self.fc = nn.Linear(input_size, output_size)

    def forward(self, x):
        out = self.fc(x)
        return torch.softmax(out, dim=1)

In [286]:
model = SimpleNet(input_size=vocab_size, output_size=vocab_size)

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)

In [287]:
# Train your network

In [288]:
num_epochs = 100
for epoch in range(num_epochs):
    # Forward
    outputs = model(X_one_hot)
    loss = criterion(outputs, y)

    # Backward y opt
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    if (epoch+1) % 10 == 0:
        print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}')

Epoch [10/100], Loss: 7.6624
Epoch [20/100], Loss: 7.6623
Epoch [30/100], Loss: 7.6621
Epoch [40/100], Loss: 7.6618
Epoch [50/100], Loss: 7.6615
Epoch [60/100], Loss: 7.6609
Epoch [70/100], Loss: 7.6601
Epoch [80/100], Loss: 7.6587
Epoch [90/100], Loss: 7.6564
Epoch [100/100], Loss: 7.6524


### Analysis

1. Test your network with a few words

In [289]:
# Función para convertir una palabra en su representación one-hot
def word_to_one_hot(word, vocab, vocab_size):
    word_index = vocab.get(word, None)
    one_hot_vector = torch.nn.functional.one_hot(torch.tensor([word_index]), num_classes=vocab_size).float()
    return one_hot_vector


test_words = ["mundo", "aldea", "casa"]
generated_pairs = []

# Mostrar tensor de salida por cada palabsa

model.eval()
with torch.no_grad():
    for word in test_words:
        one_hot_input = word_to_one_hot(word, vocab, vocab_size)
        if one_hot_input is not None:
            output = model(one_hot_input)
            predicted_index = torch.argmax(output, dim=1).item()
            predicted_word = index_to_token[predicted_index]
            print(f"Palabra de entrada: '{word}' -> Palabra predicha: '{predicted_word}'")
            print(f"Tensor de salida: {output}")

            generated_pairs.append((word, predicted_word))


Palabra de entrada: 'mundo' -> Palabra predicha: 'se'
Tensor de salida: tensor([[0.0003, 0.0032, 0.0003,  ..., 0.0003, 0.0003, 0.0003]])
Palabra de entrada: 'aldea' -> Palabra predicha: ','
Tensor de salida: tensor([[0.0003, 0.0029, 0.0003,  ..., 0.0003, 0.0003, 0.0003]])
Palabra de entrada: 'casa' -> Palabra predicha: 'en'
Tensor de salida: tensor([[0.0003, 0.0028, 0.0003,  ..., 0.0003, 0.0003, 0.0003]])


2. What does each value in the tensor represents? </br>
    Representa la probabilidad de que una palabra en especifico en el bocabulario sea la siguiente en la secuencia de palabras.

3. Why does it make sense to choose that number of neurons in our layer?</br>
    Porque debemos predecir la siguiente palabra de la secuencia a partir de todas las palabras posibles del bocabulario.



###4. What's the negative likelihood for each example?

In [290]:
def calculate_log_likelihood_for_pair(model, vocab, vocab_size, word1, word2):
    model.eval()

    with torch.no_grad():
        # Convertir word1 a one hot
        ix1 = word_to_one_hot(word1, vocab, vocab_size)

        ix2 = torch.tensor([vocab[word2]])

        output = model(ix1)

        # Obtener la probabilidad
        pr = output[0, ix2.item()].item()

        if pr > 0:
            log_likelihood = torch.log(torch.tensor(pr, dtype=torch.float32))
        else:
            log_likelihood = float('-inf')

    return log_likelihood.item()

for word1, word2 in generated_pairs:
    log_likelihood = calculate_log_likelihood_for_pair(model, vocab, vocab_size, word1, word2)
    print(f'Log-likelihood de "{word2}" dado "{word1}": {log_likelihood}')

Log-likelihood de "se" dado "mundo": -3.721703290939331
Log-likelihood de "," dado "aldea": -3.6980977058410645
Log-likelihood de "en" dado "casa": -3.6979637145996094


###5. Try generating a few sentences?


In [291]:
def generate_multiple_sentences(model, vocab, index_to_token, start_words, max_length=20, num_sentences=3):
    model.eval()
    sentences = []
    sentence_pairs = []

    for i in range(num_sentences):

        start_word = start_words[i % len(start_words)]

        # Check if the start word is in the vocabulary
        if start_word not in vocab:
            print(f"La palabra '{start_word}' No esta en el vocabulario.")
            continue  # Skip to the next start word

        sentence = [start_word]
        pairs = []

        current_word = start_word
        for _ in range(max_length - 1):
            ix1 = torch.tensor([vocab[current_word]], dtype=torch.long)

            # Convertir ix1 a one-hot
            ix1 = torch.nn.functional.one_hot(ix1, num_classes=len(vocab)).float()

            # Obtener la predicción
            with torch.no_grad():
                output = model(ix1)

            # Palabra con la mayor probabilidad
            predicted_index = torch.argmax(output, dim=1).item()
            predicted_word = index_to_token[predicted_index]

            sentence.append(predicted_word)
            pairs.append((current_word, predicted_word))

            if predicted_word == '.':
                break

            current_word = predicted_word

        # Guardar la oración
        sentences.append(' '.join(sentence))
        sentence_pairs.append(pairs)

    return sentences, sentence_pairs

start_words = ["para", "coronel", "barro"]
generated_sentences, generated_sentence_pairs = generate_multiple_sentences(
    model, vocab, index_to_token, start_words
)

for i, sentence in enumerate(generated_sentences, 1):
    print(f"Oracion {i}: {sentence}")

Oracion 1: para la de la de la de la de la de la de la de la de la de la
Oracion 2: coronel aureliano , la de la de la de la de la de la de la de la de la
Oracion 3: barro y en la de la de la de la de la de la de la de la de la


###6. What's the negative likelihood for each sentence?

In [294]:
import torch

def calculate_log_likelihood_for_sentence_revised(model, vocab, sentence):
    model.eval()

    log_likelihood = 0
    with torch.no_grad():
        for i in range(len(sentence) - 1):
            word1 = sentence[i]
            word2 = sentence[i + 1]

            # Obtener los índices de las palabras directamente del vocabulario
            ix1 = torch.tensor([vocab[word1]], dtype=torch.long)
            ix2 = torch.tensor([vocab[word2]], dtype=torch.long)

            # Convert ix1 to one-hot vector
            ix1 = torch.nn.functional.one_hot(ix1, num_classes=len(vocab)).float()

            # Pasar el índice de la palabra al modelo
            output = model(ix1)

            # Obtener la probabilidad de la palabra siguiente
            pr = output[0, ix2.item()].item()

            if pr > 0:
                log_likelihood += torch.log(torch.tensor(pr, dtype=torch.float32)).item()
            else:
                log_likelihood = float('-inf')
                break

    return log_likelihood

for sentence in generated_sentences:
    words = sentence.split()
    log_likelihood = calculate_log_likelihood_for_sentence_revised(model, vocab, words)
    print(f'Log-likelihood de "{sentence}": {log_likelihood}')

Log-likelihood de "para la de la de la de la de la de la de la de la de la de la": -68.55256867408752
Log-likelihood de "coronel aureliano , la de la de la de la de la de la de la de la de la": -70.42259764671326
Log-likelihood de "barro y en la de la de la de la de la de la de la de la de la": -69.03052377700806


### Design your own neural network (more layers and different number of neurons)
The goal is to get sentences that make more sense

In [295]:
import torch
import torch.nn as nn
import torch.optim as optim

# Definir la red neuronal
class ImprovedNet(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim1, hidden_dim2, output_dim):
        super(ImprovedNet, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.fc1 = nn.Linear(embedding_dim, hidden_dim1)
        self.relu1 = nn.ReLU()
        self.dropout1 = nn.Dropout(0.5)
        self.fc2 = nn.Linear(hidden_dim1, hidden_dim2)
        self.relu2 = nn.ReLU()
        self.dropout2 = nn.Dropout(0.5)
        self.fc3 = nn.Linear(hidden_dim2, output_dim)

    def forward(self, x):
        embedded = self.embedding(x)
        out = self.fc1(embedded)
        out = self.relu1(out)
        out = self.dropout1(out)
        out = self.fc2(out)
        out = self.relu2(out)
        out = self.dropout2(out)
        out = self.fc3(out)
        return torch.softmax(out, dim=1)

# Parámetros de la red mejorada
embedding_dim = 200 #100
hidden_dim1 = 256
hidden_dim2 = 128
output_dim = vocab_size

model = ImprovedNet(vocab_size, embedding_dim, hidden_dim1, hidden_dim2, output_dim)

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

num_epochs = 200
for epoch in range(num_epochs):
    outputs = model(X)
    loss = criterion(outputs, y)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    if (epoch+1) % 10 == 0:
        print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}')

# Generar nuevas oraciones
def generate_sentence_improved(start_word, model, vocab, index_to_token, max_length=10):
    sentence = [start_word]
    current_word = start_word

    for _ in range(max_length - 1):
        word_index = word_to_index(current_word, vocab)
        if word_index is None:
            break

        with torch.no_grad():
            output = model(word_index)
            predicted_index = torch.argmax(output, dim=1).item()
            current_word = index_to_token[predicted_index]
            sentence.append(current_word)

    return ' '.join(sentence)


Epoch [10/200], Loss: 7.6621
Epoch [20/200], Loss: 7.6479
Epoch [30/200], Loss: 7.5892
Epoch [40/200], Loss: 7.5874
Epoch [50/200], Loss: 7.5851
Epoch [60/200], Loss: 7.5813
Epoch [70/200], Loss: 7.5644
Epoch [80/200], Loss: 7.5462
Epoch [90/200], Loss: 7.5366
Epoch [100/200], Loss: 7.5319
Epoch [110/200], Loss: 7.5291
Epoch [120/200], Loss: 7.5278
Epoch [130/200], Loss: 7.5278
Epoch [140/200], Loss: 7.5278
Epoch [150/200], Loss: 7.5274
Epoch [160/200], Loss: 7.5283
Epoch [170/200], Loss: 7.5273
Epoch [180/200], Loss: 7.5270
Epoch [190/200], Loss: 7.5271
Epoch [200/200], Loss: 7.5276


In [296]:
import torch

new_words = ['el', 'macondo', 'coronel']


generated_sentences = []

for i in new_words:
    sentence = generate_sentence_improved(i, model, vocab, index_to_token)
    generated_sentences.append(sentence)



def calculate_log_likelihood_for_sentence_enhanced(model, vocab, vocab_size, sentence):
    model.eval()

    log_likelihood = 0
    with torch.no_grad():
        for i in range(len(sentence) - 1):
            word1 = sentence[i]
            word2 = sentence[i + 1]

            ix1 = torch.tensor([vocab[word1]])
            ix2 = torch.tensor([vocab[word2]])
            output = model(ix1)

            pr = output[0, ix2.item()].item()

            if pr > 0:
                log_likelihood += torch.log(torch.tensor(pr, dtype=torch.float32)).item()
            else:
                log_likelihood = float('-inf')
                break

    return log_likelihood

for sentence in generated_sentences:
    words = sentence.split()
    log_likelihood = calculate_log_likelihood_for_sentence_enhanced(model, vocab, vocab_size, words)
    print(f'Log-likelihood de "{sentence}": {log_likelihood}')

Log-likelihood de "el de de , de de , de , de": -4.871857404713637
Log-likelihood de "macondo , de , de de de , de ,": -4.071119249244703
Log-likelihood de "coronel , de de , de de de , de": -3.424561494966156
