# **Homework 4 - CC6205 Natural Language Processing 📚**




In this task, I will implement a Convolutional Neural Network and a Feed Foward Network using PyTorch. This network will address the problem of tagging (generating sequences of labels of the same length as the input sequence).

**This code belongs to a Homework from CC6205 - by Gabriel Iturra Bocaz. My code - Santiago Maass - is the Model Section**

**References:**

- [Tagging, and Hidden Markov Models ](http://www.cs.columbia.edu/~mcollins/cs4705-spring2019/slides/tagging.pdf) (slides by Michael Collins), [notes](http://www.cs.columbia.edu/~mcollins/hmms-spring2013.pdf), [video 1](https://youtu.be/-ngfOZz8yK0), [video 2](https://youtu.be/Tjgb-yQOg54), [video 3](https://youtu.be/aaa5Qoi8Vco), [video 4](https://youtu.be/4pKWIDkF_6Y)
- [MEMMs and CRFs](https://github.com/dccuchile/CC6205/blob/master/slides/NLP-CRF.pdf): [notes 1](http://www.cs.columbia.edu/~mcollins/crf.pdf), [notes 2](http://www.cs.columbia.edu/~mcollins/fb.pdf), [video 1](https://youtu.be/qlI-4lSUDkg), [video 2](https://youtu.be/PLoLKQwkONw), [video 3](https://youtu.be/ZpUwDy6o28Y)
- [Convolutional Neural Networks](https://github.com/dccuchile/CC6205/blob/master/slides/NLP-CNN.pdf): [video](https://youtu.be/lLZW5Fn40r8)
- [Recurrent Neural Networks](https://github.com/dccuchile/CC6205/blob/master/slides/NLP-RNN.pdf): [video 1](https://youtu.be/BmhjUkzz3nk), [video 2](https://youtu.be/z43YFR1iIvk), [video 3](https://youtu.be/7L5JxQdwNJk)

In this section of the task, you will need to implement a Chatbot capable of generating a basic conversation using a Star Wars dataset. During the development, it is expected that you can design a bot (which will have a classifier behind it) capable of classifying different labels, so that once the label is identified, it provides a response relevant to the question.

In [1]:
import pandas as pd

example_data = pd.read_json('https://raw.githubusercontent.com/dccuchile/CC6205/master/assignments/star_wars_chatbot.json')
print("Cantidad de tags: ", example_data['intents'].shape[0])

Cantidad de tags:  16


In [3]:
example_data["intents"]

0     {'tag': 'greeting', 'patterns': ['Hi', 'Hey', ...
1     {'tag': 'goodbye', 'patterns': ['Bye', 'See yo...
2     {'tag': 'thanks', 'patterns': ['Thanks', 'Than...
3     {'tag': 'tasks', 'patterns': ['What can you do...
4     {'tag': 'alive', 'patterns': ['Are you alive.'...
5     {'tag': 'Menu', 'patterns': ['Which items do y...
6     {'tag': 'help', 'patterns': ['I am looking for...
7     {'tag': 'mission', 'patterns': ['I am on missi...
8     {'tag': 'jedi', 'patterns': ['Tell me top 10 j...
9     {'tag': 'sith', 'patterns': ['Tell me top 10 s...
10    {'tag': 'bounti hounter', 'patterns': ['Tell m...
11    {'tag': 'funny', 'patterns': ['Tell me a joke!...
12    {'tag': 'about me', 'patterns': ['Do you know ...
13    {'tag': 'creator', 'patterns': ['Who is your c...
14    {'tag': 'myself', 'patterns': ['Tell me about ...
15    {'tag': 'stories', 'patterns': ['Tell me a sto...
Name: intents, dtype: object

A continuación, ejemplos del contenido del primer registro:

In [None]:
example_data["intents"][0]["patterns"]

['Hi',
 'Hey',
 'How are you',
 'Is anyone there?',
 'Hello',
 'Good day',
 "What's up",
 'Yo!',
 'Howdy',
 'Nice to meet you.']

In [None]:
example_data['intents'][0]['responses']

['Hey',
 'Hello, thanks for visiting.',
 'Hi there, what can I do for you?',
 'Hi there, how can I help?',
 'Hello, there.',
 'Hello Dear',
 'Ooooo Hello, looking for someone or something?',
 'Yes, I am here.',
 'Listening carefully.',
 'Ok, I am with you.']

In [None]:
example_data['intents'][0]['tag']

'greeting'

From the loaded dataset, we can notice that it comes in a JSON format, meaning that its data is stored in dictionaries. The keys of the dictionaries are not random; they serve to identify relevant points in the bot's development. Here's a brief description of the keys:

patterns: It stores the patterns used to train the model 😮. In other words, it is the training corpus that contains only questions or expressions that the bot should respond to.
responses: These are the corresponding responses 🙋 to the patterns. We will use them in a later stage after classification to provide a random response to the user.
tag: These are the labels used to train our model 💻.
In summary, the relevant keys for training our neural network will be patterns (corpus) and tag (labels).

##### Install and import

In [None]:
# Esto toma su tiempo en ejecutarse
%%capture
!pip install torch==1.8.0+cu111 -f https://download.pytorch.org/whl/torch_stable.html
!pip install torchtext==0.9.0

In [None]:
import os
import sys
import json
import torch
import random
from random import choice
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

from torch.optim import SGD, lr_scheduler
from torch.utils.data import DataLoader
from torch.autograd import Variable

from itertools import zip_longest

import plotly.express as px

import numpy as np
import nltk
from nltk.stem.porter import PorterStemmer

##### Dataset 📚

In [None]:
# we obtain the dataset
!wget 'https://raw.githubusercontent.com/dccuchile/CC6205/master/assignments/star_wars_chatbot.json'

--2023-06-15 19:23:10--  https://raw.githubusercontent.com/dccuchile/CC6205/master/assignments/star_wars_chatbot.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 14469 (14K) [text/plain]
Saving to: ‘star_wars_chatbot.json’


2023-06-15 19:23:10 (75.0 MB/s) - ‘star_wars_chatbot.json’ saved [14469/14469]



In [None]:
# Load the dataset using json
with open('star_wars_chatbot.json', 'r') as f:
    dataset = json.load(f)

# Create a vocab with the dataset and get the number of classes that have
tokenizer = get_tokenizer("basic_english")
vocab = build_vocab_from_iterator(tokenizer(x) for list_words in dataset['intents'] for x in list_words['patterns'])
num_classes = len(dataset['intents'])
vocab.set_default_index(0)
vocab.insert_token('<pad>', 1)
# Define a list with the labels
labels = sorted(set([tag for tag in [intents['tag'] for intents in dataset['intents']]]))
# Define a train_list where we can find the info in the format: [(tag_0, text_0)...,(tag_n-1, text_n-1)]
train_list = [(labels.index(intents['tag']), text) for intents in dataset['intents'] for text in intents['patterns']]

##### Model

In [None]:
# Added function to fix padding in foward step
def zip_longest_(text, offsets, vocab, window_len):
    """
    Zip longest function for iterating over sliding windows of text.
    Combinated with zip(), its used to add padding for the sentences
    shorter than the window length.

    Args:
        text (str): The input text.
        offsets (List[int]): Offsets indicating the start positions of each window.
        vocab (Dict[str, Any]): Vocabulary dictionary.
        window_len (int): Length of the sliding window.

    Yields:
        Tuple: A tuple containing the elements of the sliding window.

    """
    items = window_len
    iterables = ([text[o:offsets[i+1]] for i, o in enumerate(offsets[:-1])] + [text[offsets[-1]:len(text)]])
    for iterable in iterables:
        items = max(items, len(iterable))

    iters = [iter(iterable) for iterable in iterables]
    while items:
        yield (*[next(i, vocab["<pad>"]) for i in iters],)
        items -= 1



In [None]:
d = {"Hoy": 1, "es": 2, "un":3, "lindo":4, "dia":5, "el":6, "sol":7, "esta":8, "brillando":9, "<pad>":0}
v = build_vocab_from_iterator(tokenizer(x) for x in d.keys())
print("Return of *zip_longest_ for windows of size 2:  ")
print(*zip_longest_(["Hoy", "es", "un" ,"lindo" ,"dia", "el" ,"sol" ,"esta" ,"brillando"], [0, 2, 4, 6, 8], v, 2))
print("Return of zip previous result:  ")
print(list(zip(*zip_longest_(["Hoy", "es", "un" ,"lindo" ,"dia", "el" ,"sol" ,"esta" ,"brillando"], [0, 2, 4, 6, 8], v, 2))))

Return of *zip_longest_ for windows of size 2:  
('Hoy', 'un', 'dia', 'sol', 'brillando') ('es', 'lindo', 'el', 'esta', 0)
Return of zip previous result:  
[('Hoy', 'es'), ('un', 'lindo'), ('dia', 'el'), ('sol', 'esta'), ('brillando', 0)]


In [None]:
import torch.nn.functional as F
class CNNClassifier(nn.Module):
    def __init__(self, vocab_size, embed_dim=32, num_classes=10,
                 use_cnn=False, cnn_pool_channels=24, cnn_kernel_size=3):

      """
        CNNClassifier is a PyTorch model that can be either a Convolutional Neural Network (CNN)
        or a Feed Forward Neural Network (FFN) for text classification.

        Args:
            vocab_size (int): Size of the vocabulary.
            embed_dim (int, optional): Dimensionality of the word embeddings. Defaults to 32.
            num_classes (int, optional): Number of output classes. Defaults to 10.
            use_cnn (bool, optional): Whether to use the CNN architecture. If False, FFN is used. Defaults to False.
            cnn_pool_channels (int, optional): Number of output channels in the CNN pooling layer. Only used if `use_cnn` is True. Defaults to 24.
            cnn_kernel_size (int, optional): Size of the CNN kernel. Only used if `use_cnn` is True. Defaults to 3.
        """
      super().__init__()
      self.use_cnn = use_cnn
      self.window_len = cnn_kernel_size

      pad_idx = 1

      if self.use_cnn:
          # Model is a CNN
          self.embedding = nn.Embedding(vocab_size, embed_dim)
          self.conv = nn.Conv1d(
              in_channels=1,
              out_channels=cnn_pool_channels,
              kernel_size=cnn_kernel_size * embed_dim,
              stride=embed_dim,
          )
          fc_in_size = cnn_pool_channels
      else:
          # Model is a FFN
          self.embedding = nn.Embedding(vocab_size, embed_dim, pad_idx)
          fc_in_size = embed_dim

      self.fc = nn.Linear(fc_in_size, num_classes)
      self.init_weights()


    def init_weights(self):
        """
        Initializes the weights of the model's parameters.
        """
        initrange = 0.5

        # Initialize embedding weights
        self.embedding.weight.data.uniform_(-initrange, initrange)

        # Initialize linear layer weights
        self.fc.weight.data.uniform_(-initrange, initrange)

        if self.use_cnn:
            # Initialize convolutional layer weights
            self.conv.weight.data.uniform_(-initrange, initrange)

        # Initialize linear layer biases
        self.fc.bias.data.zero_()



    def forward(self, text, offsets):
      """
        Performs forward pass of the model.

        Args:
            text (torch.Tensor): Input text data of shape (batch_size, seq_len).
            offsets (List[int]): List of offsets indicating the start positions of each sequence in the batch.

        Returns:
            torch.Tensor: Log-probabilities of the predicted classes.
      """

      if self.use_cnn:
          # CNN forward pass
          text = torch.tensor(
                list(
                    zip(
                        *zip_longest_(text, offsets, vocab, self.window_len)
                    )
                )
            ).to(text.device)
          h = self.embedding(text)
          h = h.view(h.size(0), 1, -1)
          h = torch.relu(self.conv(h))
          h = h.mean(dim=2)
          output = self.fc(h)
          return F.log_softmax(output, dim=1)

      else:
          # FFN forward pass

          # (B, N, 1) -> (B, N, E)
          text = torch.tensor(
                list(
                    zip(
                        *zip_longest(
                            *([text[o:offsets[i+1]] for i, o in enumerate(offsets[:-1])] + [text[offsets[-1]:len(texts)]]),
                            fillvalue=torch.tensor(vocab["<pad>"])
                        )
                    )
                )
            ).to(text.device)

          h = self.embedding(text)

          # Document representation will be mean of embeddings
          h = h.mean(dim=1)

          output = self.fc(h)
          return F.log_softmax(output, dim=1)



##### Función Batch

In [None]:
# Defina su función de BATCH
stoi = vocab.get_stoi() # Mapea tokens a indices
def generate_batch(batch):
  label = torch.tensor([entry[0] for entry in batch])
  texts = [tokenizer(entry[1]) for entry in batch]
  # offsets indica en que posición inicia cada oración, donde cada posición es una palabra
  offsets = [0] + [len(text) for text in texts]
  offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)
  big_text = torch.cat([torch.tensor([vocab[t] if t in stoi else 0 for t in text]) for text in texts])
  # big_text = torch.cat([torch.tensor([vocab.stoi[t] for t in text]) for text in texts])

  return big_text, offsets, label

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
num_epochs = 1
BATCH_SIZE = 16
LR = 1e-1
INPUT_SIZE = len(vocab)
OUTPUT_SIZE = num_classes
USE_CNN = True

# Define model, optimizer, loss and scheduler (Q: ¿What is it?)
model = CNNClassifier(INPUT_SIZE, num_classes=OUTPUT_SIZE, use_cnn=USE_CNN).to(device)
train_loader = DataLoader(train_list, batch_size=BATCH_SIZE, collate_fn=generate_batch)
for i, (texts, offsets, cls) in enumerate(train_loader):
  # print(len(t) for t in texts)
  # print(texts, offsets, cls)
  output = model(texts, offsets)
  # print(output.shape)
  # break

##### Training 🥊

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"GPU is avaible: {device}")

# Define the different inputs in our model
num_epochs = 1000
BATCH_SIZE = 16
LR = 1e-1
INPUT_SIZE = len(vocab)
OUTPUT_SIZE = num_classes
USE_CNN = False

# Define model, optimizer, loss and scheduler (Q: ¿What is it?)
model = CNNClassifier(INPUT_SIZE, num_classes=OUTPUT_SIZE, use_cnn=USE_CNN).to(device)
optimizer = SGD(model.parameters(), lr=LR)
criterion = nn.CrossEntropyLoss().to(device)
scheduler = lr_scheduler.LambdaLR(optimizer, lr_lambda=[lambda epoch: .9 ** (epoch // 10)])

print(f'train: {len(train_list)} elements')

# We train the model using the intents
loss_list= []
for epoch in range(1, num_epochs):
  train_loader = DataLoader(train_list, batch_size=BATCH_SIZE, collate_fn=generate_batch)
  model.train()
  total_loss = 0
  for i, (texts, offsets, cls) in enumerate(train_loader):
    texts = texts.to(device)
    offsets = offsets.to(device)
    cls = cls.to(device)
    optimizer.zero_grad()
    output = model(texts, offsets)
    loss = criterion(output, cls)
    total_loss += loss.item()
    loss.backward()
    optimizer.step()

  loss_list.append(loss.item())
  sys.stdout.write('\rEpoch: {0:03d} \t iter-Loss: {1:.3f}'.format(epoch+1, loss.item()))

print(f'final loss: {loss.item():.4f}')

GPU is avaible: cpu
train: 97 elements
Epoch: 1000 	 iter-Loss: 0.001final loss: 0.0009


##### Let's Test! 🧪

In [None]:
# This is working?, Try the next example!
qText = "'Do you know any joke?'" # this must classify the label "funny"

X = torch.tensor([vocab.get_stoi()[t] for t in tokenizer(qText)]).to(device)

model.eval()
output = model(X, torch.tensor([0], dtype=torch.long).to(device))
_, predicted = torch.max(output, dim=1)
labels[predicted]

'funny'

##### Save model 🦺 (optional)

In [None]:
# We save de model using pytorch (this is optional, just to learn how to do this in pytorch)
data = {
"model_state": model.state_dict(),
"input_size": INPUT_SIZE,
"output_size": OUTPUT_SIZE,
"use_cnn": USE_CNN,
"labels": labels
        }

FILE = "data.pth"
torch.save(data, FILE)

print(f'training complete. file saved to {FILE}')

training complete. file saved to data.pth


##### Chatbot 💬

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

with open('star_wars_chatbot.json', 'r') as json_data:
    intents = json.load(json_data)

FILE = "data.pth"
data = torch.load(FILE)

INPUT_SIZE = data["input_size"]
OUTPUT_SIZE = data["output_size"]
USE_CNN = data["use_cnn"]
labels = data['labels']
model_state = data["model_state"]

model = CNNClassifier(INPUT_SIZE, num_classes=OUTPUT_SIZE, use_cnn=USE_CNN).to(device)
model.load_state_dict(model_state)
model.eval()

# Dictionary with the answers
responses = {key['tag']: key['responses'] for key in dataset['intents']}

bot_name = "GA-97"
print("Let's chat! (type 'finish_chat' to finish the chat)")
while True:
    q_text = input("You: ")
    q_text = q_text
    if q_text == 'finish_chat':
        break

    X = torch.tensor([vocab.get_stoi()[t] if t in stoi else 0 for t in tokenizer(q_text)]).to(device) # se modificó esta linea para que el chatbot no se caiga al evaluar palabras fuera del vocabulario
    output = model(X, torch.tensor([0], dtype=torch.long).to(device))
    _, predicted = torch.max(output, dim=1)

    tag = labels[predicted.item()]

    probs = torch.softmax(output, dim=1)
    prob = probs[0][predicted.item()]
    if prob.item() > 0.50:
      print(f"{bot_name}: {random.choice(responses[tag])}")
    else:
      print(f"{bot_name}: My model can't understand you...")

Let's chat! (type 'finish_chat' to finish the chat)
You: hey there
GA-97: Hello Dear
You: hey hey
GA-97: Hello Dear
You: any jokes?
GA-97: It so dangerous, the most brave sith in galaxy Darth Vader, Darth Plagueis, Darth Revan, Darth Traya, Darth Sidious, Darth Maul, Ulic Qel-Droma, Asajj Ventress, Kylo Ren, Marka Ragnos.
You: tell me something very funny
GA-97: You would get bored if I do so.
You: what do you have to drink?
GA-97: No coffe and no tea, only: Fuzzy Tauntaun, Bloody Rancor, Jedi Mind Trick, T-16 Skyhopper, Yub Nub, Jet Juice, Hyperdrive, Rancor Beer.
You: offer me your menu
GA-97: My model can't understand you...
You: whats on your menu
GA-97: My model can't understand you...
You: which is the best jedi
GA-97: Luke Skywalker, Yoda, Obi-Wan Kenobi, Anakin Skywalker, Qui-Gon Jinn, Mace Windu, Ahsoka Tano, Plo Koon, Aalya Secura, Kit Fisto.
You: ok bye
GA-97: May the force be with you!
You: byebye
GA-97: See you later, thanks for visiting.
You: see you later, alligator
GA-9