# 06. Character-based Language Classification 

In this session we are going to build a character-based model that learns to classify which language does the word belong to.

This piece of work will show how to implement Recurrent Neural Net (RNN) and Convolutional Neural Net (CNN) based models and compare them with Bag-of-Words.

*Important Note: Make sure to Restart and Run all (Kernel -> Restart and Run all) every time you modify your network before training it: Jupyter Notebook saves network weight and resumes training instead of starting it from scratch again.*

## 0*. Data Generation
The dataset (data.p) of roughly 8k words per each language (English, German, French, Bulgarian and Russian) is available in this repo under `data` folder. The following code piece demonstrates how the data is generated.

```python
import random
import pickle as pkl
from wordfreq import top_n_list
from transliterate import translit


def select_unique(lang_words, all_unique_words):
    unique_words = []
    for w in lang_words:
        if w in all_unique_words:
            unique_words.append(w)
    return unique_words

def create_splits(word_list, label):
    random.shuffle(word_list)
    total_len = len(word_list)
    train = zip(word_list[0:total_len-2000], [label]*(total_len-2000))
    valid = zip(word_list[total_len-2000:total_len-1000] , [label]*(1000))
    test = zip(word_list[total_len-1000:], [label]*(1000))
    
    data_dict= {"train": train,
               "valid": valid,
               "test": test}
    return data_dict


# Get top 10k words for each language
num_words = 10000
en_words = top_n_list('en', num_words)
de_words = top_n_list('de', num_words)
fr_words = top_n_list('fr', num_words)
bg_words = top_n_list('bg', num_words)
ru_words = top_n_list('ru', num_words)

# convert Cyrillic to Latin for Russian language
for i in range(len(ru_words)):
    ru_words[i] = translit(ru_words[i], 'ru', reversed=True)
    
for i in range(len(bg_words)):
    bg_words[i] = translit(bg_words[i], 'bg', reversed=True)
    
# Get unique words from all languages
all_words = en_words + de_words + fr_words + bg_words + ru_words
all_unique_words = set([x for x in all_words if all_words.count(x) == 1])
all_unique_words = list(all_unique_words)

# Select unique words from each language according to all possible unique words
en_unique_words = select_unique(en_words, all_unique_words)
de_unique_words = select_unique(de_words, all_unique_words)
fr_unique_words = select_unique(fr_words, all_unique_words)
bg_unique_words = select_unique(bg_words, all_unique_words)
ru_unique_words = select_unique(ru_words, all_unique_words)

# Split dataset into train/valid/test
data = {"train": [], "valid": [], "test": []}
for i, lang in enumerate([en_unique_words, de_unique_words, fr_unique_words, bg_unique_words, ru_unique_words]):
    lang_data = create_splits(lang, i)
    for key in data.keys():
        data[key] += lang_data[key]
        
# Save
pkl.dump(data, open("data/data.p", "wb")) 
```

## 1. Data Loading

In [2]:
# First lets improve libraries that we are going to be used in this lab session
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset
from collections import Counter
import pickle as pkl
import random
import pdb
random.seed(134)

PAD_IDX = 0
UNK_IDX = 1
BATCH_SIZE = 32

In [5]:
def build_vocab(data):
    # Returns:
    # id2char: list of chars, where id2char[i] returns char that corresponds to char i
    # char2id: dictionary where keys represent chars and corresponding values represent indices
    # some preprocessing
    max_len = max([len(word[0]) for word in data])
    all_chars = []
    for word in data:
        all_chars += word[0]
    unique_chars = list(set(all_chars))

    id2char = unique_chars
    char2id = dict(zip(unique_chars, range(2,2+len(unique_chars))))
    id2char = ['<pad>', '<unk>'] + id2char
    char2id['<pad>'] = PAD_IDX
    char2id['<unk>'] = UNK_IDX

    return char2id, id2char, max_len

def convert_to_chars(data):
    return [([c for c in sample[0]], sample[1]) for sample in data]

### Function that preprocessed dataset
def read_data():
#     data = pkl.load(open("data/rnn_cnn_lang_classification_data.p", "rb"))
    data = pkl.load(open("data/data.p", "rb"))
    train_data, val_data, test_data = data['train'], data['valid'], data['test']
    train_data, val_data, test_data = convert_to_chars(train_data), convert_to_chars(val_data), convert_to_chars(test_data)
    char2id, id2char, max_len = build_vocab(train_data)
    return train_data, val_data, test_data, char2id, id2char, max_len


In [6]:
train_data, val_data, test_data, char2id, id2char, MAX_WORD_LENGTH = read_data()

print ("Maximum word length of dataset is {}".format(MAX_WORD_LENGTH))
print ("Number of characters in dataset is {}".format(len(id2char)))
print ("Characters:")
print (char2id.keys())

Maximum word length of dataset is 23
Number of characters in dataset is 63
Characters:
dict_keys(['h', 'â', 'à', 'i', 'x', 'ö', 's', 'ê', 'ü', '►', 'é', 'î', 'e', '6', '.', 'y', '1', '3', 'è', ',', 'u', 'p', 'k', 'ç', 'ï', 'm', 'n', 'b', 'f', 'd', 'q', 'z', 'œ', 'j', 'э', '2', "'", '♪', '5', '4', '■', 'ѝ', 'ù', 'ä', 'l', 'ј', 'r', 'v', '0', 'û', 'g', 'ë', '♫', '9', '8', 'c', 'o', 'ô', 'a', 'w', 't', '<pad>', '<unk>'])


Now lets build the PyTorch DataLoader:

In [7]:
class VocabDataset(Dataset):
    """
    Class that represents a train/validation/test dataset that's readable for PyTorch
    Note that this class inherits torch.utils.data.Dataset
    """

    def __init__(self, data_tuple, char2id):
        """
        @param data_list: list of character
        @param target_list: list of targets

        """
        self.data_list, self.target_list = zip(*data_tuple)
        assert (len(self.data_list) == len(self.target_list))
        self.char2id = char2id

    def __len__(self):
        return len(self.data_list)

    def __getitem__(self, key):
        """
        Triggered when you call dataset[i]
        """
        char_idx = [self.char2id[c] if c in self.char2id.keys() else UNK_IDX  for c in self.data_list[key][:MAX_WORD_LENGTH]]
        label = self.target_list[key]
        return [char_idx, len(char_idx), label]

def vocab_collate_func(batch):
    """
    Customized function for DataLoader that dynamically pads the batch so that all
    data have the same length
    """
    data_list = []
    label_list = []
    length_list = []

    for datum in batch:
        label_list.append(datum[2])
        length_list.append(datum[1])
    # padding
    for datum in batch:
        padded_vec = np.pad(np.array(datum[0]),
                                pad_width=((0,MAX_WORD_LENGTH-datum[1])),
                                mode="constant", constant_values=0)
        data_list.append(padded_vec)
    ind_dec_order = np.argsort(length_list)[::-1]
    data_list = np.array(data_list)[ind_dec_order]
    length_list = np.array(length_list)[ind_dec_order]
    label_list = np.array(label_list)[ind_dec_order]
    return [torch.from_numpy(np.array(data_list)), torch.LongTensor(length_list), torch.LongTensor(label_list)]


In [8]:
# Build train, valid and test dataloaders
train_dataset = VocabDataset(train_data, char2id)
train_loader = torch.utils.data.DataLoader(dataset=train_dataset,
                                           batch_size=BATCH_SIZE,
                                           collate_fn=vocab_collate_func,
                                           shuffle=True)

val_dataset = VocabDataset(val_data, char2id)
val_loader = torch.utils.data.DataLoader(dataset=val_dataset,
                                           batch_size=BATCH_SIZE,
                                           collate_fn=vocab_collate_func,
                                           shuffle=True)

test_dataset = VocabDataset(test_data, char2id)
test_loader = torch.utils.data.DataLoader(dataset=test_dataset,
                                           batch_size=BATCH_SIZE,
                                           collate_fn=vocab_collate_func,
                                           shuffle=False)

## 2. Recurrent Neural Net (RNN) Model

In [9]:
class RNN(nn.Module):
    def __init__(self, emb_size, hidden_size, num_layers, num_classes, vocab_size):
        # RNN Accepts the following hyperparams:
        # emb_size: Embedding Size
        # hidden_size: Hidden Size of layer in RNN
        # num_layers: number of layers in RNN
        # num_classes: number of output classes
        # vocab_size: vocabulary size
        super(RNN, self).__init__()

        self.num_layers, self.hidden_size = num_layers, hidden_size
        self.embedding = nn.Embedding(vocab_size, emb_size, padding_idx=PAD_IDX)
        self.rnn = nn.RNN(emb_size, hidden_size, num_layers, batch_first=True)
        self.linear = nn.Linear(hidden_size, num_classes)

    def init_hidden(self, batch_size):
        # Function initializes the activation of recurrent neural net at timestep 0
        # Needs to be in format (num_layers, batch_size, hidden_size)
        hidden = torch.randn(self.num_layers, batch_size, self.hidden_size)
        return hidden

    def forward(self, x, lengths):
        batch_size, seq_len = x.size()
        # reset hidden state
        self.hidden = self.init_hidden(batch_size)
        # get embedding of characters
        embed = self.embedding(x)
        # pack padded sequence
        embed = torch.nn.utils.rnn.pack_padded_sequence(embed, lengths.numpy(), batch_first=True)
        # forward propagation though RNN
        rnn_out, self.hidden = self.rnn(embed, self.hidden)
        # undo packing
        rnn_out, _ = torch.nn.utils.rnn.pad_packed_sequence(rnn_out, batch_first=True)
        ## IMPORTANT ## rnn_out = batch_size * seq_length * hidden_size
        
        # sum hidden activations of RNN across time
        rnn_out = torch.sum(rnn_out, dim=1)   # HINT: do something along the line of torch.max(0, torch.max(rnn_out, dim=1) or use max_pool1d

        logits = self.linear(rnn_out)
        return logits


**[Important things to keep in mind when using variable sized sequences in RNN in Pytorch]**

RNN modules accept packed sequences as inputs
* pack_padded_sequence function packs a sequence (in Tensor format) containing padded sequences of variable length. **IMPORTANT: the sequences should be sorted by length in a decreasing order before passing to this function**

* pad_packed_sequence function is an inverse operation to pack_padded_sequence. Transforms a padded sequence into a tensor of variable lenth sequences

In [13]:
def test_model(loader, model):
    """
    Help function that tests the model's performance on a dataset
    @param: loader - data loader for the dataset to test against
    """
    correct = 0
    total = 0
    model.eval()
    for data, lengths, labels in loader:
        data_batch, lengths_batch, label_batch = data, lengths, labels
        outputs = F.softmax(model(data_batch, lengths_batch), dim=1)
        predicted = outputs.max(1, keepdim=True)[1]

        total += labels.size(0)
        correct += predicted.eq(labels.view_as(predicted)).sum().item()
    return (100 * correct / total)


model = RNN(emb_size=100, hidden_size=200, num_layers=2, num_classes=5, vocab_size=len(id2char))

learning_rate = 3e-4
num_epochs = 10 # number epoch to train

# Criterion and Optimizer
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

# Train the model
total_step = len(train_loader)

# for epoch in range(num_epochs):
#     for i, (data, lengths, labels) in enumerate(train_loader):
#         model.train()
#         optimizer.zero_grad()
#         # Forward pass
#         outputs = model(data, lengths)
#         loss = criterion(outputs, labels)

#         # Backward and optimize
#         loss.backward()
#         optimizer.step()
#         # validate every 256 iterations
#         if i > 0 and i % 128 == 0:
#             # validate
#             val_acc = test_model(val_loader, model)
#             print('Epoch: [{}/{}], Step: [{}/{}], Validation Acc: {}'.format(
#                        epoch+1, num_epochs, i+1, total_step, val_acc))


### 2.2 RNN Exercises

#### Exercise 1

Implement LSTM cell instead of RNN cell. Train the model and compare the results.
- Hint (modify init_hidden function and cell in __init__) 

In [11]:
class RNN_LSTM(nn.Module):
    def __init__(self, emb_size, hidden_size, num_layers, num_classes, vocab_size):
        super(RNN_LSTM, self).__init__()
        self.num_layers, self.hidden_size = num_layers, hidden_size
        self.embedding = nn.Embedding(vocab_size, emb_size, padding_idx=PAD_IDX)
        self.rnn = nn.LSTM(emb_size, hidden_size, num_layers, batch_first=True)
        self.linear = nn.Linear(hidden_size, num_classes)

    def init_hidden(self, batch_size):
        hidden = torch.zeros(self.num_layers, batch_size, self.hidden_size)  # change from random to zeros?
        cell = torch.zeros(self.num_layers, batch_size, self.hidden_size)
        return hidden, cell

    def forward(self, x, lengths):
        batch_size, seq_len = x.size()
        self.hidden, self.cell = self.init_hidden(batch_size)
        embed = self.embedding(x)
        embed = torch.nn.utils.rnn.pack_padded_sequence(embed, lengths.numpy(), batch_first=True)
        rnn_out, (self.hidden, self.cell) = self.rnn(embed, (self.hidden, self.cell))
        rnn_out, _ = torch.nn.utils.rnn.pad_packed_sequence(rnn_out, batch_first=True)
        rnn_out = torch.sum(rnn_out, dim=1)
        logits = self.linear(rnn_out)
        return logits


In [14]:
model = RNN_LSTM(emb_size=100, hidden_size=200, num_layers=2, num_classes=5, vocab_size=len(id2char))
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

# TEST LSTM:
for epoch in range(num_epochs):
    for i, (data, lengths, labels) in enumerate(train_loader):
        model.train()
        optimizer.zero_grad()
        outputs = model(data, lengths)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        if i > 0 and i % 128 == 0:
            val_acc = test_model(val_loader, model)
            print('Epoch: [{}/{}], Step: [{}/{}], Validation Acc: {}'.format(
                       epoch+1, num_epochs, i+1, total_step, val_acc))
            

Epoch: [1/10], Step: [129/924], Validation Acc: 50.02
Epoch: [1/10], Step: [257/924], Validation Acc: 57.78
Epoch: [1/10], Step: [385/924], Validation Acc: 61.08
Epoch: [1/10], Step: [513/924], Validation Acc: 66.2
Epoch: [1/10], Step: [641/924], Validation Acc: 69.52
Epoch: [1/10], Step: [769/924], Validation Acc: 72.02
Epoch: [1/10], Step: [897/924], Validation Acc: 73.32
Epoch: [2/10], Step: [129/924], Validation Acc: 73.98
Epoch: [2/10], Step: [257/924], Validation Acc: 75.0
Epoch: [2/10], Step: [385/924], Validation Acc: 75.98
Epoch: [2/10], Step: [513/924], Validation Acc: 77.68
Epoch: [2/10], Step: [641/924], Validation Acc: 76.52
Epoch: [2/10], Step: [769/924], Validation Acc: 77.92
Epoch: [2/10], Step: [897/924], Validation Acc: 79.74
Epoch: [3/10], Step: [129/924], Validation Acc: 79.34
Epoch: [3/10], Step: [257/924], Validation Acc: 80.4
Epoch: [3/10], Step: [385/924], Validation Acc: 80.44
Epoch: [3/10], Step: [513/924], Validation Acc: 80.54
Epoch: [3/10], Step: [641/924],

#### Exercise 2

Implement Bidirectional LSTM. You can do it very easily by adding one argument to cell when you create it.

For better understanding we recommend that you implement it youself by reversing a sequence and passing it to another cell.

In [16]:
class RNN_biLSTM_pt(nn.Module):
    def __init__(self, emb_size, hidden_size, num_layers, num_classes, vocab_size):
        super(RNN_biLSTM_pt, self).__init__()
        self.num_layers, self.hidden_size = num_layers, hidden_size
        self.embedding = nn.Embedding(vocab_size, emb_size, padding_idx=PAD_IDX)
        self.rnn = nn.LSTM(emb_size, hidden_size, num_layers, batch_first=True, bidirectional=True)
        self.linear = nn.Linear(hidden_size * 2, num_classes)

    def init_hidden(self, batch_size):
        hidden = torch.randn(self.num_layers * 2, batch_size, self.hidden_size)  # change from random to zeros?
        cell = torch.randn(self.num_layers * 2, batch_size, self.hidden_size)
        return hidden, cell

    def forward(self, x, lengths):
        batch_size, seq_len = x.size()
        self.hidden, self.cell = self.init_hidden(batch_size)
        embed = self.embedding(x)
        embed = torch.nn.utils.rnn.pack_padded_sequence(embed, lengths.numpy(), batch_first=True)
        rnn_out, (self.hidden, self.cell) = self.rnn(embed, (self.hidden, self.cell))
        rnn_out, _ = torch.nn.utils.rnn.pad_packed_sequence(rnn_out, batch_first=True)
        rnn_out = torch.sum(rnn_out, dim=1)
        logits = self.linear(rnn_out)
#         logits = self.linear(rnn_out[:, -1, :])
        return logits


In [19]:
class RNN_biLSTM(nn.Module):
    def __init__(self, emb_size, hidden_size, num_layers, num_classes, vocab_size):
        super(RNN_biLSTM, self).__init__()
        self.num_layers, self.hidden_size = num_layers, hidden_size
        self.num_layers = self.num_layers
        self.embedding = nn.Embedding(vocab_size, emb_size, padding_idx=PAD_IDX)
        self.rnn_forward = nn.LSTM(emb_size, hidden_size, self.num_layers, batch_first=True)
        self.rnn_backward = nn.LSTM(emb_size, hidden_size, self.num_layers, batch_first=True)
        self.linear = nn.Linear(hidden_size * 2, num_classes)

    def init_hidden(self, batch_size):
        hidden = torch.randn(self.num_layers, batch_size, self.hidden_size)  # change from random to zeros?
        cell = torch.randn(self.num_layers, batch_size, self.hidden_size)
        return hidden, cell

    def forward(self, x, lengths):
        batch_size, seq_len = x.size()
        self.hidden, self.cell = self.init_hidden(batch_size)
        self.hidden_back, self.cell_back = self.init_hidden(batch_size)
        embed = self.embedding(x)
        rnn_out, (self.hidden, self.cell) = self.rnn_forward(embed, (self.hidden, self.cell))
        rnn_out_back, (self.hidden_back, self.cell_back) = self.rnn_backward(reversed(embed), (self.hidden_back, self.cell_back))
        rnn_out_c = torch.cat((rnn_out, rnn_out_back), 2)
        rnn_out_c = torch.sum(rnn_out_c, dim=1)
        logits = self.linear(rnn_out_c)
        return logits

In [17]:
# TEST biLSTM using pytorch:
model = RNN_biLSTM_pt(emb_size=100, hidden_size=200, num_layers=2, num_classes=5, vocab_size=len(id2char))
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)


for epoch in range(num_epochs):
    for i, (data, lengths, labels) in enumerate(train_loader):
        model.train()
        optimizer.zero_grad()
        outputs = model(data, lengths)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        if i > 0 and i % 128 == 0:
            val_acc = test_model(val_loader, model)
            print('Epoch: [{}/{}], Step: [{}/{}], Validation Acc: {}'.format(
                       epoch+1, num_epochs, i+1, total_step, val_acc))


Epoch: [1/10], Step: [129/924], Validation Acc: 52.98
Epoch: [1/10], Step: [257/924], Validation Acc: 56.96
Epoch: [1/10], Step: [385/924], Validation Acc: 59.96
Epoch: [1/10], Step: [513/924], Validation Acc: 61.78
Epoch: [1/10], Step: [641/924], Validation Acc: 61.96
Epoch: [1/10], Step: [769/924], Validation Acc: 65.46
Epoch: [1/10], Step: [897/924], Validation Acc: 66.9
Epoch: [2/10], Step: [129/924], Validation Acc: 68.34
Epoch: [2/10], Step: [257/924], Validation Acc: 68.5
Epoch: [2/10], Step: [385/924], Validation Acc: 69.22
Epoch: [2/10], Step: [513/924], Validation Acc: 69.94
Epoch: [2/10], Step: [641/924], Validation Acc: 70.96
Epoch: [2/10], Step: [769/924], Validation Acc: 71.82
Epoch: [2/10], Step: [897/924], Validation Acc: 72.28
Epoch: [3/10], Step: [129/924], Validation Acc: 72.18
Epoch: [3/10], Step: [257/924], Validation Acc: 73.44
Epoch: [3/10], Step: [385/924], Validation Acc: 74.46
Epoch: [3/10], Step: [513/924], Validation Acc: 74.02
Epoch: [3/10], Step: [641/924]

In [20]:
# TEST biLSTM self-implemented:
model = RNN_biLSTM(emb_size=100, hidden_size=200, num_layers=2, num_classes=5, vocab_size=len(id2char))
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

for epoch in range(num_epochs):
    for i, (data, lengths, labels) in enumerate(train_loader):
        model.train()
        optimizer.zero_grad()
        outputs = model(data, lengths)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        if i > 0 and i % 128 == 0:
            val_acc = test_model(val_loader, model)
            print('Epoch: [{}/{}], Step: [{}/{}], Validation Acc: {}'.format(
                       epoch+1, num_epochs, i+1, total_step, val_acc))

Epoch: [1/10], Step: [129/924], Validation Acc: 52.98
Epoch: [1/10], Step: [257/924], Validation Acc: 57.44
Epoch: [1/10], Step: [385/924], Validation Acc: 61.06
Epoch: [1/10], Step: [513/924], Validation Acc: 64.94
Epoch: [1/10], Step: [641/924], Validation Acc: 67.14
Epoch: [1/10], Step: [769/924], Validation Acc: 69.2
Epoch: [1/10], Step: [897/924], Validation Acc: 69.28
Epoch: [2/10], Step: [129/924], Validation Acc: 71.28
Epoch: [2/10], Step: [257/924], Validation Acc: 71.92
Epoch: [2/10], Step: [385/924], Validation Acc: 73.58
Epoch: [2/10], Step: [513/924], Validation Acc: 74.38
Epoch: [2/10], Step: [641/924], Validation Acc: 74.66
Epoch: [2/10], Step: [769/924], Validation Acc: 75.02
Epoch: [2/10], Step: [897/924], Validation Acc: 75.58
Epoch: [3/10], Step: [129/924], Validation Acc: 75.48
Epoch: [3/10], Step: [257/924], Validation Acc: 76.62
Epoch: [3/10], Step: [385/924], Validation Acc: 75.94
Epoch: [3/10], Step: [513/924], Validation Acc: 77.44
Epoch: [3/10], Step: [641/924

#### Exercise 3

Add max-pooling (over time) after passing through RNN instead of summing over hidden layers through time

In [25]:
class RNN_maxpool(nn.Module):
    def __init__(self, emb_size, hidden_size, num_layers, num_classes, vocab_size):
        super(RNN_maxpool, self).__init__()
        self.num_layers, self.hidden_size = num_layers, hidden_size
        self.embedding = nn.Embedding(vocab_size, emb_size, padding_idx=PAD_IDX)
        self.rnn = nn.RNN(emb_size, hidden_size, num_layers, batch_first=True)
        self.linear = nn.Linear(hidden_size, num_classes)

    def init_hidden(self, batch_size):
        hidden = torch.randn(self.num_layers, batch_size, self.hidden_size)
        return hidden

    def forward(self, x, lengths):
        batch_size, seq_len = x.size()
        self.hidden = self.init_hidden(batch_size)
        embed = self.embedding(x)
        embed = torch.nn.utils.rnn.pack_padded_sequence(embed, lengths.numpy(), batch_first=True)
        rnn_out, self.hidden = self.rnn(embed, self.hidden)
        rnn_out, _ = torch.nn.utils.rnn.pad_packed_sequence(rnn_out, batch_first=True)
#         rnn_out = torch.sum(rnn_out, dim=1)
        mp = nn.MaxPool2d((lengths.max(),1))
        softmax = mp(rnn_out).reshape(batch_size, self.hidden_size)
        logits = self.linear(softmax)
        return logits


In [26]:
# TEST RNN adding maxpooling:
model = RNN_maxpool(emb_size=100, hidden_size=200, num_layers=2, num_classes=5, vocab_size=len(id2char))
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)


for epoch in range(num_epochs):
    for i, (data, lengths, labels) in enumerate(train_loader):
        model.train()
        optimizer.zero_grad()
        outputs = model(data, lengths)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        if i > 0 and i % 256 == 0:
            val_acc = test_model(val_loader, model)
            print('Epoch: [{}/{}], Step: [{}/{}], Validation Acc: {}'.format(
                       epoch+1, num_epochs, i+1, total_step, val_acc))


Epoch: [1/10], Step: [257/924], Validation Acc: 57.98
Epoch: [1/10], Step: [513/924], Validation Acc: 64.32
Epoch: [1/10], Step: [769/924], Validation Acc: 68.78
Epoch: [2/10], Step: [257/924], Validation Acc: 73.1
Epoch: [2/10], Step: [513/924], Validation Acc: 75.02
Epoch: [2/10], Step: [769/924], Validation Acc: 76.56
Epoch: [3/10], Step: [257/924], Validation Acc: 77.52
Epoch: [3/10], Step: [513/924], Validation Acc: 77.3
Epoch: [3/10], Step: [769/924], Validation Acc: 78.98
Epoch: [4/10], Step: [257/924], Validation Acc: 78.74
Epoch: [4/10], Step: [513/924], Validation Acc: 79.7
Epoch: [4/10], Step: [769/924], Validation Acc: 80.04
Epoch: [5/10], Step: [257/924], Validation Acc: 80.2
Epoch: [5/10], Step: [513/924], Validation Acc: 81.0
Epoch: [5/10], Step: [769/924], Validation Acc: 81.22
Epoch: [6/10], Step: [257/924], Validation Acc: 82.06
Epoch: [6/10], Step: [513/924], Validation Acc: 81.28
Epoch: [6/10], Step: [769/924], Validation Acc: 82.0
Epoch: [7/10], Step: [257/924], Va

### 3. Convolutional Neural Net (CNN) model


In [28]:
class CNN(nn.Module):
    def __init__(self, emb_size, hidden_size, num_layers, num_classes, vocab_size):

        super(CNN, self).__init__()

        self.num_layers, self.hidden_size = num_layers, hidden_size
        self.embedding = nn.Embedding(vocab_size, emb_size, padding_idx=PAD_IDX)
    
        self.conv1 = nn.Conv1d(emb_size, hidden_size, kernel_size=3, padding=1)
        self.conv2 = nn.Conv1d(hidden_size, hidden_size, kernel_size=3, padding=1)

        self.linear = nn.Linear(hidden_size, num_classes)

    def forward(self, x, lengths):
        batch_size, seq_len = x.size()

        embed = self.embedding(x)
        hidden = self.conv1(embed.transpose(1,2)).transpose(1,2)
        hidden = F.relu(hidden.contiguous().view(-1, hidden.size(-1))).view(batch_size, seq_len, hidden.size(-1))

        hidden = self.conv2(hidden.transpose(1,2)).transpose(1,2)
        hidden = F.relu(hidden.contiguous().view(-1, hidden.size(-1))).view(batch_size, seq_len, hidden.size(-1))

        hidden = torch.sum(hidden, dim=1)
        logits = self.linear(hidden)
        return logits

In [29]:
model = CNN(emb_size=100, hidden_size=200, num_layers=2, num_classes=5, vocab_size=len(id2char))
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

for epoch in range(num_epochs):
    for i, (data, lengths, labels) in enumerate(train_loader):
        model.train()
        optimizer.zero_grad()
        outputs = model(data, lengths)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        if i > 0 and i % 128 == 0:
            val_acc = test_model(val_loader, model)
            print('Epoch: [{}/{}], Step: [{}/{}], Validation Acc: {}'.format(
                       epoch+1, num_epochs, i+1, total_step, val_acc))


Epoch: [1/10], Step: [129/924], Validation Acc: 64.7
Epoch: [1/10], Step: [257/924], Validation Acc: 71.58
Epoch: [1/10], Step: [385/924], Validation Acc: 72.6
Epoch: [1/10], Step: [513/924], Validation Acc: 77.02
Epoch: [1/10], Step: [641/924], Validation Acc: 75.84
Epoch: [1/10], Step: [769/924], Validation Acc: 75.04
Epoch: [1/10], Step: [897/924], Validation Acc: 76.62
Epoch: [2/10], Step: [129/924], Validation Acc: 80.14
Epoch: [2/10], Step: [257/924], Validation Acc: 78.42
Epoch: [2/10], Step: [385/924], Validation Acc: 81.0
Epoch: [2/10], Step: [513/924], Validation Acc: 81.16
Epoch: [2/10], Step: [641/924], Validation Acc: 81.92
Epoch: [2/10], Step: [769/924], Validation Acc: 82.4
Epoch: [2/10], Step: [897/924], Validation Acc: 82.38
Epoch: [3/10], Step: [129/924], Validation Acc: 82.32
Epoch: [3/10], Step: [257/924], Validation Acc: 81.7
Epoch: [3/10], Step: [385/924], Validation Acc: 81.64
Epoch: [3/10], Step: [513/924], Validation Acc: 81.2
Epoch: [3/10], Step: [641/924], Va

## Important things to keep in mind when using Convolutional Nets for Language Tasks in Pytorch

### Conv1d module expect input of size (batch_size, num_channels, length), where in our case input has size (batch_size, length, num_channels). Hence it is important call transpose(1,2) before passing it to convolutional layer and then reshape it back to (batch_size, length, num_channels) by calling transpose(1,2) again

### Additionally we need to reshape hidden activations into 2D tensor before passing it to Relu layer by calling view(-1, hidden.size(-1)

## Exercise 4:
### Implement Gated Relu activations as well as Gated Linear activations and compare them with Relu (reference: https://arxiv.org/pdf/1612.08083.pdf )
### Hint: Gated Relu activations are sigmoid(conv1_1(x)) * relu(conv1_2(x))
### Hint: Gated Linear activations are sigmoid(conv1_1(x)) * conv1_2(x)

### Feel free to play with other variants of gating


## Exercise 5:

### Add max-pooling (over time) after passing through conv as well as add non-linear fully connected layer

## Exercise 6:

### Use Bag-of-Words and Bag-of-NGrams model for this task and compare it with RNN and CNN

## Exercise 7:

### Use FastText for this task