# Assigment 5

**Submission deadlines**:

* last lab before 20.06.2023

**Points:** Aim to get 6 (updated value) out of 15+ possible points

All needed data files are on Drive: <https://drive.google.com/drive/folders/1uufpGn46Mwv4oBwajIeOj4rvAK96iaS-?usp=sharing> (or will be soon :) )

## Task 1 (5 points)

Consider the vowel reconstruction task -- i.e. inserting missing vowels (aeuioy) to obtain proper English text. For instance for the input sentence:

<pre>
h m gd smbd hs stln ll m vwls
</pre>

the best result is

<pre>
oh my god somebody has stolen all my vowels
</pre>

In this task both dev and test data come from the two books about Winnie-the-Pooh. You have to train two RNN Language Models on *pooh-train.txt*. For the first model use the code below, for the second choose different hyperparameters (different dropout, smaller number of units or layers, or just do any modification you want).

The code below is based on
https://www.kdnuggets.com/2020/07/pytorch-lstm-text-generation-tutorial.html

In [1]:
! gdown https://drive.google.com/uc?id=1-k8e9OG7NOVk73Kkv4WpqNQKHrVVmVXa
! gdown https://drive.google.com/uc?id=1ADNyasf6AEUsmz-163DWHw_rSldfnpta
! gdown https://drive.google.com/uc?id=1POiC9I_BjZKBQe-7XkW5CW0z8_6inWtY
! ls

Downloading...
From: https://drive.google.com/uc?id=1-k8e9OG7NOVk73Kkv4WpqNQKHrVVmVXa
To: /content/pooh_train.txt
100% 255k/255k [00:00<00:00, 145MB/s]
Downloading...
From: https://drive.google.com/uc?id=1ADNyasf6AEUsmz-163DWHw_rSldfnpta
To: /content/pooh_test.txt
100% 34.6k/34.6k [00:00<00:00, 115MB/s]
Downloading...
From: https://drive.google.com/uc?id=1POiC9I_BjZKBQe-7XkW5CW0z8_6inWtY
To: /content/pooh_words.txt
100% 20.4k/20.4k [00:00<00:00, 72.1MB/s]
pooh_test.txt  pooh_train.txt  pooh_words.txt  sample_data


In [30]:
import torch
from collections import Counter

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

SEQUENCE_LENGTH = 15

class PoohDataset(torch.utils.data.Dataset):
    def __init__(self, sequence_length, device):
        txt = open('pooh_train.txt').read()

        self.words = txt.lower().split() # The text is already tokenized

        self.uniq_words = self.get_uniq_words()

        self.index_to_word = {index: word for index, word in enumerate(self.uniq_words)}
        self.word_to_index = {word: index for index, word in enumerate(self.uniq_words)}

        self.words_indexes = [self.word_to_index[w] for w in self.words]
        self.sequence_length = sequence_length
        self.device = device


    def get_uniq_words(self):
        word_counts = Counter(self.words)
        return sorted(word_counts, key=word_counts.get, reverse=True)

    def __len__(self):
        return len(self.words_indexes) - self.sequence_length

    def __getitem__(self, index):
        return (
            torch.tensor(self.words_indexes[index:index+self.sequence_length], device=self.device),
            torch.tensor(self.words_indexes[index+1:index+self.sequence_length+1], device=self.device)
        )

pooh_dataset = PoohDataset(SEQUENCE_LENGTH, device)
test_dataset = open('pooh_test.txt').read()
test_dataset = test_dataset.lower().split()

In [4]:
from torch import nn, optim

class LSTMModel(nn.Module):
    def __init__(self, dataset, device):
        super(LSTMModel, self).__init__()
        self.lstm_size = 512
        self.embedding_dim = 100
        self.num_layers = 2
        self.device = device


        n_vocab = len(dataset.uniq_words)
        self.embedding = nn.Embedding(
            num_embeddings=n_vocab,
            embedding_dim=self.embedding_dim,
        )
        self.lstm = nn.LSTM(
            input_size=self.embedding_dim,
            hidden_size=self.lstm_size,
            num_layers=self.num_layers,
            dropout=0.2,
        )
        self.fc = nn.Linear(self.lstm_size, n_vocab)

    def forward(self, x, prev_state):
        embed = self.embedding(x)
        output, state = self.lstm(embed, prev_state)
        logits = self.fc(output)
        return logits, state

    def init_state(self, sequence_length):
        return (torch.zeros(self.num_layers, sequence_length, self.lstm_size).to(self.device),
                torch.zeros(self.num_layers, sequence_length, self.lstm_size).to(self.device))




model = LSTMModel(pooh_dataset, device)
model.to(device)

LSTMModel(
  (embedding): Embedding(2548, 100)
  (lstm): LSTM(100, 512, num_layers=2, dropout=0.2)
  (fc): Linear(in_features=512, out_features=2548, bias=True)
)

In [5]:
##################
# modified model #
##################
class LSTMModel_modified(nn.Module):
    def __init__(self, dataset, device):
        super(LSTMModel_modified, self).__init__()
        self.lstm_size = 1024
        self.embedding_dim = 256
        self.num_layers = 3
        self.device = device


        n_vocab = len(dataset.uniq_words)
        self.embedding = nn.Embedding(
            num_embeddings=n_vocab,
            embedding_dim=self.embedding_dim,
        )
        self.lstm = nn.LSTM(
            input_size=self.embedding_dim,
            hidden_size=self.lstm_size,
            num_layers=self.num_layers,
            dropout=0.3,
        )
        self.fc = nn.Linear(self.lstm_size, n_vocab)

    def forward(self, x, prev_state):
        embed = self.embedding(x)
        output, state = self.lstm(embed, prev_state)
        logits = self.fc(output)
        return logits, state

    def init_state(self, sequence_length):
        return (torch.zeros(self.num_layers, sequence_length, self.lstm_size).to(self.device),
                torch.zeros(self.num_layers, sequence_length, self.lstm_size).to(self.device))

model_modified = LSTMModel_modified(pooh_dataset, device)
model_modified.to(device)

LSTMModel_modified(
  (embedding): Embedding(2548, 256)
  (lstm): LSTM(256, 1024, num_layers=3, dropout=0.3)
  (fc): Linear(in_features=1024, out_features=2548, bias=True)
)

In [6]:
import numpy as np
from torch.utils.data import DataLoader

batch_size = 512
max_epochs = 30

def train(dataset, model):
    model.train()

    dataloader = DataLoader(dataset, batch_size=batch_size)

    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=0.001)

    for epoch in range(max_epochs):
        state_h, state_c = model.init_state(SEQUENCE_LENGTH)

        for batch, (x, y) in enumerate(dataloader):

            optimizer.zero_grad()

            y_pred, (state_h, state_c) = model(x, (state_h, state_c))
            loss = criterion(y_pred.transpose(1, 2), y)

            state_h = state_h.detach()
            state_c = state_c.detach()

            loss.backward()
            optimizer.step()

        print({ 'epoch': epoch, 'batch': batch, 'loss': loss.item() })

In [7]:
train(pooh_dataset, model)

{'epoch': 0, 'batch': 113, 'loss': 5.4725470542907715}
{'epoch': 1, 'batch': 113, 'loss': 4.953440189361572}
{'epoch': 2, 'batch': 113, 'loss': 4.575056552886963}
{'epoch': 3, 'batch': 113, 'loss': 4.3051228523254395}
{'epoch': 4, 'batch': 113, 'loss': 4.126413822174072}
{'epoch': 5, 'batch': 113, 'loss': 3.985083818435669}
{'epoch': 6, 'batch': 113, 'loss': 3.8624050617218018}
{'epoch': 7, 'batch': 113, 'loss': 3.7447173595428467}
{'epoch': 8, 'batch': 113, 'loss': 3.6179933547973633}
{'epoch': 9, 'batch': 113, 'loss': 3.500767946243286}
{'epoch': 10, 'batch': 113, 'loss': 3.4039530754089355}
{'epoch': 11, 'batch': 113, 'loss': 3.3055076599121094}
{'epoch': 12, 'batch': 113, 'loss': 3.2000274658203125}
{'epoch': 13, 'batch': 113, 'loss': 3.0971553325653076}
{'epoch': 14, 'batch': 113, 'loss': 2.997443914413452}
{'epoch': 15, 'batch': 113, 'loss': 2.8966305255889893}
{'epoch': 16, 'batch': 113, 'loss': 2.8314549922943115}
{'epoch': 17, 'batch': 113, 'loss': 2.812638521194458}
{'epoch':

In [8]:
torch.save(model.state_dict(), 'pooh_2x512_30ep.model')


In [9]:
# train modified model
train(pooh_dataset, model_modified)

{'epoch': 0, 'batch': 113, 'loss': 5.571034908294678}
{'epoch': 1, 'batch': 113, 'loss': 5.493565559387207}
{'epoch': 2, 'batch': 113, 'loss': 5.4724555015563965}
{'epoch': 3, 'batch': 113, 'loss': 5.46691370010376}
{'epoch': 4, 'batch': 113, 'loss': 5.460814952850342}
{'epoch': 5, 'batch': 113, 'loss': 5.457488536834717}
{'epoch': 6, 'batch': 113, 'loss': 5.451207637786865}
{'epoch': 7, 'batch': 113, 'loss': 5.44795560836792}
{'epoch': 8, 'batch': 113, 'loss': 5.444857120513916}
{'epoch': 9, 'batch': 113, 'loss': 5.442676544189453}
{'epoch': 10, 'batch': 113, 'loss': 5.4477739334106445}
{'epoch': 11, 'batch': 113, 'loss': 5.440235137939453}
{'epoch': 12, 'batch': 113, 'loss': 5.444283485412598}
{'epoch': 13, 'batch': 113, 'loss': 5.436881065368652}
{'epoch': 14, 'batch': 113, 'loss': 5.433938026428223}
{'epoch': 15, 'batch': 113, 'loss': 5.461470127105713}
{'epoch': 16, 'batch': 113, 'loss': 5.450287818908691}
{'epoch': 17, 'batch': 113, 'loss': 5.479591369628906}
{'epoch': 18, 'batch

In [11]:
torch.save(model_modified.state_dict(), 'pooh_modified.model')

In [23]:
# The predict function is a text generator. You have to modify this code!
import random
def predict(dataset, model, text, next_words=15):
    model.eval()

    words = text.split()
    state_h, state_c = model.init_state(len(words))

    for i in range(0, next_words):
        x = torch.tensor([[dataset.word_to_index[w] for w in words[i:]]])
        x = x.to(device)

        y_pred, (state_h, state_c) = model(x, (state_h, state_c))

        last_word_logits = y_pred[0][-1]
        p = torch.nn.functional.softmax(last_word_logits, dim=0).detach().cpu().numpy()
        word_index = np.random.choice(len(last_word_logits), p=p)
        words.append(dataset.index_to_word[word_index])

    return ' '.join(words)

# DEMO
speakers = ['pooh', 'piglet', 'christopher robin', 'rabbit', 'owl', 'tigger', 'eeyore']
for s in speakers:
    prompt = 'in the morning ' + s
    for i in range(1):
        print (predict(pooh_dataset, model_modified, prompt, 50))
    print ()

in the morning pooh upon spudge -- forward me certain of did the feeling animals jump it why pooh the . we 2 . once it wisely and see the i then . as wonder 's a understand , time with 's and the , bridge blown was and as what owl you oaktree

in the morning piglet her a. hand her come hand to a. hand a. hand her hand hand a. come her hand milne a. a. hand her hand come hand her a. a. her a. to all a. come a. milne a. hand her hand a. we a. jump hand we hand hand her

in the morning christopher robin now said . ' '' slept `` made as i thoughtful . , said tigger good i . , a a , a and , his does now four the had the piglet accident head across too thought sorrowfully which . rabbit rabbit the practise grass so be '' ,

in the morning rabbit anyone pooh 's -- . `` a pooh 's , looked '' '' could piglet a were because be owl better all to him got at all had back about being shall i . sort '' piglet live ) tigger quickly nervously i , said that gone and it might

in the morning owl yours . 

In [12]:
# You can use the code if you want

from collections import defaultdict as dd

vowels = set("aoiuye'")
def devowelize(s):
    rv = ''.join(a for a in s if a not in vowels)
    if rv:
        return rv
    return '_' # Symbol for words without consonants

pooh_words = set(open('pooh_words.txt').read().split())
representation = dd(set)

for w in pooh_words:
    r = devowelize(w)
    representation[r].add(w)

hard_words = set()
for r, ws in representation.items():
    if len(ws) > 1:
        hard_words.update(ws)

print (len(hard_words))

863


In [17]:
def reconstruct_sentence(model, sentence, dataset, representation, temperature=1.0):
    words = sentence
    devowelized_sentence = [devowelize(w) for w in words]
    model.eval()

    state_h, state_c = model.init_state(1)

    reconstructed = []
    probabilities = []

    matching = representation[devowelized_sentence[0]]
    reconstructed.append(random.choice(list(matching)))

    for i in range(len(devowelized_sentence) - 1):
        try:
            x = torch.tensor([[dataset.word_to_index[reconstructed[-1]]]])
        except KeyError:
            pass

        x = x.to(device)
        y_pred, (state_h, state_c) = model(x, (state_h, state_c))
        last_word_logits = y_pred[0][-1]

        matching = representation[devowelized_sentence[i + 1]]
        try:
            matching_idx = [dataset.word_to_index[match] for match in matching]
        except KeyError:
            reconstructed.append(random.choice(list(matching)))
            continue
        p = torch.nn.functional.softmax(last_word_logits/temperature, dim=0).detach().cpu().numpy()
        p[~np.isin(np.arange(len(p)), matching_idx)] = 0
        p = p/p.sum()
        word_index = np.random.choice(len(last_word_logits), p=p)
        reconstructed.append(dataset.index_to_word[word_index])
        probabilities.append(p[word_index])
    return reconstructed, probabilities


In [21]:
def accuracy(original_sequence, reconstructed_sequence):
    sa = original_sequence
    sb = reconstructed_sequence
    score = len([1 for (a,b) in zip(sa, sb) if a == b])
    return score / len(original_sequence)

In [None]:
# def calculate_likelihood(words, model, dataset):
#     model.eval()

#     state_h, state_c = model.init_state(len(words))

#     x = torch.tensor([[dataset.word_to_index[w] for w in words]])
#     x = x.to(device)

#     y_pred, _ = model(x, (state_h, state_c))

#     likelihood = 0

#     for i in range(1, len(words)):
#         predicted_word_logits = y_pred[0][i-1]
#         predicted_word_index = dataset.word_to_index[words[i]]
#         likelihood += predicted_word_logits[predicted_word_index]

#     return likelihood.item()


In [20]:
model.load_state_dict(torch.load('pooh_2x512_30ep.model'))

<All keys matched successfully>

In [24]:
reconstruction_1, prob_1 = reconstruct_sentence(model, test_dataset, pooh_dataset,representation, temperature=1.0)


In [34]:
print('model_1 accuracy:', accuracy(test_dataset, reconstruction_1))

model_1 accuracy: 0.7886209975762215


In [26]:
model_modified.load_state_dict(torch.load('pooh_modified.model'))

<All keys matched successfully>

In [28]:
reconstruction_2, prob_2 = reconstruct_sentence(model_modified, test_dataset, pooh_dataset,representation, temperature=1.0)


In [35]:
print('model_modified accuracy:',accuracy(test_dataset, reconstruction_2))

model_modified accuracy: 0.7085087383594846


You can assume that only words from pooh_words.txt can occur in the reconstructed text. For decoding you have two options (choose one, or implement both ang get **+1** bonus point)

1. Sample reconstructed text several times (with quite a low temperature), choose the most likely result.
2. Perform beam search.

Of course in the sampling procedure you should consider only words matching the given consonants.

Report accuracy of your methods (for both language models). The accuracy should be computed by the following function, it should be *greater than 0.25*.


```python
def accuracy(original_sequence, reconstructed_sequence):
    sa = original_sequence
    sb = reconstructed_sequence
    score = len([1 for (a,b) in zip(sa, sb) if a == b])
    return score / len(original_sequence)
```


## Task 2 (6 points)

This task is about text generation. You have to:

**A**. Create text corpora containing texts with similar vocabulary (for instance books from the same genre, or written by the same author). This corpora should have approximately 1M words. You can consider using the following sources: Project Gutenberg (https://www.gutenberg.org/), Wolne Lektury (https://wolnelektury.pl/), parts of BookCorpus, https://github.com/soskek/bookcorpus, but generally feel free. Texts could be in English, Polish or any other language you know.

**B**. choose the tokenization procedure. It should have two stages:

1. word tokenization (you can use nltk.tokenize.word_tokenize, tokenizer from spaCy, pytorch, keras, ...). Test your tokenizer on your corpora, and look at a set of tokens containing both letters and special characters. If some of them should be in your opinion treated as a sequence of tokens, then modify the tokenization procedure

2. sub-word tokenization (you can either use the existing procedure, like wordpiece or sentencepiece, or create something by yourself). Here is a simple idea: take 8K most popular words (W), 1K most popular suffixes (S), and 1K most popular prefixes (P). Words in W are its own tokens. Word x outside W should be tokenized as 'p_ _s' where p is the longest prefix of x in P, and s is the longest prefix of W

**C**. write text generation procedure. The procedure should fulfill the following requirements:

1. it should use the RNN language model (trained on sub-word tokens)
2. generated tokens should be presented as a text containing words (without extra spaces, or other extra characters, as begin-of-word introduced during tokenization)
3. all words in a generated text should belond to the corpora (note that this is not guaranteed by LSTM)
4. in generation Top-P sampling should be used (see NN-NLP.6, slide X)
5. in generated texts every token 3-gram should be uniq
6. *(optionally, +1 point)* all token bigrams in generated texts occur in the corpora

Of course to fulfill these constraints you have to do rejection sampling, or beam search, or ... If you want to be more up-to-date you can also use transformer-like language model. In this case consider using nanoGPT (by A. Karpathy)

In [112]:
import re
import nltk
import torch
import torch.nn.functional as F
import numpy as np
from nltk.util import ngrams
from nltk import ngrams
from nltk.tokenize import word_tokenize
from tokenizers import BertWordPieceTokenizer


In [60]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [69]:
! gdown https://drive.google.com/uc?id=1G5AAnKiT7H1uTwwCcCrZR6qvVShdFdKD
! ls

Downloading...
From: https://drive.google.com/uc?id=1G5AAnKiT7H1uTwwCcCrZR6qvVShdFdKD
To: /content/Pride and prejudice.txt
  0% 0.00/756k [00:00<?, ?B/s]100% 756k/756k [00:00<00:00, 103MB/s]
 aa.txt			 pooh_train.txt		   'view?usp=drive_link'
 pooh_2x512_30ep.model	 pooh_words.txt		    word_vectors_wiki.txt
 pooh_modified.model	'Pride and prejudice.txt'
 pooh_test.txt		 sample_data


In [122]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [62]:
def preprocess(path):
    with open(path, 'r') as f:
        data = f.read()
    tokenized_data = word_tokenize(data)
    tokenized_data = [x.lower() for x in tokenized_data]
    tokenized_data = [re.sub('[^A-Za-z0-9]+', '', x) for x in tokenized_data]
    tokenized_data = [w for w in tokenized_data if len(w) > 0]
    return tokenized_data

def sub_word_tokenize(data):
    tokenizer = BertWordPieceTokenizer()
    tokenizer.train_from_iterator(data, vocab_size=8000)
    tokenized_data = tokenizer.encode(" ".join(data))

    return tokenizer, tokenized_data

In [63]:
class Dataset(torch.utils.data.Dataset):
    def __init__(self, data_ids, sequence_length, device):

        self.words_indexes = data_ids
        self.sequence_length = sequence_length
        self.device = device

    def __len__(self):
        return len(self.words_indexes) - self.sequence_length

    def __getitem__(self, index):
        return (
            torch.tensor(self.words_indexes[index:index+self.sequence_length], device=self.device),
            torch.tensor(self.words_indexes[index+1:index+self.sequence_length+1], device=self.device)
        )

In [55]:
class LSTMModel(nn.Module):
    def __init__(self, n_vocab, device):
        super(LSTMModel, self).__init__()
        self.lstm_size = 512
        self.embedding_dim = 100
        self.num_layers = 2
        self.device = device

        self.embedding = nn.Embedding(n_vocab, self.embedding_dim)
        self.lstm = nn.LSTM(self.embedding_dim, self.lstm_size, self.num_layers, dropout=0.2)
        self.fc = nn.Linear(self.lstm_size, n_vocab)

    def forward(self, x, prev_state):
        embed = self.embedding(x)
        output, state = self.lstm(embed, prev_state)
        logits = self.fc(output)
        return logits, state

    def init_state(self, sequence_length):
        num_directions = 2 if self.lstm.bidirectional else 1
        hidden = torch.zeros(self.num_layers * num_directions, sequence_length, self.lstm_size).to(self.device)
        cell = torch.zeros(self.num_layers * num_directions, sequence_length, self.lstm_size).to(self.device)
        return hidden, cell


In [77]:
data = preprocess('Pride and prejudice.txt')
tokenizer, tokenized_data = sub_word_tokenize(data)
dataset = Dataset(tokenized_data.ids, SEQUENCE_LENGTH, device)


In [78]:
model = LSTMModel(8000, device)
model.to(device)

LSTMModel(
  (embedding): Embedding(8000, 100)
  (lstm): LSTM(100, 512, num_layers=2, dropout=0.2)
  (fc): Linear(in_features=512, out_features=8000, bias=True)
)

In [79]:
train(dataset, model)

{'epoch': 0, 'batch': 264, 'loss': 8.067949295043945}
{'epoch': 1, 'batch': 264, 'loss': 7.1042256355285645}
{'epoch': 2, 'batch': 264, 'loss': 6.575026512145996}
{'epoch': 3, 'batch': 264, 'loss': 6.2271318435668945}
{'epoch': 4, 'batch': 264, 'loss': 5.597900867462158}
{'epoch': 5, 'batch': 264, 'loss': 5.038427829742432}
{'epoch': 6, 'batch': 264, 'loss': 4.614715099334717}
{'epoch': 7, 'batch': 264, 'loss': 4.110250949859619}
{'epoch': 8, 'batch': 264, 'loss': 3.6042678356170654}
{'epoch': 9, 'batch': 264, 'loss': 3.0871434211730957}
{'epoch': 10, 'batch': 264, 'loss': 2.697675943374634}
{'epoch': 11, 'batch': 264, 'loss': 2.281498670578003}
{'epoch': 12, 'batch': 264, 'loss': 1.972225546836853}
{'epoch': 13, 'batch': 264, 'loss': 1.7238037586212158}
{'epoch': 14, 'batch': 264, 'loss': 1.543869137763977}
{'epoch': 15, 'batch': 264, 'loss': 1.3699345588684082}
{'epoch': 16, 'batch': 264, 'loss': 1.2488223314285278}
{'epoch': 17, 'batch': 264, 'loss': 1.0449751615524292}
{'epoch': 18

In [80]:
torch.save(model.state_dict(), 'Pride_and_prejudice.model')

In [85]:
prefixes_sufixes = {}
sufixes_ids = []

for word in data:
    encoded_word = tokenizer.encode(word)
    if len(encoded_word.ids) > 1:
        prefixes_sufixes[encoded_word.ids[0]] = encoded_word.ids[1:]

for i in range(10000):
    token = tokenizer.id_to_token(i)
    if token is not None and (token.startswith("##") or token.endswith("##")):
        sufixes_ids.append(i)


In [98]:
def top_p_sampling(p, top_p):
    sorted_logits, sorted_indices = torch.sort(torch.from_numpy(p), descending=True)
    cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
    sorted_indices_to_remove = cumulative_probs > top_p
    p[sorted_indices[sorted_indices_to_remove]] = 0.0
    p = p/p.sum()
    return p

In [119]:
def generate_text(model, tokenizer, text, prefixes_sufixes, next_words=15):
    model.eval()

    words = tokenizer.encode(text).ids
    state_h, state_c = model.init_state(len(words))

    continuation = []
    existing_3_grams = set(ngrams(words, 3))
    for i in range(next_words):
        x = torch.tensor(words[i:], device=device)
        x = x.unsqueeze(0)
        x = x.to(device)

        y_pred, (state_h, state_c) = model(x, (state_h, state_c))
        next_from_continuation = -1
        if tokenizer.decode([words[-1]]) not in data and len(continuation) > 0:
            words.append(continuation.pop())
        else:
            last_word_logits = y_pred[0][-1]
            p = F.softmax(last_word_logits, dim=0).detach().cpu().numpy()

            to_exclude = set(sufixes_ids)
            if len(continuation) > 0:
                next_from_continuation = continuation.pop(0)
                to_exclude = to_exclude - {next_from_continuation}

            to_exclude = list(to_exclude)
            current_2gram = words[2:]
            for ngram in existing_3_grams:
                if ngram[0] == current_2gram[0] and ngram[1] == current_2gram[1]:
                    to_exclude.append(ngram[2])

            p[np.isin(np.arange(len(p)), to_exclude)] = 0
            p = p / p.sum()
            p = top_p_sampling(p, 0.9)
            word_index = np.random.choice(len(last_word_logits), p=p)
            words.append(word_index)
            if not word_index == next_from_continuation:
                continuation = []
            if word_index in prefixes_sufixes:
                continuation = prefixes_sufixes[word_index]
        existing_3_grams.add(tuple(words[3:]))
    if tokenizer.decode([words[-1]]) not in data and len(continuation) > 0:
        words.extend(continuation)
    return words, tokenizer.decode(words)


In [120]:
prompt = "it is not easy to say"
words, text = generate_text(model, tokenizer, prompt,prefixes_sufixes, 25)
print(text)

it is not easy to say 1e1 repent stated quadrille amendment seeming online revolution commanded studied submit tempt parsonagehouse fill hurt speedily despising lying exposing amaz solicit amaz amaz


In [121]:
prompt = "Oh, Jane, had we been less secret"
words, text = generate_text(model, tokenizer, prompt,prefixes_sufixes, 25)
print(text)

oh jane had we been less secret deranged cool understand me both will you come whether you ought to be in having so much no more rich sir than wishes now


## Task 3 (4 or 6 p)

In this task you have to create a network which looks at characters of the word and tries to guess whether the word is a noun, a verb, an adjective, and so on. To be more precise: the input is a word (without context), the output is a POS-tag (Part-of-Speech). Since some words are unambiguous, and we have no context, our network is supposed to return the set of possible tags.

The data is taken from Universal Dependencies English corpus, and of course it contains errors, especially because not all possible tags occured in the data.

Train a network (4p) or two networks (+2p) solving this task. Both networks should look at character n-grams occuring in the word. There are two options:

* **Fixed size:** for instance take 2,3, and 4-character suffixes of the word, use them as  features (whith 1-hot encoding). You can also combine prefix and suffix features. Simple, useful trick: when looking at suffixes, add some '_' characters at the beginning of the word to guarantee that shorter words have suffixes of a desired length.

* **Variable size:** take for instance 4-grams (or 4 grams and 3-grams), use Deep Averaging Network. Simple trick: add extra character at the beginning and at the end of the word, to add the information, that ngram occurs at special position ('ed' at the end has slightly different meaning that 'ed' in the middle)


## Task 4 (5p)

Apply seq2seq model (you can modify the code from this tutorial: https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html) to compute grapheme to phoneme conversion for English. Train the model on dev_cmu_dict.txt and test it on test_cmu_dict.txt. Report accuracy of your solution using two metrics:
* exact match (how many words are perfectly converted to phonemes)
* exact match without stress (how many words are perfectly converted to phonemes when we remove the information about stress)
