<h1> Practical 3 - Language Modeling <h1>

After we have concered ourselves with word embeddings and text classification in the previous notebooks, we will now focus on language modeling. </br>

In the last practical we worked on text classification as a problem that could be solved with deep learning. Today, we will get to know a different application of RNN's, namely language modeling. </br>

Word prediction is a Natural Language Processing - NLP application concerned with predicting the next word given the preceding text. Auto-complete or suggested responses are popular types of language prediction. The first step towards language prediction is the selection of a language model. </br>

There are generally two models you can use to develop Next Word Predictor:
</br> 1) statistical N-gram model or 
</br> 2) Neural Models

**0. Task (0 points)**  </br>
As usual, before we dive into the tasks here are a couple imports we will later use

In [None]:
!pip install boltons -q

In [None]:
import string
from pathlib import Path
from textwrap import wrap


import numpy as np
import pandas as pd
from boltons.iterutils import windowed
#from tqdm import tqdm_notebook
#from tqdm import tqdm
from tqdm.notebook import tqdm

from nltk.util import ngrams

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch import optim
from torch.utils.data import Dataset, DataLoader
from torch.utils.data.dataset import random_split
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence

from google_drive_downloader import GoogleDriveDownloader as gdd

from nltk.tokenize import sent_tokenize, word_tokenize
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [None]:
device_word = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device_char = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device_word

device(type='cuda')

<h1>Word and Text Generation</h1>

In this notebook we will do two things:
1.   Generate a RNN, that can Learn how english characters are combined.
2.   Generate a RNN, that can Learn how english words are combined.

Therefore we are going to do the following steps:
1.   Load the Data
2.   Preprocess the Data for character-level generation.
3.   Preprocess the Data for word-level generation.
4.   Building an RNN
5.   Applying the RNN on the Data of Step 1.
6.   Applying the RNN on the Data of Step 2.




<h2>1. Load the Data</h2>

Our Dataset consists of multiple texts regarding weight loss. 

In [None]:
#We can find the articles here:
DATA_PATH = 'data/weight_loss/articles.jsonl'
if not Path(DATA_PATH).is_file():
    gdd.download_file_from_google_drive(
        file_id='1mafPreWzE-FyLI0K-MUsXPcnUI0epIcI',
        dest_path='data/weight_loss/weight_loss_articles.zip',
        unzip=True,
    )


In [None]:
#lets print out the first article
print(pd.read_json(DATA_PATH).text.str.lower().tolist()[0])

weight gaining is a common problem around the world. in developed country, it is the most common problem. in this article, i am not going to show you some advance and magical technique which will make you slim overnight. i am going to show you tips on the basis of real facts which works. in this article, i will give you how to tips, which will help you to lose weight. are you ready?
calories requirement
first thing you need to understand is why you gain weight. why? whenever you eat or drink something, you will get some calories. when you think about weight, everything revolves around calories.
whatever you do, will burn some calories no matter how small work it is or just a movement of your body. your body burns thousands of calories in one day.
if you are getting more calories than needed, you will gain weight. if you are getting fewer calories than needed, you will lose weight. so for losing weight, you need to know how much calorie your body required.
find require calories for your

<h2>2. Preprocessing the data for sequence generation</h2>

As you can see in the cell above it is pretty tedious to access the data. In the next few steps we help you and the network to access the data easier.

In [None]:
def remove_unprintable_chars(all_chars_windowed):
  not_printbl_chars=[]
  filtered_chars=[]
  printbl=True
  for sequence in tqdm(all_chars_windowed):
    printbl=True
    for char in sequence:
      if not(char in string.printable):
        printbl=False
        not_printbl_chars+=[char]
    if printbl==True:
      filtered_chars+=[sequence]
  return filtered_chars 

  

In [None]:
def textlist_generator(path):
  return pd.read_json(path).text.str.lower().tolist()

def load_data_char(path, sequence_length=125):

    # Generate a list of texts from the dataset
    texts = textlist_generator(path)
    #print(texts[0])


    chars_windowed = [list(windowed(text, sequence_length)) for text in texts]
    #print(chars_windowed[:2])
    
    
    all_chars_windowed = [sublst for lst in chars_windowed for sublst in lst]
    #print(all_chars_windowed[:2])


    filtered_chars = remove_unprintable_chars(all_chars_windowed)
    #print(filtered_good_chars[:2])
    return filtered_chars


def set_of_chars_in(sequences):
    return {sublst for lst in sequences for sublst in lst}


def create_char2idx(sequences):
    set_of_chars = set_of_chars_in(sequences)
    return {char: idx for idx, char in enumerate(sorted(set_of_chars))}


def encode_sequence(sequence, char2idx):
    return [char2idx[char] for char in sequence]


def encode_sequences(sequences, char2idx):
    return np.array([
        encode_sequence(sequence, char2idx) 
        for sequence in tqdm(sequences)
    ])


class Sequences(Dataset):
    def __init__(self, path, sequence_length=125):
        self.sequences = load_data_char(DATA_PATH, sequence_length=sequence_length)
        self.vocab_size = len(set_of_chars_in(self.sequences))
        self.char2idx = create_char2idx(self.sequences)
        self.idx2char = {idx: char for char, idx in self.char2idx.items()}
        self.encoded = encode_sequences(self.sequences, self.char2idx)
        
    def __getitem__(self, i):
        return self.encoded[i, :-1], self.encoded[i, 1:]
    
    def __len__(self):
        return len(self.encoded)

The Tasks 2.1 to 2.3 will help you to understand the code better.

**2.1** <br>
(1 Point)
Describe the variable `chars_windowed`. What does it contain? 

Your Answer Here

**2.2** <br>
(1 Point)
Describe the variable `all_chars_windowed`. What does it contain?

Your Answer Here

**2.3** <br>
(1 Point)
Explain shortly, what the function `remove_unprintable_chars(all_chars_windowed)`does. We do not want a step by step explaination, just describe the generall idea.

Your Answer Here

Now lets load our char_dataset.

In [None]:
sequence_length=int(input("choose your sequence_length (for this task, we did choose 128): "))
if sequence_length<=1:
  print("1 or less is not a valid sequence length. Your model will not learn anything from just one word at a time. The sequence lenght of 128 has been chosen for you.")
  sequence_length=128

chooese your sequence_length (for this task, we did choose 128): 125


In [None]:
dataset_char = Sequences(DATA_PATH, sequence_length=128)
len(dataset_char)
train_loader_char = DataLoader(dataset_char, batch_size=4096)

HBox(children=(FloatProgress(value=0.0, max=1246263.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=1228546.0), HTML(value='')))




<h2>3. Preprocessing the data for word-level generation</h2>

We now have to do the same preprocessing steps for our word-level-model. But dont worry, it works quite similar to the character-level-preprocessing steps. In the following, there are a couple of tasks, that will guide you.


**3.1 Tokenize**<br>
(3 Point)
Complete the function `tokenize` which gets multiple texts and returns a list, which consists of a list of word-level-tokens for each text.

In [None]:
def tokenize(texts):   
    #your code here
    return texts_tokens

**3.2 Unprintable Sequences**<br>
(4 Points)
Do you remember the task 2.3? Apply the same functionality, but keep in mind, that we are now working on the basis of words, not chars. 
Adapt the function from task 2.3 so that it now works with words. We still want to delete the sequences. Write your solution into the function `remove_unprintable_sequences2`.

In [None]:
def remove_unprintable_sequences2(all_words_windowed):
  filtered_words = []
  #Your code here

  return filtered_words

We now put all the functions you provided in this new function, to load our data.

In [None]:
def load_data_word(path, sequence_length=5):

    #Generate a list of texts from the dataset
    texts = textlist_generator(path)


    texts=tokenize(texts)

    words_windowed = [list(windowed(text, sequence_length)) for text in texts]


    all_words_windowed = [sublst for lst in words_windowed for sublst in lst]

    filtered_words = remove_unprintable_sequences2(all_words_windowed)

    return filtered_words

**** 

**3.3** </br>
(1 Points)
Write a function that returns a set of all the words in the sequence. 

In [None]:
def set_of_words_in(sequences):
    #Your Code here
    return set_of_words

SyntaxError: ignored

**3.4** </br>
(1 Points)
Write a function that returns a dictionary containing the set of words indentified in task 3.3 and assigns an index to each of them.

In [None]:
def create_word2idx(sequences):
    set_of_words = set_of_words_in(sequences)
    return {#YOUR CODE HERE}

SyntaxError: ignored

**3.5** </br>
(1 Points)
Create a function `encode_sequence`, that transforms the words of a list of words `sequence` into their equivalent index from `word2index`.

In [None]:
def encode_sequence(sequence, word2idx):
    return [#YOUR CODE HERE]

**3.6** </br>
(1 Points)
Complete the function `encode_sequences` that generates a numpy array, with the encoded sequence of all the sequences.

In [1]:
def encode_sequences(sequences, word2idx):
    return np.array([
        #YOUR CODE HERE
    ])

In the next code snippet we call all the functions you defined above. (You dont have to do anything here, just run the cell)

In [None]:
class Sequences(Dataset):
    def __init__(self, path, sequence_length=30):
        self.sequences = load_data_word(DATA_PATH, sequence_length=sequence_length)
        self.vocab_size = len(set_of_words_in(self.sequences))
        self.word2idx = create_word2idx(self.sequences)
        self.idx2word = {idx: word for word, idx in self.word2idx.items()}
        self.encoded = encode_sequences(self.sequences, self.word2idx)
        
    def __getitem__(self, i):
        return self.encoded[i, :-1], self.encoded[i, 1:]
    
    def __len__(self):
        return len(self.encoded)

In [None]:
sequence_length=int(input("choose your sequence_length (for this task, we did choose 1o): "))
if sequence_length<=1:
  print("1 or less is not a valid sequence length. Your model will not learn anything from just one word at a time. The sequence lenght of 10 has been chosen for you.")
  sequence_length=10

chooese your sequence_length (for this task, we did choose 1o): 10


In [None]:
dataset_word = Sequences(DATA_PATH, sequence_length)
len(dataset_word)
train_loader_word = DataLoader(dataset_word, batch_size=4096)

HBox(children=(FloatProgress(value=0.0, max=258290.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=257741.0), HTML(value='')))




<h2>4. char-RNN: Character-level text generation</h2>

For an Idea how it could be done read this [Blog post](http://karpathy.github.io/2015/05/21/rnn-effectiveness/). In there they are using an LSTM to generate new texts on character Level. To make it easier for you, in this notebook we are Building a RNN with GRU as basis. GRU works close to LSTM, but is easier to handle. For a full comparison of these two have a look at [this](http://www.wildml.com/2015/10/recurrent-neural-network-tutorial-part-4-implementing-a-grulstm-rnn-with-python-and-theano/)

In [None]:
class RNN(nn.Module):
    def __init__(
        self,
        vocab_size,
        embedding_dimension=100,
        hidden_size=128, 
        n_layers=1,
        device='cpu',
    ):
        
        super(RNN, self).__init__()
        self.n_layers = n_layers
        self.hidden_size = hidden_size
        self.device = device
        
        #Our Building-Blocks:
        self.encoder = nn.Embedding(vocab_size, embedding_dimension)
        self.rnn = nn.GRU(
            embedding_dimension,
            hidden_size,
            num_layers=n_layers,
            batch_first=True,
        )
        self.decoder = nn.Linear(hidden_size, vocab_size)
        
    def init_hidden(self, batch_size):
        #we initialize a random hidden state
        return torch.randn(self.n_layers, batch_size, self.hidden_size).to(self.device)
    
    def forward(self, input_, hidden):
        #we feed the input through our RNN
        encoded = self.encoder(input_)
        output, hidden = self.rnn(encoded.unsqueeze(1), hidden)
        output = self.decoder(output.squeeze(1))
        return output, hidden

Lets Initialize the model with the data for chars

In [None]:
model_char = RNN(vocab_size=dataset_char.vocab_size, device=device_char).to(device_char)

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(
    filter(lambda p: p.requires_grad, model_char.parameters()),
    lr=0.001,
)

In [None]:
print(model_char)
print()
print('Trainable parameters:')
print('\n'.join([' * ' + x[0] for x in model_char.named_parameters() if x[1].requires_grad]))

RNN(
  (encoder): Embedding(66, 100)
  (rnn): GRU(100, 128, batch_first=True)
  (decoder): Linear(in_features=128, out_features=66, bias=True)
)

Trainable parameters:
 * encoder.weight
 * rnn.weight_ih_l0
 * rnn.weight_hh_l0
 * rnn.bias_ih_l0
 * rnn.bias_hh_l0
 * decoder.weight
 * decoder.bias


![](images/char_rnn_diagram.png)

In [None]:
model_char.train()
train_losses = []
for epoch in range(30):
    progress_bar = tqdm(train_loader_char, leave=False)
    losses = []
    total = 0
    for inputs, targets in progress_bar:
        batch_size = inputs.size(0)
        hidden = model_char.init_hidden(batch_size)

        model_char.zero_grad()
        
        loss = 0
        for char_idx in range(inputs.size(1)):
            output, hidden = model_char(inputs[:, char_idx].to(device_char), hidden)
            loss += criterion(output, targets[:, char_idx].to(device_char))

        loss.backward()

        optimizer.step()
        
        avg_loss = loss.item() / inputs.size(1)
        
        progress_bar.set_description(f'Loss: {avg_loss:.3f}')
        
        losses.append(avg_loss)
        total += 1
    
    epoch_loss = sum(losses) / total
    train_losses.append(epoch_loss)
        
    tqdm.write(f'Epoch #{epoch + 1}\tTrain Loss: {epoch_loss:.3f}')


#Again, this will take a while. Go for a walk in the sunshine or something else nice :D

HBox(children=(FloatProgress(value=0.0, max=300.0), HTML(value='')))

Epoch #1	Train Loss: 2.265


HBox(children=(FloatProgress(value=0.0, max=300.0), HTML(value='')))

Epoch #2	Train Loss: 1.754


HBox(children=(FloatProgress(value=0.0, max=300.0), HTML(value='')))

Epoch #3	Train Loss: 1.599


HBox(children=(FloatProgress(value=0.0, max=300.0), HTML(value='')))

Epoch #4	Train Loss: 1.515


HBox(children=(FloatProgress(value=0.0, max=300.0), HTML(value='')))

Epoch #5	Train Loss: 1.461


HBox(children=(FloatProgress(value=0.0, max=300.0), HTML(value='')))

Epoch #6	Train Loss: 1.424


HBox(children=(FloatProgress(value=0.0, max=300.0), HTML(value='')))

Epoch #7	Train Loss: 1.397


HBox(children=(FloatProgress(value=0.0, max=300.0), HTML(value='')))

Epoch #8	Train Loss: 1.377


HBox(children=(FloatProgress(value=0.0, max=300.0), HTML(value='')))

Epoch #9	Train Loss: 1.360


HBox(children=(FloatProgress(value=0.0, max=300.0), HTML(value='')))

Epoch #10	Train Loss: 1.347


HBox(children=(FloatProgress(value=0.0, max=300.0), HTML(value='')))

Epoch #11	Train Loss: 1.336


HBox(children=(FloatProgress(value=0.0, max=300.0), HTML(value='')))

Epoch #12	Train Loss: 1.326


HBox(children=(FloatProgress(value=0.0, max=300.0), HTML(value='')))

Epoch #13	Train Loss: 1.318


HBox(children=(FloatProgress(value=0.0, max=300.0), HTML(value='')))

Epoch #14	Train Loss: 1.310


HBox(children=(FloatProgress(value=0.0, max=300.0), HTML(value='')))

Epoch #15	Train Loss: 1.304


HBox(children=(FloatProgress(value=0.0, max=300.0), HTML(value='')))

Epoch #16	Train Loss: 1.298


HBox(children=(FloatProgress(value=0.0, max=300.0), HTML(value='')))

Epoch #17	Train Loss: 1.293


HBox(children=(FloatProgress(value=0.0, max=300.0), HTML(value='')))

Epoch #18	Train Loss: 1.288


HBox(children=(FloatProgress(value=0.0, max=300.0), HTML(value='')))

Epoch #19	Train Loss: 1.284


HBox(children=(FloatProgress(value=0.0, max=300.0), HTML(value='')))

Epoch #20	Train Loss: 1.280


HBox(children=(FloatProgress(value=0.0, max=300.0), HTML(value='')))

Epoch #21	Train Loss: 1.276


HBox(children=(FloatProgress(value=0.0, max=300.0), HTML(value='')))

Epoch #22	Train Loss: 1.272


HBox(children=(FloatProgress(value=0.0, max=300.0), HTML(value='')))

Epoch #23	Train Loss: 1.269


HBox(children=(FloatProgress(value=0.0, max=300.0), HTML(value='')))

Epoch #24	Train Loss: 1.266


HBox(children=(FloatProgress(value=0.0, max=300.0), HTML(value='')))

Epoch #25	Train Loss: 1.263


HBox(children=(FloatProgress(value=0.0, max=300.0), HTML(value='')))

Epoch #26	Train Loss: 1.261


HBox(children=(FloatProgress(value=0.0, max=300.0), HTML(value='')))

Epoch #27	Train Loss: 1.258


HBox(children=(FloatProgress(value=0.0, max=300.0), HTML(value='')))

Epoch #28	Train Loss: 1.256


HBox(children=(FloatProgress(value=0.0, max=300.0), HTML(value='')))

Epoch #29	Train Loss: 1.254


HBox(children=(FloatProgress(value=0.0, max=300.0), HTML(value='')))

Epoch #30	Train Loss: 1.252


Now lets test this:

In [None]:
def pretty_print(text):
    """Wrap text for nice printing."""
    to_print = ''
    for paragraph in text.split('\n'):
        to_print += '\n'.join(wrap(paragraph))
        to_print += '\n'
    print(to_print)

#this temperature defines how "strict" the RNN is following the original
temperature = 1.0

model_char.eval()
seed = '\n'
text = ''
with torch.no_grad():
    batch_size = 1
    hidden = model_char.init_hidden(batch_size)
    last_char = dataset_char.char2idx[seed]
    for _ in range(1000):
        output, hidden = model_char(torch.LongTensor([last_char]).to(device_char), hidden)
        
        #find the next char
        distribution = output.squeeze().div(temperature).exp()
        guess = torch.multinomial(distribution, 1).item()
        
        #the next char is the new last_char
        last_char = guess

        #append char to text.
        text += dataset_char.idx2char[guess]
        
pretty_print(text)

off your high speed decide to set a followed up the dayful frim, but
it does. onhed reaching diet so as maximulaging, a body adit on the
diets are a lot pills. i have me to lose weight. (body will two think
type of scare self restonal health. as longinot system on keep your
maring 20-6% even cas highing a tried granced strength or nutrition
quick plant to re-leed to a gard", yourselves? by the hormones makes,
pretty week. yet some diets don't you will thighs make you. this is
cravola more more undertime could not get to digure the fats, you
should you hands the day of fees a prrocest and hand anywhelicard the
exactly..xeence a pain. oncer, routine. but have soontit to lose
weight that maybe are, it will be renecy nument that you will make
from when you keep mention if you products:
sweet why most of the weekend. weight so the leth as a whole ben 100
miles on much amount of dangly or in least breakfast on uson day"
havorited to the fat busidly protein slays right individe up sorcises,
w

Even though it may not be sentences, those words already sound like they are out of the mouth of your fitness coach. Maybe our word-level-model can do more.

<h2>5. Word-RNN: Word-level text generation</h2>

Since the Dataset is originally made for char-level-generation this of course is probably not appropriate for word-level-generator. Just as a prove of concept, we will still show you how it works. The results are still quite good.

In [None]:
model_word = RNN(vocab_size=dataset_word.vocab_size, device=device_word).to(device_word)

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(
    filter(lambda p: p.requires_grad, model_word.parameters()),
    lr=0.001,
)

In [None]:
print(model_word)
print()
print('Trainable parameters:')
print('\n'.join([' * ' + x[0] for x in model_word.named_parameters() if x[1].requires_grad]))

RNN(
  (encoder): Embedding(10595, 100)
  (rnn): GRU(100, 128, batch_first=True)
  (decoder): Linear(in_features=128, out_features=10595, bias=True)
)

Trainable parameters:
 * encoder.weight
 * rnn.weight_ih_l0
 * rnn.weight_hh_l0
 * rnn.bias_ih_l0
 * rnn.bias_hh_l0
 * decoder.weight
 * decoder.bias


![](images/char_rnn_diagram.png)

**5.1**
<br>
(4 Points)
Now that you have seen, how our character-level model is trained, it is time to do the same for our word-level-model.
Now it is your turn to implement the loss function. Have a look at the code from char-level-generation as an orientation. 

In [None]:

model_word.train()
train_losses = []
for epoch in range(50):
    progress_bar = tqdm(train_loader_word, leave=False)
    losses = []
    total = 0
    for inputs, targets in progress_bar:
        batch_size = inputs.size(0)
        hidden = model_word.init_hidden(batch_size)

        model_word.zero_grad()
        
        loss = 0
        #Your Code here

        loss.backward()

        optimizer.step()
        
        avg_loss = loss.item() / inputs.size(1)
        
        progress_bar.set_description(f'Loss: {avg_loss:.3f}')
        
        losses.append(avg_loss)
        total += 1
    
    epoch_loss = sum(losses) / total
    train_losses.append(epoch_loss)
        
    tqdm.write(f'Epoch #{epoch + 1}\tTrain Loss: {epoch_loss:.3f}')

HBox(children=(FloatProgress(value=0.0, max=63.0), HTML(value='')))

AttributeError: ignored

Big finale: Now we want you to test your model: Try it out and look if it works.

In [None]:
def pretty_print(text):
    """Wrap text for nice printing."""
    to_print = ''
    for paragraph in text.split('\n'):
        to_print += '\n'.join(wrap(paragraph))
        to_print += '\n'
    print(to_print)



def generate(keywords, model_word):


  keywords=keywords.lower()
  text_tokens=[]
  #for every sentence.,.
  texts_sent=sent_tokenize(keywords)
  for sent in texts_sent:
    #we seperate the sentece into words...
    sent=word_tokenize(sent)
    #...and add these words into this list
    for token in sent:
      text_tokens+=[token+" "]
  #our seed is only the last word of your input
  seed=text_tokens[-1]
  
  #check if your word is even in the training-data
  try:
    dataset_word.word2idx[seed]
  except KeyError:
    print("the Word",seed,"is not part of the learned words and therefore can not be used as starting point for the new text")

  temperature = 1.0 

  model_word.eval()
  text = ""
  with torch.no_grad():
      batch_size = 1
      hidden = model_word.init_hidden(batch_size)
      last_word = dataset_word.word2idx[seed]
      for _ in range(100):
          output, hidden = model_word(torch.LongTensor([last_word]).to(device_word), hidden)
          
          distribution = output.squeeze().div(temperature).exp()
          guess = torch.multinomial(distribution, 1).item()
          
          last_word = guess
          text += dataset_word.idx2word[guess]
  return text     


keywords=input("Start your text about fitness with a few words: ")

text=generate(keywords, model_word)
pretty_print(text)

Start your text about fitness with a few words: you
will not conquer which are actually straight to satisfy the creation
of navarra products should also celebrations is made from being bad
reading to the gym with other snack thin as cinnamon men and those
types of what you have put as much this procedure though you should be
easy , or less they want to grow and fats are knowledgeable of the
brain that more importantly - if they do n't have the faith of weight
loss becomes fewer calories is . it is a very careful ! el331014
mistaken ( the metabolism may be sabotaging an stricter affected



Now you can look at the resuls of your Char-Level and Word-Level-Model. This of course is not graded, but might be interesting for you.

Congratulation! You are done now