## Prepare the Environment

In [None]:
!pip install torch_nightly -f https://download.pytorch.org/whl/nightly/cpu/torch_nightly.html

Looking in links: https://download.pytorch.org/whl/nightly/cpu/torch_nightly.html


In [None]:
!git clone https://github.com/ceshine/examples.git pytorch_examples

fatal: destination path 'pytorch_examples' already exists and is not an empty directory.


In [None]:
%cd pytorch_examples/word_language_model
%ls

/content/pytorch_examples/word_language_model
[0m[01;34mdata[0m/        lm_model.pt   model.py      requirements.txt
data.py      main.py       [01;34m__pycache__[0m/  train_new.log
generate.py  model_new.pt  README.md


Upload the trained model (from notebook 01_Training.ipynb):

In [None]:
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

The above did not work for me (because I constantly failed to download the entire file). Using gsutil instead here:

In [None]:
from google.colab import auth
auth.authenticate_user()

# https://cloud.google.com/resource-manager/docs/creating-managing-projects
project_id = 'personal-project-196600'
!gcloud config set project {project_id}

Updated property [core/project].


In [None]:
!gsutil cp  gs://ceshine-colab-tmp/lm_model.pt lm_model.pt

Copying gs://ceshine-colab-tmp/lm_model.pt...
\ [1 files][108.5 MiB/108.5 MiB]                                                
Operation completed over 1 objects/108.5 MiB.                                    


Import libraries, functions and classes:

In [None]:
import torch
import numpy as np
import pandas as pd

from model import RNNModel
from data import Dictionary, Corpus

## Prepare Dictionary

In [None]:
DATA_PATH = "./data/wikitext-2"
corpus = Corpus(DATA_PATH)

print("Number of tokens:")
print("Train: ", len(corpus.train))
print("Valid: ", len(corpus.valid))
print("Test:  ", len(corpus.test))

print("Vocabulary size:", len(corpus.dictionary.idx2word))

Number of tokens:
Train:  2075677
Valid:  216347
Test:   244102
Vocabulary size: 33278


## Load Model

In [None]:
DEVICE = torch.device("cpu")
# model = model.RNNModel(
#     "LSTM", len(corpus.dictionary), 650,
#     650, 2, 0.5, True
# ).to(DEVICE)

In [None]:
with open("lm_model.pt", 'rb') as f:
    model = torch.load(f, map_location='cpu')
model = model.to(DEVICE)

In [None]:
model.eval()

RNNModel(
  (drop): Dropout(p=0.5)
  (encoder): Embedding(33278, 650)
  (rnn): LSTM(650, 650, num_layers=2, dropout=0.5)
  (decoder): Linear(in_features=650, out_features=33278, bias=True)
)

## Evaluate with Test Documents

### Calculate the Perplexity of the Test Predictions
To confirm we have loaded the correct model.

In [None]:
%%time
BPTT = 50
CRITERION = torch.nn.CrossEntropyLoss()

def batchify(data, bsz):
    # Work out how cleanly we can divide the dataset into bsz parts.
    nbatch = data.size(0) // bsz
    # Trim off any extra elements that wouldn't cleanly fit (remainders).
    data = data.narrow(0, 0, nbatch * bsz)
    # Evenly divide the data across the bsz batches.
    data = data.view(bsz, -1).t().contiguous()
    return data.to(DEVICE)

def get_batch(source, i):
    seq_len = min(BPTT, len(source) - 1 - i)
    data = source[i:i+seq_len]
    target = source[i+1:i+1+seq_len].view(-1)
    return data, target

def evaluate(data_source):
    # Turn on evaluation mode which disables dropout.
    model.eval()
    total_loss = 0.
    ntokens = len(corpus.dictionary)
    hidden = model.init_hidden(10)
    with torch.no_grad():
        for i in range(0, data_source.size(0) - 1, BPTT):
            data, targets = get_batch(data_source, i)
            output, hidden = model(data, hidden)
            output_flat = output.view(-1, ntokens)
            total_loss += len(data) * CRITERION(output_flat, targets).item()
            hidden = repackage_hidden(hidden)
    return total_loss / len(data_source)

def repackage_hidden(h):
    """Wraps hidden states in new Tensors, to detach them from their history."""
    if isinstance(h, torch.Tensor):
        return h.detach()
    else:
        return tuple(repackage_hidden(v) for v in h)
    
test_data = batchify(corpus.test, 10)
loss = evaluate(test_data)

CPU times: user 5min 55s, sys: 1.68 s, total: 5min 57s
Wall time: 5min 57s


In [None]:
loss, np.exp(loss)

(4.486460813329338, 88.8065859480267)

### Check the Next Word Predictions

In [None]:
test_tokens = corpus.test.numpy()
eos_pos = np.where(test_tokens == corpus.dictionary.word2idx["<eos>"])[0]
print("Number of lines in test:", len(eos_pos))

Number of lines in test: 2891


In [None]:
# A random line from test dataset
print(" ".join([corpus.dictionary.idx2word[c] for c in test_tokens[eos_pos[28]+1:eos_pos[29]]]))

The An <unk> Rebellion began in December <unk> , and was not completely suppressed for almost eight years . It caused enormous disruption to Chinese society : the census of 754 recorded 52 @.@ 9 million people , but ten years later , the census counted just 16 @.@ 9 million , the remainder having been displaced or killed . During this time , Du Fu led a largely itinerant life <unk> by wars , associated <unk> and imperial <unk> . This period of <unk> was the making of Du Fu as a poet : Even Shan Chou has written that , " What he saw around him — the lives of his family , neighbors , and strangers – what he heard , and what he hoped for or feared from the progress of various campaigns — these became the enduring themes of his poetry " . Even when he learned of the death of his youngest child , he turned to the suffering of others in his poetry instead of dwelling upon his own <unk> . Du Fu wrote :


In [None]:
def eval_chunk(start, end):
    token_tensor = corpus.test[eos_pos[start]+1:eos_pos[end]]
    hidden = model.init_hidden(1)
    with torch.no_grad():
        targets = token_tensor[1:]
        output, hidden = model(token_tensor.unsqueeze(1), hidden)
        output_flat = output.squeeze(1)
        loss = CRITERION(output_flat[:-1], targets).item()
    
    sorted_idx = np.argsort(output_flat.numpy(), 1)
    preds = []
    for i in range(1, 4):
        preds.append(list(map(lambda x: corpus.dictionary.idx2word[x], sorted_idx[:, -i])))
    # preds = list(map(lambda x: itos[x], np.argmax(logits.data.cpu().numpy(), 1)))
    return (
        loss,
        pd.DataFrame({
            "orig": [corpus.dictionary.idx2word[x] for x in token_tensor.numpy()] + [" "], 
            "pred_1": [""] + preds[0], "pred_2": [""] + preds[1], "pred_3": [""] + preds[2]
        })
    )

Let's try using only one line:

In [None]:
loss, df = eval_chunk(28, 29)
print("Loss:", np.exp(loss))
df.iloc[-50:]

Loss: 163.91555818335866


Unnamed: 0,orig,pred_1,pred_2,pred_3
133,progress,<unk>,world,time
134,of,of,.,","
135,various,the,his,a
136,campaigns,people,things,<unk>
137,—,.,",",""""
138,these,and,the,""""
139,became,are,were,people
140,the,a,the,more
141,enduring,most,<unk>,first
142,themes,<unk>,subject,thing


Now try providing more context:

In [None]:
loss, df = eval_chunk(28, 34)
print("Loss:", np.exp(loss))
df.iloc[-50:]

Loss: 104.32415212207026


Unnamed: 0,orig,pred_1,pred_2,pred_3
489,in,to,the,a
490,the,the,a,his
491,summer,<unk>,middle,morning
492,of,of,and,","
493,<unk>,1918,the,1916
494,;,",",and,.
495,this,he,the,his
496,has,was,time,is
497,traditionally,been,a,also
498,been,been,occurred,come


### Try to Generate Texts

In [None]:
UNK = corpus.dictionary.word2idx["<unk>"]
UNK

9

#### Greedy Selection

In [None]:
def generate_text_from_chunk(start, end, target_length=20):
    """Greedy selection of the next token."""
    token_tensor = corpus.test[eos_pos[start]+1:eos_pos[end]]
    return generate_text_from_tensor(token_tensor, target_length)
    
def generate_text_from_tensor(token_tensor, target_length):
    hidden = model.init_hidden(1)
    output, hidden = model(token_tensor.unsqueeze(1), hidden)
    index = output[-1, -0, :].argmax()
    res = [index.numpy()]
    with torch.no_grad():    
        for i in range(target_length):
            output, hidden = model(index.unsqueeze(0).unsqueeze(0), hidden)
            index = output[-1, 0, ].argmax()
            res.append(index.numpy())
    return [
        [
           corpus.dictionary.idx2word[x] for x in arr            
        ] for arr in (token_tensor.numpy(), res)
    ]

In [None]:
context, new_texts = generate_text_from_chunk(28, 29)
print(" ".join(context[-10:]))
print(" ".join(new_texts))

dwelling upon his own <unk> . Du Fu wrote :
" I 'm not going to be a <unk> , and I am not going to be a <unk> . "


In [None]:
context, new_texts = generate_text_from_chunk(28, 38)
print(" ".join(context[-10:]))
print(" ".join(new_texts))

Fu financially and employed him as his unofficial secretary .
The Latin chronicler John C. <unk> also described him as his " liberal @-@ confident " . He described them


#### Sampling from the Predicted Distribution with a Temeperature Knob

In [None]:
def generate_text_from_chunk(start, end, target_length=20, temperature=1.0):
    token_tensor = corpus.test[eos_pos[start]+1:eos_pos[end]]
    return generate_text_from_tensor(token_tensor, target_length, temperature)
    

def generate_text_from_tensor(token_tensor, target_length, temperature):
    """Sampling from the softmax distribution."""    
    hidden = model.init_hidden(1)
    _, hidden = model(token_tensor[:-1].unsqueeze(1), hidden)
    input_tensor = torch.zeros((1, 1)).long().to(DEVICE)
    input_tensor[0, 0].fill_(token_tensor[-1])
    res = []
    with torch.no_grad():    
        for i in range(target_length):            
            output, hidden = model(input_tensor, hidden)
            word_weights = output.squeeze().div(temperature).exp()
            word_idx = torch.multinomial(word_weights, 1)[0]
            input_tensor[0, 0].fill_(word_idx)
            res.append(word_idx.item())
    return [
        [
           corpus.dictionary.idx2word[x] for x in arr            
        ] for arr in (token_tensor.numpy(), res)
    ]

In [None]:
context, new_texts = generate_text_from_chunk(28, 33, target_length=50)
print(" ".join(context[-10:]))
for i in range(0, len(new_texts), 10):
    print(" ".join(new_texts[i:i+10]))

bring more papers to pile higher on my desk .
" <unk> ( two ) and Cristina 's army in
<unk> where all historians discovered that the German sniper was
still <unk> from and one out of the Sisler children
. A brother <unk> , the friend of Richard ,
senior of the island , was therefore procured in the


In [None]:
def generate_text_from_texts(texts, target_length=20, temperature=1.0):
    """texts needs to be tokens seperated by space characters."""
    token_tensor = torch.LongTensor([
        corpus.dictionary.word2idx[x] for x in texts.split(" ")
    ]).to(DEVICE)
    return generate_text_from_tensor(token_tensor, target_length, temperature)

In [None]:
context, new_texts =  generate_text_from_texts("In the fall of 1944 , <unk> enrolled at the University of Michigan . The United Press syndicate", target_length=100)
print(" ".join(context[-10:]))
for i in range(0, len(new_texts), 10):
    print(" ".join(new_texts[i:i+10]))

at the University of Michigan . The United Press syndicate
and officials was interpreted by the searing complaints being used
as the musician by another mixed review , but expressed
concern that the laws would be found out in the
United States and during a transmission control of the same
second landscapes . Lisa that he managed to visit the
relationship with Carey and Marvel 's general president for food
was " desperate and looking , based on their own
wing . " Asked in this , the company was
told by the US Bureau of Education , who decided
, and eventually admitted to the 1920s , and "
