# Thinking in tensors in PyTorch

Hands-on training  by [Piotr Migdał](https://p.migdal.pl) (2020). A short training for Sumo Logic Warsaw, 17 Jan 2020.

* Open in Colab: https://colab.research.google.com/github/stared/thinking-in-tensors-writing-in-pytorch/
* If you want to run it locally, the easiest way is to use Python 3.7+ from [Anaconda Distribution](https://www.anaconda.com/distribution/), install [PyTorch](https://pytorch.org/) and run `jupyter notebook`.


## Generating f̶a̶k̶e artificial logs


### Background reading

We use recurrent networks. For wonderful introductions:

* [Understanding LSTM Networks](http://colah.github.io/posts/2015-08-Understanding-LSTMs/) by Chris Olah
* [Exploring LSTMs](http://blog.echen.me/2017/05/30/exploring-lstms/) by Edwin Chen	


![](http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/RNN-unrolled.png)

See also:

* [Simple diagrams of convoluted neural networks](https://medium.com/inbrowserai/simple-diagrams-of-convoluted-neural-networks-39c097d2925b) by Piotr Migdał
* [The Unreasonable Effectiveness of Recurrent Neural Networks](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) by Andrej Karpathy
* [Memorization in RNNs](https://distill.pub/2019/memorization-in-rnns/)
* [Unsupervised sentiment neuron by OpenAI](https://openai.com/blog/unsupervised-sentiment-neuron/)
* [GPT-2: Better Language Models and Their Implications](https://openai.com/blog/better-language-models/) by OpenAI

Play online:


* [RecurrentJS](http://cs.stanford.edu/people/karpathy/recurrentjs) - an in-browser demo by Andrej Karpathy
* [transformer.huggingface.co](https://transformer.huggingface.co/) - autocompletion with state-of-the-art models

Other

* [Training a Keras model to generate colors](https://heartbeat.fritz.ai/how-to-train-a-keras-model-to-generate-colors-3bc79e54971b)
* [Generating Magic cards using deep, recurrent neural networks](https://www.mtgsalvation.com/forums/magic-fundamentals/custom-card-creation/612057-generating-magic-cards-using-deep-recurrent-neural)


### Various practical links

* [What is the best way to remove accents in a Python unicode string?](https://stackoverflow.com/questions/517923/what-is-the-best-way-to-remove-accents-in-a-python-unicode-string)
* My [livelossplot Python package](https://github.com/stared/livelossplot) - live training loss plot in Jupyter Notebook for Keras, PyTorch and others (now over 100k downloads!)

### ...and where is mine command log file

* Windows PowerShell: `~\AppData\Roaming\Microsoft\Windows\PowerShell\PSReadline\ConsoleHost_history.txt`
* Linux & macOS Bash: `~/.bash_history`
* Zsh: `~/.zsh_history`

In [None]:
!pip install unidecode
!pip install livelossplot

In [None]:
import numpy as np
import pandas as pd
from collections import Counter
from unidecode import unidecode
from sklearn.model_selection import train_test_split

import torch
from torch import nn
from torch import optim
from torch.utils.data import TensorDataset, DataLoader

from livelossplot import PlotLosses

## GPU

* [GPU benchmarks for deep learning tasks](https://www.reddit.com/r/MachineLearning/comments/ecazk2/d_gpu_benchmarks_for_deep_learning_tasks/) - my overview
* [ai-benchmark.com/alpha](http://ai-benchmark.com/alpha)


In [None]:
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
if torch.cuda.is_available():
    gpu = torch.cuda.get_device_properties('cuda:0')
    print(f"Device count: {torch.cuda.device_count()}\n")
    print(f"{gpu.name}")
    print(f"({gpu.total_memory//2**20}MB memory, {gpu.multi_processor_count} multiprocessors)\n")
    print(f"CUDA version:  {torch.version.cuda}")
    print(f"cuDNN version: {gpu.major}.{gpu.minor}")
    

## Loading and preprocessing data

We use a fragment of: https://archive.ics.uci.edu/ml/datasets/UNIX+User+Data

In [None]:
!wget -O commands.txt https://www.dropbox.com/s/qed5mrji0yraoev/sanitized_all_1.981115184025?dl=1

In [None]:
with open("commands.txt") as f:
    data = f.read()
    lines = data.replace("\n", " ").replace("**SOF**", "").split("**EOF**")
    lines = [line.strip() for line in lines]

In [None]:
len(lines)

In [None]:
lines[:5]

In [None]:
word_counter = Counter()
for line in lines:
    word_counter.update(line.split(" "))

In [None]:
word_counter.most_common(20)

In [None]:
len_counter = Counter([len(line) for line in lines])
pd.Series(len_counter).sort_index().head(20)

In [None]:
letters = Counter()
for line in lines:
    letters.update(line)

In [None]:
letters.most_common()

In [None]:
char2id = {c: i for i, (c, v) in enumerate(letters.items())}
id2char = {i: c for i, (c, v) in enumerate(letters.items())}

In [None]:
char2id

In [None]:
max_len = 20
BEGIN_ID = len(char2id)
AFTER_ID = len(char2id) + 1

X = np.zeros((len(lines), max_len), dtype=np.int64)
X[:,:] = AFTER_ID
X[:,0] = BEGIN_ID

for i, line in enumerate(lines):
    for j, c in enumerate(line):
        if j + 1 >= max_len:
            break
        X[i, j + 1] = char2id[c]

In [None]:
chars = [c for i, c in id2char.items()] + [">", "<"]

In [None]:
def encode(name, end=True):
    code = [char2id[c] for c in name.lower()]
    if end:
        return torch.tensor([BEGIN_ID] + code + [AFTER_ID]).unsqueeze(0)
    else:
        return torch.tensor([BEGIN_ID] + code).unsqueeze(0)

In [None]:
encode("sudo")

In [None]:
X.shape

In [None]:
X[:5]

In [None]:
X_train, X_test, = train_test_split(X, test_size=0.25, random_state=42)

In [None]:
trainloader = DataLoader(TensorDataset(torch.from_numpy(X_train).long()),
                         batch_size=128, shuffle=True)
testloader = DataLoader(TensorDataset(torch.from_numpy(X_test).long()),
                         batch_size=128, shuffle=False)

dataloaders = {
    "train": trainloader,
    "validation": testloader
}

In [None]:
def train_model_gener(model, criterion, optimizer, num_epochs=10,
                      device=device,
                      trainloader=trainloader,
                      testloader=testloader):
    
    # creating plots
    liveloss = PlotLosses()
    model = model.to(device)
    
    # main train loop
    for epoch in range(num_epochs):
        logs = {}
        for phase in ['train', 'validation']:
            if phase == 'train':
                model.train()
            else:
                model.eval()

            # accumulating error measures
            running_loss = 0.0
            running_corrects = 0

            for (inputs_full,) in dataloaders[phase]:
                
                # here are changes!
                inputs = inputs_full[:, :-1].to(device)
                labels = inputs_full[:, 1:].to(device)

                outputs = model(inputs)
                
                loss = criterion(outputs, labels)

                # training the model with a variant
                # of the gradient descent
                if phase == 'train':
                    optimizer.zero_grad()
                    loss.backward()
                    optimizer.step()

                _, preds = torch.max(outputs, 1)
                running_loss += loss.detach() * inputs.size(0)
                running_corrects += torch.sum(preds == labels.data)

            epoch_loss = running_loss / len(dataloaders[phase].dataset)
            epoch_acc = running_corrects.float() / len(dataloaders[phase].dataset)
            
            prefix = ''
            if phase == 'validation':
                prefix = 'val_'

            logs[prefix + 'log loss'] = epoch_loss.item()
            logs[prefix + 'accuracy'] = epoch_acc.item()
        
        liveloss.update(logs)
        liveloss.draw()

In [None]:
class GenerativeLSTM(nn.Module):
    def __init__(self, embedding_size, hidden_size, dictionary_len=len(chars)):
        super().__init__()
        self.emb = nn.Embedding(dictionary_len, embedding_size)
        self.lstm = nn.LSTM(input_size=embedding_size, hidden_size=hidden_size)
        # note: input size is the numer of channels/embedding dim, NOT length
        self.fc = nn.Linear(hidden_size, dictionary_len)

    def forward(self, x):
        x = self.emb(x)
        x = x.permute(1, 0, 2)  # BLC -> LBC
        output, (hidden, cell) = self.lstm(x)
        res = self.fc(output)
        return res.permute(1, 2, 0) #  LBC -> BCL

In [None]:
X_example = torch.from_numpy(X_train[:3]).long().to(device)
X_example

In [None]:
# (sample size, sequence length)
X_example.size()

In [None]:
# create model; it has random weights
model = GenerativeLSTM(embedding_size=5, hidden_size=16).to(device)
model

In [None]:
# (sample size, alphabet size, sequence length)
model(X_example).size()

In [None]:
# let's train a model
optimizer = optim.Adam(model.parameters(), lr=0.01)
criterion = nn.CrossEntropyLoss()

train_model_gener(model, criterion, optimizer, num_epochs=51)

In [None]:
pred = model(encode("sourc", end=False).to(device))
pred.size()

In [None]:
next_letter_prob = pred[0,:,-1].softmax(dim=0).cpu().detach().numpy()
next_letter_prob

In [None]:
100 * pd.Series(next_letter_prob, index=chars).sort_values(ascending=False)

In [None]:
(100 * pd.Series(next_letter_prob, index=chars).sort_values(ascending=False)).head(10).plot.bar()

In [None]:
next_char_id = np.random.choice(len(next_letter_prob), 1, p=next_letter_prob)[0]
next_char_id

In [None]:
chars[next_char_id]

In [None]:
def generate(start="", next_chars=20, temperature=1., model=model, device=device):
    word = start

    for i in range(next_chars):
        pred = model(encode(word, end=False).to(device))
        next_letter_prob = pred[0,:,-1].softmax(dim=0).cpu().detach().numpy()
  
        next_letter_prob = next_letter_prob**(1/temperature)
        next_letter_prob = next_letter_prob / next_letter_prob.sum()

        next_char_id = np.random.choice(len(next_letter_prob), 1, p=next_letter_prob)[0]

        word += chars[next_char_id]

        if chars[next_char_id] == '<':
            word = word[:-1]
            break

    return word

In [None]:
generate()

In [None]:
for i in range(10):
    print(generate(""))

In [None]:
for i in range(10):
    print(generate("l"))

In [None]:
for i in range(10):
    print(generate("exi"))

In [None]:
for i in range(10):
    print(generate("sudo rm"))

In [None]:
for i in range(10):
    print(generate("l"))