# Horoscope generation using Temporal Convolution Networks

The goal of this notebook is to implement a generator of horoscope based on neural networks. 

More specifically, the architecture used is a Temporal Convolution Network based on the research paper ["An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling"](https://arxiv.org/abs/1803.01271). This architecture is fully convolutional and can therefore take arbitrary length sequence as inputs. The main idea of the authors of this paper is to increase the perceptive field of each successive layer using [dilated convolution](https://github.com/vdumoulin/conv_arithmetic). 

The bulk of the code of the TCN has been taken from [the implementation](https://github.com/locuslab/TCN) linked in the article.

The data used here are all the horoscopes published [beliefnet.com](beliefnet.com) for the year 2011. 

The network takes a sequence of `window_size` characters coming from as input and outputs a sequence of `window_size` characters. The target that we use for training is a slice of the horoscope corresponding to the input slid on step to the right, effectively asking the network what should the next character be. 

## Imports

In [1]:
import random
import math
import itertools

import pandas as pd

from tqdm import tnrange, tqdm_notebook
from tqdm import tqdm

import torch
import torch.nn            as nn
import torch.nn.functional as F
import torch.optim         as optim
from torch.nn.utils import weight_norm
from torch.autograd import Variable

## Training parameters

In [2]:
cuda          = True
file_path     = '../data/horoscope_2011.csv'
window_size   = 3
batch_size    = 3
print_every   = 500
test_seq_size = 600
epochs        = 7

## Data loading code

In [3]:
def load_data(path, window_size):
    df          = pd.read_csv(path)
    split_texts = [t.lower().split() for t in df.TEXT] 
    text        = list(itertools.chain.from_iterable(split_texts))
    words       = set(text)
    n_words     = len(words)
    idx_to_word = dict(enumerate(words))
    word_to_idx = {word : idx for idx, word in idx_to_word.items()}
    data        = [(text[i : i + window_size], text[i + 1 : i + window_size + 1])
                    for i in range(len(text) - window_size - 1)]

    return n_words, idx_to_word, word_to_idx, data

In [4]:
def encode_seq(seq, word_to_id):
    return [word_to_id[w] for w in seq]

In [5]:
def data_to_tensor(data, word_to_id, n_char):
    input_tensor = torch.LongTensor([encode_seq(input_seq, word_to_id)
                                      for input_seq, _ in data])
    target_tensor = torch.LongTensor([encode_seq(target_seq, word_to_id)
                                       for _, target_seq in data])
    
    return input_tensor, target_tensor

In [6]:
def batch_generator(data, batch_size, n_char, word_to_id, shuffle = True):
    if shuffle:
        data = random.sample(data, len(data))
    
    return (data_to_tensor(data[i : i + batch_size], word_to_id, n_char) 
                 for i in range(0, len(data), batch_size))

## Model visualization code

The model evaluation consists in asking it to generate a long sequence of character. We randomly select an input as a base for our generation and create a new sequence character by character using the model.

In [7]:
def test_model(tcn, final_sequence_size, window_size, n_words, 
               id_to_word, word_to_id, data):
    seq = list(random.choice(data)[0])
    while len(seq) < final_sequence_size:
        # As the sequence is able to take variable length inputs, it could be 
        # interesting to not limit ourselves on inputs of window_size.
        encoded_input = encode_seq(seq[-window_size:], word_to_id)
        input_tensor  = torch.LongTensor([encoded_input])
        X             = Variable(input_tensor)
        X             = X.cuda() if cuda else X 
        y_pred        = tcn(X)
        # It is important to take the maximum on the dim -2 as each channel of 
        # the output will correspond to the score associated to a character.
        word_pred_id  = y_pred[0].max(dim = -2)[1][-1].cpu().data[0]
        word_pred     = id_to_word[word_pred_id]
        seq.append(word_pred)
    
    return ' '.join(seq)

The following function generates an horoscope starting from a `base` supplied by the caller. 

In [8]:
def genererate_long_sequence(tcn, final_sequence_size, n_wordsr, id_to_word, word_to_id, base):
    seq = list(base)

    while len(seq) < final_sequence_size:
        # In this case we do not limit the size of the input to window_size.
        encoded_input = encode_seq(seq, word_to_id)
        input_tensor  = torch.LongTensor([encoded_input])
        X             = Variable(input_tensor)
        X             = X.cuda() if cuda else X 
        y_pred        = tcn(X)
        # It is important to take the maximum on the dim -2 as each channel of 
        # the output will correspond to the score associated to a character.
        word_pred_id  = y_pred[0].max(dim = -2)[1][-1].cpu().data[0]
        word_pred     = id_to_word[word_pred_id]
        seq.append(word_pred)
        
    return ' '.join(seq)

## Model definition

The `TransposeLayer` is just a simple transposition that reformats the output of the embedding layer into the correct format for the convolution layers.

In [9]:
class TransposeLayer(nn.Module):
    def forward(self, x):
        return x.transpose(-2, -1)

The `Chomp1d` module is used to remove the extra values at the end of the sequence by the padding of a convolution. As the TCN architecture uses dilated convolution, the padding have to be increased in order to be able to generate a long enough output. We have to remove the extra values so that the last value of our output is the result of a dilated convolution whose rightmost value was the last value of the input sequence. 

In [10]:
class Chomp1d(nn.Module):
    def __init__(self, chomp_size):
        super(Chomp1d, self).__init__()
        self.chomp_size = chomp_size
        
    def forward(self, x):
        # As x can be stored on the GPU, if we use it to build a new tensor,
        # we have to ensure that our new value is stored contiguously.
        return x[:, :, :-self.chomp_size].contiguous()

The `TemporalBlock` module is a residual block containing two weight normalized dilated convolutions with ReLU activations and dropout2d (we drop whole channel at once). The residual connection may contain a 1x1 convolution if it is necessary to transform the input to the correct number of channels. 

In [11]:
class TemporalBlock(nn.Module):
    def __init__(self, n_inputs, n_outputs, kernel_size, stride, dilation, padding, dropout = 0.2):
        super(TemporalBlock, self).__init__()
        conv_params = {
            'kernel_size' : kernel_size,
            'stride'      : stride,
            'padding'     : padding,
            'dilation'    : dilation
        }
        self.conv1    = weight_norm(nn.Conv1d(n_inputs, n_outputs, **conv_params))
        self.chomp1   = Chomp1d(padding)
        self.relu1    = nn.ReLU()
        self.dropout1 = nn.Dropout2d(dropout)
        self.conv2    = weight_norm(nn.Conv1d(n_outputs, n_outputs, **conv_params))
        self.chomp2   = Chomp1d(padding)
        self.relu2    = nn.ReLU()
        self.dropout2 = nn.Dropout2d(dropout)
        self.net      = nn.Sequential(
            self.conv1, 
            self.chomp1,
            self.relu1,
            self.dropout1,
            self.conv2,
            self.chomp2,
            self.relu2,
            self.dropout2
        )
        # If the number of input channels is equal to the number of output channel then
        # no transformation is required.
        self.downsample = nn.Conv1d(n_inputs, n_outputs, 1) if n_inputs != n_outputs else None
        self.relu       = nn.ReLU()
        self.init_weights()
        
    def forward(self, x):
        # Convolutional branch of the residual block
        out = self.net(x)
        # Residual branch of the residual block
        res = x if self.downsample is None else self.downsample(x)

        return self.relu(out + res)
    
    def init_weights(self):
        self.conv1.weight.data.normal_(0, 0.01)
        self.conv2.weight.data.normal_(0, 0.01)
        if self.downsample is not None:
            self.downsample.weight.data.normal_(0, 0.01)

The Temporal Convolution Network is a sequence of Temporal Blocks whose dilation is doubled at each step. If enough blocks are uses, this definition allows the network to used information for arbitrarily far away in the past to generate its prediction. 

In [12]:
class TemporalConvNet(nn.Module):
    def __init__(self, num_inputs, num_channels, dim_emb = 30, kernel_size = 2, dropout = 0.2):
        super(TemporalConvNet, self).__init__()
        layers     = [nn.Embedding(num_inputs, dim_emb), TransposeLayer()]
        num_levels = len(num_channels)
        
        for i in range(num_levels):
            # The dilation is doubled at each layer to allow an exponential growth of 
            # the receptive field size.   
            dilation_size = 2 ** i
            in_channels   = dim_emb if i == 0 else num_channels[i - 1]
            out_channels  = num_channels[i]
            layers.append(
                TemporalBlock(
                    in_channels,
                    out_channels,
                    kernel_size,
                    stride   = 1,
                    dilation = dilation_size,
                    padding  = (kernel_size - 1) * dilation_size,
                    dropout  = dropout
                )
            )
        
        self.network = nn.Sequential(*layers)
        
    def forward(self, x):
        return self.network(x)

## Training

In [13]:
n_words, id_to_word, word_to_id, data = load_data(file_path, window_size)
# num_channels                          = [512] * 6 + [n_words]
num_channels                          = [2] * 3 + [n_words]
tcn                                   = TemporalConvNet(n_words, num_channels)
tcn                                   = tcn.cuda() if cuda else tcn
# We view the problem as a classification task in which the network tries
# to predict what class the following character should be.   
criterion                             = nn.CrossEntropyLoss()
# We use an Adam optimizer with the default learning rate of 1e-3.
optimizer                             = optim.Adam(tcn.parameters())
tcn

TemporalConvNet(
  (network): Sequential(
    (0): Embedding(13105, 30)
    (1): TransposeLayer(
    )
    (2): TemporalBlock(
      (conv1): Conv1d(30, 2, kernel_size=(2,), stride=(1,), padding=(1,))
      (chomp1): Chomp1d(
      )
      (relu1): ReLU()
      (dropout1): Dropout2d(p=0.2)
      (conv2): Conv1d(2, 2, kernel_size=(2,), stride=(1,), padding=(1,))
      (chomp2): Chomp1d(
      )
      (relu2): ReLU()
      (dropout2): Dropout2d(p=0.2)
      (net): Sequential(
        (0): Conv1d(30, 2, kernel_size=(2,), stride=(1,), padding=(1,))
        (1): Chomp1d(
        )
        (2): ReLU()
        (3): Dropout2d(p=0.2)
        (4): Conv1d(2, 2, kernel_size=(2,), stride=(1,), padding=(1,))
        (5): Chomp1d(
        )
        (6): ReLU()
        (7): Dropout2d(p=0.2)
      )
      (downsample): Conv1d(30, 2, kernel_size=(1,), stride=(1,))
      (relu): ReLU()
    )
    (3): TemporalBlock(
      (conv1): Conv1d(2, 2, kernel_size=(2,), stride=(1,), padding=(2,), dilation=(2,))
  

In [14]:
batch_per_epoch  = math.ceil(len(data) / batch_size)
loss_update_rate = 3

In [15]:
params = filter(lambda p: p.requires_grad, tcn.parameters())

import numpy as np
n_params = sum([np.prod(p.size()) for p in params])

n_params

344019601

In [16]:
for epoch in tnrange(epochs, desc = 'epochs'):
    loss_pbar    = 0 
    running_loss = 0
    generator    = batch_generator(data, batch_size, n_words, word_to_id)

    with tqdm_notebook(
        enumerate(generator), 
        desc = 'batches', 
        total = batch_per_epoch, 
        unit = 'batch '
    ) as pbar:

        for i, (X, y) in pbar:
            X = Variable(X)
            y = Variable(y)
            X = X.cuda() if cuda else X
            y = y.cuda() if cuda else y
            optimizer.zero_grad()
#             print(X.size())
            y_pred = tcn(X)
            print(y_pred.size(), y.size())
            loss   = criterion(y_pred, y)
            loss.backward()
            optimizer.step()

#             loss_value    = loss.cpu().data[0]
#             running_loss += loss_value
#             loss_pbar    += loss_value

#             if i % loss_update_rate == loss_update_rate - 1:
#                 pbar.set_postfix(loss = loss_pbar / loss_update_rate)
#                 loss_pbar = 0

#             if i % print_every == print_every - 1:
#                 test_result = test_model(tcn, test_seq_size, window_size, n_char, 
#                                          id_to_char, char_to_id, data)
#                 tqdm.write(f'Batch: {i + 1 : 6}, '
#                            f'loss: {running_loss / print_every : .4f}\n'
#                            f'{test_result}\n')
#                 running_loss = 0

torch.Size([3, 13105, 3]) torch.Size([3, 3])



RuntimeError: cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch_1518243271935/work/torch/lib/THC/generic/THCStorage.cu:58

In [None]:
genererate_long_sequence(
    tcn, 
    3000, 
    n_char, 
    id_to_char, 
    char_to_id, 
    'your day will be terrible but you should stay optimistic because'
)