### The grand quest: make it actually work (4 points)

Your main task is to use some of the tricks you've learned on the network and analyze if you can improve __validation MAE__. Try __at least 3 options__ from the list below for a passing grade. Write a short report about what you have tried. More ideas = more bonus points. 

__Please be serious:__ " plot learning curves in MAE/epoch, compare models based on optimal performance, test one change at a time. You know the drill :)

You can use either pure __tensorflow__ or __keras__. Feel free to adapt the seminar code for your needs.


In [1]:
import nltk
import keras

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import keras.layers as L
%matplotlib inline

from collections import Counter
from sklearn.feature_extraction import DictVectorizer
from sklearn.model_selection import train_test_split

Using TensorFlow backend.


First, let's use some code from the seminar that contains preprocessing.

In [2]:
data = pd.read_csv("./Train_rev1.csv", index_col=None)

data['Log1pSalary'] = np.log1p(data['SalaryNormalized']).astype('float32')

text_columns = ["Title", "FullDescription"]
categorical_columns = ["Category", "Company", "LocationNormalized", "ContractType", "ContractTime"]
target_column = "Log1pSalary"

data[categorical_columns] = data[categorical_columns].fillna('NaN')

tokenizer = nltk.tokenize.WordPunctTokenizer()

data["FullDescription"] = data["FullDescription"].astype(str).apply(
    lambda x: ' '.join(tokenizer.tokenize(x.lower())), 1)
data["Title"] = data["Title"].astype(str).apply(
    lambda x: ' '.join(tokenizer.tokenize(x.lower())), 1)

tokens = []
for title in data["Title"]:
    tokens += title.split()
for title in data["FullDescription"]:
    tokens += title.split()
    
token_counts = Counter(tokens)

min_count = 10

tokens = [token for token, count in token_counts.items() if count >= min_count]

UNK, PAD = "UNK", "PAD"
tokens = [UNK, PAD] + sorted(tokens)
token_to_id = {token: id for id, token in enumerate(tokens)}

UNK_IX, PAD_IX = map(token_to_id.get, [UNK, PAD])

def as_matrix(sequences, max_len=None):
    """ Convert a list of tokens into a matrix with padding """
    if isinstance(sequences[0], str):
        sequences = list(map(str.split, sequences))
        
    max_len = min(max(map(len, sequences)), max_len or float('inf'))
    
    matrix = np.full((len(sequences), max_len), np.int32(PAD_IX))
    for i,seq in enumerate(sequences):
        row_ix = [token_to_id.get(word, UNK_IX) for word in seq[:max_len]]
        matrix[i, :len(row_ix)] = row_ix
    
    return matrix

top_companies, top_counts = zip(*Counter(data['Company']).most_common(1000))
recognized_companies = set(top_companies)
data["Company"] = data["Company"].apply(lambda comp: comp if comp in recognized_companies else "Other")

categorical_vectorizer = DictVectorizer(dtype=np.float32, sparse=False)
categorical_vectorizer.fit(data[categorical_columns].apply(dict, axis=1))

data_train, data_val = train_test_split(data, test_size=0.2, random_state=42)
data_train.index = range(len(data_train))
data_val.index = range(len(data_val))

def make_batch(data, max_len=None, word_dropout=0):
    """
    Creates a keras-friendly dict from the batch data.
    :param word_dropout: replaces token index with UNK_IX with this probability
    :returns: a dict with {'title' : int64[batch, title_max_len]
    """
    batch = {}
    batch["Title"] = as_matrix(data["Title"].values, max_len)
    batch["FullDescription"] = as_matrix(data["FullDescription"].values, max_len)
    batch['Categorical'] = categorical_vectorizer.transform(data[categorical_columns].apply(dict, axis=1))
    
    if word_dropout != 0:
        batch["FullDescription"] = apply_word_dropout(batch["FullDescription"], 1. - word_dropout)
    
    if target_column in data.columns:
        batch[target_column] = data[target_column].values
    
    return batch

def apply_word_dropout(matrix, keep_prop, replace_with=UNK_IX, pad_ix=PAD_IX,):
    dropout_mask = np.random.choice(2, np.shape(matrix), p=[keep_prop, 1 - keep_prop])
    dropout_mask &= matrix != pad_ix
    return np.choose(dropout_mask, [matrix, np.full_like(matrix, replace_with)])

In [3]:
def iterate_minibatches(data, batch_size=256, shuffle=True, cycle=False, **kwargs):
    """ iterates minibatches of data in random order """
    while True:
        indices = np.arange(len(data))
        if shuffle:
            indices = np.random.permutation(indices)

        for start in range(0, len(indices), batch_size):
            batch = make_batch(data.iloc[indices[start : start + batch_size]], **kwargs)
            target = batch.pop(target_column)
            yield batch, target
        
        if not cycle: break

In [7]:
def print_metrics(model, data, batch_size=batch_size, name="", **kw):
    squared_error = abs_error = num_samples = 0.0
    for batch_x, batch_y in iterate_minibatches(data, batch_size=batch_size, shuffle=False, **kw):
        batch_pred = model.predict(batch_x)[:, 0]
        squared_error += np.sum(np.square(batch_pred - batch_y))
        abs_error += np.sum(np.abs(batch_pred - batch_y))
        num_samples += len(batch_y)
    print("%s results:" % (name or ""))
    print("Mean square error: %.5f" % (squared_error / num_samples))
    print("Mean absolute error: %.5f" % (abs_error / num_samples))
    return squared_error, abs_error

Then let's define some consts and useful callbacks.

In [12]:
batch_size = 256
epochs = 100
steps_per_epoch = 100
n_tokens=len(tokens)
n_cat_features=len(categorical_vectorizer.vocabulary_)
hid_size=64

callbacks = [
    # Early stopping callback
    keras.callbacks.EarlyStopping(monitor='val_mean_absolute_error', min_delta=0.0005, patience=3),
    # Tensorboard to visualize learning curves
    keras.callbacks.TensorBoard(log_dir='./logs/')
]

Now we can take the seminar model and look how strong its baseline is. Ofc we'll put our callbacks inside.

In [5]:
def build_model(n_tokens=len(tokens), n_cat_features=len(categorical_vectorizer.vocabulary_), hid_size=64):
    l_title = L.Input(shape=[None], name="Title")
    l_descr = L.Input(shape=[None], name="FullDescription")
    l_categ = L.Input(shape=[n_cat_features], name="Categorical")
    
    emb = L.Embedding(n_tokens, 2 * hid_size)
    
    l_title_emb = emb(l_title)
    l_descr_emb = emb(l_descr)
    
    l_title_conv = L.Convolution1D(hid_size, kernel_size=2, activation='relu')(l_title_emb)
    l_descr_conv = L.Convolution1D(hid_size, kernel_size=5, activation='relu')(l_descr_emb)
    
    l_title_out = L.GlobalMaxPool1D()(l_title_conv)
    l_descr_out = L.GlobalMaxPool1D()(l_descr_conv)
    
    l_categ_out = L.Dense(hid_size, activation='relu')(l_categ)
    
    l_combined = L.Concatenate()([l_title_out, l_descr_out, l_categ_out])
    l_dense_clf = L.Dense(hid_size, activation='relu')(l_combined)
    
    output_layer = L.Dense(1)(l_dense_clf)
    
    model = keras.models.Model(inputs=[l_title, l_descr, l_categ], outputs=[output_layer])
    model.compile('adam', 'mean_squared_error', metrics=['mean_absolute_error'])
    return model

In [6]:
model = build_model()

model.fit_generator(iterate_minibatches(data_train, batch_size, cycle=True, word_dropout=0.05), 
                    epochs=epochs, steps_per_epoch=steps_per_epoch,
                    validation_data=iterate_minibatches(data_val, batch_size, cycle=True),
                    validation_steps=data_val.shape[0] // batch_size,
                    callbacks=callbacks
                   )

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100


<keras.callbacks.History at 0x7f08d0becf60>

In [8]:
print_metrics(model, data_train, name='Train')
print_metrics(model, data_val, name='Val');

Train results:
Mean square error: 0.03584
Mean absolute error: 0.13863
Val results:
Mean square error: 0.06084
Mean absolute error: 0.18102


To improve it, let's try to add more neurons in the dense layers and dropout.

In [9]:
def evaluate_model(model):
    model.fit_generator(iterate_minibatches(data_train, batch_size, cycle=True, word_dropout=0.05), 
                        epochs=epochs, steps_per_epoch=steps_per_epoch,
                        validation_data=iterate_minibatches(data_val, batch_size, cycle=True),
                        validation_steps=data_val.shape[0] // batch_size,
                        callbacks=callbacks
                       )
    
    print_metrics(model, data_train, name='Train')
    print_metrics(model, data_val, name='Val');

In [13]:
l_categ_out = L.Dense(2 * hid_size, activation='relu')(l_categ)
l_categ_out = L.Dropout(0.5)(l_categ_out)

l_combined = L.Concatenate()([l_title_out, l_descr_out, l_categ_out])
l_dense_clf = L.Dense(2 * hid_size, activation='relu')(l_combined)

output_layer = L.Dense(1)(l_dense_clf)

model = keras.models.Model(inputs=[l_title, l_descr, l_categ], outputs=[output_layer])
model.compile('adam', 'mean_squared_error', metrics=['mean_absolute_error'])

evaluate_model(model)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Train results:
Mean square error: 0.03653
Mean absolute error: 0.13851
Val results:
Mean square error: 0.06312
Mean absolute error: 0.18370


Almost no affect.

Then we gonna separate convolutional/dense and activation layers and place a batchnorm in the midst.

In [19]:
l_title_conv = L.Convolution1D(hid_size, kernel_size=2)(l_title_emb)
l_title_conv = L.BatchNormalization()(l_title_conv)
l_title_conv = L.Activation('relu')(l_title_conv)
l_descr_conv = L.Convolution1D(hid_size, kernel_size=5)(l_descr_emb)
l_descr_conv = L.BatchNormalization()(l_descr_conv)
l_descr_conv = L.Activation('relu')(l_descr_conv)

l_title_out = L.GlobalMaxPool1D()(l_title_conv)
l_descr_out = L.GlobalMaxPool1D()(l_descr_conv)

l_categ_out = L.Dense(hid_size, activation='relu')(l_categ)
l_categ_out = L.BatchNormalization()(l_categ_out)
l_categ_out = L.Activation('relu')(l_categ_out)

l_combined = L.Concatenate()([l_title_out, l_descr_out, l_categ_out])

l_dense_clf = L.Dense(hid_size, activation='relu')(l_combined)
l_dense_clf = L.BatchNormalization()(l_dense_clf)
l_dense_clf = L.Activation('relu')(l_dense_clf)

output_layer = L.Dense(1)(l_dense_clf)

model = keras.models.Model(inputs=[l_title, l_descr, l_categ], outputs=[output_layer])
model.compile('adam', 'mean_squared_error', metrics=['mean_absolute_error'])

evaluate_model(model)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Train results:
Mean square error: 2.00679
Mean absolute error: 1.35711
Val results:
Mean square error: 2.01609
Mean absolute error: 1.35066


Or we always can add more layers...

In [16]:
l_categ_out = L.Dense(hid_size, activation='relu')(l_categ)
l_categ_out = L.Dense(hid_size, activation='relu')(l_categ_out)

l_combined = L.Concatenate()([l_title_out, l_descr_out, l_categ_out])
l_dense_clf = L.Dense(2 * hid_size, activation='relu')(l_combined)
l_dense_clf = L.Dense(2 * hid_size, activation='relu')(l_dense_clf)

output_layer = L.Dense(1)(l_dense_clf)

model = keras.models.Model(inputs=[l_title, l_descr, l_categ], outputs=[output_layer])
model.compile('adam', 'mean_squared_error', metrics=['mean_absolute_error'])

evaluate_model(model)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Train results:
Mean square error: 0.11189
Mean absolute error: 0.25803
Val results:
Mean square error: 0.12985
Mean absolute error: 0.27466


This model is quite unstable...

### Unfortunately, I needed to go to sleep :(((

### A short report

Please tell us what you did and how did it work.

`<YOUR_TEXT_HERE>`, i guess...

## Recommended options

#### A) CNN architecture

All the tricks you know about dense and convolutional neural networks apply here as well.
* Dropout. Nuff said.
* Batch Norm. This time it's `L.BatchNormalization`
* Parallel convolution layers. The idea is that you apply several nn.Conv1d to the same embeddings and concatenate output channels.
* More layers, more neurons, ya know...


#### B) Play with pooling

There's more than one way to perform pooling:
* Max over time - our `L.GlobalMaxPool1D`
* Average over time (excluding PAD)
* Softmax-pooling:
$$ out_{i, t} = \sum_t {h_{i,t} \cdot {{e ^ {h_{i, t}}} \over \sum_\tau e ^ {h_{j, \tau}} } }$$

* Attentive pooling
$$ out_{i, t} = \sum_t {h_{i,t} \cdot Attn(h_t)}$$

, where $$ Attn(h_t) = {{e ^ {NN_{attn}(h_t)}} \over \sum_\tau e ^ {NN_{attn}(h_\tau)}}  $$
and $NN_{attn}$ is a dense layer.

The optimal score is usually achieved by concatenating several different poolings, including several attentive pooling with different $NN_{attn}$ (aka multi-headed attention).

The catch is that keras layers do not inlude those toys. You will have to [write your own keras layer](https://keras.io/layers/writing-your-own-keras-layers/). Or use pure tensorflow, it might even be easier :)

#### C) Fun with words

It's not always a good idea to train embeddings from scratch. Here's a few tricks:

* Use a pre-trained embeddings from `gensim.downloader.load`. See last lecture.
* Start with pre-trained embeddings, then fine-tune them with gradient descent. You may or may not want to use __`.get_keras_embedding()`__ method for word2vec
* Use the same embedding matrix in title and desc vectorizer


#### D) Going recurrent

We've already learned that recurrent networks can do cool stuff in sequence modelling. Turns out, they're not useless for classification as well. With some tricks of course..

* Like convolutional layers, LSTM should be pooled into a fixed-size vector with some of the poolings.
* Since you know all the text in advance, use bidirectional RNN
  * Run one LSTM from left to right
  * Run another in parallel from right to left 
  * Concatenate their output sequences along unit axis (dim=-1)

* It might be good idea to mix convolutions and recurrent layers differently for title and description


#### E) Optimizing seriously

* You don't necessarily need 100 epochs. Use early stopping. If you've never done this before, take a look at [early stopping callback](https://keras.io/callbacks/#earlystopping).
  * In short, train until you notice that validation
  * Maintain the best-on-validation snapshot via `model.save(file_name)`
  * Plotting learning curves is usually a good idea
  
Good luck! And may the force be with you!