# Large scale text analysis with deep learning
# Practice (2 points)

In this assignment, you will develop a deep learning model that predicts salaries based on resumes (credits to to [Oleg Vasilev](https://github.com/Omrigan/) and [Yandex Data School](https://github.com/yandexdataschool/nlp_course/tree/2024/week02_classification)). We will begin this project during our class, but it will require additional effort from you to finalize and submit it as the homework.

To secure full grade for this assignment, ensure a consistent training process with an MSE loss falling below 0.10. If necessary, feel free to experiment with different parameters and model architectures. You are asked to implement at least two different architectures that reach this MSE.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
#%matplotlib inline

### Data from Adzuna challenge
For starters, let's download the data from
1)  [google drive](https://drive.google.com/file/d/1c41JVi3so_GSQ0tPCIsmXsq_dUTSb5iZ/view?usp=sharing), or
2) the Kaggle ompetition [page](https://www.kaggle.com/c/job-salary-prediction/data) (pick `Train_rev1.*`).

Source: Adzuna, Andrew Hunter, Ben Hamner, Fabio, and XiaoJenna. Job Salary Prediction. https://kaggle.com/competitions/job-salary-prediction, 2013. Kaggle.

_You can read more [in the official description](https://www.kaggle.com/c/job-salary-prediction#description)._

In [None]:
#import gdown
#gdown.download(f"https://drive.google.com/uc?id=1c41JVi3so_GSQ0tPCIsmXsq_dUTSb5iZ", "Train_rev1.zip", quiet=False)

In [None]:
data = pd.read_csv("./Train_rev1.zip", compression='zip', index_col=None)
data.shape

In [None]:
data.head()

One challenge with salary prediction is its uneven distribution: while many individuals earn standard salaries, a few receive exceptionally high incomes. This results in a right-skewed, heavy-tailed distribution, which poses challenges for minimizing Mean Squared Error (MSE).

Several techniques can address this issue, such as using a different loss function, predicting the logarithm of the target instead of the raw target, or replacing the targets with their percentiles within the salary distribution of the training set. For now, we will use the logarithmic transformation.

In [None]:
data['Log1pSalary'] = np.log1p(data['SalaryNormalized']).astype('float32')

plt.figure(figsize=[8, 4])
plt.subplot(1, 2, 1)
plt.hist(data["SalaryNormalized"], bins=20);

plt.subplot(1, 2, 2)
plt.hist(data['Log1pSalary'], bins=20);

Our task is to predict one number, __Log1pSalary__.

To do so, our model can access a number of features:
* Free text: __`Title`__ and  __`FullDescription`__
* Categorical: __`Category`__, __`Company`__, __`LocationNormalized`__, __`ContractType`__, and __`ContractTime`__.

In [None]:
text_columns = ["Title", "FullDescription"]
categorical_columns = ["Category", "Company", "LocationNormalized", "ContractType", "ContractTime"]
TARGET_COLUMN = "Log1pSalary"

data[categorical_columns] = data[categorical_columns].fillna('NaN') # cast missing values to string "NaN"

data.sample(3)

### Preprocessing text data

We begins with tokenization: splitting raw text into sequences of tokens (words, punctuation, etc).

__Your task__ is to lowercase and tokenize all `Title` and `FullDescription` columns. Store the tokenized data as a __space-separated__ string of tokens for future processing.

It's okay to use nltk tokenizers. Assertions were designed for WordPunctTokenizer, slight deviations are okay.

In [None]:
print("Raw text:")
print(data["FullDescription"][2::100000])

In [None]:
import nltk
#TODO
from nltk.tokenize import WordPunctTokenizer
tokenizer = WordPunctTokenizer()
data["FullDescription"] = data["FullDescription"].apply(lambda line: ' '.join(tokenizer.tokenize(str(line))).lower())
data["Title"] = data["Title"].apply(lambda line: ' '.join(tokenizer.tokenize(str(line))).lower())


Now we can assume that our text is a space-separated list of tokens:

In [None]:
print("Tokenized:")
print(data["FullDescription"][2::100000])
assert data["FullDescription"][2][:50] == 'mathematical modeller / simulation analyst / opera'
assert data["Title"][54321] == 'international digital account manager ( german )'

Not all words are equally useful. Some may be typos or rare words that appear only a few times.

To address this, we will count the occurrences of each word in the data to create a "whitelist" of known words.

In [None]:
data['Title'].values[0]

In [None]:
from collections import Counter
token_counts = Counter()

# Count how many times does each token occur in both "Title" and "FullDescription" in total
#TODO <YOUR CODE>
for title in data['Title']:
    token_counts.update(title.split())
for desc in data['FullDescription']:
    token_counts.update(desc.split())


In [None]:
print("Total unique tokens :", len(token_counts))
print('\n'.join(map(str, token_counts.most_common(n=5))))
print('...')
print('\n'.join(map(str, token_counts.most_common()[-3:])))

assert token_counts.most_common(1)[0][1] in  range(2600000, 2700000)
assert len(token_counts) in range(200000, 210000)
print('Correct!')

The word distribution fallows so called [Zipf's law](https://en.wikipedia.org/wiki/Zipf%27s_law)

In [None]:
token_counts.most_common(100)

In [None]:
# Let's see how many words are there for each count
plt.hist(list(token_counts.values()), range=[0, 10**4], bins=50, log=True)
plt.xlabel("Word counts")

In [None]:
len([token for token, count in token_counts.items() if count > 100])

__Task 1.1__ Get a list of all tokens that occur at least 10 times.

In [None]:
min_count = 10

# tokens from token_counts keys that had at least min_count occurrences throughout the dataset
tokens = sorted([token for token, count in token_counts.items() if count >= min_count]) #TODO<YOUR CODE HERE>

# Add a special tokens for unknown and empty words
UNK, PAD = "UNK", "PAD"
tokens = [UNK, PAD] + tokens

In [None]:
print("Vocabulary size:", len(tokens))
assert type(tokens) == list
assert len(tokens) in range(32000, 35000)
assert 'me' in tokens
assert UNK in tokens
print("Correct!")

__Task 1.2__ Build the inverse token index: a dictionary from token(string) to it's index in `tokens` (int)

In [None]:
token_to_id =  dict(zip(tokens,list(range(len(tokens))))) #<your code here - dict of token name to its index in tokens>

In [None]:
assert isinstance(token_to_id, dict)
assert len(token_to_id) == len(tokens)
for tok in tokens:
    assert tokens[token_to_id[tok]] == tok

print("Correct!")

And finally, let's use the vocabulary you've built to map text lines into neural network-digestible matrices.

In [None]:
UNK_IX, PAD_IX = map(token_to_id.get, [UNK, PAD])

def as_matrix(sequences, max_len=None):
    """ Convert a list of tokens into a matrix with padding """
    if isinstance(sequences[0], str):
        sequences = list(map(str.split, sequences))

    max_len = min(max(map(len, sequences)), max_len or float('inf'))

    matrix = np.full((len(sequences), max_len), np.int32(PAD_IX))
    for i,seq in enumerate(sequences):
        row_ix = [token_to_id.get(word, UNK_IX) for word in seq[:max_len]]
        matrix[i, :len(row_ix)] = row_ix

    return matrix

In [None]:
print("Lines:")
print('\n'.join(data["Title"][::100000].values), end='\n\n')
print("Matrix:")
print(as_matrix(data["Title"][::100000]))

Now, let's  encode the categirical data we have.

As usual, we shall use one-hot encoding for simplicity. Kudos if you implement more advanced encodings.

In [None]:
from sklearn.feature_extraction import DictVectorizer

# we only consider top-1k most frequent companies to minimize memory usage
top_companies, top_counts = zip(*Counter(data['Company']).most_common(1000))
recognized_companies = set(top_companies)
data["Company"] = data["Company"].apply(lambda comp: comp if comp in recognized_companies else "Other")

categorical_vectorizer = DictVectorizer(dtype=np.float32, sparse=False)
categorical_vectorizer.fit(data[categorical_columns].apply(dict, axis=1))

### The deep learning part

Once we've learned to cast our data into numbers, we can design a machine learning experiment.

We won't focus too much on validation, opting for a simple train-test split.

__To be completely rigorous,__ we've comitted a small crime here: we used the whole data for tokenization and building the vocabulary. A more strict way would be to do that on training set only.

In [None]:
from sklearn.model_selection import train_test_split

data_train, data_val = train_test_split(data, test_size=0.2, random_state=42)
data_train.index = range(len(data_train))
data_val.index = range(len(data_val))

print("Train size = ", len(data_train))
print("Validation size = ", len(data_val))

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F


device = 'cuda' if torch.cuda.is_available() else 'cpu'


def to_tensors(batch, device):
    batch_tensors = dict()
    for key, arr in batch.items():
        if key in ["FullDescription", "Title"]:
            batch_tensors[key] = torch.tensor(arr, device=device, dtype=torch.int64)
        else:
            batch_tensors[key] = torch.tensor(arr, device=device)
    return batch_tensors


def make_batch(data, max_len=None, word_dropout=0, device=device):
    """
    Creates a torch-friendly dict from the batch data.
    :param word_dropout: replaces token index with UNK_IX with this probability
    :returns: a dict with {'title' : int64[batch, title_max_len]
    """
    batch = {}
    batch["Title"] = as_matrix(data["Title"].values, max_len)
    batch["FullDescription"] = as_matrix(data["FullDescription"].values, max_len)
    batch['Categorical'] = categorical_vectorizer.transform(data[categorical_columns].apply(dict, axis=1)) #.sum(axis=1) #.argmax(axis=1)
    batch[TARGET_COLUMN] = data[TARGET_COLUMN].values

    if word_dropout != 0:
        batch["Title"] = apply_word_dropout(batch["Title"], 1. - word_dropout)

    if word_dropout != 0:
        batch["FullDescription"] = apply_word_dropout(batch["FullDescription"], 1. - word_dropout)

    if TARGET_COLUMN in data.columns:
        batch[TARGET_COLUMN] = data[TARGET_COLUMN].values

    return to_tensors(batch, device)

def apply_word_dropout(matrix, keep_prop, replace_with=UNK_IX, pad_ix=PAD_IX,):
    dropout_mask = np.random.choice(2, np.shape(matrix), p=[keep_prop, 1 - keep_prop])
    dropout_mask &= matrix != pad_ix
    return np.choose(dropout_mask, [matrix, np.full_like(matrix, replace_with)])

In [None]:
make_batch(data_train[:3], max_len=10)

#### Architecture

Our basic model consists of three branches:
* Title encoder
* Description encoder
* Categorical features encoder

We will then feed all 3 branches into one common network that predicts salary.

![scheme](https://github.com/yandexdataschool/nlp_course/raw/master/resources/w2_conv_arch.png)

In [None]:
#pretrain embeddings
#!wget https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
#!unzip -q glove.6B.zip

import numpy as np

def load_glove_embeddings(filepath):
    embeddings = {}
    with open(filepath, "r", encoding="utf-8") as f:
        for line in f:
            values = line.split()
            word = values[0]  # First word
            vector = np.asarray(values[1:], dtype="float32")  # Rest are vector values
            embeddings[word] = vector
    return embeddings


def create_embedding_matrix(word_to_index, glove_embeddings, embedding_dim=100):
    vocab_size = len(word_to_index)
    embedding_matrix = np.zeros((vocab_size, embedding_dim))  # Initialize with zeros

    for word, idx in word_to_index.items():
        vector = glove_embeddings.get(word)  # Get pre-trained vector if available
        if vector is not None:
            embedding_matrix[idx] = vector  # Assign vector to corresponding index
        else:
            embedding_matrix[idx] = np.random.normal(scale=0.6, size=(embedding_dim,))  # Random init

    return torch.tensor(embedding_matrix, dtype=torch.float32)

glove_embeddings = load_glove_embeddings("glove.6B.100d.txt")  # Load 100D embeddings

embedding_matrix = create_embedding_matrix(token_to_id, glove_embeddings, embedding_dim=100)



In [None]:
class titleEncoder(nn.Module):
    def __init__(self, n_tokens=len(tokens), hid_size=64):
        super().__init__()
        self.emb = nn.Embedding(n_tokens, hid_size)
        self.emb.weight = nn.Parameter(embedding_matrix)
        self.emb.weight.requires_grad = True  

        # Convolutions for different n-gram sizes
        self.conv1 = nn.Conv1d(hid_size, hid_size, kernel_size=3, padding=1)
        self.conv1_1 = nn.Conv1d(hid_size, hid_size, kernel_size=3, padding=1)


        self.conv2 = nn.Conv1d(hid_size, hid_size, kernel_size=5, padding=2)
        self.conv2_1 = nn.Conv1d(hid_size, hid_size, kernel_size=3, padding=1)


        # Layer normalization
        self.layer_norm = nn.LayerNorm(hid_size)
        self.layer_normCombined = nn.LayerNorm(hid_size * 2)

        self.gelu = nn.GELU()
        self.pool = nn.AdaptiveMaxPool1d(1)

        self.proj_desc = nn.Linear(hid_size * 2, hid_size)

    def forward(self, x):
        x = self.emb(x).transpose(1, 2)

        # Conv 1 branch
        conv1 = self.gelu(self.conv1(x))
        conv1 = self.gelu(self.conv1_1(conv1))
        conv1 = self.gelu(self.conv1_2(conv1))
        conv1 = self.gelu(self.conv1_3(conv1))

        # Conv 2 branch
        conv2 = self.gelu(self.conv2(x))
        conv2 = self.gelu(self.conv2_1(conv2))
        conv2 = self.gelu(self.conv2_2(conv2))
        conv2 = self.gelu(self.conv2_3(conv2))
        
        # Concatenate both branches
        x = torch.cat([conv1, conv2], dim=1)

        # Normalize combined output
        x = x.transpose(1, 2)  
        x = self.layer_normCombined(x)
        x = x.transpose(1, 2) 

        x = self.gelu(x)
        x = self.pool(x).squeeze(-1)

        return x


In [None]:
enc = titleEncoder(hid_size=100).to(device)
enc(make_batch(data_train[:3], max_len=10)['Title'],).shape

In [None]:
class descriptionEncoder(nn.Module):
    def __init__(self, n_tokens=len(tokens), hid_size=64):
        super().__init__()
        self.emb = nn.Embedding(n_tokens, hid_size)
        self.emb.weight = nn.Parameter(embedding_matrix)
        self.emb.weight.requires_grad = True  

        # 3 sets of deep convolutions (different kernel sizes)
        self.conv1_1 = nn.Conv1d(hid_size, hid_size // 2, kernel_size=3, padding=1)
        self.conv1_2 = nn.Conv1d(hid_size // 2, hid_size // 2, kernel_size=3, padding=1)
        self.conv1_3 = nn.Conv1d(hid_size // 2, hid_size // 2, kernel_size=3, padding=1)


        self.conv2_1 = nn.Conv1d(hid_size, hid_size // 2, kernel_size=5, padding=2)
        self.conv2_2 = nn.Conv1d(hid_size // 2, hid_size // 2, kernel_size=5, padding=2)
        self.conv2_3 = nn.Conv1d(hid_size // 2, hid_size // 2, kernel_size=3, padding=1)


        self.conv3_1 = nn.Conv1d(hid_size, hid_size // 2, kernel_size=7, padding=3)
        self.conv3_2 = nn.Conv1d(hid_size // 2, hid_size // 2, kernel_size=5, padding=2)
        self.conv3_3 = nn.Conv1d(hid_size // 2, hid_size // 2, kernel_size=3, padding=1)

        # Activation function
        self.gelu = nn.GELU()

        # Pooling layers
        self.max_pool = nn.AdaptiveMaxPool1d(1)
        self.avg_pool = nn.AdaptiveAvgPool1d(1)

        self.layer_norm = nn.LayerNorm(hid_size // 2)  # Per feature normalization
        self.layer_norm_combined = nn.LayerNorm(hid_size // 2 * 3)  # After concatenation
        
        self.proj_desc = nn.Linear(hid_size * 3, hid_size)

    def forward(self, x):
        x = self.emb(x).transpose(1, 2)

        conv1 = self.gelu(self.conv1_1(x))
        conv1 = self.gelu(self.conv1_2(conv1))
        conv1 = self.gelu(self.conv1_3(conv1))
        conv1 = self.gelu(self.conv1_4(conv1))
        conv1 = self.gelu(self.conv1_5(conv1))

        conv2 = self.gelu(self.conv2_1(x))
        conv2 = self.gelu(self.conv2_2(conv2))
        conv2 = self.gelu(self.conv2_3(conv2))
        conv2 = self.gelu(self.conv2_4(conv2))
        conv2 = self.gelu(self.conv2_5(conv2))

        conv3 = self.gelu(self.conv3_1(x))
        conv3 = self.gelu(self.conv3_2(conv3))
        conv3 = self.gelu(self.conv3_3(conv3))
        conv3 = self.gelu(self.conv3_4(conv3))
        conv3 = self.gelu(self.conv3_5(conv3))

   
        x = torch.cat([conv1, conv2, conv3], dim=1)

      
        x = x.transpose(1, 2)  
        x = self.layer_norm_combined(x)
        x = x.transpose(1, 2)  

        x = self.gelu(x)

        # Apply pooling
        maxpool = self.max_pool(x).squeeze(-1)
        avgpool = self.avg_pool(x).squeeze(-1)

        return torch.cat([maxpool, avgpool], dim=1)


In [None]:
enc = descriptionEncoder(hid_size=100).to(device)
enc(make_batch(data_train[:3], max_len=10)['FullDescription']).shape

In [None]:
class categoriesEncoder(nn.Module):
    def __init__(self, n_cat_features, hid_size=64):
        super().__init__()

        self.fc1 = nn.Linear(n_cat_features, hid_size*2) 
        
        #same as before switched to gelu activation function as its better for nlp like tasks
        self.gelu = nn.GELU()  
        
        #add a second layer to make the encoder more complex
        self.fc2 = nn.Linear(hid_size*2, hid_size*2)  
        
        #use layer norm to stabalize the larger batches
        self.layer_norm = nn.LayerNorm(hid_size*2) 

    def forward(self, x):
        x = self.fc1(x)
        x = self.gelu(x)  
        x = self.fc2(x) 
        x = self.layer_norm(x) 
        return x


In [None]:
enc = categoriesEncoder(len(categorical_vectorizer.vocabulary_),hid_size=100).to(device)
enc(make_batch(data_train[:3], max_len=10)['Categorical']).shape

In [None]:


##ebeded conv pool
class SalaryPredictor(nn.Module):
    def __init__(self, n_tokens=len(tokens), n_cat_features=len(categorical_vectorizer.vocabulary_), hid_size=100):
        super().__init__()
        self.title_encoder = titleEncoder(hid_size=hid_size).to(device)
        self.desc_encoder = descriptionEncoder(hid_size=hid_size).to(device)
        self.cat_encoder = categoriesEncoder(n_cat_features, hid_size=hid_size).to(device)
        self.shrink = nn.Linear(hid_size*7, hid_size*4).to(device)
        self.fc = nn.Linear(hid_size*4, 1).to(device)
        
    
    def forward(self, batch):
        title_emb = self.title_encoder(batch["Title"])
        desc_emb = self.desc_encoder(batch["FullDescription"])
        cat_emb = self.cat_encoder(batch["Categorical"])
        x = torch.cat([title_emb, desc_emb, cat_emb], dim=1)
        x = self.shrink(x)
        return self.fc(x).squeeze(-1)
        


In [None]:
model = SalaryPredictor().to(device)
batch = make_batch(data_train[:100], device=device)
criterion = nn.MSELoss()

dummy_pred = model(batch)
dummy_loss = criterion(dummy_pred, batch[TARGET_COLUMN])
assert dummy_pred.shape == torch.Size([100])
assert len(torch.unique(dummy_pred)) > 20, "model returns suspiciously few unique outputs. Check your initialization"
assert dummy_loss.ndim == 0 and 0. <= dummy_loss <= 250., "make sure you minimize MSE"

#### Training and evaluation

As usual, we gonna feed our monster with random minibatches of data.

As we train, we want to monitor not only loss function, which is computed in log-space, but also the actual error measured in dollars.

In [None]:
def iterate_minibatches(data, batch_size=256, shuffle=True, cycle=False, device=device, **kwargs):
    """ iterates minibatches of data in random order """
    while True:
        indices = np.arange(len(data))
        if shuffle:
            indices = np.random.permutation(indices)

        for start in range(0, len(indices), batch_size):
            batch = make_batch(data.iloc[indices[start : start + batch_size]], device=device, **kwargs)
            yield batch

        if not cycle: break

### Model training

We can now fit our model the usual minibatch way. The interesting part is that we train on an infinite stream of minibatches, produced by `iterate_minibatches` function.

In [None]:
from tqdm.auto import tqdm

BATCH_SIZE = 128
EPOCHS = 200

In [None]:
def print_metrics(model, data, batch_size=BATCH_SIZE, name="", device=torch.device('cpu'), **kw):
    squared_error = abs_error = num_samples = 0.0
    model.eval()
    with torch.no_grad():
        for batch in iterate_minibatches(data, batch_size=batch_size, shuffle=False, device=device, **kw):
            batch_pred = model(batch)
            squared_error += torch.sum(torch.square(batch_pred - batch[TARGET_COLUMN]))
            abs_error += torch.sum(torch.abs(batch_pred - batch[TARGET_COLUMN]))
            num_samples += len(batch_pred)
    mse = squared_error.detach().cpu().numpy() / num_samples
    mae = abs_error.detach().cpu().numpy() / num_samples
    print("%s results:" % (name or ""))
    print("Mean square error: %.5f" % mse)
    print("Mean absolute error: %.5f" % mae)
    return mse, mae


In [None]:
import time
import torch

model = SalaryPredictor().to(device)
criterion = nn.MSELoss(reduction='sum')
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4)

for epoch in range(EPOCHS):
    start = time.time()
    print(f"epoch: {epoch}")
    
    model.train()
    for i, batch in tqdm(enumerate(iterate_minibatches(data_train, batch_size=BATCH_SIZE, device=device)),
                         total=len(data_train) // BATCH_SIZE):
        optimizer.zero_grad()

        with torch.autocast(device_type="cuda", dtype=torch.float16, enabled=True):
            pred = model(batch)
            loss = criterion(pred, batch[TARGET_COLUMN])

        loss = loss.to(torch.float32)
        loss.backward()

        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

        optimizer.step()

    stop = time.time()
    print(f"Epoch time: {stop - start}")
    if epoch % 5 == 0 or epoch == EPOCHS - 1:
        print_metrics(model, data_val, device=device)


In [None]:
#save model to file
torch.save(model.state_dict(), "modelReduced.pth")

In [None]:
#load model from file
model = SalaryPredictor().to(device)
model.load_state_dict(torch.load("model.pth"))


# Homework (8 points)

Your main task is to use some of the tricks you've learned about NN and analyze if you can improve __validation MAE__. Try __at least 2 options__ from the list below for a passing grade. Write a short report about what you have tried. More ideas = more bonus points.

## Recommended options

#### A) CNN architecture

All the tricks you know about dense and convolutional neural networks apply here as well.
* Dropout. Nuff said.
* Batch Norm. This time it's `nn.BatchNorm*`/`L.BatchNormalization`
* Parallel convolution layers. The idea is that you apply several nn.Conv1d to the same embeddings and concatenate output channels.
* More layers, more neurons, ya know...


#### B) Play with pooling

There's more than one way to perform pooling:
* Max over time (independently for each feature)
* Average over time (excluding PAD)
* Softmax-pooling:
$$ out_{i, t} = \sum_t {h_{i,t} \cdot {{e ^ {h_{i, t}}} \over \sum_\tau e ^ {h_{j, \tau}} } }$$

* Attentive pooling
$$ out_{i, t} = \sum_t {h_{i,t} \cdot Attn(h_t)},$$
where $$ Attn(h_t) = {{e ^ {NN_{attn}(h_t)}} \over \sum_\tau e ^ {NN_{attn}(h_\tau)}}  $$
and $NN_{attn}$ is a dense layer.

The optimal score is usually achieved by concatenating several different poolings, including several attentive pooling with different $NN_{attn}$ (aka multi-headed attention).

The catch is that keras layers do not inlude those toys. You will have to [write your own keras layer](https://keras.io/layers/writing-your-own-keras-layers/). Or use pure tensorflow, it might even be easier :)

#### C) Fun with words

It's not always a good idea to train embeddings from scratch. Here's a few tricks:

* Use a pre-trained embeddings from `gensim.downloader.load`. See last lecture.
* Start with pre-trained embeddings, then fine-tune them with gradient descent. You may or may not download pre-trained embeddings from [here](http://nlp.stanford.edu/data/glove.6B.zip) and follow this [manual](https://keras.io/examples/nlp/pretrained_word_embeddings/) to initialize your Keras embedding layer with downloaded weights.
* Use the same embedding matrix in title and desc vectorizer


#### D) Going recurrent

We've already learned that recurrent networks can do cool stuff in sequence modelling. Turns out, they're not useless for classification as well. With some tricks of course..

* Like convolutional layers, LSTM should be pooled into a fixed-size vector with some of the poolings.
* Since you know all the text in advance, use bidirectional RNN
  * Run one LSTM from left to right
  * Run another in parallel from right to left
  * Concatenate their output sequences along unit axis (dim=-1)

* It might be good idea to mix convolutions and recurrent layers differently for title and description


#### E) Optimizing seriously

* You don't necessarily need 100 epochs. Use early stopping.
* Maintain the best-on-validation snapshot via `model.save(file_name)`
* Plotting learning curves is usually a good idea
  
Good luck! And may the force be with you!

### A short report

Please tell us what you did and how did it work.

`<YOUR_TEXT_HERE>`, i guess...

### Bonus part: explaining model predictions

It's usually a good idea to understand how your model works before you let it make actual decisions. It's simple for linear models: just see which words learned positive or negative weights. However, its much harder for neural networks that learn complex nonlinear dependencies.

There are, however, some ways to look inside the black box:
* Seeing how model responds to input perturbations
* Finding inputs that maximize/minimize activation of some chosen neurons (_read more [on distill.pub](https://distill.pub/2018/building-blocks/)_)
* Building local linear approximations to your neural network: [article](https://arxiv.org/abs/1602.04938), [eli5 library](https://github.com/TeamHG-Memex/eli5/tree/master/eli5/formatters)

Today we gonna try the first method just because it's the simplest one.

In [None]:
def explain(model, sample, col_name='Title'):
    """ Computes the effect each word had on model predictions """
    sample = dict(sample)
    sample_col_tokens = [tokens[token_to_id.get(tok, 0)] for tok in sample[col_name].split()]
    data_drop_one_token = pd.DataFrame([sample] * (len(sample_col_tokens) + 1))

    for drop_i in range(len(sample_col_tokens)):
        data_drop_one_token.loc[drop_i, col_name] = ' '.join(UNK if i == drop_i else tok
                                                   for i, tok in enumerate(sample_col_tokens))

    *predictions_drop_one_token, baseline_pred = model(make_batch(data_drop_one_token, device=device)).detach().cpu()
    diffs = baseline_pred - torch.Tensor(predictions_drop_one_token)
    return list(zip(sample_col_tokens, diffs))

In [None]:
from IPython.display import HTML, display_html


def draw_html(tokens_and_weights, cmap=plt.get_cmap("bwr"), display=True,
              token_template="""<span style="background-color: {color_hex}">{token}</span>""",
              font_style="font-size:14px;"
             ):

    def get_color_hex(weight):
        rgba = cmap(1. / (1 + np.exp(float(weight))), bytes=True)
        return '#%02X%02X%02X' % rgba[:3]

    tokens_html = [
        token_template.format(token=token, color_hex=get_color_hex(weight))
        for token, weight in tokens_and_weights
    ]


    raw_html = """<p style="{}">{}</p>""".format(font_style, ' '.join(tokens_html))
    if display:
        display_html(HTML(raw_html))

    return raw_html


In [None]:
i = 36605
tokens_and_weights = explain(model, data.loc[i], "Title")
draw_html([(tok, weight * 5) for tok, weight in tokens_and_weights], font_style='font-size:20px;');

tokens_and_weights = explain(model, data.loc[i], "FullDescription")
draw_html([(tok, weight * 10) for tok, weight in tokens_and_weights]);

In [None]:
i = 12077
tokens_and_weights = explain(model, data.loc[i], "Title")
draw_html([(tok, weight * 5) for tok, weight in tokens_and_weights], font_style='font-size:20px;');

tokens_and_weights = explain(model, data.loc[i], "FullDescription")
draw_html([(tok, weight * 10) for tok, weight in tokens_and_weights]);

In [None]:
i = np.random.randint(len(data))
print("Index:", i)
print("Salary (gbp):", np.expm1(model(make_batch(data.iloc[i: i+1], device=device)).detach().cpu()))

tokens_and_weights = explain(model, data.loc[i], "Title")
draw_html([(tok, weight * 5) for tok, weight in tokens_and_weights], font_style='font-size:20px;');

tokens_and_weights = explain(model, data.loc[i], "FullDescription")
draw_html([(tok, weight * 10) for tok, weight in tokens_and_weights]);

__Terrible start-up idea #1962:__ make a tool that automaticaly rephrases your job description (or CV) to meet salary expectations :)