In [1]:
# !pip install transformers
import transformers

### Using pre-trained transformers (2pts)
_for fun and profit_

There are many toolkits that let you access pre-trained transformer models, but the most powerful and convenient by far is [`huggingface/transformers`](https://github.com/huggingface/transformers). In this week's practice, you'll learn how to download, apply and modify pre-trained transformers for a range of tasks. Buckle up, we're going in!


__Pipelines:__ if all you want is to apply a pre-trained model, you can do that in one line of code using pipeline. Huggingface/transformers has a selection of pre-configured pipelines for masked language modelling, sentiment classification, question aswering, etc. ([see full list here](https://huggingface.co/transformers/main_classes/pipelines.html))

A typical pipeline includes:
* pre-processing, e.g. tokenization, subword segmentation
* a backbone model, e.g. bert finetuned for classification
* output post-processing

Let's see it in action:

In [2]:
from transformers import pipeline
classifier = pipeline('sentiment-analysis', model="distilbert-base-uncased-finetuned-sst-2-english")

print(classifier("BERT is amazing!"))

[{'label': 'POSITIVE', 'score': 0.9998860359191895}]


In [3]:
import base64
data = {
    'arryn': 'As High as Honor.',
    'baratheon': 'Ours is the fury.',
    'stark': 'Winter is coming.',
    'tyrell': 'Growing strong.'
}

# YOUR CODE: predict sentiment for each noble house and create outputs dict
outputs = {house_name: classifier(data[house_name])[0]['label'] == 'POSITIVE' for house_name in data}

assert sum(outputs.values()) == 3 and outputs[base64.decodestring(b'YmFyYXRoZW9u\n').decode()] == False
print("Well done!")

Well done!


  if sys.path[0] == '':


You can also access vanilla Masked Language Model that was trained to predict masked words. Here's how:

In [4]:
mlm_model = pipeline('fill-mask', model="bert-base-uncased")
MASK = mlm_model.tokenizer.mask_token

for hypo in mlm_model(f"Donald {MASK} is the president of the united states."):
  print(f"P={hypo['score']:.5f}", hypo['sequence'])

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


P=0.99719 donald trump is the president of the united states.
P=0.00024 donald duck is the president of the united states.
P=0.00022 donald ross is the president of the united states.
P=0.00020 donald johnson is the president of the united states.
P=0.00018 donald wilson is the president of the united states.


In [5]:
# Your turn: use bert to recall what year was the Soviet Union founded in
for hypo in mlm_model(f"The Soviet Union had its roots in the October Revolution of {MASK}."):
    print(f"P={hypo['score']:.5f}", hypo['sequence'])

P=0.62004 the soviet union had its roots in the october revolution of 1917.
P=0.33328 the soviet union had its roots in the october revolution of russia.
P=0.00625 the soviet union had its roots in the october revolution of ukraine.
P=0.00572 the soviet union had its roots in the october revolution of 1918.
P=0.00445 the soviet union had its roots in the october revolution of bolsheviks.


```

```

```

```


Huggingface offers hundreds of pre-trained models that specialize on different tasks. You can quickly find the model you need using [this list](https://huggingface.co/models).


In [6]:
text = """Almost two-thirds of the 1.5 million people who viewed this liveblog had Googled to discover
 the latest on the Rosetta mission. They were treated to this detailed account by the Guardian’s science editor,
 Ian Sample, and astronomy writer Stuart Clark of the moment scientists landed a robotic spacecraft on a comet 
 for the first time in history, and the delirious reaction it provoked at their headquarters in Germany.
  “We are there. We are sitting on the surface. Philae is talking to us,” said one scientist.
"""

# Task: create a pipeline for named entity recognition, use task name 'ner' and search for the right model in the list
ner_model = pipeline('ner',
                     model='dslim/bert-base-NER')

named_entities = ner_model(text)

In [7]:
print('OUTPUT:', named_entities)
word_to_entity = {item['word']: item['entity'] for item in named_entities}
assert 'org' in word_to_entity.get('Guardian').lower() and 'per' in word_to_entity.get('Stuart').lower()
print("All tests passed")

OUTPUT: [{'entity': 'B-LOC', 'score': 0.7991049, 'index': 27, 'word': 'Rose', 'start': 112, 'end': 116}, {'entity': 'I-LOC', 'score': 0.9511926, 'index': 28, 'word': '##tta', 'start': 116, 'end': 119}, {'entity': 'B-ORG', 'score': 0.998223, 'index': 40, 'word': 'Guardian', 'start': 179, 'end': 187}, {'entity': 'B-PER', 'score': 0.9997613, 'index': 46, 'word': 'Ian', 'start': 207, 'end': 210}, {'entity': 'I-PER', 'score': 0.99978703, 'index': 47, 'word': 'Sam', 'start': 211, 'end': 214}, {'entity': 'I-PER', 'score': 0.999646, 'index': 48, 'word': '##ple', 'start': 214, 'end': 217}, {'entity': 'B-PER', 'score': 0.9997831, 'index': 53, 'word': 'Stuart', 'start': 240, 'end': 246}, {'entity': 'I-PER', 'score': 0.99974823, 'index': 54, 'word': 'Clark', 'start': 247, 'end': 252}, {'entity': 'B-LOC', 'score': 0.9997227, 'index': 85, 'word': 'Germany', 'start': 414, 'end': 421}, {'entity': 'B-PER', 'score': 0.9963128, 'index': 99, 'word': 'Phil', 'start': 471, 'end': 475}, {'entity': 'I-PER', '

### The building blocks of a pipeline

Huggingface also allows you to access its pipelines on a lower level. There are two main abstractions for you:
* `Tokenizer` - converts from strings to token ids and back
* `Model` - a pytorch `nn.Module` with pre-trained weights

You can use such models as part of your regular pytorch code: insert is as a layer in your model, apply it to a batch of data, backpropagate, optimize, etc.

In [8]:
import torch
from transformers import AutoTokenizer, AutoModel, pipeline

model_name = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [9]:
lines = [
    "Luke, I am your father.",
    "Life is what happens when you're busy making other plans.",
    ]

# tokenize a batch of inputs. "pt" means [p]y[t]orch tensors
tokens_info = tokenizer(lines, padding=True, truncation=True, return_tensors="pt")

for key in tokens_info:
    print(key, tokens_info[key])

print("Detokenized:")
for i in range(2):
    print(tokenizer.decode(tokens_info['input_ids'][i]))

input_ids tensor([[ 101, 5355, 1010, 1045, 2572, 2115, 2269, 1012,  102,    0,    0,    0,
            0,    0,    0],
        [ 101, 2166, 2003, 2054, 6433, 2043, 2017, 1005, 2128, 5697, 2437, 2060,
         3488, 1012,  102]])
token_type_ids tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
attention_mask tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
Detokenized:
[CLS] luke, i am your father. [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]
[CLS] life is what happens when you're busy making other plans. [SEP]


In [10]:
# You can now apply the model to get embeddings
with torch.no_grad():
    out = model(**tokens_info)

print(out)

BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[-0.3502,  0.2246, -0.2345,  ..., -0.2232,  0.1730,  0.6747],
         [-0.6097,  0.6892, -0.5512,  ..., -0.4814,  0.5322,  1.3833],
         [ 0.1842,  0.4881,  0.2193,  ..., -0.2699,  0.2246,  0.7985],
         ...,
         [-0.4413,  0.2748, -0.0391,  ..., -0.0604, -0.4358,  0.1384],
         [-0.5414,  0.4633,  0.0678,  ..., -0.1871, -0.5046,  0.2752],
         [-0.3940,  0.6180,  0.2092,  ..., -0.2345, -0.4177,  0.3341]],

        [[ 0.1622, -0.1154, -0.3894,  ..., -0.4180,  0.0138,  0.7644],
         [ 0.6471,  0.3774, -0.4082,  ...,  0.0050,  0.5559,  0.4385],
         [ 0.3351, -0.3158, -0.1178,  ...,  0.1348, -0.3143,  1.4409],
         ...,
         [ 1.2932, -0.1743, -0.5613,  ..., -0.2718, -0.1367,  0.4217],
         [ 1.0305,  0.1708, -0.2985,  ...,  0.2097, -0.4627, -0.4277],
         [ 1.0854,  0.1760, -0.0377,  ...,  0.3152, -0.5979, -0.3465]]]), pooler_output=tensor([[-0.8854, -0.4722, -0.9392,  .

### Fine-tuning for salary prediction (5 pts)

Now let's put all this monstrosity to good use!

Remember week5 when you've trained a convolutional neural network for salary prediction? Now let's see how transformers fare at this task.

__The goal__ is to take one or more pre-trained models and fine-tune it for salary prediction. A good baseline solution would be to get RoBerta or T5 from [huggingface model list](https://huggingface.co/models) and fine-tune it to solve the task. After choosing the model, please take care to use the matching Tokenizer for preprocessing, as different models have different preprocessing requirements.


There are no prompts this time: you will have to write everything from scratch. Although, feel free to reuse any code from the original salary prediction notebook :)

In [11]:
!wget 'https://www.dropbox.com/s/r9d1f3ve471osob/Train_rev1.zip?dl=1' -O data.zip
!unzip -e data.zip

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
--2021-11-12 16:46:45--  https://www.dropbox.com/s/r9d1f3ve471osob/Train_rev1.zip?dl=1
Resolving www.dropbox.com (www.dropbox.com)... 162.125.70.18
Connecting to www.dropbox.com (www.dropbox.com)|162.125.70.18|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /s/dl/r9d1f3ve471osob/Train_rev1.zip [following]
--2021-11-12 16:46:45--  https://www.dropbox.com/s/dl/r9d1f3ve471osob/Train_rev1.zip
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uc4e67709f7986d95aecbb6a024d.dl.dropboxusercontent.com/cd/0/get/BZ0Kj1_lLku1WLbMtMGuxrQmd4QZQim_oxRnAGTLORcI-WqbLgOOjesP2xG44-tNm_63Jy-uOmUfNAxVU-r0ps4PA6

In [13]:
import pandas as pd
import numpy as np

data = pd.read_csv("./Train_rev1.csv", index_col=None)
data['Log1pSalary'] = np.log1p(data['SalaryNormalized']).astype('float32')

text_columns = ["Title", "FullDescription"]
categorical_columns = ["Category", "Company", "LocationNormalized", "ContractType", "ContractTime"]
target_column = "Log1pSalary"
data[categorical_columns] = data[categorical_columns].fillna('NaN') # cast nan to string

In [14]:
from transformers import RobertaTokenizer, RobertaModel
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
roberta_model = RobertaModel.from_pretrained('roberta-base').eval()

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.layer_norm.weight', 'lm_head.decoder.weight', 'lm_head.dense.weight', 'lm_head.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [15]:
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = roberta_model(**encoded_input)

In [16]:
tokens_info = tokenizer(data["Title"][::100000].to_list(), padding=True, truncation=True, return_tensors="pt")

print("Lines:")
print('\n'.join(data["Title"][::100000].values), end='\n\n')
print("Matrix:")
print(tokens_info['input_ids'], end='\n\n')
print("Detokenized:")
print(*[tokenizer.decode(tokens_info['input_ids'][i]) for i in range(len(tokens_info['input_ids']))], sep='\n', end='\n\n')

Lines:
Engineering Systems Analyst
HR Assistant
Senior EC&I Engineer

Matrix:
tensor([[    0, 13929,  2961,  5778,  9821,     2,     1],
        [    0, 16271,  6267,     2,     1,     1,     1],
        [    0, 33867, 11270,   947,   100, 24379,     2]])

Detokenized:
<s>Engineering Systems Analyst</s><pad>
<s>HR Assistant</s><pad><pad><pad>
<s>Senior EC&I Engineer</s>



In [17]:
from collections import Counter
from sklearn.feature_extraction import DictVectorizer

# we only consider top-1k most frequent companies to minimize memory usage
top_companies, top_counts = zip(*Counter(data['Company']).most_common(1000))
recognized_companies = set(top_companies)
data["Company"] = data["Company"].apply(lambda comp: comp if comp in recognized_companies else "Other")

categorical_vectorizer = DictVectorizer(dtype=np.float32, sparse=False)
categorical_vectorizer.fit(data[categorical_columns].apply(dict, axis=1))

DictVectorizer(dtype=<class 'numpy.float32'>, sparse=False)

In [18]:
from sklearn.model_selection import train_test_split

data_train, data_val = train_test_split(data, test_size=0.1, random_state=42)

In [19]:
def generate_batch(data, batch_size=None, replace=True, max_len=None):
    """
    Creates a pytorch-friendly dict from the batch data.
    :returns: a dict with {'title' : int64[batch, title_max_len]
    """
    if batch_size is not None:
        data = data.sample(batch_size, replace=replace)

    batch = {}
    for col in text_columns:
        # batch[col] = tokenizer(data[col].values.tolist(), padding=True, truncation=True, return_tensors="pt")['input_ids']
        batch[col] = tokenizer(data[col].values.tolist(), padding=True, truncation=True, return_tensors="pt")

    batch['Categorical'] = categorical_vectorizer.transform(data[categorical_columns].apply(dict, axis=1))

    if target_column in data.columns:
        batch[target_column] = data[target_column].values

    return batch

In [20]:
generate_batch(data_train, 3, max_len=10)

{'Title': {'input_ids': tensor([[    0, 47666,    38,  5944,  4706, 25857,  4827,  1437,  6571,  1753,
              2],
         [    0, 45062,   783, 27592,   118,  7443, 33688,     2,     1,     1,
              1],
         [    0,  1121,  7248,  5293,  3105, 16688,     2,     1,     1,     1,
              1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
         [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0],
         [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]])},
 'FullDescription': {'input_ids': tensor([[    0, 18387,    38,  ..., 17491,  4794,     2],
         [    0, 27172,  8396,  ...,     1,     1,     1],
         [    0,   170,    32,  ...,     1,     1,     1]]), 'attention_mask': tensor([[1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0]])},
 'Categorical': array([[1., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]], dtype=float32),
 'Log1pSalary': array([10.657283, 10.738156

### Adding the head of the model

In [21]:
import torch, torch.nn as nn

class GlobalMaxPooling(nn.Module):
    def __init__(self, dim=-1):
        super(self.__class__, self).__init__()
        self.dim = dim

    def forward(self, x):
        return x.max(dim=self.dim)[0]

In [22]:
device = torch.device('cuda:{}'.format(torch.cuda.current_device())) if torch.cuda.is_available() else 'cpu'
device

'cpu'

In [23]:
class Embeddings(nn.Module):
    def __init__(self):
        super().__init__()
        self.roberta_model = RobertaModel.from_pretrained('roberta-base')

    def forward(self, sample):
        input_ids = sample['input_ids']
        attention_mask = sample['attention_mask']
        output = self.roberta_model(input_ids, attention_mask, return_dict=True)
        return output["pooler_output"]

In [24]:
emb_model = Embeddings().eval()

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.layer_norm.weight', 'lm_head.decoder.weight', 'lm_head.dense.weight', 'lm_head.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [25]:
class TitleEncoder(nn.Module):
    def __init__(self, out_size=64):
        super(self.__class__, self).__init__()

        self.dense1 = torch.nn.Linear(768, 384)
        self.dropout = torch.nn.Dropout(0.1)
        self.dense2 = torch.nn.Linear(384, out_size)

    def forward(self, sample):
        emb = emb_model(sample)

        h = self.dense1(emb)
        h = nn.ReLU()(h)
        h = self.dropout(h)
        h = self.dense2(h)
        return h

In [26]:
class DescriptionEncoder(nn.Module):
    def __init__(self, out_size=64):
        super(self.__class__, self).__init__()
        self.embeddings = Embeddings().eval()

        self.dense1 = torch.nn.Linear(768, 384)
        self.dropout = torch.nn.Dropout(0.1)
        self.dense2 = torch.nn.Linear(384, out_size)

    def forward(self, sample):
        emb = emb_model(sample)

        h = self.dense1(emb)
        h = nn.ReLU()(h)
        h = self.dropout(h)
        h = self.dense2(h)
        return h

In [27]:
class FullNetwork(nn.Module):
    """
    This class does all the steps from (title, desc, categorical) features -> predicted target
    It unites title & desc encoders you defined above as long as some layers for head and categorical branch.
    """
    def __init__(self, n_cat_features=len(categorical_vectorizer.vocabulary_)):
        super(self.__class__, self).__init__()

        self.title_encoder = TitleEncoder(out_size=64)
        self.desc_encoder = DescriptionEncoder(out_size=64)

        # define layers for categorical features. A few dense layers would do.
        self.dense1 = nn.Linear(n_cat_features, 300)
        self.dense2 = nn.Linear(300, 64)

        # define "output" layers that process depend the three encoded vectors into answer
        self.dense_final = nn.Linear(192, 1)


    def forward(self, title_ix, desc_ix, cat_features):
        """
        :param title_ix: int32 Variable [batch, title_len], job titles encoded by as_matrix
        :param desc_ix:  int32 Variable [batch, desc_len] , job descriptions encoded by as_matrix
        :param cat_features: float32 Variable [batch, n_cat_features]
        :returns: float32 Variable 1d [batch], predicted log1p-salary
        """

        title_h = self.title_encoder(title_ix)
        desc_h = self.desc_encoder(desc_ix)

        # apply categorical encoder
        cat_h = self.dense2(self.dense1(cat_features))

        # concatenate all vectors together...
        joint_h = torch.cat([title_h, desc_h, cat_h], dim=1)

        # ... and stack a few more layers at the top


        # Note 1: do not forget to select first columns, [:, 0], to get to 1d outputs
        # Note 2: please do not use output nonlinearities.

        return self.dense_final(joint_h)[:, 0]

In [28]:
model = FullNetwork().to(device)
opt = torch.optim.Adam(model.parameters(), lr=1e-3)

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.layer_norm.weight', 'lm_head.decoder.weight', 'lm_head.dense.weight', 'lm_head.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [29]:
# test it on one batch

batch = generate_batch(data_train, 32)

title_ix = batch["Title"].to(device)
desc_ix = batch["FullDescription"].to(device)
cat_features = torch.FloatTensor(batch["Categorical"]).to(device)
reference = torch.FloatTensor(batch[target_column]).to(device)

prediction = model(title_ix, desc_ix, cat_features)

assert len(prediction.shape) == 1 and prediction.shape[0] == title_ix['input_ids'].shape[0]

In [30]:
def compute_loss(reference, prediction):
    """
    Computes objective for minimization.
    By deafult we minimize MSE, but you are encouraged to try mix up MSE, MAE, huber loss, etc.
    """
    return torch.mean((prediction - reference) ** 2)

def compute_mae(reference, prediction):
    """ Compute MAE on actual salary, assuming your model outputs log1p(salary)"""
    return torch.abs(torch.exp(reference - 1) - torch.exp(prediction - 1)).mean()

In [31]:
loss = compute_loss(reference, prediction)
dummy_grads = torch.autograd.grad(loss, model.parameters(), allow_unused=True, retain_graph=True)

### Training

In [32]:
from tqdm import trange
def iterate_minibatches(data, batch_size=32, max_len=None,
                        max_batches=None, shuffle=True, verbose=True):
    indices = np.arange(len(data))
    if shuffle:
        indices = np.random.permutation(indices)
    if max_batches is not None:
        indices = indices[: batch_size * max_batches]

    irange = trange if verbose else range

    for start in irange(0, len(indices), batch_size):
        yield generate_batch(data.iloc[indices[start : start + batch_size]], max_len=max_len)

In [33]:
import time

def train(model, opt, batch_size, patience=5, num_epochs=10, model_name='model1', batches_per_epoch = 100, save=False):
    train_loss_ar = []
    val_loss_ar = []

    trigger_times = 0
    train_loss = train_mae = train_batches = 0
    val_loss = val_mae = val_batches = 0

    prev_val = -np.Inf
    val_min = +np.Inf

    for epoch in range(num_epochs):
        start_time = time.time()

        model.train(True)
        for batch in iterate_minibatches(data_train, max_batches=batches_per_epoch):
            title_ix = batch["Title"].to(device)
            desc_ix = batch["FullDescription"].to(device)
            cat_features = torch.FloatTensor(batch["Categorical"]).to(device)
            reference = torch.FloatTensor(batch[target_column]).to(device)

            prediction = model(title_ix, desc_ix, cat_features)

            loss = compute_loss(reference, prediction)
            loss.backward()
            opt.step()
            opt.zero_grad(set_to_none=True)

            train_loss += loss.data.cpu().numpy()
            train_mae += compute_mae(reference, prediction).data.cpu().numpy()
            train_batches += 1

            train_loss_ar.append(loss.item())

        with torch.no_grad():
            model.train(False)
            for batch in iterate_minibatches(data_val, shuffle=False):
                title_ix = batch["Title"].to(device)
                desc_ix = batch["FullDescription"].to(device)
                cat_features = torch.FloatTensor(batch["Categorical"]).to(device)
                reference = torch.FloatTensor(batch[target_column]).to(device)

                prediction = model(title_ix, desc_ix, cat_features)
                loss = compute_loss(reference, prediction)

                val_loss += loss.data.cpu().numpy()
                val_mae += compute_mae(reference, prediction).data.cpu().numpy()
                val_batches += 1

                val_loss_ar.append(loss.item())

            val_mean = np.mean(val_loss_ar[-batches_per_epoch * num_epochs // batch_size:])


        if val_mean < val_min and save:
            torch.save({
                'epoch': epoch,
                'val_mean': val_mean,
                'state_dict': model.state_dict(),
                'optimizer': opt.state_dict(),
            }, 'models/' + model_name + f'_{epoch+1}.pth')

            val_min = val_mean

        if val_mean > prev_val:
            trigger_times += 1
            if trigger_times >= patience:
                print('Early stopping\n')
                break
        else:
            trigger_times = 0

        prev_val = val_mean

        print("Epoch {} of {} took {:.3f}s".format(
            epoch + 1, num_epochs, time.time() - start_time))
        print("  training loss (in-iteration): \t{:.6f}".format(
            np.mean(train_loss_ar[-batches_per_epoch * num_epochs // batch_size:])))
        print("  validation loss: \t\t\t{:.3f}".format(
            val_mean))

In [None]:
num_epochs = 10
batch_size = 32
batches_per_epoch = len(data_train) // batch_size


train(model, opt, batch_size=batch_size, patience=3, num_epochs=num_epochs, model_name='model', batches_per_epoch = batches_per_epoch, save=True)

  0%|          | 0/6884 [00:00<?, ?it/s]