![add all the randomness](https://www.memecreator.org/static/images/memes/4987185.jpg)

<h1><center>or how to increase you ensemble score when your models suck</center></h1>

Hi guys! This kernel doesn't have a lot of fancy ideas on how to train good models... In fact, at some point a few days before the deadline, I gave up on increasing my CV score...

However, CV score is not equal to the score you get after ensembling the predictions together! Here, another variable comes into play: Correlation! The lower the predictions are correlated, the higher the ensemble score.

That means, the more randomness you can squeeze into the predictions without hurting the CV score, the better!

So I spent a lot of time thinking of ways to add randomness to my single models. Here are some of the tricks I came up with:

- For each model, average the three embedding matrixes with different weights. I found good and diverse ones through random search and luck
- Randomized sample weights for each model. That way, each model focuses on different examples and should reach different solutions
- Re-initialize the random embedding matrix between runs. Since a lot of words don't have embeddings and thus use random vectors, this also helps diversify
- Add some random features to the embedding, different for each model. The models can overfit very slightly to those, also leading to increased diversification.
- Replace some random embedding features with a random vector for each model
- Overall, train longer than I would do otherwise. A slight overfit always seemed to help my ensemble F1.
- Each model was trained on a different subset of the data and with different layer sizes, dropout values, loss functions, etc.

To measure how this helps, I added optional `ADD_DEV_SET` and `MEASURE_CORRELATION` flags.


### Other things: Cyclic LR

I realized almost every kernel was using Cyclic LR, but not finetuned so the schedule ends in the lowest LR... I took the time to do this right.

In [None]:
import time
kernel_start_time = time.time()
import random
import pandas as pd
import numpy as np
import re
import torch

import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
import os
import math

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

# cross validation and metrics
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import f1_score
from sklearn.decomposition import PCA
from torch.optim.optimizer import Optimizer

# IMPORTANT


- For submissions, set `MEASURE_CORRELATION` and `USE_DEV_SET` to `False`.
- For local experiments, set them to `True`!

In [None]:
embed_size = 300
max_features = 120000
maxlen = 80
batch_size = 1024
n_splits = 5
MEASURE_CORRELATION = False # Set this to True for you local experiments
NORMALIZE_EMBEDDINGS = True # Whether the embedding matrix should be normalized
USE_DEV_SET = False # Set this to True for you local experiments
TRY_CAPITALS = True # Whether the embedding lookup should try a capitalized version
TRY_UPPER = True # Whether the embedding lookup should try an uppercased version

model_input_size = embed_size

SEED = 666666

In [None]:
def seed_everything(seed=0):
    seed = SEED + seed
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
seed_everything()

In [None]:
df_train = pd.read_csv("../input/train.csv")
df_test = pd.read_csv("../input/test.csv")

In [None]:
puncts = [
    '½', '¿', 'ï', '¸', '-', ',', '/', '"', '¨', '²', 'è', '×', '❤', '，', '↓', '▾', '↑',
    'Ã', '±', ']', '·', '_', '<', '?', '⋅', '™', '~', '→', '′', '>', '≤', '€', '¥', '¼',
    '¶', '@', '√', '®', '\\', '…', '、', '¹', '$', '•', '!', '¯', '&', '†', ')', '・', '^',
    '—', '+', '#', '（', '³', '£', '″', '−', '[', '¬', '¦', '）', '–', '”', '¢', '%', '©',
    '»', '}', '¾', '§', '=', '{', '‘', '∞', 'Ø', '°', '|', '：', '▒', 'â', 'à', ':', '(',
    ';', '`', '│', 'é', '*', '’', '.', '\'', '“',
]

def clean_text(x):
    for punct in puncts:
        if punct in x:
            x = x.replace(punct, f' {punct} ')
    return x

In [None]:
from sklearn.preprocessing import StandardScaler

def add_features(df):
    df['total_length'] = df['question_text'].apply(len)
    df['capitals'] = df['question_text'].apply(lambda comment: sum(1 for c in comment if c.isupper()))
    df['caps_vs_length'] = df.apply(lambda row: float(row['capitals'])/float(row['total_length']), axis=1)
    df['num_words'] = df.question_text.str.count('\S+') + 1
    df['num_unique_words'] = df['question_text'].apply(lambda comment: len(set(w for w in comment.split())))
    df['words_vs_unique'] = df['num_unique_words'] / df['num_words']

    return df

FEATURE_NAMES = ['caps_vs_length', 'words_vs_unique', 'total_length', 'num_words']

def load_and_prec():
    train_df = df_train
    test_df = df_test
    print("Train shape : ",train_df.shape)
    print("Test shape : ",test_df.shape)

    # Clean the text
    train_df["question_text"] = train_df["question_text"].apply(clean_text)
    test_df["question_text"] = test_df["question_text"].apply(clean_text)

    ## fill up the missing values
    train_X = train_df["question_text"].fillna("_##_").values
    test_X = test_df["question_text"].fillna("_##_").values

    ###################### Add Features ###############################
    #  https://github.com/wongchunghang/toxic-comment-challenge-lstm/blob/master/toxic_comment_9872_model.ipynb
    train = add_features(train_df)
    test = add_features(test_df)

    features = train[FEATURE_NAMES].fillna(0)
    test_features = test[FEATURE_NAMES].fillna(0)

    ss = StandardScaler()
    ss.fit(np.vstack((features, test_features)))
    features = ss.transform(features)
    test_features = ss.transform(test_features)
    ###########################################################################

    filters = '#&*+<>@\\^_`{|}~\n'

    ## Tokenize the sentences
    tokenizer = Tokenizer(num_words=max_features, filters=filters)

    tokenizer.fit_on_texts(list(train_X))
    train_X = tokenizer.texts_to_sequences(train_X)
    test_X = tokenizer.texts_to_sequences(test_X)

    ## Pad the sentences
    train_X = pad_sequences(train_X, maxlen=maxlen)
    test_X = pad_sequences(test_X, maxlen=maxlen)

    ## Get the target values
    train_y = train_df['target'].values

    return train_X, test_X, train_y, features, test_features, tokenizer.word_index

In [None]:
def add_dev_set(train_X, train_y, train_features, size=100000):
    dev_X = train_X[:size]
    dev_y = train_y[:size]
    dev_features = train_features[:size]

    train_X = train_X[size:]
    train_y = train_y[size:]
    train_features = train_features[size:]

    return train_X, train_y, train_features, dev_X, dev_y, dev_features


In [None]:
x_train, x_test, y_train, train_features, test_features, word_index = load_and_prec()
print('Time elapsed:', time.time() - kernel_start_time)

In [None]:
if USE_DEV_SET:
    x_train, y_train, train_features, x_dev, y_dev, dev_features = add_dev_set(x_train, y_train, train_features)

In [None]:
embedding_data = {
    'glove': {
        'file': '../input/embeddings/glove.840B.300d/glove.840B.300d.txt',
        'emb_mean': -0.005838499,
        'emb_std': 0.48782197,
        'file_kwargs': {
            'encoding': 'utf8',
        },
    },
    'fasttext': {
        'file': '../input/embeddings/wiki-news-300d-1M/wiki-news-300d-1M.vec',
        'emb_mean': -0.0033469985,
        'emb_std': 0.109855495,
    },
    'paragram': {
        'file': '../input/embeddings/paragram_300_sl999/paragram_300_sl999.txt',
        'emb_mean': -0.0053247833,
        'emb_std': 0.49346462,
        'file_kwargs': {
            'encoding': 'utf8',
            'errors': 'ignore',
        },
    },
}

def load_embedding(
    type, word_index,
    normalize=NORMALIZE_EMBEDDINGS,
    try_capitals=TRY_CAPITALS, try_upper=TRY_UPPER,
):
    assert type in embedding_data
    data = embedding_data.get(type)

    emb_mean, emb_std = data['emb_mean'], data['emb_std']
    nb_words = min(max_features, len(word_index))

    embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, model_input_size))

    all_words = set(k for k, i in word_index.items() if i <= max_features)
    oov_candidates = {}

    def set_row(i, vecstring):
        extra_features = []

        embedding_vector = np.asarray(vec.split(' ') + extra_features, dtype='float32')[:model_input_size]
        if len(embedding_vector) == model_input_size:
            embedding_matrix[i] = embedding_vector
            return True

    with open(data['file'], 'r', **data.get('file_kwargs', {})) as f:
        for line in f:
            word, vec = line.split(' ', 1)
            i = word_index.get(word)

            if i is None and try_capitals:
                oov_candidates[word] = vec

            if i is None or i >= max_features:
                continue

            all_words.discard(word)

            set_row(i, vec)

    oov_words = [(w, i) for w, i in word_index.items() if w in all_words]

    if try_capitals or TRY_UPPER:
        n_added = 0
        for word, i in oov_words:
            capitalized = oov_candidates.get(word[0].upper() + word[1:]) if try_capitals else None
            upper = TRY_UPPER and (capitalized == None) and (len(word) < 7) and oov_candidates.get(word.upper())
            replacement = capitalized or upper
            if replacement:
                if set_row(i, replacement):
                    n_added += 1
                    continue

        print(n_added, 'added')

    if normalize:
        embedding_matrix = embedding_matrix / emb_std

    return embedding_matrix, oov_words

In [None]:
def load_all_embeddings(seed, load_embeddings=['glove', 'paragram', 'fasttext']):
    seed_everything(seed)

    embeddings = []

    for embedding_name in load_embeddings:
        matrix, oov = load_embedding(embedding_name, word_index)
        embeddings.append(matrix)

        print(embedding_name, 'OOV', len(oov))

    return embeddings

embeddings = load_all_embeddings(1)

In [None]:
print('Time elapsed:', time.time() - kernel_start_time)

In [None]:
# code inspired from: https://github.com/anandsaha/pytorch.cyclic.learning.rate/blob/master/cls.py
class CyclicLR(object):
    def __init__(self, optimizer, base_lr=1e-3, max_lr=6e-3,
                 step_size=2000, mode='triangular', gamma=1.,
                 scale_fn=None, scale_mode='cycle', last_batch_iteration=-1):

        if not isinstance(optimizer, Optimizer):
            raise TypeError('{} is not an Optimizer'.format(
                type(optimizer).__name__))
        self.optimizer = optimizer

        if isinstance(base_lr, list) or isinstance(base_lr, tuple):
            if len(base_lr) != len(optimizer.param_groups):
                raise ValueError("expected {} base_lr, got {}".format(
                    len(optimizer.param_groups), len(base_lr)))
            self.base_lrs = list(base_lr)
        else:
            self.base_lrs = [base_lr] * len(optimizer.param_groups)

        if isinstance(max_lr, list) or isinstance(max_lr, tuple):
            if len(max_lr) != len(optimizer.param_groups):
                raise ValueError("expected {} max_lr, got {}".format(
                    len(optimizer.param_groups), len(max_lr)))
            self.max_lrs = list(max_lr)
        else:
            self.max_lrs = [max_lr] * len(optimizer.param_groups)

        self.step_size = step_size

        if mode not in ['triangular', 'triangular2', 'exp_range'] \
                and scale_fn is None:
            raise ValueError('mode is invalid and scale_fn is None')

        self.mode = mode
        self.gamma = gamma

        if scale_fn is None:
            if self.mode == 'triangular':
                self.scale_fn = self._triangular_scale_fn
                self.scale_mode = 'cycle'
            elif self.mode == 'triangular2':
                self.scale_fn = self._triangular2_scale_fn
                self.scale_mode = 'cycle'
            elif self.mode == 'exp_range':
                self.scale_fn = self._exp_range_scale_fn
                self.scale_mode = 'iterations'
        else:
            self.scale_fn = scale_fn
            self.scale_mode = scale_mode

        self.in_final_stage = False
        self.batch_step(last_batch_iteration + 1)
        self.last_batch_iteration = last_batch_iteration

    def batch_step(self, batch_iteration=None):
        if batch_iteration is None:
            batch_iteration = self.last_batch_iteration + 1
        self.last_batch_iteration = batch_iteration
        for param_group, lr in zip(self.optimizer.param_groups, self.get_lr()):
            param_group['lr'] = lr

    def enter_final_stage(self, remaining_iterations):
        # Call to linearly decrease down to final lr over the remaining iterations
        self.in_final_stage = True
        self.remaining_iterations = remaining_iterations
        self.final_lr = self.base_lrs[0] / 100
        self.final_iterations = 0

    def _triangular_scale_fn(self, x):
        return 1.

    def _triangular2_scale_fn(self, x):
        return 1 / (2. ** (x - 1))

    def _exp_range_scale_fn(self, x):
        return self.gamma**(x)

    def get_lr(self):
        if self.in_final_stage:
            lrs = []
            param_lrs = zip(self.optimizer.param_groups, self.base_lrs, self.max_lrs)
            for param_group, base_lr, max_lr in param_lrs:
                step = (base_lr - self.final_lr) / self.remaining_iterations
                lr = base_lr - step * self.final_iterations

                if self.final_iterations < self.remaining_iterations:
                    self.final_iterations += 1

                lrs.append(lr)

            return lrs


        step_size = float(self.step_size)
        cycle = np.floor(1 + self.last_batch_iteration / (2 * step_size))
        x = np.abs(self.last_batch_iteration / step_size - 2 * cycle + 1)

        lrs = []
        param_lrs = zip(self.optimizer.param_groups, self.base_lrs, self.max_lrs)
        for param_group, base_lr, max_lr in param_lrs:
            base_height = (max_lr - base_lr) * np.maximum(0, (1 - x))
            if self.scale_mode == 'cycle':
                lr = base_lr + base_height * self.scale_fn(cycle)
            else:
                lr = base_lr + base_height * self.scale_fn(self.last_batch_iteration)
            lrs.append(lr)

        return lrs

In [None]:
class Attention(nn.Module):
    def __init__(self, feature_dim, step_dim, bias=True, **kwargs):
        super(Attention, self).__init__(**kwargs)

        self.supports_masking = True

        self.bias = bias
        self.feature_dim = feature_dim
        self.step_dim = step_dim
        self.features_dim = 0

        weight = torch.zeros(feature_dim, 1)
        nn.init.xavier_uniform_(weight)
        self.weight = nn.Parameter(weight)

        if bias:
            self.b = nn.Parameter(torch.zeros(step_dim))

    def forward(self, x, mask=None):
        feature_dim = self.feature_dim
        step_dim = self.step_dim

        eij = torch.mm(
            x.contiguous().view(-1, feature_dim),
            self.weight
        ).view(-1, step_dim)

        if self.bias:
            eij = eij + self.b

        eij = torch.tanh(eij)
        a = torch.exp(eij)

        if mask is not None:
            a = a * mask

        a = a / torch.sum(a, 1, keepdim=True) + 1e-10

        weighted_input = x * torch.unsqueeze(a, -1)
        return torch.sum(weighted_input, 1)

In [None]:
class NeuralNetBig(nn.Module):
    def __init__(self, embedding=embeddings[0], config={}):
        super(NeuralNetBig, self).__init__()
        self.config = config = {
            'hidden_size_1': 132,
            'hidden_size_2': 100,
            'embedding_dropout': 0.1,
            'dropout': 0.1,
            'fc_size': 50,
            **config
        }
        model_input_size = embedding.shape[1]
        self.embedding = nn.Embedding(max_features, model_input_size)
        self.embedding.weight = nn.Parameter(torch.tensor(embedding, dtype=torch.float32))
        self.embedding.weight.requires_grad = False
        self.embedding_dropout = nn.Dropout2d(config['embedding_dropout'])

        self.lstm = nn.LSTM(model_input_size, config['hidden_size_1'], bidirectional=True, batch_first=True)
        self.gru = nn.GRU(config['hidden_size_1'] * 2, config['hidden_size_2'], bidirectional=True, batch_first=True)

        self.lstm_attention = Attention(config['hidden_size_1'] * 2, maxlen)
        self.gru_attention = Attention(config['hidden_size_2'] * 2, maxlen)
        f_size = config['hidden_size_1'] * 2 + config['hidden_size_2'] * 6 + len(FEATURE_NAMES)
        self.linear = nn.Linear(f_size, config['fc_size'])
        self.bn = nn.BatchNorm1d(config['fc_size'], momentum=0.5)
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(config['dropout'])
        self.out = nn.Linear(config['fc_size'], 1)

    def forward(self, x):
        x_embed, x_features = x

        h_embedding = self.embedding(x_embed)
        h_embedding = torch.squeeze(self.embedding_dropout(torch.unsqueeze(h_embedding, 0)))

        h_lstm, _ = self.lstm(h_embedding)
        h_gru, _ = self.gru(h_lstm)

        h_lstm_atten = self.lstm_attention(h_lstm)
        h_gru_atten = self.gru_attention(h_gru)

        avg_pool = torch.mean(h_gru, 1)
        max_pool, _ = torch.max(h_gru, 1)

        f = torch.tensor(x_features, dtype=torch.float).cuda()

        conc = torch.cat((h_lstm_atten, h_gru_atten, avg_pool, max_pool, f), 1)
        conc = self.dropout(conc)
        conc = self.relu(self.linear(conc))
        conc = self.bn(conc)
        out = self.out(conc)

        return out

In [None]:
class NeuralNetGRU(nn.Module):
    def __init__(self, embedding=embeddings[0], config={}):
        super(NeuralNetGRU, self).__init__()
        self.config = config = {
            'hidden_size': [132, 100, 50],
            'dropout': 0.05,
            **config
        }

        self.hidden_size = hidden_size = config['hidden_size']

        model_input_size = embedding.shape[1]
        self.embedding = nn.Embedding(max_features, model_input_size)
        self.embedding.weight = nn.Parameter(torch.tensor(embedding, dtype=torch.float32))
        self.embedding.weight.requires_grad = False

        self.gru_1 = nn.GRU(model_input_size, hidden_size[0], bidirectional=True, batch_first=True)
        self.gru_2 = nn.GRU(2 * hidden_size[0], hidden_size[1], bidirectional=True, batch_first=True)

        f_size = hidden_size[0] * 2 + hidden_size[1] * 2 + len(FEATURE_NAMES)

        self.fc_size = config.get('fc_size')

        if config.get('fc_size'):
            self.linear = nn.Linear(f_size, config['fc_size'])
            self.bn = nn.BatchNorm1d(config['fc_size'])
            self.relu = nn.ReLU()
            self.out = nn.Linear(config['fc_size'], 1)

        else:
            self.bn = nn.BatchNorm1d(f_size)
            self.out = nn.Linear(f_size, 1)

    def forward(self, x):
        x_embed, x_features = x

        h_embedding = self.embedding(x_embed)

        h_gru_1, _ = self.gru_1(h_embedding)
        h_gru_2, _ = self.gru_2(h_gru_1)

        h_max1, _ = torch.max(h_gru_1, 1)
        h_max2, _ = torch.max(h_gru_2, 1)

        f = torch.tensor(x_features, dtype=torch.float).cuda()
        out = torch.cat((h_max1, h_max2, f), 1)

        if self.fc_size:
            out = self.linear(out)
            out = self.bn(out)
            out = self.relu(out)

        out = self.out(out)

        return out

In [None]:
models = {
    'gru': NeuralNetGRU,
    'big': NeuralNetBig,
}

class MyDataset(Dataset):
    def __init__(self,dataset):
        self.dataset = dataset

    def __getitem__(self, index):
        data, target = self.dataset[index]

        return data, target, index
    def __len__(self):
        return len(self.dataset)

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

from sklearn.metrics import roc_curve, precision_recall_curve
def best_thresshold(y_true, y_proba):
    precision, recall, thresholds = precision_recall_curve(y_true, y_proba)
    thresholds = np.append(thresholds, 1.001) 
    F = 2 / (1/precision + 1/recall)
    best_score = np.max(F)
    best_th = thresholds[np.argmax(F)]
    return best_th, best_score 

In [None]:
x_test_cuda = torch.tensor(x_test, dtype=torch.long).cuda()
test = torch.utils.data.TensorDataset(x_test_cuda)

if USE_DEV_SET:
    x_dev_cuda = torch.tensor(x_dev, dtype=torch.long).cuda()
    dev = torch.utils.data.TensorDataset(x_dev_cuda)

In [None]:
# Creates a weighted average or concat of embeddings, specified in the settings
def ensemble_embeddings(settings, split_i):
    global embeddings

    mode = settings.get('embedding_mode', 'avg')
    weights = settings.get('embedding_weights', [1, 1, 1, 0])

    assert mode in ['avg', 'sample', 'concat']

    if mode == 'avg':
        embedding = np.average(embeddings, axis=0, weights=weights)
    elif mode == 'sample':
        sample_embedding_size = settings['embedding_sample']
        weight_sum = np.array(weights).sum()
        emb_list = []

        for e, w in zip(embeddings, weights):
            count = int(round((w / weight_sum) * sample_embedding_size))
            emb_list.append(
                e[:, np.random.randint(e.shape[1], size=count)]
            )

        embedding = np.concatenate(emb_list, axis=1)
    elif mode == 'concat':
        embedding = np.concatenate(embeddings, axis=1)

    if settings.get('pca', 0):
        pca = PCA(n_components=settings['pca'])
        embedding = pca.fit_transform(embedding)

    if settings.get('set_random_columns', 0):
        n_random_cols = settings['set_random_columns']
        random_cols = np.random.normal(0, 1, (len(embedding), n_random_cols))
        embedding[:, np.random.randint(embedding.shape[1], size=n_random_cols)] = random_cols

    if settings.get('add_random_columns', 0):
        n_random_cols = settings['add_random_columns']
        random_cols = np.random.normal(0, 1, (len(embedding), n_random_cols))
        embedding = np.concatenate([embedding, random_cols], -1)

    if settings.get('final_sample'):
        embedding = embedding[:, np.random.randint(embedding.shape[1], size=settings['final_sample'])]

    return embedding

In [None]:
def training(n_splits, schedule, base_seed=1):
    global embeddings

    training_start = time.time()
    base_seed += SEED

    # matrix for the out-of-fold predictions
    train_preds = np.zeros((len(x_train)))
    # matrix for the predictions on the test set
    test_preds = np.zeros((len(df_test)))

    if USE_DEV_SET:
        dev_preds = np.zeros((len(x_dev)))

    # always call this before training for deterministic results
    seed_everything(base_seed)

    avg_losses_f = []
    avg_val_losses_f = []

    splits = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=SEED+base_seed-1).split(x_train, y_train)

    while len(schedule) < n_splits:
        schedule *= 2

    if MEASURE_CORRELATION:
        all_test_preds = np.zeros((len(x_test), n_splits))

    for i, (train_idx, valid_idx) in enumerate(splits):
        settings = {
            'gamma': 0.99994,
            **schedule[i]
        }
        init_seed = settings.get('init_seed', base_seed + i * 100)
        seed_everything(init_seed)
        model_class = models[settings['model']]

        batch_size = settings['batch_size']
        n_epochs = settings['n_epochs']

        test_loader = torch.utils.data.DataLoader(test, batch_size=batch_size, shuffle=False)
        if USE_DEV_SET:
            dev_loader = torch.utils.data.DataLoader(dev, batch_size=batch_size, shuffle=False)

        x_train_fold = torch.tensor(x_train[train_idx.astype(int)], dtype=torch.long).cuda()
        y_train_fold = torch.tensor(y_train[train_idx.astype(int), np.newaxis], dtype=torch.float32).cuda()
        x_val_fold = torch.tensor(x_train[valid_idx.astype(int)], dtype=torch.long).cuda()
        y_val_fold = torch.tensor(y_train[valid_idx.astype(int), np.newaxis], dtype=torch.float32).cuda()
        kfold_X_features = train_features[train_idx.astype(int)]
        kfold_X_valid_features = train_features[valid_idx.astype(int)]

        random_sample_weights_range = settings.get('random_sample_weights_range', 0)
        if settings.get('add_sample_weights') and random_sample_weights_range != 0:
            # Here come the random sample weights for each fold:
            kfold_weights = np.random.uniform(
                1 - random_sample_weights_range / 2,
                1 + random_sample_weights_range / 2,
                (len(x_train_fold),),
            )

        embedding = ensemble_embeddings(settings, i)
        model = model_class(embedding=embedding, config=settings['nn_config'])

        # make sure everything in the model is running on the GPU
        model.cuda()

        loss_factor = 1
        loss_fn = torch.nn.BCEWithLogitsLoss(reduction='none')

        train = torch.utils.data.TensorDataset(x_train_fold, y_train_fold)
        train = MyDataset(train)
        train_loader = torch.utils.data.DataLoader(train, batch_size=batch_size, shuffle=True)

        updates_per_epoch = math.ceil(len(train_idx) / batch_size)

        step_size = round(updates_per_epoch * ((n_epochs * settings['n_cycles']) / 2))
        base_lr, max_lr = settings['base_lr'], settings['max_lr']
        optimizer = torch.optim.Adam(filter(lambda p: p.requires_grad, model.parameters()), lr=base_lr)

        scheduler = CyclicLR(optimizer, base_lr=base_lr, max_lr=max_lr,
                   step_size=step_size, mode='exp_range',
                   gamma=settings['gamma'])

        valid = torch.utils.data.TensorDataset(x_val_fold, y_val_fold)
        valid = MyDataset(valid)
        valid_loader = torch.utils.data.DataLoader(valid, batch_size=batch_size, shuffle=False)

        print(f'Fold {i + 1}')
        for epoch in range(n_epochs + (1 if settings.get('add_final_stage', 0) > 0 else 0)):
            seed_everything(init_seed + epoch)
            # set train mode of the model. This enables operations which are only applied during training like dropout
            start_time = time.time()
            model.train()

            is_final_stage = False
            steps_per_epoch = len(train_loader)

            if settings.get('add_final_stage') and epoch == n_epochs:
                is_final_stage = True
                steps_per_epoch = settings['add_final_stage']
                scheduler.enter_final_stage(settings['add_final_stage'])

            avg_loss = 0.
            for batch_i, (x_batch, y_batch, index) in enumerate(train_loader):
                # Forward pass: compute predicted y by passing x to the model.
                f = kfold_X_features[index]
                y_pred = model([x_batch, f])

                if scheduler:
                    scheduler.batch_step()

                if settings['add_sample_weights']:
                    loss_weights = torch.tensor(kfold_weights[index], dtype=torch.float).unsqueeze(1).cuda()
                    losses = loss_fn(y_pred, y_batch)
                    losses *= loss_weights
                else:
                    losses = loss_fn(y_pred, y_batch)

                loss = losses.sum()

                # Before the backward pass, use the optimizer object to zero all of the
                # gradients for the Tensors it will update (which are the learnable weights
                # of the model)
                optimizer.zero_grad()

                # Backward pass: compute gradient of the loss with respect to model parameters
                loss.backward()

                # Calling the step function on an Optimizer makes an update to its parameters
                optimizer.step()

                avg_loss += ((loss.item() / steps_per_epoch) / batch_size) * loss_factor

                if is_final_stage and batch_i >= settings['add_final_stage']:
                    break

            # set evaluation mode of the model. This disabled operations which are only applied during training like dropout
            model.eval()

            # predict all the samples in y_val_fold batch per batch
            test_preds_fold = np.zeros((len(df_test)))
            avg_val_loss = 0
            val_f1 = 0

            valid_preds_fold = np.zeros((x_val_fold.size(0)))
            avg_val_loss = 0.
            for bi, (x_batch, y_batch, index) in enumerate(valid_loader):
                f = kfold_X_valid_features[index]
                y_pred = model([x_batch,f]).detach()

                avg_val_loss += ((loss_fn(y_pred, y_batch).sum().item() / len(valid_loader)) / batch_size) * loss_factor
                valid_preds_fold[bi * batch_size:(bi+1) * batch_size] = sigmoid(y_pred.cpu().numpy())[:, 0]

            threshold, val_f1 = best_thresshold(y_val_fold.cpu().numpy(), valid_preds_fold)

            elapsed_time = time.time() - start_time
            print('Epoch {}/{} \t loss={:.4f} \t val_loss={:.4f} \t val_f1={:.4f} \t thresh={:.4f} \t time={:.2f}s'.format(
                epoch + 1, n_epochs, avg_loss, avg_val_loss, val_f1, threshold, elapsed_time))

        ensemble_weight = settings.get('ensemble_weight', 1 / n_splits)

        avg_losses_f.append(avg_loss)
        avg_val_losses_f.append(avg_val_loss)
        # predict all samples in the test set batch per batch
        for bi, (x_batch,) in enumerate(test_loader):
            f = test_features[bi * batch_size:(bi+1) * batch_size]
            y_pred = model([x_batch,f]).detach()

            test_preds_fold[bi * batch_size:(bi+1) * batch_size] = sigmoid(y_pred.cpu().numpy())[:, 0]

        if USE_DEV_SET:
            dev_preds_fold = np.zeros((len(x_dev)))

            for bi, (x_batch,) in enumerate(dev_loader):
                f = dev_features[bi * batch_size:(bi+1) * batch_size]
                y_pred = model([x_batch, f]).detach()

                dev_preds_fold[bi * batch_size:(bi+1) * batch_size] = sigmoid(y_pred.cpu().numpy())[:, 0]

            dev_preds += dev_preds_fold * ensemble_weight

        if MEASURE_CORRELATION:
            all_test_preds[:, i] = test_preds_fold

        train_preds[valid_idx] = valid_preds_fold

        test_preds += test_preds_fold  * ensemble_weight

    correlation = 0
    if MEASURE_CORRELATION:
        correlation_matrix = pd.DataFrame(all_test_preds).corr()
        print('\ncorrelation matrix')
        print(correlation_matrix)
        print()
        correlation = correlation_matrix.values[np.triu_indices_from(correlation_matrix.values, 1)].mean()

    threshold, val_f1 = best_thresshold(y_train, train_preds)
    average_val_loss = np.average(avg_val_losses_f)
    overfit = average_val_loss / np.average(avg_losses_f)
    training_time = time.time() - training_start
    dev_threshold, dev_f1 = 0, 0

    if USE_DEV_SET:
        dev_threshold, dev_f1 = best_thresshold(y_dev, dev_preds)

    return test_preds, val_f1, average_val_loss, correlation, training_time, overfit, threshold, dev_f1, dev_threshold

In [None]:
# This is my training schedule... It consists of training options for each model in my ensemble
schedule = [
    {
        "embedding_weights": [
            0.44439510338875615,
            0.3859913694202824,
            0.1696135271909615,
        ],
        "model": "big",
        "base_lr": 0.001,
        "max_lr": 0.006,
        "n_epochs": 4,
        "n_cycles": 1,
        "batch_size": 1024,
        "add_final_stage": 5,
        "nn_config": {
            "hidden_size_1": 123,
            "hidden_size_2": 73,
            "embedding_dropout": 0.04055337160498571,
            "dropout": 0.15174998980548882,
            "fc_size": 77
        },
        "add_sample_weights": True,
        'set_random_columns': 5,
        'add_random_columns': 5,
        'random_sample_weights_range': 0.25,
    },
    {
        "embedding_weights": [
            0.6303467699905284,
            0.09360214396255329,
            0.27605108604691825,
        ],
        "model": "big",
        "base_lr": 0.0006013507213308578,
        "max_lr": 0.006102420734804078,
        "n_epochs": 4,
        "n_cycles": 1,
        "batch_size": 1024,
        "add_final_stage": 0,
        "nn_config": {
            "hidden_size_1": 103,
            "hidden_size_2": 140,
            "embedding_dropout": 0.0728942043640125,
            "dropout": 0.1812402857338244,
            "fc_size": 19
        },
        "add_sample_weights": True,
        'set_random_columns': 5,
        'add_random_columns': 5,
        'random_sample_weights_range': 0.25,
    },
    {
        "embedding_weights": [
            0.5497754648703559,
            0.28986474829626435,
            0.16035978683337976,
        ],
        "model": "big",
        "base_lr": 0.0007647926420277793,
        "max_lr": 0.006060292392943227,
        "n_epochs": 4,
        "n_cycles": 1,
        "batch_size": 1024,
        "add_final_stage": 10,
        "nn_config": {
            "hidden_size_1": 89,
            "hidden_size_2": 74,
            "embedding_dropout": 0.041249174666613736,
            "dropout": 0.19393867662442568,
            "fc_size": 87
        },
        "add_sample_weights": True,
        'set_random_columns': 5,
        'add_random_columns': 5,
        'random_sample_weights_range': 0.25,
    },
    {
        "embedding_weights": [
            0.38162653503923166,
            0.34041829363892023,
            0.27795517132184816,
        ],
        "model": "gru",
        "base_lr": 0.000765667042217169,
        "max_lr": 0.006078156994754822,
        "n_epochs": 3,
        "n_cycles": 1,
        "batch_size": 1024,
        "nn_config": {
            "hidden_size": [
                154,
                81
            ],
            "fc_size": 0
        },
        "add_sample_weights": True,
        'set_random_columns': 5,
        'add_random_columns': 5,
        'random_sample_weights_range': 0.25,
        "add_final_stage": 10,
        "reload_embeddings": True,
    },
    {
        "embedding_weights": [
            0.09924183199312638,
            0.3664584186963535,
            0.53429974931052,
        ],
        "model": "gru",
        "base_lr": 0.0012809083457068925,
        "max_lr": 0.006110334179517582,
        "n_epochs": 4,
        "n_cycles": 1,
        "batch_size": 1024,
        "nn_config": {
            "hidden_size": [
                157,
                60
            ],
            "fc_size": 0
        },
        "add_sample_weights": True,
        'set_random_columns': 5,
        'add_random_columns': 5,
        'random_sample_weights_range': 0.25,
        "add_final_stage": 0
    },
]

In [None]:
test_preds, val_f1, average_val_loss, correlation, training_time, overfit, threshold, dev_f1, dev_threshold = training(n_splits, schedule)

In [None]:
print('val_f1', val_f1)
print('average_val_loss', average_val_loss)
print('overfit', overfit)
print('threshold', threshold)
print('correlation', correlation)
print('dev f1', dev_f1)
print('dev threshold', dev_threshold)
print('training_time', training_time)

In [None]:
if USE_DEV_SET:
    threshold = (threshold + dev_threshold) / 2

submission = df_test[['qid']]
submission['prediction'] = (test_preds > threshold).astype(int)
submission.to_csv('submission.csv', index=False)

In [None]:
print('Time elapsed:', time.time() - kernel_start_time)