### Bert-base TensorFlow 2.0

This kernel does not explore the data. For that you could check out some of the great EDA kernels: [introduction](https://www.kaggle.com/corochann/google-quest-first-data-introduction), [getting started](https://www.kaggle.com/phoenix9032/get-started-with-your-questions-eda-model-nn) & [another getting started](https://www.kaggle.com/hamditarek/get-started-with-nlp-lda-lsa). This kernel is an example of a TensorFlow 2.0 Bert-base implementation, using ~~TensorFow Hub~~ Huggingface transformer. <br><br>

---
**Update 1 (Commit 7):**
* removing penultimate dense layer; now there's only one dense layer (output layer) for fine-tuning
* using BERT's sequence_output instead of pooled_output as input for the dense layer
---

**Update 2 (Commit 8):**
* adjusting `_trim_input()` --- now have a q_max_len and a_max_len, instead of 'keeping the ratio the same' while trimming.
* **importantly:** now also includes question_title for the input sequence
---

**Update 3 (Commit 9)**
<br><br>*A lot of experiments can be made with the title + body + answer sequence. Feel free to look into e.g. (1) inventing new tokens (add it to '../input/path-to-bert-folder/assets/vocab.txt'), (2) keeping \[SEP\] between title and body but modify `_get_segments()`, (3) using the \[PAD\] token, or (4) merging title and body without any kind of separation. In this commit I'm doing (2). I also tried (3) offline, and they both perform better than in commit 8, in terms of validation rho.*<br>

* ignoring first \[SEP\] token in `_get_segments()`.

---

**Update 4 (Commit 11)**
* **Now using Huggingface transformer instead of TFHub** (note major changes in the code). This creates the possibility to easily try out different architectures like XLNet, Roberta etc. As well as easily outputting the hidden states of the transformer.
* two separate inputs (title+body and answer) for BERT
* removed snapshot average (now only using last (third) epoch). This will likely decrease performance, but it's not feasible to use ~ 5 x 4 models for a single bert prediction in practice. 
* only training for 2 epochs instead of 3 (to manage 2h limit)
---

Fork

---

**Update 1**
* three separate inputs (title+body, answer, title+body+answer) for BERT
* concat count-based features with BERT-features and feed into NNs
---

In [None]:
import re
import os
import sys
import time
import logging
import numpy as np
import pandas as pd
import tensorflow as tf
import tensorflow.keras.backend as K
import tensorflow_hub as hub
from sklearn.model_selection import GroupKFold
from sklearn.preprocessing import StandardScaler
from tqdm.notebook import tqdm
from collections import OrderedDict
from scipy.stats import spearmanr
from math import floor, ceil

np.random.seed(0)
tf.random.set_seed(0)
np.set_printoptions(suppress=True)
print(tf.__version__)

In [None]:
!pip install ../input/sacremoses > /dev/null

sys.path.insert(0, "../input/transformers/")
from transformers import *
logging.getLogger("transformers.tokenization_utils").setLevel(logging.ERROR)

#### 1. Read data and tokenizer

Read tokenizer and data, as well as defining the maximum sequence length that will be used for the input to Bert (maximum is usually 512 tokens)

In [None]:
PATH = '../input/google-quest-challenge/'
MAX_SEQUENCE_LENGTH = 256
NUM_GENERATE = 20000
NUM_PRODUCE_PER_SAMPLE = 20

core = 'bert'
BERT_PATH = '../input/bert-base-uncased-huggingface-transformer/'
# config = BertConfig()
tokenizer = BertTokenizer.from_pretrained(BERT_PATH+'bert-base-uncased-vocab.txt')
# Model = TFBertModel

module_url = "../input/universalsentenceencoderlarge4/"
embed = hub.load(module_url)

df_train = pd.read_csv(PATH+'train.csv')
df_test = pd.read_csv(PATH+'test.csv')
df_sub = pd.read_csv(PATH+'sample_submission.csv')
print('train shape =', df_train.shape)
print('test shape =', df_test.shape)

output_categories = list(df_train.columns[11:])
input_categories = list(df_train.columns[[1,2,5]])
print('\noutput categories:\n\t', output_categories)
print('\ninput categories:\n\t', input_categories)

#### 2. Preprocessing functions

These are some functions that will be used to preprocess the raw text data into useable Bert inputs.<br>

*update 4:* credits to [Minh](https://www.kaggle.com/dathudeptrai) for this implementation. If I'm not mistaken, it could be used directly with other Huggingface transformers too! Note that due to the 2 x 512 input, it will require significantly more memory when finetuning BERT.

In [None]:
def _convert_to_transformer_inputs(title, question, answer, tokenizer, max_sequence_length):
    """Converts tokenized input to ids, masks and segments for transformer (including bert)"""
    
    def return_id(str1, str2, truncation_strategy, length):

        inputs = tokenizer.encode_plus(str1, str2,
            add_special_tokens=True,
            max_length=length,
            truncation_strategy=truncation_strategy)
        
        input_ids =  inputs["input_ids"]
        input_masks = [1] * len(input_ids)
        input_segments = inputs["token_type_ids"]
        padding_length = length - len(input_ids)
        padding_id = tokenizer.pad_token_id
        input_ids = input_ids + ([padding_id] * padding_length)
        input_masks = input_masks + ([0] * padding_length)
        input_segments = input_segments + ([0] * padding_length)
        
        return [input_ids, input_masks, input_segments]
    
    input_ids_q, input_masks_q, input_segments_q = return_id(
        title + ' ' + question, None, 'longest_first', max_sequence_length)
    
    input_ids_a, input_masks_a, input_segments_a = return_id(
        answer, None, 'longest_first', max_sequence_length)
    
    input_ids_qa, input_masks_qa, input_segments_qa = return_id(
        title + ' ' + question, answer, 'longest_first', max_sequence_length)
    
    return [input_ids_q, input_masks_q, input_segments_q,
            input_ids_a, input_masks_a, input_segments_a,
            input_ids_qa, input_masks_qa, input_segments_qa]


def compute_input_arrays(df, columns, tokenizer, max_sequence_length):
    input_ids_q, input_masks_q, input_segments_q = [], [], []
    input_ids_a, input_masks_a, input_segments_a = [], [], []
    input_ids_qa, input_masks_qa, input_segments_qa = [], [], []
    for _, instance in tqdm(df[columns].iterrows()):
        t, q, a = instance.question_title, instance.question_body, instance.answer

        inputs = _convert_to_transformer_inputs(t, q, a, tokenizer, max_sequence_length)
        ids_q, masks_q, segments_q = inputs[:3]
        ids_a, masks_a, segments_a = inputs[3:6]
        ids_qa, masks_qa, segments_qa = inputs[6:]
        
        input_ids_q.append(ids_q)
        input_masks_q.append(masks_q)
        input_segments_q.append(segments_q)

        input_ids_a.append(ids_a)
        input_masks_a.append(masks_a)
        input_segments_a.append(segments_a)
        
        input_ids_qa.append(ids_qa)
        input_masks_qa.append(masks_qa)
        input_segments_qa.append(segments_qa)
        
    return [np.asarray(input_ids_q, dtype=np.int32), 
            np.asarray(input_masks_q, dtype=np.int32), 
            np.asarray(input_segments_q, dtype=np.int32),
            np.asarray(input_ids_a, dtype=np.int32), 
            np.asarray(input_masks_a, dtype=np.int32), 
            np.asarray(input_segments_a, dtype=np.int32),
            np.asarray(input_ids_qa, dtype=np.int32), 
            np.asarray(input_masks_qa, dtype=np.int32), 
            np.asarray(input_segments_qa, dtype=np.int32)]

def compute_output_arrays(df, columns):
    return np.asarray(df[columns])

#### 2.1 Randomize inputs

In [None]:
def _tokenize_raw(title, question, answer, tokenizer):

    def return_id(str1, str2):
        inputs = tokenizer.encode_plus(str1, str2, add_special_tokens=True)
        input_ids = inputs['input_ids']
        input_masks = [1]*len(input_ids)
        input_segments = inputs['token_type_ids']
        return [input_ids, input_masks, input_segments]

    input_ids_q, input_masks_q, input_segments_q = return_id(title + ' ' + question, None)
    input_ids_a, input_masks_a, input_segments_a = return_id(answer, None)
    return [input_ids_q, input_masks_q, input_segments_q,
            input_ids_a, input_masks_a, input_segments_a]


def _crop_or_pad(ids, masks, segments, max_length, padding_id=0):
    seq_length = len(ids)

    if seq_length <= max_length:
        padding_length = max_length - seq_length
        ids += [padding_id] * padding_length
        masks += [0] * padding_length
        segments += [0] * padding_length
    else:
        i_start = np.random.randint(0, seq_length-max_length)
        i_end = i_start + max_length
        ids = ids[i_start: i_end]
        masks = masks[i_start: i_end]
        segments = segments[i_start: i_end]

    return [ids, masks, segments]


def augment_arrays(df, columns, tokenizer, max_sequence_length, 
                   num_generate=10000):
    num_samples = len(df)
    tmp_q, tmp_a = [], []
    # Full tokenization
    for _, instance in tqdm(df[columns].iterrows(), desc='processing raw sequences ...'):
        t, q, a = instance.question_title, instance.question_body, instance.answer
        inputs = _tokenize_raw(t, q, a, tokenizer) # get full tokens
        tmp_q.append(inputs[:3]) # [ids_q, masks_q, segments_q]
        tmp_a.append(inputs[3:]) # [ids_a, masks_a, segments_a]

    # Population by reproducing sample
    pad_id = tokenizer.pad_token_id
    sample_indexes = np.random.choice(np.arange(num_samples), size=num_generate)
    input_ids_q, input_masks_q, input_segments_q = [], [], []
    input_ids_a, input_masks_a, input_segments_a = [], [], []
    for i in tqdm(sample_indexes, desc='populating training data ...'):
        ids_q, masks_q, segments_q = _crop_or_pad(*tmp_q[i], max_sequence_length, pad_id)
        ids_a, masks_a, segments_a = _crop_or_pad(*tmp_a[i], max_sequence_length, pad_id)

        input_ids_q.append(ids_q)
        input_masks_q.append(masks_q)
        input_segments_q.append(segments_q)

        input_ids_a.append(ids_a)
        input_masks_a.append(masks_a)
        input_segments_a.append(segments_a)

    # Concatenation
    inputs = [np.asarray(input_ids_q, dtype=np.int32), 
              np.asarray(input_masks_q, dtype=np.int32), 
              np.asarray(input_segments_q, dtype=np.int32),
              np.asarray(input_ids_a, dtype=np.int32), 
              np.asarray(input_masks_a, dtype=np.int32), 
              np.asarray(input_segments_a, dtype=np.int32)]

    return inputs, sample_indexes


def compute_test_arrays(df_test, columns, tokenizer, 
                        max_sequence_length, n_produce_per_sample=10):
    tmp_q, tmp_a = [], []
    # Full tokenization
    for _, instance in tqdm(df_test[columns].iterrows(), desc='processing raw sequences ...'):
        t, q, a = instance.question_title, instance.question_body, instance.answer
        inputs = _tokenize_raw(t, q, a, tokenizer) # get full tokens
        tmp_q.append(inputs[:3]) # [ids_q, masks_q, segments_q]
        tmp_a.append(inputs[3:]) # [ids_a, masks_a, segments_a]

    pad_id = 0 # tokenizer.pad_token_id
    test_indexes = []
    input_ids_q, input_masks_q, input_segments_q = [], [], []
    input_ids_a, input_masks_a, input_segments_a = [], [], []
    for i, (raw_q, raw_a) in enumerate(tqdm(zip(tmp_q, tmp_a), desc='populating test data ...')):
        for _ in range(n_produce_per_sample):
            ids_q, masks_q, segments_q = _crop_or_pad(*raw_q, max_sequence_length, pad_id)
            ids_a, masks_a, segments_a = _crop_or_pad(*raw_a, max_sequence_length, pad_id)

            input_ids_q.append(ids_q)
            input_masks_q.append(masks_q)
            input_segments_q.append(segments_q)

            input_ids_a.append(ids_a)
            input_masks_a.append(masks_a)
            input_segments_a.append(segments_a)

            test_indexes.append(i)

    # Concatenation
    inputs = [np.asarray(input_ids_q, dtype=np.int32), 
              np.asarray(input_masks_q, dtype=np.int32), 
              np.asarray(input_segments_q, dtype=np.int32),
              np.asarray(input_ids_a, dtype=np.int32), 
              np.asarray(input_masks_a, dtype=np.int32), 
              np.asarray(input_segments_a, dtype=np.int32)]
    test_indexes = np.asarray(test_indexes, dtype=np.int32)

    return inputs, test_indexes

#### 2.2 Additional inputs

In [None]:
# Count feature functions
def split_sentence(sentence):
    return re.split('[\.\?]+\s*', sentence)

def count_wps(sentence):
    sentence_list = split_sentence(sentence)
    w_per_s = np.mean([len(sentence.split()) for sentence in sentence_list
                       if len(sentence) > 0])
    return w_per_s

def build_count_features(df):
    count_features = OrderedDict()

    count_features['n_title_words'] = df.question_title.apply(lambda x: len(x.split()))
    count_features['n_body_words'] = df.question_body.apply(lambda x: len(x.split()))
    count_features['n_answer_words'] = df.answer.apply(lambda x: len(x.split()))

    count_features['n_body_sentences'] = df.question_body.apply(lambda x: len(split_sentence(x)))
    count_features['n_answer_sentences'] = df.answer.apply(lambda x: len(split_sentence(x)))

    count_features['n_title_wps'] = df.question_title.apply(count_wps)
    count_features['n_body_wps'] = df.question_body.apply(count_wps)
    count_features['n_answer_wps'] = df.answer.apply(count_wps)

    df_count_features = pd.DataFrame(count_features)
    return df_count_features


# Host feature function
def build_host_features(df, df_host_count, popular_host_thr):
    def _convert(x):
        if x not in df_host_count.index:
            return 'other'
        if df_host_count[x] > popular_host_thr:
            return x
        else:
            return 'other'
    return df.host.apply(_convert)

def build_features(df, df_host_count, popular_host_thr):
    return pd.concat(
        [build_count_features(df),
         build_host_features(df, df_host_count, 100)],
        axis=1)

# universal sentence encoder feature
def build_use_feature(df, columns):
    qa_embeds = []
    for _, instance in tqdm(df[columns].iterrows(), desc='Calculating USE feature ...'):
        t, q, a = instance.question_title, instance.question_body, instance.answer
        q_embed, a_embed = embed([t+' '+q, a])['outputs'].numpy()
        qa_embed = np.concatenate([q_embed, a_embed], axis=-1)
        qa_embeds.append(qa_embed)

    return np.asarray(qa_embeds, dtype=np.float32)

#### 3. Create model

`compute_spearmanr()` is used to compute the competition metric for the validation set
<br><br>
`create_model()` contains the actual architecture that will be used to finetune BERT to our dataset.


In [None]:
def compute_spearmanr(trues, preds):
    rhos = []
    for tcol, pcol in zip(np.transpose(trues), np.transpose(preds)):
        rhos.append(spearmanr(tcol, pcol).correlation)
    return np.nanmean(rhos)


class CustomCallback(tf.keras.callbacks.Callback):
    
    def __init__(self, valid_data, batch_size, fold=None):

        self.valid_inputs = valid_data[0]
        self.valid_outputs = valid_data[1]
        
        self.batch_size = batch_size
        self.fold = fold
        self.best_rho = -1
        self.best_weights = None
        self.best_epoch = 0
        
    def on_train_begin(self, logs={}):
        self.valid_predictions = []
        
    def on_epoch_end(self, epoch, logs={}):
        rho_val = compute_spearmanr(
            self.valid_outputs,
            self.model.predict(self.valid_inputs, batch_size=self.batch_size))

        print("\nvalidation rho: %.4f" % rho_val)

        if rho_val > self.best_rho:
            self.best_rho = rho_val
            self.best_weights = self.model.get_weights()
            self.best_epoch = epoch+1
        else:
            self.model.stop_training = True
            self.model.set_weights(self.best_weights)
            self.model.save_weights(f'bert-base-{self.fold+1}fold-{self.best_epoch}epoch.h5')
            print(f'Training stopped with rho: {self.best_rho}')
            
            
def aggregate_values(values, indexes):
    df = pd.DataFrame(values, index=indexes)
    aggregated = df.groupby(level=0).mean().values
    return aggregated


class FrequentCallback(tf.keras.callbacks.Callback):
    def __init__(self, valid_data, valid_indexes,
                 batch_size, validation_step, 
                 patients=2, fold=None):

        self.valid_inputs = valid_data[0]
        self.valid_outputs = valid_data[1]
        self.valid_indexes = valid_indexes
        
        self.batch_size = batch_size
        self.validation_step = validation_step
        self.patients = patients
        self.wait = 0
        self.stop_training = False
        self.fold = fold

        self.current_epoch = 1
        self.best_rho = -1
        self.best_weights = None
        self.best_batch = 0
        self.best_epoch = 0
        
    def on_train_begin(self, logs={}):
        self.valid_predictions = []

    def on_train_batch_end(self, batch, logs={}):
        if (batch+1)%self.validation_step == 0:
            preds = self.model.predict(self.valid_inputs, batch_size=self.batch_size)

            preds_agg = aggregate_values(preds, self.valid_indexes)
            outs_agg = aggregate_values(self.valid_outputs, self.valid_indexes)
            rho_val = compute_spearmanr(outs_agg, preds_agg)
            
            print(f'\n Validation rho at {batch+1}batch: {rho_val}')
            
            if rho_val > self.best_rho:
                self.best_rho = rho_val
                self.best_weights = self.model.get_weights()
                self.best_batch = batch+1
                self.best_epoch = self.current_epoch
                self.wait = 0
            else:
                self.wait += 1
                if self.wait >= self.patients:
                    self.model.stop_training = True
                    print(f'\tLoss saturation detected')

    def on_epoch_end(self, epoch, logs={}):
        self.current_epoch += 1

    def on_train_end(self, logs={}):
        print(f'\n***** Best validation rho: {self.best_rho} at {self.best_epoch}epoch, {self.best_batch}batch. *****')
        self.model.set_weights(self.best_weights)
        filename = f'bert-base-{self.fold+1}fold-{self.best_batch}batch-{self.best_epoch}epoch.h5'
        self.model.save_weights(filename)

In [None]:
# (title+body, answer, title+body+answer)x3 + count-feature
def create_model_10inputs():
    q_id = tf.keras.layers.Input((MAX_SEQUENCE_LENGTH,), dtype=tf.int32)
    a_id = tf.keras.layers.Input((MAX_SEQUENCE_LENGTH,), dtype=tf.int32)
    qa_id = tf.keras.layers.Input((MAX_SEQUENCE_LENGTH,), dtype=tf.int32)
    
    q_mask = tf.keras.layers.Input((MAX_SEQUENCE_LENGTH,), dtype=tf.int32)
    a_mask = tf.keras.layers.Input((MAX_SEQUENCE_LENGTH,), dtype=tf.int32)
    qa_mask = tf.keras.layers.Input((MAX_SEQUENCE_LENGTH,), dtype=tf.int32)
    
    q_atn = tf.keras.layers.Input((MAX_SEQUENCE_LENGTH,), dtype=tf.int32)
    a_atn = tf.keras.layers.Input((MAX_SEQUENCE_LENGTH,), dtype=tf.int32)
    qa_atn = tf.keras.layers.Input((MAX_SEQUENCE_LENGTH,), dtype=tf.int32)
                                   
    feat_count = tf.keras.layers.Input((8, ), dtype=tf.float32)
    
    config = BertConfig() # print(config) to see settings
    config.output_hidden_states = False # Set to True to obtain hidden states
    # caution: when using e.g. XLNet, XLNetConfig() will automatically use xlnet-large config
    
    # normally ".from_pretrained('bert-base-uncased')", but because of no internet, the 
    # pretrained model has been downloaded manually and uploaded to kaggle. 
    # bert_model = TFBertForSequenceClassification.from_pretrained(
    #     BERT_PATH+'bert-base-uncased-tf_model.h5', config=config)
    bert_model = TFBertModel.from_pretrained(BERT_PATH+'bert-base-uncased-tf_model.h5',
                                             config=config)
    
    # will only use the transformer ("bert") from TFBertForSequencabseClassification
    # if config.output_hidden_states = True, obtain hidden states via .bert(...)[-1]
    # q_embedding = bert_model.bert(q_id, attention_mask=q_mask, token_type_ids=q_atn)[0]
    # a_embedding = bert_model.bert(a_id, attention_mask=a_mask, token_type_ids=a_atn)[0]
    q_embedding = bert_model(q_id, attention_mask=q_mask, token_type_ids=q_atn)[0]
    a_embedding = bert_model(a_id, attention_mask=a_mask, token_type_ids=a_atn)[0]
    qa_embedding = bert_model(qa_id, attention_mask=qa_mask, token_type_ids=qa_atn)[0]
    
    q = tf.keras.layers.GlobalAveragePooling1D()(q_embedding)
    a = tf.keras.layers.GlobalAveragePooling1D()(a_embedding)
    qa = tf.keras.layers.GlobalAveragePooling1D()(qa_embedding)
    
    x = tf.keras.layers.Concatenate()([q, a, qa])
    x = tf.keras.layers.Concatenate()([x, feat_count]) # pre-dropout
    
    x = tf.keras.layers.Dropout(0.2)(x)
    
    x = tf.keras.layers.Dense(30, activation='sigmoid')(x)

    model = tf.keras.models.Model(inputs=[q_id, q_mask, q_atn,
                                          a_id, a_mask, a_atn,
                                          qa_id, qa_mask, qa_atn,
                                          feat_count], 
                                  outputs=x)
    return model, bert_model

In [None]:
# Original (title+question, answer)x3 model
def create_model_6inputs():
    q_id = tf.keras.layers.Input((MAX_SEQUENCE_LENGTH,), dtype=tf.int32)
    a_id = tf.keras.layers.Input((MAX_SEQUENCE_LENGTH,), dtype=tf.int32)
    
    q_mask = tf.keras.layers.Input((MAX_SEQUENCE_LENGTH,), dtype=tf.int32)
    a_mask = tf.keras.layers.Input((MAX_SEQUENCE_LENGTH,), dtype=tf.int32)
    
    q_atn = tf.keras.layers.Input((MAX_SEQUENCE_LENGTH,), dtype=tf.int32)
    a_atn = tf.keras.layers.Input((MAX_SEQUENCE_LENGTH,), dtype=tf.int32)
    
    config = BertConfig() # print(config) to see settings
    config.output_hidden_states = False # Set to True to obtain hidden states
    # caution: when using e.g. XLNet, XLNetConfig() will automatically use xlnet-large config
    
    # normally ".from_pretrained('bert-base-uncased')", but because of no internet, the 
    # pretrained model has been downloaded manually and uploaded to kaggle. 
    bert_model = TFBertModel.from_pretrained(BERT_PATH+'bert-base-uncased-tf_model.h5',
                                             config=config)
    
    # will only use the transformer ("bert") from TFBertForSequencabseClassification
    # if config.output_hidden_states = True, obtain hidden states via .bert(...)[-1]
    q_embedding = bert_model(q_id, attention_mask=q_mask, token_type_ids=q_atn)[0]
    a_embedding = bert_model(a_id, attention_mask=a_mask, token_type_ids=a_atn)[0]
    
    q = tf.keras.layers.GlobalAveragePooling1D()(q_embedding)
    a = tf.keras.layers.GlobalAveragePooling1D()(a_embedding)
    
    x = tf.keras.layers.Concatenate()([q, a])
    x = tf.keras.layers.Dropout(0.2)(x)
    x = tf.keras.layers.Dense(30, activation='sigmoid')(x)
    model = tf.keras.models.Model(inputs=[q_id, q_mask, q_atn,
                                          a_id, a_mask, a_atn,], 
                                  outputs=x)
    return model, bert_model

In [None]:
# (title+question, answer)x3 + count-feature + USE-feature
def create_model_8inputs():
    q_id = tf.keras.layers.Input((MAX_SEQUENCE_LENGTH,), dtype=tf.int32)
    a_id = tf.keras.layers.Input((MAX_SEQUENCE_LENGTH,), dtype=tf.int32)
    
    q_mask = tf.keras.layers.Input((MAX_SEQUENCE_LENGTH,), dtype=tf.int32)
    a_mask = tf.keras.layers.Input((MAX_SEQUENCE_LENGTH,), dtype=tf.int32)
    
    q_atn = tf.keras.layers.Input((MAX_SEQUENCE_LENGTH,), dtype=tf.int32)
    a_atn = tf.keras.layers.Input((MAX_SEQUENCE_LENGTH,), dtype=tf.int32)

    feat_count = tf.keras.layers.Input((8, ), dtype=tf.float32)

    qa_use = tf.keras.layers.Input((512*2, ), dtype=tf.float32)
    
    config = BertConfig() # print(config) to see settings
    config.output_hidden_states = False # Set to True to obtain hidden states
    # caution: when using e.g. XLNet, XLNetConfig() will automatically use xlnet-large config
    
    # normally ".from_pretrained('bert-base-uncased')", but because of no internet, the 
    # pretrained model has been downloaded manually and uploaded to kaggle. 
    # bert_model = TFBertForSequenceClassification.from_pretrained(
    #     BERT_PATH+'bert-base-uncased-tf_model.h5', config=config)
    bert_model = TFBertModel.from_pretrained(BERT_PATH+'bert-base-uncased-tf_model.h5', 
                                             config=config)
    
    # will only use the transformer ("bert") from TFBertForSequencabseClassification
    # if config.output_hidden_states = True, obtain hidden states via .bert(...)[-1]
    q_embedding = bert_model(q_id, attention_mask=q_mask, token_type_ids=q_atn)[0]
    a_embedding = bert_model(a_id, attention_mask=a_mask, token_type_ids=a_atn)[0]
    
    q = tf.keras.layers.GlobalAveragePooling1D()(q_embedding)
    a = tf.keras.layers.GlobalAveragePooling1D()(a_embedding)
    
    x = tf.keras.layers.Concatenate()([q, a, feat_count, qa_use]) # (None, 768x2+8+512x2)
    x = tf.keras.layers.Dropout(0.2)(x)
    x = tf.keras.layers.Dense(30, activation='sigmoid')(x)

    model = tf.keras.models.Model(inputs=[q_id, q_mask, q_atn,
                                          a_id, a_mask, a_atn,
                                          feat_count, qa_use], 
                                  outputs=x)
    
    return model, bert_model

#### 4. Obtain inputs and targets, as well as the indices of the train/validation splits

In [None]:
# Build count-based features
inputs_count = build_count_features(df_train).apply(np.log1p)
test_inputs_count = build_count_features(df_test).apply(np.log1p)

# Standardize
ss = StandardScaler()
inputs_count = ss.fit_transform(inputs_count)
test_inputs_count = ss.transform(test_inputs_count)

# inputs_count = np.asarray(inputs_count, dtype=np.float32)
inputs_count = np.asarray(inputs_count, dtype=np.float32)
test_inputs_count = np.asarray(test_inputs_count, dtype=np.float32)
print(f'\nCount-based feature shape (train): {inputs_count.shape}')
print(f'\nCount-based feature shape (test): {test_inputs_count.shape}')

# Build USE feature
inputs_use = build_use_feature(df_train, input_categories)
test_inputs_use = build_use_feature(df_test, input_categories)
print(f'\nUSE feature shape (train): {inputs_use.shape}')
print(f'\nUSE feature shape (test): {test_inputs_use.shape}')

In [None]:
# outputs = compute_output_arrays(df_train, output_categories)
# inputs = compute_input_arrays(df_train, input_categories, tokenizer, MAX_SEQUENCE_LENGTH)
# test_inputs = compute_input_arrays(df_test, input_categories, tokenizer, MAX_SEQUENCE_LENGTH)

In [None]:
# '''
inputs, sample_indexes = augment_arrays(df_train, input_categories, tokenizer,
                                        MAX_SEQUENCE_LENGTH, NUM_GENERATE)
inputs.append(inputs_count[sample_indexes])
inputs.append(inputs_use[sample_indexes])

outputs = compute_output_arrays(df_train, output_categories)
outputs = outputs[sample_indexes]

test_inputs, test_indexes = compute_test_arrays(df_test, input_categories, tokenizer,
                                                MAX_SEQUENCE_LENGTH, NUM_PRODUCE_PER_SAMPLE)
test_inputs.append(test_inputs_count[test_indexes])
test_inputs.append(test_inputs_use[test_indexes])

# check input shape 
for i, (inp, test_inp) in enumerate(zip(inputs, test_inputs)):
    print(f'input-{i+1} shape: train: {inp.shape}, test:{test_inp.shape}')

print(f'# of unique train samples: {len(set(sample_indexes))}')
print(f'# of unique test samples: {len(set(test_indexes))}')
# '''

#### 5. Training, validation and testing

Loops over the folds in gkf and trains each fold for 3 epochs --- with a learning rate of 3e-5 and batch_size of 6. A simple binary crossentropy is used as the objective-/loss-function. 

In [None]:
n_splits = 10
epochs = 3
batch_size = 8
patients = 3
learning_rate = 3e-5
n_validate_per_epoch = 5

# gkf = GroupKFold(n_splits=n_splits).split(X=df_train.question_body, groups=df_train.question_body)
gkf = GroupKFold(n_splits=n_splits).split(X=sample_indexes, groups=sample_indexes)

print(f'Consume {len(inputs)} inputs')

model, _ = create_model_8inputs()
optimizer = tf.keras.optimizers.Adam(learning_rate)
model.compile(loss='binary_crossentropy', optimizer=optimizer)
init_weights = model.get_weights()

test_preds = []
for fold, (train_idx, valid_idx) in enumerate(gkf):
    model.set_weights(init_weights)
    # will actually only do 2 folds (out of 5) to manage < 2h
    # if fold in [0]:
    if True:
        train_inputs = [inputs[i][train_idx] for i in range(len(inputs))]
        train_outputs = outputs[train_idx]
        valid_inputs = [inputs[i][valid_idx] for i in range(len(inputs))]
        valid_outputs = outputs[valid_idx]
        unique_train_idx = set(sample_indexes[train_idx])
        unique_valid_idx = set(sample_indexes[valid_idx])
        overlap_idx = unique_train_idx&unique_valid_idx
        print(f'# of unique train samples: {len(unique_train_idx)}',
              f'# of unique valid samples: {len(unique_valid_idx)}',
              f'# of overlap samples: {len(overlap_idx)}', sep='\n')

        callback = FrequentCallback(
            valid_data=(valid_inputs, valid_outputs),
            valid_indexes=valid_idx,
            batch_size=batch_size,
            validation_step=int(len(train_idx)/batch_size/n_validate_per_epoch),
            patients=patients,
            fold=fold
        )
        model.fit(train_inputs, train_outputs, 
                  epochs=epochs,
                  batch_size=batch_size,
                  callbacks=[callback])
        
        t_preds = model.predict(test_inputs, batch_size=batch_size)
        t_preds_agg = aggregate_values(t_preds, test_indexes)
        test_preds.append(t_preds_agg)
        print(f'Test prediction for {fold+1}fold conpleted.\n')

#### 6. Process and submit test predictions

Average fold predictions, then save as `submission.csv`

In [None]:
df_sub.iloc[:, 1:] = np.average(test_preds, axis=0) # for weighted average set weights=[...]

df_sub.to_csv('submission.csv', index=False)