The data for this competition includes questions and answers from various StackExchange properties. Your task is to predict target values of 30 labels for each question-answer pair.

The list of 30 target labels are the same as the column names in the samplesubmission.csv file. Target labels with the prefix question relate to the question_title and/or questionbody features in the data. Target labels with the prefix answer relate to the answer feature.

Each row contains a single question and a single answer to that question, along with additional features. The training data contains rows with some duplicated questions (but with different answers). The test data does not contain any duplicated questions.

This is not a binary prediction challenge. Target labels are aggregated from multiple raters, and can have continuous values in the range [0,1]. Therefore, predictions must also be in that range.

Since this is a synchronous re-run competition, you only have access to the Public test set. For planning purposes, the re-run test set is no larger than 10,000 rows, and less than 8 Mb uncompressed.

In [None]:
import numpy as np 
import pandas as pd

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

## 1. Load data

In [None]:
df_train = pd.read_csv('../input/google-quest-challenge/train.csv')
df_test = pd.read_csv('../input/google-quest-challenge/test.csv')
df_sub = pd.read_csv('../input/google-quest-challenge/sample_submission.csv')

input_columns = list(df_train.columns[[1,2,5]])
output_columns = list(df_train.columns[11:])

print('train shape =', df_train.shape)
print('test shape =', df_test.shape)

# show columns
print('\n ', len(input_columns),'  inputs :', *[f'\n\t{x}' for x in input_columns])
print('\n ', len(output_columns),' outputs:', *[f'\n\t{x}' for x in output_columns])

## 2. Load BERT

BERT requires specifically formatted inputs. For each tokenized input sentence, we need to create:

* **input ids**: a sequence of integers identifying each input token to its index number in the BERT tokenizer vocabulary
* **segment mask**: (optional) a sequence of 1s and 0s used to identify whether the input is one sentence or two sentences long. For one sentence inputs, this is simply a sequence of 0s. For two sentence inputs, there is a 0 for each token of the first sentence, followed by a 1 for each token of the second sentence
* **attention mask**: (optional) a sequence of 1s and 0s, with 1s for all input tokens and 0s for all padding tokens (we'll detail this in the next paragraph)
* **labels**: a single value between 0 and 1

Following instructions from tensorflow hub [here](https://tfhub.dev/tensorflow/bert_en_cased_L-12_H-768_A-12/1)

In [None]:
import tensorflow as tf
import tensorflow_hub as hub

max_seq_length = 512 # Your choice here.

input_word_ids = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32, name="input_word_ids")
input_mask = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32, name="input_mask")
segment_ids = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32, name="segment_ids")

# TF Hub model uses L=12 hidden layers (i.e., Transformer blocks), 
# a hidden size of H=768, and A=12 attention heads
# Inputs have been "cased", meaning that the distinction between 
# lower and upper case as well as accent markers have been preserved
bert_layer = hub.KerasLayer("/kaggle/input/bert-en-cased-l12-h768-a12-1/bert-en-cased-l12-h768-a12-v1")

pooled_output, sequence_output = bert_layer([input_word_ids, input_mask, segment_ids])

Import tokenizer using the original vocab file


In [None]:
import bert_tokenization as tokenization

vocab_file = bert_layer.resolved_object.vocab_file.asset_path.numpy()
do_lower_case = bert_layer.resolved_object.do_lower_case.numpy()
tokenizer = tokenization.FullTokenizer(vocab_file, do_lower_case)

## 3. Test BERT embedding generator model

Generating segments and masks based on the original BERT

In [None]:
# See BERT paper: https://arxiv.org/pdf/1810.04805.pdf
# And BERT implementation convert_single_example() at https://github.com/google-research/bert/blob/master/run_classifier.py

def get_masks(tokens, max_seq_length):
    """Mask for padding"""
    if len(tokens)>max_seq_length:
        raise IndexError("Token length more than max seq length!")
    return [1]*len(tokens) + [0] * (max_seq_length - len(tokens))


def get_segments(tokens, max_seq_length):
    """Segments: 0 for the first sequence, 1 for the second"""
    if len(tokens)>max_seq_length:
        raise IndexError("Token length more than max seq length!")
    segments = []
    current_segment_id = 0
    for token in tokens:
        segments.append(current_segment_id)
        if token == "[SEP]":
            current_segment_id = 1
    return segments + [0] * (max_seq_length - len(tokens))


def get_ids(tokens, tokenizer, max_seq_length):
    """Token ids from Tokenizer vocab"""
    token_ids = tokenizer.convert_tokens_to_ids(tokens)
    input_ids = token_ids + [0] * (max_seq_length-len(token_ids))
    return input_ids

# And BERT implementation truncate_seq_pair() at https://github.com/google-research/bert/blob/master/run_classifier.py
def _truncate_seq_pair(tokens_a, tokens_b, max_length):
    """Truncates a sequence pair in place to the maximum length."""
    # This is a simple heuristic which will always truncate the longer sequence
    # one token at a time. This makes more sense than truncating an equal percent
    # of tokens from each, since if one sequence is very short then each token
    # that's truncated likely contains more information than a longer sequence.
    while True:
        total_length = len(tokens_a) + len(tokens_b)
        if total_length <= max_length:
            break
        if len(tokens_a) > len(tokens_b):
            tokens_a.pop()
        else:
            tokens_b.pop()

Although we can have variable length input sentences, BERT does requires our input arrays to be the same size. We address this by first choosing a maximum sentence length, and then padding and truncating our inputs until every input sequence is of the same length.

To "pad" our inputs in this context means that if a sentence is shorter than the maximum sentence length, we simply add 0s to the end of the sequence until it is the maximum sentence length.

If a sentence is longer than the maximum sentence length, then we simply truncate the end of the sequence, discarding anything that does not fit into our maximum sentence length.

We pad and truncate our sequences so that they all become of length MAX_LEN ("post" indicates that we want to pad and truncate at the end of the sequence, as opposed to the beginning) pad_sequences is a utility function that we're borrowing from Keras. It simply handles the truncating and padding of Python lists.

In [None]:
inputs = df_train[input_columns]
title = inputs.question_title[0]
body = inputs.question_body[0]

# Tokenizing the sentence
title_tokens = tokenizer.tokenize(title)
body_tokens = tokenizer.tokenize(body)

# Adding separator tokens according to the paper
stokens = ["[CLS]"] + title_tokens + ["[SEP]"] + body_tokens + ["[SEP]"]

# Get the model inputs from the tokens
sample_ids = get_ids(stokens, tokenizer, max_seq_length)
sample_masks = get_masks(stokens, max_seq_length)
sample_segments = get_segments(stokens, max_seq_length)

# print (first 20)
sample = (stokens,sample_ids,sample_masks,sample_segments)
b = [[print(f'\n {len(x)} :', x[:20])] for x in sample]

## 4. Build models for training

In [None]:
max_seq_length = 512 # Your choice here.

import math

def compute_input_arrays(df, tokenizer, max_sequence_length):
    input_ids, input_masks, input_segments = [], [], []
    for _, row in df.iterrows():
        # Tokenizing the sentence
        tokens_a = tokenizer.tokenize(row[0])
        tokens_b = tokenizer.tokenize(row[1])
        # Modifies `tokens_a` and `tokens_b` in place so that the total
        # length is less than the specified length.
        # Account for [CLS], [SEP], [SEP] with "- 3"
        _truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3)
        # Adding separator tokens according to the paper
        stoken = ["[CLS]"] + tokens_a + ["[SEP]"] + tokens_b + ["[SEP]"]
        ids = get_ids(stoken, tokenizer, max_sequence_length)
        masks = get_masks(stoken, max_sequence_length)
        segments = get_segments(stoken, max_sequence_length)
        # Get the model inputs from the tokens
        input_ids.append(ids)
        input_masks.append(masks)
        input_segments.append(segments)
    return [np.asarray(input_ids, dtype=np.int16), 
            np.asarray(input_masks, dtype=np.int16), 
            np.asarray(input_segments, dtype=np.int16)]

### Model 1

In [None]:
learn_rate = 0.00002
batch_size = 16
epochs = 8

In [None]:
def model1():
    input_word_ids = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32, name="input_word_ids")
    input_mask = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32, name="input_mask")
    segment_ids = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32, name="segment_ids")
    # TF Hub model uses L=12 hidden layers (i.e., Transformer blocks), 
    # a hidden size of H=768, and A=12 attention heads
    # Inputs have been "cased", meaning that the distinction between 
    # lower and upper case as well as accent markers have been preserved
    bert_layer = hub.KerasLayer("/kaggle/input/bert-en-cased-l12-h768-a12-1/bert-en-cased-l12-h768-a12-v1")
    pooled_output, sequence_output = bert_layer([input_word_ids, input_mask, segment_ids])    
    x = tf.keras.layers.GlobalAveragePooling1D()(sequence_output)
    x = tf.keras.layers.Dropout(0.2)(x)
    x = tf.keras.layers.Dense(42, activation='relu')(x)
    x = tf.keras.layers.Dense(21, activation="sigmoid", name="dense_output")(x)
    model = tf.keras.models.Model(
        inputs=[input_word_ids, input_mask, segment_ids], 
        outputs=x)
    optimizer = tf.keras.optimizers.Adam(learning_rate=learn_rate)
    model.compile(loss='binary_crossentropy', optimizer=optimizer)
    return model

In [None]:
def model_train(model, inputs, outputs):
    input_arrays = compute_input_arrays(inputs, tokenizer, max_seq_length)
    output_arrays = np.asarray(outputs) # only question related columns
    model.fit(input_arrays, output_arrays, epochs=epochs, batch_size=batch_size)
    
def model_predict(model, inputs):
    compute_inputs = compute_input_arrays(inputs, tokenizer, max_seq_length)
    return model.predict(compute_inputs, batch_size=batch_size)

In [None]:
model = model1()

In [None]:
model1_inputs = ['question_title','question_body']
model1_outputs = output_columns[:21]
    
model_train(model, df_train[model1_inputs], df_train[model1_outputs])

### update Model1 prediction

In [None]:
predictions = model_predict(model, df_test[model1_inputs])

In [None]:
predictions.shape

In [None]:
df_sub.iloc[:, 1:22] = predictions

### Model 2

In [None]:
def model2():
    input_word_ids = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32, name="input_word_ids")
    input_mask = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32, name="input_mask")
    segment_ids = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32, name="segment_ids")
    # TF Hub model uses L=12 hidden layers (i.e., Transformer blocks), 
    # a hidden size of H=768, and A=12 attention heads
    # Inputs have been "cased", meaning that the distinction between 
    # lower and upper case as well as accent markers have been preserved
    bert_layer = hub.KerasLayer("/kaggle/input/bert-en-cased-l12-h768-a12-1/bert-en-cased-l12-h768-a12-v1")
    pooled_output, sequence_output = bert_layer([input_word_ids, input_mask, segment_ids])    
    x = tf.keras.layers.GlobalAveragePooling1D()(sequence_output)
    x = tf.keras.layers.Dropout(0.2)(x)
    x = tf.keras.layers.Dense(18, activation='relu')(x)
    x = tf.keras.layers.Dense(9, activation="sigmoid", name="dense_output")(x)
    model = tf.keras.models.Model(
        inputs=[input_word_ids, input_mask, segment_ids], 
        outputs=x)
    optimizer = tf.keras.optimizers.Adam(learning_rate=learn_rate)
    model.compile(loss='binary_crossentropy', optimizer=optimizer)
    return model

In [None]:
model = model2()

In [None]:
model2_inputs = ['question_body','answer']
model2_outputs = output_columns[21:]

model_train(model, df_train[model2_inputs], df_train[model2_outputs])

In [None]:
model2_predictions = model_predict(model, df_test[model2_inputs])

In [None]:
model2_predictions.shape

In [None]:
df_sub.iloc[:, 22:] = model2_predictions

In [None]:
df_sub.to_csv('submission.csv', index=False)