## Google QUEST Q&A Labeling

### Improving automated understanding of complex question answer content

> Computers are really good at answering questions with single, verifiable answers. But, humans are often still better at answering questions about opinions, recommendations, or personal experiences. ... In this competition, youâ€™re challenged to use this new dataset to build predictive algorithms for different subjective aspects of question-answering.

![](https://storage.googleapis.com/kaggle-media/competitions/google-research/human_computable_dimensions_1.png)

The competition is **Notebook-only competition**. Your Notebook will re-run automatically against an unseen test set.

This competition data is small, only made of 6079 rows of train dataset.

In [None]:
import numpy as np
import pandas as pd
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
from pathlib import Path
import matplotlib.pyplot as plt

# --- plotly ---
from plotly import tools, subplots
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.express as px
import plotly.figure_factory as ff

import seaborn as sns

# Introduction to Data
### Data Import

In [None]:
%%time
datadir = Path('/kaggle/input/google-quest-challenge')

# Read in the data CSV files
train = pd.read_csv(datadir/'train.csv')
test = pd.read_csv(datadir/'test.csv')
sample_submission = pd.read_csv(datadir/'sample_submission.csv')

In [None]:
# Lets check the size

print("Train" , train.shape)
print("Test" , test.shape)
print("Submission" , sample_submission.shape)

# Target labels

Each row has a *qa_id* and other 30 columns as targets 

In [None]:
sample_submission.head()

### Each output has 27 question related lables and 9 as answer related lables

**NOTE:** the labels are gi****ven in the continuous range from [0, 1]. NOT binary value.

This is not a binary prediction challenge. Target labels are aggregated from multiple raters, and can have continuous values in the range [0,1]. Therefore, predictions must also be in that range.

In [None]:
sample_submission.columns

## Feature columns

Let's check feature columns one by one.

In [None]:
feature_columns = [col for col in train.columns if col not in sample_submission.columns]
print("Feature columns are " , feature_columns)

## Lets find out feature columns

In [None]:
train[feature_columns].head()

### Let's focus on the first row of the data. You can access original page mentioned in the url column.

https://photo.stackexchange.com/questions/9169/what-am-i-losing-when-using-extension-tubes-instead-of-a-macro-lens

Only the question contains *"title" (question_title)*, and we have *question_body* and *answer* which is given by sentences.

In [None]:
train0 = train.iloc[0]

print('URL           : ', train0['url'])
print('question_title: ', train0['question_title'])
print('question_body : ', train0['question_body'])

In [None]:
print('answer: ', train0['answer'])

**Each row contains a single question and a single answer to that question, along with additional features. The training data contains rows with some duplicated questions (but with different answers). The test data does not contain any duplicated questions.
**

When you access to the URL, you can understand that multiple answer to the single question is given in the page. But only one answer is sampled in the dataset. Also this answer may not be the most popular answer. We can find the answer of this data in the relatively bottom part of the homepage.

Other columns are metadata, which shows **question user property, answer user property and category of question**.

In [None]:
train[['url', 'question_user_name', 'question_user_page', 'answer_user_name', 'answer_user_page', 'url', 'category', 'host']]

# EDA

In [None]:
# target label distribution

target_cols = ['question_asker_intent_understanding',
       'question_body_critical', 'question_conversational',
       'question_expect_short_answer', 'question_fact_seeking',
       'question_has_commonly_accepted_answer',
       'question_interestingness_others', 'question_interestingness_self',
       'question_multi_intent', 'question_not_really_a_question',
       'question_opinion_seeking', 'question_type_choice',
       'question_type_compare', 'question_type_consequence',
       'question_type_definition', 'question_type_entity',
       'question_type_instructions', 'question_type_procedure',
       'question_type_reason_explanation', 'question_type_spelling',
       'question_well_written', 'answer_helpful',
       'answer_level_of_information', 'answer_plausible', 'answer_relevance',
       'answer_satisfaction', 'answer_type_instructions',
       'answer_type_procedure', 'answer_type_reason_explanation',
       'answer_well_written']

In [None]:
fig, axes = plt.subplots(6, 5, figsize=(18, 15))
axes = axes.ravel()
bins = np.linspace(0, 1, 20)

for i, col in enumerate(target_cols):
    ax = axes[i]
    sns.distplot(train[col], label=col, kde=False, bins=bins, ax=ax)
    # ax.set_title(col)
    ax.set_xlim([0, 1])
    ax.set_ylim([0, 6079])
plt.tight_layout()
plt.show()
plt.close()

It seems some of the labels are quite imbalanced. For example "question_not_really_a_question" is almost always 0, which means most of the question in the data is not a noisy data but an "actual question".

In [None]:
# Null values
train.isnull().sum()
# There is no Null Values/

### Category Col

The dataset consists of 5 categories: "Technology", "Stackoverflow", "Culture", "Science", "Life arts".

[](http://)Train/Test distribution is almost same.

In [None]:
train_category = train['category'].value_counts()
test_category = test['category'].value_counts()

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(12, 6))
train_category.plot(kind='bar', ax=axes[0])
axes[0].set_title('Train')
test_category.plot(kind='bar', ax=axes[1])
axes[1].set_title('Test')
print('Train/Test category distribution')

### Word Cloud visualization

Let's see what kind of word are used for question and answer. Also let's check the difference between train and test.

In [None]:
from wordcloud import WordCloud

def plot_wordcloud(text, ax, title=None):
    wordcloud = WordCloud(max_font_size=None, background_color='white',
                          width=1200, height=1000).generate(text_cat)
    ax.imshow(wordcloud)
    if title is not None:
        ax.set_title(title)
    ax.axis("off")

In [None]:
print('Training data Word Cloud')

fig, axes = plt.subplots(1, 3, figsize=(16, 18))

text_cat = ' '.join(train['question_title'].values)
plot_wordcloud(text_cat, axes[0], 'Question title')

text_cat = ' '.join(train['question_body'].values)
plot_wordcloud(text_cat, axes[1], 'Question body')

text_cat = ' '.join(train['answer'].values)
plot_wordcloud(text_cat, axes[2], 'Answer')

plt.tight_layout()
fig.show()

In [None]:
print('Test data Word Cloud')

fig, axes = plt.subplots(1, 3, figsize=(16, 18))

text_cat = ' '.join(test['question_title'].values)
plot_wordcloud(text_cat, axes[0], 'Question title')

text_cat = ' '.join(test['question_body'].values)
plot_wordcloud(text_cat, axes[1], 'Question body')

text_cat = ' '.join(test['answer'].values)
plot_wordcloud(text_cat, axes[2], 'Answer')

plt.tight_layout()
fig.show()

It seems common word usage distribution is similar between train & test dataset!

### Correlation in target labels

There are following pairs **corelated**:

1. "question_type_instructions" & "answer_type_instructions" = 0.77
1. "question_type_procedure" & "answer_type_procedure" = 0.61
1. "question_type_reason_explanation" & "answer_type_reason_explanation" = 0.59

This is reasonable that same evaluation on both question & answer are correlated.

On the other hand, Anticorrelation pattern can be found on following pairs:

1. "question_fact_seeking" & "question_opinion_seeking" = -0.69
1. "answer_type_instruction" & "answer_type_reason_explanation" = -0.48

I think this is also reasonable that question that asks fact & opinion conflicts.
And answer which shows instruction or reason explanation also conflicts.

In [None]:
fig, ax = plt.subplots(figsize=(23, 23))
sns.heatmap(train[target_cols].corr(), ax=ax  , annot=True)

### User check

The dataset contains question user and answer user information. This may be because user attribution is impotant, same user tend to answer same kind of question and same answer user tends to answer in similar quality.

Let's check if how the user are distributed, and the user are duplicated in train/test or not.

In [None]:
train_question_user = train['question_user_name'].unique()
test_question_user = test['question_user_name'].unique()

print('Number of unique question user in train: ', len(train_question_user))
print('Number of unique question user in test : ', len(test_question_user))
print('Number of unique question user in both train & test : ', len(set(train_question_user) & set(test_question_user)))

In [None]:
train_answer_user = train['answer_user_name'].unique()
test_answer_user = test['answer_user_name'].unique()

print('Number of unique answer user in train: ', len(train_answer_user))
print('Number of unique answer user in test : ', len(test_answer_user))
print('Number of unique answer user in both train & test : ', len(set(train_answer_user) & set(test_answer_user)))

Seems several users are in both train & test dataset.

Also, it seems many users ask question and answer.

In [None]:
print('Number of unique user in both question & answer in train  : ', len(set(train_answer_user) & set(train_question_user)))
print('Number of unique user in both question & answer in test  : ', len(set(test_answer_user) & set(test_question_user)))

# Simple feature engineering

Now, I will proceed simple feature engineering and check if it explains data well or not.

1. Number of words in question title, body and answer.
1. question_user's question count in train.
1. answer_user's answer count in train.

### Number of words

In [None]:
def char_count(s):
    return len(s)

def word_count(s):
    return s.count(' ')

In [None]:
train['question_title_n_chars'] = train['question_title'].apply(char_count)
train['question_title_n_words'] = train['question_title'].apply(word_count)
train['question_body_n_chars'] = train['question_body'].apply(char_count)
train['question_body_n_words'] = train['question_body'].apply(word_count)
train['answer_n_chars'] = train['answer'].apply(char_count)
train['answer_n_words'] = train['answer'].apply(word_count)

test['question_title_n_chars'] = test['question_title'].apply(char_count)
test['question_title_n_words'] = test['question_title'].apply(word_count)
test['question_body_n_chars'] = test['question_body'].apply(char_count)
test['question_body_n_words'] = test['question_body'].apply(word_count)
test['answer_n_chars'] = test['answer'].apply(char_count)
test['answer_n_words'] = test['answer'].apply(word_count)

In [None]:
# Number of chars and words in Question title

fig, axes = plt.subplots(1, 2, figsize=(12, 6))
sns.distplot(train['question_title_n_chars'], label='train', ax=axes[0])
sns.distplot(test['question_title_n_chars'], label='test', ax=axes[0])
axes[0].legend()
sns.distplot(train['question_title_n_words'], label='train', ax=axes[1])
sns.distplot(test['question_title_n_words'], label='test', ax=axes[1])
axes[1].legend()

In [None]:
# Number of chars and words in Question body

fig, axes = plt.subplots(1, 2, figsize=(12, 6))
sns.distplot(train['question_body_n_chars'], label='train', ax=axes[0])
sns.distplot(test['question_body_n_chars'], label='test', ax=axes[0])
axes[0].legend()
sns.distplot(train['question_body_n_words'], label='train', ax=axes[1])
sns.distplot(test['question_body_n_words'], label='test', ax=axes[1])
axes[1].legend()

In [None]:
# Outlier has too long, let's cut these outlier for visualization.
train['question_body_n_chars'].clip(0, 5000, inplace=True)
test['question_body_n_chars'].clip(0, 5000, inplace=True)
train['question_body_n_words'].clip(0, 1000, inplace=True)
test['question_body_n_words'].clip(0, 1000, inplace=True)

fig, axes = plt.subplots(1, 2, figsize=(12, 6))
sns.distplot(train['question_body_n_chars'], label='train', ax=axes[0])
sns.distplot(test['question_body_n_chars'], label='test', ax=axes[0])
axes[0].legend()
sns.distplot(train['question_body_n_words'], label='train', ax=axes[1])
sns.distplot(test['question_body_n_words'], label='test', ax=axes[1])
axes[1].legend()

### Number of chars and words in answer

Answer number chars/words distribution is similar to question body.

In [None]:
train['answer_n_chars'].clip(0, 5000, inplace=True)
test['answer_n_chars'].clip(0, 5000, inplace=True)
train['answer_n_words'].clip(0, 1000, inplace=True)
test['answer_n_words'].clip(0, 1000, inplace=True)

fig, axes = plt.subplots(1, 2, figsize=(12, 6))
sns.distplot(train['answer_n_chars'], label='train', ax=axes[0])
sns.distplot(test['answer_n_chars'], label='test', ax=axes[0])
axes[0].legend()
sns.distplot(train['answer_n_words'], label='train', ax=axes[1])
sns.distplot(test['answer_n_words'], label='test', ax=axes[1])
axes[1].legend()

Are these feature useful for predicting target values?

Let's check correlation with target values.

In [None]:
from scipy.spatial.distance import cdist

def calc_corr(df, x_cols, y_cols):
    arr1 = df[x_cols].T.values
    arr2 = df[y_cols].T.values
    corr_df = pd.DataFrame(1 - cdist(arr2, arr1, metric='correlation'), index=y_cols, columns=x_cols)
    return corr_df

In [None]:
number_feature_cols = ['question_title_n_chars', 'question_title_n_words', 'question_body_n_chars', 'question_body_n_words', 'answer_n_chars', 'answer_n_words']
# train[number_feature_cols].corrwith(train[target_cols], axis=0)

corr_df = calc_corr(train, target_cols, number_feature_cols)

In [None]:
fig, ax = plt.subplots(figsize=(25, 5))
sns.heatmap(corr_df, ax=ax, annot=True)

### We can see following relationship

1. length of answer is correlated with "answer_level_of_information".
1. length of question_title is correlated with "question_body_critical" and length of question body is anticorrelated with it.
1. length of question_body is anticorrelated with "question_well_written"

# Number of question or answer by user

In [None]:
num_question = train['question_user_name'].value_counts()
num_answer = train['answer_user_name'].value_counts()

In [None]:
train['num_answer_user'] = train['answer_user_name'].map(num_answer)
train['num_question_user'] = train['question_user_name'].map(num_question)
test['num_answer_user'] = test['answer_user_name'].map(num_answer)
test['num_question_user'] = test['question_user_name'].map(num_question)

In [None]:
number_feature_cols = ['num_answer_user', 'num_question_user']
corr_df = calc_corr(train, target_cols, number_feature_cols)

In [None]:
fig, ax = plt.subplots(figsize=(30, 2))
sns.heatmap(corr_df, ax=ax ,  annot=True)

Although correlation scale is small and it might not be a "true correlation", I can see following pattern:

num_question_user and question_conversational is correlated: People who post question a lot tend to ask question in conversational form.

## BERT Implementation 

In [None]:
# CODE TAKEN FROM https://github.com/kpe/bert-for-tf2/
# ALL CREDITS TO https://github.com/kpe
# CODE COPIED TO LOCAL FOLDER DUE TO INTERNET RESTRICTIONS
# NORMALLY THIS CODE WOULD BE AVAILABLE VIA pip install bert-for-tf2

# coding=utf-8
# Copyright 2018 The Google AI Language Team Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Tokenization classes."""

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import collections
import re
import unicodedata
import six
import tensorflow as tf


def validate_case_matches_checkpoint(do_lower_case, init_checkpoint):
    """Checks whether the casing config is consistent with the checkpoint name."""

    # The casing has to be passed in by the user and there is no explicit check
    # as to whether it matches the checkpoint. The casing information probably
    # should have been stored in the bert_config.json file, but it's not, so
    # we have to heuristically detect it to validate.

    if not init_checkpoint:
        return

    m = re.match("^.*?([A-Za-z0-9_-]+)/bert_model.ckpt", init_checkpoint)
    if m is None:
        return

    model_name = m.group(1)

    lower_models = [
        "uncased_L-24_H-1024_A-16", "uncased_L-12_H-768_A-12",
        "multilingual_L-12_H-768_A-12", "chinese_L-12_H-768_A-12"
    ]

    cased_models = [
        "cased_L-12_H-768_A-12", "cased_L-24_H-1024_A-16",
        "multi_cased_L-12_H-768_A-12"
    ]

    is_bad_config = False
    if model_name in lower_models and not do_lower_case:
        is_bad_config = True
        actual_flag = "False"
        case_name = "lowercased"
        opposite_flag = "True"

    if model_name in cased_models and do_lower_case:
        is_bad_config = True
        actual_flag = "True"
        case_name = "cased"
        opposite_flag = "False"

    if is_bad_config:
        raise ValueError(
            "You passed in `--do_lower_case=%s` with `--init_checkpoint=%s`. "
            "However, `%s` seems to be a %s model, so you "
            "should pass in `--do_lower_case=%s` so that the fine-tuning matches "
            "how the model was pre-training. If this error is wrong, please "
            "just comment out this check." % (actual_flag, init_checkpoint,
                                              model_name, case_name, opposite_flag))


def convert_to_unicode(text):
    """Converts `text` to Unicode (if it's not already), assuming utf-8 input."""
    if six.PY3:
        if isinstance(text, str):
            return text
        elif isinstance(text, bytes):
            return text.decode("utf-8", "ignore")
        else:
            raise ValueError("Unsupported string type: %s" % (type(text)))
    elif six.PY2:
        if isinstance(text, str):
            return text.decode("utf-8", "ignore")
        elif isinstance(text, unicode):
            return text
        else:
            raise ValueError("Unsupported string type: %s" % (type(text)))
    else:
        raise ValueError("Not running on Python2 or Python 3?")


def printable_text(text):
    """Returns text encoded in a way suitable for print or `tf.logging`."""

    # These functions want `str` for both Python2 and Python3, but in one case
    # it's a Unicode string and in the other it's a byte string.
    if six.PY3:
        if isinstance(text, str):
            return text
        elif isinstance(text, bytes):
            return text.decode("utf-8", "ignore")
        else:
            raise ValueError("Unsupported string type: %s" % (type(text)))
    elif six.PY2:
        if isinstance(text, str):
            return text
        elif isinstance(text, unicode):
            return text.encode("utf-8")
        else:
            raise ValueError("Unsupported string type: %s" % (type(text)))
    else:
        raise ValueError("Not running on Python2 or Python 3?")


def load_vocab(vocab_file):
    """Loads a vocabulary file into a dictionary."""
    vocab = collections.OrderedDict()
    index = 0
    with tf.io.gfile.GFile(vocab_file, "r") as reader:
        while True:
            token = convert_to_unicode(reader.readline())
            if not token:
                break
            token = token.strip()
            vocab[token] = index
            index += 1
    return vocab


def convert_by_vocab(vocab, items):
    """Converts a sequence of [tokens|ids] using the vocab."""
    output = []
    for item in items:
        output.append(vocab[item])
    return output


def convert_tokens_to_ids(vocab, tokens):
    return convert_by_vocab(vocab, tokens)


def convert_ids_to_tokens(inv_vocab, ids):
    return convert_by_vocab(inv_vocab, ids)


def whitespace_tokenize(text):
    """Runs basic whitespace cleaning and splitting on a piece of text."""
    text = text.strip()
    if not text:
        return []
    tokens = text.split()
    return tokens


class FullTokenizer(object):
    """Runs end-to-end tokenziation."""

    def __init__(self, vocab_file, do_lower_case=True):
        self.vocab = load_vocab(vocab_file)
        self.inv_vocab = {v: k for k, v in self.vocab.items()}
        self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case)
        self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab)

    def tokenize(self, text):
        split_tokens = []
        for token in self.basic_tokenizer.tokenize(text):
            for sub_token in self.wordpiece_tokenizer.tokenize(token):
                split_tokens.append(sub_token)

        return split_tokens

    def convert_tokens_to_ids(self, tokens):
        return convert_by_vocab(self.vocab, tokens)

    def convert_ids_to_tokens(self, ids):
        return convert_by_vocab(self.inv_vocab, ids)


class BasicTokenizer(object):
    """Runs basic tokenization (punctuation splitting, lower casing, etc.)."""

    def __init__(self, do_lower_case=True):
        """Constructs a BasicTokenizer.
        Args:
          do_lower_case: Whether to lower case the input.
        """
        self.do_lower_case = do_lower_case

    def tokenize(self, text):
        """Tokenizes a piece of text."""
        text = convert_to_unicode(text)
        text = self._clean_text(text)

        # This was added on November 1st, 2018 for the multilingual and Chinese
        # models. This is also applied to the English models now, but it doesn't
        # matter since the English models were not trained on any Chinese data
        # and generally don't have any Chinese data in them (there are Chinese
        # characters in the vocabulary because Wikipedia does have some Chinese
        # words in the English Wikipedia.).
        text = self._tokenize_chinese_chars(text)

        orig_tokens = whitespace_tokenize(text)
        split_tokens = []
        for token in orig_tokens:
            if self.do_lower_case:
                token = token.lower()
                token = self._run_strip_accents(token)
            split_tokens.extend(self._run_split_on_punc(token))

        output_tokens = whitespace_tokenize(" ".join(split_tokens))
        return output_tokens

    def _run_strip_accents(self, text):
        """Strips accents from a piece of text."""
        text = unicodedata.normalize("NFD", text)
        output = []
        for char in text:
            cat = unicodedata.category(char)
            if cat == "Mn":
                continue
            output.append(char)
        return "".join(output)

    def _run_split_on_punc(self, text):
        """Splits punctuation on a piece of text."""
        chars = list(text)
        i = 0
        start_new_word = True
        output = []
        while i < len(chars):
            char = chars[i]
            if _is_punctuation(char):
                output.append([char])
                start_new_word = True
            else:
                if start_new_word:
                    output.append([])
                start_new_word = False
                output[-1].append(char)
            i += 1

        return ["".join(x) for x in output]

    def _tokenize_chinese_chars(self, text):
        """Adds whitespace around any CJK character."""
        output = []
        for char in text:
            cp = ord(char)
            if self._is_chinese_char(cp):
                output.append(" ")
                output.append(char)
                output.append(" ")
            else:
                output.append(char)
        return "".join(output)

    def _is_chinese_char(self, cp):
        """Checks whether CP is the codepoint of a CJK character."""
        # This defines a "chinese character" as anything in the CJK Unicode block:
        #   https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
        #
        # Note that the CJK Unicode block is NOT all Japanese and Korean characters,
        # despite its name. The modern Korean Hangul alphabet is a different block,
        # as is Japanese Hiragana and Katakana. Those alphabets are used to write
        # space-separated words, so they are not treated specially and handled
        # like the all of the other languages.
        if ((cp >= 0x4E00 and cp <= 0x9FFF) or  #
                (cp >= 0x3400 and cp <= 0x4DBF) or  #
                (cp >= 0x20000 and cp <= 0x2A6DF) or  #
                (cp >= 0x2A700 and cp <= 0x2B73F) or  #
                (cp >= 0x2B740 and cp <= 0x2B81F) or  #
                (cp >= 0x2B820 and cp <= 0x2CEAF) or
                (cp >= 0xF900 and cp <= 0xFAFF) or  #
                (cp >= 0x2F800 and cp <= 0x2FA1F)):  #
            return True

        return False

    def _clean_text(self, text):
        """Performs invalid character removal and whitespace cleanup on text."""
        output = []
        for char in text:
            cp = ord(char)
            if cp == 0 or cp == 0xfffd or _is_control(char):
                continue
            if _is_whitespace(char):
                output.append(" ")
            else:
                output.append(char)
        return "".join(output)


class WordpieceTokenizer(object):
    """Runs WordPiece tokenziation."""

    def __init__(self, vocab, unk_token="[UNK]", max_input_chars_per_word=200):
        self.vocab = vocab
        self.unk_token = unk_token
        self.max_input_chars_per_word = max_input_chars_per_word

    def tokenize(self, text):
        """Tokenizes a piece of text into its word pieces.
        This uses a greedy longest-match-first algorithm to perform tokenization
        using the given vocabulary.
        For example:
          input = "unaffable"
          output = ["un", "##aff", "##able"]
        Args:
          text: A single token or whitespace separated tokens. This should have
            already been passed through `BasicTokenizer.
        Returns:
          A list of wordpiece tokens.
        """

        text = convert_to_unicode(text)

        output_tokens = []
        for token in whitespace_tokenize(text):
            chars = list(token)
            if len(chars) > self.max_input_chars_per_word:
                output_tokens.append(self.unk_token)
                continue

            is_bad = False
            start = 0
            sub_tokens = []
            while start < len(chars):
                end = len(chars)
                cur_substr = None
                while start < end:
                    substr = "".join(chars[start:end])
                    if start > 0:
                        substr = "##" + substr
                    if substr in self.vocab:
                        cur_substr = substr
                        break
                    end -= 1
                if cur_substr is None:
                    is_bad = True
                    break
                sub_tokens.append(cur_substr)
                start = end

            if is_bad:
                output_tokens.append(self.unk_token)
            else:
                output_tokens.extend(sub_tokens)
        return output_tokens


def _is_whitespace(char):
    """Checks whether `chars` is a whitespace character."""
    # \t, \n, and \r are technically contorl characters but we treat them
    # as whitespace since they are generally considered as such.
    if char == " " or char == "\t" or char == "\n" or char == "\r":
        return True
    cat = unicodedata.category(char)
    if cat == "Zs":
        return True
    return False


def _is_control(char):
    """Checks whether `chars` is a control character."""
    # These are technically control characters but we count them as whitespace
    # characters.
    if char == "\t" or char == "\n" or char == "\r":
        return False
    cat = unicodedata.category(char)
    if cat in ("Cc", "Cf"):
        return True
    return False


def _is_punctuation(char):
    """Checks whether `chars` is a punctuation character."""
    cp = ord(char)
    # We treat all non-letter/number ASCII as punctuation.
    # Characters such as "^", "$", and "`" are not in the Unicode
    # Punctuation class but we treat them as punctuation anyways, for
    # consistency.
    if ((cp >= 33 and cp <= 47) or (cp >= 58 and cp <= 64) or
            (cp >= 91 and cp <= 96) or (cp >= 123 and cp <= 126)):
        return True
    cat = unicodedata.category(char)
    if cat.startswith("P"):
        return True
    return False

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import GroupKFold
import matplotlib.pyplot as plt
from tqdm.notebook import tqdm
import tensorflow_hub as hub
import tensorflow as tf
import tensorflow.keras.backend as K
import gc
import os
from scipy.stats import spearmanr
from math import floor, ceil

np.set_printoptions(suppress=True)

In [None]:
BERT_PATH = '../input/bert-base-from-tfhub/bert_en_uncased_L-12_H-768_A-12'
tokenizer = FullTokenizer(BERT_PATH+'/assets/vocab.txt', True)

In [None]:
MAX_SEQUENCE_LENGTH = 512

## Preprocessing functions

In [None]:
def _get_masks(tokens, max_seq_length):
    """Mask for padding"""
    if len(tokens)>max_seq_length:
        raise IndexError("Token length more than max seq length!")
    return [1]*len(tokens) + [0] * (max_seq_length - len(tokens))

def _get_segments(tokens, max_seq_length):
    """Segments: 0 for the first sequence, 1 for the second"""
    if len(tokens)>max_seq_length:
        raise IndexError("Token length more than max seq length!")
    segments = []
    first_sep = True
    current_segment_id = 0
    for token in tokens:
        segments.append(current_segment_id)
        if token == "[SEP]":
            if first_sep:
                first_sep = False 
            else:
                current_segment_id = 1
    return segments + [0] * (max_seq_length - len(tokens))

def _get_ids(tokens, tokenizer, max_seq_length):
    """Token ids from Tokenizer vocab"""
    token_ids = tokenizer.convert_tokens_to_ids(tokens)
    input_ids = token_ids + [0] * (max_seq_length-len(token_ids))
    return input_ids

def _trim_input(title, question, answer, max_sequence_length, 
                t_max_len=60, q_max_len=224, a_max_len=224):

    t = tokenizer.tokenize(title)
    q = tokenizer.tokenize(question)
    a = tokenizer.tokenize(answer)
    
    t_len = len(t)
    q_len = len(q)
    a_len = len(a)

    if (t_len+q_len+a_len+4) > max_sequence_length:
        
        if t_max_len > t_len:
            t_new_len = t_len
            a_max_len = a_max_len + floor((t_max_len - t_len)/2)
            q_max_len = q_max_len + ceil((t_max_len - t_len)/2)
        else:
            t_new_len = t_max_len
      
        if a_max_len > a_len:
            a_new_len = a_len 
            q_new_len = q_max_len + (a_max_len - a_len)
        elif q_max_len > q_len:
            a_new_len = a_max_len + (q_max_len - q_len)
            q_new_len = q_len
        else:
            a_new_len = a_max_len
            q_new_len = q_max_len
            
            
        if t_new_len+a_new_len+q_new_len+4 != max_sequence_length:
            raise ValueError("New sequence length should be %d, but is %d" 
                             % (max_sequence_length, (t_new_len+a_new_len+q_new_len+4)))
        
        t = t[:t_new_len]
        q = q[:q_new_len]
        a = a[:a_new_len]
    
    return t, q, a

def _convert_to_bert_inputs(title, question, answer, tokenizer, max_sequence_length):
    """Converts tokenized input to ids, masks and segments for BERT"""
    
    stoken = ["[CLS]"] + title + ["[SEP]"] + question + ["[SEP]"] + answer + ["[SEP]"]

    input_ids = _get_ids(stoken, tokenizer, max_sequence_length)
    input_masks = _get_masks(stoken, max_sequence_length)
    input_segments = _get_segments(stoken, max_sequence_length)

    return [input_ids, input_masks, input_segments]

def compute_input_arays(df, columns, tokenizer, max_sequence_length):
    input_ids, input_masks, input_segments = [], [], []
    for _, instance in tqdm(df[columns].iterrows()):
        t, q, a = instance.question_title, instance.question_body, instance.answer

        t, q, a = _trim_input(t, q, a, max_sequence_length)

        ids, masks, segments = _convert_to_bert_inputs(t, q, a, tokenizer, max_sequence_length)
        input_ids.append(ids)
        input_masks.append(masks)
        input_segments.append(segments)
        
    return [np.asarray(input_ids, dtype=np.int32), 
            np.asarray(input_masks, dtype=np.int32), 
            np.asarray(input_segments, dtype=np.int32)
           ]


def compute_output_arrays(df, columns):
    return np.asarray(df[columns])

## Create model

compute_spearmanr() is used to compute the competition metric for the validation set

CustomCallback() is a class which inherits from tf.keras.callbacks.Callback and will compute and append validation score and validation/test predictions respectively, after each epoch.

bert_model() contains the actual architecture that will be used to finetune BERT to our dataset. It's simple, just taking the sequence_output of the bert_layer and pass it to an AveragePooling layer and finally to an output layer of 30 units (30 classes that we have to predict)

train_and_predict() this function will be run to train and obtain predictions

In [None]:
def compute_spearmanr(trues, preds):
    rhos = []
    for col_trues, col_pred in zip(trues.T, preds.T):
        rhos.append(
            spearmanr(col_trues, col_pred + np.random.normal(0, 1e-7, col_pred.shape[0])).correlation)
    return np.nanmean(rhos)


class CustomCallback(tf.keras.callbacks.Callback):
    
    def __init__(self, valid_data, test_data, batch_size=16, fold=None):

        self.valid_inputs = valid_data[0]
        self.valid_outputs = valid_data[1]
        self.test_inputs = test_data
        
        self.batch_size = batch_size
        self.fold = fold
        
    def on_train_begin(self, logs={}):
        self.valid_predictions = []
        self.test_predictions = []
        
    def on_epoch_end(self, epoch, logs={}):
        self.valid_predictions.append(
            self.model.predict(self.valid_inputs, batch_size=self.batch_size))
        
        rho_val = compute_spearmanr(
            self.valid_outputs, np.average(self.valid_predictions, axis=0))
        
        print("\nvalidation rho: %.4f" % rho_val)
        
        if self.fold is not None:
            self.model.save_weights(f'bert-base-{fold}-{epoch}.h5py')
        
        self.test_predictions.append(
            self.model.predict(self.test_inputs, batch_size=self.batch_size)
        )

def bert_model():
    
    input_word_ids = tf.keras.layers.Input(
        (MAX_SEQUENCE_LENGTH,), dtype=tf.int32, name='input_word_ids')
    input_masks = tf.keras.layers.Input(
        (MAX_SEQUENCE_LENGTH,), dtype=tf.int32, name='input_masks')
    input_segments = tf.keras.layers.Input(
        (MAX_SEQUENCE_LENGTH,), dtype=tf.int32, name='input_segments')
    
    bert_layer = hub.KerasLayer(BERT_PATH, trainable=True)
    
    _, sequence_output = bert_layer([input_word_ids, input_masks, input_segments])
    
    x = tf.keras.layers.GlobalAveragePooling1D()(sequence_output)
    x = tf.keras.layers.Dropout(0.2)(x)
    out = tf.keras.layers.Dense(30, activation="sigmoid", name="dense_output")(x)
    
    model = tf.keras.models.Model(
        inputs=[input_word_ids, input_masks, input_segments], outputs=out)
    model.summary()
    return model    
        
def train_and_predict(model, train_data, valid_data, test_data, 
                      learning_rate, epochs, batch_size, loss_function, fold):
        
    custom_callback = CustomCallback(
        valid_data=(valid_data[0], valid_data[1]), 
        test_data=test_data,
        batch_size=batch_size,
        fold=None)

    optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate)
    model.compile(loss=loss_function, optimizer=optimizer)
    model.fit(train_data[0], train_data[1], epochs=epochs, 
              batch_size=batch_size, callbacks=[custom_callback])
    
    return custom_callback

### Obtain inputs and targets, as well as the indices of the train/validation splits

In [None]:
output_categories = list(train.columns[11:41])
input_categories = list(train.columns[[1,2,5]])
#additional_features = list(train.columns[41:])

In [None]:
gkf = GroupKFold(n_splits=10).split(X=train.question_body, groups=train.question_body)

outputs = compute_output_arrays(train, target_cols)
inputs = compute_input_arays(train, input_categories, tokenizer, MAX_SEQUENCE_LENGTH)
test_inputs = compute_input_arays(test, input_categories, tokenizer, MAX_SEQUENCE_LENGTH)

## Training, validation and testing

Loops over the folds in gkf and trains each fold for 5 epochs --- with a learning rate of 1e-5 and batch_size of 8. A simple binary crossentropy is used as the objective-/loss-function.

In [None]:
histories = []
for fold, (train_idx, valid_idx) in enumerate(gkf):
    
    if fold <= 6:
        K.clear_session()
        model = bert_model()
        print(f'/kaggle/input/bertbase/bert-base-{fold}-4.h5py')
        model.load_weights(f'/kaggle/input/bertbase/bert-base-{fold}-4.h5py')
        preds = model.predict(test_inputs,batch_size=8, verbose=1)
        histories.append(preds)
    
    # will actually only do 3 folds (out of 5) to manage < 2h
    if fold < 10 and fold >=7:
        print(f'/kaggle/input/bertbaseextra/bert-base-{fold}-4.h5py')
        model.load_weights(f'/kaggle/input/bertbaseextra/bert-base-{fold}-4.h5py')
        preds = model.predict(test_inputs,batch_size=8, verbose=1)
        histories.append(preds)

In [None]:
len(histories)

In [None]:
test_predictions_google_qa = histories

In [None]:
test_preds_google_qa = [test_predictions_google_qa[i] for i in range(len(test_predictions_google_qa))]
test_preds_google_qa = [np.average(test_preds_google_qa, axis=0) for i in range(len(test_preds_google_qa))]
test_preds_google_qa = np.mean(test_preds_google_qa, axis=0)
test_preds_google_qa.shape

In [None]:
sample_submission.iloc[:, 1:] = test_preds_google_qa

sample_submission.to_csv('submission.csv', index=False)