In a nutshell, our team trained 3 models: 2 [BERT](https://arxiv.org/abs/1810.04805) ones, one [RoBERTa](https://arxiv.org/abs/1907.11692). Key ideas are:
- pretraining language models with StackExchange data and auxiliary targets
- postprocessing predictions


**Install necessary packages**
 - [mag](https://github.com/ex4sperans/mag) is a lightweight library to keep track of experiments
 - sacremoses is a dependency for transformers

In [None]:
%%time
!pip install /kaggle/input/pythonmag/mag > /dev/null
!pip install ../input/sacremoses/sacremoses-master/ > /dev/null

In [None]:
import os
import json
from collections import Counter
import numpy as np
import pandas as pd
from scipy.stats import spearmanr
import numpy as np

In [None]:
# temp = pd.read_csv('/kaggle/input/google-quest-challenge/train.csv')
# temp = temp[2000:2476]
# temp

In [None]:
# ss = pd.read_csv('/kaggle/input/google-quest-challenge/sample_submission.csv')
# c = list(ss)
# df = temp[temp.columns.intersection(c)]
# df.to_csv("test_actual.csv", index=False)

In [None]:
# ts = pd.read_csv('/kaggle/input/google-quest-challenge/test.csv')
# d = list(ts)
# data = temp[temp.columns.intersection(d)]
# data.to_csv("test.csv", index=False)


# Inference

### Model 1. BERT base uncased

This is an uncased BERT model, its LM is finetuned with StackExchange data.

In [None]:
%%time
!python /kaggle/input/old-bert-code/predict_test.py \
  --data_path ../input/qna-custom-test-data         \
  --model_dir /kaggle/input/stackx-80-aux-ep-3       \
  --sub_file model1_bert_base_uncased_pred.csv

### Model 2. BERT base cased

This is a cased BERT model, its LM is finetuned with StackExchange data, code has been refactored w.r.t. to the first model.

In [None]:
%%time
!python ../input/bert-base-random-code/run.py                \
  --sub_file=model2_bert_base_cased_pred.csv                  \
  --data_path=../input/qna-custom-test-data/                    \
  --max_sequence_length=500                                     \
  --max_title_length=26                                          \
  --max_question_length=260                                       \
  --max_answer_length=210                                          \
  --batch_size=8                                                    \
  --bert_model=/kaggle/input/bert-base-pretrained/stackx-base-cased/

### Model 3. RoBERTa

Here we're resorting to both LM finetuning and pseudo-labeling.

In [None]:
# setups
# remove -2
ROBERTA_EXPERIMENT_DIR = "2-4-roberta-base-saved-5-head_tail-roberta-stackx-base-v2-pl1kksample20k-1e-05-210-260-500-26-roberta-200"
!mkdir $ROBERTA_EXPERIMENT_DIR
!ln -s /kaggle/input/roberta-stackx-base-pl20k/checkpoints $ROBERTA_EXPERIMENT_DIR/checkpoints

ROBERTA_CONFIG = {
    "_seed": 42,
    "batch_accumulation": 2,
    "batch_size": 4,
    "bert_model": "roberta-base-saved",
    "folds": 5,
    "head_tail": True,
    "label": "roberta-stackx-base-v2-pl1kksample20k",
    "lr": 1e-05,
    "max_answer_length": 210,
    "max_question_length": 260,
    "max_sequence_length": 500,
    "max_title_length": 26,
    "model_type": "roberta",
    "warmup": 200
}
with open(os.path.join(ROBERTA_EXPERIMENT_DIR, "config.json"), "w") as fp:
    json.dump(ROBERTA_CONFIG, fp)
    
!echo kek > $ROBERTA_EXPERIMENT_DIR/command

In [None]:
%%time
!python ../input/roberta-base-code/infer.py                 \
  --experiment $ROBERTA_EXPERIMENT_DIR                       \
  --checkpoint=best_model.pth                                 \
  --bert_model=/kaggle/input/roberta-base-model                \
  --dataframe=../input/qna-custom-test-data/test.csv     \
  --output_dir=roberta-base-output


# Blending and postprocessing

**First, we read the 30 target columns that we need to predict.**

In [None]:
sample_submission_df = pd.read_csv("../input/qna-custom-test-data/sample_submission.csv", 
                             index_col='qa_id')
target_columns = sample_submission_df.columns
print(f'There are {len(target_columns)} targets to predict')

train_df = pd.read_csv("../input/qna-custom-test-data/train.csv")

**Reading submission files**

In [None]:
model1_bert_base_uncased_pred_df = pd.read_csv("model1_bert_base_uncased_pred.csv")
model2_bert_base_cased_pred_df = pd.read_csv("model2_bert_base_cased_pred.csv")

**For RoBERTa, we average predictions from 5 folds**

In [None]:
roberta_base_dfs = [pd.read_csv(
                    os.path.join("roberta-base-output", "fold-{}.csv".format(fold))) 
                    for fold in range(5)]

model3_roberta_pred_df = roberta_base_dfs[0].copy()

for col in target_columns:
    model3_roberta_pred_df[col] = np.mean([df[col] for df in roberta_base_dfs], axis=0)

**Blending**

In [None]:
blended_df = model3_roberta_pred_df.copy()

for col in target_columns:
    blended_df[col] = (
        model1_bert_base_uncased_pred_df[col] * 0.1 +
        model2_bert_base_cased_pred_df[col] * 0.2 + 
        model3_roberta_pred_df[col] * 0.1 
    )

**Applying postprocessing to the final blend, also discussed [here](https://www.kaggle.com/c/google-quest-challenge/discussion/129840).**

In [None]:
def postprocess_single(target, ref):
    """
    The idea here is to make the distribution of a particular predicted column
    to match the correspoding distribution of the corresponding column in the
    training dataset (called ref here)
    """
    
    ids = np.argsort(target)
    counts = sorted(Counter(ref).items(), key=lambda s: s[0])
    scores = np.zeros_like(target)
    
    last_pos = 0
    v = 0
    
    for value, count in counts:
        next_pos = last_pos + int(round(count / len(ref) * len(target)))
        if next_pos == last_pos:
            next_pos += 1

        cond = ids[last_pos:next_pos]
        scores[cond] = v
        last_pos = next_pos
        v += 1
        
    return scores / scores.max()

In [None]:
def postprocess_prediction(prediction, actual):
    
    postprocessed = prediction.copy()
    
    for col in target_columns:
        scores = postprocess_single(prediction[col].values, actual[col].values)
        # Those are columns where our postprocessing gave substantial improvement.
        # It also helped for some others, but we didn't include them as the gain was
        # very marginal (less than 0.01)
        if col in (
            "question_conversational",
            "question_type_compare",
            "question_type_definition",
            "question_type_entity",
            "question_has_commonly_accepted_answer",
            "question_type_consequence",
            "question_type_spelling"
        ):
            postprocessed[col] = scores
            
        # scale to 0-1 interval
        v = postprocessed[col].values
        postprocessed[col] = (v - v.min()) / (v.max() - v.min())
    
    return postprocessed

In [None]:
postprocessed = postprocess_prediction(blended_df, train_df)

**Saving the submission file.**

In [None]:
postprocessed.to_csv("submission.csv", index=False)

In [None]:
actual = pd.read_csv('../input/qna-custom-test-data/test_actual.csv')

In [None]:
preds_1 = model1_bert_base_uncased_pred_df
preds_1 = preds_1.assign(qa_id = actual['qa_id'])
corr_1 = [spearmanr(preds_1[col], actual[col]).correlation for col in actual]
corr_1[20] = 1.
np.mean(corr_1)

In [None]:
preds_2 = model2_bert_base_cased_pred_df
preds_2 = preds_2.assign(qa_id = actual['qa_id'])
corr_2 = [spearmanr(preds_2[col], actual[col]).correlation for col in actual]
corr_2[20] = 1.
np.mean(corr_2)

In [None]:
preds_3 = model3_roberta_pred_df
preds_3 = preds_3.assign(qa_id = actual['qa_id'])
corr_3 = [spearmanr(preds_3[col], actual[col]).correlation for col in actual]
corr_3[20] = 1.
np.mean(corr_3)

In [None]:
corr_blended = [spearmanr(blended_df[col], actual[col]).correlation for col in actual]
corr_blended[20] = 1.
np.mean(corr_blended)

In [None]:
corr = [spearmanr(postprocessed[col], actual[col]).correlation for col in actual]
corr[20] = 1.
np.mean(corr)

In [None]:

x = ['BERT uncased', 'BERT cased', 'RoBERTa', 'Combined resuts', 'After post processing']
y = [np.mean(corr_1), np.mean(corr_2), np.mean(corr_3), np.mean(corr_blended), np.mean(corr)]

df = pd.DataFrame({'Models': x, 'Spearmans Correlation Score': y})
import plotly.express as px


fig = px.bar(df, x = 'Models', y = 'Spearmans Correlation Score', color='Models')
fig.update_layout(yaxis_range=[0.42,0.58])
fig.show()