## Abstract

* Computers are good at answering questions with single verifiable answers. For Example, querying “Who is the Prime Minister of India?” on google, will give a perfect answer. When it comes to answering subjective aspects of a question, Humans do a much better job than what computers do. Few subjective aspects include 
    * *Is the question understandable?*
    * *Is the question conversational?*
    * *Is the answer to the question understandable?*
* The CrowdSource team at Google Research, has collected data on a number of these subjective aspects for each question-answer pair. Crowdsource gathers your feedback, and feedback from others around the world, which helps the machine to learn from accurate examples and improves the services provided by google like Maps, Translate etc. 

* The question-answer pairs were gathered from nearly 70 different websites. The raters received minimal guidance and training, and relied largely on their intelligence to answer subjective aspects of the prompts. As such, each prompt was simplified in such a way so that raters could simply use their common-sense to complete the task.

* The task here is to build a predictive algorithm which would quantify these subjective aspects given a question-answer pair.
* **Evaluation Metric** - The Evaluation Metric for this competition is *Spearman Rank Correlation Coefficient*. The Spearman's rank correlation is computed for each target column, and the mean of these values is calculated for the submission score. 

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory


import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/bertbaseuncasedcomplete/bert-model/vocab.txt
/kaggle/input/bertbaseuncasedcomplete/bert-model/config.json
/kaggle/input/bertbaseuncasedcomplete/bert-model/tokenizer_config.json
/kaggle/input/bertbaseuncasedcomplete/bert-model/tf_model.h5
/kaggle/input/bertbaseuncasedcomplete/bert-model/special_tokens_map.json
/kaggle/input/bertbaseuncased/bert-base-uncased-vocab.txt
/kaggle/input/bertbaseuncased/bert-base-uncased/bert-base-uncased/config.json
/kaggle/input/bertbaseuncased/bert-base-uncased/bert-base-uncased/tf_model.h5
/kaggle/input/universalsentenceencoderqa3/saved_model.pb
/kaggle/input/universalsentenceencoderqa3/variables/variables.index
/kaggle/input/universalsentenceencoderqa3/variables/variables.data-00000-of-00001
/kaggle/input/fasttextwikinewssubwords300d/fasttext-wiki-news-subwords-300
/kaggle/input/xlnetbasecased/config-xlnet-base-cased/config.json
/kaggle/input/xlnetbasecased/model-xlnet-base-cased/config.json
/kaggle/input/xlnetbasecased/model-xlnet-base-case

In [2]:
import pandas as pd
import numpy as np
import transformers
import tensorflow as tf
from scipy.stats import spearmanr
import tensorflow_hub as hub
import re
from sklearn.preprocessing import MinMaxScaler
import gc
from sklearn.model_selection import GroupKFold,KFold
from scipy.stats import spearmanr, rankdata
from sklearn.linear_model import MultiTaskElasticNet
import string
from collections import Counter
from gensim.models import Word2Vec
from transformers import XLNetConfig, TFXLNetModel, XLNetTokenizer, TFXLNetMainLayer, BertConfig, TFBertMainLayer, BertTokenizer, TFBertModel

In [3]:
DIR = '/kaggle/input/google-quest-challenge'

In [4]:
xlnet_tokenizer = XLNetTokenizer.from_pretrained('/kaggle/input/xlnetbasecased/tokenizer')
bert_tokenizer = BertTokenizer.from_pretrained('/kaggle/input/bertbaseuncasedcomplete/bert-model')

In [5]:
train_df = pd.read_csv(DIR+'/train.csv')
test_df = pd.read_csv(DIR+'/test.csv')
cols = train_df.columns[11:]

In [6]:
def tokenize_input(tokenizer, s1, s2, tags, data_name, max_length, tokenizer_name = 'bert'):
if s2 is not None:
    x = tokenizer.encode_plus(s1, s2, pad_to_max_length=False)
    if len(x['input_ids']) > max_length:
        segment_1 = int(0.25*max_length)
        x['input_ids'] = x['input_ids'][:segment_1] + x['input_ids'][-(max_length-segment_1):]
        x['attention_mask'] = x['attention_mask'][:segment_1] + x['attention_mask'][-(max_length-segment_1):]
        x['token_type_ids'] = x['token_type_ids'][:segment_1] + x['token_type_ids'][-(max_length-segment_1):]
    else:
        diff = max_length - len(x['input_ids'])
        if tokenizer_name == 'xlnet':
            x['input_ids'] = [tokenizer.pad_token_id]*diff + x['input_ids']
            x['attention_mask'] = [0]*diff + x['attention_mask']
            x['token_type_ids'] = [tokenizer.pad_token_type_id]*diff + x['token_type_ids']
        else:
            x['input_ids'] = x['input_ids'] + [tokenizer.pad_token_id]*diff
            x['attention_mask'] = x['attention_mask'] + [0]*diff
            x['token_type_ids'] = x['token_type_ids'] + [0]*diff
      
    else:
        x = tokenizer.encode_plus(s1)
        if len(x['input_ids']) > max_length:
            segment_1 = int(0.25*max_length)
            x['input_ids'] = x['input_ids'][:segment_1] + x['input_ids'][-(max_length-segment_1):]
            x['attention_mask'] = x['attention_mask'][:segment_1] + x['attention_mask'][-(max_length-segment_1):]
            x['token_type_ids'] = x['token_type_ids'][:segment_1] + x['token_type_ids'][-(max_length-segment_1):]
        else:
            diff = max_length - len(x['input_ids'])
            if tokenizer_name == 'xlnet':
                x['input_ids'] = [tokenizer.pad_token_id]*diff + x['input_ids']
                x['attention_mask'] = [0]*diff + x['attention_mask']
                x['token_type_ids'] = [tokenizer.pad_token_type_id]*diff + x['token_type_ids']
            else:
                x['input_ids'] = x['input_ids'] + [tokenizer.pad_token_id]*diff
                x['attention_mask'] = x['attention_mask'] + [0]*diff
                x['token_type_ids'] = x['token_type_ids'] + [tokenizer.pad_token_type_id]*diff
  
    data[data_name][tags[0]].append(x['input_ids']) 
    data[data_name][tags[1]].append(x['token_type_ids'])
    data[data_name][tags[2]].append(x['attention_mask']) 

data = {}
# ******************************************XLNET*************************************************************************************
data['xlnet_train_t_a'] = {}
data['xlnet_train_q_a'] = {}
data['xlnet_train_t_q'] = {}
data['xlnet_train_q'] = {}
data['xlnet_train_a'] = {}

data['xlnet_test_t_a'] = {}
data['xlnet_test_q_a'] = {}
data['xlnet_test_t_q'] = {}
data['xlnet_test_q'] = {}
data['xlnet_test_a'] = {}

tags = ['input_ids', 'token_type_ids', 'attention_masks']
data['xlnet_train_t_a'][tags[0]], data['xlnet_train_t_a'][tags[1]], data['xlnet_train_t_a'][tags[2]] = [], [], []
data['xlnet_train_q_a'][tags[0]], data['xlnet_train_q_a'][tags[1]], data['xlnet_train_q_a'][tags[2]] = [], [], []
data['xlnet_train_t_q'][tags[0]], data['xlnet_train_t_q'][tags[1]], data['xlnet_train_t_q'][tags[2]] = [], [], []
data['xlnet_train_q'][tags[0]], data['xlnet_train_q'][tags[1]], data['xlnet_train_q'][tags[2]] = [], [], []
data['xlnet_train_a'][tags[0]], data['xlnet_train_a'][tags[1]], data['xlnet_train_a'][tags[2]] = [], [], []


data['xlnet_test_t_a'][tags[0]], data['xlnet_test_t_a'][tags[1]], data['xlnet_test_t_a'][tags[2]] = [], [], []
data['xlnet_test_q_a'][tags[0]], data['xlnet_test_q_a'][tags[1]], data['xlnet_test_q_a'][tags[2]] = [], [], []
data['xlnet_test_t_q'][tags[0]], data['xlnet_test_t_q'][tags[1]], data['xlnet_test_t_q'][tags[2]] = [], [], []
data['xlnet_test_q'][tags[0]], data['xlnet_test_q'][tags[1]], data['xlnet_test_q'][tags[2]] = [], [], []
data['xlnet_test_a'][tags[0]], data['xlnet_test_a'][tags[1]], data['xlnet_test_a'][tags[2]] = [], [], []


for i in range(train_df.shape[0]):
    tokenize_input(xlnet_tokenizer, train_df.loc[i, 'question_title'], train_df.loc[i, 'answer'], tags, 
                   'xlnet_train_t_a', 512, 'xlnet')
    tokenize_input(xlnet_tokenizer, train_df.loc[i, 'question_body'], train_df.loc[i, 'answer'], tags, 
                   'xlnet_train_q_a', 512, 'xlnet')
    tokenize_input(xlnet_tokenizer, train_df.loc[i, 'question_title'], train_df.loc[i, 'question_body'], tags, 
                   'xlnet_train_t_q', 512, 'xlnet')
    tokenize_input(xlnet_tokenizer, train_df.loc[i, 'question_body'], None, tags, 'xlnet_train_q', 512, 'xlnet')
    tokenize_input(xlnet_tokenizer, train_df.loc[i, 'answer'], None, tags, 'xlnet_train_a', 512, 'xlnet')
for i in range(test_df.shape[0]):
    tokenize_input(xlnet_tokenizer, test_df.loc[i, 'question_title'], test_df.loc[i, 'answer'], tags, 
                   'xlnet_test_t_a', 512, 'xlnet')
    tokenize_input(xlnet_tokenizer, test_df.loc[i, 'question_body'], test_df.loc[i, 'answer'], tags, 
                   'xlnet_test_q_a', 512, 'xlnet')
    tokenize_input(xlnet_tokenizer, test_df.loc[i, 'question_title'], test_df.loc[i, 'question_body'], tags, 
                   'xlnet_test_t_q', 512, 'xlnet')
    tokenize_input(xlnet_tokenizer, test_df.loc[i, 'question_body'], None, tags, 'xlnet_test_q', 512, 'xlnet')
    tokenize_input(xlnet_tokenizer, test_df.loc[i, 'answer'], None, tags, 'xlnet_test_a', 512, 'xlnet')

# ******************************************BERT*************************************************************************************

data['bert_train_t_a'] = {}
data['bert_train_q_a'] = {}
data['bert_train_t_q'] = {}
data['bert_train_q'] = {}
data['bert_train_a'] = {}

data['bert_test_t_a'] = {}
data['bert_test_q_a'] = {}
data['bert_test_t_q'] = {}
data['bert_test_q'] = {}
data['bert_test_a'] = {}

data['bert_train_t_a'][tags[0]], data['bert_train_t_a'][tags[1]], data['bert_train_t_a'][tags[2]] = [], [], []
data['bert_train_q_a'][tags[0]], data['bert_train_q_a'][tags[1]], data['bert_train_q_a'][tags[2]] = [], [], []
data['bert_train_t_q'][tags[0]], data['bert_train_t_q'][tags[1]], data['bert_train_t_q'][tags[2]] = [], [], []
data['bert_train_q'][tags[0]], data['bert_train_q'][tags[1]], data['bert_train_q'][tags[2]] = [], [], []
data['bert_train_a'][tags[0]], data['bert_train_a'][tags[1]], data['bert_train_a'][tags[2]] = [], [], []

data['bert_test_t_a'][tags[0]], data['bert_test_t_a'][tags[1]], data['bert_test_t_a'][tags[2]] = [], [], []
data['bert_test_q_a'][tags[0]], data['bert_test_q_a'][tags[1]], data['bert_test_q_a'][tags[2]] = [], [], []
data['bert_test_t_q'][tags[0]], data['bert_test_t_q'][tags[1]], data['bert_test_t_q'][tags[2]] = [], [], []
data['bert_test_q'][tags[0]], data['bert_test_q'][tags[1]], data['bert_test_q'][tags[2]] = [], [], []
data['bert_test_a'][tags[0]], data['bert_test_a'][tags[1]], data['bert_test_a'][tags[2]] = [], [], []

for i in range(train_df.shape[0]):
    tokenize_input(bert_tokenizer, train_df.loc[i, 'question_title'], train_df.loc[i, 'answer'], tags, 
                   'bert_train_t_a', 512, 'bert')
    tokenize_input(bert_tokenizer, train_df.loc[i, 'question_body'], train_df.loc[i, 'answer'], tags, 
                   'bert_train_q_a', 512, 'bert')
    tokenize_input(bert_tokenizer, train_df.loc[i, 'question_title'], train_df.loc[i, 'question_body'], tags, 
                   'bert_train_t_q', 512, 'bert')
    tokenize_input(bert_tokenizer, train_df.loc[i, 'question_body'], None, tags, 'bert_train_q', 512, 'bert')
    tokenize_input(bert_tokenizer, train_df.loc[i, 'answer'], None, tags, 'bert_train_a', 512, 'bert')
for i in range(test_df.shape[0]):
    tokenize_input(bert_tokenizer, test_df.loc[i, 'question_title'], test_df.loc[i, 'answer'], tags, 
                   'bert_test_t_a', 512, 'bert')
    tokenize_input(bert_tokenizer, test_df.loc[i, 'question_body'], test_df.loc[i, 'answer'], tags, 
                   'bert_test_q_a', 512, 'bert')
    tokenize_input(bert_tokenizer, test_df.loc[i, 'question_title'], test_df.loc[i, 'question_body'], tags, 
                   'bert_test_t_q', 512, 'bert')
    tokenize_input(bert_tokenizer, test_df.loc[i, 'question_body'], None, tags, 'bert_test_q', 512, 'bert')
    tokenize_input(bert_tokenizer, test_df.loc[i, 'answer'], None, tags, 'bert_test_a', 512, 'bert')

for key, _ in data.items():
    for k, _ in data[key].items():
        data[key][k] = np.array(data[key][k])

In [7]:
def SpearmanCorrCoeff(A, B):
    overall_score = 0
    for index in range(A.shape[1]):
        overall_score += spearmanr(A[:, index], B[:, index]).correlation
    return overall_score/30
class PredictCallback(tf.keras.callbacks.Callback):
    def __init__(self, data, labels):
        self.data = data
        self.labels = labels
    def on_epoch_end(self, epoch, logs = {}):
        predictions = self.model.predict(self.data)
        print('\n\t Validation Score - ' + str(SpearmanCorrCoeff(self.labels, predictions)))

In [8]:
class BERT(TFBertModel):
    def __init__(self, config, *inputs, **kwrgs):
        super(BERT, self).__init__(config, *inputs, **kwrgs)
        self.bert = TFBertMainLayer(config, name = 'bert')
        for i in range(1, 45):
            self.bert.submodules[-i].trainable = False
      
    def call(self, inputs, **kwrgs):
        outputs = self.bert(inputs)
        hidden_states = outputs[2]
        h12 = hidden_states[-1][:, 0, :]
        h11 = hidden_states[-2][:, 0, :]
        h10 = hidden_states[-3][:, 0, :]
        h9 = hidden_states[-4][:, 0, :]
        concat = tf.keras.layers.Concatenate(axis = -1)([h9, h10, h11, h12])
        return concat

class XLNet(TFXLNetModel):
    def __init__(self, config, *inputs, **kwrgs):
        super(XLNet, self).__init__(config, *inputs, **kwrgs)
        self.transformer = TFXLNetMainLayer(config, name = 'transformer')
        for i in range(1, 3):
            self.transformer.layer[-i].trainable = False
    def call(self, inputs, **kwrgs):
        outputs = self.transformer(inputs)
        hidden_states = outputs[1]
        h12 = hidden_states[-1][:, 0, :]
        h11 = hidden_states[-2][:, 0, :]
        h10 = hidden_states[-3][:, 0, :]
        h9 = hidden_states[-4][:, 0, :]
        concat = tf.keras.layers.Concatenate(axis = -1)([h9, h10, h11, h12])
        return concat

In [9]:
def create_model(name):
    id_1 = tf.keras.Input(shape = (512), dtype = tf.int32)
    id_2 = tf.keras.Input(shape = (512), dtype = tf.int32)

    type_id_1 = tf.keras.Input(shape = (512), dtype = tf.int32)
    type_id_2 = tf.keras.Input(shape = (512), dtype = tf.int32)

    a1 = tf.keras.Input(shape = (512), dtype = tf.int32)
    a2 = tf.keras.Input(shape = (512), dtype = tf.int32)
    if name == 'xlnet':
        config = XLNetConfig.from_pretrained('/kaggle/input/xlnetbasecased/config-xlnet-base-cased', 
                                             output_hidden_states = True)
        transformer = XLNet.from_pretrained('/kaggle/input/xlnetbasecased/model-xlnet-base-cased', 
                                            config = config)
                                                
    else:
        config = BertConfig.from_pretrained('/kaggle/input/bertbaseuncasedcomplete/bert-model', 
                                            output_hidden_states = True)
        transformer = BERT.from_pretrained('/kaggle/input/bertbaseuncasedcomplete/bert-model', 
                                           config = config)
  
    out_1 = transformer({'input_ids':id_1, 'attention_mask':a1, 'token_type_ids':type_id_1})
    out_2 = transformer({'input_ids':id_2, 'attention_mask':a2, 'token_type_ids':type_id_2})
  
    concat = tf.keras.layers.Concatenate(axis = -1)([out_1, out_2])
    dense = tf.keras.layers.Dense(30, activation = 'sigmoid')(concat)
    return tf.keras.Model(inputs = [id_1, id_2, type_id_1, type_id_2, a1, a2], outputs = [dense])

In [10]:
gkf = GroupKFold(n_splits = 5).split(X = train_df.url, groups = train_df.url)
for fold, (train_idx, valid_idx) in enumerate(gkf):
    if fold != 0:
        continue
    final_outputs = train_df[cols].values.astype(np.float16)
    tf.keras.backend.clear_session()
    xlnet_train_inputs = (
                    data['xlnet_train_a'][tags[0]][train_idx], data['xlnet_train_t_q'][tags[0]][train_idx], 
        data['xlnet_train_a'][tags[1]][train_idx], data['xlnet_train_t_q'][tags[1]][train_idx],
                   data['xlnet_train_a'][tags[2]][train_idx], data['xlnet_train_t_q'][tags[2]][train_idx]  

                 )
    xlnet_valid_inputs = (
                    data['xlnet_train_a'][tags[0]][valid_idx], data['xlnet_train_t_q'][tags[0]][valid_idx], 
        data['xlnet_train_a'][tags[1]][valid_idx], data['xlnet_train_t_q'][tags[1]][valid_idx],
                   data['xlnet_train_a'][tags[2]][valid_idx], data['xlnet_train_t_q'][tags[2]][valid_idx]  

                  )
    bert_train_inputs = (
                    data['bert_train_a'][tags[0]][train_idx], data['bert_train_t_q'][tags[0]][train_idx], 
        data['bert_train_a'][tags[1]][train_idx], data['bert_train_t_q'][tags[1]][train_idx],
                   data['bert_train_a'][tags[2]][train_idx], data['bert_train_t_q'][tags[2]][train_idx]  

                 )
    bert_valid_inputs = (
                    data['bert_train_a'][tags[0]][valid_idx], data['bert_train_t_q'][tags[0]][valid_idx], 
        data['bert_train_a'][tags[1]][valid_idx], data['bert_train_t_q'][tags[1]][valid_idx],
                   data['bert_train_a'][tags[2]][valid_idx], data['bert_train_t_q'][tags[2]][valid_idx]  

                  )


    xlnet_model = create_model('xlnet')
    xlnet_model.compile(tf.keras.optimizers.Adam(learning_rate = 2.3*1e-5), 
                        loss = tf.keras.losses.BinaryCrossentropy())
    xlnet_model.fit(x = xlnet_train_inputs, y = final_outputs[train_idx], epochs = 2, batch_size = 4, 
                    steps_per_epoch = train_idx.shape[0]//4)
    tf.keras.backend.clear_session()
    print("################################################################################\n")
    bert_model = create_model('bert')
    bert_model.compile(tf.keras.optimizers.Adam(learning_rate = 2.3*1e-5), 
                       loss = tf.keras.losses.BinaryCrossentropy())
    bert_model.fit(x = bert_train_inputs, y = final_outputs[train_idx], epochs = 2, batch_size = 4, 
                   steps_per_epoch = train_idx.shape[0]//4)

    break

Train on 4863 samples
Epoch 1/2
####################################################################################################################

Train on 4863 samples
Epoch 1/2

In [11]:
class Optimize:
    def __init__(self):
        self.clips = [[0, 1] for i in range(30)]
        self.ab_ = [(0, 0.15), (0.85, 1)]
        self.new_scores, self.scores = (None, None)
    def fit(self, labels, preds):
        self.scores = [SpearmanCorrCoeff(labels[:, i:i+1], preds[:, i:i+1]) for i in range(30)]
        for i in range(30):
            self.golden_section_search(labels[:, i:i+1], preds[:, i:i+1], i, 0)
            self.golden_section_search(labels[:, i:i+1], preds[:, i:i+1], i, 1)
        self.new_scores = [np.nan_to_num(SpearmanCorrCoeff(labels[:, i:i+1], 
            np.clip(preds[:, i:i+1], self.clips[i][0], self.clips[i][1]))) for i in range(30)]
        for i in range(30):
            if self.scores[i] >= self.new_scores[i]:
                self.clips[i] = [0, 1]
    def golden_section_search(self, labels, preds, i, idx):
        (a, b) = self.ab_[idx]
        c = 0.618
        x1 = b - c*(b-a)
        x2 = (b-a)*c + a
        
        for epochs in range(10):
            self.clips[i][idx] = x1
            score_a = -self.score(labels, preds, i)
            self.clips[i][idx] = x2
            score_b = -self.score(labels, preds, i)
            if np.isnan(score_a):
                continue
            elif np.isnan(score_b):
                continue
            elif score_a <= score_b:
                b = x2
                x2 = x1
                x1 = b - c*(b-a)
            else:
                a = x1
                x1 = x2
                x2 = (b-a)*c + a
        
        self.clips[i][idx] = x1
        score_x1 = self.score(labels, preds, i)
        self.clips[i][idx] = x2
        score_x2 = self.score(labels, preds, i)
        if score_x1 > score_x2:
            self.clips[i][idx] = x1
        else:
            self.clips[i][idx] = x2
                    
            
    def score(self, labels, preds, i):
        return SpearmanCorrCoeff(labels, np.clip(preds, self.clips[i][0], self.clips[i][1]))
    def transform(self, preds):
        temp = preds.copy()
        for i in range(30):
            clipped = np.clip(preds[:, i], self.clips[i][0], self.clips[i][1])
            if np.unique(clipped).shape[0] > 1:
                temp[:, i][:] = clipped
        return temp

In [12]:
xlnet_inputs = (
                    data['xlnet_train_a'][tags[0]], data['xlnet_train_t_q'][tags[0]], 
    data['xlnet_train_a'][tags[1]], data['xlnet_train_t_q'][tags[1]],
                   data['xlnet_train_a'][tags[2]], data['xlnet_train_t_q'][tags[2]]  
                  
                 )

bert_inputs = (
                    data['bert_train_a'][tags[0]], data['bert_train_t_q'][tags[0]], 
    data['bert_train_a'][tags[1]], data['bert_train_t_q'][tags[1]],
                   data['bert_train_a'][tags[2]], data['bert_train_t_q'][tags[2]]  
                  
                 )
xlnet_predictions = (xlnet_model.predict(xlnet_inputs))
bert_predictions = (bert_model.predict(bert_inputs))
mean_predictions = (0.5*xlnet_predictions + 0.5*bert_predictions)
train_y = final_outputs[train_idx]
valid_y = final_outputs[valid_idx]
mean_train_preds = mean_predictions[train_idx]
mean_valid_preds = mean_predictions[valid_idx]

opt = Optimize()
opt.fit(train_y, mean_train_preds)
post_valid_preds = opt.transform(mean_valid_preds)
print(f"Validation Score (Before) {SpearmanCorrCoeff(valid_y, mean_valid_preds)}")
print(f"Validation Score (After) {SpearmanCorrCoeff(valid_y, post_valid_preds)}")

  c /= stddev[:, None]
  c /= stddev[None, :]
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)


Validation Score (Before) 0.4035256392684768
Validation Score (After) 0.416809994511061


In [13]:
xlnet_test_inputs = inputs = (
                    data['xlnet_test_a'][tags[0]], data['xlnet_test_t_q'][tags[0]], 
    data['xlnet_test_a'][tags[1]], 
    data['xlnet_test_t_q'][tags[1]],data['xlnet_test_a'][tags[2]], data['xlnet_test_t_q'][tags[2]]  
                  
                 )
bert_test_inputs = inputs = (
                    data['bert_test_a'][tags[0]], data['bert_test_t_q'][tags[0]], 
    data['bert_test_a'][tags[1]], 
    data['bert_test_t_q'][tags[1]],data['bert_test_a'][tags[2]], data['bert_test_t_q'][tags[2]]  
                  
                 )
xlnet_test_preds = xlnet_model.predict(xlnet_test_inputs)
bert_test_preds = bert_model.predict(bert_test_inputs)
mean_test_preds = 0.5*xlnet_test_preds + 0.5*bert_test_preds
post_test_preds = opt.transform(mean_test_preds)

submission = pd.read_csv(DIR+'/sample_submission.csv')
submission.iloc[:,1:] = post_test_preds
submission.to_csv("submission.csv", index = False)
submission.head()

Unnamed: 0,qa_id,question_asker_intent_understanding,question_body_critical,question_conversational,question_expect_short_answer,question_fact_seeking,question_has_commonly_accepted_answer,question_interestingness_others,question_interestingness_self,question_multi_intent,...,question_well_written,answer_helpful,answer_level_of_information,answer_plausible,answer_relevance,answer_satisfaction,answer_type_instructions,answer_type_procedure,answer_type_reason_explanation,answer_well_written
0,39,0.945288,0.672799,0.338741,0.396063,0.581378,0.448282,0.695343,0.743659,0.530865,...,0.919896,0.897294,0.406054,0.957358,0.956077,0.736326,0.091482,0.027052,0.857635,0.946178
1,46,0.882821,0.414841,0.148314,0.631014,0.787929,0.850671,0.509879,0.489665,0.139953,...,0.725094,0.954958,0.624564,0.976141,0.964589,0.870414,0.901102,0.173739,0.155895,0.883272
2,70,0.925923,0.579584,0.148314,0.672972,0.878834,0.858648,0.618461,0.561834,0.263809,...,0.866208,0.91986,0.557448,0.976141,0.964589,0.803503,0.091482,0.043473,0.863898,0.929245
3,132,0.865351,0.321677,0.148314,0.641319,0.71937,0.858648,0.559521,0.358044,0.139953,...,0.669263,0.944348,0.656321,0.976141,0.964589,0.890407,0.790419,0.199159,0.649088,0.901653
4,200,0.922602,0.473999,0.148314,0.766476,0.793914,0.854842,0.598358,0.61241,0.254969,...,0.788498,0.916623,0.641902,0.972269,0.962273,0.841106,0.153972,0.076906,0.69978,0.91485


* Private Score - 0.38023 | Public Score - 0.40135
* Top 10% in private submission | Top 11% in public subsmission
* Here is the kernel - https://www.kaggle.com/varunsaproo/xlnet-based?scriptVersionId=41365639 