# Ranking with Antique

### ANTIQUE: A Non-Factoid Question Answering Benchmark
### ANTIQUE is a non-factoid question answering benchmark based on the questions and answers of Yahoo! Webscope L6.

Simple explanation of the dataset - Each Query will have a few Answers (Documents). These Documents are ranked on a relevance score of 1-4, with 4 being very relevant and 1 being not relevant. We want to use list-wise ranking methods to rank instead of treating this like a classification problem.

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [4]:
!pip install transformers
!pip install tensorflow
!pip install bert-tensorflow
!pip install tensorflow_ranking

Collecting tensorflow_ranking
[?25l  Downloading https://files.pythonhosted.org/packages/88/7f/283cf82c2888d010ec19a6dc0ba652c26f8acbcc55685becaf5d4c78b0cc/tensorflow_ranking-0.3.1-py2.py3-none-any.whl (86kB)
[K     |████████████████████████████████| 92kB 2.3MB/s 
Collecting tensorflow-serving-api<3.0.0,>=2.0.0
  Downloading https://files.pythonhosted.org/packages/7f/9d/b8a604630c51f32f4de8cc31da559387761daab7732df38a0c6b9df28219/tensorflow_serving_api-2.2.0-py2.py3-none-any.whl
Installing collected packages: tensorflow-serving-api, tensorflow-ranking
Successfully installed tensorflow-ranking-0.3.1 tensorflow-serving-api-2.2.0


In [5]:
!nvidia-smi

Fri Jun 26 02:39:44 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.36.06    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   53C    P8    30W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [6]:
import pandas as pd
from transformers import TFBertModel, BertTokenizer
import bert.tokenization as tokenization
import tensorflow as tf
import tensorflow_ranking as tfr
import itertools
import numpy as np
import time
import matplotlib.pyplot as plt
from tensorflow.keras.callbacks import ModelCheckpoint, Callback, ReduceLROnPlateau, LearningRateScheduler, EarlyStopping, TensorBoard
from tensorflow.keras.callbacks import LambdaCallback, CSVLogger
from multiprocessing import Pool

In [7]:
DOC_PATH = './drive/My Drive/Antique_Dataset/antique-collection.txt'
TRAIN_ID_PATH = './drive/My Drive/Antique_Dataset/antique-train.qrel'
TRAIN_DATA_PATH = './drive/My Drive/Antique_Dataset/antique-train-queries.txt'
TEST_ID_PATH = './drive/My Drive/Antique_Dataset/antique-test.qrel'
TEST_DATA_PATH = './drive/My Drive/Antique_Dataset/antique-test-queries.txt'
VOCAB_PATH ='./drive/My Drive/Antique_Dataset/vocab.txt'

In [8]:
doc_df = pd.read_csv(DOC_PATH, sep='\t', names=['doc_id', 'doc'])
train_id_df = pd.read_csv(TRAIN_ID_PATH, sep=' ', names=['query_id', 'source', 'doc_id', 'relevance'])
train_data_df = pd.read_csv(TRAIN_DATA_PATH, sep='\t', names=['query_id', 'query'])
test_id_df = pd.read_csv(TEST_ID_PATH, sep=' ', names=['query_id', 'source', 'doc_id', 'relevance'])
test_data_df = pd.read_csv(TEST_DATA_PATH, sep='\t', names=['query_id', 'query'])

In [9]:
# The Train_ID is not standard formatting, hence created a new function to read the data. 

def read_erratic_data(path):
    query_ids, sources, doc_ids, relevances = [],[],[],[]
    with open(path, 'r') as f:
        for i,line in enumerate(f):
            if '\t' in line:
                query_id, source, doc_id, relevance = line.rstrip().split('\t')
            else:
                query_id, source, doc_id, relevance = line.rstrip().split()
            
            query_ids.append(query_id)
            sources.append(source)
            doc_ids.append(doc_id)
            relevances.append(relevance)
    
    df = pd.DataFrame({
        'query_id':query_ids,
        'source':sources,
        'doc_id':doc_ids,
        'relevance':relevances
    })
    
    #df = dd.from_pandas(df, npartitions=1)
    return df

In [10]:
train_id_df = read_erratic_data(TRAIN_ID_PATH)

In [11]:
# Checking if there's any null values
train_id_df.isnull().values.any()

False

In [12]:
# Show the doc_df 
doc_df

Unnamed: 0,doc_id,doc
0,2020338_0,A small group of politicians believed strongly...
1,2020338_1,Because there is a lot of oil in Iraq.
2,2020338_2,It is tempting to say that the US invaded Iraq...
3,2020338_3,I think Yuval is pretty spot on. It's a provin...
4,2874684_0,Call an area apiarist. They should be able to...
...,...,...
403453,1424320_5,You could try to get the owners of the propert...
403454,1424320_6,"Yes, but it depends on your Credit and Income ..."
403455,1424320_7,I can provide you non-owner financing all the ...
403456,1424320_8,"As others pointed out, there are investor lend..."


In [13]:
# JUST TO SHOW HOW THE METHODS WORK - in function preprocess_df
# Merging documents to the respective doc_ids
train_data_pre = pd.merge(train_id_df, doc_df, on='doc_id', how='left')
# Casting the query_id column to int64 type for merging
train_data_pre = train_data_pre.astype({'query_id': 'int64'})

In [14]:
# JUST TO SHOW HOW THE METHODS WORK - in function preprocess_df
# Merging the query_ids to the queries
train_data_final = pd.merge(train_data_pre, train_data_df, on='query_id', how='left')
train_data_final

Unnamed: 0,query_id,source,doc_id,relevance,doc,query
0,2531329,U0,2531329_0,4,I do it all the time. It is kind of a ritual ...,Why do some men spit into the urinal before ur...
1,2531329,Q0,2531329_5,4,To clear out the mucus deep down in the throat...,Why do some men spit into the urinal before ur...
2,2531329,Q0,2531329_4,3,"maybe they want a target to hit. Well, I gues...",Why do some men spit into the urinal before ur...
3,2531329,Q0,2531329_7,3,Where else would we spit?... Apart from sports...,Why do some men spit into the urinal before ur...
4,2531329,Q0,2531329_6,3,Because they have a cough or phlegm and hacked...,Why do some men spit into the urinal before ur...
...,...,...,...,...,...,...
27417,884731,U0,884731_0,4,Padre is a word used in both Spanish and Portu...,What does the word Padre mean in english?
27418,884731,Q0,884731_4,4,Father,What does the word Padre mean in english?
27419,884731,Q0,884731_2,4,"the word padre in english means ""father""",What does the word Padre mean in english?
27420,884731,Q0,884731_3,4,Father.,What does the word Padre mean in english?


In [15]:
# Sampling a doc that is unrelated to the query

def sample_random_example(query_id, doc_df):
    while True:
        row = doc_df.sample(n=1)
        if str(query_id) in row['doc_id'].iloc[0]:
            continue
        else:
            return row['doc'].iloc[0], row['doc_id'].iloc[0]

In [16]:
# doc, doc_id = sample_random_example(query_id, doc_df)
# print(doc)
# print(doc_id)

In [17]:
# Generating number of additional examples to add to
# make each query-doc pair have multiples of 10

def additional_count(current_total, current_number_of_pos):
    if ((current_number_of_pos*2 + current_total)//10) * 10 < current_total:
        number = current_total*2 // 10 * 10 - current_total
    else:
        number = ((current_number_of_pos*2 + current_total)//10) * 10 - current_total
    if number + current_total < 10:
        return 10 - current_total
    if (number > current_total):
        number = number - (current_number_of_pos//10 * 10)
    return number

In [18]:
# Full function used in preprocess_df
# To add additional unrelevant examples to training dataset because of lack of representation
# in the dataset

def add_random_examples(data, doc_df):
    df = data.copy()
    unique_queries = df['query_id'].unique()
    result = pd.DataFrame()
    for query_id in unique_queries:
        test = df.loc[df['query_id'] == query_id]
        query = test['query'].iloc[0]
        total = test.count()[0]
        count = test.loc[test['relevance'] == "4"].count()[0]
        number = additional_count(total, count)
        for i in range(number):
            doc, doc_id = sample_random_example(query_id, doc_df)
            line = pd.DataFrame({'query_id': query_id,
                                'source': 'Generated',
                                 'doc_id': doc_id,
                                'relevance': '1',
                                 'doc': str(doc),
                                'query': str(query),
                                }, index=[i])
            test = test.append(line, ignore_index=False)
        test = test.sample(frac=1).reset_index(drop=True)
        result = result.append(test, ignore_index=False)
    result.reset_index(drop=True, inplace=True)
    return result

In [19]:
# # Testing the function
# train_data_final = add_random_examples(train_data_final, doc_df)

In [20]:
# train_data_final

In [21]:
# train_data_final['relevance'].value_counts()

In [22]:
# train_data_final.iloc(0)[5]

In [23]:
# Merging the query and doc pairs into one tokenized represenation for BERT
# input_ids: [CLS] [query] [SEP] [doc] [SEP]
# token_type_ids: 0 for tokens representing the query tokens, 1 for tokens representing doc tokens
# attention_mask: 1 for non-padding tokens, 0 for padding tokens

def tokenizeforBert(data):
    print('Tokenizing.....')
    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
    
    df = data.filter(['query','doc']).applymap(lambda x: tokenizer.tokenize(x))
    df['input_ids'] = df.apply(lambda row: tokenizer.encode_plus(text=row['query'],
                                                                text_pair=row['doc'],
                                                                max_length=512,
                                                                truncation_strategy='only_second',
                                                                pad_to_max_length=True,
                                                                padding_side='right',
                                                                is_pretokenized=False,
                                                                )['input_ids'], axis=1)
    df['token_type_ids'] = df.apply(lambda row: tokenizer.encode_plus(text=row['query'],
                                                                text_pair=row['doc'],
                                                                max_length=512,
                                                                truncation_strategy='only_second',
                                                                pad_to_max_length=True,
                                                                padding_side='right',
                                                                is_pretokenized=False,
                                                                return_token_type_ids=True,
                                                                )['token_type_ids'], axis=1)
    df['attention_mask'] = df.apply(lambda row: tokenizer.encode_plus(text=row['query'],
                                                                text_pair=row['doc'],
                                                                max_length=512,
                                                                truncation_strategy='only_second',
                                                                pad_to_max_length=True,
                                                                padding_side='right',
                                                                is_pretokenized=False,
                                                                return_attention_mask=True,
                                                                )['attention_mask'], axis=1)
    
    total_df = pd.concat([data, df.filter(['input_ids','token_type_ids','attention_mask'])], axis=1)
    
    return total_df

In [24]:
# Convert inputs into tokens that will be pre-processed

def preprocess_df(doc_df, query_df, link_df, mode='test'):
    #casting the query_id column to int64 type for merging
    queries = query_df.astype({'query_id': 'int64'}) 
    data_pre = link_df.astype({'query_id': 'int64'})
    #merging documents to the respective doc_ids
    data_pre = pd.merge(data_pre, queries, on='query_id', how='left')
    #merging queries to the respective query_ids
    data = pd.merge(data_pre, doc_df, on='doc_id', how='left')
    data = data.dropna()
    data = data.reset_index(drop=True)
    if mode == 'train':
        print('Adding Random Examples')
        data = add_random_examples(data, doc_df)
        print('Finished adding examples')
    
    
    data = tokenizeforBert(data)
    
    return data

In [25]:
added_train_df = preprocess_df(doc_df, train_data_df, train_id_df, mode='train')

Adding Random Examples
Finished adding examples
Tokenizing.....


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




In [26]:
train_df = preprocess_df(doc_df, train_data_df, train_id_df, mode='train')

Adding Random Examples
Finished adding examples
Tokenizing.....


In [27]:
test_df = preprocess_df(doc_df, test_data_df, test_id_df)

Tokenizing.....


In [28]:
test_df.head()

Unnamed: 0,query_id,source,doc_id,relevance,query,doc,input_ids,token_type_ids,attention_mask
0,1964316,U0,1964316_5,4,"What do you mean by ""weed""?",Weed could mean the bad thing that grow in ur ...,"[101, 2054, 2079, 2017, 2812, 2011, 1000, 1790...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
1,1964316,Q0,1674088_11,1,"What do you mean by ""weed""?",sell weed,"[101, 2054, 2079, 2017, 2812, 2011, 1000, 1790...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, ..."
2,1964316,Q0,1218838_13,2,"What do you mean by ""weed""?",My weed!!,"[101, 2054, 2079, 2017, 2812, 2011, 1000, 1790...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
3,1964316,Q0,1519022_15,2,"What do you mean by ""weed""?",because we dont know what the hell to make leg...,"[101, 2054, 2079, 2017, 2812, 2011, 1000, 1790...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
4,1964316,Q0,3059341_5,2,"What do you mean by ""weed""?",Its a weed.,"[101, 2054, 2079, 2017, 2812, 2011, 1000, 1790...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."


In [29]:
# Passing tokenized inputs into pre-trained BERT model to obtain 
# dense represenations of the query-doc pair

# Batch_size used is 64, can adjust based on available RAM - Used batch_size 64 for GPU

def convert_BERT_embeddings(input_ids, token_type_ids, attention_mask, model, batch_size=64):
#     print(input_ids)
    start_time = time.time()
    length = len(input_ids)
    print(length)
    print(input_ids)
    input_ids_tensor = tf.constant(input_ids, shape=(length, 512))
    token_type_ids_tensor = tf.constant(token_type_ids, shape=(length, 512))
    attention_mask_tensor = tf.constant(attention_mask, shape=(length, 512))
    total_time = 0
    start_time = time.time()
    for i in range(0, length, batch_size):
        batch_time = time.time()
        if i % batch_size == 0:
            time_passed = int(time.time() - start_time)
            hours_remaining = (length - i) * time_passed / (max(1.0, i) * 3600)
            print(f"Hours remaining: {hours_remaining}")
            
            
        if (i+batch_size) >= length:
            features = {
                "input_ids": input_ids_tensor[i:length],
                "attention_mask": attention_mask_tensor[i:length],
                "token_type_ids": token_type_ids_tensor[i:length],
            }
        
        features = {
            "input_ids": input_ids_tensor[i:i+batch_size],
            "attention_mask": attention_mask_tensor[i:i+batch_size],
            "token_type_ids": token_type_ids_tensor[i:i+batch_size],
        }
        output = model(features)
        print(f"{i + batch_size} Examples done.")
        last_hidden_state = output[0]
        # taking the representation of the first token (CLS)
        CLS = tf.slice(last_hidden_state, [0,0,0], [-1,1,-1]) 
        CLS = tf.squeeze(CLS)
        
        if i == 0:
            overall = CLS
            print(CLS)
        else:
            overall = tf.concat([overall, CLS], 0)

        
        batch_time = int(time.time() - batch_time)
        total_time += batch_time
        print(f"Time taken: {batch_time}")
        
    print(overall)
    print(f"Total time taken: {total_time/3600} hours")
    overall = overall.numpy()
    return overall

In [30]:
def convert_dflist_to_array(series):
  print('Starting....')
  start = time.time()
  array = np.array(series[0])
  for i in range(1,len(series)):
    array = np.vstack((array, np.array(series[i])))
  print('Finished....')
  print(f'Took {time.time() - start} seconds')
  return array

In [31]:
len(added_train_df['attention_mask'][0])

512

In [32]:
# Takes pretty long as well, around 40mins
input_ids_array = convert_dflist_to_array(added_train_df['input_ids'])
token_ids_array = convert_dflist_to_array(added_train_df['token_type_ids'])
attention_array = convert_dflist_to_array(added_train_df['attention_mask'])

Starting....
Finished....
Took 835.8914837837219 seconds
Starting....
Finished....
Took 832.739221572876 seconds
Starting....
Finished....
Took 922.2327501773834 seconds


In [33]:
# Optimized by vectorization, should take less than an hour.

model = TFBertModel.from_pretrained('bert-base-uncased')
added_array = convert_BERT_embeddings(input_ids_array, token_ids_array, attention_array, model, batch_size=64)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=536063208.0, style=ProgressStyle(descri…


45340
[[ 101 2339 2079 ...    0    0    0]
 [ 101 2339 2079 ...    0    0    0]
 [ 101 2339 2079 ...    0    0    0]
 ...
 [ 101 2054 2515 ...    0    0    0]
 [ 101 2054 2515 ...    0    0    0]
 [ 101 2054 2515 ...    0    0    0]]
Hours remaining: 0.0
64 Examples done.
tf.Tensor(
[[-0.26474723  0.3622378  -0.37632445 ... -0.5626925   0.31560117
   0.5076609 ]
 [-0.52939916  0.5299391  -0.4963811  ... -0.8789688   0.5263101
   0.5547417 ]
 [ 0.14038418 -0.07637009  0.17652397 ... -0.283847    0.20487404
   0.51430064]
 ...
 [-0.0315689   0.23454882 -0.02184237 ... -0.1677695   0.50032365
   0.52029467]
 [-0.25609076  0.270187   -0.41221243 ... -0.9314111   0.20338704
   0.59887224]
 [-0.0984872   0.43620697 -0.31650746 ... -0.3988881   0.30982515
   0.69330055]], shape=(64, 768), dtype=float32)
Time taken: 4
Hours remaining: 0.7860416666666666
128 Examples done.
Time taken: 0
Hours remaining: 0.4905815972222222
192 Examples done.
Time taken: 4
Hours remaining: 0.6531828703703704
256

In [34]:
# Takes pretty long as well, around 40mins
input_ids_array = convert_dflist_to_array(train_df['input_ids'])
token_ids_array = convert_dflist_to_array(train_df['token_type_ids'])
attention_array = convert_dflist_to_array(train_df['attention_mask'])

model = TFBertModel.from_pretrained('bert-base-uncased')
array = convert_BERT_embeddings(input_ids_array, token_ids_array, attention_array, model, batch_size=64)

Starting....
Finished....
Took 802.1811153888702 seconds
Starting....
Finished....
Took 754.6459033489227 seconds
Starting....
Finished....
Took 748.9135434627533 seconds
45340
[[ 101 2339 2079 ...    0    0    0]
 [ 101 2339 2079 ...    0    0    0]
 [ 101 2339 2079 ...    0    0    0]
 ...
 [ 101 2054 2515 ...    0    0    0]
 [ 101 2054 2515 ...    0    0    0]
 [ 101 2054 2515 ...    0    0    0]]
Hours remaining: 0.0
64 Examples done.
tf.Tensor(
[[-0.26776952  0.36044857 -0.13320701 ... -0.4292788   0.5220578
   0.7052103 ]
 [ 0.14038418 -0.07637009  0.17652397 ... -0.283847    0.20487404
   0.51430064]
 [-0.2236495   0.45706183 -0.39957872 ... -0.6166152   0.6112313
   0.39875662]
 ...
 [-0.31959632  0.32258955 -0.39811546 ... -0.7156608   0.23556578
   0.63328254]
 [ 0.21063507  0.10400532 -0.32628286 ... -0.05250146  0.04261783
   0.6897694 ]
 [ 0.7249235  -0.4768578  -0.04825684 ...  0.84707    -0.14393044
   0.36661395]], shape=(64, 768), dtype=float32)
Time taken: 4
Hours re

In [35]:
# Saving the bert_embeddings to a tsv file (Already done that)

def save_to_tsv(df, bert_embedding, file='train', added=False):
    df['bert_embeddings'] = 1
    df = df.reset_index(drop=True)
    df = df.astype('object')
    for i in range(len(df)):
        df['bert_embeddings'][i] = list(bert_embedding[i])

    if added:
      sep = 'added_'
    else:
      sep = ''

    df.to_csv('./drive/My Drive/Antique_Dataset/'+sep+'antique_'+file+'_bert.tsv', sep='\t', index=False)
    return df

In [36]:
save_to_tsv(added_train_df, added_array, added=True)
save_to_tsv(train_df, array)

Unnamed: 0,query_id,source,doc_id,relevance,query,doc,input_ids,token_type_ids,attention_mask,bert_embeddings
0,2531329,Generated,4082308_2,1,Why do some men spit into the urinal before ur...,"Maybe incorrectly written ""grippe""? (flu)","[101, 2339, 2079, 2070, 2273, 13183, 2046, 199...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[-0.26776952, 0.36044857, -0.13320701, -0.3166..."
1,2531329,Q0,2531329_1,2,Why do some men spit into the urinal before ur...,Sorry. Didn't know you were watching.,"[101, 2339, 2079, 2070, 2273, 13183, 2046, 199...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[0.14038418, -0.07637009, 0.17652397, -0.21174..."
2,2531329,Q0,2531329_7,3,Why do some men spit into the urinal before ur...,Where else would we spit?... Apart from sports...,"[101, 2339, 2079, 2070, 2273, 13183, 2046, 199...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[-0.2236495, 0.45706183, -0.39957872, -0.45759..."
3,2531329,Q0,2531329_2,3,Why do some men spit into the urinal before ur...,Okay...one more piece of useless information m...,"[101, 2339, 2079, 2070, 2273, 13183, 2046, 199...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[0.042255938, 0.35146686, -0.06401053, -0.1350..."
4,2531329,Q0,2531329_3,3,Why do some men spit into the urinal before ur...,men (cough cough - boys) are just wierd...,"[101, 2339, 2079, 2070, 2273, 13183, 2046, 199...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[-0.075745545, 0.38415992, -0.47838423, -0.429..."
...,...,...,...,...,...,...,...,...,...,...
45335,884731,Q0,884731_2,4,What does the word Padre mean in english?,"the word padre in english means ""father""","[101, 2054, 2515, 1996, 2773, 28612, 2812, 199...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[-0.64837384, 0.5308057, -0.57113624, -0.39678..."
45336,884731,U0,884731_0,4,What does the word Padre mean in english?,Padre is a word used in both Spanish and Portu...,"[101, 2054, 2515, 1996, 2773, 28612, 2812, 199...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[-0.5623697, 0.4166778, -0.59484327, -0.324400..."
45337,884731,Q0,884731_1,3,What does the word Padre mean in english?,I think it either means father or brother. I ...,"[101, 2054, 2515, 1996, 2773, 28612, 2812, 199...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[-0.5553252, 0.4541003, -0.37120673, -0.533768..."
45338,884731,Generated,4055994_11,1,What does the word Padre mean in english?,a bridezilla is a bride to be that has her hea...,"[101, 2054, 2515, 1996, 2773, 28612, 2812, 199...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[-0.22325341, -0.027230937, -0.032040335, -0.0..."


In [37]:
test_df.groupby('query_id').count().describe()

Unnamed: 0,source,doc_id,relevance,query,doc,input_ids,token_type_ids,attention_mask
count,200.0,200.0,200.0,200.0,200.0,200.0,200.0,200.0
mean,32.945,32.945,32.945,32.945,32.945,32.945,32.945,32.945
std,10.172481,10.172481,10.172481,10.172481,10.172481,10.172481,10.172481,10.172481
min,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0
25%,28.0,28.0,28.0,28.0,28.0,28.0,28.0,28.0
50%,32.0,32.0,32.0,32.0,32.0,32.0,32.0,32.0
75%,37.25,37.25,37.25,37.25,37.25,37.25,37.25,37.25
max,84.0,84.0,84.0,84.0,84.0,84.0,84.0,84.0


In [38]:
test_input_ids_array = convert_dflist_to_array(test_df['input_ids'])
test_token_ids_array = convert_dflist_to_array(test_df['token_type_ids'])
test_attention_array = convert_dflist_to_array(test_df['attention_mask'])

Starting....
Finished....
Took 28.40184259414673 seconds
Starting....
Finished....
Took 26.7243013381958 seconds
Starting....
Finished....
Took 25.286107540130615 seconds


In [39]:
test_results = convert_BERT_embeddings(test_input_ids_array, test_token_ids_array, test_attention_array, model)

6589
[[ 101 2054 2079 ...    0    0    0]
 [ 101 2054 2079 ...    0    0    0]
 [ 101 2054 2079 ...    0    0    0]
 ...
 [ 101 2129 2064 ...    0    0    0]
 [ 101 2129 2064 ...    0    0    0]
 [ 101 2129 2064 ...    0    0    0]]
Hours remaining: 0.0
64 Examples done.
tf.Tensor(
[[-1.17683396e-01  3.81208867e-01 -8.14802289e-01 ... -8.22204590e-01
   3.26010078e-01  3.79400402e-01]
 [-1.85377315e-01  2.64670372e-01 -7.37730503e-01 ... -7.93308794e-01
   5.71447849e-01  4.38526720e-01]
 [-7.37383962e-04  2.45374054e-01 -5.02342641e-01 ... -7.93728828e-01
   4.80002075e-01  2.88361311e-01]
 ...
 [-3.95672262e-01  4.67448890e-01 -9.23234463e-01 ... -6.61823630e-01
   2.91656017e-01  5.57851493e-01]
 [-2.23043337e-01  2.75190324e-01 -5.40888965e-01 ... -3.31769079e-01
   4.95369971e-01  4.10063982e-01]
 [-2.51700021e-02  2.63238728e-01 -7.68171966e-01 ... -3.96065444e-01
   2.95402944e-01  2.63976544e-01]], shape=(64, 768), dtype=float32)
Time taken: 4
Hours remaining: 0.11328125
128 Ex

In [40]:
save_to_tsv(test_df, test_results, file='test')

Unnamed: 0,query_id,source,doc_id,relevance,query,doc,input_ids,token_type_ids,attention_mask,bert_embeddings
0,1964316,U0,1964316_5,4,"What do you mean by ""weed""?",Weed could mean the bad thing that grow in ur ...,"[101, 2054, 2079, 2017, 2812, 2011, 1000, 1790...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[-0.117683396, 0.38120887, -0.8148023, -0.2977..."
1,1964316,Q0,1674088_11,1,"What do you mean by ""weed""?",sell weed,"[101, 2054, 2079, 2017, 2812, 2011, 1000, 1790...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, ...","[-0.18537731, 0.26467037, -0.7377305, -0.18130..."
2,1964316,Q0,1218838_13,2,"What do you mean by ""weed""?",My weed!!,"[101, 2054, 2079, 2017, 2812, 2011, 1000, 1790...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[-0.00073738396, 0.24537405, -0.50234264, -0.4..."
3,1964316,Q0,1519022_15,2,"What do you mean by ""weed""?",because we dont know what the hell to make leg...,"[101, 2054, 2079, 2017, 2812, 2011, 1000, 1790...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[-0.31584862, 0.08820932, -0.56504595, -0.2485..."
4,1964316,Q0,3059341_5,2,"What do you mean by ""weed""?",Its a weed.,"[101, 2054, 2079, 2017, 2812, 2011, 1000, 1790...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[-0.13292149, 0.30342782, -0.52649224, -0.3847..."
...,...,...,...,...,...,...,...,...,...,...
6584,1262692,Q0,247023_6,3,How can I get rid of pimples on my back?,if there is a head on it you can take a piece ...,"[101, 2129, 2064, 1045, 2131, 9436, 1997, 1425...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[-0.4966618, 0.25597742, -0.63771343, -0.46903..."
6585,1262692,Q0,1499030_5,3,How can I get rid of pimples on my back?,"Cut down on sugary and oily foods, make sure y...","[101, 2129, 2064, 1045, 2131, 9436, 1997, 1425...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[-0.3382789, 0.19795184, -0.88143057, -0.03786..."
6586,1262692,Q0,2916758_0,3,How can I get rid of pimples on my back?,Sounds like you may have gotten immune to the ...,"[101, 2129, 2064, 1045, 2131, 9436, 1997, 1425...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[-0.25428507, 0.061954655, -0.7148295, -0.3484..."
6587,1262692,Q0,1105845_15,3,How can I get rid of pimples on my back?,the best way to lose pimples is use either pro...,"[101, 2129, 2064, 1045, 2131, 9436, 1997, 1425...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[-0.25733668, 0.3421137, -0.53493476, -0.47761..."
