# DigSci2019 科学数据挖掘大赛   Final Top-2
###  作者 : DOTA  
#### 分数( MAP@3 ):0.53733 (Rank2) 

**方法** :   
>在解决本问题时，我借鉴了推荐算法的思想，将问题拆解了两部分——召回和排序。在召回阶段，使用了两种方式，其一是利用Wrod2Vec和TFIDF方法，将描述段落利用Word2Vec得到每个词的词向量，同时对句子中的词使用TF-IDF为权重进行加权得到Sentence Embedding，同时为了得到更好的效果，这里做了一个改进，即使用Smooth Inverse Frequency代替TFIDF作为每个词的权重；其二是利用TFIDF得到Sentence Embedding。两种方法各自计算余弦相似度得到3篇论文，去重后召回集中每个段落有3-6篇不等的召回论文。  
在排序阶段，我们利用BERT对描述段落Description和论文文本PaperText组成句子对（Description，PaperText）进行编码，在输出层经过Dense和Softmax层后得到概率值后排序。
>
**模型** : Word2Vec,TF-IDF,BERT  
**测试环境** : Ubuntu18,CPU32核,内存64G,两块显卡RTX2080Ti  

**模型说明** :
>从任务描述中我们可以看到，该任务需要对描述段落匹配三篇最相关的论文。单从形式上可以理解为这是一个“完形填空”任务。但相较于在本文的相应位置上填上相应的词语不同的是，这里需要填充的是一个Sentence，也就是论文的题目。但是如果你按照这个思路去寻求解决方案，你会发现在这个量级的文本数据上，一般算力是满足不了的。
既然如此，那我们不如换一个思路来思考这个问题，“对描述段落匹配三篇最相关的论文”，其实最简单的实现方式是计算描述段落和论文库里所有论文的相似度，找出最相似的即可。但这同样会存在一个问题，通过对数据进行探查你会发现“An efficient implementation based on BERT [1] and graph neural network (GNN) [2] is introduced.”这一描述段落，同时引用了两篇文章，那么在计算相似度时，到底哪个位置该是哪篇文章呢？

**代码说明** :  
**1、RecallPart**：两种方法各自计算余弦相似度得到3篇论文，去重后召回集中每个段落有3-6篇不等的召回论文；  
**1.1 ProSolution1** : 利用TF-IDF计算相似度召回3篇论文；   
**1.2 ProSolution2** : 利用DOTA-EmbeddingVector计算相似度召回3篇论文  
**2、SortPart**：利用BERT利用Encoder描述段落和候选论文，计算相似度；

## RecallPart

In [None]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from tqdm import tqdm
import gc

In [None]:
train_data = pd.read_csv("../data/train_release.csv")
valid_data = pd.read_csv("../data/validation_release.csv")
paper_data = pd.read_csv("../data/candidate_paper.csv")

In [None]:
recall_nums = 3
run_nums = 1000

def find_candidate(x):
    cand_pos = x.find("[**##**]")
    cand_lst = x[:cand_pos].split(".")
    if len(cand_lst[-1]) < 20 :
        sp = " ".join(cand_lst[-2:])
    else:
        sp =  cand_lst[-1]
        
    return x.replace("[**##**]",sp)

paper_data['description_text'] = paper_data.title.fillna(" ")
valid_data["description"] = valid_data["description_text"].map(lambda x:find_candidate(str(x)))
train_data["description"] = train_data["description_text"].map(lambda x:find_candidate(str(x)))

In [None]:
tfidf_enc = TfidfVectorizer(ngram_range=(1, 4),min_df=5,max_df=0.9)
tfidf_enc.fit(list(train_data.description_text.fillna(" ")) + list(valid_data.description_text.fillna(" ")) + list(paper_data.description_text.fillna(" ")))

train_mat = tfidf_enc.transform(train_data.description.fillna("No Description"))
valid_mat = tfidf_enc.transform(valid_data.description.fillna("No Description"))
paper_mat = tfidf_enc.transform(paper_data.description_text)

##  PS1-GetTrainX

In [None]:
train_df = pd.DataFrame()

head_nums = int(train_mat.shape[0]/run_nums)
tail_nums = train_mat.shape[0] - head_nums * run_nums

for i in tqdm(range(head_nums)):
    i *= run_nums
    paperid = np.argsort(np.dot(train_mat[i:i+run_nums],paper_mat.T).todense())[:,-recall_nums:]
    discrid = train_data.description_id.values[i:i+run_nums]
    df_tmp = pd.DataFrame(paperid,columns=["paper_"+str(recall_nums-j) for j in range(recall_nums)])
    df_tmp["description_id"] = discrid
    train_df = pd.concat([train_df,df_tmp],axis=0,sort=False)

# tail
paperid = np.argsort(np.dot(train_mat[-tail_nums:],paper_mat.T).todense())[:,-recall_nums:]
discrid = train_data.description_id.values[-tail_nums:]
df_tmp = pd.DataFrame(paperid,columns=["paper_"+str(recall_nums-j) for j in range(recall_nums)])
df_tmp["description_id"] = discrid
train_df = pd.concat([train_df,df_tmp],axis=0,sort=False)
print(train_df.shape)

## PS1-GetValidX

In [None]:
valid_df = pd.DataFrame()

head_nums = int(valid_mat.shape[0]/run_nums)
tail_nums = valid_mat.shape[0] - head_nums * run_nums

for i in tqdm(range(head_nums)):
    i *= run_nums
    paperid = np.argsort(np.dot(valid_mat[i:i+run_nums],paper_mat.T).todense())[:,-recall_nums:]
    discrid = valid_data.description_id.values[i:i+run_nums]
    df_tmp = pd.DataFrame(paperid,columns=["paper_"+str(recall_nums-j) for j in range(recall_nums)])
    df_tmp["description_id"] = discrid
    valid_df = pd.concat([valid_df,df_tmp],axis=0,sort=False)

# tail
paperid = np.argsort(np.dot(valid_mat[-tail_nums:],paper_mat.T).todense())[:,-recall_nums:]
discrid = valid_data.description_id.values[-tail_nums:]
df_tmp = pd.DataFrame(paperid,columns=["paper_"+str(recall_nums-j) for j in range(recall_nums)])
df_tmp["description_id"] = discrid
valid_df = pd.concat([valid_df,df_tmp],axis=0,sort=False)
print(valid_df.shape)

In [None]:
dict_paperid = dict(zip(paper_data[['paper_id']].reset_index()["index"].values,paper_data[['paper_id']].reset_index()["paper_id"].values))

for i in range(1,recall_nums+1):
    train_df['paper_'+str(i)] = train_df['paper_'+str(i)].map(lambda x: dict_paperid[x])

for i in range(1,recall_nums+1):
    valid_df['paper_'+str(i)] = valid_df['paper_'+str(i)].map(lambda x: dict_paperid[x])

In [None]:
train_set = pd.DataFrame()
valid_set = pd.DataFrame()

for i in range(1,1+recall_nums):
    tmp = train_df[["description_id","paper_"+str(i)]]
    tmp.columns = ["description_id","paper_id"]
    train_set = pd.concat([train_set,tmp],axis=0,sort=False)

for i in range(1,1+recall_nums):
    tmp = valid_df[["description_id","paper_"+str(i)]]
    tmp.columns = ["description_id","paper_id"]
    valid_set = pd.concat([valid_set,tmp],axis=0,sort=False)

## PS1-SaveSetX

In [None]:
train_data["label"] = 1
train_set = train_set.merge(train_data[["description_id","paper_id","label"]], how ='left',on =["description_id","paper_id"]).fillna(0)
tmp = train_set.groupby("description_id",as_index=False)["label"].agg({"score":"sum"})
tmp = tmp[tmp.score > 0]
train_set = tmp[["description_id"]].merge(train_set,how='left',on='description_id')
train_set.to_csv("../data/train_set_x.csv",index=False)
valid_set.to_csv("../data/valid_set_x.csv",index=False)

## PS2-DOTA embedding Vecter

In [None]:
import gensim
from gensim.models.word2vec import Word2Vec

modelW2V = Word2Vec(sentences=list(train_data.description_text.fillna(" ")) + list(paper_data.description_text.fillna(" ")), size=100, seed=2019)

w2v = {w: vec for w, vec in zip(modelW2V.wv.index2word, modelW2V.wv.syn0)}

from sklearn.feature_extraction.text import TfidfVectorizer
from collections import defaultdict

class DOTAembeddingVectorizer(object):
    def __init__(self, word2vec):
        self.word2vec = word2vec
        self.word2weight = None
        self.dim = len(next(iter(word2vec)))

    def fit(self, X):
        tfidf = TfidfVectorizer()
        tfidf.fit(X)
        max_idf = max(tfidf.idf_)
        self.word2weight = defaultdict(
            lambda: max_idf,
            [(w, tfidf.idf_[i]) for w, i in tfidf.vocabulary_.items()])

        return self

    def transform(self, X):
        return np.array([
                np.mean([self.word2vec[w] * self.word2weight[w]
                         for w in words if w in self.word2vec] or
                        [np.zeros(self.dim)], axis=0)
                for words in X
            ])

dota_enc = DOTAembeddingVectorizer(w2v)
dota_enc.fit(list(train_data.description_text.fillna(" ")) + list(paper_data.description_text.fillna(" ")))

train_mat = dota_enc.transform(train_data.description.fillna("No Description"))
valid_mat = dota_enc.transform(valid_data.description.fillna("No Description"))
paper_mat = dota_enc.transform(paper_data.description_text)

## PS2-GetTrainY

In [None]:
train_df = pd.DataFrame()

head_nums = int(train_mat.shape[0]/run_nums)
tail_nums = train_mat.shape[0] - head_nums * run_nums

for i in tqdm(range(head_nums)):
    i *= run_nums
    paperid = np.argsort(np.dot(train_mat[i:i+run_nums],paper_mat.T).todense())[:,-recall_nums:]
    discrid = train_data.description_id.values[i:i+run_nums]
    df_tmp = pd.DataFrame(paperid,columns=["paper_"+str(recall_nums-j) for j in range(recall_nums)])
    df_tmp["description_id"] = discrid
    train_df = pd.concat([train_df,df_tmp],axis=0,sort=False)

# tail
paperid = np.argsort(np.dot(train_mat[-tail_nums:],paper_mat.T).todense())[:,-recall_nums:]
discrid = train_data.description_id.values[-tail_nums:]
df_tmp = pd.DataFrame(paperid,columns=["paper_"+str(recall_nums-j) for j in range(recall_nums)])
df_tmp["description_id"] = discrid
train_df = pd.concat([train_df,df_tmp],axis=0,sort=False)
print(train_df.shape)

## PS2-GetValidY

In [None]:
valid_df = pd.DataFrame()

head_nums = int(valid_mat.shape[0]/run_nums)
tail_nums = valid_mat.shape[0] - head_nums * run_nums

for i in tqdm(range(head_nums)):
    i *= run_nums
    paperid = np.argsort(np.dot(valid_mat[i:i+run_nums],paper_mat.T).todense())[:,-recall_nums:]
    discrid = valid_data.description_id.values[i:i+run_nums]
    df_tmp = pd.DataFrame(paperid,columns=["paper_"+str(recall_nums-j) for j in range(recall_nums)])
    df_tmp["description_id"] = discrid
    valid_df = pd.concat([valid_df,df_tmp],axis=0,sort=False)

# tail
paperid = np.argsort(np.dot(valid_mat[-tail_nums:],paper_mat.T).todense())[:,-recall_nums:]
discrid = valid_data.description_id.values[-tail_nums:]
df_tmp = pd.DataFrame(paperid,columns=["paper_"+str(recall_nums-j) for j in range(recall_nums)])
df_tmp["description_id"] = discrid
valid_df = pd.concat([valid_df,df_tmp],axis=0,sort=False)
print(valid_df.shape)

In [None]:
dict_paperid = dict(zip(paper_data[['paper_id']].reset_index()["index"].values,paper_data[['paper_id']].reset_index()["paper_id"].values))

for i in range(1,recall_nums+1):
    train_df['paper_'+str(i)] = train_df['paper_'+str(i)].map(lambda x: dict_paperid[x])

for i in range(1,recall_nums+1):
    valid_df['paper_'+str(i)] = valid_df['paper_'+str(i)].map(lambda x: dict_paperid[x])

In [None]:
train_set = pd.DataFrame()
valid_set = pd.DataFrame()

for i in range(1,1+recall_nums):
    tmp = train_df[["description_id","paper_"+str(i)]]
    tmp.columns = ["description_id","paper_id"]
    train_set = pd.concat([train_set,tmp],axis=0,sort=False)

for i in range(1,1+recall_nums):
    tmp = valid_df[["description_id","paper_"+str(i)]]
    tmp.columns = ["description_id","paper_id"]
    valid_set = pd.concat([valid_set,tmp],axis=0,sort=False)

## PS2-SaveSetY

In [None]:
train_data["label"] = 1
train_set = train_set.merge(train_data[["description_id","paper_id","label"]], how ='left',on =["description_id","paper_id"]).fillna(0)

tmp = train_set.groupby("description_id",as_index=False)["label"].agg({"score":"sum"})
tmp = tmp[tmp.score > 0]
train_set = tmp[["description_id"]].merge(train_set,how='left',on='description_id')

train_set.to_csv("../data/train_set_y.csv",index=False)
valid_set.to_csv("../data/valid_set_y.csv",index=False)

# SortPart

In [None]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from tqdm import tqdm
import gc

In [None]:
train_setx = pd.read_csv("../data/train_set_x.csv")
test_setx = pd.read_csv("../data/valid_set_x.csv")
train_sety = pd.read_csv("../data/train_set_y.csv")
test_sety = pd.read_csv("../data/valid_set_y.csv")

train_set = pd.concat([train_setx,train_sety],axis=0,sort=False).dorp_duplicates()
test_set  = pd.concat([test_setx,test_setu],axis=0,sort=False).dorp_duplicates()

In [None]:
paper_data = pd.read_csv("../data/candidate_paper.csv")
train_data = pd.read_csv("../data/train_release.csv")
test_data  = pd.read_csv("../data/validation_release.csv")

paper_data["paper_text"] = paper_data.title.fillna(" ")

def find_candidate(x):
    cand_pos = x.find("[**##**]")
    cand_lst = x[:cand_pos].split(".")
    if len(cand_lst[-1]) < 20 :
        sp = " ".join(cand_lst[-2:])
    else:
        sp =  cand_lst[-1]
        
    return x.replace("[**##**]",sp)

train_data["description_text"] = train_data["description_text"].map(lambda x : find_candidate(str(x)))
test_data["description_text"]  = test_data["description_text"].map(lambda x : find_candidate(str(x)))

paper_data["paper_text"] = paper_data["paper_text"].map(lambda x : str(x).lower())

# conc data
train = train_set.merge(train_data[["description_id","description_text"]],how='left',on='description_id')
train = train.merge(paper_data[["paper_id","paper_text"]],how='left',on='paper_id')

del train_set,train_data
gc.collect()

test = test_set.merge(test_data[["description_id","description_text"]],how='left',on='description_id')
test = test.merge(paper_data[["paper_id","paper_text"]],how='left',on='paper_id')

del test_set,test_data,paper_data
gc.collect()

In [None]:
train["description_text"] = train["description_text"].astype(str).fillna(" ")
train["paper_text"] = train["paper_text"].astype(str).fillna(" ")

test["description_text"] = train["description_text"].astype(str).fillna(" ")
test["paper_text"] = train["paper_text"].astype(str).fillna(" ")

In [None]:
train["description"] = train["description_text"] + train["paper_text"]
test["description"] = test["description_text"] + train["paper_text"] 

## ModelBERT2Rank

In [None]:
import json
import numpy as np
from tqdm import tqdm
import time
import logging
from sklearn.model_selection import StratifiedKFold
from keras_bert import load_trained_model_from_checkpoint, Tokenizer
from keras.optimizers import Adam
import keras.backend.tensorflow_backend as KTF
import tensorflow as tf
import os
import pandas as pd
import re
from keras.utils.np_utils import to_categorical
from sklearn.metrics import mean_absolute_error, accuracy_score, f1_score

gpu_options = tf.GPUOptions(allow_growth=True)
sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))

In [None]:
learning_rate = 5e-5
min_learning_rate = 1e-5
config_path = '../uncased_L-12_H-768_A-12/bert_config.json'
checkpoint_path = '../uncased_L-12_H-768_A-12/bert_model.ckpt'
dict_path = '../uncased_L-12_H-768_A-12/vocab.txt'
MAX_LEN = 30

In [None]:
token_dict = {}
with open(dict_path, 'r', encoding='utf-8') as reader:
    for line in reader:
        token = line.strip()
        token_dict[token] = len(token_dict)
tokenizer = Tokenizer(token_dict)

In [None]:
epochs = 5
save_path = "../model/bert_epoch{0}/".format(epochs)
if not os.path.exists(save_path):    
    os.mkdir(save_path)
    
if not os.path.exists(save_path+"submission/"):    
    os.mkdir(save_path+"submission/")    
    
if not os.path.exists(save_path+"log/"):    
    os.mkdir(save_path+"log/")    

In [None]:
file_path = save_path+"log/"
# 创建一个logger
logger = logging.getLogger('mylogger')
logger.setLevel(logging.DEBUG)

# 创建一个handler，
timestamp = time.strftime("%Y.%m.%d_%H.%M.%S", time.localtime())
fh = logging.FileHandler(file_path + 'log_' + timestamp +'.txt')
fh.setLevel(logging.DEBUG)

# 再创建一个handler，用于输出到控制台
ch = logging.StreamHandler()
ch.setLevel(logging.DEBUG)

# 定义handler的输出格式
formatter = logging.Formatter('[%(asctime)s][%(levelname)s] ## %(message)s')
fh.setFormatter(formatter)
ch.setFormatter(formatter)
# 给logger添加handler
logger.addHandler(fh)
logger.addHandler(ch)

def read_data(file_path, id, name):
    train_id = []
    train_title = []
    train_text = []
    with open(file_path, 'r', encoding='utf-8-sig') as f:
        for idx, line in enumerate(f):
            line = line.strip().split(',')
            train_id.append(line[0].replace('\'', '').replace(' ', ''))
            train_title.append(line[1])
            train_text.append('，'.join(line[2:]))
    output = pd.DataFrame(dtype=str)
    output[id] = train_id
    output[name + '_title'] = train_title
    output[name + '_content'] = train_text
    return output

In [None]:
train_achievements = train['description_text'].values
train_requirements = train['title'].values

labels = train['label'].astype(int).values 
labels_cat = to_categorical(labels)
labels_cat = labels_cat.astype(np.int32)

test_achievements = test['description_text'].values
test_requirements = test['title'].values

### Data Generator

In [None]:
class data_generator:
    def __init__(self, data, batch_size=64):
        self.data = data
        self.batch_size = batch_size
        self.steps = len(self.data[0]) // self.batch_size
        if len(self.data[0]) % self.batch_size != 0:
            self.steps += 1

    def __len__(self):
        return self.steps

    def __iter__(self):
        while True:
            X1, X2, y = self.data
            idxs = list(range(len(self.data[0])))
            np.random.shuffle(idxs)
            T, T_, Y = [], [], []
            for c, i in enumerate(idxs):
                achievements = str(X1[i])[:300]
                requirements = str(X2[i])[:30]
                t, t_ = tokenizer.encode(first=achievements, second=requirements, max_len=330)
                T.append(t)
                T_.append(t_)
                Y.append(y[i])
                if len(T) == self.batch_size or i == idxs[-1]:
                    T = np.array(T)
                    T_ = np.array(T_)
                    Y = np.array(Y)
                    yield [T, T_], Y
                    T, T_, Y = [], [], []

In [None]:
from keras.layers import *
from keras.models import Model
import keras.backend as K
from keras.callbacks import Callback

### BERT

In [None]:
def get_model():
    bert_model = load_trained_model_from_checkpoint(config_path, checkpoint_path)
    for l in bert_model.layers:
        l.trainable = True

    T1 = Input(shape=(None,))
    T2 = Input(shape=(None,))

    T = bert_model([T1, T2])

    T = Lambda(lambda x: x[:, 0])(T)

    output = Dense(2, activation='softmax')(T)

    model = Model([T1, T2], output)
    model.compile(
        loss='categorical_crossentropy',
        optimizer=Adam(1e-5),
        metrics=['accuracy']
    )
    model.summary()
    return model

### Evaluate

In [None]:
class Evaluate(Callback):
    def __init__(self, val_data, val_index,model_path):
        self.score = []
        self.best = 0.
        self.early_stopping = 0
        self.val_data = val_data
        self.val_index = val_index
        self.predict = []
        self.lr = 0
        self.passed = 0
        self.model_path = model_path

    def on_batch_begin(self, batch, logs=None):
        if self.passed < self.params['steps']:
            self.lr = (self.passed + 1.) / self.params['steps'] * learning_rate
            K.set_value(self.model.optimizer.lr, self.lr)
            self.passed += 1
        elif self.params['steps'] <= self.passed < self.params['steps'] * 2:
            self.lr = (2 - (self.passed + 1.) / self.params['steps']) * (learning_rate - min_learning_rate)
            self.lr += min_learning_rate
            K.set_value(self.model.optimizer.lr, self.lr)
            self.passed += 1

    def on_epoch_end(self, epoch, logs=None):
        score, acc, f1 = self.evaluate()
        if acc > self.best:
            self.best = acc
            self.early_stopping = 0
            model.save_weights(self.model_path)
        else:
            self.early_stopping += 1
        logger.info('lr: %.6f, epoch: %d, score: %.4f, acc: %.4f, f1: %.4f,best: %.4f\n' % (self.lr, epoch, score, acc, f1, self.best))

    def evaluate(self):
        self.predict = []
        prob = []
        val_x1, val_x2, val_y, val_cat = self.val_data
        for i in tqdm(range(len(val_x1))):
            achievements = str(val_x1[i])[:300]
            requirements = str(val_x2[i])[:30]

            t1, t1_ = tokenizer.encode(first=achievements, second=requirements)
            T1, T1_ = np.array([t1]), np.array([t1_])
            _prob = model.predict([T1, T1_])
            oof_train[self.val_index[i]] = _prob[0]
            self.predict.append(np.argmax(_prob, axis=1)[0])
            prob.append(_prob[0])

        score = 1.0 / (1 + mean_absolute_error(val_y, self.predict))
        acc = accuracy_score(val_y, self.predict)
        f1 = f1_score(val_y, self.predict, average='macro')
        return score, acc, f1

### Predict Generator

In [None]:
class predict_generator:
    def __init__(self, data, batch_size=256):
        self.data = data
        self.batch_size = batch_size
        self.steps = len(self.data[0]) // self.batch_size
        if len(self.data[0]) % self.batch_size != 0:
            self.steps += 1

    def __len__(self):
        return self.steps

    def __iter__(self):
        while True:
            X1, X2 = self.data
            idxs = list(range(len(self.data[0])))
            T, T_, = [], []
            for c, i in enumerate(idxs):
                achievements = str(X1[i])[:300]
                requirements = str(X2[i])[:30]
                t, t_ = tokenizer.encode(first=achievements, second=requirements, max_len=330)
                T.append(t)
                T_.append(t_)
                if len(T) == self.batch_size or i == idxs[-1]:
                    T = np.array(T)
                    T_ = np.array(T_)
                    yield [T, T_]
                    T, T_ = [], []

In [None]:
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

oof_train = np.zeros((len(train), 2), dtype=np.float32)
oof_test = np.zeros((len(test), 2), dtype=np.float32)
for fold, (train_index, valid_index) in enumerate(skf.split(train_achievements, labels)):
    logger.info('================     fold {}        ==============='.format(fold))
    x1 = train_achievements[train_index]
    x2 = train_requirements[train_index]
    y = labels_cat[train_index]

    val_x1 = train_achievements[valid_index]
    val_x2 = train_requirements[valid_index]
    val_y = labels[valid_index]
    val_cat = labels_cat[valid_index]

    train_D = data_generator([x1, x2, y])
    model_save_path = save_path + "BERTModel_{0}.weights".format(str(fold))
    evaluator = Evaluate([val_x1, val_x2, val_y, val_cat], valid_index,model_save_path)

    model = get_model()
    
    model.fit_generator(train_D.__iter__(),
                        steps_per_epoch=len(train_D),
                        epochs=epochs,
                        callbacks=[evaluator]
                       )
    model.load_weights(model_save_path)
    
    test_D = predict_generator([test_achievements, test_requirements])
    oof_test += model.predict_generator(test_D.__iter__(), steps=len(test_D))
    print(oof_test)
    break
    K.clear_session()
oof_test /= 5

In [None]:
submit = test[['description_id','paper_id']]
submit['proba'] = oof_test
submit["ranks"] = submit["proba"].groupby(submit["description_id"]).rank(ascending=0,method='first')
submit.sort_values(by='description_id').head()

In [None]:
sub1 = submit[submit.ranks == 1][["description_id","paper_id"]]
sub1.columns = ["description_id","paper_1"]
sub2 = submit[submit.ranks == 2][["description_id","paper_id"]]
sub2.columns = ["description_id","paper_2"]
sub3 = submit[submit.ranks == 3][["description_id","paper_id"]]
sub3.columns = ["description_id","paper_3"]
result = sub1.merge(sub2, how = 'left', on = 'description_id')
result = result.merge(sub3, how = 'left', on = 'description_id')
result.shape

In [None]:
result[['description_id','paper_1','paper_2','paper_3']].to_csv("submit.tsv",index=False,header=None,sep='\t')