# Using Novel Language Models and elasticsearch to Effectively Identify Articles related to Therapeutics and Vaccines
 * Team: MD-Lab, ASU
 * Author: Jitesh Pabla, Email: jpabla1@asu.edu, Kaggle ID: jiteshpabla
 * Team Members: Rishab Banerjee, Hong Guan, Ashwin Karthik Ambalavanan, Mihir Parmar, Murthy Devarakonda
 * Email ID: loccapollo@gmail.com, hguan6@asu.edu, aambalav@asu.edu, mparmar3@asu.edu, Murthy.Devarakonda@asu.edu
 * Kaggle ID: loccapollo, hongguan, ashwinambal96, mihir3031, murthydevarakonda
 * This is a Team Submission
 * Here are the links to our teams Kernels:
     - https://www.kaggle.com/jiteshpabla/scoring-cord-19-using-google-training-on-scibert/
     - https://www.kaggle.com/ashwinambal96/scibert-based-article-identification
     - https://www.kaggle.com/hongguan/micro-scorers-for-covid-19-open-challenge/
     - https://www.kaggle.com/mihir3031/bert-sts-for-searching-relevant-research-papers
     - https://www.kaggle.com/loccapollo/lexicon-based-similarity-scoring-with-bert-biobert
     - The final ensembling that combines everything together: http://https://www.kaggle.com/hongguan/ensemble-model-for-covid-19-open-challenge/
 

# Introduction
This repository deals with the "cord19-vaccines-and-therapeutics" dataset which is based on the ["What do we know about vaccines and therapeutics?"](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge/tasks?taskId=561) task of the COVID-19 Open Research Dataset Challenge (CORD-19).

# Part 1: prepare a csv for elasticBERT

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

In [None]:
dfm = pd.read_csv("/kaggle/input/CORD-19-research-challenge/metadata.csv")
dfm

In [None]:
#fill the blanks with a placehonder character
dfm = dfm.fillna("x")

Take only the title and abstract (only these 2 will be used for elasticBERT for now. A combination of title+journal and only abstacts were tried but by a manual analysis, title+abstract seems to work the best.

In [None]:
df2 = dfm[['title', 'abstract']]
#make sure both coulms are strings
df2.title = df2.title.astype(str)
df2.abstract = df2.abstract.astype(str)
df2

OUtput the final file to be used for elasticBERT

In [None]:
import csv
df2.to_csv('metadata_out.csv', index=False, quotechar='"', quoting=csv.QUOTE_NONNUMERIC)

# Part 2: elasticBERT

The code for elasticBERT and the instructions for running it can be found here: [repository for the full team](https://github.com/md-labs/covid19-kaggle)
or you could easily clone it form [elasticBERT standalone repository](https://github.com/jiteshpabla/elasticbert) and run it locally with the instructions given.

The code is based on docker and takes significant time to run, hence it has not been put into the kernel yet.





The goal is to find the most relevant articles related to "vaccines" and "therapeutics". So In the elasticBERT code, the queries I used are:
* vaccines
* coronavirus vaccine
* coronavirus therapeutics
* therapeutics

The first 1000 relevant articles are taken from each query.

ElasticBERT is basically uses any BERT model (here, BERT-cased-768 is used) to generate an output vector for each title+abstract combination for each sample in the metadata file.

Then, it uses the same BERT model to generate a vector for the query and then calculates the cosine similarity between the query vector and the title+abstract vector for each sample to get the most relevant articles/samples.

This method helps run queries extremely fast and its main advantage is running many different queries to answer different questions (like the many questions in all of the CORD-19 tasks) is a streamlined fashion.
But currently, the focus is only on a few queries (as a part of only one CORD-19 task).

# Part 3: converting elasticBERT scores to classes

### part 3-a: loading and cleanup

In [None]:
DATA_DIR = "/kaggle/input/cord19-elasticbert-query-results/"

In [None]:
vdf = pd.read_csv(DATA_DIR+"BERT_vaccines.csv")
tdf = pd.read_csv(DATA_DIR+"BERT_therapeutics.csv")
vdf2 = pd.read_csv(DATA_DIR+"BERT_coronavirus vaccine.csv")
tdf2 = pd.read_csv(DATA_DIR+"BERT_coronavirus therapeutics.csv")

The output of elasticBERT needs to be cleaned up and the title and abstract need to be separated

In [None]:
def cleanup(df):
  df[['abstract','title']] = df._source.str.split("'title':",expand=True)
  df["title"] = df.title.str[2:-2]
  df["abstract"] = df.abstract.str[14:-3]

In [None]:
cleanup(vdf)
cleanup(vdf2)
cleanup(tdf)
cleanup(tdf2)
vdf

get the titles from the metadata file

In [None]:
metadf_title= dfm[["title"]]
#metadf_title = metadf_title.drop_duplicates()
metadf_title

### part 3-b: For vaccine class

merge the 2 vdf's and take max score

In [None]:

merged_vdf = vdf.merge(vdf2, on=['title'], 
                   how='inner', indicator=True, suffixes=('', '_y'))
merged_vdf["score"] = merged_vdf[["_score", "_score_y"]].values.max(1)
merged_vdf.drop(list(merged_vdf.filter(regex='_y$')), axis=1, inplace=True)
merged_vdf = merged_vdf[["title", "score"]]
merged_vdf.drop_duplicates(inplace=True)
merged_vdf

In [None]:
vdf_concat = pd.concat([vdf, vdf2])
#vdf_concat = vdf_concat[["title"]]
vdf_concat.drop_duplicates(subset=["title"], inplace=True)
vdf_concat["score"] = vdf_concat["_score"]
vdf_concat = vdf_concat[["title", "score"]]
vdf_concat

In [None]:
vdf_concat2 = vdf_concat.merge(merged_vdf, on =["title"], how = 'outer' ,indicator=True).loc[lambda x : x['_merge']=='left_only']
vdf_concat2["score"] = vdf_concat2["score_x"]
vdf_concat2 = vdf_concat2[["title", "score"]]
vdf_concat2

In [None]:
vdf_final = pd.concat([vdf_concat2, merged_vdf])
vdf_final.drop_duplicates(subset=["title"],inplace=True)
vdf_final

### part 3-c: For therpeutics class

merge the 2 tdf's and take max score

In [None]:
merged_tdf = tdf.merge(tdf2, on=['title'], 
                   how='inner', indicator=True, suffixes=('', '_y'))
merged_tdf["score"] = merged_tdf[["_score", "_score_y"]].values.max(1)
merged_tdf.drop(list(merged_tdf.filter(regex='_y$')), axis=1, inplace=True)
merged_tdf = merged_tdf[["title", "score"]]
merged_tdf.drop_duplicates(inplace=True)
merged_tdf

In [None]:
tdf_concat = pd.concat([tdf, tdf2])
#tdf_concat = tdf_concat[["title"]]
tdf_concat.drop_duplicates(subset=["title"], inplace=True)
tdf_concat["score"] = tdf_concat["_score"]
tdf_concat = tdf_concat[["title", "score"]]
tdf_concat

In [None]:
tdf_concat2 = tdf_concat.merge(merged_tdf, on =["title"], how = 'outer' ,indicator=True).loc[lambda x : x['_merge']=='left_only']
tdf_concat2["score"] = tdf_concat2["score_x"]
tdf_concat2 = tdf_concat2[["title", "score"]]
tdf_concat2

In [None]:
tdf_final = pd.concat([tdf_concat2, merged_tdf])
tdf_final.drop_duplicates(subset=["title"],inplace=True)
tdf_final

### part 3-d: Merging the 2 extracted classes with metadata

In [None]:
metadf_final = metadf_title
metadf_final["class"] = 0
metadf_final

In [None]:
#the common ones between VACNNIES AND THERPEUTICS
merged_all = tdf_final.merge(vdf_final, on=['title'], 
                   how='inner', indicator=True, suffixes=('', '_y'))
merged_all.drop(list(merged_all.filter(regex='_y$')), axis=1, inplace=True)
#merged_all.drop_duplicates(inplace=True)
merged_all

There is an overlap of articles between the "vaccines" and "therapeutics", our current classificiation ensemble bythe whole team does not cater to 2 samples having the same class (the "both" class) yet, so we split the 1268 samples into the 2 classes based on the maximum cosine similarity score.

In [None]:
x = y = 0
for i, row in metadf_final.iterrows():
  if (row["title"] in vdf_final.values) and (row["title"] in tdf_final.values):
    #vdf_final.loc[df['title'] == row["title"]]
    vi = vdf_final.index[vdf_final['title'] == row["title"]].tolist()[0]
    ti = tdf_final.index[tdf_final['title'] == row["title"]].tolist()[0]
    #print(vdf_final.iloc[vi].score, tdf_final.iloc[ti].score)
    if vdf_final.iloc[vi].score >= tdf_final.iloc[ti].score:
      metadf_final.loc[i,'class'] = 1
      x = x+1
    else:
      metadf_final.loc[i,'class'] = 2
      y = y+1
  elif row["title"] in vdf_final.values:
    metadf_final.loc[i,'class'] = 1
  elif row["title"] in tdf_final.values:
    metadf_final.loc[i,'class'] = 2

In [None]:
print("common articles split into virus and therapeutics respectively")
print(x,y)

The final output:

In [None]:
metadf_final

(optinal) query the metadata fot see the final classes****

In [None]:
#rename column to support query syntax
metadf_final = metadf_final.rename(columns={"class": "classif"}, errors="raise")

In [None]:
metadf_final.query("`classif` == 1")

In [None]:
metadf_final.query("`classif` == 2")