## Deep topic model for COVID-19

<img src="https://img-blog.csdnimg.cn/20200416180916778.png" width = "600" height = "500" alt="tree" align=center />

COVID-19 Open Research Dataset (CORD-19) is a free resource of scholarly articles, aggregated by a coalition of leading research groups, about COVID-19 and the coronavirus family of viruses. The dataset can be found on [Semantic Scholar](https://pages.semanticscholar.org/coronavirus-research) and there is a research challenge on [Kaggle](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge).



The goal of this project is to discover interesting hierarchical topic information from a large number of covid-19 research corpus. In details, we use a [deep topic model](https://arxiv.org/abs/1511.02199) to analyze the semantic structure of articles. The first-layer topics denote the combination of words, while the higher-level topics denote the combination of the lower-level topics, Therefore, these topics describe the global word co-occurrence tatistics and hierarchical semantics from detailed to coarse.


An example of hierarchical topics learned from cord-19 corpus is shown below. We can conclude that covid-19 has the following four characteristics: group, lung infection, animal-related,and infectious.


All results are saved in [dropbox](https://www.dropbox.com/home/GBN_Covid19/plot_images_tree)

## install from github
Full source code for this project is on [GitHub](https://github.com/wds2014/topic_model_for_Covid) and be installed into this notebook as follows:

In [None]:
!echo -e "StrictHostKeyChecking no\n" >> ~/.ssh/config
!git clone https://github.com/wds2014/topic_model_for_Covid

In [None]:
cd topic_model_for_Covid

## install scibert for word representation

In [None]:
!pip install transformers
!wget -O scibert_uncased.tar https://s3-us-west-2.amazonaws.com/ai2-s2-research/scibert/huggingface_pytorch/scibert_scivocab_uncased.tar
!tar -xvf scibert_uncased.tar

## have a look at the data form
cite from [maksimeren](https://www.kaggle.com/maksimeren/covid-19-literature-clustering)

In [None]:
import os
import json
from pprint import pprint
from copy import deepcopy

import numpy as np
import pandas as pd
import tqdm
root_path = '/kaggle/input/CORD-19-research-challenge/'
metadata_path = f'{root_path}/metadata.csv'
meta_df = pd.read_csv(metadata_path, dtype={
    'pubmed_id': str,
    'Microsoft Academic Paper ID': str, 
    'doi': str
})
meta_df.head()

In [None]:
meta_df.info()
del meta_df

## load the CORD-19 data
The raw CORD-19 data is stored across a metadata.csv file and json files with the full text, 
we just use the abstract and full-text frames in our topic model. 
The following code returns a txt file, where each line contains the summary and full text words of the article, corresponding to a json file. we remove articles which contain less then 20 words.

In [None]:
from get_doc import *
biorxiv_dir = '/kaggle/input/CORD-19-research-challenge/biorxiv_medrxiv/biorxiv_medrxiv/pdf_json/'
comm_dir = '/kaggle/input/CORD-19-research-challenge/comm_use_subset/comm_use_subset/pdf_json/'
custom_dir = '/kaggle/input/CORD-19-research-challenge/custom_license/custom_license/pdf_json/'
noncomm_dir = '/kaggle/input/CORD-19-research-challenge/noncomm_use_subset/noncomm_use_subset/pdf_json/'
json_files = [biorxiv_dir, comm_dir, custom_dir, noncomm_dir]
json_files_names= ['biorxiv_dir', 'comm_dir', 'custom_dir', 'noncomm_dir']
doc_info = []
for each_files, each_files_name in zip(json_files, json_files_names):
    filenames = os.listdir(each_files)
    print("Number of articles retrieved from {} : {}".format(each_files_name, len(filenames)))
    all_files = []

    for filename in filenames:
        filename = each_files + filename
        file = json.load(open(filename, 'rb'))
        all_files.append(file)

    cleaned_files = []
    doc_num = 0
    with open('{}.txt'.format(each_files_name),'w') as f:
        for file in all_files:
            doc = format_body(file['abstract']) + format_body(file['abstract'])
            
            if len(doc) > 20:
                doc_info.append((file['paper_id'], file['metadata']['title'], file['metadata']['authors']))
                f.write(doc)
                f.write('\n')
                doc_num +=1
    print('doc_num : ', doc_num)
print('doneeeee')
del all_files
del doc
del file

## clean the txt data

this project extracts the bag-of-word feature of each article.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import numpy as np
import scipy.sparse as sp
from tokenizer import Tokenizer
 

path1 = 'biorxiv_dir.txt'
path2 = 'comm_dir.txt'
path3 = 'custom_dir.txt'
path4 = 'noncomm_dir.txt'
corpus = []

with open(path1) as f:
    lines = f.readlines()
for line in lines:
    corpus.append(line.strip())

with open(path2) as f:
    lines = f.readlines()
for line in lines:
    corpus.append(line.strip())
with open(path3) as f:
    lines = f.readlines()
for line in lines:
    corpus.append(line.strip())
with open(path4) as f:
    lines = f.readlines()
for line in lines:
    corpus.append(line.strip())
print('total doc :',len(corpus))
vectorizer = TfidfVectorizer(lowercase=True, stop_words='english', max_features=10000,tokenizer=Tokenizer.tokenize)

X = vectorizer.fit_transform(corpus)
voc = vectorizer.vocabulary_
vectorizer = CountVectorizer(vocabulary=voc, tokenizer=Tokenizer.tokenize)
X = vectorizer.fit_transform(corpus)
voc = vectorizer.get_feature_names()

sp.save_npz('cord19_10000_full.npz', X)
np.save('voc_10000_full.npy',voc)
with open('voc_10000_full.txt','w') as f:
    for word in voc:
        f.write(word)
        f.write('\n')
print('doneeeeee')
del corpus

## Using wordcloud to visualize the global information of the corpus。

In [None]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt
doc_data = X.toarray()
doc_fre = np.sum(doc_data,0)
word_fre = {}
for idx, word in enumerate(voc):
    word_fre[word] = doc_fre[idx]
wc = WordCloud(max_words=1000)
wc.generate_from_frequencies(word_fre)

# show
plt.imshow(wc)
plt.axis("off")
plt.show()
del word_fre
del doc_fre
del X


## build the topic model([PGBN](https://arxiv.org/abs/1511.02199))
Due to CPU memory limitations, you can load pre-trained parameters

In [None]:
from PGBN import PGBN
from scipy import sparse as sp
import pickle
import numpy as np
pre_trained = True
if not pre_trained:
    pgbn = PGBN(doc_data.T,K=[400,200,64,32],voc=voc)
    pgbn.train('./output',iteration =2000)
    Phi = pgbn.Phi
    Theta = pgbn.Theta
else:
    voc = np.load('/kaggle/input/pre-trained/data/output/voc_10000.npy')
    doc_data = sp.load_npz('/kaggle/input/pre-trained/data/output/cord19_10000.npz')
    doc_data = doc_data.toarray()
    pgbn = PGBN(doc_data.T,K=[400,200,64,32],voc=voc)
    with open('/kaggle/input/pre-trained/data/output/Phi.pick','rb') as f:
        Phi = pickle.load(f)
    with open('/kaggle/input/pre-trained/data/output/Theta.pick','rb') as f:
       Theta = pickle.load(f) 
#pgbn.Phi_vis(Phi)
#graph = pgbn.show_tree(Phi)
#graph.render('output/tree')
del doc_data

## BERT
we use pre-trained bert as a encoder to extract word embedding. in detail, when a query (here is the task or the question) comes, we match it with the trained topic in the semantic space， to this end, the topics most relevant to the task will be found based on the scores obtained. 

In [None]:
import torch
from transformers import BertTokenizer, BertModel
import heapq
model_version = 'scibert_scivocab_uncased'
do_lower_case = True
model = BertModel.from_pretrained(model_version)
tokenizer = BertTokenizer.from_pretrained(model_version, do_lower_case=do_lower_case)

from sklearn.metrics.pairwise import cosine_similarity
def embed_text(text, model):
    input_ids = torch.tensor(tokenizer.encode(text)).unsqueeze(0)  # Batch size 1
    outputs = model(input_ids)
    last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple
    return last_hidden_states.detach().numpy()[0,1:-1,:]
def topic_match_v1(q, k, model, q_weight=None, greedy=False):
    q_embed = embed_text(q, model) ## n_word*768
    if q_weight:
        q_embed = q_embed * q_weight
        q_embed = q_embed.sum(0, keepdims=True)
    else:
        q_embed = q_embed.mean(0, keepdims=True)
    score = []
    for each_k in k:
        k_embed = embed_text(each_k, model).mean(0, keepdims=True)  ## 1*768
        score.append(cosine_similarity(q_embed, k_embed)[0][0])
        if greedy:
            k_embed = embed_text(each_k, model)  ##  n_word * 768
            word_score = cosine_similarity(q_embed, k_embed)  ## 1*n_word
            score.append(word_score[0][np.argmax(word_score[0])])
    max_num_index_list = map(score.index, heapq.nlargest(5, score))
    return list(max_num_index_list)

def topic_sents(path):
    k = []
    with open(path) as f:
        lines = f.readlines()
    for line in lines:
        k.append(line.strip())
    return k


In [None]:
topic_word = topic_sents('/kaggle/input/pre-trained/data/output/phi3.txt')
top_id = topic_match_v1('medical care', topic_word, model)


## show topics related to Medical Care
we build four-layer topic model for this task, and show the hierarchical topics most related to Medical Care below. 

In [None]:
graph = pgbn.show_tree(Phi,topic_id=top_id[0],threshold=0.05)
graph

## get articles about this task


In [None]:

def get_doc_id(layer, top_id=[0,1], top_k = 100):
    target_Theta=Theta[layer].T
    
    topic_list=[]
    for i in top_id:
        data = np.argsort(target_Theta[:,i])[-top_k:]
        data=data[::-1]
        topic_list.extend(data)
    
    return topic_list
with open('/kaggle/input/pre-trained/doc_info.pkl','rb')as f:
    doc_info = pickle.load(f)

In [None]:
import csv
import pandas as pd
doc_id = get_doc_id(3, top_id=top_id)

data=[]
for i in doc_id:
    #data.append((doc_info[i][0],doc_info[i][1],doc_info[i][2]['first']+ doc_info[i][2]['last']))
    names=""
    for j in range(len(doc_info[i][2])):
        names+=doc_info[i][2][j]['first']+ doc_info[i][2][j]['last']+" "
        
    data.append((doc_info[i][0],doc_info[i][1],names))

csvfile = open('info.csv', 'w')
writer = csv.writer(csvfile)
writer.writerow(['Paper ID', 'Title', 'Authors'])
writer.writerows(data)
csvfile.close()
meta_df = pd.read_csv('info.csv')
meta_df

## pros and cons
This project is based on the deep topic model, which can discover the hierarchical semantic concepts in the covid-19 corpus, for example the medical care. And, given a topic of interest, you can find the articles most relevant to the topic based on the topic proportion.
### Pros
1. Topic model can learn concepts in the corpus, help us overall, quickly understand what these articles is discussing. Further, the deep model can explore the hidden hierarchical relationships between topics. For example, we focus on medical care and use it as the root node. We can get sub-topics about it, such as nursing, Extracorporeal membrane oxygenation (ECMO), interventions and so on.
2. Besed on the topic proportion of each article, We can collect the articles that are most relevant to the topic of interest。
3. this preject use pre-trained scibert as encoder, which achieved state-of-art in Natural Language Processing (NLP). Scibert emcode both questions and topics into the same semantic space, which is useful for downstream task.

### Cons
1. As seen above, topics are composed of keywords, often requiring domain experts to discover more interesting phenomena.
2. There are some duplicate topics in the result

## Answer the questions

### given the questions as query, return the hierarchical topic structure and papers related to the question.

* [Resources to support skilled nursing facilities and long term care facilities.](https://www.kaggle.com/danawan/covid-19-q?scriptVersionId=32117016)
* [Mobilization of surge medical staff to address shortages in overwhelmed communities](https://www.kaggle.com/goyeah/covid-19-q-2)
* [Age-adjusted mortality data for Acute Respiratory Distress Syndrome (ARDS) with/without other organ failure – particularly for viral etiologies](https://www.kaggle.com/goyeah/covid-19-q-3)
* [Extracorporeal membrane oxygenation (ECMO) outcomes data of COVID-19 patients](https://www.kaggle.com/goyeah/covid-19-q-4)
* [Outcomes data for COVID-19 after mechanical ventilation adjusted for age.](https://www.kaggle.com/goyeah/covid-19-q-5)
* [Knowledge of the frequency, manifestations, and course of extrapulmonary manifestations of COVID-19, including, but not limited to, possible cardiomyopathy and cardiac arrest.](https://www.kaggle.com/goyeah/covid-19-q-6)
* [Application of regulatory standards (e.g., EUA, CLIA) and ability to adapt care to crisis standards of care level.](https://www.kaggle.com/goyeah/covid-19-q-7)
* [Approaches for encouraging and facilitating the production of elastomeric respirators, which can save thousands of N95 masks.](https://www.kaggle.com/goyeah/covid-19-q-8)
* [Best telemedicine practices, barriers and faciitators, and specific actions to remove/expand them within and across state boundaries.](https://www.kaggle.com/goyeah/covid-19-q-9)
* [Guidance on the simple things people can do at home to take care of sick people and manage disease.](https://www.kaggle.com/goyeah/covid-19-q-10)
* [Oral medications that might potentially work.](https://www.kaggle.com/goyeah/covid-19-q-11)
* [Use of AI in real-time health care delivery to evaluate interventions, risk factors, and outcomes in a way that could not be done manually.](https://www.kaggle.com/goyeah/covid-19-q-12)
* [Best practices and critical challenges and innovative solutions and technologies in hospital flow and organization, workforce protection, workforce allocation, community-based support resources, payment, and supply chain management to enhance capacity, efficiency, and outcomes.](https://www.kaggle.com/goyeah/covid-19-q-13)
* [Efforts to define the natural history of disease to inform clinical care, public health interventions, infection prevention control, transmission, and clinical trials](https://www.kaggle.com/goyeah/covid-19-q-14)
* [Efforts to develop a core clinical outcome set to maximize usability of data across a range of trials](https://www.kaggle.com/goyeah/covid-19-q-15)
* [Efforts to determine adjunctive and supportive interventions that can improve the clinical outcomes of infected patients (e.g. steroids, high flow oxygen)](https://www.kaggle.com/goyeah/covid-19-q-16)