# Summary
## The problem
This notebook is an entry in the COVID-19 Open Research Dataset Challenge task: what has been published about medical care?

Specifically, the organisers want to know what the literature reports about the following topics. Click on the topic to go a list of relevant literature:

* [Resources to support skilled nursing facilities and long term care facilities](#0)
* [Mobilization of surge medical staff to address shortages in overwhelmed communities](#1)
* [Age-adjusted mortality data for Acute Respiratory Distress Syndrome (ARDS) with/without other organ failure – particularly for viral etiologies](#2)

* [Extracorporeal membrane oxygenation (ECMO) outcomes data of COVID-19 patients](#3)
* [Outcomes data for COVID-19 after mechanical ventilation adjusted for age](#4)
* [Knowledge of the frequency, manifestations, and course of extrapulmonary manifestations of COVID-19, including, but not limited to, possible cardiomyopathy and cardiac arrest](#5)
* [Application of regulatory standards (e.g., EUA, CLIA) and ability to adapt care to crisis standards of care level](#6)
* [Approaches for encouraging and facilitating the production of elastomeric respirators, which can save thousands of N95 masks](#7)
* [Best telemedicine practices, barriers and faciitators, and specific actions to remove/expand them within and across state boundaries](#8)
* [Guidance on the simple things people can do at home to take care of sick people and manage disease](#9)
* [Oral medications that might potentially work](#10)
* [Use of AI in real-time health care delivery to evaluate interventions, risk factors, and outcomes in a way that could not be done manually](#11)
* [Best practices and critical challenges and innovative solutions and technologies in hospital flow and organization, workforce protection, workforce allocation, community-based support resources, payment, and supply chain management to enhance capacity, efficiency, and outcomes](#12)
* [Efforts to define the natural history of disease to inform clinical care, public health interventions, infection prevention control, transmission, and clinical trials](#13)
* [Efforts to develop a core clinical outcome set to maximize usability of data across a range of trials](#14)
* [Efforts to determine adjunctive and supportive interventions that can improve the clinical outcomes of infected patients (e.g. steroids, high flow oxygen)](#15)
* 

## Proposed Solution Discussion

I started by trying the methods that were successful in my entry(https://www.kaggle.com/rdhnw1/triage-recommender-with-cold-start) into last year's Kaggle CareerVillage competion. The method consisted of answering career related questions by comparing a new question with previously asked questions.

A number of methods were tried including tfidf, word2vec, Fasttext, Global Vectors and the Universal Sentence encoder (USE).

Fasttext and USE seemed to produce the best performance. However for this challenge only USE has produced useful results. I think that this is because the other methods rely on a simple averaging technique to move from word to sentence embedding. This works well when the phrases being compared are similar in length but fails in this challenge where there is a huge mismatch between the length of the query and the literature. USE copes much better. 

"Google’s Universal Sentence Encoder encodes text into high dimensional vectors. The model is trained and optimized for greater-than-word length text, such as sentences, phrases or short paragraphs.

The input is variable length English text and the output is a 512 dimensional vector."

More details can be found here: https://tfhub.dev/google/universal-sentence-encoder/2

I found this blog useful: https://medium.com/@gaurav5430/universal-sentence-encoding-7d440fd3c7c7

The encoder provides a matrix defining the similarity between a set of questions.

## Method Details

Here are the details of the method:

**Load in the data and print out size and compare with a previous set**

**Remove duplicates and compare with a previous set. **
This method does not remove all duplcates but it is good enough

**Prepare list of queries on the literature. **
This is the list of questions set by the organisers. Some fine tuning on the questions can improve the results.  

**Clean Text. **
The text is moved to lower case and is lemmatized. Stop words are removed but punctation is left

**Reduce literature set to include those mentioning Covid 19 or its synonyms. **
This is acheived using the method supplied in covid19-tools provided by Andy White. Thank you!
    
**Split Abstracts into sentences. **
The literature source consists of a number of fields for each piece of research. Fields include the title and an summary of the research called an abstract.
 The title is too general to be useful and trying to compare a query with the abstract also produces confusing results. By comparing at the sentence level, it is possible to find interesting and relevant pieces of research.
    
 **Find relevant research. **
 A bag of sentences is prepared by combining the queries and the abstract sentences. The USE algorithm is then used to provide a vector for each query and abstract sentence. For each query to the top ten matches with the research literature are found and displayed 
 
 



In [None]:

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import os
import nltk, string
from nltk.corpus import stopwords 
from nltk.stem import WordNetLemmatizer

import covid19_tools as cvt

from sklearn.feature_extraction.text import CountVectorizer 
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
lemmatizer = nltk.stem.WordNetLemmatizer()

stop_words = set(stopwords.words('english'))

'''remove punctuation, lowercase, stem'''
remove_punctuation_map = dict((ord(char), ' ') for char in string.punctuation)    
def normalize(text):
    return nltk.word_tokenize(text.lower().translate(remove_punctuation_map))

def clean_text(text):
    text = text.lower().translate(remove_punctuation_map)
    
    return ' '.join(lemmatizer.lemmatize(w) for w in w_tokenizer.tokenize(text))

covid19_synonyms = ['covid',
                    'coronavirus disease 19',
                    'sars cov 2', # Note that search function replaces '-' with ' '
                    '2019 ncov',
                    '2019ncov',
                    r'2019 n cov\b',
                    r'2019n cov\b',
                    'ncov 2019',
                    r'\bn cov 2019',
                    'coronavirus 2019',
                    'wuhan pneumonia',
                    'wuhan virus',
                    'wuhan coronavirus',
                    r'coronavirus 2\b']

df=pd.read_csv('/kaggle/input/CORD-19-research-challenge/metadata.csv')
print ('Size of literature Set on 3rd April 47298,18')
print ('Size of literature Set', df.shape)
 

In [None]:

   df_queries = pd.DataFrame({'question': ['Resources to support skilled nursing facilities and long term care facilities',
    'Mobilization of surge medical staff to address shortages in overwhelmed communities',
    'Age-adjusted mortality data for Acute Respiratory Distress Syndrome (ARDS) with/without other organ failure – particularly for viral etiologies',
    'Extracorporeal membrane oxygenation (ECMO) outcomes data of COVID-19 patients',
     'Outcomes data for COVID-19 after mechanical ventilation adjusted for age',
     'Knowledge of the frequency, manifestations, and course of extrapulmonary manifestations of COVID-19, including, but not limited to, possible cardiomyopathy and cardiac arrest',
     'Application of regulatory standards (e.g., EUA, CLIA) and ability to adapt care to crisis standards of care level',
     'Approaches for encouraging and facilitating the production of elastomeric respirators, which can save thousands of N95 masks',
     'Best telemedicine practices, barriers and faciitators, and specific actions to remove/expand them within and across state boundaries',
     'Guidance on the simple things people can do at home to take care of sick people and manage disease',
     'Oral medications that might potentially work',
     'Use of Artificial Intelligence AI in real-time health care delivery to evaluate interventions, risk factors, and outcomes in a way that could not be done manually',
     'Best practices and critical challenges and innovative solutions and technologies in hospital flow and organization, workforce protection, workforce allocation, community-based support resources, payment, and supply chain management to enhance capacity, efficiency, and outcomes',
     'Efforts to define the natural history of disease to inform clinical care, public health interventions, infection prevention control, transmission, and clinical trials',
     'Efforts to develop a core clinical outcome set to maximize usability of data across a range of trials',
     'Efforts to determine adjunctive and supportive interventions that can improve the clinical outcomes of infected patients (e.g. steroids, high flow oxygen)',
]})
    
# # test just one query

# df_queries = pd.DataFrame({'question': ['Extracorporeal membrane oxygenation (ECMO) outcomes data of patients']})

In [None]:
#drop duplicate abstracts
df = df.drop_duplicates(subset='abstract', keep="first")

print ('Size of literature Set after removing duplicates on 3rd April 38667,18')
print ('Size after removing duplicates', df.shape)
#4/3/20 38667,18

In [None]:
df_queries['query_bow'] = df_queries.question.apply(clean_text)
df_queries['query_bow'] = df_queries['query_bow'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))


In [None]:
df_a, covid19_counts = cvt.count_and_tag(df, covid19_synonyms, 'disease_covid19')
df_covid19 = df[df['tag_disease_covid19'] == True ]
df_covid19 = df_covid19.reset_index()
df_covid19 = df_covid19.drop(['index'], axis=1)


In [None]:
df_covid19['org abstract'] = df_covid19['abstract']
df_covid19_by_sentence = df_covid19.set_index(df_covid19.columns.drop('abstract',1).tolist())\
.abstract.str.split('\. ', expand=True).stack().reset_index()\
.rename(columns={0:'abstract'})


In [None]:
df_covid19_bow_full = df_covid19_by_sentence.copy()
#df_covid19_bow_full ['bow_raw'] = df_covid19_bow_full ['title'] + " " + df_covid19_bow_full ['abstract']
df_covid19_bow_full ['bow_raw'] = df_covid19_bow_full ['abstract']

In [None]:
df_covid19_bow_full['bow'] = df_covid19_bow_full.bow_raw.apply(clean_text)
df_covid19_bow_full['bow'] = df_covid19_bow_full['bow'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))
#df_covid19_bow_full.head(5)

df_covid19_bow_f = df_covid19_bow_full[df_covid19_bow_full['abstract'].map(len) > 20]
df_covid19_bow = df_covid19_bow_f.reset_index()
#Subset for testing
# df_covid19_bow_fs = df_covid19_bow_f.loc[1218:1230].copy()
# df_covid19_bow = df_covid19_bow_fs.reset_index()


In [None]:
df_covid19_bow = df_covid19_bow[['title','org abstract','abstract','bow', 'cord_uid', 'journal', 'authors','publish_time', 'source_x', 'url']]
total_bow = ["".join(x) for x in (df_queries['query_bow'])]
total_bow += ["".join(x) for x in (df_covid19_bow['bow'])]

In [None]:
import tensorflow as tf
import tensorflow_hub as hub
module_url = "https://tfhub.dev/google/universal-sentence-encoder/4" #@param ["https://tfhub.dev/google/universal-sentence-encoder/4", "https://tfhub.dev/google/universal-sentence-encoder-large/5"]
model = hub.load(module_url)
print ("module %s loaded" % module_url)
def embed(input):
  return model(input)




In [None]:
def Find_articles (index,sim_array,answer_sentence,query):

    query_len = len(query)
    
    #df_sim_q = pd.DataFrame({'Cosine':sim_array[query_len:,index],  'Question':query.iloc[index]['question'],'Title':answer_sentence['title'], 'Abstract Sentence':answer_sentence['abstract'],'Journal':answer_sentence['journal'],'Published':answer_sentence['publish_time'],'URL':answer_sentence['url']})
    #df_sim_q = pd.DataFrame({'Cosine':sim_array[query_len:,index],  'Abstract Snippet':answer_sentence['abstract'],'Published':answer_sentence['publish_time'],'Title':answer_sentence['title'], 'Journal':answer_sentence['journal'],'Source':answer_sentence['source_x'],'Abstract                                                                                                                                                                                                      p':answer_sentence['org abstract'],'URL':answer_sentence['url']})

    df_sim_q = pd.DataFrame({'Cosine':sim_array[query_len:,index],  'Abstract Snippet                 ':answer_sentence['abstract'],\
                             'Published':answer_sentence['publish_time'],'Title             ':answer_sentence['title'], 'Journal':answer_sentence['journal'],\
                             'Source':answer_sentence['source_x'],\
                             'Authors':answer_sentence['authors'],\
                             'Abstract truncted at 1000 characters                                                                                                                           ':answer_sentence['org abstract'],\
                             'URL to full text':answer_sentence['url']})

    
    df_sim_q_sorted = df_sim_q.sort_values('Cosine',ascending = False )

    df_sim_q_sample = df_sim_q_sorted[:10]

    df_sim_q_sample = df_sim_q_sample.reset_index()
    df_sim_q_sample = df_sim_q_sample.drop(['index'], axis=1)
    df_sim_q_sample = df_sim_q_sample.drop(['Cosine'], axis=1)

    df_sim_q_sample = df_sim_q_sample.apply(lambda x: x.str.slice(0, 1000))
    df_sim_q_sample["Authors"] = df_sim_q_sample["Authors"].str[:100]
    
    return ( df_sim_q_sample)

message_embeddings_ = embed(total_bow)
corr = np.inner(message_embeddings_, message_embeddings_)


# Resources to support skilled nursing facilities and long term care facilities<a id='0'></a>

In [None]:
results = Find_articles (0,corr,df_covid19_bow,df_queries)
dfStyler = results.style.set_properties(**{'text-align': 'left'})
dfStyler.set_table_styles([dict(selector='th', props=[('text-align', 'left')])])


# Mobilization of surge medical staff to address shortages in overwhelmed communities<a id='1'></a>

In [None]:
results = Find_articles (1,corr,df_covid19_bow,df_queries)
dfStyler = results.style.set_properties(**{'text-align': 'left'})
dfStyler.set_table_styles([dict(selector='th', props=[('text-align', 'left')])])


# Age-adjusted mortality data for Acute Respiratory Distress Syndrome (ARDS) with/without other organ failure – particularly for viral etiologies<a id='2'></a>
    

In [None]:
results = Find_articles (2,corr,df_covid19_bow,df_queries)
dfStyler = results.style.set_properties(**{'text-align': 'left'})
dfStyler.set_table_styles([dict(selector='th', props=[('text-align', 'left')])])


# Extracorporeal membrane oxygenation (ECMO) outcomes data of COVID-19 patients<a id='3'></a>
    

In [None]:
results = Find_articles (3,corr,df_covid19_bow,df_queries)
dfStyler = results.style.set_properties(**{'text-align': 'left'})
dfStyler.set_table_styles([dict(selector='th', props=[('text-align', 'left')])])


#  Outcomes data for COVID-19 after mechanical ventilation adjusted for age<a id='4'></a>
 

In [None]:
results = Find_articles (4,corr,df_covid19_bow,df_queries)
dfStyler = results.style.set_properties(**{'text-align': 'left'})
dfStyler.set_table_styles([dict(selector='th', props=[('text-align', 'left')])])


## Knowledge of the frequency, manifestations, and course of extrapulmonary manifestations of COVID-19, including, but not limited to, possible cardiomyopathy and cardiac arrest<a id='5'></a>
  

In [None]:
results = Find_articles (5,corr,df_covid19_bow,df_queries)
dfStyler = results.style.set_properties(**{'text-align': 'left'})
dfStyler.set_table_styles([dict(selector='th', props=[('text-align', 'left')])])


## Application of regulatory standards (e.g., EUA, CLIA) and ability to adapt care to crisis standards of care level<a id='6'></a>

In [None]:
results = Find_articles (6,corr,df_covid19_bow,df_queries)
dfStyler = results.style.set_properties(**{'text-align': 'left'})
dfStyler.set_table_styles([dict(selector='th', props=[('text-align', 'left')])])


## Approaches for encouraging and facilitating the production of elastomeric respirators, which can save thousands of N95 masks<a id='7'></a>
    

In [None]:
results = Find_articles (7,corr,df_covid19_bow,df_queries)
dfStyler = results.style.set_properties(**{'text-align': 'left'})
dfStyler.set_table_styles([dict(selector='th', props=[('text-align', 'left')])])


## Best telemedicine practices, barriers and faciitators, and specific actions to remove/expand them within and across state boundaries<a id='8'></a>
  

In [None]:
results = Find_articles (8,corr,df_covid19_bow,df_queries)
dfStyler = results.style.set_properties(**{'text-align': 'left'})
dfStyler.set_table_styles([dict(selector='th', props=[('text-align', 'left')])])


## Guidance on the simple things people can do at home to take care of sick people and manage disease<a id='9'></a>
 

In [None]:
results = Find_articles (9,corr,df_covid19_bow,df_queries)
dfStyler = results.style.set_properties(**{'text-align': 'left'})
dfStyler.set_table_styles([dict(selector='th', props=[('text-align', 'left')])])


## Oral medications that might potentially work<a id='10'></a>

In [None]:
results = Find_articles (10,corr,df_covid19_bow,df_queries)
dfStyler = results.style.set_properties(**{'text-align': 'left'})
dfStyler.set_table_styles([dict(selector='th', props=[('text-align', 'left')])])


## Use of AI in real-time health care delivery to evaluate interventions, risk factors, and outcomes in a way that could not be done manually<a id='11'></a>
   

In [None]:
results = Find_articles (11,corr,df_covid19_bow,df_queries)
dfStyler = results.style.set_properties(**{'text-align': 'left'})
dfStyler.set_table_styles([dict(selector='th', props=[('text-align', 'left')])])


## Best practices and critical challenges and innovative solutions and technologies in hospital flow and organization, workforce protection, workforce allocation, community-based support resources, payment, and supply chain management to enhance capacity, efficiency, and outcomes<a id='12'></a>


In [None]:
results = Find_articles (12,corr,df_covid19_bow,df_queries)
dfStyler = results.style.set_properties(**{'text-align': 'left'})
dfStyler.set_table_styles([dict(selector='th', props=[('text-align', 'left')])])


## Efforts to define the natural history of disease to inform clinical care, public health interventions, infection prevention control, transmission, and clinical trials<a id='13'></a>

In [None]:
results = Find_articles (13,corr,df_covid19_bow,df_queries)
dfStyler = results.style.set_properties(**{'text-align': 'left'})
dfStyler.set_table_styles([dict(selector='th', props=[('text-align', 'left')])])


## Efforts to develop a core clinical outcome set to maximize usability of data across a range of trials<a id='14'></a>


In [None]:
results = Find_articles (14,corr,df_covid19_bow,df_queries)
dfStyler = results.style.set_properties(**{'text-align': 'left'})
dfStyler.set_table_styles([dict(selector='th', props=[('text-align', 'left')])])


## Efforts to determine adjunctive and supportive interventions that can improve the clinical outcomes of infected patients (e.g. steroids, high flow oxygen)<a id='15'></a>


In [None]:
results = Find_articles (15,corr,df_covid19_bow,df_queries)
dfStyler = results.style.set_properties(**{'text-align': 'left'})
dfStyler.set_table_styles([dict(selector='th', props=[('text-align', 'left')])])
