# **COVID-19 Open Research Dataset Challenge (CORD-19)**

![](https://www.ovpm.org/wp-content/uploads/2020/03/chla-what-you-should-know-covid-19-1200x628-01.jpg)

This is a joint work by [Moshe Hazoom](https://www.kaggle.com/hazoom), [Sarah June Sachs](https://www.kaggle.com/sarahjune) and [Kevin Benassuly](https://www.kaggle.com/kevinbenassuly).

# **Goal**
Our goal is to build an infrastructure that can serve whoever fights the novel COVID-19 virus (researches, doctors, health care workers, etc.) by finding the most useful information using state-of-the-art NLP tools and algorithms. We hope this project will be useful and that our efforts will yield fruits to make our world without the COVID-19 virus. 


# **Project Description**
This project consists of different modules that serve together as a full pipeline in order to extract relevant, useful and accurate information from the the scholarly articles.
We believe that our solution is capable to mine information that answers the research question accurately, flexible enough in order to support future research questions and easy to understand. 
We will describe in details the different modules below:
1. **Data Preprocessing** - ETL, Keyword Extraction & Word Embeddings.
2. **Topic Modeling** - LDA model.
3. **Search Engine for Seed Sentences** - Simple but useful search engine to find seed sentences using keywords.
4. **Semantic Search for Relevant Sentences** - Find relevant answers from seed sentences from #3 using sentence embedding techniques.
5. **Answer Summarization** - Generate abstractive summary for answers using Facebook's BART model.

All the code and notebooks are availabe in the Github [repo](https://github.com/Hazoom/covid19).

# **Pros/Cons**

## Pros
1. Our solution finds information to all the research questions in accuracte, consice and informative manner.
2. It's simple to understand, read and reproduce results.
3. Easy to expand to other research questions and domains.

## Cons
1. The keyword generation for each task can be improved to better use the search engine for seed sentences.
2. There is some manual work needed to pick the best seed sentences for each sub-task in each task. In most cases, we took the first 1-3 sentences that the search engine returned, but it wasn't the case for all search queries. At this point, after that we marked for each query the best results out of 10, we can better improve the search engine against gold-data set with a defined evaluation metric for information retrieval tasks (e.g. [nDCG - Normalized Discounted Cumulative Gain](https://en.wikipedia.org/wiki/Discounted_cumulative_gain)) . We believe this can be further improved and might be an interesting direction for future research.

# **Data Preprocessing**

## Article Filtering
In order to be focusing on relevant articles only, we filtered out articles that are not related specificaly to COVID-19 by using [@ajrwhite](https://www.kaggle.com/ajrwhite)'s list of keywords (thanks!) and took only articles that contains at least one word from the following list:

In [None]:
["2019-ncov", "2019 novel coronavirus", "coronavirus 2019", "coronavirus disease 19", "covid-19", "covid 19", "ncov-2019", "sars-cov-2", "wuhan coronavirus", "wuhan pneumonia", "wuhan virus"]

## ETL
For each article we did the following:
1. Parsed its full text using [scispacy](https://allenai.github.io/scispacy/) and split it into sentence. The sentence segmentation part was done using Microsoft's [BlingFire](https://github.com/microsoft/BlingFire) library since we've noticed that scispacy had difficulties to split some text into sentences and kept a very long text. The Code for sentence segmentation is available in [blingfire_sentence_splitter.py](https://github.com/Hazoom/covid19/blob/master/src/nlp/blingfire_sentence_splitter.py) and [common_sentence_splitter.py](https://github.com/Hazoom/covid19/blob/master/src/nlp/common_sentence_splitter.py).
2. Cleaned the text by normalizing non ASCII characters, fixing contractions, removing URLs, removing punctuations, removing stop-words, etc. The cleaning code is available in [cleaning.py](https://github.com/Hazoom/covid19/blob/master/src/nlp/cleaning.py).
3. Transformed the sentence to contains meaningful bi-grams and tri-grams. Detailed explanation below.
4. Created a metadata CSV file such that each row contains a sentence, its cleaned version, the section it came from (abstract, body) and the article metadata it came from. The code is available in [preprocess.py](https://github.com/Hazoom/covid19/blob/master/src/preprocessing/preprocess.py).

The parsing method with scispacy (for demonstration purposes, it doesn't include our custom sentence segmantation):

In [None]:
# Install scispacy package
!pip install scispacy

In [None]:
import spacy
import scispacy

nlp = spacy.load("../input/scispacymodels/en_core_sci_sm/en_core_sci_sm-0.2.4")
nlp.max_length = 2000000

The cleaning method:

In [None]:
!pip install contractions

In [None]:
import re

CURRENCIES = {'$': 'USD', 'zł': 'PLN', '£': 'GBP', '¥': 'JPY', '฿': 'THB',
              '₡': 'CRC', '₦': 'NGN', '₩': 'KRW', '₪': 'ILS', '₫': 'VND',
              '€': 'EUR', '₱': 'PHP', '₲': 'PYG', '₴': 'UAH', '₹': 'INR'}

RE_NUMBER = re.compile(
    r"(?:^|(?<=[^\w,.]))[+–-]?"
    r"(([1-9]\d{0,2}(,\d{3})+(\.\d*)?)|([1-9]\d{0,2}([ .]\d{3})+(,\d*)?)|(\d*?[.,]\d+)|\d+)"
    r"(?:$|(?=\b))")

RE_URL = re.compile(
    r'((http://www\.|https://www\.|http://|https://)?' +
    r'[a-z0-9]+([\-.][a-z0-9]+)*\.[a-z]{2,5}(:[0-9]{1,5})?(/.*)?)')

# English Stop Word List (Standard stop words used by Apache Lucene)
STOP_WORDS = {"a", "an", "and", "are", "as", "at", "be", "but", "by", "for", "if", "in", "into", "is", "it",
              "no", "not", "of", "on", "or", "such", "that", "the", "their", "then", "there", "these",
              "they", "this", "to", "was", "will", "with"}

In [None]:
import string
from typing import List
import ftfy
import contractions

def clean_tokenized_sentence(tokens: List[str],
                             unicode_normalization="NFC",
                             unpack_contractions=False,
                             replace_currency_symbols=False,
                             remove_punct=True,
                             remove_numbers=False,
                             lowercase=True,
                             remove_urls=True,
                             remove_stop_words=True) -> str:
    if remove_stop_words:
        tokens = [token for token in tokens if token not in STOP_WORDS]

    sentence = ' '.join(tokens)

    if unicode_normalization:
        sentence = ftfy.fix_text(sentence, normalization=unicode_normalization)

    if unpack_contractions:
        sentence = contractions.fix(sentence, slang=False)

    if replace_currency_symbols:
        for currency_sign, currency_tok in CURRENCIES.items():
            sentence = sentence.replace(currency_sign, f'{currency_tok} ')

    if remove_urls:
        sentence = RE_URL.sub('_URL_', sentence)

    if remove_punct:
        sentence = sentence.translate(str.maketrans('', '', string.punctuation))

    # strip double spaces
    sentence = re.sub(r' +', ' ', sentence)

    if remove_numbers:
        sentence = RE_NUMBER.sub('_NUMBER_', sentence)

    if lowercase:
        sentence = sentence.lower()

    return sentence


Putting it all together:

In [None]:
def clean_sentence(sentence) -> str:
    doc = nlp(sentence)
    tokens = [str(token) for token in doc]
    return clean_tokenized_sentence(tokens)

In [None]:
print(clean_sentence("Let's clean this sentence!"))

Let's see the output of ETL process:

In [None]:
import pandas as pd

In [None]:
sentences_df = pd.read_csv('../input/covid19sentencesmetadata/sentences_with_metadata.csv')

In [None]:
sentences_df.head()

In [None]:
print(f"Sentence count: {len(sentences_df)}")

## Training bi-gram Model
Some words on its own doesn't give a lot of information, but when coming together, the meaning is changing to something else. Our goal was to transform meaningful bi-gram phrases to one token, for example: `fake news` to `fake_news`. For that, we used [Gensim](https://radimrehurek.com/gensim/)'s Phrases package, that has two implementations: 1. Data-Driven approach and 2. NPMI (Normalized Pointwise Mutual Information) score. We won't show here the training, since it takes times, but we will load our trained model and see some examples. The trainind code is in our [notebook](https://github.com/Hazoom/covid19/blob/master/notebooks/Taxonomy/Topic_Model_LDA.ipynb).
We used a threshold of 10 and minimum count of 5 that worked best for our use case, to build a search engine. One can play with the hyper-parameters for his own use-case, depending on the tradeoff between large number (and less meaningful) of phrases to a smaller (but more meaningful) amount of phrases.

In [None]:
from gensim.models.phrases import Phraser

In [None]:
bigram_model = Phraser.load("../input/covid19phrasesmodels/covid_bigram_model_v0.pkl")

In [None]:
bigram_model["despite social media often vehicle fake news boast news hype also worth noting tremendous effort scientific community provide free uptodate information ongoing studies well critical evaluations".split()]

## Training tri-gram model
We created a tri-gram model, in addition to the bi-gram model, in order to catch more meaningful phrases, like `Epidemic Preparedness Innovations` for example. We transformed the cleaned sentences (from all articles) with bi-grams using the model above and trained a Phrases model with Gensim again, but with a lower threshold this time. Please note that this method can also crerate 4-grams if the model connects between two bi-grams.

In [None]:
trigram_model = Phraser.load("../input/covid19phrasesmodels/covid_trigram_model_v0.pkl")

Let's add phrases model to the ETL process and change the clean function:

In [None]:
def clean_sentence(sentence, bigram_model=None, trigram_model=None) -> str:
    doc = nlp(sentence)
    tokens = [str(token) for token in doc]
    cleaned_sentence = clean_tokenized_sentence(tokens)
    
    if bigram_model and trigram_model:
        sentence_with_bigrams = bigram_model[cleaned_sentence.split(' ')]
        sentence_with_trigrams = trigram_model[sentence_with_bigrams]
        return ' '.join(sentence_with_trigrams)
    
    return cleaned_sentence

In [None]:
print(clean_sentence("On 23 January 2020, the Coalition for Epidemic Preparedness Innovations (CEPI) announced that they will fund vaccine development programmes with Inovio", bigram_model, trigram_model))

## FastText Word Embeddings
We trained word embeddings model on the full corpus (without filtering out articles) using Facebook's [FastText](https://github.com/facebookresearch/fastText) library. This will serve us later in the Sentence Similarity model to find relevant answers for each question. After training, we also created word counts that serves us in the sentence encoder when calculating the smooth weighted average of the word embeddings of all words in the sentence.
The code for training word embeddings using FastText is availabe at [train_fasttext.py](https://github.com/Hazoom/covid19/blob/master/src/w2v/train_fasttext.py)

Let's visualize our word embeddings. 

In [None]:
import os
import numpy as np
from sklearn.preprocessing import normalize
from sklearn.manifold import TSNE
from matplotlib import pylab
%matplotlib inline

In [None]:
fasttext_model_dir = '../input/fasttext-no-subwords-trigrams'

Read the 400 most frequent word vectors. The vectors in the file are in descending order of frequency.

In [None]:
num_points = 400

first_line = True
index_to_word = []
with open(os.path.join(fasttext_model_dir, "word-vectors-100d.txt"),"r") as f:
    for line_num, line in enumerate(f):
        if first_line:
            dim = int(line.strip().split()[1])
            word_vecs = np.zeros((num_points, dim), dtype=float)
            first_line = False
            continue
        line = line.strip()
        word = line.split()[0]
        vec = word_vecs[line_num-1]
        for index, vec_val in enumerate(line.split()[1:]):
            vec[index] = float(vec_val)
        index_to_word.append(word)
        if line_num >= num_points:
            break
word_vecs = normalize(word_vecs, copy=False, return_norm=False)

Train t-SNE in order to reduce embeddings to 2-dimension for visualization purpose:

In [None]:
tsne = TSNE(perplexity=40, n_components=2, init='pca', n_iter=10000)
two_d_embeddings = tsne.fit_transform(word_vecs[:num_points])
labels = index_to_word[:num_points]

Plot the most frequent 400 word vectors in a 2-dimensions plot:

In [None]:
def plot(embeddings, labels):
    pylab.figure(figsize=(20,20))
    for i, label in enumerate(labels):
        x, y = embeddings[i,:]
        pylab.scatter(x, y)
        pylab.annotate(label, xy=(x, y), xytext=(5, 2), textcoords='offset points',
                       ha='right', va='bottom')
    pylab.show()

plot(two_d_embeddings, labels)

The visualization of the word vectors of the 400 most frequent words makes sense.
We can use those vectors to find synonyms, or related terms, for each input word (or phrase) by comparing its word vectors to the whole corpus vectors using cosine similarity, or any other vectors similarity functions.
Let's see some examples:

In [None]:
from pprint import pprint
import gensim.models.keyedvectors as word2vec

fasttext_model = word2vec.KeyedVectors.load_word2vec_format(os.path.join(fasttext_model_dir, "word-vectors-100d.txt"))

In [None]:
def print_most_similar(search_term):
    print(f"Synonyms of '{search_term}':")
    synonyms = fasttext_model.most_similar(search_term)
    pprint(synonyms)

In [None]:
print_most_similar("new_coronavirus")

In [None]:
print_most_similar("fake_news")

In [None]:
print_most_similar("pathogen")

# **Topic Modeling & Keyword Extraction**
In order to understand the content of the corpus and how the text might relate to each task, we extracted relevant topics with the Latent Dirichlet Allocation (LDA) algorithm. We used the [Gensim](https://radimrehurek.com/gensim/) package on the clean version of the sentences within the filtered subset of the corpus.

[Notebook with code for topic modeling](https://github.com/Hazoom/covid19/blob/master/notebooks/Taxonomy/Topic_Model_LDA.ipynb)

Steps taken:

1. Read in cleaned sentences
2. Build Gensim dictionary with id2word
3. Structure corpus with doc2bow
4. Calculate term document frequency
5. Train the LDA model
6. Using the full filtered subset of articles did not result in very distinctive topics, below is a snippet with keywords from the 10 topics which were derived:

In [None]:
[(0, '0.079"•" + 0.019"blood" + 0.015"associated" + 0.013"cells" + ' '0.012"ace2" + 0.012"protein" + 0.011"important" + 0.011"levels" + ' '0.010"diseases" + 0.010"cell"'), (1, '0.110"who" + 0.088"it" + 0.056"response" + 0.043"could" + 0.036"under" ' '+ 0.035"available" + 0.032"major" + 0.032"as" + 0.030"without" + ' '0.024"muscle"'), (2, '0.173"■" + 0.020"some" + 0.013"drugs" + 0.010"transmission" + ' '0.009"surgery" + 0.009"must" + 0.009"drug" + 0.009"there" + ' '0.008"increased" + 0.008"high"'), (3, '0.071"de" + 0.036"were" + 0.025"patient" + 0.023"1" + 0.022"after" + ' '0.018"a" + 0.018"more" + 0.015"all" + 0.015"when" + 0.014"cause"'), (4, '0.044"the" + 0.035"from" + 0.028"should" + 0.019"other" + 0.018"risk" ' '+ 0.017"oral" + 0.017"which" + 0.017"in" + 0.013"use" + 0.013"cases"'), (5, '0.069"may" + 0.033"can" + 0.031"have" + 0.029"disease" + 0.028"dental" ' '+ 0.022"also" + 0.020"has" + 0.020"been" + 0.018"health" + ' '0.016"virus"'), (6, '0.051"la" + 0.031"en" + 0.025"2" + 0.023"3" + 0.016"que" + 0.016"el" ' '+ 0.016"y" + 0.014"los" + 0.014"4" + 0.013"les"'), (7, '0.045"s" + 0.041"et" + 0.031"during" + 0.023"al" + 0.022"had" + ' '0.021"people" + 0.020"à" + 0.018"local" + 0.017"days" + 0.016"2020"'), (8, '0.062"patients" + 0.030"treatment" + 0.028"care" + 0.020"used" + ' '0.014"clinical" + 0.014"infection" + 0.013"common" + 0.013"severe" + ' '0.013"respiratory" + 0.012"dentistry"'), (9, '0.030"using" + 0.020"areas" + 0.018"ct" + 0.014"described" + ' '0.014"performed" + 0.013"lesions" + 0.013"above" + 0.012"day" + ' '0.011"learning" + 0.011"reactions"')]

The lack of distinctive topics is likely due to the corpus range of content which contained a lot of noise. If we run the same LDA on the output from the semantic search for relevant sentences, hopefully clearer topics will emerge. This is an interesting research direction for the future.

# **Search Engine for Seed Sentences**
Now, our goal was to find the best (accurate and informative) sentences for each sub-task for each task in the CORD-19 challenge.
For that goal, we did the following:
1. Created a search engine for finding relevant sentences using input keywords. The search engine transforms the input to phrases using our phrases model above and performs Query Expansion technique by adding synonyms (above certain similarity threshold) to the input keywords and ranks sentences by the number of keyword matches and the date of the article the sentence came from (the newest will be ranked higer). In order to focus on articles about COVID-19, the search engine can get optional keywords that boost the sentences containing them. We've tried to use TF-IDF and some other weighting techniques, but our simple method worked best for us.
2. For each sub-task we created list of keywords the retrieves for us the best result set of sentences that answer the resaerch question in the sub-task.
3. Picked 1-3 sentences (out of 10) that summarizes the answer in the most accurate, consice and informative manner. Those sentences are considered "seed sentences" and will serve us in the next module of sentence similarity. This part was done partially manually and we belive this approach can be automatic with better keyword extraction and a better search engine. This is an interesting area for future research.

All the examples are listed in the `notebooks` folder in our repo.

In [None]:
def create_articles_metadata_mapping(sentences_df: pd.DataFrame) -> dict:
    sentence_id_to_metadata = {}
    for row_count, row in sentences_df.iterrows():
        sentence_id_to_metadata[row_count] = dict(
            paper_id=row['paper_id'],
            cord_uid=row['cord_uid'],
            source=row['source'],
            url=row['url'],
            publish_time=row['publish_time'],
            authors=row['authors'],
            section=row['section'],
            sentence=row['sentence'],
        )
    return sentence_id_to_metadata

In [None]:
sentence_id_to_metadata = create_articles_metadata_mapping(sentences_df)

In [None]:
import operator
from datetime import datetime

class SearchEngine:
    def __init__(self,
                 sentence_id_to_metadata: dict,
                 sentences_df: pd.DataFrame,
                 bigram_model,
                 trigram_model,
                 fasttext_model):
        self.sentence_id_to_metadata = sentence_id_to_metadata
        self.cleaned_sentences = sentences_df['cleaned_sentence'].tolist()
        print(f'Loaded {len(self.cleaned_sentences)} sentences')

        self.bigram_model = bigram_model
        self.trigram_model = trigram_model
        self.fasttext_model = fasttext_model

    def _get_search_terms(self, keywords, synonyms_threshold):
        # clean tokens
        cleaned_terms = [clean_tokenized_sentence(keyword.split(' ')) for keyword in keywords]
        # remove empty terms
        cleaned_terms = [term for term in cleaned_terms if term]
        # create bi-grams
        terms_with_bigrams = self.bigram_model[' '.join(cleaned_terms).split(' ')]
        # create tri-grams
        terms_with_trigrams = self.trigram_model[terms_with_bigrams]
        # expand query with synonyms
        search_terms = [self.fasttext_model.most_similar(token) for token in terms_with_trigrams]
        # filter synonyms above threshold (and flatten the list of lists)
        search_terms = [synonym[0] for synonyms in search_terms for synonym in synonyms
                        if synonym[1] >= synonyms_threshold]
        # expand keywords with synonyms
        search_terms = list(terms_with_trigrams) + search_terms
        return search_terms

    def search(self,
               keywords: List[str],
               optional_keywords=None,
               top_n: int = 10,
               synonyms_threshold=0.7,
               keyword_weight: float = 3.0,
               optional_keyword_weight: float = 0.5) -> List[dict]:
        if optional_keywords is None:
            optional_keywords = []

        search_terms = self._get_search_terms(keywords, synonyms_threshold)

        optional_search_terms = self._get_search_terms(optional_keywords, synonyms_threshold) \
            if optional_keywords else []

        print(f'Search terms after cleaning, bigrams, trigrams and synonym expansion: {search_terms}')
        print(f'Optional search terms after cleaning, bigrams, trigrams and synonym expansion: {optional_search_terms}')

        date_today = datetime.today()

        # calculate score for each sentence. Take only sentence with at least one match from the must-have keywords
        indexes = []
        match_counts = []
        days_diffs = []
        for sentence_index, sentence in enumerate(self.cleaned_sentences):
            sentence_tokens = sentence.split(' ')
            sentence_tokens_set = set(sentence_tokens)
            match_count = sum([keyword_weight if keyword in sentence_tokens_set else 0
                               for keyword in search_terms])
            if match_count > 0:
                indexes.append(sentence_index)
                if optional_search_terms:
                    match_count += sum([optional_keyword_weight if keyword in sentence_tokens_set else 0
                                       for keyword in optional_search_terms])
                match_counts.append(match_count)
                article_date = self.sentence_id_to_metadata[sentence_index]["publish_time"]

                if article_date == "2020":
                    article_date = "2020-01-01"

                article_date = datetime.strptime(article_date, "%Y-%m-%d")
                days_diff = (date_today - article_date).days
                days_diffs.append(days_diff)

        # the bigger the better
        match_counts = [float(match_count)/sum(match_counts) for match_count in match_counts]

        # the lesser the better
        days_diffs = [(max(days_diffs) - days_diff) for days_diff in days_diffs]
        days_diffs = [float(days_diff)/sum(days_diffs) for days_diff in days_diffs]

        index_to_score = {}
        for index, match_count, days_diff in zip(indexes, match_counts, days_diffs):
            index_to_score[index] = 0.7 * match_count + 0.3 * days_diff

        # sort by score descending
        sorted_indexes = sorted(index_to_score.items(), key=operator.itemgetter(1), reverse=True)

        # take only the sentence IDs
        sorted_indexes = [item[0] for item in sorted_indexes]

        # limit results
        sorted_indexes = sorted_indexes[0: min(top_n, len(sorted_indexes))]

        # get metadata for each sentence
        results = []
        for index in sorted_indexes:
            results.append(self.sentence_id_to_metadata[index])
        return results

In [None]:
search_engine = SearchEngine(sentence_id_to_metadata, sentences_df, bigram_model, trigram_model, fasttext_model)

In [None]:
def search(keywords, optional_keywords=None, top_n=10, synonyms_threshold=0.8, only_sentences=False):
    print(f"\nSearch for terms {keywords}\n\n")
    results = search_engine.search(
        keywords, optional_keywords=optional_keywords, top_n=top_n, synonyms_threshold=synonyms_threshold
    )
    print("\nResults:\n")
    
    if only_sentences:
        for result in results:
            print(result['sentence'] + "\n")
    else:
        pprint(results)

In [None]:
search(keywords=["spillover", "bats", "snakes", "exotic animals", "seafood"],
       optional_keywords=["new coronavirus", "coronavirus", "covid19"],
      top_n=3)

At the end of this stage, we had seed sentences for each sub-task.

In [None]:
task_id = 2

In [None]:
import json

with open(f"../input/covid19seedsentences/{task_id}.json") as in_fp:
    seed_sentences_json = json.load(in_fp)

print(seed_sentences_json['taskName'])

# **Semantic Search for Relevant Sentences**
After finding seed sentences, we wanted to expand our evidences by finding sentences with similar semantic meaning in the whole corpus (that haven't came up in the search engine from previous phase) in order to collect more information and evidence to support the research question.

There are many methods and techniques for sentence embedding in the NLP litretature and from our vast experience in the field we chose to use to implement 2 and to ty both
1. The techniques from the paper ["A Simple but Tough-to-Beat Baseline for Sentence Embeddings"](https://openreview.net/forum?id=SyK00v5xx) (Sanjeev Arora, Yingyu Liang, Tengyu Ma). In their work, they use a pretrained word embedding model on unsupervised large corpus and in order to create embedding for the whole sentence, they use a smooth weighted average on the word embeddings of the the words in the sentence, and remove the 1st principal component from the vector after performing a dimension reduction technique (e.g. SVD or PCA). The latter improves to reduce noise from the sentence embeddings. We found that using phrases (bi-grams and tri-grams) in the semantic search engine, using this technique, performed worser than with them so we decided to not using them for the semantic search engine part.
2. Using fine-tuned BERT on Stanford Natural Language Inference (SNLI) task that predicts if two sentences are semantically related or not. For the implementation of BERT encoder, we used UKP's sentence-transformers library. We chose to use the model called `bert-base-nli-stsb-mean-tokens`.

We developed a method that gets list of sentences and retrieves similar semantic sentences to them. Because the input is not just one sentence, we had to aggerage them in different manngers. We did research one some aggregation techniques:
1. Union
2. Mean
3. 1st Principal Component (pc_1)
4. 2nd Principal Component (pc_2)

The method that worked best was fine-tuned BERT (#2) with Union aggregation.

We iterate through all the sentences in our filtered corpus and encoded the sentences using the techniques mentioned above. When giving a new sentence, we encoded it and compared it to all existing pre-encoded sentences we have in our index (using Cosine Similarity function between the vectors) in order to get the most similar sentences for each input sentence. We used `nmslib` for storing the pre-encoded sentences and to perform the similarity measures. All code is available in our repo in [corpus_indx](https://github.com/Hazoom/covid19/tree/master/src/corpus_index) and [encoders](https://github.com/Hazoom/covid19/tree/master/src/encoders). 

Credit to [Tal Almagor](https://github.com/talmago) for helping to develop this module.

# **Answer Summarization**
The final module in our project is the abstractive answer summarization. The goal is to build informative and consice answer for each sub-task using the relevant sentences we found from previous task. We chose to use Facebook's [BART](https://arxiv.org/abs/1910.13461) model. BART is trained by (1) corrupting text with an arbitrary noising function, and (2) learning a model to reconstruct the original text. It uses a standard Tranformer-based neural machine translation architecture which, despite its simplicity, can be seen as generalizing BERT (due to the bidirectional encoder), GPT (with the left-to-right decoder), and many other more recent pretraining schemes. BART is really useful for text generation tasks, hence we chose it for this task. We use HuggingFace's great library [transformers](https://github.com/huggingface/transformers) for that end. A future improvement here can be to fine-tune BART model with COVID-19 articles. Due to lack of resources, we took the pretrained BART model.

In [None]:
import torch
from transformers import BartTokenizer, BartForConditionalGeneration

# This will take time on the first time since it downloads the model
tokenizer_summarize = BartTokenizer.from_pretrained('bart-large-cnn')
model_summarize = BartForConditionalGeneration.from_pretrained('bart-large-cnn')

In [None]:
class BartSummarizer:
    def __init__(self, tokenizer_summarize, model_summarize):
        self.torch_device = 'cuda' if torch.cuda.is_available() else 'cpu'
        self.tokenizer_summarize = tokenizer_summarize
        self.model_summarize = model_summarize
        self.model_summarize.to(self.torch_device)
        self.model_summarize.eval()

    def create_summary(self, text: str,
                       repetition_penalty=1.0) -> str:
        text_input_ids = self.tokenizer_summarize.batch_encode_plus(
            [text], return_tensors='pt', max_length=1024)['input_ids']
        
        min_length = min(text_input_ids.size()[1], 128)
        max_length = min(text_input_ids.size()[1], 1024)

        print(f"Summary Min length: {min_length}")
        print(f"Summary Max length: {max_length}")

        text_input_ids = text_input_ids.to(self.torch_device)

        summary_ids = self.model_summarize.generate(text_input_ids,
                                                    num_beams=4,
                                                    length_penalty=1.4,
                                                    max_length=max_length,
                                                    min_length=min_length,
                                                    no_repeat_ngram_size=4,
                                                    repetition_penalty=repetition_penalty)
        summary = self.tokenizer_summarize.decode(summary_ids.squeeze(), skip_special_tokens=True)
        return summary

In [None]:
bart_summarizer = BartSummarizer(tokenizer_summarize, model_summarize)

# **Putting it all together**
Now that we have all the components ready, we can visualize the results for the task.
We will iterate the different sub-tasks, for each one we will find relevant sentences and create an abstractive summary.

In [None]:
with open(f"../input/covid19seedsentences/{task_id}_relevant_sentences.json") as in_fp:
    relevant_sentences_json = json.load(in_fp)

In [None]:
answers_results = []
for idx, sub_task_json in enumerate(relevant_sentences_json["subTasks"]):
    sub_task_description = sub_task_json["sub_task_description"]
    print(f"Working on task: {sub_task_description}")
    best_sentences = seed_sentences_json["subTasks"][idx]["bestSentences"]
    relevant_sentences = sub_task_json["relevant_sentences"]
    relevant_sentences_texts = [result["sentence"] for result in relevant_sentences]
    sub_task_summary = bart_summarizer.create_summary(" ".join(best_sentences + relevant_sentences_texts))
    answers_results.append(dict(sub_task_description=sub_task_description, relevant_sentences=relevant_sentences, sub_task_summary=sub_task_summary))

Let's visualize the results:

In [None]:
from IPython.display import display, HTML
pd.set_option('display.max_colwidth', 0)

In [None]:
def display_summary(summary: str):
    return display(HTML(f"<div>{summary}</div>"))

def display_sub_task_description(sub_task_description):
    return display(HTML(f"<h2>{sub_task_description}</h2>"))

def display_task_name(task_name):
    return display(HTML(f"<h1>{task_name}</h1>"))

In [None]:
def visualize_output(sub_task_json):
    """
    Prints output for each sub-task
    """
    # print description
    display_sub_task_description(sub_task_json.get("sub_task_description"))
    display_summary(sub_task_json.get("sub_task_summary"))

    # print output sentences
    results = sub_task_json.get('relevant_sentences')
    sentence_output = pd.DataFrame(sub_task_json.get('relevant_sentences'))
    sentence_output.rename(columns={"sentence": "Relevant Sentence","cord_id": "CORD UID",
                                    "publish_time": "Publish Time", "url": "URL",
                                    "source": "Source"}, inplace=True)
    
    display(HTML(sentence_output[['cord_uid', 'Source', 'Publish Time', 'Relevant Sentence', 'URL']].to_html(render_links=True, escape=False)))

In [None]:
display_task_name(seed_sentences_json["taskName"])
for sub_task_json in answers_results:
    visualize_output(sub_task_json)

Most of the results looks informative and relevant to the research topic.

# Extracting Topics From Answers

## Aggregate all sentences into a dataframe to use LDA on relevant sentences to see if distincitve topics emerge

In [None]:
def save_output(seed_sentences, sub_task_json):
    """
    Saves output for each sub-task
    """
    sentence_output = pd.DataFrame(sub_task_json.get('relevant_sentences'))
    sentence_output.rename(columns={"sentence": "Relevant Sentence","cord_id": "CORD UID",
                                    "publish_time": "Publish Time", "url": "URL",
                                    "source": "Source"}, inplace=True)
    
    return sentence_output[['cord_uid', 'Source', 'Publish Time', 'Relevant Sentence', 'URL']]

In [None]:
relevant_sentences = []
for idx, sub_task_json in enumerate(answers_results):
    task_sentences = save_output(seed_sentences_json["subTasks"][idx]["bestSentences"], sub_task_json)
    relevant_sentences.append(task_sentences)

In [None]:
all_relevant_sentences = pd.concat(relevant_sentences).reset_index()

In [None]:
all_relevant_sentences.head(1)

Clean relevant sentences and add bigrams and trigrams:

In [None]:
cleaned_sentences = []
for i in range(len(all_relevant_sentences['Relevant Sentence'])):
    cleaned_sentences.append(clean_sentence(all_relevant_sentences['Relevant Sentence'][i], bigram_model, trigram_model).split())

In [None]:
cleaned_sentences[0]

Prepare our corpus for LDA model:

In [None]:
from gensim import corpora

# Create Dictionary
id2word = corpora.Dictionary(cleaned_sentences)

# Create Corpus
texts = cleaned_sentences

# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]

# Human readable format of corpus (term-frequency)
[[(id2word[id], freq) for id, freq in cp] for cp in corpus[:1]]

In [None]:
import gensim

# Build LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=10, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha='auto',
                                           per_word_topics=True)

In [None]:
# Print the Keyword in the topics
pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]

# Conclusion
We hope this notebook will help researches in the medical domain to better gain information about the new COVID-19 virus.
Please feel free to to use this notebook for your own needs.

Any comments and upvotes will be much appreciated.