
# Multi Approach to NLP Content Analysis <img src="https://i.ibb.co/S01K61c/World-Wide-Technology.png" alt="World-Wide-Technology" border="0" align="right" width=100 height = 50>
<hr>
Here we take an ensemble approach to finding relevant articles related to a particular query. In particular, we use an ensemble of 6 different NLP models of varying complexity and explored an additional 2. The reason we decided to approach this problem as an ensemble approach instead of relying on only one methodology is due to the lack of labels and overall understanding of the medical literature. Absent 
expert knowledge or labels, we have decided to instead rely on the average. We also found that these 8 methods gave similar but different answers for the same query, thus justifying the use of multiple models.

Moreover, each of the 8 methods ranks every abstract based on its similarity to a particular query. Next, the ranks for each of the methods is combined in an ensemble fashion to get the overall ranks of each abstract. Using these ranks the top N abstracts for a particular query can be displayed to the user.


# Content :
<hr>
* [Pros and Cons](#pandc)
* [Setup](#setup)
* [Data loading and Cleaning](#data)
* [TF-IDF with Cosine distance](#TF-IDF)
* [Topic Modeling with Cosine distance](#Topic-Modeling)
* [Knowledge based methods: Wordnet](#Wordnet)
* [Doc2vec with Word Mover's Distance ](#Doc2vec)
* [Pre-trained Word2vec with Word Mover's Distance](#Word2vec)
* [BERT with Cosine Distance ](#BERT-cos)
* [SciBert with Cosine distance](#SciBert-cos)
* [SciBert with Next Sentence Prediction](#SciBert-next)
* [Results](#Results)

<a id="pandc"></a>
# Pros and Cons of our Approach:
<hr>

### Pros
1. Is not overly reliant on one single method due to the ensemble framework.
2. Uses a variety of NLP tools ranging from more basic methods such as TF-IDF all the way to BERT.
3. Contains both pre-trained methods and methods trained on the actual data. Thus a compromise between relying on prior knowledge and the raw data.

 

### Cons
1. Pre-processing of the data is very standard for a NLP problem, could be extended.
2. The method could be made faster, the runtime is slow when doing on-demand searches. 
3. Only explored one type of topic modeling, could also explore LSD.

<a id="setup"></a>
# Setup
<hr>
We ran this Notebook on a 20 core, 32GB Ram, GPU machine. Note that the Kaggle kernel may not have enough resources.

#### Installing the packages

In [None]:
#Language library
!pip install nltk

#Topic Modeling library used for TF-IDF and Doc2vec
!pip install gensim

#Math library used for Bert Model Distance Calculation 
!pip install scipy

#Libraries used for SciBert Models
!pip install sentence-transformers
!pip install transformers


You must download the Bert and Word2vec models from [here](https://drive.google.com/drive/folders/1q91MEExCliyz-mmp8bWCn6L7_lm8mHIF?usp=sharing) and place them in a local folder called 'bin'

#### Importing the required libraries

In [None]:
import numpy as np 
import sys
import os 
import nltk
import pandas as pd
import spacy

import gensim
from gensim.models import KeyedVectors

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.corpus import wordnet as wn

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.decomposition import TruncatedSVD
from sklearn import mixture

import scipy
from scipy import spatial

from sentence_transformers import SentenceTransformer

import torch
from transformers import *

#display
import ipywidgets as widgets
from IPython.display import HTML




#### Setup the nltk library

In [None]:
nltk.download('stopwords',download_dir='bin', quiet=True)
nltk.download('punkt',download_dir='bin', quiet=True)
stop_words = stopwords.words('english')
nltk.download('averaged_perceptron_tagger',download_dir='bin', quiet=True)
nltk.download('wordnet',download_dir='bin', quiet=True)
nltk.download('omw',download_dir='bin', quiet=True)
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

<a id="data"></a>
## Data loading and Cleaning
<hr>

Select a task to run the models on.

In [None]:
subtasks = ['Clinical and bench trials to investigate less common viral inhibitors against COVID-19 such as naproxen, clarithromycin, and minocyclinethat that may exert effects on viral replication.'
            ,'Methods evaluating potential complication of Antibody-Dependent Enhancement (ADE) in vaccine recipients.'
            ,'Exploration of use of best animal models and their predictive value for a human vaccine.'
            ,'Capabilities to discover a therapeutic (not vaccine) for the disease, and clinical effectiveness studies to discover therapeutics, to include antiviral agents.'
            ,'Alternative models to aid decision makers in determining how to prioritize and distribute scarce, newly proven therapeutics as production ramps up. This could include identifying approaches for expanding production capacity to ensure equitable and timely distribution to populations in need.'
            ,'Efforts targeted at a universal coronavirus vaccine.'
            ,'Efforts to develop animal models and standardize challenge studies'
            ,'Efforts to develop prophylaxis clinical studies and prioritize in healthcare workers'
            ,'Approaches to evaluate risk for enhanced disease after vaccination'
            ,'Assays to evaluate vaccine immune response and process development for vaccines, alongside suitable animal models']

sub_task='Methods evaluating potential complication of Antibody-Dependent Enhancement (ADE) in vaccine recipients.'


def set_subtasks(Task):
    global sub_task
    sub_task = Task
    print('Current taks: '+sub_task)
    
widgets.interact(set_subtasks, Task=subtasks )
out = widgets.Output(layout={'border': '1px solid black'})



Read in the abstract data and clean the results. 

In [None]:
#load the csv sources
raw_md_data = pd.read_csv('/kaggle/input/CORD-19-research-challenge/metadata.csv')
#drop duplicates by abstract
raw_md_data.drop_duplicates(['abstract'], inplace=True)
#remove missing abstracts
clean_abstracts = raw_md_data['abstract'].dropna()

In [None]:


#The TF-IDF and Topic modeling methods requier specific data cleaning and call this function 
def clean_docs(doc_list):

    doc_df = pd.DataFrame({'document':doc_list})

    #Clean the data
    # removing everything except alphabets`
    doc_df['clean_doc'] = doc_df['document'].str.replace("[^a-zA-Z#]", " ")

    # removing short words
    doc_df['clean_doc'] = doc_df['clean_doc'].apply(lambda x: ' '.join([w for w in x.split() if len(w)>3]))

    # make all text lowercase
    doc_df['clean_doc'] = doc_df['clean_doc'].apply(lambda x: x.lower())

    stop_words = stopwords.words('english')

    # tokenization
    tokenized_doc = doc_df['clean_doc'].apply(lambda x: x.split())

    # remove stop-words
    tokenized_doc = tokenized_doc.apply(lambda x: [item for item in x if item not in stop_words])

    # de-tokenization
    detokenized_doc = []
    for i in range(len(doc_df)):

        try:
            t = ' '.join(tokenized_doc[i])
            detokenized_doc.append(t)
        except:
            print(f'Can not put {tokenized_doc[i]} back together')
            detokenized_doc.append('')


    detokenized_doc = np.array(detokenized_doc)

    return detokenized_doc

def get_ranks(sim_scores,reverse=True):

    if reverse:
        temp = np.flip(np.argsort(sim_scores))
    else:
        temp = np.argsort(sim_scores)

    ranks = np.empty_like(temp)
    ranks[temp] = np.arange(len(sim_scores))

    return ranks

<a id="TF-IDF"></a>
## TF-IDF with Cosine distance
<hr>
Term Frequency-Inverse Document Frequency (TF-IDF) is a basic NLP method that determines the importance of an individual word relative to a document. That is, words are weighted based on how often then appear in a document and then inversely weighted based on how often they appear across a collection of documents.  Cosine distance (https://en.wikipedia.org/wiki/Cosine_similarity) is a common distance measure in the NLP literature and is used with many of the methods presented here. Cosine distance measures the difference in orientation. Thus, it is possible that two sentences or documents are far apart in Euclidean space but actually have similar orientations and are similar according to Cosine distance.
<a name="some-id"></a>

### Running the TF-IDF Model


TF-IDF helper functions: 

In [None]:

#TFIDF object
class TFIDFBOW:

    def __init__(self,doc_list,workdir):
        self.doc_list = doc_list
        self.workdir = workdir

        self.tfidf_bow()

    def tfidf_bow(self):
        #clean the docs
        detokenized_doc = clean_docs(self.doc_list)
        gen_docs = [[w.lower() for w in word_tokenize(text)] for text in detokenized_doc]

        # create the dictionary
        self.dictionary = gensim.corpora.Dictionary(gen_docs)

        # Create bag of words
        corpus = [self.dictionary.doc2bow(gen_doc) for gen_doc in gen_docs]
        self.tf_idf = gensim.models.TfidfModel(corpus)

        self.sims = gensim.similarities.Similarity(self.workdir, self.tf_idf[corpus], num_features=len(self.dictionary))



    def query_tfidf_bow(self,compare_doc):
        detokenized_compare_doc = clean_docs([compare_doc])
        gen_compare_docs = [[w.lower() for w in word_tokenize(text)] for text in detokenized_compare_doc]

        query_doc_bow = self.dictionary.doc2bow(gen_compare_docs[0])

        # perform a similarity query against the corpus
        query_doc_tf_idf = self.tf_idf[query_doc_bow]
        # print(document_number, document_similarity)
        return self.sims[query_doc_tf_idf]
    
    
def get_ranks(sim_scores,reverse=True):

    if reverse:
        temp = np.flip(np.argsort(sim_scores))
    else:
        temp = np.argsort(sim_scores)

    ranks = np.empty_like(temp)
    ranks[temp] = np.arange(len(sim_scores))

    return ranks    
    




Create a TF-IDF object and run it on the preprocessed articles.

In [None]:
tfidf_bow_obj = TFIDFBOW(list(clean_abstracts), workdir='./bin/')

In [None]:
sim_scores = tfidf_bow_obj.query_tfidf_bow(sub_task)

In [None]:
ranks_TFIDF = get_ranks(sim_scores)

Match the ranks with their corresponding article.

In [None]:
matched_corpus_TFIDF = np.array(clean_abstracts)[ranks_TFIDF]

Print the Output.

In [None]:

def set_output(num_doc):
    

    text_string = ''
    for i,doc in enumerate(matched_corpus_TFIDF[:num_doc]):
        text_string+= 'Document '+str(i+1)+'\n'+doc+'\n'

    print(text_string)


    
    
widgets.interact(set_output,num_doc=widgets.IntSlider(min=1, max=30, step=1, value=5))



### Our Results for TF-IDF
The following abstracts are the top 3 results after running the task: *Methods evaluating potential complication of Antibody-Dependent Enhancement (ADE) in vaccine recipients.*

* #### Abstract1
**Antibody-dependent enhancement of virus** infection is a process whereby virus-antibody complexes initiate infection of cells via Fc receptor-mediated endocytosis. We sought to investigate **antibody-dependent enhancement** of feline infectious peritonitis virus infection of primary feline peritoneal macrophages in vitro. Enhancement of infection was assessed, after indirect immunofluorescent-antibody labelling of infected cells, by determining the ratio between the number of cells infected in the presence and absence of virus-specific antibody. Infection enhancement was initially demonstrated by using heat-inactivated, virus-specific feline antiserum. Functional compatibility between murine immunoglobulin molecules and feline Fc receptors was demonstrated by using murine anti-sheep erythrocyte serum and an antibody-coated sheep erythrocyte phagocytosis assay. Thirty-seven murine monoclonal antibodies specific for the nucleocapsid, membrane, or spike proteins of feline infectious peritonitis virus or transmissible gastroenteritis virus were assayed for their ability to enhance the infectivity of feline infectious peritonitis virus. Infection enhancement was mediated by a subset of spike protein-specific monoclonal antibodies. A distinct correlation was seen between the ability of a monoclonal antibody to cause virus neutralization in a routine cell culture neutralization assay and its ability to mediate infection enhancement of macrophages. Infection enhancement was shown to be Fc receptor mediated by blockade of antibody-Fc receptor interaction using staphylococcal protein A. Our results are consistent with the hypothesis that antibody-dependent enhancement of feline infectious peritonitis virus infectivity is mediated by antibody directed against specific sites on the spike protein.
* #### Abstract2
2013 marks a milestone year for **plasmid DNA vaccine development** as a first-in-class cytomegalovirus (CMV) DNA vaccine enters pivotal phase 3 testing. This vaccine consists of two plasmids expressing CMV antigens glycoprotein B (gB) and phosphoprotein 65 (pp65) formulated with a CRL1005 poloxamer and benzalkonium chloride (BAK) delivery system designed to enhance plasmid expression. The vaccine’s planned initial indication under investigation is for prevention of CMV reactivation in CMV-seropositive (CMV(+)) recipients of an allogeneic hematopoietic stem cell transplant (HCT). A randomized, double-blind placebo-controlled phase 2 proof-of-concept study provided initial evidence of the safety of this product in CMV(+) HCT recipients who underwent immune ablation conditioning regimens. This study revealed a significant reduction in viral load endpoints and increased frequencies of pp65-specific interferon-γ-producing T cells in vaccine recipients compared to placebo recipients. The results of this endpoint-defining trial provided the basis for defining the primary and secondary endpoints of a global phase 3 trial in HCT recipients. A case study is presented here describing the development history of this **vaccine** from product concept to initiation of the phase 3 trial.
* #### Abstract3
BACKGROUND: Annual influenza **vaccination** is routinely recommended for pediatric solid organ transplant recipients. However, there are **limited data defining the immune response** to the inactivated vaccine in this population. METHODS: This prospective study compared the humoral and cell-mediated immune responses to the trivalent subvirion influenza vaccine in pediatric liver transplant recipients with those in their healthy siblings. All subjects received inactivated influenza vaccine. Hemagglutination inhibition and interferon-γ (IFN-γ) enzyme-linked immunosorbent spot assays for New Caledonia and Shanghai strains were performed at baseline, after each vaccine dose, and 3 months after the series. Seroconversion was defined as a 4-fold increase in antibody titers; seroprotection was defined as an antibody titer ≥1:40. An increase in the number of T cells secreting IFN-γ was considered to be a positive enzyme-linked immunosorbent spot response. RESULTS: After 1 dose of vaccine, transplant recipients achieved rates of antibody seroprotection and seroconversion that were similar to those achieved by their healthy siblings. However, for both influenza strains, IFN-γ responses by enzyme-linked immunosorbent spot were significantly attenuated in transplant recipients after 2 doses of vaccine. No cases of influenza or vaccine-related serious adverse events were documented in the study. CONCLUSIONS: The diminished cell-mediated immune response to influenza vaccination that was observed in pediatric liver transplant recipients suggests that the current vaccine strategy may not provide optimal protection. Because of concerns regarding potential emergence of more virulent influenza strains, further studies are warranted to determine if IFN-γ responses are predictive of efficacy and to identify the optimal vaccination strategy to protect populations with a high risk of infection.


<a id="Topic-Modeling"></a>
## Topic Modeling with Cosine distance
<hr>
Topic modeling is a NLP unsupervised technique for assigning particular words to clusters. These clusters can be thought of as word clouds and contain similar terms. Latent semantic analysis (LSA) and latent Dirichlet allocation (LDA) are two of the most popular topic modeling methods. Here we use LSA for its computational speed, but LDA could also be considered here. Once again we use Cosine distance with the results of the topic modeling.
<img src="https://i.ibb.co/23sp1Gb/tm-abstracts.png">



### Running the Topic Modeling with Cosine distance Model


Topic Modeling helper functions: 

In [None]:
#helper functions to hide 


num_topics= 20


def make_tm_output(doc_list,num_tf_idf_features=2000,num_compons=20):

    """
    Make output for topic modeling
    :param doc_list:
    :return:
    """

    detokenized_doc = clean_docs(doc_list)

    # #Run the model
    vectorizer = TfidfVectorizer(stop_words='english',
    max_features= num_tf_idf_features, # keep top 1000 terms
    max_df = 0.25,
    smooth_idf=True)


    tfidf_output = vectorizer.fit_transform(detokenized_doc)

    # SVD represent documents and terms in vectors
    svd_model = TruncatedSVD(n_components=num_compons, algorithm='randomized', n_iter=100, random_state=122)

    svd_model.fit(tfidf_output)

    tm_output = svd_model.fit_transform(tfidf_output)
    return tm_output, vectorizer,svd_model



Rank each document

In [None]:
    tm_list = list(clean_abstracts)
    tm_list.append(sub_task)

In [None]:
    tm_output,_,_ = make_tm_output(tm_list,num_compons=num_topics)
    target_vec = np.array(tm_output[-1,:]).reshape(1,num_topics)

Sort each document by distance from the task.

In [None]:
    sim_scores = []
    for i in range(tm_output.shape[0]-1):
        sim_scores.append(cosine_similarity(tm_output[i,].reshape(1,num_topics),target_vec)[0][0])

    ranks_cos_dist = get_ranks(sim_scores,   reverse=True)

Print the Output.

In [None]:
matched_corpus_cos_dist = np.array(clean_abstracts)[ranks_cos_dist]
def set_output(num_doc):
    

    text_string = ''
    for i,doc in enumerate(matched_corpus_cos_dist[:num_doc]):
        text_string+= 'Document '+str(i+1)+'\n'+doc+'\n'

    print(text_string)


    
    
widgets.interact(set_output,num_doc=widgets.IntSlider(min=1, max=30, step=1, value=5))



### Our Results for Topic Modeling with Cosine distance
The following abstracts are the top 3 results after running the task: *Methods evaluating potential complication of Antibody-Dependent Enhancement (ADE) in vaccine recipients.*

* #### Abstract1
Data from successful attenuated lentiviral **vaccine studies** indicate that fully mature Env-specific **antibodies** characterized by high titer, high avidity, and the predominant recognition of conformational epitopes are associated with protective efficacy. Although vaccination with a DNA prime/recombinant vaccinia-vectored vaccine boost strategy has been found to be effective in some trials with non-human primate/simian/human immunodeficiency virus (SHIV) models, it remains unclear whether this vaccination strategy could elicit mature equine infectious anemia virus (EIAV) Env-specific **antibodies**, thus protecting vaccinated horses against EIAV infection. Therefore, in this pilot study we vaccinated horses using a strategy based on DNA prime/recombinant Tiantan vaccinia (rTTV)-vectored vaccines encoding EIAV env and gag genes, and observed the development of Env-specific antibodies, neutralizing antibodies, and p26-specific antibodies. Vaccination with DNA induced low titer, low avidity, and the predominant recognition of linear epitopes by Env-specific antibodies, which was enhanced by boosting vaccinations with rTTV vaccines. However, the maturation levels of Env-specific antibodies induced by the DNA/rTTV vaccines were significantly lower than those induced by the attenuated vaccine EIAV(FDDV). Additionally, DNA/rTTV vaccines did not elicit broadly neutralizing antibodies. After challenge with a virulent EIAV strain, all of the vaccinees and control horses died from EIAV disease. These data indicate that the regimen of DNA prime/rTTV vaccine boost did not induce mature Env-specific antibodies, which might have contributed to immune protection failure.
* #### Abstract2
A **vaccine** for equine coronavirus (ECoV) is so far unavailable. Bovine coronavirus (BCoV) is antigenically related to ECoV; it is therefore possible that BCoV vaccine will induce antibodies against ECoV in horses. This study investigated antibody response to ECoV in horses inoculated with BCoV vaccine. Virus neutralization tests showed that antibody titers against ECoV increased in all six horses tested at 14 days post inoculation, although the antibody titers were lower against ECoV than against BCoV. This study showed that BCoV vaccine provides horses with antibodies against ECoV to some extent. It is unclear whether antibodies provided by BCoV vaccine are effective against ECoV, and therefore ECoV challenge studies are needed to evaluate efficacy of the vaccine in the future.
* #### Abstract3
BACKGROUND: The **development of a protective vaccine** against canine visceral leishmaniasis (CVL) is an alternative approach for interrupting the domestic cycle of Leishmania infantum. Given the importance of sand fly salivary proteins as potent immunogens obligatorily co-deposited during transmission of Leishmania parasites, their inclusion in an anti-Leishmania vaccine has been investigated in the last few decades. In this context, we previously immunized dogs with a vaccine composed of L. braziliensis antigens plus saponin as the adjuvant and sand fly salivary gland extract (LBSapSal vaccine). This vaccine elicited an increase in both anti-saliva and anti-Leishmania IgG isotypes, higher counts of specific circulating CD8(+) T cells, and high NO production. METHODS: We investigated the immunogenicity and protective effect of LBSapSal vaccination after intradermal challenge with 1 × 10(7) late-log-phase L. infantum promastigotes in the presence of sand fly saliva of Lutzomyia longipalpis. The dogs were followed for up to 885 days after challenge. RESULTS: The LBSapSal vaccine presents extensive antigenic diversity with persistent humoral and cellular immune responses, indicating resistance against CVL is triggered by high levels of total IgG and its subtypes (IgG1 and IgG2); expansion of circulating CD5(+), CD4(+), and CD8(+) T lymphocytes and is Leishmania-specific; and reduction of splenic parasite load. CONCLUSIONS: These results encourage further study of vaccine strategies addressing Leishmania antigens in combination with proteins present in the saliva of the vector.

<a id="Wordnet"></a>
## Knowledge based methods: Wordnet
<hr>
Wordnet is a method for building a so-called knowledge graph. Knowledge graphs map the relationship between different words by using a graph that has more general terms in the middle of the graph and more specific terms along the terminal nodes. These graphs can be used to measure the distance between words. That is, words that can be reached on the graph in fewer steps are considered more similar. Therefore two sentences or documents can be compared by calculating the pairwise distance between every word.
<img src="https://i.ibb.co/5hMgdHP/wordnet-figure.png" alt="wordnet-figure" border="0">

### Running the Wordnet Model

Wordnet helper functions:

In [None]:
#dask stuff

from distributed import Client, LocalCluster
import multiprocessing as mp


def start_dask_cluster(n_workers=None, dashboard_address='8787', memory_limit=None):
    """

    :param n_workers: number of dask workers
    :param dashboard_address: local dashboard address to see dask dashboard
    :param memory_limit: memory limit for each worker (e.g., '8GB')
    :return: cluster and client
    """
    # set the number of cpus
    n_workers = set_num_works(n_workers)



    if memory_limit:
        cluster = LocalCluster(n_workers=n_workers, dashboard_address=dashboard_address, memory_limit=memory_limit,
                               threads_per_worker=1)
    else:
        # cluster = LocalCluster(n_workers=n_workers,dashboard_address=dashboard_address)
        cluster = LocalCluster(n_workers=n_workers, threads_per_worker=1)

    client = Client(cluster)


    return cluster, client


def close_dask_cluster(cluster, client):
    """

    :param cluster: dask cluster object
    :param client: dask client object
    :return:
    """

    cluster.close()
    client.close()


def set_num_works(n_workers=None):
    """
    Helper function
    """

    if n_workers:
        if n_workers > 50:
            n_workers = 50
    else:
        n_workers = mp.cpu_count()
    return n_workers



In [None]:


def similarity_score(s1, s2):
    """
    Calculate the normalized similarity score of s1 onto s2

    For each synset in s1, finds the synset in s2 with the largest similarity value.
    Sum of all of the largest similarity values and normalize this value by dividing it by the
    number of largest similarity values found.

    Args:
        s1, s2: list of synsets from doc_to_synsets

    Returns:
        normalized similarity score of s1 onto s2

    Example:
        synsets1 = doc_to_synsets('I like cats')
        synsets2 = doc_to_synsets('I like dogs')
        similarity_score(synsets1, synsets2)
        Out: 0.73333333333333339
    """

    # Your Code Here
    list1 = []
    # For each synset in s1
    for a in s1:
        # finds the synset in s2 with the largest similarity value
        cur_list = [i.path_similarity(a) for i in s2 if i.path_similarity(a) is not None]
        if len(cur_list) > 0:
            list1.append(max(cur_list))
        else:
            list1.append(0.0)
        # list1.append(max([i.wup_similarity(a) for i in s2 if i.wup_similarity(a) is not None]))

    if len(list1)==0:
        output = 0
    else:
        output = sum(list1) / len(list1)

    return output


def convert_tag(tag):
    """Convert the tag given by nltk.pos_tag to the tag used by wordnet.synsets"""
    # These are wordnet allowed parts of speech, converting tags from nltk parts of speech to wordnet
    tag_dict = {'N': 'n', 'J': 'a', 'R': 'r', 'V': 'v'}
    try:
        return tag_dict[tag[0]]
    except KeyError:
        return None
    
def doc_to_synsets(doc):
    """
    Returns a list of synsets in document.

    Tokenizes and tags the words in the document doc.
    Then finds the first synset for each word/tag combination.
    If a synset is not found for that combination it is skipped.

    Args:
        doc: string to be converted

    Returns:
        list of synsets

    Example:
        doc_to_synsets('Fish are nvqjp friends.')
        Out: [Synset('fish.n.01'), Synset('be.v.01'), Synset('friend.n.01')]
    """

    # Your Code Here
    token = nltk.word_tokenize(doc)
    # add parts of speech to token
    tag = nltk.pos_tag(token)
    # convert nltk pos into wordnet pos
    nltk2wordnet = [(i[0], convert_tag(i[1])) for i in tag]
    # if there are no synsets in token, ignore, else put in a list
    # only use the first synets, there can be more than one
    output = [wn.synsets(i, z)[0] for i, z in nltk2wordnet if len(wn.synsets(i, z)) > 0]

    return output

def document_path_similarity(doc1, doc2):
    """Finds the symmetrical similarity between doc1 and doc2"""
    # first function u need to create
    synsets1 = doc_to_synsets(doc1)
    synsets2 = doc_to_synsets(doc2)
    # 2nd function u need to create
    return (similarity_score(synsets1, synsets2) + similarity_score(synsets2, synsets1)) / 2



Run the Wordnet on the documents

In [None]:

sim_scores = []
for doc in list(clean_abstracts):
    sim_scores.append(document_path_similarity(doc,sub_task))



# get the top matches
ranks_Wordnet = get_ranks(sim_scores, reverse=True)


Print the Output.

In [None]:
matched_corpus_Wordnet = np.array(clean_abstracts)[ranks_Wordnet]
def set_output(num_doc):
    

    text_string = ''
    for i,doc in enumerate(matched_corpus_Wordnet[:num_doc]):
        text_string+= 'Document '+str(i+1)+'\n'+doc+'\n'

    print(text_string)


    
    
widgets.interact(set_output,num_doc=widgets.IntSlider(min=1, max=30, step=1, value=5))



### Our Results for Wordnet
The following abstracts are the top 3 results after running the task: *Methods evaluating potential complication of Antibody-Dependent Enhancement (ADE) in vaccine recipients.*


* #### Abstract1
For almost 60 years, the WHO Global Influenza Surveillance and Response System (GISRS) has been the key player in monitoring the evolution and spread of influenza viruses and recommending the strains to be used in human influenza vaccines. The GISRS has also worked to continually monitor and assess the risk posed by potential pandemic viruses and to guide appropriate public health responses. •  The expanded and enhanced role of the GISRS following the adoption of the International Health Regulations (2005), recognition of the continuing threat posed by avian H5N1 and the aftermath of the 2009 H1N1 pandemic provide an opportune time to critically review the process by which influenza vaccine viruses are selected. In addition to identifying potential areas for improvement, such a review will also help to promote greater appreciation by the wider influenza and policy‐making community of the complexity of influenza vaccine virus selection. •  The selection process is highly coordinated and involves continual year‐round integration of virological data and epidemiological information by National Influenza Centres (NICs), thorough antigenic and genetic characterization of viruses by WHO Collaborating Centres (WHOCCs) as part of selecting suitable candidate vaccine viruses, and the preparation of suitable reassortants and corresponding reagents for vaccine standardization by WHO Essential Regulatory Laboratories (ERLs). •  Ensuring the optimal effectiveness of vaccines has been assisted in recent years by advances in molecular diagnosis and the availability of more extensive genetic sequence data. However, there remain a number of challenging constraints including variations in the assays used, the possibility of complications resulting from non‐antigenic changes, the limited availability of suitable vaccine viruses and the requirement for recommendations to be made up to a year in advance of the peak of influenza season because of production constraints. •  Effective collaboration and coordination between human and animal influenza networks is increasingly recognized as an essential requirement for the improved integration of data on animal and human viruses, the identification of unusual influenza A viruses infecting human, the evaluation of pandemic risk and the selection of candidate viruses for pandemic vaccines. •  Training workshops, assessments and donations have led to significant increases in trained laboratory personnel and equipment with resulting expansion in both geographical surveillance coverage and in the capacities of NICs and other laboratories. This has resulted in a significant increase in the volume of information reported to WHO on the spread, intensity and impact of influenza. In addition, initiatives such as the WHO Shipment Fund Project have facilitated the timely sharing of clinical specimens and virus isolates and contributed to a more comprehensive understanding of the global distribution and temporal circulation of different viruses. It will be important to sustain and build upon the gains made in these and other areas. •  Although the haemagglutination inhibition (HAI) assay is likely to remain the assay of choice for the antigenic characterization of viruses in the foreseeable future, alternative assays – for example based upon advanced recombinant DNA and protein technologies – may be more adaptable to automation. Other technologies such as microtitre neuraminidase inhibition assays may also have significant implications for both vaccine virus selection and vaccine development. •  Microneutralization assays provide an important adjunct to the HAI assay in virus antigenic characterization. Improvements in the use and potential automation of such assays should facilitate large‐scale serological studies, while other advanced techniques such as epitope mapping should allow for a more accurate assessment of the quality of a protective immune response and aid the development of additional criteria for measuring immunity. •  Standardized seroepidemiological surveys to assess the impact of influenza in a population could help to establish well‐characterized banks of age‐stratified representative sera as a national, regional and global resource, while providing direct evidence of the specific benefits of vaccination. •  Advances in high‐throughput genetic sequencing coupled with advanced bioinformatics tools, together with more X‐ray crystallographic data, should accelerate understanding of the genetic and phenotypic changes that underlie virus evolution and more specifically help to predict the influence of amino acid changes on virus antigenicity. •  Complex mathematical modelling techniques are increasingly being used to gain insights into the evolution and epidemiology of influenza viruses. However, their value in predicting the timing and nature of future antigenic and genetic changes is likely to be limited at present. The application of simpler non‐mechanistic statistical algorithms, such as those already used as the basis of antigenic cartography, and phylogenetic modelling are more likely to be useful in facilitating vaccine virus selection and in aiding assessment of the pandemic potential of avian and other animal influenza viruses. •  The adoption of alternative vaccine technologies – such as live‐attenuated, quadrivalent or non‐HA‐based vaccines – has significant implications for vaccine virus selection, as well as for vaccine regulatory and manufacturing processes. Recent collaboration between the GISRS and vaccine manufacturers has resulted in the increased availability of egg isolates and high‐growth reassortants for vaccine production, the development of qualified cell cultures and the investigation of alternative methods of vaccine potency testing. WHO will continue to support these and other efforts to increase the reliability and timeliness of the global influenza vaccine supply. •  The WHO GISRS and its partners are continually working to identify improvements, harness new technologies and strengthen and sustain collaboration. WHO will continue in its central role of coordinating worldwide expertise to meet the increasing public health need for influenza vaccines and will support efforts to improve the vaccine virus selection process, including through the convening of periodic international consultations.
* #### Abstract2
untington’s dementia: a conundrum in fluid management H. Venkatesh, S. Ramachandran, A. Basu, H. Nair P179 - Diagnosis and management of severe hypernatraemia in the critical care setting S. Egan, J. Bates P180 - Correlation between arterial blood gas and electrolyte disturbances during hospitalization and outcome in critically ill patients S. Oliveira, N. R. Rangel Neto, F. Q. Reis P181 - Missing the “I” in MUDPILES – a rare cause of high anion gap metabolic acidosis (HAGMA) C. P. Lee, X. L. Lin, C. Choong , K. M. Eu, W. Y. Sim , K. S. Tee, J. Pau , J. Abisheganaden P182 - Plasma NGAL and urinary output: potential parameters for early initiation of renal replacement therapy K. Maas, H. De Geus P183 - Renal replacement therapy for critically ill patients: an intermittent continuity E. Lafuente, R. Marinho, J. Moura, R. Antunes, A. Marinho P184 - A survey of practices related to renal replacement therapy in critically ill patients in the north of England. T. E. Doris, D. Monkhouse, T. Shipley, S. Kardasz, I Gonzalez P185 - High initiation creatinine associated with lower 28-day mortality in critically ill patients necessitating continuous renal replacement therapy S. Stads, A. J. Groeneveld P186 - The impact of Karnofsky performance scale on outcomes in acute kidney injury patients receiving renal replacement therapy on the intensive care unit I. Elsayed, N. Ward, A. Tridente, A. Raithatha P187 - Severe hypophosphatemia during citrate-anticoagulated CRRT A. Steuber, C. Pelletier, S. Schroeder, E. Michael, T. Slowinski, D. Kindgen-Milles P188 ...
* #### Abstract3
Venezuelan equine encephalitis virus (VEEV) is a mosquito-borne RNA virus that causes low mortality but high morbidity rates in humans. In addition to natural outbreaks, there is the potential for exposure to VEEV via aerosolized virus particles. There are currently no FDA-licensed vaccines or antiviral therapies for VEEV. Passive immunotherapy is an approved method used to protect individuals against several pathogens and toxins. Human polyclonal antibodies (PAbs) are ideal, but this is dependent upon serum from convalescent human donors, which is in limited supply. Non-human-derived PAbs can have serious immunoreactivity complications, and when “humanized,” these antibodies may exhibit reduced neutralization efficiency. To address these issues, transchromosomic (Tc) bovines have been created, which can produce potent neutralizing human antibodies in response to hyperimmunization. In these studies, we have immunized these bovines with different VEEV immunogens and evaluated the protective efficacy of purified preparations of the resultant human polyclonal antisera against low- and high-dose VEEV challenges. These studies demonstrate that prophylactic or therapeutic administration of the polyclonal antibody preparations (TcPAbs) can protect mice against lethal subcutaneous or aerosol challenge with VEEV. Furthermore, significant protection against unrelated coinfecting viral pathogens can be conferred by combining individual virus-specific TcPAb preparations. IMPORTANCE With the globalization and spread or potential aerosol release of emerging infectious diseases, it will be critical to develop platforms that are able to produce therapeutics in a short time frame. By using a transchromosomic (Tc) bovine platform, it is theoretically possible to produce antigen-specific highly neutralizing therapeutic polyclonal human antibody (TcPAb) preparations in 6 months or less. In this study, we demonstrate that Tc bovine-derived Venezuelan equine encephalitis virus (VEEV)-specific TcPAbs are highly effective against VEEV infection that mimics not only the natural route of infection but also infection via aerosol exposure. Additionally, we show that combinatorial TcPAb preparations can be used to treat coinfections with divergent pathogens, demonstrating that the Tc bovine platform could be beneficial in areas where multiple infectious diseases occur contemporaneously or in the case of multipathogen release.

<a id="Doc2vec"></a>
## Doc2vec with Word Mover's Distance 
<hr>
Doc2vec is very similar to word2vec with a slight altercation that allows the model to consider which document a particular word comes from. The doc2vec model is trained on the dataset of medical abstracts, once again WMD is used to measure the similarity between documents.

### Running the Doc2vec model 

Doc2vec helper functions:

In [None]:
def avg_feature_vector(sentence, model, num_features, index2word_set):
    words = sentence.split()
    feature_vec = np.zeros((num_features, ), dtype='float32')
    n_words = 0
    for word in words:
        if word in index2word_set:
            n_words += 1
            feature_vec = np.add(feature_vec, model[word])
    if (n_words > 0):
        feature_vec = np.divide(feature_vec, n_words)
    return feature_vec

def calc_cosine_diff(doc_1,doc_2):
    return 1 - spatial.distance.cosine(doc_1, doc_2)

def get_pretrained_model(is_using_aws=False):

    if is_using_aws:
        EMBEDDING_FILE = download_google_pretrained_word2vec()
    else:
        EMBEDDING_FILE = './bin/GoogleNews-vectors-negative300.bin.gz'

    w2v_model = KeyedVectors.load_word2vec_format(EMBEDDING_FILE, binary=True)
    index2word_set = set(w2v_model.wv.index2word)
    return w2v_model, index2word_set

def calc_pre_train_sim_scores(w2v_model,index2word_set,num_features,cur_doc,compare_embed):
    cur_embed = avg_feature_vector(cur_doc,w2v_model,num_features,index2word_set)
    return calc_cosine_diff(compare_embed,cur_embed)

def calc_word2vec_wmd(w2v_model,sub_task,doc):
    return w2v_model.wmdistance(sub_task, doc)

def preprocess_doc2vec(doc_list):

    preprocess_list = []
    for count, doc in enumerate(doc_list):
        tokens = gensim.parsing.preprocess_string(doc)
        preprocess_list.append(gensim.models.doc2vec.TaggedDocument(tokens, [count]))
    return preprocess_list


def train_doc2vec(train_corpus):
    # model = gensim.models.doc2vec.Doc2Vec(dm=1, vector_size=100, min_count=2, epochs=20, seed=42, workers=3)
    model = gensim.models.doc2vec.Doc2Vec(dm=1, vector_size=100, min_count=2, epochs=5, seed=42, workers=3)
    model.build_vocab(train_corpus)
    model.train(train_corpus, total_examples=model.corpus_count, epochs=model.epochs)

    return model

def aggregate_embedding(item_list, embedding):

    embedding_list = [embedding.wv[m] for m in item_list if m in embedding.wv]
    if len(embedding_list) > 0:
        return np.array(embedding_list).mean(axis=0)
    else:
        return None

def embed_columns(doc_list, embedding):

    processed_embeddings = []
    for doc in doc_list:
        processed_embeddings.append(aggregate_embedding(doc,embedding))

    return processed_embeddings

def cluster_data(avg_embeddings,n_clusters=10):
    cl = mixture.GaussianMixture(n_components=n_clusters, max_iter=500, random_state=10)
    fit_data = np.stack(avg_embeddings)
    cluster_nums = cl.fit_predict(fit_data)
    return cl,cluster_nums

def find_top_cluster_indexes(cluster_obj,sub_task_embed,cluster_nums,n_clusters=10):
    cluster_index = np.argmax(cluster_obj.predict_proba(sub_task_embed))
    return np.where(cluster_nums==cluster_index)[0]



In [None]:

preprocess_list = preprocess_doc2vec(list(clean_abstracts))

# run the model
doc2vec_model = train_doc2vec(preprocess_list)

# calculate the similarity socres
sim_scores = [calc_word2vec_wmd(doc2vec_model, sub_task, doc) for doc in clean_abstracts]

# get the top matches
ranks_doc2vec = get_ranks(sim_scores, reverse=False)


Print the Output.

In [None]:
matched_corpus_doc2vec = np.array(clean_abstracts)[ranks_doc2vec]
def set_output(num_doc):
    

    text_string = ''
    for i,doc in enumerate(matched_corpus_doc2vec[:num_doc]):
        text_string+= 'Document '+str(i+1)+'\n'+doc+'\n'

    print(text_string)


    
    
widgets.interact(set_output,num_doc=widgets.IntSlider(min=1, max=30, step=1, value=5))



### Our Results for Doc2vec
The following abstracts are the top 3 results after running the task: *Methods evaluating potential complication of Antibody-Dependent Enhancement (ADE) in vaccine recipients.*



* #### Abstract1
It is of great interest to understand the molecular details of the pathways that constitute species barriers to viral infection. The tripartite motif protein TRIM5α has emerged as an important mediator of species-specific retroviral replication and innate immunity. This review considers the role of TRIM5α as an antiviral protein in mammals. The methods used to identify species-specific restriction to retroviral infection, and the identification of TRIM5α itself, are outlined. TRIM5α mediates an early postentry block to sensitive retroviral infection, usually before viral DNA synthesis. Results from mutational analysis of TRIM5α and their contribution to a mechanistic model for TRIM5α antiviral activity are discussed. The antiviral role of other TRIM proteins is considered, as is the role of TRIM5α cytoplasmic bodies.
* #### Abstract2
Interleukin-12 is a lymphokine that triggers gamma interferon secretion by various cells and differentiation of T-helper lymphocytes towards the Th1 subtype. Since viruses are potent inducers of gamma interferon production and elicit immune responses most probably mediated by Th1 cells, like B-cell immunoglobulin G2a secretion, we analyzed interleukin-12 message expression after infection of mice with lactate dehydrogenase-elevating virus, mouse hepatitis virus, and mouse adenovirus. Our results indicated that the message for the p40 component of interleukin-12 was transiently increased shortly after infection. Interleukin-12 was expressed mainly by macrophages. Therefore, production of interleukin-12 might constitute the initial event that would determine the subsequent characteristics of the immune response elicited by viral infections.
* #### Abstract3
Recent studies have shown that natural infection by HIV-2 leads to the elicitation of high titers of broadly neutralizing antibodies (NAbs) against primary HIV-2 strains (T. I. de Silva, et al., J. Virol. 86:930–946, 2012; R. Kong, et al., J. Virol. 86:947–960, 2012; G. Ozkaya Sahin, et al., J. Virol. 86:961–971, 2012). Here, we describe the envelope (Env) binding and neutralization properties of 15 anti-HIV-2 human monoclonal antibodies (MAbs), 14 of which were newly generated from 9 chronically infected subjects. All 15 MAbs bound specifically to HIV-2 gp120 monomers and neutralized heterologous primary virus strains HIV-2(7312A) and HIV-2(ST). Ten of 15 MAbs neutralized a third heterologous primary virus strain, HIV-2(UC1). The median 50% inhibitory concentrations (IC(50)s) for these MAbs were surprisingly low, ranging from 0.007 to 0.028 μg/ml. Competitive Env binding studies revealed three MAb competition groups: CG-I, CG-II, and CG-III. Using peptide scanning, site-directed mutagenesis, chimeric Env constructions, and single-cycle virus neutralization assays, we mapped the epitope of CG-I antibodies to a linear region in variable loop 3 (V3), the epitope of CG-II antibodies to a conformational region centered on the carboxy terminus of V4, and the epitope(s) of CG-III antibodies to conformational regions associated with CD4- and coreceptor-binding sites. HIV-2 Env is thus highly immunogenic in vivo and elicits antibodies having diverse epitope specificities, high potency, and wide breadth. In contrast to the HIV-1 Env trimer, which is generally well shielded from antibody binding and neutralization, HIV-2 is surprisingly vulnerable to broadly reactive NAbs. The availability of 15 human MAbs targeting diverse HIV-2 Env epitopes can facilitate comparative studies of HIV/SIV Env structure, function, antigenicity, and immunogenicity.

<a id="Word2vec"></a>
## Pre-trained Word2vec with Word Mover's Distance 
<hr>
Word2vec is a NLP method for producing word embeddings using a neural network model such as a recurrent neural network (RNN). Word embeddings map words to vectors and thus can be used to represent words and thus documents/sentences. Here a pre-trained word2vec model on Google news is used. As is common, we use word mover's distance (WMD) to measure the difference between two documents (i.e., the word embeddings of each document). Further, WMD finds the most efficient way to move document 1's distribution in order to match document 2's distribution as closely as possible and then measures the similarity.

### Running the Word2vec model 

Word2vec helper functions:

In [None]:
def get_pretrained_model(is_using_aws=False):

    if is_using_aws:
        EMBEDDING_FILE = download_google_pretrained_word2vec()
    else:
        EMBEDDING_FILE = './bin/GoogleNews-vectors-negative300.bin.gz'

    w2v_model = KeyedVectors.load_word2vec_format(EMBEDDING_FILE, binary=True)
    index2word_set = set(w2v_model.wv.index2word)
    return w2v_model, index2word_set

def calc_word2vec_wmd(w2v_model,sub_task,doc):
    return w2v_model.wmdistance(sub_task, doc)




In [None]:

w2v_model, index2word_set = get_pretrained_model()

sim_scores = [calc_word2vec_wmd( w2v_model, sub_task, doc) for doc in clean_abstracts]

# get the top n final matches
ranks_Word2vec = get_ranks(sim_scores)

Print the Output.

In [None]:
matched_corpus_Word2vec = np.array(clean_abstracts)[ranks_Word2vec]
def set_output(num_doc):
    

    text_string = ''
    for i,doc in enumerate(matched_corpus_Word2vec[:num_doc]):
        text_string+= 'Document '+str(i+1)+'\n'+doc+'\n'

    print(text_string)


    
    
widgets.interact(set_output,num_doc=widgets.IntSlider(min=1, max=30, step=1, value=5))



### Our Results for Word2vec
The following abstracts are the top 3 results after running the task: *Methods evaluating potential complication of Antibody-Dependent Enhancement (ADE) in vaccine recipients.*


* #### Abstract1
DNA vaccination has been developed in the last two decades in human and animal species as a promising alternative to conventional vaccination. It consists in the injection, in the muscle, for example, of plasmid DNA encoding the vaccinating polypeptide. Electroporation which forces the entrance of the plasmid DNA in cells at the injection point has been described as a powerful and promising strategy to enhance DNA vaccine efficacy. Due to the fact that the vaccine is composed of DNA, close attention on the fate of the plasmid DNA upon vaccination has to be taken into account, especially at the injection point. To perform such studies, the muscle injection point has to be precisely recovered and collected several weeks after injection. This is even more difficult for large and growing animals. A technique has been developed to localize precisely and collect efficiently the muscle injection points in growing piglets 6 weeks after DNA vaccination accompanied or not by electroporation. Electroporation did not significantly increase the level of remaining plasmids compared to nonelectroporated piglets, and, in all the cases, the levels were below the limit recommended by the FDA to research integration events of plasmid DNA into the host DNA.
* #### Abstract2
Replication-competent controlled virus vectors were derived from the virulent herpes simplex virus 1 (HSV-1) wild-type strain 17syn+ by placing one or two replication-essential genes under the stringent control of a gene switch that is coactivated by heat and an antiprogestin. Upon activation of the gene switch, the vectors replicate in infected cells with an efficacy that approaches that of the wild-type virus from which they were derived. Essentially no replication occurs in the absence of activation. When administered to mice, localized application of a transient heat treatment in the presence of systemic antiprogestin results in efficient but limited virus replication at the site of administration. The immunogenicity of these viral vectors was tested in a mouse footpad lethal challenge model. Unactivated viral vectors—which may be regarded as equivalents of inactivated vaccines—induced detectable protection against lethality caused by wild-type virus challenge. Single activation of the viral vectors at the site of administration (rear footpads) greatly enhanced protective immune responses, and a second immunization resulted in complete protection. Once activated, vectors also induced far better neutralizing antibody and HSV-1-specific cellular immune responses than unactivated vectors. To find out whether the immunogenicity of a heterologous antigen was also enhanced in the context of efficient transient vector replication, a virus vector constitutively expressing an equine influenza virus hemagglutinin was constructed. Immunization of mice with this recombinant induced detectable antibody-mediated neutralization of equine influenza virus, as well as a hemagglutinin-specific cellular immune response. Single activation of viral replication resulted in a severalfold enhancement of these immune responses. IMPORTANCE We hypothesized that vigorous replication of a pathogen may be critical for eliciting the most potent and balanced immune response against it. Hence, attenuation/inactivation (as in conventional vaccines) should be avoided. Instead, the necessary safety should be provided by placing replication of the pathogen under stringent control and by activating time-limited replication of the pathogen strictly in an administration region in which pathology cannot develop. Immunization will then occur in the context of highly efficient pathogen replication and uncompromised safety. We found that localized activation in mice of efficient but limited replication of a replication-competent controlled herpesvirus vector resulted in a greatly enhanced immune response to the virus or an expressed heterologous antigen. This finding supports the above-mentioned hypothesis and suggests that the vectors may be promising novel agents worth exploring for the prevention/mitigation of infectious diseases for which efficient vaccination is lacking, in particular in immunocompromised patients.
* #### Abstract3
Coronavirus contains three envelope proteins, M, E and S, and a nucleocapsid, which consists of genomic RNA and N protein, within the viral envelope. We studied the macromolecular interactions involved in coronavirus assembly in cells infected with a murine coronavirus, mouse hepatitis virus (MHV). Coimmunoprecipitation analyses demonstrated an interaction between N protein and M protein in infected cells. Pulse-labeling experiments showed that newly synthesized, unglycosylated M protein interacted with N protein in a pre-Golgi compartment, which is part of the MHV budding site. Coimmunoprecipitation analyses further revealed that M protein interacted with only genomic-length MHV mRNA, mRNA 1, while N protein interacted with all MHV mRNAs. These data indicated that M protein interacted with the nucleocapsid, consisting of N protein and mRNA 1, in infected cells. The M protein-nucleocapsid interaction occurred in the absence of S and E proteins. Intracellular M protein-N protein interaction was maintained after removal of viral RNAs by RNase treatment. However, the M protein-N protein interaction did not occur in cells coexpressing M protein and N protein alone. These data indicated that while the M protein-N protein interaction, which is independent of viral RNA, occurred in the M protein-nucleocapsid complex, some MHV function(s) was necessary for the initiation of M protein-nucleocapsid interaction. The M protein-nucleocapsid interaction, which occurred near or at the MHV budding site, most probably represented the process of specific packaging of the MHV genome into MHV particles.

<a id="BERT-cos"></a>
## BERT with Cosine Distance 
<hr>
Bidirectional Encoder Representations from Transformers (BERT) is a pre-trained model developed by Google. Unlike traditional RNNs or LSTMs, which only learn in one direction, BERT is trained in both directions and thus is better at understanding context. Once again we use Cosine distance.

### Running the BERT with Cosine Distance model 


BERT helper functions:

In [None]:

def make_bert_embeddings(doc_list):
    generic_bert_model = SentenceTransformer('./bin/')
    return generic_bert_model.encode(doc_list,show_progress_bar=True)


bert_corpus = list(clean_abstracts)


In [None]:

sub_task_embed = make_bert_embeddings([sub_task])

# make the embeddings
corpus_embed = make_bert_embeddings(list(bert_corpus))

# get the sim scores
bert_sim_scores = [scipy.spatial.distance.cdist(sub_task_embed[0].reshape(1,sub_task_embed[0].shape[0]), corpus_embed[index_f].reshape(1,sub_task_embed[0].shape[0]), "cosine")[0][0] for index_f in range(len(corpus_embed))]

ranks_bert = get_ranks(bert_sim_scores,reverse=True)

Print the Output.

In [None]:
matched_corpus_bert = np.array(clean_abstracts)[ranks_bert]
def set_output(num_doc):
    

    text_string = ''
    for i,doc in enumerate(matched_corpus_bert[:num_doc]):
        text_string+= 'Document '+str(i+1)+'\n'+doc+'\n'

    print(text_string)


    
    
widgets.interact(set_output,num_doc=widgets.IntSlider(min=1, max=30, step=1, value=5))



### Our Results for BERT with Cosine Distance 
We had very poor results when running the task: *Methods evaluating potential complication of Antibody-Dependent Enhancement (ADE) in vaccine recipients.* We decided to remove the BERT model from our final results.


<a id="SciBert-cos"></a>
## SciBERT with Cosine distance
<hr>
SciBERT (https://github.com/allenai/scibert) is a BERT model that is specifically pre-trained on medical journal articles. Therefore the model is potentially more suitable for analyzing coronavirus related material than a typical pre-trained BERT model. Cosine distance is also used with the SciBERT model.

### Running the SciBERT with cosine distance model 


SciBert helper functions:

In [None]:


def get_scibert_embedding(doc,model,tokenizer,device):

    # Add special tokens takes care of adding [CLS], [SEP], <s>... tokens in the right way for each model.
    input_ids = torch.tensor([tokenizer.encode(doc,add_special_tokens=True, max_length=512)],device=device)
    try:
        with torch.no_grad():
            last_hidden_states = model(input_ids)  # Models outputs are now tuples
            test = last_hidden_states[0].mean(1).detach()
        return np.array(test.cpu())
    except Exception as e:
        print(e)
        print(f'Was not able to get embeddings for {doc}')
        return []


In [None]:

bert_corpus = list(clean_abstracts)

In [None]:

model = BertModel.from_pretrained("./bin/bert/")
device = torch.device("cuda")
model.to(device)
tokenizer = AutoTokenizer.from_pretrained("./bin/bert/")


abstract_embedd_list = []
for index_f in list(bert_corpus):
    embed_vals = get_scibert_embedding(index_f,model,tokenizer,device)

    if len(embed_vals)>0:
        abstract_embedd_list.append(embed_vals)


target_embed = get_scibert_embedding(sub_task,model,tokenizer,device)

# get the similarity scores
sim_scores = []
for i in range(len(abstract_embedd_list)):
    sim_scores.append(cosine_similarity(abstract_embedd_list[i],target_embed)[0][0])


# print the top matches
ranks_scibert = get_ranks(sim_scores,reverse=True)

Print the Output.

In [None]:

matched_corpus_scibert = np.array(clean_abstracts)[ranks_scibert]


def set_output(num_doc):
    

    text_string = ''
    for i,doc in enumerate(matched_corpus_scibert[:num_doc]):
        text_string+= 'Document '+str(i+1)+'\n'+doc+'\n'

    print(text_string)


    
    
widgets.interact(set_output,num_doc=widgets.IntSlider(min=1, max=30, step=1, value=5))


### Our Results for BERT with cosine distance 
The following abstracts are the top 3 results after running the task: *Methods evaluating potential complication of Antibody-Dependent Enhancement (ADE) in vaccine recipients.*


* #### Abstract1
As the novel coronavirus that emerged in Wuhan spreads worldwide (p. 610), China is facing criticism that its initial response was slow, and questions persist about officials' openness. People in and outside of China have praised an early warning about mysterious illnesses, sounded in a message sent 30 December 2019 by Li Wenliang, an ophthalmologist at a Wuhan hospital, to his medical school classmates. On 3 January, however, local police summoned Li, chastised him for spreading socially disruptive rumors, and made him sign a letter of self-criticism. He has since become infected and was hospitalized. Last week, the country's highest court faulted Li's detention as overreach. China is waging a fierce battle against the virus; it built a new, 1000-bed hospital in Wuhan in just 10 days, and Chinese scientists have published several papers on the virus. But in a 30 January statement leaked on social media, China's Ministry of Science and Technology urged researchers to pour their efforts into stopping its spread instead. “Until the task of prevention and control is completed, the focus should not be on the publication of papers,” the statement says As the novel coronavirus that emerged in Wuhan spreads worldwide (p. [610][1]), China is facing criticism that its initial response was slow, and questions persist about officials' openness. People in and outside of China have praised an early warning about mysterious illnesses, sounded in a message sent 30 December 2019 by Li Wenliang, an ophthalmologist at a Wuhan hospital, to his medical school classmates. On 3 January, however, local police summoned Li, chastised him for spreading socially disruptive rumors, and made him sign a letter of self-criticism. He has since become infected and was hospitalized. Last week, the country's highest court faulted Li's detention as overreach. China is waging a fierce battle against the virus; it built a new, 1000-bed hospital in Wuhan in just 10 days, and Chinese scientists have published several papers on the virus. But in a 30 January statement leaked on social media, China's Ministry of Science and Technology urged researchers to pour their efforts into stopping its spread instead. “Until the task of prevention and control is completed, the focus should not be on the publication of papers,” the statement says. > “Employees … are questioning whether they should post potentially life-saving info or check tweets first.” > > A September 2019 email by a National Weather Service official , reported by The Washington Post, after superiors rebuked forecasters for contradicting the president's inaccurate tweets about Hurricane Dorian's path. ### Agriculture The largest plague of desert locusts ( Schistocerca gregaria ) in decades is advancing across the Horn of Africa, consuming crops and threatening famine. The problem began in 2018 when unusually heavy rains on the Arabian Peninsula allowed populations to boom over several generations. In October 2019, the locusts swarmed south into Ethiopia, Eritrea, and Somalia, and in late December, they spread to Kenya, causing the worst infestation there in 70 years. Somalia—which last week declared a national emergency and asked for increased food aid—has already lost 100,000 hectares of crops and pasture. Another generation of locusts will likely hatch this month and cause more damage. The Food and Agriculture Organization of the United Nations called for $70 million to fight the outbreak with pesticides and help farmers. ### Glaciology After dropping sensors and a torpedo-shaped robot through a 700-meter hole in the ice, scientists in Antarctica last week revealed the first direct evidence that warm ocean temperatures around the rapidly retreating Thwaites Glacier could destabilize the key ice sheet. Researchers are worried because Thwaites—larger than the state of Illinois—helps block the ocean from reaching and warming the even bigger, unstable West Antarctic Ice Sheet, whose melting could eventually drive meters of sea level rise. Battling 2 months of stormy conditions, the team measured ocean waters beneath Thwaites at more than 2°C above the freezing point. The robot, Icefin (above, shown operating elsewhere in Antarctica), provided the first images of the glacier's grounding zone, the mysterious boundary where the floating coastal ice sheet attaches to bedrock. The project is part of the International Thwaites Glacier Collaboration, a multiyear effort by the United States and the United Kingdom that is wrapping up its first full field season. ### Leadership Some researchers in Colombia are calling for a little-known molecular biologist appointed as the country's first ever science minister to resign. They are outraged by reports that she treated cancer patients with a fungal extract, without running a formal clinical trial. “We can only regret that the course of how to do science in our country has been left in the hands of pseudoscience,” the Colombian Association of Medical Faculties wrote in a statement. In December 2019, Mabel Gisela Torres Torres was appointed to lead the newly created Ministry of Science, Technology and Innovation. In January, she told a newspaper she did not seek formal ethical, safety, and efficacy reviews of her work with patients because she believed the fungus posed no threat to human health. Her Ph.D. adviser has defended her and notes that metabolites in the fungi Torres studied have shown potential as a cancer treatment in cell and mouse studies. ### Environmental science !
* #### Abstract2
SIMPLE SUMMARY: Feline upper respiratory infection is a common disease in animal shelters. Without monitoring, effective control and prevention is difficult. We looked at a software system (a) used in shelters across the United States to determine if it can be used to track URI frequency and risk factors in a population. Reports from the software system (a) were compared to data collected manually. This showed that data currently collected were not useful for tracking URI frequency and risk factors. However, potential exists to increase the practicality and usefulness of this shelter software system to monitor URI and other diseases. ABSTRACT: Objective—Feline upper respiratory infection (URI) is a common, multi-factorial infectious disease syndrome endemic to many animal shelters. Although a significant cause of morbidity and mortality in shelter cats, URI is seldom formally monitored in shelter cat populations. Without monitoring, effective control and prevention of this often endemic disease is difficult. We looked at an integrated case management software system (a) for animal care organizations, widely used in shelters across the United States. Shelter staff routinely enter information regarding individual animals and disease status, but do not commonly use the software system to track frequency of disease. The purpose of this study was to determine if the software system (a) can be used to track URI frequency and selected risk factors in a population, and to evaluate the quality and completeness of the data as currently collected in a shelter. Design (type of study)—Descriptive Survey. Animals (or Sample)—317 cats in an animal shelter. Procedures—Reports from the software system (a) containing data regarding daily inventory, daily intake, animal identification, location, age, vaccination status, URI diagnosis and URI duration were evaluated. The reports were compared to data collected manually by an observer (Ann Therese Kommedal) to assess discrepancies, completeness, timeliness, availability and accuracy. Data were collected 6 days a week over a 4 week period. Results—Comparisons between the software system (a) reports and manually collected reports showed that 93% of inventory reports were complete and of these 99% were accurate. Fifty-two percent of the vaccination reports were complete, of which 97% were accurate. The accuracy of the software system’s age reports was 76%. Two-hundred and twenty-three cats were assigned a positive or negative URI diagnosis by the observer. The predictive value of the URI status in the software system (a) was below 60% both for positive and negative URI diagnosis. Conclusions and Clinical Relevance—data currently collected and entered into the software systems in the study shelter, was not useful for tracking URI frequency and risk factors, due to issues with both data quality and capture. However, the potential exists to increase the practicality and usefulness of this shelter software system to monitor URI and other diseases. Relevant data points, i.e., health status at intake and outcome, vaccination date and status, as well as age, should be made mandatory to facilitate more useful data collection and reporting.
* #### Abstract3
SIMPLE SUMMARY: Pecking-related problems are common in intensive egg production, diminishing hen welfare and production performance, and negatively affecting sustainability. Beak trimming is a common practice to control these problems, but in Finland beak trimming is prohibited. Finnish egg producers have decades-long experience of egg production with intact-beaked hens. This experience, and their management of pecking-related problems, could benefit producers in other countries. The online questionnaire aimed to gather information about Finnish farmers’ attitudes towards beak trimming, their estimation of the seriousness of pecking problems in their laying hen flocks, common risk factors and the best practices to prevent attendant problems. The questionnaire received 35 responses. Finnish egg producers appeared strongly to support a policy of not trimming beaks. Motivation against beak trimming was explained by considering it to be unnecessary and unethical. Most respondents did not regard pecking-related problems as being very severe in their flocks. Lighting, feeding and flock management problems represented the most important risk factors regarding pecking problems. Generally, the same topics were highlighted as being the most important intervention measures for managing an on-going pecking problem. The study indicates that it is possible to incorporate a non-beak-trimming policy as a component of sustainable egg production. ABSTRACT: Pecking-related problems are common in intensive egg production, compromising hen welfare, causing farmers economic losses and negatively affecting sustainability. These problems are often controlled by beak trimming, which in Finland is prohibited. An online questionnaire aimed to collect information from farmers about pecking-related problems in Finnish laying hen flocks, important risk factors and the best experiences to prevent the problems. Additionally, the farmers’ attitudes towards beak trimming were examined. We received 35 responses, which represents about 13% of all Finnish laying hen farms with ≥300 laying hens. The majority of respondents stated that a maximum of 5–7% incidence of feather pecking or 1–2% incidence of cannibalism would be tolerable. The majority of respondents (74%) expressed that they would definitely not use beak-trimmed hens. Only two respondents indicated that they would probably use beak-trimmed hens were the practice permitted. Among risk factors, light intensity earned the highest mean (6.3), on a scale from 1 (not important) to 7 (extremely important). Other important problems included those that occurred during rearing, feeding, flock management and problems with drinking water equipment (mean 5.9, each). The most important intervention measures included optimal lighting and feeding, flock management, and removing the pecker and victim. Concluding, Finnish farmers had strong negative attitudes towards beak trimming. The study underlines the importance of flock management, especially lighting and feeding, in preventing pecking problems and indicates that it is possible to incorporate a non-beak-trimming policy into sustainable egg production.

<a id="SciBert-next"></a>
## SciBert with Next Sentence Prediction
<hr>
One of the two tasks that BERT is typically pretrained on is the so-called next sentence prediction task. This task involves classifying if two sentences follow each other or not. This technique can also be used to find the similarity between two sentences or documents, with the idea being that similar documents tend to follow each other. This is very straightforward with the SciBERT pre-trained model.

### Running the SciBert with next sentence prediction model 

SciBertNext helper functions:

In [None]:
def next_sent_pred_scibert(doc,query,model,tokenizer,device):

    try:
        txt = doc.strip() + "[SEP]" + query.strip()
        #input_ids = torch.tensor(tokenizer.encode(txt, add_special_tokens=True, max_length=512)).unsqueeze(0)

        input_ids = torch.tensor(tokenizer.encode(txt, add_special_tokens=True, max_length=512),device=device).unsqueeze(0)

        output = model(input_ids)
        return output[0][0][0].item()
    except:
        return None



In [None]:
bert_corpus = list(clean_abstracts)

In [None]:
model = BertForNextSentencePrediction.from_pretrained("./bin/bert")
device = torch.device("cuda")
model.to(device)
tokenizer = BertTokenizer.from_pretrained("./bin/bert")
bert_sim_scores= [next_sent_pred_scibert( doc, sub_task, model, tokenizer,device) for doc in bert_corpus]
ranks_SciBertNext = get_ranks(bert_sim_scores,reverse=True)

Print the Output.

In [None]:
matched_corpus_SciBertNext = np.array(clean_abstracts)[ranks_SciBertNext]
def set_output(num_doc):
    

    text_string = ''
    for i,doc in enumerate(matched_corpus_SciBertNext[:num_doc]):
        text_string+= 'Document '+str(i+1)+'\n'+doc+'\n'

    print(text_string)


    
    
widgets.interact(set_output,num_doc=widgets.IntSlider(min=1, max=30, step=1, value=5))



### Our Results for SciBert with next sentence prediction
We had very poor results when running the task: *Methods evaluating potential complication of Antibody-Dependent Enhancement (ADE) in vaccine recipients.* We decided to remove the SciBert with next sentence model from our final results.

<a id="Results"></a>
## Results
<hr>
Here we Average the rankings of all the methods to produce a final result. Of the BERT models, we found that SciBert with cosine distance produced the best results. 

Calculate the median ranks based on all 6 methods.

In [None]:
 median_ranks = np.median(np.column_stack([ranks_scibert,ranks_Word2vec ,ranks_doc2vec,ranks_Wordnet,ranks_cos_dist,ranks_TFIDF ]),axis=1)

Print the Output.

In [None]:
ensem_ranks = np.argsort(median_ranks)
matched_corpus_SciBertNext = np.array(clean_abstracts)[ensem_ranks]
def set_output(num_doc):
    

    text_string = ''
    for i,doc in enumerate(matched_corpus_SciBertNext[:num_doc]):
        text_string+= 'Document '+str(i+1)+'\n'+doc+'\n'

    print(text_string)


    
    
widgets.interact(set_output,num_doc=widgets.IntSlider(min=1, max=30, step=1, value=5))



### Our Results
The following abstracts are the top 3 results after running the task: *Methods evaluating potential complication of Antibody-Dependent Enhancement (ADE) in vaccine recipients.*

* #### Abstract1
Over the last century, the successful attenuation of multiple bacterial and viral pathogens has led to an effective, robust and safe form of **vaccination**. Recently, these vaccines have been evaluated as delivery vectors for heterologous antigens, as a means of simultaneous vaccination against two pathogens. The general consensus from published studies is that these vaccine vectors have the potential to be both safe and efficacious. However, some of the commonly employed vectors, for example Salmonella and adenovirus, often have **pre-existing immune responses in the host and this has the potential to modify the subsequent immune response to a vectored antigen**. This review examines the literature on this topic, and concludes that for bacterial vectors there can in fact, in some cases, be an enhancement in immunogenicity, typically humoral, w**hile for viral vectors pre-existing immunity is a hindrance for subsequent induction of cell-mediated responses**.
* #### Abstract2
Understanding **immune responses to viral infections** is crucial to progress in the quest for effective infection prevention and control. The **host immunity involves various mechanisms to combat viral infections**. Under certain circumstances, a **viral infection or vaccination** may result in a subverted immune system, which may lead to an exacerbated illness. Clinical evidence of enhanced illness by preexisting antibodies from vaccination, infection or maternal passive immunity is available for several viruses and is presumptively proposed for other viruses. Multiple mechanisms have been proposed to explain this phenomenon. It has been confirmed that certain infection- and/or** vaccine-induced immunity could exacerbate viral infectivity** in Fc receptor- or complement bearing cells- mediated mechanisms. Considering that **antibody dependent enhancement (ADE) is a major obstacle in vaccine development**, there are continues efforts to understand the underlying mechanisms through identification of the epitopes and antibodies responsible for disease enhancement or protection. **This review discusses the recent findings on virally induced ADE**, and highlights the potential mechanisms leading to this condition.
* #### Abstract3
**Viral subunit vaccines often contain immunodominant** non-neutralizing epitopes that **divert host immune responses**. These epitopes should be eliminated in vaccine design, but there is no reliable method for evaluating an epitope's capacity to **elicit neutralizing immune responses**. Here we introduce a new concept ‘neutralizing immunogenicity index' (NII) to evaluate an epitope's neutralizing immunogenicity. To determine the NII, we mask the epitope with a glycan probe and then assess the epitope's contribution to the vaccine's overall neutralizing immunogenicity. As proof-of-concept, we measure the NII for different epitopes on an immunogen comprised of the receptor-binding domain from MERS coronavirus (MERS-CoV). Further, we design a variant form of this vaccine by masking an epitope that has a negative NII score. This engineered vaccine demonstrates significantly enhanced efficacy in protecting transgenic mice from lethal MERS-CoV challenge. Our study may guide the rational design of highly effective subunit vaccines to combat MERS-CoV and other life-threatening viruses.
