# TF-IDF Based Information Retrieval System with Summarization
** Approach:**
   > 1. Preprocessing of dataset and saving result to csv.
   > 2. Compute cosine similarity of tf-idf vectors between query and dataset items, and return most relevant.
   > 3. Display query, search results, and summaries as html elements.
   > 4. Test our system for each subtask in each of the 10 tasks.
   > 5. Visualize system performance for any given query through plotting the similarity scores of each of the retrieved documents with the query.
    
** Pros & Cons:**

- **Pros:**
> 1. IR through tf-idf is deterministic, requires no training time, and is very fast.
> 2. Providing summaries helps user identify at a glance whether or not they're interested in a specific document.
> 3. Built system is not task specific: works on all 10 tasks/subtasks and for any query. 
 
- **Cons:**
> 1. System returns top n documents related to query, where n is a fixed user specified value that is not related to the query at hand. i.e: some queries might have more highly relevant documents than n, and some might have less than n and therefore return some documents which aren't very relevant.
> 2. Summarization is done to each document disregarding the context of the query. This might lead to the summary excluding sentences that are important to the query. 


# Table of Contents:
* [Util functions & Preprocessing](#1)
* [TF-IDF & Cosine Similarity](#2)
* [Summarization](#3)
* [Testing All Tasks](#4)
* [Evaluating System Performance](#5)


<a id="1"></a>
# Util functions & Preprocessing
  In this section we perform basic preprocessing of the dataset, removing any articles that don't contain our covid 19 terms, removing duplicates, and saving the result to csv.

In [None]:
import numpy as np 
import pandas as pd
import re
import os
import json
import pickle
from nltk.corpus import stopwords
from collections import Counter
from nltk import sent_tokenize, word_tokenize



In [None]:
# path to corpus of data
CORONAVIRUS_LIBRARY_PATH = '../input/CORD-19-research-challenge'
# path from which created csv is read
RELEVANT_PAPERS_PATH = '../input/covid19csv/relevant_papers.csv'

In [None]:
# checks if any of the keywords appear in text
def is_match(text, keywords):

    return any(f'{keyterm} ' in text.lower() for keyterm in keywords)
# loading files from json
def load_json(filename):
    '''Load json file.'''
    with open(filename, 'r') as f:
        return json.load(f)

In [None]:
ARTIFACTS = r'https?:\/\/.[^\s\\]*|doi: medRxiv|preprint|\[\d+\]|\[\d+\, \d+\]'

# retrieved from get_stopwords_from_corpus(sample) call in later cell
CORPUS_STOPWORDS = ['as', 'of', 'during', 'the', 'disease', 'in', 'have', 'been', 'new', 'with',
 'number', 'cases', 'is', 'or', 'not', 'to', 'it', 'on', 'its', 'outbreak', 'by',
 'data', 'are', 'this', 'virus', 'and', 'epidemic', 'time', 'for', 'case', 'a', 'between', 'but',
 'these', 'some', 'no', 'different', 'viral', 'transmission', 'clinical', 'from', 'we', 'confirmed',
 'patients', 'were', 'study', 'china', 'infection', 'such', 'that', 'their', 'each', 'other', 'when',
 'also', 'which', 'should', 'will', 'be', 'can', 'at', 'days', 'may', 'health', 'has', 'reported', 'an',
 'infected', 'risk', 'was', '1', 'first', 'most', '.', 'used', 'two', 'using', ',', 'our', 'found', 'who', 'however,',
 'more', 'they', 'all', '2', 'both', 'than', 'based', '', 'there', 'sars-cov-2', 'after', 'severe', 'respiratory', '3',
 'one', 'including', 'et', 'use', 'control', 'had', 'if', 'only', 'into', 'al.,', 'made', 'under', 'could', 'cells',
 'high', 'human', 'international', 'treatment', 'rate', 'without', '(', '=', 'patient', 'cell', 'results', 'available',
 'model', 'figure', 'protein', 'display', 'copyright', 'granted', 'y', 'la', 'de', 'el', 'en', 'que', '(which', '•',
 'holder', 'peer-reviewed)', 'license', 'author/funder,', 'medrxiv', 'perpetuity.']


# credit: https://www.kaggle.com/rismakov/research-search-tool-and-article-summary/
COVID_19_TERMS = ['covid-19', 'covid 19','covid-2019','2019 novel coronavirus', 'corona virus disease 2019','coronavirus disease 19',
    'coronavirus 2019','2019-ncov','ncov-2019',  'wuhan virus','wuhan coronavirus','wuhan pneumonia','NCIP','sars-cov-2','sars-cov2']
VIRUS_TERMS = ['epidemic', 'pandemic', 'viral','virus','viruses','coronavirus', 'respiratory','infectious'] + COVID_19_TERMS

In [None]:
#this function saves all the data from the data set into a dataframe in the following format ['title' , 'text']
def get_all_data_without_duplicates():
    '''Get all research paper data in dataset.

    Returns
    -------
    pandas.DataFrame
        Includes columns ['title' , 'text'].
    '''
    articles = []
    for dirname, _, filenames in os.walk(CORONAVIRUS_LIBRARY_PATH):
        for filename in filenames:
            full_path = os.path.join(dirname, filename)
            
            if full_path.endswith('.json') :
                article =load_json(full_path)
                abstract=''
                body=''
                if(article.get("abstract") != None):
                    abstract = re.sub(ARTIFACTS, '', ' '.join(x['text'] for x in article['abstract']) , flags=re.MULTILINE).lower() 
                if(article.get("body_text") != None):
                    body =  re.sub(ARTIFACTS, '', ' '.join(x['text'] for x in article['body_text']) , flags=re.MULTILINE).lower() 
                authors = article['metadata']['authors']
                list_authors =[]
                for author in authors:
                    if(len(author['middle'])==0):
                        middle =""
                    else :
                        middle = author['middle'][0]
                    _authors =author['first']+ " "+ middle +" "+ author['last']

                articles.append(
                    [
                        article['metadata']['title'], 
                        _authors,
                        abstract+body,
                    ]
                )
    return pd.DataFrame(articles, columns=['title', 'authors', 'text']).drop_duplicates()


#this function filters the dataframe where its text contains COVID_19_TERMS
def filter_covid19_articles(df):
    return df[
        df['text'].apply(lambda x: is_match(x, set(COVID_19_TERMS)))
    ]

# this function gets stop words from the corpus , words repeated more than 5000 times
def get_stopwords_from_corpus(corpus_data_frame):
    #this contains all the words in the corpus
    sample = corpus_data_frame.text.str.cat(sep=' ').split(' ')
    word_counts = Counter(sample)
    max_word_count = max(word_counts.values())
    threshold = 5000
    return [word for word, count in word_counts.items() if count > threshold]

# this function adds all stop words together 
def get_all_stopwords():
    return set(VIRUS_TERMS + CORPUS_STOPWORDS + stopwords.words('english'))

# this function tokenize the sentece and only add words which are not in the stopwords list
def tokenize_sentences(data_frame_text):
    return [
        [
            word.lower() for word in word_tokenize(sent) 
            if word not in get_all_stopwords()
        ] for sent in sent_tokenize(data_frame_text)
    ]

![](http://)Run the following cell **once only** to create the corpus csv (it should be then uploaded to input folder in a folder named "covid19csv" to be accessed later) 

In [None]:
#RUN ONCE TO GET THE RELEVANT DATA AND STORE IT TO CSV

# sample = get_all_data_without_duplicates() # complete corpus
# sample = filter_covid19_articles(sample) # corpus with relvant data to covid19
# sample.to_csv('relevant_corpus.csv',index=False)

# Run once to get stop words

# CORPUS_STOPWORDS = get_stopwords_from_corpus(sample)


<a id="2"></a>

# TF-IDF
In this section we compute tf-idf vectors for both query and each of the dataset documents, then we use cosine similarity to obtain nearest n documents to the query. 

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
# computes tf idf similarity between query and documents in dataset with n being max number of retrieved documents
def compute_tf_idf_similarity(query, docs, n):
    tfidf_vectorizer = TfidfVectorizer()
    tfidf = tfidf_vectorizer.fit_transform(docs)
    query = tfidf_vectorizer.transform([query])
    
    cosine_similarities = linear_kernel(query, tfidf).flatten()
#     related_docs_indices = cosine_similarities.argsort()[:-5:-1]
    related_docs_indices = (-cosine_similarities).argsort()[:n]
    
    res = []
    for i in range(0, len(related_docs_indices)):
        res.append(docs[related_docs_indices[i]])
        
    res_df = pd.DataFrame(res, columns=['text'])
    titles = []
    authors = []
    for j in range(0, len(res_df)):
        for i in range(0, len(relevant_corpus)):
            if(relevant_corpus['text'][i] == res_df['text'][j]):
                titles.append(relevant_corpus['title'][i])
                authors.append(relevant_corpus['authors'][i])
                
    res_df['title'] = titles
    res_df['authors'] = authors
    res_df = res_df[["title","authors", "text"]]
    return res_df

In [None]:
#sample call to test retrieving documents for a query
pd.set_option('max_colwidth', 500)
relevant_corpus = pd.read_csv(RELEVANT_PAPERS_PATH).loc[:]
df = pd.DataFrame(relevant_corpus)
docs = df['text'].tolist()
query = 'Integration of federal/state/local public health surveillance systems. '
results = compute_tf_idf_similarity(query, docs,5)
results
    

<a id="3"></a>
# Summarizing results
In this section we use Gensim's summarization tool to obtain summaries for a set of documents

In [None]:
from gensim.summarization.summarizer import summarize
# returns an array of strings, each element corresponding to a result document's summary
def summarize_results(results):
    summarized_res = []
    for i in range(0, len(results)):
        ranked_sentences = summarize(results['text'][i], split=True, word_count=150) 
        summarized_res.append(ranked_sentences)
        
    string_summaries = []
    for i in range(0, len(summarized_res)):
        summary = ""
        for j in range(0, len(summarized_res[i])):
            summary += summarized_res[i][j]
        string_summaries.append(summary)
    return string_summaries

<a id="3"></a>
# Display Function
HTML-formatted display to better visualize the system's operation

In [None]:
from IPython.core.display import display, HTML
import math
# html display of query, results and their respective summaries
def display_output(query, results, string_summaries):
    display(HTML(f' <div style=" padding: 25px" > <h3 > Query: </h3> </br> <h4> {query} </h4> </div>  '))
    display(HTML(f' <div style=" padding: 25px" > <h3 > Results: </h3> </div>'))
    for i in range(0, len(results)):
        title = results['title'][i]
        authors = results['authors'][i]
        if(pd.isna(title)):
            title = "No Title"
        summary = string_summaries[i]
        display(HTML(f' <div style=" padding: 25px" >  <div> Title: <h5 style=" display: inline"> {title} </h5> </div><div> Author(s): <h5 style=" display: inline"> {authors} </h5> </div> <p> <b> Summary: </b> {summary}</p>  </div>  '))

<a id="4"></a>
# Testing Queries
In each cell you'll find a list of all the queries representing the subtasks in each task. To see results for a certain query, simply uncomment the line containing this query and make sure other queries are commented. 

* **Task 1:**

In [None]:
relevant_corpus = pd.read_csv(RELEVANT_PAPERS_PATH).loc[:]
df = pd.DataFrame(relevant_corpus)
docs = df['text'].tolist()

# queries, uncomment desired query and comment rest to see query results

query = 'Integration of federal/state/local public health surveillance systems. '
# query = 'Range of incubation periods for the disease in humans (and how this varies across age and health status) and how long individuals are contagious, even after recovery.'
# query = 'Prevalence of asymptomatic shedding and transmission (e.g., particularly children).'
# query = 'Seasonality of transmission.'
# query = 'Physical science of the coronavirus (e.g., charge distribution, adhesion to hydrophilic/phobic surfaces, environmental survival to inform decontamination efforts for affected areas and provide information about viral shedding).'
# query = 'Persistence and stability on a multitude of substrates and sources (e.g., nasal discharge, sputum, urine, fecal matter, blood).'
# query = 'Persistence of virus on surfaces of different materials (e,g., copper, stainless steel, plastic).'
# query = 'Natural history of the virus and shedding of it from an infected person'
# query = 'Implementation of diagnostics and products to improve clinical processes'
# query = 'Disease models, including animal models for infection, disease and transmission'
# query = 'Tools and studies to monitor phenotypic change and potential adaptation of the virus'
# query = 'Immune response and immunity'
# query = 'Effectiveness of movement control strategies to prevent secondary transmission in health care and community settings'
# query = 'Effectiveness of personal protective equipment (PPE) and its usefulness to reduce risk of transmission in health care and community settings'
# query = 'Role of the environment in transmission'
results = compute_tf_idf_similarity(query, docs,5)
summaries = summarize_results(results)
display_output(query, results,summaries)

* **Task 2**

In [None]:
relevant_corpus = pd.read_csv(RELEVANT_PAPERS_PATH).loc[:]
df = pd.DataFrame(relevant_corpus)
docs = df['text'].tolist()

# queries, uncomment desired query and comment rest to see query results

query = 'Data on potential risks factors'
# query = 'Smoking, pre-existing pulmonary disease'
# query = 'Co-infections (determine whether co-existing respiratory/viral infections make the virus more transmissible or virulent) and other co-morbidities'
# query = 'Neonates and pregnant women'
# query = 'Socio-economic and behavioral factors to understand the economic impact of the virus and whether there were differences.'
# query = 'Transmission dynamics of the virus, including the basic reproductive number, incubation period, serial interval, modes of transmission and environmental factors'
# query = 'Severity of disease, including risk of fatality among symptomatic hospitalized patients, and high-risk patient groups'
# query = 'Susceptibility of populations'
# query = 'Public health mitigation measures that could be effective for control'
results = compute_tf_idf_similarity(query, docs,5)
summaries = summarize_results(results)
display_output(query, results,summaries)

* **Task 3**

In [None]:
relevant_corpus = pd.read_csv(RELEVANT_PAPERS_PATH).loc[:]
df = pd.DataFrame(relevant_corpus)
docs = df['text'].tolist()

# queries, uncomment desired query and comment rest to see query results

query = 'Real-time tracking of whole genomes and a mechanism for coordinating the rapid dissemination of that information to inform the development of diagnostics and therapeutics and to track variations of the virus over time.'
# query = 'Access to geographic and temporal diverse sample sets to understand geographic distribution and genomic differences, and determine whether there is more than one strain in circulation. Multi-lateral agreements such as the Nagoya Protocol could be leveraged.'
# query = 'Evidence that livestock could be infected (e.g., field surveillance, genetic sequencing, receptor binding) and serve as a reservoir after the epidemic appears to be over.'
# query = 'Evidence of whether farmers are infected, and whether farmers could have played a role in the origin.'
# query = 'Surveillance of mixed wildlife- livestock farms for SARS-CoV-2 and other coronaviruses in Southeast Asia.'
# query = 'Experimental infections to test host range for this pathogen.'
# query = 'Animal host(s) and any evidence of continued spill-over to humans'
# query = 'Socioeconomic and behavioral risk factors for this spill-over'
# query = 'Sustainable risk reduction strategies'
results = compute_tf_idf_similarity(query, docs,5)
summaries = summarize_results(results)
display_output(query, results,summaries)


* **Task 4**

In [None]:
relevant_corpus = pd.read_csv(RELEVANT_PAPERS_PATH).loc[:]
df = pd.DataFrame(relevant_corpus)
docs = df['text'].tolist()

# queries, uncomment desired query and comment rest to see query results

query = 'Effectiveness of drugs being developed and tried to treat COVID-19 patients.'
# query = 'Clinical and bench trials to investigate less common viral inhibitors against COVID-19 such as naproxen, clarithromycin, and minocyclinethat that may exert effects on viral replication.'
# query = 'Methods evaluating potential complication of Antibody-Dependent Enhancement (ADE) in vaccine recipients.'
# query = 'Exploration of use of best animal models and their predictive value for a human vaccine.'
# query = 'Capabilities to discover a therapeutic (not vaccine) for the disease, and clinical effectiveness studies to discover therapeutics, to include antiviral agents.'
# query = 'Alternative models to aid decision makers in determining how to prioritize and distribute scarce, newly proven therapeutics as production ramps up. This could include identifying approaches for expanding production capacity to ensure equitable and timely distribution to populations in need.'
# query = 'Efforts targeted at a universal coronavirus vaccine.'
# query = 'Efforts to develop animal models and standardize challenge studies'
# query = 'Efforts to develop prophylaxis clinical studies and prioritize in healthcare workers'
# query = 'Approaches to evaluate risk for enhanced disease after vaccination'
# query = 'Assays to evaluate vaccine immune response and process development for vaccines, alongside suitable animal models [in conjunction with therapeutics]'
results = compute_tf_idf_similarity(query, docs,5)
summaries = summarize_results(results)
display_output(query, results,summaries)


* **Task 5**

In [None]:
relevant_corpus = pd.read_csv(RELEVANT_PAPERS_PATH).loc[:]
df = pd.DataFrame(relevant_corpus)
docs = df['text'].tolist()

# queries, uncomment desired query and comment rest to see query results

query = 'Resources to support skilled nursing facilities and long term care facilities.'
# query = 'Mobilization of surge medical staff to address shortages in overwhelmed communities'
# query = 'Age-adjusted mortality data for Acute Respiratory Distress Syndrome (ARDS) with/without other organ failure – particularly for viral etiologies'
# query = 'Extracorporeal membrane oxygenation (ECMO) outcomes data of COVID-19 patients'
# query = 'Outcomes data for COVID-19 after mechanical ventilation adjusted for age.'
# query = 'Knowledge of the frequency, manifestations, and course of extrapulmonary manifestations of COVID-19, including, but not limited to, possible cardiomyopathy and cardiac arrest.'
# query = 'Application of regulatory standards (e.g., EUA, CLIA) and ability to adapt care to crisis standards of care level.'
# query = 'Approaches for encouraging and facilitating the production of elastomeric respirators, which can save thousands of N95 masks.'
# query = 'Best telemedicine practices, barriers and faciitators, and specific actions to remove/expand them within and across state boundaries.'
# query = 'Guidance on the simple things people can do at home to take care of sick people and manage disease.'
# query = 'Oral medications that might potentially work.'
# query = 'Use of AI in real-time health care delivery to evaluate interventions, risk factors, and outcomes in a way that could not be done manually.'
# query = 'Best practices and critical challenges and innovative solutions and technologies in hospital flow and organization, workforce protection, workforce allocation, community-based support resources, payment, and supply chain management to enhance capacity, efficiency, and outcomes.'
# query = 'Efforts to define the natural history of disease to inform clinical care, public health interventions, infection prevention control, transmission, and clinical trials'
# query = 'Efforts to develop a core clinical outcome set to maximize usability of data across a range of trials'
# query = 'Efforts to determine adjunctive and supportive interventions that can improve the clinical outcomes of infected patients (e.g. steroids, high flow oxygen)'
results = compute_tf_idf_similarity(query, docs,5)
summaries = summarize_results(results)
display_output(query, results,summaries)


* **Task 6**

In [None]:
relevant_corpus = pd.read_csv(RELEVANT_PAPERS_PATH).loc[:]
df = pd.DataFrame(relevant_corpus)
docs = df['text'].tolist()

# queries, uncomment desired query and comment rest to see query results

query = 'Guidance on ways to scale up NPIs in a more coordinated way (e.g., establish funding, infrastructure and authorities to support real time, authoritative (qualified participants) collaboration with all states to gain consensus on consistent guidance and to mobilize resources to geographic areas where critical shortfalls are identified) to give us time to enhance our health care delivery system capacity to respond to an increase in cases'
# query = 'Rapid design and execution of experiments to examine and compare NPIs currently being implemented. DHS Centers for Excellence could potentially be leveraged to conduct these experiments'
# query = 'Rapid assessment of the likely efficacy of school closures, travel bans, bans on mass gatherings of various sizes, and other social distancing approaches'
# query = 'Methods to control the spread in communities, barriers to compliance and how these vary among different populations'
# query = 'Models of potential interventions to predict costs and benefits that take account of such factors as race, income, disability, age, geographic location, immigration status, housing status, employment status, and health insurance status'
# query = 'Policy changes necessary to enable the compliance of individuals with limited resources and the underserved with NPIs'
# query = 'Research on why people fail to comply with public health advice, even if they want to do so (e.g., social or financial costs may be too high)'
# query = 'Research on the economic impact of this or any pandemic. This would include identifying policy and programmatic alternatives that lessen/mitigate risks to critical government services, food distribution and supplies, access to critical household supplies, and access to health diagnoses, treatment, and needed care, regardless of ability to pay'
results = compute_tf_idf_similarity(query, docs,5)
summaries = summarize_results(results)
display_output(query, results,summaries)

* **Task 7**

In [None]:
relevant_corpus = pd.read_csv(RELEVANT_PAPERS_PATH).loc[:]
df = pd.DataFrame(relevant_corpus)
docs = df['text'].tolist()

# queries, uncomment desired query and comment rest to see query results

query = 'Are there geographic variations in the rate of COVID-19 spread?'
# query = 'Are there geographic variations in the mortality rate of COVID-19?'
# query = 'Is there any evidence to suggest geographic based virus mutations?'
results = compute_tf_idf_similarity(query, docs,5)
summaries = summarize_results(results)
display_output(query, results,summaries)

* **Task 8**

In [None]:
relevant_corpus = pd.read_csv(RELEVANT_PAPERS_PATH).loc[:]
df = pd.DataFrame(relevant_corpus)
docs = df['text'].tolist()

# queries, uncomment desired query and comment rest to see query results

query = 'How widespread current exposure is to be able to make immediate policy recommendations on mitigation measures. Denominators for testing and a mechanism for rapidly sharing that information, including demographics, to the extent possible. Sampling methods to determine asymptomatic disease (e.g., use of serosurveys (such as convalescent samples) and early detection of disease (e.g., use of screening of neutralizing antibodies such as ELISAs)'
# query = 'Efforts to increase capacity on existing diagnostic platforms and tap into existing surveillance platforms'
# query = 'Recruitment, support, and coordination of local expertise and capacity (public, private—commercial, and non-profit, including academic), including legal, ethical, communications, and operational issues'
# query = 'National guidance and guidelines about best practices to states (e.g., how states might leverage universities and private laboratories for testing purposes, communications to public health officials and the public)'
# query = 'Development of a point-of-care test (like a rapid influenza test) and rapid bed-side tests, recognizing the tradeoffs between speed, accessibility, and accuracy'
# query = 'Rapid design and execution of targeted surveillance experiments calling for all potential testers using PCR in a defined area to start testing and report to a specific entity. These experiments could aid in collecting longitudinal samples, which are critical to understanding the impact of ad hoc local interventions (which also need to be recorded)'
# query = 'Separation of assay development issues from instruments, and the role of the private sector to help quickly migrate assays onto those devices'
# query = 'Efforts to track the evolution of the virus (i.e., genetic drift or mutations) and avoid locking into specific reagents and surveillance/detection schemes'
# query = 'Latency issues and when there is sufficient viral load to detect the pathogen, and understanding of what is needed in terms of biological and environmental sampling'
# query = 'Use of diagnostics such as host response markers (e.g., cytokines) to detect early disease or predict severe disease progression, which would be important to understanding best clinical practice and efficacy of therapeutic interventions'
# query = 'Policies and protocols for screening and testing'
# query = 'Policies to mitigate the effects on supplies associated with mass testing, including swabs and reagents'
# query = 'Technology roadmap for diagnostics'
# query = 'Barriers to developing and scaling up new diagnostic tests (e.g., market forces), how future coalition and accelerator models (e.g., Coalition for Epidemic Preparedness Innovations) could provide critical funding for diagnostics, and opportunities for a streamlined regulatory environment'
# query = 'New platforms and technology (e.g., CRISPR) to improve response times and employ more holistic approaches to COVID-19 and future diseases'
# query = 'Coupling genomics and diagnostic testing on a large scale'
# query = 'Enhance capabilities for rapid sequencing and bioinformatics to target regions of the genome that will allow specificity for a particular variant'
# query = 'Enhance capacity (people, technology, data) for sequencing with advanced analytics for unknown pathogens, and explore capabilities for distinguishing naturally-occurring pathogens from intentional'
# query = 'One Health surveillance of humans and potential sources of future spillover or ongoing exposure for this organism and future pathogens, including both evolutionary hosts (e.g., bats) and transmission hosts (e.g., heavily trafficked and farmed wildlife and domestic food and companion species), inclusive of environmental, demographic, and occupational risk factors'
results = compute_tf_idf_similarity(query, docs,5)
summaries = summarize_results(results)
display_output(query, results,summaries)

* **Task 9**

In [None]:
relevant_corpus = pd.read_csv(RELEVANT_PAPERS_PATH).loc[:]
df = pd.DataFrame(relevant_corpus)
docs = df['text'].tolist()

# queries, uncomment desired query and comment rest to see query results

query = 'Efforts to articulate and translate existing ethical principles and standards to salient issues in COVID-2019'
# query = 'Efforts to embed ethics across all thematic areas, engage with novel ethical issues that arise and coordinate to minimize duplication of oversight'
# query = 'Efforts to support sustained education, access, and capacity building in the area of ethics'
# query = 'Efforts to establish a team at WHO that will be integrated within multidisciplinary research and operational platforms and that will connect with existing and expanded global networks of social sciences'
# query = 'Efforts to develop qualitative assessment frameworks to systematically collect information related to local barriers and enablers for the uptake and adherence to public health measures for prevention and control. This includes the rapid identification of the secondary impacts of these measures. (e.g. use of surgical masks, modification of health seeking behaviors for SRH, school closures)'
# query = 'Efforts to identify how the burden of responding to the outbreak and implementing public health measures affects the physical and psychological health of those providing care for Covid-19 patients and identify the immediate needs that must be addressed'
# query = 'Efforts to identify the underlying drivers of fear, anxiety and stigma that fuel misinformation and rumor, particularly through social media'
results = compute_tf_idf_similarity(query, docs,5)
summaries = summarize_results(results)
display_output(query, results,summaries)

* **Task 10**

In [None]:
relevant_corpus = pd.read_csv(RELEVANT_PAPERS_PATH).loc[:]
df = pd.DataFrame(relevant_corpus)
docs = df['text'].tolist()

# queries, uncomment desired query and comment rest to see query results

query = 'Methods for coordinating data-gathering with standardized nomenclature'
# query = 'Sharing response information among planners, providers, and others'
# query = 'Understanding and mitigating barriers to information-sharing'
# query = 'How to recruit, support, and coordinate local (non-Federal) expertise and capacity relevant to public health emergency response (public, private, commercial and non-profit, including academic)'
# query = 'Integration of federal/state/local public health surveillance systems'
# query = 'Value of investments in baseline public health response infrastructure preparedness'
# query = 'Modes of communicating with target high-risk populations (elderly, health care workers)'
# query = 'Risk communication and guidelines that are easy to understand and follow (include targeting at risk populations’ families too)'
# query = 'Communication that indicates potential risk of disease to all population groups'
# query = 'Misunderstanding around containment and mitigation'
# query = 'Action plan to mitigate gaps and problems of inequity in the Nation’s public health capability, capacity, and funding to ensure all citizens in need are supported and can access information, surveillance, and treatment'
# query = 'Measures to reach marginalized and disadvantaged populations'
# query = 'Data systems and research priorities and agendas incorporate attention to the needs and circumstances of disadvantaged populations and underrepresented minorities'
# query = 'Mitigating threats to incarcerated people from COVID-19, assuring access to information, prevention, diagnosis, and treatment'
# query = 'Understanding coverage policies (barriers and opportunities) related to testing, treatment, and care'
results = compute_tf_idf_similarity(query, docs,5)
summaries = summarize_results(results)
display_output(query, results,summaries)

<a id="5"></a>
# Evaluating Query Results
In this section, the relevance of the retrieved documents for a certain query can be visualized through our plot of cosine similarity per retrieved document 

In [None]:
import matplotlib.pyplot as plt
# this returns the top n (max_documnets_number) cosine similarity scores
def get_top_cosine_similarity(query, docs, max_documnets_number):
    tfidf_vectorizer = TfidfVectorizer()
    tfidf = tfidf_vectorizer.fit_transform(docs)
    query = tfidf_vectorizer.transform([query])
    
    cosine_similarities = linear_kernel(query, tfidf).flatten()
    cosine_similarities_list = cosine_similarities.tolist()
    cosine_similarities_list.sort(reverse=True)
    top_cosine_similarity = cosine_similarities_list[0:max_documnets_number]
    return top_cosine_similarity
#this plots documents against cosine similarity scores
def plot_top_cosine_similarity(query, docs, max_documnets_number):
    top_cosine_simlrity = get_top_cosine_similarity(query, docs, max_documnets_number)
    result_number = np.arange(start=1, stop=max_documnets_number+1)
    plt.figure(figsize=(8,6))
    plt.bar(result_number, top_cosine_simlrity, align='center', alpha=0.5)
    plt.xlabel('Document Number')
    plt.ylabel('Cosine Simlarity')
    plt.show()

Write any query here to visualize the system's performance (measured in cosine similarity score) in regards to document relevance to query. 

In [None]:
query = 'Effectiveness of drugs being developed and tried to treat COVID-19 patients.'
plot_top_cosine_similarity(query, docs, 5)