![search_vaccine3_banner.jpg](attachment:search_vaccine3_banner.jpg)

# Table of Contents

* Abstract
* Introduction
* Task Details
* Technology
* Literature Compilation
* Conclusion
* Acknowledgements
* Disclaimer
* Bibliography

# Abstract

The whole world is in suspense right now, waiting for a vaccine that treats COVID-19. An enormous amount of resources are used in the search for a vaccine. That allows all of us back to a normal life. The purpose of this notebook is to provide a compilation of literature explicitly related to COVID-19 vaccine.

On the 16th March 2020 the [White House](https://www.whitehouse.gov/briefings-statements/call-action-tech-community-new-machine-readable-covid-19-dataset/) and a coalition of leading research groups have made a call to action to the Tech Community on New Machine Readable COVID-19 Dataset:

> *Today, researchers and leaders from the Allen Institute for AI, Chan Zuckerberg Initiative (CZI), Georgetown University’s Center for Security and Emerging Technology (CSET), Microsoft, and the National Library of Medicine (NLM) at the National Institutes of Health released the COVID-19 Open Research Dataset (CORD-19) of scholarly literature about COVID-19, SARS-CoV-2, and the Coronavirus group.*

This is a work in progress and the examples shown below (see section "**Literature Compilation**") are from a system based on the dataset provided by [Kaggle](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge) and [Semantic Scholar](https://pages.semanticscholar.org/coronavirus-research).

This system is based on the information provided on other notebooks that runs on-premises (see section "**Technology**").

# Introduction

Around the world, people are eager to a COVID19 vaccine, which can stop the spread on the earth and revive the economy. Everyone who needs a COVID19 vaccine should be able to get it for free.

When one comes into contact with a virus or bacteria, the body’s immune system makes antibodies to fight them off. Vaccines force the immune system to produce antibodies against specific diseases that usually appear in the form of dead or weakened bacteria. The immune system knows what to do if one comes into contact with the virus again, so one doesn't get sick or so illness is much more moderate. A vaccine against COVID-19 would slow its spread around the world.


The World Health Organization reported on the 9th June 2020 that there are 10 vaccine candidates under clinical evaluation. These vaccines should be tested to verify how long it takes a person’s immune system to respond to a vaccine and to wait for side effects.

A vaccine must go through development and testing to make sure it is effective against a virus or bacteria and that it doesn’t cause any other problems when available to the public.

There are six stages generally followed in the development of a vaccine:

![CORD19_vacvcine3.png](attachment:CORD19_vacvcine3.png)

**1 Exploratory stage**, laboratories start the research to find something that can treat or prevent a disease.

**2 Pre-clinical stage**, scientists use lab tests and testing in animals, such as mice or monkeys, to learn whether a vaccine might work. Many potential vaccines don’t make it past this point. If the tests are succeeded, and the FDA signs off, it’s on to clinical testing.

**3 Clinical development**, a three-phase process of testing on humans
* Phase I involves fewer than 100 people.
* Phase II includes several hundred people.
* Phase III involves thousands of people. 

**4 Regulatory review and approval**, scientists with the FDA and CDC go over the data from the clinical trials and sign off.

**5 Manufacturing**, the vaccine goes into production. The FDA inspects the factory and approves drug labels.

**6 Quality control**, scientists and government agencies keep tabs on the drug-making process and on people who get the vaccine. They want to make sure it keeps working safely.


In the Unit States of America, FDA and CDC are regulatory and control entities. Analogous entities have the same role in other countries.


# Task Details

There are 10 tasks proposed for this competition. This work puts a particular attention to vaccines.

The original description is the following:

What do we know about vaccines and therapeutics? What has been published concerning research and development and evaluation efforts of vaccines and therapeutics?

Specifically, we want to know what the literature reports about:

* Effectiveness of drugs being developed and tried to treat COVID-19 patients.
 * Clinical and bench trials to investigate less common viral inhibitors against COVID-19 such as naproxen, clarithromycin, and minocyclinethat that may exert effects on viral replication.
* Methods evaluating potential complication of Antibody-Dependent Enhancement (ADE) in vaccine recipients.
* Exploration of use of best animal models and their predictive value for a human vaccine.
* Capabilities to discover a therapeutic (not vaccine) for the disease, and clinical effectiveness studies to discover therapeutics, to include antiviral agents.
* Alternative models to aid decision makers in determining how to prioritize and distribute scarce, newly proven therapeutics as production ramps up. This could include identifying approaches for expanding production capacity to ensure equitable and timely distribution to populations in need.
* Efforts targeted at a universal coronavirus vaccine.
* Efforts to develop animal models and standardize challenge studies
* Efforts to develop prophylaxis clinical studies and prioritize in healthcare workers
* Approaches to evaluate risk for enhanced disease after vaccination
* Assays to evaluate vaccine immune response and process development for vaccines, alongside suitable animal models [in conjunction with therapeutics]

# Technology

The Python package cord19q, is used and available at: [cord19q](https://github.com/ricardofsilva/cord19q). This is a fork from David Mezzetti cord19q package.

Parts of the code were based on the notebook by David Mezzetti.
#### Cite: [David Mezzetti | Kaggle, CORD-19 Analysis with Sentence Embeddings](https://www.kaggle.com/davidmezzetti/cord-19-analysis-with-sentence-embeddings)<br>

The training of the NLP was done on a machine with Windows and WSL, which allows to run Linux directly) on Azure or on-premises. At the time of this post on-premises pcs with Windows 10 were used.<br>
[What is the Windows Subsystem for Linux?](https://docs.microsoft.com/en-us/windows/wsl/about)<br>
[Windows Server 2019 Installation Guide](https://docs.microsoft.com/en-us/windows/wsl/install-on-server)<br>

The constant **CORD19_cache** can have the value True or False, which means:
* **True**, uses the pre-trained NLP files on the cord19cache folder. Making the output file generation faster, which helps to create a more useful notebook. The notebook has the advantage to render the information much faster and will render in approximately 15 min.<br><br>

* **False**, ignores the cached files. The notebook will render in approximately 4h, however, it will allow to use more recent data. This will:
    - select the records on the metadata for the year 2020;<br>
    - import the data into the SQLite database;<br>
    - calculate the vectors;<br>
    - generates the indexes, allowing to train the network on a different server, and then import the dataset.

# Literature Compilation
A set of pre generated reports is shown as an example, with information related to the following topics:

* Effectiveness of drugs being developed and tried to treat COVID-19 patients.
* Clinical and bench trials to investigate less common viral inhibitors against COVID-19 such as naproxen, clarithromycin, and minocyclinethat that may exert effects on viral replication.
* Methods evaluating potential complication of Antibody-Dependent Enhancement (ADE) in vaccine recipients.
* Exploration of use of best animal models and their predictive value for a human vaccine.
* Capabilities to discover a therapeutic (not vaccine) for the disease, and clinical effectiveness studies to discover therapeutics, to include antiviral agents.
* Alternative models to aid decision makers in determining how to prioritize and distribute scarce, newly proven therapeutics as production ramps up. This could include identifying approaches for expanding production capacity to ensure equitable and timely distribution to populations in need.
* Efforts targeted at a universal coronavirus vaccine.
* Efforts to develop animal models and standardize challenge studies.
* Efforts to develop prophylaxis clinical studies and prioritize in healthcare workers.
* Approaches to evaluate risk for enhanced disease after vaccination.
* Assays to evaluate vaccine immune response and process development for vaccines [in conjunction with therapeutics].
* Suitable animal models for vaccine development.

In [None]:
# Install cord19q project
!pip install git+https://github.com/ricardofsilva/cord19q.git

In [None]:
# Install scispacy model
!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.4/en_core_sci_md-0.2.4.tar.gz

In [None]:
CORD19_cache = True

# See section "Technology" for a description.

In [None]:
import os
import shutil

from distutils.dir_util import copy_tree

if CORD19_cache:

    # Copy study design models locally
    if not os.path.exists("cord19q"):
        os.mkdir("cord19q")

    #if not os.path.exists("cord19q"):
    copy_tree("../input/cord19cache/cord19q", "cord19q")
    copy_tree("../input/cord19cache/vectors", "/root/.cord19/vectors")

In [None]:
import os
import shutil

if not CORD19_cache:
    # Copy study design models locally
    #if not os.path.exists("cord19q"):
    os.mkdir("cord19q")

    #if not os.path.exists("cord19q/attribute"):
    shutil.copy("../input/cord19-study-design/attribute", "cord19q")

    #if not os.path.exists("cord19q/design"):
    shutil.copy("../input/cord19-study-design/design", "cord19q")

In [None]:
from cord19q.etl.execute import Execute as Etl

# Build SQLite database for metadata.csv and json full text files

if not CORD19_cache:
    Etl.run("../input/CORD-19-research-challenge", "cord19q")

In [None]:
from cord19q.vectors import Vectors

if not CORD19_cache:   
    Vectors.run('cord19q', 300, 3)    

In [None]:
from cord19q.index import Index
from datetime import datetime

if not CORD19_cache:    
    # print(str(datetime.now()))
    
    Index.run("cord19q", "/root/.cord19/vectors/cord19-300d.magnitude")    
    #Index.run("cord19q", "../input/cord19-fasttext-vectors/cord19-300d.magnitude")
    
    # print(str(datetime.now()))

In [None]:
# Visualize a sankey graphic between design and source

In [None]:
!pip install psankey

In [None]:
import pandas as pd
import numpy as np
import sqlite3


# Connect to database
db = sqlite3.connect("cord19q/articles.sqlite")

# Articles
strSQL = 'select source source, case when design=1 then "systematic review" when design in (2, 3) then "control trial" when design in (4, 5) then "prospective studies" when design=6 then "retrospective studies" when design in (7, 8) then "case series" else "modeling" end as design, ROW_NUMBER() OVER (ORDER BY source, design) id from articles where tags is not null'
        
pd.set_option("max_colwidth", 125)
df = pd.read_sql_query(strSQL, db)


df2 = pd.DataFrame(columns=['source', 'design', 'id'])

for index, row in df.iterrows():
    src = row.source
    src = src.split(';')

    design = row.design

    for s in src:
        s = s.strip()
        
        df2 = df2.append({'source': s, 'design': design, 'id': row.id}, ignore_index=True)

In [None]:
df3 = df2.groupby(['design', 'source'])['id'].count()

In [None]:
df1 = df3.reset_index()
df1.columns = ['source', 'target', 'value']

df1.head(15)

In [None]:
from psankey.sankey import sankey

import matplotlib

matplotlib.rcParams['figure.figsize'] = [20, 20]
fig, ax = sankey(df1, labelsize=13, nodecmap='copper')

In [None]:
import os
import os.path

os.environ["COLUMNS"] = "80"

from cord19q.report.execute import Execute as Report
from IPython.display import display, Markdown

import shutil

if os.path.exists("vaccines"):
    shutil.rmtree('vaccines')
    
vaccines = \
    ["Effectiveness of drugs being developed and tried to treat COVID-19 patients",
     "Clinical and bench trials to investigate less common viral inhibitors against COVID-19 such as naproxen, clarithromycin, and minocyclinethat that may exert effects on viral replication",
     "Methods evaluating potential complication of Antibody-Dependent Enhancement (ADE) in vaccine recipients",
     "Exploration of use of best animal models and their predictive value for a human vaccine",
     "Capabilities to discover a therapeutic (not vaccine) for the disease, and clinical effectiveness studies to discover therapeutics, to include antiviral agents",
     "Alternative models to aid decision makers in determining how to prioritize and distribute scarce, newly proven therapeutics as production ramps up. This could include identifying approaches for expanding production capacity to ensure equitable and timely distribution to populations in need",
     "Efforts targeted at a universal coronavirus vaccine",
     "Efforts to develop animal models and standardize challenge studies",
     "Efforts to develop prophylaxis clinical studies and prioritize in healthcare workers",
     "Approaches to evaluate risk for enhanced disease after vaccination",
     "Assays to evaluate vaccine immune response and process development for vaccines [in conjunction with therapeutics]",
     "Suitable animal models for vaccine development",
     "What Treatments For Coronavirus Are Being Researched",
     "What Research Is Being Conducted On A Coronavirus Vaccine", 
     "What Is The White House Saying About Coronavirus Testing And Treatment",
     "What Will Happen When A Coronavirus Treatment Or Vaccine Becomes Available",
     "Is there a vaccine for the coronavirus disease",
     "What is the treatment for the coronavirus disease",
     "Do vaccines against pneumonia protect against the coronavirus disease",
     "Is the coronavirus disease the same as SARS",
     "Is there a vaccine for the coronavirus disease",
     "Racing for a cure: where are we with COVID-19 vaccines and treatments",
     "Approaches for creating a COVID-19 vaccine",
     "Finding a COVID-19 vaccine is the easy part"]
   
# Writes queries out to a local file for processing
def queries(file, queries):
    # Create output directory
    os.mkdir(file)

    with open(os.path.join(file, file + ".txt"), "w") as output:
        for query in queries:
            output.write("%s\n" % query)

# Builds Markdown and Excel reports
def report(file, category = None):
    # Create path to input file
    file = os.path.join(file, file + ".txt")    
    
    Report.run(file, 50, category,"md", "cord19q")
    # Report.run(file, 50, category,"xlsx", "cord19q")
    Report.run(file, 50, category,"csv", "cord19q")

# Save queries to file
queries("vaccines", vaccines)

In [None]:
report("vaccines")

In [None]:
from IPython.display import display, Markdown

# Where file is the full-path to the Markdown file (transmission, risk-factors, ...)
file = "vaccines/vaccines.md"


display(Markdown(filename = file))

In [None]:
# Reports generated

# These are the csv files on the vaccines folder on the output

import os
for dirname, _, filenames in os.walk('vaccines', '*.csv'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Conclusion

New papers about COVID-19 vaccines and therapeutic are published every day. With an increasing number of information, it becomes difficult to search for a topic among thousands of papers. Using technology to find the information that a researcher needs to read, based on NLP, makes their life much easier and faster to acquire new conclusions. At the moment, there are around 138,000 articles available from around the world.

As the number of articles published every day increases, it is of great significance to have some data to support the COVID-19 vaccine research in 2020.     The new data subset of articles contains 23,523 entries.

The author is a former Microsoft Consultant with a background on Microsoft technologies related to BI using the Microsoft BI stack, which justifies the use of Windows with WSL 2 to access Linux and Azure to make a VM available on the cloud. The objective of using this infrastructure is related to the fact that training the NLP solution takes several hours.


**Ricardo Silva**<br><br>
16 April 2020 (updated on the 16th June 2020)
<br>Portugal

# Acknowledgements

CORD-19 Analysis with Sentence Embeddings<br>
https://www.kaggle.com/davidmezzetti/cord-19-analysis-with-sentence-embeddings

cord19 Python package<br>
https://github.com/neuml/cord19q

# Disclaimer

This disclaimer informs readers that the views, thoughts, and opinions expressed in the text belong solely to the author and not necessarily to the author's employer, organization, committee, or other group or individual.

# Bibliography

Craven, Jeff. “COVID-19 Vaccine Tracker.” Regulatory Affairs Professionals Society (RAPS), RAPS, 20 Apr. 2020<br>
https://www.raps.org/news-and-articles/news-articles/2020/3/covid-19-vaccine-tracker

Gonzalez, Oscar, and Bill Gates. “Bill Gates Says Foundation Putting 'Total Attention' on COVID-19.” CNET, CNET, 27 Apr. 2020<br>
https://www.cnet.com/news/bill-gates-says-foundation-putting-total-attention-on-covid-19

Nazario, Brunilda. “Coronavirus (COVID-19) Vaccine: How Long Will Finding a Vaccine Take?” WebMD, WebMD, 24 Apr. 2020<br>
https://www.webmd.com/lung/covid-19-vaccine#1

Offit, Paul A. “Is Natural Infection Better Than Vaccination?” Bing, Microsoft, 16 Mar. 2020<br>
https://www.youtube.com/watch?v=MSsVTaLkPng&feature=emb_logo

Offit, Paul A. “The Challenging Path To A COVID-19 Vaccine.” Science Friday, 24 Apr. 2020<br>
https://www.sciencefriday.com/segments/path-to-covid-19-vaccine

“Producing Prevention: The Complex Development of Vaccines.” Blog, 6 Mar. 2019<br>
https://publichealthonline.gwu.edu/blog/producing-prevention-the-complex-development-of-vaccines

Sa, Sara, and Miguel Castanho. “‘À Covid-19 Pode Suceder-Se Uma Covid-20 Ou 21.’” Visão, Visao, 13 Apr. 2020<br>
https://visao.sapo.pt/visaosaude/2020-04-12-a-covid-19-pode-suceder-se-uma-covid-20-ou-21/

United States, Office of Science and Technology Policy. “Call to Action to the Tech Community on New Machine Readable COVID-19 Dataset.” The White House, 16 Mar. 2020<br>
https://www.whitehouse.gov/briefings-statements/call-action-tech-community-new-machine-readable-covid-19-dataset

“Vaccine Development, Testing, and Regulation.” History of Vaccines<br>
https://www.historyofvaccines.org/content/articles/vaccine-development-testing-and-regulation

“Vaccine Testing and Approval Process.” Centers for Disease Control and Prevention, Centers for Disease Control and Prevention, 1 May 2014<br>
https://www.cdc.gov/vaccines/basics/test-approve.html.

World Health Organization, World Health Organization, “Draft Landscape of COVID-19 Candidate Vaccines.”, 9 June 2020<br>
https://www.who.int/publications/m/item/draft-landscape-of-covid-19-candidate-vaccines