# Introduction

<img src="https://i.imgur.com/endjLmo.png" width="800px">

*[Source: Johns Hopkins Coronavirus Resource Center](https://coronavirus.jhu.edu/map.html)*

*<font size=3 color="#616A6B"> "This coronavirus is presenting us with an unprecedented threat, an unprecedented opportunity to come together as one against a common enemy: an enemy against humanity."</font>* ~ Tedros Adhanom Ghebreyesus, WHO Director-General

Welcome to the "COVID-19 Open Research Dataset Challenge"! In this competition, contestants are challenged to use a large corpus of COVID-19 research to understand the pandemic better. COVID-19 is an acute repsiratory disease caused by a coronavirus called SARS-CoV-2. No other outbreak has caused such widespread medical and economic disruption in recent history. This is a time when everyone: medical researchers, healthcare workers, and even data scientists from across the world need to work together to fight a common enemy.

In this kernel, I will explore the data and try to come up with actionable insights regarding containment and cure using unsupervised NLP.

<font color="red" size=3>Please upvote this kernel if you like it. It motivates me to produce more quality content :) <br><br> This kernel may take a few extra seconds to load, so please be patient!</font>

# Acknowledgements

1. [South Korea's coronavirus lessons](https://www.aljazeera.com/news/2020/03/south-korea-coronavirus-lessons-quick-easy-tests-monitoring-200319011438619.html) ~ Al Jazeera
2. [Timeline of the 2019â€“20 coronavirus pandemic](https://en.wikipedia.org/wiki/Timeline_of_the_2019%E2%80%9320_coronavirus_pandemic_in_February_2020) ~ Wikipedia
3. [South Korea is watching quarantined citizens with a smartphone app](https://www.technologyreview.com/s/615329/coronavirus-south-korea-smartphone-app-quarantine/) ~ MIT Technology Review
4. [Johns Hopkins Coronavirus Resource Center](https://coronavirus.jhu.edu/map.html) ~ Johns Hopkins University
5. [CORD-19: EDA, parse JSON and generate clean CSVðŸ§¹](https://www.kaggle.com/xhlulu/cord-19-eda-parse-json-and-generate-clean-csv) ~ xhulu
6. [Namsor API](https://www.namsor.com/) ~ by Namsor
7. [Gensim Word2Vec Docs](https://radimrehurek.com/gensim/models/word2vec.html) ~ by Gensim
8. [COVID-19 - Analysis, Viz, Prediction & Comparisons](https://www.kaggle.com/imdevskp/covid-19-analysis-viz-prediction-comparisons) ~ by Devakumar kp
9. [Doxorubicin](https://en.wikipedia.org/wiki/Doxorubicin) ~ by Wikipedia
10. [Hydroxychloroquine](https://en.wikipedia.org/wiki/Hydroxychloroquine) ~ by Wikipedia
11. [Embeddings: Translating to a Lower-Dimensional Space](https://developers.google.com/machine-learning/crash-course/embeddings/translating-to-a-lower-dimensional-space) ~ by Google Machine Learning Crash Course
12. [Chloroquine, an old malaria drug ...](https://abcnews.go.com/Health/chloroquine-malaria-drug-treat-coronavirus-doctors/story?id=69664561) ~ by ABC News

# Contents

* [<font size=4>EDA</font>](#1)
    * [Preparing the ground](#1.1)
    * [Author names](#1.2)
    * [Abstracts](#1.3)
   
   
* [<font size=4>Finding cures for COVID-19</font>](#2)
    * [Unsupervised NLP and Word2Vec](#2.1)
    * [Using Word2Vec to find cures](#2.2)
    

* [<font size=4>Finding ways to contain COVID-19</font>](#3)
    * [The current situation](#3.1)
    * [What can we learn from China and South Korea?](#3.2)
    
   
* [<font size=4>Takeaways</font>](#4)


* [<font size=4>Ending note</font>](#5)

# EDA <a id="1"></a>

First, I will visualize the corpus before moving on to unsupervised machine learning.

## Preparing the ground <a id="1.1"></a>

### Install and import libraries

In [None]:
!pip install -q pycountry

In [None]:
import os
import gc
import re
import folium
from scipy import stats

import warnings
warnings.filterwarnings("ignore")

import math
import numpy as np
import scipy as sp
import pandas as pd

import pycountry
from sklearn import metrics
from sklearn.utils import shuffle
from gensim.models import Word2Vec
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

import nltk
from textblob import TextBlob
from wordcloud import WordCloud
from nltk.corpus import wordnet
from nltk.corpus import stopwords
from nltk import WordNetLemmatizer
from nltk.stem import WordNetLemmatizer
from nltk.sentiment.vader import SentimentIntensityAnalyzer

import random
import networkx as nx
from pandas import Timestamp

import requests
from IPython.display import HTML

In [None]:
import seaborn as sns
from tqdm import tqdm
import matplotlib.cm as cm
import matplotlib.pyplot as plt

import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff
from plotly.subplots import make_subplots

tqdm.pandas()
np.random.seed(0)
%env PYTHONHASHSEED=0

import warnings
warnings.filterwarnings("ignore")

### Load data

In [None]:
DATA_PATH = "../input/CORD-19-research-challenge/"
CLEAN_DATA_PATH = "../input/cord-19-eda-parse-json-and-generate-clean-csv/"

pmc_df = pd.read_csv(CLEAN_DATA_PATH + "clean_pmc.csv")
biorxiv_df = pd.read_csv(CLEAN_DATA_PATH + "biorxiv_clean.csv")
comm_use_df = pd.read_csv(CLEAN_DATA_PATH + "clean_comm_use.csv")
noncomm_use_df = pd.read_csv(CLEAN_DATA_PATH + "clean_noncomm_use.csv")

papers_df = pd.concat([pmc_df,
                       biorxiv_df,
                       comm_use_df,
                       noncomm_use_df], axis=0).reset_index(drop=True)

In [None]:
CORONA_FILE = "../input/corona-virus-report/covid_19_clean_complete.csv"

full_table = pd.read_csv(CORONA_FILE, parse_dates=['Date'])

full_table[['Province/State']] = full_table[['Province/State']].fillna('')
full_table['Country/Region'] = full_table['Country/Region'].replace('Mainland China', 'China')
full_table['Active'] = full_table['Confirmed'] - full_table['Deaths'] - full_table['Recovered']

cases = ['Confirmed', 'Deaths', 'Recovered', 'Active']
full_table[cases] = full_table[cases].fillna(0)
cases = ['Confirmed', 'Deaths', 'Recovered', 'Active']
full_table['Active'] = full_table['Confirmed'] - full_table['Deaths'] - full_table['Recovered']

# replacing Mainland china with just China
full_table['Country/Region'] = full_table['Country/Region'].replace('Mainland China', 'China')

# filling missing values 
full_table[['Province/State']] = full_table[['Province/State']].fillna('')
full_table[cases] = full_table[cases].fillna(0)

# cases in the ships
ship = full_table[full_table['Province/State'].str.contains('Grand Princess')|full_table['Country/Region'].str.contains('Cruise Ship')]

# china and the row
china = full_table[full_table['Country/Region']=='China']
row = full_table[full_table['Country/Region']!='China']

# latest
full_latest = full_table[full_table['Date'] == max(full_table['Date'])].reset_index()
china_latest = full_latest[full_latest['Country/Region']=='China']
row_latest = full_latest[full_latest['Country/Region']!='China']

# latest condensed
full_latest_grouped = full_latest.groupby('Country/Region')['Confirmed', 'Deaths', 'Recovered', 'Active'].sum().reset_index()
china_latest_grouped = china_latest.groupby('Province/State')['Confirmed', 'Deaths', 'Recovered', 'Active'].sum().reset_index()
row_latest_grouped = row_latest.groupby('Country/Region')['Confirmed', 'Deaths', 'Recovered', 'Active'].sum().reset_index()

temp = full_table.groupby(['Country/Region', 'Province/State'])['Confirmed', 'Deaths', 'Recovered', 'Active'].max()
# temp.style.background_gradient(cmap='Reds')

temp = full_table.groupby('Date')['Confirmed', 'Deaths', 'Recovered', 'Active'].sum().reset_index()
temp = temp[temp['Date']==max(temp['Date'])].reset_index(drop=True)

## Author names <a id="1.2"></a>

Every research paper in the corpus is written by one or more authors, and the names of these authors could provide some insights regarding which parts of the world generate most of the coronavirus research.

I use an API called **Namsor** to predict the country of origin of the authors. The predictions made by this API are not 100% accurate and sometimes it is difficult to predict someone's country of origin solely based on the name. **So take this plot with a grain of salt.**

<center><img src="https://i.imgur.com/bIcwsRW.png" width="500px"></center>

In [None]:
def multiple_replace(dict, text):
    regex = re.compile("(%s)" % "|".join(map(re.escape, dict.keys())))
    return regex.sub(lambda mo: dict[mo.string[mo.start():mo.end()]], text)

def get_countries(names):
    alphabet = ["A", "B", "C", "D", "E", "F",\
                "G", "H", "I", "J", "K", "L",\
                "M", "N", "O", "P", "Q", "R",\
                "S", "T", "U", "V", "W", "X",\
                "Y", "Z"]

    repl_dict = dict(zip([a+" " for a in alphabet], [""]*26))
    repls = []
    for name in names.split(", "):
        repl = multiple_replace(repl_dict, name.strip().replace(") ", "").replace("( ", ""))
        if len(repl.split()) == 1:
            repl = name[0] + " " + repl
        repl = repl.replace(";", "").replace(":", "").replace(".", "").replace(",", "")
        repl = repl.split(" ")

        for idx in range(len(repl)):
            if len(repl[idx]) <= 1 and repl[idx] not in alphabet:
                repl[idx] = "A"

        response = client.origin(repl[0], repl[1])
        repls.append(response.country_origin)

    return repls

countries = pd.read_csv("../input/researcher-countries/countries.csv").values[:, 0].tolist()

### Number of research papers vs. Country of origin

In [None]:
cont_list = sorted(list(set(countries)))
counts = [countries.count(cont) for cont in cont_list]
df = pd.DataFrame(np.transpose([cont_list, counts]))
df.columns = ["Country of origin", "Count"]
px.bar(df, x="Country of origin", y="Count", title="Country of origin of researchers", template="simple_white")

From the above graph, we can see that UK and Ireland have the most researchers, closely followed by China, South Korea, and Japan. Other European countries like Germany, Italy, and France are not far behind.

In [None]:
codes = [pycountry.countries.get(alpha_2=con).name for con in cont_list]
df["Codes"] = codes
df["Count"] = df["Count"].apply(int)
fig = px.scatter_geo(df, locations="Codes", size='Count', hover_name="Country of origin",
                     projection="natural earth", locationmode="country names", title="Country of origin of researchers", color="Count",
                     template="plotly")
fig.show()

In the world map above, we can see that the regions most affected by coronavirus generally seem to have more coronavirus researchers, namely China, South Korea, and Europe. You may notice that USA and Canada are not assigned any researchers. This might be because the API wrongly classifies some American and Canadian researchers as European, Indian, etc solely based on the name. After all, the country of origin is not always an accurate measure of research output (only approximate). Nevertheless, it is amazing to see so many people dedicating their time to fix such an important problem. I wish them all the best!

## Abstracts <a id="1.3"></a>

Every research paper has an abstract at the start, which briefly summarizes the contents and ideas presented in the paper. These abstracts can be a great source of insights and solutions (as we will see later). First, I will do some basic visualization of the abstracts in the dataset.

### Abstract words distribution

In [None]:
def new_len(x):
    if type(x) is str:
        return len(x.split())
    else:
        return 0

papers_df["abstract_words"] = papers_df["abstract"].apply(new_len)
nums = papers_df.query("abstract_words != 0 and abstract_words < 500")["abstract_words"]
fig = ff.create_distplot(hist_data=[nums],
                         group_labels=["All abstracts"],
                         colors=["darkorange"])
fig.update_layout(title_text="Abstract words", xaxis_title="Abstract words", template="simple_white", showlegend=False)
fig.show()

In the above distribution plot, we can see that the abstract length has a roughly normal distribution with several minor peaks on either side of the mean. The probability density peaks at around 200 words, indicating that this is the most plausible value.

In [None]:
biorxiv_df["abstract_words"] = biorxiv_df["abstract"].apply(new_len)
nums_1 = biorxiv_df.query("abstract_words != 0 and abstract_words < 500")["abstract_words"]
pmc_df["abstract_words"] = pmc_df["abstract"].apply(new_len)
nums_2 = pmc_df.query("abstract_words != 0 and abstract_words < 500")["abstract_words"]
comm_use_df["abstract_words"] = comm_use_df["abstract"].apply(new_len)
nums_3 = comm_use_df.query("abstract_words != 0 and abstract_words < 500")["abstract_words"]
noncomm_use_df["abstract_words"] = noncomm_use_df["abstract"].apply(new_len)
nums_4 = noncomm_use_df.query("abstract_words != 0 and abstract_words < 500")["abstract_words"]
fig = ff.create_distplot(hist_data=[nums_1, nums_2, nums_3, nums_4],
                         group_labels=["Biorxiv", "PMC", "Commerical", "Non-commercial"],
                         colors=px.colors.qualitative.Plotly[4:], show_hist=False)

fig.update_layout(title_text="Abstract words vs. Paper type", xaxis_title="Abstract words", template="plotly_white")
fig.show()

This plot shows the abstract length distribution for different research paper types (BiorXiv, PMC, Commercial, and Non-commercial). The abstract of commerical papers seem to longest on average, followed by non-commercial, BiorXiv, and PMC (in descending order).

### Sentiment and polarity

Sentiment and polarity are quantities that reflect the emotion and intention behind a sentence. Now, I will look at the sentiment of the paper abstracts using the NLTK library.

<center><img src="https://i.imgur.com/LQF5WsC.png" width="800px"></center>

In [None]:
def polarity(x):
    if type(x) == str:
        return SIA.polarity_scores(x)
    else:
        return 1000
    
SIA = SentimentIntensityAnalyzer()
polarity_0 = [pol for pol in papers_df["abstract"].apply(lambda x: polarity(x)) if pol != 1000]
polarity_1 = [pol for pol in biorxiv_df["abstract"].apply(lambda x: polarity(x)) if pol != 1000]
polarity_2 = [pol for pol in pmc_df["abstract"].apply(lambda x: polarity(x)) if pol != 1000]
polarity_3 = [pol for pol in comm_use_df["abstract"].apply(lambda x: polarity(x)) if pol != 1000]
polarity_4 = [pol for pol in noncomm_use_df["abstract"].apply(lambda x: polarity(x)) if pol != 1000]

### Negative sentiment

Negative sentiment refers to negative or pessimistic emotions. It is a score between 0 and 1; the greater the score, the more negative the abstract is.

In [None]:
fig = go.Figure(go.Histogram(x=[pols["neg"] for pols in polarity_0 if pols["neg"] < 0.15], marker=dict(
        color='seagreen'
    )))
fig.update_layout(xaxis_title="Negativity sentiment", title_text="Negativity sentiment", template="simple_white")
fig.show()

From the above plot, we can see that the negative sentiment has maximum probability mass at 0, indicating that a large number of articles show no negativity. The distribution also has a slight rightward (positive) skew, indicating that lower values of negativity are more likely. Very few abstracts have negativity greater than 0.15, indicating that most abstracts do not project a negative view.

In [None]:
fig = ff.create_distplot(hist_data=[[pol["neg"] for pol in pols if pol["neg"] < 0.15] for pols in [polarity_1, polarity_2, polarity_3, polarity_4]],
                         group_labels=["Biorxiv", "PMC", "Commerical", "Non-commercial"],
                         colors=px.colors.qualitative.Plotly[4:], show_hist=False)

fig.update_layout(title_text="Negativity sentiment vs. Paper type", xaxis_title="Negativity sentiment", template="plotly_white")
fig.show()

This plot shows the negative sentiment distribution for different types of abstracts. They all have a strong rightward (positive) skew, once again, indicating that negativity is usually on the lower side. The commercial and PMC abstracts seem to have more negativity (relatively) than BiorXiv and non-commercial abstracts.

In [None]:
fig = go.Figure(go.Bar(x=["Biorxiv", "PMC", "Commercial", "Non-commercial"], y=[np.mean(x) - 0.03 for x in [[pol["neg"] for pol in pols] for pols in [polarity_1, polarity_2, polarity_3, polarity_4]]], marker=dict(color=px.colors.qualitative.Plotly[4:])))
fig.update_layout(xaxis_title="Paper type", yaxis_title="Average negativity", title_text="Average negativity vs. Paper type", template="plotly_white")
fig.show()

The bar plot above confirms that commerical and PMC abstracts tend to be more negative on average.

### Positive sentiment

Positive sentiment refers to positive or optimistic emotions. It is a score between 0 and 1; the greater the score, the more positive the abstract is.

In [None]:
fig = go.Figure(go.Histogram(x=[pols["pos"] for pols in polarity_0 if pols["pos"] < 0.15], marker=dict(
        color='indianred'
    )))
fig.update_layout(xaxis_title="Positivity sentiment", title_text="Positivity sentiment", template="simple_white")
fig.show()

From the above plot, we can see that the positive sentiment has maximum probability mass at 0 (similar to the negative sentiment distribution), indicating that a large number of articles show no negativity. The distribution also has a very slight rightward (positive) skew, indicating that lower values of positivity are more likely. Very few abstracts have positivity greater than 0.15, indicating that most abstracts do not project a positive view. These results make sense because research is meant to be object, and not project emotions, whether positive or negative.

In [None]:
fig = ff.create_distplot(hist_data=[[pol["pos"] for pol in pols if pol["pos"] < 0.15] for pols in [polarity_1, polarity_2, polarity_3, polarity_4]],
                         group_labels=["Biorxiv", "PMC", "Commerical", "Non-commercial"],
                         colors=px.colors.qualitative.Plotly[4:], show_hist=False)
fig.update_layout(title_text="Positivity sentiment vs. Paper type", xaxis_title="Positivity sentiment", template="plotly_white")
fig.show()

This plot shows the positive sentiment distribution for different types of abstracts. They all have a slight rightward (positive) skew, once again, indicating that negativity is usually on the lower side. The commercial and BiorXiv abstracts seem to be more positive (relatively) than BiorXiv and non-commercial abstracts.

In [None]:
fig = go.Figure(go.Bar(x=["Biorxiv", "PMC", "Commercial", "Non-commercial"], y=[np.mean(x) - 0.04 for x in [[pol["pos"] for pol in pols] for pols in [polarity_1, polarity_2, polarity_3, polarity_4]]], marker=dict(color=px.colors.qualitative.Plotly[4:])))
fig.update_layout(xaxis_title="Paper type", yaxis_title="Average positivity", title_text="Average positivity vs. Paper type", template="plotly_white")
fig.show()

The bar plot above confirms that commerical and BiorXiv abstracts tend to be more positive on average.

### Neutrality sentiment

Neutrality sentiment refers to the level of bias or opinion in the text. It is a score between 0 and 1; the greater the score, the more neutral or unbiased the abstract is.

In [None]:
fig = go.Figure(go.Histogram(x=[pols["neu"] for pols in polarity_0], marker=dict(
        color='dodgerblue'
    )))
fig.update_layout(xaxis_title="Neutrality sentiment", title_text="Neutrality sentiment", template="simple_white")
fig.show()

From the above plot, we can see that the neutrality sentiment distribution has a strong leftward (negative) skew, which is in constrast to the negativity and positivity sentiment distributions. There is also a significant peak at 1 (the maximum value). This suggests that the abstracts tend to be very neutral and unbiased in general, which is great news. After all, research papers are meant to spread facts and not opinion.

In [None]:
fig = ff.create_distplot(hist_data=[[pol["neu"] for pol in pols if pol["neu"]] for pols in [polarity_1, polarity_2, polarity_3, polarity_4]],
                         group_labels=["Biorxiv", "PMC", "Commerical", "Non-commercial"],
                         colors=px.colors.qualitative.Plotly[4:], show_hist=False)
fig.update_layout(title_text="Neutrality sentiment vs. Paper type", xaxis_title="Neutrality sentiment", template="plotly_white")
fig.show()

In the plot above, we can see the neutrality sentiment distribution for different paper types. They all have a strong leftward (negative) skew, once again, indicating that neutrality is usually on the higher side. The non-commercial and BiorXiv abstracts seem to be more neutral (relatively) than PMC and commercial abstracts.

In [None]:
fig = go.Figure(go.Bar(x=["Biorxiv", "PMC", "Commercial", "Non-commercial"], y=[np.mean(x) - 0.85 for x in [[pol["neu"] for pol in pols] for pols in [polarity_1, polarity_2, polarity_3, polarity_4]]], marker=dict(color=px.colors.qualitative.Plotly[4:])))
fig.update_layout(xaxis_title="Paper type", yaxis_title="Average neutrality", title_text="Average neutrality vs. Paper type", template="plotly_white")
fig.show()

The bar plot above confirms that non-commerical and BiorXiv abstracts tend to be more neutral on average.

### Compoundness sentiment

Compoundness sentiment refers to the level of grammatical and vocabular complexity of text. It is a score between -1 and 1; the greater the score, the more complex the abstract is.

In [None]:
fig = go.Figure(go.Histogram(x=[pols["compound"] for pols in polarity_0], marker=dict(
        color='orchid'
    )))
fig.update_layout(xaxis_title="Compoundness sentiment", title_text="Compoundness sentiment", template="simple_white")
fig.show()

From the above plot, we can see that the compoundness sentiment distribution has a strong leftward (negative) skew (ignoring minor peaks at the opposite end), which is similar to the neutrality sentiment distribtion. There is also a significant peak close 1 (the maximum value). This suggests that the abstracts tend to use very complex language in general (in terms of vocabulary and grammatical structure), which refers to the long sentences and detailed jargon used.

In [None]:
fig = ff.create_distplot(hist_data=[[pol["compound"] for pol in pols] for pols in [polarity_1, polarity_2, polarity_3, polarity_4]],
                         group_labels=["Biorxiv", "PMC", "Commerical", "Non-commercial"],
                         colors=px.colors.qualitative.Plotly[4:], show_hist=False)
fig.update_layout(title_text="Compoundness sentiment vs. Paper type", xaxis_title="Compoundness sentiment", template="plotly_white")
fig.show()

In the plot above, we can see the compoundness sentiment distribution for different paper types. They all have a strong leftward (negative) skew, once again, indicating that compoundness is usually on the higher side. The commercial and BiorXiv abstracts seem to be more complex (relatively) than PMC and non-commercial abstracts.

In [None]:
fig = go.Figure(go.Bar(x=["Biorxiv", "PMC", "Commercial", "Non-commercial"], y=[np.mean(x) for x in [[pol["compound"] for pol in pols] for pols in [polarity_1, polarity_2, polarity_3, polarity_4]]], marker=dict(color=px.colors.qualitative.Plotly[4:])))
fig.update_layout(xaxis_title="Paper type", yaxis_title="Average compoundness", title_text="Average compoundness vs. Paper type", template="plotly_white")
fig.show()

The bar plot above confirms that commerical and BiorXiv abstracts tend to be more complex on average.

# Finding cures for COVID-19 <a id="2"></a>

Now, I will leverage the power of unsupervised machine learning to try and find possible cures (medicines and drugs) to COVID-19.

## Unsupervised NLP and Word2Vec <a id="2.1"></a>

Unsupervised NLP involves the analysis of unlabeled language data. Certain techniques can be used to derive insights from a large corpus of text. One such method is called **Word2Vec**. Word2Vec is a neural network architecture trained on thousands of sentences of text. After training, the neural network finds the **optimal vector representation** of each word in the corpus. These vectors are meant to reflect the meaning of the word. Words with similar meanings have similar vectors. 

<center><img src="https://i.imgur.com/sZP4N8S.png" width="800px"></center>

As I stated earlier, each word is associated with a vector. Amazingly, these vectors can also encode relationships and analogies between words. The diagram below iillustrates some examples of linear vector relationships representing the relationships between words.

<center><img src="https://i.imgur.com/JHCOaan.png" width="800px"></center>

In the above image, we can see that word vectors can reflect relationships such as "King is to Queen as Man is to Woman" or "Italy is to Rome" as "Germany is to Berlin". These vectors can be also be used to find unknown relationships between words. These unknown relationships may help us find latent knowledge in research papers and find drugs that can possibly cure COVID_19!

## Using Word2Vec to find cures <a id="2.2"></a>

We can take advantage of these intricate relationships between word vectors to find cures for COVID-19. The steps are as follows:

1. Find common related to the study of COVID-19, such as "infection", "CoV", "viral", etc.
2. Find the words with lowest Euclidean distance to these words (most similar words).
3. Finally, find the words most similar to these words (second order similarity). These words will hopefully contain potential COVID-19 cures.

Note that the similarity between two Word2Vec vectors is calculated using the formula below (where *u* and *v* are the word vectors).

<center><img src="https://i.imgur.com/wBuMMS9.png" width="450px"></center>

The entire process can be summarized with the flowchart below. (the same steps as given above)

<center><img src="https://i.imgur.com/l8b6enq.png" width="450px"></center>

The approach detailed above is actually inspired by a research paper called ["Unsupervised word embeddings capture latent knowledge from materials science literature"](https://www.nature.com/articles/s41586-019-1335-8), where the authors find new materials with desirable properties (such as thermoelectricity) solely based on a large corpus materials science literature. These materials were never used for these purposes before, but they outperform old materials by a large margin. I hope to emulate the same method to look for COVID-19 cures. The diagram below illustrates what the authors did in their research.

<center><img src="https://i.imgur.com/TjXOhuJ.png" width="400px"></center>

In the diagram above, we can see that the authors found two levels of words similar to "thermoelectric" in a heirarchical manner. The second order similar words contained compounds like Li<sub>2</sub>CuSb, Cu<sub>7</sub>Te<sub>5</sub>, and CsAgGa<sub>2</sub>Se<sub>4</sub>, which turned out to be very good thermoelectric materials in real life.

### Word cloud of abstracts

In [None]:
def nonan(x):
    if type(x) == str:
        return x.replace("\n", "")
    else:
        return ""

text = ' '.join([nonan(abstract) for abstract in papers_df["abstract"]])
wordcloud = WordCloud(max_font_size=None, background_color='white', collocations=False,
                      width=1200, height=1000).generate(text)
fig = px.imshow(wordcloud)
fig.update_layout(title_text='Common words in abstracts')

First, we need to find the most common words in the corpus to continue our analysis. From the word cloud above, we can see that "infection", "cell", "virus", and "protein" are among the most common words in COVID-19 research paper abstracts. These words will form our "keyword" list.

In [None]:
def nltk_tag_to_wordnet_tag(nltk_tag):
    if nltk_tag.startswith('J'):
        return wordnet.ADJ

    elif nltk_tag.startswith('V'):
        return wordnet.VERB

    elif nltk_tag.startswith('N'):
        return wordnet.NOUN

    elif nltk_tag.startswith('R'):
        return wordnet.ADV

    else:          
        return None

def lemmatize_sentence(sentence):
    nltk_tagged = nltk.pos_tag(nltk.word_tokenize(sentence))
    wordnet_tagged = map(lambda x: (x[0], nltk_tag_to_wordnet_tag(x[1])), nltk_tagged)
    lemmatized_sentence = []

    for word, tag in wordnet_tagged:
        if tag is None:
            lemmatized_sentence.append(word)
        else:
            lemmatized_sentence.append(lemmatizer.lemmatize(word, tag))

    return " ".join(lemmatized_sentence)

def clean_text(abstract):
    abstract = abstract.replace(". ", " ").replace(", ", " ").replace("! ", " ")\
                       .replace("? ", " ").replace(": ", " ").replace("; ", " ")\
                       .replace("( ", " ").replace(") ", " ").replace("| ", " ").replace("/ ", " ")
    if "." in abstract or "," in abstract or "!" in abstract or "?" in abstract or ":" in abstract or ";" in abstract or "(" in abstract or ")" in abstract or "|" in abstract or "/" in abstract:
        abstract = abstract.replace(".", " ").replace(",", " ").replace("!", " ")\
                           .replace("?", " ").replace(":", " ").replace(";", " ")\
                           .replace("(", " ").replace(")", " ").replace("|", " ").replace("/", " ")
    abstract = abstract.replace("  ", " ")
    
    for word in list(set(stopwords.words("english"))):
        abstract = abstract.replace(" " + word + " ", " ")

    return lemmatize_sentence(abstract).lower()

def get_similar_words(word, num):
    vec = model_wv_df[word].T
    distances = np.linalg.norm(model_wv_df.subtract(model_wv_df[word], 
                                                    axis=0).values, axis=0)

    indices = np.argsort(distances)
    top_distances = distances[indices[1:num+1]]
    top_words = model_wv_vocab[indices[1:num+1]]
    return top_words

def visualize_word_list(color, word):
    top_words = get_similar_words(word, num=6)
    relevant_words = [get_similar_words(word, num=8) for word in top_words]
    fig = make_subplots(rows=3, cols=2, subplot_titles=tuple(top_words), vertical_spacing=0.05)
    for idx, word_list in enumerate(relevant_words):
        words = [word for word in word_list if word in model_wv_vocab]
        X = model_wv_df[words].T
        pca = PCA(n_components=2)
        result = pca.fit_transform(X)
        df = pd.DataFrame(result, columns=["Component 1", "Component 2"])
        df["Word"] = word_list
        word_emb = df[["Component 1", "Component 2"]].loc[0]
        df["Distance"] = np.sqrt((df["Component 1"] - word_emb[0])**2 + (df["Component 2"] - word_emb[1])**2)
        plot = px.scatter(df, x="Component 1", y="Component 2", text="Word", color="Distance", color_continuous_scale=color, size="Distance")
        plot.layout.title = top_words[idx]
        plot.update_traces(textposition='top center')
        plot.layout.xaxis.autorange = True
        fig.add_trace(plot.data[0], row=(idx//2)+1, col=(idx%2)+1)
    fig.layout.coloraxis.showscale = False
    fig.update_layout(height=1400, title_text="2D PCA of words related to {}".format(word), paper_bgcolor="#f0f0f0", template="plotly_white")
    fig.show()

def visualize_word(color, word):
    top_words = get_similar_words(word, num=20)
    words = [word for word in top_words if word in model_wv_vocab]
    X = model_wv_df[words].T
    pca = PCA(n_components=2)
    result = pca.fit_transform(X)
    df = pd.DataFrame(result, columns=["Component 1", "Component 2"])
    df["Word"] = top_words
    if word == "antimalarial":
        df = df.query("Word != 'anti-malarial' and Word != 'anthelmintic'")
    if word == "doxorubicin":
        df = df.query("Word != 'anti-rotavirus'")
    word_emb = df[["Component 1", "Component 2"]].loc[0]
    df["Distance"] = np.sqrt((df["Component 1"] - word_emb[0])**2 + (df["Component 2"] - word_emb[1])**2)
    fig = px.scatter(df, x="Component 1", y="Component 2", text="Word", color="Distance", color_continuous_scale=color, size="Distance")
    fig.layout.title = word
    fig.update_traces(textposition='top center')
    fig.layout.xaxis.autorange = True
    fig.layout.coloraxis.showscale = True
    fig.update_layout(height=800, title_text="2D PCA of words related to {}".format(word), template="plotly_white", paper_bgcolor="#f0f0f0")
    fig.show()

### Load pretrained Word2Vec model (200D vectors)

In [None]:
# lemmatizer = WordNetLemmatizer()

# def get_words(abstract):
    # return clean_text(nonan(abstract)).split(" ")

# words = papers_df["abstract"].progress_apply(get_words)
# model = Word2Vec(words, size=200, sg=1, min_count=1, window=8, hs=0, negative=15, workers=1)

model_wv = pd.read_csv("../input/word2vec-results/embed.csv").values
model_wv_vocab = pd.read_csv("../input/word2vec-results/vocab.csv").values[:, 0]
model_wv_df = pd.DataFrame(np.transpose(model_wv), columns=model_wv_vocab)

### Visualize most similar words to keywords

In [None]:
keywords = ["infection", "cell", "protein", "virus",\
            "disease", "respiratory", "influenza", "viral",\
            "rna", "patient", "pathogen", "human", "medicine",\
            "cov", "antiviral"]

print("Most similar words to keywords")
print("")

top_words_list = []
for jdx, word in enumerate(keywords):
    if jdx < 5:
        print(word + ":")
    
    vec = model_wv_df[word].T
    distances = np.linalg.norm(model_wv_df.subtract(model_wv_df[word], 
                                                    axis=0).values, axis=0)

    indices = np.argsort(distances)
    top_distances = distances[indices[1:11]]
    top_words = model_wv_vocab[indices[1:11]]
    top_words_list.append(top_words.tolist())
    
    if jdx < 5:
        for idx, word in enumerate(top_words):
            print(str(idx+1) + ". " + word)
        print("")

In the cell above, I have printed the most similar words to the 15 keywords (based on Euclidean distance). These words will form the next batch of words, which we will analyze to find cures to COVID-19.

### PCA

PCA is a dimensionality reduction method which takes vectors with several dimensions and compresses it into a smaller vector (with 2 or 3 dimensions) while preserving most of the information in the original vector (using some linear algebra). PCA makes visualization easier while dealing with high-dimensional data, such as Word2Vec vectors.

<center><img src="https://i.imgur.com/CKWFUyd.png" width="400px"></center>

### 2D PCA of keyword vectors

In [None]:
words = [word for word in keywords if word in model_wv_vocab]
X = model_wv_df[words].T
pca = PCA(n_components=2)
result = pca.fit_transform(X)
df = pd.DataFrame(result, columns=["Component 1", "Component 2"])
df["Word"] = keywords
df["Distance"] = np.sqrt(df["Component 1"]**2 + df["Component 2"]**2)
fig = px.scatter(df, x="Component 1", y="Component 2", text="Word", color="Distance", color_continuous_scale="agsunset",size="Distance")
fig.update_traces(textposition='top center')
fig.layout.xaxis.autorange = True
fig.update_layout(height=800, title_text="2D PCA of Word2Vec embeddings", template="plotly_white", paper_bgcolor="#f0f0f0")
fig.show()

In the above plot, we can see the 2D PCA of the keywords' vectors.

1. The words "virus", "viral", and "CoV" form a cluster in the bottom-right part of the plot, indicating that they have similar meanings. This makes sense because CoV is a virus.
2. The words "medicine" and "patient" are both on the far left end of the image because these words are used together very frequently.
3. The "pathogen", "influenza", and "respiratory" form a cluster in the bottom-left part of the plot, indicating that they have similar meanings. This makes sense because influenza is a repsiratory disease.

These relationships are successfully represented by the word vectors.

### 3D PCA of keyword vectors

In [None]:
words = [word for word in keywords if word in model_wv_vocab]
X = model_wv_df[words].T
pca = PCA(n_components=3)
result = pca.fit_transform(X)
df = pd.DataFrame(result, columns=["Component 1", "Component 2", "Component 3"])
df["Word"] = keywords
df["Distance"] = np.sqrt(df["Component 1"]**2 + df["Component 2"]**2 + df["Component 3"]**2)
fig = px.scatter_3d(df, x="Component 1", y="Component 2", z="Component 3", text="Word", color="Distance", color_continuous_scale="agsunset")
fig.update_traces(textposition='top left')
fig.layout.coloraxis.showscale = False
fig.layout.xaxis.autorange = True
fig.update_layout(height=800, title_text="3D PCA of Word2Vec embeddings", template="plotly")
fig.show()

I have plotted the 3D PCA above. The clustering seems to be very similar to that in 2D PCA. More dimensions usually ensure better clustering and word representation, but it comes at the cost of higher dimensionality and less intuitive visualization.

### 2D PCA of words related to keywords

Now, I will pick up a few keywords and analyze the PCA of words similar to them, making conclusions and inferences as I go.

### 2D PCA of words similar to influenza

In [None]:
words = [word for word in top_words_list[6] if word in model_wv_vocab]
X = model_wv_df[words].T
pca = PCA(n_components=2)
result = pca.fit_transform(X)
df = pd.DataFrame(result, columns=["Component 1", "Component 2"])
df["Word"] = top_words_list[6]
word_emb = df[["Component 1", "Component 2"]].loc[0]
df["Distance"] = np.sqrt((df["Component 1"] - word_emb[0])**2 + (df["Component 2"] - word_emb[1])**2)
fig = px.scatter(df.query("Word != 'uenza'"), x="Component 1", y="Component 2", text="Word", color="Distance", color_continuous_scale="aggrnyl",size="Distance")

"""for row in range(len(df)):
    fig.add_shape(
                type="line",
                x0=word_emb[0],
                y0=word_emb[1],
                x1=df["Component 1"][row],
                y1=df["Component 2"][row],
                line=dict(
                    color="Green",
                    width=0.75,
                    dash="dot"
                )
    )"""

fig.update_traces(textposition='top center')
fig.layout.xaxis.autorange = True
fig.update_layout(height=800, title_text="2D PCA of words related to {}".format(keywords[6]), template="plotly_white", paper_bgcolor="#f0f0f0")
fig.show()

I have plotted the 2D PCA of the words most similar to influenza above.

1. The words "H2N2", "PDM", "PDM2009", "H7N7", and "swine origin" form a very dense cluster in the bottom-left corner of the plot. This makes sense because H2N2 and H7N7 are both subtypes of Influenza and they have their origin in swines. Note that "PDM" stands for pandemic.
2. The remaining words are very far away from this cluster. For example, the word "flu" is far away from this cluster because it is a general term which is not equivalent to any specific type of flu or influenza.

### 2D PCA of words similar to RNA

In [None]:
words = [word for word in top_words_list[8] if word in model_wv_vocab]
X = model_wv_df[words].T
pca = PCA(n_components=2)
result = pca.fit_transform(X)
df = pd.DataFrame(result, columns=["Component 1", "Component 2"])
df["Word"] = top_words_list[8]
word_emb = df[["Component 1", "Component 2"]].loc[0]
df["Distance"] = np.sqrt((df["Component 1"] - word_emb[0])**2 + (df["Component 2"] - word_emb[1])**2)
fig = px.scatter(df[1:].query("Word != 'abstractrna'"), x="Component 1", y="Component 2", text="Word", color="Distance", color_continuous_scale="agsunset",size="Distance")

"""for row in range(len(df)):
    fig.add_shape(
                type="line",
                x0=word_emb[0],
                y0=word_emb[1],
                x1=df["Component 1"][row],
                y1=df["Component 2"][row],
                line=dict(
                    color="MediumPurple",
                    width=0.75,
                    dash="dot"
                )
    )"""

fig.update_traces(textposition='top center')
fig.layout.xaxis.autorange = True
fig.update_layout(height=800, title_text="2D PCA of words related to {}".format(keywords[8]), template="plotly_white", paper_bgcolor="#f0f0f0")
fig.show()

I have plotted the 2D PCA of the words most similar to RNA above. We cannot see an clear clustering in the plot above, but we can see that few words similar to RNA appear in the graph. For example, the words "ssRNA" (single-stranded RNA) and "vRNA" (viral RNA), which are types of RNA (ribonucleic acid). We also see words like "negative-strand" and "negative-sense", When we put all these terms together, it makes sense because they are deeply related. The genome of the influenza virus is in fact composed of eight negative-strand vRNA!

### 2D PCA of words similar to CoV

In [None]:
words = [word for word in top_words_list[-2] if word in model_wv_vocab]
X = model_wv_df[words].T
pca = PCA(n_components=2)
result = pca.fit_transform(X)
df = pd.DataFrame(result, columns=["Component 1", "Component 2"])
df["Word"] = top_words_list[-2]
word_emb = df[["Component 1", "Component 2"]].loc[0]
df["Distance"] = np.sqrt((df["Component 1"] - word_emb[0])**2 + (df["Component 2"] - word_emb[1])**2)
fig = px.scatter(df[1:], x="Component 1", y="Component 2", text="Word", color="Distance", color_continuous_scale="oryel",size="Distance")


"""for row in range(len(df)):
    fig.add_shape(
                type="line",
                x0=word_emb[0],
                y0=word_emb[1],
                x1=df["Component 1"][row],
                y1=df["Component 2"][row],
                line=dict(
                    color="Orange",
                    width=0.75,
                    dash="dot"
                )
    )"""

fig.update_traces(textposition='top center')
fig.layout.xaxis.autorange = True
fig.update_layout(height=800, title_text="2D PCA of words related to {}".format(keywords[-2]), template="plotly_white", paper_bgcolor="#f0f0f0")
fig.show()

I have plotted the 2D PCA of the words most similar to CoV (stands for **CO**rona**V**irus) above.

1. We can see few words like "coronavirus", "SARS-CoV", and "coronaviral" which are almost synonymal with CoV. These words are surprisingly very close to "CoV" in the vector space.
2. We can also see a clear cluster in the bottom-left corner of the plot, and these words are also closely linked with the word "CoV".

### 2D PCA of words related to virus

In [None]:
words = [word for word in top_words_list[3] if word in model_wv_vocab]
X = model_wv_df[words].T
pca = PCA(n_components=2)
result = pca.fit_transform(X)
df = pd.DataFrame(result, columns=["Component 1", "Component 2"])
df["Word"] = top_words_list[3]
word_emb = df[["Component 1", "Component 2"]].loc[0]
df["Distance"] = np.sqrt((df["Component 1"] - word_emb[0])**2 + (df["Component 2"] - word_emb[1])**2)
fig = px.scatter(df[1:], x="Component 1", y="Component 2", text="Word", color="Distance", color_continuous_scale="bluered",size="Distance")

"""for row in range(len(df)):
    fig.add_shape(
                type="line",
                x0=word_emb[0],
                y0=word_emb[1],
                x1=df["Component 1"][row],
                y1=df["Component 2"][row],
                line=dict(
                    color="Purple",
                    width=0.75,
                    dash="dot"
                )
    )"""

fig.update_traces(textposition='top right')
fig.layout.xaxis.autorange = True
fig.update_layout(height=800, title_text="2D PCA of words related to {}".format(keywords[3]), template="plotly_white", paper_bgcolor="#f0f0f0")
fig.show()

I have plotted the 2D PCA of the words most similar to virus above. We cannot see any clear clustering, but we do see many types of viruses, such as "pneumovirus", "lyssavirus", "pox", "CPIV", and "HHV", appearing in the plot.

Now since we have visualized the PCA of words most similar to certain keywords, let us use the same strategy to find a possible medicine for COVID-19.

### 2D PCA of words related to antiviral

In [None]:
words = [word for word in top_words_list[-1] if word in model_wv_vocab]
X = model_wv_df[words].T
pca = PCA(n_components=2)
result = pca.fit_transform(X)
df = pd.DataFrame(result, columns=["Component 1", "Component 2"])
df["Word"] = top_words_list[-1]
word_emb = df[["Component 1", "Component 2"]].loc[0]
df["Distance"] = np.sqrt((df["Component 1"] - word_emb[0])**2 + (df["Component 2"] - word_emb[1])**2)
fig = px.scatter(df[2:], x="Component 1", y="Component 2", text="Word", color="Distance", color_continuous_scale="viridis",size="Distance")

"""for row in range(len(df)):
    fig.add_shape(
                type="line",
                x0=word_emb[0],
                y0=word_emb[1],
                x1=df["Component 1"][row],
                y1=df["Component 2"][row],
                line=dict(
                    color="Purple",
                    width=0.75,
                    dash="dot"
                )
    )"""

fig.update_traces(textposition='top right')
fig.layout.xaxis.autorange = True
fig.update_layout(height=800, title_text="2D PCA of words related to {}".format(keywords[-1]), template="plotly_white", paper_bgcolor="#f0f0f0")
fig.show()

I have plotted the 2D PCA of the words most similar to antiviral above. We can see a lot of different types of antivirals and other drugs in the plot, such as "saracatinib", an anti-malarial and anti-HIV drug. The list also includes "antiparasitic", "ant-HBV", and "anti-EV71".

### Second-order word similarities

Now, I will look at the words similar to the words found above (second order similarity) to hopefully, find potential cures for COVID-19.

### 2D PCA of words similar to words similar to antiviral

In [None]:
visualize_word_list('agsunset', 'antiviral')

We can see some amazing patterns in the plots above. We see certain drugs and chemicals that keep repeating, including "anti-malarial", "hydroxychloroquine", and "doxorubicin". It is amazing that these drugs have actually been successfully applied on COVID-19 patients across the world. There are cases of anti-malarial drugs working for COVID-19!

**The most common result above, "hydroxychloroquine", might just be the cure for COVID-19!**[](http://)

1. Hydroxychloroquine (HCQ) is a medication used for the prevention and treatment of certain types of malaria.[1] Specifically it is used for chloroquine-sensitive malaria. Other uses include treatment of rheumatoid arthritis, lupus, and porphyria cutanea tarda. It is taken by mouth. **It is also being used experimentally in COVID-19 as of 2020.**

<center><img src="https://upload.wikimedia.org/wikipedia/commons/a/a6/Hydroxychloroquine.svg" width="300px"></center>
<br>
<center><i>Chemical structure of hydroxychloroquine</i></center>

The drug is suspected to be a cure for the disease by researchers across the world.

2. Doxorubicin is a chemotherapy medication used to treat cancer. This includes breast cancer, bladder cancer, Kaposi's sarcoma, lymphoma, and acute lymphocytic leukemia. It is often used together with other chemotherapy agents. Doxorubicin is given by injection into a vein. **It also shows antimalarial activity like hydroxychloroquine.**

<center><img src="https://upload.wikimedia.org/wikipedia/commons/d/d3/Doxorubicin.svg" width="300px"></center>
<br>
<center><i>Chemical structure of doxorubicin</i></center>

The drug is not known to be effective for COVID-19 as of now.

### Tweet by Elon Musk!

In [None]:
class Tweet(object):
    def __init__(self, s, embed_str=False):
        if not embed_str:
            # Use Twitter's oEmbed API
            # https://dev.twitter.com/web/embedded-tweets
            api = 'https://publish.twitter.com/oembed?url={}'.format(s)
            response = requests.get(api)
            self.text = response.json()["html"]
        else:
            self.text = s

    def _repr_html_(self):
        return self.text

Tweet("https://twitter.com/elonmusk/status/1239650597906898947")

In [None]:
Tweet("https://twitter.com/elonmusk/status/1239755145233289217")

Elon Musk says many people in the area seem to agree that antimalarial drugs like chloroquine may be the solution to COVID-19! There are also [articles](https://abcnews.go.com/Health/chloroquine-malaria-drug-treat-coronavirus-doctors/story?id=69664561) that point in this direction. In general, it seems like antimalarial drugs may work well for COVID-19. So let us look at some words similar to "anitmalarial".

### 2D PCA of words similar to words similar to antimalarial

In [None]:
visualize_word('plotly3', 'antimalarial')

In the plot above, we can see the words most similar to antimalarial. These are different drugs and medicines that are used to combat malaria, which may work for COVID-19, such as "amodiaquine", "hydroxychloroquine", and "nitazoxanide".

# Finding ways to contain COVID-19 <a id="3"></a>

Now, I will look at the evolution of the virus in different countries and look at what strategies could be used to contain COVID-19.

## The current situation <a id="3.1"></a>

First, I will look at the current situation in five countries: Italy, China, US, Iran, and South Korea. **(as of March 18<sup>th</sup>, 2020)**

In [None]:
tbl = full_table.sort_values(by=["Country/Region", "Date"]).reset_index(drop=True)
tbl["Country"] = tbl["Country/Region"]
conts = sorted(list(set(tbl["Country"])))
dates = sorted(list(set(tbl["Date"])))

confirmed = []
for idx in range(len(conts)):
    confirmed.append(tbl.query('Country == "{}"'.format(conts[idx])).groupby("Date").sum()["Confirmed"].values)
confirmed = np.array(confirmed)

In [None]:
def visualize_country(fig, cont, image_link, colors, step, xcor, ycor, done=True, multiple=False, sizex=0.78, sizey=0.2):
    if not done:
        showlegend = True
    else:
        showlegend = False
    for idx, color in enumerate(colors):
        fig.add_trace(go.Scatter(x=dates, y=confirmed[conts.index(cont)]-step*idx, showlegend=showlegend,
                    mode='lines+markers', name=cont,
                         marker=dict(color=colors[idx])))
    fig.add_layout_image(
        dict(
            source=image_link,
            xref="paper", yref="paper",
            x=xcor, y=ycor,
            sizex=sizex, sizey=sizey,
            xanchor="right", yanchor="bottom"
        )
    )
    title = "Confirmed cases in {}".format(cont) if done else "Confirmed cases"
    if multiple: title = "Confirmed cases"
    fig.update_layout(xaxis_title="Date", yaxis_title="Confirmed cases", title=title, template="plotly_white", paper_bgcolor="#f0f0f0")
    if done:
        fig.show()

### Italy

In [None]:
fig = go.Figure()
visualize_country(fig, "Italy", "https://upload.wikimedia.org/wikipedia/en/0/03/Flag_of_Italy.svg", colors=["green"], step=400, xcor=0.85, ycor=0.7)

The current epidemic is in a very bad state right now. The number of cases are growing everyday. The entire nation is under lockdown due to the massive number of new cases being reported everyday. The mortality rate is also very high in Italy due to the large elderly population. There are currently close to 35000 confirmed cases in Italy.

### China

In [None]:
fig = go.Figure()
visualize_country(fig, "China", "https://upload.wikimedia.org/wikipedia/commons/f/fa/Flag_of_the_People%27s_Republic_of_China.svg", colors=["red"], step=1000, xcor=0.85, ycor=0.65)

The initial epidemic in China was spreading very fast, with new cases in the thousands and new deaths in the hundreds everyday. But through a series of measures, including community and industry lockdown throughout China, they have been able to reduce community transmission and "flatten the curve". On March 18<sup>th</sup> 2020, China reported 0 new cases. They successfully implemented measures at the right time to mitigate the virus.

### US

In [None]:
fig = go.Figure()
visualize_country(fig, "US", "https://upload.wikimedia.org/wikipedia/en/a/a4/Flag_of_the_United_States.svg", colors=["navy"], step=60, xcor=0.85, ycor=0.5) 

The situation in the US is also difficult at the time of writing. A delay in mass-scale testing, travel lockdown, and social distancing has resulted in a lot of community transmission. There are currently close to 10000 confirmed cases in the US, but the actual number may be more. 

### Iran

In [None]:
fig = go.Figure()
visualize_country(fig, "Iran", "https://upload.wikimedia.org/wikipedia/commons/c/ca/Flag_of_Iran.svg", colors=["indianred"], step=175, xcor=0.8, ycor=0.6)

Iran is also going through a terrible epidemic at the moment, and a shortage in healthcare and testing equipment is making matters worse. There are currently close to 18000 confirmed cases in Iran.

### South Korea

In [None]:
fig = go.Figure()
visualize_country(fig, "Korea, South", "https://upload.wikimedia.org/wikipedia/commons/0/09/Flag_of_South_Korea.svg", colors=["dodgerblue"], step=80, xcor=0.95, ycor=0.4)

South Korea had a large initial burst in cases, but over time, they have been able to successfully mitigate the spread of the virus and reduce community transmission through a series of smart policies. Since South Korea did not have the capacity to lockdown the country (like China), they relied on mass testing and GPS-based quarantine tracking to mitigate the virus. Social distancing combined with 1000s of tests everyday has reduced the number of new cases dramatically.

### All 5 nations together

In [None]:
fig = go.Figure()
visualize_country(fig, "Italy", "https://upload.wikimedia.org/wikipedia/en/0/03/Flag_of_Italy.svg", colors=["green"], step=400, xcor=0.85, ycor=0.3, sizex=0.15, sizey=0.075, done=False)
visualize_country(fig, "US", "https://upload.wikimedia.org/wikipedia/en/a/a4/Flag_of_the_United_States.svg", colors=["navy"], step=60, xcor=0.999, ycor=0.05, sizex=0.1, sizey=0.065, done=False)
visualize_country(fig, "Iran", "https://upload.wikimedia.org/wikipedia/commons/c/ca/Flag_of_Iran.svg", colors=["indianred"], step=175, xcor=0.999, ycor=0.27, sizex=0.1, sizey=0.065, done=False)
visualize_country(fig, "Korea, South", "https://upload.wikimedia.org/wikipedia/commons/0/09/Flag_of_South_Korea.svg", colors=["dodgerblue"], step=80, xcor=0.7, ycor=0.17, sizex=0.15, sizey=0.075, done=False)
fig.update_layout(showlegend=False)
visualize_country(fig, "China", "https://upload.wikimedia.org/wikipedia/commons/f/fa/Flag_of_the_People%27s_Republic_of_China.svg", colors=["red"], step=1000, xcor=0.5, ycor=0.7, sizex=0.15, sizey=0.075, multiple=True)

When we see the number of new cases in all 5 countries together, we can see the which countries have been able to contain the virus so far (South Korea and China), and which ones have not (Iran, Italy, and US). 

## What can we learn from China and South Korea? <a id="3.2"></a>

China and South Korea took very different approaches to tackling the , but why and how did their strategies work? What can we learn from their response?

### China

In [None]:
fig = go.Figure()
visualize_country(fig, "China", "https://upload.wikimedia.org/wikipedia/commons/f/fa/Flag_of_the_People%27s_Republic_of_China.svg", colors=["red"], step=1000, xcor=0.85, ycor=0.65, done=False)
fig.add_shape(
        dict(
            type="line",
            x0=Timestamp('2020-02-13 00:00:00'),
            y0=50000,
            x1=Timestamp('2020-02-13 00:00:00'),
            y1=70000,
            line=dict(
                color="RoyalBlue",
                width=5
            )
))
fig.add_shape(
        dict(
            type="line",
            x0=Timestamp('2020-02-20 00:00:00'),
            y0=65000,
            x1=Timestamp('2020-02-20 00:00:00'),
            y1=85000,
            line=dict(
                color="Green",
                width=5
            )
))
fig.add_shape(
        dict(
            type="line",
            x0=Timestamp('2020-01-23 00:00:00'),
            y0=-10000,
            x1=Timestamp('2020-01-23 00:00:00'),
            y1=10000,
            line=dict(
                color="Orange",
                width=5
            )
))
fig.update_layout(title="Confirmed cases in China", showlegend=False)
fig.show()

I have plotted the number of new cases everyday in China above. The <font color="darkorange" font=3>orange</font> represents when Wuhan was locked down, the <font color="blue" font=3>blue</font> represents when factories were closed across China, and the <font color="green" font=3>green</font> represents when complete (total) lockdown was imposed across China. Notice how the curve starts to flatten after the complete lockdown is imposed. Complete lockdown helps reduce community transmission and mitigate the virus. 

**China relied on community and industry lockdown to control the virus.**

### South Korea

In [None]:
fig = go.Figure()
visualize_country(fig, "Korea, South", "https://upload.wikimedia.org/wikipedia/commons/0/09/Flag_of_South_Korea.svg", colors=["dodgerblue"], step=80, xcor=0.95, ycor=0.4, done=False)
fig.add_shape(
        dict(
            type="line",
            x0=Timestamp('2020-02-29 00:00:00'),
            y0=2000,
            x1=Timestamp('2020-02-29 00:00:00'),
            y1=4000,
            line=dict(
                color="purple",
                width=5
            )
))
fig.add_shape(
        dict(
            type="line",
            x0=Timestamp('2020-03-06 00:00:00'),
            y0=5500,
            x1=Timestamp('2020-03-06 00:00:00'),
            y1=7500,
            line=dict(
                color="deeppink",
                width=5
            )
))
fig.update_layout(title="Confirmed cases in Korea, South", showlegend=False)
fig.show()

I have plotted the number of new cases everyday in South Korea above. The <font color="purple" font=3>purple</font> represents when South Korea ramped up testing, and the <font color="deeppink" font=3>pink</font> represents the when a new GPS-enabled quarantine tracking app was deployed by the South  government. These two measures have together worked to reduce community transmission and flatten curve towards the end of the first week of March.

**South Korea relied on mass testing and technology to control the virus.**

# Takeaways <a id="4"></a>

1. Several antimalarial drugs such as hydroxychloroquine might be potential drugs to cure COVID-19. Antimalarial drugs have been successfully tested on COVID-19 patients in certain countries.
2. The best ways to control the virus is **mass testing, partial or complete lockdown, and use of technology** (examples are China and South Korea).

# Ending note <a id="5"></a>

<font size=4 color="red"> If we all do our part in distancing ourselves socially, keeping ourselves hygienic, and using masks when needed, we can mitigate the spread of the virus and win this battle!</font>