# Analyzing the speed of spread of COVID-19 and publishing findings on dstack.ai

## Introduction

This notebook gives an overview of how one can...

* Use Python, pandas, and plot.ly to analyze and visualize data. This may be interesting for beginner data scientists interested in learning Python in the context of analysing data.
* Use public data to analyze the influence of COVID-19.
* Use [dstack.ai](https://dstack.ai) API for Python to publish analysis findings and share them with others.

Note that the analysis in this notebook serves education purposes and is not aimed at providing an accurate analysis. In case you find a mistake in the analysis or have a question related to the code or analysis results, please drop an email to `vitaly at dstack.ai`.

In case you're only learning Python for data science and don't have an experience with [Jupyter notebooks](https://jupyter.org/), we recommend you taking a look at it as it's the most common way of working with data using Python.

## Importing pandas, plotly.express, and dstack libraries

When it comes to analyzing data, the essential thing is to have good libraries at hand. In our tutorial, we'll use two of the basic yet most important and popular libraries: [pandas](https://pandas.pydata.org/) (for data manipulation) and [plot.ly](https://plot.ly/python/) (for making interactive publication-quality visualizations).


While pandas is almost the standard de facto for data manipulation, there are more than one popular visualization libraries (e.g. [Matplotlib](https://matplotlib.org/) and [Bokeh](https://docs.bokeh.org/) to name a few). For the sake of simplicity, in this tutorial, we use `plotly.express` which is a less-verbose wrapper around `plotly`.

Finally, since we are going to publish our data and visualizations on [dstack.ai](https://dstack.ai), we'll use [dstack Python library](https://pypi.org/project/dstack/).

In [34]:
import pandas as pd
import plotly.express as px
from dstack import create_frame

## Loading COVID-19 data of new confirmed cases

Another thing without which you cannot do data analysis is clean data. In case your data is not clean, you'll have to clean it yourself, e.g., by using `pandas`. Cleaning data is out of the scope of this notebook. In our case, we are going to use the data on confirmed cases of COVID-19 [compiled](https://data.humdata.org/dataset/novel-coronavirus-2019-ncov-cases) from various sources and updated by John Hopkins University.

As you'll see in the output of the next cell, the data we'll be using provides the information on new confirmed cases of COVID-19 per province/state, country, and particular date.

In [35]:
url = "https://data.humdata.org/hxlproxy/api/data-preview.csv?url=https%3A%2F%2Fraw.githubusercontent.com%2FCSSEGISandData%2FCOVID-19%2Fmaster%2Fcsse_covid_19_data%2Fcsse_covid_19_time_series%2Ftime_series_19-covid-Confirmed.csv&filename=time_series_2019-ncov-Confirmed.csv"
df = pd.read_csv(url) # returns a "pandas dataframe"
df.head() # this function displays the first 5 rows of the dataframe; it helps check the format of the data and make sure that everything is correct

Unnamed: 0,Province/State,Country/Region,Lat,Long,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,...,3/9/20,3/10/20,3/11/20,3/12/20,3/13/20,3/14/20,3/15/20,3/16/20,3/17/20,3/18/20
0,,Thailand,15.0,101.0,2,3,5,7,8,8,...,50,53,59,70,75,82,114,147,177,212
1,,Japan,36.0,138.0,2,1,2,2,4,4,...,511,581,639,639,701,773,839,825,878,889
2,,Singapore,1.2833,103.8333,0,1,3,3,4,5,...,150,160,178,178,200,212,226,243,266,313
3,,Nepal,28.1667,84.25,0,0,0,1,1,1,...,1,1,1,1,1,1,1,1,1,1
4,,Malaysia,2.5,112.5,0,0,0,3,4,4,...,117,129,149,149,197,238,428,566,673,790


## Getting new cases for the last two days by country

One particular thing that might be of interest is how the situation has been changing over the last two days for every country. To get this data, we'll use `pandas''s API to manipulate the data: e.g. drop unnecessary columns. In the code below, we'll drop all columns except the country and the two last columns which are the confirmed cases for the last two days.

In [36]:
cols = [df.columns[1]] + list(df.columns[-2:]) # this and below are a few ways how you can manipulate a dataframe using pandas
# country + two recent days (very simple, ignore week day, etc)
last_2_days = df[df["Province/State"].isnull()][cols].copy()
last_2_days # as you might've noticed above, the value of the expression in the end of the code cell is displayed in the output

Unnamed: 0,Country/Region,3/17/20,3/18/20
0,Thailand,177,212
1,Japan,878,889
2,Singapore,266,313
3,Nepal,1,1
4,Malaysia,673,790
...,...,...,...
455,Kyrgyzstan,0,3
456,Mauritius,0,3
458,Zambia,0,2
459,Djibouti,0,1


## Getting increase in new cases over the last day

One particular reason why the data of two following days can be interesting is that it lets you calculate the increase in new cases. The code transforms the dataframe that we obtained above by extending it with two new columns: increase in absolute number of new cases, and increase in % of new cases:

In [37]:
d1 = last_2_days.columns[-1]
d2 = last_2_days.columns[-2]

last_2_days["delta"] = last_2_days[d1] - last_2_days[d2]
last_2_days["delta%"] = last_2_days["delta"] / last_2_days[d2]

In [38]:
last_2_days # displaying the resulting dataframe

Unnamed: 0,Country/Region,3/17/20,3/18/20,delta,delta%
0,Thailand,177,212,35,0.197740
1,Japan,878,889,11,0.012528
2,Singapore,266,313,47,0.176692
3,Nepal,1,1,0,0.000000
4,Malaysia,673,790,117,0.173848
...,...,...,...,...,...
455,Kyrgyzstan,0,3,3,inf
456,Mauritius,0,3,3,inf
458,Zambia,0,2,2,inf
459,Djibouti,0,1,1,inf


## Publishing on dstack.ai

Now imagine that you'd like to publish the resulting data online and share it with other people.

You can do that by using the `dstack` library. The library provides an API to publish both `pandas` data frames and `plotly` visualizations (it also supports other visualization libraries).

A nice thing about `dstack` is that it lets you publish data and visualizations, and keeps them interactive: e.g. the user may change parameters and see the corresponding data.

Once the data is published, it can be accessed via a link. Other people can access it and comment.

In order to publish `pandas` dataframes or `plotly`'s figures, you have to create a dstack frame and specify the name of the stack. Stack can be later access via a URL, e.g. `https://dstack.ai/<user>/<stack>`. Every stack may have many frames. Frames are revisions of published data. Each stack points to its head frame – the latest version of the published data. A frame includes a list of attachments. Every attachment can be a visualization or a dataframe and may have own parameters associated with it.

## Publishing top 50 countries by increase in new cases

Let us use a simple example to better understand the concept of stacks, frames and attachments.

1. We create a frame and give it a stack name `covid19/speed`. This means that this frame will be published to a stack that can be accessed via `https://dstack.ai/<user>/covid19/speed`.

_Note that the current user is configured via the `dstack` command line utility and is stored along with your dstack.ai secure token in a local user directory. You can learn how it can be installed at [docs.dstack.ai](https://docs.dstack.ai)._

2. For both cases, i.e., increase in absolute numbers and increase in %, we'll publish a separate dataframe so the user accessing the published data can switch between two tables. Each dataframe is committed as a separate attachment along with a description and corresponding parameters:

3. We push the frame to send all attachments

In [39]:
min_cases = 50
# create frame and set stack name
top_speed_frame = create_frame("covid19/speed")
# top countries
sort_by_cols = ["delta", "delta%"]
for col in sort_by_cols:
    top = last_2_days[last_2_days[last_2_days.columns[1]]>min_cases].sort_values(by=[col], ascending=False).head(50)
    # commit attachment
    top_speed_frame.commit(top, f"Top 50 countries with the fastest growing number of confirmed Covid-19 cases (at least {min_cases})", {"Sort by": col})

top_speed_frame.push()

'https://dstack.ai/cheptsov/covid19/speed'

You can see the resulting published data at [dstack.ai/cheptsov/covid19/speed](https://dstack.ai/<user>/covid19/speed).

## Transposing data

Now let's try to visualize some data, e.g. new cases for a given country. This exercise is great not only because it shows how to plot data, but also because it shows us more ways of manipulating data.

In order to plot our data on new cases over time, we'll need to slightly change the format of the data – namely transpose the data to make dates dataframe rows instead of dataframe columns:

In [40]:
cdf = df[(df["Country/Region"]=="Italy") & (df["Province/State"].isnull())][df.columns[4:]].T
cdf = cdf.rename(columns={cdf.columns[0]:"confirmed"}) # set the name of the new column resulted as a transposition of date dataframe columns

## Visualizing new cases over time

Now that we have our data prepared, we can plot new cases. We'll plot it as a line chart where the x axis is dates and the y axis is new cases:

In [41]:
fig = px.line(cdf, x=cdf.index, y="confirmed")
fig.show() # displays the `plotly`'s figure

## Visualizing increase in new cases

Now, let's try to do something more advanced. How about visualizing increases in new cases over time?

To do that, we'll need to do another manipulation with our dataframe. We'll make a new dataframe by subtracting the number of new cases for the day by the number of new cases from the previous day. Here's how it's done using the `pandas`'s API:

In [42]:
delta = (cdf.shift(-1) - cdf)
delta.tail() # display the last 5 rows of the dataframe to make sure the operation was correct

Unnamed: 0,confirmed
3/14/20,3590.0
3/15/20,3233.0
3/16/20,3526.0
3/17/20,4207.0
3/18/20,


Now the data is ready for plotting – the same way we did it above – with the difference that instead of teh absolute number of cases, we disply the increase, also in absolute numbers:

In [43]:
fig = px.line(delta, x=delta.index, y="confirmed")
fig.show()

## Defining a function that returns plots for a given country

By now, most of our coding above was simple, even though it may feel cryptic to you if you're only getting familiar with pandas. Now, we'll do a more advance thing, we'll generalize our code that manipulates data and makes plots by moving it to a function. The function below returns three `plotly` figures: absolute confirmed cases, absolute increase, and increase in percent:

In [44]:
def plots_by_country(country):
    cdf = df[(df["Country/Region"]==country) & (df["Province/State"].isnull())][df.columns[4:]].T
    cdf = cdf.rename(columns={cdf.columns[0]:"confirmed"})
    cfig = px.line(cdf, x=cdf.index, y="confirmed")
    delta = (cdf.shift(-1) - cdf).rename(columns={"confirmed": "confirmed per day"})
    cdfig = px.line(delta, x=cdf.index, y="confirmed per day")
    delta_p = ((cdf.shift(-1) - cdf) / cdf.shift(-1)).rename(columns={"confirmed": "confirmed per day %"})
    cdpfig = px.line(delta_p, x=cdf.index, y="confirmed per day %")
    return (cfig, cdfig, cdpfig)

To test out function, let's call it for `Australia` and display all three resulting plots:

In [45]:
(fig1, fig2, fig3) = plots_by_country("Austria")
fig1.show()
fig2.show()
fig3.show()

## Visualizing new cases and increase over time for all countries

Now, let's use our function to call it on every country of the top 30 by new cases to publish all visualizations in one stack.

This exercise lets us showcase how one single `dstack`'s stack can be used to organize an interactive dashboard with multiple parameters:

In [46]:
# get top 30 countries by the number of new cases on the last day
countries = df[df["Province/State"].isnull()].sort_values(by=[df.columns[-1]], ascending=False)[["Country/Region"]].head(30)

# create a frame and iterate over the top countries to commit three plots for every country: new absolute cases, increase in absolute numbers, and increase in percent
frame = create_frame("covid19/speed_by_country")
for c in countries["Country/Region"].tolist():
    print(c)
    (fig1, fig2, fig3) = plots_by_country(c)
    frame.commit(fig1, f"Confirmed cases in {c}", {"Country": c, "Chart": "All cases"})
    frame.commit(fig2, f"New confirmed cases in {c}", {"Country": c, "Chart": "New cases"})
    frame.commit(fig3, f"New confirmed cases in {c} in %", {"Country": c, "Chart": "New cases (%)"})

frame.push()

Italy
Iran
Spain
Germany
Korea, South
Switzerland
Austria
Norway
Belgium
Sweden
Japan
Malaysia
Czechia
Qatar
Portugal
Israel
Greece
Brazil
Finland
Singapore
Pakistan
Ireland
Slovenia
Romania
Estonia
Bahrain
Poland
Iceland
Chile
Indonesia


'https://dstack.ai/cheptsov/covid19/speed_by_country'

The resulting stack can be found at [dstack.ai/cheptsov/covid19/speed_by_country](https://dstack.ai/cheptsov/covid19/speed_by_country).

## Visualizing all top 30 countries individually and all top 10 countries together

To do another, a bit more comprehensive exercise on data manipulation, plotting, and also publishing, let's try to visualize a similar analysis data but this time in addition to individual visualizations per country, also include a visualization with all countries together.

While doing this exercise, we'll see one more way of manipulating data.

Let's create a dataframe with absolute numbers of new cases for Italy for every day:

In [47]:
# filter Italy, transpose date dataframe columns into dataframe rows
t1 = df[(df["Country/Region"]=="Italy") & (df["Province/State"].isnull())][df.columns[4:]].T
# set the new column name
t1 = t1.rename(columns={t1.columns[0]:"confirmed"})
# make the dataframe's index a regular column; we'll later need it to highlight each country with its own color
t1.reset_index()
# add country column
t1["Country/Region"] = "Italy"
t1.tail() # display the last 5 rows to make sure everything is correct

Unnamed: 0,confirmed,Country/Region
3/14/20,21157,Italy
3/15/20,24747,Italy
3/16/20,27980,Italy
3/17/20,31506,Italy
3/18/20,35713,Italy


Now, let's generalize this code to make it work for a given country and also include increse in absolute numbers and in percent:

In [48]:
# this function return three dataframes: absolute new cases, absolute increase, percent increase
def country_df(country):
    cdf = df[(df["Country/Region"]==country) & (df["Province/State"].isnull())][df.columns[4:]].T
    cdf = cdf.rename(columns={cdf.columns[0]:"confirmed"})
    delta = (cdf.shift(-1) - cdf).rename(columns={"confirmed": "confirmed per day"})
    delta.reset_index()
    delta["Country/Region"] = country
    delta_p = ((cdf.shift(-1) - cdf) / cdf.shift(-1)).rename(columns={"confirmed": "confirmed per day %"})
    delta_p.reset_index()
    delta_p["Country/Region"] = country
    cdf.reset_index()
    cdf["Country/Region"] = country
    return (cdf, delta, delta_p)



Now, let's make a list of the top countries by the absolute number of new cases on the last day, and then make three list of dataframes for all countries: absolute new cases, absolute increase, percent increase.

In [49]:
# top 10 countries by last day absolute
top10 = df[df["Province/State"].isnull()].sort_values(by=[df.columns[-1]], ascending=False)[["Country/Region"]].head(10)

# make a single lists of dataframes for all countries
top = []
top_delta = []
top_delta_p = []
for c in top10["Country/Region"].tolist():
    (x, y, z) = country_df(c)
    top.append(x)
    top_delta.append(y)
    top_delta_p.append(z)

test = pd.concat(top) # make a pandas dataframe out for the new cases
# plot the resulted dataframe of new cases to make sure everything is correct
px.line(test, x=test.index, y="confirmed", color='Country/Region').show()

Now, let's put it all together, and use our function to make visualizations:

* Number of all cases over time for all 10 top countries
* Number of new cases over time for all 10 top countries
* Percent of increase over time for all 10 top countries
* Number of all cases over time for every of 30 top countries
* Number of new cases over time for every of 30 top countries
* Percent of increase over time for every of 30 top countries

In [50]:
frame = create_frame("covid19/speed_by_country_all")

top10df = pd.concat(top)
fig = px.line(top10df, x=top10df.index, y="confirmed", color='Country/Region')
frame.commit(fig, "Confirmed cases in top 10 countries", {"Country": "Top 10", "Chart": "All cases"})

top10df_delta = pd.concat(top_delta)
fig = px.line(top10df_delta, x=top10df_delta.index, y="confirmed per day", color='Country/Region')
frame.commit(fig, "New confirmed cases in top 10 countries", {"Country": "Top 10", "Chart": "New cases"})

top10df_delta_p = pd.concat(top_delta_p)
fig = px.line(top10df_delta_p, x=top10df_delta_p.index, y="confirmed per day %", color='Country/Region')
frame.commit(fig, "New confirmed cases in top 10 countries in %", {"Country": "Top 10", "Chart": "New cases (%)"})

for c in countries["Country/Region"].tolist():
    print(c)
    (fig1, fig2, fig3) = plots_by_country(c)
    frame.commit(fig1, f"Confirmed cases in {c}", {"Country": c, "Chart": "All cases"})
    frame.commit(fig2, f"New confirmed cases in {c}", {"Country": c, "Chart": "New cases"})
    frame.commit(fig3, f"New confirmed cases in {c} in %", {"Country": c, "Chart": "New cases (%)"})

frame.push()

Italy
Iran
Spain
Germany
Korea, South
Switzerland
Austria
Norway
Belgium
Sweden
Japan
Malaysia
Czechia
Qatar
Portugal
Israel
Greece
Brazil
Finland
Singapore
Pakistan
Ireland
Slovenia
Romania
Estonia
Bahrain
Poland
Iceland
Chile
Indonesia


'https://dstack.ai/cheptsov/covid19/speed_by_country_all'

The resulting stack can be found at [dstack.ai/cheptsov/covid19/speed_by_country](https://dstack.ai/cheptsov/covid19/speed_by_country_all).

That is it for this time. Hope you've enjoyed these little exercises and got an idea of how simple actually data analysis is.

## Resources

Here's a list of some resources that you can find useful:

* [Python for beginners](https://wiki.python.org/moin/BeginnersGuide)
* [Getting started with Pandas](https://pandas.pydata.org/docs/getting_started/index.html)
* [Getting started with Plot.ly](https://plot.ly/python/getting-started/)
* [Jupyter notebooks](https://jupyter-notebook.readthedocs.io/en/stable/notebook.html)
* [Getting started with dstack.ai](https://docs.dstack.ai/)
* [COVID-19 cases data](https://data.humdata.org/dataset/novel-coronavirus-2019-ncov-cases)
* [Stacks published in this tutorial](https://dstack.ai/cheptsov)
