<h1><center>Applying 10 rules on the Data Analysis of the Coronavirus</center></h1>

**Name**: Kam, Ka Lok

**Collaborator**: Ho, Koon Leong

## Visualization Technique

### A narrative description of the visualization I am planning to use, describing how it works

In this exercise, I am planning to use bar charts and maps to demonstrate the current status of coronavirus. especially the number of deaths, the number of patients and the number of recovered people. There are mainly two types of information that should be encoded, one is the number of patients, another one is the corresponding country. The first one is a quantitative variable and the second one is the nominal variable.

According to **The Mackinlay ranking of perceptual task**, a nice option to encode these two variables is to use bar chart because we can use one axis to encode the nominal variable and another axis to encode the quantitative variable. According to Steven's power law, comparing things in length will give the most accurate result. 

Therefore, when we want to compare the quantities, the bar chart should be the primary choice. In this task, the quantity we need to encode is the number of patients in each country, and it depicts the severity of coronavirus in the corresponding country. By comparing the quantities in different countries, we could gain an overview of the current status of coronavirus. 

### A discussion of in which circumstances this visualization should and should not be used (what is it close to? What else could I consider? How does it relate to specific aspects of data?

As mentioned above, the bar chart is useful when we want to encode a quantitative variable and a nominal variable. Putting bars side by side will give us the most accurate result of the comparison. But we could also use it when we want to encode a quantitative variable and an ordinal variable. One example for it is the histogram where the y-axis demonstrates the counts or probability densities, and the x-axis interprets the intervals of data. Therefore, when we are doing a random experiment, the bar chart is almost always applied to explore the experimental data.

Also, when the nominal variable or the ordinal variable could be categorized into several sets, a stacked bar chart is suitable, which nicely illustrate the inner structure of the data. (See the figure below [1])

![image](./asset/fig_1.png)

Of course, the bar char is not the jack of all trades in the field of data visualization. It has limits. For example, when we want to encode two quantitative variables (like the price and the area of the houses), the bar chart should not be applied. Otherwise, the objects we need to compare in the figure are not the lengths of bars, but the areas of them. According to The Mackinlay ranking of the perceptual task, comparing areas is not very accurate.

## Visualization Library 

### The library I am going to use, and a background on why the library is good for this visualization. Who created it? Is it open source? How install it?

The library I want to use for the task is called **Altair** which is based on a library in Java called **Vega**. It is a user-friendly library because it can directly read the data from Pandas data-frame and process the data to plot the figures. Since it is a declarative statistical visualization library for Python, it is more intuitive than other Plotting libraries such as Matplotlib. Also the figures created by it is more beautiful than other visualization packages of Python.

The creator of this library is **Jake VanderPlas** who is a data scientist and Software Engineer in Google. He has invented many open source packages for data science such as Altair, AstroML, Scikit-Learn, etc.

The installation process is simple:
- For pip user: `pip install altair vega_datasets`
- For conda user: `conda install -c conda-forge altair vega_datasets`
- For other option, please [Click here](https://altair-viz.github.io/getting_started/installation.html)

### A discussion of the general approach and limitations of this library. Is it declarative or procedural? Does it integrate with Jupyter? Why I decided to use this library (especially if there are other options)?

The very first thing we need to do is data cleaning such as dealing with the missing data, removing unnecessary columns and transforming the data (wide format to long format). After that, we need to define the type of plot that will be used. Some plots that are commonly used are line graph, scatter plot, bar chart, stacked area graph and box plot. After defining the type, the only thing that we need to do is putting the variables and the corresponding data properties (quantitative, ordinal or nominal) into axes. Some refinements can be made after that for improving readability and effectiveness. 

Altair has some downsides:
1. It has limited customization options. Unlike Matplotlib, all types of the plot are predefined in the library of Altair, you can only choose one of them to perform visualization. For example, there is no option of using pie chart in Altair, and users cannot do much about it.
2. The power of processing data is limited. I don't know why, but Altair cannot handle the data frames which have more than 5000 rows. This is a serious downside because in the period of big data people often need to deal with data in PB which is certainly not suitable for Altair. 

But I still prefer Altair especially when I only need to visualize small amounts of data in the spreadsheets, because **it is a declarative and it integrates with Jupyter nicely**. Although there are many other visualization libraries available for Python such as Matplotlib, Bokeh, Seaborn, etc., I still choose Altair to perform this task simply because:
- It is declarative, users don't need to define what to do at each step. The only thing that users need to do is mapping the properties of data to visual aesthetics.
- Its philosophy aligns well with the principles mentioned by Mackinlay. 
- Having fewer customization options also could be an advantage because you don't need extra works if the figure can be nicely done within a few steps.

## Demonstration

### The dataset I picked and instructions for cleaning the dataset.

The assignment contains three parts:
1. In the first section, I want to show how the severity of coronavirus evolves in time in Wuhan. The daily new cases number is a measurement to depict the severity of the virus. For this task, the dataset obtained from the HK Government Data Center ([click here](https://data.gov.hk/en-data/dataset/hk-dh-chpsebcddr-novel-infectious-agent)). The data I use contains the cases numbers in each province in China from 11.01.2020 to 07.03.2020. The explicit data processing cycle will be described in the following cells.


2. In the second section, I want to focus on the statistical data about the virus around the world to give the audience an overview of its current status. Because the current status is considered to be pandemic, descriptive statistics is useful. For this task, I use the data shown at [here](https://www.worldometers.info/coronavirus/).


3. In the third section, a bar chart is applied to demonstrate how the situation in the USA is getting worse. The data provided by [Johns Hopkins Coronavirus Resource Center](https://coronavirus.jhu.edu/map.html) is used.


4. In the final section, I want to use a map to demonstrate the current numbers of patients in every province in Mainland China. The data used in the report are also from the Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE). The data offered by JHU gives the total number of patients in each province in China, and it is suitable for the task of giving information about the latest updates on the coronavirus in China.


The processes of data cleaning and processing will be demonstrated in the following sections.

In [9]:
import altair as alt
import pandas as pd
import numpy as np

<h1><center>Coronavirus: its evolution and current statistics</center></h1>

## What happened in Wuhan?

On 31.12.2019,  a cluster of pneumonia cases was reported by health authorities in Wuhan. Totally 27 cases were reported, 7cases among them are critical. But the cause was unknown. "No evidence of human to human transmission", they announced.

On 03.01.2020, the total number of infected people was updated to 44 cases, 11 among them were critical, according to health authorities in Wuhan.

The number of patients kept being updated daily until 05.01.2020. The total number at the time was 59, and 9 among them were critical.

On 11.01.2020, Health authorities in Wuhan announced that most patients are linked to the Huanan Seafood Wholesale Market. The number of patients was confirmed to be 41 with 18 false cases detected, among them, 1 died and 2 recovered.

From 11.01.2020 to 17.01.2020, the number was frozen at 41. Coincidentally, Lianghui, the local parliament session, was held between 06.01 to 17.01.

On 15.01.2020, Wuhan's Municipal Health Commission (MHC) stated in the website, "the result of present investigation shows no clear evidence of human-to-human transmission, but this does not rule out the possibility of such a transmission. The risk of continuous human-to-human transmission is low."

On 20.01.2020, the number of reported new cases increased to 136. Major cities in China such as Beijing and Shenzhen also reported their first cases.[2] On the same day, Dr Nanshan Zhong, a Chinese epidemiologist who gained international fame for fighting the outbreak of the SARS, gave an example of human-to-human transmission of the new coronavirus by showing two infected familial clusters in Guangdong whose members had been to Wuhan before.

23.01.2020, Wuhan city was locked down...

### Processing the data
To understand what happened in Wuhan, especially how policies affected the "shown" data, it is useful to have a time series of daily new cases which can probably help us to understand severity of the coronavirus. Firstly, we need to load the data which includes the number of new cases in China from 11.01.2020 to 07.03.2020.

In [10]:
# download the date from data.gov.hk
daily = pd.read_csv('./asset/areas_in_mainland_china_have_reported_cases_eng.csv')
daily.head(10)

Unnamed: 0,As of date,As of time,Mainland China,Number of reported/confirmed cases,Number of deaths,Remark
0,11/01/2020,23:59,Hubei,41,,
1,12/01/2020,23:59,Hubei,41,,
2,13/01/2020,23:59,Hubei,41,,
3,15/01/2020,23:59,Hubei,41,,
4,16/01/2020,23:59,Hubei,45,,
5,17/01/2020,23:59,Hubei,62,,
6,19/01/2020,23:59,Hubei,198,,
7,20/01/2020,18:00,Guangdong,14,,
8,20/01/2020,18:00,Beijing,5,,
9,20/01/2020,18:00,Shanghai,1,,


### Extract information of Wuhan
The city Wuhan, which is the origin of the outbreak, has the greatest number of patients. By analyzing the data in Wuhan, a broad picture of how the disease evolved in China could be obtained. Therefore, the next step is to extract the data of Wuhan.

In [11]:
#  convert the datatype of column [As of date] to time-format, and set it as the index
daily['date'] = pd.to_datetime(daily['As of date'], format='%d/%m/%Y')
daily.drop(columns = ['As of date'], inplace = True)
daily.set_index('date', inplace = True)

# clean the missing data.
daily_hubei = daily[ (daily['As of time'] == '23:59') ]
daily_hubei = daily_hubei.fillna( value = {'Number of deaths' : 0})
daily_hubei = daily_hubei.drop(columns = ['Remark', 'As of time', 'Mainland China'])

# construct long-form
daily_hubei.rename(columns = {'Number of reported/confirmed cases' : 'Reported cases'}, inplace = True)
daily_hubei = daily_hubei.T

# take out needed columns [time] and [number of patients]. Construct a column which contains the number of daily new cases.
time = daily_hubei.columns[1:-1]
value_current = daily_hubei.iloc[0, 1:-1].to_list()
value_yesterday = daily_hubei.iloc[0, 0:-2].to_list()
new_case = [value_current[n] - value_yesterday[n] for n in range(len(value_current))]

hubei_new = pd.DataFrame({ 'date': time, 'daily new case': new_case})

In [12]:
# Use Altair to draw
hubei_fig = alt.Chart(hubei_new).mark_line(color = '#BF4055').encode(
    alt.X('monthdate(date):T', axis = alt.Axis( title = 'From 12.01 to 04.03', titleFontSize = 15, \
                                               labelFontSize = 12)),
    alt.Y('daily new case:Q', axis = alt.Axis( title = '# of cases', titleFontSize = 15))
).properties(
    height = 400,
    width = 700
)

# do the annotation
annotations = [['2020-01-13', 12000, 'No evidence of H-to-H transmission,stated by Wuhan'],
               ['2020-01-24', 10000, 'Zhong:"Evidence of H-to-H tranmission"'],
               ['2020-01-29', 6000, 'Wuhan Quarantine'],
               ['2020-01-24', 15000, 'Clinical Cases included'],
               ['2020-02-25', 4000, "WHO raises risk to very high"]]
text = pd.DataFrame(annotations, columns=['date','count','note'])
text['date'] = pd.to_datetime(text['date'], format='%Y-%m-%d')

figure_text =  alt.Chart(text).encode(
alt.X('monthdate(date):T'),
     alt.Y('count:Q', axis = None),
     text=alt.Text('note:N')
   ).mark_text( align='left',  baseline='middle', dy = 0, fontSize = 13)

pointer = pd.DataFrame({
    'x':  pd.to_datetime(['2020-01-19', '2020-01-19', '2020-01-31', '2020-01-20',\
                         '2020-02-01', '2020-01-23', '2020-02-04', '2020-02-11',\
                         '2020-02-29', '2020-02-29'], format = '%Y-%m-%d'),
    'y': [11500, 100, 9500, 100, 5500, 200, 14900, 14700, 3500, 650],
    'class': ['A', 'A', 'B', 'B', 'C', 'C', 'D', 'D', 'E', 'E']
})

line = alt.Chart(pointer).mark_line().encode(
    x='monthdate(x):T',
    y='y',
    detail='class'
)

(hubei_fig  + figure_text + line).properties(
    height = 600,
    width = 900,
    title = {'text': 'Daily new cases in Hubei', 'fontSize' : 20}
)

Annotation is added to indicate some important events. The large peak on 12.02 occurred because the clinically detected cases, which were confirmed by CT but were treated as suspected before, were also included as active cases. From the figure, we could also tell that there is a signal of betterment at the end of February.

## The current status of coronavirus

Now, let's focus on the statistical data of the current status of the coronavirus. Firstly, we should download the data from the [website](https://www.worldometers.info/coronavirus/). For the convenience, the CSV file which contains the data on 01.04.2020 is provided.

In [13]:
data = pd.read_csv('./asset/coronavirus.csv')
data.drop( columns = 'Unnamed: 0', inplace = True)
data.head(10)

Unnamed: 0,"Country,Other",TotalCases,NewCases,TotalDeaths,NewDeaths,TotalRecovered,ActiveCases,"Serious,Critical",Tot Cases/1M pop,Deaths/1M pop,Reported1st case
0,USA,188578,48.0,4054.0,1.0,7251.0,177273,4576.0,570.0,12.0,Jan 20
1,Italy,105792,,12428.0,,15729.0,77635,4023.0,1750.0,206.0,Jan 29
2,Spain,95923,,8464.0,,19259.0,68200,5607.0,2052.0,181.0,Jan 30
3,Germany,71808,,775.0,,16100.0,54933,2675.0,857.0,9.0,Jan 26
4,France,52128,,3523.0,,9444.0,39161,5565.0,799.0,54.0,Jan 23
5,Iran,44605,,2898.0,,14656.0,27051,3703.0,531.0,35.0,Feb 18
6,UK,25150,,1789.0,,135.0,23226,163.0,370.0,26.0,Jan 30
7,Switzerland,16605,,433.0,,1823.0,14349,301.0,1919.0,50.0,Feb 24
8,Turkey,13531,,214.0,,243.0,13074,622.0,160.0,3.0,Mar 09
9,Belgium,12775,,705.0,,1696.0,10374,1021.0,1102.0,61.0,Feb 03


### The total number of patients all over the world

It is always useful to have an statistical overview which shows the total numbers of active cases, deaths and recovered. 

In [14]:
total = pd.melt(data, id_vars = 'Country,Other') # convert to long form
total = total[ total['Country,Other'] == 'Total:'] 

# construct a sorting list
names = ['TotalRecovered', 'ActiveCases', 'TotalDeaths']
colors = ['green', '#52C9E0', 'red']

pic_left = alt.Chart(total[ (total['variable'] == 'TotalDeaths' )| \
                         (total['variable'] == 'ActiveCases' )| \
                         (total['variable'] == 'TotalRecovered')]).mark_bar().encode(
    
 alt.Y('variable:N', sort = names, axis = alt.Axis(title = None, labelFontSize = 15), \
      scale = alt.Scale(paddingInner=0.1)),
    
    alt.X('value:Q', axis = alt.Axis(title = 'The overall statistics on 01.04.2020', titleFontSize = 15), \
           ),
          
    alt.Color('variable:N', legend = alt.Legend(title = 'Cases', labelFontSize = 15, titleFontSize = 15), \
             scale = alt.Scale(domain = names, range = colors))
).properties(
    height = 200,
    width = 500
)

pic_left

From the figure above, it is clear that the disease is not under control yet, because the number of active cases is more than three times of the number of recovered cases. 

### Country-ranking by severity level of the novel coronavirus

In [15]:
data_2 = data.sort_values(by = 'TotalCases', ascending = False).head(12).iloc[1:11]
data_2['index'] = list(range(0,10))
data_2.set_index('index', inplace = True)
data_2.fillna(value = {'TotalDeaths' : 0}, inplace = True)
data_2 = data_2[ ['Country,Other','TotalRecovered', 'ActiveCases', 'TotalDeaths'] ]
data_2 = pd.melt(data_2, id_vars = ['Country,Other'])

# ranking list 
names_c = data_2['Country,Other'].to_list()

rest = alt.Chart(data_2).mark_bar().encode(
    alt.X('Country,Other:N', sort = names_c, axis = alt.Axis(title = None, labelFontSize = 15)),
    alt.Y('value:Q',  sort = names, axis = alt.Axis(title = '# of cases', titleFontSize = 15, labelFontSize = 15)),
    alt.Color('variable:N', scale = alt.Scale(domain = names, range = colors),\
             legend = alt.Legend(title = 'Type of Cases', titleFontSize = 15, labelFontSize = 15))
).properties(
width = 700,
    height = 400,
    title = { 'text' :'Top 10 countries ranked by severity level', 'fontSize' : 20, \
             'subtitle' : 'There are totally {0} cases'.format(  int(data.iloc[-1, 1])) ,\
             'subtitleFontSize' : 15}
)


rest

As shown in the figure above, the most affected country is the USA. Other countries in the top ten list are mostly in Europa.

## How the coronavirus evolves in the US

When I am writing this notebook, the USA has the highest severity level with about 190000 patients which is about double the number of patients in Italy. So it is useful to see how this virus rapidly spread over the USA in March.

In [16]:
# read the data
us_df = pd.read_csv('asset/cases_us.csv')
us_df.drop( columns = ['Unnamed: 0'], inplace = True)

us_df['date'] = pd.to_datetime(us_df['date'])
us_df = pd.melt(us_df, id_vars = ['date'])

In [17]:
# draw graph
us_df = us_df[ us_df['date'] > pd.to_datetime('2020-03-01')]

us_time = alt.Chart(us_df).mark_bar().encode(
    alt.X('monthdate(date):T', axis = alt.Axis(title = None, labelFontSize = 15)),
    alt.Y('value:Q', axis = alt.Axis(title = '# of cases', titleFontSize = 15, labelFontSize = 15)),
    alt.Color('variable:N', scale = alt.Scale(domain = ['Recovered', 'Active', 'Deaths'], range = colors),\
             legend = alt.Legend(title = 'Type of Cases', titleFontSize = 15, labelFontSize = 15))
).properties(
width = 700,
    height = 400,
    title = { 'text' :'How the COVID19 evolves in US', 'fontSize' : 20, 'subtitleFontSize' : 15}
)

us_time

## A dot map to show geographical information

Finally, a dot map which gives information about each province in China was built. Dot map is very useful to convey a particular feature of geographical information. Generally, there are two layers to form the dot map. A map which includes all geographical details of a place is set in the background. Above it, dots which have information about coordinate and the feature were placed. 

In [18]:
# load the geojon which is actually the map in the background
url_json = './asset/china_provinces_geo.geojson'
data_geojson_remote = alt.Data(url=url_json)

# chart object
background = alt.Chart(data_geojson_remote).mark_geoshape(
        stroke='black',
        strokeWidth=1
    ).encode(
    ).project('mercator')

# read the data of China
df_c = pd.read_csv('./asset/time_china_confirmed.csv')
df_d = pd.read_csv('./asset/time_china_deaths.csv')
df_r = pd.read_csv('./asset/time_china_recovered.csv')

# construct  geographical information of the capital of each province in China.
coordinate = {'Liaoning' : [123.429092, 41.796768], 'Jilin': [125.324501,43.886841], 'Heilongjiang': [126.642464, 45.756966], \
             'Beijing' : [116.405289, 39.904987], 'Tianjin': [117.190186, 39.125595], 'Inner Mongolia': [111.751990, 40.841490], \
             'Ningxia': [106.232480, 38.486440], 'Shanxi': [112.549248, 37.857014], 'Hebei': [114.502464, 38.045475], \
             'Shandong':[117.000923, 36.675808], 'Henan':[113.665413, 34.757977], 'Shaanxi':[108.948021, 34.263161], \
             'Hubei':[114.298569, 30.584354], 'Jiangsu':[118.76741, 32.041546], 'Anhui':[117.283043, 31.861191], 'Shanghai':[121.472641, 31.231707], \
             'Hunan':[112.982277, 28.19409], 'Jiangxi':[115.892151, 28.676493], 'Zhejiang':[120.15358, 30.287458], 'Fujian':[119.306236, 26.075302], \
             'Guangdong':[113.28064, 23.125177], 'Hainan':[110.199890, 20.044220], 'Guangxi':[108.320007, 22.82402], \
              'Chongqing':[106.504959, 29.533155], 'Yunnan':[102.71225, 25.040609], 'Guizhou':[106.713478, 26.578342], \
             'Sichuan':[104.065735, 30.659462], 'Gansu':[103.834170, 36.061380], 'Qinghai':[101.777820, 36.617290], \
             'Tibet':[91.11450,29.644150], 'Xinjiang':[87.616880, 43.826630], 'Hong Kong':[114.2000, 22.3000]}

longitude = []
latitude = []
recovered = []
deaths = []
cases = []
p_names = []

for province in coordinate.keys():
    coor = coordinate[province]
    
    longitude.append(coor[0])
    latitude.append(coor[1])
    p_names.append(province)
    
    recovered.append( int(df_r[ df_r['Province/State'] == province].iloc[:, -1]) )
    cases.append( int(df_c[ df_c['Province/State'] == province].iloc[:, -1]) )
    deaths.append( int(df_d[ df_d['Province/State'] == province].iloc[:, -1]) )
    
df_china = pd.DataFrame( {'Province': p_names, 'Total Cases': cases, 'Deaths': deaths, 'Recovered': recovered, 'longitude':longitude, 'latitude': latitude} )

points = alt.Chart(df_china).mark_circle(
    size=70,
    color='red'
).encode(
    longitude='longitude:Q',
    latitude='latitude:Q',
    tooltip=['Province', 'Total Cases', 'Deaths', 'Recovered']
)

In [19]:
(background + points).properties(
        width=500,
        height=500,
        title = {'text' : 'The National Report of Mainland China on 23.03', 'fontSize' : 20, 'orient' : 'top', \
                'subtitle': 'Move to red dots to see the details'}
    ).configure_view(strokeWidth = 0)

Unlike tables or bar charts, the dot map is more intuitive. 

## Reference

[1] **Andy Kriebel**, Displaying time-series data: Stacked bars, area charts or lines…you decide!, http://www.vizwiz.com/2012/08/displaying-time-series-data-stacked.html

[2] France 24, China confirms sharp rise in cases of SARS-like virus across the country, https://www.france24.com/en/20200120-china-confirms-sharp-rise-in-cases-of-sars-like-virus-across-the-country

[3] Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE), https://coronavirus.jhu.edu/

[4] worldometer, https://www.worldometers.info/coronavirus/

### Adherence to some of Rule et al's rules for computational analyses.

1. **Tell a story for an audience**: This rule emphasizes the interaction between authors and audiences by showing the audiences not only the codes but also the story behind the data and the process of thoughts of telling the story. The audiences of this notebook are probably the TAs and classmates in this course, so I need to demonstrate the figures in order to give intuitive pictures to audiences.


2. **Document the process, not just the results**: This rule emphasizes that the authors should tell audiences how to clean data and how to achieve the results. In this exercise, comments and texts are used to show the audiences how I get the data, clean them and plot them finally.


3. **Use cell division**: This rule refers to the well-structured notebook which has high readability. We should try to avoid the cells that contain too many contents. It is better to have only one task (loading data, data cleaning, or plotting) in one cell. In this notebook, there are mainly four parts, and each part includes roughly four cells, two for markdowns and two for data cleaning and plotting.


9. **Design your notebooks to be read, run, and explored**: This rule asks the authors to ensure that the audiences can access, run, and explore the notebook. Because most data are obtained from the internet, it should be worried that the audiences might not access those data if the given links do not work anymore. Therefore, I save all the data in CSV files and put them in Github with the notebook to avoid the problems of accessing data. 