# COVID-19 VIETNAM DATASET - AN OVERVIEW PICTURE OF THE PANDEMIC AT COUNTRY LEVEL

*This kernel is created to support the '[vietnam-covid19-patient-dataset'](https://www.kaggle.com/nhntran/vietnam-covid19-patient-dataset) published on Kaggle.*

*The writen up report for this analysis could be found [here](https://towardsdatascience.com/covid-19-what-do-we-know-about-the-situation-in-vietnam-82c195163d7e).*

*A nice visualization of all Vietnam COVID-19 patients could be found [here](https://medium.com/@tranhnnguyenvn/a-full-picture-of-vietnam-covid-19-patients-496f7ccad3ea). *

*This kernel is created together with [another kernel](https://www.kaggle.com/nhntran/covid-19-the-world-data-eda-and-visualization/edit/run/32144597) which is targeted on global COVID data *.

## **INTRODUCTION**

**CONTEXT: **

On December 31, 2019, Chinese officials informed the first case of COVID-19 in Wuhan (China). Around the end of January, 2020, many countries (the US, the UK, South Korea, etc.), including Vietnam, reported their first COVID-19 cases.

While the number of confirmed cases and deaths has exponentially risen in other countries, **Vietnam currently only has 270 COVID-19 cases in TOTAL and NO FATALITIES**.

One remarkable thing in Vietnam is the fact that privacy laws are not as stringent as in the US, Canada or the EU. Therefore, **COVID-19 patient data in Vietnam is publicly available**. (To be more transparent and effective in COVID-19 contact tracing task, Vietnam COVID-19 patient data is publicly available on the Vietnam Ministry of Health's website and on the news.) 

The tradeoff in personal privacy, in this circumtance, provides the data science community the opportunity to look into more details about the COVID-19 pandemic in many aspects, and at the country level.

I hope this analysis will give you some inspirations on the topic.



**DATA COLLECTION:**

Data was acquired by web scrapping with manually curated from the Vietnam Ministry of Health's website (https://ncov.moh.gov.vn/) and other mainstream media in Vietnam (cited specifically in each data row).


**DISCLAIMER:**

* This is my personal work with no link to any organization. Although this analysis is data-driven and hence provides some insights about the government strategy as well as patient characteristics in Vietnam, my comments reflect my personal perspectives.
* My results are based on the data collected from the Vietnam Health Ministry website and the mainstream media in Vietnam. Therefore, the data is likely to be biased and reflects what is publicly available on the internet. However, it can served as a good reference for someone who are curious about the COVID-19 pandemic in Vietnam.


**GOALS**

* Download the dataset, clean up, perform exploratory data analysis (EDA) and transform data into different types of dataframe for further study

* Explore the unique features of COVID-19 data in Vietnam

# EXPLORATORY DATA ANALYSIS

**A. ENVIRONMENT SETUP**

In [None]:
### Import neccessary package as Kaggle recommended

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

### Have a look at the data directory and get the link to the updated dataset
list_path = []
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        list_path.append(os.path.join(dirname, filename))
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

In [None]:
# Install wget to download the data
# !pip install wget
#import wget

In [None]:
# Import other neccessary package
from datetime import date
import datetime as dt
import collections
import random
import statistics

# matplotlib
import matplotlib.pyplot as plt
from matplotlib.lines import Line2D
from matplotlib.patches import Patch
import matplotlib.style as style
import matplotlib.gridspec as gridspec
import matplotlib as mpl # use/reset style
from matplotlib.dates import DateFormatter
import matplotlib.dates as mdates
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()
%matplotlib inline

# seaborn
import seaborn as sns
sns.set_style("whitegrid")

## Map visualization
import folium
import altair as alt

# Plotly
from plotly import tools, subplots
from plotly.subplots import make_subplots
import plotly.offline as py
py.init_notebook_mode(connected=True) # Required to use plotly offline in jupyter notebook
import plotly.graph_objs as go
import plotly.express as px
import plotly.figure_factory as ff
import plotly.io as pio
pio.templates.default = "plotly_white"

'''Display markdown formatted output like bold, italic bold etc.'''
from IPython.display import Markdown
def bold(string):
    display(Markdown(string))

**B. DOWNLOAD DATA, CLEAN UP AND TRANSFORM INTO SPECIFIC DATA FRAMES FOR ANALYSIS **

1. DOWNLOAD DATA

In [None]:
### Reading patient data - Main data 
vietnam_patient_data = pd.read_csv('/kaggle/input/vietnam-covid19-patient-dataset/Vietnam_COVID-19_patient_data_May10_2020.csv')
vietnam_patient_data.tail()

In [None]:
### Reading the number of hopitalized patients (Active case) through the start date 02/22/2020 - up to the most updated date
# This kind of data is not so consistent among different sources, 
#so I downloaded directly from the Vietnam Health Ministry website

vietnam_hospitalized_patient_info = pd.read_csv('/kaggle/input/vietnam-covid19-patient-dataset/Vietnam_COVID-19_HospitalizedPatient_May10_2020.csv')
vietnam_hospitalized_patient_info.tail()

In [None]:
### Reading province information (providing the latitude and longtitude for each province/city in Vietnam)
vietnam_region_info = pd.read_csv('/kaggle/input/vietnam-covid19-patient-dataset/Vietnam_province_info.csv')
vietnam_region_info.tail()

**2. EXPLORATORY DATA ANALYSIS - CLEAN UP **

In [None]:
## Data correction - No need anymore
# vietnam_patient_data.columns[8]
# vietnam_patient_data['\x08Confirmed Date']
# '\x08Confirmed Date' => Need to be fixed by renaming # '\x08Confirmed Date'
# vietnam_patient_data = vietnam_patient_data.rename({'Confirmed Date': 'Date'}, axis = 'columns')

# vietnam_patient_data.shape
# # (999, 23) => Need to be fixed number of length (currently: 999)
# vietnam_patient_data = vietnam_patient_data[vietnam_patient_data['Date'].isnull() != True]

In [None]:
## Change 'Confirmed Date' to 'Date'
vietnam_patient_data = vietnam_patient_data.rename({'Confirmed Date': 'Date'}, axis = 'columns')

## Convert date from string to datetime type
vietnam_patient_data['Date'] = pd.to_datetime(vietnam_patient_data['Date'])
#vietnam_patient_data['Date'] = pd.to_datetime(vietnam_patient_data['Date']).dt.strftime('%m/%d/%Y')
vietnam_patient_data.shape

In [None]:
vietnam_patient_data.columns

In [None]:
 # What is the types of the features (categorical, numerical, mixed)?
#vietnam_patient_data.dtypes

In [None]:
vietnam_patient_data.describe(include=['O'])

In [None]:
#vietnam_patient_data.info()

**3. PATIENT DETAIL DATA - CLEAN UP, TRANSFORM AND EXTRACT**


**List of transformed dataframes:**

- 'patient_data': all detail about patients

- 'hospitalized_time': number of hospitalized days for each patient

- 'df_travel_country': list of country patients did travel and got infected, 
2 data frames for the first and second wave of infection (secondwave was started with patient BN17) (df_travel_country_firstwave and df_travel_country_secondwave)

- 'df_cases': all statistics about cases
- 'vietnam_cases_all_times': Number of cases (confirmed cases, new cases, death) in each province/region - same format as the world data format from Johns Hopkins University - CSSEG

In [None]:
### BASIC DATA FRAME FOR PATIENT DATA:
# Name 'patient_data'

target_list = ['ID', 'Gender', 'Age', 'Nationality', 'Detection Location', 'Date',
               'Travel History','Travel Country, Correct', 'Source of Infection', 
               'Relationship', 'Health Condition When Confirmed', 
               'Detail Symptoms When Confirmed (clean up)','Underlying Health Condition','Discharged Date', 'Re-Infected']
patient_data = vietnam_patient_data[target_list]

## brief look at the data
# sns.pairplot(patient_data)

## Convert date from string to datetime type
patient_data['Date'] = pd.to_datetime(patient_data['Date'])

patient_data['Discharged Date'].unique()
# => Remove the one that has no exact date, starting as '(', for example '(Feb 2020, no exact date)'
patient_data['Discharged Date'] = patient_data['Discharged Date'].replace(to_replace = r'^\(', 
                                                                value = np.nan, regex = True)

## Fill NaN in 'Health Condition When Confirmed' with Not Reported
patient_data['Health Condition When Confirmed'] = patient_data['Health Condition When Confirmed'].fillna('Not Reported')

patient_data.head()

In [None]:
### DATA FRAME FOR PATIENT'S GENDER, AGE AND NATIONALITY:
# Name 'patient_nationality'

# Prepare patient data
patient_nationality = patient_data[['Gender','Age', 'Nationality', 'Travel History']]
# Condition for columns distinguish Vietnamese and foreigners
patient_nationality['Patient Nationality'] = patient_nationality['Nationality']
patient_nationality.loc[patient_nationality['Patient Nationality'] != 'Vietnam', 'Patient Nationality'] = 'Foreigner'
patient_nationality.loc[patient_nationality['Patient Nationality'] == 'Vietnam', 'Patient Nationality'] = 'Vietnamese'
patient_nationality.loc[patient_nationality['Gender'] == 'F', 'Gender'] = 'Female'
patient_nationality.loc[patient_nationality['Gender'] == 'M', 'Gender'] = 'Male'
patient_nationality.loc[patient_nationality['Travel History'] == 1, 'Travel History'] = 'Imported Cases'
patient_nationality.loc[patient_nationality['Travel History'] == 0, 'Travel History'] = 'Locally Transmitted Cases'
patient_nationality

In [None]:
### DATA FRAME FOR HOSPITALIZED TIME:
# Name 'hospitalized_time'

hospitalized_time = patient_data[['Date','Discharged Date','Gender','Age', 'Nationality']]
hospitalized_time = hospitalized_time.dropna()
## Convert date from string to datetime type
hospitalized_time['Discharged Date'] = pd.to_datetime(hospitalized_time['Discharged Date'])
hospitalized_time['Days Hospitalized Since Confirmed'] = (hospitalized_time['Discharged Date'] - hospitalized_time['Date'])/pd.Timedelta('1 days')
hospitalized_time = hospitalized_time.sort_values('Days Hospitalized Since Confirmed', ascending = True).reset_index(drop=True)
## Modify the 'Nationality' column (2 classes: Vietnamese and Foreigner)
hospitalized_time.loc[hospitalized_time['Gender'] == 'F', 'Gender'] = 'Female'
hospitalized_time.loc[hospitalized_time['Gender'] == 'M', 'Gender'] = 'Male'
hospitalized_time.loc[hospitalized_time['Nationality'] != 'Vietnam', 'Nationality'] = 'Foreigners'
hospitalized_time.loc[hospitalized_time['Nationality'] == 'Vietnam', 'Nationality'] = 'Vietnamese'
hospitalized_time

In [None]:
### DATAFRAMES FOR TRAVEL COUNTRY:
# Name 'travel_country'
# 2 data frames for the first and second wave of infection (secondwave was started with patient BN17)
# df_travel_country_firstwave
# df_travel_country_secondwave

### Function to extract the travel countries information
def extract_travel_country(df):
    travel_country = df['Travel Country, Correct'].dropna().tolist()
    travel_countries = []
    for country in travel_country:
        name = country.split(', ')
        travel_countries.extend(name)
    counter = collections.Counter(travel_countries)
    df_travel_country = pd.DataFrame(list(counter.items()),columns = ['Country','Number of Cases'])
    df_travel_country = df_travel_country.sort_values('Number of Cases', ascending = False)
    return df_travel_country

## All travel countries
df_travel_country = extract_travel_country(patient_data)

## All travel countries - separate first and second wave
secondwave_start_index = 16 #BN17
firstwave = vietnam_patient_data[:secondwave_start_index]
secondwave = vietnam_patient_data[secondwave_start_index:]

df_travel_country_firstwave = extract_travel_country(firstwave)
df_travel_country_secondwave = extract_travel_country(secondwave)

In [None]:
### BASIC DATAFRAME FOR CASES (STATISTICS):
# Name 'df_cases'

# STEP 1: Create a date dataframe (all days from start to end)
start_date = '01/22/2020'
# end_date =  patient_data['Date'].iloc[-1]
end_date =  vietnam_hospitalized_patient_info['Date'].iloc[-1]
## create an array of date from start_date to end_date, one per day
arr_date = pd.date_range(start = start_date, end = end_date)
## create dataframe
df_date = pd.DataFrame(arr_date, columns = ['Date'])
## Convert date from string to datetime type
df_date['Date'] = pd.to_datetime(df_date['Date'])
# df_date

# STEP 2: create df_cases dataframe
df_cases = pd.DataFrame(columns = ['Date', 'Travel History', 'New Imported Cases',
                                             'New Local Cases'])
## fill up date and travel information
df_cases['Date'] = patient_data['Date']
df_cases['Travel History'] = patient_data['Travel History']
## classify as local or imported cases
df_cases.loc[df_cases['Travel History'] == 0, 'New Local Cases'] = 1
df_cases.loc[df_cases['Travel History'] == 1, 'New Imported Cases'] = 1
df_cases['New Local Cases'] = df_cases['New Local Cases'].fillna(0)
df_cases['New Imported Cases'] = df_cases['New Imported Cases'].fillna(0)

## Sum up all the cases of the same date 
df_cases = df_cases.groupby(['Date'])['New Imported Cases', 'New Local Cases'].sum().sort_values('Date').reset_index()
df_cases['New Confirmed Cases'] = df_cases['New Imported Cases'] + df_cases['New Local Cases']

## Adding all missing date so the date is continous (Join 2 dataframe):
df_cases = df_date.set_index('Date').join(df_cases.set_index('Date'))

df_cases['New Local Cases'] = df_cases['New Local Cases'].fillna(0)
df_cases['New Imported Cases'] = df_cases['New Imported Cases'].fillna(0)
df_cases['New Confirmed Cases'] = df_cases['New Confirmed Cases'].fillna(0)

## Add accumulative cases column 'Confirmed Cases'
df_cases['Confirmed Cases'] = df_cases['New Confirmed Cases'].cumsum()
## Add accumulative imported and locally transmitted cases column 'Imported Cases', 'Local Cases'
df_cases['Imported Cases'] = df_cases['New Imported Cases'].cumsum()
df_cases['Local Cases'] = df_cases['New Local Cases'].cumsum()

## Add number of hospitalized patients by date (from 'vietnam_hospitalized_patient_info' )
df_cases = df_cases.join(vietnam_hospitalized_patient_info.set_index('Date'))
df_cases

**CONVERTING THE DATA TO THE SAME FORMAT AS WORLD DATA (JOHNS HOPKINS UNIVERSITY - CSSEGI)**


The format for COVID-19 world data is as followed:

* Containing 7 columns: 'Province/State', 'Country/Region', 'Lat', 'Long', 'Date', 'Confirmed Cases', 'Deaths'
* Each row is a date since the start_date up to most updated date (current)
* Start_date = '01/22/2020': First day available on COVID-19 world data (JOHNS HOPKINS UNIVERSITY - CSSEGI)


*** NOTE:** 

There are 2 possible columns to use as 'Province/State': 'Treatment Location' and 'Detection Location'.

=> In here, I chose 'Province/State' as 'Treatment Location' column because

(i) Most patients were hospitalized locally in the same location as detection location.

(ii) Sometimes, patients were transfer to another hospital, the treatment location is the exact place to track and trace the patient condition.

In [None]:
### DATAFRAME FOR PROVINCE/REGION INFORMATION (THE SAME FORMAT AS WORLD DATA (JOHNS HOPKINS UNIVERSITY - CSSEGI):
# Name 'vietnam_cases_all_times'

### STEP 1: CREATE A DATAFRAME OF NEW CASES IN EACH PROVINCE IN EACH AVAILABLE DATE

## Create the dataframe
vietnam_patient_extracted_data = pd.DataFrame(columns = 
                                              ['Province/State','Date','New Confirmed Cases', 'Deaths'])

# Add 'Treatment Location' as 'Province/State' and 'Date'
vietnam_patient_extracted_data['Province/State'] = vietnam_patient_data['Treatment Location']
vietnam_patient_extracted_data['Date'] = vietnam_patient_data['Date']
## Fill 1 for each line in 'Confirmed Cases' since every row in vietnam_patient_data is a single case
vietnam_patient_extracted_data['New Confirmed Cases'] = vietnam_patient_extracted_data['New Confirmed Cases'].fillna(1)
vietnam_patient_extracted_data['Deaths'] = vietnam_patient_extracted_data['Deaths'].fillna(0)
#vietnam_patient_extracted_data.tail()
# Sum up all the cases of the same date and location
vietnam_patient_extracted_data_sum = vietnam_patient_extracted_data.groupby(['Date',
            'Province/State'])['New Confirmed Cases', 'Deaths'].sum().sort_values('Date').reset_index()
vietnam_patient_extracted_data_sum

### STEP 2: CREATE A DATAFRAME OF SINGLE COLUMN FOR DATE (FROM START DATE TO END DATE)

start_date = '01/22/2020'
# end_date =  vietnam_patient_extracted_data_sum['Date'].iloc[-1]
# end_date =  patient_data['Date'].iloc[-1]
end_date =  vietnam_hospitalized_patient_info['Date'].iloc[-1]

## create an array of date from start_date to end_date, one per day
arr_date = pd.date_range(start = start_date, end = end_date)

## create dataframe
df_date = pd.DataFrame(columns = ['Date'])
## Add 'Date'
df_date['Date'] = arr_date
## Convert date from string to datetime type
df_date['Date'] = pd.to_datetime(df_date['Date'])
#df_date['Date'] = pd.to_datetime(df_date['Date']).dt.strftime('%m/%d/%Y')

### Step 3: CREATE THE TARGET DATAFRAME 

## Target dataframe is 'vietnam_cases_all_times'
vietnam_cases_all_times = pd.DataFrame(columns = ['Date','Province/State','New Confirmed Cases', 'Deaths',
                                                  'Lat', 'Long', 'Confirmed Cases'])

### Function to extract data from a single province and add that to the vietnam_cases_all_times
def extract_combine_province(province_name):
    
    df_province = df_date
    df_province_extracted = vietnam_patient_extracted_data_sum[vietnam_patient_extracted_data_sum['Province/State'] == province_name]
    ## Join 2 data:
    df_province = df_province.set_index('Date').join(df_province_extracted.set_index('Date'))
    ## Add Lat and Long value
    lat = vietnam_region_info.loc[vietnam_region_info['Province/State'] == province_name, 'Lat'].values[0]
    long = vietnam_region_info.loc[vietnam_region_info['Province/State'] == province_name,'Long'].values[0]
    df_province['Lat'] = lat
    df_province['Long'] = long
    
    ## Fill NaN with 0
    df_province['New Confirmed Cases'] = df_province['New Confirmed Cases'].fillna(0)
    df_province['Deaths'] = df_province['Deaths'].fillna(0)
    df_province['Province/State'] = df_province['Province/State'].fillna(province_name)
    df_province = df_province.reset_index()
    ## Add accumulative cases column 'Confirmed Cases'
    sum_case = df_province['New Confirmed Cases']
    sum_case = sum_case.cumsum()
    df_province['Confirmed Cases'] = sum_case
    return df_province

### Add the data of each province
## The provinces that have confirmed cases in the dataset
provinces = vietnam_patient_extracted_data_sum['Province/State'].unique()

for province_name in provinces:
    region_df = extract_combine_province(province_name)  
    vietnam_cases_all_times = vietnam_cases_all_times.append(region_df)
vietnam_cases_all_times

**=> We are done with preparing the data. Let's do some visualization and analysis!**

# VIETNAM COVID-19 DATA - VISUALIZATION AND ANALYSIS

**1. OVERVIEW ABOUT VIETNAM COVID-19 CASES**

**1.1. GENERAL TREND IN VIETNAM**


In [None]:
# Brief overview of cases

## Extract the final date in the dataframe
date = vietnam_cases_all_times['Date'].iloc[-1]
imported_cases = int(df_cases['Imported Cases'].iloc[-1])
local_cases = int(df_cases['Local Cases'].iloc[-1])
active_cases = int(df_cases['No of Hospitalized Patients'].iloc[-1])
most_recent_data = vietnam_cases_all_times[vietnam_cases_all_times['Date'] == date]
print('Vietnam COVID-19 data on date {}:\n'.format(date))
print('Total confirmed cases:   {}'.format(int(most_recent_data['Confirmed Cases'].sum())))
print('Total imported cases:    {}'.format(imported_cases))
print('Total local cases:       {}'.format(local_cases))
print('Total death cases:         {}'.format(int(most_recent_data['Deaths'].sum())))
print('Total active cases:       {}'.format(active_cases))
print('Total reinfected cases:   {}'.format(patient_data['Re-Infected'].count()))

In [None]:
## GRAPH OF CASES

# reset style use
mpl.rcParams.update(mpl.rcParamsDefault)
# formate for date on graph
myFmt = DateFormatter('%m/%d')

fig, ax = plt.subplots(figsize  = (10,3))

#draw plot
ax.plot(df_cases.index, df_cases['Confirmed Cases'], '-o', markersize = 2, color = 'red', markevery=[-1])
ax.plot(df_cases.index, df_cases['No of Hospitalized Patients'],'-o',markersize = 2, color = 'darkviolet', markevery=[-1])

# decoration
ax.set_ylabel('Total Cases', fontdict = {'fontweight':'bold'})
ax.legend(['Confirmed Cases', 'Active Cases'])
ax.set_title('COVID-19 Cases: Trend in Vietnam', loc = 'left', fontdict = {'fontweight':'bold'})
ax.set_xlabel('Date (2020)', fontdict = {'fontweight':'bold'})

# axis
ax.xaxis.set_major_formatter(myFmt)
start, end = ax.get_xlim()
ax.set_xlim(left = start)
ax.set_ylim(top = 300)
x_value = np.arange(start, end, 8)
ax.xaxis.set_ticks(x_value)

# Annotation
end_date =  mdates.date2num(df_cases.index[-1])
case_value = []
case_value.append(df_cases['Confirmed Cases'].iloc[-1])
case_value.append(df_cases['No of Hospitalized Patients'].iloc[-1])

# Add annotation
for val in case_value:
    plt.annotate(int(val), # this is the text
                 (end_date,val), # this is the point to label
                 color = '#333F4B',fontsize = 8, weight = 'bold',
                 textcoords = "offset points", # how to position the text
                 xytext = (6,-2), # distance from text to points (x,y)
                 ha = 'left') # 
plt.savefig('Covid19Cases_Vietnam_trend.png', dpi = 200, bbox_inches='tight')    
plt.show()



**THE NUMBER OF COVID-19 CASES HAS REMAIN LOW IN VIETNAM**

The number of confirmed cases and deaths has exponentially risen and reached the grim milestone in many countries. Meanwhile, Vietnam currently has only 270 COVID-19 confirmed cases in total with NO FATALITIES.

The number of active cases (hospitalized patients) has also remained low. 

**Note that in Vietnam, all COVID-19 patients, including the asymptomatic cases, were hospitalized.** (In contrast, [in the US](https://www.cdc.gov/coronavirus/2019-ncov/if-you-are-sick/steps-when-sick.html), asymptomatic cases or people with mild symptoms are recommended to stay home except to get medical care.)

In [None]:
## The number of cases for each province/city in Vietnam
# Convert date to string for map
vietnam_cases_all_times['Date'] = pd.to_datetime(vietnam_cases_all_times['Date']).dt.strftime('%m/%d/%Y')

fig = px.line(vietnam_cases_all_times,
              x='Date', y='Confirmed Cases', color='Province/State', 
              color_discrete_sequence = px.colors.diverging.Spectral)

fig.update_layout(title = {'text': 'Confirmed Cases for Each Province/City in Vietnam', 'x': 0.5},
                   xaxis_title = 'Date (2020)',
                   yaxis_title = 'Confirmed Cases',
                 legend = {'title': None})
# set up axis
fig.update_xaxes(nticks=6)

fig.show()

**Geographical animation - Cases by Date in Vietnam**

In [None]:
##### *********  MAP - STYLE 2  *********
# Using plotly.express (px)
# Using dataframe vietnam_cases_all_times

# Convert date to string for map
vietnam_cases_all_times['Date'] = pd.to_datetime(vietnam_cases_all_times['Date']).dt.strftime('%m/%d/%Y')

fig = px.scatter_geo(
    vietnam_cases_all_times, lat = 'Lat', lon = 'Long', 
    color = 'Confirmed Cases', size = 'Confirmed Cases', 
    scope = 'asia',
    animation_frame = 'Date', 
    range_color = [0, vietnam_cases_all_times['Confirmed Cases'].max()],  
    hover_name = 'Province/State',
    center = {'lat': 16, 'lon': 108}
)
#range_color = [0, vietnam_cases_all_times['Confirmed Cases'].max()], 
fig.update_layout(margin={"r": 0,"t": 0, "l": 0,"b": 0})
fig.layout.geo.projection = go.layout.geo.Projection(scale = 5)
fig.show()

**Distribution of cases by province/city:**

In [None]:
color_case = 'YlOrRd'
color_death = 'YlOrRd'
province_cases = most_recent_data.groupby('Province/State')['Confirmed Cases', 'Deaths'].sum()
province_cases.sort_values('Confirmed Cases', ascending = False).reset_index()\
            .style.background_gradient(cmap = color_case, subset = ['Confirmed Cases'])\
            .background_gradient(cmap = color_death, subset = ['Deaths'])

**Geographical map of cases by province**

In [None]:
##### *********  MAP - STYLE 2  *********
# Using folium

#drop the province/city that has 0 'Confirmed Cases' (if any)
most_recent_data = most_recent_data [most_recent_data['Confirmed Cases'] != 0]

#setting style for map
mapstyle = 'CartoDB positron'
line_color = '#da635eff'
fill_color = '#da635eff'
fill_opacity = 0.6
line_weight = 1.5
# other styles: 'OpenStreetMap', "Stamen Terrain”, “Stamen Toner”, “Stamen Watercolor”

vietnam_map = folium.Map(location = [16,108], zoom_start = 5, max_zoom = 12, min_zoom = 2, tiles = mapstyle)

for lat, long, case, name in zip(most_recent_data['Lat'], most_recent_data['Long'], most_recent_data['Confirmed Cases'],\
                                most_recent_data['Province/State']):
    folium.CircleMarker([lat, long], radius = (int((np.log(case + 1.00001))) + 0.8) * 5,
                       popup = ('<strong>Province/City</strong>: ' + str(name).capitalize() + '<br>'
                                '<strong>Confirmed Cases</strong>: ' + str(case) + '<br>'),\
                       color = line_color, weight= line_weight, \
                        fill_color = fill_color, fill_opacity = fill_opacity).add_to(vietnam_map)

#opacity = fill_opacity, 
 ### Add text
grid_pt=(51.4,0.05)
W=grid_pt[1]-0.005
E=grid_pt[1]+0.005
N=grid_pt[0]+0.005
S=grid_pt[0]-0.005


upper_left=(N,W)
upper_right=(N,E)
lower_right=(S,E)
lower_left=(S,W)
line_color='red'
fill_color='red'
weight=2
text='text'
edges = [upper_left, upper_right, lower_right, lower_left]
vietnam_map.add_child(folium.vector_layers.Polygon(locations=edges, color=line_color, fill_color=fill_color,
                                              weight=weight, popup=(folium.Popup(text))))

# Save map
vietnam_map.save("./vietnam_map.html")
vietnam_map

**1.2 THE IMPORTED AND LOCALLY TRANSMITTED CASES**

In [None]:
## GRAPH OF CASES
# reset style use
mpl.rcParams.update(mpl.rcParamsDefault)
# formate for date on graph

myFmt = DateFormatter('%m/%d')
# Layout: 2x1
fig, axs = plt.subplots(2,1, figsize  = (10,6), sharex = True)

## GRAPH OF NEW CASES - DAILY CASES
width = 0.5
axs[0].bar(df_cases.index, df_cases['New Imported Cases'], width, color = 'tomato',alpha=0.8 )
axs[0].bar(df_cases.index, df_cases['New Local Cases'], width, bottom = df_cases['New Imported Cases'], color = 'mediumpurple', alpha=0.8 )
axs[0].set_ylabel('New Cases', fontdict = {'fontweight':'bold'})
axs[0].legend(['New Imported', 'New Locally Transmitted'], loc = 'upper left')
axs[0].set_title('Vietnam COVID-19 Daily Confirmed Cases', loc = 'left', fontdict = {'fontweight':'bold'})
axs[0].set_ylim(top = 40)

#Annotation 
firstwave = dt.datetime(2020, 2, 4)
secondwave = dt.datetime(2020, 3, 26)
axs[0].annotate('The first wave of COVID-19',color = 'firebrick',
            xy = (mdates.date2num(firstwave), 5), 
            xytext = (mdates.date2num(firstwave), 14),
            ha = "center", va = "bottom",
            arrowprops = dict(facecolor='peachpuff', edgecolor = 'none', shrink = 0.01)
               )

axs[0].annotate('The second wave of COVID-19',color = 'firebrick',
            xy = (mdates.date2num(secondwave), 22), 
            xytext = (mdates.date2num(secondwave), 30),
            ha = "center", va = "bottom",
            arrowprops = dict(facecolor='peachpuff', edgecolor = 'none', shrink = 0.01)
            )

# CUMULLATIVE CASES - TOTAL CASES, IMPORTED AND LOCAL TRANSMITTED CASES
# plot
axs[1].plot(df_cases.index, df_cases['Confirmed Cases'], '-o', markersize = 2, color = 'red', markevery=[-1])
axs[1].plot(df_cases.index, df_cases['Imported Cases'], '-o', markersize = 2, color = 'tomato', markevery=[-1])
axs[1].plot(df_cases.index, df_cases['Local Cases'], '-o', markersize = 2, color = 'mediumpurple', markevery=[-1])
# decoration
axs[1].set_ylabel('Total Cases', fontdict = {'fontweight':'bold'})
axs[1].legend(['Confirmed', 'Imported','Locally Transmitted'])
axs[1].set_title('COVID-19 Cases: Trend in Vietnam', loc = 'left', fontdict = {'fontweight':'bold'})
axs[1].set_xlabel('Date (2020)', fontdict = {'fontweight':'bold'})
# set up axis
axs[1].xaxis.set_major_formatter(myFmt)
start, end = axs[1].get_xlim()
axs[1].set_xlim(left = start)
axs[1].set_ylim(top = 300)
x_value = np.arange(start, end, 8)
axs[1].xaxis.set_ticks(x_value)

# Annotation
end_date =  mdates.date2num(df_cases.index[-1])
case_value = []
case_value.append(df_cases['Confirmed Cases'].iloc[-1])
case_value.append(df_cases['Imported Cases'].iloc[-1])
case_value.append(df_cases['Local Cases'].iloc[-1])

# Add annotation
for val in case_value:
    plt.annotate(int(val), # this is the text
                 (end_date,val), # this is the point to label
                 color = '#333F4B',fontsize = 8, weight = 'bold',
                 textcoords = "offset points", # how to position the text
                 xytext = (6,-2), # distance from text to points (x,y)
                 ha = 'left') # 
plt.savefig('Covid19Cases_Vietnam_DailyCases.png', dpi = 200, bbox_inches='tight')  
plt.show()

In [None]:
## GRAPH OF CASES
# reset style use
# mpl.rcParams.update(mpl.rcParamsDefault)
# # formate for date on graph

# myFmt = DateFormatter('%m/%d')
# fig, ax = plt.subplots(figsize  = (10,3))

# ## GRAPH OF NEW CASES - DAILY CASES
# width = 0.5
# ax.bar(df_cases.index, df_cases['New Imported Cases'], width, color = 'tomato',alpha=0.8 )
# ax.bar(df_cases.index, df_cases['New Local Cases'], width, bottom = df_cases['New Imported Cases'], color = 'mediumpurple', alpha=0.8 )
# ax.set_ylabel('New Cases', fontdict = {'fontweight':'bold'})
# ax.legend(['New Imported', 'New Locally Transmitted'], loc = 'upper left')
# ax.set_title('Vietnam COVID-19 Daily Confirmed Cases', loc = 'left', fontdict = {'fontweight':'bold'})
# ax.set_ylim(top = 40)

# #Annotation 
# firstwave = dt.datetime(2020, 2, 4)
# secondwave = dt.datetime(2020, 3, 26)
# ax.annotate('The first wave of COVID-19',color = 'firebrick',
#             xy = (mdates.date2num(firstwave), 5), 
#             xytext = (mdates.date2num(firstwave), 14),
#             ha = "center", va = "bottom",
#             arrowprops = dict(facecolor='peachpuff', edgecolor = 'none', shrink = 0.01)
#                )

# ax.annotate('The second wave of COVID-19',color = 'firebrick',
#             xy = (mdates.date2num(secondwave), 22), 
#             xytext = (mdates.date2num(secondwave), 30),
#             ha = "center", va = "bottom",
#             arrowprops = dict(facecolor='peachpuff', edgecolor = 'none', shrink = 0.01)
#             )


# ax.set_xlabel('Date (2020)', fontdict = {'fontweight':'bold'})
# # set up axis
# ax.xaxis.set_major_formatter(myFmt)
# start, end = ax.get_xlim()
# ax.set_xlim(left = start)
# x_value = np.arange(start, end, 8)
# ax.xaxis.set_ticks(x_value)

# plt.savefig('Covid19Cases_Vietnam_Daily_Alone.png', dpi = 200, bbox_inches='tight')  
# plt.show()

The COVID-19 pandemic in Vietnam can be described with 2 different waves: the first wave (Jan 23 to Feb 16, 2020) and the second waves (Mar 06 to late April, 2020). The number of imported cases has pretty much remained dominant in the daily new COVID-19 cases.

Interestingly, the increase in the number of COVID-19 imported and locally transmitted cases has the same trend, which added up to the total confirmed cases. In addition, the number of imported cases always larger than the locally transmitted cases. 

**=> This trend indicated that Vietnam has been able to control the outbreak consistently overtime.**

**2. WHERE DID VIETNAM GET THE COVID-19 IMPORTED CASES?**

NOTE: DISEASES HAVE NO BORDERS, SO AS THIS NOVEL CORONAVIRUS. THEREFORE, THIS ANALYSIS IS EQUIVOCAL AND SHOULD BE VIEWED AS A REFERENCCE ONLY.


* Some patients had traveled to other countries before coming back to Vietnam. They had transitted in different countries and might be infected during these short transit periods. For example, patient BN7 was likely infected when transitting for 2 hours in Wuhan, China from his trip from US to Vietnam.

=> In this analysis, all the countries, including transit countries were included.

* Some patients traveled to several coutries in Europe (no specific country was given), so the countried they traveled were classified as Europe.

In [None]:
## TRAVEL COUNTRY - THE LIST OF COUNTRY WHERE THE IMPORTED CASES WERE POSSIBLY EXPOSURED AND GOT INFECTED
# Plot the number of cases for each country, dividing into 2 groups: first and second infection waves:

# Prepare data for plot
group_name = df_travel_country_firstwave['Country'].tolist()
group_name.extend(df_travel_country_secondwave['Country'].tolist())
group_size = df_travel_country_firstwave['Number of Cases'].tolist()
sub_group1 = len(group_size)
group_size.extend(df_travel_country_secondwave['Number of Cases'].tolist())
sub_group2 = len(group_size) - sub_group1


## Choose the position of each barplots on the x-axis, left 5 empty positions between 2 waves
group_pos1 = np.arange(0,sub_group1)
skip_step = sub_group1 + 7
group_pos2 = np.arange(skip_step,sub_group2 + skip_step)
x_pos = np.concatenate((group_pos1, group_pos2))

# Draw plot
fig, ax = plt.subplots(figsize=(10,8), dpi = 80)
ax.vlines(x = x_pos, ymin = 0, ymax = group_size, color = 'firebrick', alpha = 0.7, linewidth = 2)
ax.scatter(x = x_pos, y = group_size, color = 'firebrick', alpha = 0.7)


# Annotate
for i in range(0,len(x_pos)):
    ax.text(x_pos[i], group_size[i] + 2, s = f'{group_name[i]} ({group_size[i]})', wrap = True, rotation = 90, 
            fontsize = 10, va = 'center_baseline')
ax.set_ylabel('Cases', fontsize =  'large', fontdict = {'fontweight':'bold'})
ax.set_title('Where Vietnam COVID-19 Imported Cases Had Travelled', loc = 'left', 
             fontsize =  'x-large', fontdict = {'fontweight':'bold'})

# Make the chart more beautiful

# Beautiful axis
ax.set_ylim(bottom = 0, top = 85)
ax.set_xlim(left = -4)

# remove top and right spines
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)

# hide grid
ax.grid(False)

# Change the xticks
plt.xticks([0.5,statistics.median(group_pos2)], ['First Wave', 'Second Wave'], 
           fontsize =  'large', fontweight = 'bold')
# remove tick on the x axis since tick is not reflect the position of first and second wave
plt.tick_params(axis = "x", which = "both", bottom = False, top = False)

# Show graphic
plt.savefig('TravelHistories.png', dpi = 200, bbox_inches='tight') 
plt.show()

#### Draw plot - horizontal
# fig, ax = plt.subplots(figsize=(6, 18), dpi= 80)
# ax.hlines(y=y_pos, xmin=0, xmax=group_size, color='firebrick', alpha=0.7, linewidth=2)
# ax.scatter(y=y_pos, x=group_size, color='firebrick', alpha=0.7)

# # Annotate
# for i in range(0,len(y_pos)):
#     ax.text(group_size[i] + 1, y_pos[i], s = f'{group_name[i]} ({group_size[i]})', 
#             horizontalalignment = 'left', verticalalignment = 'center', fontsize = 10)

### Vietnam got more COVID-19 confirmed imported cases from Europe and the US than from China.


The COVID-19 pandemic in Vietnam can be described with 2 different waves: 

* The first wave with most imported cases from China. The only case imported from the US (shown on chart) was from patient BN7, This case was likely infected when the patient transited for 2 hours in Wuhan, China from his trip from US to Vietnam. 

* The second wave started with many cases imported from Europe (dominantly from the UK).

This trend is pretty much consistent with what happen in the U.S. ([the U.S. got more COVID-19 cases from Europe than from China](https://theintercept.com/2020/04/12/u-s-got-more-confirmed-index-cases-of-coronavirus-from-europe-than-from-china/))

In [None]:
### LIST OF COUNTRY WHERE VIETNAM GOT THE COVID-19 IMPORTED CASES

## Simple style to visualize the list of country - Not in use, kept as reference

# df_travel_country.reset_index(drop=True)
# df_travel_country.style.background_gradient(cmap = 'YlOrRd', subset = ['Number of Cases'])

## Another way to visualize the data
# Using matplotlib.pyplot plt

df_travel_country1 = df_travel_country.sort_values('Number of Cases', ascending = True).reset_index(drop=True)
df_travel_country1 = df_travel_country1.set_index('Country')
# we first need a numeric placeholder for the y axis
my_range = list(range(1,df_travel_country1.shape[0]+1))

#set color
color_bar = 'tomato'
color_marker = 'red'
# set font
plt.rcParams['font.family'] = 'sans-serif'
plt.rcParams['font.sans-serif'] = 'Helvetica'
# set the style of the axes and the text color
plt.rcParams['axes.edgecolor']='#333F4B'
plt.rcParams['axes.linewidth']=0.8
plt.rcParams['xtick.color']='#333F4B'
plt.rcParams['ytick.color']='#333F4B'
plt.rcParams['text.color']='#333F4B'
# hide grid
ax.grid(False)

fig, ax = plt.subplots(figsize = (6,6)) 
# represented by the specific expense percentage value.
plt.hlines(y = my_range, xmin = 0, xmax = df_travel_country1['Number of Cases'],
           color = color_bar, alpha = 0.8, linewidth = 5)

# create for each expense type a dot at the level of the expense percentage value
plt.plot(df_travel_country1['Number of Cases'], my_range, "o", markersize=5, color=color_marker, alpha=0.9)

# set labels
ax.set_xlabel('Number of Cases', fontsize = 10, fontweight='black', color = '#333F4B')
ax.set_ylabel('')

# Add annotation
# zip joins x and y coordinates in pairs
for x,y in zip(df_travel_country1['Number of Cases'],my_range):

    label = x

    plt.annotate(label, # this is the text
                 (x,y), # this is the point to label
                 color = '#333F4B',
                 textcoords = "offset points", # how to position the text
                 xytext = (6,-2), # distance from text to points (x,y)
                 ha = 'left') # 
    
# set axis
ax.tick_params(axis='both', which='major', labelsize = 10)
plt.yticks(my_range, df_travel_country1.index)

# add an horizonal label for the y axis 
fig.text(0, 0.92, 'Travel Countries Of Vietnam COVID-19 Patients Before Confirmed Positive',
         fontsize = 12, fontweight = 'black', color = '#333F4B')

# change the style of the axis spines
ax.spines['top'].set_color('none')
ax.spines['right'].set_color('none')

# set the spines position
ax.spines['bottom'].set_position(('axes', 0))
ax.spines['left'].set_position(('axes', 0.0))

plt.savefig('hist2.png', dpi = 100, bbox_inches = 'tight')
plt.show()

In [None]:
## Basic pie chart for infection source

pull_val = [0]* len(df_travel_country1['Number of Cases'])
pull_val[-1] = 0.1

# pull is given as a fraction of the pie radius
fig = go.Figure(data=[go.Pie(labels=df_travel_country1['Number of Cases'].index, values=df_travel_country1['Number of Cases'], 
                             pull=pull_val, marker_colors = px.colors.qualitative.Light24, sort = False, 
                   )])

fig.show()

**3. VIETNAM COVID-19 PATIENTS: GENDER, AGE AND NATIONALITY**

In [None]:
## Prepare value for charts
# data for left plot
nationality_list = list(patient_nationality['Patient Nationality'].value_counts())
all_cases_labels = ["Vietnamese", "Foreigners"]
# data for right plot
foreigners_list = dict(patient_nationality[patient_nationality['Nationality'] != 'Vietnam']['Nationality'].value_counts())
names, values = zip(*foreigners_list.items())


# Create subplots: use 'domain' type for Pie subplot
fig = make_subplots(rows = 1, cols = 2, specs = [[{'type':'domain'}, {'type':'domain'}]])
# all cases chart
colors = ['lightskyblue', 'plum']
fig.add_trace(go.Pie(labels = all_cases_labels, values = nationality_list,texttemplate = "%{percent} (%{value})", 
                     marker_colors = colors, name = "All COVID-19 cases"),
              1, 1)
fig.add_trace(go.Pie(labels = names, values = values, name = "Foreigner patients", 
                    marker_colors = px.colors.qualitative.Pastel, sort = False, 
                    direction = 'clockwise',texttemplate = "%{percent} (%{value})"),
              1, 2)

# Use `hole` to create a donut-like pie chart
fig.update_traces(hole=.6, hoverinfo="label+percent+name")

fig.update_layout(
    title_text = "<b>Nationality of Vietnam COVID-19 Patients</b>",
    # Add annotations in the center of the donut pies.
    annotations = [dict(text = '<b>All\nCases</b>', x = 0.15, y = 0.5, font_size = 16, showarrow = False),
                 dict(text = '<b>Foreigners</b>', x = 0.87, y = 0.5, font_size = 16, showarrow = False)])
fig.show()

In [None]:
# Plot the number of cases regarding gender, nationality and travel history
colors = ['#98D8D8','#EE99AC']
## Setting up for plot
sns.set(style='whitegrid', rc={"grid.linewidth": 0.1})

## Drawing plot
g = sns.catplot(x = "Patient Nationality", hue = "Gender", col = 'Travel History', palette = colors,
            data = patient_nationality, kind = 'count', height = 4, aspect = 0.7)

## Decoration
plt.ylim([0,80])

g.fig.subplots_adjust(wspace=.5, hspace=.05)

g.fig.suptitle('Vietnam COVID-19 Patients: Gender, Nationality and Travel History', y = 1.1, weight = 'bold')
g.set_titles("{col_name}")
g.set(ylabel ='Cases') 
g.set(xlabel ='')

for i in np.arange(2):
        ax1 = g.facet_axis(0,i)
        for p in ax1.patches:
            if str(p.get_height()) != 'nan':
                ax1.text(p.get_x() + 0.15, p.get_height()+ 0.5, '{}'.format(p.get_height()),size='small')

plt.setp(g._legend.get_title(), fontsize=12, weight = 'bold')
plt.show()

With Vietnamese patients, more COVID-19 cases were female. While for foreigners, more COVID-19 cases were male. This trend likely reflected the occupation, travel purpose and the infection clusters in Vietnam. Let's look into the age distribution of the patients for more details.

In [None]:
# Plot the distribution of patients' age regarding gender, nationality and travel history
colors = ['#98D8D8','#EE99AC']

## Setting up for plot
sns.set(style='whitegrid', rc={"grid.linewidth": 0.1})

## Drawing plot
g = sns.catplot(x = "Age", y = "Travel History",
            hue = "Gender", col = "Patient Nationality",
            data = patient_nationality,
            orient = "h", height = 5, aspect = 1, palette = colors,
            kind = "violin", dodge = True, cut = 0, bw = .2)

## Decoration
plt.xlim([0,90])
sns.set_style({"ytick.left":False},)
g.fig.suptitle('Vietnam COVID-19 Patients: Age Distribution Per Gender, Nationality and Travel History', 
               fontsize = 15, y = 1.1, weight = 'bold')
g.set_titles("{col_name}", fontdict = {'fontweight':'bold'})
g.set_xlabels(fontdict = {'fontweight':'bold'})
g.set_ylabels('', fontdict = {'fontweight':'bold'})
g.set(ylabel ='')
plt.setp(g._legend.get_title(), fontsize=12, weight = 'bold')

g.set_yticklabels(['Imported Cases', 'Local Cases'], fontdict = {'fontweight':'bold'})

plt.tight_layout()
plt.savefig('Covid19Cases_Vietnam_AgeGender.png', dpi = 200, bbox_inches='tight') 
plt.show()

**Note about violin plot:** It is the hybrid of a box plot and a kernel density plot.The white dot: median;
the thick gray bar in the center: the interquartile range; Wider and skinner sections of the violin plot: a higher and lower probability, repectively that members of the population will take on the given value.

*** With the imported cases:**

  - The majority of the Vietnamese imported cases were in the age range of 20-30 years old. They were students and young employees who traveled to study and perform training abroad. 

  - The majority of the foreigners were in the old age, ranging from 50-70 years old for men and around 45-60 for women. They were likely in the group of retired travelers.
  

*** With the local cases:**

The trend of age seems to link to the specific infection clusters in Vietnam (Foreigners who were in their young age link to the Buddha Bar & Grill (a restaurant and bar) cluster; the majority of Vietnamese female patients were link to a food supply company for Bach Mai hospital, etc.)

**=> Currently with this small dataset, it is unable to tell if there is any age-gender population groups that may be in higher risk of contracting COVID-19. The outbreak in Vietnam appeared largely under control when all the cases seems to link to specific clusters and travel groups. **

**4. VIETNAM COVID-19 PATIENTS: SYMPTOMS AND HOSPITALIZED TIME**


**4.1 Patient symptoms**

In [None]:
# PATIENT SYMPTOMS - BRIEF GRAPH
# Using seaborn

plt.figure(figsize=(15,5))
fig = sns.countplot(y = 'Health Condition When Confirmed', data = patient_data,
                   palette = 'Set2')
fig.set_title('Symptoms of Vietnam COVID-19 Patients', fontdict = {'fontsize':20})
fig.set_xlabel('Number of Cases')
plt.show()

In [None]:
# PATIENT SYMPTOMS - PIE CHART
# Using plotly.graph_objs

## Counting the group of symptoms
symptoms = patient_data['Health Condition When Confirmed'].tolist()
counter = collections.Counter(symptoms)

## Make an order symptom group so that the pie chart will served its purpose
symptom_group = (
     'Not reported',
     'Not reported (Stable condition)',
     'No symptom when positive',
    'Cold/Flu-like symptoms',
     'COVID-19 (with/without cold/flu-like symptoms)',
    'Showing symptoms (no detail)')

## Extract the group_size for graph
group_size = []
for key in symptom_group:
    group_size.append(counter[key])

# Rename the name
label_group = (
     'Not reported',
     'Not reported (stable condition)',
     'Showing no symptoms',
    'Cold/flu-like symptoms',
     'COVID-19 symptoms',
    'Showing symptoms (no detail)')

## Picking colors
colors = ['lightgray', 'silver','#66b3ff', '#ff9999', 'tomato','#ffcc99']
fig = go.Figure(data = [go.Pie(labels = label_group, values = group_size, hole = .7,
                             marker_colors = colors,textinfo = 'percent+label',
                              insidetextorientation = 'radial', sort = False, direction = 'counterclockwise',
                             rotation = 220, texttemplate = "%{label} (%{percent})",
                            pull = [0.01, 0.01, 0.01, 0.01, 0.01, 0.01], 
                            showlegend = False)])

fig.update_traces(textfont_size = 14)
fig.show()

In [None]:
# PATIENT SYMPTOMS - DISPLAYING ASYMPTOMATIC VS SYMPTOMATIC CASES - NESTED PIE CHART
# Using matplotlib pyplot

## Counting the group of symptoms
symptoms = patient_data['Health Condition When Confirmed'].tolist()
counter = collections.Counter(symptoms)

#### MAKING SUBGROUP (INSIDE CIRCLE) (symptom_group and group_size)
## Make an order symptom group so that the chart will served its purpose
symptom_group = (
     'Not reported',
     'Not reported (Stable condition)',
     'No symptom when positive',
    'Cold/Flu-like symptoms',
     'COVID-19 (with/without cold/flu-like symptoms)',
    'Showing symptoms (no detail)')

## Extract the group_size for graph
group_size = []
for key in symptom_group:
    group_size.append(counter[key])
    
group_size_labels = []
for i in group_size:
    label = (i/sum(group_size))*100
    group_size_labels.append(f"{label:.1f}%")

#### MAKING MAINGROUP (OUTSIDE CIRCLE) (symptom_main_value, symptom_main_percentage and symptom_main_label)
symptom_main_value = []
symptom_main_percentage = []
symptom_main_value.append(sum(group_size[:3]))
symptom_main_value.append(sum(group_size[3:]))
for i in symptom_main_value: 
    symptom_main_percentage.append(100 * (i/sum(symptom_main_value)))
    
symptom_main_label = (
     'Asymptomatic\n(presumptive) ({:^.1f}%)'.format(symptom_main_percentage[0]),
     'Symptomatic\n({:^.1f}%)'.format(symptom_main_percentage[1]))

#### DRAWING NESTED PIE CHART
# Create colors
a, b, c = [plt.cm.Greys, plt.cm.Blues, plt.cm.OrRd]

## Explosion
explode1 = (0.008, 0.008)
explode2 = (0.01, 0.01, 0.01, 0.01, 0.01, 0.01)

# First Ring (outside)
fig, ax = plt.subplots()
fig.set_dpi(200)
ax.axis('equal')
mypie, _ = ax.pie(symptom_main_value, radius = 1.3, 
                  labels = symptom_main_label, labeldistance = 1.05, textprops = {'fontsize': 10,'weight':'bold'},
                  startangle = 90, explode = explode1,
                  colors = [b(0.5), c(0.6)] )

plt.setp( mypie, width=0.3, edgecolor='white')
# Second Ring (Inside)
mypie2,_ = ax.pie(group_size, radius = 1.3-0.3, startangle = 90,
                   explode = explode2,
                    labels = group_size_labels, labeldistance = 0.7,textprops = {'fontsize': 7,'weight':'bold'},
                  rotatelabels = True,
                   colors=[a(0.3), a(0.2), b(0.3),
                           c(0.3), c(0.2), c(0.1)])
    
plt.setp( mypie2, width=0.4, edgecolor='white')

plt.tight_layout() #Automatically adjust subplot parameters to give specified padding.

# Adding legend
legend_elements = [
                Patch(facecolor = a(0.3), label = symptom_group[0]),
                Patch(facecolor = a(0.2), label = symptom_group[1]),
                Patch(facecolor = b(0.3), label = symptom_group[2]),
                Patch(facecolor = c(0.3), label = symptom_group[3]),
                Patch(facecolor = c(0.2), label = symptom_group[4]),
                Patch(facecolor = c(0.1), label = symptom_group[5]),
]
leg = ax.legend(handles=legend_elements, fontsize = 10, loc = (-0.8,0.7), frameon = False,
         title = 'Detail Breakdown of Symptoms\n', title_fontsize = 12)
leg._legend_box.align = "left"

# save and show
plt.savefig('pie_symptom.png', dpi = 200, bbox_inches='tight')  #after plt.show() is called, a new figure is created, need to save first
plt.show()

Among all Vietnam COVID-19 confirmed cases, only 35.2% patients showed symptoms right before or right after they tested positive for the virus. Remarkably, only 6.3% of cases reported [the COVID-19 associated symptoms](https://www.cdc.gov/coronavirus/2019-ncov/symptoms-testing/symptoms.html) (shortness of breath/pressure in the chest/mild pneumonia/respiratory failure).

The majority of symptomatic patients reported mild common cold/flu-like symptoms (such as fever, cough, sore throat, fatigue, etc.) (25.4% of all cases).

In [None]:
## Plotting the detail symptoms of Vietnam COVID-19 patients

# Preparing the data
detail_symptoms = patient_data['Detail Symptoms When Confirmed (clean up)'].dropna().tolist()
symptoms = []
for symptom in detail_symptoms:    
    name = symptom.split(', ')
    symptoms.extend(name)
counter = collections.Counter(symptoms)
df_symptoms = pd.DataFrame(list(counter.items()),columns = ['Symptoms','Number of Cases'])
df_symptoms['Percentage'] = round(df_symptoms['Number of Cases']/len(detail_symptoms)*100,2)
df_symptoms = df_symptoms.sort_values('Number of Cases', ascending = False).reset_index(drop=True)

# SIMPLE LIST OF DETAIL SYMPTOMS
#df_symptoms.style.background_gradient(cmap = 'YlOrRd', subset = ['Number of Cases', 'Percentage'])

#### BEAUTIFUL CHART FOR DETAIL SYMPTOMS
# Using matplotlib pyplot

df_symptoms1 = df_symptoms.sort_values('Number of Cases', ascending = True).reset_index(drop=True)

# set font
plt.rcParams['font.family'] = 'sans-serif'
plt.rcParams['font.sans-serif'] = 'Helvetica'
color_bar = 'tomato'
color_marker = 'red'
# set the style of the axes and the text color
plt.rcParams['axes.edgecolor'] = '#333F4B'
plt.rcParams['axes.linewidth'] = 0.8
plt.rcParams['xtick.color'] = '#333F4B'
plt.rcParams['ytick.color'] = '#333F4B'
plt.rcParams['text.color'] = '#333F4B'

# we first need a numeric placeholder for the y axis
my_range=list(range(1,df_symptoms1.shape[0]+1))

fig, ax = plt.subplots(figsize=(8,8))

# Hide axis
ax.set_frame_on(False)
ax.get_xaxis().tick_bottom()
ax.axes.get_xaxis().set_visible(False)
# Hide grid lines
ax.grid(False)

# create for each expense type an horizontal line that starts at x = 0 with the length 
# represented by the specific expense percentage value.
plt.hlines(y=my_range, xmin=0, xmax=df_symptoms1['Percentage'], color=color_bar, alpha=0.8, linewidth=5)

# create for each expense type a dot at the level of the expense percentage value
plt.plot(df_symptoms1['Percentage'], my_range, "o", markersize=5, color=color_marker, alpha=0.9)

# set labels
ax.set_xlabel('Percentage', fontsize=15, fontweight='bold', color = '#333F4B')
ax.set_ylabel('')

# Add annotation
# zip joins x and y coordinates in pairs
for x,y in zip(df_symptoms1['Percentage'],my_range):

    label = "{:.2f}%".format(x)

    plt.annotate(label, # this is the text
                 (x,y), # this is the point to label
                 color = '#333F4B',
                 textcoords = "offset points", # how to position the text
                 xytext = (6,-2), # distance from text to points (x,y)
                 ha = 'left') # 
    
# set axis
ax.tick_params(axis='both', which='major', labelsize=14)
plt.yticks(my_range, df_symptoms1['Symptoms'])

# add an horizonal label for the y axis 
fig.text(0.1, 0.9, 'Detail Symptom of COVID-19 Symptomatic Cases', 
         fontsize = 15, fontweight = 'black', color = '#333F4B')

# change the style of the axis spines
ax.spines['top'].set_color('none')
ax.spines['right'].set_color('none')

# set the spines position
ax.spines['bottom'].set_position(('axes', 0))
ax.spines['left'].set_position(('axes', 0.0))

# remove tick on the x axis since tick is not reflect the position of first and second wave
plt.tick_params(axis = "y", which = "both", left = False, right = False)

plt.savefig('hist2.png', dpi = 100, bbox_inches='tight')
plt.show()

Among symptomatic patients, the common symptoms include fever, cough and sore throat.

**4.2. Hospitalized time**

In [None]:
# Plotting the overall hospitalized time (for the patients that have recovered and been discharged)
colors = ['#98D8D8','#EE99AC']

# plot
g = sns.catplot(x = 'Nationality', y='Days Hospitalized Since Confirmed', hue = 'Gender', 
            kind = 'box', data = hospitalized_time, palette = colors)
# decoration
g.fig.suptitle('Vietnam COVID-19 Patients: Duration of Hospitalization Since Confirmed Positive', 
               fontsize = 12, y = 1.02, weight = 'bold')
plt.setp(g._legend.get_title(), fontsize=12, weight = 'bold')
g.set_xlabels(fontdict = {'fontweight':'bold'})
g.set_ylabels('Days', fontdict = {'fontweight':'bold'})
plt.savefig('Covid19Cases_HospitalizationDuration.png', dpi = 200, bbox_inches='tight') 
plt.show()

* The average hospitalized time for Vietnam COVID-19 patients (since positively confirmed) is around 2 weeks. (In this dataset, the hospitalized time is calculated since the day the patients were confirmed positive to COVID-19, not the day the samples were taken for testing). The time for releasing the testing result might varied among patients.

* Foreigners had a longer hospitalized time comparing to Vietnamese patients. The reasons could be because of the environment inside the hospital (food, language barriers in healthcare) and the age group: majority of foreigners are retired traveller who tend to be more vulnerable to the virus.

* Majority of patients spent 10-20 days in the hospital, mostly because of the stringent discharge policy in Vietnam.(In an uncommon case, the patient BN51 had gone through 13 times of testing before officially discharged (See reference and note on the Vietnam COVID-19 patient dataset).

*The guidelines for discharge (on the Vietnam Health Ministry's website) are as followed:
"COVID-19 patient could be discharged after 2 continuous negative laboratory tests within 48 hours, reported no fever for at least 3 continuous days, having normal vitals, normal blood test, improving chest X-ray. And the discharged patient must be sent into quarantine (at home/hotel) for 14 more days."*

**Discussing about the outliners:**

* There was 1 case whose the hospitalized time is only 1 day. The reason for that might be:
   - The hospitalized time is calculated since the day the patients were confirmed positive to COVID-19, not the day the patients were hospitalized and their samples were taken for testing. Therefore, this patient might have been hospitalized before confirmed positive to the virus.
   - The time for releasing the testing result might varied among patients.
   
* Some patients have a long hospitalized periods (More than 30 days) indicated that their conditions had been more severe. 

# CONCLUSIONS

***Some key findings:***

* Only 6.3% of Vietnam COVID-19 confirmed cases reported the COVID-19 associated symptoms when they tested positive for COVID-19. If Vietnam did not respond swiftly with tests and early quarantine people, more than 90% of these COVID-19 cases would have continue to frequent communal and public places, unwittingly spread the virus in the community.

* The very low rate of increase in the number of coronavirus cases suggested that the Vietnam government's initial virus response has succeeded in controlling the virus through swiftly testing, contact tracing, quarantine and surveillance.

* It has been almost 100 days since the first COVID-19 case was confirmed in Vietnam (January 23, 2020); and the outbreak appeared largely under control. Comparing to the scale in other countries, if Vietnam officials tried to suppress the information or underreported COVID-19 cases; 100 days would be more than enough for the pandemic to reach to the grim milestone that no government/country is able to hide such a diaster.


* The majority of Vietnam COVID-19 imported cases are linked to students/employees performing study/training abroad and foreigners who travel to Vietnam


**Take-away message:**

* Most of the COVID-19 patients were asymptomatic in the early days of symptom development. Therefore, testing, contact tracing, quarantine and surveillance are keys to contain the virus.

* Until vaccine/specific and effective treatments are available, there is no guarantee for any country to be safe from the virus. Everyone needs to be on alert. Take care and be safe!


**Thank you for reading this notebook.**

**Please upvote if you like this notebook. Any comment/feeback is much appreciated.**