# COVID Data Visualization - Part 1
[Edward Toth, PhD, University of Sydney]

- e-mail: eddie_toth@hotmail.com
- Add me on: https://www.linkedin.com/in/edward-toth/ 
- Join the community: https://www.meetup.com/Get-Singapore-Meetup-Group/

Using different data visualization tools, we attempt to understand the spread of COVID through pretty pictures. In these tutorials you learn more about:
- the spread of COVID
- explore the following Python libraries for visualizing data

__PART 1:__

    - pandas (quick plots for data frames/series)
    - matplotlib (basic plotting library in Python)
    - plotly (interactive plots)
    

<!-- ggplot (follows `R`'s ggplot2, concepts from 'The Grammar of Graphics'), web-based plot for dash, Gleam (inspired by R's Shiny package for interactive plotting apps) -->
  
For more detail on data visualization libraries: https://mode.com/blog/python-data-visualization-libraries/

Dataset from: 
https://www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset
- 8 `.csv` files
- records of confirmed cases, death and recovered patients
- Short description of patients (gender, age, location, etc.)

## In Part 1 of Visualizing COVID
We will visualize data mainly from:
- `time_series_covid_19_confirmed.csv`
- `time_series_covid_19_deaths.csv` 
- `time_series_covid_19_recovered.csv`

In [None]:
from IPython.display import display
import pandas as pd
covid_data = pd.DataFrame(pd.read_csv('../input/novel-corona-virus-2019-dataset/covid_19_data.csv'))
indiv_list = pd.DataFrame(pd.read_csv('../input/novel-corona-virus-2019-dataset/COVID19_line_list_data.csv'))
open_list = pd.DataFrame(pd.read_csv('../input/novel-corona-virus-2019-dataset/COVID19_open_line_list.csv'))
confirmed_US = pd.DataFrame(pd.read_csv('../input/novel-corona-virus-2019-dataset/time_series_covid_19_confirmed_US.csv'))
confirmed = pd.DataFrame(pd.read_csv('../input/novel-corona-virus-2019-dataset/time_series_covid_19_confirmed.csv'))
deaths_US = pd.DataFrame(pd.read_csv('../input/novel-corona-virus-2019-dataset/time_series_covid_19_deaths_US.csv'))
deaths = pd.DataFrame(pd.read_csv('../input/novel-corona-virus-2019-dataset/time_series_covid_19_deaths.csv'))
recovered = pd.DataFrame(pd.read_csv('../input/novel-corona-virus-2019-dataset/time_series_covid_19_recovered.csv'))
 

# Check tail for most recent date
display(confirmed.tail())
# Define dates for time series
dates = confirmed.columns[4:]

#  Getting the data into a dataframe
def country_row(df,cs):
    return df[df['Country/Region']==cs]

country_row(confirmed,"Singapore")

# matplotlib
Time series data is easily plottled using `matplotlib`
- `time_series_covid_19_confirmed.csv`, `time_series_covid_19_deaths.csv` and `time_series_covid_19_recovered.csv`. 
- Also calculate the active cases with the formula:

`[Active] = [Confirmed] - [Recovered] - [Deaths`]



To visualize worldwide data over time using `matplotlib`:
1. Sum up counts over all countries 
- Calculate the active cases
- Plot confirmed, recovered, deaths and active cases over time [on same figure]
- Modify labels, title, legend

In [None]:
# Worldwide count
import matplotlib.pyplot as plt 
# import matplotlib.dates as mdates
# Extract dates from one of the time series and Sum over each date value
confirmed_cases = confirmed.sum()
deaths_cases = deaths.sum()
recovered_cases = recovered.sum()

## Adjust plotting style 
plt.style.use('fivethirtyeight')
fig, ax = plt.subplots(figsize=(12, 8))
# Plot the cases over extracted dates # dates = recovered.iloc[:,4:].columns
active_cases = confirmed_cases[dates]- deaths_cases[dates]-recovered_cases[dates]
plt.plot(dates, confirmed_cases[dates].values.T)
plt.plot(dates,active_cases[dates].values.T) # active cases 
plt.plot(dates, recovered_cases[dates].values.T)
plt.plot(dates, deaths_cases[dates].values.T)

# add nice label settings 
scale = 30 # scale text font and dates 
plt.xlabel('Dates', size=scale)
plt.ylabel('Number of Cases', size=scale)
plt.yticks(size=scale*2/3)
plt.title('Worldwide Count', size=20)
plt.legend(['Confirmed', 'Active', 'Recovered', 'Deaths'], prop={'size': scale*2/3}, bbox_to_anchor=(0.25, 1))

## How to better see x labels
x_ticks = range(0,dates.size, 20) # 97/8 ~= 12 slices = 13 values
plt.xticks(x_ticks,size=scale*2/3)
plt.gcf().autofmt_xdate()

Interpretation
- After around mid-March, worldwide confirmed cases of COVID virius are expontentially rises 
- Out of those confirmed cases; most are active, a large proportion have recovered and a smaller proportion are deaths. 

### Examine Specific Countries
- Note that we are using three different files (confirmed, recovered, deaths)
- Check if data frames are consistent with indices (BUT they're not!)
- Use `IPython.display` for a nicer print out

### Plot nature of COVID cases for different countries
- Similar to the code for the worldwide count but you don't `sum` up values
- Some countries are recorded by state, region or province. Thus there are multiple values recorded in some countries.
- Solution: aggregate the cases of confirmed, recovered, deaths and active cases over the different regions of the country. 

In [None]:
# Adding that extra work into the plotting function 
def plotting_covid(dates,country,text_scale):
  import matplotlib.dates as mdates
  # Extracting the row information
  confirmed_cases = country_row(confirmed,country) 
  deaths_cases = country_row(deaths,country) 
  recovered_cases = country_row(recovered,country) 
  x = len(confirmed_cases); y = len(deaths_cases); z = len(recovered_cases)
  if x == y == z == 1:
    confirmed_cases = country_row(confirmed,country) 
    deaths_cases = country_row(deaths,country) 
    recovered_cases = country_row(recovered,country)    
  elif x == y == z > 1: 
    # Aggregate data 
    confirmed_cases = country_row(confirmed,country).sum()
    deaths_cases = country_row(deaths,country).sum()
    recovered_cases = country_row(recovered,country).sum() 
  else:
      print("Error: row sizes are not equal")
  
  # Calculate the number of active cases [Problem: cannot subtract rows easily because some row numbers are different]
  active_cases=confirmed_cases[dates].values.T-deaths_cases[dates].values.T-recovered_cases[dates].values.T
  
  # Plotting data
  plt.plot(dates, confirmed_cases[dates].values.T)
  plt.plot(dates,active_cases) # active cases 
  plt.plot(dates, recovered_cases[dates].values.T)
  plt.plot(dates, deaths_cases[dates].values.T)
  plt.xlabel('Dates', size=text_scale)
  plt.ylabel('Number of Cases', size=text_scale)
  plt.legend(['Confirmed', 'Active', 'Recovered', 'Deaths'], prop={'size': scale*2/3}, bbox_to_anchor=(0.6, 1))
  plt.yticks(size=scale*2/3)
  # How to better see x labels
  plt.xticks(size=scale*2/3)
  x_ticks = range(0,dates.size,round(text_scale)) # 97/8 ~= 12 slices = 13 values
  plt.xticks(x_ticks)
  plt.gcf().autofmt_xdate()



### Compare top five countries with highest COVID count and compare it to Australia (I'm from OZ, that's why!)
- Examine confirmed, active, recorded, deaths in SIX plots

In [None]:
# Top five countries + Australia
Countries = confirmed.sort_values(by=[dates[-1]], ascending=False).head()['Country/Region']
Countries = [str(c) for c in Countries ]
Countries.append('Australia')
# Countries.append('Singapore')
# Plot Australia with top 5 countries with most confirmed cases 
# fig, ax = plt.subplots(2, 3, sharex='col', sharey='row',figsize=(15, 12))
scale = 20;
fig = plt.figure(figsize=(15,10))
fig.subplots_adjust(hspace=0.2, wspace=0.5)
for i in range(1,7): # 6 countries
    ax = fig.add_subplot(2, 3, i)
    country = Countries[i-1]
    ax.title.set_text(Countries[i-1])
    plotting_covid(dates, country, scale)

Interpretation
- Confirmed and active cases are growing exponentially in US, UK, Russia
- The active cases have a downturn (decreasing COVID cases) in Spain, Italy, Australia
- The active curve for Australia is similar to the COVID world count website
https://www.worldometers.info/coronavirus/country/australia/
### Let's examine the active cases more closely




## Examine Active Cases using `plotly`
- Great for interactive graphs (line graphs, contour plots, 3D charts, etc.)
- Also available in R and you can access `plotly`, an online platform for visualizing data offer 
- Examining the nature of active cases

### Simple analysis: 
Which month (Feb, Mar, Apr, May)  does the the maximum value occurs?
- If max occurs in Feb, Mar then early flattening of active curve
- If max occurs Apr then COVID cases are beginning to decrease 
- If max occurs in May then COVID is still a problem

### Advanced analysis (if you're bothered):
- SIR epidemic models
- Polynomial fits 
- Regression prediction
- Use calculus find turning points 


In [None]:
# 1. Calculate the number of active cases over time for each country
# 2. Find countries (top 10) with large descreasing trend in active cases 
# 3. Find countries (top 10) with exponential growth in COVID cases
### 1. Calculate the number of active cases over time for each country
df_active = pd.DataFrame([],columns = dates)
recorded_country = []
for country in confirmed['Country/Region'].unique():
  # Extracting the row information
  confirmed_cases = country_row(confirmed,country) 
  deaths_cases = country_row(deaths,country) 
  recovered_cases = country_row(recovered,country) 
  x = len(confirmed_cases); y = len(deaths_cases); z = len(recovered_cases)
  if x == y == z == 1:
    confirmed_cases = country_row(confirmed,country) 
    deaths_cases = country_row(deaths,country) 
    recovered_cases = country_row(recovered,country) 
    # record country 
    recorded_country.append(country)  
    # calculate active cases and put in data frame 
    data = confirmed_cases[dates].values.T-deaths_cases[dates].values.T-recovered_cases[dates].values.T
    data = pd.DataFrame(data).transpose()
    data.columns=dates
    df_active=df_active.append(data)  
  else:
    # Aggregate data 
    confirmed_cases = country_row(confirmed,country).sum()
    deaths_cases = country_row(deaths,country).sum()
    recovered_cases = country_row(recovered,country).sum() 
    # record country
    recorded_country.append(country)
    # calculate active cases and put in data frame 
    data = confirmed_cases[dates].values.T-deaths_cases[dates].values.T-recovered_cases[dates].values.T
    data = pd.DataFrame(data).transpose()
    data.columns=dates
    df_active = df_active.append(data) 

# Insert country names
df_active.insert(0, 'Country/Region',recorded_country)
df_active.head()

Just created a new data frame containing the active cases over time for each country. WOHOOOO!

- Then create a plotting function similar to before

In [None]:
 # Plotting function
def plot_active(df,country_list,title,scale):
  fig, ax = plt.subplots(figsize=(14, 10))
  for i in range(len(country_list)): # 6 countries
    country = country_list[i]
    active_cases =  df[df['Country/Region']==country]
    plt.plot(dates,active_cases[dates].values.T) # active cases 
  plt.xlabel('Dates', size=text_scale)
  plt.ylabel('Number of Cases', size=text_scale)
  plt.legend(country_list, prop={'size': scale*2/3},bbox_to_anchor=(1.2, 1))
  plt.yticks(size=scale*2/3)
  plt.title(title)
  # How to better see x labels
  plt.xticks(size=scale*2/3)
  x_ticks = range(0,dates.size,round(text_scale)) # 97/8 ~= 12 slices = 13 values
  plt.xticks(x_ticks)
  plt.gcf().autofmt_xdate()
  plt.show()

Using `matplotlib` vs. `plotly` for multiple line plots: 
- `matplotlib`: data frame with one country column and dates as new columns
- `plotly`: data frame with one date column and each country as new columns
- `plotly`: In this way you can merge data with `melt`

In [None]:
display("Data Format for matplotlib",df_active.head())
dft =df_active.iloc[:,1:].transpose()
country_list = confirmed['Country/Region'].unique()
dft.columns= [ str(cl) for cl in country_list] 
dft['Dates'] = dft.index
# display("Data Format for plotly",dft.head())
df_melt = pd.melt(dft, id_vars="Dates", value_vars=dft.columns[:-1])
display("Melting data for plotly",df_melt)

Alternatively use `covid_data` and sum up the counts for each country but I don't like the messy country labels like "('St. Martin',)" or occupied Palestinian territory. So I choose to use the combined time series data. 

In [None]:
df_covid = covid_data
df_covid['Active'] = df_covid['Confirmed'] - df_covid['Deaths'] - df_covid['Recovered']
df_covid = df_covid.groupby(['Country/Region','ObservationDate'],as_index=False).sum()
df_covid.tail()
# fig = px.line(df_covid[df_covid["Country/Region"]==country], x="ObservationDate", y="Active", color='Country/Region')

In [None]:
import plotly.express as px 
import pandas as pd 
import numpy as np 
def plotly_active(df_active,CL,title):
    dft = df_active.iloc[:,1:].transpose()
    country_list = df_active['Country/Region']
    dft.columns = [ str(cl) for cl in country_list] 
    dft['Dates'] = dft.index
    df_melt = pd.melt(dft, id_vars="Dates", value_vars = dft[CL])
    fig = px.line(df_melt, x="Dates", y="value", color='variable')
    xnum = [num for num in range(0,len(dates),20)]
    xnum.append(len(dates)-1)
    fig.update_layout(title={
        'text': title,
        'y':0.95,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'},
        yaxis_title="Number of Active Cases",
        xaxis = dict(
            tickmode = 'array',
            tickvals = xnum,
            ticktext = dates[xnum]
        )
    )
    fig.show()

In [None]:
from datetime import datetime
# check month... easier  
df4 = df_active[dates].apply(pd.to_numeric, errors='coerce')
check_dates = df4.idxmax(axis=1).values
date_num = []
for d in check_dates:
    datee = datetime.strptime(d, "%m/%d/%y")
    date_num.append(datee.month)
date_num


#FEB
df5 = pd.DataFrame({'Country/Region':country_list,'PeakMonth':date_num})
# Countries that peaked in Feb
dff = df5.loc[(df5['PeakMonth']==2) ]#| (df5['PeakMonth']==3) ,:]
# display(dff)
str(dff['Country/Region'].values)
CL = [str(el) for el in dff['Country/Region'].values ]
# plot_active(df_active,CL,"Countries with COVID Peak in Feb",20)

display("1. SELECT AREA to zoom in")
display("2. RESET AXES button")
plotly_active(df_active,CL,"Countries with COVID Peak in Feb")
len(dff)
dff.shape

In [None]:
# MAR 
df5 = pd.DataFrame({'Country/Region':country_list,'PeakMonth':date_num})
# Countries that peaked in Feb
dff = df5.loc[df5['PeakMonth']==3,:]
# display(dff)
str(dff['Country/Region'].values)
CL = [str(el) for el in dff['Country/Region'].values ]
display("1. HOVER over data points")
display("2. ZOOM IN button and PAN to move")
display("3. AUTOSCALE button also resets axes ")

plotly_active(df_active,CL,"Countries with COVID Peak in Mar")


In [None]:
#APR
df5 = pd.DataFrame({'Country/Region':country_list,'PeakMonth':date_num})
# Countries that peaked in Feb
dff = df5.loc[df5['PeakMonth']==4,:]
# display(dff)
str(dff['Country/Region'].values)
CL = [str(el) for el in dff['Country/Region'].values ]
# plot_active(df,cnames,"Countries with COVID Peak in April",20)
display("1. DOUBLE CLICK curves of interest")
display("2. Then CLICK to add other curves")
display("3. DOUBLE CLICK to reset")
plotly_active(df_active,CL,"Countries with COVID Peak in Apr")
dff.shape

In [None]:
#MAY 
df5 = pd.DataFrame({'Country/Region':country_list,'PeakMonth':date_num})
# Countries that peaked in Feb
dff = df5.loc[df5['PeakMonth']==5,:]
# display(dff)
str(dff['Country/Region'].values)
CL = [str(el) for el in dff['Country/Region'].values ]
# plot_active(df,cnames,"Countries with COVID Peak in May",20)
plotly_active(df_active,CL,"Countries with COVID Peak in May")

In [None]:
#MAY 
df5 = pd.DataFrame({'Country/Region':country_list,'PeakMonth':date_num})
# Countries that peaked in Feb
dff = df5.loc[df5['PeakMonth']==6,:]
# display(dff)
str(dff['Country/Region'].values)
CL = [str(el) for el in dff['Country/Region'].values ] 
plotly_active(df_active,CL,"Countries with COVID Peak in Jun")

In [None]:
#MAY 
df5 = pd.DataFrame({'Country/Region':country_list,'PeakMonth':date_num})
# Countries that peaked in Feb
dff = df5.loc[df5['PeakMonth']==7,:]
# display(dff)
str(dff['Country/Region'].values)
CL = [str(el) for el in dff['Country/Region'].values ] 
plotly_active(df_active,CL,"Countries with COVID Peak in Jul")

Awesome!
These visualization tool can help you understand the spread of COVID in different countries. 
## Summary
- pandas: different plots, labels, ticks 
- matplotlib: subplots, function
- plotly: requires specific data frame, melt, awesome interactive features 






## NOT THE END, Just the beginning
# Part 2 NEXT WEEK:
- missingno
- plotly



### Talk to me! 
- e-mail: eddie_toth@hotmail.com
- Add me on: https://www.linkedin.com/in/edward-toth/ 
- Join the community: https://www.meetup.com/Get-Singapore-Meetup-Group/