# COVID Data Visualization - Part 3
[Edward Toth, PhD, University of Sydney]

- e-mail: eddie_toth@hotmail.com
- Add me on: https://www.linkedin.com/in/edward-toth/ 
- Join the community: https://www.meetup.com/Get-Singapore-Meetup-Group/

Using different data visualization tools, we attempt to understand the spread of COVID through pretty pictures. In these tutorials you learn more about:
- the spread of COVID
- explore the following Python libraries for visualizing data

 
  

__PART 3:__
 - __`missingno`__ (offers a quick visual summary of missing values in data)
 - `wordcloud` to visualize text data
 - `seaborn` for further visualizations 
 
 YES THIS IS PART THREE so if you haven't already, check out 
 - Part 1 (Active Cases): https://www.kaggle.com/reddieeddie/visualizing-covid-part-1-active-cases
 - Part 2 (Interactive Plots): https://www.kaggle.com/reddieeddie/visualizing-covid-p2-awesome-interactive-plots
 

For more detail on data visualization libraries: https://mode.com/blog/python-data-visualization-libraries/

In this tutorial we use: 

- Data: https://www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset
- contains 8 `.csv` files
- records of confirmed cases, death and recovered patients
- Description of patients (gender, age, location, etc.)

 


In [None]:
from IPython.display import display
import matplotlib.pyplot as plt
import pandas as pd
# !pip install missingno
import missingno as mn
path = '../input/novel-corona-virus-2019-dataset/' 
covid_data = pd.DataFrame(pd.read_csv(path+'covid_19_data.csv'))
indiv_list = pd.DataFrame(pd.read_csv(path+'COVID19_line_list_data.csv'))
open_list = pd.DataFrame(pd.read_csv(path+'COVID19_open_line_list.csv'))
confirmed_US = pd.DataFrame(pd.read_csv(path+'time_series_covid_19_confirmed_US.csv'))
confirmed = pd.DataFrame(pd.read_csv(path+'time_series_covid_19_confirmed.csv'))
deaths_US = pd.DataFrame(pd.read_csv(path+'time_series_covid_19_deaths_US.csv'))
deaths = pd.DataFrame(pd.read_csv(path+'time_series_covid_19_deaths.csv'))
recovered = pd.DataFrame(pd.read_csv(path+'time_series_covid_19_recovered.csv'))
# Check tail for most recent date
display(covid_data.tail())
# Define dates for time series
dates = confirmed.columns[4:]
# dates[-1]

## missingno
- Check which column varibales are useful
- `missingno` visualize missing values
- See if there is a possible relationship between columns with missing values


In [None]:
import matplotlib.pyplot as plt 
import missingno as mn
mn.bar(covid_data)

In [None]:
dataframes= [confirmed, deaths ,recovered ,  covid_data,indiv_list, open_list]
# # Lazy way
names = ['confirmed', 'deaths' ,'recovered',  'covid_data', 'indiv_list', 'open_list']
fig, axes = plt.subplots(figsize=(10,10))
for num in range(len(dataframes)):#df in dataframes:
    plt.subplot(3,2,num+1).title.set_text(names[num])
    #plt.title(dataframes[num])
    mn.bar(dataframes[num],color='DarkBlue')
fig.tight_layout()
 

Interpretation
- First three plots contain time series where most values are not missing (many date columns)
- `covid_data` (2 row, 2 col) has half of province/state field missing.
- Many missing values in different columns of `indiv_list` ('COVID19_line_list_data.csv') and `open_list` ('COVID19_open_line_list.csv'

In [None]:
# Get rid of blank columns (all missing entries) in open_list and indiv_list
# print("Proportion of missing values", open_list.isnull().mean())
# Extract columns where not all values are missing
cols = open_list.columns[open_list.isnull().mean() != 1]
open_list = open_list[cols] # Get rid of totally missing values in the column
# Repeat for indiv_list
cols = indiv_list.columns[indiv_list.isnull().mean() != 1]
indiv_list = indiv_list[cols]

# Bar shows the proportion/number of non-missing values
plt.subplot(121)
plt.gca().set_title('open_list', fontsize=30) 
mn.bar(open_list,color=(0.25, 0.5, 0.25)) # ignore completely missing columns
plt.subplot(122)
plt.gca().set_title('indiv_list', fontsize=30) 
mn.bar(indiv_list,color=(0.5, 0.25, 0.25))


### Interpretation
`open_list` contains many missing values:
- many columns have a large proportion of missing values such as age, sex, date of hospital admission. 
- columns with most values include: city, province, country, latitude, longitude, from wuhan or not. 

`indiv_list` contains many missing values:
- some columns have a large proportion of missing values such as symptoms,  exposure start and end dates.
- columns with most values include: location, country, gender, age, death. 

## Matrix plot
Visualize if there are patterns in data structure for missing values in `indiv_list`:
- 1 is the earliest time, 1085 is the latest entry 
- At the start of data collection, case_in_country (possible not symptoms) was not recorded
- There are certain chunks of missing values taht are consistent over variables, symptom_onset to exposure_end. 
- Line plot (on the right) shows the number of non-missing values plotted against row number 


In [None]:
# Visualize the missing values in the dataframe
mn.matrix(indiv_list,color=(0.5, 0.25, 0.25)) # adds more red

# Heatmap
- Visualize the closeness between column structures based on missing values
- Ignores columns where all values are filled (link, source, recovered, death, etc.)

In [None]:
# indiv_list contains less variables (columns)
# heatmap to pick up interesting correlations between missing values
mn.heatmap(indiv_list)

### Dendrograms
- perform hierarchical clustering columns based on their closeness of missing values in similar rows
- suggests which variable have similar structures of missing values. 
- Note:  one column where the first half of the values are missing is a terrible partner with another column where the second half is missing. 


In [None]:
#dendrogram reports on the closeness of missing values in different variables
mn.dendrogram(open_list)
mn.dendrogram(indiv_list) 

### Interpretation

In `open_list`, columns with a higher closeness of missing values structure include:
- (large proportion of filled values from bar plot): geo_resolution, longitude and latitude; country_new and admin_id (not that interesting to look at)
- (a lot of missing values from bar plot): age and sex, outcome and date of death/discharge, date of onset symptoms and date of hospital admission, travel history dates and location, province and admin1, city and admin2 

In `indiv_list`, columns with a higher closeness (or correlation) between missing values structure include:
1. 'reporting date' has a similar structure to link, source, recovered, death, visiting Wuhan, id, location. 
2. 'summary' and 'from Wuhan' have a high closeness (Almost 100% of filled values, correlation of 0.9) 
3. 'gender' and 'age' are usually recorded together (around 80% of values are filled from bar plot,correlation of 0.7)

### We focus on these three items in the subsequent visualizations!
The above variables are in the first level of the hierachy (closest in structure). If we look at the second level (less correlated missing value columns) and using the correlation heatmap: 
- hospital visit date may be explored with sympton onset and if onset approximated 

__Reminder:__ that the correlation here is based on the patterns of missing entries between columns. It does not equate to a high Pearson's correlation to evaluate the relationship between numeric values. 




In [None]:
display(open_list[['date_confirmation','geo_resolution', 'longitude', 'latitude', 'country_new','admin_id']].head(10))
# indiv_list['death'].value_counts().head(10)

- Even though large proportion of filled values in columns: geo_resolution, longitude and latitude; country_new and admin_id, it's not that interesting to look at. 

### 1. 'reporting date' has a similar structure to link, source, recovered, death, visiting Wuhan, id, location. 
- Not too much interesting stuff going on 


In [None]:
display(indiv_list[['reporting date','link','source','recovered','death','visiting Wuhan', 'id','location']].head(2))
display("Main sources of news mainly include offical government sources with some popular media outlets:")
indiv_list['source'].value_counts().head(10)

Other possibilities:
- examine the variable `visiting Wuhan` with `death` and `recovered` 

In [None]:
import seaborn as sb

data=indiv_list[['visiting Wuhan','death','recovered']]
# [el not in ['0','1'] for el in indiv_list['recovered'].values]
 
 
select = [el not in ['0','1'] for el in data['recovered'].values]
data = data.copy()
data.loc[select,'recovered'] = '1'
select = [el not in ['0','1'] for el in data['death'].values]
data.loc[select,'death'] = '1'
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(15,5))
sb.countplot(x='visiting Wuhan', data=data,ax=axes[0], hue='death').set_title('Patients who Visited Wuhan: Deaths')
#  
sb.countplot(x='visiting Wuhan', data=data,ax=axes[1], hue='recovered').set_title('Patients who Visited Wuhan: Recovered')


Interpretation
- From the countplots (left graph), it seems that a small amount of patients who were visiting Wuhan died  
- On the right countplot, a significant proportion of people visiting Wuhan actually recovered
- The data here suggest a large proportion of COVID cases are still active (that is not treated as recovered or death)


### 2. 'summary' and 'from Wuhan' have a high closeness (Almost 100% of filled values, correlation of 0.9) 

In `indiv_list`, columns with a higher closeness (or correlation) between missing values structure include:
- 'summary' and 'from Wuhan' have a high correlation of 0.9 (with almost 100% of filled values) 
- separate `summary` text based on whether patient is from Wuhan or not from Wuhan 
- visualize text in a wordcloud and compare word frequencies  

In [None]:
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import nltk
def word_counts(df):
    StopWords = ['confirmed','COVID-19','patient','new',"'new",'onset','female','male','went']#,"'new"," in ","on","to","of","and","went",",","from"]
    txt = str(df.tolist()).split(' ')
    text1 = [str(item).replace(" in ","") for item in txt if str(item) not in StopWords]
    text = [str(item).replace(",","").replace("'","") for item in text1 if str(item) not in STOPWORDS]
    counts = pd.Series(text).value_counts()
    wordcloud = WordCloud(background_color="white").generate_from_frequencies(counts)
    plt.imshow(wordcloud,  interpolation="bilinear")
    plt.axis("off")
    return counts
# Visualize the wordcloud
fig = plt.figure(figsize=(20,10))
df1 = indiv_list['summary'][indiv_list['from Wuhan'] < 1]
plt.subplot(1,2,1).title.set_text("Text Summary for patients, not from Wuhan")
text_notfromWuhan = word_counts(df1)
#
df2 = indiv_list['summary'][indiv_list['from Wuhan'] > 0]
plt.subplot(1,2,2).title.set_text("Text Summary for patients from Wuhan")
text_fromWuhan = word_counts(df2)
# display("Cases not from Wuhan:", len(df1), text_notfromWuhan.head(20),"Cases from Wuhan:", len(df2), text_fromWuhan.head(20))

Interpretation
The text `summary` of patients that not from Wuhan contains; 
- more cases of males than females
- foreign cases like from Japan, South Korea, Hong Kong 
- a smaller proportion of cases mention pneumonia and fever 

The text `summary` of patients from Wuhan mainly contains words such as:
- Wuhan, resident
- pneumonia, symptom, hospitalized, death 
- a larger proportion of cases mention pneumonia and fever 

In `indiv_list`, columns with a higher closeness (or correlation) between missing values structure include:
- 'reporting date' has a similar structure to link, source, summary, recovered, death, visiting Wuhan, country, id, location (Almost 100% of filled values)

### 3. 'gender' and 'age' are usually recorded together (around 80% of values are filled from bar plot,correlation of 0.7)
In `indiv_list`, columns with a higher closeness (or correlation) between missing values structure include:
- `gender` and `age` are usually recorded together (around 80% of values are filled from bar plot,correlation of 0.7)
- examine `gender` and `age` with information about `deaths` and `recovered`


In [None]:
import seaborn as sb
data=indiv_list[['gender','age','death','recovered']]
# [el not in ['0','1'] for el in indiv_list['recovered'].values]
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(15,5))
sb.violinplot(x='gender', y='age', data=data,ax=axes[0]).set_title('Violinplot: Age vs. Gender')
plt.title('Boxplot: Age vs. Gender')
sb.boxplot(x='gender', y='age', data=data,ax=axes[1])

# pd.options.mode.chained_assignment = None
data=indiv_list[['gender','age','death','recovered']]
select = [el not in ['0','1'] for el in data['recovered'].values]
data['recovered'][select] = '1'
select = [el not in ['0','1'] for el in data['death'].values]
data['death'][select] = '1'
# data.loc[select, 'death'] = 1
# plot
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(15,5))
sb.stripplot(x='gender', y='age', data=data,ax=axes[0], hue='recovered').set_title('Deaths: Age vs. Gender')
plt.title('Recovered Patients: Age vs. Gender')
sb.stripplot(x='gender', y='age', data=data,ax=axes[1], hue='death')




# Add me! 

- e-mail: eddie_toth@hotmail.com
- Connect with me: https://www.linkedin.com/in/edward-toth/ 
- Join the community: https://www.meetup.com/Get-Singapore-Meetup-Group/

# The End [Back to Netflix!]