## Analyzing The Effectiveness of Mask As A Measures Against Covid 19

![Covid-19](https://cdn.pixabay.com/photo/2020/03/25/11/30/envato-4966945_1280.jpg)

First of all, I would like to thank the wonderful the people at Kaggle and in Kaggle community, who does an amazing job to make awesome datasets like this available to us. In this notebook, I want to explore the effectiveness of using mask as a preventive measure against Covid-19. To do so, I will use the following datasets:
* [COVID-19 Open Research Dataset Challenge (CORD-19)](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge) created by  the Allen Institute for AI in partnership with the Chan Zuckerberg Initiative, Georgetown Universityâ€™s Center for Security and Emerging Technology, Microsoft Research, and the National Library of Medicine - National Institutes of Health, in coordination with The White House Office of Science and Technology Policy.
* [COVID-19 containment and mitigation measures](https://www.kaggle.com/paultimothymooney/covid19-containment-and-mitigation-measures) uploaded by [Paul Mooney](https://www.kaggle.com/paultimothymooney) and 
* [Novel Corona Virus 2019 Dataset](https://www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset) uploaded by [SRK](https://www.kaggle.com/sudalairajkumar).
* [countryinfo](https://www.kaggle.com/koryto/countryinfo) dataset uploaded by [My Koryto](https://www.kaggle.com/koryto).

### This notebook is still a work in progress. Please upvote my kernel if you find it useful and encourage me to keep going..

### Imports

In [None]:
import json
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
from os import path, listdir, walk
from tqdm.notebook import tqdm
import plotly.express as px
import plotly.graph_objects as go
from sklearn import manifold

In [None]:
from nltk.tokenize import WordPunctTokenizer
from collections import Counter
punct_tokenizer = WordPunctTokenizer()

# Reading and Cleaning The Data

In [None]:
# read datasets
covid_data_df = pd.read_csv('/kaggle/input/novel-corona-virus-2019-dataset/covid_19_data.csv')
# in the covid_19_data.csv we have the number of confirmed cases, deaths, and recovered for every date by country and privince or state.
covid_data_df.head()

In [None]:
covid_measures_df = pd.read_csv('/kaggle/input/covid19-containment-and-mitigation-measures/COVID 19 Containment measures data.csv')
# In the measures dataframe we have a description of the measure implemented and a keyword to standardize all the topics in description. 
# This dataset also has a start date of the measure, and it is given by country and state or province.
covid_measures_df.head()

In [None]:
country_info_df = pd.read_csv('/kaggle/input/countryinfo/covid19countryinfo.csv')
country_info_df.head()

In [None]:
# clean up the data a little bit
country_info_df.loc[country_info_df['alpha2code']=='TW', ['country']] = 'Taiwan'
country_info_df.loc[country_info_df['alpha2code']=='KR', ['country']] = 'South Korea'
country_info_df.loc[country_info_df['alpha2code']=='HK', ['country']] = 'Hong Kong'
country_info_df.loc[country_info_df['alpha2code']=='HK', ['region']] = None
# keep important parameters
country_info_df = country_info_df[country_info_df['region'].isnull()][['country', 'alpha2code', 'pop', 'density', 'medianage', 'urbanpop', 'tests', 'testpop']]

In [None]:
# convert population to million
country_info_df['pop_mil'] = country_info_df['pop'].str.replace(',', '').astype(int) / 1_000_000
country_info_df = country_info_df.drop('pop', axis=1)
# convert date columns to Datetime format
covid_measures_df['Date Start'] = pd.to_datetime(covid_measures_df['Date Start'], format="%b %d, %Y")
covid_data_df['ObservationDate'] = pd.to_datetime(covid_data_df['ObservationDate'], format="%m/%d/%Y")
# Some cleaning on country names
country_info_df['country'] = country_info_df['country'].apply(lambda row: str(row)).apply(lambda row: 'Czech Republic' if 'Czechia' in row else row)
covid_measures_df['Country'] = covid_measures_df['Country'].apply(lambda row: str(row)).apply(lambda row: 'Czech Republic' if 'Czechia' in row else row)
covid_measures_df['Country'] = covid_measures_df['Country'].apply(lambda row: str(row)).apply(lambda row: 'United States' if 'US' in row else row)

In [None]:
# see number of total examples found found and missing measures against coronavirus 
# As you can see, we have a few rows where measure and their date is missing
print(f"total examples {len(covid_measures_df)}")
print(f"measures description found: {len(covid_measures_df[covid_measures_df['Description of measure implemented'].notnull()])}")
print(f"measures keywords found: {len(covid_measures_df[covid_measures_df['Keywords'].notnull()])}")
print(f"measures with date found: {len(covid_measures_df[covid_measures_df['Date Start'].notnull()])}")

### Functions

In this section I will store/hide and hide all the functions I am using throughout this notebook.

In [None]:
def get_measures_count_by_country(country, covid_data_df, covid_measures_df, country_info_df):
    '''
    Get a datafarme containing number of covid cases and measure takens in a single dataframe
    '''
    country_covid_df = covid_data_df[covid_data_df['Country/Region'] == country].groupby(['Country/Region', 'ObservationDate']).sum().reset_index()
    country_measures_df = covid_measures_df[covid_measures_df['Country']==country]
    country_covid_df['Confirmed Increase'] = country_covid_df['Confirmed'].diff().fillna(0)
    country_covid_df['Death Increase'] = country_covid_df['Deaths'].diff().fillna(0)
    country_covid_df['Recovered Increase'] = country_covid_df['Recovered'].diff().fillna(0)
    country_df = country_covid_df.merge(country_measures_df, how='left',left_on='ObservationDate', right_on='Date Start')
    pop_mil = country_info_df[country_info_df['country']==country]['pop_mil']
    pop_mil = int(pop_mil)
    country_df['confirmed_per_one_mil'] = country_df['Confirmed'] / pop_mil
    country_df['death_per_one_mil'] = country_df['Deaths'] / pop_mil
    country_df['recovered_per_one_mil'] = country_df['Recovered'] / pop_mil
    return country_df

In [None]:
def insert_breaks(measure):
    '''
    The hover textbox is going to become very long. A work around to this is to insert <br> tags and break the lines in text.
    '''
    measure_list = str(measure).split(' ')
    [measure_list.insert(x, '<br>') for x in range(0, len(measure_list), 7)]
    measure_list.pop(0)
    return ' '.join(measure_list)

def get_count_list(country_df):
    country_df = country_df.sort_values('ObservationDate')
    country_df = country_df[['ObservationDate','Confirmed','Deaths']].drop_duplicates()
    return (country_df['Confirmed'].to_list(), country_df['Deaths'].to_list())

def get_measures(country_df):
    '''
    get x y co-ordinates and measures and keywords
    '''
    country_df = country_df.sort_values('ObservationDate')
    x = country_df[country_df['Description of measure implemented'].notnull()]['Date Start'].to_list()
    y = country_df[country_df['Description of measure implemented'].notnull()]['Confirmed'].to_list()
    measures = country_df[country_df['Description of measure implemented'].notnull()]['Description of measure implemented'].to_list()
    measures = [insert_breaks(measure) for measure in measures]
    about_mask = country_df[country_df['Description of measure implemented'].notnull()]['mask'].to_list()
    about_mask = ['ðŸ˜·' if event is True else '' for event in about_mask]
    return (x, y, measures, about_mask)

In [None]:
def plot_measures(country_df, x, y, measures, keywords=None):
    
    '''
    plots the number of confirmed cases in line graph and measures using scatter plots
    '''

    fig = px.line(country_df, x="ObservationDate", y="Confirmed", color="Country/Region", line_group="Country/Region")
    
    #fig.add_trace(go.Bar(
    #    x=country_df['ObservationDate'], 
    #    y=country_df['Confirmed Increase']))
    
    country = country_df['Country/Region'][0]
    
    # taking care of events with same date
    
    # create an empty list to store dates with multiple events
    dates_with_multiple_events = []
    
    # store dates with multiple events in the dataset
    for dt in set(x):
        if len([t for t in x if t==dt]) > 1:
            dates_with_multiple_events.append(dt)
    
    # loop over x and add a few hours to make each date different 
    x = [t+timedelta(minutes=i+1) if t in dates_with_multiple_events else t for i, t in enumerate(x)]
    #print(x)
    
    fig.add_trace(go.Scatter(
        x=x, y=y, text=measures,
        mode="markers",
        name="measures"
    ))
    
    if keywords:
        fig.add_trace(go.Scatter(
            x=x, y=y, text=keywords,
            textposition='top left',
            mode="text",
            name="keywords",
            hoverinfo='skip'
        ))
    
    fig.update_layout(hovermode='closest', showlegend=False, 
                     title = 'Covid 19 Measures vs Confirmed Cases - '+country + ' (Hover To See Measure)',
                     xaxis_title = 'Date', yaxis_title = 'Confirmed Covid-19 Cases')
    
    return fig


In [None]:
def check_for_mask(text):
    word_list = ['mask', 'masks']
    text = str(text)
    token = set(punct_tokenizer.tokenize(text.lower()))
    match = set(word_list).intersection(token)
    if match:
        return True
    else:
        return False

# Mask Related Measures Taken

In this section, I will simply tokenize each words in the measure description (using NLTK), and match for the word mask or masks. This approach will not give us exactly what we are looking for, and a more automated approach is also possible. For example, using this approach, we will have measures to ban export of masks also in our count. However, as our dataset is small, it is easier to narrow down our search and then inspect the data.

In [None]:
# check for masks in description and create a column to indicate so
covid_measures_df['mask'] = covid_measures_df['Description of measure implemented'].apply(check_for_mask)

In [None]:
# count number of mask related events for each country and do a bar plot
mask_country_count = covid_measures_df[covid_measures_df['mask']==True]['Country'].value_counts()
fig = px.bar(mask_country_count, x = mask_country_count.index, y = mask_country_count.values)
fig.update_layout(title = 'Countries with mask mentioned in measures dataset',
                  xaxis_title = 'Country Name', yaxis_title = 'Number of events in data')
fig.show()

**After manually inspecting the data, we see that South Korea, Hong Kong, China, Czech Republic, Slovakia, Singapore, and Taiwan are the countries where everyone is wearing a mask as a serious measure against Covid-19. For the rest of the countries, only the healthcare providers are using a mask.**

Let's take a look also at the growth of confirmed cases for these countries below.

In [None]:
country_info_df[country_info_df['country']=='South Korea']

In [None]:
kr_df = get_measures_count_by_country('South Korea', covid_data_df, covid_measures_df, country_info_df)
cz_df = get_measures_count_by_country('Czech Republic', covid_data_df, covid_measures_df, country_info_df)
sg_df = get_measures_count_by_country('Singapore', covid_data_df, covid_measures_df, country_info_df)
hk_df = get_measures_count_by_country('Hong Kong', covid_data_df, covid_measures_df, country_info_df)
sk_df = get_measures_count_by_country('Slovakia', covid_data_df, covid_measures_df, country_info_df)
tw_df = get_measures_count_by_country('Taiwan', covid_data_df, covid_measures_df, country_info_df)
mask_country_df = pd.concat([kr_df, cz_df, sg_df, hk_df, sk_df, tw_df])

In [None]:
# plot using plotly
fig = px.line(mask_country_df, x="ObservationDate", y="confirmed_per_one_mil", color="Country/Region", line_group="Country/Region")
fig.update_layout(title = 'Covid 19 Confirmed Cases for Countries Where Everyone Wears A Mask',
                  xaxis_title = 'Date', yaxis_title = 'Confirmed Covid-19 Cases (Per 1 mil)')
fig.show()

#### Looks like all of these countries are doing well in terms of controlling the total number of confirmed cases of Covid-19 compared to the rest of the world. South Korea was even able flatten a scary growth rate. However, we need to zoom in much more to understand what is really going on.

# Zooming Into The Countries

In [None]:
x, y, measures, mask_events = get_measures(kr_df)
plot_measures(kr_df, x, y, measures, mask_events)

In [None]:
x, y, measures, mask_events = get_measures(cz_df)
plot_measures(cz_df, x, y, measures, mask_events)

In [None]:
x, y, measures, mask_events = get_measures(sg_df)
plot_measures(sg_df, x, y, measures, mask_events)

In [None]:
x, y, measures, mask_events = get_measures(hk_df)
plot_measures(hk_df, x, y, measures, mask_events)

In [None]:
x, y, measures, mask_events = get_measures(sk_df)
plot_measures(sk_df, x, y, measures, mask_events)

In [None]:
x, y, measures, mask_events = get_measures(tw_df)
plot_measures(tw_df, x, y, measures, mask_events)

**After inspecting the data, for the following countries we can confidently say that they enforces mask as a serious measure against Covid-19:**
* **South Korea**: Advocated all people to wear masks, specially when visiting medical facilities and outdoor in general. In certain cases (i.e. Taxi and other non-essential businesses), a mask is mandatory with fine. They even deployed a mask supply app service to inform and alert local pharmacies mask supply situation. 
* **Hong Kong**: Initially mask was encouraged for people with symptoms along with returning travelers from high risk areas. Later, government started seeing wearing mask as one of the serious measures. They started distributing two face masks per day to seniors at elderly homes. Furthermore, all citizens were requested to wear masks at all times. 
* **China**: Scaled up production (and even created new factories in record breaking times) to provide as much mask as possible. 
* **Czech Republic**: Chronological information about when 1%, 10%, and 100% citizens in public are wearing masks is available.
* **Slovakia**: At first, mask was recommended for all public transport operators. However, then mask was mandatory for everyone at all places (streets, shops, work, factory, office, everywhere!).
* **Singapore**: Four masks distributed to each household. 
* **Taiwan**: Rationed face masks to its citizens.

## Work In Progress. . . 

## Of course we have some more work to do before we can understanding anything conclusively. If you like what I did so far, and would like me to keep going, upvote this kernel! 