<div class="alert alert-block alert-info"><b>IAB303</b> - Data Analytics for Business Insight</div>

# Module 3 - Structured Data 

Data Analytics is the process of discovering meaning and value in data to solve complex research questions, support evidence-based decision-making, and identify trends and patterns in data.

Data analytics is the process that enlables the augmentation of human intelligence about some business concern belonging to a specific context using a set of tools to analyse and extract insights from data.

<img src="graphics/data_analytics.png" width=50% />

This data can occur in three different formats:

- **Structured Data.** When the structure is predefined (for instance a database table). It is usually stored in a Database. 

- **Semi-Structured Data.** Data that does not reside in a relational database but that have some organizational properties that make it easier to analyze. With some process, you can store them in the relation database. Example: XML data.

- **Unstructured Data.** Data which is not organized in a pre-defined manner or does not have a pre-defined data model, thus it is not a good fit for a mainstream relational database. Example: Word, PDF, Text, Media logs

One of the primary aims of data analytics is to process semi-structured and unstructured data and put it in a structured format that is human understandable so it can be analysed. 

In this module, we will analyse data about the impacts of COVID-19 in different countries.

## Real World Case Study: The Impact of COVID-19 in Different Countries

In [None]:
# import libraries
import numpy as np              # used for algebraic operations      
import pandas as pd              # used for data manipulation and data analysis
import matplotlib.pyplot as plt # used for visualisations
import seaborn as sns           # used for visualisations

pd.set_option('display.max_rows', 500) # used to show all the rows in our dataset

In [None]:
# DAILY REPORTS ABOUT CONFIRMED, DEATHS AND RECOVERY CASES DOWNLOADED FROM JOHN HOPKINS UNIVERSITY
# source: https://github.com/CSSEGISandData/COVID-19

# load confirmed cases data
data_conf = pd.read_csv( "data/covid19_confirmed_global.csv" )

# load deaths data
data_deaths = pd.read_csv( "data/covid19_deaths_global.csv" )


In [None]:
# let's take a look at out data. 
# checking the confirmed covid19 cases data
data_conf

The first thing we notice is that there are entries in our dataset that do not have any data records. This is represented by the NaN entries. These entries are also reffered to missing values. As presented in module 1, part of the data analytics cycle is to clean the dataset and to address these missing entries. We will take a look at this later in this notebook.

After taking a quick look at our data, it is important to understand what are the variables or features (represented as columns) in our dataset. We can answer this question with the following line of code:

In [None]:
# what are the variables or features in our dataset?
data_conf.columns.to_list()

Our dataset has information about the:
- Province or State of a country
- the Country or region
- geographical coordinates: Latitude and Longitude
- a range of dates from the 22nd of January of 2020 until the 12th of June of 2020

But how many countries are represented in this dataset? Each country is an instance or an observation (represented as a row) in this dataset. So, we can get this information by counting the total number of rows of our dataset. We can do this by simply determining the dimensions of our dataset:

In [None]:
# get dimensions of the dataset: (number of rows, number of columns)
dims = data_conf.shape
dims

In [None]:
# how many countries are represented in this dataset?

# number of countries / regions with confirmed cases.
# note that in Python, the indexing of data structures starts with 0. 
# So, index 0 selects the first entry of a list
num_countries_or_regions = dims[0] 

# %d -> integer
# %f -> float -> decimal
# %s -> string (chain of characters) / text

print( "There are %d of regions in our dataset" %num_countries_or_regions )

In [None]:
# what are the total number of recorded days in this dataset?

# remember the columns that we extracted from our dataset.
# 1st column:  Province/State
# 2nd column:  Country/Region
# 3rd column:  Lat
# 4th column: 'Long
# 5th column: 1/22/20

# this means that we can compute the total number of days with recorded cases by
# subtracting these 4 columns to the total number of columns in our dataset
num_days = dims[1] - 4

print("We have data reported over %d days" %num_days)

In [None]:
data_conf['Long']

In [None]:
# the 2nd and 3rd entries of our dataset correspond to the geographic
# coordinates of the country/region, so we can ignore these two for now

# the date of the first confirmed case of covid-19 that has been
# recoreded in this dataset is found on the 5th colum (5th index)
first_date = data_conf.columns[4] 

# the end of a list is represented by the index -1
last_date = data_conf.columns[-1]

print( "The first confirmed case recorded in the dataset was in %s" %first_date)
print("The last confirmed case recorded in the dataset was in %s" %last_date)

In [None]:
# what are the countries / regions that have reported confirmed covid-19 cases?

countries = data_conf["Country/Region"]
countries.to_list()

In [None]:
# the above list has repeated entries in some countries
# this is because the data for some countries has been recorded by region

# we can remove duplicate entries out of a list in the following way
# what are the countries in the dataset? 
# what countries have reported confirmed covid-19 cases?
countries.unique()

In [None]:
# the total number of countries can be determined in the following way
num_countries = len( countries.unique() )

print("There are %d countries in this dataset with confirmed covid-19 cases" %num_countries)

Let's simplify our analysis and let's get rid of the "Province/State" column since it is only present for a very small amount of countries, and also the Latitude and Longitude, because we will not use them for the analysis. We will do this for the dataset of confirmed cases and also for the dataset with deaths

In [None]:
# droping columns for confirmed covid-19 cases
confirmed_time_series = data_conf.drop(["Province/State", "Lat", "Long"], axis=1)

# droping columns for confirmed covid-19 deaths
deaths_time_series = data_deaths.drop(["Province/State", "Lat", "Long"], axis=1)

# let's take a quick look at Australia
confirmed_time_series[confirmed_time_series["Country/Region"] == "Australia"]

One can see that Australia appears multiple times in our dataset. That is because Australia the recorded cases of covid19 in Australia were presented per State:

In [None]:
data_conf[data_conf["Country/Region"] == "Australia"]

It would be nice to put all these cases separated by province/state in a single row representing the total number of confirmed cases in the country, in other words, we want to sum all the number of confirmed cases in each state and put them in a single row.

In [None]:
# group my data by country

# for the confirmed covid19 cases
confirmed_time_series = confirmed_time_series.groupby( "Country/Region" ).sum()

# and for the confirmed covid19 deaths
deaths_time_series = deaths_time_series.groupby( "Country/Region" ).sum()

# let's take a look
confirmed_time_series

In [None]:
conf_cases_per_country = confirmed_time_series.iloc[:,-1]
conf_cases_per_country.tolist()

In [None]:
# setting fontsize of figure to 22
plt.rcParams.update({'font.size': 22})

# let's check the impact of the virus by country
conf_cases_per_country = confirmed_time_series.iloc[:,-1]

# let's look at countries with more than 20 000 cases, only
THRESHOLD = 20000
conf_cases_per_country = conf_cases_per_country[ conf_cases_per_country > THRESHOLD ]

x = conf_cases_per_country.index.tolist()
y = conf_cases_per_country.tolist()

# get the index of the country with the highest number of confirmed cases
max_cases_idx = y.index( max(y) )

fig = plt.figure(figsize=(25,7))

barlist = plt.bar(x, y)
barlist[max_cases_idx].set_color('r')
plt.xticks(ticks= x, rotation=90) 

plt.ylabel("Number of Confirmed Deaths")
plt.title('Countries with most deaths by covid-19')

In [None]:
# setting fontsize of figure to 22
plt.rcParams.update({'font.size': 22})

# let's check the impact of the virus by country in terms of deaths
conf_deaths_per_country = deaths_time_series.iloc[:,-1]

# let's look at countries with more than 100 cases, only
THRESHOLD = 1000
conf_deaths_per_country = conf_deaths_per_country[ conf_deaths_per_country > THRESHOLD ]

x = conf_deaths_per_country.index.tolist()
y = conf_deaths_per_country.tolist()

# get the index of the country with the highest number of confirmed cases
max_deaths_idx = y.index( max(y) )
fig = plt.figure(figsize=(25,7))

barlist = plt.bar(x, y)
barlist[max_deaths_idx].set_color('r')

plt.xticks(ticks= x, rotation=90) 
plt.ylabel("Number of Confirmed Deaths")
plt.title('Countries with most deaths by covid-19')

Let's focus our analysis in Spain (one of the countries that got heavily affected by the virus)

In [None]:
# note that when you use the groupby operation, you are changing the way you are accessing the data
# in this case, the column "Country/Region" is now being used to index the data
# you can access its information in the following way
spain_confirmed_cases = confirmed_time_series[ confirmed_time_series.index == "Spain"]
spain_confirmed_deaths = deaths_time_series[ deaths_time_series.index == "Spain"]

spain_confirmed_cases

Let's visualise this information

In [None]:
# setting fontsize of figure to 22
plt.rcParams.update({'font.size': 22})

# get dates
dates = spain_confirmed_cases.columns.to_list()

# initialise figure
fig = plt.figure(figsize=(28,10))

# plot confirmed cases
cases = spain_confirmed_cases.values[0]

plt.subplot(1, 2, 1)
plt.plot(dates, cases) # default color is blue
plt.title('Cummulative Confirmed covid-19 cases in Spain')
plt.ylabel('Number of confirmeed covid-19 cases')
# adding dates on the x-axis every 10 days
plt.xticks(ticks= range(0, len(dates), 5), rotation=90)
plt.tight_layout()

# plot confirmed deaths
deaths = spain_confirmed_deaths.values[0]

plt.subplot(1, 2, 2)
plt.plot(dates, deaths, c="r") # setting color to red
plt.title('Cummulative Confirmed covid-19 deaths in Spain')
plt.ylabel('Number of deaths')
# adding dates on the x-axis every 10 days
plt.xticks(ticks= range(0, len(dates), 5), rotation=90) 
plt.tight_layout()

plt.show()

Our dataset has information about the cummulative number of confirmed cases with covid-19. Perhaps calculating the daily number of confirmed cases might gives us better insights.

In [None]:
# we can define a function to convert the cummulative counts into daily counts

# FUNCTION: get_data_per_day
# takes a series of cummulative data counts 
# and converts to daily data counts
# cum_data: list 
# return: list
def get_data_per_day( cum_data ):
    
    cases_per_day = [0]  # list to return. Initially there are 0 cases
    
    # for each data point... 
    # subtract the following day cases with the current cases
    for i in range( 0, len(cum_data)-1):
        
        current_day =  cum_data[i]  # current data
        next_day = cum_data[i+1]    # following day data
        cases_per_day.append( np.abs(next_day - current_day) ) # add data to result list
    
    return cases_per_day

In [None]:
# setting fontsize of figure to 22
plt.rcParams.update({'font.size': 22})

# get dates
dates = spain_confirmed_cases.columns.to_list()

# initialise figure
fig = plt.figure(figsize=(28,10))

# plot confirmed cases
cases = get_data_per_day(spain_confirmed_cases.values[0])

plt.subplot(1, 2, 1)
plt.bar(dates, cases) # default color is blue
plt.title('Daily Confirmed covid-19 cases in Spain')
plt.ylabel('Number of confirmeed covid-19 cases')
plt.xticks(ticks= range(0, len(dates), 10), rotation=90)
plt.tight_layout()

# plot confirmed deaths
deaths = get_data_per_day(spain_confirmed_deaths.values[0])

plt.subplot(1, 2, 2)
plt.bar(dates, deaths, color='r') # setting color to red
plt.title('Daily Confirmed covid-19 deaths in Spain')
plt.ylabel('Number of deaths')
plt.xticks(ticks= range(0, len(dates), 10), rotation=90)
plt.tight_layout()

plt.show()

Absolute levels are still very ambiguous. If a country has a big population, these graphs may not provide us much insights. Perhaps it would be better to compute the number of confirmed cases and confirmed deaths per million. Knowing that Spain has a population of approximatelly 47 million, let's adjust the above analysis to reflect numbers per million

In [None]:
# FUNCTION: data_per_million
# takes a series of data 
# and converts to values per million
# data: list 
# return: list
def data_per_million( data, population, MILLION ):
    
    data_per_million = []
    for data_per_day in get_data_per_day( data ):
        val = (data_per_day*MILLION)/population
        data_per_million.append(val)
        
    return data_per_million
    

In [None]:
MILLION = 1000000.0
population = 47 * MILLION # defining the population in Spain in millions of people

# setting fontsize of figure to 22
plt.rcParams.update({'font.size': 22})

# get dates
dates = spain_confirmed_cases.columns.to_list()

# initialise figure
fig = plt.figure(figsize=(28,10))

# plot confirmed cases

# computing the confirmed
cases_per_million = data_per_million( spain_confirmed_cases.values[0], 
                                     population, MILLION )
    
plt.subplot(1, 2, 1)
plt.bar(dates, cases_per_million) # default color is blue
plt.title('Daily Confirmed covid-19 cases in Spain per million')
plt.ylabel('Number of confirmeed covid-19 cases per million')
plt.xticks(ticks= range(0, len(dates), 10), rotation=90)
plt.tight_layout()

# plot confirmed deaths
deaths_per_million = data_per_million( spain_confirmed_deaths.values[0], 
                                       population, MILLION )
plt.subplot(1, 2, 2)
plt.bar(dates, deaths_per_million, color='r') # setting color to red
plt.title('Daily Confirmed covid-19 deaths in Spain per million')
plt.ylabel('Number of deaths per million')
plt.xticks(ticks= range(0, len(dates), 10), rotation=90)
plt.tight_layout()

plt.show()



## Exercise: Full lockdown, late lockdown and soft lockdown - Australia vs. Spain vs. Sweden

In this module we presented two datasets containing the impact in terms of confirmed covid-19 cases and deaths in different countries. We made a setailed analysis for Spain. 

Can you repeat the same analysis, but for Australia and Sweden? Knowing that Australia has a population of approximately 25 million people and Sweden approximately 10.2 million, can you make an analysis between countries who went on a lockdown very early (Australia), compared to countries that entered in lockdown already late (Spain), and countries that did not fully enter into a restrict lockdown (Sweden)?  

In [None]:
# YOUR ANSWER HERE

# load confirmed cases data: data/covid19_confirmed_global.csv
data_conf = pd.read_csv("data/covid19_confirmed_global.csv")

# load deaths data: data/covid19_deaths_global.csv
data_deaths = pd.read_csv("data/covid19_deaths_global.csv")

# take a look at the data_conf dataset
# YOUR ANSWER HERE:





In [None]:
# remove the columns containing geographical information and information about the Province
# YOUR ANSWER HERE

# for the confirmed cases
confirmed_data_simplified = 

# for the confirmed deaths
deaths_data_simplified = 


In [None]:
# group data by country

# YOUR ANSWER HERE
# for the confirmed covid19 cases
confirmed_time_series = 

# and for the confirmed covid19 deaths
deaths_time_series = 

# let's take a look
confirmed_time_series


In [None]:
# select all the confirmed cases and confirmed deaths from Australia
# YOUR ANSWER HERE
australia_confirmed_cases = 
australia_confirmed_deaths = 

# select all the confirmed cases and confirmed deaths from Sweden
# YOUR ANSWER HERE
sweden_confirmed_cases = 
sweden_confirmed_deaths = 

In [None]:
# plot the daily number of confirmed cases and confirmed deaths from Australia per million

# YOUR ANSWER HERE





In [None]:
# plot the daily number of confirmed cases and confirmed deaths from Sweden per million

# YOUR ANSWER HERE
# plot the daily number of confirmed cases and confirmed deaths from Australia per million

# YOUR ANSWER HERE





Compare the different graphs obtained for Australia, Sweden and our previous analysis from Spain.

From the graphs, make a small discussion about the number of confirmed cases and deaths per million for each country and analyse it in lights of their lockdown policies: Australia - strict lockdown, Spain - late lockdown, Sweden - soft lockdown

**YOUR ANSWER HERE**

- Australia (strict lockdown): 

- Spain (late lockdown): 

- Sweden (soft lockdown):