<div class="alert alert-block alert-info"><b>IAB303</b> - Data Analytics for Business Insight</div>

# Tutorial: Ethical concerns & data analytics

### This tutorial is all about considering human and ethical factors in data analytics. 
### We present one example QDAVI cycle below. Please ensure you read and understand each part of the QDAVI cycle and think about potential problems, how you can improve certain aspects and what else you can explore within the concern. 

## Tutorial :: The numbers behind COVID-19

**CONCERN**

You are working for Qantas as a data analyst. Your manager welcomes the Australian open borders. However, he wants to have an overview of how different countries in the world have been dealing with the pandemic to restart operation in the safest way possible.

1. **Q**uestion
2. **D**ata
3. **A**nalysis
4. **V**isualisation
5. **I**nsight

<img src="graphics/QDAVI_cycle_sm.png" width="50%" />

### 1. Question

How is each country doing in terms of new cases, ICU patients and vaccination?

### 2. Data

We are going to use a file called `owid-covid-data.csv`, located in the data folder.

In [None]:
# Libraries for the analysis
import pandas as pd
import plotly.express as px

In [None]:
# Load the dataset
df = pd.read_csv('data/owid-covid-data.csv')
df

#### Clean/preprocess data

In [None]:
# Check data types
df.dtypes

In [None]:
# date is an object - we need to to convert the date column to a date data type (this makes it easier later in the analysis to extract months and years)
df["date"] = pd.to_datetime(df["date"], format="%d/%m/%Y")
df

**Tip:** The format will depend on the data. If the format does not match the data, it will return an error

In [None]:
# Check data types to confirm the convertion
df.dtypes

In [None]:
# get and overview of the data - check the data descriptive statistics
df.describe()

## What do you notice about the data? 
**What does count, mean, min, max and percentiles mean?**

**Which statistical measure is meaningful for which variable (column)? (e.g. does it make sense to look at the mean date)**

**By looking at these descriptive statistics, can you tell how the data are distributed? (measures of central tendency and dispersion)**

Please describe your thoughts in a markdown cell. 

Your notes: 
-
-
-
-
-



In [None]:
# Check the missing values
df.isnull().sum()

## What about missing data? 

**Are there any missing data and where?**

**What does this mean and what should we do about it?**

Please describe your thoughts in a markdown cell. 


Your notes: 
-
-
-
-
-



### 3. Analysis

The data needs to be grouped to have a better perspective of the data

In [None]:
# Create new columns with the year and month
df["year"] = df["date"].dt.year
df["month"] = df["date"].dt.month
df

In [None]:
# Find the unique values for locations
df["location"].unique()

In [None]:
# Filter by Australia 2021

au_df = df[(df['location'] == 'Australia') & (df['year'] == 2021)]
au_df

In [None]:
# Group Australian data by month
au_df_grouped = au_df[["new_cases", "new_deaths", "people_fully_vaccinated", "month"]].groupby("month").agg({"new_cases": "sum",
    "new_deaths": "sum", "people_fully_vaccinated": "max"})
au_df_grouped["country"]="Australia"
au_df_grouped

**What does the .groupby function do?**

**What does the .agg function do?**

Please describe in your own words. 

**Your answer:**
-
-
-
-





In [None]:
# Filter by Brazil 2021
br_df = df[(df['location'] == 'Brazil') & (df['year'] == 2021)]
br_df

In [None]:
# Group Brazilian data by month
br_df_grouped = br_df[["new_cases", "new_deaths", "people_fully_vaccinated", "month"]].groupby("month").agg({"new_cases": "sum",
    "new_deaths": "sum", "people_fully_vaccinated": "max"})
br_df_grouped["country"]="Brazil"
br_df_grouped

In [None]:
# combine both Australia and Brazil grouped data into one dataframe (makes it easier to handle the data in plotly later)

au_br_df_grouped = pd.concat([au_df_grouped, br_df_grouped],axis=0)
au_br_df_grouped.reset_index(inplace=True)
au_br_df_grouped


### 4. Visualisation

**How can we visualise our results?**

**What exactly do we want to show?**

**What kind of graph would be best?**

Your notes: 
-
-
-
-
-


In [None]:
# Visualise the results

# new cases
fig = px.line(au_br_df_grouped,x='month',y='new_cases',color='country',
              title='Covid new cases 2021')
fig.update_layout(yaxis_title='cases (count)',
                 xaxis = dict(
                     tickmode = 'array',
                     tickvals = [1,2,3,4,5,6,7,8,9,10,11,12])
                    )
fig.show()


# new deaths
fig = px.line(au_br_df_grouped,x='month',y='new_deaths',color='country',
              title='Covid new deaths 2021')
fig.update_layout(yaxis_title='cases (count)',
                 xaxis = dict(
                     tickmode = 'array',
                     tickvals = [1,2,3,4,5,6,7,8,9,10,11,12])
                    )
fig.show()


# people fully vaccinated
fig = px.line(au_br_df_grouped,x='month',y='people_fully_vaccinated',color='country',
              title='Covid people fully vaccinated 2021')
fig.update_layout(yaxis_title='cases (count)',
                 xaxis = dict(
                     tickmode = 'array',
                     tickvals = [1,2,3,4,5,6,7,8,9,10,11,12])
                    )
fig.show()




### 5. Insights

- What have we shown with these analyses and plots?
- What was our original question?
- Have we answered our original question, and what is the answer?
- Is our analysis accurate or was our analysis wrong? Why?
- Is our analysis fair? Why?
- What is the difference between accurate and fair analysis?
- Can you think of a different way of analysing the data with the data we already have?

**Your notes:**


-
-
-
-
-



## New question & new QDAVI cycle


### Question(s)

-
-





In [None]:
# write your code here


### Insights

-
-
-
-
-

