# Lesson 7 - EDA and Statistics Part 3

![flowers](https://i.pinimg.com/originals/f4/1d/45/f41d4567aee3a42f1cbbfd7f44412efa.jpg)  
**Source:** https://www.poppyfield.org/

I. Recap of lesson 6  
II. Learning Outcomes  


__Outline for the session:__

1. Introduction to Exploratory Data Analysis
2. Setting the Stage
3. Get the Data
    - Load & Inspect
    - Clean & Prepare
    - What did you see?
4. EDA
5. Reporting
6. Summary
7. Beyond the Data Analyst
    - A Better Programmer
    - The Data Scientist Road
8. References
9. Feedback

# I. Recap of Lesson 5

In lesson 6, we covered a small but inportant piece of inferential statistics, distributions and hypothesis testing. There we touched on:

- One of the many frameworks one can follow to structure a hypothesis test
- Which hypothesis is which in statistics terms
    - The Null Hypothesis is the one that is currently accepted
    - The Alternative Hypothesis is the one we would like to prove right
- Every variable or array of data comes with a distribution and being able to model and visualise such distributions, allows us to understand the data we have at hand better
- Hypothesis tests can be done with one variable, two, or many others, and for each of these instances there is a test that we can use
- SciPy is built on top and at the same level as NumPy. It comes with most of the statistical functions we need to test hypotheses

# II. Learning Outcomes

By the end of the lesson you will have learned:

1. How to structure a data task
2. How to design your to-do list for the initial steps of the process
3. How to structure your repository for a data project
4. How iteratetively look at your data

# 1. Introduction to Exploratory Data Analysis

![](https://miro.medium.com/max/1400/0*LWtcjRNoRrqSHcdK.gif)  
**Source:** [South China Morning Post](https://multimedia.scmp.com/lifestyle/article/2163738/crazy-rich-asians/index.html). Authors - Pablo Robles and Adolfo Arranz, in collaboration with Marco Hernández, Vincenzo La Torre, Darren Long and Sean Keeley

> "This is my favorite part about analytics: Taking boring flat data and bringing it to life through visualization." ~ John Tukey

The award-winning visualising above summarises best what Exploratory Data Analysis is, being able to question and explore our data with a voracious hunger for insight discovery.

Exploratory Data Analysis (EDA) is the best synonym of what a detective is in practice and at heart, a true investigator of the facts (in other words, what has already happened in the past). EDA is more of an approach to inspecting our data rather than a hard-set of rules one should follow. It brings our the curiosity of the analysts and allows for full exploration that often leads to:
- uncovering insights
- detecting outliers
- testing assumptions
- identifying the most important variables in the dataset
- visualising the heck out of the data
- coming up with new hypotheses to test/models to build

EDA is often confused with pure statistcal data visualisation, in part because it involves a lot of it, but also because is not complete without one. It differs, though, in that it is a more all-around approach to insight gathering rather than a display of data.

# 2. Setting the Stage

![consulting](https://i.imgur.com/DnIaN4p.png)

You work as an analyst at a major global consulting firm and have projects with clients in industries such as hospitality, retail, and finance. The COVID-19 pandemic has hit us with full force and before making any decisions, your boss would like you to provide the company with an overview of what is happening around the world.

You have been given full autonomy over the project, so the resources you choose, the questions you answer, and the visualisations and tables you create, are all your responsability. Sad about the situation, of course, but excited about the opportunity to inform all of your colleagues, you acknowledged your boss's request,

![roger](https://media.giphy.com/media/lOId8Hdsk2ZMU8cjT9/giphy.gif)

head back to your office, and get started on this task.

![igotthis](https://media.giphy.com/media/VgSjnwSoqiPjRRIJ1F/giphy.gif)

## Markdown or (Pen & Paper)

You just got back to your office and immediately you set our to write down a list of what your will need in order to write a compelling report about the whole situation.

![your list](https://media.giphy.com/media/j2wpZyLy2s70ul4TKo/giphy.gif)

### a. Define Your Task
My current task is general and specific in the sense that I have a wide range of angles I can pursue from one specific topic, COVID-19. I need to source data on my own and inform the company on the current situation surrounding the virus.

Let's tackle some essential questions first (most of the information below related to COVID-19 was taken from the [Government of Western Australia Department of Health latest report](https://ww2.health.wa.gov.au/~/media/Files/Corporate/general%20documents/Infectious%20diseases/PDF/Coronavirus/coronavirus-faqs.pdf#page=1)).
- **What is COVID-19?** This is a new coronavirus that was first identified in Wuhan, Hubei Province, China in December 2019. It is a new strain of coronaviruses that hasn’t previously been identified in humans. COVID-19 is closely related to SARS and in the same family of viruses as MERS.
- **What are a Coronaviruses?** "Coronaviruses are a large family of viruses that can cause illness in humans and animals. Human coronavirus illnesses are generally mild such as the common cold. Some coronaviruses can cause severe diseases such as Severe Acute Respiratory Syndrome (SARS), which was identified in 2002, and Middle East Respiratory Syndrome (MERS), which was identified in 2012."
- **What are the symptoms?** Symptoms include shortness of breath or cough, with or without a fever. In some cases, the virus can cause severe pneumonia. From what we know now about COVID-19, the symptoms can start between 2 and 14 days from exposure to the virus.
- **What is a Pandemic?** A pandemic is defined as “an epidemic occurring worldwide, or over a very wide area, crossing international boundaries and usually affecting a large number of people”. The classical definition includes nothing about population immunity, virology or disease severity. ([WHO 2011](https://www.who.int/bulletin/volumes/89/7/11-088815/en/))
- **How can it be cures?** There is no treatment available at the moment.
- **How can help decrease the spread of the virus?**
    - Self-isolation
    - Social distancing
    - Increase personal higene. Washing hands frequently and not touching one's faces.  


### b. Data Required
Because of our company's global footprint, we will need as much data as we can get on the current situation across the globe. This means that scraping data, going to well-known sources of information such as the World Health Organization, the World Bank, Governments websites, etc., is all fair game.
### c. Cleaning & Manipulation
This part will depend on the data available.
### d. Description and Visualisation
This will depend on the data available but we will definitely want to see trends, distributions, max and min values around the peak weeks, and many averages.
### e. Report findings
In the pipeline.

# 3. Get the Data

![covid-19](https://hsu.net.au/wp-content/uploads/2020/03/COVID-19.png)  
**Source:** https://hsu.net.au/covid-19/

The data we will be using comes from [Our World in Data](https://ourworldindata.org/coronavirus-data), and detail information about the variables can be [found here](https://github.com/owid/covid-19-data/blob/master/public/data/owid-covid-data-codebook.md).

## Create Repository

We will first create a repository for our work and then proceed to load, inspect, clean, prepare, and explore. Let's head to GitHub.

# 3.1 Load & Inspect

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('fivethirtyeight')
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.options.display.float_format = '{:.2f}'.format

%matplotlib inline

In [None]:
# for windows users in CMD
!type coronadatascraper-timeseries.csv | more

In [None]:
# for mac users or windows users with Git Bash
!head -n 5 coronadatascraper-timeseries.csv

In [None]:
df = pd.read_csv('owid-covid-data.csv', low_memory=False)
df.head()

In [None]:
df.shape

In [None]:
df.info(memory_usage='deep')

In [None]:
df.tail()

In [None]:
df.describe().T

In [None]:
missing_pct = ((df.isna().sum() / df.shape[0]) * 100)
missing_pct

## Exercise for Everyone

What did we see upon first inspection. Everyone should share their input so that we can compile a list of observations.

# 3.2 Clean & Prepare

![cleaning](https://media-exp1.licdn.com/dms/image/C4D12AQHSzRO0liwLPQ/article-inline_image-shrink_1000_1488/0?e=1596067200&v=beta&t=k97-q1B4LZnChhnHqeLGpbFeGKCf6JaBmhVgoxvhHts)

Let's move on to preparing our data for exploration.

In [None]:
to_drop = (missing[missing < 5]).index
to_drop

In [None]:
df.shape

In [None]:
df.dropna(subset=to_drop, inplace=True)
df.head()

In [None]:
missing = ((df.isna().sum() / df.shape[0]) * 100)
missing

In [None]:
missing_cols = (missing[missing > 0]).index
len(missing_cols)

In [None]:
df[missing_cols].head()

In [None]:
for col in missing_cols:
    print(f"This {col} data type is {df[col].dtype} and it has {df[col].isna().sum()/df.shape[0]*100:.2f} pct of missing values!")

In [None]:
df[missing_cols].describe().T

In [None]:
missing_more_50 = (missing[missing > 50]).index
len(missing_more_50)

In [None]:
type(missing_more_50)

In [None]:
df.drop(labels=missing_more_50, axis=1, inplace=True)

In [None]:
missing_less_15 = (missing[(missing > 0) & (missing <= 15)]).index
len(missing_less_15)

In [None]:
df[missing_less_15].isna().sum() / df.shape[0] * 100

In [None]:
df['median_age'].plot(kind='hist', bins=50)
plt.title('Median Age Frequencies')
plt.show()

In [None]:
df['aged_65_older'].plot(kind='hist', bins=50)
plt.title('Those Older than 65')
plt.show()

In [None]:
df['aged_70_older'].plot(kind='hist', bins=50, color='green', edgecolor='white')
plt.title('Those Older than 70')
plt.show()

In [None]:
df[missing_less_15].describe().T

In [None]:
missing_less_15

In [None]:
missing_15 = missing_less_15.drop('gdp_per_capita')

In [None]:
for col in missing_15:
    df[col].fillna(value=df[col].median(), axis=0, inplace=True)
    
df.isna().sum()

In [None]:
plt.hist(df.loc[df['female_smokers'].notna(), 'female_smokers'], bins=20, 
         color='blue', alpha=0.5, label='Female Smokers', edgecolor='white')
plt.hist(df.loc[df['male_smokers'].notna(), 'male_smokers'], bins=20, 
         color='red', alpha=0.3, label='Male Smokers', edgecolor='white')

plt.title('Share of Smokers in our Sample')

plt.legend()
plt.show()

In [None]:
df[['female_smokers', 'male_smokers']].head()

In [None]:
for col in ['female_smokers', 'male_smokers']:
    df[col].fillna(value=df[col].median(), axis=0, inplace=True)

In [None]:
df['extreme_poverty'].isna().sum() / df.shape[0] * 100

In [None]:
sns.boxplot(y=df['extreme_poverty']);

In [None]:
df['extreme_poverty'].describe()

In [None]:
df['extreme_poverty'].fillna(df['extreme_poverty'].median(), inplace=True)

In [None]:
df['stringency_index'].describe()

In [None]:
sns.distplot(df['stringency_index'], bins=20);

In [None]:
df['stringency_index'].fillna(value=0, inplace=True)

In [None]:
df.isna().sum() / df.shape[0] * 100

In [None]:
countries_missing = df.loc[df['gdp_per_capita'].isna(), 'location'].unique()
countries_missing

In [None]:
df.loc[df['gdp_per_capita'].isna(), 'location'].head()

In [None]:
df.dropna(how='any', axis=0, inplace=True)
df.head()

In [None]:
df['month'] = df['date'].dt.month
df['year'] = df['date'].dt.year
df['week'] = df['date'].dt.week
df['weekday'] = df['date'].dt.weekday
df['quarter'] = df['date'].dt.quarter
df['day_of_week'] = df['date'].dt.day_name()
df['week_or_end'] = df['weekday'].apply(lambda x: 'weekend' if x >= 5 else 'week_day')
df.head()

In [None]:
df.to_csv('covid19_ready_data.csv', index=False)

# 4. EDA

> "This is my favorite part about analytics: Taking boring flat data and bringing it to life through visualization." ~ John Tukey

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('fivethirtyeight')
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.options.display.float_format = '{:.2f}'.format

%matplotlib inline

In [None]:
df = pd.read_csv('covid19_ready_data.csv', low_memory=False)
df.head()

In [None]:
month_continents = df[df['month'] != 12].pivot_table(
    index=['month'],
    values='total_cases',
    columns='continent',
    aggfunc='mean'
)

month_continents

In [None]:
month_continents.plot(title="Rise of COVID-19 Cases Per Continent")

In [None]:
sns.boxplot(x="continent", y="hospital_beds_per_thousand", 
            hue="week_or_end", data=df, palette="Set1")
plt.title("Availability of Beds by the Thousands in Each Continent by Weekend")
plt.xticks(rotation=45)
plt.show()

In [None]:
sns.scatterplot(x='total_cases', y='total_deaths', hue='week_or_end', data=df)

The following is an addapted example from Damien Farrell. You can find the source [code here](https://dmnfarrell.github.io/plotting/bokeh-covid19).

In [None]:
import random
from bokeh.io import show, output_notebook
from bokeh.models import ColumnDataSource, ColorBar, HoverTool, Legend
from bokeh.plotting import figure
from bokeh.palettes import brewer
from bokeh.layouts import row, column, gridplot
from bokeh.models import CustomJS, Slider, Select, Plot, Button, LinearAxis, Range1d, DatetimeTickFormatter
from bokeh.models.glyphs import Line, MultiLine

In [None]:
from datetime import datetime
df['new_date'] = df['date'].apply(lambda x: x.strftime("%d-%b"))
df['new_date'].head()

In [None]:
df = df.sort_values(['location', 'date'])

df['cumcases'] = df.groupby(['location'])['total_cases'].apply(lambda x: x.cumsum())

In [None]:
data = pd.pivot_table(df, index='date', 
                      columns='location', 
                       values='cumcases').reset_index()

In [None]:
summary = (df.groupby('location')
           .agg({'total_deaths':np.sum,
                 'total_cases':np.sum})
           .reset_index())

In [None]:
source = ColumnDataSource(data)

In [None]:
filt_data1 = data[['date','Australia']].rename(columns={'Australia':'total_cases'})

In [None]:
filt_data2 = data[['date', 'Dominican Republic']].rename(columns={'Dominican Republic':'total_cases'})

In [None]:
hover_tool = HoverTool(tooltips=[
            ('Cases', '@total_cases'),
            ('Date', '@date')],
            formatters={'date': 'datetime'}
        )

In [None]:
source = ColumnDataSource(data)

filt_data1 = data[['date','Australia']].rename(columns={'Australia':'total_cases'})

src2 = ColumnDataSource(filt_data1)

filt_data2 = data[['date', 'Dominican Republic']].rename(columns={'Dominican Republic':'total_cases'})

src3 = ColumnDataSource(filt_data2)

hover_tool = HoverTool(tooltips=[
            ('Cases', '@total_cases'),
            ('Date', '@date')],
            formatters={'date': 'datetime'}
        )

p1 = figure(plot_width=600, plot_height=400, x_axis_type='datetime',
            tools=[hover_tool], title='Total COVID Cases Over Time',
            y_range=Range1d(start=0, end=filt_data1['total_cases'].max() + 50))

p1.line(x='date', y='total_cases', source=src2, legend_label="country 1", 
        line_color='blue', line_width=3,line_alpha=.8)

p1.extra_y_ranges = {"y2": Range1d(start=0, end=filt_data2['total_cases'].max() + 50)}

p1.add_layout(LinearAxis(y_range_name="y2"), 'right')
p1.line(x='date',y='total_cases', source=src3, legend_label="country 2", line_color='green',
        line_width=3,line_alpha=.8,y_range_name="y2")

p1.yaxis[0].axis_label = 'Australia'
p1.yaxis[1].axis_label = 'Dominican Republic'
p1.background_fill_color = "whitesmoke"
p1.background_fill_alpha = 0.5
p1.legend.location = "top_left"
p1.xaxis.axis_label = 'Date'
p1.xaxis.formatter=DatetimeTickFormatter(days="%d/%m", months="%m/%d %H:%M")

#JavaScript Snippet
code="""
var c = cb_obj.value;
ax.axis_label = c;
var y = s1.data[c];
s2.data['cases'] = y;
y_range.start = 0;
y_range.end = parseInt(y[y.length - 1]+50);
s2.change.emit();
"""

callback1 = CustomJS(args=dict(s1=source,s2=src2,y_range=p1.y_range,ax=p1.yaxis[0]), code=code)

callback2 = CustomJS(args=dict(s1=source,s2=src3,y_range=p1.extra_y_ranges['y2'],ax=p1.yaxis[1]), code=code)

names = list(df.location.unique())

names_sub=['Australia', 'United_Kingdom', 'United States', 'Spain', 'Italy', 'Germany',
           'France','Iran','Dominican Republic','Ireland','Sweden','Belgium','Turkey', 'India']

select1 = Select(title="Country 1:", value='Australia', options=names_sub)

select1.js_on_change('value', callback1)

select2 = Select(title="Country 2:", value='Dominican Republic', options=names)

select2.js_on_change('value', callback2)

btn = Button(label='Update')

layout = column(row(select1,select2), row(p1))

show(layout)

In [None]:
sns.pairplot(df[['total_cases', 'total_deaths', 'median_age', 'aged_65_older', 'aged_70_older']], kind="scatter")
plt.show()

In [None]:
import altair as alt

In [None]:
alt.Chart(df.sample(5000)).mark_circle().encode(
    alt.X(alt.repeat("column"), type='quantitative'),
    alt.Y(alt.repeat("row"), type='quantitative'),
    color='Origin:N'
).properties(
    width=150,
    height=150
).repeat(
    row=['total_cases', 'median_age', 'total_deaths'],
    column=['total_cases', 'median_age', 'total_deaths']
).interactive()

In [None]:
alt.Chart(df.sample(5000)).mark_circle(
    opacity=0.8,
    stroke='black',
    strokeWidth=1
).encode(
    alt.X('new_date:O', axis=alt.Axis(labelAngle=0)),
    alt.Y('day_of_week:N'),
    alt.Size('total_cases:Q',
        scale=alt.Scale(range=[0, 4000]),
        legend=alt.Legend(title='Total COVID Cases')
    ),
    alt.Color('month:N', legend=None)
).properties(
    width=450,
    height=320
).transform_filter(
    alt.datum.Entity != 'NA'
)

In [None]:
alt.Chart(df.sample(5000)).transform_fold(
    ['median_age', 'aged_65_older', 'aged_70_older'],
    as_=['Ages', 'Cases']
).mark_area(
    opacity=0.3,
    interpolate='step'
).encode(
    alt.X('Cases:Q', bin=alt.Bin(maxbins=40)),
    alt.Y('count()', stack=None),
    alt.Color('Ages:N')
)

# 5. Reporting

This is a group exercise for our repository.

Some questions to think about.

- What's the key message?
- What are some hypotheses one could test beyond initial EDA?
- Should we gather more data?
- ...

# 6. Summary

![done](https://media.giphy.com/media/xTiTnoFlplnVO1Sow0/giphy.gif)


Well done! We have covered quite a lot in this course and this lesson summarises many of the crucial concepts very well.

- State your task and create a to-do list with the requirements for the project.
- Do some research on the topic, specially if it is outside your area of expertise.
- Source the data.
- Load, clean, inspect, save, and commit.
- Explore your dataset.
- Write your key findings down and put the most important points in your report.

# 7. Beyond Data Analysis

Now that you are officially a data analyst, here are some recommendations for your jouney ahead.

## Becoming a Better Programmer

> “Any fool can write code that a computer can understand. Good programmers write code that humans can understand.” – Martin Fowler

You have probably noticed throughout the course that programming is more than a skill one can acquire, it is a muscle that one needs to continued exercising in order to make it better and better each day. Here are some ideas for your next steps.

1. Practice a lot with some of the essential concepts of programming, loops, functions and if-else statements. If you write a loop to iterate over an array, convert that into a function and vice-versa.
2. Learn how to create your own data structures and classes by learning a bit of object-oriented programming.
3. When you finish working on a project, grab all of the custom functions you built and put them into your own personal package.
4. Create a dashboard as an app to display your findings.
5. Practice using the command line and make git an essential part of your workflow.

## Data Scientist Road

>“By definition all scientists are data scientists. In my opinion, they are half hacker, half analyst, they use data to build products and find insights. It’s Columbus meet Columbo―starry-eyed explorers and skeptical detectives.” ― Monica Rogati, Independent Data Science Advisor

The transition between being a data analyst and a data scientist can depend on many factors, so here is a non-exhaustive checklist for your road ahead (should you choose to go down this path of course).

1. Learn NumPy, pandas, scipy, and matplotlib/seaborn very well.
2. Data can get dirtier and dirtier the more you move into the data science space. Get as comfortable as you can with manipulating and cleaning unstructured data.
2. Learn how to deal with large datasets in your local computer and in the cloud.
3. Learn Statistics and learn how to design experiments.
4. Get familiarised with machine learning and practice, practice, practice.
5. Learn how to conduct large scale data analysis in the cloud. The top three players right now are AWS, Azure, and GCP.
6. Repetition is key, whether you have a plan or not, a little bit of learning and doing each day goes a long way.

Here are some resources covering some of the points mentioned above and throughout the lesson.

- Ramalho, Luciano. *Fluent Python*. OReilly, 2016.
- VanderPlas, Jake. *A Whirlwind Tour of Python*. OReilly, 2016.
- Grus, Joel. *Data Science from Scratch*. OReilly, 2019.
- [Coursera Statistics with Python Specialization](https://www.coursera.org/specializations/statistics-with-python)
- [Machine Learning with Python by fastai](http://course18.fast.ai/ml)
- [Full Machine Learning class by Pedro Domingos](https://www.youtube.com/user/UWCSE/playlists?shelf_id=16&sort=dd&view=50)
- Mitchell, Ryan. *Web Scraping with Python: Collecting More Data from the Modern Web*. OReilly Media, Incorporated, 2018.
- VanderPlas, Jake. _Python Data Science Handbook_. O'Reilly, 2017.
- McKinney, Wes. _Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython_. OReilly, 2018.
- [Ryan's tutorial on bash scripting](https://ryanstutorials.net/bash-scripting-tutorial/)
- [Learn X in Y Minutes](https://learnxinyminutes.com/docs/python/)


# <center> <h1>Congratulations on Completing Intro to Data Analytics with Python - Great Job!!</h1> </center>

# 8. Feedback

We would really appreciate it if you could please provide us with your feedback from this session by filling a couple of question.

> ## [Survey](https://docs.google.com/forms/d/e/1FAIpQLSdt3l-8oh2BGP1Jp-inYuHDkgtIi5hOqRSq6yTAN7uo6rHB7w/viewform?usp=sf_link)