In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # used for data visulisation
import seaborn as sns # used for data visualisation
from datetime import date, datetime, timedelta # Used for time data

# plotly
from plotly import tools, subplots
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.express as px
import plotly.figure_factory as ff
import plotly.io as pio
pio.templates.default = "simple_white"
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
# Any results you write to the current directory are saved as output.
submission = pd.read_csv("/kaggle/input/covid19-global-forecasting-week-4/submission.csv")
test = pd.read_csv("/kaggle/input/covid19-global-forecasting-week-4/test.csv")
train = pd.read_csv("/kaggle/input/covid19-global-forecasting-week-4/train.csv")

clean_complete = pd.read_csv("/kaggle/input/corona-virus-report/covid_19_clean_complete.csv")
usa_country = pd.read_csv("/kaggle/input/corona-virus-report/usa_county_wise.csv")
china_country = pd.read_csv("/kaggle/input/corona-virus-report/usa_county_wise.csv")


![](https://cdn.psychologytoday.com/sites/default/files/field_blog_entry_images/2020-03/covid-19.jpg)

# COVID 19 Data EDA and Exploration

The purpose of this notebook is to explore the COVID 19 datasets on Kaggle, while trying to improve my understanding of Python. Please bear with me as I stumble through this, as I am fairly new to Python and all it entails. 

I have added the COVID 19 Dataset to this Notebook as well, which you can find [here](https://www.kaggle.com/imdevskp/corona-virus-report). 

First step in this process, I am going to look at a couple of the data sources and get a sense of what I am dealing with and what might need changing or adjusting.

# Data Cleaning

In [None]:
# View clean dataset from added dataset
clean_complete.head()

We can see here that our data values appear to be a string rather than a date value, so let's change this, which is also likely to be the case for all files from this dataset. 

In [None]:
# Create a function to convert the date column of each dataset 
def date_conversion(df):
    df['Date'] = pd.to_datetime(df.Date, format = "%m/%d/%y")

    
# Convert the columns of each data set
date_conversion(clean_complete)
date_conversion(usa_country)
date_conversion(china_country)

# View the changes worked 
clean_complete.head()

In the head of the clean_compelte dataset, we can see that the date conversion function worked well and our date column is now in a usable date format. 

Let's create our first graph, looking at the worldwide statistics for confirmed cases and deaths.

# Worldwide Data Trends

In [None]:
# Create a world df using only the columns of interest.
world_df = clean_complete.groupby('Date')[['Confirmed', 'Deaths']].sum().reset_index()

# Add a column for new cases.
world_df['New_Cases'] = world_df['Confirmed'] - world_df['Confirmed'].shift(1)

# View the tail of the data to see the new column is accurate. 
world_df.tail()

In [None]:
# Melt df in to a long dataset
world_df_melt = pd.melt(world_df, id_vars=['Date'], value_vars=['Confirmed', 'Deaths', 'New_Cases'])

In [None]:
# Create line graph using Plotly Express and melted dataset
fig_world = px.line(world_df_melt, x='Date', y='value', color='variable',
                    title="Worldwide Confirmed Cases, New Cases and Deaths Over Time")
fig_world.show()

Looking at this figure, we can see the growth in confirmed cases really started to rise rapidly around March 13th. 

500k cases were reached on March 26th, with 24k deaths at that time. 
1M cases were reached on April 2, with 53k deaths at that time. 

As confirmed cases are cumulative, why don't we look new cases over time and see how they are trending. 

In [None]:
# Create line graph using Plotly Express and cleaned dataset
plot_new_world = px.line(world_df, x='Date', y='New_Cases')
plot_new_world.show()

We can see that although new cases rose from around March 11, it looks like we are slowly starting to level out worldwide, but more time is need to be certain of this. In the coming days / weeks we will hopefully see this level a bit more. 

What is of more interest around the world, is how are deaths starting to appear. Let's take a look. 

In [None]:
# Create line graph using Plotly Express and cleaned dataset
plot_death_world = px.line(world_df, x='Date', y='Deaths')
plot_death_world.show()

We can see in this figure that deaths started to really exponentiate within 10 days of cases starting to rapidly increase. With test results taking up to 5-days to appear, this could be 10-15 days after the person initially getting symptoms, which is rather rapid. Otherwise, I would need to take more look at what has been presented to see if this is the case. But so far, this graph doesn't show this graph slowing just yet, which is a scary proposition. 

# Country Trends

I've decided I am going to take a look at a couple of countries. For this, I am not going to choose many, but rather places I have either lived or lived near. So I am going to choose the following places:
* New Zealand
* Australia
* United Kingdom
* Canada

So let's dig in.

In [None]:
# Filter data to countries I want to keep
countries_keep = ['New Zealand', 'Australia', 'United Kingdom', 'Canada']
country_data = clean_complete[clean_complete['Country/Region'].isin(countries_keep)]
country_data.head()

Looking at the head of this data, we are going to need to group by Country/Region and Date to make sure we don't have multiple rows for a given country. 

In [None]:
country_data_group = country_data.groupby(['Country/Region', 'Date']).sum().reset_index()

In [None]:
# Create line graph using Plotly Express and melted dataset
fig_country_cases = px.line(country_data_group, x='Date', y='Confirmed', color='Country/Region', log_y = True,
                    title="Confirmed Cases in the UK, Canada, Australia and New Zealand")
fig_country_cases.show()

Before we start reviewing this information, it is important to note the log y axis. I have done this to make it a bit easier to see the data. However, as the reader you need to be aware that the sacale is off and although the UK and Canada seem close, they are actually over a 100k difference as of April 26th. So keep this in mind. 

This graph is very interesting, in that the UK had a huge spike in cases, starting March 16th that has almost gone out of control. This resulted in the country entering lockdown for a month a week later, which has been extended for another 3 weeks. 

In comparison, NZ hasn't really had a spike, but they entered lockdown on the 25th of March, which has been extended until April 25th. This has resulted in cases being supressed to the point of having very few daily cases in their most recent updates (as of April 21st). 

Alternatively, Australia and Canada have allowed states / provinces to guide lockdown protocols. In both cases, most essential business are all that remain open, with some state / provinces placing a lockdown on their residents. In Canada, BC has stated they have always not been willing to place residents in lockdown, saying this would be too detrimental to their health. Which appears to be similar to the approach of Australia. In these circumstances, either luck or obiding citizens have helped to keep case numbers low, with BC currently on 1600 and Australia on around 6000. This points to an approach that has worked well in both cases. But parts of both countries have also seen spikes that have struggled to maintain, with reporting in provinces of Canada also different compared to others (probable cases being included for example). 

Either way, all 4 countries have taken differing approaches with outcomes that are from one extreme to the other almost. 

In [None]:
# Create line graph using Plotly Express and melted dataset
fig_country_deaths = px.line(country_data_group, x='Date', y='Deaths', color='Country/Region', log_y = True,
                    title="Deaths in the UK, Canada, Australia and New Zealand")
fig_country_deaths.show()

Again this graph shows that the UK spiked quickly towards where they are, with over 20k deaths in April alone. Meanwhile, Canada has had around 2k deaths, they have not see a spike in these as of April 26th. 

On the other hand, Australia has done really well to keep deaths at a minimum while NZ has only recently crept above 10. 

# Provinces of Canada

I am currently in Canada, so let's take a look at how the different provinces are doing here. 

In [None]:
# Filter data to countries I want to keep
province_data = clean_complete[clean_complete['Country/Region'] == 'Canada']
province_data.head()

In [None]:
province_data_group = province_data.groupby(['Province/State', 'Date']).sum().reset_index()

In [None]:
# Create line graph using Plotly Express and melted dataset
fig_province_cases = px.line(province_data_group, x='Date', y='Confirmed', color='Province/State',
                    title="Confirmed Cases in Provinces of Canada")
fig_province_cases.show()

Of all the provinces, Quebec spiked massively following a change in how they reported new cases in the province. Whilst, Ontario has also seen a huge spike as a result of being the biggest province by population. Both of these provinces also had stay at home orders placed on their residents. 

Meanwhile, BC was the first province to have a case in Canada, but have done really well to only just creep above 2k cases on April 27th. Residents will say this is due to a lack of testing, but they have also kept hospitilisations to a minimum which is the key metric they are looking at in the province overall.