# 1 | Loading Data
Let's load the data.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

# 2 | Exploring the Time Series Dataset
Let's take a look at this data.

In [None]:
data = pd.read_csv('/kaggle/input/novel-corona-virus-2019-dataset/time_series_covid_19_confirmed.csv')
data.head()

The data is in an interesting form. Let's look at the columns.

In [None]:
data.columns

It seems that each row is a country or province, and each column represents one date.

This time series represents the number of deaths across each province/country against the date.

Let's plot the number of cases in the United States.

We will be using condition selection. 

This is of the form data[condition], for example, data[data[column] > 5].

In [None]:
us = data[data['Country/Region'] == 'US']
us

Let's only select the numbers.

In [None]:
us = us.drop(['Province/State','Country/Region','Lat','Long'],axis=1)

Now, we only have the numbers.

In [None]:
us

Let's use the transpose function. This function turns the x-axis into the y axis and the y axis into the x axis.

In [None]:
us.T

Let's get the index into a column. How do we pop out the index into a column?

In [None]:
us = us.T.reset_index()
us

Let's rename the columns into something more appropiate.

In [None]:
us = us.rename(columns={'index':'date',225:'confirmed'})
us

Great! We have our x and our y.

First, let's import our essential plotting libraries.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

Let's use our Grammar of Graphics:

1. Specify coordinate grid and figure properties / style
2. Specify the figure type
3. Specify the data (x and y)

In [None]:
plt.figure(figsize=(18,5)) #step 1: coordinates/figure

sns.barplot(x='date',y='confirmed',data=us) #step 2: specify type and step 3: data

plt.show() #show the plot by itself

Let's try adding a style to the coordinate.

In [None]:
plt.figure(figsize=(18,5)) #step 1: coordinates/figure

sns.set_style('whitegrid') #step 1: coordinates/figure

sns.barplot(x='date',y='confirmed',data=us) #step 2: specify type and step 3: data

plt.show() #show the plot by itself

Uh-oh! Our x-labels are overlapping. How might we address this?

In [None]:
plt.figure(figsize=(18,5)) #step 1: coordinates/figure
sns.set_style('whitegrid') #step 1: coordinates/figure
sns.barplot(x='date',y='confirmed',data=us) #step 2: specify type and step 3: data
plt.xticks(rotation=90) #step 1: coordinates/figure
plt.show() #show the plot by itself

A line plot seems more appropiate, becuase it is not discrete, like a bar plot.

In [None]:
plt.figure(figsize=(18,5)) #step 1: coordinates/figure
sns.set_style('whitegrid') #step 1: coordinates/figure
sns.lineplot(x='date',y='confirmed',data=us) #step 2: specify type and step 3: data
plt.xticks(rotation=90) #step 1: coordinates/figure
plt.show() #show the plot by itself

There are some breaks in the data, which leads the lineplot to go to 0 when there is no data for a certain date. Let's use a scatterplot instead.

In [None]:
plt.figure(figsize=(18,5)) #step 1: coordinates/figure
sns.set_style('whitegrid') #step 1: coordinates/figure
sns.scatterplot(x='date',y='confirmed',data=us) #step 2: specify type and step 3: data
plt.xticks(rotation=90) #step 1: coordinates/figure
plt.show() #show the plot by itself

Let's compare the number of cases in the United States to the number of those in Italy over time.

In [None]:
italy = data[data['Country/Region']=='Italy']
italy

Like before, remove the non-text columns.

In [None]:
italy = italy.drop(['Province/State','Country/Region','Lat','Long'],axis=1)
italy

The transpose function gets it into our desired x, y form.

In [None]:
italy = italy.T
italy

Let's reset the index to pop out our x value.

In [None]:
italy = italy.reset_index()
italy

and finally, let's rename our columns.

In [None]:
italy = italy.rename(columns={'index':'date',137:'confirmed'})
italy

In [None]:
plt.figure(figsize=(18,5)) #step 1: coordinates/figure
sns.set_style('whitegrid') #step 1: coordinates/figure

#United States data
sns.scatterplot(x='date',y='confirmed',data=us) #step 2 and 3

#Italy data
sns.scatterplot(x='date',y='confirmed',data=italy) #step 2 and 3

plt.xticks(rotation=90) #step 1: coordinates/figure
plt.show() #show the plot by itself

But which data is which? How can we find which line is which?

Let's add labels / a legend.

In [None]:
plt.figure(figsize=(18,5)) #step 1: coordinates/figure
sns.set_style('whitegrid') #step 1: coordinates/figure

#United States data
sns.scatterplot(x='date',y='confirmed',data=us,label='US') #step 2 and 3

#Italy data
sns.scatterplot(x='date',y='confirmed',data=italy,label='Italy') #step 2 and 3

plt.legend() #display the legend
plt.xticks(rotation=90) #step 1: coordinates/figure
plt.show() #show the plot by itself

It seems that the United States has outpaced Italy in deaths near the end of March.

We've looked a bit at bivariate data - let's look now at some univariate data.

Let's open another data file.

In [None]:
covid = pd.read_csv('/kaggle/input/novel-corona-virus-2019-dataset/COVID19_open_line_list.csv')
covid.head()

This data seems to have a lot of unnecessary columns. Let's look at them:

In [None]:
covid.columns

Let's explore the results of age on the outcome.

In [None]:
sub_data = covid[['age','outcome']]
sub_data

It appears that there are a lot of nan values. Let's remove them with .dropna(), which drops any missing values.

In [None]:
sub_data = sub_data.dropna()
sub_data

There appears to be a discrepancy in the naming systems - it is occasionally named 'discharge', and sometimes named 'discharged'.

We can get an understanding for what types of values are in our data with data[column].unique().

In [None]:
sub_data['outcome'].unique()

Wow - there seems to be a lot of different naming conventions for things that mean the same thing. 

Let's make a function to process this so that all the outcomes that mean the same thing are named accordingly.

In [None]:
def process_outcome(x):
    if x=='discharged' or x=='discharge' or x=='Discharged':
        return 'Discharged'
    elif x=='died' or x=='death' or x=='severe':
        return 'Death/Severe'
    elif x=='stable' or x == 'recovered':
        return 'Stable'
    else:
        return np.nan
sub_data['outcome'] = sub_data['outcome'].apply(process_outcome)

Great! Now we should have consistent naming protocol.

Remember that we replaced anything else with a nan value. Let's filter those out with .dropna().

In [None]:
sub_data = sub_data.dropna()
sub_data

Unfortunately, the age column seems to be in string format.

In [None]:
sub_data['age'].apply(type)

How might we convert it into an integer?

-

-

-

-

-

Let's apply the int() function to the data.

In [None]:
sub_data['age'] = sub_data['age'].apply(int)

Jesus Christ, another error!

Let's look at the unique values for age.

In [None]:
sub_data['age'].unique()

When in doubt, make a function to parse the data.

In [None]:
def process_age(age):
    if len(age.split('-'))==2: #if it is a range, e.g. '70-79'.split() -> ['70','79']
        return (float(age.split('-')[0]) + float(age.split('-')[1]))/2
        #return the average of the two bounds
    else:
        return float(age)
        #otherwise, return the float of the age
        
sub_data['age'] = sub_data['age'].apply(process_age)

Great - now, our age data should all be floats.

In [None]:
plt.figure(figsize=(10,5))
sns.distplot(sub_data['age'])

Let's plot out the ages of people for each of the three outcomes.

In [None]:
discharged = sub_data[sub_data['outcome']=='Discharged']
death = sub_data[sub_data['outcome']=='Death/Severe']
stable = sub_data[sub_data['outcome']=='Stable']

Great! Each of the variables has their own outcome.

In [None]:
plt.figure(figsize=(15,6)) #create figure and specify size

sns.distplot(discharged['age'],label='Discharged') #plot discharged
sns.distplot(death['age'],label='Death') #plot death/severe
sns.distplot(stable['age'],label='Stable') #plot stable

plt.legend() #display the legend so we know which distributions are which
plt.title('Ages of People by Outcome') #Add a title
plt.show() #show plot

Observations:
- People that died are overwhelmingly older older
- People that were discharged from the hospital are overwhelmingly young

Let's visualize this with a 2-dimensional boxplot.

In [None]:
plt.figure(figsize=(15,6)) #create figure + specify size

sns.boxplot(x='age',y='outcome',data=sub_data)

plt.title('Ages of People by Outcome')
plt.show()

# If you enjoyed...
Try out the following data analysis ideas!
- Compare the number of deaths against the number of recovered people. Which one is rising more quickly?
- Check out the number of cases in China. Is it beginning to taper off?
