<a href="https://colab.research.google.com/github/sierra6266/hello-world/blob/master/HackHer_Covid_Data_Analysis_Workshop.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Queen's HackHer Hackathon Workshop
## Presented by Kinaxis

Covid 19 notbook Pandas and Data Visualization

References: 

https://www.kaggle.com/learn/pandas 

https://www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset

https://www.kaggle.com/therealcyberlord/coronavirus-covid-19-visualization-prediction/notebook


In [None]:
import numpy as np 
import matplotlib.pyplot as plt 
import matplotlib.colors as mcolors
import pandas as pd 
import random
import math
import time
from sklearn.linear_model import LinearRegression, BayesianRidge
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error, mean_absolute_error
import datetime
import operator 
plt.style.use('fivethirtyeight')
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")

Firstly, we need to read the data in from a csv file.

In [None]:
confirmed_df = pd.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv')
deaths_df = pd.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv')
recoveries_df = pd.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_recovered_global.csv')
latest_data = pd.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/10-07-2020.csv')
us_medical_data = pd.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports_us/10-07-2020.csv')

In pandas you can use head() to show a preview of what's in the dataframe. We're interested in the confirmed_df dataframe. This tells us how many confirmed Covid-19 cases there are in each region by date.



In [None]:
confirmed_df.head()

To get a quick overview of what the data contains in the dataframe, the pandas **describe** function can be used, to output many different summary statistics, including:


*   count: number of non null values
*   mean, min, max: mean, min and max value of the column
*   std: the standard deviation of the column
*   quantiles: 25%, 50% (i.e. the median), 75%

Note, this function only outputs statistics on columns with numeric values.

Some insights we can quickly gain from this for example, is that for the first few dates, most columns are over 75% zeros.


In [None]:
confirmed_df.describe()

## Visualizing Confirmed cases as a line graph
 This section will go over how to format the data and build a graph.

With a pandas dataframe, we can select specific rows and columns by using the **.loc** function.

In [None]:
# Select the first row and the column of "Country/Region"
confirmed_df.loc[0,'Country/Region']

In [None]:
# if you want to select an entire column, we can use the : operator
confirmed_df.loc[:,'Country/Region']

In [None]:
# if you want to select by index instead of column name, use .iloc
confirmed_df.iloc[:, 1]

Now we want to select the dates. See how the first 4 columns are not dates? We can use pandas **.iloc** function to select all the columns after the 4th one.



In [None]:
# keys() will give us a list of all the column headers
cols = confirmed_df.keys()
# We want to select all the rows but only all the columns after the 4th
confirmed = confirmed_df.iloc[:, 4:]
dates = confirmed.keys()
# This will print out all the dates
dates

We want to plot the graph based on days since the first date. We can use a **list comprehension** to do this.

In [None]:
# Create a list of the numbers from 0 to 10
[x for x in range(10)]

In [None]:
# Create a list of the number of days based on the length of how many dates we have
days_since = np.array([i for i in range(len(dates))])

Now we want to sum up the number of confirmed cases worldwide.

In [None]:
world_cases = []

for i in dates:
  confirmed_sum = confirmed[i].sum()
  world_cases.append(confirmed_sum)

We are going to use matplotlib to graph our results.

In [None]:
plt.figure(figsize=(16, 10))
# .plot creates a line chart with parameters x and y
plt.plot(days_since, world_cases)
plt.title('# of Coronavirus Cases Over Time', size=30)
plt.xlabel('Days Since 1/22/2020', size=30)
plt.ylabel('# of Cases', size=30)
plt.legend(['Worldwide Coronavirus Cases'], prop={'size': 20})
plt.xticks(size=20)
plt.yticks(size=20)
plt.show()

## Challenge 1

Plot the total number of deaths over time

Hint: use the deaths_df and .iloc in the same way we did for confirmed_df


### Your Solution Here

In [None]:
# Count the deaths

# Plot the graph

## Visualizing Confirmed Cases by country as a pie chart

This section will go over how to group data by country and build a pie chart.

This time  we're interested in the latest_data dataframe. This tells us the current status of Covid-19 cases in each country.

In [None]:
latest_data.head()

In pandas we can use the **unique()** function to get all the unique values for a column

In [None]:
latest_data['Country_Region'].unique()

In pandas, we can get all the rows where a column value fits some criteria using comparison operators. Some comparison operators are: be greater than (>), less than (<) or equal to (==).

In [None]:
# All rows where country is Canada
latest_data[latest_data['Country_Region']=='Canada']

In [None]:
# all rows where the number of confirmed cases is greater than 500 000
latest_data[latest_data['Confirmed'] > 500000]

Now we are going to iterate over each country and sum up the number of confirmed cases.

In [None]:
unique_countries =  list(latest_data['Country_Region'].unique())
country_confirmed_cases = []

no_cases = []
for i in unique_countries:
    cases = latest_data[latest_data['Country_Region']==i]['Confirmed'].sum()
    if cases > 0:
        country_confirmed_cases.append(cases)
    else:
        no_cases.append(i)

# Here we remove the countries with no cases to make the graph look cleaner        
for i in no_cases:
    unique_countries.remove(i)


Tip! In python we can use **zip** to pair up two lists of data

In [None]:
a = [2,1,3]
b = ['z','y','x']
c = list(zip(a,b))
print(c)

Another tip! In python we can use **sorted** to sort a list. Since we are sorting on a zipped list, we want to specify which item to sort on.


In [None]:
# sort numerically based on the numbers at index 0
print(sorted(c,key=operator.itemgetter(0)))
# sort alphabetically based on the letters at index 1
print(sorted(c, key=operator.itemgetter(1)))

In [None]:
# sort countries by the number of confirmed cases
# We sort in descending order by using reverse = True
unique_countries = [k for k, v in sorted(zip(unique_countries, country_confirmed_cases), key=operator.itemgetter(1), reverse=True)]
for i in range(len(unique_countries)):
    country_confirmed_cases[i] = latest_data[latest_data['Country_Region']==unique_countries[i]]['Confirmed'].sum()

There are too many countries to show in one chart clearly, so let's take the top ten, and put the others into a category called Other


In [None]:
# Only show 10 countries with the most confirmed cases, the rest are grouped into the other category
visual_unique_countries = [] # Names of countries
visual_confirmed_cases = [] # Numbers of cases
others = np.sum(country_confirmed_cases[10:])

for i in range(len(country_confirmed_cases[:10])):
    visual_unique_countries.append(unique_countries[i])
    visual_confirmed_cases.append(country_confirmed_cases[i])
    
visual_unique_countries.append('Others')
visual_confirmed_cases.append(others)

Now we're ready to plot our pie chart. In matplotlib, we use the function pie()

In [None]:
def plot_pie_charts(x, y, title):
    # Have fun picking colours :)
    c = ['lightcoral', 'rosybrown', 'sandybrown', 'navajowhite', 'gold',
        'khaki', 'lightskyblue', 'turquoise', 'lightslategrey', 'thistle', 'pink']
    plt.figure(figsize=(20,15))
    plt.title(title, size=20)
    plt.pie(y, colors=c,shadow=True, labels=y)
    plt.legend(x, loc='best', fontsize=12)
    plt.show()

In [None]:
plot_pie_charts(visual_unique_countries, visual_confirmed_cases, 'Covid-19 Confirmed Cases per Country')

## Challenge 2
Create a pie chart for confirmed cases in Canada (or the country of your choosing) grouped by province or state.

Hint: Look at the cell that does this for countries

Hint 2: Use the latest_data df and the column 'Province_State'

### Your Solution Here


In [None]:
# Get the regions for your country

# Count the cases per region

# Plot the graph

## Final Challenge



Now is your chance to be creative. Come up with an insightful visualization for any part of this Covid-19 data. you may use what you've just learned, or any library you'd like! Have fun!

### Your creative solution here

In [None]:
# Show us what you can come up with