[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/stat10/DS10-Python-HW/blob/master/HW08_distributions.ipynb)

## Homework 08 Exercises

The following exercises are based on the material from [Lab 8](https://colab.research.google.com/github/AllenDowney/ElementsOfDataScience/blob/master/08_distributions.ipynb).

**Exercise 1:** As you probably know, the novel Corona virus disease (COVID-19) has turned into a pandemic and has created health, social, economic, as well as political crises around the globe. In this exercise, we will explore data on the COVID-19 cases worldwide and look at the distributions of the death ratios. The data for this exercise was obtained from a [repository](https://github.com/CSSEGISandData/COVID-19) operated by Johns Hopkins University Center for Systems Science and Engineering. 

The original dataset contains three csv files coressponding to the number of daily worldwide confirmed cases, recovered cases, and cases resulting in death, respectively. The data is updated everyday and is cumulative, *i.e.*, the number of cases on any given day is a cumulative sum starting from January 22nd, 2020. So, the row corressponding to the last day contains the total number of cases reported till that day. 

We have removed some of the columns in the data and cleaned it up a bit and converted the cumulative data to deltas, *i.e.* the number of cases on each day only coressponds to the data from that day. This is simply to make the analysis a little easier to do.

Since we are using the following libraries `pandas`, `numpy`, `matplotlib`, `empiricaldist`, `scipy.stats`, and `seaborn`. So let's start with importing and/or installing these by running the cells below.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
import seaborn as sns

# If we're running on Colab, install empiricaldist
# https://pypi.org/project/empiricaldist/

import sys
IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
    !pip install empiricaldist

from empiricaldist import Pmf
from empiricaldist import Cdf

Run the cells below to download the cleaned dataset for the confirmed cases and deaths. 

In [None]:
import os

if not os.path.exists('Confirmed_Clean.csv'):
    !wget https://raw.githubusercontent.com/stat10/DS10-Python-HW/master/Data/Confirmed_Clean.csv
        
if not os.path.exists('Deaths_Clean.csv'):
    !wget https://raw.githubusercontent.com/stat10/DS10-Python-HW/master/Data/Deaths_Clean.csv

**2a:** Create dataframes from the files downloaded in the cells above and plot the number of daily confirmed cases in China, *i.e.* the number of confirmed cases on the y-axis and the dates on the x-axis. Your plot should look similar to the image below.

![alt text](https://raw.githubusercontent.com/stat10/DS10-Python-HW/master/Images/plot8.png)

*Hints:* 
1. Take a look at all the columns in the dataframe to figure out the appropriate column name to use for China
2. To make x-axis work correctly with the dates, we would want to convert the dates to a `Timestamp`. (You might remember this from Lab02 and HW02) In order to do that first convert the `Date` column to a list, using the method `tolist()`. Now, write a `for` loop that repeats as many times as there are entries in the list. Within the `for` loop use the `pd.Timestamp` function on the current element of the list to convert the string to a `Timestamp` and either put it back in the same list or append to a new list.
3. For the dates on the x-axis to look better. Use the function `plt.xticks()` with the keyword agrument of `rotation=45`. This will rotate the x-axis labels by 45 degrees and prevent overlapping labels.

In [None]:
# Solution (create dataframes)


In [None]:
# Solution (plot daily number of confirmed cases in Mainland China)


**2b:** Now, create a similar plot for the number of daily deaths in China due to COVID-19. 

In [None]:
# Solution


**2c:** Calculate the ratio of deaths to confirmed cases to quantify how deadly COVID-19 has been over time and plot the death ratio as a function of time. Your plot should look similar to the image below. Also, calculate and print the mean death ratio. 

![alt text](https://raw.githubusercontent.com/stat10/DS10-Python-HW/master/Images/plot9.png)

As you might notice, there are spikes in the data and it also seems like the ratio might be increasing a little over time. So, the mean death ratio, which is about 8% is not an ideal statistic to calculate, let us look at the distributions of the death ratio to get a better idea.

In [None]:
# Solution


**Exercise 3a:** Calculate and plot the PMF of the death ratio in China. The plot of your PMF should look similar to the image below.

![alt text](https://raw.githubusercontent.com/stat10/DS10-Python-HW/master/Images/plot10.png)

*Hint:* 
1. After you have calculated the PMF using `Pmf.from_seq`, you can use the `plot()` method on the calculated PMF to generate a plot. 
2. If you calculated the PMF for the raw death ratio your PMF plot will essentially look flat. This is because there are too many unique values of the death ratio. So, to get a better estimate, calculate the PMF of the death ratio that is rounded to 2 decimal places. For example, `round(ratio, 2)` will round the values in the variable ratio to 2 decimal places.

You might now notice that there is a peak around a death ratio of 0.025, which represents the most likely value of the death ratio. However, as you have seen the death ratio data is in general a floating number and rounding its value to obtain the PMF is not the best way to represent the data. Also, 

In [None]:
# Solution


**3b:** Now, calculate and plot the CDF of the death ratio in China. Also, print out what is the interquartile range (IQR) of the distribution death ratios. As you have seen in Lab 8, the IQR which is the distance between the 25th and 75th percentile, represents the spread in the distribution. 

In [None]:
# Solution (plot of CDF)


In [None]:
# Solution (IQR)


**3c:** Repeat the above analysis (calculate death ratio and its mean value, PMF, and CDF) for the following countries: Italy, South Korea, Iran, and USA.

In [None]:
# Solution (Italy)

# Solution (South Korea)

# Solution (Iran)

# Solution (USA)


**3d:** Display the CDFs of all the countries (China, Italy, South Korea, Iran, and USA) on a single plot and answer the following question: In which countries is COVID-19 the most and least deadly. Remember to add a legend to the plot.

In [None]:
# Solution


**Exercise 4a:** Compare (*i.e.* display on the same plot) the PDF of the death ratios in China and Iran to that of a normal PDF generated using the mean and standard deviation of the respective data. Your plot should look similar to the image below.

![alt text](https://raw.githubusercontent.com/stat10/DS10-Python-HW/master/Images/plot11.png)

*Hint:* Use `describe()` on the death ratio data to obtain the mean and standard deviation and use those values in `norm` to generate a normal PDF with the same mean and standard deviation as the data. Also, use the `kdeplot` function from `seaborn` to plot the PDF for the data. 

As you might notice, the normal PDF does not fit the PDF of the data quite that well. Also, notice that the PDF of the data from both China and Iran have smaller peaks at a much higher value of the death ratio, indicating that their might be a certain population for which COVID-19 is significantly deadlier.

In [None]:
# Solution
