# Problem Set 5

See [Introduction](https://datascience.quantecon.org/../pandas/intro.html) and [Basic Functionality](https://datascience.quantecon.org/../pandas/basics.html)

In [None]:
import pandas as pd
import numpy as np

%matplotlib inline

## Setup for Question 1-5

These questions use data on daily Covid cases in health regions in Canada from the [COVID-19 Canada Open Data Working Group](https://github.com/ccodwg/Covid19Canada).

We will be begin by loading this data and putting it into a similar format as the `unemp` data from the [Basic Functionality](https://datascience.quantecon.org/../pandas/basics.html) lecture.

In [None]:
url = "https://github.com/ccodwg/Covid19Canada/raw/master/timeseries_hr/cases_timeseries_hr.csv"
try : # only download if cases_raw has not already been defined 
    cases_raw
except:
    cases_raw = pd.read_csv(url, parse_dates=["date_report"])

try :
    hr_map 
except: 
    hr_map = pd.read_csv("https://github.com/ccodwg/Covid19Canada/raw/master/other/hr_map.csv")

Now, we create cases per 100,000 and then do the same manipulation as in the pandas basics lecture. We will focus on BC health regions in this problem set.

In [None]:
cases_bc = cases_raw.loc[(cases_raw['province']=='BC') &  
                         (cases_raw['date_report']<=pd.to_datetime('2021-10-18')),:] # so results don't change as data gets updated
# create cases per 100,000
cases_bc = cases_bc.merge(hr_map[['province','health_region','pop']],
                          on=['province','health_region'],
                          how='left')
cases_bc['cases100k'] = cases_bc['cases']/cases_bc['pop']*100_000
cases_bc = ( 
    cases_bc.reset_index()
    .pivot_table(index='date_report',columns='health_region', values='cases100k')
)    
cases_bc

The resulting `cases_bc` DataFrame contains Covid cases per 100,000 population for each BC health region and day.

## Question 1

At each date, what is the minimum number of cases per 100,000 across health regions?

In [None]:
# Your code here

What was the median number of cases per 100,000 in each health region?

In [None]:
# Your code here

What was the maximum number of cases per 100,000 across health regions? In what health region did it happen? On what date was this achieved?

- Hint 1: What Python type (not `dtype`) is returned by a reduction?  
- Hint 2: Read documentation for the method `idxmax`.  

In [None]:
# Your code here

Classify each health region as high or low volatility based on whether the variance of their cases per 100,000 is above or below 100.

In [None]:
# Your code here


## Question 2

Imagine that we want to determine whether cases per 100,000 was high (> 10),
medium (1 < x <= 10), or low (<= 1) for each health region and each day.

Write a Python function that takes a single number as an input and
outputs a single string which notes whether that number is high, medium, or low.

In [None]:
# Your code here

Pass your function to either `apply` or `applymap` and save the result in a new DataFrame called `case_bins`.

In [None]:
# Your code here

## Question 3

This exercise has multiple parts:

Use another transformation on `case_bins` to count how many times each health region had each of the three classifications.

- Hint 1: Will you need to use `apply` or `applymap` for transformation?  
- Hint 2: Try googling “pandas count unique value” or something similar to find the proper transformation.  

In [None]:
# Your code here

Construct a horizontal bar chart to detail the occurrences of each level.
Use one bar per health region and classification for 15 total bars.

In [None]:
# Your code here

## Question 4

Repeat Question 3, but count how many health regions had each classification on each day. Which day had the most health regions with high cases per 100,000? What about medium and low?

Part 1: Write a Python function to classify cases per 100,000 levels (this might be the same as Part 1 of the previous question.)

In [None]:
# Your code here

Part 2: Decide whether you should use `.apply` or `.applymap` and pass your function from part 1 to the appropriate method.

In [None]:
cases_bins = cases_bc#replace this comment with your code!!

Part 3: Count the number of times each health region had each classification and plot the number of low, medium, and high regions on each date. Your plot(s) should have date on the horizontal axis and number of regions on the vertical axis. You can choose whether to make one plot with three lines (or bars etc) or to make three plots.

In [None]:
# Your code here

## Question 5

For a single health region of your choice, determine the mean
cases per 100,000 during “Low”, “Medium”, and “High” case times.
(recall your `case_bins` DataFrame from the exercise above)

In [None]:
# Your code here

Which health regions in our sample performs the best during “bad times?” To
determine this, compute each health region’s mean cases per 100,000 in
months where the mean cases per 100,000 is greater than 10.

In [None]:
# Your code here