# Problem Set 5

See [Introduction](https://datascience.quantecon.org/../pandas/intro.html) and [Basic Functionality](https://datascience.quantecon.org/../pandas/basics.html)

In [3]:
import pandas as pd
import numpy as np

%matplotlib inline

## Setup for Question 1-5

These questions use data on daily Covid cases in health regions in Canada from the [COVID-19 Canada Open Data Working Group](https://github.com/ccodwg/Covid19Canada).

In [4]:
url = "https://github.com/ccodwg/Covid19Canada/raw/master/timeseries_hr/cases_timeseries_hr.csv"
try : # only download if cases_raw has not already been defined 
    cases_raw
except:
    cases_raw = pd.read_csv(url, parse_dates=["date_report"])

try :
    hr_map 
except: 
    hr_map = pd.read_csv("https://github.com/ccodwg/Covid19Canada/raw/master/other/hr_map.csv")

Now, we create cases per 100,000 and then do the same manipulation as in the pandas basics lecture. We will focus on BC health regions in this problem set.

In [6]:
cases_raw

Unnamed: 0,province,health_region,date_report,cases,cumulative_cases
0,Alberta,Calgary,2020-01-25,0,0
1,Alberta,Calgary,2020-01-26,0,0
2,Alberta,Calgary,2020-01-27,0,0
3,Alberta,Calgary,2020-01-28,0,0
4,Alberta,Calgary,2020-01-29,0,0
...,...,...,...,...,...
78025,Yukon,Yukon,2022-02-23,8,3412
78026,Yukon,Yukon,2022-02-24,14,3426
78027,Yukon,Yukon,2022-02-25,0,3426
78028,Yukon,Yukon,2022-02-26,0,3426


In [9]:
cases_bc = cases_raw.loc[(cases_raw['province'] == 'BC') &  
                         (cases_raw['date_report'] < pd.to_datetime('2022-01-01')) &
                         (cases_raw['date_report'] >= pd.to_datetime('2021-01-01')),:] # Take the data for BC in year 2021 only
# create cases per 100,000
cases_bc = cases_bc.merge(hr_map[['province','health_region','pop']],
                          on=['province','health_region'],
                          how='left')
cases_bc['cases100k'] = cases_bc['cases'] / cases_bc['pop'] * 100_000
cases_bc = ( 
    cases_bc.reset_index()
    .pivot_table(index='date_report',columns='health_region', values='cases100k')
)    
cases_bc

health_region,Fraser,Interior,Island,Northern,Vancouver Coastal
date_report,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2021-01-01,0.000000,0.000000,0.000000,0.000000,0.000000
2021-01-02,25.701861,24.589675,9.101332,33.213897,24.191511
2021-01-03,45.464436,9.935222,10.519722,39.252788,26.815675
2021-01-04,21.129137,7.203036,6.500952,17.110190,21.813363
2021-01-05,0.000000,0.000000,0.000000,0.000000,0.000000
...,...,...,...,...,...
2021-12-27,42.153154,36.387752,24.585417,17.781177,28.865804
2021-12-28,41.995474,33.282995,33.686749,13.755250,32.966060
2021-12-29,83.623027,46.447164,36.523528,33.213897,46.824926
2021-12-30,121.886737,62.219330,54.371594,40.930258,80.119006


The resulting `cases_bc` DataFrame contains Covid cases per 100,000 population for each BC health region and day, in 2021.

## Question 1

At each date, what is the minimum number of cases per 100,000 across health regions?

In [None]:
# Your code here

What was the (daily) median number of cases per 100,000 in each health region?

In [None]:
# Your code here

What was the maximum number of cases per 100,000 across health regions? In what health region did it happen? On what date was this achieved?

- Hint 1: What Python type (not `dtype`) is returned by a reduction?  
- Hint 2: Read documentation for the method `idxmax`.  

In [None]:
# Your code here

Classify each health region as high or low volatility based on whether the variance of their cases per 100,000 is above or below 100.

In [None]:
# Your code here

## Question 2

Imagine that we want to determine whether cases per 100,000 was High (> 10),
Low (0 < x <= 10), or None (x = 0) for each health region and each day.

Write a Python function that takes a single number as an input and
outputs a single string which notes whether that number is High, Low, or None.

In [None]:
# Your code here

Pass your function to either `apply` or `applymap` and save the result in a new DataFrame called `case_bins`.

In [None]:
# Your code here

## Question 3

This exercise has multiple parts:

Use another transformation on `case_bins` to count how many times each health region had each of the three classifications.

- Hint 1: Will you need to use `apply` or `applymap` for transformation?  
- Hint 2: `value_counts`

In [None]:
# Your code here

Construct a horizontal bar chart (you can refer to an example of horizontal bar chart [here](https://www150.statcan.gc.ca/edu/power-pouvoir/c-g/c-g05-2-3-eng.png)) to detail the occurrences of each level.
Use one bar per health region and classification for 15 total bars.

In [None]:
# Your code here

In [None]:
# Your code here

Part 2: Decide whether you should use `.apply` or `.applymap` and pass your function from Part 1 to the appropriate method.

In [None]:
cases_bins = cases_bc#replace this comment with your code!!

Part 3: Plot the number of low, medium, and high regions on each date. Your plot(s) should have date on the horizontal axis and number of regions on the vertical axis. You can choose whether to make one plot with three lines (or bars etc) or to make three plots.

In [None]:
# Your code here

## Question 4

For a single health region of your choice, determine the mean
cases per 100,000 during “High” and “Low” case times.
(recall your `case_bins` DataFrame from the exercise above)

In [None]:
# Your code here

Which health regions in our sample performs the best during “bad times" ? To
determine this, compute each health region’s mean daily cases per 100,000 where the daily cases per 100,000 is greater than 10 (i.e., in the "high"
category as defined above).

In [None]:
# Your code here