# General requirements for the assignment

- Make appropriate comments to your code
- Use Markdown cells to provide your answers (when applicable)
- Stick to Pep8 standard as much as possible for your coding
- Submit through GitHub
- Tag the commit as *Final submission of graded assignment*
- Provide your GitHub URL to the notebook as the submission for the Brightspace assignment page
- Post errors in the course Github's issue page for faster feedback
- ***DO NOT* forget to remove your review partner from your repository before you put the code in Github to avoid plagiarism**

### DEADLINE FOR THIS ASSIGNMENT IS 29 OCTOBER 2021 BEFORE 23:59

<hr />

# Assignment


In the past 7 weeks, you have been working with Google mobility data. Now, let's combine that data with covid-19 data to see if we can derive some *interesting* insights. There are multiple sources of COVID-19 data. Maybe the country that you chose has its separate data source. 
- One such data source is from [OurWorldInData](https://github.com/owid/covid-19-data/tree/master/public/data), which contains daily covid data from 217 countries and the corresponding government response measured as **stringency index**.
- Another data source that provides municipal, provincial, and nationwide covid data for the whole of **Netherlands** is [here](https://github.com/J535D165/CoronaWatchNL).

Feel free to use either of these data sources or something you found on your own!

## Part I - Data import

1.[5 points]
This dataframe should combine mobility data and covid-19 data of your chosen country. There are different types of covid data available such as the number of positively tested cases, hospital admission, fatality rates, government stringency index, etc. Provide a brief explanation or data dictionary of your new dataframe. Keep in mind that you need to associate these two datasets, then pick municipal, provincial, or nationwide data accordingly.

In [1]:
import pandas as pd
import os
import numpy as np
import datetime
from datetime import datetime
import plotly.express as px
import plotly.graph_objects as go

In [2]:
# First dealing with the mobility data, combine 2020 and 2021 dataset of Netherland
df_2020 = pd.read_csv('/Users/mengxinran/Downloads/2020_NL_Region_Mobility_Report.csv')
df_2021 = pd.read_csv('/Users/mengxinran/Downloads/2021_NL_Region_Mobility_Report.csv')
df = pd.concat([df_2020, df_2021])
df = df.reset_index(drop = True)

# Then dealing the covid data, choosing the cprresponding data of Netherland
covid_data = pd.read_csv('/Users/mengxinran/Downloads/owid-covid-data.csv')
covid_NL = covid_data[covid_data['location'] == 'Netherlands']
covid_NL = covid_NL.reset_index(drop = True)

# Combine two datasets and combine the data according to date
result = pd.merge(df, covid_NL, on='date')
cob_data = result.groupby(['date']).sum().reset_index()
cob_data


Unnamed: 0,date,metro_area,census_fips_code,retail_and_recreation_percent_change_from_baseline,grocery_and_pharmacy_percent_change_from_baseline,parks_percent_change_from_baseline,transit_stations_percent_change_from_baseline,workplaces_percent_change_from_baseline,residential_percent_change_from_baseline,total_cases,...,female_smokers,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy,human_development_index,excess_mortality_cumulative_absolute,excess_mortality_cumulative,excess_mortality,excess_mortality_cumulative_per_million
0,2020/10/1,0.0,0.0,-2892.0,-993.0,128.0,-8520.0,-9185.0,2919.0,50549653.0,...,9540.4,10674.3,0.0,1298.12,32171.48,369.104,0.0,0.00,0.00,0.000000
1,2020/10/10,0.0,0.0,-4688.0,-2035.0,1396.0,-6116.0,-20.0,633.0,57385020.0,...,8052.0,9009.0,0.0,1095.60,27152.40,311.520,0.0,0.00,0.00,0.000000
2,2020/10/11,0.0,0.0,-3242.0,-1356.0,2555.0,-4140.0,-79.0,293.0,47055168.0,...,6368.4,7125.3,0.0,866.52,21475.08,246.384,1760888.7,1453.77,1153.62,102537.649873
3,2020/10/12,0.0,0.0,-659.0,-718.0,2581.0,-8842.0,-11731.0,2721.0,73170958.0,...,9540.4,10674.3,0.0,1298.12,32171.48,369.104,0.0,0.00,0.00,0.000000
4,2020/10/13,0.0,0.0,53.0,466.0,2547.0,-9241.0,-11813.0,2923.0,76085081.0,...,9540.4,10674.3,0.0,1298.12,32171.48,369.104,0.0,0.00,0.00,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
304,2020/9/5,0.0,0.0,-717.0,-425.0,640.0,-850.0,8.0,75.0,5306583.0,...,1683.6,1883.7,0.0,229.08,5677.32,65.136,0.0,0.00,0.00,0.000000
305,2020/9/6,0.0,0.0,-317.0,-226.0,572.0,-702.0,58.0,16.0,3503565.0,...,1098.0,1228.5,0.0,149.40,3702.60,42.480,282159.0,262.80,74.70,16430.294970
306,2020/9/7,0.0,0.0,-115.0,-179.0,383.0,-1717.0,-4820.0,1146.0,19045884.0,...,5904.8,6606.6,0.0,803.44,19911.76,228.448,0.0,0.00,0.00,0.000000
307,2020/9/8,0.0,0.0,-140.0,-125.0,354.0,-1794.0,-4867.0,1334.0,19628832.0,...,6002.4,6715.8,0.0,816.72,20240.88,232.224,0.0,0.00,0.00,0.000000


## Part II - Data processing

As you already know, there are various peaks/valleys in the changes of mobility activity data. In this assignment, find peaks/valleys (if available) in the covid data.

After identifying peaks from two datasets, you need to check if there are common peaks. Most likely, the peaks do not intersect on the same day, so it should be possible to provide a certain offset to combine peaks/valleys that are close to each other. A visual representation of this problem is shown in the following image:

<p align="center">
  <img src="Images/offset.png" alt="drawing" width="500"/>
</p>


Below are the challenges that need to be solved for this part:

2. *[8 points]* Provide pseudo-code or logic behind the offset algorithms that you will develop for the following questions (3. and 4.) Use bullet points/flow chart/pseudocode/other means to explain the logic.


3. *[10 points]* Find all the common peaks/valleys of mobility activity patterns of a municipality/provinces/nation within a range of time offsets. **eg: find common peaks between 1 activity of two municipalities OR find common peaks between 2 activities of the same municipality**


4. *[2 points]* Find all the common peaks/valleys of the selected covid data of municipality/provinces/nation within a range of time offsets. **eg: find common peaks between 1 type of covid data (eg. vaccinations) of two municipalities OR find common peaks between 2 types of covid data (eg. vaccinations and deaths) of the same municipality**


5. *[8 points]* Relationship between common peaks/valleys (municipal/provincial/nationwide) in activities and covid data (municipal/provincial/nationwide) (time-offset) (either through observation or using programmable logic). If you only use visual observational methods, you won't get maximum points for this question. **eg: compare peaks of 1 activity and 1 type of covid data of the same municipality OR compare common peaks of all activities and common peaks of all types of covid data of the same municipality**

**Motivate your selection for the data choice for finding the common peaks**

In [3]:
# For PRAT2.  

In [9]:
# Find common peaks between 2 activities of the same municipality
# first define the function of finding peaks
def my_find_peaks(data, activity, **kwargs):
    dic = []
    #select the corresponding data
    act_data = data[activity]
    act_data = act_data.reset_index(drop = True)
    #calculate the number of iterations
    iteration = act_data.shape[0] -1
    
    #find the peak value by compare each value with the left and right one 
    for i in range(0,iteration):
        if act_data.iloc[i] > act_data.iloc[i-1] and act_data.iloc[i]>act_data.iloc[i+1]:
            dic.append(i)
        
    return dic

# derive 5-day averages to work as the offset
cob_data.loc[:, 'date'] = pd.to_datetime(cob_data['date'])
cob_data1 = cob_data.resample('5D', on='date').mean()
cob_data1.reset_index(inplace=True)


# choose two activities and find their corresponding peak values
activity1 = 'retail_and_recreation_percent_change_from_baseline'
activity2 = 'grocery_and_pharmacy_percent_change_from_baseline'
max1 = my_find_peaks(cob_data1,activity1)
max2 = my_find_peaks(cob_data1,activity2)

# Using a loop to find commmon peak values
com_peak1 = []
com_peak2 = []
for i in max1:
    for j in max2:
        if -2< i-j <2:
            com_peak1.append(i)
            com_peak2.append(j)

# Delete the duplicate data
final_com_peak1={}.fromkeys(com_peak1).keys()
final_com_peak2={}.fromkeys(com_peak2).keys()

# show the result
print('Commonpeaks for retail_and_recreation_percent_change_from_baseline:')
for i in final_com_peak1:
    print(cob_data1.date[i])
print('Commonpeaks for grocery_and_pharmacy_percent_change_from_baselin:')
for j in final_com_peak2:
    print(cob_data1.date[j])

Commonpeaks for retail_and_recreation_percent_change_from_baseline:
2020-04-07 00:00:00
2020-05-02 00:00:00
2020-06-01 00:00:00
2020-06-21 00:00:00
2020-07-11 00:00:00
2020-07-26 00:00:00
2020-08-10 00:00:00
2020-08-30 00:00:00
2020-09-14 00:00:00
2020-10-09 00:00:00
2020-10-19 00:00:00
2020-11-08 00:00:00
2020-11-23 00:00:00
Commonpeaks for grocery_and_pharmacy_percent_change_from_baselin:
2020-04-07 00:00:00
2020-05-07 00:00:00
2020-05-27 00:00:00
2020-06-21 00:00:00
2020-07-06 00:00:00
2020-07-26 00:00:00
2020-08-10 00:00:00
2020-08-30 00:00:00
2020-09-14 00:00:00
2020-10-09 00:00:00
2020-10-19 00:00:00
2020-11-08 00:00:00
2020-11-23 00:00:00


In [10]:
# Find common peaks between 2 types of covid data of the same municipality
# choose two types of data regarding covid and find their corresponding peak values
covid1 = 'icu_patients'
covid2 = 'hosp_patients'
max_covid1 = my_find_peaks(cob_data1,covid1)
max_covid2 = my_find_peaks(cob_data1,covid2)

# Using a loop to find commmon peak values
com_peak1 = []
com_peak2 = []
for i in max_covid1:
    for j in max_covid2:
        if -2< i-j <2:
            com_peak1.append(i)
            com_peak2.append(j)

# Delete the duplicate data
final_com_peak1={}.fromkeys(com_peak1).keys()
final_com_peak2={}.fromkeys(com_peak2).keys()

# show the result
print('Commonpeaks for icu_patients:')
for i in final_com_peak1:
    print(cob_data1.date[i])
print('Commonpeaks for hosp_patients:')
for j in final_com_peak2:
    print(cob_data1.date[j])



Commonpeaks for icu_patients:
2020-04-07 00:00:00
2020-07-06 00:00:00
2020-08-10 00:00:00
2020-08-25 00:00:00
2020-11-03 00:00:00
2020-11-23 00:00:00
Commonpeaks for hosp_patients:
2020-04-07 00:00:00
2020-07-11 00:00:00
2020-08-10 00:00:00
2020-08-25 00:00:00
2020-11-03 00:00:00
2020-11-23 00:00:00


In [12]:
# Find common peaks between one mobility data and one covid data of the same municipality
activity1 = 'retail_and_recreation_percent_change_from_baseline'
covid1 = 'hosp_patients'
max1 = my_find_peaks(cob_data1,activity1)
max2 = my_find_peaks(cob_data1,covid1)

# Using a loop to find commmon peak values
com_peak1 = []
com_peak2 = []
for i in max1:
    for j in max2:
        if -2< i-j <2:
            com_peak1.append(i)
            com_peak2.append(j)

# Delete the duplicate data
final_com_peak1={}.fromkeys(com_peak1).keys()
final_com_peak2={}.fromkeys(com_peak2).keys()

# show the result
print('Commonpeaks for retail_and_recreation_percent_change_from_baseline:')
for i in final_com_peak1:
    print(cob_data1.date[i])
print('Commonpeaks for hosp_patients:')
for j in final_com_peak2:
    print(cob_data1.date[j])

Commonpeaks for retail_and_recreation_percent_change_from_baseline:
2020-04-07 00:00:00
2020-05-07 00:00:00
2020-08-10 00:00:00
2020-10-19 00:00:00
2020-11-23 00:00:00
2020-12-18 00:00:00
Commonpeaks for hosp_patients:
2020-04-07 00:00:00
2020-05-07 00:00:00
2020-08-10 00:00:00
2020-10-19 00:00:00
2020-11-18 00:00:00
2020-12-18 00:00:00


## Part III - Data visualisation

6. *[12 points]* Use visualization to tell the mobility and covid data story of a specific municipality/province/nationwide. This is a more exploration question. Explain the logic behind your story and also your visualization choices

In [17]:
# For this part, the last part result can be used for doing the analyse 
# I choose the mobility of retail_and_recreation_percent_change_from_baseline and 
# the covid data of hospital patients , plot those two types of data in two diagrams
# plot the 2D and 3D diagram 
fig = px.scatter(cob_data, x='hosp_patients', y='retail_and_recreation_percent_change_from_baseline', log_x=True)
fig.show()

fig = px.scatter_3d(cob_data, x='new_cases', y='grocery_and_pharmacy_percent_change_from_baseline', z='date', log_x=True)
fig.update_traces(marker_size=4)
fig.show()

# From the diagram that it can be concluded that there is a linear relationship between two types of data.
# With the increase of the number of hospital patients, the retail and recreation keep decreasing, which indicated 
# a relative bad influence from Covid-19 for economy development in Dutch. 
# Also, the number of infections fluctuates over time.


# Rubrics

## Overall grading

- 10% of the final grade is for code review for the final assignment. Information about partners will be released after the assignment submission deadline.

- 90% of the final grade is divided among the following categories, which vary for different questions (see image below):
    - narrative
    - coding/logic - correctness
    - readability - [pep8 standard](https://www.python.org/dev/peps/pep-0008/)

<p align="center">
  <img src="Images/rubric.png" alt="drawing" width="500"/>
</p>

## Rubrics for each question in the assignment

Criteria:
1. Consistent dataset throughout the assignment, combined dataframe
2. Correctness, generalisability, clarity, simplicity
3. Working code, visualization of the result, generalisability
4. Generalisability
5. Logic, visualization
6. Logic, story, visuals, clarity, correctness, readability

You can obtain maximum points for the question if you have:
- Excellent stories
- Interactive visualisation 
- Programmable logic in *question 5*

You can obtain bonus points if you use:
- Extra datasets (eg: population)