Activity 2.6 -- Working with COVID 19 and World Bank Data
=========================================================

Our ultimate goal is to explore relationships between various World
Bank indicators for countries and their corresponding COVID death rates. In this activity, you will do some preprocessing of the data in preparation for joining the two data sets.

## Part 1 -- Downloading the data

First you need to download data on COVID-19 (see links and instructions
below) and the selected indicators from the Open World Bank data
available at <https://data.worldbank.org>.

[**COVID data set source**](https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_time_series) 

**Tasks.** Use pandas and dfply to perform each of the following.

1.  Download the raw **time\_series\_covid19\_confirmed\_global.csv**
    dataset.

2.  Inspect the data and discuss the need to reshape. 

In [1]:
# Code for loading and inspecting the CSV file

> **Your discussion:**

3.  Write a single pipe that reshapes the data, sets the dtype of the date column, and extracts various date parts.
    1. To change the `dtype` of the date column, `date = X.date.astype('datetime64')`
    2. To extract the year and month, use the `X.date.dt.year` and `X.date.dt.month` attributes. This will need to happen in a separate `mutate` 

In [2]:
# your code here

### World Bank Links Development Indicators

<https://databank.worldbank.org/source/world-development-indicators>

#### Constructing a data set.

First you need to construct a data set as follows

1.  Expand the Country tab and select all.

<img src="./img/media/image1.png" width="300">

2.  Click on the Series tab, search for *Health* and select the
    following indicators. **Feel free to add additional indicators!**

<img src="img/media/image2.png" width="300">

3.  Click on the Time tab and select 2018.

4.  Click apply changes in the floating dialog.

<img src="img/media/image3.png" width="300">

5.  Select CSV from the Download Options button and save the data folder

<img src="img/media/image4.png" width="100">

#### Tasks

Use pandas and dfply to perform each of the following.

1.  Inspect the World Bank data and discuss the need to reshape. 

**Hints:** 

* You should apply `fix_names` from `more_dfply` to clean up the column names.
* This table needs to be reshaped twice




In [3]:
# Code for loading and inspecting the CSV

> **Your discussion:**

2.  Write a single pipe that reshapes the data and cleans up the year column.  Be sure to make `year` the correct `dtype`.

In [4]:
# your code here

## Part 2 -- Investigate joining on country

Before we can proceed, we need to make sure that the columns used to join the data--namely the country--actually match.  Do this by

1. For each table, select just the country columns and make sure the column names match.
2. Add a `file` column to has an entry that corresponds to the data source, e.g., `"covid"` or `"World Bank"`
3. Perform a full outer join and filter on rows that didn't match (i.e. with a missing value in one (but not both) columns).
4. Sort the column by the country names and write out the result to a `csv` file.
5. Open and inspect the file and identify any mismatches in country name, e.g., `"Bahamas"` in the COVID data and `"Bahamas, The"` from the World Bank.

In [6]:
# Your code here

## Part 3 -- Creating a translation dictionary

We will need to `recode` one of the data sets to match the other for each mismatch.  I have started this dictionary by making the World Bank entry the key and the COVID entry the value in a dictionary.

**Task.** Complete this dictionary by adding additional key-value pairs, one for each country mismatch.  You will have to make some decisions about how to handle odd case.  Record these in comments (for now).

In [None]:
recode_world_bank = {"Bahamas, The":"Bahamas",
                     "Brunei Darussalam":"Brunei",
                     "next_world_bank":"next_covid_entry"}

## Part 4 -- Join and visualize 

Finally, you should use pandas and dfply to join these two data sets together, then create some interesting visualization using seaborn.

In [6]:
# Your code here

### Deliverables
To complete this part of the activity, you need to submit the following.

1.  A link to this notebook including all discussion and code requests
    above.

2.  A csv file containing your final dataset. **Hint.** You can use the
    to\_csv method on the final data frame.

In [7]:
# Code for writing the data here