### 1. Import and load

In [21]:
import pandas as pd

covid_totals = pd.read_csv('data/covidtotalswithmissings.csv')

covid_totals.set_index('iso_code', inplace=True)

tot_vars = ['location', 'total_cases', 'total_deaths', 'total_cases_pm', 'total_deaths_pm']

demo_vars = ['population', 'pop_density', 'median_age', 'gdp_per_capita', 'hosp_beds']

### 2. Check the demographic columns for missing data.

Set the axis to `0` (the default) to check for the count of countries that are missing values for each of the demographic variables (missing values down columns) Notice that 46 out of 210 countries, more than 20 percent of countries, are missing `hosp_beds`.

Set the axis to `1` to check for the number of demographic variables that are missing for each country (missing values accross rows). 

Next, get `value_counts` on the resulting `demo_vars_miss_cnt` series to see whether some countries have missing values for much of the demographic data. 

Notice that 10 countries are missing values for 3 out of the 5 demographic variables, while 8 countries are missing values for 4 out of 5 demographic variables:

In [22]:
covid_totals[demo_vars].isnull().sum(axis=0)

population         0
pop_density       12
median_age        24
gdp_per_capita    28
hosp_beds         46
dtype: int64

In [23]:
demo_vars_miss_cnt = covid_totals[demo_vars].isnull().sum(axis=1)
demo_vars_miss_cnt.value_counts()  #First column is the number of missed variables

0    156
1     24
2     12
3     10
4      8
Name: count, dtype: int64

In [24]:
type(demo_vars_miss_cnt)

pandas.core.series.Series

### 3. List the countries with three or more missing values for the demographic data.

Index alignment and Boolean indexing allow us to use the count of missing values (`demo_vars_miss_cnt`) to select rows. 

Append the location to the `demo_vars` list to see the country. (We only show the first five of these countries here).

In [25]:
covid_totals.loc[demo_vars_miss_cnt >= 3, ['location'] + demo_vars].head(5).T

iso_code,AND,AIA,BES,VGB,FRO
location,Andorra,Anguilla,Bonaire Sint Eustatius and Saba,British Virgin Islands,Faeroe Islands
population,77265.0,15002.0,26221.0,30237.0,48865.0
pop_density,163.755,,,207.973,35.308
median_age,,,,,
gdp_per_capita,,,,,
hosp_beds,,,,,


### 4. Check the COVID case data for missing values.

Notice that only one country has missing values for any of this data:

In [26]:
covid_totals[tot_vars].isnull().sum(axis=0)

location           0
total_cases        0
total_deaths       0
total_cases_pm     1
total_deaths_pm    1
dtype: int64

In [27]:
tot_vars_miss_cnt = covid_totals[tot_vars].isnull().sum(axis=1)
tot_vars_miss_cnt.value_counts()

0    209
2      1
Name: count, dtype: int64

In [28]:
covid_totals.loc[tot_vars_miss_cnt > 0].T

iso_code,HKG
lastdate,2020-05-26
location,Hong Kong
total_cases,0
total_deaths,0
total_cases_pm,
total_deaths_pm,
population,7496988.0
pop_density,7039.714
median_age,44.8
gdp_per_capita,56054.92


### 5. Use the `fillna` method to fix the missing cases for the one country affected (Hong Kong).

We could just set the values to `0`, since the numerator is `0` in both cases. However, it is helpful in terms of code reuse to use the correct logic:

In [30]:
covid_totals['total_cases_pm'] = covid_totals['total_cases_pm'].fillna(
    covid_totals['total_cases'] / (covid_totals['population'] / 1000000))

covid_totals['total_deaths_pm'] = covid_totals['total_deaths_pm'].fillna(
    covid_totals['total_deaths'] / (covid_totals['population'] / 1000000))

covid_totals[tot_vars].isnull().sum(axis=0)

location           0
total_cases        0
total_deaths       0
total_cases_pm     0
total_deaths_pm    0
dtype: int64