# Problem Set 6

See [Merge](https://datascience.quantecon.org/../pandas/merge.html), [Reshape](https://datascience.quantecon.org/../pandas/reshape.html), and [GroupBy](https://datascience.quantecon.org/../pandas/groupby.html)

In [3]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

%matplotlib inline

## Questions 1-8

Lets start with a relatively straightforward exercise before we get to the really fun stuff.

The following code loads a cleaned piece of census data from Statistics Canada.

In [7]:
df = pd.read_csv("https://datascience.quantecon.org/assets/data/canada_census.csv", header=0, index_col=False)
df.head()

Unnamed: 0,CDcode,Pname,Population,CollegeEducated,PercentOwnHouse,Income
0,1001,Newfoundland and Labrador,270350,24.8,74.1,74676
1,1002,Newfoundland and Labrador,20370,7.5,86.3,60912
2,1003,Newfoundland and Labrador,15560,7.3,86.0,56224
3,1004,Newfoundland and Labrador,20385,10.9,73.7,44282
4,1005,Newfoundland and Labrador,42015,17.0,73.9,62565


A *census division* is a geographical area, smaller than a Canadian province, that is used to
organize information at a slightly more granular level than by province or by city. The census
divisions are shown below.

![https://datascience.quantecon.org/_static/canada_censusdivisions_map.png](https://datascience.quantecon.org/_static/canada_censusdivisions_map.png)

  
The data above contains information on 1) the population, 2) percent of population with a college
degree, 3) percent of population who own their house/apartment, and 4) the median after-tax income at the
*census division* level.

Hint: The `groupby` is the key here.  You will need to practice different split, apply, and combine options.

### Question 1

Assume that you have a separate data source with province codes and names.

In [8]:
df_provincecodes = pd.DataFrame({
    "Pname" : [ 'Newfoundland and Labrador', 'Prince Edward Island', 'Nova Scotia',
                'New Brunswick', 'Quebec', 'Ontario', 'Manitoba', 'Saskatchewan',
                'Alberta', 'British Columbia', 'Yukon', 'Northwest Territories','Nunavut'],
    "Code" : ['NL', 'PE', 'NS', 'NB', 'QC', 'ON', 'MB', 'SK', 'AB', 'BC', 'YT', 'NT', 'NU']
            })
df_provincecodes

Unnamed: 0,Pname,Code
0,Newfoundland and Labrador,NL
1,Prince Edward Island,PE
2,Nova Scotia,NS
3,New Brunswick,NB
4,Quebec,QC
5,Ontario,ON
6,Manitoba,MB
7,Saskatchewan,SK
8,Alberta,AB
9,British Columbia,BC


With this,

1. Either merge or join these province codes into the census dataframe to provide province codes for each province
  name. You need to figure out which “key” matches in the merge, and don’t be afraid to rename columns for convenience.  
1. Drop the province names from the resulting dataframe.  
1. Rename the column with the province codes to “Province”.  Hint: `.rename(columns = <YOURDICTIONARY>)`  

In [9]:
# Your code here

For this particular example, you could have renamed the column using `replace`. This is a good check.

In [10]:
(pd.read_csv("https://datascience.quantecon.org/assets/data/canada_census.csv", header=0, index_col=False)
.replace({
    "Alberta": "AB", "British Columbia": "BC", "Manitoba": "MB", "New Brunswick": "NB",
    "Newfoundland and Labrador": "NL", "Northwest Territories": "NT", "Nova Scotia": "NS",
    "Nunavut": "NU", "Ontario": "ON", "Prince Edward Island": "PE", "Quebec": "QC",
    "Saskatchewan": "SK", "Yukon": "YT"})
.rename(columns={"Pname" : "Province"})
.head()
)

Unnamed: 0,CDcode,Province,Population,CollegeEducated,PercentOwnHouse,Income
0,1001,NL,270350,24.8,74.1,74676
1,1002,NL,20370,7.5,86.3,60912
2,1003,NL,15560,7.3,86.0,56224
3,1004,NL,20385,10.9,73.7,44282
4,1005,NL,42015,17.0,73.9,62565


### Question 2

Which province has the highest population? Which has the lowest?

In [None]:
# Your code here

### Question 3

Show a bar plot of the province populations.  Hint: After the split-apply-combine, you can use `.plot.bar()`.

In [None]:
# Your code here

### Question 4

Which province has the highest percent of individuals with a college education? Which has the
lowest?

Hint: Remember to weight this calculation by population!

In [None]:
# Your code here

### Question 5

What is the census division with the highest median income in each province?

In [None]:
# Your code here

### Question 6

By province, what is the total population of census divisions where more than 80 percent of the population own houses ?

In [None]:
# Your code here

### Question 7

By province, what is the average proportion of college-educated individuals in census divisions
where more than 80 percent of the population own houses?

In [None]:
# Your code here

### Question 8

In the data we provided there is only median income for each census division. For this question only, we pretend that the median income is equal to the average income (In general though this is not a good assumption if you are familiar with labor economics or macroeconomics).

Classify the census divisons as low, medium, and highly-educated by using the college-educated proportions,
where “low” indicates that less than 10 percent of the area is college-educated, “medium” indicates between 10 and 20 percent is college-educated, and “high” indicates more than 20 percent.

Based on that classification, find the average of income. Weight this average income by population for each of the low, medium, high education groups.

In [None]:
# Your code here