# Combining Data

Practice combining data from two different data sets. In the same folder as this Jupyter notebook, there are two csv files:
* rural_population_percent.csv
* electricity_access_percent.csv

They both come from the World Bank Indicators data. 
* https://data.worldbank.org/indicator/SP.RUR.TOTL.ZS
* https://data.worldbank.org/indicator/EG.ELC.ACCS.ZS

The rural populaton data represents the percent of a country's population that is rural over time. The electricity access data shows the percentage of people with access to electricity.

In this exercise, you will combine these two data sets together into one pandas data frame.

# Exercise 1

Combine the two data sets using the [pandas concat method](https://pandas.pydata.org/pandas-docs/stable/merging.html). In other words, find the union of the two data sets.

In [24]:
pd.set_option("display.max.columns", None)

In [1]:
# TODO: import the pandas library
import pandas as pd

In [2]:
# TODO: read in each csv file into a separate variable
# HINT: remember from the Extract material that these csv file have some formatting issues
# HINT: The file paths are 'rural_population_percent.csv' and 'electricity_access_percent.csv'
df_rural = pd.read_csv('rural_population_percent.csv', skiprows=4)
df_electricity = pd.read_csv('electricity_access_percent.csv', skiprows=4)

In [5]:
# TODO: remove the 'Unnamed: 62' column from each data set
df_rural.drop('Unnamed: 62', axis=1, inplace=True)
df_electricity.drop('Unnamed: 62', axis=1, inplace=True)

In [11]:
print(df_rural.shape, df_electricity.shape)

(264, 62) (264, 62)


In [10]:
df_rural.head(1)

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
0,Aruba,ABW,Rural population (% of total population),SP.RUR.TOTL.ZS,49.224,49.239,49.254,49.27,49.285,49.3,...,56.217,56.579,56.941,57.302,57.636,57.942,58.221,58.472,58.696,58.893


In [9]:
df_electricity.head(1)

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
0,Aruba,ABW,Access to electricity (% of population),EG.ELC.ACCS.ZS,,,,,,,...,93.086166,93.354546,93.356292,93.942375,94.255814,94.578262,94.906723,95.238182,95.570145,


In [15]:
df_rural['Country Code'].value_counts().sort_values(ascending=True)

ARG    1
GUM    1
GRL    1
MLT    1
CIV    1
      ..
DMA    1
AZE    1
TSA    1
KGZ    1
LBY    1
Name: Country Code, Length: 264, dtype: int64

In [16]:
df_rural.set_index('Country Code', inplace=True)

In [17]:
df_electricity.set_index('Country Code', inplace=True)

In [31]:
# TODO: combine the two data sets together using the concat method
df = pd.concat([df_rural, df_electricity], keys=['rual%','electricity%'], axis=1, join='inner')

In [32]:
df.shape

(264, 122)

In [33]:
df.head()

Unnamed: 0_level_0,rual%,rual%,rual%,rual%,rual%,rual%,rual%,rual%,rual%,rual%,rual%,rual%,rual%,rual%,rual%,rual%,rual%,rual%,rual%,rual%,rual%,rual%,rual%,rual%,rual%,rual%,rual%,rual%,rual%,rual%,rual%,rual%,rual%,rual%,rual%,rual%,rual%,rual%,rual%,rual%,rual%,rual%,rual%,rual%,rual%,rual%,rual%,rual%,rual%,rual%,rual%,rual%,rual%,rual%,rual%,rual%,rual%,rual%,rual%,rual%,rual%,electricity%,electricity%,electricity%,electricity%,electricity%,electricity%,electricity%,electricity%,electricity%,electricity%,electricity%,electricity%,electricity%,electricity%,electricity%,electricity%,electricity%,electricity%,electricity%,electricity%,electricity%,electricity%,electricity%,electricity%,electricity%,electricity%,electricity%,electricity%,electricity%,electricity%,electricity%,electricity%,electricity%,electricity%,electricity%,electricity%,electricity%,electricity%,electricity%,electricity%,electricity%,electricity%,electricity%,electricity%,electricity%,electricity%,electricity%,electricity%,electricity%,electricity%,electricity%,electricity%,electricity%,electricity%,electricity%,electricity%,electricity%,electricity%,electricity%,electricity%,electricity%
Unnamed: 0_level_1,Country Name,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,Country Name,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
Country Code,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2,Unnamed: 23_level_2,Unnamed: 24_level_2,Unnamed: 25_level_2,Unnamed: 26_level_2,Unnamed: 27_level_2,Unnamed: 28_level_2,Unnamed: 29_level_2,Unnamed: 30_level_2,Unnamed: 31_level_2,Unnamed: 32_level_2,Unnamed: 33_level_2,Unnamed: 34_level_2,Unnamed: 35_level_2,Unnamed: 36_level_2,Unnamed: 37_level_2,Unnamed: 38_level_2,Unnamed: 39_level_2,Unnamed: 40_level_2,Unnamed: 41_level_2,Unnamed: 42_level_2,Unnamed: 43_level_2,Unnamed: 44_level_2,Unnamed: 45_level_2,Unnamed: 46_level_2,Unnamed: 47_level_2,Unnamed: 48_level_2,Unnamed: 49_level_2,Unnamed: 50_level_2,Unnamed: 51_level_2,Unnamed: 52_level_2,Unnamed: 53_level_2,Unnamed: 54_level_2,Unnamed: 55_level_2,Unnamed: 56_level_2,Unnamed: 57_level_2,Unnamed: 58_level_2,Unnamed: 59_level_2,Unnamed: 60_level_2,Unnamed: 61_level_2,Unnamed: 62_level_2,Unnamed: 63_level_2,Unnamed: 64_level_2,Unnamed: 65_level_2,Unnamed: 66_level_2,Unnamed: 67_level_2,Unnamed: 68_level_2,Unnamed: 69_level_2,Unnamed: 70_level_2,Unnamed: 71_level_2,Unnamed: 72_level_2,Unnamed: 73_level_2,Unnamed: 74_level_2,Unnamed: 75_level_2,Unnamed: 76_level_2,Unnamed: 77_level_2,Unnamed: 78_level_2,Unnamed: 79_level_2,Unnamed: 80_level_2,Unnamed: 81_level_2,Unnamed: 82_level_2,Unnamed: 83_level_2,Unnamed: 84_level_2,Unnamed: 85_level_2,Unnamed: 86_level_2,Unnamed: 87_level_2,Unnamed: 88_level_2,Unnamed: 89_level_2,Unnamed: 90_level_2,Unnamed: 91_level_2,Unnamed: 92_level_2,Unnamed: 93_level_2,Unnamed: 94_level_2,Unnamed: 95_level_2,Unnamed: 96_level_2,Unnamed: 97_level_2,Unnamed: 98_level_2,Unnamed: 99_level_2,Unnamed: 100_level_2,Unnamed: 101_level_2,Unnamed: 102_level_2,Unnamed: 103_level_2,Unnamed: 104_level_2,Unnamed: 105_level_2,Unnamed: 106_level_2,Unnamed: 107_level_2,Unnamed: 108_level_2,Unnamed: 109_level_2,Unnamed: 110_level_2,Unnamed: 111_level_2,Unnamed: 112_level_2,Unnamed: 113_level_2,Unnamed: 114_level_2,Unnamed: 115_level_2,Unnamed: 116_level_2,Unnamed: 117_level_2,Unnamed: 118_level_2,Unnamed: 119_level_2,Unnamed: 120_level_2,Unnamed: 121_level_2,Unnamed: 122_level_2
ABW,Aruba,Rural population (% of total population),SP.RUR.TOTL.ZS,49.224,49.239,49.254,49.27,49.285,49.3,49.315,49.33,49.346,49.361,49.376,49.391,49.407,49.422,49.437,49.452,49.468,49.483,49.498,49.513,49.528,49.544,49.559,49.574,49.589,49.605,49.62,49.635,49.65,49.665,49.681,49.696,50.002,50.412,50.823,51.233,51.644,52.054,52.464,52.873,53.283,53.661,54.028,54.394,54.76,55.125,55.489,55.853,56.217,56.579,56.941,57.302,57.636,57.942,58.221,58.472,58.696,58.893,Aruba,Access to electricity (% of population),EG.ELC.ACCS.ZS,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,88.445351,88.780846,89.115829,89.447754,89.77356,90.090187,90.394585,90.683678,90.954422,91.203751,91.660398,91.638092,91.833717,92.023048,92.212166,92.40712,92.613983,92.838821,93.086166,93.354546,93.356292,93.942375,94.255814,94.578262,94.906723,95.238182,95.570145,
AFG,Afghanistan,Rural population (% of total population),SP.RUR.TOTL.ZS,91.779,91.492,91.195,90.89,90.574,90.25,89.915,89.57,89.214,88.848,88.471,88.083,87.684,87.274,86.851,86.417,85.971,85.513,85.042,84.565,84.319,84.07,83.818,83.563,83.304,83.042,82.777,82.509,82.237,81.962,81.684,81.403,81.118,80.83,80.538,80.243,79.945,79.644,79.339,79.03,78.718,78.404,78.085,77.763,77.438,77.105,76.763,76.413,76.054,75.687,75.311,74.926,74.532,74.129,73.718,73.297,72.868,72.43,Afghanistan,Access to electricity (% of population),EG.ELC.ACCS.ZS,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.021977,0.179635,0.959756,0.776537,6.267394,11.751966,17.236319,23.0,28.228613,33.74868,42.4,44.854885,42.7,43.222019,69.1,67.259552,89.5,71.5,84.137138,
AGO,Angola,Rural population (% of total population),SP.RUR.TOTL.ZS,89.565,89.202,88.796,88.376,87.942,87.496,87.035,86.559,86.068,85.564,85.043,84.566,84.125,83.676,83.215,82.745,82.263,81.772,81.27,80.758,80.234,79.701,79.157,78.602,78.035,77.459,76.872,76.275,75.666,75.048,74.418,73.779,73.128,72.47,71.8,71.12,70.43,69.732,69.025,68.308,67.581,66.848,66.105,65.355,64.595,63.831,63.058,62.278,61.491,60.701,59.903,59.1,58.301,57.51,56.726,55.95,55.181,54.422,Angola,Access to electricity (% of population),EG.ELC.ACCS.ZS,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,11.397808,12.579379,13.76044,14.938441,16.110325,17.273031,18.423502,19.558676,20.675495,21.770901,22.843355,20.0,24.939095,25.974508,27.009701,28.050735,29.103676,37.5,31.268013,32.382469,33.51495,34.6,35.821964,36.99049,32.0,42.0,40.520607,
ALB,Albania,Rural population (% of total population),SP.RUR.TOTL.ZS,69.295,69.057,68.985,68.914,68.842,68.77,68.698,68.626,68.554,68.452,68.26,68.067,67.873,67.679,67.484,67.288,67.092,66.895,66.698,66.5,66.238,65.976,65.713,65.448,65.183,64.917,64.65,64.381,64.112,63.842,63.572,63.3,62.751,62.201,61.646,61.089,60.527,59.965,59.399,58.831,58.259,57.565,56.499,55.427,54.349,53.269,52.185,51.098,50.009,48.924,47.837,46.753,45.67,44.617,43.591,42.593,41.624,40.684,Albania,Access to electricity (% of population),EG.ELC.ACCS.ZS,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,
AND,Andorra,Rural population (% of total population),SP.RUR.TOTL.ZS,41.55,39.017,36.538,34.128,31.795,29.555,27.407,25.359,23.412,21.576,19.845,18.22,16.699,15.284,13.968,12.748,11.618,10.58,9.622,8.743,7.936,7.2,6.526,5.911,5.351,4.841,4.675,4.822,4.973,5.128,5.288,5.47,5.676,5.889,6.11,6.339,6.575,6.82,7.073,7.334,7.605,7.944,8.359,8.793,9.249,9.705,10.162,10.637,11.133,11.648,12.183,12.74,13.292,13.835,14.367,14.885,15.388,15.873,Andorra,Access to electricity (% of population),EG.ELC.ACCS.ZS,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,


# Exercise 2 (Challenge)

This exercise is more challenging.

The resulting data frame should look like this:

|Country Name|Country Code|Year|Rural_Value|Electricity_Value|
|---|---|---|---|---|
|Aruba|ABW|1960|49.224|49.239|

... etc.

Order the results in the dataframe by country and then by year

Here are a few pandas methods that should be helpful:
* [melt](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.melt.html)
* [drop](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html)
* [merge](https://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.DataFrame.merge.html)
* [sort_values](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html)

HINT: You can use country name, country code, and the year as common keys between the data sets

In [4]:
# TODO: merge the data sets together according to the instructions. First, use the 
# melt method to change the formatting of each data frame so that it looks like this:
# Country Name, Country Code, Year, Rural Value
# Country Name, Country Code, Year, Electricity Value

# TODO: drop any columns from the data frames that aren't needed

# TODO: merge the data frames together based on their common columns
# in this case, the common columns are Country Name, Country Code, and Year

# TODO: sort the results by country and then by year

df_combined = None

In [34]:
df_rural = pd.read_csv('rural_population_percent.csv', skiprows=4)
df_electricity = pd.read_csv('electricity_access_percent.csv', skiprows=4)

df_rural.drop('Unnamed: 62', axis=1, inplace=True)
df_electricity.drop('Unnamed: 62', axis=1, inplace=True)

In [35]:
df_rural.head()

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
0,Aruba,ABW,Rural population (% of total population),SP.RUR.TOTL.ZS,49.224,49.239,49.254,49.27,49.285,49.3,49.315,49.33,49.346,49.361,49.376,49.391,49.407,49.422,49.437,49.452,49.468,49.483,49.498,49.513,49.528,49.544,49.559,49.574,49.589,49.605,49.62,49.635,49.65,49.665,49.681,49.696,50.002,50.412,50.823,51.233,51.644,52.054,52.464,52.873,53.283,53.661,54.028,54.394,54.76,55.125,55.489,55.853,56.217,56.579,56.941,57.302,57.636,57.942,58.221,58.472,58.696,58.893
1,Afghanistan,AFG,Rural population (% of total population),SP.RUR.TOTL.ZS,91.779,91.492,91.195,90.89,90.574,90.25,89.915,89.57,89.214,88.848,88.471,88.083,87.684,87.274,86.851,86.417,85.971,85.513,85.042,84.565,84.319,84.07,83.818,83.563,83.304,83.042,82.777,82.509,82.237,81.962,81.684,81.403,81.118,80.83,80.538,80.243,79.945,79.644,79.339,79.03,78.718,78.404,78.085,77.763,77.438,77.105,76.763,76.413,76.054,75.687,75.311,74.926,74.532,74.129,73.718,73.297,72.868,72.43
2,Angola,AGO,Rural population (% of total population),SP.RUR.TOTL.ZS,89.565,89.202,88.796,88.376,87.942,87.496,87.035,86.559,86.068,85.564,85.043,84.566,84.125,83.676,83.215,82.745,82.263,81.772,81.27,80.758,80.234,79.701,79.157,78.602,78.035,77.459,76.872,76.275,75.666,75.048,74.418,73.779,73.128,72.47,71.8,71.12,70.43,69.732,69.025,68.308,67.581,66.848,66.105,65.355,64.595,63.831,63.058,62.278,61.491,60.701,59.903,59.1,58.301,57.51,56.726,55.95,55.181,54.422
3,Albania,ALB,Rural population (% of total population),SP.RUR.TOTL.ZS,69.295,69.057,68.985,68.914,68.842,68.77,68.698,68.626,68.554,68.452,68.26,68.067,67.873,67.679,67.484,67.288,67.092,66.895,66.698,66.5,66.238,65.976,65.713,65.448,65.183,64.917,64.65,64.381,64.112,63.842,63.572,63.3,62.751,62.201,61.646,61.089,60.527,59.965,59.399,58.831,58.259,57.565,56.499,55.427,54.349,53.269,52.185,51.098,50.009,48.924,47.837,46.753,45.67,44.617,43.591,42.593,41.624,40.684
4,Andorra,AND,Rural population (% of total population),SP.RUR.TOTL.ZS,41.55,39.017,36.538,34.128,31.795,29.555,27.407,25.359,23.412,21.576,19.845,18.22,16.699,15.284,13.968,12.748,11.618,10.58,9.622,8.743,7.936,7.2,6.526,5.911,5.351,4.841,4.675,4.822,4.973,5.128,5.288,5.47,5.676,5.889,6.11,6.339,6.575,6.82,7.073,7.334,7.605,7.944,8.359,8.793,9.249,9.705,10.162,10.637,11.133,11.648,12.183,12.74,13.292,13.835,14.367,14.885,15.388,15.873


In [36]:
df_rural.columns

Index(['Country Name', 'Country Code', 'Indicator Name', 'Indicator Code',
       '1960', '1961', '1962', '1963', '1964', '1965', '1966', '1967', '1968',
       '1969', '1970', '1971', '1972', '1973', '1974', '1975', '1976', '1977',
       '1978', '1979', '1980', '1981', '1982', '1983', '1984', '1985', '1986',
       '1987', '1988', '1989', '1990', '1991', '1992', '1993', '1994', '1995',
       '1996', '1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004',
       '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013',
       '2014', '2015', '2016', '2017'],
      dtype='object')

In [37]:
df_rural = pd.melt(df_rural, id_vars=['Country Name', 'Country Code', 'Indicator Name', 'Indicator Code'], \
                   var_name='Year', value_name='Rural_Value')

In [39]:
df_rural.columns

Index(['Country Name', 'Country Code', 'Indicator Name', 'Indicator Code',
       'Year', 'Rural_Value'],
      dtype='object')

In [40]:
df_rural = df_rural[['Country Name', 'Country Code', 'Year', 'Rural_Value']]

In [42]:
df_rural.head()

Unnamed: 0,Country Name,Country Code,Year,Rural_Value
0,Aruba,ABW,1960,49.224
1,Afghanistan,AFG,1960,91.779
2,Angola,AGO,1960,89.565
3,Albania,ALB,1960,69.295
4,Andorra,AND,1960,41.55


In [45]:
df_electricity = pd.melt(df_electricity, id_vars=['Country Name', 'Country Code', 'Indicator Name', 'Indicator Code'], \
                   var_name='Year', value_name='Electricty_Value')
df_electricity = df_electricity[['Country Name', 'Country Code', 'Year', 'Electricty_Value']]
df_electricity.head()

Unnamed: 0,Country Name,Country Code,Year,Electricty_Value
0,Aruba,ABW,1960,
1,Afghanistan,AFG,1960,
2,Angola,AGO,1960,
3,Albania,ALB,1960,
4,Andorra,AND,1960,


In [48]:
df = pd.merge(df_rural, df_electricity, on=['Country Name', 'Country Code', 'Year'])

In [51]:
df = df.sort_values(by=['Country Name', 'Year'])

In [52]:
df.head()

Unnamed: 0,Country Name,Country Code,Year,Rural_Value,Electricty_Value
1,Afghanistan,AFG,1960,91.779,
265,Afghanistan,AFG,1961,91.492,
529,Afghanistan,AFG,1962,91.195,
793,Afghanistan,AFG,1963,90.89,
1057,Afghanistan,AFG,1964,90.574,
