# <span style="color:blue">03-05 Assignment: World Happiness Analysis </span>

This assignment is based on our previous assignment regarding the [7th World Happiness Report](https://worldhappiness.report/ed/2019/) and builds on it by merging with population data from a [different source](https://www.kaggle.com/tanuprabhu/population-by-country-2020).

We are interested in learning how well some of the happiness parameters correlate with some of the population parameters.

### 1. Process happiness data as before

In [1]:
import pandas as pd
happy1 = pd.read_csv('../03-04-dataframes-in-pandas/happiness-report.csv')
hap_attrs = list(happy1.columns)

### 2. Drop some columns that will not be part of our analysis &hellip;

&hellip; as coded in the next cell.

Also drop rows with null values and see the resulting dataframe.

In [2]:
happy2 = happy1[hap_attrs]
happy2 = happy2.drop(columns = ['LifeLadder', 'ConfidenceInNationalGovernment', 'PerceptionsOfCorruption'])
happy3 = happy2.drop(columns = ['FreedomToMakeLifeChoices','Generosity','PositiveAffect','NegativeAffect', 'Year'])
happy4 = happy3.dropna()

happy4

Unnamed: 0,Country,HappinessScore,LogGDP,SocialSupport,HealthyLifeExpectancyAtBirth
0,Afghanistan,3.203,7.494588,0.507516,52.599998
1,Albania,4.719,9.412399,0.683592,68.699997
2,Algeria,5.211,9.557952,0.798651,65.900002
3,Argentina,6.086,9.809972,0.899912,68.800003
4,Armenia,4.559,9.119424,0.814449,66.900002
...,...,...,...,...,...
130,Uzbekistan,6.174,8.773365,0.920821,65.099998
131,Venezuela,4.707,9.270281,0.886882,66.500000
132,Vietnam,5.175,8.783416,0.831945,67.900002
134,Zambia,4.107,8.223958,0.717720,55.299999


In [3]:
pop1 = pd.read_csv('population_by_country_2020.csv')
pop_attrs = list(pop1.columns)
pop1

Unnamed: 0,Country (or dependency),Population (2020),Yearly Change,Net Change,Density (P/Km²),Land Area (Km²),Migrants (net),Fert. Rate,Med. Age,Urban Pop %,World Share
0,China,1440297825,0.39 %,5540090,153,9388211,-348399.0,1.7,38,61 %,18.47 %
1,India,1382345085,0.99 %,13586631,464,2973190,-532687.0,2.2,28,35 %,17.70 %
2,United States,331341050,0.59 %,1937734,36,9147420,954806.0,1.8,38,83 %,4.25 %
3,Indonesia,274021604,1.07 %,2898047,151,1811570,-98955.0,2.3,30,56 %,3.51 %
4,Pakistan,221612785,2.00 %,4327022,287,770880,-233379.0,3.6,23,35 %,2.83 %
...,...,...,...,...,...,...,...,...,...,...,...
230,Montserrat,4993,0.06 %,3,50,100,,N.A.,N.A.,10 %,0.00 %
231,Falkland Islands,3497,3.05 %,103,0,12170,,N.A.,N.A.,66 %,0.00 %
232,Niue,1628,0.68 %,11,6,260,,N.A.,N.A.,46 %,0.00 %
233,Tokelau,1360,1.27 %,17,136,10,,N.A.,N.A.,0 %,0.00 %


### 3. Read population data &hellip;

&hellip; and also drop some columns we will not be using in our analysis, as coded in the next cell.

1. Rename the first two columns to _Country_ and _Population_ respectively.
2. Drop rows with null values.
3. Drop rows where _Med. Age_ is N.A. and
4. See the resulting dataframe.

In [4]:
pop2 = pop1.drop(columns = [pop_attrs[2], pop_attrs[3], pop_attrs[5], pop_attrs[6], pop_attrs[7], pop_attrs[9], pop_attrs[10]])
pop2.rename(columns={pop_attrs[0]: "Country", 
                     pop_attrs[1]: "Population"}, inplace = True)
pop2.dropna()
pop2 = pop2[pop2['Med. Age'] != 'N.A.']
pop2

Unnamed: 0,Country,Population,Density (P/Km²),Med. Age
0,China,1440297825,153,38
1,India,1382345085,464,28
2,United States,331341050,36,38
3,Indonesia,274021604,151,30
4,Pakistan,221612785,287,23
...,...,...,...,...
196,Aruba,106845,593,41
197,Tonga,105901,147,22
198,U.S. Virgin Islands,104398,298,43
199,Seychelles,98453,214,34


### 4. Merge the two datasets together, treating `Country` as the common attribute

If a Country is listed in one dataset but not in the other, it won't be needed for our analysis. We are interested in truly merging the two datasets.

Arrange the combined dataset in _descending_ order of population.

In [5]:
import numpy as np
combined_dataset = pd.merge(left=pop2, right=happy4, how='inner', on='Country')
combined_dataset.sort_values(by = 'Population', ascending = False)

Unnamed: 0,Country,Population,Density (P/Km²),Med. Age,HappinessScore,LogGDP,SocialSupport,HealthyLifeExpectancyAtBirth
0,China,1440297825,153,38,5.191,9.694376,0.787605,69.300003
1,India,1382345085,464,28,4.015,8.830280,0.638052,60.099998
2,United States,331341050,36,38,6.892,10.922465,0.903856,68.300003
3,Indonesia,274021604,151,30,5.192,9.362827,0.809379,62.099998
4,Pakistan,221612785,287,23,5.653,8.561664,0.685059,58.500000
...,...,...,...,...,...,...,...,...
117,Estonia,1326693,31,42,5.893,10.324107,0.932694,68.599998
118,Mauritius,1272140,626,37,5.888,9.956448,0.908842,66.400002
119,Comoros,872695,467,20,3.973,7.260142,0.621303,57.200001
120,Montenegro,628080,47,39,5.523,9.732955,0.855980,68.500000


### 5. Calculate the correlation matrix between the columns of the combined dataset

It should look somewhat like this, with 1.0 along the diagonal and correlation values everywhere else. Since correlation is symmetric, i.e., correlation between A and B is the same as the correlation between B and A, we expect this matrix to be symmetrical.

Hint: Use the `dataframe.corr()` function

| Correlation	| Population  	| Density (P/Km²)  	| HappinessScore  	| LogGDP  	| SocialSupport | HealthyLifeExpectancyAtBirth  	|
|---	|---	|---	|---	|---	|---	|---	|
| Population  	|   1.0	|   	|   	|   	|   	|
| Density (P/Km²)  	|   	| 1.0  	|   	|   	|   	|
| HappinessScore  	|   	|   	| 1.0  	|   	|   	|
| LogGDP  	    |   	|   	|   	| 1.0 	|  	|  |
| SocialSupport  	|   	|   	|   	|   	| 1.0  	|
| HealthyLifeExpectancyAtBirth  	|   	|   	|   	|   	|  	| 1.0  	|

In [6]:
corrMatrix = combined_dataset.corr()
corrMatrix

Unnamed: 0,Population,Density (P/Km²),HappinessScore,LogGDP,SocialSupport,HealthyLifeExpectancyAtBirth
Population,1.0,0.162601,-0.078647,0.016404,-0.101452,-0.010284
Density (P/Km²),0.162601,1.0,-0.073427,-0.08275,-0.170214,0.050283
HappinessScore,-0.078647,-0.073427,1.0,0.79852,0.762456,0.778404
LogGDP,0.016404,-0.08275,0.79852,1.0,0.791557,0.85635
SocialSupport,-0.101452,-0.170214,0.762456,0.791557,1.0,0.747156
HealthyLifeExpectancyAtBirth,-0.010284,0.050283,0.778404,0.85635,0.747156,1.0


### 6. What attribute(s) correlate most strongly with Happiness?

**Your Answer**

# When you're done, submit the notebook

1. **Run all the cells in order.**

2. Submit the notebook by saving it as PDF. 
    * In the cluster environment, it's File | Print (Save as PDF) and submit to [Gradescope](https://www.gradescope.com/courses/182658)<sup>&dagger;</sup>, 
    * On other versions, it may be File | Download As (PDF) and then submit to [Gradescope](https://www.gradescope.com/courses/182658)<sup>&dagger;</sup>.

<sup>&dagger;</sup>To submit to Gradescope, log into the website, add course 9W7PW3 (if not already added) and submit. The assignment name should match the name of this notebook.

![The end](https://live.staticflickr.com/32/89187454_3ae6aded89_b.jpg)