<img align="right" style="padding-left:10px; height: 60%; width: 20%" src="figures/whr.png" >

# World Happiness Report

This case study is based on the 7th World Happiness Report. The first was released in April 2012 in support of a UN High level meeting on “Wellbeing and Happiness: Defining a New Economic Paradigm”. 

That 2012 report presented the available global data on national happiness and reviewed related evidence from the emerging science of happiness, showing that the quality of people’s lives can be coherently, reliably, and validly assessed by a variety of subjective well-being measures, collectively referred to then and in subsequent reports as “happiness.” 

This year’s World Happiness Report focuses on happiness and the community: how happiness has evolved over the past dozen years, with a focus on the technologies, social norms, conflicts and government policies that have driven those changes.

I have downloaded the data from [Chapter 2: Online Data](https://s3.amazonaws.com/happiness-report/2019/Chapter2OnlineData.xls) and filtered out data prior to 2018. The result is available in CSV format, as the next cell shows. _Data Prep Notes:_ The Happiness Score column is from Figure 2.6 in the downloaded report; the other data columns are from Table 2.1 in the same report. If a country wasn't in either list, it wasn't included in the CSV file.

We were first introduced to this dataset in `03-04-world-happiness` and `03-05-world-happiness.ipynb`. We return to the dataset for a deeper analysis!

In [1]:
import pandas as pd
import numpy as np
data1 = pd.read_csv('happiness-report.csv')
data1

Unnamed: 0,Country,Year,HappinessScore,LifeLadder,LogGDP,SocialSupport,HealthyLifeExpectancyAtBirth,FreedomToMakeLifeChoices,Generosity,PerceptionsOfCorruption,PositiveAffect,NegativeAffect,ConfidenceInNationalGovernment
0,Afghanistan,2018,3.203,2.694303,7.494588,0.507516,52.599998,0.373536,-0.084888,0.927606,0.424125,0.404904,0.364666
1,Albania,2018,4.719,5.004403,9.412399,0.683592,68.699997,0.824212,0.005385,0.899129,0.713300,0.318997,0.435338
2,Algeria,2018,5.211,5.043086,9.557952,0.798651,65.900002,0.583381,-0.172413,0.758704,0.591043,0.292946,
3,Argentina,2018,6.086,5.792797,9.809972,0.899912,68.800003,0.845895,-0.206937,0.855255,0.820310,0.320502,0.261352
4,Armenia,2018,4.559,5.062449,9.119424,0.814449,66.900002,0.807644,-0.149109,0.676826,0.581488,0.454840,0.670828
...,...,...,...,...,...,...,...,...,...,...,...,...,...
131,Venezuela,2018,4.707,5.005663,9.270281,0.886882,66.500000,0.610855,-0.176156,0.827560,0.759221,0.373658,0.260700
132,Vietnam,2018,5.175,5.295547,8.783416,0.831945,67.900002,0.909260,-0.039124,0.808423,0.692222,0.191061,
133,Yemen,2018,3.380,3.057514,,0.789422,56.700001,0.552726,,0.792587,0.461114,0.314870,0.308151
134,Zambia,2018,4.107,4.041488,8.223958,0.717720,55.299999,0.790626,0.036644,0.810731,0.702698,0.350963,0.606715


## Some observations:

* **Quality** Some data points just show `NaN` (not a number where there should be one). This is typically done using `dropna()`. 

* **Normalization:** The columns have different ranges: some values are between 0. and 1., Others have different ranges. For example Generosity is between -0.33 and 0.49, and so on. Un-normalized ranges can cause the different features to be under- or over-valued during analysis. The data needs to be preprocessed so as to be uniform. 

* **Pre-normalization:** `LogGDP` is the logarithm of the GDP per capita. Here is a case where the data has gone through Feature Scaling _prior to publication!_ Taking a log of numbers whose range spans multiple orders of magnitude is a common technique for compressing the range. However, it still doesn't span the range [0. 1.] and a bit more Feature Scaling will be required.

### 1. Quality

First, let's try `dropna()`. Out of 136 rows, _we are left with 113_. 

How important is it for data quality to be _absolutely_ sacrosanct in our analysis? To answer this question, we wish to compare `data1` with `data1.dropna()`. Would we lose any important data by doing so? Create a DataFrame `data_na` that shows what we would lose by using `dropna()`.

In [2]:
# Get a sense of data quality
data1_not_na = data1.dropna()
data1_good_or_na = pd.merge(data1, data1_not_na, indicator=True, how='left')
# _good_or_na has a new column '_merge' which has values 'left_only' or 'both'. Pick 'left_only'
data1_na = data1_good_or_na.query('_merge=="left_only"').drop('_merge', axis=1)
data1_na

Unnamed: 0,Country,Year,HappinessScore,LifeLadder,LogGDP,SocialSupport,HealthyLifeExpectancyAtBirth,FreedomToMakeLifeChoices,Generosity,PerceptionsOfCorruption,PositiveAffect,NegativeAffect,ConfidenceInNationalGovernment
2,Algeria,2018,5.211,5.043086,9.557952,0.798651,65.900002,0.583381,-0.172413,0.758704,0.591043,0.292946,
18,Burundi,2018,3.775,3.775283,6.541033,0.484715,53.400002,0.646399,-0.019334,0.598608,0.666442,0.362767,
19,Cambodia,2018,4.7,5.121838,8.253352,0.794605,61.599998,0.958305,0.033787,,0.844593,0.414346,
24,China,2018,5.191,5.131434,9.694376,0.787605,69.300003,0.895378,-0.174899,,0.855784,0.18964,
30,Cyprus,2018,6.046,6.276443,,0.825573,73.699997,0.794215,,0.848337,0.750122,0.298021,0.35244
35,Egypt,2018,4.166,4.005451,9.29396,0.758824,61.700001,0.681654,-0.22293,,0.492261,0.285184,
42,Gambia,2018,4.516,4.922099,7.376554,0.6848,55.0,0.718729,,0.69107,0.804012,0.379208,0.757543
58,Jordan,2018,4.906,4.638934,9.024435,0.799544,66.800003,0.76242,-0.18349,,,,
61,Kosovo,2018,6.1,6.391826,,0.822407,65.149826,0.889737,,0.922078,0.778271,0.170248,0.347547
63,Laos,2018,4.796,4.859402,8.813603,0.704738,58.700001,0.906661,0.140599,0.63424,0.852214,0.331883,


### Quality Evaluation

Many of the countries in the "to be dropped" list are important in a geopolitical sense. For an analysis of a United Nations dataset called **World Happiness Report**, there are three options:

1. Hold the line on data quality and publish with just 113 countries in the final report,
1. Fill the missing column values with the average value for that column, or
1. (Partial.) Drop the rows with `NaN` for "objective" features such as `LogGDP` and fill the unavailable features with the average values of those features.

### 1a. Choosing the Quality Option

Which of the above quality options would you choose and why? 

_This is a judgment question. No answer is "wrong" or "preferred." How well you make your argument is what's important!_

### 1b. Filled Data

Irrespective of your answer to Q 1a, management has decided to pursue option 3. 

Prepare a DataFrame `data1_rdy` which has only non-NaN LogGDP rows and the remaining feature values filled with average values of those features. Hint: check out the `np.nanmean()` function.

In [3]:
# Fill some of the columns with the mean values and get a sense of data quality of the new data
means = data1.mean()
# Ref: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html
data1_trimmed = data1.fillna(value = { k:means[k] for k in means.keys() if k not in ['LogGDP']} )
data1_trimmed

Unnamed: 0,Country,Year,HappinessScore,LifeLadder,LogGDP,SocialSupport,HealthyLifeExpectancyAtBirth,FreedomToMakeLifeChoices,Generosity,PerceptionsOfCorruption,PositiveAffect,NegativeAffect,ConfidenceInNationalGovernment
0,Afghanistan,2018,3.203,2.694303,7.494588,0.507516,52.599998,0.373536,-0.084888,0.927606,0.424125,0.404904,0.364666
1,Albania,2018,4.719,5.004403,9.412399,0.683592,68.699997,0.824212,0.005385,0.899129,0.713300,0.318997,0.435338
2,Algeria,2018,5.211,5.043086,9.557952,0.798651,65.900002,0.583381,-0.172413,0.758704,0.591043,0.292946,0.495120
3,Argentina,2018,6.086,5.792797,9.809972,0.899912,68.800003,0.845895,-0.206937,0.855255,0.820310,0.320502,0.261352
4,Armenia,2018,4.559,5.062449,9.119424,0.814449,66.900002,0.807644,-0.149109,0.676826,0.581488,0.454840,0.670828
...,...,...,...,...,...,...,...,...,...,...,...,...,...
131,Venezuela,2018,4.707,5.005663,9.270281,0.886882,66.500000,0.610855,-0.176156,0.827560,0.759221,0.373658,0.260700
132,Vietnam,2018,5.175,5.295547,8.783416,0.831945,67.900002,0.909260,-0.039124,0.808423,0.692222,0.191061,0.495120
133,Yemen,2018,3.380,3.057514,,0.789422,56.700001,0.552726,-0.029086,0.792587,0.461114,0.314870,0.308151
134,Zambia,2018,4.107,4.041488,8.223958,0.717720,55.299999,0.790626,0.036644,0.810731,0.702698,0.350963,0.606715


In [4]:
# As above, we should get a sense of the quality of the data now.
data1_trimmed = data1_trimmed.dropna()
pd_data = data1_trimmed.drop(['Country', 'Year'], axis=1)
pd_data_to_normalize = data1_trimmed.drop(['Country', 'Year', 'LogGDP'], axis=1)
np_data = pd_data_to_normalize.to_numpy()
np_cols = list(pd_data_to_normalize)
np_rows = list(data1_trimmed['Country'])

def reconstruct_pd(np_data, np_rows, np_cols):
    return pd.concat([
        pd.DataFrame(np_rows, columns=['Country']), 
        pd.DataFrame(np_data, dtype='float32', columns=np_cols)], axis=1)

data_rdy = reconstruct_pd(np_data, np_rows, np_cols).reindex()
data_rdy

Unnamed: 0,Country,HappinessScore,LifeLadder,SocialSupport,HealthyLifeExpectancyAtBirth,FreedomToMakeLifeChoices,Generosity,PerceptionsOfCorruption,PositiveAffect,NegativeAffect,ConfidenceInNationalGovernment
0,Afghanistan,3.203,2.694303,0.507516,52.599998,0.373536,-0.084888,0.927606,0.424125,0.404904,0.364666
1,Albania,4.719,5.004403,0.683592,68.699997,0.824212,0.005385,0.899129,0.713300,0.318997,0.435338
2,Algeria,5.211,5.043086,0.798651,65.900002,0.583381,-0.172413,0.758704,0.591043,0.292946,0.495120
3,Argentina,6.086,5.792797,0.899912,68.800003,0.845895,-0.206937,0.855255,0.820310,0.320502,0.261352
4,Armenia,4.559,5.062449,0.814449,66.900002,0.807644,-0.149109,0.676826,0.581488,0.454840,0.670828
...,...,...,...,...,...,...,...,...,...,...,...
122,Uzbekistan,6.174,6.205460,0.920821,65.099998,0.969898,0.311695,0.520360,0.825422,0.208660,0.969356
123,Venezuela,4.707,5.005663,0.886882,66.500000,0.610855,-0.176156,0.827560,0.759221,0.373658,0.260700
124,Vietnam,5.175,5.295547,0.831945,67.900002,0.909260,-0.039124,0.808423,0.692222,0.191061,0.495120
125,Zambia,4.107,4.041488,0.717720,55.299999,0.790626,0.036644,0.810731,0.702698,0.350963,0.606715


### 2. Normalizing the data

The process of making all columns uniform in scale is referred to as **Feature Scaling**. Many data analysis libraries require it (see [_The Importance of Feature Scaling_](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_scaling_importance.html)). Scikit-Learn offers an extensive library for [preprocessing data](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing). Pay particular attention to `*Scaler()` functions. Also Discretizers such as Binarizer, KBinsDiscretizer and QuantileTransformer.

Use `StandardScaler` to scale all features except `Country`, `Year` and `LogGDP`.

In [5]:
from sklearn.preprocessing import StandardScaler, RobustScaler
scaler = StandardScaler()
np_norm = scaler.fit_transform(np.array(pd_data_to_normalize))
data1_scaled = reconstruct_pd(np_norm, np_rows, np_cols).reindex()
data1_scaled

Unnamed: 0,Country,HappinessScore,LifeLadder,SocialSupport,HealthyLifeExpectancyAtBirth,FreedomToMakeLifeChoices,Generosity,PerceptionsOfCorruption,PositiveAffect,NegativeAffect,ConfidenceInNationalGovernment
0,Afghanistan,-2.019505,-2.551520,-2.526380,-1.799854,-3.499283,-0.359735,1.087317,-2.628350,1.268071,-0.694985
1,Albania,-0.642061,-0.442753,-1.044773,0.630625,0.337264,0.222222,0.925847,0.016003,0.268569,-0.323426
2,Algeria,-0.195028,-0.407441,-0.076593,0.207934,-1.712902,-0.923980,0.129587,-1.101972,-0.034526,-0.009117
3,Argentina,0.600001,0.276930,0.775471,0.645722,0.521843,-1.146541,0.677065,0.994557,0.286083,-1.238163
4,Armenia,-0.787438,-0.389766,0.056338,0.358895,0.196217,-0.773745,-0.334688,-1.189350,1.849055,0.914672
...,...,...,...,...,...,...,...,...,...,...,...
122,Uzbekistan,0.679958,0.653628,0.951415,0.087164,1.577465,2.196894,-1.221905,1.041302,-1.015151,2.484200
123,Venezuela,-0.652964,-0.441602,0.665833,0.298510,-1.479014,-0.948107,0.520024,0.435936,0.904527,-1.241594
124,Vietnam,-0.227738,-0.176983,0.203562,0.509857,1.061261,-0.064710,0.411510,-0.176737,-1.219908,-0.009117
125,Zambia,-1.198127,-1.321746,-0.757594,-1.392258,0.051348,0.423740,0.424599,-0.080947,0.640482,0.577598


### 3. Examine scaled features

Examine min, mean and max values of each of the scaled features. Why are the min and max values not 0 and 1 respectively?

**Your Answer**

---

---

---

### 4. Cluster the data into 2 clusters.

The clustering will return a new column with values 0 or 1. We won't yet know what 0 and 1 stand for. Show `data_scaled` along with the returned class values. `kmeans.labels_` gives the class values.

In [6]:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=2, random_state=0).fit(np_norm)
kmeans.labels_

array([0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1,
       0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1,
       1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1,
       1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,
       1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1,
       0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0], dtype=int32)

### Assigning Cluster Labels

We need to assign labels for values of the clustering cells. The Scikit Learn class LabelEncoder is designed for this purpose. Its usage is shown in the next cell.

In [7]:
from sklearn.preprocessing import LabelEncoder
from sklearn.utils import column_or_1d

le = LabelEncoder()
le.fit(['unhappy', 'happy'])
# le.fit(['\U0001F626', '\U0001F601'])

def clustered_pd(np_data, np_rows, np_cols, labels, le):
    return pd.concat([
        pd.DataFrame(np_rows, columns=['Country']), 
        pd.DataFrame(le.inverse_transform(labels), columns=['Cluster']), 
        pd.DataFrame(np_data, dtype='float32', columns=np_cols)], axis=1)

print (clustered_pd(np_norm, np_rows, np_cols, kmeans.labels_, le).loc[:,['Country', 'Cluster']])

         Country  Cluster
0    Afghanistan    happy
1        Albania    happy
2        Algeria    happy
3      Argentina  unhappy
4        Armenia    happy
..           ...      ...
122   Uzbekistan  unhappy
123    Venezuela    happy
124      Vietnam  unhappy
125       Zambia    happy
126     Zimbabwe    happy

[127 rows x 2 columns]


### 5. Assigning Cluster Labels

The labels produced above are **wrong**. Happiness values of 0 are being translated as 'happy'. Fix the code in the above cell to correct this error.

### 6. Make 5 clusters

Based on the above exercise, cluster `data1_scaled` into 5 classes, giving values 1 &hellip; 5 to the appropriate rows.

In [8]:
from sklearn.cluster import KMeans
n_clusters = 5
kmeans = KMeans(n_clusters=n_clusters, random_state=0).fit(np_norm)
kmeans.labels_

array([3, 2, 4, 0, 2, 1, 1, 2, 2, 4, 0, 3, 0, 4, 2, 0, 4, 3, 3, 2, 3, 1,
       3, 0, 0, 0, 3, 3, 0, 4, 0, 1, 0, 0, 4, 0, 0, 2, 1, 0, 4, 2, 4, 1,
       4, 0, 3, 3, 0, 2, 2, 3, 1, 0, 4, 3, 0, 4, 0, 2, 2, 2, 4, 4, 3, 4,
       1, 4, 3, 3, 2, 3, 4, 0, 0, 4, 4, 4, 3, 2, 2, 2, 2, 1, 1, 0, 3, 2,
       1, 3, 0, 0, 2, 0, 0, 4, 2, 0, 2, 4, 3, 0, 0, 2, 4, 0, 2, 2, 1, 1,
       2, 0, 3, 4, 4, 2, 3, 4, 0, 1, 0, 0, 1, 4, 0, 2, 2], dtype=int32)

## Automatically assigning labels

Replace arbitrary numeric labels with descriptive numerical labels for categories. Sometimes it is possible to use some column values to give us a hint about cluster identity. For example, in the data above, Happiness Score could be used to label the rows.

In [9]:
from sklearn.preprocessing import LabelEncoder
from sklearn.utils import column_or_1d

class OrderedLabelEncoder(LabelEncoder):
# Reference: https://stackoverflow.com/questions/58893912/
    def fit(self, y):
        y = column_or_1d(y, warn=True)
        self.classes_ = pd.Series(y).unique()
        return self

# Replace numeric labels with descriptive labels
ole = OrderedLabelEncoder()
ole.fit(list(range(n_clusters)))
cpd = clustered_pd(np_norm, np_rows, np_cols, kmeans.labels_, ole)

means = cpd.groupby('Cluster')['HappinessScore'].mean()
new_indexes = sorted(range(n_clusters), key=lambda k: means[k])

dd = {new_indexes[i]:i+1 for i in range(5)}
ole = OrderedLabelEncoder()
ole.fit(['Happy_'+str(dd[i]) for i in range(5)])

cpd = clustered_pd(np_norm, np_rows, np_cols, kmeans.labels_, ole)
sorted_indexes = cpd.groupby('Cluster')['HappinessScore'].mean().sort_values(ascending=True)
cpd

Unnamed: 0,Country,Cluster,HappinessScore,LifeLadder,SocialSupport,HealthyLifeExpectancyAtBirth,FreedomToMakeLifeChoices,Generosity,PerceptionsOfCorruption,PositiveAffect,NegativeAffect,ConfidenceInNationalGovernment
0,Afghanistan,Happy_1,-2.019505,-2.551520,-2.526380,-1.799854,-3.499283,-0.359735,1.087317,-2.628350,1.268071,-0.694985
1,Albania,Happy_2,-0.642061,-0.442753,-1.044773,0.630625,0.337264,0.222222,0.925847,0.016003,0.268569,-0.323426
2,Algeria,Happy_3,-0.195028,-0.407441,-0.076593,0.207934,-1.712902,-0.923980,0.129587,-1.101972,-0.034526,-0.009117
3,Argentina,Happy_4,0.600001,0.276930,0.775471,0.645722,0.521843,-1.146541,0.677065,0.994557,0.286083,-1.238163
4,Armenia,Happy_2,-0.787438,-0.389766,0.056338,0.358895,0.196217,-0.773745,-0.334688,-1.189350,1.849055,0.914672
...,...,...,...,...,...,...,...,...,...,...,...,...
122,Uzbekistan,Happy_5,0.679958,0.653628,0.951415,0.087164,1.577465,2.196894,-1.221905,1.041302,-1.015151,2.484200
123,Venezuela,Happy_3,-0.652964,-0.441602,0.665833,0.298510,-1.479014,-0.948107,0.520024,0.435936,0.904527,-1.241594
124,Vietnam,Happy_4,-0.227738,-0.176983,0.203562,0.509857,1.061261,-0.064710,0.411510,-0.176737,-1.219908,-0.009117
125,Zambia,Happy_2,-1.198127,-1.321746,-0.757594,-1.392258,0.051348,0.423740,0.424599,-0.080947,0.640482,0.577598


# When you're done, submit the notebook

1. **Run all the cells in order.**

2. Submit the notebook by saving it as PDF. 
    * In the cluster environment, it's File | Print (Save as PDF) and submit to [Gradescope](https://www.gradescope.com/courses/182658)<sup>&dagger;</sup>, 
    * On other versions, it may be File | Download As (PDF) and then submit to [Gradescope](https://www.gradescope.com/courses/182658)<sup>&dagger;</sup>.

<sup>&dagger;</sup>To submit to Gradescope, log into the website, add course 9W7PW3 (if not already added) and submit. The assignment name should match the name of this notebook.

![The end](https://live.staticflickr.com/32/89187454_3ae6aded89_b.jpg)