# Predict People Satisfaction Across the Globe

Problem Statement:

We would like to build a model that predicts satisfaction score for people of different countries given their country GDP.

# Download Dataset

Download the Better Life Index data (latest edition, currently it is 2017) from the [OECD’s website](http://homl.info/4) as well as stats about GDP per capita from the [IMF’s website](http://homl.info/5). Then you join the tables and sort by GDP per capita. Table 1-1 shows an excerpt of what you get

In [None]:
#get file from google drive
%%time
!pip install googledrivedownloader #black magic
from google_drive_downloader import GoogleDriveDownloader as gdd

gdd.download_file_from_google_drive(file_id="1AavEtw38uMBhcDLmejYU_L8LVbbrWtlh",
                                    dest_path="./BLI_30012019054825599.xlsx",
                                    unzip=False)

gdd.download_file_from_google_drive(file_id="1Z0hri7DaH37aH-VGmyC5G2Uo240NfFms",
                                    dest_path="./WEO_Data.xlsx",
                                    unzip=False)

Downloading 1AavEtw38uMBhcDLmejYU_L8LVbbrWtlh into ./BLI_30012019054825599.xlsx... Done.
Downloading 1Z0hri7DaH37aH-VGmyC5G2Uo240NfFms into ./WEO_Data.xlsx... Done.
CPU times: user 71 ms, sys: 17.5 ms, total: 88.6 ms
Wall time: 6.01 s


# Load data as dataframes

In [None]:
import pandas as pd
from IPython.display import display

In [None]:
#load excel as dataframe
bli = pd.read_excel("BLI_30012019054825599.xlsx", sheet_name=0)
display(bli) #inspect if the dataframe is loaded correctly

# Remove rows where inequality has values other than TOT
bli = bli[bli["INEQUALITY"]=="TOT"]

# Reformat data based on "indicator column"
bli = bli.pivot(index="Country", columns="Indicator", values="Value")

bli["Life satisfaction"].head()

Unnamed: 0,LOCATION,Country,INDICATOR,Indicator,MEASURE,Measure,INEQUALITY,Inequality,Unit Code,Unit,PowerCode Code,PowerCode,Reference Period Code,Reference Period,Value,Flag Codes,Flags
0,AUS,Australia,JE_LMIS,Labour market insecurity,L,Value,TOT,Total,PC,Percentage,0.0,Units,,,4.300000,,
1,AUS,Australia,CG_SENG,Stakeholder engagement for developing regulations,L,Value,TOT,Total,AVSCORE,Average score,0.0,Units,,,2.700000,,
2,AUS,Australia,CG_SENG,Stakeholder engagement for developing regulations,L,Value,MN,Men,AVSCORE,Average score,0.0,Units,,,2.700000,E,Estimated value
3,AUS,Australia,CG_SENG,Stakeholder engagement for developing regulations,L,Value,WMN,Women,AVSCORE,Average score,0.0,Units,,,2.700000,E,Estimated value
4,AUS,Australia,PS_FSAFEN,Feeling safe walking alone at night,L,Value,TOT,Total,PC,Percentage,0.0,Units,,,63.600000,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3394,USA,United States,WL_EWLH,Employees working very long hours,L,Value,WMN,Women,PC,Percentage,0.0,Units,,,6.860000,,
3395,USA,United States,WL_TNOW,Time devoted to leisure and personal care,L,Value,TOT,Total,HOUR,Hours,0.0,Units,,,14.440000,,
3396,USA,United States,WL_TNOW,Time devoted to leisure and personal care,L,Value,MN,Men,HOUR,Hours,0.0,Units,,,14.480000,,
3397,USA,United States,WL_TNOW,Time devoted to leisure and personal care,L,Value,WMN,Women,HOUR,Hours,0.0,Units,,,14.390000,,


Country
Australia    7.3
Austria      7.0
Belgium      6.9
Brazil       6.6
Canada       7.3
Name: Life satisfaction, dtype: float64

# Import WOE data

In [None]:
#load excel as dataframe
weo = pd.read_excel("WEO_Data.xlsx", sheet_name=0)
display(weo) #inspect if the dataframe is loaded correctly

# Drop the header row from data
weo = weo.reindex(weo.index.drop(0))

# 1- Select only Country name and 2015 
# 2- then rename it to GDP Per capita
weo = weo[['Country','2015']].rename(columns={'2015':'GDP per capita'})

# Set Country as index column
# Inplace command, will replace the results of command into the same DF
weo.set_index("Country", inplace=True)

#weo.drop_duplicates(inplace=True)
#Print top 5 rows
weo.head()

Unnamed: 0,Country,Subject Descriptor,Units,Scale,Country/Series-specific Notes,2015,Estimates Start After
0,Afghanistan,"Gross domestic product per capita, current prices",U.S. dollars,Units,"See notes for: Gross domestic product, curren...",599.994,2013.0
1,Albania,"Gross domestic product per capita, current prices",U.S. dollars,Units,"See notes for: Gross domestic product, curren...",3995.380,2010.0
2,Algeria,"Gross domestic product per capita, current prices",U.S. dollars,Units,"See notes for: Gross domestic product, curren...",4318.140,2014.0
3,Angola,"Gross domestic product per capita, current prices",U.S. dollars,Units,"See notes for: Gross domestic product, curren...",4100.320,2014.0
4,Antigua and Barbuda,"Gross domestic product per capita, current prices",U.S. dollars,Units,"See notes for: Gross domestic product, curren...",14414.300,2011.0
...,...,...,...,...,...,...,...
186,Yemen,"Gross domestic product per capita, current prices",U.S. dollars,Units,"See notes for: Gross domestic product, curren...",1302.940,2008.0
187,Zambia,"Gross domestic product per capita, current prices",U.S. dollars,Units,"See notes for: Gross domestic product, curren...",1350.150,2010.0
188,Zimbabwe,"Gross domestic product per capita, current prices",U.S. dollars,Units,"See notes for: Gross domestic product, curren...",1064.350,2012.0
189,,,,,,,


Unnamed: 0_level_0,GDP per capita
Country,Unnamed: 1_level_1
Albania,3995.38
Algeria,4318.14
Angola,4100.32
Antigua and Barbuda,14414.3
Argentina,13588.85


In [None]:
bli["Life satisfaction"].head()

Country
Australia    7.3
Austria      7.0
Belgium      6.9
Brazil       6.6
Canada       7.3
Name: Life satisfaction, dtype: float64

# Merge/Join dataset

In [None]:
df = pd.merge(left = weo, right = bli, left_index=True, right_index=True)
df.sort_values(by="GDP per capita", inplace=True)
df.head()


Unnamed: 0_level_0,GDP per capita,Air pollution,Dwellings without basic facilities,Educational attainment,Employees working very long hours,Employment rate,Feeling safe walking alone at night,Homicide rate,Household net adjusted disposable income,Household net financial wealth,Housing expenditure,Labour market insecurity,Life expectancy,Life satisfaction,Long-term unemployment rate,Personal earnings,Quality of support network,Rooms per person,Self-reported health,Stakeholder engagement for developing regulations,Student skills,Time devoted to leisure and personal care,Voter turnout,Water quality,Years in education
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1
South Africa,5694.57,22.0,37.0,43.0,18.68,43.0,36.1,10.0,10872.0,17042.0,18.0,26.5,57.4,4.8,15.57,11554.0,88.0,0.7,67.0,1.6,391.0,14.73,73.0,69.0,15.3
Brazil,8670.0,10.0,6.7,49.0,7.15,64.0,37.3,27.6,12227.0,7102.0,20.0,4.9,74.7,6.6,3.37,14024.0,90.0,0.8,70.0,2.2,395.0,14.45,79.0,72.0,15.9
Mexico,9009.28,16.0,4.2,37.0,29.48,61.0,45.9,17.9,13891.0,4750.0,21.0,4.6,75.0,6.6,0.08,15311.0,80.0,1.0,66.0,3.5,416.0,12.74,63.0,67.0,14.8
Russia,9054.91,15.0,13.8,95.0,0.16,70.0,52.2,11.3,16657.0,2260.0,19.0,3.6,71.3,6.0,1.64,22101.0,90.0,1.0,43.0,0.8,492.0,14.9,65.0,54.0,16.1
Turkey,9437.37,20.0,6.5,39.0,33.77,51.0,60.6,1.7,17067.0,4429.0,20.0,13.0,78.0,5.5,2.24,22848.0,86.0,1.0,66.0,2.1,425.0,12.59,85.0,63.0,17.9


In [None]:
test_indices = [0, 1, 6, 8, 33, 34, 35]
train_indices = list(set(range(36)) - set(test_indices))

train = df[["Air pollution", 'Life satisfaction']].iloc[train_indices]
test = df[["Air pollution", 'Life satisfaction']].iloc[test_indices]



In [None]:
test


Unnamed: 0_level_0,Air pollution,Life satisfaction
Country,Unnamed: 1_level_1,Unnamed: 2_level_1
Luxembourg,12,6.9
Hungary,19,5.3
Czech Republic,20,6.6
Greece,18,5.2
Switzerland,15,7.5
Brazil,10,6.6
Mexico,16,6.6


In [None]:
# Code example
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import sklearn
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import SGDRegressor
# Prepare the data
X = np.c_[train["Air pollution"]]
y = np.c_[train["Life satisfaction"]]

# Visualize the data
#df.plot(kind='scatter', x="GDP per capita", y='Life satisfaction')
#plt.show()

# Build a LinearRegression model
model_NormalEquation= ....

# Build a SGDRegressor model (keep maximum iteration at 300 and eta = 0.0001)
model_SGD = ....

# Make a prediction for Cyprus
X_new = [[18]]  # Cyprus' GDP per capita
print(model_NormalEquation.predict(X_new)) # outputs [[5.95199478]]
print(model_SGD.predict(X_new)) # outputs [[5.95199478]]

[[6.25469406]]
[7.40059138]


  y = column_or_1d(y, warn=True)


In [None]:
# Make a prediction for our test data
# Predict using your Normal Equation model
pred_NE = ....(test['Air pollution'].values.reshape(-1,1))

# Predit using your SGD model
pred_SGD = ...(test['Air pollution'].values.reshape(-1,1))

In [None]:
pred_NE.ravel()

array([6.65395503, 6.18815056, 6.12160706, 6.25469406, 6.45432454,
       6.78704202, 6.38778105])

In [None]:
pred_SGD


array([4.96330678, 7.80680547, 8.21301957, 7.40059138, 6.18194908,
       4.15087858, 6.58816318])

In [None]:
from sklearn.metrics import mean_squared_error
# Fill in the blank and calculate the error
MAE_NE =mean_squared_error(test['Life satisfaction'] , ....) 
MAE_NE

0.48057823693051915

In [None]:
# Fill in the blank and calculate the error

MAE_SGD =mean_squared_error(test['Life satisfaction'] , ....) 
MAE_SGD

3.60212613035068

# Discussion & Conclusions:

Compare the results of M
Perhaps this is a very synthetic dataset as the number of test data is 6 only. But, the purpose is that different algorithm works with different performanre and it is up to you to run different model for varied dataset and pick the best!

As you know Normal Equation is mathematically finding global minimum and is always more succseful than SGD (when dataset is snall enough) which is randomly trying to find global minimal.