[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1CkvVzG1NtJABPkPRofGLT-MXUCj7veLo#scrollTo=gseS97WXHRAJ)


# Predict People Satisfaction Across the Globe

**Objective:**
We would like to build a model that predicts satisfaction score for people of different countries given their country GDP.

#Note 1: How to enable code completion:

Tools menu ==> click on settings ==> Editor ==> Enable "Automatically trigger code completions"




#Note 2: Instructions to create a copy of this notebook for youtself

You do not have write access to this notebook.

* From Menu bar, Go to File, 
* Select "Save a copy in my Drive"
* Navigate to Google Drive
* Find a folder named "Collab Notebook" and open it to find your notebook. 
* Rename it and start making changes.

**Note:** If there is any file you should read in your code, make sure you copy the file from instructor folder to your own Gdrive by following below steps:

* Right clicking on the file name 
* Select "Make a copy"
* Click on the new file
* Move it to desired folder, preferrably where you have your notebook

# Download Dataset

Download the Better Life Index data (latest edition, currently it is 2022) from the [OECD’s website](http://homl.info/4) as well as stats about GDP per capita from the [IMF’s website](http://homl.info/5). Then you join the tables and sort by GDP per capita. 

# Import Dataset to Google Colab

1. Download CSV and XLS files to your computer
2. Upload them to your Google Drive
3. Open the CSV files using Google Sheets so Google will create the dataset in format of Google Sheets
4. You can remove CSV and XLS files from your drive
5. Use the step by step guide from [here](https://colab.research.google.com/notebooks/io.ipynb#scrollTo=vz-jH8T_Uk2c) and scroll down to **" Google Sheets" ** cell to import data into dataframe

NOTE: After creating Google Sheet into your Drive, make sure you are converting Column 2015 to 0.00 format before importing it into Colab  otherwise Google will import it as a string and you will have hard time to clean the data




In [6]:
# Run below line of code for the first time to install gspread. Once installed comment it for future use
#!pip install --upgrade -q gspread
from google.colab import auth
auth.authenticate_user()

import gspread
from google.auth import default
creds, _ = default()

gc = gspread.authorize(creds)

# Use gc to open Google Sheet Datasets

In [7]:
#Open given sheet
worksheet = gc.open('BLI_26092022223032438.xlsx').sheet1

# Read contents of CSV file
bli_rows = worksheet.get_all_values()

# Convert to a DataFrame and render.
import pandas as pd
bli  = pd.DataFrame.from_records(bli_rows, columns = bli_rows[0])

# Remove rows where inequality has values other than TOT
bli = bli[bli["INEQUALITY"]=="TOT"]

# Reformat data based on "indicator column"
bli = bli.pivot(index="Country", columns="Indicator", values="Value")
#bli.head()
bli["Life satisfaction"].head()

Country
Australia    7.3
Austria        7
Belgium      6.9
Brazil       6.6
Canada       7.3
Name: Life satisfaction, dtype: object

In [8]:
bli[0:5]

Indicator,Air pollution,Dwellings without basic facilities,Educational attainment,Employees working very long hours,Employment rate,Feeling safe walking alone at night,Homicide rate,Household net adjusted disposable income,Household net financial wealth,Housing expenditure,Labour market insecurity,Life expectancy,Life satisfaction,Long-term unemployment rate,Personal earnings,Quality of support network,Rooms per person,Self-reported health,Stakeholder engagement for developing regulations,Student skills,Time devoted to leisure and personal care,Voter turnout,Water quality,Years in education
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1
Australia,5,1.1,80,13.2,72,63.6,1.0,33417,57462,20,4.3,82.5,7.3,1.36,52063,94,2.3,85,2.7,502,14.35,91,92,21.2
Austria,16,1.0,85,6.78,72,80.7,0.4,32544,59574,21,2.7,81.3,7.0,1.94,48295,92,1.6,70,1.3,492,14.55,75,93,17.1
Belgium,15,2.3,75,4.31,62,70.7,1.0,29968,104084,21,4.8,81.1,6.9,3.98,49587,92,2.2,75,2.2,503,15.77,89,84,18.2
Brazil,10,6.7,49,7.15,64,37.3,27.6,12227,7102,20,4.9,74.7,6.6,3.37,14024,90,0.8,70,2.2,395,14.45,79,72,15.9
Canada,7,0.2,91,3.73,73,80.9,1.4,29850,85758,22,3.9,81.5,7.3,0.81,48403,93,2.5,88,3.0,523,14.41,68,91,16.7


In [9]:
#Open given sheet
worksheet = gc.open('WEO_Data').sheet1

# Read contents of CSV file
WEO_rows = worksheet.get_all_values()

# Convert to a DataFrame and render.
import pandas as pd
weo  = pd.DataFrame.from_records(WEO_rows, columns = WEO_rows[0])

# Drop the header row from data
weo = weo.reindex(weo.index.drop(0))

# 1- Select only Country name and 2015 
# 2- then rename it to GDP Per capita
weo = weo[['Country','2,015.00']].rename(columns={'2,015.00':'GDP per capita'})

# Set Country as index column
# Inplace command, will replace the results of command into the same DF
weo.set_index("Country", inplace=True)

#weo.drop_duplicates(inplace=True)
#Print top 5 rows
weo.head()

Unnamed: 0_level_0,GDP per capita
Country,Unnamed: 1_level_1
Afghanistan,599.99
Albania,3995.38
Algeria,4318.14
Angola,4100.32
Antigua and Barbuda,14414.3


# Merge/Join dataset

In [10]:
# Now merge BLI and WEO datasets
df = pd.merge(left = weo, right = bli, left_index=True, right_index=True)

#sort the dataframe by GPD per capita
df.sort_values(by="GDP per capita", inplace=True)
df.head()


Unnamed: 0_level_0,GDP per capita,Air pollution,Dwellings without basic facilities,Educational attainment,Employees working very long hours,Employment rate,Feeling safe walking alone at night,Homicide rate,Household net adjusted disposable income,Household net financial wealth,Housing expenditure,Labour market insecurity,Life expectancy,Life satisfaction,Long-term unemployment rate,Personal earnings,Quality of support network,Rooms per person,Self-reported health,Stakeholder engagement for developing regulations,Student skills,Time devoted to leisure and personal care,Voter turnout,Water quality,Years in education
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1
Luxembourg,101994.09,12,0.0,79,3.76,66,72.0,0.6,41317,74141,20,3.2,82.4,6.9,1.9,62636,92,2.0,70,1.5,483,15.15,91,85,15.1
Hungary,12239.89,19,4.3,83,3.05,67,50.7,1.2,16821,23289,18,4.8,75.7,5.3,2.42,21711,84,1.2,56,1.2,474,15.06,62,76,16.6
Poland,12495.33,22,2.7,91,6.68,65,66.3,0.8,18906,14997,23,4.3,77.6,6.0,2.14,25921,89,1.1,58,2.6,504,14.42,55,80,17.7
Chile,13340.91,16,9.4,65,10.06,62,51.1,4.5,16588,21409,18,8.1,79.1,6.7,2.02,28434,84,1.9,57,1.5,443,14.9,49,69,17.3
Latvia,13618.57,11,12.9,89,2.09,69,60.7,6.6,15269,17105,23,6.8,74.6,5.9,3.73,22389,86,1.2,46,2.4,487,13.83,59,77,17.9


In [11]:
df.iloc[3]

GDP per capita                                       13340.91
Air pollution                                              16
Dwellings without basic facilities                        9.4
Educational attainment                                     65
Employees working very long hours                       10.06
Employment rate                                            62
Feeling safe walking alone at night                      51.1
Homicide rate                                             4.5
Household net adjusted disposable income                16588
Household net financial wealth                          21409
Housing expenditure                                        18
Labour market insecurity                                  8.1
Life expectancy                                          79.1
Life satisfaction                                         6.7
Long-term unemployment rate                              2.02
Personal earnings                                       28434
Quality 

## Split dataset into Train & Test

In [12]:
# Below snippet is the most basic way of splitting data in Python; It is for illustration only and in 
# later sections we will use proper library from Sklearn to split the data
test_indices = [0, 1, 6, 8, 33, 34, 35]

#define the indices of the training set by substracting the test indices from all indices
train_indices = list(set(range(36)) - set(test_indices))

#use the above indices to select parts of the dataframe as the training set and the test set
train = df[["GDP per capita", 'Life satisfaction']].iloc[train_indices]
test = df[["GDP per capita", 'Life satisfaction']].iloc[test_indices]

In [13]:
train.head()

Unnamed: 0_level_0,GDP per capita,Life satisfaction
Country,Unnamed: 1_level_1,Unnamed: 2_level_1
Poland,12495.33,6.0
Chile,13340.91,6.7
Latvia,13618.57,5.9
Slovak Republic,15991.74,6.1
Estonia,17288.08,5.6


In [14]:
# Code example
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import sklearn.linear_model

# Prepare the data
X = np.c_[train["GDP per capita"]]
y = np.c_[train["Life satisfaction"]]

In [15]:
type(y)

numpy.ndarray

In [16]:
#Lets look at what is inside X and y. Print first 5 records
X[:5], y[:5]

(array([['12495.33'],
        ['13340.91'],
        ['13618.57'],
        ['15991.74'],
        ['17288.08']], dtype=object), array([['6'],
        ['6.7'],
        ['5.9'],
        ['6.1'],
        ['5.6']], dtype=object))

# ML Starts Now
## Define a model with default values

In [18]:
# Select a basic linear model without setting any parameter (nothing inside paranthesis below)
model = sklearn.linear_model.LinearRegression()

# See the model for yourself
model

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

## Start training the model using X and y

In [19]:
# Train the model
model.fit(X, y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [20]:
model.coef_

array([[3.83407608e-05]])

In [21]:
model.intercept_

array([5.25939616])

## Do prediction on test data

In [None]:
# Make a prediction for Cyprus
X_new = [[17770]]  # Cyprus' GDP per capita
print(model.predict(X_new)) # outputs [[5.95199478]]

[[5.94071148]]


In [None]:
# Make a prediction for our test data
pred = model.predict(test['GDP per capita'].values.reshape(-1,1))

In [None]:
pred

array([[9.16992716],
       [5.72868285],
       [5.9210396 ],
       [5.95199478],
       [8.35254892],
       [5.59181055],
       [5.60481881]])

#Now, lets make it better!

Use test dataset and predict the life expectancy using test dataset. 

# Evaluate Model

In [None]:
# Lets create train and test dataset, so we can use train dataset for training the model
# and use test dataset to evaluate model performance 
X_train = np.c_[train["GDP per capita"]]
y_train = np.c_[train["Life satisfaction"]]

X_test = np.c_[test["GDP per capita"]]
y_test = np.c_[test["Life satisfaction"]]

model = model.fit(X_train,y_train)

#Now apply the prediction on test dataset
y_pred_test = model.predict(X_test)

In [None]:
# See predictions for yourself
y_pred_test

array([[9.16992716],
       [5.72868285],
       [5.9210396 ],
       [5.95199478],
       [8.35254892],
       [5.59181055],
       [5.60481881]])

In [None]:
from sklearn.metrics import mean_squared_error

#MSE: Mean Squared Error as a metric to evaluate a regression model
mean_squared_error(y_test, y_pred_test)


1.299498988620812

In [None]:
#RMSE
from math import sqrt
sqrt(mean_squared_error(y_test, y_pred_test))


1.1399556959026136

# Question:

### What would you expect if we normalize data and train the model again?

# Now, normalize data before prediction

In [None]:
from sklearn.preprocessing import MinMaxScaler
# Define Scaling technique
scaler = MinMaxScaler()

In [None]:
# Train escaling object 
X_train_escaler = scaler.fit(X_train)

# Apply scaling model to the data
X_train_escaled = X_train_escaler.transform(X_train)
X_train_escaled[:5]

array([[0.09837989],
       [0.11061207],
       [0.1146287 ],
       [0.14895901],
       [0.16771188]])

#Normalize Test Dataset

In [None]:
# Apply scaling model to the data
X_test_escaled = X_train_escaler.transform(X_test)
X_test_escaled

array([[1.39307026],
       [0.09468469],
       [0.16726112],
       [0.17894055],
       [1.08467248],
       [0.04304261],
       [0.04795064]])

# Train Models using Scaled Data

## Start training the model using X and y

In [None]:
model.fit(X_train_escaled, y_train)

y_pred_escaled = model.predict(X_train_escaled)


## Do prediction on test data

In [None]:
y_pred_escaled = model.predict(X_test_escaled)

In [None]:
#RMSE
from math import sqrt
sqrt(mean_squared_error(y_test, y_pred_escaled))


1.1399556959026143

# Can you conclude by comparing RMSE from normalized and not normalized dataset?