# Introduction

This notebook is about World Happiness Report which is a landmark survey of the state of global happiness.

In this tutorial, I am going to work on Machine Learning.

<font color='red'>
Content:
    
1. [Load and Check Data](#1)
2. [Variable Description](#2)
3. [Logistic Regression](#3)
    * [Prepearing the Data for Logistic Regression](#4)
    * [Droping Unuseful Features](#5)
    * [Editing Score Data for Binary Classification](#6)
    * [Normalization](#7)
    * [Train - Test Split](#8)
    * [Initializing Parameters and Sigmoid Function](#9)
    * [Forward - Backward Propagation](#10)
    * [Updating Parameters](#11)
    * [Prediction](#12)
    * [Cost and Test Accuracy](#13)
4. [Logistic Regression with Sklearn](#14)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

<a id="1"></a> <br>
# Load and Check Data

* First of all, we are going to read the reports in our dataset.

In [None]:
data_2015=pd.read_csv("../input/world-happiness/2015.csv")
data_2016=pd.read_csv("../input/world-happiness/2016.csv")
data_2017=pd.read_csv("../input/world-happiness/2017.csv")
data_2018=pd.read_csv("../input/world-happiness/2018.csv")
data_2019=pd.read_csv("../input/world-happiness/2019.csv")

In [None]:
#Summary Analysis

data_2019.head()

In [None]:
#Checking info because of the data types and missing values.

#data_2015.info()
#data_2016.info()
#data_2017.info()
#data_2018.info()
data_2019.info()

As you can see, we have different kind of columns in our reports. So I will work 2019 year's reports.

In 2019 year's report;

* Length:156 (Range Index)
* Features are float other than rank and country.
* We have no NAN values in this reports.

<a id="3"></a> <br>
# Logistic Regression

<a id="4"></a> <br>
## Prepearing the Data for Logistic Regression

* I will work 2019's data, so firstly I check the score of countries.
* By finding the mean of these data, I will identify countries whose average score points are above and below the average.
* I will group the countries in this way.

In [None]:
score_list=list(data_2019.Score)
for i in range(0,5):
    print(score_list[i])

In [None]:
np.mean(score_list)

In [None]:
data_happy=[]
data_sad=[]
    
for i in range(len(score_list)):
    if score_list[i]>np.mean(score_list):
        data_happy.append(score_list[i])
    else:
        data_sad.append(score_list[i])

In [None]:
len(data_happy)

In [None]:
len(data_sad)

<a id="5"></a> <br>
## Droping Unuseful Features

We work with features that affect the score data. So we can ignore Overall Rank and Country features.

In [None]:
data_2019.drop(["Overall rank","Country or region"],axis=1,inplace=True)

In [None]:
data_2019.head()

<a id="6"></a> <br>
## Editing Score Data for Binary Classification

We set the score data of the countries whose score is above the average as 1 and the others are 0.

In [None]:
data_2019.Score=[1 if each>np.mean(score_list) else 0 for each in data_2019.Score]

In [None]:
data_2019.head()

In [None]:
y=data_2019.Score.values   
x_data=data_2019.drop(["Score"],axis=1)

In [None]:
y

In [None]:
x_data.head()

<a id="7"></a> <br>
## Normalization

We normalize the data so that the effect of each data is equal.

In [None]:
x=(x_data - np.min(x_data))/(np.max(x_data) - np.min(x_data)).values

In [None]:
x.head()

<a id="8"></a> <br>
## Train - Test Split  

We use 80% of our report for training and 20% for testing. That's why we're splitting this way.

In [None]:
# 80% Train - 20% Test
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=42)
x_train=x_train.T
x_test=x_test.T

print("x_train: ",x_train.shape)
print("x_test: ",x_test.shape)
print("y_train: ",y_train.shape)
print("y_test: ",y_test.shape)

<a id="9"></a> <br>
## Initializing Parameters and Sigmoid Function

We set our w and b parameters and define the sigmoid function. -> w = [0.01] -> b = 0.0

In [None]:
# dimension = 6
def initialize_weights_and_bias(dimension):
    
    w=np.full((dimension,1),0.01)   #make 6 w=[0.01] 
    b=0.0
    return w,b

# w,b=initialize_weights_and_bias(6)

# Sigmoid Function
def sigmoid(z):
    y_head = 1 / (1 + np.exp(-z))
    return y_head

# print(sigmoid(0))

<a id="10"></a> <br>
## Forward - Backward Propagation

In [None]:
def forward_backward_propagation(w,b,x_train,y_train):
    #forward propagation
    z=np.dot(w.T,x_train) + b
    y_head=sigmoid(z)
    loss=-y_train*np.log(y_head)-(1-y_train)*np.log(1-y_head)
    cost=(np.sum(loss))/x_train.shape[1]
    
    #backward propagation
    derivative_weight=(np.dot(x_train,((y_head-y_train).T)))/x_train.shape[1]
    derivative_bias=np.sum(y_head-y_train)/x_train.shape[1]
    gradients={"derivative_weight": derivative_weight,"derivative_bias": derivative_bias}
    
    return cost,gradients

<a id="11"></a> <br>
## Updating Parameters

In [None]:
# Updating(learning) parameters
def update(w,b,x_train,y_train,learning_rate,number_of_iteration):
    cost_list=[]
    cost_list2=[]
    index=[]
    #updating parameters isnumber_of_iteration times
    
    for i in range(number_of_iteration):
        #make forward and backward propagation and find cost and gradients
        cost,gradients=forward_backward_propagation(w, b, x_train, y_train)
        cost_list.append(cost)
        
        w = w - learning_rate * gradients["derivative_weight"]
        b = b - learning_rate * gradients["derivative_bias"]
        if i%20 == 0:
            cost_list2.append(cost)
            index.append(i)
            print("Cost after iteration %i: %f" %(i,cost))
    # we update (learn) parameters weights and bias
    parameters = {"weight": w, "bias": b}
    plt.plot(index,cost_list2)
    plt.xticks(index,rotation='vertical')
    plt.xlabel("Number of Iteration")
    plt.ylabel("Cost")
    plt.show()
    return parameters, gradients, cost_list

<a id="12"></a> <br>
## Prediction

In [None]:
def predict(w,b,x_test):
    #x_test is an input for forward propagation
    z=sigmoid(np.dot(w.T,x_test)+b)
    Y_prediction=np.zeros((1,x_test.shape[1]))
    #if z is bigger than 0.5, our prediction is sign one (y_head=1)
    #if z is smaller than 0.5, our prediction is sign zero (y_head=0)
    for i in range(z.shape[1]):
        if z[0,i]<= 0.5:
            Y_prediction[0,i] = 0
        else:
            Y_prediction[0,i] = 1
            
    return Y_prediction

<a id="13"></a> <br>
## Cost and Test Accuracy

In [None]:
#%% Logistic Regression
def logistic_regression(x_train, y_train, x_test, y_test, learning_rate, num_iterations):
    #initialize
    dimension = x_train.shape[0]   #that is 30
    w,b = initialize_weights_and_bias(dimension)
    #do not change learning rate
    parameters, gradients, cost_list = update(w, b, x_train, y_train, learning_rate, num_iterations)
    
    y_prediction_test=predict(parameters["weight"], parameters["bias"], x_test)
    
    #print test errors
    print("test accuracy: {} %".format(100 - np.mean(np.abs(y_prediction_test - y_test)) * 100))
    
logistic_regression(x_train, y_train, x_test, y_test, learning_rate=1, num_iterations=500)

<a id="14"></a> <br>
# Logistic Regression with Sklearn

In [None]:
from sklearn.linear_model import LogisticRegression
lr=LogisticRegression()
lr.fit(x_train.T, y_train.T)
print("test accuracy {}".format(lr.score(x_test.T, y_test.T)))