In [None]:
# Before Having a deep dive into the Data, Importing Modules


import numpy as np
import pandas as pd
import warnings
import seaborn as sns
import matplotlib.pyplot as plt

warnings.filterwarnings('ignore')

#                    Happiness Level of Countries by Regions in Years




## Contents
1. Introduction
2. The Aim of Analysis
3. General Information of the Data
4. Arrangements of the Data Sets
5. Cleaning of the Row Data 
6. Data Exploration 
7. Feature Engineering
8. Conclusions 


## 1. Introduction 

Happiness can be come through many factors such as income, freedom or relationships and more. While sharing the same world together, this study will show whether all of us are happy or not, with providing scientific conclusions via using main factors.

This study is based on The World Happiness Report which was published 2012, 2013, 2015, 2016 and 2017 aiming to demostrate global happiness.


## 2. The Aim of Analysis

Study will search Happiness_Score by examining 9 main factors based on 166 countries;
'Country','Region', 'Economy_GDP_per_Capita','Family','Health_Life_Expectancy','Freedom', 'Trust_Government_Corruption','Generosity', 'Dystopia_Residual'. 

Data will show how happiness score explains personal and national variations in happiness.  


## 3. General Information of the Data

We have 3 different data sets that rely on 2015, 2016, 2017's surveys. 

**Columns**:

**Country**:                       Name of the country.

**Region**:                        Region the country belongs to.

**Happiness Rank**:                Rank of the country based on the Happiness Score.

**Happiness Score**:               A metric measured in 2015 by asking the sampled people the question: "How would you rate your                                  happiness on a scale of 0 to 10 where 10 is the happiest."

**Economy (GDP per Capita)**:      The extent to which GDP contributes to the calculation of the Happiness Score. 

**Family**:                        The extent to which Family contributes to the calculation of the Happiness Score

**Health (Life Expectancy)**:      The extent to which Life expectancy contributed to the calculation of the Happiness Score

**Freedom**:                       The extent to which Freedom contributed to the calculation of the Happiness Score.

**Trust (Government Corruption)**: The extent to which Perception of Corruption contributes to Happiness Score.

**Generosity**:                    The extent to which Generosity contributed to the calculation of the Happiness Score.

**Dystopia Residual**:             The extent to which Dystopia Residual contributed to the calculation of the Happiness Score.
(Dystopia is an imaginary country where most unhappy people live. The purpose in establishing Dystopia is to have a benchmark against which all countries can be favorably compared (no country performs more poorly than Dystopia) in terms of each of the six key variables, thus allowing each sub-bar to be of positive width. “Dystopia” was created in contrast to Utopia.)


In [None]:
from subprocess import check_output

print(check_output(["ls", "../input/world-happiness"]).decode("utf8"))

In [None]:
df2015 = pd.read_csv("../input/world-happiness/2015.csv")
df2016 = pd.read_csv("../input/world-happiness/2016.csv")
df2017 = pd.read_csv("../input/world-happiness/2017.csv")

## 4. Arrangements of the Data Sets

In [None]:
df2015.info()

In [None]:
df2016.info()

In [None]:
df2017.info()

Above, we see general information of data sets which gives different data sizes.

In [None]:
print(df2015.columns, df2016.columns, df2017.columns, sep = " \n" )

### 4-a) Eualizing columns of each data sets

In order to arrange our data frame, some columns have been removed from each data set. 

In [None]:
#'Standard Error' column has been deleted from 2015 data. 


df2015 = df2015.drop('Standard Error', axis=1)

In [None]:
# New column names have been amended to the data set. 

df2015.columns = ['Country', 'Region', 'Happiness_Rank', 'Happiness_Score', 'Economy_GDP_per_Capita', 'Family',
       'Health_Life_Expectancy', 'Freedom', 'Trust_Government_Corruption','Generosity', 'Dystopia_Residual']

In [None]:
df2015["Year"] = 2015

In [None]:
df2015.head()

In [None]:
#'Upper Confidence Interval' and 'Lower Confidence Interval' column have been deleted from 2016 data set. 


df2016 = df2016.drop(['Upper Confidence Interval','Lower Confidence Interval'], axis =1 )

In [None]:
# New column names have been amended to the data set. 


df2016.columns = ['Country', 'Region', 'Happiness_Rank', 'Happiness_Score', 'Economy_GDP_per_Capita', 'Family',
       'Health_Life_Expectancy', 'Freedom', 'Trust_Government_Corruption','Generosity', 'Dystopia_Residual']

In [None]:
df2016["Year"] = 2016

In [None]:
df2016.head()

In [None]:
#'Whisker.high' and 'Whisker.low' column have been deleted from 2017 data set. 


df2017 =  df2017.drop(['Whisker.high','Whisker.low'],axis =1 )

In [None]:
# New column names have been amended to the data set. 


df2017.columns = ['Country', 'Happiness_Rank', 'Happiness_Score', 'Economy_GDP_per_Capita', 'Family',
       'Health_Life_Expectancy', 'Freedom', 'Generosity', 'Trust_Government_Corruption', 'Dystopia_Residual']

In [None]:
df2017["Year"] = 2017

In [None]:
df2017.head()

### 4-b) Checking last versions of each data sets 

In [None]:
print(df2015.columns, df2016.columns, df2017.columns, sep = " \n" )

### 4-c)Creating a new data frame as Hapiness_report by collating 3 data sets; df2015, df2016, df2017 

In [None]:
frames = [df2015, df2016, df2017]

Hapiness_report = pd.concat(frames,sort=True,ignore_index=True)
Hapiness_report.head()

### 4.d)Exploring Null Values

In [None]:
sns.heatmap(Hapiness_report.isnull(),yticklabels=False,cbar=False,cmap='viridis')
plt.show()

In [None]:
Sum = Hapiness_report.isnull().sum()
Percentage = ( Hapiness_report.isnull().sum()/Hapiness_report.isnull().count() )

pd.concat([Sum,Percentage], axis =1, keys= ['Sum', 'Percentage'])


We only have missing values in Region Column 

## 5. Cleaning of the Row Data

In [None]:
Happiness_report2 = Hapiness_report.copy()

In [None]:
try:
    for country in Happiness_report2.Country.unique():
        Happiness_report2.loc[Happiness_report2['Country']==str(country),
                              'Region']=Happiness_report2[Happiness_report2['Country']==str(country)].Region.mode()[0]
except IndexError:
    pass

In [None]:
Happiness_report2.info()

### 5-a)After filling missing values in Region Column by Mode function, we do still have 2 more missing values. 

In [None]:
#Lets have a look at the countries which do not have region information in the data set.


Happiness_report2[Happiness_report2['Region'].isna()]

In [None]:
#Lets check if China exsists in the previous rows. 

Happiness_report2[Happiness_report2.Country == "China"] 

### 5-b) As having China in Country column as Eastern Asia region, we will assign Eastern Asia manually into the 'Region Column'. 

In [None]:
Happiness_report2.loc[[347,385], 'Region'] = "Eastern Asia"

In [None]:
Happiness_report2.loc[385].Region

In [None]:
sns.heatmap(Happiness_report2.isnull(),yticklabels=False,cbar=False,cmap='viridis')
plt.show()

Now we have all values filled in to the data set. 

## 6. Data Exploration

Lets have a quick look at corelations between variables.

In [None]:
sns.pairplot(Happiness_report2)

Apperantly, some variabes have positive relations with each other. Lets search for more details.

### 6-a)Distribution of Variables 

In [None]:
from scipy.stats import norm 


plt.figure(figsize = (20,10)) 

plt.subplot(2,5,1)
sns.distplot(Happiness_report2["Family"],fit=norm) 
plt.title("Family")

plt.subplot(2,5,2)
sns.distplot(Happiness_report2["Dystopia_Residual"],fit=norm)
plt.title("Dystopia_Residual")

plt.subplot(2,5,3)
sns.distplot(Happiness_report2["Economy_GDP_per_Capita"], fit=norm)
plt.title("Economy_GDP_per_Capita")

plt.subplot(2,5,4)
sns.distplot(Happiness_report2["Freedom"], fit=norm)
plt.title("Freedom")

plt.subplot(2,5,5)
sns.distplot(Happiness_report2["Generosity"],fit=norm)
plt.title("Generosity")

plt.subplot(2,5,6)
sns.distplot(Happiness_report2["Happiness_Rank"], fit=norm)
plt.title("Happiness_Rank")

plt.subplot(2,5,7)
sns.distplot(Happiness_report2["Happiness_Score"], fit=norm)
plt.title("Happiness_Score")

plt.subplot(2,5,8)
sns.distplot(Happiness_report2["Health_Life_Expectancy"], fit=norm)
plt.title("Health_Life_Expectancy")

plt.subplot(2,5,9)
sns.distplot(Happiness_report2["Trust_Government_Corruption"], fit=norm)
plt.title("Trust_Government_Corruption")



plt.show()

We do ignore Happiness_rank scores as this column contains only index. 
Variables has normal distrubtions except Generosity, Trust goverment and Health_Life_Expectancy. 

In [None]:
#In order to have a deep understanding of those variables, we can remove fit norm factor by using 'kde' function. 

sns.distplot(Happiness_report2["Trust_Government_Corruption"], kde=False)
plt.ylim(0,100)
plt.title("Trust_Government_Corruption")

In [None]:
sns.distplot(Happiness_report2["Generosity"], kde=False)
plt.ylim(0,90)
plt.title("Generosity")

In [None]:
sns.distplot(Happiness_report2["Health_Life_Expectancy"], kde=False)
plt.ylim(0,100)
plt.title("Health_Life_Expectancy.")

In [None]:
sns.distplot(Happiness_report2["Happiness_Score"], kde=False)
plt.ylim(0,100)
plt.title("Happiness_Score")
plt.show()

Happiness Score falls more between 4 and 6 scores. We do not have 10 score on a 10 point scale.  

Happiness Score is slightly decreasing in 2017. 

### 6-b) Means of Happiness Report2 by years

In [None]:
mean_by_year = Happiness_report2.groupby(by="Year").mean()["Happiness_Score"] 
print(mean_by_year[2015])
print(mean_by_year[2016])
print(mean_by_year[2017])

## T testi  ##

In [None]:
import scipy.stats as stats
from scipy.stats.mstats import winsorize
from statsmodels.stats.weightstats import ttest_ind

In [None]:
Happiness_Score_2015 = Happiness_report2[Happiness_report2["Year"] == 2015].Happiness_Score
Happiness_Score_2016 = Happiness_report2[Happiness_report2["Year"] == 2016].Happiness_Score 
Happiness_Score_2017 = Happiness_report2[Happiness_report2["Year"] == 2017].Happiness_Score 


In [None]:
stats.ttest_ind(Happiness_Score_2015, Happiness_Score_2016)


In [None]:
stats.ttest_ind(Happiness_Score_2016, Happiness_Score_2017)


Happiness Score changes unmeaningful and there is no evidence to prove changes between years. 

#### A quick look at Happiness Score by Years on a graph. 

In [None]:
plt.figure(figsize = (8,6))

objects = ('2015','2016','2017')
y_pos = np.arange(len(objects)) # y_pos kac tane object varsa o kdrlik bir array olusturuyor. Bar plot altina isimlerini yazar
performance =[mean_by_year[2015], mean_by_year[2016], mean_by_year[2017]]
 
plt.bar(y_pos, performance, align='center', alpha=0.6)
plt.yticks(size=15)
plt.xticks(y_pos, objects,size=15)
plt.xlabel('Year',size=15)
plt.ylabel('Happiness Score',size=15)
plt.title('Happiness Score by Years', fontsize=15)

plt.ylim(5.30,5.40)

plt.show()

### 6-c) The means of Happiness Score by Years and Regions 

In [None]:
mean_by_year_and_region = Happiness_report2.groupby(by=["Region", "Year"]).mean()["Happiness_Score"]

In [None]:
mean_by_year_and_region=mean_by_year_and_region.reset_index()

In [None]:
mean_by_year_and_region.head()

In [None]:
plt.figure(figsize=(10,8))
sns.barplot(x="Region", y="Happiness_Score", hue="Year", data=mean_by_year_and_region)
plt.xticks(rotation=90)
plt.ylim((0,10))

Happiness score is slightly increasing in 5 regions. 
Western Europe region gives almost same results that cannot be shown on the graph. 
Happiness score is decreasing in 4 regions. 

### 6-d)Overview of Hapiness Score and Dystopia_Residual in Turkey 

In [None]:
years_Turkey = Happiness_report2[Happiness_report2.Country == 'Turkey']['Year']

In [None]:
Happiness_Score_Turkey = Happiness_report2[Happiness_report2.Country == 'Turkey']['Happiness_Score']

In [None]:
Dystopia_Residual_Turkey = Happiness_report2[Happiness_report2.Country == 'Turkey']['Dystopia_Residual']

In [None]:
plt.figure(1, figsize = (8,8))
plt.plot(years_Turkey, Happiness_Score_Turkey, label = 'Year/Happiness_Score', color='blue', linewidth=5)
plt.plot(years_Turkey,Dystopia_Residual_Turkey, label = 'Year/Dystopia_Residual',color='red', linewidth=5)
plt.xlabel('Years')
plt.ylabel('scores ')
plt.xlim([2015, 2017])
plt.title('Total Happiness Score&Dystopia_Residual in Turkey')
plt.legend()
plt.show()

Happiness score is increasing slowly from 2015 to 2017 in Turkey. Until 2016, two variables has a pozitive corellation until 2016, than corellation turns to negative in 2017 by refering opposite directions on the graph. 

### 6-e) General Looking at Some Factors with Happiness Score  Based on  Regions 

In [None]:
Happiness_report2["Economy_GDP_per_Capita"].head()

In [None]:
plt.figure(figsize=(15,8))
sns.scatterplot(x='Happiness_Score', y='Health_Life_Expectancy', hue='Region',data=Happiness_report2, s = Happiness_report2.Economy_GDP_per_Capita*100);
plt.xlabel('Happiness_Score',size=15)
plt.ylabel('Health_Life_Expectancy', size =10)

Africa region does not show a clear relation between variables. 
Other regions have more tight spots within itself. 
As we see in pair plot show earlier, we have a positive corellation  between these variables. It is also happeningin each regions except Sub-Saharan Africa region. 


In [None]:
plt.figure(figsize=(15,8))
sns.scatterplot(x='Happiness_Score', y='Economy_GDP_per_Capita', hue='Region',data=Happiness_report2, s = Happiness_report2.Economy_GDP_per_Capita*100);
plt.xlabel('Happiness_Score',size=15)
plt.ylabel('Economy_GDP_per_Capita', size =10)

In [None]:
plt.figure(figsize=(15,8))
sns.scatterplot(x='Happiness_Score', y='Family', hue='Region',data=Happiness_report2, s = Happiness_report2.Economy_GDP_per_Capita*100);
plt.xlabel('Happiness_Score',size=15)
plt.ylabel('Family', size =10)

In [None]:
plt.figure(figsize=(15,8))
sns.scatterplot(x='Happiness_Score', y='Trust_Government_Corruption', hue='Region',data=Happiness_report2, s = Happiness_report2.Economy_GDP_per_Capita*100);
plt.xlabel('Happiness_Score',size=15)
plt.ylabel('Trust_Government_Corruption', size =10)

In [None]:
# As we see in previous results, 3 variables have a close relation with each other. Lets have a further reseach on those ones.  


fig, axes = plt.subplots(1,3,figsize=(20,5))
baslik_font = {'family': 'arial', 'color': 'darkred','weight': 'bold','size': 13 }

happiness_score_by_three_variables = ['Economy_GDP_per_Capita','Trust_Government_Corruption', 'Health_Life_Expectancy'] 
 
for i in range(0,3):
    
    plt.subplot(1, 3, i+1)
    plt.scatter(Happiness_report2['Happiness_Score'],Happiness_report2[happiness_score_by_three_variables[i]],c='purple', s=80)
    plt.title('Happiness Score and '+ str(happiness_score_by_three_variables[i]), fontdict=baslik_font, fontsize=13, y=1.08)
    plt.xlabel('Happiness Score',size=15)
    plt.ylabel(str(happiness_score_by_three_variables[i]),size=15)

Economy_GDP_per_Capita and Health_Life-Expectancy have a positive corelation with each other. 

Turst_Government_Corruption shows that it has an exponential correlation between happiness score. As long as citizens feel secure, they are happy. Especially after a certain point of score 7,  Happiness score increases very fast. 

### 6-f)Corellation between Variables 

In [None]:
Corr_Matrix = Happiness_report2.corr()
Corr_Matrix

In [None]:
Corr_Matrix.Happiness_Score.sort_values()

In [None]:
plt.figure(figsize=(10,10))
sns.heatmap(Corr_Matrix, cmap='bwr')
plt.title('Correlation Matrix')

### 6-g) General Look at Outliers

In [None]:
graph_by_eight_variables = ['Dystopia_Residual', 'Economy_GDP_per_Capita', 'Family', 'Freedom',
               'Generosity', 'Happiness_Score', 'Health_Life_Expectancy', 'Trust_Government_Corruption'] 
plt.figure(figsize=(15,8))

for i in range(0,8):
    plt.subplot(2, 4, i+1)
    plt.boxplot(Happiness_report2[graph_by_eight_variables[i]])
    plt.title(graph_by_eight_variables[i])
    
    
    

## 7. Feature Engineering

In [None]:
from scipy.stats.mstats import winsorize

Happiness_report2["winsorize_Dystopia_Residual"] = winsorize(Happiness_report2["Dystopia_Residual"], (0, 0.05))
Happiness_report2["winsorize_Family"] = winsorize(Happiness_report2["Family"], (0, 0.02))
Happiness_report2["winsorize_Generosity"] = winsorize(Happiness_report2["Generosity"], (0, 0.05))
Happiness_report2["winsorize_Trust_Government_Corruption"] = winsorize(Happiness_report2["Trust_Government_Corruption"], (0, 0.03))

> ### 7-a)Normalizing

In [None]:
from sklearn.preprocessing import normalize

Happiness_report2["winsorize_Dystopia_Residual"] = normalize(np.array(Happiness_report2["winsorize_Dystopia_Residual"]).reshape(1,-1)).reshape(-1,1)
Happiness_report2["winsorize_Family"] =  normalize(np.array(Happiness_report2["winsorize_Family"]).reshape(1,-1)).reshape(-1,1)
Happiness_report2["winsorize_Generosity"] =  normalize(np.array(Happiness_report2["winsorize_Generosity"]).reshape(1,-1)).reshape(-1,1)
Happiness_report2["winsorize_Trust_Government_Corruption"] =  normalize(np.array(Happiness_report2["winsorize_Trust_Government_Corruption"]).reshape(1,-1)).reshape(-1,1)



### 7-b) PCA 

In [None]:
df_first_data= Happiness_report2[['Dystopia_Residual','Economy_GDP_per_Capita',
       'Family','Freedom', 'Generosity','Happiness_Score',
       'Health_Life_Expectancy', 'Trust_Government_Corruption']]

In [None]:
df_winsorize = Happiness_report2[['winsorize_Dystopia_Residual',
                                  'winsorize_Family','winsorize_Generosity','winsorize_Trust_Government_Corruption']]

In [None]:
from sklearn.decomposition import PCA 
pca = PCA(n_components=2)
pc = pca.fit_transform(df_first_data)
print (pca.explained_variance_ratio_)

In [None]:
from sklearn.decomposition import PCA 
pca = PCA(n_components=2)
pc = pca.fit_transform(df_winsorize)
print (pca.explained_variance_ratio_)

#### Above as we see, PCA is working more accurate with first values. 
#### PCA results of df_winsorize shows that first value explains only %45 of the data set while df_first_data gives results of %78 for the first value.

## 8. Conclusions

Happiness score depends on each region with different shapes of distributions and relations within variables.
As we see in pair plot show earlier, we have a positive correlation between these variables. It is also happening in each regions except Sub-Saharan Africa region.
Data has been researched via different aspects by using varied tools and arguments. Even though, data set gives us a clear understanding of correlations between variables, Happiness Score is explained by %78 percentage considering PCA scores.
Considering PCA results, we can say that due to the high correlations between variables effects PCA score so first values are enough to see relations between considered factors and Happiness Score.
By following the results, we see on this study, we can advise to extend variables and the number of participants to have an accurate understanding on Happiness Score. Happiness score has a high relation between Family, Economy GPD Capita and Health Life Expectancy.
