## Intro

### What makes a country happier?

What makes a country happy? Or.. An easier question. What makes you, as an individual, happier? Your future expectations or your income or your honest government? According to <a href="https://worldhappiness.report/faq/" target="_blank">happiness report</a>.; "**The variables used reflect what has been broadly found in the research literature to be important in explaining national-level differences in life evaluations.**". We will be using those datas to understand better what makes a country more happy. Without further ado, let's get started

***
- I will be using two datasets in this analysis. One of them is latest world happines report of 2021 and the other one contains data from previous years.
***

***
- Let's start with importing required libraries.
***

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt 
import seaborn as sns 
import matplotlib as mpl



import plotly 
import plotly.express as px
import plotly.graph_objs as go
import plotly.offline as py
from plotly.offline import iplot
from plotly.subplots import make_subplots
import plotly.figure_factory as ff

***
- Let's read our data.
***

In [None]:
df_2021 = pd.read_csv("../input/world-happiness-report-2021/world-happiness-report-2021.csv")
df_history = pd.read_csv("../input/world-happiness-report-2021/world-happiness-report.csv")
pd.set_option('display.max_columns', None)

***
- First, inspect two dataframes one by one.
***

***
- Let's check statistical information of 2021 data.
***

In [None]:
df_2021.describe()

***
- What about historic data?
***

In [None]:
df_history.describe()

***
- Now, let's check general information of both dataframes.
***

In [None]:
df_2021.info()
print()
print("-"*60)
print()
df_history.info()

***
- 2021 data has no missing values. Great! On the other hand historic data has some missing values. The other thing is some of the column names are different on each dataframe. Although there are 11 and 20 variables in dataframes, my interest will be on
    - 'Country name'
    - 'Year'
    - 'Life Ladder'
    - 'Log GDP per capita'
    - 'Social support'
    - 'Healthy life expectancy at birth'
    - 'Freedom to make life choices'
    - 'Generosity'
    - 'Perceptions of corruption'
    - 'Regional indicator'
    
- Let's prepare our data and concatenate two dataframes into one dataframe.
***

In [None]:
df_2021 = df_2021.rename(columns={"Ladder score": "Life Ladder", "Logged GDP per capita": "Log GDP per capita", "Healthy life expectancy":"Healthy life expectancy at birth", })
df = pd.concat([df_history, df_2021], axis=0, join="outer", ignore_index=True)
df = df.drop(columns=['Standard error of ladder score', 'upperwhisker', 'lowerwhisker',
       'Ladder score in Dystopia', 'Explained by: Log GDP per capita',
       'Explained by: Social support', 'Explained by: Healthy life expectancy',
       'Explained by: Freedom to make life choices',
       'Explained by: Generosity', 'Explained by: Perceptions of corruption',
       'Dystopia + residual', "Positive affect", "Negative affect"])

In [None]:
df.info()

***
- There seems to be some missing values. Let's check missing values.
***

In [None]:
df.isnull().sum()

***
- *`year`* column has missing values but all those values are actually 2021. Because we know historic data had no missing values in *`year`* column.
- *`Regional indicator`* colum has 1949 missing values. That is just like *`year`* column. Historic column had no column as *`Regional indicator`*. We will fill those missing values by using 2021 data.
- I will check the other ones later.
***

***
- Now, let's fill *`Regional indicator`* column by using 2021 data.
***

In [None]:
for i in df["Country name"].unique():
    filt = (df["Country name"] == i)
    regional = df.loc[filt, "Regional indicator"].unique()[-1]
    df.loc[filt, "Regional indicator"] = regional
df

***
- Now it is time for *`year`* column.
***

In [None]:
df.year = df.year.fillna(2021)

***
- Let's check missing values again.
***

In [None]:
df.isnull().sum()

***
- Still there are some missing values. What should we do? Filling those values with mean or median values may be one solution. Or we can just drop those values which is not the best option considering there are at least 110 rows with missing values.
- I will be filling missing values with that country's mean values. That seems like a proper way to me.
***

In [None]:
def fill_by_country(df):
    missing = df.drop(["Country name", "Regional indicator"], axis=1).isnull().sum()
    missing = missing[missing>0]
    for i in missing.index:
        df[i] = df.groupby("Country name")[i].transform(lambda val: val.fillna(val.mean()))
    return df

In [None]:
fill_by_country(df)

In [None]:
df.isnull().sum()

***
- We filled most of the missing values but still there are some. Let's check those missing values if the same row has more than one missing values. If that is the case, it may be clever to just drop those rows.
***

In [None]:
df[df["Log GDP per capita"].isnull()]

***
- All *`Log GDP per capita`*, *`Generosity`*  and *`Regional indicator`* columns' missing values in the same 12 rows. Also 4 of them also contains missing values in *`Healthy life expectancy at birth`* column. Let' drop those 12 rows.
***

In [None]:
df = df[~df["Log GDP per capita"].isnull()]
df.isnull().sum()

***
- Apart from *`Regional indicator`* column, we have only 2 missing values. Let's look one of them.
***

In [None]:
df[df["Social support"].isnull()]

***
- Great! Our last missing values in the *`Social support`* and *`Perceptions of corruption`* in the same row. Let's drop this row too.
***

In [None]:
df = df[~df["Social support"].isnull()]
df.isnull().sum()

***
- We have done it! There is no missing values in the columns other than *`Regional indicator`* column in which it is OK to leave as NaN. This is because 2021 data had no information about those countries that historic data had. It is not a problem though. We can start our analysis now.
***

***
## What Makes a Country Happier?
***

def trust(corrupt):
    if corrupt >=  0.8450:
        return "Low trust in institutions"
    elif corrupt < 0.8450 and corrupt > 0.781:
        return "Lower than normal level trust in institutions"
    elif corrupt <= 0.781 and corrupt > 0.667:
        return "About the normal level trust in institutions"
    else:
        return "High level trust in institutions"
df3["Trust in institutions"] = df3["Perceptions of corruption"].apply(trust)

***
- First, let's check correlation among the numerical variables in the dataset.
***

In [None]:
df.drop("year",axis=1).corr()

***
- Happiness score(Life Ladder) has strong level correlation with GDP, Social Support and Healthy life expectancy at birth.
- Freedom to make life choices and happiness score have mide level correlation between them.
- Perception of corruption and happiness score have weak level negative level correlation between them.
***

***
- Let's see correlation in the heatmap.
***

In [None]:
fig = go.Figure(go.Heatmap(z=df.corr(), x=df.corr().columns.tolist(), y=df.corr().columns.tolist(),
                          colorscale="viridis"))
fig.show()

***
- What about happines at the Regional level?
***

In [None]:
df.groupby("Regional indicator")["Life Ladder"].describe().sort_values(by="std")

***
- North America and ANZ	has the highest level happines mean on the other hand Sub-Saharan Africa has the least level happines mean.
- Western Europe with the highest level happiness and South Asia least highest levet happiness.
- North America and ANZ has the least standard deviation and almost the same mediand and mean which means happiness there is a normal distribution acroos countires in North America and ANZ.
- Middle East and North Africa with the highest standard deviation.
***

***
- Let's see all this better with boxplot.
***

In [None]:
fig = px.box(df, x="Life Ladder", y="Regional indicator", hover_data = df[['Regional indicator','Country name']])
fig.show()

***
- South Asia, Latin America and Caribbean, North American and ANZ, Western Europe has several outliers in the minimum side.
- Sub-Saharan Africa has outlier in the maxiumum side.
- Middle East and North Africa seems interesting to me since it has long tail in both minimum and maximum side.
- Let's dive into Middle East and North Africa.

In [None]:
middle_east = df[df["Regional indicator"] == "Middle East and North Africa"]
middle_east

***
- Let's see how correlated the variables in Middle East and North Africa.
***

In [None]:
df.drop("year",axis=1).corr()

In [None]:
middle_east.drop("year",axis=1).corr()

***
- Even though some similarities can be found with the whole dataset correlation, in Middle East and North Africa has some differences in correlation matrix.
- GDP has more correlation with happiness compared to whole dataset but in a low margin but on the other hand, Generosity has so much more correlation with happines in Middle East and North Africa compared to whole dataset.
- Social support, Healthy life expectancy at birth and Perception of corruption has much lower correlation with happines than whole dataset.
- Freedom to make life choices' correlation almost identical in both datasets.
***

***
- Let's look at heatmap for better understanding.
***

In [None]:
fig = go.Figure(go.Heatmap(z=middle_east.corr(), x=middle_east.corr().columns.tolist(), y=middle_east.corr().columns.tolist(),
                          colorscale="viridis"))
fig.show()

In [None]:
middle_east.describe()

***
- Based on descriptive information, possible outliers can be seen in the:
    - Healthy life expectancy at birth
    - Freedom to make life choices
    - Perceptions of corruption
***

***
- In this EDA, I will mostly focus on happines score.
***

## Happiness in the Middle East and North Africa

***
- Let's start with boxplot.
***

In [None]:
fig = px.box(middle_east, x="Life Ladder", hover_data = middle_east[['Country name']])
fig.update_traces(quartilemethod="inclusive")
fig.show()

***
- Boxplot does not show any possible outliers.
***

***
- Let's continue with a bar plot.
***

In [None]:
middle_east = middle_east.sort_values(by="Life Ladder")

In [None]:
fig = px.bar(middle_east, x="Life Ladder", y="Country name")
fig.show()

***
- **Israel** has the highest happiness score in Middle East and North Africa.
- **Yemen** has the least happiness score in Middle East and North Africa.

In [None]:
fig=go.Figure()
fig.add_trace(go.Scatter(
    x=middle_east.groupby("Country name").mean().sort_values(by="Life Ladder").index,
    y=middle_east.groupby("Country name").mean().sort_values(by="Life Ladder")["Life Ladder"],
    name='Happines Score',
    mode='markers+text',
    marker_color='blue',
    marker_size=10,
    textposition='top center',
    line=dict(color='red',dash='dash'),
))
fig.update_layout(
    title= "<b>Middle East Happiness Score in 2021</b>",
    xaxis_title="<b>Country</b>",
    yaxis_title="<b>Happiness Score</b>",
    template='plotly_white',
    font=dict(
        size=12,
        color="Black",
        family="Oswald', sans-serif"
        ),
    xaxis=dict(showgrid=True),
    yaxis=dict(showgrid=True),
    yaxis2=dict(showgrid=True,overlaying='y',side='right',title='<b>Happiness Score</b>'),
)
fig.show()

***
- It is important to understand that we are not working with just 2021 data but all data which contains data from 2013 to 2021. This values are mean of all the years.
- Most of the countries has happines score between 4.5 and 6.5.
***

***
- Let's see happiness score trends in the past years.
***

In [None]:
middle_east = middle_east.sort_values(["Country name", "year"])

css3_colors = ['#add8e6', '#f08080','#e0ffff','#fafad2','#d3d3d3','#90ee90','#ffb6c1','#ffa07a','#20b2aa','#87cefa','#778899','#b0c4de','#32cd32','#ff00ff','#66cdaa','#ba55d3', '#7b68ee']
css3_dict ={}
i=0
for name in middle_east["Country name"].unique():
    css3_dict[name]=css3_colors[i]
    i+=1
    

    
fig=go.Figure()
for name in middle_east['Country name'].unique():
    fig.add_trace(go.Scatter(
    x=middle_east[middle_east['Country name']==name]['year'],
    y=middle_east[middle_east['Country name']==name]['Life Ladder'],
    name=name,
    mode='markers+text+lines',
    marker_color='black',
    line=dict(color=css3_dict[name]),
    marker_size=3,
    yaxis='y1'))
    
fig.update_layout(
    title="Happiness Score Trend in Central and Eastern Europe ",
    xaxis_title="Year",
    yaxis_title='Happiness Score',
    template='plotly_white',
    font=dict(
        size=14,
        color="Blue",
        family="Oswald', sans-serif"
    ),
    xaxis=dict(showgrid=True),
    yaxis=dict(showgrid=True)
)
fig.show()

***
- Algeria, Jordan, Turkey and United Arab Emirates have downtrend. 
- Bahrain and Iraq's happiness increased over the last years.
***

***
- Let's start working with other variables' relation with happiness score.
***

## GDP's Impact on Happiness Score

***
- Is really the money answer to our happiness? Let's check!
***

In [None]:
trace = go.Scatter(x=middle_east[middle_east["year"]==2021]['Life Ladder'],y=middle_east[middle_east["year"]==2021]['Log GDP per capita'],text = middle_east['Country name'],mode='markers',marker={'color':'blue', 'size':10})
df=[trace]
layout = go.Layout(title='Happiness Score & Logged GDP per capita in Middle East',xaxis=dict(title='Ladder Score'),yaxis=dict(title='Log GDP per capita'),hovermode='closest')
figure = go.Figure(data=df,layout=layout)
figure.update_layout(template='plotly_white',
                  font=dict(family="Oswald', sans-serif"))
figure.show()

In [None]:
print(f"Happines score and GDP has {round(middle_east.corr().loc['Life Ladder', 'Log GDP per capita'],2)} correlation score.")

***
- The answer the question I asked before is yes. GDP has really strong correlation with happines score. We can see this from scatterplot.

***
- What about social support?
***

## Social Support's Impact on Happiness Score

In [None]:
trace = go.Scatter(x=middle_east[middle_east["year"]==2021]['Life Ladder'],y=middle_east[middle_east["year"]==2021]['Social support'],text = middle_east['Country name'],mode='markers',marker={'color':'blue', 'size':10})
df=[trace]
layout = go.Layout(title='Happiness Score & Social support in Middle East',xaxis=dict(title='Ladder Score'),yaxis=dict(title='Social support'),hovermode='closest')
figure = go.Figure(data=df,layout=layout)
figure.update_layout(template='plotly_white',
                  font=dict(family="Oswald', sans-serif"))
figure.show()

In [None]:
print(f"Happines score and social support has {round(middle_east.corr().loc['Life Ladder', 'Social support'],2)} correlation score.")

***
- Also social support has strong correlation score with happiness.
***

## Healthy Life Expectancy at Birth's Impact on Happiness Score

In [None]:
trace = go.Scatter(x=middle_east[middle_east["year"]==2021]['Life Ladder'],y=middle_east[middle_east["year"]==2021]['Healthy life expectancy at birth'],text = middle_east['Country name'],mode='markers',marker={'color':'blue', 'size':10})
df=[trace]
layout = go.Layout(title='Happiness Score & Healthy life expectancy at birth in Middle East',xaxis=dict(title='Ladder Score'),yaxis=dict(title='Healthy life expectancy at birth'),hovermode='closest')
figure = go.Figure(data=df,layout=layout)
figure.update_layout(template='plotly_white',
                  font=dict(family="Oswald', sans-serif"))
figure.show()

In [None]:
print(f"Happines score and healthy life expectancy at birth has {round(middle_east.corr().loc['Life Ladder', 'Healthy life expectancy at birth'],2)} correlation score.")

***
- Healthy life expectancy at birth has strong correlation score with happiness.
***

## Perceptions of Corruption's Impact on Happiness Score

In [None]:
trace = go.Scatter(x=middle_east[middle_east["year"]==2021]['Life Ladder'],y=middle_east[middle_east["year"]==2021]['Perceptions of corruption'],text = middle_east['Country name'],mode='markers',marker={'color':'blue', 'size':10})
df=[trace]
layout = go.Layout(title='Happiness Score & Perceptions of corruption in Middle East',xaxis=dict(title='Ladder Score'),yaxis=dict(title='Perceptions of corruption'),hovermode='closest')
figure = go.Figure(data=df,layout=layout)
figure.update_layout(template='plotly_white',
                  font=dict(family="Oswald', sans-serif"))
figure.show()

In [None]:
print(f"Happines score and perceptions of corruption has {round(middle_east.corr().loc['Life Ladder', 'Perceptions of corruption'],2)} correlation score.")

***
- As expected, perceptions of corruption has strong negative correlation with happiness. If people who are ruling you are not honest, it is hard to stay happy.
***

## Freedom to Make Life Choices' Impact on Happiness Score

In [None]:
trace = go.Scatter(x=middle_east[middle_east["year"]==2021]['Life Ladder'],y=middle_east[middle_east["year"]==2021]['Freedom to make life choices'],text = middle_east['Country name'],mode='markers',marker={'color':'blue', 'size':10})
df=[trace]
layout = go.Layout(title='Happiness Score & Freedom to make life choices in Middle East',xaxis=dict(title='Ladder Score'),yaxis=dict(title='Freedom to make life choices'),hovermode='closest')
figure = go.Figure(data=df,layout=layout)
figure.update_layout(template='plotly_white',
                  font=dict(family="Oswald', sans-serif"))
figure.show()

In [None]:
print(f"Happines score and freedom to make life choices has {round(middle_east.corr().loc['Life Ladder', 'Freedom to make life choices'],2)} correlation score.")

***

We have come to an end of another great analysis. It was really enjoyable for me. It was a pleasure to work with this dataset for me. I would like to thank dataset contibutor for this data. I hope you enjoyed too. If you liked my EDA on this dataset, feel free to check my other notebooks as well. Looking forward for your feedback. Thanks a lot.

Have a great day.