# In this Exploratory Data Analysis (EDA), following points will be discussed.

- Descriptive statistics of the 2021 dataset

- Correlation among the variables

- Happines Score distribution over the regions

- Descriptive statistics of 2021 dataset for Latin America and Caribbean region

- Happiness Score's trend in 2021 and the previous years at Latin America and Caribbean



Note: If you ever encounter with "ModuleNotFoundError: No module named 'plotly', you can run the very below pip code.

In [None]:
# pip install plotly

As first step, let's import the related libraries for further analysis

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt 
import seaborn as sns 
import matplotlib as mpl



import plotly 
import plotly.express as px
import plotly.graph_objs as go
import plotly.offline as py
from plotly.offline import iplot
from plotly.subplots import make_subplots
import plotly.figure_factory as ff

In this Exploratory Data Analysis (EDA), World Happiness Report 2021 and World Happines Report (which includes data before 2021) will be used.

In [None]:
df_2021 = pd.read_csv("../input/world-happiness-report-2021/world-happiness-report-2021.csv")
df_past = pd.read_csv("../input/world-happiness-report-2021/world-happiness-report.csv")

In [None]:
df_2021.head()

In [None]:
df_2021.shape

In [None]:
df_past.head()

In [None]:
df_past.shape

In [None]:
df_2021.columns

In [None]:
df_past.columns

To make it easy to be more understandable, we will rename the column names in the old dataset. In this regard, 
"Life Ladder" will be "Ladder score", "Log GDP per capita" will be Logged GDP per capita" and "Healthy life expectancy at birth" will be "Healthy life expectancy".

In [None]:
df_past = df_past.rename(columns={'Life Ladder':'Ladder score', 'Log GDP per capita':'Logged GDP per capita',
                                  'Healthy life expectancy at birth':'Healthy life expectancy'})

In [None]:
df_2021.sample(3)

In [None]:
df_past.sample(3)

### In this EDA, focus will be on the following columns:
- 'Country name',
- 'Regional indicator',
- 'Ladder score',
- 'Logged GDP per capita',
- 'Social support',
- 'Healthy life expectancy',
- 'Freedom to make life choices',
- 'Generosity',
- 'Perceptions of corruption'


For that reason we will make further adjustments. First thing first, let's create a new dataframe that consists of above columns.

In [None]:
df1_2021 = df_2021[['Country name','Regional indicator','Ladder score','Logged GDP per capita',
                    'Social support','Healthy life expectancy','Freedom to make life choices',
                    'Generosity','Perceptions of corruption']].copy()

In [None]:
df1_2021.head()

Let's check out the general info of new dataset to see whether there is any missing value.

In [None]:
df1_2021.info()

So far all seems good.

Now, let's check general statistical information of the given dataset.

In [None]:
df1_2021.describe()

Let's see how the correlation among numerical variables in the dataset looks like.

In [None]:
df1_2021.corr()

It is a well-known fact that high level of correlation is interpreted differenly. While 80% or above is a good sign of the string correlation for some areas, for others, this percentage may drop to around 60%. 

- With bearing that in mind, Ladder Score has strong level correlation with GDP, Social Support, Healthy life Expectancy.

- Freedom to make life choice and Ladder score have mid-level correlation between them.

- Perception of corruption and Ladder score have weak level negative level correlation between them.

Let's draw the heatmap of what the dataframe above says.

In [None]:
plt.figure(figsize=(12, 8))
sns.heatmap(df1_2021.corr(), annot=True)

Before passing on to Latin America and Caribbean, let's see the happiness score at the Regional Level.

In [None]:
df1_2021.groupby("Regional indicator")["Ladder score"].describe()

Based on the above information, it is easily seen that,

- Western Europe has the highest level happiness score, 

on the other hand;

- South Asia and Sub-Saharan Africa have the lowest level happiness score amongst the regions.

Let's look at the boxplot to see overall distribution of the happiness score at the different regions.

In [None]:
fig = px.box(data_frame=df1_2021, x="Ladder score", y="Regional indicator", hover_data=df1_2021[["Regional indicator", "Country name"]])
fig.update_traces(quartilemethod="inclusive")
fig.show()

Based on the happiness score distributions on the regions, several outliers are seen in the:

- Latin America and Caribbean
- Central and Eastern Europe

Now, let's look in detail to Latin America and Caribbean part of our dataset.

In [None]:
df1_2021["Regional indicator"].unique()

In [None]:
latin_america_caribbean = df1_2021[df1_2021["Regional indicator"]=="Latin America and Caribbean"]
latin_america_caribbean

Let's see how correlated the variables in Latin America and Caribben are.

In [None]:
latin_america_caribbean.corr()

There are many similarities with the whole 2021 Happines Score correlation matrix. Happiness score's correlation with the other variables in Latin America and Caribbean is mainly higher than whole dataset correlation matrix results (however, there are some correlation matrix results in Latin America and Caribbean that are lower than the whole dataset correlation matrix results.

In [None]:
plt.figure(figsize=(12, 8))
sns.heatmap(latin_america_caribbean.corr(), annot=True);

In [None]:
latin_america_caribbean.describe()

Based on the descriptive information, based on the Mean-Median differences and IQRs (InterQuartile Ranges) possible outliers can be seen in the:

- Happiness Score
- Logged GDP per capita
- Social Support
- Healthy life expectancy
- Freedom to make life choices
- Generosity
- Perceptions of corruption

For the sake of the simplicity, in this EDA we will focus on Happiness Score.
For further detail, let's make a boxplot, barplot.

In [None]:
plt.figure(figsize=(10, 6))
sns.boxplot(x="Ladder score", data=latin_america_caribbean);

In the Latin America and Caribbean, based on the happiness score distributions, 2 possible outliers can be seen. Both outliers are at the minimum side of the happiness score.

In [None]:
plt.figure(figsize=(12, 8))
sns.barplot(x="Ladder score", y="Country name", data=latin_america_caribbean, orient="h",
           order=latin_america_caribbean.sort_values('Ladder score')["Country name"]);

In [None]:
fig = px.bar(latin_america_caribbean, x='Ladder score', y='Country name')
fig.show()

As seen in the barplot,

- Costa Rica has the highest happiness score in the Latin America and Caribbean region.
- Haiti and Venezuela have the lowest happiness score in the Latin America and Caribbean region and are possible outliers based on the happiness score distribution in the given region.

It would be good idea to see and compare the happpiness score trends in the past years. Let's look at the previous years happines score for all the countries in the region.

In [None]:
latin_america_caribbean_past=df_past[df_past['Country name'].isin(latin_america_caribbean['Country name'].to_list())].loc[:,'Country name':'Ladder score']
latin_america_caribbean_past.sample(3)

In [None]:
plt.figure(figsize=(12, 8), dpi=200)
sns.pointplot(x="year", y="Ladder score", data=latin_america_caribbean_past, hue="Country name")
plt.title("Happiness Score Trend in Central and Eastern Europe ")
plt.xlabel("Year")
plt.ylabel("Happiness Score")
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left', borderaxespad=0);

And finally, let's look at the Happiness Score relation with the other variables in the region.

## Happiness Score & Logged GDP per capita

In [None]:
trace = go.Scatter(x = latin_america_caribbean['Ladder score'],y=latin_america_caribbean['Logged GDP per capita'],text = latin_america_caribbean['Country name'],mode='markers',marker={'color':'blue', 'size':10})
df=[trace]
layout = go.Layout(title='Happiness Score & Logged GDP per capita in Latin America and Caribbean',xaxis=dict(title='Ladder Score'),yaxis=dict(title='Logged GDP per capita'),hovermode='closest')
figure = go.Figure(data=df,layout=layout)
figure.update_layout(template='plotly_white',
                  font=dict(family="Oswald', sans-serif"))
figure.show()

## Happiness Score & Social Support

In [None]:
trace = go.Scatter(x = latin_america_caribbean['Ladder score'],y=latin_america_caribbean['Social support'],text = latin_america_caribbean['Country name'],mode='markers',marker={'color':'blue', 'size':10})
df=[trace]
layout = go.Layout(title='Happiness Score & Social Support in Latin America and Caribbean',xaxis=dict(title='Ladder Score'),yaxis=dict(title='Logged GDP per capita'),hovermode='closest')
figure = go.Figure(data=df,layout=layout)
figure.update_layout(template='plotly_white',
                  font=dict(family="Oswald', sans-serif"))
figure.show()

## Happiness Score & Healthy Life Expectancy

In [None]:
trace = go.Scatter(x = latin_america_caribbean['Ladder score'],y=latin_america_caribbean['Healthy life expectancy'],text = latin_america_caribbean['Country name'],mode='markers',marker={'color':'blue', 'size':10})
df=[trace]
layout = go.Layout(title='Happiness Score & Healthy Life Expectancy in Latin America and Caribbean',xaxis=dict(title='Ladder Score'),yaxis=dict(title='Logged GDP per capita'),hovermode='closest')
figure = go.Figure(data=df,layout=layout)
figure.update_layout(template='plotly_white',
                  font=dict(family="Oswald', sans-serif"))
figure.show()

## Happiness Score & Freedom to Make Life Choices

In [None]:
trace = go.Scatter(x = latin_america_caribbean['Ladder score'],y=latin_america_caribbean['Freedom to make life choices'],text = latin_america_caribbean['Country name'],mode='markers',marker={'color':'blue', 'size':10})
df=[trace]
layout = go.Layout(title='Happiness Score & Freedom to Make Life Choices in Latin America and Caribbean',xaxis=dict(title='Ladder Score'),yaxis=dict(title='Logged GDP per capita'),hovermode='closest')
figure = go.Figure(data=df,layout=layout)
figure.update_layout(template='plotly_white',
                  font=dict(family="Oswald', sans-serif"))
figure.show()

## Happiness Score & Generosity

In [None]:
trace = go.Scatter(x = latin_america_caribbean['Ladder score'],y=latin_america_caribbean['Generosity'],text = latin_america_caribbean['Country name'],mode='markers',marker={'color':'blue', 'size':10})
df=[trace]
layout = go.Layout(title='Happiness Score & Generosity in Latin America and Caribbean',xaxis=dict(title='Ladder Score'),yaxis=dict(title='Logged GDP per capita'),hovermode='closest')
figure = go.Figure(data=df,layout=layout)
figure.update_layout(template='plotly_white',
                  font=dict(family="Oswald', sans-serif"))
figure.show()

## Happiness Score & Perceptions of Corruption

In [None]:
trace = go.Scatter(x = latin_america_caribbean['Ladder score'],y=latin_america_caribbean['Perceptions of corruption'],text = latin_america_caribbean['Country name'],mode='markers',marker={'color':'blue', 'size':10})
df=[trace]
layout = go.Layout(title='Happiness Score & Perceptions of Corruption in Latin America and Caribbean',xaxis=dict(title='Ladder Score'),yaxis=dict(title='Logged GDP per capita'),hovermode='closest')
figure = go.Figure(data=df,layout=layout)
figure.update_layout(template='plotly_white',
                  font=dict(family="Oswald', sans-serif"))
figure.show()

* Thanks for your time to read the Latin America and Caribbean's Happines Score EDA.



* All the best