## <i>Exploratory Data Analysis - Happiness</i><br>
### This dataset has been taken from [Kaggle](https://www.kaggle.com/).<br>

We will be performing data cleaning, preparation and visualization on the [World Happiness Report](https://www.kaggle.com/ajaypalsinghlo/world-happiness-report-2021) dataset.<br>

<b>The focus of this study will be to see:</b><br>
<ul>
    <li>How happiness score of countries in a region looks like over a period of time?</li>
    <li>Which countries were the top highest and lowest according to mean score of all the years?</li>
    <li>What are the main criterias defining happiness?</li>
    <li>If countries have imporved or worsen over a period of time, and if they did, what were the factors leading to a change in the level?</li>

In [None]:
# import necessary libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly_express as px

import plotly.io as pio
# to save plotly interactive plots as html files

In [None]:
# importing two datasets, one contains data for mutiple years while other contains data for the year 2021
# we will later merge the datasets

df_allyears = pd.read_csv('../input/world-happiness-report-2021/world-happiness-report.csv')

df_2021 = pd.read_csv('../input/world-happiness-report-2021/world-happiness-report-2021.csv')

In [None]:
print(f"Dataset contains : {len(df_allyears)} rows\n")
df_allyears.head(3)

In [None]:
# lets check what all year's data do we have in this dataset
df_allyears['year'].unique()
# df_allyears['Country name'].nunique()

In [None]:
# checking if there are any null values

def check_null(df):
    for col in df.columns:
        values = np.mean(df[col].isnull())
        print(f'{col} --- \t{values}% null values')
        
check_null(df_allyears)

<i>Looks like there are missing values in some of the columns but since the missing values are <b>less than 1% </b>of the total column values we can move forward.</i>

In [None]:
# checking datatypes

df_allyears.dtypes

In [None]:
# lets bring in the second dataset so we can merge the two
print(f'Dataset contains : {len(df_2021)} rows\n')
df_2021.head(3)

Looks like this dataset contains more columns than the one with mutiple years.<br>
We will have to delete the extra columns as these are only present for 2021 and not all years.<br>

Note : It can be seen that columns ( Positive affect & Negative affect) are not present in df_2021.

In [None]:
# before we specify the columns we want, we will add a new column 'year' so that we can add it during merge

df_2021['Year'] = 2021

# specifying the columns we want in the dataframe
df_allyears.columns

df_2021 = df_2021[['Country name', 'Regional indicator', 'Year', 'Ladder score', 'Logged GDP per capita',
                   'Social support', 'Healthy life expectancy',
                   'Freedom to make life choices', 'Generosity',
                    'Perceptions of corruption',]]

df_2021.head(3)

In [None]:
# lets check if df_2021 contains any null values

check_null(df_2021)

<b>Great!</b> All columns are filled to the trim!

In [None]:
# before we merge, let's rename columns on both df's for ease of merging

df_allyears.rename(columns={'Country name': 'Country', 'year': 'Year', 'Life Ladder': 'Score',
                        'Healthy life expectancy at birth': 'Healthy life expectancy',
                        'Log GDP per capita': 'GDP score', 'Freedom to make life choices': 'Freedom'}, 
                         inplace=True)

df_2021.rename(columns={'Country name': 'Country', 'Ladder score': 'Score', 'Regional indicator': 'Region',
                       'Logged GDP per capita': 'GDP score', 'Freedom to make life choices': 'Freedom'},
                        inplace=True)

In [None]:
# merging

# I could not find a better way to update the region values, hence I will be creating a merge
# just to get the region values for all the countries

temp_reg = pd.merge(df_allyears, df_2021, how='outer', on='Country')['Region']
df_allyears['Region'] = temp_reg

eda_happy = pd.merge(df_allyears, df_2021, how='outer',
                     on=['Country', 'Year', 'Score', 'GDP score','Social support',
                        'Healthy life expectancy', 'Freedom', 'Generosity', 
                        'Perceptions of corruption', 'Region'])


In [None]:
# rearranging columns

r_list = ['Region', 'Country', 'Year', 'Score', 'GDP score', 'Social support', 'Healthy life expectancy', 'Freedom',
         'Generosity', 'Perceptions of corruption', 'Positive affect', 'Negative affect']

eda_happy = eda_happy[r_list]

eda_happy.head(3)

<b>Great!</b> now our dataset is ready!

In [None]:
# to create a choropleth graph we need country codes
# Luckily in our previously done project : Plastic Pollution we have country codes
# lets merge these codes to our eda_happy dataset

con_codes = pd.read_csv('../input/plastic-datasets/per-capita-plastic-waste-vs-gdp-per-capita.csv')

con_codes.rename(columns={'Entity': 'Country'}, inplace=True)
con_codes = con_codes[['Country', 'Code']].drop_duplicates()

print(con_codes.head(3))

eda_happy = pd.merge(eda_happy, con_codes, how='left', on='Country')
eda_happy.sort_values(by='Year', inplace=True)

In [None]:
# since we have missing score values for most of the countries for the years 2005 & 2006,
# we will be mapping from 2007 to see the change in levels for all the countries.
# Though, if you want to see the levels from 2005, you can change the dataframe parameter to : eda_happy

eda_mapping = eda_happy[(eda_happy['Year'] != 2005) &  (eda_happy['Year'] != 2006)]

px.choropleth(eda_mapping, locations='Code', color='Score', hover_name='Country',
             animation_frame='Year', color_continuous_scale=px.colors.sequential.Plasma,
             projection='natural earth', title="Happiness Levels in Countries From 2007 - 2021",
             template='seaborn', range_color=[2, 7])
    
# figure.update_layout(paper_bgcolor = '#2e3141', font_color='white')

#### It is observed that most of the countries/states present in North American & ANZ and the  Western European regions have maintained a high levels of happiness throughtout many years
We will see what were the factors leading to this!

#### Similarly, it can be observed that most of the countries present in Sub-Saharan African, Asian & Southeast Asian regions have happiness levels fluctuating between low to medium. Indicating a lower score happiness score throughout many years.
We will see what were the factors leading to this!

<ul>Before we continue further, lets check the mean score.<br>
    <li>Any country above the mean score will be considered happy or progressing.</li>
    <li>Any country Below the mean score will be considered unhappy or struggling.</li></ul>

In [None]:
np.mean(eda_happy['Score'])
# if we round this score we will get 5

In [None]:
%matplotlib inline
# To see the top 10 highest and lowest countries we need to create plot data differently

top_happy = eda_happy.groupby('Country', as_index=False)['Score'].mean().sort_values(
    by='Score',ascending=False)[:10]
top_unhappy = eda_happy.groupby('Country', as_index=False)['Score'].mean().sort_values(
    by='Score', ascending=True)[:10]

plt.style.use('seaborn')
plt.figure(2, figsize=(12,8))
sns.barplot(data=top_happy, x='Score', y='Country', palette='Blues_d')
plt.xlabel('Score',fontsize=14, fontweight='bold')
plt.ylabel('Country', fontsize=14, fontweight='bold')
plt.title('Top 10 highest happiness scored countries', fontsize=16, fontweight='bold')
plt.xticks(fontsize=11)
plt.yticks(fontsize=11)

# fig = plt.gcf()
plt.show()
# fig.savefig('top10happy.jpg')


### From the above graph we can see that Denmark has the highest mean happiness level scores throughtout all the years
We'll see what were the contributors that led to this!

In [None]:
%matplotlib inline

plt.style.use('seaborn')
plt.figure(3, figsize=(12,8))
sns.barplot(data=top_unhappy, x='Score', y='Country', palette='Reds_r_d')
plt.xlabel('Score',fontsize=14, fontweight='bold')
plt.ylabel('Country', fontsize=14, fontweight='bold')
plt.title('Top 10 lowest happiness scored countries', fontsize=16, fontweight='bold')
plt.xlim(0,8)
plt.xticks(fontsize=11)
plt.yticks(fontsize=11)

# plt.axes.grid(color='white')
# ax = plt.gca()
# ax.set_facecolor("#2e3141")

# fig = plt.gcf()
plt.show()
# fig.savefig('top10unhappyt.jpg', bbox_inches='tight')

### From the above graph we can see that South Sudan has the lowest mean happiness level scores throughtout all the years
We'll see what were the contributors that led to this!

In [None]:
# creating correlation matrix to see which factors contributes the most to happiness levels
happy_corr = eda_happy.corr()

In [None]:
plt.figure(figsize=(12,8))
plt.style.use('seaborn')
sns.heatmap(happy_corr, annot=True)
plt.xticks(fontsize=11, fontstyle='normal')
plt.yticks(fontsize=11, fontstyle='normal')
plt.title("Factor's Correlation with Happiness Score", fontsize=14, fontweight='bold')

# fig = plt.gcf()
plt.show()
# fig.savefig('happy_corr.jpg', bbox_inches="tight")

### It seems that, <i>GDP per capita score</i>, <i>Healthy life expectancy</i> & <i> Social Support</i> of a country, are the main factors contributing to the overall happiness level!

##### Surprisingly, GDP score and Healthy life expectancy are most closely related! Let's see how!

In [None]:
# for this we need to remove any null values

gdp_map = eda_happy.dropna()
figure = px.scatter(gdp_map, x='GDP score', y='Healthy life expectancy', color='Region',
          template='plotly_white', hover_name='Country',
          animation_group='Country')

figure.update_layout(plot_bgcolor='#2e3141', paper_bgcolor='#2e3141', legend_font_color='lightgray',
                    font_color='lightgray')


### It is observed that as the GDP per capita of a country increases, the healthy life expectancy of that country increases as well.

#### As seen in our first geo graph, most of the countries in The Sub-Saharan African region has a low GDP score and thus a low healthy life expectancy.

## <i>Conclusion :</i> 
<ul>
    <li>The most important factors leading to an increase in the happiness levels of any country are the <b>GDP per capita of that entity</b>, the <b>social support</b> it has and the <b>healthy life expectancy</b> of that country.</li><br>
    <li>Other factors that contribute in an overall good happiness score are <b>Freedom</b> & <b>Positive Affectivity</b></li><br>
<i>Although, all of these criterias are interrelated.</i><br>

Thus most of the countries in the affected regions, such as Sub-Saharan Afria, South Asia & some countries in the Latin American Region must focus on the GDP score and how to improve health of its citizens.<br>This will in return increase the positive affectivity in its citizens, thus increasing social support and contributing to an overall higher happiness level.<br>

While on the other hand, it is observed, countries having a progressive increase in its GDP score, such a Singapore in the Southeast Asian region have an increasing healthy life expectany throughout multiple years.<br>
These also include most of the countries present in the Western European Region.