# World Happiness Report
<div style="color:#140033;
           display:fill;
           border-radius:15px;
            border-style: solid;
           border-width: 15px;
            border-color:#f0e6ff;
           background-color:#f0e6ff;
           letter-spacing:0.75px;
            font-family:'Futura';
            line-height: 1.7em;
            font-size:1.5em;">
    
<h2><b>Context</b></h2><h3 style = "line-height:1.3;">
This notebook deals with the exploratory data analysis of the <a href = "https://www.kaggle.com/ajaypalsinghlo/world-happiness-report-2021" style="color:#cc5200;">World Happiness Report</a> and <a href = "https://www.kaggle.com/rsrishav/world-population" style="color:#cc5200;">World Population</a>. Geographic data is also mapped with these data points such as <a href = "https://www.kaggle.com/andradaolteanu/country-mapping-iso-continent-region" style="color:#cc5200;">Region & Sub-region</a>, <a href = "https://www.kaggle.com/paultimothymooney/latitude-and-longitude-for-every-country-and-state" style="color:#cc5200;">Latitude & Longitude</a> to identify and visualize more patterns with the happiness index.</h3>


<h2><b>Content</b></h2>
    <h3 style = "line-height:1.3;">
The happiness scores and rankings use data from the Gallup World Poll. The columns such as the happiness score estimate the extent to which each of six factors – economic production, social support, life expectancy, freedom, absence of corruption, and generosity – contribute to making life evaluations higher in each country than they are in Dystopia.</h3>
</div>


In [None]:
#Import necessary libraries
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_style("whitegrid")
sns.set_palette("tab10")
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
import plotly.express as px
import plotly.graph_objs as go
from plotly.offline import init_notebook_mode,iplot
plt.style.use('seaborn-notebook')

In [None]:
# Loading the input files into dataframes
df_2021 = pd.read_csv('/kaggle/input/world-happiness-report-2021/world-happiness-report-2021.csv')
df = pd.read_csv('/kaggle/input/world-happiness-report-2021/world-happiness-report.csv')

# Data Analysis :-

<div style="color:#140033;
           display:fill;
           border-radius:75px;
            border-style: solid;
           border-width: 15px;
            border-color:#f0e6ff;
           background-color:#f0e6ff;
           letter-spacing:0.75px;
            font-family:'Futura';
            line-height: 1.7em;
            font-size:1.5em;"><i>world-happiness-report:-</i></div>

In [None]:
df.rename(columns={"Country name": "country"}, inplace= True)
df.info(show_counts = False)
df.head()
df.describe().T.style.bar(subset=['mean','std','min','25%','50%','75%','max'], color='#20c8f2')\
                      .background_gradient(subset=['mean','std','min','25%','50%','75%','max'], cmap='YlGn')

<div style="color:#140033;
           display:fill;
           border-radius:75px;
            border-style: solid;
           border-width: 15px;
            border-color:#f0e6ff;
           background-color:#f0e6ff;
           letter-spacing:0.75px;
            font-family:'Futura';
            line-height: 1.7em;
            font-size:1.5em;"><i>world-happiness-report-2021:-</i></div>

In [None]:
df_2021.rename(columns={"Country name": "country"}, inplace= True)
df_2021.info(show_counts = False)
df_2021.head()
df_2021.describe().T

# Introducing Population
<div style="color:#140033;
display:fill;
border-radius:15px;
border-style: solid;
border-width: 15px;
border-color:#f0e6ff;
background-color:#f0e6ff;
letter-spacing:0.75px;
font-family:'Futura';
line-height: 1.7em;
font-size:1.5em;">
<h3 style = "line-height:1.3;">
<ul>
<li>There is no population related data in the input data.</li><br>
<li>It is a no brainer to say that a country's happiness index relies on it's population greatly.</li><br>
<li>Adding the population information will be helpful in understanding the data correlations</li></ul></h3>
<hr>
<h3>Let's have a look at the population data</h3>
</div>

In [None]:
#Import population data into a dataframe
df_pop = pd.read_csv('../input/d/rsrishav/world-population/2021_population.csv')
df_pop.info(show_counts = False)
df_pop.head()

<div style="color:#140033;
           display:fill;
           border-radius:15px;
            border-style: solid;
           border-width: 15px;
            border-color:#f0e6ff;
           background-color:#f0e6ff;
           letter-spacing:0.75px;
            font-family:'Futura';
            line-height: 1.7em;
            font-size:1.5em;"><h3 style = "line-height:1.3;">
    Let's convert the numerical values which are currently recorded as strings into integers as they are continuous and have a look at the statistics.</h3>
</div>

In [None]:
df_pop['2021_last_updated'] = df_pop['2021_last_updated'].apply(lambda x : int(str(x).replace(',','')))
df_pop['2020_population'] = df_pop['2020_population'].apply(lambda x : int(str(x).replace(',','')))
df_pop['density_sq_km'] = df_pop.density_sq_km.apply(lambda x : int(str(x).replace(',','')[:-6]))
df_pop['area'] = df_pop.area.apply(lambda x : int(str(x).replace(',','')[:-6]))
df_pop['growth_rate'] = df_pop.growth_rate.apply(lambda x : float(str(x)[:-2]))
df_pop['world_%'] = df_pop['world_%'].apply(lambda x : float(str(x)[:-2]))
df_pop.describe().T

<div style="color:#140033;
   display:fill;
   border-radius:15px;
    border-style: solid;
   border-width: 15px;
    border-color:#f0e6ff;
   background-color:#f0e6ff;
   letter-spacing:0.75px;
    font-family:'Futura';
    line-height: 1.7em;
    font-size:1.5em;">
<h2>The below data joins will be very much efficient in enhancing the Data Analysis.</h2>
<h3 style = "line-height:1.3;">
<ul>
<li>Latitude and Longitude data to visualise the data points globally.</li>
<li>Contitent and sub contitent data to identify patterns related to specific regions.</li>
</ul>
</h3>
</div>

In [None]:
df_country = pd.read_csv('../input/latitude-and-longitude-for-every-country-and-state/world_country_and_usa_states_latitude_and_longitude_values.csv')
# Removing USA states columns, as we are dealing with the country data
df_country_iso = pd.read_csv('../input/country-mapping-iso-continent-region/continents2.csv').rename(columns={"alpha-2": "country_code","alpha-3":"iso_code"})
df_country = df_country.merge(df_country_iso,on = 'country_code')[['iso_code','latitude','longitude']]
df_continent = pd.read_csv('../input/country-mapping-iso-continent-region/continents2.csv')
df_continent.rename(columns = {'alpha-3':'iso_code'}, inplace = True)
df_pop = df_pop.merge(df_country, on = 'iso_code')
df_pop = df_pop.merge(df_continent[['iso_code','region','sub-region']], on = 'iso_code')
df_pop.drop(columns = ['2020_population'], inplace= True)
df_pop.rename(columns = {'2021_last_updated':'population'}, inplace= True)
df_pop.head()
df_pop.describe().T.style.bar(color='#20c8f2')\
                      .background_gradient(subset=['mean','std','min','25%','50%','75%','max'], cmap='YlGn')
del df_country
del df_country_iso
del df_continent

In [None]:
#finding out the miss matched country names.
c1 = df_pop.country.value_counts().index
c2 = df.country.value_counts().index
# uncommon country codes
c1_minus_c2 = list(set(c1) - set(c2))
c1_minus_c2.sort()
c2_minus_c1 = list(set(c2) - set(c1))
c2_minus_c1.sort()
print(c1_minus_c2)
print(c2_minus_c1)

<div style="color:#140033;
           display:fill;
           border-radius:15px;
            border-style: solid;
           border-width: 15px;
            border-color:#f0e6ff;
           background-color:#f0e6ff;
           letter-spacing:0.75px;
            font-family:'Futura';
            line-height: 1.7em;
            font-size:1.5em;">
    <h3 style = "line-height:1.3;">
As we are trying to merge dataframes from two different  dataset. There will be a mismatch in the keys. Let's try to map some of the keys by replacing the uncommon key values. Here the <b>Country</b> column acts as a key.
    </h3>
</div>

In [None]:
old_values = ['Congo (Brazzaville)', 'Congo (Kinshasa)', 'Hong Kong S.A.R. of China', 'North Cyprus', 'Palestinian Territories', 'Somaliland region', 'Taiwan Province of China', 'Trinidad and Tobago']
new_values = ['Republic Of The Congo', 'Dr Congo', 'Hong Kong', 'Cyprus', 'Palestine', 'Somalia', 'Taiwan', 'Trinidad And Tobago']
df['country'] = df['country'].replace(old_values,new_values)
df_2021['country'] = df_2021['country'].replace(old_values,new_values)

<div style="color:#140033;
           display:fill;
           border-radius:15px;
            border-style: solid;
           border-width: 15px;
            border-color:#f0e6ff;
           background-color:#f0e6ff;
           letter-spacing:0.75px;
            font-family:'Futura';
            line-height: 1.7em;
            font-size:1.5em;">
    <h3>Let's have a look at our final data</h3>
</div>

In [None]:
df = df.merge(df_pop,how = 'inner',on='country')
df_2021 = df_2021.merge(df_pop,how = 'inner',on='country')
df.sort_values(by = 'year', inplace = True)
df.info(show_counts = False)

<div style="color:#140033;
           display:fill;
           border-radius:15px;
            border-style: solid;
           border-width: 15px;
            border-color:#f0e6ff;
           background-color:#f0e6ff;
           letter-spacing:0.75px;
            font-family:'Futura';
            line-height: 1.7em;
            font-size:1.5em;">
    <h2>Distribution of <b>Generousity</b> data :-</h2>
</div>

In [None]:
fig,ax = plt.subplots(figsize = (20,10))
g = sns.boxplot(data = df,x = 'region', y = 'Generosity',hue='year', ax = ax)
#plt.savefig('Boxplot_of_merged_data.png')

<div style="color:#00381c;
           display:fill;
           border-radius:50px;
            border-style: solid;
            padding: 25px 25px;
           border-width: 5px;
            border-color:#00381c;
           background-color:#b8fcda;
           letter-spacing:0.75px;
            font-family:'Futura';
            line-height: 1.7em;
            font-size:1.5em;">
    <h3 style = "line-height:1.3;">The above plot clearly infers the generousity index got dropped significantly over the span of 15 years in the region <b>Americas</b></h3>
</div>

In [None]:
year_wise_cnt = df.year.value_counts()
fig, ax = plt.subplots(figsize = (10,10))
plt.title('Observation counts by Year',fontsize = 'xx-large',weight = 'bold');
sns.countplot(data = df,y ='year',ax = ax);

<div style="color:#00381c;
           display:fill;
           border-radius:50px;
            border-style: solid;
            padding: 25px 25px;
           border-width: 5px;
            border-color:#00381c;
           background-color:#b8fcda;
           letter-spacing:0.75px;
            font-family:'Futura';
            line-height: 1.7em;
            font-size:1.5em;">
    <h3 style = "line-height:1.3;">
The observations from the year 2005 are comparitively very less and generousity value is missing for most of the rows. Data standardisation is needs to be done by filling up the missing values with proper replacements.
    </h3>
</div>

In [None]:
df.groupby('year')[['Generosity']].count().plot(figsize = (18,8));
plt.title('Observation counts of Generosity by year',fontsize = 'xx-large',weight = 'bold');

<div style="color:#00381c;
           display:fill;
           border-radius:50px;
            border-style: solid;
            padding: 25px 25px;
           border-width: 5px;
            border-color:#00381c;
           background-color:#b8fcda;
           letter-spacing:0.75px;
            font-family:'Futura';
            line-height: 1.7em;
            font-size:1.5em;">
    <h3 style = "line-height:1.3;">The generousity index is tremendous in the year <b>2005</b> when compared with other years. However, The observations are very less in 2005.</h3></div>
    
*** 
# Handling missing values
<div style="color:#140033;
           display:fill;
           border-radius:15px;
            border-style: solid;
           border-width: 15px;
            border-color:#f0e6ff;
           background-color:#f0e6ff;
           letter-spacing:0.75px;
            font-family:'Futura';
            line-height: 1.7em;
            font-size:1.5em;">
<h3> Missing value counts in each columns</h3>
    </div>

In [None]:
df.isnull().sum()

<div style="color:#140033;
           display:fill;
           border-radius:15px;
            border-style: solid;
           border-width: 15px;
            border-color:#f0e6ff;
           background-color:#f0e6ff;
           letter-spacing:0.75px;
            font-family:'Futura';
            line-height: 1.7em;
            font-size:1.5em;"><h3 style = "line-height:1.3;">
    Let's try to fill in Generosity values based on the other observations from its country/region data</h3></div>

In [None]:
col_null_vals = df.isnull().sum()[df.isnull().sum() > 0].index

In [None]:
for x in col_null_vals:
    df[x]= df.groupby('country')[x].transform(lambda grp: grp.fillna(np.mean(grp)))
    df[x] = df.groupby('region')[x].transform(lambda grp: grp.fillna(np.mean(grp)))

In [None]:
df.isnull().sum()

<div style="color:#140033;
           display:fill;
           border-radius:15px;
            border-style: solid;
           border-width: 15px;
            border-color:#f0e6ff;
           background-color:#f0e6ff;
           letter-spacing:0.75px;
            font-family:'Futura';
            line-height: 1.7em;
            font-size:1.5em;">
    <h3>We have now filled in all the null values with its respective mean.</h3>
    
***

<h3>How does each and every variables are correlated?</h3>
</div>

In [None]:
ax,fig = plt.subplots(figsize = (16,10))
plt.gcf().set_dpi(200)
g = sns.heatmap(df.corr(), annot = True)
plt.title('Heatmap of data correlations between the columns',fontsize = 'xx-large',weight = 'bold');
plt.savefig('Correlations_Heatmap.png')

<div style="color:#00381c;
           display:fill;
           border-radius:50px;
            border-style: solid;
            padding: 25px 25px;
           border-width: 5px;
            border-color:#00381c;
           background-color:#b8fcda;
           letter-spacing:0.75px;
            font-family:'Futura';
            line-height: 1.7em;
            font-size:1.5em;">
<h3 style = "line-height:1.3;">It is seen that there are much correlations between multiple columns of the dataframe. Notably we can see the correlations between growth rate and the columns of the original dataset, adding to the assumption that population related data will be helpful in identifying the patterns in origanl dataset
</h3></div>

In [None]:
df_pairplot = df[['population','Healthy life expectancy at birth','Generosity', 'Log GDP per capita','growth_rate','Life Ladder','region','sub-region']]
pair_plot = sns.pairplot(data = df_pairplot,hue='region');
plt.subplots_adjust(top=0.93);
pair_plot.fig.suptitle(t = "Pairplot between the columns with high correlations",fontsize = 25,weight = 'bold');
pair_plot.savefig('Pairplot.png')

<div style="color:#00381c;
           display:fill;
           border-radius:50px;
            border-style: solid;
            padding: 25px 25px;
           border-width: 5px;
            border-color:#00381c;
           background-color:#b8fcda;
           letter-spacing:0.75px;
            font-family:'Futura';
            line-height: 1.7em;
            font-size:1.5em;">
<h3 style = "line-height:1.3;"><b>Inferences from the above plot:-</b>
<ul>
<li>It is seen that the region (continent) plays a significant role in defining the happiness index of a country overall. The datapoints from each region are clustered together in the same region.</li><br>
<li>We can see clear clusters in the diagonals representing continent data which signifies that a countries happiness index depends on which part of region it belongs to.</li><br>
<li>The features <i>Healthy life expectancy at birth, Log GDP per capita and Life Ladder are interrelated to each other</i></li><br>
    <li><i>Growth rate</i> plays a pivotal role for the countries in setting it's <i>GDP per capita</i>. There is a decline in GDP per capita with increase in Growth rate</li>
</ul>
</h3>
</div>

In [None]:
select_cols_df = ['country','population','Healthy life expectancy at birth','Generosity','Life Ladder', 'Log GDP per capita','growth_rate','Perceptions of corruption','region','sub-region']
select_cols_df_2021 = ['country','population','Healthy life expectancy','Generosity','Ladder score','Logged GDP per capita','growth_rate','Perceptions of corruption','region','sub-region']
df.groupby('year').country.count()

In [None]:
df_good = df_2021.groupby('country')['Ladder score'].mean().sort_values(ascending=False)[:10].reset_index().merge(df_2021[['country','region']], on = 'country')
df_poor = df_2021.groupby('country')['Ladder score'].mean().sort_values(ascending=True)[:10].reset_index().merge(df_2021[['country','region']], on = 'country')

In [None]:
fig1 = px.choropleth(df, 
                    locations="iso_code",
                    color='Life Ladder', 
                    hover_name='Life Ladder',
                    hover_data =select_cols_df,
                    color_continuous_scale=px.colors.sequential.Oranges,
                    animation_frame="year"
                   ).update_layout(
    title_text = 'World Happiness Index - year wise data',
    title_x = 0.5,);
iplot(fig1)

<div style="color:#00381c;
           display:fill;
           border-radius:50px;
            border-style: solid;
            padding: 25px 25px;
           border-width: 5px;
            border-color:#00381c;
           background-color:#b8fcda;
           letter-spacing:0.75px;
            font-family:'Futura';
            line-height: 1.7em;
            font-size:1.5em;">
    <h3 style = "line-height:1.3;">From the above visualisation it is clearly seen that the happiness index gets reduced in the Americas region over the span of 15 years</h3>
</div>

In [None]:
fig2 = px.choropleth(
    df_2021,
    locations="iso_code",
    color='Ladder score',
    hover_name='Ladder score',
    hover_data=select_cols_df_2021,
    animation_frame='region'
).update_layout(
    title_text = 'World Happiness map in 2021 - regions wise data',
    title_x = 0.5,)
iplot(fig2)

<div style="color:#00381c;
           display:fill;
           border-radius:50px;
            border-style: solid;
            padding: 25px 25px;
           border-width: 5px;
            border-color:#00381c;
           background-color:#b8fcda;
           letter-spacing:1.2px;
            font-family:'Futura';
            font-size:1.5em;">
    <h3 style = "line-height:1.3;">From the above visualisation, We can infer that the ladder score is comparatively very high in the regions, Americas, Europe and Ocenia having the score at around 7 and very less for Asia and Africa having the scores almost less than 5.</h3>
</div>

*** 
<div style="color:#140033;
           display:fill;
           border-radius:15px;
            border-style: solid;
           border-width: 15px;
            border-color:#f0e6ff;
           background-color:#f0e6ff;
           letter-spacing:0.75px;
            font-family:'Futura';
            line-height: 1.7em;
            font-size:1.5em;">
<h3> Let's have a look at the 2021 report and try find rich and poor countries in terms of happiness index</h3>
</div>

In [None]:
fig, ax = plt.subplots(1,2, figsize = (25,12))
top_plot = sns.barplot(data = df_good,x ='Ladder score', y = 'country', hue = 'region', ax = ax[0], dodge=False);
top_plot.set(xlim = (0,8));
top_plot.set_yticklabels(df_good.country,fontdict={'fontsize': 15,'fontweight' : 'bold'}, alpha=0.9)
top_plot.set_title(label = 'Top 10 countries', color = 'green',fontdict = {'fontsize': 20,'fontweight' : 'bold'});
h, l = top_plot.get_legend_handles_labels();
counts = df_good.region.value_counts().reindex(l);
l = [f'{yn} (n={c})' for yn,c in counts.iteritems()]
top_plot.legend(h,l, title="Region",prop={'size': 16});
bottom_plot = sns.barplot(data = df_poor,x ='Ladder score', y = 'country', hue = 'region', ax = ax[1], dodge=False);
bottom_plot.set(xlim = (0,8));
bottom_plot.set_yticklabels(df_poor.country,fontdict={'fontsize': 15,'fontweight' : 'bold'}, alpha=0.9)
h, l = bottom_plot.get_legend_handles_labels();
counts = df_poor.region.value_counts().reindex(l);
l = [f'{yn} (n={c})' for yn,c in counts.iteritems()]
bottom_plot.legend(h,l, title="Region",prop={'size': 16});
bottom_plot.set_title(label = 'Bottom 10 countries', color = 'red',fontdict = {'fontsize': 20,'fontweight' : 'bold'});
plt.savefig('barplot_2021.png');

<div style="color:#00381c;
           display:fill;
           border-radius:50px;
            border-style: solid;
            padding: 25px 25px;
           border-width: 5px;
            border-color:#00381c;
           background-color:#b8fcda;
           letter-spacing:0.75px;
            font-family:'Futura';
            line-height: 1.7em;
            font-size:1.5em;">
    <h3 style = "line-height:1.3;">It is seen that out of the top 10 countries, 9 are Europian countries and out of the bottom 10 countries, 7 are African countries</h3>
</div>