<span style="font-family:'Futura Std Condensed';font-size:1.5em">
<h1>World Happiness Report</h1>

<h2>Context</h2>
This notebook deals with the exploratory data analysis of the <a href = "https://www.kaggle.com/ajaypalsinghlo/world-happiness-report-2021">World Happiness Report</a> and <a href = "https://www.kaggle.com/rsrishav/world-population">World Population</a>. Geographic data is also mapped with these data points such as region, sub-region, latitude and longitude to identify and visualize more patterns with the happiness index.


<h2 style="font-family:'Futura Std Condensed;">Content</h2>
The happiness scores and rankings use data from the Gallup World Poll. The columns such as the happiness score estimate the extent to which each of six factors – economic production, social support, life expectancy, freedom, absence of corruption, and generosity – contribute to making life evaluations higher in each country than they are in Dystopia.
</span>


In [None]:
#Import necessary libraries
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")
sns.set_palette("icefire")
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
import plotly.express as px

In [None]:
# Loading the input files into dataframes
df_2021 = pd.read_csv('/kaggle/input/world-happiness-report-2021/world-happiness-report-2021.csv')
df = pd.read_csv('/kaggle/input/world-happiness-report-2021/world-happiness-report.csv')

## Data Analysis :-

### <em>world-happiness-report.csv</em>

In [None]:
df.rename(columns={"Country name": "country"}, inplace= True)
df.info()
df.head()
df.describe().T

### <em> world-happiness-report-2021.csv</em>

In [None]:
df_2021.rename(columns={"Country name": "country"}, inplace= True)
df_2021.info()
df_2021.head()
df_2021.describe().T

<span style="font-family:'Futura Std Condensed';font-weight:bold ;font-size:1.5em">
<h1>Introducing Population</h1>
<b>
<ul>
<li>There is no population related data in the input data.</li>
<li>It is a no brainer to say that a country's happiness index relies on it's population greatly.</li>
<li>Adding the population information will be helpful in understanding the data correlations</li>
</ul>
</b>
</span>

## Let's have a look at the population data

In [None]:
#Import population data into a dataframe
df_pop = pd.read_csv('../input/d/rsrishav/world-population/2021_population.csv')
df_pop.info()
df_pop.head()

### Converting the numerical values which are currently recorded as strings into integers as they are continuous.

In [None]:
df_pop['2021_last_updated'] = df_pop['2021_last_updated'].apply(lambda x : int(str(x).replace(',','')))
df_pop['2020_population'] = df_pop['2020_population'].apply(lambda x : int(str(x).replace(',','')))
df_pop['density_sq_km'] = df_pop.density_sq_km.apply(lambda x : int(str(x).replace(',','')[:-6]))
df_pop['area'] = df_pop.area.apply(lambda x : int(str(x).replace(',','')[:-6]))
df_pop['growth_rate'] = df_pop.growth_rate.apply(lambda x : float(str(x)[:-2]))
df_pop['world_%'] = df_pop['world_%'].apply(lambda x : float(str(x)[:-2]))
df_pop.describe().T

In [None]:
df_country = pd.read_csv('../input/latitude-and-longitude-for-every-country-and-state/world_country_and_usa_states_latitude_and_longitude_values.csv')
# Removing USA states columns, as we are dealing with the country data
df_country_iso = pd.read_csv('../input/country-mapping-iso-continent-region/continents2.csv').rename(columns={"alpha-2": "country_code","alpha-3":"iso_code"})
df_country = df_country.merge(df_country_iso,on = 'country_code')[['iso_code','latitude','longitude']]
df_continent = pd.read_csv('../input/country-mapping-iso-continent-region/continents2.csv')
df_continent.rename(columns = {'alpha-3':'iso_code'}, inplace = True)
df_pop = df_pop.merge(df_country, on = 'iso_code')
df_pop = df_pop.merge(df_continent[['iso_code','region','sub-region']], on = 'iso_code')
df_pop.drop(columns = ['2020_population'], inplace= True)
df_pop.rename(columns = {'2021_last_updated':'population'}, inplace= True)
df_pop.head()
del df_country
del df_country_iso

### Let's have a look at our final data

In [None]:
c1 = df_pop.country.value_counts().index
c2 = df.country.value_counts().index
# uncommon country codes
c1_minus_c2 = list(set(c1) - set(c2))
c1_minus_c2.sort()
c2_minus_c1 = list(set(c2) - set(c1))
c2_minus_c1.sort()
print(c1_minus_c2)
print(c2_minus_c1)

### As we are trying to merge dataframes from two different  dataset. There will be a mismatch in the keys. Let's try to map some of the keys by replacing the uncommon key values. Here the *country* column acts as a key.

In [None]:
old_values = ['Congo (Brazzaville)', 'Congo (Kinshasa)', 'Hong Kong S.A.R. of China', 'North Cyprus', 'Palestinian Territories', 'Somaliland region', 'Taiwan Province of China', 'Trinidad and Tobago']
new_values = ['Republic Of The Congo', 'Dr Congo', 'Hong Kong', 'Cyprus', 'Palestine', 'Somalia', 'Taiwan', 'Trinidad And Tobago']
df['country'] = df['country'].replace(old_values,new_values)

In [None]:
df = df.merge(df_pop,how = 'inner',on='country')
print(list(df.columns))
df.info()

# Distribution of **Generousity** data

In [None]:
plt.subplots(figsize = (20,10))
sns.boxplot(data = df,x = 'region', y = 'Generosity',hue='year')

## The above plot clearly infers the generousity index got dropped significantly over the spean of 15 years in the region **Americas**

In [None]:
year_wise_cnt = df.year.value_counts()
ax = plt.subplots(figsize = (10,10))
sns.countplot(data = df,y ='year')
plt.title('Observation counts by Year')

In [None]:
df.groupby('year')[['Generosity']].count().plot(figsize = (18,8))

## The observations from the year 2005 are comparitively very less and generousity value is missing for most of the rows. Data standardisation is needs to be done by filling up the missing values with proper replacements.

## Observation percentage by year:

In [None]:
year_wise_cnt.sort_index().apply(lambda x : round(x/len(df.index)*100,2)).plot(kind = 'bar',figsize = (18,8))

## The generousity index is tremendous in the year **2005** when compared with other years. However, The observations are very less in 2005.

# Handling missing values

In [None]:
df.isnull().sum()

In [None]:
df.loc[(df.year >= 2006) & (df.year <=2007)].Generosity.mean()

In [None]:
df.Generosity = df.groupby('country')['Generosity'].transform(lambda grp: grp.fillna(np.mean(grp)))
df.Generosity = df.groupby('region')['Generosity'].transform(lambda grp: grp.fillna(np.mean(grp)))

## How each and every variables for correlated?

In [None]:
ax,fig = plt.subplots(figsize = (12,8))
sns.heatmap(df.corr())

In [None]:
df_pairplot = df[['population','Healthy life expectancy at birth', 'Log GDP per capita','growth_rate']]

In [None]:
sns.pairplot(data = df_pairplot,corner = True)