# Data Explore 

By **Jared Chung**

________________________________________________________________________________________________________________________________________________________

## Introduction

This is a practicing attempt with Dataset "World Happiness Report", In this notebook, I look at how variables correlation with Ladder score and the effect it has on calculating a country happiness level. In addition, I look into the changes of rank in countries across the year 2015 to 2020. There can be more detail findings on the data exploration to be done.


### Acknowledgements

All data collection (from Kaggle) credit goes to the original Authors:

Editors: John Helliwell, Richard Layard, Jeffrey D. Sachs, and Jan Emmanuel De Neve, Co-Editors; Lara Aknin, Haifang Huang and Shun Wang, Associate Editors; and Sharon Paculor, Production Editor

Citation:
Helliwell, John F., Richard Layard, Jeffrey Sachs, and Jan-Emmanuel De Neve, eds. 2020. World Happiness Report 2020. New York: Sustainable Development Solutions Network

### Import libraries needed

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px

### Loading the dataset 
Note I already downloaded the dataset and put it into the same folder 

In [None]:
df = pd.read_csv("../input/world-happiness-report/2020.csv")

In [None]:
# Checking if data imported by looking at the head
df.head()

### Understanding the data
It is always good to understand the data before any manipulation or analysis

In [None]:
# Looking at the columns header
df.columns

In [None]:
# Getting number of data rows
nRow, nCol = df.shape
print("Num of row:", nRow)
print("Num of col:", nCol)

In [None]:
# Looking at the datatypes in each Column, and number of null values 
df.info()

Very good! there is no null values in any column and all of the columns have 153 records!
We can then look into the analysis part of the data

In [None]:
# Checking for outliers 
sns.distplot(df["Ladder score"])

Looks like a normal distrubution! No changes required

### Data Exploration
Firstly, lets take a look at the top 20 countries with a bar graph

In [None]:
df_top20 = df[:20].sort_values('Ladder score', ascending = True)

In [None]:
px.bar(df_top20, x='Ladder score', y='Country name',
       orientation='h',title="Top 20 happiest countries")

In [None]:
df_last20 = df.sort_values('Ladder score', ascending = True)
df_last20 = df_last20[:20]

In [None]:
px.bar(df_last20, x='Ladder score', y='Country name',
       orientation='h',title="Last 20 least happiest countries")

From here we can see that in 2020, Finland scored as the happiest country, follow by Denmark and Switzerland.

We should now explore other columns variable on how it affects <b>Ladder score</b>, mainly:
<li>GDP per capital</li>
<li>Social support</li>
<li>Healthy life expectancy</li>
<li>Freedom to make life choices</li>
<li>Perceptions of corruption</li>

which initially is what I think are the major 5 contributer to affect a country happiness level

### Scatter plot bettween variables and Ladder score

In [None]:
#scatter plot between Logged GDP and Ladder score
var = 'Logged GDP per capita'
data = pd.concat([df["Ladder score"], df[var]],axis=1)
data.plot.scatter(x=var, y='Ladder score')

It seems to have linear relationship between GDP compared to ladder score

In [None]:
#scatter plot between Social support and Ladder score
var = 'Social support'
data = pd.concat([df["Ladder score"], df[var]],axis=1)
data.plot.scatter(x=var, y='Ladder score')

In [None]:
#scatter plot between Healthy life expectancy and Ladder score
var = 'Healthy life expectancy'
data = pd.concat([df["Ladder score"], df[var]],axis=1)
data.plot.scatter(x=var, y='Ladder score')

In [None]:
#scatter plot between Freedom to make life choices and Ladder score
var = 'Freedom to make life choices'
data = pd.concat([df["Ladder score"], df[var]],axis=1)
data.plot.scatter(x=var, y='Ladder score')

In [None]:
#scatter plot between Perceptions of corruption and Ladder score
var = 'Perceptions of corruption'
data = pd.concat([df["Ladder score"], df[var]],axis=1)
data.plot.scatter(x=var, y='Ladder score')

As we can see from the above 5 graphs, the initial thought was correct on these variables has a linear contribution to Ladder score. All relationship seems to have positive linear contribution except Perceptions of corruption which is negative. It is expected as the less value on perceptions of corruption should result in a happier country. 

<h3> Further analysis between variables </h3>

Look into the correlation between variables for a much detail analysis

In [None]:
corrmap = df.corr()
f, ax = plt.subplots(figsize = (12,9))
sns.heatmap(corrmap, square = True, linecolor = 'black',cmap="YlGnBu")

As we can see from the graph above, on column 1;

<i>{GDP, social support, life expectancy, freedom to make choices}</i>

all have positive strong correlation with Ladder score where;

<i>{Corruption}</i> has a negative strong correclation with it. 

<i>Dystopia+residual</i> also have a strong positive correlation that requires further investigation

The rest of the columns header are just calculation done before; Thus, we will remove those on further analyses

### Scatter plots between ladder score and chosen correlated variables 

In [None]:
# Getting the chosen variables 
cols = ['Ladder score','Logged GDP per capita','Social support','Healthy life expectancy'
       ,'Freedom to make life choices','Perceptions of corruption','Dystopia + residual']

In [None]:
sns.set()
sns.pairplot(df[cols],height = 3,kind = 'reg',corner = True)
plt.show()

From this visulisation we can conclude that our initial hypothesis on how ***social support***, ***GDP***, ***life expectancy***, ***freedom of life choices*** are correct and ***Dystopia + residual*** also has positive effect on happiness score however, not that significant. Furthermore, ***perception on country corruption*** is also shown to have large negative correlation on Ladder score. 

We can also see that ***GDP*** and ***Social support*** has a significant positive correlation with Healthy life expectancy. 

Thus, <font color='red'>happiness and healthiness of a country is shown to be depending on a country GDP, Gorvenment struture and freedom of choices.

### Country happiness rank changes over the years

<font size = 3> This section explore how happiness on country changes over the year 2015 - 2020. </font>

Import all dataset across the 5 years

In [None]:
data_2015 = pd.read_csv("../input/world-happiness-report/2015.csv")
data_2016 = pd.read_csv("../input/world-happiness-report/2016.csv")
data_2017 = pd.read_csv("../input/world-happiness-report/2017.csv")
data_2018 = pd.read_csv("../input/world-happiness-report/2018.csv")
data_2019 = pd.read_csv("../input/world-happiness-report/2019.csv")
data_2020 = pd.read_csv("../input/world-happiness-report/2020.csv")

Lets check on all the files, making sure the data is correct and consistent between the years.

In [None]:
data_full = [data_2015,data_2016,data_2017,data_2018,data_2019,data_2020]

In [None]:
for item in data_full:
    print(item.shape)

Looks like there are different number of columns in each dataset. Thus, we need to dive deeper so that we can modify the data correctly

In [None]:
for item in data_full:
    print(item.columns)

Overall; it looks like the indicator of name are listed as 
- 2015, 2016 and 2017: 'Country';
- 2018, 2019: 'Country or region';
- 2020: 'Country name'
This required further analyses to ensure all countries are present in each documents.

Furthermore, the ranking column across the 6 years are different as well: 
- 2015, 2016 and 2017: 'Happiness rank';
- 2018, 2019: 'Overall rank';
- 2020: do not have a ranking board, hence, modification on data is needed.

### Data modification

Firstly, determine all data sets has the same countries presented.
To make analyses simple, begin by switching all the target data columns name to be the same.

In [None]:
data_2017 = data_2017.rename(columns={'Happiness.Rank':'Happiness Rank'})
data_2018 = data_2018.rename(columns={'Country or region': 'Country',
                            'Overall rank': 'Happiness Rank'})
data_2019 = data_2019.rename(columns={'Country or region': 'Country',
                            'Overall rank': 'Happiness Rank'})
data_2020 = data_2020.rename(columns={'Country name': 'Country'})

print("Success")

Since we only cares about the country ranking changes across the years. Extract Country and ranking columns. 

For year 2020, we have to create a new columns to determine the rank.

In [None]:
# Creating new columns for year 2020 dataset
data_2020['Happiness Rank'] = data_2020['Ladder score'].rank(ascending = False)

In [None]:
# Getting the selected columns
data_2015 = data_2015[['Country','Happiness Rank']]
data_2016 = data_2016[['Country','Happiness Rank']]
data_2017 = data_2017[['Country','Happiness Rank']]
data_2018 = data_2018[['Country','Happiness Rank']]
data_2019 = data_2019[['Country','Happiness Rank']]
data_2020 = data_2020[['Country','Happiness Rank']]

In [None]:
# Getting the unique number of country in each data set
data_full = [data_2015,data_2016,data_2017,data_2018,data_2019,data_2020]
for item in data_full:
    print(item.Country.nunique())

As we can see that, there are different number of country present in each data set. Thus, merging the table with only country that is listed inside each docunment is needed for further analysis.

In [None]:
Merged = pd.merge(data_2015,data_2016, on=['Country'],how='inner')
Merged = Merged.rename(columns={'Happiness Rank_x':'2015','Happiness Rank_y':'2016' })

In [None]:
Merged = pd.merge(Merged,data_2017, on=['Country'],how='inner')
Merged = Merged.rename(columns={'Happiness Rank':'2017'})

In [None]:
Merged = pd.merge(Merged,data_2018, on=['Country'],how='inner')
Merged = Merged.rename(columns={'Happiness Rank':'2018'})

In [None]:
Merged = pd.merge(Merged,data_2019, on=['Country'],how='inner')
Merged = Merged.rename(columns={'Happiness Rank':'2019'})

In [None]:
Merged = pd.merge(Merged,data_2020, on=['Country'],how='inner')
Merged = Merged.rename(columns={'Happiness Rank':'2020'})

In [None]:
# Changing type for consistency 
Merged['2020'] = Merged['2020'].astype('int')

In [None]:
# Changing index values
Merged = Merged.set_index('Country')

As there are too many countries, it can lead to the graph being unreadable. We can filter the data and plot graphs according to our liking.

In [None]:
# The top 10 country in the year 2015 and how it changes over the years
Merged[:10].T.plot()

Switzerland has gone from number 1 spot in 2015 to number 3. Where as Finland has improved from the 6th place in 2015 all the awy to maintain number 1 spot from the year 2018 to 2020. 

We also can take a look at a specific country of our liking

In [None]:
# Looking for specific country ranking
Var = 'Finland'
Merged.loc[Var].T.plot()

## Conclusion

There are many more analyses can be done with the given dataset. As I am new to data analyses and science. This is a practice run for me in exploring and understanding more data. Feel free to comment what can be improve to assist me in my learning. Thank you. 