# Introduction

This notebook will explore an assortment of exploratory data analysis techniques using the World Happiness Report data. This data is mostly clean and ready to use already, so this notebook will focus mainly on EDA and perhaps skip some typical steps in the data science workflow in regards to data cleaning. Afterwards, we will also explore a basic regression model to try to predict the happiness of a country based on various factors such as economic status of citizens, amount of social support, perceptions of corruption, etc., and explore some ways in which we can use our insights gathered from EDA to add to and improve our model.

# Setup

In [None]:
import numpy as np
import pandas as pd

# List all files under the current input directory
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Loading the Data

In this notebook, for simplicity, we will focus only on using the data from one particular year, 2020, in our data visualization.

In [None]:
# Load the data for 2020, leaving some unwanted columns out
wh_2020 = pd.read_csv('/kaggle/input/world-happiness-report/2020.csv', usecols = range(12))

In [None]:
# Preview the data
wh_2020.head()

In [None]:
print("Our data has {} rows (observations/countries) and {} columns.".format(wh_2020.shape[0], wh_2020.shape[1]))

# Cleaning/Modifying the Data

## Renaming Columns

In [None]:
col_names_dict = {'Country name':'Country', 'Regional indicator':'Region', 'Ladder score': 'Ladder',
                  'Standard error of ladder score':'Standard Error', 'Logged GDP per capita':'Logged GDPPC',
                  'Social support':'Social Support', 'Healthy life expectancy':'Life Expectancy',
                  'Freedom to make life choices':'Freedom', 'Perceptions of corruption': 'Corruption'}

wh_2020.rename(columns = col_names_dict, inplace = True)

## Missing Values

In [None]:
# Check for any missing values in the data
wh_2020.isnull().sum()

## Adding Columns to the Data

In [None]:
# Add a 'Rank' column to our data (luckily for us, the rows are already ordered from happiest to unhappiest)
wh_2020['Rank'] = range(1, 154)

Later on, we may find it useful to have the countries split up into percentiles. Let's create a 'Quartile' column that denotes the quartile each country belongs to according to its overall happiness score/rank.

In [None]:
quartile_index = np.percentile(wh_2020['Rank'], [25, 50, 75])
quartiles = pd.Series(wh_2020['Rank'].map(lambda x:(np.searchsorted(quartile_index, x) + 1)), name = 'Quartile')
wh_2020 = pd.concat([wh_2020, quartiles], axis = 1)    

In [None]:
# Check our updated data with the new 'Rank' and 'Quartile' columns
wh_2020.head()

# Data Visualization

In [None]:
# Set font sizes for all of our plots
plt.rc('font', size = 14)
plt.rc('axes', labelsize = 16)
plt.rc('legend', fontsize = 18)
plt.rc('axes', titlesize = 24)
plt.rc('figure', titlesize = 24)

In [None]:
# Set style
plt.style.use('seaborn-whitegrid')

## Barplot/Countplot

Let's first quickly see how countries are distributed by region and how happiness differs among the different regions to see if there are any region-specific trends we can pick up on:

In [None]:
fig = plt.figure(figsize = (18, 14))
ax = plt.axes()

countplot = sns.countplot('Region', data = wh_2020, saturation = 0.8, palette = 'tab10')
countplot.set_xticklabels(countplot.get_xticklabels(), rotation = 90)
countplot.set_title("Countplot by Region", y = 1.05);

## Stacked Barplot/Countplot

In [None]:
fig = plt.figure(figsize = (18, 14))
ax = plt.axes()

stacked_countplot = sns.countplot('Region', data = wh_2020, hue = 'Quartile')
stacked_countplot.set_xticklabels(countplot.get_xticklabels(), rotation = 90)
stacked_countplot.set_title("Countplot of Quartiles for Each Region", y = 1.05);
ax.legend(loc = "upper left", title = 'Quartile', title_fontsize = 18);

We can illustrate these differences by region more succinctly with the following:

In [None]:
print("Table of Average Rank for Each Region:\n")
print(wh_2020.groupby('Region')['Rank'].agg('mean'))

There a couple of pretty clear patterns we can see here. For example, Western Europe, North America, and Latin America all seem to be places where happiness is quite high, whereas places like South Asia and Sub-Saharan Africa appear to be quite unhappy.

## Correlation Matrix

Let's now look at the relationships between each of the six measured values (Logged GDP per capita, social support, etc.) and the overall ladder score to perhaps highlight which features may be more/less important.

In [None]:
# Gather columns corresponding to the six measured values (Logged GDP per capita, social support, etc.)
feature_cols = ['Logged GDPPC', 'Social Support', 'Life Expectancy', 'Freedom', 'Generosity', 'Corruption']

In [None]:
df = pd.concat([wh_2020['Ladder'], wh_2020[feature_cols]], axis = 1)

fig = plt.figure(figsize = (13, 10))
plt.style.use('seaborn-white')

plt.matshow(df.corr(), fignum = fig.number, cmap = 'viridis')
plt.xticks(range(df.shape[1]), df.columns, fontsize=14, rotation=45)
plt.yticks(range(df.shape[1]), df.columns, fontsize=14)

cb = plt.colorbar()
cb.ax.tick_params(labelsize=14)

plt.title('Correlation Matrix', fontsize = 24, y = 1.2);

It looks like the Logged GDPPC, Social Support, and Life Expectancy metrics all have a relatively high correlation with the overall score a country received. Also, these factors each seem to have a pretty high correlation with each other (e.g., Social Support is well correlated with Life Expectancy, and so on). On the other end of the spectrum, Generosity does not seem to have a sizeable correlation with any other measurement, including the Ladder score.

## Scatter Plots

Let's now take a look at the relationship between some of these factors. Scatterplots are great because they let us see at a glance the relative amount of correlation between two particular variables in our data.

In [None]:
pairplot = sns.pairplot(wh_2020, hue = 'Quartile', vars = feature_cols, corner = False)
pairplot.fig.suptitle("Pairplot of the 6 Happiness Metrics", fontsize = 24, y = 1.05);

In some metrics, we can see a stark difference between countries in the first quartile versus countries in the other quartiles (especially 'Corruption'). On the other hand, other measurements seem to be much less relevant in distinguishing happier countries from the rest (look at 'Generosity')

## Line Plots

What about the relationship between the overall score/rank and each of the factors? For instance, does the Life Expectancy generally increase as we progress up the rankings toward happier countries?

In [None]:
fig, axes = plt.subplots(2, 3, figsize = (20, 12))

for i, ax in enumerate(axes.flat):
    ax.plot(wh_2020['Rank'], wh_2020[feature_cols[i]], color = 'red')
    ax.set_title(feature_cols[i] + ' by Rank', fontsize = 18)
    ax.set_xlim(153, 1)
    ax.axis('tight')

As we may have expected, the plots involving variables with a higher correlation with overall score/ranking display a linear trend as we go from lower to higher ranked countries. The last two plots (which portray variables with low correlation), on the other hand, are fairly "noisy" and don't display a very clear linear relationship.

## A More Complex Scatterplot

Let's take our three most indicative/"valuable" variables and create a multidimensional analysis of a country's score in these 3 measurements and quartile.

In [None]:
fig = plt.figure(figsize = (15, 12))
ax = plt.axes()

scatter = ax.scatter(wh_2020['Logged GDPPC'], wh_2020['Social Support'], alpha = 0.4, s = wh_2020['Life Expectancy']**1.5, c = wh_2020['Quartile'], cmap = 'viridis')
ax.set(xlabel = 'Logged GDPPC', ylabel = 'Social Support')
legend = ax.legend(*scatter.legend_elements(prop = 'colors', size = 16),
                    loc = "lower right", title = "Quartile", title_fontsize = 18)
ax.add_artist(legend);

In the above plot, we have Logged GDPPC on the x-axis, Social Support on the y-axis, and the size of the dots corresponding to Life Expectancy. It is clear from the plot that the higher quartiles tend to have higher measurements in all three of these features.

# Model Building

Now, let's try to build a simple model to predict the happiness score for each country given the 6 measured values: 
* Logged GDPPC
* Social Support
* Life Expectancy
* Freedom
* Generosity
* Corruption

## Linear Regression

In [None]:
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

In [None]:
# Designate features and target variable
y = wh_2020['Ladder']
X = wh_2020[feature_cols]

In [None]:
# Split data into training and validation sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0, test_size = 0.25)

In [None]:
# Fit a linear regression model to the data
lin_reg_model = LinearRegression()

lin_reg_model.fit(X_train, y_train)

In [None]:
# Use the fitted model to make predictions
preds = lin_reg_model.predict(X_test)

In [None]:
# Find the average error of our predictions for the validation data
mean_squared_error(preds, y_test)    

In [None]:
# Another metric for evaluating error
mean_absolute_error(preds, y_test)    

The big question: how can we use the insights gathered from the exploratory data analysis above (such as certain regions being a *lot* happier on average, some features being highly correlated with happiness score while others are not, etc.) to improve upon this most basic of models? Additionally, what other things can we implement in our EDA to discover additional trends in the data? This is what I'll continue to explore in the future, and hopefully I can create a much more robust model in the future.

Some initial ideas that come to mind:
* Use some type of encoding for the categorical 'Region' variable and turn it into an additional feature; since the regions differ pretty vastly in happiness scores, this extra feature could help our model
* Get rid of features that aren't very relevant/use some additional features
* Use a more complex model

Thanks for checking out my notebook (my first Kaggle notebook, actually!), and I would love any feedback/tips on improving since I am pretty new at this and still have a lot to learn. If you happened to like this notebook, an upvote is greatly appreciated :)