This kernel is just a fun explorative project to get a feel for this very interesting dataset. I don't have any intention to accomplish anything besides enjoy doing some visualizations. At the end I'll run a linear regression and see how accurate I can get.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

# Any results you write to the current directory are saved as output.

In [None]:
data2015 = pd.read_csv('../input/2015.csv')
data2016 = pd.read_csv('../input/2016.csv')

data2015['Year'] = 2015
data2016['Year'] = 2016
data = pd.concat([data2015, data2016])
data = data.reset_index()
data.head()
data.tail()
data.info()

Looks like we don't have upper and lower confidence intervals for all of our countries. Might we assume that this data was collected in 2016 but not in 2015? Let's look and see.

In [None]:
i = data['Lower Confidence Interval'][data['Year'] == 2015].isnull()
assert i.any() == True
print(len(i))
l = data['Lower Confidence Interval'][data['Year'] == 2016].isnull()
assert l.any() == False
print(len(l))
b = data['Standard Error'][data['Year'] == 2015].isnull()
assert b.any() == False
print(len(b))

OK, well fortunately I have enough data that I can figure out the standard error for my 2016 data. I don't have enough information to figure out the upper/lower confidence intervals for 2015 so I'll simple drop both those columns and drive on with just my standard error for both years.

In [None]:
midpoint = (data2016['Upper Confidence Interval'] - data2016['Lower Confidence Interval']) / 2
data2016['Sample Mean'] = data2016['Lower Confidence Interval'] + midpoint
data2016['Standard Error'] = midpoint / 1.96

del data2016['Sample Mean']
del data2016['Upper Confidence Interval']
del data2016['Lower Confidence Interval']

data = pd.concat([data2015, data2016])
data = data.reset_index()
data.info()

OK, we have no null values and we're ready to begin doing some plotting. I'll start by throwing up some kernel density estimate plots with my various continuous variables arrayed against each countries happiness score.

In [None]:
print(data.head())

In [None]:
_ = sns.kdeplot(data=data['Happiness Score'], data2=data['Economy (GDP per Capita)'], shade=True)
_ = plt.scatter(x=data['Happiness Score'], y=data['Economy (GDP per Capita)'], alpha=0.2, color='green')
_ = plt.xlabel('Happiness Score')
_ = plt.ylabel('Economy (GDP per Capita)')
_ = plt.title('Happiness vs. GDP')
plt.show()

_ = sns.kdeplot(data=data['Happiness Score'], data2=data['Family'], shade=True)
_ = plt.scatter(x=data['Happiness Score'], y=data['Family'], alpha=0.2, color='green')
_ = plt.xlabel('Happiness Score')
_ = plt.ylabel('Family')
_ = plt.title('Happiness vs. Family')
plt.show()

_ = sns.kdeplot(data=data['Happiness Score'], data2=data['Freedom'], shade=True)
_ = plt.scatter(x=data['Happiness Score'], y=data['Freedom'], alpha=0.2, color='green')
_ = plt.xlabel('Happiness Score')
_ = plt.ylabel('Freedom')
_ = plt.title('Happiness vs. Freedom')
plt.show()

_ = sns.kdeplot(data=data['Happiness Score'], data2=data['Generosity'], shade=True)
_ = plt.scatter(x=data['Happiness Score'], y=data['Generosity'], alpha=0.2, color='green')
_ = plt.xlabel('Happiness Score')
_ = plt.ylabel('Generosity')
_ = plt.title('Happiness vs. Generosity')
plt.show()

_ = sns.kdeplot(data=data['Happiness Score'], data2=data['Health (Life Expectancy)'], shade=True)
_ = plt.scatter(x=data['Happiness Score'], y=data['Health (Life Expectancy)'], alpha=0.2, color='green')
_ = plt.xlabel('Happiness Score')
_ = plt.ylabel('Health (Life Expectancy)')
_ = plt.title('Happiness vs. Health (Life Expectancy)')
plt.show()

_ = sns.kdeplot(data=data['Happiness Score'], data2=data['Trust (Government Corruption)'], shade=True)
_ = plt.scatter(x=data['Happiness Score'], y=data['Trust (Government Corruption)'], alpha=0.2, color='green')
_ = plt.xlabel('Happiness Score')
_ = plt.ylabel('Trust (Government Corruption)')
_ = plt.title('Happiness vs. Trust (Government Corruption)')
plt.show()

There are some interesting things to note here, chiefly that health, family, freedom, and wealth seem to correlate well with overall happiness while trust in government and generosity are less important. We can quantify this with a seaborn heatmap.

In [None]:
cols = ['Dystopia Residual', 'Economy (GDP per Capita)', 'Family', 'Freedom', 'Generosity', 'Happiness Score', 'Health (Life Expectancy)', 'Trust (Government Corruption)']

heatmap = data[cols]
corr = heatmap.corr()
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True
with sns.axes_style("white"):
    ax = sns.heatmap(corr, mask=mask, vmax=.3, square=True, cmap="YlGnBu", annot=True)
plt.show()

Again, we see that there is a high degree of correlation between happiness and the economy, family, freedom, and health. Less important are trust in government (although at 40% certainly not something to ignore) and generosity.

Now let's use some lag plots to see if our data is random. Lag plots are a neat tool in the pandas library "used to check if a data set or time series is random. Random data should not exhibit any structure in the lag plot. Non-random structure implies that the underlying data are not random." You can check out the documentation [here][1].


  [1]: http://pandas.pydata.org/pandas-docs/stable/visualization.html

In [None]:
from pandas.tools.plotting import lag_plot

plt.figure(1)
lag_plot(data['Freedom'])
plt.title('Freedom')

plt.figure(2)
lag_plot(data['Family'])
_ = plt.title('Family')

plt.figure(3)
lag_plot(data['Dystopia Residual'])
_ = plt.title('Dystopia Residual')

plt.figure(4)
lag_plot(data['Economy (GDP per Capita)'])
_ = plt.title('Economy (GDP per Capita)')

plt.figure(5)
lag_plot(data['Generosity'])
_ = plt.title('Generosity')

plt.figure(6)
lag_plot(data['Happiness Score'])
_ = plt.title('Happiness Score')

plt.figure(7)
lag_plot(data['Health (Life Expectancy)'])
_ = plt.title('Health (Life Expectancy)')

plt.figure(8)
lag_plot(data['Trust (Government Corruption)'])
_ = plt.title('Trust (Government Corruption)')

plt.tight_layout()
plt.show()

Well, the happiness score itself is decidedly non-random. The rest of these variables don't exhibit any particularly marked pattern to me.

Now we'll use the radviz plots native to the pandas library. Per the documentation:

"RadViz is a way of visualizing multi-variate data. It is based on a simple spring tension minimization algorithm. Basically you set up a bunch of points in a plane. In our case they are equally spaced on a unit circle. Each point represents a single attribute. You then pretend that each sample in the data set is attached to each of these points by a spring, the stiffness of which is proportional to the numerical value of that attribute (they are normalized to unit interval). The point in the plane, where our sample settles to (where the forces acting on our sample are at an equilibrium) is where a dot representing our sample will be drawn. Depending on which class that sample belongs it will be colored differently."

In [None]:
from pandas.tools.plotting import radviz
pd.options.mode.chained_assignment = None

del heatmap['Happiness Score']
heatmap['Region'] = data['Region']

plt.figure()
radviz(heatmap, 'Region')
plt.legend(bbox_to_anchor=(1,1))
plt.show()

We can see that Sub-Saharan Africa places more importance on family and economic success relative to the other variables. However, it is a bit hard to tell from this chart. I can bring out these tendencies in sharp relief by raising each value to an exponent. This dramatically accentuates the differences between the various values.

In [None]:
del heatmap['Region']
heatmap = heatmap ** 6
heatmap['Region'] = data['Region']

plt.figure()
radviz(heatmap, 'Region')
plt.legend(bbox_to_anchor=(1,1))
plt.show()

It's a little clearer now that freedom, family, economy, and dystopia residual are more important to Sub-Saharan Africans than they are to the rest of the world's population. We have a few countries with decidedly unusual preferences for generosity, a good number from around the world that are more heavily influenced by health considerations, and a strong overall tendency towards the importance of freedom and family.

It's also interesting to note that trust in government is not decisive for anybody, apparently. Or, more accurately, that nobody has a high enough trust in government for that factor to be one of the most important ones.

Next let's look at trends over time by region.

In [None]:
order =['Sub-Saharan Africa', 'Southern Asia', 'Southeastern Asia', 'Eastern Asia', 'Australia and New Zealand', 'Central and Eastern Europe', 'Western Europe', 'Latin America and Caribbean', 'North America']


_ = sns.barplot(x=data['Region'], y=data['Happiness Score'], order=order, hue=data['Year'], hue_order=[2015, 2016])
_ = plt.xticks(rotation=75)
_ = plt.xlabel('Regions')
_ = plt.ylabel('Average Happiness Score 2015-2016')
_ = plt.title('Happiness by Region 2015-2016')
plt.show()

_ = sns.barplot(x='Region', y='Economy (GDP per Capita)', data=data, hue='Year', order=order, hue_order=[2015, 2016])
_ = plt.xticks(rotation=75)
_ = plt.xlabel('Regions')
_ = plt.ylabel('Average GDP per Capita 2015-2016')
_ = plt.title('GDP per Capita by Region 2015-2016')
plt.show()

_ = sns.barplot(x='Region', y='Freedom', data=data, hue='Year', order=order, hue_order=[2015, 2016])
_ = plt.xticks(rotation=75)
_ = plt.xlabel('Regions')
_ = plt.ylabel('Freedom 2015-2016')
_ = plt.title('Freedom by Region 2015-2016')
plt.show()

_ = sns.barplot(x='Region', y='Family', data=data, hue='Year', order=order, hue_order=[2015, 2016])
_ = plt.xticks(rotation=75)
_ = plt.xlabel('Regions')
_ = plt.ylabel('Family 2015-2016')
_ = plt.title('Family by Region 2015-2016')
plt.show()

_ = sns.barplot(x='Region', y='Health (Life Expectancy)', data=data, hue='Year', order=order, hue_order=[2015, 2016])
_ = plt.xticks(rotation=75)
_ = plt.xlabel('Regions')
_ = plt.ylabel('Health (Life Expectancy) 2015-2016')
_ = plt.title('Health (Life Expectancy) by Region 2015-2016')
plt.show()

_ = sns.barplot(x='Region', y='Trust (Government Corruption)', data=data, hue='Year', order=order, hue_order=[2015, 2016])
_ = plt.xticks(rotation=75)
_ = plt.xlabel('Regions')
_ = plt.ylabel('Trust (Government Corruption) 2015-2016')
_ = plt.title('Trust (Government Corruption) by Region 2015-2016')
plt.show()

_ = sns.barplot(x='Region', y='Generosity', data=data, hue='Year', order=order, hue_order=[2015, 2016])
_ = plt.xticks(rotation=75)
_ = plt.xlabel('Regions')
_ = plt.ylabel('Generosity 2015-2016')
_ = plt.title('Generosity by Region 2015-2016')
plt.show()

_ = sns.barplot(x='Region', y='Dystopia Residual', data=data, hue='Year', order=order, hue_order=[2015, 2016])
_ = plt.xticks(rotation=75)
_ = plt.xlabel('Regions')
_ = plt.ylabel('Dystopia Residual 2015-2016')
_ = plt.title('Dystopia Residual by Region 2015-2016')
plt.show()







So in general we can say that in 2015-2016:

1. Generosity and Happiness were effectively unchanged.

2. Dystopia and GDP rose.

3. Life Expectancy, Family, and Freedom declined pretty much across the board.

4. Trust in government decreased slightly, but was already so low it hardly seems to matter.

And now, just because this is Kaggle and I can, I'll use a Random Forest to predict happiness scores using some of our measured variables. Let's see how accurately these numbers predict the happiness score. I would assume its pretty close to 100% since the happiness score is actually derived from these metrics, but there's no need to take that assumption for granted when we can test it.

In [None]:
from sklearn.preprocessing import LabelEncoder

data.Region = LabelEncoder().fit_transform(data.Region)
data.Country = LabelEncoder().fit_transform(data.Country)

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import sklearn.metrics as m

target = data['Happiness Score']
features = data[['Country', 'Dystopia Residual', 'Economy (GDP per Capita)', 'Family', 'Freedom', 'Generosity', 'Health (Life Expectancy)', 'Region', 'Trust (Government Corruption)']]

X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.33, random_state=42)

model = LinearRegression().fit(X_train, y_train)
predictions = model.predict(X_test)
mae = m.mean_absolute_error(y_test, predictions)
mse = m.mean_squared_error(y_test, predictions)
print(mae)
print(mse)

_ = plt.hist(predictions, alpha=0.5, color='red', cumulative=True, normed=True, bins=len(predictions), histtype='stepfilled', stacked=True)
_ = plt.hist(y_test, alpha=0.5, color='blue', cumulative=True, normed=True, bins=len(predictions), histtype='stepfilled', stacked=True)
plt.show()


Well, I have to say that's pretty accurate. My mean absolute and mean squared errors are tiny. 

To illustrate the results I have plotted my results as cumulative, normed histograms. My predictions are red and the true values blue, so where the two overlap you see purple. Only where a little blue or a little red pops out was there some inaccuracy in the prediction.

I think that's all for now. Please upvote if you like what you see, and leave a comment if you have any suggestions for improvement.