<img src="https://images.squarespace-cdn.com/content/v1/51965cc6e4b0812cc818d772/1497545812042-23CKR83KTAAUZHQ5QPRF/ke17ZwdGBToddI8pDm48kO-nusLtcAdtf47f8bIHOgBZw-zPPgdn4jUwVcJE1ZvWQUxwkmyExglNqGp0IvTJZamWLI2zvYWH8K3-s_4yszcp2ryTI0HqTOaaUohrI8PIRZ5NwDSCpbsZQ0RB-l3w14x_kfU1-FWN1-nxyaZPMPYKMshLAGzx4R3EDFOm1kBS/thrive+finding+happiness?format=750w" width="450">

# World Happiness and 65 world indexes

In this kernel we are exploring the **World Happiness Report** (WHR) dataset provided by Kaggle.

In the first part of the notebook we study the data with pandas tools and statistical plots of seaborn.  
We see what are the most happiest countries in the different years and which countries experienced the largest rises and drops in Happines rank. Also, we study how Happiness is related to parameters like Economy, Health, Education etc. and how these values vary for different regions of the world. Finally, some interactive choropleth maps with Plotly complete this EDA part.

In the second part we check how well the WHR Happiness score (and rank) can be predicted by independent data.  
For this, we use data from the **65 world indexes** dataset which is a collection of socioeconomic indicators for each country of the world,


**The notebook follows this outline:**

### Part 0: Imports, reading data, useful functions  


### Part 1: Exploratory Data Analysis

1.1 first look at **world-happiness** and **65-world-indexes-gathered**  
[top 15 happiest countries for 2015, 2016 and 2017](#top-15-happiest-countries-for-2015,-2016-and-2017)   
[Match column names for all years](#Match-column-names-for-all-years)     
[65 indexes](#65-indexes)   
[Sort by largest CO2 emissions per capita](#Sort-by-largest-CO2-emissions-per-capita) 

1.2 Exploring happiness data: 2015, 2016, 2017  
**Pandas**  
[Change in top 25 Happiness ranks 2015-2017](#Change-in-top-25-Happiness-ranks-2015-2017)   
[10 biggest rises in Happiness rank](#10-biggest-rises-in-Happiness-rank)     
[10 biggest drops in Happiness rank](#10-biggest-drops-in-Happiness-rank)    
**Seaborn**  
[Seaborn Boxplots for World Regions](#Seaborn-Boxplots-for-World-Regions)  
[Scatterplots for World Regions](#Scatterplots-for-World-Regions)  
[Correlation matrix for 2015](#Correlation-matrix-for-2015)  
**Plotly**  
[Scatterplot matrix for 2015](#Scatterplot-matrix-for-2015)  
**Choropleth maps**  
[Happiness Score Map 2015](#Happiness-Score-Map-2015)    
[Slider: Happiness Score Maps 2015-2017](#Slider:-Happiness-Score-Maps-2015-2017)  
[Map: Happiness rank change 2015 to 2017](#Map:-Happiness-rank-change-2015-to-2017) 

1.3 Exploring 65 world indexes  
Carbon dioxide emissions per capita 2011  

1.4 Looking at both data sets together 



### Part 2: Modeling happiness score from 65 world indexes


predict happiness score from 65 world indexes





**About the WHR dataset** (from Kaggle) :

The World Happiness Report is a landmark survey of the state of global happiness.  

The happiness scores and rankings use data from the Gallup World Poll. The scores are based on answers to the main life evaluation question asked in the poll. This question, known as the Cantril ladder, asks respondents to think of a ladder with the best possible life for them being a 10 and the worst possible life being a 0 and to rate their own current lives on that scale.

The scores are from nationally representative samples for the years 2013-2016 and use the Gallup weights to make the estimates representative. The columns following the happiness score estimate the extent to which each of six factors – economic production, social support, life expectancy, freedom, absence of corruption, and generosity – contribute to making life evaluations higher in each country than they are in Dystopia, a hypothetical country that has values equal to the world’s lowest national averages for each of the six factors. They have no impact on the total score reported for each country, but they do explain why some countries rank higher than others.

# Part 0: Imports, reading data, useful functions 

### Imports

In [None]:
import numpy as np 
import pandas as pd 

import matplotlib.pyplot as plt
import matplotlib.style as style
#style.use('ggplot')
style.use('fivethirtyeight')

import seaborn as sns
sns.set_context('talk', font_scale=0.8) 

import plotly as py
from plotly import tools
from plotly import subplots
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)
%matplotlib inline

from scipy import stats

from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.metrics import mean_absolute_error, mean_squared_error, max_error
from sklearn.metrics import explained_variance_score, r2_score
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.preprocessing import StandardScaler

import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

import os
print(os.listdir("../input"))
print(os.listdir("../input/world-happiness"))
print(os.listdir("../input/65-world-indexes-gathered"))

In [None]:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn

In [None]:
%%html
<style>table {float:left}</style>
<style> table td, table th, table tr {text-align:left !important;} </style>

### Reading data  
There are 3 happiness data csv files for 2015, 2016 and 2017 and one data file for the 65 world indexes.  
We read the csv files into individual pandas dataframes.

In [None]:
df_happy_2015 = pd.read_csv("../input/world-happiness/2015.csv")
df_happy_2016 = pd.read_csv("../input/world-happiness/2016.csv")
df_happy_2017 = pd.read_csv("../input/world-happiness/2017.csv")
df_65indexes = pd.read_csv("../input/65-world-indexes-gathered/Kaggle.csv")

### Useful functions 

In [None]:
def plotly_scatter(df, col_x_values, col_y_values) :
    
    plot_title = col_y_values + " vs. " + col_x_values
        
    trace = go.Scatter( x = df[col_x_values], y = df[col_y_values],
                        text = df['Country'], mode = 'markers') 

    layout = go.Layout( title=plot_title, autosize=False, width=700, height=500,
                        xaxis=dict(title = col_x_values), 
                        yaxis=dict(title = col_y_values)              
                      )

    fig = go.Figure(data=[trace], layout=layout)
    fig.update_layout(title=go.layout.Title(text=plot_title, xref="paper", x=0.5))     
    
    iplot(fig)

In [None]:
def plotly_choropleth_map(countries, values, title, colorbar_title, projection, colorscale) :
    
    data = dict(type = 'choropleth', 
           colorscale = colorscale,
           locations = countries,
           locationmode = 'country names',
           z = values, 
           text = countries,
           colorbar = {'title': colorbar_title})
    
    layout = dict(title = title, 
                  geo = dict(showframe = True, 
                       projection = {'type': projection}))
    
    choroplethmap = go.Figure(data = [data], layout=layout)
    choroplethmap.update_layout(title=go.layout.Title(text=title, xref="paper", x=0.5))    
    
    iplot(choroplethmap)

In [None]:
def plotly_true_and_predictions(model) :

    str_title = type(model).__name__
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    
    fig = subplots.make_subplots(rows=1, cols=2, print_grid=False, 
                          subplot_titles=["Happiness Score, true and predicted", 
                                          "Differences Histogram"])

    trace_1 = go.Scatter(x = y_test, y = y_pred, name="", 
                         mode = 'markers',  marker = dict(size = 10),
                         text = y_test.index)
    
    trace_2 = go.Histogram(x=y_test-y_pred, name="Diff", nbinsx = 25)

    fig.append_trace(trace_1, 1, 1)
    fig.append_trace(trace_2, 1, 2)

    fig['layout'].update(height=450, width=800, title=str_title, showlegend=False)
    fig['layout']['xaxis1'].update(title='y_true', range=[0, 9])
    fig['layout']['yaxis1'].update(title='y_pred', range=[0, 9])
    fig['layout']['xaxis2'].update(title='y_true - y_pred', range=[-2, 2])
    fig['layout']['yaxis2'].update(title='count', range=[0, 5])    
    
    fig.update_layout(title=go.layout.Title(text=str_title, xref="paper", x=0.5))
    
    iplot(fig)

# Part 1: Exploratory Data Analysis

## **1.1.1 - World Happiness**

## top 15 happiest countries for 2015, 2016 and 2017

In [None]:
df_happy_2015.head(15)

In [None]:
df_happy_2016.head(15)

In [None]:
df_happy_2017.head(15)

There are some countries that are in the top 10 each year.  
And many of the top happiest countries are in Western Europe.  
Scandinavians seem to be especially happy.  
Also we note some differences in the column names for 2017.  

### Match column names for all years

Column names for 2015 and 2016 are the same.  
For 2017 some names are different and there are a few additional columns.  
For convenience, we rename the columns for 2017 to match those of 2015 and 2016.

In [None]:
df_happy_2017.rename(columns={'Happiness.Rank': 'Happiness Rank',
                              'Happiness.Score': 'Happiness Score',
                              'Economy..GDP.per.Capita': 'Economy (GDP per Capita)'
                             }, inplace=True)

## **1.1.2 - 65 indexes**

In [None]:
df_65indexes.head(150)

The default sorting of the 65 indexes dataframe is by the Human Development Index (HDI) of 2014.  
In the top places we find many countries that are also among the happiest countries.  
So the HDI might be a good proxy for the Happiness Score. We explore this further in Part 1.4 and Part 2.  
Lets now look at another sorting, for example the largest CO2 emissions per capita.

### Sort by largest CO2 emissions per capita

In [None]:
df_65indexes.sort_values(by=['Carbon dioxide emissions per capita 2011 Tones'], ascending=False).head(15)

## **1.2 Exploring happiness data**

### Change in top 25 Happiness ranks 2015-2017

In [None]:
country_and_rank = ['Country','Happiness Rank']

In [None]:
df_countries_rank_year =  \
    df_happy_2015[country_and_rank].  \
    merge(df_happy_2016[country_and_rank], on='Country').  \
    merge(df_happy_2017[country_and_rank], on='Country')

In [None]:
df_countries_rank_year.set_index('Country', inplace=True)
df_countries_rank_year.rename(columns={'Happiness Rank_x': 'Happiness Rank_2015'}, inplace=True)
df_countries_rank_year.rename(columns={'Happiness Rank_y': 'Happiness Rank_2016'}, inplace=True)
df_countries_rank_year.rename(columns={'Happiness Rank': 'Happiness Rank_2017'}, inplace=True)
df_countries_rank_year['Change_2015_to_2017'] = df_countries_rank_year['Happiness Rank_2015'] - df_countries_rank_year['Happiness Rank_2017']
df_countries_rank_year.head(25)

### 10 biggest rises in Happiness rank

In [None]:
df_countries_rank_year.nlargest(10, 'Change_2015_to_2017', keep='last')

### 10 biggest drops in Happiness rank

In [None]:
df_countries_rank_year.nsmallest(10, 'Change_2015_to_2017', keep='last')

### Correlation matrix for 2015

In [None]:
f,ax = plt.subplots(figsize=(10, 8))
#sns.set(font_scale=1.2)
sns.heatmap(df_happy_2015.corr(), annot=True, annot_kws={"size": 12}, cmap="rainbow", 
            linewidths=1.5, linecolor="white", fmt= '.2f',ax=ax)
plt.show()

### Seaborn Boxplots for World Regions

In [None]:
cols = df_happy_2015.columns[3:].tolist()
print(cols)

In [None]:
for col in cols :
    fig, ax = plt.subplots(figsize=(8, 5.5))
    g = sns.boxplot(x=col, y="Region", data=df_happy_2015)
    plt.show()

### Scatterplots for World Regions

Seaborn

In [None]:
fig, ax = plt.subplots(figsize=(8, 6.5))
g = sns.scatterplot('Economy (GDP per Capita)', 'Happiness Score',  hue= 'Region', data=df_happy_2015);
g.legend(loc='upper left', bbox_to_anchor=(1.05, 1.01), ncol=1);
#plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.);

In [None]:
fig, ax = plt.subplots(figsize=(8, 6.5))
g = sns.scatterplot('Health (Life Expectancy)' ,'Happiness Score',  hue= 'Region', data=df_happy_2015);
g.legend(loc='upper left', bbox_to_anchor=(1.05, 1.01), ncol=1);

### Plotly

In [None]:
df=df_happy_2015

df['Region_code'] = pd.factorize(df['Region'], sort=True)[0] + 1 

data = [ { 'x': df['Health (Life Expectancy)'],
           'y': df['Happiness Score'],
           'mode': 'markers',
           'marker': { 'color': df['Region_code'] ,
                       'colorscale' : 'Viridis' ,
                       'size': df['Economy (GDP per Capita)']*20,
                       'showscale': True
                      },
           "text" :  df['Country']    
       } ]

title = 'Happiness Score vs Health (Life Expectancy): Size=Economy, Color=Region'

layout = dict(title=title,
              xaxis=dict(title='Health (Life Expectancy)'),
              yaxis=dict(title='Happiness Score'),
             )
fig = go.Figure(data=data,layout=layout)
fig.update_layout(title=go.layout.Title(text=title, xref="paper", x=0.5))

iplot(fig)

In [None]:
regions = df['Region'].unique().tolist()

fig = { 'data': [{ 'x': df[df['Region']==region]['Economy (GDP per Capita)'],
                   'y': df[df['Region']==region]['Happiness Score'],    
                   'name': region, 
                   'text': df['Country'][df['Region']==region], 
                   'mode': 'markers',
                   'marker': {'size': df['Health (Life Expectancy)']*20 ,
                             } 
                 } 
                   for region in regions       ],
    
        'layout': { 'xaxis': {'title': 'GDP per Capita'},
                    'yaxis': {'title': "Happiness Score"}
                  }
       }

iplot(fig)

### **Plotly**

### Scatterplot matrix for 2015

In [None]:
import plotly.figure_factory as ff

df_happy_2015['index'] = np.arange(1,len(df_happy_2015)+1)
cols = ['Happiness Score', 'Economy (GDP per Capita)', 'Family', 'Health (Life Expectancy)',
        'Freedom', 'Trust (Government Corruption)', 'Generosity']

fig = ff.create_scatterplotmatrix (df_happy_2015[cols], diag='box', index='Happiness Score', colormap='Viridis',
                                  colormap_type='cat', height=900, width=900)


fig.update_layout(title=go.layout.Title(text="Scatterplot Matrix", xref="paper", x=0.5))

iplot(fig)

## Plotly choropleth maps

### Happiness Score Map 2015

In [None]:
df = df_happy_2015
plotly_choropleth_map(df['Country'], df['Happiness Score'], 
                      'Global Happiness 2015', 'Happiness Score', 
                      'natural earth', 'Viridis')

### Map: Happiness rank change 2015 to 2017

In [None]:
df_countries_rank_year.head()

In [None]:
df = df_countries_rank_year
plotly_choropleth_map(df.index, df['Change_2015_to_2017'], 
                      'Happiness Rank Change 2015 to 2017', 
                      'Rank Change', 
                      'natural earth', 'Rainbow')

In [None]:
df = df_happy_2015
plotly_choropleth_map(df['Country'], df['Economy (GDP per Capita)'], 
                      'Economy (GDP per Capita)', 'GDP per Capita', 
                      'equirectangular', 'Viridis')

**scatter**

In [None]:
plotly_scatter(df_happy_2015, 'Economy (GDP per Capita)', 'Happiness Score')

In [None]:
sns.jointplot(x='Economy (GDP per Capita)', y='Happiness Score',  data=df_happy_2015, kind='kde', height=6)
plt.show()

In [None]:
sns.lmplot(x='Economy (GDP per Capita)', y='Happiness Score', 
           data=df_happy_2015, height=6, aspect=1.2)
plt.show()

## 1.3 Exploring 65 world indexes

In [None]:
df = df_65indexes
plotly_choropleth_map(df['Id'], df['Carbon dioxide emissions per capita 2011 Tones'], 
                      'Carbon dioxide emissions per capita 2011', 
                      'Tons', 'natural earth', 'Portland')

## 1.4 Studying both data sets together 

In [None]:
df_happy_2015.shape

158 countries in Happiness 2015 data

In [None]:
df_65indexes.shape

188 countries in 65 indexes data

### inner join

In [None]:
df_happy_2015 = df_happy_2015.set_index('Country')
df_65indexes = df_65indexes.set_index('Id')

df_join = df_happy_2015.join(df_65indexes, how="inner")

In [None]:
df_join.shape

Only 149 countries in both datasets, that is 9 countries less than there are in Happiness 2015 data.  
Lets check if some countries are misspelled or named differently,

### Countries that are in Happiness 2015 data but not in 65 indexes

In [None]:
df_happy_2015.index.difference(df_65indexes.index)

### Countries that are in  65 indexes data but not in Happiness 2015

In [None]:
df_65indexes.index.difference(df_happy_2015.index)

Yes, the following countries are in both datasets, but have different names: 

| Happiness 2015 | 65 indexes | Note |
| --- | --- | --- |
| 'Congo (Brazzaville)' | 'Republic of the Congo' | see Wikipedia |
| 'Congo (Kinshasa)' | 'Democratic Republic of the Congo' | see Wikipedia |
| 'Hong Kong' | 'Hong Kong ' | typo: space after name |
| 'Ivory Coast' | 'Côte d'Ivoire' | english and french name |
| 'Palestinian Territories' | 'Palestine' | same country |

We correct these names and then join the datasets again.  
That should enhance the joined dataset by 5 countries.

In [None]:
df_happy_2015.rename(index={"Congo (Brazzaville)":"Republic of the Congo"},inplace=True)
df_happy_2015.rename(index={"Congo (Kinshasa)":"Democratic Republic of the Congo"},inplace=True)
df_65indexes.rename(index={"Hong Kong\xa0":"Hong Kong"},inplace=True)
df_65indexes.rename(index={"Côte d'Ivoire":"Ivory Coast"},inplace=True)
df_65indexes.rename(index={"Palestine":"Palestinian Territories"},inplace=True)

In [None]:
df_join = df_happy_2015.join(df_65indexes, how="inner")
df_join.shape

Yes, now all 154 countries that are in both datasets are also in the joined dataframe.

In [None]:
df_65indexes.index.difference(df_happy_2015.index)

Many of the contries that occur in 65 indexes but not in Happiness 2015 are very small islands.  
However there are also some large countries that are not in the Happiness 2015 data:

'Belize', 'Brunei', 'Cape Verde', 'Cuba', 'Equatorial Guinea' 'Guinea-Bissau', 'Guyana' 'Maldives', 'Namibia', 'Papua New Guinea',

In [None]:
df_join.head()

In [None]:
sns.jointplot(x=['Human Development Index HDI-2014'],
              y=['Happiness Score'],               
              data=df_join, height=6)
plt.show()

In [None]:
['Happiness Score']

In [None]:
plt.figure(figsize=(7,7))
sns.scatterplot(x='Human Development Index HDI-2014', y='Happiness Score', data=df_join)
sns.kdeplot(df_join['Human Development Index HDI-2014'], df_join['Happiness Score']);

### Correlation to Happiness Score

In [None]:
col_names_65idx = df_join.columns.tolist()[13:]
df_corr = df_join[col_names_65idx].corrwith(df_join['Happiness Score']).abs().sort_values(ascending=False)
df_corr.head(20)

### Correlation to Happiness Rank

In [None]:
df_corr_rank = df_join[col_names_65idx].corrwith(df_join['Happiness Rank']).abs().sort_values(ascending=False)
df_corr_rank.head(20)

Of the 65 indexes the largest correlation to Happiness is found for indexes from these groups:  
* Economy 
* Developement
* Health 
* Education

In [None]:
df_join['Country'] = df_join.index

In [None]:
plotly_scatter(df_join, 'Gini coefficient 2005-2013', 'Happiness Score')

In [None]:
plotly_scatter(df_join, 'Gross domestic product GDP percapta', 'Happiness Score')

In [None]:
plotly_scatter(df_join, 'Gross national income GNI per capita - 2011  Dollars', 'Happiness Score')

# Part 2: Modeling happiness score from 65 world indexes

To predict the Happinees score we use the independent dataset 65 wolrd indexes  
because we can not use data from the World Happiness Report.  
from the dataset description on Kaggle:  
What do the columns succeeding the Happiness Score(like Family, Generosity, etc.) describe?  
The following columns: GDP per Capita, Family, Life Expectancy, Freedom, Generosity, Trust Government Corruption describe the extent to which these factors contribute in evaluating the happiness in each country. The Dystopia Residual metric actually is the Dystopia Happiness Score(1.85) + the Residual value or the unexplained value for each country as stated in the previous answer.  
**If you add all these factors up, you get the happiness score so it might be un-reliable to model them to predict Happiness Scores.**  

See also discussion: https://www.kaggle.com/unsdsn/world-happiness/discussion/35141  

In Part 1 we found that there are many features in the 65 world indexes that have a high correlation coefficient to the Happiness Score and Rank.  
In the following we check how well these features predict the Happiness.

In [None]:
cols_65idx_sorted = df_corr.index.tolist()
len(cols_65idx_sorted)

In [None]:
print(cols_65idx_sorted[:10])

In [None]:
X_HPscore = df_join[cols_65idx_sorted].copy()
y_HPscore = df_join["Happiness Score"].copy()

## 2.1 First test: 10 features, 3 linear models

### 10 features with highest correlation

In [None]:
X = X_HPscore.iloc[:,:10].copy()
y = y_HPscore.copy()

In [None]:
y

### train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=10)
print(len(X_train))
print(len(X_test))

### 3 linear models

In [None]:
linear = LinearRegression()
ridge = Ridge()
lasso = Lasso()

### Results: Difference of predicted and true Happiness Scores

In [None]:
plotly_true_and_predictions(linear)

In [None]:
plotly_true_and_predictions(ridge)

In [None]:
plotly_true_and_predictions(lasso)

Looks like the 10 features of 65 indexes with largest correlation to Happiness Score can indeed be used for predicting the Happiness of the countries: The true Happiness core for the 31 countries in the validation set agrees quite well with the predictions.  
For all linear models, the deviation to the true score is approximately in between -1 to +1
and the differences of true and predicted values spread evenly around the 1:1 line.
Lets look in more detail at these error measures for the different models: 
* MAE = mean_absolute_error 
* MSE = mean_squared_error
* MXE = max_error
* r2  = r2_score

### Results: MAE, RMSE, MXE, r2

In [None]:
list_models = [linear, ridge, lasso]
name_models = ["linear", "ridge", "lasso"]
dict_models = dict(zip(name_models, list_models))

In [None]:
MAE = mean_absolute_error 
MSE = mean_squared_error
MXE = max_error
r2  = r2_score

In [None]:
print("model", "\t", "MAE", "\t",  "RMSE", "\t", "MXE", "\t", "r2")
print("-+"*30)

for n, m in dict_models.items() :
    print(  n,"\t", 
           '{:05.3f}'.format(MAE(y_test, m.predict(X_test))), "\t",
           '{:05.3f}'.format(np.sqrt(MSE(y_test, m.predict(X_test)))), "\t", 
           '{:05.3f}'.format(MXE(y_test, m.predict(X_test))), "\t", 
           '{:05.3f}'.format(r2(y_test, m.predict(X_test))), "\t",           
         ) 


Lets look in more detail how the errors for the predictions change when we include more or less features from 65 indexes.

## 2.2 Vary number of features, 3 linear models

### Cross Validation Scores, cv=5

In [None]:
nr_feats = list(range(1,31))
scoring = ['neg_mean_absolute_error', 'neg_mean_squared_error', 'max_error']

In [None]:
scores_MAE_linear = []
scores_MAE_ridge = []
scores_MAE_lasso = []
scores_RMSE_linear = []
scores_RMSE_ridge = []
scores_RMSE_lasso = []
scores_MXE_linear = []
scores_MXE_ridge = []
scores_MXE_lasso = []


for i in nr_feats:
    
    X = X_HPscore.iloc[:,:i].copy()
    y = y_HPscore.copy()
    
    linear = LinearRegression()
    ridge = Ridge()
    lasso = Lasso()
       
    scores_linear = cross_validate(linear, X, y, scoring=scoring, cv=5)
    scores_ridge = cross_validate(ridge, X, y, scoring=scoring, cv=5)
    scores_lasso = cross_validate(lasso, X, y, scoring=scoring, cv=5)
    
    scores_MAE_linear.append(-1* np.mean(scores_linear['test_neg_mean_absolute_error']))
    scores_MAE_ridge.append(-1* np.mean(scores_ridge['test_neg_mean_absolute_error']))
    scores_MAE_lasso.append(-1* np.mean(scores_lasso['test_neg_mean_absolute_error']))     
                  
    scores_RMSE_linear.append(np.mean(np.sqrt(-1 * scores_linear['test_neg_mean_squared_error'])))
    scores_RMSE_ridge.append(np.mean(np.sqrt(-1 * scores_ridge['test_neg_mean_squared_error'])))
    scores_RMSE_lasso.append(np.mean(np.sqrt(-1 * scores_lasso['test_neg_mean_squared_error'])))   
    
    scores_MXE_linear.append(np.mean(-1 * scores_linear['test_max_error']))
    scores_MXE_ridge.append(np.mean(-1 * scores_ridge['test_max_error']))
    scores_MXE_lasso.append(np.mean(-1 * scores_lasso['test_max_error']))     
    

In [None]:
fig, axs = plt.subplots(3, 1, figsize=(10,18))


fig.suptitle("MAE, RMSE and MXE vs. number of features ", fontsize=18)

axs[0].plot(nr_feats, scores_MAE_linear, label="linear")
axs[0].plot(nr_feats, scores_MAE_ridge, label="ridge")
axs[0].plot(nr_feats, scores_MAE_lasso, label="lasso")
axs[0].set_xlim(1,30)
axs[0].set_ylim(0.7,0.9)
axs[0].set_xlabel("Number of features", fontsize=14)
axs[0].set_ylabel("Mean absolute error", fontsize=14)
axs[0].legend();

axs[1].plot(nr_feats, scores_RMSE_linear, label="linear")
axs[1].plot(nr_feats, scores_RMSE_ridge, label="ridge")
axs[1].plot(nr_feats, scores_RMSE_lasso, label="lasso")
axs[1].set_xlim(1,30)
axs[1].set_ylim(0.8,1.0)
axs[1].set_xlabel("Number of features", fontsize=14)
axs[1].set_ylabel("Root mean squared error", fontsize=14)
axs[1].legend();

axs[2].plot(nr_feats, scores_MXE_linear, label="linear")
axs[2].plot(nr_feats, scores_MXE_ridge, label="ridge")
axs[2].plot(nr_feats, scores_MXE_lasso, label="lasso")
axs[2].set_xlim(1,30)
axs[2].set_ylim(1.4,1.9)
axs[2].set_xlabel("Number of features", fontsize=14)
axs[2].set_ylabel("Maxium absolute error", fontsize=14)
axs[2].legend();

The plots above show that more features not always lead to better predictions.  
For Linear and Ridge model, MAE and RMSE are smaller when using the top 5 features compared to the top 10 features.  
On the other hand, for Lasso the error rises very rarely and only little for every further feature included.  
In the next update we check if we can apply this info to remove those features that lead to larger prediction errors.

In [None]:
zipped_scores =  list(zip(scores_MAE_linear, scores_RMSE_linear)) 
df_scores = pd.DataFrame(zipped_scores, columns = ['scores_MAE_linear' , 'scores_RMSE_linear'], index = nr_feats) 
df_scores['features'] = cols_65idx_sorted[:30]
df_scores['change_MAE']  = df_scores['scores_MAE_linear'].diff()
df_scores['change_RMSE'] = df_scores['scores_RMSE_linear'].diff()

In [None]:
df_scores

In [None]:
import sklearn

In [None]:
sorted(sklearn.metrics.SCORERS.keys())