Multi linear regression and some feature selection on world happiness data

In [None]:
#importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

#importing the dataset
dataset = pd.read_csv('../input/2016.csv')

#Splitting up the dataset
x = dataset.iloc[:, 6:].values #take all the data from col 6 onwards
y = dataset.iloc[:, 2].values #Happiness rank
GDP = dataset.iloc[:, 6].values 
Family = dataset.iloc[:, 7].values 
Health = dataset.iloc[:, 8].values
Freedom = dataset.iloc[:, 9].values
Corruption = dataset.iloc[:, 10].values
Generosity = dataset.iloc[:, 11].values
Dystopia = dataset.iloc[:, 12].values

#splitting data to train/test sets
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 20)

In [None]:
#Fitting mult linear regre to training set!!
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(x_train, y_train)

#Y_pred for multi-linear
y_pred = regressor.predict(x_test)
y_pred

In [None]:
y_test

As you can see y_pred versus y_test is quite accurate in its predictions. Now lets look at the correlations between particular features of x one by one.

In [None]:
plt.scatter(GDP, y)


In [None]:
plt.scatter(Family, y)
plt.show()

In [None]:
plt.scatter(Health, y)
plt.show()

In [None]:
plt.scatter(Freedom, y)
plt.show()

In [None]:
plt.scatter(Corruption, y)
plt.show()

In [None]:
plt.scatter(Generosity, y)
plt.show()

In [None]:
plt.scatter(Dystopia, y)
plt.show()

The above graphs show that while most features of X correlate with happiness rank in some way, some correlate much more significantly that others. GDP and Health seem to have a strong correlation with the smallest amount of variance whilst Freedom, and Family have a strong correlation but with very high variance. Corruption shows something different to those features stated above. It seems to have a non-linear trend, some sort of exponential curve. This means that a linear regression doesn't fit this data well and there would be a large sum of squares error. Interestingly generosity seems to have no trend, whether your happiness is low or high there is nonetheless a large variance in generosity. Below we will try some methods for fea

In [None]:
#Firstly we will try a step wise regression method know as backward elimination
import statsmodels.formula.api as sm

#We need a constant x0 = 1 for this to work
x = np.append(arr = np.ones((157, 1)).astype(int), values = x, axis = 1)

#backward elim we will set P = 0.05 as our threshold for elimation
x_opt = x[:, [0, 1, 2, 3, 4, 5, 6, 7]]   
regressor_OLS = sm.OLS(endog = y, exog = x_opt).fit()
regressor_OLS.summary()


In stepwise regression we first think of a P value to use as a threshold for elimination of features. For this example we will use p = 0.05 as our threshold. In the above statistical summary you can see that x7 (Dystopia) has a high p value of 0.758. We eliminate the highest p value above our threshold with each step.

In [None]:
x_opt = x[:, [0, 1, 2, 3, 4, 5, 6]]  #x7 has been remove from our optimal x
regressor_OLS = sm.OLS(endog = y, exog = x_opt).fit()
regressor_OLS.summary()

In our 2nd step we can see x6 (Generosity) is 0.089, just above our threshold. We will eliminate it. It should be said that sometimes when the p-value is this close to your threshold it may negatively impact your model.

In [None]:
x_opt = x[:, [0, 1, 2, 3, 4, 5]]   #x6 has now been removed
regressor_OLS = sm.OLS(endog = y, exog = x_opt).fit()
regressor_OLS.summary()

Now nothing is above the threshold, these are the features we will keep.

In [None]:
#Testing on a new y_pred
#Splitting into train/test set
x_train2, x_test2, y_train2, y_test2 = train_test_split(x_opt, y, test_size = 0.2, random_state = 20)

#Fitting mult linear regre to training set!!
from sklearn.linear_model import LinearRegression
regressor2 = LinearRegression()
regressor2.fit(x_train2, y_train2)

y_pred2 = regressor2.predict(x_test2)
y_pred2


In [None]:
y_test2

Now y_pred2 is actually performing worse than y_pred with all the features of x. This is because of essentially a loss of information that would have actually been valuable to the model. There is always a trade off in data science, keeping all of your information versus computational intensity.

In [None]:
#Now we will try another method: Univariate feature selection
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
x_new = SelectKBest(chi2, k=5).fit_transform(x, y)
x.shape

In [None]:
x_new.shape

The above code reduces the original x down from 9 features to the 5 most impactful ones. Below we will test this on y_pred

In [None]:
#train / test split
x_train3, x_test3, y_train3, y_test3 = train_test_split(x_new, y, test_size = 0.2, random_state = 20)

#Multilinear regression
from sklearn.linear_model import LinearRegression
regressor3 = LinearRegression()
regressor3.fit(x_train3, y_train3)

#y_pred
y_pred3 = regressor3.predict(x_test3)
y_pred3

In [None]:
y_test

Univariate feature selection seems to perform a lot better than backward elimination in this case.