## Session 1.2 Regression (Multivariate, Boston)

In [None]:
%pylab inline

import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
import sklearn
import IPython
import platform

from sklearn import preprocessing

print ('Python version:', platform.python_version())
print ('IPython version:', IPython.__version__)
print ('numpy version:', np.__version__)
print ('scikit-learn version:', sklearn.__version__)
print ('matplotlib version:', matplotlib.__version__)

## Multivariate linear regression

To demonstrate multivariate regression in scikit-learn, we will apply it to a (very) simple and well-know problem: trying to predict the price of a house given some of its characteristics. As the dataset, we will use the 1978 Boston house price dataset (find the dataset description and attributes [here](http://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html)).

In [None]:
from sklearn.datasets import load_boston
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split
boston = load_boston()

print(type(boston))

print ('Boston dataset shape:{}'.format(boston.data.shape))
print ('Boston target shape:{}'.format(boston.target.shape))
print (boston.feature_names)

manual_split = True
if manual_split:
    numpy.random.seed(666)
    test_split = 0.2
    test_n = int(test_split*len(boston.data))
    test_indices = numpy.random.randint(0,len(boston.data), size=test_n)
    train_indices = sorted(set(range(len(boston.data))) - set(test_indices))

    X_train_boston_raw=boston.data[train_indices]
    X_test_boston_raw=boston.data[test_indices]
    y_train_boston=boston.target[train_indices]
    y_test_boston=boston.target[test_indices]
else:
    X_train_boston_raw, X_test_boston_raw, y_train_boston, y_test_boston = train_test_split(boston.data, boston.target, test_size=0.2, random_state=666)


numpy.set_printoptions(precision=4)
print("features:\n", X_train_boston_raw[0:3,:])

x_scaler = StandardScaler()

#Create our scaled train and test datasets
X_train_boston = x_scaler.fit_transform(X_train_boston_raw) # preprocessing.scale(X_train_boston_raw) # shortcut to preprocessing.StandardScaler()
X_test_boston = x_scaler.transform(X_test_boston_raw)

print("Scaled features:\n", X_train_boston[0:3,:])
print("prices:\n", y_train_boston[0:3])


## Training using n-fold cross-validation

Previously we've trained using a dataset split into train and test subsets.  Another way to split your data is to use cross validation.

One of the main advantages of cross-validation is reducing the variance of the evaluation measures.  When you split the data manually, you will find that for each different split, your algorithm's performance will vary.  How do you know what is the right score?

Evaluation within machine learning generally assumes that the distribution of classes on your training and testing sets are similar. If not, you may get results that are not a truthful measure of the classifier's performance. Cross-validation lets us mitigate this: we are averaging on k different models built on k different datasets, so we are reducing variance and probably producing more realistic performance scores for our models.

Another benefit of cross-validation is that it allows us to make good use of the data we have available - each example acts as both a training datapoint and as a validation datapoint.

In [None]:
def train_and_evaluate(_clf, X_train, y_train, n_folds):
    _clf.fit(X_train, y_train)
    print ('Score on training set: {:.2f}'.format(_clf.score(X_train, y_train)))
    #create a k-fold cross validation iterator of k=5 folds
    data =X_train.shape[0]
    cv = sklearn.model_selection.KFold(n_splits= n_folds, shuffle=True, random_state=42)
    scores = sklearn.model_selection.cross_val_score(_clf, X_train, y_train, cv=cv)
    print ('Average score using {}-fold crossvalidation:{:.2f}'.format(n_folds,np.mean(scores)))
    return _clf

For classification, we used accuracy, the proportion of correctly classified test-instances, to summarise our method’s performance.

For regression, accuracy is a bad idea: we are predicting real values, so it's almost impossible to exactly predict the true value.

Instead, the default score function in scikit-learn is the coefficient of determination (or $R^2$ score), which measures the proportion of outcome variation explained by the model. $R^2 \in [0,1]$, and reaches 1 when the model perfectly predicts all the target values.

In [None]:
from sklearn import linear_model

#Use a Stochastic Gradient Descent Regressor - this is a general purpose linear regressor good for large datasets
clf_sgd = linear_model.SGDRegressor(loss='squared_loss', penalty=None, random_state=42, max_iter=10e5, tol=1e-4)
train_and_evaluate(clf_sgd, X_train_boston, y_train_boston, 5)

#print the hyperplane coefficients and their sum-of-squares

print(clf_sgd.coef_)
print(np.sum(np.square(clf_sgd.coef_)))

In [None]:
print(clf_sgd.score(X_test_boston, y_test_boston))
y_hats = clf_sgd.predict(X_test_boston)
for y,yh in zip(y_test_boston, y_hats):
    print(y,yh)

Create a correlation matrix to help us pick out the most relevant factors.  We want those with the biggest (negative or positive) correlation with median value, MEDV.

In [None]:
correlation_matrix = numpy.corrcoef(X_train_boston.T, y_train_boston)
fig, ax = plt.subplots()
im = ax.imshow(correlation_matrix)

# # We want to show all ticks...
ax.set_xticks(np.arange(len(boston.data.T)+1))
ax.set_yticks(np.arange(len(boston.data.T)+1))
# # ... and label them with the respective list entries
ax.set_xticklabels(list(boston.feature_names)+["MEDV"])
ax.set_yticklabels(list(boston.feature_names)+["MEDV"])



# Rotate the tick labels and set their alignment.
plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
         rotation_mode="anchor")

# Loop over data dimensions and create text annotations.
for i in range(len(boston.feature_names)+1):
    for j in range(len(boston.feature_names)+1):
        text = ax.text(j, i, round(correlation_matrix[i, j],1),
                       ha="center", va="center", color="w")

ax.set_title("Correlation matrix of Boston house price variables")
# fig.tight_layout()
fig.set_size_inches(6,6)
plt.show()


It looks like LSTAT and RM are most relevant.  Sklearn lets us automatically extra the K best features for explaining variance in the dataset...

In [None]:
from sklearn.feature_selection import *
k=5

fs=SelectKBest(score_func=f_regression,k=k)
X_new=fs.fit_transform(X_train_boston,y_train_boston)
print (zip(fs.get_support(),boston.feature_names))

x_min, x_max = X_new[:,0].min() - .5, X_new[:, 0].max() + .5
y_min, y_max = y_train_boston.min() - .5, y_train_boston.max() + .5
#fig=plt.figure()

# Two subplots, unpack the axes array immediately
fig, axes = plt.subplots(1,k)
# plt.tight_layout()
fig.subplots_adjust(left=0, right=1, bottom=0, top=1, hspace=0.05, wspace=0.1)
fig.set_size_inches(12,6)
fig.tight_layout()

cols = fs.get_support(indices=True)
print(cols)

for i in range(k):
    axes[i].set_aspect('auto')
#     axes[i].set_aspect('equal')
    axes[i].set_title('Feature ' + boston.feature_names[cols[i]])
    axes[i].set_xlabel('Feature')
    axes[i].set_ylabel('Median house value')
    axes[i].set_xlim(x_min, x_max)
    axes[i].set_ylim(y_min, y_max)
    sca(axes[i])
    plt.scatter(X_new[:,i],y_train_boston, alpha=0.2)
    
X_test_new = X_test_boston[:,cols]

The default score function in scikit-learn is the _coefficient of determination_ (or $R^2$ score), which measures the proportion of outcome variation explained by the model. $R^2 \in [0,1]$, and reaches 1 when the model perfectly predicts all the target values.

In [None]:
from sklearn import linear_model

#Use a Stochastic Gradient Descent Regressor - this is a general purpose linear regressor good for large datasets
train_and_evaluate(clf_sgd, X_new, y_train_boston, 5)

#print the hyperplane coefficients and their sum-of-squares
print(clf_sgd.coef_)
print(np.sum(np.square(clf_sgd.coef_)))

print(X_test_new.shape)
clf_sgd.fit(X_new, y_train_boston)
print(clf_sgd.score(X_test_new, y_test_boston))
y_hats = clf_sgd.predict(X_test_new)
for y,yh in zip(y_test_boston, y_hats):
    print(y,yh)

## Extra
Use a _non-linear_ regressor such as sklearn's SVR, and cross validate it on the Boston data.  Is it better?  If so, why might this be?  What about on the test dataset?

In [None]:
from sklearn.svm import SVR
clf_sgd = SVR()
#...

## Summary
- We tried out multivariate regression on the Boston house price dataset, using k-fold cross validation to test our estimators
- We tried some feature selection using a correlation matrix and SelectKBest