###BOSTON HOUSING

In [1]:
from sklearn import datasets
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cross_validation import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
import numpy as np

In [2]:
bean = datasets.load_boston()
print bean.DESCR

Boston House Prices dataset

Notes
------
Data Set Characteristics:  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive
    
    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
      

In [3]:
def load_boston():
    scaler = StandardScaler()
    boston = datasets.load_boston()
    X=boston.data
    y=boston.target
    X = scaler.fit_transform(X)
    return train_test_split(X,y)

In [4]:
X_train, X_test, y_train, y_test = load_boston()

In [5]:
X_train.shape

(379L, 13L)

###Fitting a Linear Regression

It's as easy as instantiating a new regression object (line 1) and giving your regression object your training data (line 2) by calling .fit(independent variables, dependent variable)

In [6]:
clf = LinearRegression()
clf.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

###Making a Prediction

X_test is our holdout set of data. We know the answer (y_test) but the computer does not.

Using the command below, create a tuple for each observation, where combining the real value (y_test) with the value our regressor predicts (clf.predict(X_test))

In [7]:
zip (y_test, clf.predict(X_test))

[(9.5999999999999996, 14.448052193143667),
 (6.2999999999999998, 10.849810681042914),
 (19.399999999999999, 17.292326347204671),
 (19.300000000000001, 22.015830004728453),
 (17.899999999999999, 1.3462553537497008),
 (8.4000000000000004, 4.7836270222966775),
 (24.100000000000001, 20.569002980546216),
 (21.699999999999999, 24.44864081551626),
 (10.5, 5.8996721819325373),
 (21.0, 22.879432117924374),
 (20.0, 20.414869086418314),
 (21.699999999999999, 22.979057301007998),
 (14.9, 15.330215979387621),
 (22.5, 28.879201222800489),
 (24.5, 20.300578052363289),
 (32.700000000000003, 30.716383023669653),
 (8.3000000000000007, 9.7736980950832226),
 (24.699999999999999, 22.666524403190135),
 (22.0, 27.508893013760478),
 (21.0, 21.373444958444349),
 (21.699999999999999, 21.040331604785191),
 (14.5, 13.875810160122645),
 (35.100000000000001, 35.429180667306831),
 (26.600000000000001, 22.187604308455914),
 (16.5, 22.481552109035388),
 (31.600000000000001, 32.873182573981552),
 (23.899999999999999, 2

###MEAN SQUARED ERROR

Measuring the performance using MSE

In [8]:
mean_squared_error(y_test, clf.predict(X_test))

19.982177487562005

###R2
Measuring the performance sing r2

In [9]:
r2_score(y_test, clf.predict(X_test))

0.71599637378378111

###L2 Regularization

In [10]:
np.random.seed(0)
clf = Ridge(alpha=1.0)
clf.fit(X_train, y_train) 
Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None, normalize=False, solver='auto', tol=0.009)

Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, solver='auto', tol=0.009)

In [11]:
r2_score(y_test,clf.predict(X_test))

0.71607084536404231

In [12]:
mean_squared_error(y_test, clf.predict(X_test))

19.976937750469968