#### Trisha Moyer
#### Spring 2017


## Boston Housing Assignment

In this assignment you'll be using linear regression to estimate the cost of house in boston, using a well known dataset.

Goals:
+  Measure the performance of the model I created using $R^{2}$ and MSE
> Learn how to use sklearn.metrics.r2_score and sklearn.metrics.mean_squared_error
+  Implement a new model using L2 regularization
> Use sklearn.linear_model.Ridge or sklearn.linear_model.Lasso 
+  Get the best model you can by optimizing the regularization parameter.   

In [1]:
from sklearn import datasets
import pandas as pd
import math
from sklearn.preprocessing import StandardScaler
from sklearn.cross_validation import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso

In [2]:
bean = datasets.load_boston()
print(bean.DESCR)

Boston House Prices dataset

Notes
------
Data Set Characteristics:  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive
    
    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
      

In [3]:
def load_boston():
    scaler = StandardScaler()
    boston = datasets.load_boston()
    X=boston.data
    y=boston.target
    X = scaler.fit_transform(X)
    return train_test_split(X,y)
    

In [4]:
X_train, X_test, y_train, y_test = load_boston()

In [5]:
X_train.shape

(379, 13)

### Fitting a Linear Regression

It's as easy as instantiating a new regression object (line 1) and giving your regression object your training data
(line 2) by calling .fit(independent variables, dependent variable)



In [6]:
clf = LinearRegression()
clf.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

### Making a Prediction
X_test is our holdout set of data.  We know the answer (y_test) but the computer does not.   

Using the command below, I create a tuple for each observation, where I'm combining the real value (y_test) with
the value our regressor predicts (clf.predict(X_test))

Use a similiar format to get your r2 and mse metrics working.  Using the [scikit learn api](http://scikit-learn.org/stable/modules/model_evaluation.html) if you need help!

In [7]:
observation_tuple=list(zip (y_test, clf.predict(X_test)))

In [8]:
print(observation_tuple)

[(17.199999999999999, 14.217361229501785), (12.0, 11.135235472070571), (50.0, 44.500980722736294), (15.199999999999999, 19.371910309810538), (26.399999999999999, 22.337873283048978), (15.6, 11.343333758684565), (9.5, 13.927741568730188), (20.399999999999999, 22.594062208958064), (11.699999999999999, 15.741268037515065), (19.0, 20.74286797728346), (23.699999999999999, 10.624910807729032), (33.0, 23.101956774322026), (22.199999999999999, 23.654063394829436), (21.699999999999999, 24.056653960431298), (19.0, 15.288601267825676), (21.199999999999999, 23.417916126011168), (23.199999999999999, 25.528420891213646), (22.199999999999999, 24.042594941105879), (17.399999999999999, 22.995888660612195), (19.300000000000001, 16.961588632588814), (21.899999999999999, 23.763443321893263), (22.0, 27.475695739409744), (24.399999999999999, 24.0482892611187), (23.0, 20.094005280709091), (23.899999999999999, 27.063757674409231), (10.199999999999999, 17.554494304392886), (13.800000000000001, -0.5118350332463

In [11]:
real_value=[i[0] for i in observation_tuple]
predicted_value=[i[1] for i in observation_tuple]

In [12]:
print("real_value")
print(real_value)
print("predicted_value")
print(predicted_value)

real_value
[17.199999999999999, 12.0, 50.0, 15.199999999999999, 26.399999999999999, 15.6, 9.5, 20.399999999999999, 11.699999999999999, 19.0, 23.699999999999999, 33.0, 22.199999999999999, 21.699999999999999, 19.0, 21.199999999999999, 23.199999999999999, 22.199999999999999, 17.399999999999999, 19.300000000000001, 21.899999999999999, 22.0, 24.399999999999999, 23.0, 23.899999999999999, 10.199999999999999, 13.800000000000001, 16.100000000000001, 24.800000000000001, 33.100000000000001, 19.300000000000001, 23.800000000000001, 19.399999999999999, 18.899999999999999, 24.0, 6.2999999999999998, 50.0, 30.800000000000001, 31.5, 19.399999999999999, 14.6, 23.899999999999999, 33.399999999999999, 22.0, 32.399999999999999, 24.399999999999999, 10.199999999999999, 37.600000000000001, 22.100000000000001, 27.5, 18.399999999999999, 31.600000000000001, 16.100000000000001, 18.5, 13.4, 21.100000000000001, 17.199999999999999, 28.699999999999999, 22.699999999999999, 20.399999999999999, 21.5, 21.0, 19.399999999999

### $R^{2}$
#### Coefficient of determination

In [13]:
r2_score(real_value, predicted_value)

0.64046360029196414

### MSE
#### Mean Square Error

In [14]:
mse = mean_squared_error(real_value, predicted_value)
print("MSE: ")
print(mse)
rmse = math.sqrt(mse)
print("RMSE: ")
print(rmse)

MSE: 
22.5800102749
RMSE: 
4.751842829356362


# sklearn.linear_model.Ridge

In [15]:
ridge = Ridge()
ridge.fit(X_train, y_train)

Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001)

In [16]:
ridge_observation_tuple=list(zip (y_test, ridge.predict(X_test)))

In [17]:
print(ridge_observation_tuple)

[(17.199999999999999, 14.332340948153323), (12.0, 11.109681976587257), (50.0, 44.472832957325259), (15.199999999999999, 19.379421281076119), (26.399999999999999, 22.281668872644804), (15.6, 11.420460103489292), (9.5, 13.922558095059451), (20.399999999999999, 22.593365643516861), (11.699999999999999, 15.676355078477764), (19.0, 20.738864725869398), (23.699999999999999, 10.641261889647335), (33.0, 23.16713807102067), (22.199999999999999, 23.643392597046951), (21.699999999999999, 24.03932505078485), (19.0, 15.261923209966863), (21.199999999999999, 23.444403795095013), (23.199999999999999, 25.517592679883407), (22.199999999999999, 24.039071060414479), (17.399999999999999, 22.964415199712128), (19.300000000000001, 16.959899890651045), (21.899999999999999, 23.806089438085028), (22.0, 27.501134703060423), (24.399999999999999, 24.076993181309479), (23.0, 20.127960384914772), (23.899999999999999, 27.048450127214174), (10.199999999999999, 17.497641656346403), (13.800000000000001, -0.516764936427

In [18]:
ridge_real_value=[i[0] for i in ridge_observation_tuple]
print("Real Value:", ridge_real_value)
ridge_predicted_value=[i[1] for i in ridge_observation_tuple]
print("Predicted value: ", ridge_predicted_value)

Real Value: [17.199999999999999, 12.0, 50.0, 15.199999999999999, 26.399999999999999, 15.6, 9.5, 20.399999999999999, 11.699999999999999, 19.0, 23.699999999999999, 33.0, 22.199999999999999, 21.699999999999999, 19.0, 21.199999999999999, 23.199999999999999, 22.199999999999999, 17.399999999999999, 19.300000000000001, 21.899999999999999, 22.0, 24.399999999999999, 23.0, 23.899999999999999, 10.199999999999999, 13.800000000000001, 16.100000000000001, 24.800000000000001, 33.100000000000001, 19.300000000000001, 23.800000000000001, 19.399999999999999, 18.899999999999999, 24.0, 6.2999999999999998, 50.0, 30.800000000000001, 31.5, 19.399999999999999, 14.6, 23.899999999999999, 33.399999999999999, 22.0, 32.399999999999999, 24.399999999999999, 10.199999999999999, 37.600000000000001, 22.100000000000001, 27.5, 18.399999999999999, 31.600000000000001, 16.100000000000001, 18.5, 13.4, 21.100000000000001, 17.199999999999999, 28.699999999999999, 22.699999999999999, 20.399999999999999, 21.5, 21.0, 19.39999999999

### $R^{2}$

In [19]:
r2_score(ridge_real_value, ridge_predicted_value)

0.64149284450196453

### MSE

In [20]:
mse = mean_squared_error(ridge_real_value, ridge_predicted_value)
print("MSE: ")
print(mse)
rmse = math.sqrt(mse)
print("RMSE: ")
print(rmse)

MSE: 
22.5153705198
RMSE: 
4.745036408688156


# sklearn.linear_model.Ridge 
## With optimizing the regularization parameter

In [37]:
ridge2 = Ridge(alpha=10.0)
ridge2.fit(X_train, y_train)

Ridge(alpha=10.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001)

In [38]:
ridge2_observation_tuple=list(zip (y_test, ridge2.predict(X_test)))
print(ridge2_observation_tuple)

[(17.199999999999999, 15.166582130030339), (12.0, 10.950396168002589), (50.0, 44.212028495296735), (15.199999999999999, 19.441770095653442), (26.399999999999999, 21.900408792340791), (15.6, 12.011811130880874), (9.5, 13.902238445547217), (20.399999999999999, 22.603574695579578), (11.699999999999999, 15.215226562844855), (19.0, 20.757342698022683), (23.699999999999999, 10.815132378652766), (33.0, 23.690057761686553), (22.199999999999999, 23.624021211720855), (21.699999999999999, 23.92517699401343), (19.0, 15.091126937185075), (21.199999999999999, 23.578494795359916), (23.199999999999999, 25.423393213408623), (22.199999999999999, 24.047590777535621), (17.399999999999999, 22.733818937278201), (19.300000000000001, 16.985280430947753), (21.899999999999999, 24.124384614702965), (22.0, 27.664030160659259), (24.399999999999999, 24.306030056910437), (23.0, 20.349787471840557), (23.899999999999999, 26.93720001635284), (10.199999999999999, 17.084768652917404), (13.800000000000001, -0.495052422538

In [39]:
ridge2_real_value=[i[0] for i in ridge2_observation_tuple]
ridge2_predicted_value=[i[1] for i in ridge2_observation_tuple]

In [40]:
r2_score(ridge2_real_value, ridge2_predicted_value)

0.64806929710776962

In [41]:
mse_ridge2 = mean_squared_error(ridge2_real_value, ridge2_predicted_value)
print("MSE: ")
print(mse_ridge2)
rmse_ridge2 = math.sqrt(mse_ridge2)
print("RMSE: ")
print(rmse_ridge2)

MSE: 
22.1023487297
RMSE: 
4.701313511103231


# sklearn.linear_model.Lasso

In [21]:
lasso = Lasso()
lasso.fit(X_train, y_train)

Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False)

In [22]:
lasso_observation_tuple=list(zip (y_test, lasso.predict(X_test)))

In [23]:
lasso_real_value=[i[0] for i in lasso_observation_tuple]
print("Real Value:", lasso_real_value)
lasso_predicted_value=[i[1] for i in lasso_observation_tuple]
print("Predicted value: ", lasso_predicted_value)

Real Value: [17.199999999999999, 12.0, 50.0, 15.199999999999999, 26.399999999999999, 15.6, 9.5, 20.399999999999999, 11.699999999999999, 19.0, 23.699999999999999, 33.0, 22.199999999999999, 21.699999999999999, 19.0, 21.199999999999999, 23.199999999999999, 22.199999999999999, 17.399999999999999, 19.300000000000001, 21.899999999999999, 22.0, 24.399999999999999, 23.0, 23.899999999999999, 10.199999999999999, 13.800000000000001, 16.100000000000001, 24.800000000000001, 33.100000000000001, 19.300000000000001, 23.800000000000001, 19.399999999999999, 18.899999999999999, 24.0, 6.2999999999999998, 50.0, 30.800000000000001, 31.5, 19.399999999999999, 14.6, 23.899999999999999, 33.399999999999999, 22.0, 32.399999999999999, 24.399999999999999, 10.199999999999999, 37.600000000000001, 22.100000000000001, 27.5, 18.399999999999999, 31.600000000000001, 16.100000000000001, 18.5, 13.4, 21.100000000000001, 17.199999999999999, 28.699999999999999, 22.699999999999999, 20.399999999999999, 21.5, 21.0, 19.39999999999

### $R^{2}$

In [24]:
r2_score(lasso_real_value, lasso_predicted_value)

0.59131422785131282

### MSE

In [25]:
mse = mean_squared_error(lasso_real_value, lasso_predicted_value)
print("MSE: ")
print(mse)
rmse = math.sqrt(mse)
print("RMSE: ")
print(rmse)

MSE: 
25.6667445683
RMSE: 
5.066235739514619
