# Predicting House Prices With Regression

Simple regression problem where by pre-loaded data I have shown linear regression algorithm and its performance on data. Prediction of house price using regression algorithms is widely popular problem in machine learning.   
By pre-loaded data of Boston House price I have performed regression on dataset. The dataset is already in `sklearn.datasets` module, So there is no need to download it.

### Import dependencies

In [32]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_boston

### Load data set

In [3]:
boston = load_boston()

In [4]:
print(boston.keys())

dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename'])


In [5]:
print(boston.DESCR[:1200])

.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pu

In [6]:
print(boston.feature_names)

['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
 'B' 'LSTAT']


In [7]:
print(boston.data.shape)
print(boston.target.shape)

(506, 13)
(506,)


### Data Visualization

In [34]:
df_X = pd.DataFrame(boston.data, columns = boston.feature_names)
df_y = pd.DataFrame(boston.target)

In [36]:
df_X.describe()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
count,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0
mean,3.613524,11.363636,11.136779,0.06917,0.554695,6.284634,68.574901,3.795043,9.549407,408.237154,18.455534,356.674032,12.653063
std,8.601545,23.322453,6.860353,0.253994,0.115878,0.702617,28.148861,2.10571,8.707259,168.537116,2.164946,91.294864,7.141062
min,0.00632,0.0,0.46,0.0,0.385,3.561,2.9,1.1296,1.0,187.0,12.6,0.32,1.73
25%,0.082045,0.0,5.19,0.0,0.449,5.8855,45.025,2.100175,4.0,279.0,17.4,375.3775,6.95
50%,0.25651,0.0,9.69,0.0,0.538,6.2085,77.5,3.20745,5.0,330.0,19.05,391.44,11.36
75%,3.677083,12.5,18.1,0.0,0.624,6.6235,94.075,5.188425,24.0,666.0,20.2,396.225,16.955
max,88.9762,100.0,27.74,1.0,0.871,8.78,100.0,12.1265,24.0,711.0,22.0,396.9,37.97


### Initialize linear regression model

In [47]:
from sklearn import linear_model

reg = linear_model.LinearRegression()

In [48]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(boston.data, 
                                                    boston.target,
                                                    test_size = 0.25,
                                                    random_state=33)

### Training the model

In [57]:
reg.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [50]:
print(reg.coef_) #f(x, a) = mx + da + b = y 

[-1.26539657e-01  5.17054667e-02  1.59923838e-02  3.15849638e+00
 -1.47747009e+01  4.19328015e+00 -5.84475822e-03 -1.51333288e+00
  3.00394989e-01 -1.24603044e-02 -8.66991289e-01  6.54067123e-03
 -5.42000537e-01]


### Predictions

In [62]:
y_pred = reg.predict(X_test)
print(y_pred)

[20.29022993 11.20920066 13.87845541 18.17725798 22.61036483 20.73608879
 36.88055903 14.69362199 23.28063602 22.29036435 25.51192971 36.70574284
  5.38065596 25.41977407 11.09096124 23.82152491 17.60378246 19.31408384
 32.31316226 22.48544424 13.69438934 19.74975895 18.30176012 18.42192955
 34.07276452 15.44266616 25.35003609 24.93580683 11.6543228  34.67145035
 16.49253619 25.88986142  4.95589104 16.10161262 29.98955815 33.70633341
 25.21023473  5.1007656  20.14938672 28.88115057 17.65474861 13.76228757
 30.38352133 15.76310919 30.40779913 20.36746733 21.72609146 17.25698824
 24.24169982 21.20210359 17.40511363 36.17204804 11.43815281 16.46406141
 24.64965015 14.29169238 25.52117581 15.19681486 22.56582301 23.75435091
 16.90609314 18.82003988 35.66969657 22.0791474  18.01381553 25.11338846
 28.29665673 -0.73292001 13.56515423 30.28765197 21.2578837  19.33496218
 15.22580036 22.46759997 16.75396494 39.4269093  20.17259518  2.40342578
 17.69782732 25.29024095 21.17460643  8.22764755 17

### Actual values

In [52]:
print(y_test)

[20.5  5.6 13.4 12.6 21.2 19.7 32.4 14.8 33.  21.4 30.1 36.   8.4 21.6
 16.3 23.  14.9 14.1 31.1 11.9 12.7 27.9 20.8 19.6 32.  21.9 23.2 23.8
 10.8 34.9 19.1 26.5 10.5 17.5 24.  36.1 25.3 13.8 27.5 24.6 12.7  9.5
 32.7 13.8 23.5 17.7 15.6 22.5 26.2 20.6 14.1 33.3 15.2 14.9 21.6 17.2
 23.1 11.7 20.6 22.2 23.1 18.4 43.8 21.1 14.9 28.7 23.3 13.8 19.7 30.5
 19.  19.1 19.  26.6 17.5 21.9 13.8  8.8 19.4 28.1 21.  11.8  7.2 24.1
 20.  18.9 50.  13.3 50.  41.3 28.7 19.9 16.5 10.9 13.4 32.9 20.6 25.
 19.5 19.9 15.4 21.7 31.5 27.1  8.3 13.6  8.8 22.5  7.5 28.6 50.  11.5
 13.5 24.4 36.2 21.4 18.5 22.6 24.8 19.3 29.8 16.4  8.4 24.7 20.1 13.1
 35.2]


### Check Model performance / accuracy using Mean Squared Error

In [54]:
np.mean((y_pred - y_test)**2)

25.139236520353595

### Check model performance /  accuracy using Mean Squared Error and `sklearn.metrics`

In [56]:
from sklearn.metrics import mean_squared_error

mean_squared_error(y_test, y_pred)

25.139236520353595

### Evaulating the model

In [64]:
print(reg.score(X_train, y_train))

0.7550548859241805


In [63]:
print(reg.score(X_test, y_test))

0.6757955014529462
