### Linear Regression using Python

In [19]:
import pandas as pd
dataset = pd.read_csv('Boston.csv')
dataset.head()

Unnamed: 0.1,Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,black,lstat,medv
0,1,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
1,2,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
2,3,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
3,4,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
4,5,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2


The Boston Housing Dataset

The Boston Housing Dataset is a derived from information collected by the U.S. Census Service concerning housing in the area of Boston MA. The following describes the dataset columns:

CRIM - per capita crime rate by town
ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
INDUS - proportion of non-retail business acres per town.
CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
NOX - nitric oxides concentration (parts per 10 million)
RM - average number of rooms per dwelling
AGE - proportion of owner-occupied units built prior to 1940
DIS - weighted distances to five Boston employment centres
RAD - index of accessibility to radial highways
TAX - full-value property-tax rate per $10,000
PTRATIO - pupil-teacher ratio by town
B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
LSTAT - % lower status of the population
MEDV - Median value of owner-occupied homes in $1000's

In [20]:
X = pd.DataFrame(dataset.iloc[:,:-1])
y = pd.DataFrame(dataset.iloc[:,-1])

print(X.shape)
print(y.shape)

(506, 14)
(506, 1)


### Split the dataset to train and test sets...

In [21]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=1)

# print data shapes...
print('x_train shape : ', x_train.shape)
print('x_test shape : ', x_test.shape)
print('y_train shape : ', y_train.shape)
print('X_test shape : ', y_test.shape)

x_train shape :  (404, 14)
x_test shape :  (102, 14)
y_train shape :  (404, 1)
X_test shape :  (102, 1)


### Define the model.. and ...

In [6]:
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(x_train,y_train)

LinearRegression()

### Having a look at the coefficients that the model has chosen: 

In [14]:
v = pd.DataFrame(regressor.coef_,index=['Co-efficient']).transpose()
w = pd.DataFrame(X.columns)

In [12]:
print(X.columns)

Index(['Unnamed: 0', 'crim', 'zn', 'indus', 'chas', 'nox', 'rm', 'age', 'dis',
       'rad', 'tax', 'ptratio', 'black', 'lstat'],
      dtype='object')


### Concatenating the DataFrames to compare:

In [16]:
coeff_df = pd.concat([w,v],axis =1, join = 'inner')
coeff_df

Unnamed: 0,0,Co-efficient
0,Unnamed: 0,-0.002717
1,crim,-0.113467
2,zn,0.06067
3,indus,0.019462
4,chas,2.155218
5,nox,-19.441797
6,rm,3.106845
7,age,0.002296
8,dis,-1.523417
9,rad,0.333528


### Comparing the predicted value to the actual values:

In [17]:
y_pred = regressor.predict(x_test)
y_pred = pd.DataFrame(y_pred, columns = ['Predicted'])
y_pred

Unnamed: 0,Predicted
0,32.417338
1,27.790194
2,18.177707
3,21.813096
4,19.003303
...,...
97,29.389236
98,37.104539
99,20.786951
100,17.416306


### Evaluate with different metrics

In [18]:
from sklearn import metrics
import numpy as np

print('Mean Absulute Error (MAE) :',metrics.mean_absolute_error(y_test,y_pred))
print('Mean Squared Error (MSE) :',metrics.mean_squared_error(y_test,y_pred))
print('Root Mean Squared Error (RMSE) :', np.sqrt(metrics.mean_squared_error(y_test,y_pred)))


Mean Absulute Error (MAE) : 3.7357095670940095
Mean Squared Error (MSE) : 23.369625899561132
Root Mean Squared Error (RMSE) : 4.834214093269053
