# house price prediction with liner reression

# Linear data
In this example we'll be using the Boston Housing dataset. There are 506 rows in the dataset. The target variable is median home price. There are 13 predictor variables including average number of rooms per dwelling, crime rate by town, etc. More information about this dataset can be found at https://www.kaggle.com/c/boston-housing



This data frame contains the following columns:

**crim**
per capita crime rate by town.

**zn**
proportion of residential land zoned for lots over 25,000 sq.ft.

**indus**
proportion of non-retail business acres per town.

**chas**
Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).

**nox**
nitrogen oxides concentration (parts per 10 million).

**rm**
average number of rooms per dwelling.

**age**
proportion of owner-occupied units built prior to 1940.

**dis**
weighted mean of distances to five Boston employment centres.

**rad**
index of accessibility to radial highways.

**tax**
full-value property-tax rate per $10,000.

**ptratio**
pupil-teacher ratio by town.

**black**
1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.

**lstat**
lower status of the population (percent).

**medv**
median value of owner-occupied homes in $1000s.

In [9]:
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set()

In [16]:
df=pd.read_csv("./sir notebook/Boston.csv")

In [17]:
df.head()

Unnamed: 0,Crime Rate,Residential Proportion,non-retail business acres/Town,Charles River,NO2 concentration,Average Rooms/Dwelling.,Prior Built Units Proportion,Distance to Employment Centres,Radial Highways Distance,ValueProperty/tax rate,Teacher/town,blacks/town,Lower Status Percent,median home price
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


In [21]:
# split uindependent and dependent veriables
x=df.drop(['median home price'],axis=1)
y=df[['median home price']]
print(x.shape,y.shape)

(506, 13) (506, 1)


In [24]:
x.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 13 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   Crime Rate                      506 non-null    float64
 1   Residential Proportion          506 non-null    float64
 2   non-retail business acres/Town  506 non-null    float64
 3   Charles River                   506 non-null    float64
 4   NO2 concentration               506 non-null    float64
 5   Average Rooms/Dwelling.         506 non-null    float64
 6   Prior Built Units Proportion    506 non-null    float64
 7   Distance to Employment Centres  506 non-null    float64
 8   Radial Highways Distance        506 non-null    float64
 9   ValueProperty/tax rate          506 non-null    float64
 10  Teacher/town                    506 non-null    float64
 11  blacks/town                     506 non-null    float64
 12  Lower Status Percent            506 

# normalization 

In [23]:
# set the column names
columnss = ['Crime Rate','Residential Proportion','non-retail business acres/Town','Charles River',
          'NO2 concentration','Average Rooms/Dwelling.','Prior Built Units Proportion','Distance to Employment Centres',
           'Radial Highways Distance','ValueProperty/tax rate','Teacher/town','blacks/town','Lower Status Percent']

In [25]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
x = scaler.fit_transform(x)
x=pd.DataFrame(x,columns=columnss)
x.head()

Unnamed: 0,Crime Rate,Residential Proportion,non-retail business acres/Town,Charles River,NO2 concentration,Average Rooms/Dwelling.,Prior Built Units Proportion,Distance to Employment Centres,Radial Highways Distance,ValueProperty/tax rate,Teacher/town,blacks/town,Lower Status Percent
0,0.0,0.18,0.067815,0.0,0.314815,0.577505,0.641607,0.269203,0.0,0.208015,0.287234,1.0,0.08968
1,0.000236,0.0,0.242302,0.0,0.17284,0.547998,0.782698,0.348962,0.043478,0.104962,0.553191,1.0,0.20447
2,0.000236,0.0,0.242302,0.0,0.17284,0.694386,0.599382,0.348962,0.043478,0.104962,0.553191,0.989737,0.063466
3,0.000293,0.0,0.06305,0.0,0.150206,0.658555,0.441813,0.448545,0.086957,0.066794,0.648936,0.994276,0.033389
4,0.000705,0.0,0.06305,0.0,0.150206,0.687105,0.528321,0.448545,0.086957,0.066794,0.648936,1.0,0.099338


In [26]:
# train test split
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=42)

In [27]:
print(x_train.shape,x_test.shape,y_train.shape,y_test.shape)

(404, 13) (102, 13) (404, 1) (102, 1)


In [28]:
# fit the model
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(x_train,y_train)

In [29]:
# prediction
y_pred = lr.predict(x_test)

In [31]:
lr.coef_

array([[-10.05857199,   3.01104641,   1.10158605,   2.7844382 ,
         -8.36047983,  23.16628091,  -0.61137677, -15.92203067,
          6.03588392,  -5.57891601,  -8.60528866,   4.89829233,
        -18.43062842]])

In [32]:
lr.intercept_

array([23.65537767])

In [36]:
# evaluation
from sklearn.metrics import r2_score
test=r2_score(y_test,y_pred)

In [37]:
# r2 on training 
train=r2_score(y_train,lr.predict(x_train))

In [38]:
print('r2 score on test:',test)
print('r2 score on train:',train)

r2 score on test: 0.6687594935356318
r2 score on train: 0.7508856358979673
