## Boston HousePrice Prediction model

In [1]:
from warnings import filterwarnings
filterwarnings('ignore')

## Load the dataset

In [2]:
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv')
df.head()

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,b,lstat,medv
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   crim     506 non-null    float64
 1   zn       506 non-null    float64
 2   indus    506 non-null    float64
 3   chas     506 non-null    int64  
 4   nox      506 non-null    float64
 5   rm       506 non-null    float64
 6   age      506 non-null    float64
 7   dis      506 non-null    float64
 8   rad      506 non-null    int64  
 9   tax      506 non-null    int64  
 10  ptratio  506 non-null    float64
 11  b        506 non-null    float64
 12  lstat    506 non-null    float64
 13  medv     506 non-null    float64
dtypes: float64(11), int64(3)
memory usage: 55.5 KB


In [4]:
df.isna().sum()

crim       0
zn         0
indus      0
chas       0
nox        0
rm         0
age        0
dis        0
rad        0
tax        0
ptratio    0
b          0
lstat      0
medv       0
dtype: int64

In [5]:
df.duplicated().sum()

0

There are no missing values and no duplicated values in the dataset

## Separate X and Y features

In [6]:
X = df.drop(columns=['medv'],axis=1)
Y = df[['medv']]

## Train Test Split

In [7]:
from sklearn.model_selection import train_test_split
xtrain, xtest, ytrain, ytest = train_test_split(X,Y,train_size=0.8,test_size=0.2,random_state=42)

In [8]:
xtrain.shape,ytrain.shape

((404, 13), (404, 1))

In [9]:
xtest.shape, ytest.shape

((102, 13), (102, 1))

## Build the KNN model

In [10]:
from sklearn.neighbors import KNeighborsRegressor
knn = KNeighborsRegressor(n_neighbors=5)
knn.fit(xtrain,ytrain)

In [11]:
knn.score(xtrain,ytrain)

0.6839203981935851

Above model is giving very low score. We need to tune the model by performing hyperparameter tuning

## Hyperparameter tuning

In [12]:
params = {'n_neighbors':[2,3,4,5,6,7,8,9],  # number of neighbors to consider
          'weights':['uniform','distance'], # weight function used for prediction
          'algorithm': ['auto','ball_tree','kd_tree','brute'],  # algorithm used for compute nearest neighbors
          'p':[1,2]}    # power parameter for Minkowski distance metric

In [13]:
from sklearn.model_selection import GridSearchCV
gscv = GridSearchCV(KNeighborsRegressor(),param_grid=params,cv=5,scoring='neg_mean_squared_error')

In [14]:
gscv.fit(xtrain,ytrain)

In [15]:
gscv.best_params_

{'algorithm': 'auto', 'n_neighbors': 5, 'p': 1, 'weights': 'distance'}

In [19]:
gscv.best_score_

-33.2971448976826

In [16]:
best_knn = gscv.best_estimator_
best_knn

## Model Evaluation

In [20]:
best_knn.score(xtrain,ytrain)

1.0

In [21]:
best_knn.score(xtest,ytest)

0.7136527808511335