<a href="https://colab.research.google.com/github/ridwanulhoquejr/Multilayer_Perceptron_Implementation_for_Regression/blob/main/Multilayer_Perceptron_Implementation_for_Regression_Task_in_Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Problem statement:** *Multilayer Perceptron Implementation for Regression Task in Python*

In [None]:
import pandas as pd
import numpy as np
from sklearn.neural_network import MLPRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score, mean_absolute_error

**Dataset:** https://www.kaggle.com/harlfoxem/housesalesprediction

- This dataset contains house sale prices for King County, which includes Seattle. It includes homes sold between May 2014 and May 2015.

It's a great dataset for evaluating regression models.

In [None]:
df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Datasets/kc_house_data.csv')

In [None]:
df.shape

(21613, 21)

In [None]:
df.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900.0,3,1.0,1180,5650,1.0,0,0,3,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,20141209T000000,538000.0,3,2.25,2570,7242,2.0,0,0,3,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,5631500400,20150225T000000,180000.0,2,1.0,770,10000,1.0,0,0,3,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,2487200875,20141209T000000,604000.0,4,3.0,1960,5000,1.0,0,0,5,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,20150218T000000,510000.0,3,2.0,1680,8080,1.0,0,0,3,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21613 entries, 0 to 21612
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             21613 non-null  int64  
 1   date           21613 non-null  object 
 2   price          21613 non-null  float64
 3   bedrooms       21613 non-null  int64  
 4   bathrooms      21613 non-null  float64
 5   sqft_living    21613 non-null  int64  
 6   sqft_lot       21613 non-null  int64  
 7   floors         21613 non-null  float64
 8   waterfront     21613 non-null  int64  
 9   view           21613 non-null  int64  
 10  condition      21613 non-null  int64  
 11  grade          21613 non-null  int64  
 12  sqft_above     21613 non-null  int64  
 13  sqft_basement  21613 non-null  int64  
 14  yr_built       21613 non-null  int64  
 15  yr_renovated   21613 non-null  int64  
 16  zipcode        21613 non-null  int64  
 17  lat            21613 non-null  float64
 18  long  

we dont have any missing values and categorical data here.

In [None]:
## correlation values of features with target label

corr = np.abs(df.corr()['price']).sort_values(ascending=False)
corr = corr.rename_axis('Column').reset_index(name='Correlation')
corr

Unnamed: 0,Column,Correlation
0,price,1.0
1,sqft_living,0.702035
2,grade,0.667434
3,sqft_above,0.605567
4,sqft_living15,0.585379
5,bathrooms,0.525138
6,view,0.397293
7,sqft_basement,0.323816
8,bedrooms,0.30835
9,lat,0.307003


**So, we can drop some of the features vector which are not much related to our target (price column).**

In [None]:
df.drop(['date', 'id', 'long', 'condition'],  axis=1, inplace=True)
df

Unnamed: 0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,sqft_living15,sqft_lot15
0,221900.0,3,1.00,1180,5650,1.0,0,0,7,1180,0,1955,0,98178,47.5112,1340,5650
1,538000.0,3,2.25,2570,7242,2.0,0,0,7,2170,400,1951,1991,98125,47.7210,1690,7639
2,180000.0,2,1.00,770,10000,1.0,0,0,6,770,0,1933,0,98028,47.7379,2720,8062
3,604000.0,4,3.00,1960,5000,1.0,0,0,7,1050,910,1965,0,98136,47.5208,1360,5000
4,510000.0,3,2.00,1680,8080,1.0,0,0,8,1680,0,1987,0,98074,47.6168,1800,7503
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21608,360000.0,3,2.50,1530,1131,3.0,0,0,8,1530,0,2009,0,98103,47.6993,1530,1509
21609,400000.0,4,2.50,2310,5813,2.0,0,0,8,2310,0,2014,0,98146,47.5107,1830,7200
21610,402101.0,2,0.75,1020,1350,2.0,0,0,7,1020,0,2009,0,98144,47.5944,1020,2007
21611,400000.0,3,2.50,1600,2388,2.0,0,0,8,1600,0,2004,0,98027,47.5345,1410,1287


In [None]:
df.shape

(21613, 17)

In [None]:
x = df.drop('price', axis=1)
y = df.price

In [None]:
x.head(10)

Unnamed: 0,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,sqft_living15,sqft_lot15
0,3,1.0,1180,5650,1.0,0,0,7,1180,0,1955,0,98178,47.5112,1340,5650
1,3,2.25,2570,7242,2.0,0,0,7,2170,400,1951,1991,98125,47.721,1690,7639
2,2,1.0,770,10000,1.0,0,0,6,770,0,1933,0,98028,47.7379,2720,8062
3,4,3.0,1960,5000,1.0,0,0,7,1050,910,1965,0,98136,47.5208,1360,5000
4,3,2.0,1680,8080,1.0,0,0,8,1680,0,1987,0,98074,47.6168,1800,7503
5,4,4.5,5420,101930,1.0,0,0,11,3890,1530,2001,0,98053,47.6561,4760,101930
6,3,2.25,1715,6819,2.0,0,0,7,1715,0,1995,0,98003,47.3097,2238,6819
7,3,1.5,1060,9711,1.0,0,0,7,1060,0,1963,0,98198,47.4095,1650,9711
8,3,1.0,1780,7470,1.0,0,0,7,1050,730,1960,0,98146,47.5123,1780,8113
9,3,2.5,1890,6560,2.0,0,0,7,1890,0,2003,0,98038,47.3684,2390,7570


In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=47)

In [None]:
scaler = StandardScaler()
scaled_x_train = scaler.fit_transform(x_train)
scaled_x_test = scaler.transform(x_test)

In [None]:
scaled_x_train

array([[ 0.67143646,  0.50131445,  0.43498928, ...,  0.53326678,
         1.34715419, -0.03072596],
       [ 0.67143646, -0.47287933, -0.41648171, ..., -0.57939754,
        -0.623749  , -0.18073651],
       [-0.39961589,  0.17658319,  0.05291896, ..., -0.76976914,
         0.38360152, -0.11317971],
       ...,
       [-0.39961589, -1.44707311, -0.53656095, ...,  0.73229164,
        -0.60914972, -0.31427066],
       [-0.39961589, -1.44707311, -0.59114243, ...,  0.900309  ,
        -0.7113447 , -0.32022945],
       [-0.39961589, -1.44707311, -0.8640498 , ...,  0.88732911,
        -0.08357553, -0.17355725]])

In [None]:
model = MLPRegressor(hidden_layer_sizes=(64), 
                   activation="relu",
                   random_state=47, 
                   max_iter=2500).fit(scaled_x_train, y_train)



In [None]:
y_pred = model.predict(scaled_x_test)
y_pred

array([ 266798.2856568 ,  235864.65963774,  540927.9618869 , ...,
        332927.32610625,  656785.72791911, 1079822.58650042])

In [None]:
r2_score(y_test, y_pred)

0.7269359612725709

The r2 score varies between 0 and 100%. It is closely related to the MSE. if it is 100%, the two variables are perfectly correlated,

in our case, 72% of predicted data are correlated to each other.
That mean our **error rate : 28%**