How to train and test simple regression models using Scikit-learn in Jupyter Notebook

1. Import libraries and modules
2. Import datasets
3. Pre-processs dataset (cleaning, missing data, encoding)
4. Train models
5. Test models

In [45]:
import pandas as pd

In [57]:
df = pd.read_csv("HousePriceExample.csv")

In [59]:
df

Unnamed: 0,Size,Rooms,Price,Unnamed: 3
0,100,2,137.556462,
1,110,2,147.266947,
2,210,3,277.823882,
3,150,2,201.293308,
4,170,3,228.43209,
5,101,2,136.208504,
6,120,2,160.248469,
7,160,3,216.177111,
8,180,3,241.515656,
9,250,3,331.809213,


In [61]:
df.head()

Unnamed: 0,Size,Rooms,Price,Unnamed: 3
0,100,2,137.556462,
1,110,2,147.266947,
2,210,3,277.823882,
3,150,2,201.293308,
4,170,3,228.43209,


In [62]:
df.tail()

Unnamed: 0,Size,Rooms,Price,Unnamed: 3
15,95,1,128.432274,
16,145,2,193.969353,
17,165,3,222.637396,
18,133,2,180.015644,
19,206,3,275.600622,


In [63]:
df.shape

(20, 4)

In [64]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Size        20 non-null     int64  
 1   Rooms       20 non-null     int64  
 2   Price       20 non-null     float64
 3   Unnamed: 3  0 non-null      float64
dtypes: float64(2), int64(2)
memory usage: 768.0 bytes


In [65]:
df.describe()

Unnamed: 0,Size,Rooms,Price,Unnamed: 3
count,20.0,20.0,20.0,0.0
mean,159.25,2.35,213.261857,
std,42.077466,0.587143,55.163494,
min,95.0,1.0,128.432274,
25%,129.75,2.0,175.073851,
50%,160.0,2.0,213.957543,
75%,185.0,3.0,247.285431,
max,250.0,3.0,331.809213,


In [66]:
# Drop the "Unnamed 3 column" from the data (set axis =1 column,0 is for row)
df.drop("Unnamed: 3" , axis = 1, inplace = True)

In [67]:
df

Unnamed: 0,Size,Rooms,Price
0,100,2,137.556462
1,110,2,147.266947
2,210,3,277.823882
3,150,2,201.293308
4,170,3,228.43209
5,101,2,136.208504
6,120,2,160.248469
7,160,3,216.177111
8,180,3,241.515656
9,250,3,331.809213


In [72]:
#Checking if there are null values in the data frame
df.isnull().sum()

Size      0
Rooms     0
Price     0
dtype: int64

Import models and model selection modules

In [75]:
#Import SciKitLearn Linear Regression, Neural Network and Random Forest Machine learning Algorithms
from sklearn.linear_model import LinearRegression
from sklearn.neural_network import MLPRegressor
from sklearn.ensemble import RandomForestRegressor

# Import training and testing module
from sklearn.model_selection import train_test_split

In [90]:
# Instantiate the models     #Default Hyperparameters
model_LR = LinearRegression()
model_NN =MLPRegressor()     # Multi Layer Perceptron /Deep Learning
model_RF =RandomForestRegressor()

Train, Test and Split

In [111]:
#Splitting the dataset into input (X) and output (Y) features
X = df.iloc[:,0:2]   #:all the rows, 0,1 columns
Y = df.iloc[:,2:3]     

In [112]:
X

Unnamed: 0,Size,Rooms
0,100,2
1,110,2
2,210,3
3,150,2
4,170,3
5,101,2
6,120,2
7,160,3
8,180,3
9,250,3


In [114]:
Y

Unnamed: 0,Price
0,137.556462
1,147.266947
2,277.823882
3,201.293308
4,228.43209
5,136.208504
6,160.248469
7,216.177111
8,241.515656
9,331.809213


In [115]:
#Then split the dataset into training and testing (random state =1 for reproducibility)
Xtrain, Xtest, Ytrain, Ytest = train_test_split(X,Y, test_size=0.20, random_state=1)

In [116]:
# Notice how the data set changes

In [117]:
Xtest

Unnamed: 0,Size,Rooms
3,150,2
16,145,2
6,120,2
10,200,2


In [118]:
ytest

Unnamed: 0,Price
3,201.293308
16,193.969353
6,160.248469
10,264.594759


Train our 3 models

In [124]:
# https://stackoverflow.com/questions/34165731/a-column-vector-y-was-passed-when-a-1d-array-was-expected
model_LR.fit(Xtrain,Ytrain.values.ravel())
model_NN.fit(Xtrain,Ytrain.values.ravel())
model_RF.fit(Xtrain,Ytrain.values.ravel())

RandomForestRegressor()

In [126]:
#Testing our models 
# rsquared - score
# measure the error in prediction and then compare the models to each other based on their generalizability
print(model_LR.score(Xtest,Ytest))
print(model_NN.score(Xtest,Ytest))
print(model_RF.score(Xtest,Ytest))

0.9992331749057306
0.9983149670724843
0.9734414788770939


Import sklearn metrics

In [128]:
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

In [129]:
LRpred = model_LR.predict(Xtest)
NNpred = model_NN.predict(Xtest)
RFpred = model_RF.predict(Xtest)

In [130]:
print(mean_squared_error(LRpred,Ytest))
print(mean_squared_error(NNpred,Ytest))
print(mean_squared_error(RFpred,Ytest))

1.0907409576755096
2.3968104888693698
37.777150200912146


The LR model has the least MSE 

In [131]:
print(mean_absolute_error(LRpred,Ytest))
print(mean_absolute_error(NNpred,Ytest))
print(mean_absolute_error(RFpred,Ytest))

0.8757389110900817
1.1425996003269177
5.862079645999998


In [None]:
The LR model has the least mean absolute error of the 3 models.