# Chapter 6

## Solving Regression Problems in Machine Learning Using `sklearn`

viewing the datasets in the seaborn library

In [1]:
import seaborn as sns
sns.get_dataset_names()

['anagrams',
 'anscombe',
 'attention',
 'brain_networks',
 'car_crashes',
 'diamonds',
 'dots',
 'dowjones',
 'exercise',
 'flights',
 'fmri',
 'geyser',
 'glue',
 'healthexp',
 'iris',
 'mpg',
 'penguins',
 'planets',
 'seaice',
 'taxis',
 'tips',
 'titanic']

loading in the tips dataset

In [2]:
tips_df = sns.load_dataset("tips")
tips_df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


loading in the diamonds dataset

In [3]:
diamond_df = sns.load_dataset("diamonds")
diamond_df.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


this chapter will focus on using the tips dataset
- we will make an algorithm to predict the tip for a particular record
- this will be based on the other features such as total_bill, sex, day, time etc

first we divide data into features and labels
- labels are the values from the tip column
- features are values from the remaining columns


In [4]:
import numpy as np
import seaborn as sns

tips_df = sns.load_dataset("tips")

#this is our feature set
X = tips_df.drop(["tip"], axis=1)

#this is our label set
Y = tips_df["tip"]


machine learning algorithmns mostly work with numbers
- so it is important to convert categorical data into a numerical format
- the first step here is to create a numerical dataset

In [5]:
#here we are dropping the categorical columns from the feature set
numerical = X.drop(["sex","smoker","day","time"], axis=1)

numerical.head()

Unnamed: 0,total_bill,size
0,16.99,2
1,10.34,3
2,21.01,3
3,23.68,2
4,24.59,4


In [6]:
#next we are creating a dataframe containing only categorical columns
categorical = X.filter(["sex","smoker","day","time"])
categorical.head()

Unnamed: 0,sex,smoker,day,time
0,Female,No,Sun,Dinner
1,Male,No,Sun,Dinner
2,Male,No,Sun,Dinner
3,Male,No,Sun,Dinner
4,Female,No,Sun,Dinner


converting a categorical column to a numerical one can be done via one-hot encoding
- for every unique value in the original columns, a new column is created
- for the sex column, a new male and female column is made, with a 1 added in the correct respective columns
- but in this case we don't need two columns, a single one for Female is enough when a customer is Female
- we can add a 1 or 0 in that column
- hence we need N-1 one-hot encoded columns for all the N values in the original column
- this code converts categorical columns into one-hot encoded columns

In [7]:
import pandas as pd
cat_numerical = pd.get_dummies(categorical,drop_first=True)
cat_numerical.head()

Unnamed: 0,sex_Female,smoker_No,day_Fri,day_Sat,day_Sun,time_Dinner
0,True,True,False,False,True,True
1,False,True,False,False,True,True
2,False,True,False,False,True,True
3,False,True,False,False,True,True
4,True,True,False,False,True,True


In [8]:
#the next step is to join both of the numerical columns with the one-hot encoded columns
X = pd.concat([numerical, cat_numerical], axis=1)
X.head()

Unnamed: 0,total_bill,size,sex_Female,smoker_No,day_Fri,day_Sat,day_Sun,time_Dinner
0,16.99,2,True,True,False,False,True,True
1,10.34,3,False,True,False,False,True,True
2,21.01,3,False,True,False,False,True,True
3,23.68,2,False,True,False,False,True,True
4,24.59,4,True,True,False,False,True,True


we are now splitting the data into a training set and a test set
- this code divides data into an 80% training set and a 20% test set

In [9]:
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size=0.20,random_state=0)

the final step before the data is passed for machine learning is to scale the data
- some of the columns in the dataset have small values, whilst others are very large values
- it is better to convert all the values to a uniform scale

In [10]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()

#this is scaling the training set
X_train = sc.fit_transform(X_train)

#this is scaling the test set
X_test = sc.fit_transform(X_test)

here we will implement a linear regression algorithmn
- the training and test sets that we setup earlier will be used

In [11]:
from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
#training the algorithmn
regressor = lin_reg.fit(X_train,Y_train)

#making predictions on test set
Y_pred = regressor.predict(X_test)

once the model has been trained on the test set, we need to know how well it has preformed for making predictions on the unknown set. There are various metrics to check this, but the following are the most common:
- mean absolute error, mean squared error and root mean squared error

In [12]:
from sklearn import metrics

print("Mean Absolute Error: ", metrics.mean_absolute_error(Y_test,Y_pred))
print("Mean Squared Error: ",metrics.mean_squared_error(Y_test,Y_pred))
print("Root Mean Squared Error: ", np.sqrt(metrics.mean_squared_error(Y_test,Y_pred)))


Mean Absolute Error:  0.6366762541802369
Mean Squared Error:  0.7159134231087634
Root Mean Squared Error:  0.8461166722791623


KNN stands for K-nearest neighbors 

this code uses KNN regression to predict valyes for the tip column

KNN is a lazy learning algorithm which is based on finding Euclidean distance between different data points

the random forest algorithm is useful when:
- 1 - KNN doesn't assume any relationship between features
- 2 - useful for a dataset where data localisation is important
- 3 - only have to tune the parameter K, which is the number of nearest neighbors
- 4 - no training is needed, as it is a lazy learning algorithm
- 5 - recinnebder systems and findinding semantic similarity between the documents are major applications of KNN

Disadvantages of KNN: 
- 1 - you have to find the optimal value of K, this is not easy
- 2 - not very suitable for high dimensional data

In [13]:
from sklearn.neighbors import KNeighborsRegressor
KNN_reg = KNeighborsRegressor(n_neighbors=5)
regressor = KNN_reg.fit(X_train, Y_train)

Y_pred = regressor.predict(X_test)

from sklearn import metrics

print("Mean Absolute Error: ", metrics.mean_absolute_error(Y_test,Y_pred))
print("Mean Squared Error: ",metrics.mean_squared_error(Y_test,Y_pred))
print("Root Mean Squared Error: ", np.sqrt(metrics.mean_squared_error(Y_test,Y_pred)))

Mean Absolute Error:  0.7077959183673468
Mean Squared Error:  0.8681808163265307
Root Mean Squared Error:  0.9317622101837628


Random forest regression

a tree-based algorithm that converts features into tree nodes and uses entropy loss to make predictions

Random forest is useful for:
- 1 - when you have missing or imbalanced data
- 2 - with large numbers of trees you can avoid overfitting while training
- 3 - the random forest algorithm can be used when you have very high dimensional data
- 4 - through cross-validation, the random forest algorithm can return higher accuracy
- 5 - it can solve classification and regression tasks and finds its application in a variety of tasks

Disadvantages:
- 1 - using a large number of trees can slow down the algorithm
- 2 - random forest algorithm is predictive, which can only predict the future and not what happened in the past

In [14]:
from sklearn.ensemble import RandomForestRegressor
rf_reg = RandomForestRegressor(random_state=42,n_estimators=500)
regressor = rf_reg.fit(X_train,Y_train)
y_pred = regressor.predict(X_test)

from sklearn import metrics

print("Mean Absolute Error: ", metrics.mean_absolute_error(Y_test,Y_pred))
print("Mean Squared Error: ",metrics.mean_squared_error(Y_test,Y_pred))
print("Root Mean Squared Error: ", np.sqrt(metrics.mean_squared_error(Y_test,Y_pred)))

Mean Absolute Error:  0.7077959183673468
Mean Squared Error:  0.8681808163265307
Root Mean Squared Error:  0.9317622101837628


Support Vector Regression

support vector machine is classification and regression algorithmns

this minismises the error between the actual predictions and predicted predictions by maximising the distance between hyperplanes that contain data for various records

SVR is a support vector machine (SVM) for regression

SVM advantages:
- 1 - can be used to perform regression or classification with high dimensional data
- 2 - with the kernel trick, SVM is capable of applying regression and classification to non-linear datasets
- 3 - SVM algorithms are commonly used for ordinal classificaiton or regression, and this is why they are commonly known as ranking algorithmns

SVM disadvantages:
- 1 - lots of parameters to be optimised in order to get the best performance
- 2 - training can take a long time on large datasets
- 3 - yields poor results if the number of features is greater than the number of records in a dataset

In [15]:
from sklearn import svm
svm_reg = svm.SVR()

regressor = svm_reg.fit(X_train,Y_train)
Y_pred = regressor.predict(X_test)

from sklearn import metrics

print("Mean Absolute Error: ", metrics.mean_absolute_error(Y_test,Y_pred))
print("Mean Squared Error: ",metrics.mean_squared_error(Y_test,Y_pred))
print("Root Mean Squared Error: ", np.sqrt(metrics.mean_squared_error(Y_test,Y_pred)))

Mean Absolute Error:  0.6627753026041159
Mean Squared Error:  0.8815889441384774
Root Mean Squared Error:  0.938929680081782


K Fold Cross-Validation

earlier we divided data into 80% training and 20% test data, but this means that the 20% is never used for training

for more stable results all parts of the dataset should be used at least once for testing and for training

K-Fold cross-validation can be used to do this

the data is divided into K parts:
- e.g. K-1 parts are used for training and the Kth part for testing
- e.g. in a 5-fold cross validation: data is divided into 5 equal parts: K1 to K5
- the first time around K1-K4 are used for training and K5 for testing
- the second time around all but K4 might be used for training

In [16]:
from sklearn.model_selection import cross_val_score

print(cross_val_score(regressor,X,Y,cv=5,scoring="neg_mean_absolute_error"))

#the output shows the mean absolute error for each of the K folds

[-0.66386205 -0.57007269 -0.63598762 -0.96960743 -0.87391702]


In [17]:
#making predictions on a single record
#here we are selected the 100th record from the tips dataset

tips_df.loc[100]
#the output shows the value of the tip in the 100th record is 2.5

total_bill     11.35
tip              2.5
sex           Female
smoker           Yes
day              Fri
time          Dinner
size               2
Name: 100, dtype: object

In [18]:
#we will try to predict the value of the 100th record tip using the random forest regressor
#note - you have to scale your single record before it can be used as input to your machine learning algorithm

from sklearn.ensemble import RandomForestRegressor

ref_reg = RandomForestRegressor(random_state=42,n_estimators=500)
regressor = rf_reg.fit(X_train,Y_train)

single_record = sc.transform(X.values[100].reshape(1,-1))
predicted_tip = regressor.predict(single_record)
print(predicted_tip)

[1.80906]




In [19]:
#to use the diamnods dataset from seaborn to train a regression algorithm which predicts
#the price of the diamond
#also I need to perform the preprocessing steps

import pandas as pd
import numpy as np
import seaborn as sns

#loading in the diamonds dataset
diamonds_df = sns.load_dataset("diamonds")
diamonds_df.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


In [20]:
#first we divide data into features and labels
#labels are the values from the tip column
#features are values from the remaining columns

#this is our feature set
DX = diamonds_df.drop(["price"], axis=1)

#this is our label set
DY = diamonds_df["price"]

DX.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,4.34,4.35,2.75


In [21]:
#machine learning algorithmns mostly work with numbers
#so it is important to convert categorical data into a numerical format
#the first step here is to create a numerical dataset

#here we are dropping the categorical columns from the feature set
Dnumerical = DX.drop(["cut","color","clarity"], axis=1)

Dnumerical.head()

Unnamed: 0,carat,depth,table,x,y,z
0,0.23,61.5,55.0,3.95,3.98,2.43
1,0.21,59.8,61.0,3.89,3.84,2.31
2,0.23,56.9,65.0,4.05,4.07,2.31
3,0.29,62.4,58.0,4.2,4.23,2.63
4,0.31,63.3,58.0,4.34,4.35,2.75


In [22]:
#next we are creating a dataframe containing only categorical columns
Dcategorical = DX.filter(["cut","color","clarity"])
Dcategorical.head()

Unnamed: 0,cut,color,clarity
0,Ideal,E,SI2
1,Premium,E,SI1
2,Good,E,VS1
3,Premium,I,VS2
4,Good,J,SI2


In [23]:
#converting a categorical column to a numerical one can be done via one-hot encoding
#for every unique value in the original columns, a new column is created
#we can add a 1 or 0 in that column
#hence we need N-1 one-hot encoded columns for all the N values in the original column
#this code converts categorical columns into one-hot encoded columns

Dcat_numerical = pd.get_dummies(Dcategorical,drop_first=True)
Dcat_numerical.head()

Unnamed: 0,cut_Premium,cut_Very Good,cut_Good,cut_Fair,color_E,color_F,color_G,color_H,color_I,color_J,clarity_VVS1,clarity_VVS2,clarity_VS1,clarity_VS2,clarity_SI1,clarity_SI2,clarity_I1
0,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,True,False
1,True,False,False,False,True,False,False,False,False,False,False,False,False,False,True,False,False
2,False,False,True,False,True,False,False,False,False,False,False,False,True,False,False,False,False
3,True,False,False,False,False,False,False,False,True,False,False,False,False,True,False,False,False
4,False,False,True,False,False,False,False,False,False,True,False,False,False,False,False,True,False


In [24]:
#the next step is to join both of the numerical columns with the one-hot encoded columns

DX = pd.concat([Dnumerical, Dcat_numerical], axis=1)
DX.head()
#this output shows our raw features dataset

Unnamed: 0,carat,depth,table,x,y,z,cut_Premium,cut_Very Good,cut_Good,cut_Fair,...,color_H,color_I,color_J,clarity_VVS1,clarity_VVS2,clarity_VS1,clarity_VS2,clarity_SI1,clarity_SI2,clarity_I1
0,0.23,61.5,55.0,3.95,3.98,2.43,False,False,False,False,...,False,False,False,False,False,False,False,False,True,False
1,0.21,59.8,61.0,3.89,3.84,2.31,True,False,False,False,...,False,False,False,False,False,False,False,True,False,False
2,0.23,56.9,65.0,4.05,4.07,2.31,False,False,True,False,...,False,False,False,False,False,True,False,False,False,False
3,0.29,62.4,58.0,4.2,4.23,2.63,True,False,False,False,...,False,True,False,False,False,False,True,False,False,False
4,0.31,63.3,58.0,4.34,4.35,2.75,False,False,True,False,...,False,False,True,False,False,False,False,False,True,False


In [25]:
#we are now splitting the data into a training set and a test set
#this code divides data into an 80% training set and a 20% test set

from sklearn.model_selection import train_test_split

DX_train, DX_test, DY_train, DY_test = train_test_split(DX,DY,test_size=0.20,random_state=0)

In [26]:
#the final step before the data is passed for machine learning is to scale the data
#some of the columns in the dataset have small values, whilst others are very large values
#it is better to convert all the values to a uniform scale

from sklearn.preprocessing import StandardScaler

sc = StandardScaler()

#this is scaling the training set
DX_train = sc.fit_transform(DX_train)

#this is scaling the test set
DX_test = sc.fit_transform(DX_test)

In [27]:
#here we will try support vector regression on the diamonds dataset

from sklearn import svm
svm_reg = svm.SVR()

regressor = svm_reg.fit(DX_train,DY_train)
DY_pred = regressor.predict(DX_test)

from sklearn import metrics

print("Mean Absolute Error: ", metrics.mean_absolute_error(DY_test,DY_pred))
print("Mean Squared Error: ",metrics.mean_squared_error(DY_test,DY_pred))
print("Root Mean Squared Error: ", np.sqrt(metrics.mean_squared_error(DY_test,DY_pred)))
#the output shows this is pretty rubbish

Mean Absolute Error:  1675.2173572292245
Mean Squared Error:  9929484.880939929
Root Mean Squared Error:  3151.108516211387


In [28]:
#here we will try a linear regression algorithmn on the diamonds dataset

from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
#training the algorithmn
regressor = lin_reg.fit(DX_train,DY_train)

#making predictions on test set
DY_pred = regressor.predict(DX_test)

print("Mean Absolute Error: ", metrics.mean_absolute_error(DY_test,DY_pred))
print("Mean Squared Error: ",metrics.mean_squared_error(DY_test,DY_pred))
print("Root Mean Squared Error: ", np.sqrt(metrics.mean_squared_error(DY_test,DY_pred)))
#this result is still quite bad

Mean Absolute Error:  743.0755520116228
Mean Squared Error:  1250081.0817130487
Root Mean Squared Error:  1118.0702490063175


In [29]:
#here we try a Random forest regression on the diamonds dataset

from sklearn.ensemble import RandomForestRegressor
rf_reg = RandomForestRegressor(random_state=42,n_estimators=500)
regressor = rf_reg.fit(DX_train,DY_train)
DY_pred = regressor.predict(DX_test)

print("Mean Absolute Error: ", metrics.mean_absolute_error(DY_test,DY_pred))
print("Mean Squared Error: ",metrics.mean_squared_error(DY_test,DY_pred))
print("Root Mean Squared Error: ", np.sqrt(metrics.mean_squared_error(DY_test,DY_pred)))

#the best result yet but still off by a significant margin

Mean Absolute Error:  292.44894141727144
Mean Squared Error:  350089.85085841134
Root Mean Squared Error:  591.6839112722361


In [31]:
#KNN stands for K-nearest neighbors, we will try this for the diamonds dataset

from sklearn.neighbors import KNeighborsRegressor
KNN_reg = KNeighborsRegressor(n_neighbors=4)
regressor = KNN_reg.fit(DX_train, DY_train)

DY_pred = regressor.predict(DX_test)

print("Mean Absolute Error: ", metrics.mean_absolute_error(DY_test,DY_pred))
print("Mean Squared Error: ",metrics.mean_squared_error(DY_test,DY_pred))
print("Root Mean Squared Error: ", np.sqrt(metrics.mean_squared_error(DY_test,DY_pred)))

#the second best result so far

Mean Absolute Error:  419.68135891731555
Mean Squared Error:  694051.7020184464
Root Mean Squared Error:  833.0976545510415
