# DATA 201 - Week 7 - Lab

This lab is for students to practice their skill in using data transformation, feature scaling, pipelining and some linear regression methods with regularization. Ensemble is also introduded in this lab.

If there are any questions, please contact Binh Nguyen (Email: binh.p.nguyen@vuw.ac.nz).

We will use dataset `cars.csv` which is attached with this notebook.

First, let's import some Python modules and load the data.

In [1]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler, PolynomialFeatures
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.metrics import mean_squared_error

In [2]:
data = pd.read_csv("cars.csv")

In [3]:
data.head(10)

Unnamed: 0,horsepower,highway-mpg,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,...,engine-type,num-of-cylinders,engine-size,fuel-system,bore,stroke,compression-ratio,peak-rpm,city-mpg,price
0,95.0,24.0,0.0,120.232558,peugot,gas,std,four,wagon,rwd,...,l,four,120.0,mpfi,3.46,2.19,8.4,5000.0,19.0,16695.0
1,116.0,30.0,2.0,134.0,toyota,gas,std,two,hardtop,rwd,...,ohc,four,146.0,mpfi,3.62,3.5,9.3,4800.0,24.0,11199.0
2,121.0,28.0,0.0,188.0,bmw,gas,std,two,sedan,rwd,...,ohc,six,164.0,mpfi,3.31,3.19,9.0,4250.0,21.0,20970.0
3,184.0,16.0,0.0,120.232558,mercedes-benz,gas,std,four,sedan,rwd,...,ohcv,eight,308.0,mpfi,3.8,3.35,8.0,4500.0,14.0,40960.0
4,111.0,29.0,0.0,102.0,subaru,gas,turbo,four,sedan,4wd,...,ohcf,four,108.0,mpfi,3.62,2.64,7.7,4800.0,24.0,11259.0
5,70.0,43.0,0.0,81.0,chevrolet,gas,std,four,sedan,fwd,...,ohc,four,90.0,2bbl,3.03,3.11,9.6,5400.0,38.0,6575.0
6,97.0,24.0,0.0,161.0,peugot,gas,std,four,sedan,rwd,...,l,four,120.0,mpfi,3.46,3.19,8.4,5000.0,19.0,11900.0
7,140.0,20.0,1.0,158.0,audi,gas,turbo,four,sedan,fwd,...,ohc,five,131.0,mpfi,3.13,3.4,8.3,5500.0,17.0,23875.0
8,86.0,33.0,0.0,85.0,honda,gas,std,four,sedan,fwd,...,ohc,four,110.0,1bbl,3.15,3.58,9.0,5800.0,27.0,8845.0
9,69.0,37.0,1.0,128.0,nissan,gas,std,two,sedan,fwd,...,ohc,four,97.0,2bbl,3.15,3.29,9.4,5200.0,31.0,5499.0


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 205 entries, 0 to 204
Data columns (total 26 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   horsepower         205 non-null    float64
 1   highway-mpg        205 non-null    float64
 2   symboling          205 non-null    float64
 3   normalized-losses  205 non-null    float64
 4   make               205 non-null    object 
 5   fuel-type          205 non-null    object 
 6   aspiration         205 non-null    object 
 7   num-of-doors       205 non-null    object 
 8   body-style         205 non-null    object 
 9   drive-wheels       205 non-null    object 
 10  engine-location    205 non-null    object 
 11  wheel-base         205 non-null    float64
 12  length             205 non-null    float64
 13  width              205 non-null    float64
 14  height             205 non-null    float64
 15  curb-weight        205 non-null    float64
 16  engine-type        205 non

As you can see, the dataset has 205 samples and there is no missing value.

** `symboling` should not be `float64`, write code to convert this field to an `object` type ** 

In [5]:
data["symboling"] = data["symboling"].astype(object)

** Convert all values in `num-of-doors` from words to numbers **

In [6]:
data["num-of-doors"].unique()

array(['four', 'two'], dtype=object)

In [7]:
data["num-of-doors"] = data["num-of-doors"].apply(lambda i: {'four':4, 'two':2}[i])

** Similarily, convert all values in `num-of-cylinders` from words to numbers **

In [8]:
data["num-of-cylinders"].unique()

array(['four', 'six', 'eight', 'five'], dtype=object)

In [9]:
data["num-of-cylinders"] = data["num-of-cylinders"].apply(lambda i: {'four':4, 'six':6, 'eight':8, 'five':5}[i])

** Randomly split the given data into 2 subsets for training (80%) and test (20%). Use *random_state = 42*. **

In [10]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(data, test_size = 0.2, random_state = 42)

** Let `price` be the target that we need to make prediction and the remaining columns are the predictors (attributes). Create two pipelines for numerical attributes and categorical attributes which contain `MinMaxScaler` and `OneHotEncoder`, respectively. Then use the training data to fit the two pipelines and transform the training and test sets to `X_train` and `X_test`, respectively. Use `y_train` and `y_test` to store the targets in the two sets. **

*Note that all numerical values should be converted to `float` type for using with `MinMaxScaler`.*

In [11]:
train_features = train.drop(["price"], axis=1)
y_train = train["price"].copy()

In [12]:
test_features = test.drop(["price"], axis=1)
y_test = test["price"].copy()

In [13]:
num_attribs = list(train_features.select_dtypes(include=[np.number]))
cat_attribs = list(train_features.select_dtypes(include=['object']))

In [14]:
num_attribs

['horsepower',
 'highway-mpg',
 'normalized-losses',
 'num-of-doors',
 'wheel-base',
 'length',
 'width',
 'height',
 'curb-weight',
 'num-of-cylinders',
 'engine-size',
 'bore',
 'stroke',
 'compression-ratio',
 'peak-rpm',
 'city-mpg']

In [15]:
cat_attribs

['symboling',
 'make',
 'fuel-type',
 'aspiration',
 'body-style',
 'drive-wheels',
 'engine-location',
 'engine-type',
 'fuel-system']

In [16]:
train_features[num_attribs] = train_features[num_attribs].astype(float)

In [17]:
num_pipeline = Pipeline([('scaler', MinMaxScaler())])

In [18]:
cat_pipeline = Pipeline([('onehot', OneHotEncoder())])

In [19]:
full_pipeline = ColumnTransformer([
    ("num", num_pipeline, num_attribs),
    ("cat", cat_pipeline, cat_attribs),
])

In [20]:
X_train = full_pipeline.fit_transform(train_features)

In [21]:
X_train.shape

(164, 65)

In [22]:
X_test = full_pipeline.transform(test_features)

** The following function is for displaying performance of a classifier: **

In [23]:
def showPerformance(clf):
    y_train_pred = clf.predict(X_train)
    
    print("RMSE train: ", np.sqrt(mean_squared_error(y_train, y_train_pred)))
    y_test_pred = clf.predict(X_test)
    print("RMSE test: ", np.sqrt(mean_squared_error(y_test, y_test_pred)))

    print("Training set score: {:.2f}".format(clf.score(X_train, y_train)))
    print("Test set score: {:.2f}".format(clf.score(X_test, y_test)))

** Train a `LinearRegression` and evaluate its performance **

In [24]:
from sklearn.linear_model import LinearRegression

linear = LinearRegression()
linear.fit(X_train, y_train)

showPerformance(linear)

RMSE train:  867.1960281386857
RMSE test:  1584.6968549067103
Training set score: 0.98
Test set score: 0.98


** Use `RidgeCV` to find a suitable value for `alpha` then evaluate the performance of the Ridge Regressor**

Document: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeCV.html#sklearn.linear_model.RidgeCV

In [25]:
from sklearn.linear_model import RidgeCV
ridge = RidgeCV(alphas=[1e-3, 1e-2, 1e-1, 1, 10], cv=5).fit(X_train, y_train)

In [26]:
print("alpha = ", ridge.alpha_)
showPerformance(ridge)

alpha =  0.1
RMSE train:  896.2516797825926
RMSE test:  1493.187041284025
Training set score: 0.98
Test set score: 0.98


** Use `LassoCV` to find a suitable value for `alpha` then evaluate the performance of the Lasso Regressor**

Document: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoCV.html

In [27]:
from sklearn.linear_model import LassoCV
lasso = LassoCV(alphas=[1e-3, 1e-2, 1e-1, 1, 10], max_iter=10000, cv=5).fit(X_train, y_train)

In [28]:
print("alpha = ", lasso.alpha_)
print("Number of features used:", np.sum(lasso.coef_ != 0))
showPerformance(lasso)

alpha =  10.0
Number of features used: 37
RMSE train:  1029.4010246984094
RMSE test:  1381.9024276719588
Training set score: 0.97
Test set score: 0.98


** Check your `sklearn version` , continue if it is from `0.21` and above **

In [29]:
import sklearn
sklearn.__version__

'0.22.1'

** Study the `VotingRegressor` and train a `VotingRegressor` which combines the Ridge and Lasso regressors above, give a higher weight for Lasso, then evaluate the performance of the `VotingRegressor`. **

In [30]:
from sklearn.ensemble import VotingRegressor

In [31]:
er = VotingRegressor([('ridge', ridge), ('lasso', lasso)], weights=[1,2])

In [32]:
er.fit(X_train, y_train)

VotingRegressor(estimators=[('ridge',
                             RidgeCV(alphas=array([1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01]),
                                     cv=5, fit_intercept=True, gcv_mode=None,
                                     normalize=False, scoring=None,
                                     store_cv_values=False)),
                            ('lasso',
                             LassoCV(alphas=[0.001, 0.01, 0.1, 1, 10],
                                     copy_X=True, cv=5, eps=0.001,
                                     fit_intercept=True, max_iter=10000,
                                     n_alphas=100, n_jobs=None, normalize=False,
                                     positive=False, precompute='auto',
                                     random_state=None, selection='cyclic',
                                     tol=0.0001, verbose=False))],
                n_jobs=None, weights=[1, 2])

In [33]:
showPerformance(er)

RMSE train:  970.0531928815776
RMSE test:  1372.5228314398853
Training set score: 0.98
Test set score: 0.98
