# Preprocessing Using Scikit

# Standardization
Standardization or Mean Removal is the process of transforming each feature vector into a normal distribution with mean 0 and variance 1.

In [10]:
import sklearn.preprocessing as preprocessing
import sklearn.datasets as datasets
import pandas as pd

In [15]:
breast_cancer=datasets.load_breast_cancer()
standardizer = preprocessing.StandardScaler()
standardizer = standardizer.fit(breast_cancer.data)
breast_cancer_standardized = standardizer.transform(breast_cancer.data)
print('Mean of each feature after Standardization :\n\n')
print(breast_cancer_standardized.mean(axis=0))
print('\nStd. of each feature after Standardization :\n\n')
print(breast_cancer_standardized.std(axis=0))

Mean of each feature after Standardization :


[-3.16286735e-15 -6.53060890e-15 -7.07889127e-16 -8.79983452e-16
  6.13217737e-15 -1.12036918e-15 -4.42138027e-16  9.73249991e-16
 -1.97167024e-15 -1.45363120e-15 -9.07641468e-16 -8.85349205e-16
  1.77367396e-15 -8.29155139e-16 -7.54180940e-16 -3.92187747e-16
  7.91789988e-16 -2.73946068e-16 -3.10823423e-16 -3.36676596e-16
 -2.33322442e-15  1.76367415e-15 -1.19802625e-15  5.04966114e-16
 -5.21317026e-15 -2.17478837e-15  6.85645643e-16 -1.41265636e-16
 -2.28956670e-15  2.57517109e-15]

Std. of each feature after Standardization :


[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1.]


# Scaling
Scaling transforms existing data values to lie between a minimum and maximum value.

MinMaxScaler transforms data to range 0 and 1.

MaxAbsScaler transforms data to range -1 and 1.

Transforming breast_cancer dataset through Scaling is shown in next three cards.

# Using MinMaxScaler
MinMaxScaler with specified range

In [17]:
min_max_scaler = preprocessing.MinMaxScaler(feature_range=(0, 10)).fit(breast_cancer.data)

breast_cancer_minmaxscaled10 = min_max_scaler.transform(breast_cancer.data)
#In the above example, data is transformed to range 0 and 10.

# Using MaxAbsScaler
Using MaxAbsScaler, the maximum absolute value of each feature is scaled to unit size, i.e., 1. It is intended for data that is previously centered at sparse or zero data.

Example for MaxAbsScaler

In [19]:
max_abs_scaler = preprocessing.MaxAbsScaler().fit(breast_cancer.data)

breast_cancer_maxabsscaled = max_abs_scaler.transform(breast_cancer.data)
#By default, MaxAbsScaler transforms data to the range -1 and 1.

# Normalization
Normalization scales each sample to have a unit norm.

Normalization can be achieved with 'l1', 'l2', and 'max' norms.

'l1' norm makes the sum of absolute values of each row as 1, and 'l2' norm makes the sum of squares of each row as 1.

'l1' norm is insensitive to outliers.

By default l2 norm is considered. Hence, removing outliers is recommended before applying l2 norm.

# Normalization - Example

In [21]:
normalizer = preprocessing.Normalizer(norm='l1').fit(breast_cancer.data)

breast_cancer_normalized = normalizer.transform(breast_cancer.data)
#In above example, l1 norm is used with norm parameter.

# Binarization
Binarization is the process of transforming data points to 0 or 1 based on a given threshold.

Any value above the threshold is transformed to 1, and any value below the threshold is transformed to 0.
By default, a threshold of 0 is used.

# Binarization - Example

In [22]:
binarizer = preprocessing.Binarizer(threshold=3.0).fit(breast_cancer.data)
breast_cancer_binarized = binarizer.transform(breast_cancer.data)
print(breast_cancer_binarized[:5,:5])

[[1. 1. 1. 1. 0.]
 [1. 1. 1. 1. 0.]
 [1. 1. 1. 1. 0.]
 [1. 1. 1. 1. 0.]
 [1. 1. 1. 1. 0.]]


# OneHotEncoder
OneHotEncoder converts categorical integer values into one-hot vectors. In an on-hot vector, every category is transformed into a binary attribute having only 0 and 1 values.

An example creating two binary attributes for the categorical integers 1 and 2, is shown in the next slide.

# OneHotEncoder - Example

In [47]:
onehotencoder = preprocessing.OneHotEncoder()
onehotencoder = onehotencoder.fit([[1], [1], [1], [2], [2], [1]])

# Transforming category values 1 and 2 to one-hot vectors
print(onehotencoder.transform([[1]]).toarray())
print(onehotencoder.transform([[2]]).toarray())
print(onehotencoder.transform([[1]]).toarray())

[[1. 0.]]
[[0. 1.]]
[[1. 0.]]


In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


# Imputation
Imputation replaces missing values with either median, mean, or the most common value of the column or row in which the missing values exist.

Below example replaces missing values, represented by np.nan, with the mean of respective column (axis 0).

In [24]:
imputer = preprocessing.Imputer(missing_values='NaN', strategy='mean')

imputer = imputer.fit(breast_cancer.data)
breast_cancer_imputed = imputer.transform(breast_cancer.data)



# Label Encoding
Label Encoding is a step in which, in which categorical features are represented as categorical integers. An example of transforming categorical values ["benign","malignant"]into[0, 1]` is shown below.

In [None]:
labels = ['malignant', 'benign', 'malignant', 'benign']

labelencoder = preprocessing.LabelEncoder()

labelencoder = labelencoder.fit(labels)

bc_labelencoded = labelencoder.transform(breast_cancer.target_names)

# Using MinMaxScaler

In [28]:
min_max_scaler = preprocessing.MinMaxScaler().fit(breast_cancer.data)

breast_cancer_minmaxscaled = min_max_scaler.transform(breast_cancer.data)
#By default, transformation occurs to a range of 0 and 1. It can also be customized with feature_range argument as shown in next example.

# Machine Learning Using Scikit-Learn | 5 | SVM

# Task 1
Import two modules sklearn.datasets, and sklearn.model_selection.

Load popular digits dataset from sklearn.datasets module and assign it to variable digits.

Split digits.data into two sets names X_train and X_test. Also, split digits.target into two sets Y_train and Y_test.

Hint: Use train_test_split method from sklearn.model_selection; set random_state to 30; and perform stratified sampling.
Print the shape of X_train dataset.

Print the shape of X_test dataset.

In [29]:
import sklearn.datasets as datasets
from sklearn.model_selection import train_test_split 
import numpy as np

In [68]:
digits=datasets.load_digits()
X_train,X_test,Y_train,Y_test=train_test_split(digits.data,digits.target,random_state=30, stratify=digits.target)
print(X_train.shape)
print(X_test.shape)


(1347, 64)
(450, 64)


# Task 2
Import required module from sklearn.svm.

Build an SVM classifier from X_train set and Y_train labels, with default parameters. Name the model as svm_clf.

Evaluate the model accuracy on testing data set and print it's score.

In [63]:
import sklearn.datasets as datasets
from sklearn.model_selection import train_test_split 
from sklearn.svm import SVC
digits=datasets.load_digits()
X_train,X_test,Y_train,Y_test=train_test_split(digits.data,digits.target,random_state=30, stratify=digits.target)
svm_classifier = SVC()
svm_clf= svm_classifier.fit(X_train, Y_train) 
print('Accuracy of Test Data :', svm_classifier.score(X_test,Y_test))



Accuracy of Test Data : 0.6022222222222222


# Task 3
Perform Standardization of digits.data and store the transformed data in variable digits_standardized.

Hint : Use required utility from sklearn.preprocessing.
Once again, split digits_standardized into two sets names X_train and X_test. Also, split digits.target into two sets Y_train and Y_test.

Hint: Use train_test_split method from sklearn.model_selection; set random_state to 30; and perform stratified sampling.
Build another SVM classifier from X_train set and Y_train labels, with default parameters. Name the model as svm_clf2.

Evaluate the model accuracy on testing data set and print it's score.

In [70]:
import sklearn.preprocessing as preprocessing
import sklearn.datasets as datasets
digits=datasets.load_digits()
standardizer = preprocessing.StandardScaler()
standardizer = standardizer.fit(digits.data)
cancer_standardized = standardizer.transform(digits.data)
X_train,X_test,Y_train,Y_test=train_test_split(digits.data,digits.target,random_state=30, stratify=digits.target)
svm_classifier = SVC()
svm_classifier = svm_classifier.fit(X_train, Y_train) 
#print('Accuracy of Train Data :', svm_classifier.score(X_train,Y_train))
print('Accuracy of Test Data :', svm_classifier.score(X_test,Y_test))



Accuracy of Test Data : 0.6022222222222222


In [54]:
from sklearn.svm import SVC

svm_classifier = SVC()

svm_classifier = svm_classifier.fit(X_train, Y_train) 

print('Accuracy of Train Data :', svm_classifier.score(X_train,Y_train))

print('Accuracy of Test Data :', svm_classifier.score(X_test,Y_test))

import sklearn.preprocessing as preprocessing
import sklearn.datasets as datasets
cancer=datasets.load_breast_cancer
standardizer = preprocessing.StandardScaler()
standardizer = standardizer.fit(cancer.data)
cancer_standardized = standardizer.transform(cancer.data)

svm_classifier = SVC()

svm_classifier = svm_classifier.fit(X_train, Y_train) 
print('Accuracy of Train Data :', svm_classifier.score(X_train,Y_train))

print('Accuracy of Test Data :', svm_classifier.score(X_test,Y_test))
from sklearn import metrics

Y_pred = svm_classifier.predict(X_test)

print('Classification report : \n',metrics.classification_report(Y_test, Y_pred))



Accuracy of Train Data : 1.0
Accuracy of Test Data : 0.4622222222222222


AttributeError: 'function' object has no attribute 'data'

In [None]:
import sklearn.datasets as datasets

from sklearn.model_selection import train_test_split

from sklearn.neighbors import KNeighborsClassifier

cancer = datasets.load_breast_cancer()  # Loading the data set
X_train, X_test, Y_train, Y_test = train_test_split(cancer.data, cancer.target,
           stratify=cancer.target,                  random_state=42)
    
knn_classifier = KNeighborsClassifier()   

knn_classifier = knn_classifier.fit(X_train, Y_train) 
print('Accuracy of Train Data :', knn_classifier.score(X_train,Y_train))
print('Accuracy of Test Data :', knn_classifier.score(X_test,Y_test))

# Task 1
Import two modules sklearn.datasets, and sklearn.model_selection.

Load popular iris data set from sklearn.datasets module and assign it to variable iris.

Split iris.data into two sets names X_train and X_test. Also, split iris.target into two sets Y_train and Y_test.

Hint: Use train_test_split method from sklearn.model_selection; set random_state to 30 and perform stratified sampling.
Print the shape of X_train dataset.

Print the shape of X_test dataset.

In [75]:
import sklearn.datasets as datasets
from sklearn.model_selection import train_test_split 
import numpy as np
iris=datasets.load_iris()
X_train,X_test,Y_train,Y_test=train_test_split(iris.data,iris.target,random_state=30, stratify=iris.target)
print(X_train.shape)
print(X_test.shape)

(112, 4)
(38, 4)


# Task 2
Import required module from sklearn.neighbors

Fit K nearest neighbors model on X_train data and Y_train labels, with default parameters. Name the model as knn_clf.

Evaluate the model accuracy on training data set and print it's score.

Evaluate the model accuracy on testing data set and print it's score.

In [77]:
import sklearn.datasets as datasets
from sklearn.model_selection import train_test_split 
from sklearn.neighbors import KNeighborsClassifier
iris=datasets.load_iris()
X_train,X_test,Y_train,Y_test=train_test_split(iris.data,iris.target,random_state=30, stratify=iris.target)

knn_classifier = KNeighborsClassifier()   

knn_classifier = knn_classifier.fit(X_train, Y_train) 
print('Accuracy of Train Data :', knn_classifier.score(X_train,Y_train))
print('Accuracy of Test Data :', knn_classifier.score(X_test,Y_test))

Accuracy of Train Data : 0.9821428571428571
Accuracy of Test Data : 0.9473684210526315


# Task 3
Fit multiple K nearest neighbors models on X_train data and Y_train labels with n_neighbors parameter value changing from 3 to 10.

Evaluate each model accuracy on testing data set.

Hint: Make use of for loop
Print the n_neighbors value of the model with highest accuracy.

In [122]:
import sklearn.datasets as datasets
from sklearn.model_selection import train_test_split 
from sklearn.neighbors import KNeighborsClassifier
import numpy as np
iris=datasets.load_iris()
X_train,X_test,Y_train,Y_test=train_test_split(iris.data,iris.target,random_state=30, stratify=iris.target)
a=[]
for i in range(3,11):
    knn_classifier = KNeighborsClassifier(n_neighbors=i)   
    knn_classifier = knn_classifier.fit(X_train, Y_train) 
    a[i].append(knn_classifier.score(X_test,Y_test))
    #print('Accuracy of Test Data :',a)      

IndexError: list index out of range

# Task 1
Import two modules sklearn.datasets and sklearn.preprocessing.

Load popular iris data set from sklearn.datasets module and assign it to variable 'iris'.

Perform Normalization on iris.data with l2 norm and save the transformed data in variable iris_normalized.

Hint: Use Normalizer API.
Print the mean of every column using the below command. print(iris_normalized.mean(axis=0))

In [2]:
import sklearn.datasets as datasets
import sklearn.preprocessing as preprocessing
iris=datasets.load_iris()
normalizer = preprocessing.Normalizer(norm='l2').fit(iris.data)
iris_normalized = normalizer.transform(iris.data)
print(iris_normalized.mean(axis=0))

[0.75140029 0.40517418 0.45478362 0.14107142]


# Task 2
Convert the categorical integer list iris.target into three binary attribute representation and store the result in variable iris_target_onehot.

Hint: Use reshape(-1,1) on iris.target and OneHotEncoder.
Execute the following print statement print(iris_target_onehot.toarray()[[0,50,100]])

In [57]:
import sklearn.datasets as datasets
import sklearn.preprocessing as preprocessing
iris=datasets.load_iris()
iris_target_onehot=iris.target.reshape(-1,1)
onehotencoder = preprocessing.OneHotEncoder()
iris_target_onehot= onehotencoder.fit_transform(iris_target_onehot).toarray()
#print(onehotencoder.transform([[2]]).toarray())
#print(iris_target_onehot[[0,50,100]])
#print(onehotencoder.transform([[1]]).toarray())
print(iris_target_onehot[[0,50,100]])

[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]


In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


# Task 3
Set first 50 row values of iris.data to Null values. Use numpy.nan

Perform Imputation on 'iris.data' and save the transformed data in variable 'iris_imputed'.

Hint : use Imputer API, Replace numpy.NaN values with mean of corresponding data.
Print the mean of every column using the below command. print(iris_imputed.mean(axis=0))

In [67]:
import sklearn.datasets as datasets
import sklearn.preprocessing as preprocessing
import numpy as np
iris=datasets.load_iris()
iris.data[0:50]=np.nan
imputer = preprocessing.Imputer(missing_values='NaN', strategy='mean')
imputer = imputer.fit(iris.data)
iris_imputed= imputer.transform(iris.data)
print(iris_imputed.mean(axis=0))

[6.262 2.872 4.906 1.676]




# Building a Decision Tree Classifier Model

In [None]:
from sklearn.tree import DecisionTreeClassifier

dt_classifier = DecisionTreeClassifier()   

dt_classifier = dt_classifier.fit(X_train, Y_train) 
print('Accuracy of Train Data :', dt_classifier.score(X_train,Y_train))

print('Accuracy of Test Data :', dt_classifier.score(X_test,Y_test))
dt_classifier = DecisionTreeClassifier(max_depth=2)   

dt_classifier = dt_classifier.fit(X_train, Y_train) 

print('Accuracy of Train Data :', dt_classifier.score(X_train,Y_train))

print('Accuracy of Test Data :', dt_classifier.score(X_test,Y_test))

# Task 1
Import two modules sklearn.datasets, and sklearn.model_selection.
Import numpy and set random seed to 100.

Load popular Boston dataset from sklearn.datasets module and assign it to variable boston.

Split boston.data into two sets names X_train and X_test. Also, split boston.target into two sets Y_train and Y_test.

Hint: Use train_test_split method from sklearn.model_selection; set random_state to 30.
Print the shape of X_train dataset.

Print the shape of X_test dataset.

In [68]:
import sklearn.datasets as datasets
from sklearn.model_selection import train_test_split 
import numpy as np
np.random.seed(100)
boston=datasets.load_boston()
X_train,X_test,Y_train,Y_test=train_test_split(boston.data,boston.target,random_state=30)
print(X_train.shape)
print(X_test.shape)

(379, 13)
(127, 13)


# Task 2
Import required module from sklearn.tree.

Build a Decision tree Regressor model from X_train set and Y_train labels, with default parameters. Name the model as dt_reg.

Evaluate the model accuracy on training data set and print it's score.

Evaluate the model accuracy on testing data set and print it's score.

Predict the housing price for first two samples of X_test set and print them.(Hint : Use predict() function)

In [99]:
from sklearn.tree import DecisionTreeRegressor
import sklearn.datasets as datasets
from sklearn.model_selection import train_test_split 
import numpy as np
np.random.seed(100)
boston=datasets.load_boston()
X_train,X_test,Y_train,Y_test=train_test_split(boston.data,boston.target,random_state=30)

dt_reg = DecisionTreeRegressor()   
dt_reg = dt_reg.fit(X_train, Y_train) 
print('Accuracy of Train Data :', dt_reg.score(X_train,Y_train))
print('Accuracy of Test Data :', dt_reg.score(X_test,Y_test))
dt_reg.predict(X_test[0:2,])

Accuracy of Train Data : 1.0
Accuracy of Test Data : 0.7038312015987166


array([18.2, 13.9])

In [None]:
dt_classifier = DecisionTreeClassifier(max_depth=2)   

dt_classifier = dt_reg.fit(X_train, Y_train) 

print('Accuracy of Train Data :', dt_reg.score(X_train,Y_train))

print('Accuracy of Test Data :', dt_reg.score(X_test,Y_test))

In [85]:
from sklearn.cluster import KMeans

kmeans_cluster = KMeans(n_clusters=2)

kmeans_cluster = kmeans_cluster.fit(X_train) 

kmeans_cluster.predict(X_test)
from sklearn import metrics

print(metrics.homogeneity_score(kmeans_cluster.predict(X_test), Y_test))

print(metrics.completeness_score(kmeans_cluster.predict(X_test), Y_test))

print(metrics.v_measure_score(kmeans_cluster.predict(X_test), Y_test))

print(metrics.adjusted_rand_score(kmeans_cluster.predict(X_test), Y_test))

0.7805024671754182
0.10550617628787605
0.18588493803637118
0.00027782071833537086


# Task 1
Import three modules sklearn.datasets, sklearn.cluster, and sklearn.metrics.

Load popular iris dataset from sklearn.datasets module and assign it to variable iris.

Cluster iris.data set into 3 clusters using K-means with default parameters. Name the model as km_cls.

Hint : Import required utility from sklearn.cluster
Determine the homogeneity score of the model and print it.

Hint : Import required utility from sklearn.metrics

In [89]:
from sklearn.cluster import KMeans
import sklearn.datasets as datasets
iris=datasets.load_iris()

km_cls= KMeans(n_clusters=3)
X_train,X_test,Y_train,Y_test=train_test_split(iris.data,iris.target)
km_cls = km_cls.fit(X_train) 

kmeans_cluster.predict(X_test)
from sklearn import metrics

print(metrics.homogeneity_score(km_cls.predict(X_test), Y_test))

print(metrics.completeness_score(km_cls.predict(X_test), Y_test))

print(metrics.v_measure_score(km_cls.predict(X_test), Y_test))

print(metrics.adjusted_rand_score(km_cls.predict(X_test), Y_test))

0.7312509997670167
0.7187287275730635
0.7249357914309795
0.6831077974414224


# Task 2
Cluster iris.data set into 3 clusters using Agglomerative clustering. Name the model as agg_cls.

Hint : Import required utility from sklearn.cluster
Determine the homogeneity score of the model and print it.

Hint : Import required utility from sklearn.metrics

In [95]:
from sklearn.cluster import KMeans
import sklearn.datasets as datasets
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics.cluster import homogeneity_score
iris=datasets.load_iris()


agg_cls = AgglomerativeClustering(n_clusters=3)
agg_cls.fit(iris.data)
labels2=agg_cls.labels_
homogeneity_score(iris.target,labels2)

0.7608008469718723

# Task 3
Cluster iris.data set using Affinity Propagation clustering method with default parameters. Name the model as af_cls.

Hint : Import required utility from sklearn.cluster
Determine the homogeneity score of the model and print it.

Hint : Import required utility from sklearn.metrics

In [97]:
from sklearn.cluster import KMeans
import sklearn.datasets as datasets
from sklearn.cluster import AffinityPropagation
from sklearn.metrics.cluster import homogeneity_score
iris=datasets.load_iris()
af_cls= AffinityPropagation(preference=-50)
af_cls.fit(iris.data)
labels2=agg_cls.labels_
homogeneity_score(iris.target,labels2)

0.7608008469718723

# Demo of Random Forest Classifier

In [98]:
from sklearn.ensemble import RandomForestClassifier

rf_classifier = RandomForestClassifier()

rf_classifier = rf_classifier.fit(X_train, Y_train) 

print('Accuracy of Train Data :', rf_classifier.score(X_train,Y_train))

print('Accuracy of Test Data :', rf_classifier.score(X_test,Y_test))

Accuracy of Train Data : 0.9910714285714286
Accuracy of Test Data : 1.0




# Task 2
Import required module from sklearn.ensemble.

Build a Random Forest Regressor model from X_train set and Y_train labels, with default parameters. Name the model as rf_reg.

Evaluate the model accuracy on training data set and print it's score.

Evaluate the model accuracy on testing data set and print it's score.

Predict the housing price for first two samples of X_test set and print them.

In [111]:
from sklearn.ensemble import RandomForestRegressor
import sklearn.datasets as datasets
from sklearn.model_selection import train_test_split 
import numpy as np
np.random.seed(100)
boston=datasets.load_boston()
X_train,X_test,Y_train,Y_test=train_test_split(boston.data,boston.target,random_state=30)

rf_reg= RandomForestRegressor()
rf_reg = rf_reg.fit(X_train, Y_train) 
print('Accuracy of Train Data :', rf_reg.score(X_train,Y_train))
print('Accuracy of Test Data :', rf_reg.score(X_test,Y_test))
rf_reg.predict(X_test[0:2,])

Accuracy of Train Data : 0.9735273853589964
Accuracy of Test Data : 0.8829204928067821




array([19.59,  9.47])

# Task 3
Build multiple Random forest regressor on X_train set and Y_train labels with max_depth parameter value changing from 3 to 5 and also setting n_estimators to one of 50, 100, 200 values.

Evaluate each model accuracy on testing data set.

Hint: Make use of for loop
Print the max_depth and n_estimators values of the model with highest accuracy.

Note: Print the parameter values in the form of tuple (a, b). a refers to max_depth value and b refers to n_estimators

In [114]:
from sklearn.ensemble import RandomForestRegressor
import sklearn.datasets as datasets
from sklearn.model_selection import train_test_split 
import numpy as np
np.random.seed(100)
boston=datasets.load_boston()
X_train,X_test,Y_train,Y_test=train_test_split(boston.data,boston.target,random_state=30)
rf_reg=RandomForestRegressor(random_state = 1, n_estimators = 10)
rf_reg =rf_reg.fit(X_train, Y_train) 
print('Accuracy of Train Data :', rf_reg.score(X_train,Y_train))
print('Accuracy of Test Data :', rf_reg.score(X_test,Y_test))


Accuracy of Train Data : 0.9710420861029339
Accuracy of Test Data : 0.8799361210962687
