The purpose of this exercise is to extract new features from the categorical and numeric variables before the modeling phase. In the previous chaters, we applied various feature extraction techniques, such as converting categorical variables to dummy variables and scaling variables. This exercise will demonstrate how these task can be automated using ML Pipelines.

In [1]:
import pandas as pd
file_url = 'https://raw.githubusercontent.com/sedeba19/Chapter-16/main/data_source/Dataset_crx.data.txt'

df = pd.read_csv(file_url,
                 sep = ',',
                 header = None,
                 na_values= '?')

# Changing the Classess to 1 & 0
df.loc[df[15] == '+', 15] = 1
df.loc[df[15] == '-', 15] = 0

df_clean = df.dropna(axis = 0)
df_clean.isna().sum()

# Separating X and y variabls
X = df_clean.loc[:, 0:14]
y = df_clean.loc[:, 15].astype('int')

from sklearn.model_selection import train_test_split

# Split the data into training and testing set
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size= 0.3,
                                                    random_state=123)

In [2]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# Pipeline for transforming categorical variables
catTransformer = Pipeline(steps = [('onehot', OneHotEncoder(handle_unknown = 'ignore'))])
catTransformer

In [3]:
# Pipeline for scaling numerical variables
numTransformer = Pipeline(steps = [('scaler', StandardScaler())])
numTransformer 

In [4]:
X.dtypes

0      object
1     float64
2     float64
3      object
4      object
5      object
6      object
7     float64
8      object
9      object
10      int64
11     object
12     object
13    float64
14      int64
dtype: object

In [5]:
catFeatures = X.select_dtypes(include = 'object').columns
catFeatures

Int64Index([0, 3, 4, 5, 6, 8, 9, 11, 12], dtype='int64')

In [6]:
numFeatures = X.select_dtypes(include = ['float', 'int']).columns
numFeatures

Int64Index([1, 2, 7, 10, 13, 14], dtype='int64')

Just to get the context of what we are going to do next, we are going to create a literal engine that automates the task of scaling features and converting categorical variables to a one-hot encoded form.

In [7]:
# Create the preprocessing engine
from sklearn.compose import ColumnTransformer

preprocessor = ColumnTransformer(transformers=[('numeric', numTransformer, numFeatures),
                                               ('categoric', catTransformer, catFeatures)])
preprocessor

In [8]:
# Transforming the Training Data
X_tran_train = pd.DataFrame(preprocessor.fit_transform(X_train))
print(X_tran_train.shape)
X_tran_train.head()

(457, 46)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,36,37,38,39,40,41,42,43,44,45
0,0.105658,-0.4449,1.377002,-0.553206,0.570065,-0.174241,0.0,1.0,0.0,0.0,...,0.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0
1,-1.084238,1.115032,-0.528306,-0.553206,-0.60247,-0.167337,1.0,0.0,0.0,1.0,...,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0
2,-0.416675,-0.080916,0.592889,-0.327276,-0.367963,-0.174241,0.0,1.0,0.0,1.0,...,0.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0
3,-0.795428,1.418699,-0.189778,-0.553206,-0.485217,0.024974,0.0,1.0,0.0,1.0,...,0.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0
4,-1.125497,0.439061,-0.636809,-0.553206,-0.25071,-0.174241,0.0,1.0,0.0,0.0,...,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0


In [9]:
# Transforming the Test Data
X_tran_test = pd.DataFrame(preprocessor.transform(X_test))
print(X_tran_test.shape)
X_tran_test.head()


(196, 46)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,36,37,38,39,40,41,42,43,44,45
0,-0.059376,-0.531217,-0.623789,-0.553206,0.687319,-0.174241,0.0,1.0,0.0,1.0,...,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0
1,-1.063609,-0.878562,-0.600642,-0.327276,0.101051,-0.174076,0.0,1.0,0.0,0.0,...,0.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0
2,0.64862,1.929316,1.847181,0.802371,-0.661097,-0.174241,0.0,1.0,0.0,1.0,...,0.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0
3,2.203242,3.402933,2.245025,2.383877,-1.071485,0.927028,1.0,0.0,0.0,1.0,...,0.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0
4,-0.451332,-0.644572,-0.612215,-0.553206,-0.485217,-0.174241,0.0,1.0,0.0,1.0,...,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0


In [12]:
# Import PCA library
from sklearn.decomposition import PCA

# Creae an estimator with both preprocessor and dimensionality reduction
estimator = Pipeline(steps = [('preprocessor', preprocessor), ('dimred', PCA(10))])
estimator 

In [13]:
# Fitting and transforming Training set
X_transformed_train = pd.DataFrame(estimator.fit_transform(X_train))
X_transformed_train

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,-0.456911,0.857577,-1.231989,0.902396,1.604191,-0.284921,-0.595444,0.206836,0.027712,0.742267
1,-0.758102,-1.279315,1.162158,0.397572,0.031973,1.236864,0.353098,-0.020558,0.561482,0.613476
2,0.387754,-0.022255,-0.082482,-0.524931,0.089300,0.300113,-1.257660,-0.191124,-0.376516,-0.367365
3,-0.332061,-0.636192,0.825248,0.798001,0.435375,1.377995,-0.578766,0.030524,-0.900729,0.620234
4,-1.412780,-0.707406,0.607928,0.549580,1.582078,-0.119710,0.496112,0.597986,-0.133551,0.032972
...,...,...,...,...,...,...,...,...,...,...
452,0.811601,-0.466851,0.741124,-0.285976,0.448155,1.122801,-0.483755,-0.352555,-1.104434,-0.284433
453,-1.032857,-0.317365,-0.325301,0.779674,0.790264,-0.669789,0.879571,-0.115642,0.826721,0.559435
454,-1.436613,-0.626001,0.404983,0.577783,1.412340,-0.377916,0.448791,0.575650,-0.014936,0.039160
455,0.535473,-0.507511,-1.458810,2.248169,0.168517,-1.386928,0.471567,-1.252744,0.271407,1.079495


In [14]:
# Transform X_test
X_transformed_test = pd.DataFrame(estimator.fit_transform(X_test))
X_transformed_test

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,-1.376591,0.155195,-0.563434,0.049405,-0.399998,-0.491688,0.562708,0.511245,0.346382,-0.003808
1,-1.692988,0.361616,0.978856,0.203378,0.073868,1.136227,-0.047385,0.541608,0.034440,-0.717510
2,2.646165,-0.110647,-0.466814,-1.163394,0.248937,-0.013165,-0.780581,-0.296853,0.693618,-0.596870
3,5.373258,-0.668498,0.115588,0.969101,0.572690,-0.337716,0.382457,0.479135,-0.504547,1.068907
4,-1.291115,-0.453252,0.313133,-0.209912,-0.748126,-0.459586,0.336056,0.302274,0.588874,-0.096309
...,...,...,...,...,...,...,...,...,...,...
191,-1.756426,0.553994,-0.767019,-0.120467,-0.189609,-0.302743,-0.105660,-0.664187,0.910465,0.200317
192,0.694146,-1.445142,-1.229004,-0.527833,-0.531979,-0.113569,0.625734,-0.963592,-0.744302,1.134550
193,-0.434029,0.454746,0.734669,-0.301829,-0.028549,-0.899120,-0.276020,0.547488,-0.021891,-1.136908
194,-0.146258,-1.347556,1.027346,0.946471,1.603582,0.724402,-0.296726,-0.731639,1.155427,0.254661


We implemented data preprocessing steps such as scaling, one-hot encoding, and dimensionality reduction using the Pipeline() function. As we have seen implementing new steps is quite intuitive and simple with the Pipeline() function.