The purpose of this exercise is to extract new features from the categorical and numeric variables before the modeling phase. In the previous chaters, we applied various feature extraction techniques, such as converting categorical variables to dummy variables and scaling variables. Thiss exercise will demonstrate how these task can be automated using ML Pipelines.

In [2]:
import pandas as pd
file_url = 'https://raw.githubusercontent.com/sedeba19/Chapter-16/main/data_source/Dataset_crx.data.txt'

df = pd.read_csv(file_url,
                 sep = ',',
                 header = None,
                 na_values= '?')

# Changing the Classess to 1 & 0
df.loc[df[15] == '+', 15] = 1
df.loc[df[15] == '-', 15] = 0

df_clean = df.dropna(axis = 0)
df_clean.isna().sum()

# Separating X and y variabls
X = df_clean.loc[:, 0:14]
y = df_clean.loc[:, 15].astype('int')

from sklearn.model_selection import train_test_split

# Split the data into training and testing set
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size= 0.3,
                                                    random_state=123)

In [15]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# Pipeline for transforming categorical variables
catTransformer = Pipeline(steps = [('onehot', OneHotEncoder(handle_unknown = 'ignore'))])
catTransformer

In [13]:
# Pipeline for scaling numerical variables
numTransformer = Pipeline(steps = [('scaler', StandardScaler())])
numTransformer 

In [10]:
X.dtypes

0      object
1     float64
2     float64
3      object
4      object
5      object
6      object
7     float64
8      object
9      object
10      int64
11     object
12     object
13    float64
14      int64
dtype: object

In [12]:
catFeatures = X.select_dtypes(include = 'object').columns
catFeatures

Int64Index([0, 3, 4, 5, 6, 8, 9, 11, 12], dtype='int64')

In [11]:
numFeatures = X.select_dtypes(include = ['float', 'int']).columns
numFeatures

Int64Index([1, 2, 7, 10, 13, 14], dtype='int64')

Just to get the context of what we are going to do next, we are going to create a literal engine that automates the task of scaling features and converting categorical variables to a one-hot encoded form.

In [16]:
# Create the preprocessing engine
from sklearn.compose import ColumnTransformer

preprocessor = ColumnTransformer(transformers=[('numeric', numTransformer, numFeatures),
                                               ('categoric', catTransformer, catFeatures)])
preprocessor

In [19]:
# Transforming the Training Data
X_tran_train = pd.DataFrame(preprocessor.fit_transform(X_train))
print(X_tran_train.shape)
X_tran_train.head()

(457, 46)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,36,37,38,39,40,41,42,43,44,45
0,0.105658,-0.4449,1.377002,-0.553206,0.570065,-0.174241,0.0,1.0,0.0,0.0,...,0.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0
1,-1.084238,1.115032,-0.528306,-0.553206,-0.60247,-0.167337,1.0,0.0,0.0,1.0,...,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0
2,-0.416675,-0.080916,0.592889,-0.327276,-0.367963,-0.174241,0.0,1.0,0.0,1.0,...,0.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0
3,-0.795428,1.418699,-0.189778,-0.553206,-0.485217,0.024974,0.0,1.0,0.0,1.0,...,0.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0
4,-1.125497,0.439061,-0.636809,-0.553206,-0.25071,-0.174241,0.0,1.0,0.0,0.0,...,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0


In [20]:
# Transforming the Test Data
X_tran_test = pd.DataFrame(preprocessor.transform(X_test))
print(X_tran_test.shape)
X_tran_test.head()


(196, 46)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,36,37,38,39,40,41,42,43,44,45
0,-0.059376,-0.531217,-0.623789,-0.553206,0.687319,-0.174241,0.0,1.0,0.0,1.0,...,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0
1,-1.063609,-0.878562,-0.600642,-0.327276,0.101051,-0.174076,0.0,1.0,0.0,0.0,...,0.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0
2,0.64862,1.929316,1.847181,0.802371,-0.661097,-0.174241,0.0,1.0,0.0,1.0,...,0.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0
3,2.203242,3.402933,2.245025,2.383877,-1.071485,0.927028,1.0,0.0,0.0,1.0,...,0.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0
4,-0.451332,-0.644572,-0.612215,-0.553206,-0.485217,-0.174241,0.0,1.0,0.0,1.0,...,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0
