The purpose of this exercise is to extract new features from the categorical and numeric variables before the modeling phase. In the previous chaters, we applied various feature extraction techniques, such as converting categorical variables to dummy variables and scaling variables. This exercise will demonstrate how these task can be automated using ML Pipelines.

In [1]:
import pandas as pd
file_url = 'https://raw.githubusercontent.com/sedeba19/Chapter-16/main/data_source/Dataset_crx.data.txt'

df = pd.read_csv(file_url,
                 sep = ',',
                 header = None,
                 na_values= '?')

# Changing the Classess to 1 & 0
df.loc[df[15] == '+', 15] = 1
df.loc[df[15] == '-', 15] = 0

df_clean = df.dropna(axis = 0)
df_clean.isna().sum()

# Separating X and y variabls
X = df_clean.loc[:, 0:14]
y = df_clean.loc[:, 15].astype('int')

from sklearn.model_selection import train_test_split

# Split the data into training and testing set
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size= 0.3,
                                                    random_state=123)

Create Processing Engine

In [2]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# Pipeline for transforming categorical variables
catTransformer = Pipeline(steps = [('onehot', OneHotEncoder(handle_unknown = 'ignore'))])
catTransformer

In [3]:
# Pipeline for scaling numerical variables
numTransformer = Pipeline(steps = [('scaler', StandardScaler())])
numTransformer 

In [4]:
X.dtypes

0      object
1     float64
2     float64
3      object
4      object
5      object
6      object
7     float64
8      object
9      object
10      int64
11     object
12     object
13    float64
14      int64
dtype: object

In [5]:
catFeatures = X.select_dtypes(include = 'object').columns
catFeatures

Int64Index([0, 3, 4, 5, 6, 8, 9, 11, 12], dtype='int64')

In [6]:
numFeatures = X.select_dtypes(include = ['float', 'int']).columns
numFeatures

Int64Index([1, 2, 7, 10, 13, 14], dtype='int64')

Just to get the context of what we are going to do next, we are going to create a literal engine that automates the task of scaling features and converting categorical variables to a one-hot encoded form.

In [7]:
# Create the preprocessing engine
from sklearn.compose import ColumnTransformer

preprocessor = ColumnTransformer(transformers=[('numeric', numTransformer, numFeatures),
                                               ('categoric', catTransformer, catFeatures)])
preprocessor

Spot Checking Multiple Models

In [19]:
# Import necessary libraries
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier

# Create a list of the classifiers
classifiers = [
    KNeighborsClassifier(5),     
    RandomForestClassifier(random_state=123),
    AdaBoostClassifier(random_state=123),
    LogisticRegression(random_state=123)
    ]

for i in classifiers:
    estimator = Pipeline(steps=[('preprocessor', preprocessor),
                      ('dimred', PCA(10)),
                           ('classifier',i)])
    estimator.fit(X_train, y_train)   
    print(i)
    print("model score: %.2f" % estimator.score(X_test, y_test))

KNeighborsClassifier()
model score: 0.83
RandomForestClassifier(random_state=123)
model score: 0.86
AdaBoostClassifier(random_state=123)
model score: 0.86
LogisticRegression(random_state=123)
model score: 0.89
