## Concepts and Examples Covered
The aim of this notebook is to provide example code on how to complete the following using **PIPELINES**:

- Splitting a dataset into training and test sets
- Encoding categorical data using ColumnTransformer (OrdinalEncoder and OneHotEncoder)
- Hyperparameter grid search for SVM and kNN models
- Evaluating a final ML model on a test set

In [None]:
# imports, seeing if any rows/columns are missing data
import numpy as np, pandas as pd 
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder, MinMaxScaler
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import plot_roc_curve, plot_confusion_matrix

data= pd.read_csv('/kaggle/input/customer-analytics/Train.csv')
print('Number of columns/rows with any missing values:', data.isnull().sum().sum())
print('Dataset size:', np.shape(data))
data.head()

For this dataset, columns 2-11 represent features and column 12 represents the target for each row/set of features.

In [None]:
# split dataset into a training and test set, ensuring the same proportions for the target are in both the training and test sets
X= data.iloc[:, 1:-1]
Y= data.iloc[:, -1]
X_train, X_test, Y_train, Y_test= train_test_split(X, Y, test_size= 0.2, stratify= Y, random_state= 24)

print('Number of training examples:', len(X_train))
print('Number of test examples:', len(X_test))

In [None]:
# in order to use all features in a ML model, we need to encode the categorical features
# first get the cardinality of each of the categorical features (to help decide how we will encode these features)
num_feats= X_train.select_dtypes(include= ['int64', 'float32']).columns
cat_feats= X_train.select_dtypes(include= 'object').columns
print(X_train[cat_feats].nunique())

In [None]:
# for Gender, Product_importance, and Mode_of_Shipment, we will use ordinal encoding. For Warehouse_block, we will use one-hot encoding
# for any numerical features, we will scale them so their values are between 0 and 1
print(X_train['Gender'].value_counts())
print(X_train['Product_importance'].value_counts())
print(X_train['Mode_of_Shipment'].value_counts())
transforms= [('num_t', MinMaxScaler(), list(num_feats)), ('warehouse', OneHotEncoder(categories= 'auto', sparse= False), ['Warehouse_block']), ('gender', OrdinalEncoder(categories= [['M', 'F']]), ['Gender']), ('importance', OrdinalEncoder(categories= [['low', 'medium', 'high']]), ['Product_importance']), ('shipment', OrdinalEncoder(categories= [['Ship','Flight','Road']]), ['Mode_of_Shipment'])]

col_transforms= ColumnTransformer(transforms)

At this point, we have our dataset split into training and test sets. We also have all of our feature transformations set up and ready to be used in a pipeline. Next, we will create a ML model (SVM) and then define our pipeline before executing our hyperparameter grid search using our pipeline.

In [None]:
# define model, pipeline, and hyperparameter settings to test
model_svm= SVC()
pipeline_svm= Pipeline(steps= [('prep', col_transforms), ('mod', model_svm)])
params_svm= {'mod__C': [0.5, 1, 10], 'mod__kernel': ['linear', 'rbf'], 'mod__class_weight': [None, 'balanced'], 'mod__random_state': [24]}
search_svm= GridSearchCV(pipeline, param_grid= params, cv=20, n_jobs= -1, scoring= 'roc_auc')
search_svm.fit(X_train, Y_train) # will take awhile to run... reduce number hyperparameters to test to reduce time!
print(search_svm.best_params_)
print(search_svm.best_score_)

From the hyperparameter grid search, we now have the best hyperparameter settings for our SVM model. The 20-fold cross-validation score (ROC-AUC) for these settings is 0.729. I chose this particular scoring function because of the class imbalance in what we are trying to predict (3549 instances of class 0 vs. 5250 instances of class 1). For reference, a perfect score is 1. Since there is room for improvement for this score, let's try another model before we get into evaluating our test set.

In [None]:
model_nn= KNeighborsClassifier()
pipeline_nn= Pipeline(steps= [('prep', col_transforms), ('mod', model_nn)])
params_nn= {'mod__n_neighbors': [3, 5, 7]}
search_nn= GridSearchCV(pipeline, param_grid= params, cv=20, n_jobs= -1, scoring= 'roc_auc')
search_nn.fit(X_train, Y_train) # will take awhile to run... reduce number hyperparameters to test to reduce time!
print(search_nn.best_params_)
print(search_nn.best_score_)

We see we get a worse result for the kNN model. You can continue to try different models and hyperparameter settings until you find a model that gives you a score you're happy with, but for the sake of this notebook, we will use the SVM model as our best model and evaluate its performance on the test set.

In [None]:
final_mod= SVC(C= 0.5, class_weight= 'balanced', kernel= 'rbf', random_state= 24)
pipeline_fin= Pipeline(steps= [('prep', col_transforms), ('mod', model_nn)])
pipeline_fin.fit(X_train, Y_train)
predicts= pipeline_fin.predict(X_test)
plot_roc_curve(pipeline_fin, X_test, Y_test); plot_confusion_matrix(pipeline_fin, X_test, Y_test)

These visualizations illustrate how many test examples were misclassified - 435 examples with a ground truth label of 1 were misclassified as belonging to class 0 whereas 375 examples with a ground truth label of 0 were misclassified as belonging to class 1.

# Conclusions
This notebook provided example code on how to implement pipelines to conduct feature transformations and a hyperparameter grid search. Pipelines are super handy and can help reduce the risk of accidental data leakage - for the final model, we fit the pipeline on the training data and then made predictions on our test set. Leave your feedback, questions, and anything else you'd like to see from me in the comments!