# Credit Card Customers_Building Model Pipeline

### Overview

In this notebook, I will create model pipeline. Also, I will compare the model performances of 6 different algorithms.

### Import Packages

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings('ignore')

from collections import Counter

pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 100)

from pathlib import Path
import os
os.getcwd()

'/Users/yejiseoung/Dropbox/My Mac (Yejis-MacBook-Pro.local)/Documents/Projects/CreditCard'

In [2]:
# set up path for data
path = Path('/Users/yejiseoung/Dropbox/My Mac (Yejis-MacBook-Pro.local)/Documents/Projects/CreditCard/Data/')

In [3]:
# Data pre-processing 
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split


# Modelling 
from sklearn.metrics import (
    roc_auc_score,
    plot_roc_curve,
    precision_recall_curve,
    plot_precision_recall_curve,
    auc,
    precision_score, 
    accuracy_score, 
    recall_score,
    classification_report, 
    confusion_matrix
)

from yellowbrick.classifier import ROCAUC, PrecisionRecallCurve

from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
    RandomForestClassifier, 
    AdaBoostClassifier, 
    GradientBoostingClassifier)

import xgboost as xgb

# for feature engineering
from feature_engine import encoding as ce

# Evaluation & CV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import KFold, cross_val_score


# pipeline
from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline


# for oversampling
from imblearn.over_sampling import ADASYN

# for cross-validation
from imblearn.pipeline import make_pipeline

### Load Data

In [4]:
df = pd.read_csv(path/'BankChurners.csv')

In [5]:
# drop unuseful columns 
df.drop(['CLIENTNUM',
        'Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1',
     'Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2'],
        axis=1, inplace=True)

df.shape

(10127, 20)

In [6]:
df.head(2)

Unnamed: 0,Attrition_Flag,Customer_Age,Gender,Dependent_count,Education_Level,Marital_Status,Income_Category,Card_Category,Months_on_book,Total_Relationship_Count,Months_Inactive_12_mon,Contacts_Count_12_mon,Credit_Limit,Total_Revolving_Bal,Avg_Open_To_Buy,Total_Amt_Chng_Q4_Q1,Total_Trans_Amt,Total_Trans_Ct,Total_Ct_Chng_Q4_Q1,Avg_Utilization_Ratio
0,Existing Customer,45,M,3,High School,Married,$60K - $80K,Blue,39,5,1,3,12691.0,777,11914.0,1.335,1144,42,1.625,0.061
1,Existing Customer,49,F,5,Graduate,Single,Less than $40K,Blue,44,6,1,2,8256.0,864,7392.0,1.541,1291,33,3.714,0.105


In [7]:
# create lists for categorical and numerical variables
cat_vars = [var for var in df.columns if df[var].dtype=='O' and var != 'Attrition_Flag']
num_vars = [var for var in df.columns if df[var].dtype!='O']

print('The number of categorical variables: {}'.format(len(cat_vars)))
print('The number of numerical vairables: {}'.format(len(num_vars)))

The number of categorical variables: 5
The number of numerical vairables: 14


## Separate Dataset into train and test

It is important to separate the data into training and testing set, because when we engineer features, some techniques learn parameters from data. This is to avoid over-fitting.

In [9]:
X_train, X_test, y_train, y_test = train_test_split(
    df.drop(['Attrition_Flag'], axis=1),
    df['Attrition_Flag'],
    test_size=0.2, 
    random_state=0)

X_train.shape, X_test.shape

((8101, 19), (2026, 19))

## Pipeline

Let's write machine learning pipeline. 

In [None]:
pipe = Pipeline([
    # Integer Encoding for Categorical Variable
    ('encoding', ce.OrdinalEncoder(
        encoding_method='arbitrary', variables=cat_vars)),
    
    # Feature Scaling
    ('scaler', StandardScaler())
])