### **Importing Related Notebooks** 

In [2]:
import import_ipynb
import Explanatory_Data_Analysis

data = Explanatory_Data_Analysis.data

### **Feature Engineering**

Before doing machine learning modeling, a process is needed to prepare the raw data into ready-to-use data. In other words, this process is referred to as feature engineering. By doing good feature engineering, machine learning models can provide more accurate and better results in predicting target variables.

##### **Transformatting and Feature Selection**

As is known, the features needed to perform the modeling process must be of numerical type. Therefore, features that are non-numerical will be encoded according to the characteristics of the categories in these features. First of all, attribute information will be seen from data that has been processed from the previous process.

In [3]:
import pandas as pd

items = [[
    col, 
    data[col].dtype, 
    data[col].nunique(), 
    list(data[col].unique())
] for col in data]

pd.DataFrame(
    data=items,
    columns=[
        'Attributes',
        'Data Type',
        'Total Unique',
        'Unique Sample'
    ]
)

Unnamed: 0,Attributes,Data Type,Total Unique,Unique Sample
0,Dependents,category,2,"[Yes, No]"
1,tenure,int8,73,"[9, 14, 72, 3, 40, 17, 11, 8, 47, 18, 5, 1, 48..."
2,OnlineSecurity,category,3,"[No, Yes, No internet service]"
3,OnlineBackup,category,3,"[No, Yes, No internet service]"
4,InternetService,category,3,"[DSL, Fiber optic, No]"
5,DeviceProtection,category,3,"[Yes, No internet service, No]"
6,TechSupport,category,3,"[Yes, No, No internet service]"
7,Contract,category,3,"[Month-to-month, Two year, One year]"
8,PaperlessBilling,category,2,"[Yes, No]"
9,MonthlyCharges,float32,1395,"[72.9, 82.65, 69.65, 23.6, 74.55, 19.7, 44.05,..."


Based on the information above, a preprocessing process will be carried out so that it can be used in building a machine learning model. The explanation will be described as follows:

1. Features with 2 class categories, namely `Dependents` and `PaperlessBilling`, will be encoded using `OneHotEncoder()` on the grounds that the number of classes contained is only 2 and is nominal so that it will produce as many dummy columns as there are classes, namely 2 dummy columns. In addition, it will also be considered to drop on one of the dummy columns to avoid the possibility of multicollinearity problems.

1. Features with more than 2 classes and normal categories, namely `OnlineSecurity`, `OnlineBackup`, `InternetService`, `DeviceProtection`, and `TechSupport` will be encoded using `OneHotEncoder()`. Then, drop on the column with class `No internet service` in the additional services feature and class `No` in the `InternetService` feature. These columns need to be deleted because they are related to one another.
1. The `Contract` feature which is ordinal categorical will be subject to a `mapping encoding` process on the grounds that each class has a measured interval level. Class `Month-to-month` will be changed to class `0`, class '`One year`' will be changed to class `1`, and class `Two year` will be changed to class `2`.
1. The `Churn` feature which is categorical and is target variable will be mapped with the criteria for class `Yes` being number 1 and class `No` being number 0 by using `FunctionTransformer()`.
1. Features that are numerical in nature will be scaling and binning processes. This aims to improve the performance and accuracy of the machine learning model by ensuring that each numerical feature has an equal contribution to the learning process. Since the `tenure` and `MonthlyCharges` features do not meet the parametric assumption, `RobustScaler()` will be used. Then a binning process will also be carried out using `KBinsDiscretizer()` with the aim of grouping continuous values ​​into a series of intervals. The two classes were chosen because they are able to reduce the impact of outliers well compared to others.

All of these processes will be carried out concurrently by using `ColumnTransformer()` from library **Scikit-learn** because it allows preprocessing these features separately and reunites them in one dataset that has been processed. On the other hand, it can also be ensured that the transformation applied to each column is in accordance with the right method so that the data will not be mixed and minimizes the risk of errors in preprocessing.

In [4]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, RobustScaler, FunctionTransformer, KBinsDiscretizer
from category_encoders import OrdinalEncoder
from imblearn.pipeline import Pipeline

mapping = [{
    'col':'Contract',
    'mapping':{
        'Month-to-month':0,
        'One year':1,
        'Two year':2
    }
}]

num_preprocessor = Pipeline([
    ('binning',KBinsDiscretizer(
        n_bins=20,
        encode='ordinal',
        strategy='quantile'
    )),
    ('scaling',RobustScaler(quantile_range=(0,100)))
])

transformer = ColumnTransformer([
    (
        'multiclass_onehot',
        OneHotEncoder(),
        ['InternetService','OnlineSecurity','OnlineBackup','DeviceProtection','TechSupport']
    ),
    (
        'binary_onehot',
        OneHotEncoder(drop='if_binary'),
        ['Dependents','PaperlessBilling']
    ),
    (
        'ordinal',
        OrdinalEncoder(mapping=mapping),
        ['Contract']
    ),
    (
        'map',
        FunctionTransformer(func=lambda x: x=='Yes'),
        ['Churn']
    ),
    (
        'num_preprocessor',
        num_preprocessor,
        ['tenure','MonthlyCharges']
    )
])

Before carrying out fit and transform processes on dataset `data`, it is necessary to do data splitting first to prepare the dataset into 4 parts namely, `X_train`, `X_test`, `y_train`, and `y_test` using method `train_test_split()` from library **Scikit-learn**.

In [5]:
from sklearn.model_selection import train_test_split
import numpy as np

np.random.seed(1995)
X_train, X_test, y_train, y_test = train_test_split(
    data.drop('Churn',axis=1),
    data['Churn'],
    stratify=data['Churn'],
    test_size=0.2,
    random_state=1995
)

Considering that the `Churn` feature is still categorical, it is necessary to encoding also through `ColumnTransformer()` so that each `y_train` and `y_test` will be combined with `X_train` and `X_test` becomes `data_train` and `data_test`. This is necessary because the fit process via `ColumnTransformer()` is recommended only for training dataset, in this case `data_train`, with the aim of avoiding information leakage and causing score from model validation becomes too optimistic or overfitting.

In [6]:
for i, set_ in enumerate([[X_train,y_train],[X_test,y_test]]):
    set_[0]['Churn'] = set_[1]
    if i == 0:
        data_train = set_[0].copy()
    else:
        data_test = set_[0].copy()

train_encoded_values = transformer.fit_transform(data_train)
test_encoded_values = transformer.transform(data_test)

feature_names = []

for i in range(3):
    feature_names += list(np.array(object=transformer.transformers_[i][1].get_feature_names_out()))

feature_names += ['Churn','tenure','MonthlyCharges']
data_train_encoded = pd.DataFrame(
    data=train_encoded_values,
    columns=feature_names
)

data_test_encoded = pd.DataFrame(
    data=test_encoded_values,
    columns=feature_names
)

After the transformation process has been carried out and generated a new dataset for training and test data, a drop process will then be carried out on several features needed, namely among them `InternetService_No`, `OnlineSecurity_No internet service`, `OnlineBackup_No internet service`, `DeviceProtection_No internet service`, and `TechSupport_No internet service`.

In [7]:
cols_dropped = [
    'InternetService_No',
    'OnlineSecurity_No internet service',
    'OnlineBackup_No internet service',
    'DeviceProtection_No internet service',
    'TechSupport_No internet service'
]

data_train_encoded = data_train_encoded.drop(
    columns=cols_dropped,
    axis=1
)

data_test_encoded = data_test_encoded.drop(
    columns=cols_dropped,
    axis=1
)

##### **Data Formatting**

After the feature selection process is carried out, the dataset that is ready will be used to build and validate the machine learning model as well as a tester of the model that has been built. Then, an examination will be carried out to find out the attribute information on the two datasets.

In [8]:
for i, dataset in enumerate([data_train_encoded,data_test_encoded]):
    items = [[
        col, 
        dataset[col].dtype
    ] for col in dataset]
    
    print('\nAttribute Information of Encoded {}'.format('Training Dataset' if i == 0 else 'Test Dataset'))
    display(pd.DataFrame(
        data=items,
        columns=[
            'Features',
            'Data Type'
        ]
    ))
    
    dataset.info(verbose=False)


Attribute Information of Encoded Training Dataset


Unnamed: 0,Features,Data Type
0,InternetService_DSL,float64
1,InternetService_Fiber optic,float64
2,OnlineSecurity_No,float64
3,OnlineSecurity_Yes,float64
4,OnlineBackup_No,float64
5,OnlineBackup_Yes,float64
6,DeviceProtection_No,float64
7,DeviceProtection_Yes,float64
8,TechSupport_No,float64
9,TechSupport_Yes,float64


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3772 entries, 0 to 3771
Columns: 16 entries, InternetService_DSL to MonthlyCharges
dtypes: float64(16)
memory usage: 471.6 KB

Attribute Information of Encoded Test Dataset


Unnamed: 0,Features,Data Type
0,InternetService_DSL,float64
1,InternetService_Fiber optic,float64
2,OnlineSecurity_No,float64
3,OnlineSecurity_Yes,float64
4,OnlineBackup_No,float64
5,OnlineBackup_Yes,float64
6,DeviceProtection_No,float64
7,DeviceProtection_Yes,float64
8,TechSupport_No,float64
9,TechSupport_Yes,float64


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 944 entries, 0 to 943
Columns: 16 entries, InternetService_DSL to MonthlyCharges
dtypes: float64(16)
memory usage: 118.1 KB


It can be seen that every feature of `data_train_encoded` and `data_test_encoded` is still of type `float64` which can cause problems due to its large memory usage. Therefore, data formatting process will be carried out on the two datasets with the aim of making memory efficient and facilitating further processing. The features that used to be category in dataset `data`, will change their data type to `int8` because they only contain the numbers 0 and 1. Apart from these features, namely `tenure` and `MonthlyCharges` , the data type will be changed to `float32`.

In [9]:
for i, dataset in enumerate([data_train_encoded,data_test_encoded]):
    before = [dataset[j].dtype for j in dataset.columns]
    dataset[list(dataset.columns)[:-2]] = dataset[list(dataset.columns)[:-2]].astype('int8')
    dataset[list(dataset.columns)[-2:]] = dataset[list(dataset.columns)[-2:]].astype('float32')
    items = [[
        col, 
        before[k], 
        dataset[col].dtype
    ] for k, col in enumerate(dataset)]

    print('\nAttribute Information of Encoded {}'.format('Training Dataset'if i == 0 else 'Test Dataset'))
    display(pd.DataFrame(
        data=items,
        columns=[
            'Features',
            'Before',
            'After'
        ]
    ))
    
    dataset.info(verbose=False)


Attribute Information of Encoded Training Dataset


Unnamed: 0,Features,Before,After
0,InternetService_DSL,float64,int8
1,InternetService_Fiber optic,float64,int8
2,OnlineSecurity_No,float64,int8
3,OnlineSecurity_Yes,float64,int8
4,OnlineBackup_No,float64,int8
5,OnlineBackup_Yes,float64,int8
6,DeviceProtection_No,float64,int8
7,DeviceProtection_Yes,float64,int8
8,TechSupport_No,float64,int8
9,TechSupport_Yes,float64,int8


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3772 entries, 0 to 3771
Columns: 16 entries, InternetService_DSL to MonthlyCharges
dtypes: float32(2), int8(14)
memory usage: 81.2 KB

Attribute Information of Encoded Test Dataset


Unnamed: 0,Features,Before,After
0,InternetService_DSL,float64,int8
1,InternetService_Fiber optic,float64,int8
2,OnlineSecurity_No,float64,int8
3,OnlineSecurity_Yes,float64,int8
4,OnlineBackup_No,float64,int8
5,OnlineBackup_Yes,float64,int8
6,DeviceProtection_No,float64,int8
7,DeviceProtection_Yes,float64,int8
8,TechSupport_No,float64,int8
9,TechSupport_Yes,float64,int8


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 944 entries, 0 to 943
Columns: 16 entries, InternetService_DSL to MonthlyCharges
dtypes: float32(2), int8(14)
memory usage: 20.4 KB


It can be seen that the memory usage can be reduced to a minimum of 81.2 KB for `data_train_encoded` and 20.4 KB for `data_test_encoded`. This means that it is almost 6 times more efficient than the previous data type, namely 471.6 KB for `data_train_encoded` and 118.1 KB for `data_test_encoded`, thereby saving memory usage on the computer and allowing the program to run faster for the next process.

In [10]:
X_trainval = data_train_encoded.drop(
    columns='Churn',
    axis=1
)

y_trainval = data_train_encoded[['Churn']]

X_test = data_test_encoded.drop(
    columns='Churn',
    axis=1
)

y_test = data_test_encoded[['Churn']]

for i in [X_trainval,X_test]:
    i.columns = i.columns.astype(dtype='string')

Then, data splitting process will be carried out both on `data_train_encoded` and `data_test_encoded` to `X_trainval`, `y_trainval`, `X_test`, and `y_test`. `X_trainval` and `y_trainval` will be used as dataset for model benchmarking process. Especially for the `X_trainval` and `X_test` datasets, a casting process will be carried out due to problems with the column data types used which are the impact of the transformation process that was carried out before.