# **Machine Learning Workflow**
---
> Introduction to Machine Learning <br>

## **Machine Learning Workflow** (Simplified)
---

### 1. <font color='blue'> Importing Data to Python</font>
    * Drop Duplicates
### 2. <font color='blue'> Data Preprocessing:</font>
    * Input-Output Split, Train-Test Split
    * Imputation, Processing Categorical, Normalization
### 3. <font color='blue'> Training Machine Learning:</font>
    * Choose Score to optimize and Hyperparameter Space

    ## **Machine Learning with Scikit-Learn is Easy**
---

<center>
<img src="https://img.ifunny.co/images/1f58ab4c0a13ce4b916aa56838792ea21938849394e9a0b309fffbe69e9dce21_1.jpg">
</center>


## **Bank Analysis**
---

- Task : Classification
- Objective : Prediksi client bank yang berlangganan term deposit

<br>

<center>
<img src="https://keralagbank.com/public/images/inner/personal/term-deposit.png">
</center>

### **Data description:**

**Bank Client Data**:

- `age` (numeric)
- `job` : type of job (categorical: "admin.", "unknown", "unemployed", "management", "housemaid", "entrepreneur", "student", "blue-collar", "self-employed", "retired", "technician", "services")
- `marital` : marital status (categorical: "married", "divorced", "single"; note: "divorced" means divorced or widowed)
- `education` (categorical: "unknown", "secondary", "primary", "tertiary")
- `default`: has credit in default? (binary: "yes", "no")
- `balance`: average yearly balance, in euros (numeric)
- `housing`: has housing loan? (binary: "yes", "no")
- `loan`: has personal loan? (binary: "yes", "no")

<br>

**Kondisi komunikasi dengan campaign terakhir**
- `contact`: contact communication type (categorical: "unknown", "telephone", "cellular")
- `day`: last contact day of the month (numeric)
- `month`: last contact month of year (categorical: "jan", "feb", "mar", ..., "nov", "dec")
- `duration`: last contact duration, in seconds (numeric)

<br>

**Atribut/Fitur lain**
- `campaign`: number of contacts performed during this campaign and for this client (numeric, includes last contact)
- `pdays`: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)
- `previous`: number of contacts performed before this campaign and for this client (numeric)
- `poutcome`: outcome of the previous marketing campaign (categorical: "unknown","other","failure","success")

<br>

**Output variable (desired target)**
- `y` - has the client subscribed a term deposit? (binary: "yes","no")

## <b><font color='blue'>1.  Importing Data to Python</font></b>
---

Anda dapat import data dari berbagai format:
- .csv
- .Excel
- .txt
- .SQL
- .dat

**Import library pengolahan data**

Biasanya
- Pandas
- Numpy




In [2]:
# Import library pengolahan struktur data
import pandas as pd

# Import library pengolahan angka
import numpy as np

**Load data**

- Pakai `pd.read_csv()` apabila file-nya .csv
- Load data `bank-data.csv`

In [3]:
bank_df = pd.read_csv("bank-data.csv")

In [4]:
bank_df.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58.0,management,married,tertiary,no,2143.0,yes,no,unknown,,,261.0,1.0,-1.0,0.0,unknown,no
1,,technician,single,secondary,no,29.0,yes,no,unknown,5.0,may,151.0,1.0,-1.0,0.0,unknown,no
2,,entrepreneur,married,secondary,no,2.0,yes,yes,unknown,5.0,may,76.0,1.0,-1.0,0.0,unknown,no
3,,blue-collar,married,unknown,no,1506.0,yes,no,unknown,5.0,may,92.0,1.0,-1.0,0.0,unknown,no
4,33.0,unknown,single,unknown,no,1.0,no,no,,5.0,may,198.0,1.0,-1.0,0.0,unknown,no


In [5]:
bank_df.shape

(45211, 17)

**Cek & Drop data yang duplikat**

- cek-nya pakai `.duplicated()`

In [6]:
duplicate_status = bank_df.duplicated()
duplicate_status

Unnamed: 0,0
0,False
1,False
2,False
3,False
4,False
...,...
45206,False
45207,False
45208,False
45209,False


In [7]:
duplicate_status.sum()

0

In [8]:
bank_df = bank_df.drop_duplicates()

In [9]:
bank_df.shape

(45211, 17)

In [13]:
bank_df = pd.read_csv("bank-data.csv")
print("Original Data           : ", bank_df.shape, "- (#observasi, #kolom)")

bank_df = bank_df.drop_duplicates()
print("Data after dropping : ", bank_df.shape, "- (#observasi, #kolom)")

Original Data           :  (45211, 17) - (#observasi, #kolom)
Data after dropping :  (45211, 17) - (#observasi, #kolom)


In [11]:
def importData(filename):
    """
   Function to import data & delete duplicates
    :param filename: <string> input file name (.csv format)
    :return df: <pandas dataframe> sample data
    """

    # read data
    df = pd.read_csv(filename)
    print("Data asli            : ", df.shape, "- (#observasi, #kolom)")

    # drop duplicates
    df = df.drop_duplicates()
    print("Data setelah di-drop : ", df.shape, "- (#observasi, #kolom)")

    return df

#

In [14]:

file_bank = "bank-data.csv"

bank_df = importData(filename = file_bank)

Data asli            :  (45211, 17) - (#observasi, #kolom)
Data setelah di-drop :  (45211, 17) - (#observasi, #kolom)


In [15]:
bank_df.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58.0,management,married,tertiary,no,2143.0,yes,no,unknown,,,261.0,1.0,-1.0,0.0,unknown,no
1,,technician,single,secondary,no,29.0,yes,no,unknown,5.0,may,151.0,1.0,-1.0,0.0,unknown,no
2,,entrepreneur,married,secondary,no,2.0,yes,yes,unknown,5.0,may,76.0,1.0,-1.0,0.0,unknown,no
3,,blue-collar,married,unknown,no,1506.0,yes,no,unknown,5.0,may,92.0,1.0,-1.0,0.0,unknown,no
4,33.0,unknown,single,unknown,no,1.0,no,no,,5.0,may,198.0,1.0,-1.0,0.0,unknown,no


##  2. Data Preprocessing

     Input-Output Split, Train-Test Split
     Processing Categorical
     Imputation, Normalization, Drop Duplicates

In [16]:
bank_df.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58.0,management,married,tertiary,no,2143.0,yes,no,unknown,,,261.0,1.0,-1.0,0.0,unknown,no
1,,technician,single,secondary,no,29.0,yes,no,unknown,5.0,may,151.0,1.0,-1.0,0.0,unknown,no
2,,entrepreneur,married,secondary,no,2.0,yes,yes,unknown,5.0,may,76.0,1.0,-1.0,0.0,unknown,no
3,,blue-collar,married,unknown,no,1506.0,yes,no,unknown,5.0,may,92.0,1.0,-1.0,0.0,unknown,no
4,33.0,unknown,single,unknown,no,1.0,no,no,,5.0,may,198.0,1.0,-1.0,0.0,unknown,no


In [17]:
output_data = bank_df["y"]

In [18]:
input_data = bank_df.drop(["y"],
                          axis = 1)

In [19]:
input_data.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome
0,58.0,management,married,tertiary,no,2143.0,yes,no,unknown,,,261.0,1.0,-1.0,0.0,unknown
1,,technician,single,secondary,no,29.0,yes,no,unknown,5.0,may,151.0,1.0,-1.0,0.0,unknown
2,,entrepreneur,married,secondary,no,2.0,yes,yes,unknown,5.0,may,76.0,1.0,-1.0,0.0,unknown
3,,blue-collar,married,unknown,no,1506.0,yes,no,unknown,5.0,may,92.0,1.0,-1.0,0.0,unknown
4,33.0,unknown,single,unknown,no,1.0,no,no,,5.0,may,198.0,1.0,-1.0,0.0,unknown


In [20]:
output_data = bank_df["y"]
input_data = bank_df.drop("y",
                          axis = 1)

In [21]:
def extractInputOutput(data,
                       output_column_name):
    """
    Fungsi untuk memisahkan data input dan output
    :param data: <pandas dataframe> data seluruh sample
    :param output_column_name: <string> nama kolom output
    :return input_data: <pandas dataframe> data input
    :return output_data: <pandas series> data output
    """
    output_data = data[output_column_name]
    input_data = data.drop(output_column_name,
                           axis = 1)

    return input_data, output_data



In [22]:
X, y = extractInputOutput(data = bank_df,
                          output_column_name = "y")

In [23]:
X.head(2)

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome
0,58.0,management,married,tertiary,no,2143.0,yes,no,unknown,,,261.0,1.0,-1.0,0.0,unknown
1,,technician,single,secondary,no,29.0,yes,no,unknown,5.0,may,151.0,1.0,-1.0,0.0,unknown


In [24]:
from sklearn.model_selection import train_test_split

In [25]:
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size = 0.25,
                                                    random_state = 12)

In [26]:
print(X_train.shape)
print(X_test.shape)

(33908, 16)
(11303, 16)


In [27]:
X_test.shape[0] / X.shape[0]

0.25000552962774547

### **Data Imputation**

In [28]:
X_train.isnull().sum()

Unnamed: 0,0
age,2626
job,2650
marital,2650
education,2542
default,2689
balance,2574
housing,2660
loan,2668
contact,2695
day,2617


In [29]:
X_train.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome
37156,35.0,management,single,tertiary,no,2749.0,no,no,cellular,13.0,may,127.0,1.0,-1.0,0.0,unknown
20494,30.0,management,,,no,443.0,yes,,cellular,12.0,,80.0,2.0,-1.0,0.0,unknown
35272,39.0,management,,tertiary,no,4239.0,yes,no,cellular,7.0,may,40.0,1.0,-1.0,0.0,unknown
22260,49.0,services,,,no,400.0,no,no,cellular,21.0,aug,151.0,3.0,-1.0,0.0,unknown
2728,28.0,technician,single,secondary,no,468.0,yes,no,unknown,13.0,may,152.0,3.0,-1.0,0.0,unknown


In [30]:
X_train.columns

Index(['age', 'job', 'marital', 'education', 'default', 'balance', 'housing',
       'loan', 'contact', 'day', 'month', 'duration', 'campaign', 'pdays',
       'previous', 'poutcome'],
      dtype='object')

In [31]:
numerical_column = ["age", "balance", "day", "duration",
                    "campaign", "pdays", "previous"]

In [32]:
X_train_numerical = X_train[numerical_column]

In [33]:
X_train_numerical.head()

Unnamed: 0,age,balance,day,duration,campaign,pdays,previous
37156,35.0,2749.0,13.0,127.0,1.0,-1.0,0.0
20494,30.0,443.0,12.0,80.0,2.0,-1.0,0.0
35272,39.0,4239.0,7.0,40.0,1.0,-1.0,0.0
22260,49.0,400.0,21.0,151.0,3.0,-1.0,0.0
2728,28.0,468.0,13.0,152.0,3.0,-1.0,0.0


In [34]:
X_train_numerical.isnull().any()

Unnamed: 0,0
age,True
balance,True
day,True
duration,True
campaign,True
pdays,True
previous,True


In [35]:
from sklearn.impute import SimpleImputer

In [36]:
imputer = SimpleImputer(missing_values = np.nan,
                        strategy = "median")


In [37]:
imputer.fit(X_train_numerical)

# Transform
imputed_data = imputer.transform(X_train_numerical)
X_train_numerical_imputed = pd.DataFrame(imputed_data)

X_train_numerical_imputed.columns = X_train_numerical.columns
X_train_numerical_imputed.index = X_train_numerical.index

In [38]:
X_train_numerical_imputed.isnull().any()

Unnamed: 0,0
age,False
balance,False
day,False
duration,False
campaign,False
pdays,False
previous,False


In [39]:
from sklearn.impute import SimpleImputer

def numericalImputation(data, numerical_column):
    """
    Fungsi untuk melakukan imputasi data numerik
    :param data: <pandas dataframe> sample data input
    :param numerical_column: <list> list kolom numerik data
    :return X_train_numerical: <pandas dataframe> data numerik
    :return imputer_numerical: numerical imputer method
    """
    # Filter data numerik
    numerical_data = data[numerical_column]

    # Buat imputer
    imputer_numerical = SimpleImputer(missing_values = np.nan,
                                      strategy = "median")
    imputer_numerical.fit(numerical_data)

    # Transform
    imputed_data = imputer_numerical.transform(numerical_data)
    numerical_data_imputed = pd.DataFrame(imputed_data)

    numerical_data_imputed.columns = numerical_column
    numerical_data_imputed.index = numerical_data.index

    return numerical_data_imputed, imputer_numerical

In [40]:
numerical_column = ["age", "balance", "day", "duration",
                    "campaign", "pdays", "previous"]

# Imputation Numeric
X_train_numerical, imputer_numerical = numericalImputation(data = X_train,
                                                           numerical_column = numerical_column)

In [41]:
X_train_numerical.isnull().any()

Unnamed: 0,0
age,False
balance,False
day,False
duration,False
campaign,False
pdays,False
previous,False


In [42]:
X_train_column = list(X_train.columns)
categorical_column = list(set(X_train_column).difference(set(numerical_column)))

In [43]:
categorical_data = X_train[categorical_column]
categorical_data.isnull().sum()

Unnamed: 0,0
marital,2650
month,2602
job,2650
housing,2660
contact,2695
loan,2668
default,2689
poutcome,2629
education,2542


In [44]:
categorical_data = X_train[categorical_column]
categorical_data = categorical_data.fillna(value="KOSONG")

In [45]:
def categoricalImputation(data, categorical_column):
    """
    Fungsi untuk melakukan imputasi data kategorik
    :param data: <pandas dataframe> sample data input
    :param categorical_column: <list> list kolom kategorikal data
    :return categorical_data: <pandas dataframe> data kategorikal
    """
    # seleksi data
    categorical_data = data[categorical_column]

    # lakukan imputasi
    categorical_data = categorical_data.fillna(value="KOSONG")

    return categorical_data

In [46]:
X_train_categorical = categoricalImputation(data = X_train,
                                            categorical_column = categorical_column)

### **Preprocessing Categorical Variables**

In [47]:
categorical_ohe = pd.get_dummies(X_train_categorical)

In [48]:
categorical_ohe.head(2)

Unnamed: 0,marital_KOSONG,marital_divorced,marital_married,marital_single,month_KOSONG,month_apr,month_aug,month_dec,month_feb,month_jan,...,poutcome_KOSONG,poutcome_failure,poutcome_other,poutcome_success,poutcome_unknown,education_KOSONG,education_primary,education_secondary,education_tertiary,education_unknown
37156,False,False,False,True,False,False,False,False,False,False,...,False,False,False,False,True,False,False,False,True,False
20494,True,False,False,False,True,False,False,False,False,False,...,False,False,False,False,True,True,False,False,False,False


In [49]:
def extractCategorical(data, categorical_column):
    """
    Fungsi untuk ekstrak data kategorikal dengan One Hot Encoding
    :param data: <pandas dataframe> data sample
    :param categorical_column: <list> list kolom kategorik
    :return categorical_ohe: <pandas dataframe> data sample dengan ohe
    """
    data_categorical = categoricalImputation(data = data,
                                             categorical_column = categorical_column)
    categorical_ohe = pd.get_dummies(data_categorical)

    return categorical_ohe

In [50]:
X_train_categorical_ohe = extractCategorical(data = X_train,
                                             categorical_column = categorical_column)

In [51]:
X_train_categorical_ohe.head()

Unnamed: 0,marital_KOSONG,marital_divorced,marital_married,marital_single,month_KOSONG,month_apr,month_aug,month_dec,month_feb,month_jan,...,poutcome_KOSONG,poutcome_failure,poutcome_other,poutcome_success,poutcome_unknown,education_KOSONG,education_primary,education_secondary,education_tertiary,education_unknown
37156,False,False,False,True,False,False,False,False,False,False,...,False,False,False,False,True,False,False,False,True,False
20494,True,False,False,False,True,False,False,False,False,False,...,False,False,False,False,True,True,False,False,False,False
35272,True,False,False,False,False,False,False,False,False,False,...,False,False,False,False,True,False,False,False,True,False
22260,True,False,False,False,False,False,True,False,False,False,...,False,False,False,False,True,True,False,False,False,False
2728,False,False,False,True,False,False,False,False,False,False,...,False,False,False,False,True,False,False,True,False,False


In [52]:
ohe_columns = X_train_categorical_ohe.columns

In [53]:
ohe_columns = X_train_categorical_ohe.columns

In [54]:
ohe_columns

Index(['marital_KOSONG', 'marital_divorced', 'marital_married',
       'marital_single', 'month_KOSONG', 'month_apr', 'month_aug', 'month_dec',
       'month_feb', 'month_jan', 'month_jul', 'month_jun', 'month_mar',
       'month_may', 'month_nov', 'month_oct', 'month_sep', 'job_KOSONG',
       'job_admin.', 'job_blue-collar', 'job_entrepreneur', 'job_housemaid',
       'job_management', 'job_retired', 'job_self-employed', 'job_services',
       'job_student', 'job_technician', 'job_unemployed', 'job_unknown',
       'housing_KOSONG', 'housing_no', 'housing_yes', 'contact_KOSONG',
       'contact_cellular', 'contact_telephone', 'contact_unknown',
       'loan_KOSONG', 'loan_no', 'loan_yes', 'default_KOSONG', 'default_no',
       'default_yes', 'poutcome_KOSONG', 'poutcome_failure', 'poutcome_other',
       'poutcome_success', 'poutcome_unknown', 'education_KOSONG',
       'education_primary', 'education_secondary', 'education_tertiary',
       'education_unknown'],
      dtype='object'

### **Join data Numerical dan Categorical**

In [55]:
X_train_concat = pd.concat([X_train_numerical,
                            X_train_categorical_ohe],
                           axis = 1)

In [56]:
X_train_concat.head()

Unnamed: 0,age,balance,day,duration,campaign,pdays,previous,marital_KOSONG,marital_divorced,marital_married,...,poutcome_KOSONG,poutcome_failure,poutcome_other,poutcome_success,poutcome_unknown,education_KOSONG,education_primary,education_secondary,education_tertiary,education_unknown
37156,35.0,2749.0,13.0,127.0,1.0,-1.0,0.0,False,False,False,...,False,False,False,False,True,False,False,False,True,False
20494,30.0,443.0,12.0,80.0,2.0,-1.0,0.0,True,False,False,...,False,False,False,False,True,True,False,False,False,False
35272,39.0,4239.0,7.0,40.0,1.0,-1.0,0.0,True,False,False,...,False,False,False,False,True,False,False,False,True,False
22260,49.0,400.0,21.0,151.0,3.0,-1.0,0.0,True,False,False,...,False,False,False,False,True,True,False,False,False,False
2728,28.0,468.0,13.0,152.0,3.0,-1.0,0.0,False,False,False,...,False,False,False,False,True,False,False,True,False,False


In [57]:
X_train_concat.isnull().any()

Unnamed: 0,0
age,False
balance,False
day,False
duration,False
campaign,False
pdays,False
previous,False
marital_KOSONG,False
marital_divorced,False
marital_married,False


In [58]:
X_train_concat.isnull().any()

Unnamed: 0,0
age,False
balance,False
day,False
duration,False
campaign,False
pdays,False
previous,False
marital_KOSONG,False
marital_divorced,False
marital_married,False


### **Standardizing Variables**

In [59]:
from sklearn.preprocessing import StandardScaler

# Buat fungsi
def standardizerData(data):
    """
    Fungsi untuk melakukan standarisasi data
    :param data: <pandas dataframe> sampel data
    :return standardized_data: <pandas dataframe> sampel data standard
    :return standardizer: method untuk standardisasi data
    """
    data_columns = data.columns  # agar nama kolom tidak hilang
    data_index = data.index  # agar index tidak hilang

    # buat (fit) standardizer
    standardizer = StandardScaler()
    standardizer.fit(data)

    # transform data
    standardized_data_raw = standardizer.transform(data)
    standardized_data = pd.DataFrame(standardized_data_raw)
    standardized_data.columns = data_columns
    standardized_data.index = data_index

    return standardized_data, standardizer

In [60]:
X_train_clean, standardizer = standardizerData(data = X_train_concat)

In [61]:
X_train_clean.head()

Unnamed: 0,age,balance,day,duration,campaign,pdays,previous,marital_KOSONG,marital_divorced,marital_married,...,poutcome_KOSONG,poutcome_failure,poutcome_other,poutcome_success,poutcome_unknown,education_KOSONG,education_primary,education_secondary,education_tertiary,education_unknown
37156,-0.568886,0.502047,-0.354434,-0.504073,-0.569761,-0.390255,-0.292138,-0.291167,-0.344163,-1.117849,...,-0.289914,-0.328903,-0.195722,-0.180075,0.56727,-0.284681,-0.402775,-0.949599,1.632949,-0.200227
20494,-1.058043,-0.288093,-0.47921,-0.693995,-0.236736,-0.390255,-0.292138,3.434454,-0.344163,-1.117849,...,-0.289914,-0.328903,-0.195722,-0.180075,0.56727,3.512706,-0.402775,-0.949599,-0.612389,-0.200227
35272,-0.177561,1.012588,-1.103089,-0.855631,-0.569761,-0.390255,-0.292138,3.434454,-0.344163,-1.117849,...,-0.289914,-0.328903,-0.195722,-0.180075,0.56727,-0.284681,-0.402775,-0.949599,1.632949,-0.200227
22260,0.800752,-0.302827,0.643772,-0.407091,0.096289,-0.390255,-0.292138,3.434454,-0.344163,-1.117849,...,-0.289914,-0.328903,-0.195722,-0.180075,0.56727,3.512706,-0.402775,-0.949599,-0.612389,-0.200227
2728,-1.253705,-0.279527,-0.354434,-0.40305,0.096289,-0.390255,-0.292138,-0.291167,-0.344163,-1.117849,...,-0.289914,-0.328903,-0.195722,-0.180075,0.56727,-0.284681,-0.402775,1.053076,-0.612389,-0.200227


##  3. Training Machine Learning
---
    * Choose Score to optimize and Hyperparameter Space
    * Cross-Validation: Random vs Grid Search CV
    * Kita harus mengalahkan benchmark

In [62]:
y_train.value_counts(normalize = True)

Unnamed: 0_level_0,proportion
y,Unnamed: 1_level_1
no,0.882624
yes,0.117376


In [63]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

###  Fitting Model

In [64]:
logreg = LogisticRegression(random_state = 123)
logreg.fit(X_train_clean, y_train)

In [65]:
random_forest = RandomForestClassifier(random_state = 123)
random_forest.fit(X_train_clean, y_train)

###  Prediction

In [66]:
logreg.predict(X_train_clean)

array(['no', 'no', 'no', ..., 'no', 'no', 'yes'], dtype=object)

In [67]:
predicted_logreg = pd.DataFrame(logreg.predict(X_train_clean))
predicted_logreg

Unnamed: 0,0
0,no
1,no
2,no
3,no
4,no
...,...
33903,no
33904,no
33905,no
33906,no


In [68]:
predicted_rf = pd.DataFrame(random_forest.predict(X_train_clean))
predicted_rf.head()

Unnamed: 0,0
0,no
1,no
2,no
3,no
4,no


###  Cek performa model di data training

In [69]:
benchmark = y_train.value_counts(normalize=True)[0]
benchmark

  benchmark = y_train.value_counts(normalize=True)[0]


0.8826235696590775

In [70]:
logreg.score(X_train_clean, y_train)

0.900554441429751

In [71]:
random_forest.score(X_train_clean, y_train)

1.0

In [72]:
import joblib
joblib.dump(logreg, "logreg.pkl")


joblib.dump(random_forest, "random_forest.pkl")


['random_forest.pkl']

### Test Prediction

In [73]:
def extractTest(data,
                numerical_column, categorical_column, ohe_column,
                imputer_numerical, standardizer):
    """
    Fungsi untuk mengekstrak & membersihkan test data
    :param data: <pandas dataframe> sampel data test
    :param numerical_column: <list> kolom numerik
    :param categorical_column: <list> kolom kategorik
    :param ohe_column: <list> kolom one-hot-encoding dari data kategorik
    :param imputer_numerical: <sklearn method> imputer data numerik
    :param standardizer: <sklearn method> standardizer data
    :return cleaned_data: <pandas dataframe> data final
    """
    # Filter data
    numerical_data = data[numerical_column]
    categorical_data = data[categorical_column]

    # Proses data numerik
    numerical_data = pd.DataFrame(imputer_numerical.transform(numerical_data))
    numerical_data.columns = numerical_column
    numerical_data.index = data.index

    # Proses data kategorik
    categorical_data = categorical_data.fillna(value="KOSONG")
    categorical_data.index = data.index
    categorical_data = pd.get_dummies(categorical_data)
    categorical_data.reindex(index = categorical_data.index,
                             columns = ohe_column)

    # Gabungkan data
    concat_data = pd.concat([numerical_data, categorical_data],
                             axis = 1)
    cleaned_data = pd.DataFrame(standardizer.transform(concat_data))
    cleaned_data.columns = concat_data.columns

    return cleaned_data


In [74]:
def testPrediction(X_test, y_test, classifier, compute_score):
    """
    Fungsi untuk mendapatkan prediksi dari model
    :param X_test: <pandas dataframe> input
    :param y_test: <pandas series> output/target
    :param classifier: <sklearn method> model klasifikasi
    :param compute_score: <bool> True: menampilkan score, False: tidak
    :return test_predict: <list> hasil prediksi data input
    :return score: <float> akurasi model
    """
    if compute_score:
        score = classifier.score(X_test, y_test)
        print(f"Accuracy : {score:.4f}")

    test_predict = classifier.predict(X_test)

    return test_predict, score

In [75]:
X_test_clean = extractTest(data = X_test,
                           numerical_column = numerical_column,
                           categorical_column = categorical_column,
                           ohe_column = ohe_columns,
                           imputer_numerical = imputer_numerical,
                           standardizer = standardizer)

In [76]:
X_test_clean.shape

(11303, 60)


logreg_test_predict, score = testPrediction(X_test = X_test_clean,
                                            y_test = y_test,
                                            classifier = logreg,
                                            compute_score = True)

In [77]:
rf_test_predict, score = testPrediction(X_test = X_test_clean,
                                        y_test = y_test,
                                        classifier = random_forest,
                                        compute_score = True)

Accuracy : 0.9010
