# **Penting**
- Jangan mengubah atau menambahkan cell text yang sudah disediakan, Anda hanya perlu mengerjakan cell code yang sudah disediakan.
- Pastikan seluruh kriteria memiliki output yang sesuai, karena jika tidak ada output dianggap tidak selesai.
- Misal, Anda menggunakan df = df.dropna() silakan gunakan df.isnull().sum() sebagai tanda sudah berhasil. Silakan sesuaikan seluruh output dengan perintah yang sudah disediakan.
- Pastikan Anda melakukan Run All sebelum mengirimkan submission untuk memastikan seluruh cell berjalan dengan baik.
- Pastikan Anda menggunakan variabel df dari awal sampai akhir dan tidak diperbolehkan mengganti nama variabel tersebut.
- Hapus simbol pagar (#) pada kode yang bertipe komentar jika Anda menerapkan kriteria tambahan
- Biarkan simbol pagar (#) jika Anda tidak menerapkan kriteria tambahan
- Pastikan Anda mengerjakan sesuai section yang sudah diberikan tanpa mengubah judul atau header yang disediakan.

# **1. Import Library**
Pada tahap ini, Anda perlu mengimpor beberapa pustaka (library) Python yang dibutuhkan untuk analisis data dan pembangunan model machine learning.

In [30]:
import warnings, textwrap, os, math, joblib
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from google.colab import drive
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, classification_report
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline


# **2. Memuat Dataset dari Hasil Clustering**
Memuat dataset hasil clustering dari file CSV ke dalam variabel DataFrame.

In [31]:
# Gunakan dataset hasil clustering yang memiliki fitur Target
# Silakan gunakan dataset data_clustering jika tidak menerapkan Interpretasi Hasil Clustering [Advanced]
# Silakan gunakan dataset data_clustering_inverse jika menerapkan Interpretasi Hasil Clustering [Advanced]
# Lengkapi kode berikut
# ___ = pd_read_csv("___.csv")
drive.mount('/content/drive')

# Replace 'path/to/your/file.xlsx' with the actual path to your file in Google Drive
file_path = '/content/drive/MyDrive/Machine Learning Data/data_clustering_inverse.csv'

try:
    df = pd.read_csv(file_path)
    display(df.head())
except FileNotFoundError:
    print(f"Error: File not found at {file_path}")
except Exception as e:
    print(f"An error occurred: {e}")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Unnamed: 0,TransactionAmount,PreviousTransactionDate,TransactionType,Location,Channel,CustomerAge,CustomerOccupation,TransactionDuration,LoginAttempts,AccountBalance,TransactionDate,TransactionAmount_Binned,CustomerAge_Binned,Target
0,14.09,2023-04-11 16:29:14,Debit,San Diego,ATM,70.0,Doctor,81.0,1.0,5112.21,2024-11-04 8:08:08,0,4,0
1,376.24,2023-06-27 16:44:19,Debit,Houston,ATM,68.0,Doctor,141.0,1.0,13758.91,2024-11-04 8:09:35,2,4,2
2,126.29,2023-07-10 18:16:08,Debit,Mesa,Online,19.0,Student,56.0,1.0,1122.35,2024-11-04 8:07:04,0,0,2
3,184.5,2023-05-05 16:32:11,Debit,Raleigh,Online,26.0,Student,25.0,1.0,8569.06,2024-11-04 8:09:06,1,0,0
4,92.15,2023-04-03 17:15:01,Debit,Oklahoma City,ATM,18.0,Student,172.0,1.0,781.68,2024-11-04 8:06:36,0,0,0


In [32]:
# Tampilkan 5 baris pertama dengan function head.

# **3. Data Splitting**
Tahap Data Splitting bertujuan untuk memisahkan dataset menjadi dua bagian: data latih (training set) dan data uji (test set).

In [33]:
# Menggunakan train_test_split() untuk melakukan pembagian dataset.

# Assuming 'Target' is your target variable
X = df.drop('Target', axis=1)
y = df['Target']

# Split the data into training and testing sets (e.g., 80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

Shape of X_train: (1382, 13)
Shape of X_test: (346, 13)
Shape of y_train: (1382,)
Shape of y_test: (346,)


# **4. Membangun Model Klasifikasi**
Setelah memilih algoritma klasifikasi yang sesuai, langkah selanjutnya adalah melatih model menggunakan data latih.

Berikut adalah rekomendasi tahapannya.
1. Menggunakan algoritma klasifikasi yaitu Decision Tree.
2. Latih model menggunakan data yang sudah dipisah.

In [34]:
def eval_model(name, model, X_te, y_te):
    y_pred = model.predict(X_te)
    acc = accuracy_score(y_te, y_pred)
    pr, rc, f1, _ = precision_recall_fscore_support(y_te, y_pred, average="weighted", zero_division=0)
    print(f"\n[{name}]")
    print(f"Accuracy: {acc:.4f} | Precision: {pr:.4f} | Recall: {rc:.4f} | F1: {f1:.4f}")
    print(classification_report(y_te, y_pred, zero_division=0))
    return {"acc":acc, "prec":pr, "rec":rc, "f1":f1}

# Buatlah model klasifikasi menggunakan Decision Tree
categorical_features = X_train.select_dtypes(include=['object']).columns
numerical_features = X_train.select_dtypes(include=['float64', 'int64']).columns

# Create a column transformer for one-hot encoding categorical features
preprocessor = ColumnTransformer(
    transformers=[
        ('num', 'passthrough', numerical_features),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ])

# Create pipelines for each model including preprocessing
dt_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('classifier', DecisionTreeClassifier(random_state=42))])

# Fit the models using the pipelines
dt_pipeline.fit(X_train, y_train)


metrics_dt = eval_model("DecisionTree", dt_pipeline, X_test, y_test)



[DecisionTree]
Accuracy: 0.3671 | Precision: 0.3706 | Recall: 0.3671 | F1: 0.3674
              precision    recall  f1-score   support

           0       0.33      0.39      0.36       102
           1       0.39      0.37      0.38       112
           2       0.39      0.35      0.37       132

    accuracy                           0.37       346
   macro avg       0.37      0.37      0.37       346
weighted avg       0.37      0.37      0.37       346



In [35]:
# Menyimpan Model
# import joblib
# joblib.dump(model, 'decision_tree_model.h5')
joblib.dump(dt_pipeline, "decision_tree_model.h5")

['decision_tree_model.h5']

# **5. Memenuhi Kriteria Skilled dan Advanced dalam Membangun Model Klasifikasi**



**Biarkan kosong jika tidak menerapkan kriteria skilled atau advanced**

In [36]:
# Melatih model menggunakan algoritma klasifikasi scikit-learn selain Decision Tree.


logreg_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                                  ('classifier', LogisticRegression(max_iter=1000))])

rf_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('classifier', RandomForestClassifier(n_estimators=200, random_state=42))])



logreg_pipeline.fit(X_train, y_train)
rf_pipeline.fit(X_train, y_train)

def eval_model(name, model, X_te, y_te):
    y_pred = model.predict(X_te)
    acc = accuracy_score(y_te, y_pred)
    pr, rc, f1, _ = precision_recall_fscore_support(y_te, y_pred, average="weighted", zero_division=0)
    print(f"\n[{name}]")
    print(f"Accuracy: {acc:.4f} | Precision: {pr:.4f} | Recall: {rc:.4f} | F1: {f1:.4f}")
    print(classification_report(y_te, y_pred, zero_division=0))
    return {"acc":acc, "prec":pr, "rec":rc, "f1":f1}



In [37]:
# Menampilkan hasil evaluasi akurasi, presisi, recall, dan F1-Score pada seluruh algoritma yang sudah dibuat.
metrics_dt = eval_model("DecisionTree", dt_pipeline, X_test, y_test)
metrics_lr = eval_model("LogisticRegression", logreg_pipeline, X_test, y_test)
metrics_rf = eval_model("RandomForest", rf_pipeline, X_test, y_test)


[DecisionTree]
Accuracy: 0.3671 | Precision: 0.3706 | Recall: 0.3671 | F1: 0.3674
              precision    recall  f1-score   support

           0       0.33      0.39      0.36       102
           1       0.39      0.37      0.38       112
           2       0.39      0.35      0.37       132

    accuracy                           0.37       346
   macro avg       0.37      0.37      0.37       346
weighted avg       0.37      0.37      0.37       346


[LogisticRegression]
Accuracy: 0.2977 | Precision: 0.3022 | Recall: 0.2977 | F1: 0.2994
              precision    recall  f1-score   support

           0       0.23      0.26      0.25       102
           1       0.30      0.29      0.30       112
           2       0.36      0.33      0.34       132

    accuracy                           0.30       346
   macro avg       0.30      0.30      0.30       346
weighted avg       0.30      0.30      0.30       346


[RandomForest]
Accuracy: 0.3353 | Precision: 0.3413 | Recall: 0.3

In [38]:
# Menyimpan Model Selain Decision Tree
# Model ini bisa lebih dari satu
# import joblib
# joblib.dump(___, 'explore_<Nama Algoritma>_classification.h5')
joblib.dump(logreg_pipeline, "explore_LogisticRegression_classification.h5")
joblib.dump(rf_pipeline, "explore_RandomForest_classification.h5")

['explore_RandomForest_classification.h5']

Hyperparameter Tuning Model

Pilih salah satu algoritma yang ingin Anda tuning

In [39]:
# Lakukan Hyperparameter Tuning dan Latih ulang.
# Lakukan dalam satu cell ini saja.
from sklearn.model_selection import GridSearchCV

# Define the parameter grid for Random Forest
param_grid = {
    'classifier__n_estimators': [100, 200, 300],
    'classifier__max_depth': [None, 10, 20, 30],
    'classifier__min_samples_split': [2, 5, 10],
    'classifier__min_samples_leaf': [1, 2, 4]
}

# Create GridSearchCV object
grid_search_rf = GridSearchCV(rf_pipeline, param_grid, cv=5, scoring='accuracy', n_jobs=-1)

# Perform hyperparameter tuning
grid_search_rf.fit(X_train, y_train)

# Get the best parameters and best score
best_params_rf = grid_search_rf.best_params_
best_score_rf = grid_search_rf.best_score_

print("Best parameters for Random Forest:", best_params_rf)
print("Best cross-validation accuracy for Random Forest:", best_score_rf)

# Retrain the model with the best parameters
tuned_rf_model = grid_search_rf.best_estimator_

Best parameters for Random Forest: {'classifier__max_depth': 20, 'classifier__min_samples_leaf': 4, 'classifier__min_samples_split': 2, 'classifier__n_estimators': 100}
Best cross-validation accuracy for Random Forest: 0.3697404907654476


In [40]:
# Menampilkan hasil evaluasi akurasi, presisi, recall, dan F1-Score pada algoritma yang sudah dituning.
metrics_tuned_rf = eval_model("Tuned RandomForest", tuned_rf_model, X_test, y_test)


[Tuned RandomForest]
Accuracy: 0.3266 | Precision: 0.3341 | Recall: 0.3266 | F1: 0.3050
              precision    recall  f1-score   support

           0       0.29      0.57      0.39       102
           1       0.27      0.12      0.16       112
           2       0.42      0.32      0.36       132

    accuracy                           0.33       346
   macro avg       0.33      0.33      0.30       346
weighted avg       0.33      0.33      0.30       346



In [41]:
# Menyimpan Model hasil tuning
# import joblib
# joblib.dump(__, 'tuning_classification.h5')
joblib.dump(tuned_rf_model, 'tuning_classification.h5')

['tuning_classification.h5']

End of Code