# **1. Import Library**

Pada tahap ini, Anda perlu mengimpor beberapa pustaka (library) Python yang dibutuhkan untuk analisis data dan pembangunan model machine learning.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from collections import Counter
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

# Algoritma yang digunakan
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.linear_model import LinearRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import accuracy_score, recall_score, f1_score, precision_score, confusion_matrix

import seaborn as sns
import matplotlib.pyplot as plt

# **2. Memuat Dataset dari Hasil Clustering**

Memuat dataset hasil clustering dari file CSV ke dalam variabel DataFrame.

In [43]:
df = pd.read_csv('clustering.csv')
df.head()

Unnamed: 0,TransactionID,AccountID,TransactionAmount,TransactionDate,TransactionType,Location,DeviceID,IP Address,MerchantID,Channel,...,CustomerOccupation,TransactionDuration,LoginAttempts,AccountBalance,PreviousTransactionDate,Clusters,Transaction Type Code,Channel Code,Location Code,CO Code
0,TX000001,AC00128,14.09,2023-04-11 16:29:14,Debit,San Diego,D000380,162.198.218.92,M015,ATM,...,Doctor,81,1,5112.21,2024-11-04 08:08:08,2,1,0,36,0
1,TX000002,AC00455,376.24,2023-06-27 16:44:19,Debit,Houston,D000051,13.149.61.4,M052,ATM,...,Doctor,141,1,13758.91,2024-11-04 08:09:35,1,1,0,15,0
2,TX000003,AC00019,126.29,2023-07-10 18:16:08,Debit,Mesa,D000235,215.97.143.157,M009,Online,...,Student,56,1,1122.35,2024-11-04 08:07:04,1,1,3,23,3
3,TX000004,AC00070,184.5,2023-05-05 16:32:11,Debit,Raleigh,D000187,200.13.225.150,M002,Online,...,Student,25,1,8569.06,2024-11-04 08:09:06,2,1,3,33,3
4,TX000005,AC00411,13.45,2023-10-16 17:51:24,Credit,Atlanta,D000308,65.164.3.100,M091,Online,...,Student,198,1,7429.4,2024-11-04 08:06:39,3,0,3,1,3


In [44]:
# Drop kolom yang tdk digunakan
df.drop(['AccountID', 'TransactionID', 'TransactionDate',
               'DeviceID', 'IP Address', 'MerchantID', 'PreviousTransactionDate',
               'Transaction Type Code', 'Channel Code', 'Location Code',
               'CO Code'], axis=1, inplace=True)
df.head()

Unnamed: 0,TransactionAmount,TransactionType,Location,Channel,CustomerAge,CustomerOccupation,TransactionDuration,LoginAttempts,AccountBalance,Clusters
0,14.09,Debit,San Diego,ATM,70,Doctor,81,1,5112.21,2
1,376.24,Debit,Houston,ATM,68,Doctor,141,1,13758.91,1
2,126.29,Debit,Mesa,Online,19,Student,56,1,1122.35,1
3,184.5,Debit,Raleigh,Online,26,Student,25,1,8569.06,2
4,13.45,Credit,Atlanta,Online,26,Student,198,1,7429.4,3


In [46]:
# Label Encoder
numeric_cols = ['TransactionAmount', 'AccountBalance', 'LoginAttempts',
                'CustomerAge', 'TransactionDuration']

categorical_cols = ['TransactionType', 'CustomerOccupation', 'Channel', 'Location']

label_encoder = LabelEncoder()
for col in categorical_cols:
    df[col] = label_encoder.fit_transform(df[col])

# Standarisasi kolom numerik
scaler = MinMaxScaler()
df[numeric_cols] = scaler.fit_transform(df[numeric_cols])
std = StandardScaler()
df[numeric_cols] = std.fit_transform(df[numeric_cols])
df.head()

Unnamed: 0,TransactionAmount,TransactionType,Location,Channel,CustomerAge,CustomerOccupation,TransactionDuration,LoginAttempts,AccountBalance,Clusters
0,-0.971275,1,36,0,1.423718,0,-0.552443,-0.206794,-0.000537,2
1,0.26944,1,15,0,1.311287,0,0.305314,-0.206794,2.216472,1
2,-0.586882,1,23,2,-1.443277,3,-0.909842,-0.206794,-1.023534,1
3,-0.387456,1,33,2,-1.049768,3,-1.353017,-0.206794,0.885797,2
4,-0.973468,0,1,2,-1.049768,3,1.120184,-0.206794,0.593589,3


# **3. Data Splitting**

Tahap Data Splitting bertujuan untuk memisahkan dataset menjadi dua bagian: data latih (training set) dan data uji (test set).

In [47]:
X = df.drop(columns=['Clusters'])
y = df['Clusters']

In [48]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Tampilkan bentuk set pelatihan dan set uji untuk memastikan split
print(f"Training set shape: X_train={X_train.shape}, y_train={y_train.shape}")
print(f"Test set shape: X_test={X_test.shape}, y_test={y_test.shape}")

Training set shape: X_train=(2009, 9), y_train=(2009,)
Test set shape: X_test=(503, 9), y_test=(503,)


# **4. Membangun Model Klasifikasi**


## **a. Membangun Model Klasifikasi**

Setelah memilih algoritma klasifikasi yang sesuai, langkah selanjutnya adalah melatih model menggunakan data latih.

Berikut adalah rekomendasi tahapannya.
1. Pilih algoritma klasifikasi yang sesuai, seperti Logistic Regression, Decision Tree, Random Forest, atau K-Nearest Neighbors (KNN).
2. Latih model menggunakan data latih.

In [49]:
knn = KNeighborsClassifier().fit(X_train, y_train)
dt = DecisionTreeClassifier().fit(X_train, y_train)
rf = RandomForestClassifier().fit(X_train, y_train)
svm = SVC().fit(X_train, y_train)
nb = GaussianNB().fit(X_train, y_train)

print("Model training selesai.")

Model training selesai.


Tulis narasi atau penjelasan algoritma yang Anda gunakan.

## **b. Evaluasi Model Klasifikasi**

Berikut adalah **rekomendasi** tahapannya.
1. Lakukan prediksi menggunakan data uji.
2. Hitung metrik evaluasi seperti Accuracy dan F1-Score (Opsional: Precision dan Recall).
3. Buat confusion matrix untuk melihat detail prediksi benar dan salah.

In [50]:
# Evaluasi Model
def evaluate_model(model, X_test, y_test):
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred, average='weighted')
    f1 = f1_score(y_test, y_pred, average='weighted')
    precision = precision_score(y_test, y_pred, average='weighted')
    cm = confusion_matrix(y_test, y_pred)

    print(f"Accuracy: {accuracy:.4f}")
    print(f"Recall: {recall:.4f}")
    print(f"F1-Score: {f1:.4f}")
    print(f"Precision: {precision:.4f}")
    print("Confusion Matrix:")
    print(cm)
    return accuracy, recall, f1, precision

# Evaluasi setiap model
print("Evaluasi KNN:")
knn_metrics = evaluate_model(knn, X_test, y_test)

print("\nEvaluasi Decision Tree:")
dt_metrics = evaluate_model(dt, X_test, y_test)

print("\nEvaluasi Random Forest:")
rf_metrics = evaluate_model(rf, X_test, y_test)

print("\nEvaluasi SVM:")
svm_metrics = evaluate_model(svm, X_test, y_test)

print("\nEvaluasi Naive Bayes:")
nb_metrics = evaluate_model(nb, X_test, y_test)


Evaluasi KNN:
Accuracy: 0.9861
Recall: 0.9861
F1-Score: 0.9861
Precision: 0.9863
Confusion Matrix:
[[171   2   0]
 [  2 165   0]
 [  3   0 160]]

Evaluasi Decision Tree:
Accuracy: 1.0000
Recall: 1.0000
F1-Score: 1.0000
Precision: 1.0000
Confusion Matrix:
[[173   0   0]
 [  0 167   0]
 [  0   0 163]]

Evaluasi Random Forest:
Accuracy: 1.0000
Recall: 1.0000
F1-Score: 1.0000
Precision: 1.0000
Confusion Matrix:
[[173   0   0]
 [  0 167   0]
 [  0   0 163]]

Evaluasi SVM:
Accuracy: 1.0000
Recall: 1.0000
F1-Score: 1.0000
Precision: 1.0000
Confusion Matrix:
[[173   0   0]
 [  0 167   0]
 [  0   0 163]]

Evaluasi Naive Bayes:
Accuracy: 0.9920
Recall: 0.9920
F1-Score: 0.9920
Precision: 0.9922
Confusion Matrix:
[[169   3   1]
 [  0 167   0]
 [  0   0 163]]


Tulis hasil evaluasi algoritma yang digunakan, jika Anda menggunakan 2 algoritma, maka bandingkan hasilnya.

## **c. Tuning Model Klasifikasi (Optional)**

Gunakan GridSearchCV, RandomizedSearchCV, atau metode lainnya untuk mencari kombinasi hyperparameter terbaik

In [None]:
#Type your code here

## **d. Evaluasi Model Klasifikasi setelah Tuning (Optional)**

Berikut adalah rekomendasi tahapannya.
1. Gunakan model dengan hyperparameter terbaik.
2. Hitung ulang metrik evaluasi untuk melihat apakah ada peningkatan performa.

In [None]:
#Type your code here

## **e. Analisis Hasil Evaluasi Model Klasifikasi**

Berikut adalah **rekomendasi** tahapannya.
1. Bandingkan hasil evaluasi sebelum dan setelah tuning (jika dilakukan).
2. Identifikasi kelemahan model, seperti:
  - Decision Tree, Random Forest, dan SVM menunjukkan performa sempurna, yang sangat mencurigakan dan kemungkinan besar mengindikasikan overfitting.
  - KNN dan Naive Bayes menunjukkan performa yang sangat baik, tetapi tidak sempurna, yang lebih realistis. Kemungkinan juga mengalami overfitting.
  - Perlu Tindakan lanjutan seperti feature engineering yang lebih baik, atau mengumpulkan data tambahan sehingga bisa mengurangi overfitting.
3. Berikan rekomendasi tindakan lanjutan, seperti mengumpulkan data tambahan atau mencoba algoritma lain jika hasil belum memuaskan.