# **1. Import Library**

Pada tahap ini, Anda perlu mengimpor beberapa pustaka (library) Python yang dibutuhkan untuk analisis data dan pembangunan model machine learning.

In [20]:
#Import Library
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

# **2. Memuat Dataset dari Hasil Clustering**

Memuat dataset hasil clustering dari file CSV ke dalam variabel DataFrame.

In [21]:
#load dataset
df = pd.read_csv('/content/hasil_clustering.csv')
df.head()

Unnamed: 0,work_year,experience_level,salary_in_usd,remote_ratio,company_size,tipe_kerja_CT,tipe_kerja_FL,tipe_kerja_FT,tipe_kerja_PT,frekuensi_jabatan,employee_residence_encoded,company_location_encoded,cluster,Cluster
0,2023,3,85847,100,3,False,False,True,False,8,47,44,1,1
1,2023,2,30000,100,1,True,False,False,False,34,1893,1929,2,2
2,2023,2,25500,100,1,True,False,False,False,34,1893,1929,2,2
3,2023,3,175000,100,2,False,False,True,False,538,81,83,1,1
4,2023,3,120000,100,2,False,False,True,False,538,81,83,1,1


# **3. Data Splitting**

Tahap Data Splitting bertujuan untuk memisahkan dataset menjadi dua bagian: data latih (training set) dan data uji (test set).

In [22]:
# Memisahkan fitur (independen) dan label (target cluster)
fitur = df.drop(columns='cluster')
label = df['cluster']

# Membagi dataset menjadi data pelatihan dan data pengujian
from sklearn.model_selection import train_test_split
X_latih, X_uji, y_latih, y_uji = train_test_split(fitur, label, test_size=0.2, random_state=42)

# Mengecek jumlah data yang terbagi
print(f"Data untuk pelatihan: {X_latih.shape[0]} baris")
print(f"Data untuk pengujian: {X_uji.shape[0]} baris")

Data untuk pelatihan: 2067 baris
Data untuk pengujian: 517 baris


# **4. Membangun Model Klasifikasi**


## **a. Membangun Model Klasifikasi**

Setelah memilih algoritma klasifikasi yang sesuai, langkah selanjutnya adalah melatih model menggunakan data latih.

Berikut adalah rekomendasi tahapannya.
1. Pilih algoritma klasifikasi yang sesuai, seperti Logistic Regression, Decision Tree, Random Forest, atau K-Nearest Neighbors (KNN).
2. Latih model menggunakan data latih.

In [23]:
# Inisialisasi model klasifikasi
daftar_model = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Decision Tree': DecisionTreeClassifier(),
    'Random Forest': RandomForestClassifier(),
    'K-Nearest Neighbor (KNN)': KNeighborsClassifier()
}

Beberapa algoritma klasifikasi yang dipertimbangkan dalam proyek ini antara lain:

- Logistic Regression, cocok untuk prediksi kategori dengan pendekatan probabilistik dan sederhana.

- Decision Tree, algoritma berbasis aturan if-else yang mudah dipahami serta mampu menangani data numerik dan kategorikal.

- Random Forest, metode ensambel yang memanfaatkan banyak pohon keputusan untuk menghasilkan prediksi yang lebih stabil dan akurat.

- K-Nearest Neighbors (KNN), pendekatan berbasis jarak yang memprediksi label berdasarkan kedekatan terhadap data tetangga terdekat.

Masing-masing model akan dilatih menggunakan data latih, dan nantinya akan dievaluasi performanya terhadap data uji untuk mengetahui mana yang paling optimal dalam mengklasifikasikan klaster yang telah terbentuk sebelumnya.

In [24]:
X = df.drop('cluster', axis=1)
y = df['cluster']

# Membagi data menjadi data latih dan data uji
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Melihat ukuran data hasil pembagian
print(f"Ukuran X_train: {X_train.shape}")
print(f"Ukuran X_test: {X_test.shape}")
print(f"Ukuran y_train: {y_train.shape}")
print(f"Ukuran y_test: {y_test.shape}")

Ukuran X_train: (2067, 13)
Ukuran X_test: (517, 13)
Ukuran y_train: (2067,)
Ukuran y_test: (517,)


In [25]:
# Fungsi untuk melatih model klasifikasi
def latih_model(X_train, y_train, model_name):
    """
    Fungsi untuk melatih model klasifikasi

    Parameters:
    X_train : fitur data latih
    y_train : label data latih
    model_name : nama model yang dipilih

    Returns:
    model : model yang sudah dilatih
    """
    if model_name not in daftar_model:
        print(f"Model {model_name} tidak tersedia")
        return None

    model = daftar_model[model_name]
    model.fit(X_train, y_train)
    print(f"Model {model_name} berhasil dilatih")
    return model

In [26]:
# Melatih semua model klasifikasi dan menyimpannya dalam dictionary
model_terlatih = {}
for nama_model in daftar_model:
    print(f"Melatih model {nama_model}...")
    model_terlatih[nama_model] = latih_model(X_train, y_train, nama_model)

# Melihat daftar model yang sudah dilatih
print("\nDaftar model yang sudah dilatih:")
for nama_model in model_terlatih:
    print(f"- {nama_model}")

Melatih model Logistic Regression...


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Model Logistic Regression berhasil dilatih
Melatih model Decision Tree...
Model Decision Tree berhasil dilatih
Melatih model Random Forest...
Model Random Forest berhasil dilatih
Melatih model K-Nearest Neighbor (KNN)...
Model K-Nearest Neighbor (KNN) berhasil dilatih

Daftar model yang sudah dilatih:
- Logistic Regression
- Decision Tree
- Random Forest
- K-Nearest Neighbor (KNN)


In [27]:
# Evaluasi performa model pada data latih
print("\nPerforma model pada data latih:")
for nama_model, model in model_terlatih.items():
    y_pred_train = model.predict(X_train)
    akurasi_train = accuracy_score(y_train, y_pred_train)
    print(f"- {nama_model}: Akurasi = {akurasi_train:.4f}")


Performa model pada data latih:
- Logistic Regression: Akurasi = 0.9734
- Decision Tree: Akurasi = 1.0000
- Random Forest: Akurasi = 1.0000
- K-Nearest Neighbor (KNN): Akurasi = 0.9739


## **b. Evaluasi Model Klasifikasi**

Berikut adalah **rekomendasi** tahapannya.
1. Lakukan prediksi menggunakan data uji.
2. Hitung metrik evaluasi seperti Accuracy dan F1-Score (Opsional: Precision dan Recall).
3. Buat confusion matrix untuk melihat detail prediksi benar dan salah.

In [28]:
# Melatih dan mengevaluasi setiap model dengan output rapi
for nama_model, algoritma in daftar_model.items():
    # Melatih model
    algoritma.fit(X_latih, y_latih)

    # Prediksi data uji
    hasil_prediksi = algoritma.predict(X_uji)

    # Evaluasi performa
    skor_akurasi = accuracy_score(y_uji, hasil_prediksi)
    laporan = classification_report(y_uji, hasil_prediksi, output_dict=True)
    matriks_konfusi = confusion_matrix(y_uji, hasil_prediksi)

    # Ubah laporan dan matriks ke bentuk DataFrame
    df_laporan = pd.DataFrame(laporan).transpose()
    df_matriks = pd.DataFrame(
        matriks_konfusi,
        index=[f'Asli {label}' for label in sorted(y_uji.unique())],
        columns=[f'Prediksi {label}' for label in sorted(y_uji.unique())]
    )

    # Tampilkan hasil evaluasi model
    print(f"\n{'='*70}")
    print(f"Evaluasi Model: {nama_model}")
    print(f"Akurasi: {skor_akurasi:.4f}")
    print("\nLaporan Klasifikasi:")
    print(df_laporan.round(2))
    print("\nConfusion Matrix:")
    print(df_matriks)
    print(f"{'='*70}\n")

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))



Evaluasi Model: Logistic Regression
Akurasi: 0.9497

Laporan Klasifikasi:
              precision  recall  f1-score  support
0                  0.96    0.99      0.98   386.00
1                  0.90    0.90      0.90   119.00
2                  0.00    0.00      0.00    12.00
accuracy           0.95    0.95      0.95     0.95
macro avg          0.62    0.63      0.63   517.00
weighted avg       0.93    0.95      0.94   517.00

Confusion Matrix:
        Prediksi 0  Prediksi 1  Prediksi 2
Asli 0         384           2           0
Asli 1          12         107           0
Asli 2           2          10           0


Evaluasi Model: Decision Tree
Akurasi: 1.0000

Laporan Klasifikasi:
              precision  recall  f1-score  support
0                   1.0     1.0       1.0    386.0
1                   1.0     1.0       1.0    119.0
2                   1.0     1.0       1.0     12.0
accuracy            1.0     1.0       1.0      1.0
macro avg           1.0     1.0       1.0    517.0
w

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Tulis hasil evaluasi algoritma yang digunakan, jika Anda menggunakan 2 algoritma, maka bandingkan hasilnya.

## **c. Tuning Model Klasifikasi (Optional)**

Gunakan GridSearchCV, RandomizedSearchCV, atau metode lainnya untuk mencari kombinasi hyperparameter terbaik

In [29]:
# Pisahkan fitur (X) dan target (y)
X = df.drop(['cluster'], axis=1)
y = df['cluster']

# Membagi dataset menjadi data pelatihan dan pengujian
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Menentukan ruang pencarian hyperparameter
param_options = {
    'penalty': ['l2', 'none'],
    'C': [0.1, 1, 10],
    'solver': ['liblinear', 'saga']
}

# Inisialisasi model Logistic Regression
logistic_model = LogisticRegression(max_iter=1000)

# Setup GridSearchCV
tuner = GridSearchCV(
    estimator=logistic_model,
    param_grid=param_options,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)

# Melatih model dengan GridSearch
tuner.fit(X_train, y_train)

# Menampilkan hasil tuning terbaik
print("Best Parameters Found:", tuner.best_params_)
print("Best CV Accuracy Score:", tuner.best_score_)

Best Parameters Found: {'C': 0.1, 'penalty': 'l2', 'solver': 'liblinear'}
Best CV Accuracy Score: 0.9733960299914612


30 fits failed out of a total of 60.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
10 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/sklearn/model_selection/_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.11/dist-packages/sklearn/base.py", line 1382, in wrapper
    estimator._validate_params()
  File "/usr/local/lib/python3.11/dist-packages/sklearn/base.py", line 436, in _validate_params
    validate_parameter_constraints(
  File "/usr/local/lib/python3.11/dist-packages/sklearn/utils/_param_validation.py", line 98, in validate_parameter_constraints
    raise InvalidParameterError(
sklea

In [30]:
# Definisikan rentang parameter yang akan diuji
param_options_dt = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Inisialisasi model Decision Tree
decision_tree_model = DecisionTreeClassifier()

# Setup GridSearchCV untuk mencari hyperparameter terbaik
grid_search_dt_model = GridSearchCV(
    estimator=decision_tree_model,
    param_grid=param_options_dt,
    cv=5,
    scoring='accuracy'
)

# Melatih model menggunakan GridSearch
grid_search_dt_model.fit(X_train, y_train)

# Menampilkan hasil parameter terbaik dan akurasi cross-validation terbaik
print("Optimal Hyperparameters:", grid_search_dt_model.best_params_)
print("Best Cross-Validation Accuracy:", grid_search_dt_model.best_score_)

Optimal Hyperparameters: {'criterion': 'gini', 'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 2}
Best Cross-Validation Accuracy: 1.0


In [31]:
# Tentukan ruang pencarian untuk hyperparameter Random Forest
param_options_rf = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Inisialisasi model Random Forest
rf_model = RandomForestClassifier()

# Setup GridSearchCV untuk mencari parameter terbaik
grid_search_rf_model = GridSearchCV(
    estimator=rf_model,
    param_grid=param_options_rf,
    cv=5,
    scoring='accuracy'
)

# Latih model dengan data pelatihan
grid_search_rf_model.fit(X_train, y_train)

# Tampilkan hasil hyperparameter terbaik dan akurasi cross-validation
print("Optimal Hyperparameters for Random Forest:", grid_search_rf_model.best_params_)
print("Best Cross-Validation Accuracy:", grid_search_rf_model.best_score_)


Optimal Hyperparameters for Random Forest: {'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 100}
Best Cross-Validation Accuracy: 1.0


In [32]:
# Tentukan ruang parameter untuk Randomized Search pada KNN
param_distribution = {
    'n_neighbors': np.arange(1, 31),
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan']
}

# Inisialisasi model K-Nearest Neighbors
knn_model = KNeighborsClassifier()

# Setup RandomizedSearchCV untuk mencari kombinasi hyperparameter terbaik
random_search_knn_model = RandomizedSearchCV(
    estimator=knn_model,
    param_distributions=param_distribution,
    n_iter=100,
    cv=5,
    scoring='accuracy',
    random_state=42
)

# Latih model menggunakan data pelatihan
random_search_knn_model.fit(X_train, y_train)

# Tampilkan hasil hyperparameter terbaik dan akurasi cross-validation
print("Optimal Hyperparameters for KNN:", random_search_knn_model.best_params_)
print("Best Cross-Validation Accuracy:", random_search_knn_model.best_score_)

Optimal Hyperparameters for KNN: {'weights': 'distance', 'n_neighbors': np.int64(4), 'metric': 'manhattan'}
Best Cross-Validation Accuracy: 0.965660712823572


In [33]:
#Import dan split data
X = df.drop(['cluster'], axis=1)
y = df['cluster']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#GridSearchCV untuk Logistic Regression
param_grid_logreg = {
    'penalty': ['l2', 'none'],
    'C': [0.1, 1, 10],
    'solver': ['liblinear', 'saga']
}

log_reg = LogisticRegression(max_iter=1000)
grid_search_log_reg = GridSearchCV(log_reg, param_grid_logreg, cv=5, scoring='accuracy')
grid_search_log_reg.fit(X_train, y_train)

#GridSearchCV untuk Decision Tree
param_grid_dt = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

dt = DecisionTreeClassifier()
grid_search_dt = GridSearchCV(dt, param_grid_dt, cv=5, scoring='accuracy')
grid_search_dt.fit(X_train, y_train)

# GridSearchCV untuk Random Forest
param_grid_rf = {
    'n_estimators': [100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2]
}

rf = RandomForestClassifier()
grid_search_rf = GridSearchCV(rf, param_grid_rf, cv=5, scoring='accuracy')
grid_search_rf.fit(X_train, y_train)

#RandomizedSearchCV untuk KNN
param_dist_knn = {
    'n_neighbors': np.arange(1, 31),
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan']
}

knn = KNeighborsClassifier()
random_search_knn = RandomizedSearchCV(knn, param_dist_knn, n_iter=20, cv=5, scoring='accuracy')
random_search_knn.fit(X_train, y_train)

30 fits failed out of a total of 60.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
30 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/sklearn/model_selection/_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.11/dist-packages/sklearn/base.py", line 1382, in wrapper
    estimator._validate_params()
  File "/usr/local/lib/python3.11/dist-packages/sklearn/base.py", line 436, in _validate_params
    validate_parameter_constraints(
  File "/usr/local/lib/python3.11/dist-packages/sklearn/utils/_param_validation.py", line 98, in validate_parameter_constraints
    raise InvalidParameterError(
sklea

In [34]:
results = {
    'Model': ['Logistic Regression', 'Decision Tree', 'Random Forest', 'K-Nearest Neighbors'],
    'Best Parameters': [
        grid_search_log_reg.best_params_,
        grid_search_dt.best_params_,
        grid_search_rf.best_params_,
        random_search_knn.best_params_
    ],
    'Best CV Accuracy': [
        round(grid_search_log_reg.best_score_, 4),
        round(grid_search_dt.best_score_, 4),
        round(grid_search_rf.best_score_, 4),
        round(random_search_knn.best_score_, 4)
    ]
}

result_df = pd.DataFrame(results)
print(result_df.to_string(index=False))

              Model                                                                         Best Parameters  Best CV Accuracy
Logistic Regression                                      {'C': 0.1, 'penalty': 'l2', 'solver': 'liblinear'}            0.9734
      Decision Tree {'criterion': 'gini', 'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 2}            1.0000
      Random Forest {'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 100}            1.0000
K-Nearest Neighbors                        {'weights': 'distance', 'n_neighbors': 4, 'metric': 'manhattan'}            0.9657


## **d. Evaluasi Model Klasifikasi setelah Tuning (Optional)**

Berikut adalah rekomendasi tahapannya.
1. Gunakan model dengan hyperparameter terbaik.
2. Hitung ulang metrik evaluasi untuk melihat apakah ada peningkatan performa.

In [35]:
# Inisialisasi model dengan hyperparameter terbaik
log_reg_model = LogisticRegression(C=0.1, penalty='l2', solver='liblinear')
dt_model = DecisionTreeClassifier(criterion='gini', max_depth=30, min_samples_leaf=2, min_samples_split=5)
rf_model = RandomForestClassifier(n_estimators=300, max_depth=None, min_samples_leaf=1, min_samples_split=5)
knn_model = KNeighborsClassifier(n_neighbors=3, weights='distance', metric='manhattan')

# Fit ke data training
log_reg_model.fit(X_train, y_train)
dt_model.fit(X_train, y_train)
rf_model.fit(X_train, y_train)
knn_model.fit(X_train, y_train)

# Prediksi dan evaluasi semua model
model_list = {
    'Logistic Regression': log_reg_model,
    'Decision Tree': dt_model,
    'Random Forest': rf_model,
    'K-Nearest Neighbors': knn_model
}

# Buat list untuk menyimpan hasil
results = []

# Loop evaluasi model
for model_name, model in model_list.items():
    y_pred = model.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    report = classification_report(y_test, y_pred, output_dict=True)
    cmatrix = confusion_matrix(y_test, y_pred)

    # Ambil precision, recall, f1-score macro average
    precision = report['macro avg']['precision']
    recall = report['macro avg']['recall']
    f1 = report['macro avg']['f1-score']

    # Tambahkan ke hasil
    results.append({
        'Model': model_name,
        'Accuracy': round(acc, 4),
        'Precision': round(precision, 4),
        'Recall': round(recall, 4),
        'F1-Score': round(f1, 4)
    })

# Tampilkan hasil evaluasi dalam bentuk tabel
result_df = pd.DataFrame(results)
print("\n📊 Evaluasi Kinerja Setiap Model:\n")
print(result_df.to_string(index=False))


📊 Evaluasi Kinerja Setiap Model:

              Model  Accuracy  Precision  Recall  F1-Score
Logistic Regression    0.9458     0.6177  0.6277    0.6226
      Decision Tree    1.0000     1.0000  1.0000    1.0000
      Random Forest    1.0000     1.0000  1.0000    1.0000
K-Nearest Neighbors    0.9342     0.6978  0.6455    0.6577


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [36]:
# Model logistic regression with best hyperparameter
log_reg_best = LogisticRegression(C=0.1, penalty='l2', solver='liblinear')
log_reg_best.fit(X_train, y_train)
log_reg_predictions = log_reg_best.predict(X_test)

# Model decision tree with best hyperparameter
decision_tree_best = DecisionTreeClassifier(criterion='gini', max_depth=30, min_samples_leaf=2, min_samples_split=5)
decision_tree_best.fit(X_train, y_train)
decision_tree_predictions = decision_tree_best.predict(X_test)

# Model random forest with best hyperparameter
random_forest_best = RandomForestClassifier(max_depth=None, min_samples_leaf=1, min_samples_split=5, n_estimators=300)
random_forest_best.fit(X_train, y_train)
random_forest_predictions = random_forest_best.predict(X_test)

# Model k-nearest neighbors with best hyperparameter
knn_best = KNeighborsClassifier(weights='distance', n_neighbors=3, metric='manhattan')
knn_best.fit(X_train, y_train)
knn_predictions = knn_best.predict(X_test)

# Evaluation performance each model
models = {
    "Logistic Regression": log_reg_best,
    "Decision Tree": decision_tree_best,
    "Random Forest": random_forest_best,
    "K-Nearest Neighbors": knn_best
}

for model_name, model in models.items():
    print(f"Model: {model_name}")
    predictions = model.predict(X_test)
    print("Accuracy:", accuracy_score(y_test, predictions))
    print("Classification Report:\n", classification_report(y_test, predictions))
    print("Confusion Matrix:\n", confusion_matrix(y_test, predictions))
    print("="*50)


Model: Logistic Regression
Accuracy: 0.9458413926499033
Classification Report:
               precision    recall  f1-score   support

           0       0.96      0.99      0.98       386
           1       0.89      0.89      0.89       119
           2       0.00      0.00      0.00        12

    accuracy                           0.95       517
   macro avg       0.62      0.63      0.62       517
weighted avg       0.92      0.95      0.93       517

Confusion Matrix:
 [[383   3   0]
 [ 13 106   0]
 [  2  10   0]]
Model: Decision Tree
Accuracy: 1.0
Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00       386
           1       1.00      1.00      1.00       119
           2       1.00      1.00      1.00        12

    accuracy                           1.00       517
   macro avg       1.00      1.00      1.00       517
weighted avg       1.00      1.00      1.00       517

Confusion Matrix:
 [[386   0   0]


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


## **e. Analisis Hasil Evaluasi Model Klasifikasi**

**Analisis Hasil Evaluasi Model Klasifikasi**

1. **Perbandingan Hasil Evaluasi**

**Sebelum Tuning:**

- Logistic Regression: Akurasi = 0.9497, dengan precision dan recall rendah untuk kelas 2 (0.00).

- Decision Tree &amp; Random Forest: Akurasi sempurna (1.0000) di semua kelas, tetapi berpotensi overfitting.

- KNN: Akurasi = 0.9362, dengan precision dan recall 0.00 untuk kelas 2.

**Setelah Tuning**:

- Logistic Regression: Akurasi sedikit menurun (0.9458), tetapi parameter hyperparameter dioptimalkan (C=0.1, penalty='l2'). Masalah kelas 2 tetap ada.

- Decision Tree &amp; Random Forest: Tetap sempurna (akurasi 1.0000), tetapi perlu diwaspadai overfitting.

- KNN: Akurasi sedikit turun (0.9342), tetapi hyperparameter tuning (n_neighbors=4, metric='manhattan') meningkatkan CV accuracy menjadi 0.9657.

Tuning tidak banyak mengubah performa model, kecuali untuk KNN yang mengalami peningkatan CV accuracy. Decision Tree dan Random Forest tetap menunjukkan hasil sempurna, yang perlu diperiksa lebih lanjut untuk overfitting.

2. **Identifikasi Kelemahan Model**

Precision/Recall Rendah untuk Kelas Tertentu:

- Logistic Regression &amp; KNN: Kelas 2 memiliki precision dan recall 0.00, menunjukkan model gagal memprediksi kelas ini.
Contoh:

- Logistic Regression: precision=0.00, recall=0.00 untuk kelas 2 (12 sampel).

- KNN: precision=0.00, recall=0.00 untuk kelas 2.

- Decision Tree &amp; Random Forest: Tidak ada kelemahan di laporan klasifikasi, tetapi hasil sempurna bisa mencurigakan.

**Overfitting atau Underfitting?**

Overfitting:

- Decision Tree dan Random Forest memiliki akurasi 1.0000 di data latih dan uji. Ini bisa jadi tanda overfitting, terutama jika data uji tidak representatif.

- Logistic Regression dan KNN lebih realistis dengan akurasi ~0.93–0.95.

Underfitting:

- Logistic Regression dan KNN underfit untuk kelas 2 (gagal memprediksi sama sekali).

- F1-Score:

  - Kelas 1:

  - Logistic Regression: f1-score=0.90 (Baik).

  - KNN: f1-score=0.87 (lebih rendah).

3. **Rekomendasi Tindakan Lanjutan**
- Untuk Kelas 2 yang Gagal Diprediksi:

  Kumpulkan Data Tambahan: Kelas 2 hanya memiliki 12 sampel (2.3% dari total data). Data yang sangat tidak seimbang menyebabkan model mengabaikannya.

- Alternatif Algoritma:

  - Coba model yang lebih robust untuk data tidak seimbang, seperti SVM dengan class weighting atau Gradient Boosting (XGBoost).

  - Untuk KNN, coba metric lain selain manhattan atau tingkatkan n_neighbors.

- Evaluasi Lebih Mendalam:

  - Gunakan metrik lain seperti ROC-AUC (jika masalah binary) atau Confusion Matrix untuk melihat kesalahan spesifik.

  - Periksa apakah ada kebocoran data (data leakage) yang membuat Decision Tree/Random Forest terlalu baik.

4. **Kesimpulan Akhir**
Decision Tree dan Random Forest mencatat akurasi sempurna (1.0000) pada data latih dan uji, menunjukkan kemampuan prediksi yang sangat kuat. Namun, hasil ini perlu diwaspadai karena berpotensi overfitting, terutama jika data uji tidak cukup bervariasi atau terdapat kebocoran data.

Logistic Regression dan KNN memberikan akurasi yang realistis (sekitar 93–95%), tetapi keduanya gagal memprediksi kelas minoritas.