# **1. Import Library**

In [None]:
# === Import all required libraries ===
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint
from sklearn.metrics import classification_report

# **2. Memuat Dataset dari Hasil Clustering**

In [None]:
# Load the dataset
df = pd.read_csv('Dataset_inisiasi.csv')

# Display the first few rows
print("Preview of the dataset:")
print(df.head())

Preview of the dataset:
  provinsi    jenis     daerah   tahun    periode        gk        ump  \
0     ACEH  MAKANAN  PERKOTAAN  2015.0      MARET  293697.0  1900000.0   
1     ACEH  MAKANAN  PERKOTAAN  2015.0  SEPTEMBER  302128.0  1900000.0   
2     ACEH  MAKANAN  PERKOTAAN  2016.0      MARET  306243.0  2118500.0   
3     ACEH  MAKANAN  PERKOTAAN  2016.0  SEPTEMBER  319768.0  2118500.0   
4     ACEH  MAKANAN  PERDESAAN  2015.0      MARET  297479.0  1900000.0   

       peng     upah  Cluster  
0  466355.0  11226.0        0  
1  466355.0  11226.0        0  
2  548853.0  13627.0        0  
3  548853.0  13627.0        0  
4  395136.0  11226.0        0  


# **3. Data Splitting**

The Data Splitting stage aims to separate the dataset into two parts: training set and test set.

Before we do splitting, let's scale the data first because we will try KNN algorithm which need to be scaled first.

In [None]:
df_processed = df.copy()

numeric_cols = df_processed.select_dtypes(include=['int64', 'float64']).columns.drop('Cluster')
categorical_cols = df_processed.select_dtypes(include='object').columns

scaler_std = StandardScaler()

# Apply both scalings
df_processed[numeric_cols] = scaler_std.fit_transform(df_processed[numeric_cols])

Next we encode the categorical columns so that the Machine can process the data.

In [None]:
# Identify categorical columns
categorical_cols = df_processed.select_dtypes(include='object').columns

# Apply Label Encoding to each categorical column
label_encoders = {}
le = LabelEncoder()
for col in categorical_cols:
    df_processed[col] = le.fit_transform(df_processed[col])
    label_encoders[col] = le  # store encoder if needed for inverse_transform

Since the provinsi is highly correlated to the cluster feature, we should drop it.

In [None]:
# Drop the 'provinsi' column BEFORE splitting
df_processed = df_processed.drop(columns='provinsi')

Setting the target.

In [None]:
# Set target column
target_column = 'Cluster'

# Split the dataset into features and target
X = df_processed.drop(columns=target_column)
y = df_processed[target_column]

# Split into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# Display the shape of the resulting datasets
print(f"Training set shape: {X_train.shape}")
print(f"Test set shape: {X_test.shape}")

Training set shape: (1430, 8)
Test set shape: (614, 8)


# **4. Membangun Model Klasifikasi**


## **a. Membangun Model Klasifikasi**

After the data ready to be processed, let's build the models, here i use several model at once so we can pick the best model.

In [None]:
# Initialize all models
models = {
    "Logistic Regression": LogisticRegression(max_iter=1000, random_state=42),
    "Decision Tree": DecisionTreeClassifier(random_state=42),
    "Random Forest": RandomForestClassifier(random_state=42),
    "K-Nearest Neighbors": KNeighborsClassifier(),
    "Naive Bayes": GaussianNB()
}

# Train the models
for name, model in models.items():
    model.fit(X_train, y_train)
    print(f"{name} model trained ✅")

Logistic Regression model trained ✅
Decision Tree model trained ✅
Random Forest model trained ✅
K-Nearest Neighbors model trained ✅
Naive Bayes model trained ✅


## **b. Evaluasi Model Klasifikasi**

Next let's evaluate the model.

In [None]:
# Evaluate the models
for name, model in models.items():
    y_pred = model.predict(X_test)

    print(f"=== {name} ===")
    print("Accuracy:", accuracy_score(y_test, y_pred))
    print("Precision:", precision_score(y_test, y_pred, average='weighted', zero_division=0))
    print("Recall:", recall_score(y_test, y_pred, average='weighted'))
    print("F1 Score:", f1_score(y_test, y_pred, average='weighted'))
    print("Classification Report:\n", classification_report(y_test, y_pred, zero_division=0))
    print("-" * 40)

=== Logistic Regression ===
Accuracy: 0.6514657980456026
Precision: 0.6536025696183717
Recall: 0.6514657980456026
F1 Score: 0.6507546927188375
Classification Report:
               precision    recall  f1-score   support

           0       0.63      0.70      0.66       303
           1       0.67      0.60      0.64       311

    accuracy                           0.65       614
   macro avg       0.65      0.65      0.65       614
weighted avg       0.65      0.65      0.65       614

----------------------------------------
=== Decision Tree ===
Accuracy: 0.8990228013029316
Precision: 0.8990228013029316
Recall: 0.8990228013029316
F1 Score: 0.8990228013029316
Classification Report:
               precision    recall  f1-score   support

           0       0.90      0.90      0.90       303
           1       0.90      0.90      0.90       311

    accuracy                           0.90       614
   macro avg       0.90      0.90      0.90       614
weighted avg       0.90      0.9

✅ Top Performer: Decision Tree
Accuracy: 89.9% - wow, almost 90%!

All metrics are balanced, meaning the model understands your data well. But be careful: Decision Tree is prone to overfitting, especially if the depth is not limited.

🥈 Runner-Up: Random Forest
Accuracy: 84.0%

Random Forest generalizes more than Decision Tree, so it's usually more reliable on real-world data.

Precision and Recall are balanced - a very good sign!

😐 Mid-tier: Logistic Regression & KNN
LogReg: 65% accuracy - still usable for baseline.

KNN: 62% accuracy - a bit low, might be improved by tuning n_neighbors.

🚫 Naive Bayes
Accuracy: 59.4% - probably not suitable for your data.
It could be because your data doesn't meet the independence assumption between features favored by Naive Bayes.

Since the accuracy of the decision tree is very high, which risks overfit, I will choose random forest.

But i think we can improve the accuracy using model tuning, so let's do hyperparammeter tuning.

## **c. Tuning Model Klasifikasi (Optional)**

Here i use randomized search for the tuning.

In [None]:
# Inisialisasi model dasar
rf = RandomForestClassifier(random_state=42)

# Definisikan ruang hyperparameter
param_dist = {
    'n_estimators': [100, 200, 300, 400, 500],
    'max_depth': [5, 10, 20, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['sqrt', 'log2'],
    'bootstrap': [True, False]
}

## **d. Evaluasi Model Klasifikasi setelah Tuning (Optional)**

Next let's evaluate the result after tuning.

But this process may take a while.

In [None]:
# Randomized Search CV
rf_random = RandomizedSearchCV(
    estimator=rf,
    param_distributions=param_dist,
    n_iter=50,
    cv=5,
    verbose=2,
    n_jobs=-1,
    scoring='accuracy',
    random_state=42
)

# Fit ke data kamu
rf_random.fit(X_train, y_train)

# Cek hasil terbaik
print("Best parameters:", rf_random.best_params_)
print("Best accuracy:", rf_random.best_score_)

Fitting 5 folds for each of 50 candidates, totalling 250 fits
Best parameters: {'n_estimators': 200, 'min_samples_split': 10, 'min_samples_leaf': 1, 'max_features': 'log2', 'max_depth': None, 'bootstrap': False}
Best accuracy: 0.8664335664335665


In [None]:
best_rf = rf_random.best_estimator_
y_pred = best_rf.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.89      0.92      0.90       303
           1       0.92      0.88      0.90       311

    accuracy                           0.90       614
   macro avg       0.90      0.90      0.90       614
weighted avg       0.90      0.90      0.90       614



## **e. Analisis Hasil Evaluasi Model Klasifikasi**

Comparison of Evaluation Results (Before & After Tuning)

Model | Accuracy | Precision | Recall | F1 Score

Random Forest (Before tuning) | 84.0% | 0.84 | 0.84 | 0.84

Random Forest (After tuning) | 90.0% | 0.89–0.92 | 0.88–0.92 | 0.90

---

After tuning, the model experienced significant improvements in all metrics.

Not only is it more accurate, but it is also more balanced in handling both classes (class 0 and 1).

---

Model Weakness Identification
📊 Classification Result Report:

Cluster 0:

Precision: 0.89

Recall: 0.92

Cluster 1:

Precision: 0.92

Recall: 0.88

---
🧠 Interpretation:

The model is slightly better at recognizing class 0 than class 1, judging from the slightly lower recall of class 1 (0.88).

However, this is still within a very good range and does not show extreme discrepancies.

🎯 Overfitting / Underfitting?
Accuracy on cross-validation (best score): 86.6%

Accuracy on test set: 90%

---
**Conclusion**: The model is not overfitting. Instead, it is quite stable and likely to generalize well.

---

Recommended Further Actions


If want to go deeper:

🔍 Try ensemble voting or stacking classifier techniques.

🔍 Try SMOTE or class weighting approaches in case the data becomes imbalanced.

🧪 Experiment with XGBoost or LightGBM - often outperform Random Forest in the case of structured data.

Preventive measures:

Make sure the data cleaning pipeline is solid.

Keep adding data where possible (the more data, the more stable the model).