# Prediksi

Notebook ini berfungsi untuk melakukan proses prediksi hasil kelayakan kredit menggunakan tiga model terlatih, yaitu Logistic Regression, Random Forest, dan Gradient Boosting. Adapun prosesnya meliputi hal berikut.
1. Membaca dataset hasil pemrosesan (test_processed.csv) yang berisi data calon peminjam yang siap diprediksi.
2. Melakukan prediksi terhadap seluruh data dalam dataset menggunakan ketiga model, menghasilkan probabilitas serta klasifikasi akhir (Lancar atau Gagal Bayar).
3. Menyimpan seluruh hasil prediksi dan ringkasan hasil prediksi dari dataset (`outputs/hasil_prediksi/`).
4. Menyimpan histogram distribusi skor untuk setiap model yang disimpan dalam format .png (`outputs/hasil_prediksi/`).


## Setup dan Paths

In [1]:
# === Setup Path & Load Data ===
from pathlib import Path
import json, joblib
import numpy as np, pandas as pd
from datetime import datetime
import matplotlib.pyplot as plt

BASE_DIR = Path('.')
DATA_PATH = BASE_DIR / 'data' / 'dataset_hasil_data_processing' / 'test_processed.csv'
MODELS_DIR = BASE_DIR / 'outputs' / 'models'
OUT_DIR = BASE_DIR / 'outputs' / 'hasil_prediksi'
OUT_DIR.mkdir(parents=True, exist_ok=True)

df_test_full = pd.read_csv(DATA_PATH)
print('Loaded test_processed.csv:', df_test_full.shape)

# Simpan kolom ID jika ada
id_cols = [c for c in df_test_full.columns if ('id' in c.lower()) or c.upper().startswith('SK_ID')]
ids_df = df_test_full[id_cols] if id_cols else None

# Deteksi label jika ada, tapi prediksi tidak membutuhkannya
label_candidates = ['TARGET','target','label','LABEL','y']
label_col = next((c for c in label_candidates if c in df_test_full.columns), None)
X_test = df_test_full.drop(columns=[label_col]) if label_col else df_test_full.copy()

# Gunakan hanya fitur numerik (aman untuk mayoritas pipeline)
X_test_num = X_test.select_dtypes(include=[np.number]).copy()
feature_cols = list(X_test_num.columns)
print('Feature columns used (numeric):', len(feature_cols))

Loaded test_processed.csv: (48744, 18)
Feature columns used (numeric): 18


## Pendefinisian Fungsi

In [2]:
def load_model_and_threshold(model_name: str):
    mdir = MODELS_DIR / model_name
    model = joblib.load(mdir / 'model.pkl')
    thr_path = mdir / 'threshold.json'
    thr = 0.5
    if thr_path.exists():
        with open(thr_path, 'r') as f:
            thr = float(json.load(f).get('threshold', 0.5))
    return model, thr

def score_and_decide(model, X, thr: float):
    proba = model.predict_proba(X)[:, 1]
    pred = (proba >= thr).astype(int)
    label = np.where(pred==1, 'Gagal Bayar', 'Lancar')
    return proba, pred, label

## Prediksi menggunakan data `test_processed.csv` dengan menerapkan 3 model

In [3]:
models = {
    'logistic_regression': load_model_and_threshold('logistic_regression'),
    'random_forest': load_model_and_threshold('random_forest'),
    'gradient_boosting': load_model_and_threshold('gradient_boosting')
}

out_df = ids_df.copy() if ids_df is not None else pd.DataFrame(index=X_test_num.index)
for name, (mdl, thr) in models.items():
    proba, pred, label = score_and_decide(mdl, X_test_num, thr)
    out_df[f'score_{name}'] = proba
    out_df[f'pred_{name}'] = label
    out_df[f'threshold_{name}'] = thr

out_path_dataset = OUT_DIR / 'prediksi_data_from_dataset.csv'
out_df.to_csv(out_path_dataset, index=False)
print('Saved dataset predictions:', out_path_dataset)
out_df.head()

Saved dataset predictions: outputs\hasil_prediksi\prediksi_data_from_dataset.csv


Unnamed: 0,score_logistic_regression,pred_logistic_regression,threshold_logistic_regression,score_random_forest,pred_random_forest,threshold_random_forest,score_gradient_boosting,pred_gradient_boosting,threshold_gradient_boosting
0,0.312115,Lancar,0.488968,0.349478,Lancar,0.440962,0.329597,Lancar,0.439825
1,0.453781,Lancar,0.488968,0.410233,Lancar,0.440962,0.427936,Lancar,0.439825
2,0.30568,Lancar,0.488968,0.20095,Lancar,0.440962,0.17051,Lancar,0.439825
3,0.248563,Lancar,0.488968,0.326847,Lancar,0.440962,0.258208,Lancar,0.439825
4,0.626048,Gagal Bayar,0.488968,0.609612,Gagal Bayar,0.440962,0.529178,Gagal Bayar,0.439825


## Pembuatan Ringkasan dan Histogram Hasil Prediksi

In [4]:
# === Ringkasan & Histogram ===
summary_stats = {}
model_list = ['logistic_regression', 'random_forest', 'gradient_boosting']
for model in model_list:
    col_pred = f'pred_{model}'
    counts = out_df[col_pred].value_counts()
    total = counts.sum()
    pct = (counts / total * 100).round(2)
    summary_stats[model] = {
        'Lancar (%)': pct.get('Lancar', 0.0),
        'Gagal Bayar (%)': pct.get('Gagal Bayar', 0.0),
        'Total Data': int(total)
    }
summary_df = pd.DataFrame(summary_stats).T
print('=== RINGKASAN PREDIKSI DATASET ===')
print(summary_df)

summary_csv = OUT_DIR / 'ringkasan_prediksi_dataset.csv'
summary_df.to_csv(summary_csv)
print('Saved ringkasan:', summary_csv)

# Histogram skor per model
for model in model_list:
    plt.figure(figsize=(6,4))
    plt.hist(out_df[f'score_{model}'], bins=30)
    plt.title(f'Distribusi Probabilitas Gagal Bayar â€” {model.replace("_"," ").title()}')
    plt.xlabel('Probabilitas Gagal Bayar')
    plt.ylabel('Frekuensi')
    img_path = OUT_DIR / f'histogram_{model}.png'
    plt.savefig(img_path, bbox_inches='tight')
    plt.close()
    print('Saved histogram:', img_path)

=== RINGKASAN PREDIKSI DATASET ===
                     Lancar (%)  Gagal Bayar (%)  Total Data
logistic_regression       63.23            36.77     48744.0
random_forest             57.25            42.75     48744.0
gradient_boosting         63.28            36.72     48744.0
Saved ringkasan: outputs\hasil_prediksi\ringkasan_prediksi_dataset.csv
Saved histogram: outputs\hasil_prediksi\histogram_logistic_regression.png
Saved histogram: outputs\hasil_prediksi\histogram_random_forest.png
Saved histogram: outputs\hasil_prediksi\histogram_gradient_boosting.png
