{ “cells”: \[ { “cell_type”: “markdown”, “metadata”: {}, “source”: \[
“\# TUGAS2_KELOMPOK3 — Optimized Logistic (Binary) & Softmax
(Multiclass)”, “”, “Notebook ini merapikan pipeline agar **anti-leakage,
reproducible, dan evaluasi lebih kuat**.”, “”, “\## Apa yang berubah”,
“- Gunakan **Pipeline + ColumnTransformer** (imputasi + scaling +
encoding) → mencegah data leakage.”, “- **Stratified split + Stratified
K-Fold CV** untuk metrik yang stabil.”, “- **class_weight="balanced"**
pada logistic biner (atasi imbalance).”, “- **Threshold tuning** (biner)
via Precision-Recall untuk F1 optimal.”, “- **Confusion matrix
ter-normalisasi** dan **classification_report** per-kelas.”, “”, “\>
**Cara pakai cepat:** Atur variabel pada *Config* (path & nama kolom
target). Jika hanya punya salah satu target (biner atau multikelas),
biarkan yang lain `None`.” \] }, { “cell_type”: “code”,
“execution_count”: null, “metadata”: {}, “outputs”: \[\], “source”: \[
“\# === Environment info (optional) ===”, “import sys, sklearn, numpy,
pandas”, “print(sys.version)”, “print("sklearn",
sklearn.\_\_version\_\_, "numpy", numpy.\_\_version\_\_, "pandas",
pandas.\_\_version\_\_)” \] }, { “cell_type”: “code”, “execution_count”:
null, “metadata”: {}, “outputs”: \[\], “source”: \[ “\# === Config (EDIT
SESUAI DATA ANDA) ===”, “DATASET_PATH = ‘data.csv’ \# ganti ke path
dataset anda”, “TARGET_BINARY = None \# contoh: ‘label_bin’ (isi None
jika tidak ada)”, “POSITIVE_CLASS = None \# contoh: 1 atau ‘yes’
(opsional; hanya untuk laporan)”, “TARGET_MULTICLASS = None \# contoh:
‘label_mc’ (isi None jika tidak ada)”, “ID_COLUMNS = \[\] \# contoh:
\[‘id’,‘timestamp’\] jika tidak ingin dipakai sebagai fitur”,
“RANDOM_STATE = 42”, “TEST_SIZE = 0.2”, “N_SPLITS = 5” \] }, {
“cell_type”: “code”, “execution_count”: null, “metadata”: {}, “outputs”:
\[\], “source”: \[ “\# === Imports utama ===”, “import numpy as np”,
“import pandas as pd”, “import matplotlib.pyplot as plt”, “from
sklearn.model_selection import train_test_split, StratifiedKFold,
cross_validate”, “from sklearn.preprocessing import StandardScaler,
OneHotEncoder”, “from sklearn.compose import ColumnTransformer”, “from
sklearn.pipeline import Pipeline”, “from sklearn.impute import
SimpleImputer”, “from sklearn.linear_model import LogisticRegression”,
“from sklearn.metrics import (classification_report, confusion_matrix,
ConfusionMatrixDisplay,”, ” precision_recall_curve, f1_score)“,”from
sklearn.utils.class_weight import compute_class_weight“,”import
warnings; warnings.filterwarnings(‘ignore’)” \] }, { “cell_type”:
“code”, “execution_count”: null, “metadata”: {}, “outputs”: \[\],
“source”: \[ “\# === Load data & audit ringkas ===”, “df =
pd.read_csv(DATASET_PATH)”, “print(‘Shape:’, df.shape)”,
“display(df.head())”, “”, “\# Info missing & duplikat”, “missing =
df.isna().mean().sort_values(ascending=False)”, “dup_count =
df.duplicated().sum()”, “print(‘ratio (top 10):’)”,
“display(missing.head(10).to_frame(‘missing_ratio’))”,
“print(f’Duplicates: {dup_count}‘)“,”“,”\# Drop exact duplicate“,”if
dup_count \> 0:“,” df = df.drop_duplicates().reset_index(drop=True)“,”
print(’Duplicates dropped. New shape:’, df.shape)” \] }, { “cell_type”:
“code”, “execution_count”: null, “metadata”: {}, “outputs”: \[\],
“source”: \[ “\# === Utility: buat preprocessor numerik & kategorik
===”, “def build_preprocessor(X_df):”, ” num_cols =
X_df.select_dtypes(include=np.number).columns.tolist()“,” cat_cols =
X_df.select_dtypes(exclude=np.number).columns.tolist()“,”“,” num_pipe =
Pipeline(\[“,” (‘imp’, SimpleImputer(strategy=‘median’)),“,” (‘scale’,
StandardScaler())“,” \])“,” cat_pipe = Pipeline(\[“,” (‘imp’,
SimpleImputer(strategy=‘most_frequent’)),“,” (‘ohe’,
OneHotEncoder(handle_unknown=‘ignore’))“,” \])“,”“,” pre =
ColumnTransformer(\[“,” (‘num’, num_pipe, num_cols),“,” (‘cat’,
cat_pipe, cat_cols)“,” \])“,” return pre, num_cols, cat_cols” \] }, {
“cell_type”: “markdown”, “metadata”: {}, “source”: \[ “\## A. Binary
Logistic Regression (opsional)” \] }, { “cell_type”: “code”,
“execution_count”: null, “metadata”: {}, “outputs”: \[\], “source”: \[
“if TARGET_BINARY is not None and TARGET_BINARY in df.columns:”, ” \#
Siapkan X/y dan buang ID kolom bila ada“,” cols = \[c for c in
df.columns if c != TARGET_BINARY and c not in ID_COLUMNS\]“,” Xb =
df\[cols\].copy()“,” yb = df\[TARGET_BINARY\].copy()“,”“,” \# Stratified
split“,” Xb_train, Xb_test, yb_train, yb_test = train_test_split(“,” Xb,
yb, test_size=TEST_SIZE, stratify=yb, random_state=RANDOM_STATE“,”
)“,”“,” pre_b, num_b, cat_b = build_preprocessor(Xb_train)“,”“,” \#
class_weight balanced untuk atasi imbalance“,” logreg_bin =
Pipeline(\[“,” (‘pre’, pre_b),“,” (‘clf’, LogisticRegression(“,”
solver=‘lbfgs’, max_iter=1000, class_weight=‘balanced’,
random_state=RANDOM_STATE“,” ))“,” \])“,”“,” \# Cross-validation“,” skf
= StratifiedKFold(n_splits=N_SPLITS, shuffle=True,
random_state=RANDOM_STATE)“,” scoring = {‘acc’:‘accuracy’,
‘f1_macro’:‘f1_macro’, ‘roc_auc’:‘roc_auc’}“,” cv_bin =
cross_validate(logreg_bin, Xb, yb, cv=skf, scoring=scoring,
n_jobs=-1)“,” print(‘CV Binary (mean):’, {k: float(np.mean(v)) for k,v
in cv_bin.items() if k.startswith(‘test\_’)})“,”“,” \# Train final“,”
logreg_bin.fit(Xb_train, yb_train)“,” yb_pred =
logreg_bin.predict(Xb_test)“,” print(‘Report (Binary):’)“,”
print(classification_report(yb_test, yb_pred, digits=3,
zero_division=0))“,”“,” \# Threshold tuning untuk F1“,” if
hasattr(logreg_bin.named_steps\[‘clf’\], ‘predict_proba’):“,” yb_proba =
logreg_bin.predict_proba(Xb_test)\[:,1\]“,” p, r, th =
precision_recall_curve(yb_test, yb_proba)“,” f1s = 2*p*r/(p+r+1e-12)“,”
best_idx = int(np.argmax(f1s\[:-1\]))“,” best_thr =
float(th\[best_idx\]) if len(th) \> 0 else 0.5“,” print(f’Best threshold
for F1: {best_thr:.4f} (F1={float(f1s\[best_idx\]):.3f})‘)“,” yb_opt =
(yb_proba \>= best_thr).astype(int)“,” print(’@ best threshold:‘)“,”
print(classification_report(yb_test, yb_opt, digits=3,
zero_division=0))“,”“,” \# Confusion matrix normalized“,” fig =
plt.figure(figsize=(4,4))“,” cm = confusion_matrix(yb_test, yb_opt,
normalize=’true’)“,” disp = ConfusionMatrixDisplay(cm)“,”
disp.plot(values_format=‘.2f’)“,” plt.title(‘Confusion Matrix (Binary,
normalized)’)“,” plt.show()“,”else:“,” print(‘Lewati bagian Binary:
TARGET_BINARY tidak diset atau tidak ada di kolom.’)” \] }, {
“cell_type”: “markdown”, “metadata”: {}, “source”: \[ “\## B. Multiclass
Softmax (Multinomial Logistic Regression) (opsional)” \] }, {
“cell_type”: “code”, “execution_count”: null, “metadata”: {}, “outputs”:
\[\], “source”: \[ “if TARGET_MULTICLASS is not None and
TARGET_MULTICLASS in df.columns:”, ” cols = \[c for c in df.columns if c
!= TARGET_MULTICLASS and c not in ID_COLUMNS\]“,” Xm =
df\[cols\].copy()“,” ym = df\[TARGET_MULTICLASS\].copy()“,”“,” Xm_train,
Xm_test, ym_train, ym_test = train_test_split(“,” Xm, ym,
test_size=TEST_SIZE, stratify=ym, random_state=RANDOM_STATE“,” )“,”“,”
pre_m, num_m, cat_m = build_preprocessor(Xm_train)“,”“,” softmax_clf =
Pipeline(\[“,” (‘pre’, pre_m),“,” (‘clf’, LogisticRegression(“,”
multi_class=‘multinomial’, solver=‘lbfgs’, max_iter=1000,
random_state=RANDOM_STATE“,” ))“,” \])“,”“,” skf =
StratifiedKFold(n_splits=N_SPLITS, shuffle=True,
random_state=RANDOM_STATE)“,” cv_mc = cross_validate(softmax_clf, Xm,
ym, cv=skf,“,” scoring={‘acc’:‘accuracy’,‘f1_macro’:‘f1_macro’},
n_jobs=-1)“,” print(‘CV Multiclass (mean):’, {k: float(np.mean(v)) for
k,v in cv_mc.items() if k.startswith(‘test\_’)})“,”“,”
softmax_clf.fit(Xm_train, ym_train)“,” ym_pred =
softmax_clf.predict(Xm_test)“,” print(‘Report (Multiclass):’)“,”
print(classification_report(ym_test, ym_pred, digits=3,
zero_division=0))“,”“,” \# Confusion matrix normalized“,” fig =
plt.figure(figsize=(4,4))“,” cm = confusion_matrix(ym_test, ym_pred,
normalize=‘true’)“,” disp = ConfusionMatrixDisplay(cm)“,”
disp.plot(values_format=‘.2f’)“,” plt.title(‘Confusion Matrix
(Multiclass, normalized)’)“,” plt.show()“,”else:“,” print(‘Lewati bagian
Multiclass: TARGET_MULTICLASS tidak diset atau tidak ada di kolom.’)” \]
}, { “cell_type”: “markdown”, “metadata”: {}, “source”: \[ “\## C.
Simpan artefak (opsional)” \] }, { “cell_type”: “code”,
“execution_count”: null, “metadata”: {}, “outputs”: \[\], “source”: \[
“\# Simpan ringkasan metrik sederhana (jika sudah tersedia
variabelnya)”, “from pathlib import Path”, “out = {}”, “try:”, ”
out\[‘binary_report_bestF1’\] = classification_report(yb_test, yb_opt,
output_dict=True)“,”except Exception:“,” pass“,”try:“,”
out\[‘multiclass_report’\] = classification_report(ym_test, ym_pred,
output_dict=True)“,”except Exception:“,” pass“,”“,”if out:“,”
Path(‘artifacts’).mkdir(exist_ok=True)“,” \# Simpan tiap report ke CSV
terpisah“,” for k, v in out.items():“,” df_rep =
pd.DataFrame(v).transpose()“,” df_rep.to_csv(f’artifacts/{k}.csv’)“,”
print(’Saved metrics to artifacts/\*.csv’)“,”else:“,” print(‘Tidak ada
artefak yang disimpan (mungkin bagian tertentu dilewati).’)” \] }, {
“cell_type”: “markdown”, “metadata”: {}, “source”: \[ “—”, “**Catatan**:
Jika dataset Anda memiliki fitur tanggal/waktu, pertimbangkan
*time-based split* alih-alih stratified split biasa untuk menghindari
kebocoran temporal.” \] } \], “metadata”: { “kernelspec”: {
“display_name”: “Python 3”, “language”: “python”, “name”: “python3” },
“language_info”: { “name”: “python”, “version”: “3.10” } }, “nbformat”:
4, “nbformat_minor”: 5 }