# Assignment 5:
### รหัสนักศึกษา: 67130701711
### ชื่อ-นามสกุล:ศุภวิชญ์ เฟื่องน้อย
### หลักสูตร: SED

## Imbalanced Data Classification & Model Deployment

- Understand the challenges of imbalanced classification.
- Train different models with various resampling techniques.
- Compare model performance using ROC and PR curves.
- Deploy the best-performing model using Streamlit.


#### 1. Install Required Libraries

Ensure you have the necessary libraries installed:

pip install imbalanced-learn scikit-learn matplotlib seaborn streamlit


In [1]:
pip install imbalanced-learn scikit-learn matplotlib seaborn streamlit

Note: you may need to restart the kernel to use updated packages.


In [2]:
import streamlit as st
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## **2. Load and Explore the Dataset**  
Select a dataset from [`imbalanced-learn datasets`](https://imbalanced-learn.org/stable/datasets/index.html). Example: `fetch_datasets` provides multiple datasets.

In [4]:
from imblearn.datasets import fetch_datasets
import pandas as pd

# Load an imbalanced dataset (modify as needed)
dataset = fetch_datasets()['scene']  # Example: 'wine_quality'
X, y = dataset.data, dataset.target

# Convert to DataFrame
df = pd.DataFrame(X, columns=[f"Feature_{i}" for i in range(X.shape[1])])
df['Target'] = y

# Check class distribution
print(df['Target'].value_counts())

Target
-1    2230
 1     177
Name: count, dtype: int64


In [5]:
df['Target'].unique()

array([ 1, -1], dtype=int64)

## **3. Train-Test Split**  

In [7]:
from sklearn.model_selection import train_test_split

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=42)


In [8]:
X_train.shape

(1684, 294)

In [9]:
X_test.shape

(723, 294)

## **4. Train Models**  

### **4.1 Baseline Model (Logistic Regression)**

In [13]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# Train a baseline model (Logistic Regression)
base_model = LogisticRegression(random_state=42, max_iter=1000)
base_model.fit(X_train, y_train)

# Predictions
y_pred = base_model.predict(X_test)
print("Baseline Model (Logistic Regression) Classification Report:")
print(classification_report(y_test, y_pred))


Baseline Model (Logistic Regression) Classification Report:
              precision    recall  f1-score   support

          -1       0.93      0.99      0.96       670
           1       0.45      0.09      0.16        53

    accuracy                           0.93       723
   macro avg       0.69      0.54      0.56       723
weighted avg       0.90      0.93      0.90       723



### **4.2 Model with Undersampling (Logistic Regression)**  

In [15]:
from imblearn.under_sampling import RandomUnderSampler

# Apply undersampling
rus = RandomUnderSampler(random_state=42)
X_train_rus, y_train_rus = rus.fit_resample(X_train, y_train)

# Train Logistic Regression model
under_model_lr = LogisticRegression(random_state=42, max_iter=1000)
under_model_lr.fit(X_train_rus, y_train_rus)

# Predictions
y_pred_under = under_model_lr.predict(X_test)
print("Undersampling Model (Logistic Regression) Classification Report:")
print(classification_report(y_test, y_pred_under))


Undersampling Model (Logistic Regression) Classification Report:
              precision    recall  f1-score   support

          -1       0.98      0.65      0.78       670
           1       0.16      0.85      0.27        53

    accuracy                           0.67       723
   macro avg       0.57      0.75      0.53       723
weighted avg       0.92      0.67      0.75       723



In [16]:
X_train.shape

(1684, 294)

In [17]:
X_train_rus.shape

(248, 294)

In [18]:
y_train.shape

(1684,)

In [19]:
y_train_rus.shape

(248,)

### **4.3 Model with Oversampling (Logistic Regression)**  

In [21]:
from imblearn.over_sampling import RandomOverSampler

# Apply oversampling
ros = RandomOverSampler(random_state=42)
X_train_ros, y_train_ros = ros.fit_resample(X_train, y_train)

# Train Logistic Regression model
over_model_lr = LogisticRegression(random_state=42, max_iter=1000)
over_model_lr.fit(X_train_ros, y_train_ros)

# Predictions
y_pred_over = over_model_lr.predict(X_test)
print("Oversampling Model (Logistic Regression) Classification Report:")
print(classification_report(y_test, y_pred_over))

Oversampling Model (Logistic Regression) Classification Report:
              precision    recall  f1-score   support

          -1       0.95      0.80      0.87       670
           1       0.16      0.49      0.25        53

    accuracy                           0.78       723
   macro avg       0.56      0.65      0.56       723
weighted avg       0.89      0.78      0.82       723



### **4.4 Random Forest Model (No Resampling)**  

In [23]:
from sklearn.ensemble import RandomForestClassifier

# Train Random Forest model
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train, y_train)

# Predictions
y_pred_rf = rf_model.predict(X_test)
print("Random Forest Model Classification Report:")
print(classification_report(y_test, y_pred_rf))

Random Forest Model Classification Report:
              precision    recall  f1-score   support

          -1       0.93      1.00      0.96       670
           1       0.75      0.06      0.11        53

    accuracy                           0.93       723
   macro avg       0.84      0.53      0.53       723
weighted avg       0.92      0.93      0.90       723



## **5. Compare Model Performance**  

### **5.1 Plot ROC Curve & ROC-AUC Score**

In [27]:
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt

models = {
    "Logistic Regression": base_model,
    "Undersampling (Logistic Regression)": under_model_lr,
    "Oversampling (Logistic Regression)": over_model_lr,
    "Random Forest": rf_model
}

plt.figure(figsize=(8, 6))

for name, model in models.items():
    y_prob = model.predict_proba(X_test)[:, 1]
    fpr, tpr, _ = roc_curve(y_test, y_prob, pos_label=1)
    roc_auc = auc(fpr, tpr)
    plt.plot(fpr, tpr, label=f'{name} (AUC = {roc_auc:.2f})')

plt.plot([0, 1], [0, 1], linestyle="--", color="gray")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve")
plt.legend()
plt.show()

  plt.show()


### **5.2 Plot PR Curve & PR-AUC Score**  

In [38]:
from sklearn.metrics import precision_recall_curve, average_precision_score

plt.figure(figsize=(8, 6))

for name, model in models.items():
    y_prob = model.predict_proba(X_test)[:, 1]
    precision, recall, _ = precision_recall_curve(y_test, y_prob, pos_label=1)
    pr_auc = average_precision_score(y_test, y_prob)
    plt.plot(recall, precision, label=f'{name} (AP = {pr_auc:.2f})')

plt.xlabel("Recall")
plt.ylabel("Precision")
plt.title("Precision-Recall Curve")
plt.legend()
plt.show()

  plt.show()


## **6. Select the Best Model for Deployment**  
Choose the best model based on **ROC-AUC and PR-AUC scores**. Assume **oversampling model** performed best.

### **6.1 Save the Model**  

In [41]:
import joblib

# Save the best model
joblib.dump(rf_model, "best_model.pkl")

['best_model.pkl']

In [50]:
joblib.load("best_model.pkl")

## **7. Deploy Model using Streamlit**  



In [52]:
# %%writefile app.py


In [56]:
streamlit run app.py

SyntaxError: invalid syntax (3737097518.py, line 1)

In [None]:
# %%writefile requirements.txt



**Insert link of your App here.**