# Exercise: Logistic Regression on a Synthetic Classification Dataset

1. Generate synthetic binary classification data with 5 features (100 samples).
2. Train a logistic regression model using scikit-learn.
3. Evaluate the model using accuracy, confusion matrix, and ROC AUC.
4. Visualize decision boundary (for 2D projection).
5. Discuss feature importance and model performance.

## 1) Generate Synthetic Classification Data
`make_blobs` from `sklearn.datasets` generates synthetic data; see https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_blobs.html.

In [None]:
import numpy as np
from sklearn.datasets import make_blobs

# Generate binary classification data using make_blobs
X, y = make_blobs(n_samples=100, centers=2, n_features=5, cluster_std=1.5, random_state=0)


print("X shape:", X.shape)
print("First 5 samples of X:\n", X[:5])
print("First 5 labels y:\n", y[:5])

## 2) Train Logistic Regression Model

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Split into train/test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train model
clf = LogisticRegression()
clf.fit(X_train, y_train)

print("Model coefficients:", clf.coef_)
print("Intercept:", clf.intercept_)

## 3) Evaluate Model Performance

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix, roc_auc_score

y_pred = clf.predict(X_test)
y_proba = clf.predict_proba(X_test)[:, 1]

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("ROC AUC:", roc_auc_score(y_test, y_proba))

## 4) (Optinal) Visualize Decision Boundary (2D Projection)
Here we used `PCA` for reducing the dimension of the data to 2; we will study `PCA` in more details later in the course. 

In [None]:
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

# Reduce to 2D for visualization
pca = PCA(n_components=2) 
X_vis = pca.fit_transform(X)

clf_vis = LogisticRegression().fit(X_vis, y)

# Plot decision boundary
xx, yy = np.meshgrid(np.linspace(X_vis[:,0].min(), X_vis[:,0].max(), 100),
                     np.linspace(X_vis[:,1].min(), X_vis[:,1].max(), 100))
grid = np.c_[xx.ravel(), yy.ravel()]
probs = clf_vis.predict_proba(grid)[:, 1].reshape(xx.shape)

plt.contourf(xx, yy, probs, 25, cmap="RdBu", alpha=0.8)
plt.scatter(X_vis[:, 0], X_vis[:, 1], c=y, cmap="summer", edgecolors='k')
plt.title("Logistic Regression Decision Boundary (PCA Projection)")
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.show()

## 5) Discussion
- Coefficients indicate feature importance.
- Consider regularization or feature selection for high-dimensional data.