# Exercise: Softmax Regression on a Synthetic Classification Dataset

1. Generate synthetic classification data with 3 classes and 5 features (100 samples).
2. Train a softmax regression model using scikit-learn.
3. Evaluate the model using accuracy and confusion matrix.
4. Visualize decision boundary (for 2D projection).
5. Discuss feature importance and model performance.

## 1) Generate Synthetic Classification Data
`make_blobs` from `sklearn.datasets` generates synthetic data; see https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_blobs.html.

In [None]:
import numpy as np
from sklearn.datasets import make_blobs

# Generate classification data with 3 classes using make_blobs
X, y = make_blobs(n_samples=100, centers=3, n_features=5, cluster_std=1.5, random_state=0)

print("X shape:", X.shape)
print("First 5 samples of X:\n", X[:5])
print("First 5 labels y:\n", y[:5])

## 2) Train Softmax Regression Model
Softmax in `sklearn` is part of the `LogisticRegression` since softmax regression is also known as multinomial logistic regression.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Split into train/test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train model using 'lbfgs' optimizer
clf = LogisticRegression(solver='saga', tol=1e-3, max_iter=1000)
clf.fit(X_train, y_train)

print("Model coefficients:", clf.coef_)
print("Intercept:", clf.intercept_)

## 3) Evaluate Model Performance

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix

y_pred = clf.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

## 4) Visualize Decision Boundary (2D Projection)

In [None]:
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

# Reduce to 2D for visualization
pca = PCA(n_components=2)
X_vis = pca.fit_transform(X)

clf_vis = LogisticRegression(solver='saga', tol=1e-3, max_iter=1000).fit(X_vis, y)

# Plot decision boundary
xx, yy = np.meshgrid(np.linspace(X_vis[:,0].min(), X_vis[:,0].max(), 100),
                     np.linspace(X_vis[:,1].min(), X_vis[:,1].max(), 100))
grid = np.c_[xx.ravel(), yy.ravel()]
probs = clf_vis.predict_proba(grid).reshape(xx.shape + (3,))

plt.contourf(xx, yy, probs[:, :, 0], 25, cmap="RdBu", alpha=0.8)
plt.scatter(X_vis[:, 0], X_vis[:, 1], c=y, cmap="RdBu", edgecolors='k')
plt.title("Softmax Regression Decision Boundary (PCA Projection)")
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.show()

## 5) Feature Importance and Discussion
- Coefficients indicate feature importance.
- Consider regularization or feature selection for high-dimensional data.