<a href="https://colab.research.google.com/github/rhodes-byu/cs180-winter25/blob/main/notebooks/17-mlp-demo
.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# MLPClassifier with Scikit-Learn

This notebook demonstrates how to use the `MLPClassifier` from `scikit-learn` for classification and regression.

---



## Introduction to Neural Networks

A **neural network** consists of layers of interconnected nodes (neurons), inspired by the human brain. The simplest neural network, a **multilayer perceptron (MLP)**, includes:
- **Input layer**: receives features.
- **Hidden layers**: perform nonlinear transformations using activation functions.
- **Output layer**: produces the final prediction.

Each connection has an associated **weight**, and each neuron typically has a **bias** term. During training, the model adjusts weights to minimize prediction error.


In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons, fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import ConfusionMatrixDisplay, classification_report
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import loguniform


## Moons Dataset Revisited
---

In [None]:
# Generate synthetic dataset (moons)
X, y = make_moons(n_samples = 1000, noise = 0.2, random_state = 42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Further split into validation data
X_test, X_val, y_test, y_val = train_test_split(X_test, y_test, test_size=0.5, random_state=42)


scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)
X_test = scaler.transform(X_test)


In [None]:
print("X_train shape:", X_train.shape)
print("X_val shape:", X_val.shape)
print("X_test shape:", X_test.shape)

In [None]:
plt.scatter(X[:, 0], X[:, 1], c=y, cmap='viridis', s=10)
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.title("Scatter Plot of Moons Dataset")
plt.show()

In [None]:
mlp = MLPClassifier(hidden_layer_sizes=(10), max_iter = 1000, random_state = 42)
mlp.fit(X_train, y_train)

In [None]:
mlp.predict(X_val)
y_pred_val = mlp.predict(X_val)
print("Validation set classification report:")
print(classification_report(y_val, y_pred_val))


In [None]:
ConfusionMatrixDisplay.from_predictions(y_val, y_pred_val)
plt.show()


In [None]:
# Visualization: decision boundary
def plot_decision_boundary(model, X, y, ax):
    h = 0.02
    x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5
    y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    ax.contourf(xx, yy, Z, alpha=0.3)
    scatter = ax.scatter(X[:, 0], X[:, 1], c=y, edgecolors='k')
    legend1 = ax.legend(*scatter.legend_elements(), title="Classes")
    ax.add_artist(legend1)
    ax.set_title("MLPClassifier Decision Boundary")




In [None]:
fig, ax = plt.subplots()
plot_decision_boundary(mlp, X_test, y_test, ax)
plt.show()

## Visualizing the Loss Curve

You can monitor training progress by inspecting the loss curve:

In [None]:
plt.plot(mlp.loss_curve_)
plt.xlabel("Epochs")
plt.ylabel("Loss")
plt.title("Training Loss Curve")

In [None]:
mlp_partial = MLPClassifier(hidden_layer_sizes=(10,), max_iter=1, warm_start=True, random_state=42,
                            activation = 'tanh')

train_loss = []
val_loss = []
train_accuracy = []
val_accuracy = []

for _ in range(10000):  # Number of epochs
    mlp_partial.partial_fit(X_train, y_train, classes=np.unique(y_train))
    train_loss.append(mlp_partial.loss_)
    val_loss.append(-np.mean(y_val * np.log(mlp_partial.predict_proba(X_val)[:, 1]) + 
                              (1 - y_val) * np.log(1 - mlp_partial.predict_proba(X_val)[:, 1])))
    train_accuracy.append(mlp_partial.score(X_train, y_train))
    val_accuracy.append(mlp_partial.score(X_val, y_val))

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(10, 5))

# Plot the loss curves
ax[0].plot(train_loss, label="Training Loss")
ax[0].plot(val_loss, label="Validation Loss")
ax[0].set_xlabel("Epochs")
ax[0].set_ylabel("Loss")
ax[0].set_title("Training and Validation Loss Curve")
ax[0].legend()

# Plot the accuracy curves
ax[1].plot(train_accuracy, label="Training Accuracy")
ax[1].plot(val_accuracy, label="Validation Accuracy")
ax[1].set_xlabel("Epochs")
ax[1].set_ylabel("Accuracy")
ax[1].set_title("Training and Validation Accuracy Curve")
ax[1].legend()

plt.tight_layout()
plt.show()

## Activation Functions

Activation functions introduce non-linearity into the network. Common ones include:
- **ReLU (Rectified Linear Unit)**: `f(x) = max(0, x)`
- **Sigmoid**: `f(x) = 1 / (1 + exp(-x))`
- **Tanh**: `f(x) = tanh(x)`

Different activations affect learning dynamics. Historically, **Sigmoid** was important due to its connection with Logistic Regression. In practice, **ReLU** is most frequently used today due to faster training times and help with the vanishing gradient problem.


In [None]:
activations = ['identity', 'logistic', 'tanh', 'relu']

In [None]:
models = {}
scores = {}
for activation in activations:
    mlp = MLPClassifier(hidden_layer_sizes=(10), max_iter=1000, activation=activation, random_state=42)
    mlp.fit(X_train, y_train)
    models[activation] = mlp
    scores[activation] = mlp.score(X_val, y_val)


In [None]:
fig, ax = plt.subplots(1, 4, figsize=(30, 8))

for i, (activation, model) in enumerate(models.items()):
    score = scores[activation]
    plot_decision_boundary(model, X_test, y_test, ax[i])
    ax[i].set_title(f"Activation: {activation}, Score: {np.round(score, 3)}")  # Set title after plotting
    ax[i].set_xlabel("Feature 1")
    ax[i].set_ylabel("Feature 2")
plt.tight_layout()


## Hidden Layer Sizes

The `hidden_layer_sizes` parameter controls the number and size of hidden layers, e.g.:
- `(10,)`: one hidden layer with 10 neurons
- `(100,)`: one large hidden layer
- `(50, 30)`: two hidden layers with 50 and 30 neurons

In [None]:
hidden_sizes = [(1,), (2,), (10,), (50,), (100,), (10, 10), (50, 10), (100, 100)]

In [None]:
for sizes in hidden_sizes:
    mlp = MLPClassifier(hidden_layer_sizes=sizes, max_iter=1000, activation='relu', random_state=42)
    mlp.fit(X_train, y_train)
    y_pred_val = mlp.predict(X_val)
    print(f"Hidden sizes: {sizes}")
    print("Validation set classification report:")
    print('Accuracy:', mlp.score(X_val, y_val), '\n')

## Overfitting and Regularization

Train a model with large hidden layers to observe overfitting (high train accuracy, low test accuracy).

Then mitigate using L2 regularization:

```python
mlp = MLPClassifier(hidden_layer_sizes=(100,), alpha=0.01)
```

You can also conceptually introduce **dropout** and **early stopping** as regularization strategies.


In [None]:
mlp = MLPClassifier(hidden_layer_sizes=(10), max_iter=1000, activation='relu', random_state=42, alpha = 10)
mlp.fit(X_train, y_train)

y_pred_val = mlp.predict(X_val)
print("Validation set classification report:")
print(classification_report(y_val, y_pred_val))

In [None]:
fig, ax = plt.subplots()
plot_decision_boundary(mlp, X_test, y_test, ax)

## Hyperparameter tuning
___

In [None]:
params = {
    'hidden_layer_sizes': [(10,), (50,), (100,), (150,)], # Number of neurons in each hidden layer
    'activation': ['tanh'],
    'solver': ['sgd', 'adam'],
    'alpha': [0.0001, 0.05], # Regularization parameter
    'learning_rate': ['constant', 'adaptive'],
    'learning_rate_init': [0.001, 0.01, 0.1], # Initial learning rate
    'max_iter': [1000],
}

cv = RandomizedSearchCV(mlp, params, n_iter=15, cv=5, verbose=2, random_state=42, n_jobs=-1)

In [None]:
cv = RandomizedSearchCV(mlp, param_distributions = params, n_iter=10, cv=3, random_state=42)
cv.fit(X_train, y_train)

In [None]:
cv.best_score_

In [None]:
cv.best_params_

In [None]:
fig, ax = plt.subplots()
plot_decision_boundary(cv.best_estimator_, X_test, y_test, ax)
plt.show()

## MNIST Data

In [None]:
mnist = fetch_openml('mnist_784', version = 1, parser = 'auto')
X, y = mnist['data'], mnist['target']

In [None]:
print('Max val: ', X.values.max())
print('Min val: ', X.values.min())

In [None]:
X = X / 255.0  # Normalize pixel values to [0, 1]

In [None]:
X_train = X[:60000]
y_train = y[:60000]

X_test = X[60000:]
y_test = y[60000:]

# Split of validation set
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

In [None]:
mlp = MLPClassifier(hidden_layer_sizes=(100,), max_iter=10, random_state=42)
mlp.fit(X_train, y_train)

In [None]:
mlp.predict(X_val)
y_pred_val = mlp.predict(X_val)

print("Validation set classification report:")
print(classification_report(y_val, y_pred_val))

In [None]:
cm = ConfusionMatrixDisplay.from_estimator(mlp, X_val, y_val, cmap=plt.cm.Blues)
plt.show()

## Visualizing `Difficult` examples
---

In [None]:
# Get predicted probabilities for the validation set
y_proba_val = mlp.predict_proba(X_val)

# Identify misclassified points
misclassified_indices = np.where(y_val != y_pred_val)[0]

# Calculate discrepancies for misclassified points
discrepancies = np.abs(y_proba_val[misclassified_indices, y_val.iloc[misclassified_indices].astype(int)] - 
                       y_proba_val[misclassified_indices, y_pred_val[misclassified_indices].astype(int)])

# Get indices of misclassified points with the highest discrepancies
top_discrepancy_indices = misclassified_indices[np.argsort(-discrepancies)[:5]]

# Display predicted probabilities and true y values for these points
print("Predicted probabilities for misclassified points with highest discrepancies:")
print(y_proba_val[top_discrepancy_indices].round(3))
print("\nTrue y values for these points:")
print(y_val.iloc[top_discrepancy_indices].values)

In [None]:
fig, axes = plt.subplots(1, len(top_discrepancy_indices), figsize=(15, 5))

for i, idx in enumerate(top_discrepancy_indices):
    ax = axes[i]
    image = X_val.iloc[idx].values.reshape(28, 28)  # Reshape the flattened image to 28x28
    ax.imshow(image, cmap='gray')
    ax.axis('off')
    ax.set_title(f"True: {y_val.iloc[idx]}, Pred: {y_pred_val[idx]}")

plt.tight_layout()
plt.show()

## Your Turn!
---

The **Fashion MNIST** dataset is a collection of 70,000 grayscale images of size 28x28 pixels, representing 10 categories of clothing items, such as T-shirts, trousers, and shoes. It is designed as a drop-in replacement for the MNIST dataset, providing a more challenging benchmark for machine learning models. Each image is labeled with one of the 10 classes, and the dataset is split into 60,000 training samples and 10,000 test samples.

In [None]:
# Fetch Fashion MNIST
fashion_mnist = fetch_openml('Fashion-MNIST', version=1, as_frame=False, parser = 'auto')

# Extract features and labels
X, y = fashion_mnist["data"], fashion_mnist["target"]

In [None]:
target_names = {0: "T-shirt/top", 1: "Trouser", 2: "Pullover", 3: "Dress", 4: "Coat",
               5: "Sandal", 6: "Shirt", 7: "Sneaker", 8: "Bag", 9: "Ankle boot"}

### 0. Visualize the data
Display the first 10 data points. Include the label names in the titles.

### 1. Data Prep.
Normalize the dataset and split in to 70/30 train-test set sizes. Further split the test set into test and validation sets.

### 2. Train a Model
Train an MLP. What accuracy are you getting on the validation set? Try a few more parameter combinations and select the best combination.

### 3. Model Validation
Now using the test set, evaluate the model performance. 
Create a confusion matrix; where are most of the misclasifications occuring?  
Do the misclassifications seem to make sense based on the target labels?

### 4. Visualizing Miscalssified Points

Visualize some of the misclassified points. Include in the plot title the true and predicted labels. Can you see why the model might have been confused?

### 5. Compare with other models
Try another model we have covered in class.  Do any of the models outperform the MLP on the test set?