# 5.1 Introduction to Neural Networks

## Introduction

In previous modules, we explored regularized logistic regression, decision trees, and random forests for predicting student departure. Now we introduce **neural networks** (also called **deep learning** when networks have many layers) - a powerful and flexible family of models inspired by the human brain.

Neural networks have revolutionized fields like image recognition, natural language processing, and speech recognition. For tabular data like our student departure dataset, they offer an alternative approach that can sometimes capture complex patterns that other models miss.

### Learning Objectives

By the end of this notebook, you will be able to:

1. Explain the connection between logistic regression and neural networks
2. Describe the components of a neural network: neurons, layers, and connections
3. Understand different activation functions and when to use them
4. Explain the backpropagation algorithm at a conceptual level
5. Compare neural networks with tree-based models for tabular data

## 1. From Logistic Regression to Neural Networks

### 1.1 The Perceptron: The Building Block

A **perceptron** (or neuron) is the fundamental building block of neural networks. Remarkably, it's almost identical to logistic regression!

**Logistic Regression:**
$$\hat{y} = \sigma(w_1x_1 + w_2x_2 + ... + w_nx_n + b) = \sigma(\vec{w} \cdot \vec{x} + b)$$

**Perceptron:**
$$\text{output} = f(w_1x_1 + w_2x_2 + ... + w_nx_n + b) = f(\vec{w} \cdot \vec{x} + b)$$

The only difference is that neural networks can use different **activation functions** $f$ (not just the sigmoid $\sigma$).

**Key insight**: Logistic regression IS a neural network - specifically, a network with one neuron and a sigmoid activation!

In [None]:
import numpy as np
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.express as px

# Visualize a single neuron/perceptron
fig = go.Figure()

# Input nodes
input_labels = ['x1', 'x2', 'x3', 'bias']
input_x = [0, 0, 0, 0]
input_y = [3, 2, 1, 0]

# Add input nodes
fig.add_trace(go.Scatter(
    x=input_x, y=input_y,
    mode='markers+text',
    marker=dict(size=40, color='lightblue', line=dict(width=2, color='darkblue')),
    text=input_labels,
    textposition='middle left',
    name='Inputs'
))

# Add neuron (output)
fig.add_trace(go.Scatter(
    x=[2], y=[1.5],
    mode='markers+text',
    marker=dict(size=60, color='lightgreen', line=dict(width=2, color='darkgreen')),
    text=['f(sum)'],
    textposition='middle center',
    name='Neuron'
))

# Add output
fig.add_trace(go.Scatter(
    x=[4], y=[1.5],
    mode='markers+text',
    marker=dict(size=40, color='lightyellow', line=dict(width=2, color='orange')),
    text=['output'],
    textposition='middle right',
    name='Output'
))

# Add connections (weights)
weight_labels = ['w1', 'w2', 'w3', 'b']
for i, (y, label) in enumerate(zip(input_y, weight_labels)):
    fig.add_trace(go.Scatter(
        x=[0.2, 1.7], y=[y, 1.5],
        mode='lines',
        line=dict(color='gray', width=2),
        showlegend=False
    ))
    # Add weight label
    mid_x = 0.95
    mid_y = (y + 1.5) / 2
    fig.add_annotation(x=mid_x, y=mid_y, text=label, showarrow=False, 
                       font=dict(size=12, color='darkblue'))

# Add output connection
fig.add_trace(go.Scatter(
    x=[2.3, 3.7], y=[1.5, 1.5],
    mode='lines',
    line=dict(color='gray', width=2),
    showlegend=False
))

fig.update_layout(
    title='A Single Neuron (Perceptron) - The Building Block',
    xaxis=dict(showgrid=False, zeroline=False, showticklabels=False, range=[-1, 5]),
    yaxis=dict(showgrid=False, zeroline=False, showticklabels=False, range=[-0.5, 4]),
    height=400,
    showlegend=False
)

fig.show()

**How a neuron works:**

1. **Inputs** ($x_1, x_2, ..., x_n$): Features from your data
2. **Weights** ($w_1, w_2, ..., w_n$): Learned parameters that determine importance
3. **Bias** ($b$): An offset term (like the intercept in linear regression)
4. **Weighted Sum**: $z = w_1x_1 + w_2x_2 + ... + w_nx_n + b$
5. **Activation Function**: $f(z)$ transforms the sum to produce the output

### 1.2 The Limitation of Single Neurons

A single neuron (like logistic regression) can only learn **linear decision boundaries**. It cannot capture complex, non-linear relationships in data.

**The XOR Problem**: A classic example where a single neuron fails.

In [None]:
# Demonstrate the XOR problem
# XOR: output is 1 when inputs differ, 0 when they're the same
fig = make_subplots(rows=1, cols=2, subplot_titles=(
    'Linearly Separable (AND/OR)',
    'NOT Linearly Separable (XOR)'
))

# AND problem (linearly separable)
and_x = [0, 0, 1, 1]
and_y = [0, 1, 0, 1]
and_labels = [0, 0, 0, 1]  # AND: only 1 when both inputs are 1
and_colors = ['red' if l == 0 else 'green' for l in and_labels]

fig.add_trace(go.Scatter(
    x=and_x, y=and_y,
    mode='markers',
    marker=dict(size=20, color=and_colors),
    showlegend=False
), row=1, col=1)

# Add linear separator for AND
fig.add_trace(go.Scatter(
    x=[-0.2, 1.2], y=[1.2, -0.2],
    mode='lines',
    line=dict(color='blue', width=2, dash='dash'),
    showlegend=False
), row=1, col=1)

# XOR problem (NOT linearly separable)
xor_x = [0, 0, 1, 1]
xor_y = [0, 1, 0, 1]
xor_labels = [0, 1, 1, 0]  # XOR: 1 when inputs differ
xor_colors = ['red' if l == 0 else 'green' for l in xor_labels]

fig.add_trace(go.Scatter(
    x=xor_x, y=xor_y,
    mode='markers',
    marker=dict(size=20, color=xor_colors),
    showlegend=False
), row=1, col=2)

# Add annotation showing no single line can separate
fig.add_annotation(
    x=0.5, y=0.5, text='No single line\ncan separate!',
    showarrow=False, font=dict(size=12, color='red'),
    row=1, col=2
)

fig.update_xaxes(title='x1', range=[-0.3, 1.3])
fig.update_yaxes(title='x2', range=[-0.3, 1.3])
fig.update_layout(
    title='Why We Need Multiple Neurons: The XOR Problem',
    height=400
)

fig.show()

**Solution**: By combining multiple neurons in layers, neural networks can learn complex, non-linear decision boundaries - including XOR!

## 2. Neural Network Architecture

### 2.1 Layers: Input, Hidden, and Output

A neural network organizes neurons into **layers**:

1. **Input Layer**: Receives the features (no computation, just passes data forward)
2. **Hidden Layers**: Where the "magic" happens - learns complex patterns
3. **Output Layer**: Produces the final prediction

**Terminology:**
- **Dense/Fully Connected Layer**: Every neuron connects to every neuron in the next layer
- **Deep Network**: A network with multiple hidden layers
- **Shallow Network**: A network with one or no hidden layers

### 2.2 Visualizing Network Architecture

In [None]:
def draw_neural_network(layer_sizes, layer_names=None, title='Neural Network Architecture'):
    """
    Draw a neural network diagram using plotly.
    
    Parameters:
    -----------
    layer_sizes : list
        Number of neurons in each layer (e.g., [10, 8, 4, 1])
    layer_names : list, optional
        Names for each layer
    title : str
        Title for the plot
    """
    fig = go.Figure()
    
    n_layers = len(layer_sizes)
    max_neurons = max(layer_sizes)
    
    # Colors for different layer types
    colors = ['lightblue'] + ['lightgreen'] * (n_layers - 2) + ['lightyellow']
    border_colors = ['darkblue'] + ['darkgreen'] * (n_layers - 2) + ['orange']
    
    if layer_names is None:
        layer_names = ['Input'] + [f'Hidden {i+1}' for i in range(n_layers - 2)] + ['Output']
    
    # Draw neurons and connections
    for layer_idx, n_neurons in enumerate(layer_sizes):
        x = layer_idx * 2
        
        # Center neurons vertically
        start_y = (max_neurons - n_neurons) / 2
        
        for neuron_idx in range(n_neurons):
            y = start_y + neuron_idx
            
            # Draw connections to next layer
            if layer_idx < n_layers - 1:
                next_n_neurons = layer_sizes[layer_idx + 1]
                next_start_y = (max_neurons - next_n_neurons) / 2
                
                for next_neuron_idx in range(next_n_neurons):
                    next_y = next_start_y + next_neuron_idx
                    fig.add_trace(go.Scatter(
                        x=[x, x + 2],
                        y=[y, next_y],
                        mode='lines',
                        line=dict(color='lightgray', width=0.5),
                        showlegend=False,
                        hoverinfo='skip'
                    ))
            
            # Draw neuron
            fig.add_trace(go.Scatter(
                x=[x], y=[y],
                mode='markers',
                marker=dict(
                    size=25,
                    color=colors[layer_idx],
                    line=dict(width=2, color=border_colors[layer_idx])
                ),
                showlegend=False,
                hoverinfo='skip'
            ))
        
        # Add layer label
        fig.add_annotation(
            x=x, y=-1,
            text=f'{layer_names[layer_idx]}<br>({n_neurons} neurons)',
            showarrow=False,
            font=dict(size=10)
        )
    
    fig.update_layout(
        title=title,
        xaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
        yaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
        height=500,
        plot_bgcolor='white'
    )
    
    return fig

# Example: A network for our student departure problem
# Input: 10 features, Hidden layers: 8 and 4 neurons, Output: 1 (binary classification)
fig = draw_neural_network(
    layer_sizes=[10, 8, 4, 1],
    layer_names=['Input\n(Features)', 'Hidden 1', 'Hidden 2', 'Output\n(Departed?)'],
    title='Example Neural Network for Student Departure Prediction'
)
fig.show()

**Architecture Explanation:**

- **Input Layer (10 neurons)**: One for each feature (GPA, DFW rate, demographics, etc.)
- **Hidden Layer 1 (8 neurons)**: Learns initial patterns from raw features
- **Hidden Layer 2 (4 neurons)**: Combines patterns into higher-level representations
- **Output Layer (1 neuron)**: Produces probability of departure (0 to 1)

### 2.3 Network Depth and Width

Two key decisions when designing a neural network:

**Depth (Number of Layers):**
- More layers can learn more complex patterns
- But deeper networks are harder to train and prone to overfitting
- For tabular data, 1-3 hidden layers is usually sufficient

**Width (Neurons per Layer):**
- More neurons can capture more nuances
- But more neurons mean more parameters and risk of overfitting
- Common pattern: start wide, narrow toward output ("funnel" shape)

In [None]:
# Compare different architectures
fig = make_subplots(rows=1, cols=3, subplot_titles=(
    'Shallow & Narrow', 'Medium Depth', 'Deep & Wide'
))

architectures = [
    {'sizes': [10, 4, 1], 'params': 'Few parameters'},
    {'sizes': [10, 8, 4, 1], 'params': 'Moderate parameters'},
    {'sizes': [10, 16, 8, 4, 1], 'params': 'Many parameters'}
]

# Calculate parameters for each architecture
for arch in architectures:
    total_params = 0
    for i in range(len(arch['sizes']) - 1):
        # weights + biases
        total_params += arch['sizes'][i] * arch['sizes'][i+1] + arch['sizes'][i+1]
    arch['total_params'] = total_params

# Create comparison bar chart
fig = go.Figure()

names = ['Shallow\n[10, 4, 1]', 'Medium\n[10, 8, 4, 1]', 'Deep\n[10, 16, 8, 4, 1]']
params = [arch['total_params'] for arch in architectures]
layers = [len(arch['sizes']) for arch in architectures]

fig.add_trace(go.Bar(
    name='Total Parameters',
    x=names,
    y=params,
    marker_color='darkblue',
    text=params,
    textposition='outside'
))

fig.update_layout(
    title='Network Architecture Comparison: Parameter Count',
    xaxis_title='Architecture',
    yaxis_title='Number of Parameters',
    height=400
)

fig.show()

print("Parameter breakdown:")
for name, arch in zip(names, architectures):
    print(f"\n{name.replace(chr(10), ' ')}:")
    sizes = arch['sizes']
    for i in range(len(sizes) - 1):
        weights = sizes[i] * sizes[i+1]
        biases = sizes[i+1]
        print(f"  Layer {i+1}: {sizes[i]} -> {sizes[i+1]} = {weights} weights + {biases} biases = {weights + biases}")
    print(f"  Total: {arch['total_params']} parameters")

**Key Insight**: For our student departure dataset (~10 features), a network with 1-2 hidden layers of 8-32 neurons each is a good starting point. We don't need a very deep network for tabular data.

## 3. Activation Functions

### 3.1 Why Activation Functions Matter

Without activation functions, a neural network would just be a series of linear transformations - mathematically equivalent to a single linear model!

**Linear combination of linear functions = Still linear**

$$f(g(x)) = f(ax + b) = c(ax + b) + d = (ca)x + (cb + d)$$

Activation functions introduce **non-linearity**, allowing neural networks to learn complex patterns.

### 3.2 Common Activation Functions

In [None]:
# Visualize common activation functions
x = np.linspace(-5, 5, 200)

# Define activation functions
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def tanh(x):
    return np.tanh(x)

def relu(x):
    return np.maximum(0, x)

def leaky_relu(x, alpha=0.1):
    return np.where(x > 0, x, alpha * x)

# Create subplots
fig = make_subplots(rows=2, cols=2, subplot_titles=(
    'Sigmoid', 'Tanh', 'ReLU', 'Leaky ReLU'
))

# Sigmoid
fig.add_trace(go.Scatter(x=x, y=sigmoid(x), mode='lines', 
                         line=dict(color='blue', width=3), name='Sigmoid'), row=1, col=1)
fig.add_hline(y=0.5, line_dash='dash', line_color='gray', row=1, col=1)

# Tanh
fig.add_trace(go.Scatter(x=x, y=tanh(x), mode='lines', 
                         line=dict(color='green', width=3), name='Tanh'), row=1, col=2)
fig.add_hline(y=0, line_dash='dash', line_color='gray', row=1, col=2)

# ReLU
fig.add_trace(go.Scatter(x=x, y=relu(x), mode='lines', 
                         line=dict(color='red', width=3), name='ReLU'), row=2, col=1)

# Leaky ReLU
fig.add_trace(go.Scatter(x=x, y=leaky_relu(x), mode='lines', 
                         line=dict(color='orange', width=3), name='Leaky ReLU'), row=2, col=2)

fig.update_xaxes(title='z (input)')
fig.update_yaxes(title='f(z) (output)')
fig.update_layout(
    title='Common Activation Functions',
    height=600,
    showlegend=False
)

fig.show()

### Activation Function Summary

| Function | Formula | Range | Use Case |
|:---------|:--------|:------|:---------|
| **Sigmoid** | $\frac{1}{1+e^{-z}}$ | (0, 1) | Output layer for binary classification |
| **Tanh** | $\frac{e^z - e^{-z}}{e^z + e^{-z}}$ | (-1, 1) | Hidden layers (centers data) |
| **ReLU** | $\max(0, z)$ | [0, infinity) | Most common for hidden layers |
| **Leaky ReLU** | $\max(0.01z, z)$ | (-infinity, infinity) | Prevents "dying ReLU" problem |

### 3.3 Choosing the Right Activation

**For Hidden Layers:**
- **ReLU** is the default choice (simple, fast, works well)
- Use **Leaky ReLU** if you encounter "dying neurons" (all outputs become 0)

**For Output Layer:**
- **Sigmoid**: Binary classification (outputs probability 0-1)
- **Softmax**: Multi-class classification (outputs probability distribution)
- **Linear (no activation)**: Regression (predicting continuous values)

**For our student departure problem:**
- Hidden layers: ReLU
- Output layer: Sigmoid (binary classification: departed or retained)

In [None]:
# Demonstrate why ReLU is preferred: derivative comparison
x = np.linspace(-3, 3, 200)

# Derivatives
def sigmoid_derivative(x):
    s = sigmoid(x)
    return s * (1 - s)

def relu_derivative(x):
    return np.where(x > 0, 1, 0)

fig = make_subplots(rows=1, cols=2, subplot_titles=(
    'Sigmoid Derivative', 'ReLU Derivative'
))

fig.add_trace(go.Scatter(x=x, y=sigmoid_derivative(x), mode='lines',
                         line=dict(color='blue', width=3)), row=1, col=1)
fig.add_trace(go.Scatter(x=x, y=relu_derivative(x), mode='lines',
                         line=dict(color='red', width=3)), row=1, col=2)

fig.update_xaxes(title='z')
fig.update_yaxes(title="f'(z)")
fig.update_layout(
    title='Why ReLU Trains Faster: Gradient Comparison',
    height=350,
    showlegend=False
)

fig.add_annotation(
    x=2.5, y=0.1, text='Max gradient\n= 0.25',
    showarrow=True, arrowhead=2, row=1, col=1
)
fig.add_annotation(
    x=1.5, y=1, text='Constant gradient\n= 1',
    showarrow=True, arrowhead=2, row=1, col=2
)

fig.show()

**Key Insight**: Sigmoid's gradient is at most 0.25, while ReLU's gradient is 1 for positive values. This makes ReLU much faster to train (avoids the "vanishing gradient" problem).

## 4. How Neural Networks Learn: Backpropagation

Neural networks learn through an iterative process of:
1. Making predictions (forward pass)
2. Measuring error (loss function)
3. Computing gradients (backpropagation)
4. Updating weights (optimization)

Let's walk through each step.

### 4.1 Forward Pass

Data flows from input to output, layer by layer:

1. Input features enter the network
2. Each layer computes: $\text{output} = f(W \cdot \text{input} + b)$
3. Final layer produces prediction

In [None]:
# Demonstrate forward pass with a simple example
np.random.seed(42)

# Simple network: 2 inputs -> 2 hidden -> 1 output
# Initialize weights and biases
W1 = np.array([[0.1, 0.2], [0.3, 0.4]])  # 2x2
b1 = np.array([0.1, 0.1])  # 2
W2 = np.array([[0.5], [0.6]])  # 2x1
b2 = np.array([0.1])  # 1

# Input example (e.g., GPA=3.5, DFW_rate=0.1)
x = np.array([3.5, 0.1])

print("Forward Pass Example")
print("="*50)
print(f"\nInput: x = {x}")

# Layer 1
z1 = np.dot(x, W1) + b1
print(f"\nLayer 1 (before activation):")
print(f"  z1 = x * W1 + b1 = {z1}")

a1 = relu(z1)  # ReLU activation
print(f"  a1 = ReLU(z1) = {a1}")

# Layer 2 (output)
z2 = np.dot(a1, W2) + b2
print(f"\nLayer 2 (before activation):")
print(f"  z2 = a1 * W2 + b2 = {z2}")

output = sigmoid(z2)  # Sigmoid for final output
print(f"  output = Sigmoid(z2) = {output}")

print(f"\nFinal prediction: {output[0]:.4f}")
print(f"Interpretation: {output[0]*100:.1f}% probability of departure")

### 4.2 Loss Function

The **loss function** measures how wrong our predictions are. For binary classification, we use **binary cross-entropy**:

$$\text{Loss} = -\frac{1}{N}\sum_{i=1}^{N}[y_i \log(\hat{y}_i) + (1-y_i)\log(1-\hat{y}_i)]$$

This is the same loss function used in logistic regression!

In [None]:
# Visualize binary cross-entropy loss
y_pred = np.linspace(0.01, 0.99, 100)

# Loss when true label is 1
loss_y1 = -np.log(y_pred)
# Loss when true label is 0
loss_y0 = -np.log(1 - y_pred)

fig = go.Figure()

fig.add_trace(go.Scatter(
    x=y_pred, y=loss_y1,
    mode='lines',
    name='True Label = 1 (Departed)',
    line=dict(color='red', width=3)
))

fig.add_trace(go.Scatter(
    x=y_pred, y=loss_y0,
    mode='lines',
    name='True Label = 0 (Retained)',
    line=dict(color='blue', width=3)
))

fig.update_layout(
    title='Binary Cross-Entropy Loss',
    xaxis_title='Predicted Probability',
    yaxis_title='Loss',
    height=400
)

fig.add_annotation(
    x=0.9, y=-np.log(0.9),
    text='Low loss (correct)',
    showarrow=True, arrowhead=2
)

fig.add_annotation(
    x=0.1, y=-np.log(0.1),
    text='High loss (wrong)',
    showarrow=True, arrowhead=2
)

fig.show()

**Interpretation**: When the true label is 1 (departed), predicting 0.9 gives low loss, while predicting 0.1 gives very high loss. The loss function penalizes confident wrong predictions heavily.

### 4.3 Backward Pass (Backpropagation)

**Backpropagation** computes how much each weight contributed to the error using the **chain rule** from calculus.

Starting from the output:
1. Compute how the loss changes with respect to the output
2. Propagate this error backward through each layer
3. Calculate gradients for each weight and bias

**Intuition**: Backpropagation answers "How much would the loss change if I slightly changed this weight?"

In [None]:
# Visualize the backpropagation concept
fig = go.Figure()

# Forward pass arrows (blue)
fig.add_trace(go.Scatter(
    x=[0, 1, 2, 3], y=[0, 0, 0, 0],
    mode='lines+markers+text',
    line=dict(color='blue', width=3),
    marker=dict(size=30, color='lightblue', line=dict(width=2, color='blue')),
    text=['Input', 'Hidden 1', 'Hidden 2', 'Output'],
    textposition='bottom center',
    name='Forward Pass'
))

# Add forward arrows
for i in range(3):
    fig.add_annotation(
        x=i+0.5, y=0.1,
        ax=i+0.3, ay=0.1,
        xref='x', yref='y',
        axref='x', ayref='y',
        showarrow=True,
        arrowhead=2,
        arrowcolor='blue'
    )

# Backward pass arrows (red)
fig.add_trace(go.Scatter(
    x=[3, 2, 1, 0], y=[-0.5, -0.5, -0.5, -0.5],
    mode='lines+markers+text',
    line=dict(color='red', width=3),
    marker=dict(size=30, color='lightyellow', line=dict(width=2, color='red')),
    text=['Loss', 'Gradient', 'Gradient', 'Gradient'],
    textposition='top center',
    name='Backward Pass'
))

# Add backward arrows
for i in range(3, 0, -1):
    fig.add_annotation(
        x=i-0.5, y=-0.4,
        ax=i-0.3, ay=-0.4,
        xref='x', yref='y',
        axref='x', ayref='y',
        showarrow=True,
        arrowhead=2,
        arrowcolor='red'
    )

fig.add_annotation(
    x=1.5, y=0.3,
    text='Forward Pass: Compute predictions',
    showarrow=False,
    font=dict(color='blue', size=14)
)

fig.add_annotation(
    x=1.5, y=-0.8,
    text='Backward Pass: Compute gradients',
    showarrow=False,
    font=dict(color='red', size=14)
)

fig.update_layout(
    title='Backpropagation: Forward and Backward Passes',
    xaxis=dict(showgrid=False, zeroline=False, showticklabels=False, range=[-0.5, 3.5]),
    yaxis=dict(showgrid=False, zeroline=False, showticklabels=False, range=[-1, 0.6]),
    height=350,
    showlegend=False
)

fig.show()

### 4.4 Gradient Descent Optimization

Once we have gradients, we update weights to minimize the loss:

$$w_{new} = w_{old} - \eta \cdot \frac{\partial \text{Loss}}{\partial w}$$

Where $\eta$ is the **learning rate** - how big of steps we take.

In [None]:
# Visualize gradient descent with different learning rates
def loss_function(w):
    return (w - 2)**2 + 1

def gradient(w):
    return 2 * (w - 2)

w_range = np.linspace(-2, 6, 100)

fig = make_subplots(rows=1, cols=3, subplot_titles=(
    'Learning Rate = 0.1 (Too Small)',
    'Learning Rate = 0.5 (Just Right)',
    'Learning Rate = 1.1 (Too Large)'
))

learning_rates = [0.1, 0.5, 1.1]
colors = ['blue', 'green', 'red']

for col, (lr, color) in enumerate(zip(learning_rates, colors), 1):
    # Plot loss function
    fig.add_trace(go.Scatter(
        x=w_range, y=loss_function(w_range),
        mode='lines',
        line=dict(color='gray', width=2),
        showlegend=False
    ), row=1, col=col)
    
    # Simulate gradient descent
    w = 5.0  # Start point
    path_w = [w]
    path_loss = [loss_function(w)]
    
    for _ in range(10):
        w = w - lr * gradient(w)
        path_w.append(w)
        path_loss.append(loss_function(w))
    
    # Plot path
    fig.add_trace(go.Scatter(
        x=path_w, y=path_loss,
        mode='lines+markers',
        line=dict(color=color, width=2),
        marker=dict(size=8),
        showlegend=False
    ), row=1, col=col)

fig.update_xaxes(title='Weight (w)')
fig.update_yaxes(title='Loss')
fig.update_layout(
    title='Effect of Learning Rate on Gradient Descent',
    height=350
)

fig.show()

**Key Observations:**
- **Too small**: Slow convergence, may get stuck
- **Just right**: Smooth convergence to minimum
- **Too large**: Overshoots, may diverge

Modern optimizers like **Adam** automatically adjust learning rates during training.

## 5. Neural Networks vs. Traditional ML Models

### 5.1 When to Use Neural Networks

**Neural networks excel at:**
- Large datasets (more data = better performance)
- Complex patterns with many interactions
- Image, text, and sequential data
- Problems where feature engineering is difficult

**Tree-based models (Random Forests, Gradient Boosting) excel at:**
- Smaller tabular datasets
- Mixed feature types (categorical + numerical)
- When interpretability matters
- When training time is limited

### 5.2 Comparison with Tree-Based Models

In [None]:
# Create comparison table
import pandas as pd

comparison_data = {
    'Aspect': [
        'Data Size Needed',
        'Training Speed',
        'Handles Missing Data',
        'Feature Scaling Required',
        'Handles Categorical Features',
        'Interpretability',
        'Hyperparameter Tuning',
        'Risk of Overfitting',
        'Capture Non-linear Patterns',
        'GPU Acceleration'
    ],
    'Neural Networks': [
        'Large (1000s+)',
        'Slow',
        'No (requires preprocessing)',
        'Yes (critical)',
        'Requires encoding',
        'Low (black box)',
        'Complex (many parameters)',
        'High',
        'Excellent',
        'Yes'
    ],
    'Random Forests': [
        'Small to Medium (100s)',
        'Fast',
        'Yes (built-in)',
        'No',
        'Native support',
        'Medium',
        'Easier (fewer parameters)',
        'Low (ensemble averaging)',
        'Good',
        'No'
    ],
    'Logistic Regression': [
        'Small (100s)',
        'Very Fast',
        'No (requires preprocessing)',
        'Recommended',
        'Requires encoding',
        'High (coefficients)',
        'Simple (C, penalty)',
        'Medium',
        'Limited (linear only)',
        'No'
    ]
}

comparison_df = pd.DataFrame(comparison_data)
comparison_df

In [None]:
# Visualize model comparison for tabular data
models = ['Logistic Regression', 'Random Forest', 'Neural Network']
metrics = ['Interpretability', 'Training Speed', 'Small Data Performance', 
           'Large Data Performance', 'Non-linear Patterns']

# Scores (1-5 scale)
scores = {
    'Logistic Regression': [5, 5, 4, 2, 1],
    'Random Forest': [3, 4, 4, 3, 4],
    'Neural Network': [1, 2, 2, 5, 5]
}

fig = go.Figure()

colors = ['blue', 'green', 'red']
for model, color in zip(models, colors):
    fig.add_trace(go.Scatterpolar(
        r=scores[model] + [scores[model][0]],  # Close the polygon
        theta=metrics + [metrics[0]],
        fill='toself',
        name=model,
        line=dict(color=color)
    ))

fig.update_layout(
    polar=dict(
        radialaxis=dict(visible=True, range=[0, 5])
    ),
    title='Model Comparison for Tabular Data',
    height=500
)

fig.show()

**Key Insight for Student Departure Prediction:**

Our dataset has:
- ~5000 students (moderate size)
- ~10 features (small)
- Mix of numeric and categorical

This is actually a case where **tree-based models often perform as well or better** than neural networks. However, neural networks are worth trying because:
1. They may capture different patterns
2. They can be easily extended with more data
3. Understanding neural networks is valuable for more complex problems

## 6. Neural Networks for Student Departure Prediction

How will we apply neural networks to our student departure problem?

**Input Features (10 neurons):**
- Academic: HS_GPA, GPA_1, GPA_2, DFW_RATE_1, DFW_RATE_2
- Course load: UNITS_ATTEMPTED_1, UNITS_ATTEMPTED_2
- Demographics: GENDER, RACE_ETHNICITY, FIRST_GEN_STATUS (one-hot encoded)

**Architecture Options:**
1. Shallow: Input -> 8 neurons -> Output
2. Medium: Input -> 16 -> 8 -> Output
3. Deeper: Input -> 32 -> 16 -> 8 -> Output

**Output (1 neuron):**
- Probability of departure (0 to 1)
- Using sigmoid activation

In [None]:
# Visualize our proposed architecture for student departure
fig = draw_neural_network(
    layer_sizes=[10, 16, 8, 1],
    layer_names=['Input\n(10 features)', 'Hidden 1\n(ReLU)', 'Hidden 2\n(ReLU)', 'Output\n(Sigmoid)'],
    title='Proposed Neural Network for Student Departure Prediction'
)

# Add feature labels
feature_names = ['HS_GPA', 'GPA_1', 'GPA_2', 'DFW_1', 'DFW_2', 
                 'UNITS_1', 'UNITS_2', 'GENDER', 'RACE', 'FIRST_GEN']

fig.show()

**In the upcoming notebooks, we will:**

1. **5.2 Build**: Create neural networks using TensorFlow/Keras
2. **5.3 Train**: Learn about epochs, batches, and callbacks
3. **5.4 Evaluate & Tune**: Compare with our baseline models and optimize

## 7. Summary

In this notebook, we introduced neural networks and their core concepts.

### Key Concepts

| Concept | Description |
|:--------|:------------|
| **Neuron/Perceptron** | Basic unit: weighted sum + activation function |
| **Layers** | Input (features), Hidden (learning), Output (prediction) |
| **Activation Functions** | Introduce non-linearity (ReLU, Sigmoid, Tanh) |
| **Backpropagation** | Algorithm to compute gradients for learning |
| **Gradient Descent** | Optimization method to update weights |

### Activation Function Guide

| Layer Type | Recommended Activation | Why |
|:-----------|:-----------------------|:----|
| Hidden Layers | ReLU | Fast training, avoids vanishing gradients |
| Binary Output | Sigmoid | Outputs probability (0-1) |
| Multi-class Output | Softmax | Outputs probability distribution |
| Regression Output | Linear (none) | Continuous values |

### Neural Networks vs. Other Models

For tabular data like student departure:
- Neural networks can work well but require more tuning
- Tree-based models are often competitive or better
- Neural networks shine with large datasets and complex patterns

### Next Steps

In the next notebook, we will build our first neural network using TensorFlow and Keras.

**Proceed to:** `5.2 Build a Neural Network with Keras`