# Enhanced Neural Network Examples from Scratch with NumPyThis notebook demonstrates several tasks implemented from scratch using NumPy:1. **Real-time Training Progress Visualization**: Monitor loss and accuracy during training.2. **Non-Perfectly Separable Data**: Create a toy dataset with overlapping clusters and label noise.3. **Next-Word Prediction Task**: A simple word-level next-word prediction using learned embeddings.4. **Word-Based Language Modeling for Text Classification**: An explanation of how embeddings improve text classification.Each section includes code with explanations and visualizations.

In [None]:
import numpy as npimport matplotlib.pyplot as pltfrom IPython.display import display, clear_output%matplotlib inline

## 1. Real-time Training Progress VisualizationIn the following example, we train a simple two-layer neural network on a toy dataset. During training, we plot loss and accuracy in real time to monitor progress.

In [None]:
# Create a toy dataset with overlapping clusters and label noisenp.random.seed(0)n_points = 200# Define two class centers that are close (to create overlap)mean0, mean1 = [0, 0], [1.0, 1.0]cov = [[1.5, 0.0], [0.0, 1.5]]X_class0 = np.random.multivariate_normal(mean0, cov, size=n_points//2)X_class1 = np.random.multivariate_normal(mean1, cov, size=n_points//2)X_overlap = np.vstack([X_class0, X_class1])y_overlap = np.array([0]*(n_points//2) + [1]*(n_points//2))# Introduce label noise: flip 10% of the labels randomlynum_noisy = int(0.10 * n_points)noise_idx = np.random.choice(n_points, num_noisy, replace=False)y_noisy = y_overlap.copy()y_noisy[noise_idx] = 1 - y_noisy[noise_idx]# Visualize the noisy, overlapping dataplt.figure(figsize=(6,4))plt.scatter(X_overlap[:,0], X_overlap[:,1], c=y_noisy, cmap='bwr', alpha=0.8)plt.title('Toy Data with Overlap and Label Noise')plt.xlabel('Feature 1'); plt.ylabel('Feature 2')plt.show()

### Two-Layer Neural Network Training FunctionWe define a two-layer neural network (one hidden layer using ReLU, and an output layer with sigmoid) and train it with stochastic gradient descent. We also update a live plot showing loss and accuracy over epochs.

In [None]:
def train_two_layer_model(X, y, hidden_size=3, lr=0.1, epochs=100, seed=None):    if seed is not None:        np.random.seed(seed)    n_samples, input_dim = X.shape    output_dim = 1    # Initialize weights and biases    W1 = np.random.randn(input_dim, hidden_size) * 0.1    b1 = np.zeros(hidden_size)    W2 = np.random.randn(hidden_size, output_dim) * 0.1    b2 = np.zeros(output_dim)    def sigmoid(z):        return 1 / (1 + np.exp(-z))    def compute_accuracy(X_data, y_data):        hidden = np.maximum(0, X_data.dot(W1) + b1)        output = sigmoid(hidden.dot(W2) + b2)        preds = (output >= 0.5).astype(int).flatten()        return np.mean(preds == y_data)    # Lists to store training metrics    history_loss = []    history_acc = []    # Set up the live plot    plt.ion()    fig, ax = plt.subplots(figsize=(6,4))    for epoch in range(1, epochs+1):        epoch_losses = []        for i in range(n_samples):            xi = X[i]            yi = y[i]            # Forward pass            h_pre = xi.dot(W1) + b1            h = np.maximum(0, h_pre)            o_pre = h.dot(W2) + b2            o = sigmoid(o_pre)            # Compute loss for the sample            loss = - (yi * np.log(o + 1e-8) + (1 - yi) * np.log(1 - o + 1e-8))            epoch_losses.append(loss)            # Backpropagation            error_out = o - yi            grad_W2 = np.outer(h, error_out)            grad_b2 = error_out            hidden_error = (h_pre > 0).astype(float) * (error_out * W2.flatten())            grad_W1 = np.outer(xi, hidden_error)            grad_b1 = hidden_error            # Update weights            W2 -= lr * grad_W2            b2 -= lr * grad_b2            W1 -= lr * grad_W1            b1 -= lr * grad_b1        # End of epoch: compute average loss and accuracy        avg_loss = np.mean(epoch_losses)        acc = compute_accuracy(X, y)        history_loss.append(avg_loss)        history_acc.append(acc)        # Update the live plot        ax.cla()        ax.plot(history_loss, label='Loss')        ax.plot(history_acc, label='Accuracy')        ax.set_xlabel('Epoch')        ax.set_ylabel('Metric')        ax.set_title('Training Progress')        ax.legend()        clear_output(wait=True)        display(fig)    plt.ioff()    return W1, b1, W2, b2# Train on the noisy overlapping dataW1_vis, b1_vis, W2_vis, b2_vis = train_two_layer_model(X_overlap, y_noisy, hidden_size=5, lr=0.01, epochs=100, seed=1)

## 2. Next-Word Prediction Task (Word-Level)Now we implement a simple next-word prediction model. We use a small corpus of sequences and build a vocabulary. For each sequence, the first few words form the context and the model predicts the next word.We use an embedding layer to represent words as dense vectors, then a hidden layer and an output layer (with softmax) to predict the next word.

In [None]:
# Example training sequencessequences = [    ["city", "of", "new", "york"],    ["life", "in", "the", "world"],    ["he", "is", "the", "best"],    ["deep", "learning", "is", "fun"]]# Build vocabulary and mappingsvocab = sorted({word for seq in sequences for word in seq})word_to_idx = {word: i for i, word in enumerate(vocab)}idx_to_word = {i: word for word, i in word_to_idx.items()}# Create training pairs: use first 3 words as context, 4th word as targetcontext_size = 3X_train = []y_train = []for seq in sequences:    idx_seq = [word_to_idx[w] for w in seq]    X_train.append(idx_seq[:context_size])    y_train.append(idx_seq[context_size])X_train = np.array(X_train)y_train = np.array(y_train)# Parameters for next-word prediction modelembed_dim = 8hidden_dim = 16vocab_size = len(vocab)# Initialize embedding matrix and network weightsembedding = np.random.randn(vocab_size, embed_dim) * 0.01W1_word = np.random.randn(hidden_dim, embed_dim) * 0.01b1_word = np.zeros((hidden_dim,))W2_word = np.random.randn(vocab_size, hidden_dim) * 0.01b2_word = np.zeros((vocab_size,))def softmax(z):    exp_z = np.exp(z - np.max(z))    return exp_z / np.sum(exp_z)def train_next_word(X_train, y_train, epochs=500, lr=0.1, seed=None):    if seed is not None:        np.random.seed(seed)    losses = []    for epoch in range(1, epochs+1):        epoch_loss = 0        for context, target in zip(X_train, y_train):            # Look up embeddings and average them            context_vecs = embedding[context]  # shape: (context_size, embed_dim)            context_rep = context_vecs.mean(axis=0)  # shape: (embed_dim,)                        # Forward pass            hidden = np.maximum(0, W1_word.dot(context_rep) + b1_word)  # ReLU            scores = W2_word.dot(hidden) + b2_word  # shape: (vocab_size,)            probs = softmax(scores)                        # Compute cross-entropy loss            loss = -np.log(probs[target] + 1e-8)            epoch_loss += loss                        # Backpropagation            dscores = probs.copy()            dscores[target] -= 1  # derivative of softmax cross-entropy                        grad_W2 = np.outer(dscores, hidden)            grad_b2 = dscores            dhidden = W2_word.T.dot(dscores)            dhidden[hidden <= 0] = 0  # ReLU backprop            grad_W1 = np.outer(dhidden, context_rep)            grad_b1 = dhidden                        # Gradient for context representation            dcontext_rep = W1_word.T.dot(dhidden)                        # Update embedding: distribute gradient to each word in the context            for idx in context:                embedding[idx] -= lr * (dcontext_rep / len(context))                        # Update network weights            global W1_word, b1_word, W2_word, b2_word            W2_word -= lr * grad_W2            b2_word -= lr * grad_b2            W1_word -= lr * grad_W1            b1_word -= lr * grad_b1        losses.append(epoch_loss / len(X_train))        if epoch % 100 == 0 or epoch == epochs:            print(f"Epoch {epoch}: Loss = {losses[-1]:.3f}")    return losses# Train the next-word prediction modellosses_word = train_next_word(X_train, y_train, epochs=500, lr=0.1, seed=2)# Plot training loss for next-word predictionplt.figure(figsize=(6,4))plt.plot(losses_word, label='Loss')plt.xlabel('Epoch')plt.ylabel('Loss')plt.title('Next-Word Prediction Training Loss')plt.legend()plt.show()# Function to predict next word given a contextdef predict_next_word(context_words):    idxs = [word_to_idx[w] for w in context_words if w in word_to_idx]    context_vecs = embedding[idxs]    context_rep = context_vecs.mean(axis=0)    hidden = np.maximum(0, W1_word.dot(context_rep) + b1_word)    scores = W2_word.dot(hidden) + b2_word    pred_idx = np.argmax(scores)    return idx_to_word[pred_idx]# Test predictionstest_contexts = [    ["city", "of", "new"],    ["he", "is", "the"],    ["deep", "learning", "is"]]for context in test_contexts:    pred = predict_next_word(context)    print(f"Context: {' '.join(context)} -> Predicted next word: {pred}")

## 3. Word-Based Language Modeling for Text ClassificationModern text classifiers often use embeddings learned from language modeling to improve generalization. This means the model not only recognizes words but also understands their relationships.For example, consider the following pseudo-code for a sentiment classifier that uses averaged word embeddings:```python# Convert a sentence to an embedding by averaging word embeddingsdef sentence_to_vec(sentence_words):    idxs = [word_to_idx[w] for w in sentence_words if w in word_to_idx]    if not idxs:         return np.zeros(embed_dim)    return embedding[idxs].mean(axis=0)# Given classifier parameters W_clf and b_clf (trained on labeled data)sentence = ["this", "movie", "is", "fantastic"]sent_vec = sentence_to_vec(sentence)logit = W_clf.dot(sent_vec) + b_clfprob_positive = 1 / (1 + np.exp(-logit))print("Positive sentiment probability:", prob_positive)```In a full implementation, the embedding layer (and optionally the language model) is pre-trained, then fine-tuned on a classification task. This approach helps the classifier generalize to words and phrases not seen during training.This concludes our enhanced examples.

## ConclusionIn this notebook, we demonstrated several neural network tasks built from scratch using NumPy:- **Real-time Training Visualization** on a noisy, overlapping dataset.- **Next-Word Prediction** using a word-level model with embeddings.- **Discussion on Language Modeling for Text Classification** using embeddings for improved generalization.These examples illustrate how basic neural network components and training loops can be implemented manually, providing insights into the workings of modern deep learning techniques.