In [None]:
'''
 * Copyright (c) 2004 Radhamadhab Dalai
 *
 * Permission is hereby granted, free of charge, to any person obtaining a copy
 * of this software and associated documentation files (the "Software"), to deal
 * in the Software without restriction, including without limitation the rights
 * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 * copies of the Software, and to permit persons to whom the Software is
 * furnished to do so, subject to the following conditions:
 *
 * The above copyright notice and this permission notice shall be included in
 * all copies or substantial portions of the Software.
 *
 * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
 * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
 * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
 * THE SOFTWARE.
'''

## Why Deep Generative Modeling?

###  AI Is Not Only About Decision Making

Before we start thinking about (deep) generative modeling, let us consider a simple example. Imagine we have trained a deep neural network that classifies images $(x \in \mathbb{Z}^D$) of animals $(y \in \mathcal{Y}$, and $\mathcal{Y} = \{\text{cat}, \text{dog}, \text{horse}\}$). Further, let us assume that this neural network is trained really well so that it always classifies a proper class with a high probability $p(y|x)$.

So far, so good, right? However, a problem could occur. As pointed out in [1], adding noise to images could result in completely false classification. An example of such a situation is presented in **Fig.1**, where adding noise could shift the predicted probabilities of labels. Despite this, the image is barely changed (at least to us, human beings).

![image.png](attachment:image.png)

Fig.1 An example of adding noise to an almost perfectly classiﬁed image that results in a shift of predicted label.

**Fig.1**: An example of adding noise to an almost perfectly classified image that results in a shift of predicted labels.

This example indicates that neural networks used to parameterize the conditional distribution $p(y|x)$ seem to lack a semantic understanding of images. Furthermore, we hypothesize that learning discriminative models is not enough for proper decision-making and creating AI. A machine learning system cannot rely solely on learning how to make decisions without understanding reality and being able to express uncertainty about the surrounding world.

How can we trust such a system if even a small amount of noise could change its internal beliefs and also shift its certainty from one decision to another? How can we communicate with such a system effectively?

---

**References:**

1. [Add appropriate reference here]

![image-2.png](attachment:image-2.png)

Fig.2 And example of data (left) and two approaches to decision-making: (middle) a discrimi- native approach, (right) a generative approach.

## Fig.2: Discriminative vs Generative Approaches to Decision-Making

In above Fig.2, we illustrate two approaches to decision-making: a **discriminative approach** (middle) and a **generative approach** (right). 

To motivate the importance of concepts like uncertainty and understanding in decision-making, let us consider a system that classifies objects into two classes: **orange** and **blue**. 

We assume we have some two-dimensional data (Fig.2, left) and a new datapoint (a black cross in Fig.2) to be classified. We can make decisions using two approaches:

1. **Discriminative Approach**: This approach explicitly models the conditional distribution $ p(y|x) $ (Fig.2, middle).
2. **Generative Approach**: This approach models the joint distribution $ p(x, y) $, which can be further decomposed as $ p(x, y) = p(y|x) p(x) $ (Fig.2, right).

After training a model using the discriminative approach, namely the conditional distribution $ p(y|x) $, we obtain a clear decision boundary. From the figure, we observe that the black cross is farther away from the orange region. Therefore, the classifier assigns a higher probability to the **blue** label. As a result, the classifier is **certain** about the decision.

On the other hand, if we additionally fit a distribution $ p(x) $, we observe that the black cross is not only farther away from the decision boundary, but also distant from the region where the blue datapoints lie. In other words, the black point is far away from the region of high probability mass. As a result, the **marginal probability** $ p(x = \text{black cross}) $ is low, and the joint distribution $ p(x = \text{black cross}, y = \text{blue}) $ will be low as well. Thus, the decision is **uncertain**.

This simple example clearly indicates that if we want to build AI systems that make reliable decisions and can communicate with us, human beings, they must first **understand the environment**. For this purpose, they cannot simply learn how to make decisions, but they should also be able to **quantify their beliefs** about their surroundings using the language of probability.

### Why Estimating the Distribution $ p(x) $ is Crucial:

From the generative perspective, knowing the distribution $ p(x) $ is essential because:

- It could be used to assess whether a given object has been observed in the past or not.
- It could help properly **weigh** the decision.

---# 1.2 Where Can We Use (Deep) Generative Modeling?

From the generative perspective, estimating $ p(x) $ has several advantages:

- It can be used to **assess uncertainty** about the environment.
- It can facilitate **active learning** by interacting with the environment (e.g., querying labels for objects with low $ p(x) $).
- It can be used to **generate (synthesize) new objects**.

### Broader Applicability of $ p(x) $

Typically, in deep learning literature, generative models are seen primarily as tools for generating new data. However, in this discussion, we emphasize that having $ p(x) $ has much broader applicability and is essential for building successful AI systems. 

In machine learning, formulating a proper generative process is critical for understanding phenomena of interest [3, 4]. Often, another factorization is considered, namely:

$$
p(x, y) = p(x|y) p(y)
$$

We argue that using the factorization:

$$
p(x, y) = p(y|x) p(x)
$$

offers clear advantages, as discussed earlier.

---

## Applications of Deep Generative Modeling

With advancements in neural networks and computational power, **deep generative modeling** has become a leading direction in AI. Its applications span various domains, including:

- **Text analysis** (e.g., [5]),
- **Image analysis** (e.g., [6]),
- **Audio analysis** (e.g., [7]),
- **Active learning** (e.g., [8]),
- **Reinforcement learning** (e.g., [9]),
- **Graph analysis** (e.g., [10]),
- **Medical imaging** (e.g., [11]).

In **Fig.3**, we provide a graphical summary of potential applications of deep generative modeling. 

---

### Importance of Synthesis and Feature Modification

In some applications, generating or modifying features of objects is critical. For example, an app that transforms an image of a young person to predict their appearance as they age illustrates the power of generative modeling.

![image-3.png](attachment:image-3.png)

Fig.3 Various potential applications of deep generative modeling.


**Fig.3**: Various potential applications of deep generative modeling, including text, images, graphs, audio, reinforcement learning, and medical imaging.

---






In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

# Data preparation
transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,))])
train_data = datasets.MNIST(root='./data', train=True, download=True, transform=transform)
test_data = datasets.MNIST(root='./data', train=False, download=True, transform=transform)

train_loader = DataLoader(train_data, batch_size=64, shuffle=True)
test_loader = DataLoader(test_data, batch_size=64, shuffle=False)

# Define a simple neural network
class SimpleClassifier(nn.Module):
    def __init__(self):
        super(SimpleClassifier, self).__init__()
        self.fc = nn.Sequential(
            nn.Flatten(),
            nn.Linear(28 * 28, 128),
            nn.ReLU(),
            nn.Linear(128, 64),
            nn.ReLU(),
            nn.Linear(64, 10)
        )

    def forward(self, x):
        return self.fc(x)

# Model, loss, optimizer
model = SimpleClassifier()
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Training loop
for epoch in range(5):
    for images, labels in train_loader:
        optimizer.zero_grad()
        outputs = model(images)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
    print(f'Epoch {epoch+1}, Loss: {loss.item()}')


In [None]:
class VAE(nn.Module):
    def __init__(self):
        super(VAE, self).__init__()
        # Encoder
        self.encoder = nn.Sequential(
            nn.Flatten(),
            nn.Linear(28 * 28, 128),
            nn.ReLU(),
            nn.Linear(128, 64),
            nn.ReLU()
        )
        self.mu = nn.Linear(64, 2)  # Latent mean
        self.log_var = nn.Linear(64, 2)  # Latent variance

        # Decoder
        self.decoder = nn.Sequential(
            nn.Linear(2, 64),
            nn.ReLU(),
            nn.Linear(64, 128),
            nn.ReLU(),
            nn.Linear(128, 28 * 28),
            nn.Sigmoid()
        )

    def reparameterize(self, mu, log_var):
        std = torch.exp(0.5 * log_var)
        eps = torch.randn_like(std)
        return mu + eps * std

    def forward(self, x):
        encoded = self.encoder(x)
        mu, log_var = self.mu(encoded), self.log_var(encoded)
        z = self.reparameterize(mu, log_var)
        decoded = self.decoder(z)
        return decoded, mu, log_var

# Loss function
def vae_loss(reconstructed, original, mu, log_var):
    reconstruction_loss = nn.functional.binary_cross_entropy(reconstructed, original, reduction='sum')
    kl_divergence = -0.5 * torch.sum(1 + log_var - mu.pow(2) - log_var.exp())
    return reconstruction_loss + kl_divergence

vae = VAE()
optimizer = optim.Adam(vae.parameters(), lr=0.001)

# Training loop
for epoch in range(5):
    for images, _ in train_loader:
        images = images.view(-1, 28 * 28)  # Flatten images
        optimizer.zero_grad()
        reconstructed, mu, log_var = vae(images)
        loss = vae_loss(reconstructed, images, mu, log_var)
        loss.backward()
        optimizer.step()
    print(f'Epoch {epoch+1}, Loss: {loss.item()}')


In [None]:
class BayesianNN(SimpleClassifier):
    def forward(self, x, mc_samples=10):
        outputs = [super().forward(x) for _ in range(mc_samples)]
        return torch.stack(outputs)

# Example of uncertainty estimation
model = BayesianNN()
model.eval()  # Set model to evaluation mode

# Perform Monte Carlo sampling
mc_samples = 20
test_images, _ = next(iter(test_loader))
outputs = model(test_images, mc_samples=mc_samples)
mean_prediction = outputs.mean(dim=0)  # Average over samples
uncertainty = outputs.var(dim=0)  # Variance over samples
def query_samples(model, unlabeled_loader):
    model.eval()
    uncertainties = []
    for images, _ in unlabeled_loader:
        outputs = model(images)
        probabilities = torch.softmax(outputs, dim=1)
        confidence, _ = probabilities.max(dim=1)
        uncertainties.append(1 - confidence)
    return torch.cat(uncertainties).argsort(descending=True)[:10]  # Top 10 uncertain samples


In [2]:
import numpy as np

# Generate synthetic data (2D features, binary labels)
np.random.seed(42)
X = np.random.randn(100, 2)
y = (np.dot(X, np.array([1.5, -2])) + 0.5 > 0).astype(int)

# Initialize weights and bias
weights = np.zeros(X.shape[1])
bias = 0
learning_rate = 0.1
epochs = 1000

# Sigmoid function
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

# Training loop
for epoch in range(epochs):
    # Linear model
    z = np.dot(X, weights) + bias
    y_pred = sigmoid(z)

    # Binary cross-entropy loss
    loss = -np.mean(y * np.log(y_pred + 1e-8) + (1 - y) * np.log(1 - y_pred + 1e-8))

    # Gradients
    dw = np.dot(X.T, (y_pred - y)) / len(y)
    db = np.sum(y_pred - y) / len(y)

    # Update weights and bias
    weights -= learning_rate * dw
    bias -= learning_rate * db

    # Print loss every 100 epochs
    if epoch % 100 == 0:
        print(f"Epoch {epoch}, Loss: {loss:.4f}")

# Predict function
def predict(X, weights, bias):
    return (sigmoid(np.dot(X, weights) + bias) > 0.5).astype(int)

# Test prediction
print("Test prediction:", predict(X[:5], weights, bias))


Epoch 0, Loss: 0.6931
Epoch 100, Loss: 0.2888
Epoch 200, Loss: 0.2194
Epoch 300, Loss: 0.1868
Epoch 400, Loss: 0.1669
Epoch 500, Loss: 0.1531
Epoch 600, Loss: 0.1428
Epoch 700, Loss: 0.1348
Epoch 800, Loss: 0.1282
Epoch 900, Loss: 0.1228
Test prediction: [1 0 1 1 0]


![image.png](attachment:image.png)
Fig.4 A taxonomy of deep generative models.

## Flow-Based Models  

The **change of variables formula** provides a principled manner of expressing the density of a random variable by transforming it with an invertible transformation $ f $ $[17]$:

$$
p(\mathbf{x}) = p\left(\mathbf{z} = f(\mathbf{x})\right) \left| J_f(\mathbf{x}) \right|,
$$

where $ J_f(\mathbf{x}) $ denotes the Jacobian matrix.  

Deep neural networks can parameterize $ f $, but they cannot be arbitrary networks as the Jacobian matrix must be computable. Initial approaches focused on **linear, volume-preserving transformations** that yield $ |J_f(\mathbf{x})| = 1 $ $[18, 19]$. Further advancements leveraged matrix determinant theorems, leading to transformations such as **planar flows** $[20]$ and **Sylvester flows** $[21, 22]$.  

A different approach involves invertible transformations with easily computable Jacobian determinants, like the coupling layers in **RealNVP** $[23]$. Recent efforts constrain neural networks to ensure invertibility and approximate the Jacobian determinant $[24–26]$.  

For discrete distributions (e.g., integers), the formula becomes simpler since there is no volume change:

$$
p(\mathbf{x}) = p\left(\mathbf{z} = f(\mathbf{x})\right).
$$

**Integer discrete flows** use affine coupling layers with rounding operators to ensure integer-valued outputs $[27]$, with further generalizations in $[28]$.  

All models leveraging the change of variables formula are referred to as **flow-based models** (or flows). These will be discussed further in Chapter 4.  

---

##  Latent Variable Models  

The idea behind **latent variable models** is to assume a lower-dimensional latent space and define the following generative process:

$$
\mathbf{z} \sim p(\mathbf{z}), \quad \mathbf{x} \sim p(\mathbf{x} | \mathbf{z}).
$$

Here, latent variables represent hidden factors in the data, and the conditional distribution $ p(\mathbf{x} | \mathbf{z}) $ acts as the generator.  

A classic example is **probabilistic Principal Component Analysis (pPCA)** $[29]$, where both $ p(\mathbf{z}) $ and $ p(\mathbf{x} | \mathbf{z}) $ are Gaussian distributions, and the relationship between $ \mathbf{z} $ and $ \mathbf{x} $ is linear. A nonlinear extension is the **Variational Auto-Encoder (VAE)** framework $[30, 31]$, which uses variational inference to approximate the posterior $ p(\mathbf{z} | \mathbf{x}) $, with neural networks parameterizing the distributions.  

Since the introduction of VAEs, extensions have enhanced variational posteriors $[19, 21, 22, 32]$, priors $[33, 34]$, and decoders $[35]$. Other directions include exploring different latent space topologies, such as hyperspherical latent spaces $[36]$.  

In VAEs and pPCA, all distributions are defined upfront, making them **prescribed models**. These will be explored in Chapter 5.  

---

## 1.3.4 Energy-Based Models  

Physics-inspired models define a group of generative models via an energy function, $ E(\mathbf{x}) $, and the **Boltzmann distribution**:

$$
p(\mathbf{x}) = \frac{\exp\{-E(\mathbf{x})\}}{Z}, \quad Z = \sum_{\mathbf{x}} \exp\{-E(\mathbf{x})\},
$$

where $ Z $ is the **partition function** that normalizes the probabilities.  

Energy-based models (EBMs) focus on formulating the energy function and approximating the partition function. The most notable EBMs are **Boltzmann Machines**, where variables $ \mathbf{x} $ are entangled via a bilinear form:

$$
E(\mathbf{x}) = \mathbf{x}^T W \mathbf{x}.
$$

Introducing latent variables leads to **Restricted Boltzmann Machines (RBMs)** $[41]$.  

---

These groups—autoregressive models, flow-based models, latent variable models, and energy-based models—offer diverse perspectives for deep generative modeling. Each approach will be explored further in its respective chapters.


In [3]:
import numpy as np

# Sigmoid activation for nonlinearity
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

# Inverse of sigmoid (logit function)
def sigmoid_inv(y):
    return np.log(y / (1 - y))

# Example of an invertible transformation
def transformation(x, scale, shift):
    """
    Simple affine transformation: z = scale * x + shift
    """
    return scale * x + shift

# Inverse of the transformation
def inverse_transformation(z, scale, shift):
    """
    Inverse affine transformation: x = (z - shift) / scale
    """
    return (z - shift) / scale

# Compute the Jacobian determinant
def jacobian_determinant(scale):
    """
    Jacobian determinant for the affine transformation
    """
    return np.abs(scale)

# Define a synthetic dataset
np.random.seed(42)
data = np.random.uniform(-1, 1, (1000, 1))  # Uniformly distributed data

# Parameters for the transformation (learned parameters in practice)
scale = 2.0  # Scaling factor
shift = 0.5  # Shift factor

# Forward transformation
z = transformation(data, scale, shift)

# Calculate density using the change of variables formula
base_density = 1 / (1 + np.exp(-z))  # Example: Gaussian-like density
p_x = base_density * jacobian_determinant(scale)

# Inverse transformation
x_reconstructed = inverse_transformation(z, scale, shift)

# Print results
print("Original data (first 5 samples):", data[:5].flatten())
print("Transformed data (first 5 samples):", z[:5].flatten())
print("Reconstructed data (first 5 samples):", x_reconstructed[:5].flatten())


Original data (first 5 samples): [-0.25091976  0.90142861  0.46398788  0.19731697 -0.68796272]
Transformed data (first 5 samples): [-1.83952461e-03  2.30285723e+00  1.42797577e+00  8.94633937e-01
 -8.75925438e-01]
Reconstructed data (first 5 samples): [-0.25091976  0.90142861  0.46398788  0.19731697 -0.68796272]


###  Score-Based Generative Models

Instead of matching distributions, we can match a score function, $ \nabla_x \ln p(x) $, and its model,$ s_{\theta}(x) $, using the second norm (a.k.a. the mean squared error loss). However, since we do not have access to the true distribution, we can use a noisy version of the empirical distribution by adding small Gaussian noise, $ \tilde{x}_n = x_n + \sigma \cdot \epsilon $, that yields:

$$ q_{\text{data}}(\tilde{x}_n) = \mathcal{N}(\tilde{x}_n | x_n, \sigma^2) $$

Since for Gaussian noise, the score function is analytically tractable, i.e.,

$$ \nabla_{\tilde{x}} \ln \mathcal{N}(\tilde{x}_n | x_n, \sigma^2) = -\frac{1}{\sigma} \epsilon $$

the final objective is the following:

$$ L(\theta) = \frac{1}{2N} \sum_{n=1}^{N} \left\| s_{\theta}(\tilde{x}_n) + \epsilon \right\|^2 $$

Optimizing this objective leads to a score model, $ s_{\theta}(x) $, and the approach is called **score matching** [13, 14]. Sampling from the model requires running an auxiliary procedure, e.g., **Langevin dynamics** [44]. The idea behind score matching has been further used to train generative models formulated as stochastic differential equations, a continuous generalization of diffusion-based models. Then, sampling results in applying numerical methods for solving differential equations like backward Euler's method.

A similar approach to score matching with differential equations is called **flow matching** [45]. We will discuss all three frameworks in **Chap. 9**.


In [4]:
import torch
import torch.nn as nn
import torch.optim as optim

# Define a simple neural network model for score function
class ScoreModel(nn.Module):
    def __init__(self, input_dim):
        super(ScoreModel, self).__init__()
        self.fc1 = nn.Linear(input_dim, 128)
        self.fc2 = nn.Linear(128, 128)
        self.fc3 = nn.Linear(128, input_dim)  # Output same size as input for score matching

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        x = self.fc3(x)
        return x

# Generate synthetic data (a simple Gaussian distribution)
def generate_data(n_samples, input_dim):
    return torch.randn(n_samples, input_dim)

# Gaussian noise
def add_gaussian_noise(x, sigma):
    noise = torch.randn_like(x) * sigma
    return x + noise

# Score matching loss function
def score_matching_loss(model, x, sigma):
    # Add Gaussian noise to the input
    x_tilde = add_gaussian_noise(x, sigma)

    # True noise term (used for score matching)
    epsilon = torch.randn_like(x_tilde)

    # Calculate the model's score prediction
    score = model(x_tilde)

    # Calculate the score matching loss
    loss = torch.mean((score + epsilon) ** 2)
    return loss

# Training setup
input_dim = 2  # Example for 2D data
n_samples = 1000
sigma = 0.1  # Standard deviation of noise
learning_rate = 0.001
epochs = 100

# Create a dataset and model
data = generate_data(n_samples, input_dim)
model = ScoreModel(input_dim)
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

# Training loop
for epoch in range(epochs):
    model.train()

    # Zero gradients
    optimizer.zero_grad()

    # Compute the score matching loss
    loss = score_matching_loss(model, data, sigma)

    # Backpropagation
    loss.backward()

    # Update the model parameters
    optimizer.step()

    if (epoch + 1) % 10 == 0:
        print(f'Epoch [{epoch+1}/{epochs}], Loss: {loss.item():.4f}')

# After training, you can use the trained model to generate samples, but additional steps are required for sampling from the model.


Epoch [10/100], Loss: 0.9895
Epoch [20/100], Loss: 0.9418
Epoch [30/100], Loss: 0.9874
Epoch [40/100], Loss: 0.9812
Epoch [50/100], Loss: 0.9629
Epoch [60/100], Loss: 0.9730
Epoch [70/100], Loss: 0.9667
Epoch [80/100], Loss: 0.9836
Epoch [90/100], Loss: 1.0332
Epoch [100/100], Loss: 1.0145


In [6]:
import math
import random

# Generate synthetic data (Gaussian distribution)
def generate_data(n_samples, input_dim):
    return [[random.gauss(0, 1) for _ in range(input_dim)] for _ in range(n_samples)]

# Add Gaussian noise to the data
def add_gaussian_noise(data, sigma):
    noisy_data = []
    for x in data:
        noisy_data.append([xi + random.gauss(0, sigma) for xi in x])
    return noisy_data

# Initialize weights and biases for a simple fully connected neural network
def initialize_model(input_dim, hidden_dim):
    model = {
        "w1": [[random.uniform(-0.1, 0.1) for _ in range(input_dim)] for _ in range(hidden_dim)],
        "b1": [0.0 for _ in range(hidden_dim)],
        "w2": [[random.uniform(-0.1, 0.1) for _ in range(hidden_dim)] for _ in range(hidden_dim)],
        "b2": [0.0 for _ in range(hidden_dim)],
        "w3": [[random.uniform(-0.1, 0.1) for _ in range(hidden_dim)] for _ in range(input_dim)],
        "b3": [0.0 for _ in range(input_dim)],
    }
    return model

# Forward pass
def forward(x, model):
    def relu(z):
        return [max(0, zi) for zi in z]

    def matmul(a, b):
        return [sum(ai * bi for ai, bi in zip(a, col)) for col in zip(*b)]

    def add_bias(vec, bias):
        return [v + b for v, b in zip(vec, bias)]

    # Layer 1
    z1 = matmul(x, model["w1"])
    a1 = relu(add_bias(z1, model["b1"]))

    # Layer 2
    z2 = matmul(a1, model["w2"])
    a2 = relu(add_bias(z2, model["b2"]))

    # Output layer
    z3 = matmul(a2, model["w3"])
    output = add_bias(z3, model["b3"])

    return output

# Compute score-matching loss
def score_matching_loss(model, data, sigma):
    loss = 0
    for x in data:
        # Add noise to the input
        x_tilde = [xi + random.gauss(0, sigma) for xi in x]
        epsilon = [random.gauss(0, sigma) for _ in x]

        # Predicted score
        score = forward(x_tilde, model)

        # Calculate loss
        loss += sum((s + e) ** 2 for s, e in zip(score, epsilon))
    return loss / len(data)

# Backward pass (gradient calculation)
def backward(model, data, sigma, lr):
    # Initialize gradients for weights and biases
    gradients = {key: [[0] * len(val[0]) for _ in range(len(val))] for key, val in model.items() if key.startswith("w")}
    biases = {key: [0] * len(val) for key, val in model.items() if key.startswith("b")}

    for x in data:
        x_tilde = [xi + random.gauss(0, sigma) for xi in x]
        epsilon = [random.gauss(0, sigma) for _ in x]

        # Forward pass
        score = forward(x_tilde, model)

        # Gradient computation (simplified example, actual gradients would depend on loss terms)
        for i, (s, e) in enumerate(zip(score, epsilon)):
            for j in range(len(x_tilde)):
                gradients["w3"][i][j] += (s + e) * x_tilde[j]
            biases["b3"][i] += s + e

    # Update weights and biases
    for key in model:
        if key.startswith("w"):
            for i in range(len(model[key])):
                for j in range(len(model[key][i])):
                    model[key][i][j] -= lr * gradients[key][i][j]
        elif key.startswith("b"):
            for i in range(len(model[key])):
                model[key][i] -= lr * biases[key][i]


    # Update weights and biases
    for key in model:
        if key.startswith("w"):
            for i in range(len(model[key])):
                for j in range(len(model[key][i])):
                    model[key][i][j] -= lr * gradients[key][i][j]
        elif key.startswith("b"):
            for i in range(len(model[key])):
                model[key][i] -= lr * biases[key][i]

# Training setup
input_dim = 2  # Dimensionality of input
hidden_dim = 16  # Hidden layer size
n_samples = 100  # Number of samples
sigma = 0.1  # Noise standard deviation
lr = 0.01  # Learning rate
epochs = 100  # Number of training epochs

# Data and model initialization
data = generate_data(n_samples, input_dim)
model = initialize_model(input_dim, hidden_dim)

# Training loop
for epoch in range(epochs):
    loss = score_matching_loss(model, data, sigma)
    backward(model, data, sigma, lr)

    if (epoch + 1) % 10 == 0:
        print(f"Epoch [{epoch + 1}/{epochs}], Loss: {loss:.4f}")


Epoch [10/100], Loss: 0.0206
Epoch [20/100], Loss: 0.0311
Epoch [30/100], Loss: 0.0843
Epoch [40/100], Loss: 0.0753
Epoch [50/100], Loss: 0.0523
Epoch [60/100], Loss: 0.0564
Epoch [70/100], Loss: 0.0597
Epoch [80/100], Loss: 0.0538
Epoch [90/100], Loss: 0.0503
Epoch [100/100], Loss: 0.0421
