The gradients computed in the `contrastive_divergence()` function using `torch.mm()` are **outer products**, and this is a key concept in training Restricted Boltzmann Machines (RBMs). Let me break it down step by step:

---

### 1. **What is `torch.mm()` Doing?**
`torch.mm(A, B)` performs a **matrix multiplication** between two tensors `A` and `B`. In this case:
- `positive_grad = torch.mm(h_sample.t(), v)` computes the outer product of the hidden activations (`h_sample`) and the visible units (`v`) during the **positive phase**.
- `negative_grad = torch.mm(p_h_given_v_sample.t(), v_sample)` computes the outer product of the hidden activations (`p_h_given_v_sample`) and the reconstructed visible units (`v_sample`) during the **negative phase**.

---

### 2. **Why Outer Products?**
The **outer product** is used because the weight matrix `W` in an RBM connects every visible unit to every hidden unit. The weight update rule in RBMs is based on the correlations between visible and hidden units. Specifically:
- The **positive phase** captures the correlation between the visible units (`v`) and the hidden units (`h_sample`) when the model is exposed to the real data.
- The **negative phase** captures the correlation between the reconstructed visible units (`v_sample`) and the hidden units (`p_h_given_v_sample`) after Gibbs sampling.

The outer product is the natural way to compute these correlations because:
- The outer product of a column vector (e.g., `h_sample`) and a row vector (e.g., `v`) results in a matrix where each element represents the interaction (or correlation) between a specific hidden unit and a specific visible unit.

---

### 3. **Positive and Negative Gradients**
- **Positive Gradient (`positive_grad`)**:
  - This is computed using the real data (`v`) and the hidden activations (`h_sample`).
  - It represents how the weights should be adjusted to increase the probability of the observed data.
  - It is called "positive" because it reinforces the correlations between visible and hidden units for the real data.

- **Negative Gradient (`negative_grad`)**:
  - This is computed using the reconstructed data (`v_sample`) and the hidden activations (`p_h_given_v_sample`).
  - It represents how the weights should be adjusted to decrease the probability of the reconstructed data (which the model generates itself).
  - It is called "negative" because it counteracts the model's tendency to overfit to its own reconstructions.

---

### 4. **Why Does This Work?**
The goal of training an RBM is to adjust the weights (`W`) such that the model assigns higher probabilities to the real data and lower probabilities to the reconstructed data. This is achieved by:
- **Maximizing the log-likelihood** of the data, which involves computing the gradient of the log-likelihood with respect to the weights.
- The gradient of the log-likelihood can be approximated using **Contrastive Divergence (CD)**, which uses the difference between the positive and negative gradients:
  ```python
  ΔW = learning_rate * (positive_grad - negative_grad) / batch_size
  ```
  This update rule ensures that the weights are adjusted to better represent the real data while discouraging the model from overfitting to its own reconstructions.

---

### 5. **Why Outer Products Specifically?**
The weight matrix `W` in an RBM connects every visible unit to every hidden unit. The outer product is used because:
- Each weight `W[i, j]` represents the connection between visible unit `i` and hidden unit `j`.
- The outer product computes all these pairwise interactions (correlations) at once, resulting in a matrix of the same shape as `W`.

For example:
- If `v` has shape `(batch_size, visible_dim)` and `h_sample` has shape `(batch_size, hidden_dim)`, then:
  - `h_sample.t()` has shape `(hidden_dim, batch_size)`.
  - `torch.mm(h_sample.t(), v)` results in a matrix of shape `(hidden_dim, visible_dim)`, which matches the shape of `W`.

---

### 6. **Summary**
- The **positive gradient** captures correlations between real data and hidden units, reinforcing the model's ability to represent the data.
- The **negative gradient** captures correlations between reconstructed data and hidden units, discouraging overfitting to the model's own reconstructions.
- Both gradients are computed as **outer products** because the weight matrix `W` connects every visible unit to every hidden unit, and the outer product naturally captures these pairwise interactions.
- The weight update rule (`W += learning_rate * (positive_grad - negative_grad)`) ensures the model learns to better represent the real data while avoiding overfitting.

This is why the outer product and the distinction between positive and negative gradients are central to training RBMs.

notes on constrastive_divergence()

Explanation:
Positive Phase:

Compute the hidden probabilities (p_h_given_v) and sample hidden states (h_sample).
Compute the positive gradient as the outer product of h_sample and v.
Negative Phase:

Perform k steps of Gibbs sampling to reconstruct the visible and hidden states.
Compute the negative gradient as the outer product of the reconstructed hidden probabilities and visible states.
Weight and Bias Updates:

Update the weights (W) and biases (v_bias, h_bias) using the difference between the positive and negative gradients, normalized by the batch size.
Make sure to define a learning_rate attribute in your class (e.g., in the constructor) to control the step size for updates.

# Midterm 2, Assignment 3 - Gaetano Barresi [579102]

A Restricted Boltzmann Machine (RBM) is a generative stochastic neural network that learns a probability distribution over its inputs. It is widely used for unsupervised learning, dimensionality reduction, and feature extraction.
To implement from scratch a RBM we must define first its architecture. It is a simple two layer neural network, one input layer (visible states, our data) and one hidden layer (the hidden states, latent feature representation). For parameters, we have a set of weights and two set of bias, one for the visible units and one for the hidden units:

```python
self.W = torch.randn(hidden_dim, visible_dim, device=self.device) * 0.01
self.v_bias = torch.zeros(visible_dim, device=self.device)
self.h_bias = torch.zeros(hidden_dim, device=self.device)
```

Weights W are initialized with small values and biases with zeros.
Hidden units are conditionally independent given visible units and viceversa, due to not oriented edges and bipartition structure:

$$
\
\mathbb{P}(h_j \mid v) = \sigma \left( \sum_i M_{ij} v_i + c_j \right) \quad \forall j
\
$$


$$
\
\mathbb{P}(v_i \mid h) = \sigma \left( \sum_j M_{ij} h_j + b_i \right) \quad \forall i
\
$$

These are resepctively the forward pass (wake) and the backward pass (dream) and they can be implemented as:


```python
def visible_to_hidden(self, v):  # forward pass
        # compute probabilities of hidden units given visible units
        p_h = torch.sigmoid(F.linear(v, self.W, self.h_bias))
        return p_h
    
def hidden_to_visible(self, h):  # backward pass
    # compute probabilities of visible units given hidden units
    p_v = torch.sigmoid(F.linear(h, self.W.t(), self.v_bias))
    return p_v
```


In a RBM, sampling is a key step in the training process, particularly during Gibbs sampling. In order to generate binary samples (0 or 1) from a given probability distribution p, we use the following function. These binary samples represent the activation states of the visible or hidden units in the RBM.

```python
def sample_from_p(self, p):
    # bernoulli sampling given probabilities
    return F.relu(torch.sign(p - torch.rand_like(p, device=self.device)))
```

How it works: `sample_from_p` takes p, which is a tensor of probabilities. Each value in p represents the probability of a unit being activated (set to 1).
`torch.rand_like(p)` generates a tensor of random values, uniformly distributed between 0 and 1, with the same shape as p. `p - torch.rand_like(p)` computes the difference between the probabilities and the random values.
`torch.sign(p - torch.rand_like(p))` produces a tensor where values greater than 0 are set to 1, and values less than or equal to 0 are set to -1. This effectively performs a thresholding operation to decide whether each unit is activated (1) or not (-1).
`F.relu(...)` ensures that all negative values are clamped to 0. This step converts the -1 values to 0, resulting in a binary tensor of 0s and 1s.

All these pieces are used inside the generalized version of Contrastive Divergence (CD) learning algorithm. It is divided in positive phase (wake part), negative phase(dream part) and parameters update.
Positive phase computes the hidden probabilities (`p_h_given_v`) and sample hidden states (`h_sample`). Computes the positive gradient as the outer product of `h_sample` and `v`.

```python
# positive phase
    p_h_given_v = self.visible_to_hidden(v)
    h_sample = self.sample_from_p(p_h_given_v)
    positive_grad = torch.mm(h_sample.t(), v)
```

Negative phase performs k steps of Gibbs sampling to reconstruct the visible and hidden states. Compute the negative gradient as the outer product of the reconstructed hidden probabilities and visible states.

```python
# gibbs sampling (negative phase)
    v_sample = v
    for _ in range(self.k):
        p_h_given_v = self.visible_to_hidden(v_sample)
        h_sample = self.sample_from_p(p_h_given_v)
        p_v_given_h = self.hidden_to_visible(h_sample)
        v_sample = self.sample_from_p(p_v_given_h)

    # negative phase
    p_h_given_v_sample = self.visible_to_hidden(v_sample)
    negative_grad = torch.mm(p_h_given_v_sample.t(), v_sample)
```

Parameters updates