**Consider the following model:**

In [None]:
class ToyModel(nn.Module):
    def __init__(self, in_features: int, out_features: int):
        super().__init__()
        self.fc1 = nn.Linear(in_features, 10, bias=False)
        self.ln = nn.LayerNorm(10)
        self.fc2 = nn.Linear(10, out_features, bias=False)
        self.relu = nn.ReLU()

    def forward(self, x):
        x = self.relu(self.fc1(x))
        x = self.ln(x)
        x = self.fc2(x)
    return x


**Suppose we are training the model on a GPU and that the model parameters are originally in
FP32. We’d like to use autocasting mixed precision with FP16. What are the data types of:**
- the model parameters within the autocast context,
  - FP32
- the output of the first feed-forward layer (ToyModel.fc1),
  - FP16
- the output of layer norm (ToyModel.ln),
  - FP32
- the model’s predicted logits,
  - FP16
- the loss,
  - FP32
- and the model’s gradients?
  - FP32

**You should have seen that FP16 mixed precision autocasting treats the layer normalization layer
differently than the feed-forward layers. What parts of layer normalization are sensitive to mixed
precision? If we use BF16 instead of FP16, do we still need to treat layer normalization differently?
Why or why not?**

Layer Normalization involves two key steps that are prone to numerical instability when performed in $\text{FP16}$ (half-precision):

1. Computing the Variance ($\sigma^2$)

2. Division by Standard Deviation (Square Root and $\epsilon$)

Generally, it is often not necessary to force $\text{LayerNorm}$ into $\text{FP32}$ when using $\text{BF16}$ autocasting.

1. Wider Dynamic Range ($\text{BF16}$): $\text{BF16}$ uses 8 exponent bits, the same number as $\text{FP32}$. This gives it a significantly larger range than $\text{FP16}$ (with its 5 exponent bits). This wider range almost completely eliminates the risk of overflow during the variance summation step.

2. Less Underflow ($\text{BF16}$): Because the exponent range is much wider, $\text{BF16}$ can represent much smaller non-zero numbers than $\text{FP16}$, making it far less prone to underflow to zero in the denominator of the normalization step.

By matching the wide dynamic range of $\text{FP32}$, $\text{BF16}$ offers sufficient numerical stability for operations like $\text{LayerNorm}$, which is why modern frameworks like $\text{PyTorch}$'s Automatic Mixed Precision ($\text{AMP}$) typically permit $\text{LayerNorm}$ to run in $\text{BF16}$ without issue, whereas they force it into $\text{FP32}$ when using $\text{FP16}$.


**Modify your benchmarking script to optionally run the model using mixed precision with BF16.
Time the forward and backward passes with and without mixed-precision for each language model
size described in §1.1.2. Compare the results of using full vs. mixed precision, and comment on
any trends as model size changes. You may find the nullcontext no-op context manager to be
useful.**

| Model   |   Forward Time Avg |   Forward Time Std |   Backward Time Avg |   Backward Time Std | Mixed precision BF16 |
|:--------|-------------------:|-------------------:|--------------------:|--------------------:|:------------------|
| small   |          0.0296619 |         0.00128773 |           0.0487109 |         0.04987     | False             |
| medium  |          0.0594976 |         0.00104634 |           0.0790231 |         0.000644619 | False             |
| large   |          0.106489  |         0.0119742  |           0.177656  |         0.000461642 | False             |
| xl      |          0.211363  |         0.0194774  |           0.415546  |         0.193638    | False             |
| 2.7B    |          0.295566  |         0.0411646  |           0.544969  |         0.135777    | False             |
| small   |          0.0474328 |         0.0481943  |           0.0339963 |         0.00088042  | True              |
| medium  |          0.0626057 |         0.00077958 |           0.0676167 |         0.000379691 | True              |
| large   |          0.0950005 |         0.00390939 |           0.100845  |         0.000948544 | True              |
| xl      |          0.127277  |         0.00994632 |           0.15123   |         0.00208068  | True              |
| 2.7B    |          0.106543  |         0.0427552  |           0.231101  |         0.00334915  | True              |

Starting from Model size xl based on Forward Time Avg and Std values it can be seen that model with BF16 mixed precision performs better.
In case of model 2.7B the difference is 3 times between FP32 and BF16 in favor of the last.
