#  bf16, fp16 or fp32 Model Pretraining Detection

The goal is to autodetect if a model has been trained in bf16, fp16 or fp32 precision. We want this since we know that bf16-pretrained models tend to overflow when consequently finetuned with fp16 (mixed).

We know that fp16's max number is `2**16=65536` (`~6.5e04`), so it should be easy to look at the weights and if they are larger than `1e02` (`sqrt(1e04)`) then the model has most likely been trained in other than fp16 precision (mixed or not).

Let's write a script to look at the absolute min/max values of any model's weights, apply it to a bunch of models that we have information on how they were trained and find a pattern. 

I thought that abs min values could give us some info about the precision, but most likely it's the abs max values that are most telling. Let's see.

I also added min and max norms, which I see are quite telling as well.

**I'm currently needing more public models to get the patterns right. Please help by adding more models that you know how they were trained. Thank you!**

You can submit your contribution and/or read the database gathered so far [here](https://discuss.huggingface.co/t/compiling-data-on-how-models-were-pre-trained-fp16-fp32-bf16/5671).


In [12]:
import torch
import logging
import transformers

In [2]:
from transformers import AutoModel

## Module weights abs min/max analyser

In [3]:
def analyze(modules, verbose=True):
    """
    modules is a list of sub-modules to search recursively. 
    
    this can be the whole model, but sometimes only some submodules want to be inspected
    """
    if verbose:
        print("\nSearching:")
        print("module | params")
    abs_min, abs_max = 1e10, 0
    norm_min, norm_max = 1e10, 0
    for i,m in enumerate(modules):
        for j,p in enumerate(m.parameters(recurse=True)):
            p_abs = p.abs()
            p_abs_max = p_abs.max().item()
            p_abs_min = p_abs.min().item()
            if p_abs_min < abs_min: abs_min = p_abs_min
            if p_abs_max > abs_max: abs_max = p_abs_max
                
            p_norm = torch.linalg.norm(p.data)
            if p_norm > 0:
                if p_norm < norm_min: norm_min = p_norm
                if p_norm > norm_max: norm_max = p_norm
        if verbose:
            print(f"{i:>6} | {j}")
    return abs_min, abs_max, norm_min, norm_max

the only concern I have here is that some models when trained in mixed precision may have some segment trained in fp32 and may end up with larger weights, though it is very unlikely since these then have to interact with the rest of the system. But more thought is needed.

In [4]:
from transformers.utils.logging import disable_progress_bar
disable_progress_bar() # disable tqdm!

model = AutoModel.from_pretrained("t5-3b")

In [5]:
# Let's look at t5-small in verbose mode
#model = AutoModel.from_pretrained("t5-small")

# let's look at just transformer blocks
abs_min, abs_max, norm_min, norm_max = analyze([model.encoder.block, model.decoder.block])
print("\nResults:")
print("abs min   | abs max   | norm min  | norm max")
print(f"{abs_min:.3e} | {abs_max:.3e} | {norm_min:.3e} | {norm_max:.3e}")

# now the whole model
abs_min, abs_max, norm_min, norm_max = analyze([model])
print("\nResults:")
print("abs min   | abs max   | norm min  | norm max")
print(f"{abs_min:.3e} | {abs_max:.3e} | {norm_min:.3e} | {norm_max:.3e}")

del model


Searching:
module | params
     0 | 192
     1 | 312

Results:
abs min   | abs max   | norm min   | norm max
1.455e-11 | 6.950e+01 | 5.201e+00 | 2.535e+03

Searching:
module | params
     0 | 508

Results:
abs min   | abs max   | norm min   | norm max
1.455e-11 | 2.340e+02 | 5.201e+00 | 6.349e+04


## Multiple model weights abs min/max analyser

Now let's write a nice wrapper to process many models

In [6]:
def models_analyze(mnames):
    transformers.logging.set_verbosity_error() # be quiet
    print(f"{'name':^40} | {'abs min':^9} | {'abs max':^9} | {'norm min':^9} | {'norm max':^9}  ")
    print(f"{'-'*40}-|-{'-'*9}-|-{'-'*9}-|-{'-'*9}-|-{'-'*9}-")
    for mname in mnames:
        model = AutoModel.from_pretrained(mname)
        abs_min, abs_max, norm_min, norm_max = analyze([model], verbose=False)
        print(f"{mname:<40} | {abs_min:.3e} | {abs_max:.3e} | {norm_min:.3e} | {norm_max:.3e}")
        del model

## fp16 models

Let's look at fp16-pretrained models

In [7]:
# fp16-pretrained models
mnames = ["allenai/longformer-base-4096", "allenai/longformer-large-4096", 
          "allenai/led-base-16384", "allenai/led-large-16384", "lvwerra/codeparrot", 
          "facebook/m2m100_418M", "facebook/m2m100_1.2B",
           "facebook/opt-1.3b", "facebook/opt-13b",
           "bigscience/bloom-7b1", "bigscience/bloom-3b",         
         ]
models_analyze(mnames)

                  name                   |  abs min  |  abs max  | norm min  | norm max   
-----------------------------------------|-----------|-----------|-----------|-----------
allenai/longformer-base-4096             | 0.000e+00 | 1.510e+00 | 2.272e-02 | 7.993e+02
allenai/longformer-large-4096            | 0.000e+00 | 1.146e+00 | 9.087e-02 | 9.428e+02
allenai/led-base-16384                   | 0.000e+00 | 1.600e+01 | 1.611e-02 | 4.147e+02
allenai/led-large-16384                  | 0.000e+00 | 2.320e+01 | 4.799e-02 | 6.362e+02
lvwerra/codeparrot                       | 1.245e-11 | 1.832e+00 | 1.185e-01 | 2.112e+02
facebook/m2m100_418M                     | 0.000e+00 | 1.000e+00 | 4.792e-01 | 4.829e+02
facebook/m2m100_1.2B                     | 0.000e+00 | 1.000e+00 | 4.835e-01 | 4.925e+02
facebook/opt-1.3b                        | 0.000e+00 | 1.000e+00 | 4.852e-02 | 3.619e+02
facebook/opt-13b                         | 0.000e+00 | 1.000e+00 | 7.830e-02 | 3.136e+02
bigscience/bloom-7

So we can see the fp16 abs max weights are quite small - they are in the range of 1e0 - 1e1.

The norm max is also always under 1e3 in our samples

abs max for "led" models is oddly pretty high. They are supposed to be the same as longformer, which are fp16. But norm max matches other models.

## bf16 models

Let's look at bf16-pretrained models

In [8]:
# bf16-pretrained models
mnames = ["t5-small", "t5-base", "t5-large", "google/mt5-small", "google/mt5-base", 
          "google/mt5-large",
          "google/bigbird-pegasus-large-arxiv", "google/pegasus-cnn_dailymail", 
          "google/pegasus-large", "google/pegasus-multi_news", "google/pegasus-xsum",
          "bigscience/T0_3B", "EleutherAI/gpt-neo-1.3B",
]
# "bigscience/T0pp", T0 are huge!
models_analyze(mnames)

                  name                   |  abs min  |  abs max  | norm min  | norm max   
-----------------------------------------|-----------|-----------|-----------|-----------
t5-small                                 | 5.442e-09 | 7.920e+02 | 1.780e+00 | 9.403e+04
t5-base                                  | 1.273e-10 | 5.600e+02 | 1.647e+00 | 9.332e+04
t5-large                                 | 3.638e-11 | 5.200e+02 | 3.797e+00 | 8.237e+04
google/mt5-small                         | 3.201e-09 | 1.140e+02 | 2.662e+00 | 1.610e+05
google/mt5-base                          | 1.848e-09 | 1.135e+02 | 3.445e+00 | 1.639e+05
google/mt5-large                         | 1.892e-10 | 1.750e+02 | 4.472e+00 | 2.029e+05
google/bigbird-pegasus-large-arxiv       | 0.000e+00 | 2.424e+02 | 4.955e-01 | 3.183e+03
google/pegasus-cnn_dailymail             | 0.000e+00 | 2.416e+02 | 4.926e-01 | 4.423e+03
google/pegasus-large                     | 0.000e+00 | 2.417e+02 | 4.912e-01 | 4.745e+03
google/pegasus-mul

We can see big abs max weight values - pretty consistently - so perhaps if the max weight > 1e2 it's a good candidate for bf16 group.

## fp32 models

Let's look at fp32-pretrained models

In [9]:
# fp32-pretrained models
mnames = ["gsarti/it5-small", "gsarti/it5-base", "gsarti/it5-base-oscar", 
          "gsarti/it5-large", "EleutherAI/gpt-neo-2.7B", 
         ]
models_analyze(mnames)

                  name                   |  abs min  |  abs max  | norm min  | norm max   
-----------------------------------------|-----------|-----------|-----------|-----------
gsarti/it5-small                         | 6.114e-08 | 4.693e+02 | 8.411e-02 | 6.881e+04
gsarti/it5-base                          | 1.068e-08 | 1.598e+03 | 3.596e-01 | 8.997e+04
gsarti/it5-base-oscar                    | 3.638e-12 | 2.092e+01 | 3.637e+00 | 5.758e+03
gsarti/it5-large                         | 2.094e-09 | 4.388e+04 | 7.982e-02 | 1.105e+06
EleutherAI/gpt-neo-2.7B                  | 2.319e-11 | 3.563e+00 | 1.322e+00 | 9.850e+02


The abs max is all over the map here.

"EleutherAI/gpt-neo-2.7B"'s abs max is very low.

XXX: need more inputs

## Unknown models

Let's look at some uknown models

In [10]:
# fp32? (XXX: need to check)
#mnames = ["bigscience/T0_3B"] 
# mnames = ["bigscience/T0pp", "bigscience/T0_3B"] "bigscience/T0pp" is huge!
#models_analyze(mnames)

need to check how it was trained - looks like bf16 to me

In [11]:
# fp32? (XXX: need to check)
#mnames = ["google/pegasus-pubmed"] 
#mnames = [] 

#mnames = [""] 
#models_analyze(mnames)