New and better T5 checkpoints from scaling transformers paper #15467

Xirider · 2022-02-01T19:45:49Z

🌟 New model addition

Model description

This paper explores different ways of scaling T5:
Scaling Efficiently: Insights from Pre-training and Finetuning Transformers
https://arxiv.org/abs/2109.10686

They found new checkpoints (DeepNarrow) that perform significantly better than the previous models on downstream applications.

Here is a table from the paper that compares the old Base, Large, XL and XXL models to the new ones:

The checkpoints were released today here:
https://github.com/google-research/google-research/tree/master/scaling_transformers

Open source status

the model implementation is available: the current T5 implementation in transformers
the model weights are available: https://github.com/google-research/google-research/tree/master/scaling_transformers
who are the authors: Yi Tay, Mostafa Dehghani, Jinfeng Rao, William Fedus, Samira Abnar, Hyung Won Chung, Sharan Narang, Dani Yogatama, Ashish Vaswani, Donald Metzler

LysandreJik · 2022-02-02T16:11:03Z

cc @patil-suraj @patrickvonplaten

patrickvonplaten · 2022-02-02T23:11:03Z

Holy moly 170 checkpoints at those sizes. I'll give it a try tomorrow. Official link: https://console.cloud.google.com/storage/browser/scenic-bucket/scaling_explorer/

patrickvonplaten · 2022-02-03T18:40:56Z

Okey can't get the models to work out of the box. Will dive a bit deeper into the code next week

patrickvonplaten · 2022-02-07T01:23:15Z

The conversion seems to work now. I've added a first model here:
https://huggingface.co/NewT5/t5-efficient-base-el4

I have some internal conversions scripts for https://github.com/google-research/text-to-text-transfer-transformer => Transformers that allows me to quickly convert mesh-tensorflow checkpoints, verify them and upload them.

All I need for a conversion now is the following:

name of the original TF checkpoint, e.g. bi_v1_el4_law_03-17-20-56 as shown on https://console.cloud.google.com/storage/browser/scenic-bucket/scaling_explorer/
name for the hf checkpoint, e.g. t5-efficient-base-el4 as decided in https://huggingface.co/NewT5/t5-efficient-base-el4 .
config for the t5 model, e.g.: https://huggingface.co/NewT5/t5-efficient-base-el4/blob/main/config.json . Note often the config is very similar to an existing one. E.g. the one here is the same as https://huggingface.co/t5-base but with the following values changed/added num_layers=4 , num_decoder_layers=12. So it's enough to give a reference config and some proposed changes.

Now, there are over a hundred new checkpoints which makes a manual conversion too slow and time-consuming. I think we should do one of the following:

Write a script that does the conversion automatically. This can definitely be done and it also shouldn't be too hard. In order to do this we have to do two things:
a. Find a good name pattern, e.g. t5-efficient-{config}
b. (This is the time consuming part). Prepare the model configs for each checkpoint to be uploaded. E.g. we would have to look at each checkpoint and define the model config depending on their changes. I won't have time to do this alone - I could need some help here - @Xirider, would you be interested in helping here? Would be happy to decide on a good strategy together to port all of the models :-)
c. Decide how to write good model cards in an automated way
Only do the conversion for the most important models. In this case we should decide what those important models are.

What approach do you think is best here? @LysandreJik @patil-suraj @craffel @Xirider ?

In general, I think it's important to find a good name here for the new checkpoints? What do you think would be a good name?
google/t5-base-efficient-{config}? or google/t5-base-scale-efficient-{config}? Better ideas?

patrickvonplaten · 2022-02-07T01:26:02Z

Love that sentence from the paper "since there is a lack of representation of transformers at lower compute regions." -> very true! Think those small checkpoints here can be very impactful

craffel · 2022-02-07T14:54:04Z

I would advocate for porting all the models, though take it with a grain of salt because I'm not volunteering to do the manual work. Regarding an automatic naming convention, if the original TF checkpoint names are somewhat sane/follow some kind of reasonable pattern, we could just do t5-efficient-{t5 checkpoint name}.

patrickvonplaten · 2022-02-07T23:22:33Z

Having talked to @patil-suraj, it should actually be possible to fully automate porting all those models. In a first step, I've preprocessed the naming of each of the folder available online to come up with the following names:

t5-efficient-xxl-nl4
t5-efficient-xxl
t5-efficient-xl-nl12
t5-efficient-xl-nl16
t5-efficient-xl-nl28
t5-efficient-xl-nl2
t5-efficient-xl-nl4
t5-efficient-xl-nl6
t5-efficient-xl-nl8
t5-efficient-xl
t5-efficient-xl-sh
t5-efficient-xl-skv
t5-efficient-base
t5-efficient-base-dm1000
t5-efficient-base-dm256
t5-efficient-base-dm2000
t5-efficient-base-dm512
t5-efficient-base-dml2
t5-efficient-base-dml4
t5-efficient-base-dml6
t5-efficient-base-dml8
t5-efficient-base-el16
t5-efficient-base-el2
t5-efficient-base-el4
t5-efficient-base-el6
t5-efficient-base-el8
t5-efficient-base-ff12000
t5-efficient-base-ff1000
t5-efficient-base-ff2000
t5-efficient-base-ff6000
t5-efficient-base-ff9000
t5-efficient-base-nh16
t5-efficient-base-nh24
t5-efficient-base-nh32
t5-efficient-base-nh8
t5-efficient-base-kv128
t5-efficient-base-kv16
t5-efficient-base-kv256
t5-efficient-base-kv32
t5-efficient-base-l16
t5-efficient-base-l24
t5-efficient-base-l2
t5-efficient-base-l32
t5-efficient-base-l36
t5-efficient-base-l40
t5-efficient-base-l48
t5-efficient-base-l4
t5-efficient-base-l8
t5-efficient-large
t5-efficient-base
t5-efficient-large-dm128
t5-efficient-large-dm256
t5-efficient-large-dm2000
t5-efficient-large-dm512
t5-efficient-large-dm768
t5-efficient-large-dl12
t5-efficient-large-dl16
t5-efficient-large-dl2
t5-efficient-large-dl32
t5-efficient-large-dl4
t5-efficient-large-dl6
t5-efficient-large-dl8
t5-efficient-large-el12
t5-efficient-large-el2
t5-efficient-large-el4
t5-efficient-large-el6
t5-efficient-large-el8
t5-efficient-large-nh12
t5-efficient-large-nh24
t5-efficient-large-nh2
t5-efficient-large-nh32
t5-efficient-large-nh4
t5-efficient-large-nh8-nl16
t5-efficient-large-nh8-nl32
t5-efficient-large-nh8
t5-efficient-large-kv128
t5-efficient-large-kv16
t5-efficient-large-kv256
t5-efficient-large-kv32
t5-efficient-large-nl10
t5-efficient-large-nl12
t5-efficient-large-nl16
t5-efficient-large-nl20
t5-efficient-large-nl2
t5-efficient-large-nl32
t5-efficient-large-nl36
t5-efficient-large-nl4
t5-efficient-large-nl8
t5-efficient-large-sh
t5-efficient-large-skv
t5-efficient-mini-nl12
t5-efficient-mini-nl24
t5-efficient-mini-nl6
t5-efficient-mini-nl8
t5-efficient-mini
t5-efficient-base-sh
t5-efficient-base-skv
t5-efficient-small-dm128
t5-efficient-small-dm1000
t5-efficient-small-dm256
t5-efficient-small-dm2000
t5-efficient-small-dm768
t5-efficient-small-dl12
t5-efficient-small-dl16
t5-efficient-small-dl2
t5-efficient-small-dl4
t5-efficient-small-dl8
t5-efficient-small-el12
t5-efficient-small-el16
t5-efficient-small-el16-dl1
t5-efficient-small-el16-dl2
t5-efficient-small-el16-dl4
t5-efficient-small-el16-dl8
t5-efficient-small-el2
t5-efficient-small-el32
t5-efficient-small-el48
t5-efficient-small-el4
t5-efficient-small-el64
t5-efficient-small-el8
t5-efficient-small-el8-dl1
t5-efficient-small-el8-dl2
t5-efficient-small-el8-dl4
t5-efficient-small-ff12000
t5-efficient-small-ff1000
t5-efficient-small-ff3000
t5-efficient-small-ff6000
t5-efficient-small-ff9000
t5-efficient-small-kv128
t5-efficient-small-kv16
t5-efficient-small-kv256
t5-efficient-small-kv32
t5-efficient-small-nl16
t5-efficient-small-nl20
t5-efficient-small-nl22
t5-efficient-small-nl24
t5-efficient-small-nl2
t5-efficient-small-nl32
t5-efficient-small-nl36
t5-efficient-small-nl40
t5-efficient-small-nl48
t5-efficient-small-nl4
t5-efficient-small-nl8
t5-efficient-small-sh
t5-efficient-small-shkv
t5-efficient-small
t5-efficient-tiny-dl2
t5-efficient-tiny-dl6
t5-efficient-tiny-dl8
t5-efficient-tiny-el12
t5-efficient-tiny-el2
t5-efficient-tiny-el6
t5-efficient-tiny-el8
t5-efficient-tiny-ff12000
t5-efficient-tiny-ff2000
t5-efficient-tiny-ff3000
t5-efficient-tiny-ff6000
t5-efficient-tiny-ff9000
t5-efficient-tiny-nh16
t5-efficient-tiny-nh1
t5-efficient-tiny-nh32
t5-efficient-tiny-nh8
t5-efficient-tiny-nl12
t5-efficient-tiny-nl16
t5-efficient-tiny-nl24
t5-efficient-tiny-nl2
t5-efficient-tiny-nl32
t5-efficient-tiny-nl6
t5-efficient-tiny-nl8
t5-efficient-tiny
t5-efficient-tiny-sh
t5-efficient-tiny-skv

-> think this is pretty clear with `t5-efficient-{default_size}-{change_to_default_size_1}(-{change_to_default_size_2}), where as the default sizes codenames follow those of Table 2 of the paper.

patrickvonplaten · 2022-02-08T00:12:37Z

Automatically parsed and uploaded all configs now here: https://huggingface.co/NewT5 . Will now look into automatically uploading the weights.

patrickvonplaten · 2022-02-09T15:09:14Z

Have 157/169 now correctly converted and uploaded: https://huggingface.co/models?other=t5-new-success .

Something seems to be wrong with the "shared heads" SH checkpoints in the conversion.

@craffel - do you know what exactly shared heads means? Does it mean that each transformer block uses the same head weights or that each head within a transformer block is shared with each other?

craffel · 2022-02-09T16:15:33Z

Hm, I'm not sure, but that would be my guess. If you point me to the operative config for one of the shared heads checkpoints, I can try to hunt down what it means in the original codebase.

versae · 2022-02-11T12:39:44Z

Sorry for the off-topic, but to have available such a conversion script would be awesome. I'm struggling with the conversion myself.

patrickvonplaten · 2022-02-14T13:30:37Z

Uploaded some of my scripts here: https://github.com/patrickvonplaten/t5-mtf-to-hf-converter . The repo is not very clean and heavily lacks comments / explanations. Could you try to see whether those scripts help you though in any way?

versae · 2022-02-14T14:14:20Z

I'll test them and will report back :) Thanks! 🙏🏼

patrickvonplaten · 2022-02-14T17:45:08Z

Okey, 159/169 checkpoints are now correct. Given that the others might not be that useful/practical for now: google-research/google-research#986 (comment) , I'll go ahead with 159 of 169 checkpoints now. So will convert them to TF and Flax, write a nice README and then we can publish I think :-)

stefan-it · 2022-02-17T20:39:13Z

Hi @patrickvonplaten , I've trained a 32EL model with the T5 Mesh codebase (model is actually training, so I'm using a checkpoint). Now I wanted to convert the TF checkpoint into PyTorch, but the following error is thrown:

Initialize PyTorch weight ['decoder', 'block_000', 'layer_001', 'rms_norm', 'scale']
Skipping decoder/block_000/layer_001/rms_norm/scale_slot_v
Skipping decoder/block_000/layer_002/DenseReluDense/wi/kernel
Traceback (most recent call last):
  File "/home/stefan/model-hub/transformers/src/transformers/models/t5/convert_t5_original_tf_checkpoint_to_pytorch.py", line 59, in <module>
    convert_tf_checkpoint_to_pytorch(args.tf_checkpoint_path, args.config_file, args.pytorch_dump_path)
  File "/home/stefan/model-hub/transformers/src/transformers/models/t5/convert_t5_original_tf_checkpoint_to_pytorch.py", line 34, in convert_tf_checkpoint_to_pytorch
    load_tf_weights_in_t5(model, config, tf_checkpoint_path)
  File "/home/stefan/model-hub/transformers/src/transformers/models/t5/modeling_t5.py", line 122, in load_tf_weights_in_t5
    pointer = getattr(pointer, "weight")
  File "/home/stefan/.venvs/dev/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1177, in __getattr__
    raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'T5DenseGatedGeluDense' object has no attribute 'weight'

I used the created config.json from your create_config script (and used google/t5-v1_1-base, because I couldn't find the template json).

Conversion is done with latest Transformers master, do you have any hint what's missing here 🤔

Many thanks!

patrickvonplaten · 2022-02-18T12:52:24Z

Hey @stefan-it,

You need to base your config on t5-base (the original T5-model) instead of goolge/t5-v1_1-base I believe. Could you try this instead?

patrickvonplaten · 2022-02-18T12:53:35Z

You should be able to use this script: https://github.com/patrickvonplaten/t5-mtf-to-hf-converter/blob/master/create_config.py where you load from the t5-base config :-)

stefan-it · 2022-02-18T13:24:28Z

Hi @patrickvonplaten , thanks, it is working with t5-base.

Another question: why is the vocab size in the config set to 32128, whereas the spm model has a size of 32000? Is it because of the integrated tasks (such as translation), because my T5 model demands 32000 in the config (otherwise is throws an error).

versae · 2022-02-18T15:48:14Z

Maybe related: according to this comment, it was rounded to a multiple of 128 for TPU efficiency.

https://github.com/google-research/t5x/blob/main/t5x/examples/scalable_t5/t5_1_1/base.gin#L45

craffel · 2022-02-18T15:51:40Z

100 IDs were added for sentinel tokens (for the pre-training objective), and then as @versae said it was rounded to nearest the 128 for TPU efficiency.

stefan-it · 2022-02-18T16:07:23Z

Hi @versae and @craffel , thanks for that hint!

Do you accidentally know how to add these sentinel ids in t5_mesh_transformer command or in the gin file (I'm not using T5X) 🤔

Can this be configured in the seqio Task 🤔

craffel · 2022-02-18T16:10:34Z

It's vocabularies.Vocabulary.extra_ids = 100 in gin.

stefan-it · 2022-02-19T17:04:46Z

Hi @craffel, thanks for that! I could solve the problem by using seqio.SentencePieceVocabulary(SPM_VOCAB, extra_ids=100) in the task description. I've checked the converted checkpoint and it now has the desired 32128 shape 👍

But I have another question regarding to the Scaling Efficiently paper: it seems that c4_v220_unsupervised is used as mixture/task description is used in the GIN files, but I can't find this recipe in T5 (or T5X) repository. Do you accidentally know how it could be structured or do you know a comparable task from T5 library, such as:

# ================================ Wikipedia ===================================
TaskRegistry.add(
    "wikipedia_20190301.en_v003_unsupervised",
    source=seqio.TfdsDataSource(tfds_name="wikipedia/20190301.en:1.0.0"),
    preprocessors=[
        functools.partial(
            preprocessors.rekey, key_map={
                "inputs": None,
                "targets": "text"
            }),
        seqio.preprocessors.tokenize,
        seqio.CacheDatasetPlaceholder(),
        preprocessors.unsupervised,
        seqio.preprocessors.append_eos_after_trim,
    ],
    output_features=DEFAULT_OUTPUT_FEATURES,
    metric_fns=[])

I'm highly interested in the preprocessors part of it. Many thanks!

(/cc @vanzytay)

craffel · 2022-02-19T18:35:28Z

It's here: https://github.com/google-research/text-to-text-transfer-transformer/blob/main/t5/data/tasks.py#L106

But it needs further gin configuration if you actually want to use it as a pre-training task. If you want to use the standard T5 pre-training task, use https://github.com/google-research/text-to-text-transfer-transformer/blob/main/t5/data/tasks.py#L46

KnutJaegersberg · 2023-01-19T10:50:54Z

How does t5-efficient-xxl-nl4 perform to say medium sized models? While the xxl model file is 45 gb, this one is smaller than 4gb. Googling anything for this model didn't help me. 3 results in total. Not spoken of in the paper as far as I can see. No performance comparison I could find. 4 transformer blocks instead of 24 sounds like a quite radical change, so properly a performance penalty, but then again, they shared the model, is it any good?

patrickvonplaten · 2023-01-19T15:07:54Z

Gently pinging the original author @vanzytay here

Xirider added the New model label Feb 1, 2022

patrickvonplaten self-assigned this Feb 7, 2022

versae mentioned this issue Feb 15, 2022

Exporting models google-research/t5x#198

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New and better T5 checkpoints from scaling transformers paper #15467

New and better T5 checkpoints from scaling transformers paper #15467

Xirider commented Feb 1, 2022 •

edited

Loading

LysandreJik commented Feb 2, 2022

patrickvonplaten commented Feb 2, 2022 •

edited

Loading

patrickvonplaten commented Feb 3, 2022

patrickvonplaten commented Feb 7, 2022

patrickvonplaten commented Feb 7, 2022

craffel commented Feb 7, 2022

patrickvonplaten commented Feb 7, 2022

patrickvonplaten commented Feb 8, 2022 •

edited

Loading

patrickvonplaten commented Feb 9, 2022 •

edited

Loading

craffel commented Feb 9, 2022

versae commented Feb 11, 2022

patrickvonplaten commented Feb 14, 2022

versae commented Feb 14, 2022

patrickvonplaten commented Feb 14, 2022

stefan-it commented Feb 17, 2022

patrickvonplaten commented Feb 18, 2022

patrickvonplaten commented Feb 18, 2022

stefan-it commented Feb 18, 2022 •

edited

Loading

versae commented Feb 18, 2022

craffel commented Feb 18, 2022

stefan-it commented Feb 18, 2022 •

edited

Loading

craffel commented Feb 18, 2022

stefan-it commented Feb 19, 2022 •

edited

Loading

craffel commented Feb 19, 2022

KnutJaegersberg commented Jan 19, 2023

patrickvonplaten commented Jan 19, 2023

New and better T5 checkpoints from scaling transformers paper #15467

New and better T5 checkpoints from scaling transformers paper #15467

Comments

Xirider commented Feb 1, 2022 • edited Loading

🌟 New model addition

Model description

Open source status

LysandreJik commented Feb 2, 2022

patrickvonplaten commented Feb 2, 2022 • edited Loading

patrickvonplaten commented Feb 3, 2022

patrickvonplaten commented Feb 7, 2022

patrickvonplaten commented Feb 7, 2022

craffel commented Feb 7, 2022

patrickvonplaten commented Feb 7, 2022

patrickvonplaten commented Feb 8, 2022 • edited Loading

patrickvonplaten commented Feb 9, 2022 • edited Loading

craffel commented Feb 9, 2022

versae commented Feb 11, 2022

patrickvonplaten commented Feb 14, 2022

versae commented Feb 14, 2022

patrickvonplaten commented Feb 14, 2022

stefan-it commented Feb 17, 2022

patrickvonplaten commented Feb 18, 2022

patrickvonplaten commented Feb 18, 2022

stefan-it commented Feb 18, 2022 • edited Loading

versae commented Feb 18, 2022

craffel commented Feb 18, 2022

stefan-it commented Feb 18, 2022 • edited Loading

craffel commented Feb 18, 2022

stefan-it commented Feb 19, 2022 • edited Loading

craffel commented Feb 19, 2022

KnutJaegersberg commented Jan 19, 2023

patrickvonplaten commented Jan 19, 2023

Xirider commented Feb 1, 2022 •

edited

Loading

patrickvonplaten commented Feb 2, 2022 •

edited

Loading

patrickvonplaten commented Feb 8, 2022 •

edited

Loading

patrickvonplaten commented Feb 9, 2022 •

edited

Loading

stefan-it commented Feb 18, 2022 •

edited

Loading

stefan-it commented Feb 18, 2022 •

edited

Loading

stefan-it commented Feb 19, 2022 •

edited

Loading