Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New and better T5 checkpoints from scaling transformers paper #15467

Open
3 tasks done
Xirider opened this issue Feb 1, 2022 · 26 comments
Open
3 tasks done

New and better T5 checkpoints from scaling transformers paper #15467

Xirider opened this issue Feb 1, 2022 · 26 comments
Assignees

Comments

@Xirider
Copy link

Xirider commented Feb 1, 2022

🌟 New model addition

Model description

This paper explores different ways of scaling T5:
Scaling Efficiently: Insights from Pre-training and Finetuning Transformers
https://arxiv.org/abs/2109.10686

They found new checkpoints (DeepNarrow) that perform significantly better than the previous models on downstream applications.

Here is a table from the paper that compares the old Base, Large, XL and XXL models to the new ones:

image

The checkpoints were released today here:
https://github.com/google-research/google-research/tree/master/scaling_transformers

Open source status

@LysandreJik
Copy link
Member

cc @patil-suraj @patrickvonplaten

@patrickvonplaten
Copy link
Contributor

patrickvonplaten commented Feb 2, 2022

Holy moly 170 checkpoints at those sizes. I'll give it a try tomorrow. Official link: https://console.cloud.google.com/storage/browser/scenic-bucket/scaling_explorer/

@patrickvonplaten
Copy link
Contributor

Okey can't get the models to work out of the box. Will dive a bit deeper into the code next week

@patrickvonplaten patrickvonplaten self-assigned this Feb 7, 2022
@patrickvonplaten
Copy link
Contributor

The conversion seems to work now. I've added a first model here:
https://huggingface.co/NewT5/t5-efficient-base-el4

I have some internal conversions scripts for https://github.com/google-research/text-to-text-transfer-transformer => Transformers that allows me to quickly convert mesh-tensorflow checkpoints, verify them and upload them.

All I need for a conversion now is the following:

Now, there are over a hundred new checkpoints which makes a manual conversion too slow and time-consuming. I think we should do one of the following:

  1. Write a script that does the conversion automatically. This can definitely be done and it also shouldn't be too hard. In order to do this we have to do two things:
    a. Find a good name pattern, e.g. t5-efficient-{config}
    b. (This is the time consuming part). Prepare the model configs for each checkpoint to be uploaded. E.g. we would have to look at each checkpoint and define the model config depending on their changes. I won't have time to do this alone - I could need some help here - @Xirider, would you be interested in helping here? Would be happy to decide on a good strategy together to port all of the models :-)
    c. Decide how to write good model cards in an automated way

  2. Only do the conversion for the most important models. In this case we should decide what those important models are.

What approach do you think is best here? @LysandreJik @patil-suraj @craffel @Xirider ?

In general, I think it's important to find a good name here for the new checkpoints? What do you think would be a good name?
google/t5-base-efficient-{config}? or google/t5-base-scale-efficient-{config}? Better ideas?

@patrickvonplaten
Copy link
Contributor

Love that sentence from the paper "since there is a lack of representation of transformers at lower compute regions." -> very true! Think those small checkpoints here can be very impactful

@craffel
Copy link

craffel commented Feb 7, 2022

I would advocate for porting all the models, though take it with a grain of salt because I'm not volunteering to do the manual work. Regarding an automatic naming convention, if the original TF checkpoint names are somewhat sane/follow some kind of reasonable pattern, we could just do t5-efficient-{t5 checkpoint name}.

@patrickvonplaten
Copy link
Contributor

Having talked to @patil-suraj, it should actually be possible to fully automate porting all those models. In a first step, I've preprocessed the naming of each of the folder available online to come up with the following names:

t5-efficient-xxl-nl4
t5-efficient-xxl
t5-efficient-xl-nl12
t5-efficient-xl-nl16
t5-efficient-xl-nl28
t5-efficient-xl-nl2
t5-efficient-xl-nl4
t5-efficient-xl-nl6
t5-efficient-xl-nl8
t5-efficient-xl
t5-efficient-xl-sh
t5-efficient-xl-skv
t5-efficient-base
t5-efficient-base-dm1000
t5-efficient-base-dm256
t5-efficient-base-dm2000
t5-efficient-base-dm512
t5-efficient-base-dml2
t5-efficient-base-dml4
t5-efficient-base-dml6
t5-efficient-base-dml8
t5-efficient-base-el16
t5-efficient-base-el2
t5-efficient-base-el4
t5-efficient-base-el6
t5-efficient-base-el8
t5-efficient-base-ff12000
t5-efficient-base-ff1000
t5-efficient-base-ff2000
t5-efficient-base-ff6000
t5-efficient-base-ff9000
t5-efficient-base-nh16
t5-efficient-base-nh24
t5-efficient-base-nh32
t5-efficient-base-nh8
t5-efficient-base-kv128
t5-efficient-base-kv16
t5-efficient-base-kv256
t5-efficient-base-kv32
t5-efficient-base-l16
t5-efficient-base-l24
t5-efficient-base-l2
t5-efficient-base-l32
t5-efficient-base-l36
t5-efficient-base-l40
t5-efficient-base-l48
t5-efficient-base-l4
t5-efficient-base-l8
t5-efficient-large
t5-efficient-base
t5-efficient-large-dm128
t5-efficient-large-dm256
t5-efficient-large-dm2000
t5-efficient-large-dm512
t5-efficient-large-dm768
t5-efficient-large-dl12
t5-efficient-large-dl16
t5-efficient-large-dl2
t5-efficient-large-dl32
t5-efficient-large-dl4
t5-efficient-large-dl6
t5-efficient-large-dl8
t5-efficient-large-el12
t5-efficient-large-el2
t5-efficient-large-el4
t5-efficient-large-el6
t5-efficient-large-el8
t5-efficient-large-nh12
t5-efficient-large-nh24
t5-efficient-large-nh2
t5-efficient-large-nh32
t5-efficient-large-nh4
t5-efficient-large-nh8-nl16
t5-efficient-large-nh8-nl32
t5-efficient-large-nh8
t5-efficient-large-kv128
t5-efficient-large-kv16
t5-efficient-large-kv256
t5-efficient-large-kv32
t5-efficient-large-nl10
t5-efficient-large-nl12
t5-efficient-large-nl16
t5-efficient-large-nl20
t5-efficient-large-nl2
t5-efficient-large-nl32
t5-efficient-large-nl36
t5-efficient-large-nl4
t5-efficient-large-nl8
t5-efficient-large-sh
t5-efficient-large-skv
t5-efficient-mini-nl12
t5-efficient-mini-nl24
t5-efficient-mini-nl6
t5-efficient-mini-nl8
t5-efficient-mini
t5-efficient-base-sh
t5-efficient-base-skv
t5-efficient-small-dm128
t5-efficient-small-dm1000
t5-efficient-small-dm256
t5-efficient-small-dm2000
t5-efficient-small-dm768
t5-efficient-small-dl12
t5-efficient-small-dl16
t5-efficient-small-dl2
t5-efficient-small-dl4
t5-efficient-small-dl8
t5-efficient-small-el12
t5-efficient-small-el16
t5-efficient-small-el16-dl1
t5-efficient-small-el16-dl2
t5-efficient-small-el16-dl4
t5-efficient-small-el16-dl8
t5-efficient-small-el2
t5-efficient-small-el32
t5-efficient-small-el48
t5-efficient-small-el4
t5-efficient-small-el64
t5-efficient-small-el8
t5-efficient-small-el8-dl1
t5-efficient-small-el8-dl2
t5-efficient-small-el8-dl4
t5-efficient-small-ff12000
t5-efficient-small-ff1000
t5-efficient-small-ff3000
t5-efficient-small-ff6000
t5-efficient-small-ff9000
t5-efficient-small-kv128
t5-efficient-small-kv16
t5-efficient-small-kv256
t5-efficient-small-kv32
t5-efficient-small-nl16
t5-efficient-small-nl20
t5-efficient-small-nl22
t5-efficient-small-nl24
t5-efficient-small-nl2
t5-efficient-small-nl32
t5-efficient-small-nl36
t5-efficient-small-nl40
t5-efficient-small-nl48
t5-efficient-small-nl4
t5-efficient-small-nl8
t5-efficient-small-sh
t5-efficient-small-shkv
t5-efficient-small
t5-efficient-tiny-dl2
t5-efficient-tiny-dl6
t5-efficient-tiny-dl8
t5-efficient-tiny-el12
t5-efficient-tiny-el2
t5-efficient-tiny-el6
t5-efficient-tiny-el8
t5-efficient-tiny-ff12000
t5-efficient-tiny-ff2000
t5-efficient-tiny-ff3000
t5-efficient-tiny-ff6000
t5-efficient-tiny-ff9000
t5-efficient-tiny-nh16
t5-efficient-tiny-nh1
t5-efficient-tiny-nh32
t5-efficient-tiny-nh8
t5-efficient-tiny-nl12
t5-efficient-tiny-nl16
t5-efficient-tiny-nl24
t5-efficient-tiny-nl2
t5-efficient-tiny-nl32
t5-efficient-tiny-nl6
t5-efficient-tiny-nl8
t5-efficient-tiny
t5-efficient-tiny-sh
t5-efficient-tiny-skv

-> think this is pretty clear with `t5-efficient-{default_size}-{change_to_default_size_1}(-{change_to_default_size_2}), where as the default sizes codenames follow those of Table 2 of the paper.

@patrickvonplaten
Copy link
Contributor

patrickvonplaten commented Feb 8, 2022

Automatically parsed and uploaded all configs now here: https://huggingface.co/NewT5 . Will now look into automatically uploading the weights.

@patrickvonplaten
Copy link
Contributor

patrickvonplaten commented Feb 9, 2022

Have 157/169 now correctly converted and uploaded: https://huggingface.co/models?other=t5-new-success .

Something seems to be wrong with the "shared heads" SH checkpoints in the conversion.

@craffel - do you know what exactly shared heads means? Does it mean that each transformer block uses the same head weights or that each head within a transformer block is shared with each other?

@craffel
Copy link

craffel commented Feb 9, 2022

Hm, I'm not sure, but that would be my guess. If you point me to the operative config for one of the shared heads checkpoints, I can try to hunt down what it means in the original codebase.

@versae
Copy link
Contributor

versae commented Feb 11, 2022

Sorry for the off-topic, but to have available such a conversion script would be awesome. I'm struggling with the conversion myself.

@patrickvonplaten
Copy link
Contributor

Uploaded some of my scripts here: https://github.com/patrickvonplaten/t5-mtf-to-hf-converter . The repo is not very clean and heavily lacks comments / explanations. Could you try to see whether those scripts help you though in any way?

@versae
Copy link
Contributor

versae commented Feb 14, 2022

I'll test them and will report back :) Thanks! 🙏🏼

@patrickvonplaten
Copy link
Contributor

Okey, 159/169 checkpoints are now correct. Given that the others might not be that useful/practical for now: google-research/google-research#986 (comment) , I'll go ahead with 159 of 169 checkpoints now. So will convert them to TF and Flax, write a nice README and then we can publish I think :-)

@stefan-it
Copy link
Collaborator

Hi @patrickvonplaten , I've trained a 32EL model with the T5 Mesh codebase (model is actually training, so I'm using a checkpoint). Now I wanted to convert the TF checkpoint into PyTorch, but the following error is thrown:

Initialize PyTorch weight ['decoder', 'block_000', 'layer_001', 'rms_norm', 'scale']
Skipping decoder/block_000/layer_001/rms_norm/scale_slot_v
Skipping decoder/block_000/layer_002/DenseReluDense/wi/kernel
Traceback (most recent call last):
  File "/home/stefan/model-hub/transformers/src/transformers/models/t5/convert_t5_original_tf_checkpoint_to_pytorch.py", line 59, in <module>
    convert_tf_checkpoint_to_pytorch(args.tf_checkpoint_path, args.config_file, args.pytorch_dump_path)
  File "/home/stefan/model-hub/transformers/src/transformers/models/t5/convert_t5_original_tf_checkpoint_to_pytorch.py", line 34, in convert_tf_checkpoint_to_pytorch
    load_tf_weights_in_t5(model, config, tf_checkpoint_path)
  File "/home/stefan/model-hub/transformers/src/transformers/models/t5/modeling_t5.py", line 122, in load_tf_weights_in_t5
    pointer = getattr(pointer, "weight")
  File "/home/stefan/.venvs/dev/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1177, in __getattr__
    raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'T5DenseGatedGeluDense' object has no attribute 'weight'

I used the created config.json from your create_config script (and used google/t5-v1_1-base, because I couldn't find the template json).

Conversion is done with latest Transformers master, do you have any hint what's missing here 🤔

Many thanks!

@patrickvonplaten
Copy link
Contributor

Hey @stefan-it,

You need to base your config on t5-base (the original T5-model) instead of goolge/t5-v1_1-base I believe. Could you try this instead?

@patrickvonplaten
Copy link
Contributor

You should be able to use this script: https://github.com/patrickvonplaten/t5-mtf-to-hf-converter/blob/master/create_config.py where you load from the t5-base config :-)

@stefan-it
Copy link
Collaborator

stefan-it commented Feb 18, 2022

Hi @patrickvonplaten , thanks, it is working with t5-base.

Another question: why is the vocab size in the config set to 32128, whereas the spm model has a size of 32000? Is it because of the integrated tasks (such as translation), because my T5 model demands 32000 in the config (otherwise is throws an error).

@versae
Copy link
Contributor

versae commented Feb 18, 2022

Maybe related: according to this comment, it was rounded to a multiple of 128 for TPU efficiency.

https://github.com/google-research/t5x/blob/main/t5x/examples/scalable_t5/t5_1_1/base.gin#L45

@craffel
Copy link

craffel commented Feb 18, 2022

100 IDs were added for sentinel tokens (for the pre-training objective), and then as @versae said it was rounded to nearest the 128 for TPU efficiency.

@stefan-it
Copy link
Collaborator

stefan-it commented Feb 18, 2022

Hi @versae and @craffel , thanks for that hint!

Do you accidentally know how to add these sentinel ids in t5_mesh_transformer command or in the gin file (I'm not using T5X) 🤔

Can this be configured in the seqio Task 🤔

@craffel
Copy link

craffel commented Feb 18, 2022

It's vocabularies.Vocabulary.extra_ids = 100 in gin.

@stefan-it
Copy link
Collaborator

stefan-it commented Feb 19, 2022

Hi @craffel, thanks for that! I could solve the problem by using seqio.SentencePieceVocabulary(SPM_VOCAB, extra_ids=100) in the task description. I've checked the converted checkpoint and it now has the desired 32128 shape 👍

But I have another question regarding to the Scaling Efficiently paper: it seems that c4_v220_unsupervised is used as mixture/task description is used in the GIN files, but I can't find this recipe in T5 (or T5X) repository. Do you accidentally know how it could be structured or do you know a comparable task from T5 library, such as:

# ================================ Wikipedia ===================================
TaskRegistry.add(
    "wikipedia_20190301.en_v003_unsupervised",
    source=seqio.TfdsDataSource(tfds_name="wikipedia/20190301.en:1.0.0"),
    preprocessors=[
        functools.partial(
            preprocessors.rekey, key_map={
                "inputs": None,
                "targets": "text"
            }),
        seqio.preprocessors.tokenize,
        seqio.CacheDatasetPlaceholder(),
        preprocessors.unsupervised,
        seqio.preprocessors.append_eos_after_trim,
    ],
    output_features=DEFAULT_OUTPUT_FEATURES,
    metric_fns=[])

I'm highly interested in the preprocessors part of it. Many thanks!

(/cc @vanzytay)

@craffel
Copy link

craffel commented Feb 19, 2022

It's here: https://github.com/google-research/text-to-text-transfer-transformer/blob/main/t5/data/tasks.py#L106

But it needs further gin configuration if you actually want to use it as a pre-training task. If you want to use the standard T5 pre-training task, use https://github.com/google-research/text-to-text-transfer-transformer/blob/main/t5/data/tasks.py#L46

@KnutJaegersberg
Copy link

How does t5-efficient-xxl-nl4 perform to say medium sized models? While the xxl model file is 45 gb, this one is smaller than 4gb. Googling anything for this model didn't help me. 3 results in total. Not spoken of in the paper as far as I can see. No performance comparison I could find. 4 transformer blocks instead of 24 sounds like a quite radical change, so properly a performance penalty, but then again, they shared the model, is it any good?

@patrickvonplaten
Copy link
Contributor

Gently pinging the original author @vanzytay here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants