-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New and better T5 checkpoints from scaling transformers paper #15467
Comments
Holy moly 170 checkpoints at those sizes. I'll give it a try tomorrow. Official link: https://console.cloud.google.com/storage/browser/scenic-bucket/scaling_explorer/ |
Okey can't get the models to work out of the box. Will dive a bit deeper into the code next week |
The conversion seems to work now. I've added a first model here: I have some internal conversions scripts for https://github.com/google-research/text-to-text-transfer-transformer => Transformers that allows me to quickly convert mesh-tensorflow checkpoints, verify them and upload them. All I need for a conversion now is the following:
Now, there are over a hundred new checkpoints which makes a manual conversion too slow and time-consuming. I think we should do one of the following:
What approach do you think is best here? @LysandreJik @patil-suraj @craffel @Xirider ? In general, I think it's important to find a good name here for the new checkpoints? What do you think would be a good name? |
Love that sentence from the paper |
I would advocate for porting all the models, though take it with a grain of salt because I'm not volunteering to do the manual work. Regarding an automatic naming convention, if the original TF checkpoint names are somewhat sane/follow some kind of reasonable pattern, we could just do |
Having talked to @patil-suraj, it should actually be possible to fully automate porting all those models. In a first step, I've preprocessed the naming of each of the folder available online to come up with the following names: t5-efficient-xxl-nl4
t5-efficient-xxl
t5-efficient-xl-nl12
t5-efficient-xl-nl16
t5-efficient-xl-nl28
t5-efficient-xl-nl2
t5-efficient-xl-nl4
t5-efficient-xl-nl6
t5-efficient-xl-nl8
t5-efficient-xl
t5-efficient-xl-sh
t5-efficient-xl-skv
t5-efficient-base
t5-efficient-base-dm1000
t5-efficient-base-dm256
t5-efficient-base-dm2000
t5-efficient-base-dm512
t5-efficient-base-dml2
t5-efficient-base-dml4
t5-efficient-base-dml6
t5-efficient-base-dml8
t5-efficient-base-el16
t5-efficient-base-el2
t5-efficient-base-el4
t5-efficient-base-el6
t5-efficient-base-el8
t5-efficient-base-ff12000
t5-efficient-base-ff1000
t5-efficient-base-ff2000
t5-efficient-base-ff6000
t5-efficient-base-ff9000
t5-efficient-base-nh16
t5-efficient-base-nh24
t5-efficient-base-nh32
t5-efficient-base-nh8
t5-efficient-base-kv128
t5-efficient-base-kv16
t5-efficient-base-kv256
t5-efficient-base-kv32
t5-efficient-base-l16
t5-efficient-base-l24
t5-efficient-base-l2
t5-efficient-base-l32
t5-efficient-base-l36
t5-efficient-base-l40
t5-efficient-base-l48
t5-efficient-base-l4
t5-efficient-base-l8
t5-efficient-large
t5-efficient-base
t5-efficient-large-dm128
t5-efficient-large-dm256
t5-efficient-large-dm2000
t5-efficient-large-dm512
t5-efficient-large-dm768
t5-efficient-large-dl12
t5-efficient-large-dl16
t5-efficient-large-dl2
t5-efficient-large-dl32
t5-efficient-large-dl4
t5-efficient-large-dl6
t5-efficient-large-dl8
t5-efficient-large-el12
t5-efficient-large-el2
t5-efficient-large-el4
t5-efficient-large-el6
t5-efficient-large-el8
t5-efficient-large-nh12
t5-efficient-large-nh24
t5-efficient-large-nh2
t5-efficient-large-nh32
t5-efficient-large-nh4
t5-efficient-large-nh8-nl16
t5-efficient-large-nh8-nl32
t5-efficient-large-nh8
t5-efficient-large-kv128
t5-efficient-large-kv16
t5-efficient-large-kv256
t5-efficient-large-kv32
t5-efficient-large-nl10
t5-efficient-large-nl12
t5-efficient-large-nl16
t5-efficient-large-nl20
t5-efficient-large-nl2
t5-efficient-large-nl32
t5-efficient-large-nl36
t5-efficient-large-nl4
t5-efficient-large-nl8
t5-efficient-large-sh
t5-efficient-large-skv
t5-efficient-mini-nl12
t5-efficient-mini-nl24
t5-efficient-mini-nl6
t5-efficient-mini-nl8
t5-efficient-mini
t5-efficient-base-sh
t5-efficient-base-skv
t5-efficient-small-dm128
t5-efficient-small-dm1000
t5-efficient-small-dm256
t5-efficient-small-dm2000
t5-efficient-small-dm768
t5-efficient-small-dl12
t5-efficient-small-dl16
t5-efficient-small-dl2
t5-efficient-small-dl4
t5-efficient-small-dl8
t5-efficient-small-el12
t5-efficient-small-el16
t5-efficient-small-el16-dl1
t5-efficient-small-el16-dl2
t5-efficient-small-el16-dl4
t5-efficient-small-el16-dl8
t5-efficient-small-el2
t5-efficient-small-el32
t5-efficient-small-el48
t5-efficient-small-el4
t5-efficient-small-el64
t5-efficient-small-el8
t5-efficient-small-el8-dl1
t5-efficient-small-el8-dl2
t5-efficient-small-el8-dl4
t5-efficient-small-ff12000
t5-efficient-small-ff1000
t5-efficient-small-ff3000
t5-efficient-small-ff6000
t5-efficient-small-ff9000
t5-efficient-small-kv128
t5-efficient-small-kv16
t5-efficient-small-kv256
t5-efficient-small-kv32
t5-efficient-small-nl16
t5-efficient-small-nl20
t5-efficient-small-nl22
t5-efficient-small-nl24
t5-efficient-small-nl2
t5-efficient-small-nl32
t5-efficient-small-nl36
t5-efficient-small-nl40
t5-efficient-small-nl48
t5-efficient-small-nl4
t5-efficient-small-nl8
t5-efficient-small-sh
t5-efficient-small-shkv
t5-efficient-small
t5-efficient-tiny-dl2
t5-efficient-tiny-dl6
t5-efficient-tiny-dl8
t5-efficient-tiny-el12
t5-efficient-tiny-el2
t5-efficient-tiny-el6
t5-efficient-tiny-el8
t5-efficient-tiny-ff12000
t5-efficient-tiny-ff2000
t5-efficient-tiny-ff3000
t5-efficient-tiny-ff6000
t5-efficient-tiny-ff9000
t5-efficient-tiny-nh16
t5-efficient-tiny-nh1
t5-efficient-tiny-nh32
t5-efficient-tiny-nh8
t5-efficient-tiny-nl12
t5-efficient-tiny-nl16
t5-efficient-tiny-nl24
t5-efficient-tiny-nl2
t5-efficient-tiny-nl32
t5-efficient-tiny-nl6
t5-efficient-tiny-nl8
t5-efficient-tiny
t5-efficient-tiny-sh
t5-efficient-tiny-skv -> think this is pretty clear with `t5-efficient-{default_size}-{change_to_default_size_1}(-{change_to_default_size_2}), where as the default sizes codenames follow those of Table 2 of the paper. |
Automatically parsed and uploaded all configs now here: https://huggingface.co/NewT5 . Will now look into automatically uploading the weights. |
Have 157/169 now correctly converted and uploaded: https://huggingface.co/models?other=t5-new-success . Something seems to be wrong with the "shared heads" SH checkpoints in the conversion. @craffel - do you know what exactly shared heads means? Does it mean that each transformer block uses the same head weights or that each head within a transformer block is shared with each other? |
Hm, I'm not sure, but that would be my guess. If you point me to the operative config for one of the shared heads checkpoints, I can try to hunt down what it means in the original codebase. |
Sorry for the off-topic, but to have available such a conversion script would be awesome. I'm struggling with the conversion myself. |
Uploaded some of my scripts here: https://github.com/patrickvonplaten/t5-mtf-to-hf-converter . The repo is not very clean and heavily lacks comments / explanations. Could you try to see whether those scripts help you though in any way? |
I'll test them and will report back :) Thanks! 🙏🏼 |
Okey, 159/169 checkpoints are now correct. Given that the others might not be that useful/practical for now: google-research/google-research#986 (comment) , I'll go ahead with 159 of 169 checkpoints now. So will convert them to TF and Flax, write a nice README and then we can publish I think :-) |
Hi @patrickvonplaten , I've trained a 32EL model with the T5 Mesh codebase (model is actually training, so I'm using a checkpoint). Now I wanted to convert the TF checkpoint into PyTorch, but the following error is thrown: Initialize PyTorch weight ['decoder', 'block_000', 'layer_001', 'rms_norm', 'scale']
Skipping decoder/block_000/layer_001/rms_norm/scale_slot_v
Skipping decoder/block_000/layer_002/DenseReluDense/wi/kernel
Traceback (most recent call last):
File "/home/stefan/model-hub/transformers/src/transformers/models/t5/convert_t5_original_tf_checkpoint_to_pytorch.py", line 59, in <module>
convert_tf_checkpoint_to_pytorch(args.tf_checkpoint_path, args.config_file, args.pytorch_dump_path)
File "/home/stefan/model-hub/transformers/src/transformers/models/t5/convert_t5_original_tf_checkpoint_to_pytorch.py", line 34, in convert_tf_checkpoint_to_pytorch
load_tf_weights_in_t5(model, config, tf_checkpoint_path)
File "/home/stefan/model-hub/transformers/src/transformers/models/t5/modeling_t5.py", line 122, in load_tf_weights_in_t5
pointer = getattr(pointer, "weight")
File "/home/stefan/.venvs/dev/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1177, in __getattr__
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'T5DenseGatedGeluDense' object has no attribute 'weight' I used the created Conversion is done with latest Transformers Many thanks! |
Hey @stefan-it, You need to base your config on |
You should be able to use this script: https://github.com/patrickvonplaten/t5-mtf-to-hf-converter/blob/master/create_config.py where you load from the |
Hi @patrickvonplaten , thanks, it is working with Another question: why is the vocab size in the config set to 32128, whereas the spm model has a size of 32000? Is it because of the integrated tasks (such as translation), because my T5 model demands 32000 in the config (otherwise is throws an error). |
Maybe related: according to this comment, it was rounded to a multiple of 128 for TPU efficiency. https://github.com/google-research/t5x/blob/main/t5x/examples/scalable_t5/t5_1_1/base.gin#L45 |
100 IDs were added for sentinel tokens (for the pre-training objective), and then as @versae said it was rounded to nearest the 128 for TPU efficiency. |
It's |
Hi @craffel, thanks for that! I could solve the problem by using But I have another question regarding to the Scaling Efficiently paper: it seems that
I'm highly interested in the (/cc @vanzytay) |
It's here: https://github.com/google-research/text-to-text-transfer-transformer/blob/main/t5/data/tasks.py#L106 But it needs further gin configuration if you actually want to use it as a pre-training task. If you want to use the standard T5 pre-training task, use https://github.com/google-research/text-to-text-transfer-transformer/blob/main/t5/data/tasks.py#L46 |
How does t5-efficient-xxl-nl4 perform to say medium sized models? While the xxl model file is 45 gb, this one is smaller than 4gb. Googling anything for this model didn't help me. 3 results in total. Not spoken of in the paper as far as I can see. No performance comparison I could find. 4 transformer blocks instead of 24 sounds like a quite radical change, so properly a performance penalty, but then again, they shared the model, is it any good? |
Gently pinging the original author @vanzytay here |
🌟 New model addition
Model description
This paper explores different ways of scaling T5:
Scaling Efficiently: Insights from Pre-training and Finetuning Transformers
https://arxiv.org/abs/2109.10686
They found new checkpoints (DeepNarrow) that perform significantly better than the previous models on downstream applications.
Here is a table from the paper that compares the old Base, Large, XL and XXL models to the new ones:
The checkpoints were released today here:
https://github.com/google-research/google-research/tree/master/scaling_transformers
Open source status
The text was updated successfully, but these errors were encountered: