Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP]Add TF BEiT Implementation #18559

Closed
wants to merge 101 commits into from

Conversation

MadElf1337
Copy link

@MadElf1337 MadElf1337 commented Aug 10, 2022

Porting BEiT model from PyTorch to TensorFlow backend

What does this PR do?

Fixes #18085

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@amyeroberts @gante @LysandreJik @NielsRogge

Sorry, something went wrong.

@MadElf1337
Copy link
Author

@gante @amyeroberts Here's the WIP draft of BEiT!

Please tell me if I have done anything wrong, I'll make the changes right away!

Thanks!

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

@amyeroberts
Copy link
Collaborator

Hi @MadElf1337 - thanks for opening a PR and for adding this model! Outline looks good.

As a quick overview, I see two main things that you'll want to add (alongside docs and tests):

  • # Copied from in the TF data2vec model definition
  • TFBeitForXxx classes

Looking forward to seeing the full PR and having this model available for our TF users :)

@MadElf1337
Copy link
Author

@amyeroberts Sure! I'll make the changes!

@MadElf1337
Copy link
Author

@amyeroberts @gante So I think I'm done with the model, can you just look over it once while I'll finish writing the tests?

@gante
Copy link
Member

gante commented Sep 5, 2022

@MadElf1337 From a quick glance, the model code looks fine 👍 As always, the devil is in the details, so you likely come across issues in the tests. Lets us know if you get stuck in a particular test (tip: breakpoint() + comparing to PT are your friends)

Will do an in-depth review when the tests are added.

@amyeroberts
Copy link
Collaborator

@MadElf1337 As discussed on the issue #18085 here for this model, we want to copy the relevant code in data2vec to modeling_tf_beit.py, then add the necessary #Copied from statements in modeling_tf_data2vec.py i.e. modeling_tf_beit.py and modeling_tf_data2vec.pyshould have the same structure and equivalent#Copied fromstatements as inmodeling_beit.pyandmodeling_data2vec.py`. Let me know if any of this isn't clear or you need any help.

@MadElf1337
Copy link
Author

Yeah yeah it was clear, just wanted to see if the broad architecture was written correctly or not, once I complete the tests(I’m a bit stuck on the attention output test for tf), I’ll do the formatting, add the comments and then ask for a complete review

@amyeroberts
Copy link
Collaborator

If you follow the same structure as the pytorch data2vec vision and beit, including the copied from statements, then almost all of the architecture considerations will be taken care of for you, and it will be easier for us as reviewers.

If you need any help with the tests, let us know and we can try and lend a hand.

@MadElf1337
Copy link
Author

Yeah so as I said, I just am stuck on the seq_len part in the attention output for TF, since that is one thing which is present in data2vec but not in BEIT, So just need to figure out that test

@gante
Copy link
Member

gante commented Oct 3, 2022

Hey @MadElf1337 -- we've just released a guide for TF conversions, might come handy to you :D

https://huggingface.co/docs/transformers/main/en/add_tensorflow_model

@MadElf1337
Copy link
Author

Yep thanks!

Mostly done with the tests as well, just a little hiccup that will be solved soon, else I’ll make sure to ask for help!

@MadElf1337
Copy link
Author

@gante @amyeroberts Terribly sorry for the delay, had to deal with some personal stuff that could not be avoided.

I think I'm done writing the tests and the model, can I get a review to see if I've missed anything/done anything wrong?

Thanks!

(Also I'll add the comments of #Copied from TFData2vec in the final commit)

@MadElf1337
Copy link
Author

@amyeroberts @gante

Can I get a review please?

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update and for implementing this first pass. Structure looks good and ready for addition of all extra pieces of work e.g. making the models importable.

Few comments:

  • TFBeitModel is missing and will need to be implemented.
  • Some small copy-pasta nits with torch and data2vec
  • I'm asking you again to implement with the #Copied from statements. I will only review again once this is done. This isn't just for completeness - it helps checking that the architecture is correct and makes everything easier for both the reviewer and the implementer. As almost all of the architecture for data2vec is a copy of beit it does not require you to write, or us to review, a new stand-alone architecture implementation. This will ensure your PR gets merged faster. If you have any questions about how to do this, please do not hesitate to ask.

>>> image = Image.open(requests.get(url, stream=True).raw)
>>> feature_extractor = AutoFeatureExtractor.from_pretrained("microsoft/beit-base-patch16-224-pt22k-ft22k")
>>> model = TFBeitForSemanticSegmentation.from_pretrained("microsoft/beit-base-patch16-224-pt22k-ft22k")
>>> inputs = feature_extractor(images=image, return_tensors="pt")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
>>> inputs = feature_extractor(images=image, return_tensors="pt")
>>> inputs = feature_extractor(images=image, return_tensors="tf")

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

@MadElf1337
Copy link
Author

MadElf1337 commented Nov 12, 2022

@amyeroberts Thanks for the review!

  1. As suggested I've added the comments of #Copied from...(Sorry that you had to ask twice, I thought they were just comments and did not know that it was a part of the review process)

  2. I've also added the missing code and the torch references have been changed!

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

@amyeroberts
Copy link
Collaborator

Hi @MadElf1337 - thanks for the updates and iterating so quickly after review.

There's still a few files that need to be added for the model to be importable and fully integrated into the library. The guidelines in the document @gante details these. Here's a recent model PR for reference. As the overall architecture looks good, this is the next step for this PR.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

@MadElf1337
Copy link
Author

@amyeroberts @gante So I've done everything as specified in the docs(I think), can I get a review to see if I've missed anything?

@MadElf1337
Copy link
Author

Hey @amyeroberts @gante Can I get a review please?

@amyeroberts
Copy link
Collaborator

@MadElf1337 Thanks for the update!

The next stage for this PR is getting all of the tests running - the fun part! The tests aren't running at the moment as the models can't be imported:

E   ImportError: cannot import name 'TFBeitForImageClassification' from 'transformers' (/home/circleci/transformers/src/transformers/__init__.py)

One thing I can see that needs to be added is included the beit models in import_structure in the __init__.py e.g. here.

Some of the tests that are failing e.g. check_code_quality you can fix and/or find the issues by running make fixup locally.

Finally, the # Copied from statements should be added to the data2vec vision model in modeling_tf_data2vec_vision.py
and the ones in modeling_tf_beit.py removed.
# Copied from transformers.models.beit.modeling_beit.TFBeitModelOutputWithPooling with Beit->Data2VecVision

@MadElf1337
Copy link
Author

@amyeroberts Thanks for the review!

I can see that the original repo does not have the import structures in init.py, however I have added those to the init file in my dev branch, which is why it is showing a conflict for the same file

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@MadElf1337
Copy link
Author

MadElf1337 commented Jan 11, 2023

Hey, can I know what to do next to solve the merge conflict?

@gante
Copy link
Member

gante commented Jan 16, 2023

Hey @MadElf1337 -- You will have to rebase your PR with main :)

  1. Get the latest main
git checkout main
git pull
  1. Rebase
git checkout your_branch
git rebase origin/main
  1. Handle conflicts manually (i.e. keep the desired changes and remove the unwanted ones in the conflicting files, and follow the instructions that git gives you)

  2. Force-push your changes (force to avoid GitHub showing a diff of 666 files)

git push -u origin your_branch -f

@MadElf1337
Copy link
Author

There, I think I've solved the conflict but the test errors are occurring due to errors in data2vecvision

@amyeroberts
Copy link
Collaborator

@MadElf1337 Some of the failures are because the # Copied from statements point to a path that doesn't exist e.g.
# Copied from transformers.models.data2vec.modeling_data2vec_vision.TFData2VecVisionEmbeddings with Data2VecVision->Beit is copying the object TFData2VecVisionEmbeddings but is referring to the pytorch modeling file transformers.models.data2vec.modeling_data2vec_vision.

Note: The copied from statement should be in the modeling_tf_data2vec_vision.py file and should copy from the beit model e.g. # Copied from transformers.models.beit.modeling_tf_beit.TFBeitEmbeddings with Beit->Data2VecVision. There shouldn't be any # Copied from comments in the BEiT modeling file modeling_tf_beit.py.

If you run make fixup locally in the repo, you'll be able to reproduce the check_copies and it will make
the check_code_quality checks pass.

@MadElf1337
Copy link
Author

@amyeroberts Seems like now just the assertion errors remain, how do I go about solving those?

@amyeroberts
Copy link
Collaborator

@MadElf1337 That's not completely true. As I have asked many times in the past, please look at the circle CI errors e.g. these ones.

The assertion errors have the process to resolve as I have mentioned in the past here and here.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
@MadElf1337
Copy link
Author

@amyeroberts Oh my bad, I overlooked the documentation errors :(

I'll fix them and the assertion errors immediately!

MadElf1337 and others added 6 commits February 3, 2024 12:23

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
@MadElf1337
Copy link
Author

@amyeroberts I fixed the data2vec layer errors, now I get this for the output of the hidden_states and activations -

{'attentions[0]': 6.446242e-05,
 'attentions[10]': 3.9696693e-05,
 'attentions[11]': 0.00011232495,
 'attentions[1]': 4.529953e-06,
 'attentions[2]': 3.1590462e-06,
 'attentions[3]': 6.765127e-06,
 'attentions[4]': 1.8686056e-05,
 'attentions[5]': 1.2725592e-05,
 'attentions[6]': 1.424551e-05,
 'attentions[7]': 8.791685e-06,
 'attentions[8]': 7.587671e-05,
 'attentions[9]': 8.711219e-05,
 'pooler_output': 8.150935e-06}


{'hidden_states[0]': 1.7166138e-05,
 'hidden_states[10]': 0.00036287308,
 'hidden_states[11]': 0.000667572,
 'hidden_states[12]': 0.0011978149,
 'hidden_states[1]': 7.6293945e-05,
 'hidden_states[2]': 5.9127808e-05,
 'hidden_states[3]': 7.9631805e-05,
 'hidden_states[4]': 0.00015258789,
 'hidden_states[5]': 0.00015258789,
 'hidden_states[6]': 0.00018310547,
 'hidden_states[7]': 0.00018310547,
 'hidden_states[8]': 0.00022506714,
 'hidden_states[9]': 0.0003066063,
 'last_hidden_state': 0.0011978149}

Additionally, I get the following warning when I run the test - Some weights of BeitModel were not initialized from the model checkpoint at microsoft/beit-base-patch16-224-pt22k and are newly initialized: ['beit.pooler.layernorm.bias', 'beit.pooler.layernorm.weight']

I'm wondering if that's why I'm getting those assertion errors?

@amyeroberts
Copy link
Collaborator

@MadElf1337

Two things:

Some weights of BeitModel were not initialized from the model checkpoint at microsoft/beit-base-patch16-224-pt22k and are newly initialized: ['beit.pooler.layernorm.bias', 'beit.pooler.layernorm.weight']

This shouldn't be happening. These weights should be loaded in when you load a checkpoint. I'd investigate this first.

I'm wondering if that's why I'm getting those assertion errors?

I don't know. You'll be able to answer that by comparing the activations of the TF and PT model and see if they're similar before the pooler layer and not after.

@MadElf1337
Copy link
Author

MadElf1337 commented Feb 19, 2024

@amyeroberts Yes, I fixed the weight init issue by using the MaskedImageModeling fix as mentioned in one of the issues, and I'm getting 0.00011232495 as the difference between the attentions before the pooling layers, when I compute the differences by using output.attentions[-1] for both PT and TF

@amyeroberts
Copy link
Collaborator

@MadElf1337 This is an incredibly long running PR. For context, we want most of our model PRs to be open for a few weeks at most - this has been open for over a year and a half. There's been a lot of upstream changes to our TF models, in particular how they are built, which would need to be incorporated here. For example, I can see in the diff for modeling_tf_data2vec_vision.py many of these necessary methods are now being removed.

We can of course help if there's weird behaviours in the repo, or you don't know how to add something, but adding and debugging the model is ultimately the contributor's responsibility. This includes finding out why there's differences between the models, which, looking at the tests at the moment are large. If you don't think you'll be able to resolve the conflicts and find make the TF and PT models equivalent within a month, then I'd suggest closing this PR.

@MadElf1337
Copy link
Author

@amyeroberts Yep, I'll fix everything and wrap it up now

@MadElf1337
Copy link
Author

@amyeroberts So I went through the entire model layers, and I found out where the difference is occuring.

It's the layer before the pooler, so there must be a problem in the layernorm

Attaching layers and differences below:

outputs.last_hidden_state
-------------2.9802322e-08
outputs.pooler_output
-------------4.7683716e-07
outputs.hidden_states_0
-------------0.0
outputs.hidden_states_1
-------------1.4901161e-08
outputs.hidden_states_2
-------------1.4901161e-08
outputs.hidden_states_3
-------------1.4901161e-08
outputs.hidden_states_4
-------------2.9802322e-08
outputs.attentions_0
-------------1.8626451e-09
outputs.attentions_1
-------------1.8626451e-09
outputs.attentions_2
-------------1.8626451e-09
outputs.attentions_3
-------------1.8626451e-09
outputs.last_hidden_state
-------------0.9371357

This last_hidden_state occurs just before the final pooler_output

@amyeroberts
Copy link
Collaborator

@MadElf1337 In your example, there's two outputs.last_hidden_state listed. What's the difference between the two? It seems very odd that a layer notm would cause this large a difference to suddenly arise but you can confirm by comparing the differences between the arrays for the PT and TF models before and after that layer

@MadElf1337
Copy link
Author

@amyeroberts Yeah I think you're right, because the test only errors out on layernorm but when I got the layerwise max abs diff here's what I get:

embeddings: Max Absolute Difference = 1.71661376953125e-05
encoder.layer_.0: Max Absolute Difference = 10.488643646240234
encoder.layer_.1: Max Absolute Difference = 6.2569780349731445
encoder.layer_.2: Max Absolute Difference = 63.09259033203125
encoder.layer_.3: Max Absolute Difference = 182.445556640625
encoder.layer_.4: Max Absolute Difference = 138.13844299316406
encoder.layer_.5: Max Absolute Difference = 95.3775634765625
encoder.layer_.6: Max Absolute Difference = 49.490692138671875
encoder.layer_.7: Max Absolute Difference = 19.959197998046875
encoder.layer_.8: Max Absolute Difference = 33.18023681640625
encoder.layer_.9: Max Absolute Difference = 145.5604248046875
encoder.layer_.10: Max Absolute Difference = 183.478515625
encoder.layer_.11: Max Absolute Difference = 145.6247100830078
layernorm: Max Absolute Difference = 0.0010194778442382812
pooler: Max Absolute Difference = 6.467103958129883e-06

The code I'm using is this:

import numpy as np
import torch
from PIL import Image

from transformers import BeitImageProcessor
from transformers.models.beit.configuration_beit import BeitConfig
from transformers.models.beit.modeling_beit import BeitModel
from transformers.models.beit.tf_test import TFBeitModel
from transformers.models.beit.modeling_tf_beit import TFBeitModelOutputWithPooling

img = Image.open("/home/madelf1337/Projects/transformers/tests/fixtures/tests_samples/COCO/000000039769.png")

img_processor = BeitImageProcessor.from_pretrained("microsoft/beit-base-patch16-224")

image1 = img_processor(images=img, return_tensors="pt")
image2 = img_processor(images=img, return_tensors="tf")

config = BeitConfig.from_pretrained(
    "microsoft/beit-base-patch16-224-pt22k", output_hidden_states=True, output_attentions=True
)

pt_outputs = BeitModel.from_pretrained("microsoft/beit-base-patch16-224-pt22k", config=config)
tf_outputs = TFBeitModel.from_pretrained("microsoft/beit-base-patch16-224-pt22k", config=config, from_pt=True)



with torch.no_grad():
    op1 = pt_outputs(**image1)
op2, tf_layerwise_outputs = tf_outputs(image2)
hidden_states = op1.hidden_states
layerwise_outputs = {}
layerwise_outputs["embeddings"] = hidden_states[0].numpy()
for i, layer_output in enumerate(hidden_states[1:]):
    layer_name = f"encoder.layer_.{i}"
    layerwise_outputs[layer_name] = layer_output.numpy()

layerwise_outputs["layernorm"] = op1.last_hidden_state.numpy()
if op1.pooler_output is not None:
    layerwise_outputs["pooler"] = op1.pooler_output.numpy()

for layer_name, pt_output in layerwise_outputs.items():
    tf_output = tf_layerwise_outputs[layer_name]
    abs_diff = np.amax(np.abs(pt_output - tf_output))
    print(f"{layer_name}: Max Absolute Difference = {abs_diff}")

Earlier I was adding stuff to the test itself so that I could see where the diff was occuring, and it errored out at the same layernorm with the message - AssertionError: outputs.last_hidden_state: Difference between torch and tf is 0.0010194778442382812 (>= 0.0002).

@amyeroberts
Copy link
Collaborator

@MadElf1337 Having large errors on the outputs of the layer doesn't tell you it's the layernorm - it tells you that the final activation differences are large. You'll need to compare the activations at each step within the layer to see where the differences are coming from

@MadElf1337
Copy link
Author

@amyeroberts Yes I've started going through each encoder layer now

@MadElf1337
Copy link
Author

@amyeroberts I finally got everything!
I know this has been quite long, but I really want to see this through to completion!

Max absolute difference for layer embeddings: 1.71661376953125e-05

Max absolute difference for layer encoder.layer_.0: 1.71661376953125e-05

Max absolute difference for layer encoder.layer_.0.attention_output: 1.2278556823730469e-05

Max absolute difference for layer encoder.layer_.0.attention_output_w_lambda: 1.4901161193847656e-05

Max absolute difference for layer encoder.layer_.0.residual_1: 1.71661376953125e-05

Max absolute difference for layer encoder.layer_.0.layernorm_after: 8.45193862915039e-05

Max absolute difference for layer encoder.layer_.0.residual_2: 5.817413330078125e-05

Max absolute difference for layer encoder.layer_.1: 5.817413330078125e-05

Max absolute difference for layer encoder.layer_.1.attention_output: 5.781650543212891e-06

Max absolute difference for layer encoder.layer_.1.attention_output_w_lambda: 2.3484230041503906e-05

Max absolute difference for layer encoder.layer_.1.residual_1: 5.054473876953125e-05

Max absolute difference for layer encoder.layer_.1.layernorm_after: 4.100799560546875e-05

Max absolute difference for layer encoder.layer_.1.residual_2: 4.291534423828125e-05

Max absolute difference for layer encoder.layer_.2: 4.291534423828125e-05

Max absolute difference for layer encoder.layer_.2.attention_output: 1.9431114196777344e-05

Max absolute difference for layer encoder.layer_.2.attention_output_w_lambda: 5.91278076171875e-05

Max absolute difference for layer encoder.layer_.2.residual_1: 6.532669067382812e-05

Max absolute difference for layer encoder.layer_.2.layernorm_after: 5.218386650085449e-05

Max absolute difference for layer encoder.layer_.2.residual_2: 6.771087646484375e-05

Max absolute difference for layer encoder.layer_.3: 6.771087646484375e-05

Max absolute difference for layer encoder.layer_.3.attention_output: 1.5497207641601562e-05

Max absolute difference for layer encoder.layer_.3.attention_output_w_lambda: 2.7179718017578125e-05

Max absolute difference for layer encoder.layer_.3.residual_1: 5.4836273193359375e-05

Max absolute difference for layer encoder.layer_.3.layernorm_after: 4.427134990692139e-05

Max absolute difference for layer encoder.layer_.3.residual_2: 9.1552734375e-05

Max absolute difference for layer encoder.layer_.4: 9.1552734375e-05

Max absolute difference for layer encoder.layer_.4.attention_output: 2.1338462829589844e-05

Max absolute difference for layer encoder.layer_.4.attention_output_w_lambda: 2.956390380859375e-05

Max absolute difference for layer encoder.layer_.4.residual_1: 9.1552734375e-05

Max absolute difference for layer encoder.layer_.4.layernorm_after: 4.151463508605957e-05

Max absolute difference for layer encoder.layer_.4.residual_2: 0.0001220703125

Max absolute difference for layer encoder.layer_.5: 0.0001220703125

Max absolute difference for layer encoder.layer_.5.attention_output: 1.0967254638671875e-05

Max absolute difference for layer encoder.layer_.5.attention_output_w_lambda: 3.814697265625e-05

Max absolute difference for layer encoder.layer_.5.residual_1: 0.0001220703125

Max absolute difference for layer encoder.layer_.5.layernorm_after: 6.580352783203125e-05

Max absolute difference for layer encoder.layer_.5.residual_2: 0.000152587890625

Max absolute difference for layer encoder.layer_.6: 0.000152587890625

Max absolute difference for layer encoder.layer_.6.attention_output: 1.5497207641601562e-05

Max absolute difference for layer encoder.layer_.6.attention_output_w_lambda: 5.227327346801758e-05

Max absolute difference for layer encoder.layer_.6.residual_1: 0.000152587890625

Max absolute difference for layer encoder.layer_.6.layernorm_after: 5.996227264404297e-05

Max absolute difference for layer encoder.layer_.6.residual_2: 0.000152587890625

Max absolute difference for layer encoder.layer_.7: 0.000152587890625

Max absolute difference for layer encoder.layer_.7.attention_output: 2.47955322265625e-05

Max absolute difference for layer encoder.layer_.7.attention_output_w_lambda: 0.00010347366333007812

Max absolute difference for layer encoder.layer_.7.residual_1: 0.000152587890625

Max absolute difference for layer encoder.layer_.7.layernorm_after: 5.507469177246094e-05

Max absolute difference for layer encoder.layer_.7.residual_2: 0.0002384185791015625

Max absolute difference for layer encoder.layer_.8: 0.0002384185791015625

Max absolute difference for layer encoder.layer_.8.attention_output: 6.711483001708984e-05

Max absolute difference for layer encoder.layer_.8.attention_output_w_lambda: 0.00019747018814086914

Max absolute difference for layer encoder.layer_.8.residual_1: 0.0002651214599609375

Max absolute difference for layer encoder.layer_.8.layernorm_after: 5.412101745605469e-05

Max absolute difference for layer encoder.layer_.8.residual_2: 0.000263214111328125

Max absolute difference for layer encoder.layer_.9: 0.000263214111328125

Max absolute difference for layer encoder.layer_.9.attention_output: 5.364418029785156e-05

Max absolute difference for layer encoder.layer_.9.attention_output_w_lambda: 0.000217437744140625

Max absolute difference for layer encoder.layer_.9.residual_1: 0.00028967857360839844

Max absolute difference for layer encoder.layer_.9.layernorm_after: 4.7206878662109375e-05

Max absolute difference for layer encoder.layer_.9.residual_2: 0.000339508056640625

Max absolute difference for layer encoder.layer_.10: 0.000339508056640625

Max absolute difference for layer encoder.layer_.10.attention_output: 2.3066997528076172e-05

Max absolute difference for layer encoder.layer_.10.attention_output_w_lambda: 0.00016999244689941406

Max absolute difference for layer encoder.layer_.10.residual_1: 0.000339508056640625

Max absolute difference for layer encoder.layer_.10.layernorm_after: 4.2825937271118164e-05

Max absolute difference for layer encoder.layer_.10.residual_2: 0.00052642822265625

Max absolute difference for layer encoder.layer_.11: 0.00052642822265625

Max absolute difference for layer encoder.layer_.11.attention_output: 9.34600830078125e-05

Max absolute difference for layer encoder.layer_.11.attention_output_w_lambda: 0.0004742145538330078

Max absolute difference for layer encoder.layer_.11.residual_1: 0.00067901611328125

Max absolute difference for layer encoder.layer_.11.layernorm_after: 7.05718994140625e-05

Max absolute difference for layer encoder.layer_.11.residual_2: 0.0010194778442382812

Max absolute difference for layer layernorm: 0.0010194778442382812

Here is the max abs diff across all layers, which is not spiking across the encoder layers!

@MadElf1337
Copy link
Author

@amyeroberts I think the error was occuring in the test because the test might be considering the base model, for which the checkpoint weights are not the correct initialization as described by @NielsRogge on this issue

@amyeroberts
Copy link
Collaborator

Here is the max abs diff across all layers, which is not spiking across the encoder layers!

@MadElf1337 You'll notice that it's still very high for the residual layers.

The linked initialization issue shouldn't affect the TF-PT cross tests, as whatever the weights are for the PT (randomly initialized or loaded from a checkpoint) they should be the same for the TF model.

In order for the PR to be reviewable, all the failing tests would need to be addressed.

@amyeroberts
Copy link
Collaborator

Hi @MadElf1337, I'm closing this PR.

There's a lot of upstream changes which have happened with TF models and even updates to the BEIT model, which mean this PR is increasingly diverging and hard to reconcile with the changes upstream. Model PRs should be open on the timescale of days or weeks, and now we're approaching two years. Thanks for your efforts in porting this model. Adding models is always a very large piece of work, particularly handling compatibility between frameworks.

If you're still interested in contributing to transformers, I'd suggest looking through issues tagged with Good first issue or Good second issue and seeing if any interest you. They're far more likely to be small in scope and enable you to add something quickly into the codebase.

@MadElf1337
Copy link
Author

@amyeroberts I understand, thanks for all the help till now!

I'll still continue with this model offline, and make all the adjustments necessary. Once done, I'll add it to the Hub, and if it's a valuable contribution maybe we can revisit this PR

Thanks for bearing with me!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Adding TF Implementation of BEiT
5 participants