Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weird output #31

Closed
kmr2017 opened this issue May 5, 2022 · 7 comments
Closed

Weird output #31

kmr2017 opened this issue May 5, 2022 · 7 comments

Comments

@kmr2017
Copy link

kmr2017 commented May 5, 2022

Hi
I ran the code, it is giving me final output that is too weird irrespective of changing the image. I am attaching it. Can you explain what it is?

image

Thanks

@uakarsh
Copy link
Collaborator

uakarsh commented May 17, 2022

Sorry for the delay, but can do let me know, from which layer did you extract the output?

Regards,

@kmr2017
Copy link
Author

kmr2017 commented May 18, 2022

Hi @uakarsh

Thanks for your response.

I tried below code

config = {
"coordinate_size": 96,
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"image_feature_pool_shape": [7, 7, 256],
"intermediate_ff_size_factor": 4,
"max_2d_position_embeddings": 1000,
"max_position_embeddings": 512,
"max_relative_positions": 8,
"num_attention_heads": 12,
"num_hidden_layers": 12,
"pad_token_id": 0,
"shape_size": 96,
"vocab_size": 30522,
"layer_norm_eps": 1e-12,
}

fp = "img.jpeg"

tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")
encoding = dataset.create_features(fp, tokenizer, add_batch_dim=True)

feature_extractor = modeling.ExtractFeatures(config)
docformer = modeling.DocFormerEncoder(config)
v_bar, t_bar, v_bar_s, t_bar_s = feature_extractor(encoding)
output = docformer(v_bar, t_bar, v_bar_s, t_bar_s) # shape (1, 512, 768)

then I visualized the output.

@uakarsh
Copy link
Collaborator

uakarsh commented May 27, 2022

HI,

Actually, we know that the output is (512, 768), now, this output results from the attention of three different entities:

  1. Image feature of (512, 768)
  2. Language Feature of (512, 768)
  3. Spatial Dimension of (512, 768)

Now, when we perform any downstream task, we have an encoded version of these three modalities, so the diagram (which you have plotted) would be helpful for the model to know, which encoding to attend to when performing the downstream task.

The same can be seen in Pg No. 15, Figure 11. B of DocFormer Paper. Hope it helps

@kmr2017
Copy link
Author

kmr2017 commented Jun 7, 2022

Thanks for your info. How can I do entity level classification like in FUNSD dataset?

@kmr2017
Copy link
Author

kmr2017 commented Jun 7, 2022

@uakarsh

@uakarsh
Copy link
Collaborator

uakarsh commented Jun 7, 2022

I have almost finished the training script for RVL-CDIP (Document Classification), and have started working on FUNSD for token classification.

You can visit my cloned repo (https://github.com/uakarsh/docformer/tree/master/examples/docformer_pl), and in the examples/docformer_pl, you can get the

  1. Data visualizing
  2. Dataset making
  3. MLM with Pytorch Lightning
  4. Document Classification with DocFormer (would be uploaded soon)
    And next would be NER with FUNSD.

Would update you soon!!

@uakarsh uakarsh closed this as completed Jun 21, 2022
@BakingBrains
Copy link

@uakarsh Hello,

Any update on NER with FUNSD using docformer?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants