Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to handle Index Error Issue #17

Closed
ShauryaUppal-1Mg opened this issue Oct 10, 2019 · 5 comments
Closed

How to handle Index Error Issue #17

ShauryaUppal-1Mg opened this issue Oct 10, 2019 · 5 comments

Comments

@ShauryaUppal-1Mg
Copy link

Should I trim the string to 512 or I can increase the maximum sequence length or something.
Screenshot 2019-10-10 at 10 33 48 PM
Screenshot 2019-10-10 at 10 34 22 PM
Screenshot 2019-10-10 at 10 34 38 PM

@ShauryaUppal-1Mg
Copy link
Author

@Tiiiger

@felixgwu
Copy link
Collaborator

Hi @ShauryaUppal-1Mg,

Thank you for using our BERTScore.
The reason for this error is that both BERT and RoBERTa are trained with sentences with at most 512 tokens. Unlike the original Transformers, BERT and RoBERTa use learned positional embedding whose size is set during training.
BERTScore is commonly computed between a pair of sentences.
We would suggest that you split the documents into sentences before feeding them to BERT.
If there happen to be some sentences with more than 512 tokens, you can:

  1. train a BERT model using longer sentences
  2. cut sentences into multiple chunks and design a better way to aggregate them. This is one of the future directions to extend BERTScore, but we haven't studied it yet.
  3. use XLNets which support longer inputs. However, they perform worse in our experiments.

Best,
Felix

@ShauryaUppal-1Mg
Copy link
Author

Will if work will it for sentence having length > 512, I remove stopwords and common words?

@ShauryaUppal-1Mg
Copy link
Author

But BERT as a service allows user to make max_sequence= ignore
Can't we do something with that?

https://github.com/hanxiao/bert-as-service

@felixgwu
Copy link
Collaborator

felixgwu commented Oct 11, 2019

As far as I know, they just trim down the sequence. Please see:
https://github.com/hanxiao/bert-as-service/blob/85690491d66fd1ca0d03924f8c9ead3d1cad90b1/server/bert_serving/server/__init__.py#L414-L422

We would like to follow huggingface's transformers and just raise an error instead.
We encourage users to deal with their own special cases.

Thank you for raising this issue. We will update the README to remind other users.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants