-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
return length of the Java embedTokens() method #2
Comments
Ok, so I figured out the answer myself, so posting it here in case it helps someone else: the array There are 2 options as I see it:
@robrua would you consider adding this? According to https://github.com/hanxiao/bert-as-service, it is more efficient (faster) to have smaller size maxSequenceLength. (see question "How about the speed? Is it fast enough for production?")
|
Hey, thanks for the research and the detailed issue. For (1) that's an excellent idea to add here and I'll look into allowing dynamic max sequence length on both the Python and Java ends next time I sit down and do some work on this project. For (2) there's an extra complication involved here in matching the output token vectors back to the original source: BERT uses a wordpiece vocabulary which may split a single word from your sequence into multiple subtokens before inputting it to the model. Because of this, the output size doesn't necessarily match the number of words in the input sequence (even after considering the [CLS] and [SEP] tokens); you'd need to inject some "tracking" logic into the tokenizer to keep track of any words that are getting subdivided during tokenization. There's no reason this wouldn't work, and I think providing a way to match the output vectors to each word in the input sequence would be useful, so I'll also take a look at this in the future. |
Hi Rob! Thanks for considering adding (1) to the code. About (2): you are right that this is not straightforward because of this special wordpiece tokenizer that bert is using but the code included in the link I posted (https://github.com/google-research/bert) and which I converted in java takes this into account. This means that it uses the same special wordpiece tokenizer to tokenize the words and keeps track of how each word is tokenized: e.g. the verb "faked" is tokenized as "fake" + "d", each of these two tokens matching to its own vector. In this case, the above code keeps track of the "real" word, in this case "fake" and gives you back the vector of "fake" and not "d" as the representation of "fake". In other words, the code above always tracks the first token of each word, which is also the base form of the word. Thanks! |
Reopening this to remind myself to add this in the future. On (2): Right then, I hadn't read it closely enough and failed to notice you were tracking the start indices for each token. I'll probably end up including something very similar to this, just integrated into the tokenizer itself to avoid needing to run it twice on each sequence. |
Hi,
first of all, well done for this great work and thanks for making it publicly available!
I have the following problem: I am using the Java version and I want to match each token to its embedding. I am loading the English uncased model and getting the embeddings of 2 strings (str1, str2) with
float[][][] embeddings = bert.embedTokens(str1, str2);
.After that, I can get the embedding corresponding to each sequence/string by
However, firstSent and secondSent have always a standard length of 127 and not the length of my strings str1 and str2. If I then do
firstSent[0]
, this will have a length of 768 which is the expected size of the embeddings but I don't understand why I am getting 127 as the length of firstSent and secondSent. And since I get this length, I guess that firstSent[0] does NOT correspond to the first token of my first sentence, which is what I would like to get.Any help is much appreciated! Thanks a lot!
The text was updated successfully, but these errors were encountered: