You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After extensive testing and digging, @mheinzinger and I figured out that ELMo (aka: SeqVec) has an "initialization" problem (see: allenai/allennlp#1169)
In short: the first (couple) batch(es) embedded in SeqVec will produce significant to slightly different embeddings than what is expected. A visual example:
In this case: a reference set of sequences was embedded, including P12345. Then, P12345 was embedded in batches of 1 (setting max_amino_acids: 1) and the euclidean distance between these embeddings and the "reference" embeddings was calculated.
As it's evident, in the first batch, P12345 is 0.02 euclidean distances away from itself. In the second batch, P12345 is 0.007 euclidean distances distant from itself (a factor >10 less than in the first batch). Further down the line, the distance dicreases.
At this stage, the suggestion to fix this is that after programmatic initialization, SeqVec/ELMo needs to be run with a random (but real!) sequence in a single batch, before starting to process the actual sequence set. @mheinzinger suggest to actually run 2-3 sequences in 2-3 batches (1 sequence per batch). This should tryly "initialize" the model.
This is relevant for both CPU and GPU, meaning, in the code, you should add a call to embed:
After extensive testing and digging, @mheinzinger and I figured out that ELMo (aka: SeqVec) has an "initialization" problem (see: allenai/allennlp#1169)
In short: the first (couple) batch(es) embedded in SeqVec will produce significant to slightly different embeddings than what is expected. A visual example:
In this case: a reference set of sequences was embedded, including P12345. Then, P12345 was embedded in batches of 1 (setting
max_amino_acids: 1
) and the euclidean distance between these embeddings and the "reference" embeddings was calculated.As it's evident, in the first batch, P12345 is 0.02 euclidean distances away from itself. In the second batch, P12345 is 0.007 euclidean distances distant from itself (a factor >10 less than in the first batch). Further down the line, the distance dicreases.
At this stage, the suggestion to fix this is that after programmatic initialization, SeqVec/ELMo needs to be run with a random (but real!) sequence in a single batch, before starting to process the actual sequence set. @mheinzinger suggest to actually run 2-3 sequences in 2-3 batches (1 sequence per batch). This should tryly "initialize" the model.
This is relevant for both CPU and GPU, meaning, in the code, you should add a call to
embed
:The text was updated successfully, but these errors were encountered: