Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ELMo has a cold start problem #76

Closed
sacdallago opened this issue Oct 23, 2020 · 0 comments
Closed

ELMo has a cold start problem #76

sacdallago opened this issue Oct 23, 2020 · 0 comments
Assignees
Labels
bug Something isn't working prio:high

Comments

@sacdallago
Copy link
Owner

sacdallago commented Oct 23, 2020

After extensive testing and digging, @mheinzinger and I figured out that ELMo (aka: SeqVec) has an "initialization" problem (see: allenai/allennlp#1169)

In short: the first (couple) batch(es) embedded in SeqVec will produce significant to slightly different embeddings than what is expected. A visual example:

image

In this case: a reference set of sequences was embedded, including P12345. Then, P12345 was embedded in batches of 1 (setting max_amino_acids: 1) and the euclidean distance between these embeddings and the "reference" embeddings was calculated.

As it's evident, in the first batch, P12345 is 0.02 euclidean distances away from itself. In the second batch, P12345 is 0.007 euclidean distances distant from itself (a factor >10 less than in the first batch). Further down the line, the distance dicreases.

At this stage, the suggestion to fix this is that after programmatic initialization, SeqVec/ELMo needs to be run with a random (but real!) sequence in a single batch, before starting to process the actual sequence set. @mheinzinger suggest to actually run 2-3 sequences in 2-3 batches (1 sequence per batch). This should tryly "initialize" the model.

This is relevant for both CPU and GPU, meaning, in the code, you should add a call to embed:

@sacdallago sacdallago added bug Something isn't working prio:high labels Oct 23, 2020
@sacdallago sacdallago added this to the Version v0.1.5 milestone Oct 23, 2020
@konstin konstin closed this as completed in 832224c Nov 3, 2020
konstin pushed a commit that referenced this issue Nov 3, 2020
ELMo warmup to fix GH-76

See merge request sacdallago/bio_embeddings!97
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working prio:high
Projects
None yet
Development

No branches or pull requests

2 participants