-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Version 0.9.2 #107
Merged
Merged
Version 0.9.2 #107
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
1. Saving embeddings to file after threshold is reached 2. Deleting and collecting memory after calculation, especially embedder model was causing troubles sometimes
It was found that Tokenizers for different pLMs work differently under the hood, as such it was necessary to find the correct strategy to provide the sequences to the tokenizers
LayerNorm is more commonly used in NLP and avoids problems with batches of size 1
Padded residue embeddings are now masked out which improves reproducibility and avoids different predictions between batches and single inputs
Mask is now applied before the attention convolution, and `-float('inf')` is used instead of `-1e9` which seems to improve reproducibility and avoids different predictions between batches and single inputs
Using AutoTokenizer does not work for all models, but can help if it provides the class name
Embedder: https://github.com/westlake-repl/SaProt Discussion: J-SNACKKB/FLIP#26
Enhanced the embedding computation function to dynamically manage RAM by estimating the maximum embeddings that can fit and automatically saving them to optimize memory usage
- Extract core embedding service logic into embedding_service method - Add special case handling for ultra-long reads: - Immediately save ultra-long read embeddings to disk - Avoid loading additional sequences when ultra-long read detected - Dynamically calculate max embeddings that fit in available memory - Use this to determine when to flush embeddings to disk - Code cleanup: - Use type hints for improved readability - Docstrings for key methods
…ogging - Renamed embedding_service to _do_embeddings_computation for clarity. - Replaced while loop with for loop to prevent infinite loops and simplify processing. - Combined handling of ultra-long and normal reads into a single loop to reduce complexity. - Updated logging levels for better clarity and reduced unnecessary logs. - Optimized garbage collection by consolidating gc.collect() calls.
…gement - Replacing tearDown and manual cleanup with tempfile.TemporaryDirectory() for automatic resource management. - Adjusting long_length calculation to use a fixed value for local testing and a memory-based value for CI environments. - Enhancing _run_embedding_test to support both sequence_to_class and residue_to_class protocols. - Updating _verify_result to validate output based on the protocol used. - Adding new tests for comprehensive coverage of both embedding protocols.
- Moving progress bar initialization to ensure visibility during the first embedding calculation and updated its description. - Consolidating memory cleanup: moving del self._embedder and gc.collect() to avoid repeated calls. - Updating docstring in _max_embedding_fit to explain constants (0.75 and 18).
SebieF
added
enhancement
New feature or request
breaking
Breaking change
maintenance
Code or example maintainance
labels
Aug 26, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
breaking
Breaking change
enhancement
New feature or request
maintenance
Code or example maintainance
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
26.08.2024 - Version 0.9.2
Features
Maintenance