Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Version 0.9.0 #92

Merged
merged 11 commits into from
Jun 16, 2024
Merged

Version 0.9.0 #92

merged 11 commits into from
Jun 16, 2024

Conversation

SebieF
Copy link
Collaborator

@SebieF SebieF commented Jun 16, 2024

09.06.2024 - Version 0.9.0

Maintenance

  • Adding more extensive code documentation
  • Optimizing imports
  • Applying consistent file naming
  • Updating dependencies. Note that jupyter was removed as a direct optional dependency.
    You can always add it via poetry add jupyter.
  • Adding simple differentiation between t5 and esm tokenizer and models in embedders module

Features

  • Adding new residues_to_value protocol.
    Similar to the residues_to_class protocol,
    this protocol predicts a value for each sequence, using per-residue embeddings. It might, in some situations, outperform
    the sequence_to_value protocol.

Bug fixes

  • For huggingface_transformer_embedder.py, all special tokens are now always deleted from the final embedding
    (e.g. first/last for esm1b, last for t5)

SebieF added 11 commits June 6, 2024 14:54
Adding simple heuristic to switch between t5 and esm
All special tokens are sliced off from the final embedding now for all models (e.g. esm: first/last, t5: last)
Similar to the residues_to_class protocol, this protocol predicts a value for each sequence, using per-residue embeddings.
Note that `jupyter` was removed as a direct optional dependency.
You can always add it via `poetry add jupyter`.
@SebieF SebieF added bug Something isn't working documentation Improvements or additions to documentation enhancement New feature or request labels Jun 16, 2024
@SebieF SebieF self-assigned this Jun 16, 2024
@SebieF
Copy link
Collaborator Author

SebieF commented Jun 16, 2024

@CodiumAI-Agent /review

@CodiumAI-Agent
Copy link

PR Reviewer Guide 🔍

⏱️ Estimated effort to review [1-5] 4
🧪 Relevant tests No
🔒 Security concerns No
⚡ Key issues to review Possible Bug:
The method _early_stop in solver.py uses a decrement operation on _stop_count which might lead to negative values if called repeatedly beyond the patience threshold. Consider resetting _stop_count to self.patience when the early stop condition is not met.
Performance Concern:
The method _do_dropout_iterations in solver.py might be computationally expensive as it involves multiple forward passes with dropout enabled. This could be optimized or parallelized to improve performance.
Code Clarity:
The use of complex list comprehensions and nested functions in fasta.py might reduce code readability and maintainability. Simplifying these constructs could help in making the code more understandable.

@SebieF
Copy link
Collaborator Author

SebieF commented Jun 16, 2024

@CodiumAI-Agent /improve

@CodiumAI-Agent
Copy link

PR Code Suggestions ✨

CategorySuggestion                                                                                                                                    Score
Security
Improve security by using yaml.safe_load() instead of yaml.load()

Replace the use of yaml.load() with yaml.safe_load() to avoid potential security risks
associated with loading arbitrary YAML content.

biotrainer/inference/inferencer.py [86]

-output_vars = yaml.load(tmp_output_file, Loader=yaml.RoundTripLoader)
+output_vars = yaml.safe_load(tmp_output_file)
 
Suggestion importance[1-10]: 10

Why: This suggestion addresses a significant security concern by replacing yaml.load() with yaml.safe_load(), which is a safer method for loading YAML content.

10
Possible bug
Correct the superclass initialization in the DeeperFNN class

In the DeeperFNN class, the superclass initialization is incorrectly calling super(FNN,
self).init() which should be super(DeeperFNN, self).init(). This ensures that the
DeeperFNN class correctly initializes its superclass.

biotrainer/models/fnn.py [38]

-super(FNN, self).__init__()
+super(DeeperFNN, self).__init__()
 
Suggestion importance[1-10]: 10

Why: This suggestion addresses a critical bug in the superclass initialization of the DeeperFNN class, ensuring proper inheritance and functionality.

10
Performance
Optimize the removal of special tokens by using np.where()

Use list comprehension directly in the np.delete() function to improve code readability
and efficiency.

biotrainer/embedders/huggingface_transformer_embedder.py [62-63]

 special_tokens_mask = self._tokenizer.get_special_tokens_mask(input_id, already_has_special_tokens=True)
-embedding = np.delete(embeddings[seq_num],
-                      [index for index, mask in enumerate(special_tokens_mask) if mask != 0], axis=0)
+embedding = np.delete(embeddings[seq_num], np.where(special_tokens_mask)[0], axis=0)
 
Suggestion importance[1-10]: 9

Why: Using np.where() improves both readability and performance. This change makes the code more efficient and easier to understand.

9
Use torch.no_grad() to optimize inference performance

Consider using torch.no_grad() context manager during inference to disable gradient
computation, which can reduce memory consumption and increase computation speed.

biotrainer/inference/inferencer.py [232]

-inference_dict = solver.inference(dataloader, calculate_test_metrics=targets is not None)
+with torch.no_grad():
+    inference_dict = solver.inference(dataloader, calculate_test_metrics=targets is not None)
 
Suggestion importance[1-10]: 9

Why: This suggestion correctly identifies a performance optimization by using torch.no_grad() during inference, which can reduce memory consumption and increase computation speed.

9
Robustness
Add exception handling for model state loading to manage file-related errors

Implement exception handling for the torch.load function to manage potential errors during
the loading of model states, such as file not found or corrupted files.

biotrainer/solvers/solver.py [251]

-state = torch.load(checkpoint_path, map_location=torch.device(self.device))
+try:
+    state = torch.load(checkpoint_path, map_location=torch.device(self.device))
+except FileNotFoundError:
+    logger.error("Checkpoint file not found.")
+    return
+except Exception as e:
+    logger.error(f"Failed to load checkpoint: {str(e)}")
+    return
 
Suggestion importance[1-10]: 9

Why: Adding exception handling for torch.load is a good practice to manage potential errors such as file not found or corrupted files. This enhances the robustness of the code.

9
Best practice
Improve type checking by using isinstance()

Replace the direct type checks with isinstance() for better type checking, especially when
dealing with inheritance.

biotrainer/config/config_option.py [67-68]

-return ("range" in str(self.value) or type(self.value) is list or
-        (type(self.value) is str and "[" in self.value and "]" in self.value))
+return ("range" in str(self.value) or isinstance(self.value, list) or
+        (isinstance(self.value, str) and "[" in self.value and "]" in self.value))
 
Suggestion importance[1-10]: 8

Why: Using isinstance() is a best practice for type checking, especially when dealing with inheritance. This change improves code robustness and readability.

8
Use more specific exception types for clearer error handling

Use a more specific exception type than the general Exception to provide clearer error
handling.

biotrainer/config/configurator.py [81-82]

-except Exception as e:
-    raise Exception(f"Loading {embedder_name} automatically and as {tokenizer_class.__class__.__name__} failed!"
-                    f" Please provide a custom_embedder script for your use-case.") from e
+except ImportError as e:
+    raise ImportError(f"Loading {embedder_name} automatically and as {tokenizer_class.__class__.__name__} failed!"
+                      f" Please provide a custom_embedder script for your use-case.") from e
 
Suggestion importance[1-10]: 7

Why: Using a more specific exception type like ImportError provides clearer error handling and makes the code easier to debug. However, the improvement is minor and context-specific.

7
Enhancement
Enhance error messages for clarity and debugging

Replace the manual exception raising for unknown split_name with a more informative error
message that includes the available splits.

biotrainer/inference/inferencer.py [191]

-raise Exception(f"Unknown split_name {split_name} for given configuration!")
+if split_name not in self.solvers_and_loaders_by_split:
+    available_splits = ', '.join(self.solvers_and_loaders_by_split.keys())
+    raise ValueError(f"Unknown split_name '{split_name}'. Available splits are: {available_splits}")
 
Suggestion importance[1-10]: 8

Why: The suggestion improves the clarity of error messages by including available split names, which aids in debugging and provides more informative feedback to the user.

8
Apply dropout consistently to both feature and attention convolutions in the LightAttention class

In the LightAttention class, the dropout operation is applied only to the output of
feature_convolution but not to attention_convolution. Consistently applying dropout to
both could potentially improve model performance by regularizing both features and
attention mechanisms.

biotrainer/models/light_attention.py [47]

 o = self.dropout(o)
+attention = self.dropout(attention)
 
Suggestion importance[1-10]: 8

Why: This suggestion potentially improves model performance by regularizing both features and attention mechanisms, making it a valuable enhancement.

8
Enhance the _early_stop method by logging the reason for stopping

Modify the _early_stop method to log the reason for stopping, which could be due to
achieving a new minimum loss or reaching the patience limit. This enhances debugging and
monitoring capabilities.

biotrainer/solvers/solver.py [299]

 if self._stop_count == 0:
+    logger.info("Early stopping due to patience limit reached.")
 
Suggestion importance[1-10]: 8

Why: Logging the reason for early stopping enhances debugging and monitoring capabilities, making it easier to understand why the training was stopped. This is a useful enhancement for tracking the training process.

8
Improve variable naming for clarity in the FNN class's forward method

In the FNN class, consider using a more descriptive variable name for the input tensor x
in the forward method. Renaming x to input_tensor would improve code readability and make
the method's purpose clearer.

biotrainer/models/fnn.py [20]

-def forward(self, x):
+def forward(self, input_tensor):
 
Suggestion importance[1-10]: 5

Why: While this suggestion enhances code readability, it is a minor improvement and does not address any functional issues.

5
Maintainability
Simplify dictionary creation from iterable using list comprehension

Use list comprehension to simplify the creation of embeddings_dict from embeddings when it
is not a dictionary.

biotrainer/inference/inferencer.py [278]

-embeddings_dict = {str(idx): embedding for idx, embedding in enumerate(embeddings)}
+embeddings_dict = dict(enumerate(embeddings)) if not isinstance(embeddings, Dict) else embeddings
 
Suggestion importance[1-10]: 7

Why: This suggestion simplifies the code for creating embeddings_dict from an iterable, improving code readability and maintainability. However, the improvement is minor.

7
Refactor the mask calculation into a separate method in the LightAttention class

The mask calculation in the forward method of the LightAttention class should be moved to
a separate method to improve code readability and maintainability. This change will make
the forward method cleaner and focus primarily on the forward pass logic.

biotrainer/models/light_attention.py [43]

-mask = x.sum(dim=-1) != utils.SEQUENCE_PAD_VALUE
+mask = self.calculate_mask(x)
 
+def calculate_mask(self, x):
+    return x.sum(dim=-1) != utils.SEQUENCE_PAD_VALUE
+
Suggestion importance[1-10]: 7

Why: This suggestion improves code readability and maintainability by separating concerns, but it does not address a critical issue.

7
Refactor to separate training and validation into distinct methods for better modularity

Refactor the train method to separate the training and validation phases into their own
methods. This improves code readability and maintainability by modularizing the training
process.

biotrainer/solvers/solver.py [66]

 for epoch in range(self.start_epoch, self.number_of_epochs):
+    self._train_epoch(training_dataloader, epoch)
+    self._validate_epoch(validation_dataloader, epoch)
 
Suggestion importance[1-10]: 7

Why: Refactoring the train method to separate training and validation phases improves code readability and maintainability. However, this suggestion requires additional implementation details for the new methods, which are not provided.

7
Simplify dictionary initialization using comprehension

Use dictionary comprehension to simplify the initialization of __DATASETS and
__COLLATE_FUNCTIONS.

biotrainer/datasets/init.py [9-22]

-__DATASETS = {
-    Protocol.residue_to_class: ResidueEmbeddingsClassificationDataset,
-    Protocol.residues_to_class: ResidueEmbeddingsClassificationDataset,
-    Protocol.residues_to_value: ResidueEmbeddingsRegressionDataset,
-    Protocol.sequence_to_class: SequenceEmbeddingsClassificationDataset,
-    Protocol.sequence_to_value: SequenceEmbeddingsRegressionDataset,
-}
-__COLLATE_FUNCTIONS = {
-    Protocol.residue_to_class: pad_residue_embeddings,
-    Protocol.residues_to_class: pad_residues_embeddings,
-    Protocol.residues_to_value: pad_residues_embeddings,
-    Protocol.sequence_to_class: pad_sequence_embeddings,
-    Protocol.sequence_to_value: pad_sequence_embeddings,
-}
+__DATASETS = {protocol: dataset for protocol, dataset in zip(
+    [Protocol.residue_to_class, Protocol.residues_to_class, Protocol.residues_to_value,
+     Protocol.sequence_to_class, Protocol.sequence_to_value],
+    [ResidueEmbeddingsClassificationDataset, ResidueEmbeddingsClassificationDataset, ResidueEmbeddingsRegressionDataset,
+     SequenceEmbeddingsClassificationDataset, SequenceEmbeddingsRegressionDataset])}
+__COLLATE_FUNCTIONS = {protocol: function for protocol, function in zip(
+    [Protocol.residue_to_class, Protocol.residues_to_class, Protocol.residues_to_value,
+     Protocol.sequence_to_class, Protocol.sequence_to_value],
+    [pad_residue_embeddings, pad_residues_embeddings, pad_residues_embeddings,
+     pad_sequence_embeddings, pad_sequence_embeddings])}
 
Suggestion importance[1-10]: 6

Why: While dictionary comprehension can make the code more concise, it may also reduce readability for some developers. The improvement is more about code style and maintainability.

6

@SebieF SebieF merged commit cfbef9c into sacdallago:main Jun 16, 2024
1 check passed
@SebieF SebieF deleted the version-0-9-0 branch June 16, 2024 10:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working documentation Improvements or additions to documentation enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants