-
Notifications
You must be signed in to change notification settings - Fork 267
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added class SubstitutionMatrix
#1913
Conversation
This reverts commit 69e15b2.
Thanks Qiyun. Any interest in adding in a pytest.fixture to iterate all of the substitution matrices to make sure that they won't break the alignments? If that is in, I'd be convinced that this pull request is good to go. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @qiyunzhu!!! A few minor comments and suggestions, and a couple of notes for items that I think would be valuable to consider as a separate PR or two
from skbio.stats.distance import DissimilarityMatrix | ||
|
||
|
||
@experimental(as_of='0.5.10') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Separate from this PR, I wonder about the long term value of these tags. Does anything use them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The documentation displays these notes. Back in days the team established a rigorous life cycle mechanism. See below a screenshot from this video. (I wasn't there so this is just my understanding.) I tend to this that this mechanism is beneficial. Other packages also have similar things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What other packages are doing something like this? My concern is that while this did get setup, I don't recall seeing any PRs over the history of the project which actually changed things associated with these decorators nor am I aware of consumers of the library using this information (though I may be wrong there?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I echo your questions. Here are some examples in Pandas (and here) and scikit-learn. Instead of adding decorators, they achieve this in the documentation only. We can have a discussion on how to deal with the decorators.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deprecation warnings, and noting versions added aren't a bad idea but qualititively different from these decorators
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you on this.
skbio/sequence/_substitution.py
Outdated
|
||
# Defined according to the matrices hosted at the NCBI FTP server: | ||
# https://ftp.ncbi.nlm.nih.gov/blast/matrices/ | ||
named_substitution_matrices = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be all capital as it's global scope.
Separate from this PR, it would be pleasant to place this in a defined object to facilitate lazy load. As is right now, all of these objects will be instantiated on import of anything from this module
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of lazy loading. I am not familiar with the best approaches to achieve this. Some other packages simply store the matrices as external files. I wonder if there is a better approach? It may be the greatest if you can show an example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like the following, and would preserve dict-like interaction?
class _LazySubstitutionMatrix:
def __init__(self, name, path):
self.name = name
self.path = path
self._data = None
@property
def data(self):
if self._data is None:
self._data = _some_mechanism_to_load_data(self.path)
return self._data
class _SubstitutionMatrices:
_matrices = (_LazySubstitutionMatrix('NUC.4.4', 'file://foo/bar'),
_LazySubstitutionMatrix('PAM30', http://foo/bar'),
...)
def __init__(self):
self._lazy_data = {m.name: m for m in self._matrices}
def __get__(self, k):
return self._lazy_data[k].data
def keys(self):
return sorted(self._lazy_data)
def values(self):
return [self[k] for k in self.keys()]
def items(self):
return [(k, self[k]) for k in self.keys()]
named_substitution_matrices = _SubstitutionMatrices()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Relevant to the other comment. Will put this on hold pending the discussion on external data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PS @wasade : If we directly type data in _LazySubstitutionMatrix
(instead of external files), will it still suffice lazy loading?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could defer the calls to _vector_to_matrix
for example but bigger bang may be until these can be pulled from an authoritative source
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I envision that skbio should have a local data repository that is distributed with the package, plus a remote server that stores larger datasets. Analogs are scikit-learn's small, local datasets and pyTorch's big, remote datasets.
For substitution matrices specifically, the NCBI repo has many common matrices, but not all. These are only for alignments. Meanwhile, matrices for tree building are not there. Examples are JTT, WAG, LG, etc. In comparison, BioPython has some matrices that are not found in NCBI. For skbio, I think that having a local repo of most common matrices (like the ones I already typed) and a remote repo of extended ones may be the way to go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good! Could an issue be opened to capture this? It could refer back to the discussion here
Co-authored-by: Daniel McDonald <d3mcdonald@eng.ucsd.edu>
Co-authored-by: Daniel McDonald <d3mcdonald@eng.ucsd.edu>
Co-authored-by: Daniel McDonald <d3mcdonald@eng.ucsd.edu>
@mortonjt Thanks for your feedback! I added unit tests with nucleotide and protein matrices in the last commit. I didn't do it for all matrices though, as the alignment functions will be removed and replaced, and it will make more sense if we test on the new functions. |
@wasade Thanks again for such careful review! I think I have addressed all the comments you made. Would you like to take a look? |
@wasade Thanks and happy weekend! |
This PR provides an explicit substitution matrix class for sequence alignment. It is an follow-up of the previous issue #161 , but uses a new design. It is a subclass of
DissimilarityMatrix
. I found thatDissimilarityMatrix
is a perfect existing structure for this purpose. Specifically, it provides a 2D NumPy array and a character-to-index dictionary that are essential for efficient sequence alignment (which necessitates converting sequences to vectors of indices followed by vectorized calculation). In contrast, the previous design -- a nested dictionary, is not suitable. Meanwhile, the new classSubstitutionMatrix
provides a few extra methods and attributes overDissimilarityMatrix
to make it more explict for the purpose.Several most common matrices are bundled in the code, including NUC.4.4, PAM series and BLOSUM series. This choice follows the options in the NCBI blastn and blastp web servers. The matrix data were derived from the NCBI FTP server (https://ftp.ncbi.nlm.nih.gov/blast/matrices/). To my understanding, these data files are in the public domain (see here and here).
The user can read additional matrices retrieved from the URL above, or any compatible ones (including ones bundled in other popular programs, such as BioPython).
The reason it is under
sequence
instead ofalignment
is because substitution matrices are also useful for phylogenetic reconstruction.The method
identity
replacesmake_identity_substitution_matrix
(deprecated) for creating identity matrices.To-do 1: Add method
from_msa(TabularMSA, [TreeNode])
, which creates a substitution matrix by calculating the substitution frequencies observed in a multiple sequence alignment, given a phylogenetic tree.To-do 2: Bundle more matrices, including new ones like VTML, and ones for tree-building, such as JTT, WAG and LG.
Please complete the following checklist:
I have read the guidelines in CONTRIBUTING.md.
I have documented all public-facing changes in CHANGELOG.md.
This pull request includes code, documentation, or other content derived from external source(s). If this is the case, ensure the external source's license is compatible with scikit-bio's license. Include the license in the
licenses
directory and add a comment in the code giving proper attribution. Ensure any other requirements set forth by the license and/or author are satisfied. It is your responsibility to disclose code, documentation, or other content derived from external source(s). If you have questions about whether something can be included in the project or how to give proper attribution, include those questions in your pull request and a reviewer will assist you.This pull request does not include code, documentation, or other content derived from external source(s).
Note: REVIEWING.md may also be helpful to see some of the things code reviewers will be verifying when reviewing your pull request.