Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added class SubstitutionMatrix #1913

Merged
merged 18 commits into from
Jan 13, 2024
Merged

Added class SubstitutionMatrix #1913

merged 18 commits into from
Jan 13, 2024

Conversation

qiyunzhu
Copy link
Contributor

@qiyunzhu qiyunzhu commented Dec 23, 2023

This PR provides an explicit substitution matrix class for sequence alignment. It is an follow-up of the previous issue #161 , but uses a new design. It is a subclass of DissimilarityMatrix. I found that DissimilarityMatrix is a perfect existing structure for this purpose. Specifically, it provides a 2D NumPy array and a character-to-index dictionary that are essential for efficient sequence alignment (which necessitates converting sequences to vectors of indices followed by vectorized calculation). In contrast, the previous design -- a nested dictionary, is not suitable. Meanwhile, the new class SubstitutionMatrix provides a few extra methods and attributes over DissimilarityMatrix to make it more explict for the purpose.

Several most common matrices are bundled in the code, including NUC.4.4, PAM series and BLOSUM series. This choice follows the options in the NCBI blastn and blastp web servers. The matrix data were derived from the NCBI FTP server (https://ftp.ncbi.nlm.nih.gov/blast/matrices/). To my understanding, these data files are in the public domain (see here and here).

The user can read additional matrices retrieved from the URL above, or any compatible ones (including ones bundled in other popular programs, such as BioPython).

The reason it is under sequence instead of alignment is because substitution matrices are also useful for phylogenetic reconstruction.

The method identity replaces make_identity_substitution_matrix (deprecated) for creating identity matrices.

To-do 1: Add method from_msa(TabularMSA, [TreeNode]), which creates a substitution matrix by calculating the substitution frequencies observed in a multiple sequence alignment, given a phylogenetic tree.

To-do 2: Bundle more matrices, including new ones like VTML, and ones for tree-building, such as JTT, WAG and LG.


Please complete the following checklist:

  • I have read the guidelines in CONTRIBUTING.md.

  • I have documented all public-facing changes in CHANGELOG.md.

  • This pull request includes code, documentation, or other content derived from external source(s). If this is the case, ensure the external source's license is compatible with scikit-bio's license. Include the license in the licenses directory and add a comment in the code giving proper attribution. Ensure any other requirements set forth by the license and/or author are satisfied. It is your responsibility to disclose code, documentation, or other content derived from external source(s). If you have questions about whether something can be included in the project or how to give proper attribution, include those questions in your pull request and a reviewer will assist you.

  • This pull request does not include code, documentation, or other content derived from external source(s).

Note: REVIEWING.md may also be helpful to see some of the things code reviewers will be verifying when reviewing your pull request.

@qiyunzhu
Copy link
Contributor Author

qiyunzhu commented Jan 3, 2024

Hi @mortonjt @wasade Any of you interested in taking a look? Thanks!

@mortonjt
Copy link
Collaborator

mortonjt commented Jan 4, 2024

Thanks Qiyun.

Any interest in adding in a pytest.fixture to iterate all of the substitution matrices to make sure that they won't break the alignments? If that is in, I'd be convinced that this pull request is good to go.

Copy link
Collaborator

@wasade wasade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @qiyunzhu!!! A few minor comments and suggestions, and a couple of notes for items that I think would be valuable to consider as a separate PR or two

skbio/alignment/_pairwise.py Show resolved Hide resolved
skbio/alignment/_pairwise.py Show resolved Hide resolved
skbio/alignment/_pairwise.py Show resolved Hide resolved
skbio/alignment/tests/test_ssw.py Show resolved Hide resolved
from skbio.stats.distance import DissimilarityMatrix


@experimental(as_of='0.5.10')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separate from this PR, I wonder about the long term value of these tags. Does anything use them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation displays these notes. Back in days the team established a rigorous life cycle mechanism. See below a screenshot from this video. (I wasn't there so this is just my understanding.) I tend to this that this mechanism is beneficial. Other packages also have similar things.

image

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What other packages are doing something like this? My concern is that while this did get setup, I don't recall seeing any PRs over the history of the project which actually changed things associated with these decorators nor am I aware of consumers of the library using this information (though I may be wrong there?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I echo your questions. Here are some examples in Pandas (and here) and scikit-learn. Instead of adding decorators, they achieve this in the documentation only. We can have a discussion on how to deal with the decorators.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deprecation warnings, and noting versions added aren't a bad idea but qualititively different from these decorators

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you on this.

skbio/sequence/_substitution.py Outdated Show resolved Hide resolved
skbio/sequence/_substitution.py Outdated Show resolved Hide resolved

# Defined according to the matrices hosted at the NCBI FTP server:
# https://ftp.ncbi.nlm.nih.gov/blast/matrices/
named_substitution_matrices = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be all capital as it's global scope.

Separate from this PR, it would be pleasant to place this in a defined object to facilitate lazy load. As is right now, all of these objects will be instantiated on import of anything from this module

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of lazy loading. I am not familiar with the best approaches to achieve this. Some other packages simply store the matrices as external files. I wonder if there is a better approach? It may be the greatest if you can show an example.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like the following, and would preserve dict-like interaction?

class _LazySubstitutionMatrix:
    def __init__(self, name, path):
        self.name = name
        self.path = path
        self._data = None
    
    @property
    def data(self):
        if self._data is None:
            self._data = _some_mechanism_to_load_data(self.path)
        return self._data

class _SubstitutionMatrices:
    _matrices = (_LazySubstitutionMatrix('NUC.4.4', 'file://foo/bar'),
                 _LazySubstitutionMatrix('PAM30', http://foo/bar'), 
                 ...)
    def __init__(self):
        self._lazy_data = {m.name: m for m in self._matrices}
     
    def __get__(self, k):
        return self._lazy_data[k].data

    def keys(self):
        return sorted(self._lazy_data)

    def values(self):
        return [self[k] for k in self.keys()]

    def items(self):
        return [(k, self[k]) for k in self.keys()]

named_substitution_matrices = _SubstitutionMatrices()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Relevant to the other comment. Will put this on hold pending the discussion on external data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PS @wasade : If we directly type data in _LazySubstitutionMatrix (instead of external files), will it still suffice lazy loading?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could defer the calls to _vector_to_matrix for example but bigger bang may be until these can be pulled from an authoritative source

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I envision that skbio should have a local data repository that is distributed with the package, plus a remote server that stores larger datasets. Analogs are scikit-learn's small, local datasets and pyTorch's big, remote datasets.

For substitution matrices specifically, the NCBI repo has many common matrices, but not all. These are only for alignments. Meanwhile, matrices for tree building are not there. Examples are JTT, WAG, LG, etc. In comparison, BioPython has some matrices that are not found in NCBI. For skbio, I think that having a local repo of most common matrices (like the ones I already typed) and a remote repo of extended ones may be the way to go.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! Could an issue be opened to capture this? It could refer back to the discussion here

skbio/sequence/_substitution.py Show resolved Hide resolved
skbio/sequence/_substitution.py Show resolved Hide resolved
@qiyunzhu
Copy link
Contributor Author

@wasade (and @mortonjt ) thanks for reviewing my code! I am sorry for the late response and I had to deal with quite a few things in the beginning of the semester. Now I am back to skbio development.

qiyunzhu and others added 4 commits January 12, 2024 12:24
Co-authored-by: Daniel McDonald <d3mcdonald@eng.ucsd.edu>
Co-authored-by: Daniel McDonald <d3mcdonald@eng.ucsd.edu>
Co-authored-by: Daniel McDonald <d3mcdonald@eng.ucsd.edu>
@qiyunzhu
Copy link
Contributor Author

@mortonjt Thanks for your feedback! I added unit tests with nucleotide and protein matrices in the last commit. I didn't do it for all matrices though, as the alignment functions will be removed and replaced, and it will make more sense if we test on the new functions.

@qiyunzhu
Copy link
Contributor Author

@wasade Thanks again for such careful review! I think I have addressed all the comments you made. Would you like to take a look?

@qiyunzhu qiyunzhu merged commit 91e1643 into scikit-bio:master Jan 13, 2024
22 checks passed
@qiyunzhu qiyunzhu deleted the submat branch January 13, 2024 00:20
@qiyunzhu
Copy link
Contributor Author

@wasade Thanks and happy weekend!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants