Added class `SubstitutionMatrix` #1913

qiyunzhu · 2023-12-23T23:06:35Z

This PR provides an explicit substitution matrix class for sequence alignment. It is an follow-up of the previous issue #161 , but uses a new design. It is a subclass of DissimilarityMatrix. I found that DissimilarityMatrix is a perfect existing structure for this purpose. Specifically, it provides a 2D NumPy array and a character-to-index dictionary that are essential for efficient sequence alignment (which necessitates converting sequences to vectors of indices followed by vectorized calculation). In contrast, the previous design -- a nested dictionary, is not suitable. Meanwhile, the new class SubstitutionMatrix provides a few extra methods and attributes over DissimilarityMatrix to make it more explict for the purpose.

Several most common matrices are bundled in the code, including NUC.4.4, PAM series and BLOSUM series. This choice follows the options in the NCBI blastn and blastp web servers. The matrix data were derived from the NCBI FTP server (https://ftp.ncbi.nlm.nih.gov/blast/matrices/). To my understanding, these data files are in the public domain (see here and here).

The user can read additional matrices retrieved from the URL above, or any compatible ones (including ones bundled in other popular programs, such as BioPython).

The reason it is under sequence instead of alignment is because substitution matrices are also useful for phylogenetic reconstruction.

The method identity replaces make_identity_substitution_matrix (deprecated) for creating identity matrices.

To-do 1: Add method from_msa(TabularMSA, [TreeNode]), which creates a substitution matrix by calculating the substitution frequencies observed in a multiple sequence alignment, given a phylogenetic tree.

To-do 2: Bundle more matrices, including new ones like VTML, and ones for tree-building, such as JTT, WAG and LG.

Please complete the following checklist:

I have read the guidelines in CONTRIBUTING.md.
I have documented all public-facing changes in CHANGELOG.md.
This pull request includes code, documentation, or other content derived from external source(s). If this is the case, ensure the external source's license is compatible with scikit-bio's license. Include the license in the licenses directory and add a comment in the code giving proper attribution. Ensure any other requirements set forth by the license and/or author are satisfied. It is your responsibility to disclose code, documentation, or other content derived from external source(s). If you have questions about whether something can be included in the project or how to give proper attribution, include those questions in your pull request and a reviewer will assist you.
This pull request does not include code, documentation, or other content derived from external source(s).

Note: REVIEWING.md may also be helpful to see some of the things code reviewers will be verifying when reviewing your pull request.

This reverts commit 69e15b2.

…ubmat

qiyunzhu · 2024-01-03T21:01:11Z

Hi @mortonjt @wasade Any of you interested in taking a look? Thanks!

mortonjt · 2024-01-04T10:49:24Z

Thanks Qiyun.

Any interest in adding in a pytest.fixture to iterate all of the substitution matrices to make sure that they won't break the alignments? If that is in, I'd be convinced that this pull request is good to go.

wasade

Thanks, @qiyunzhu!!! A few minor comments and suggestions, and a couple of notes for items that I think would be valuable to consider as a separate PR or two

skbio/alignment/_pairwise.py

skbio/alignment/tests/test_ssw.py

wasade · 2024-01-04T16:20:48Z

skbio/sequence/_substitution.py

+from skbio.stats.distance import DissimilarityMatrix
+
+
+@experimental(as_of='0.5.10')


Separate from this PR, I wonder about the long term value of these tags. Does anything use them?

The documentation displays these notes. Back in days the team established a rigorous life cycle mechanism. See below a screenshot from this video. (I wasn't there so this is just my understanding.) I tend to this that this mechanism is beneficial. Other packages also have similar things.

What other packages are doing something like this? My concern is that while this did get setup, I don't recall seeing any PRs over the history of the project which actually changed things associated with these decorators nor am I aware of consumers of the library using this information (though I may be wrong there?)

I echo your questions. Here are some examples in Pandas (and here) and scikit-learn. Instead of adding decorators, they achieve this in the documentation only. We can have a discussion on how to deal with the decorators.

Deprecation warnings, and noting versions added aren't a bad idea but qualititively different from these decorators

I agree with you on this.

skbio/sequence/_substitution.py

wasade · 2024-01-04T16:54:00Z

skbio/sequence/_substitution.py

+
+# Defined according to the matrices hosted at the NCBI FTP server:
+# https://ftp.ncbi.nlm.nih.gov/blast/matrices/
+named_substitution_matrices = {


This should be all capital as it's global scope.

Separate from this PR, it would be pleasant to place this in a defined object to facilitate lazy load. As is right now, all of these objects will be instantiated on import of anything from this module

I like the idea of lazy loading. I am not familiar with the best approaches to achieve this. Some other packages simply store the matrices as external files. I wonder if there is a better approach? It may be the greatest if you can show an example.

Something like the following, and would preserve dict-like interaction?

class _LazySubstitutionMatrix: def __init__(self, name, path): self.name = name self.path = path self._data = None @property def data(self): if self._data is None: self._data = _some_mechanism_to_load_data(self.path) return self._data class _SubstitutionMatrices: _matrices = (_LazySubstitutionMatrix('NUC.4.4', 'file://foo/bar'), _LazySubstitutionMatrix('PAM30', http://foo/bar'), ...) def __init__(self): self._lazy_data = {m.name: m for m in self._matrices} def __get__(self, k): return self._lazy_data[k].data def keys(self): return sorted(self._lazy_data) def values(self): return [self[k] for k in self.keys()] def items(self): return [(k, self[k]) for k in self.keys()] named_substitution_matrices = _SubstitutionMatrices()

Relevant to the other comment. Will put this on hold pending the discussion on external data.

PS @wasade : If we directly type data in _LazySubstitutionMatrix (instead of external files), will it still suffice lazy loading?

You could defer the calls to _vector_to_matrix for example but bigger bang may be until these can be pulled from an authoritative source

I envision that skbio should have a local data repository that is distributed with the package, plus a remote server that stores larger datasets. Analogs are scikit-learn's small, local datasets and pyTorch's big, remote datasets.

For substitution matrices specifically, the NCBI repo has many common matrices, but not all. These are only for alignments. Meanwhile, matrices for tree building are not there. Examples are JTT, WAG, LG, etc. In comparison, BioPython has some matrices that are not found in NCBI. For skbio, I think that having a local repo of most common matrices (like the ones I already typed) and a remote repo of extended ones may be the way to go.

Sounds good! Could an issue be opened to capture this? It could refer back to the discussion here

skbio/sequence/_substitution.py

qiyunzhu · 2024-01-12T19:14:43Z

@wasade (and @mortonjt ) thanks for reviewing my code! I am sorry for the late response and I had to deal with quite a few things in the beginning of the semester. Now I am back to skbio development.

Co-authored-by: Daniel McDonald <d3mcdonald@eng.ucsd.edu>

qiyunzhu · 2024-01-12T23:28:21Z

@mortonjt Thanks for your feedback! I added unit tests with nucleotide and protein matrices in the last commit. I didn't do it for all matrices though, as the alignment functions will be removed and replaced, and it will make more sense if we test on the new functions.

qiyunzhu · 2024-01-13T00:06:39Z

@wasade Thanks again for such careful review! I think I have addressed all the comments you made. Would you like to take a look?

qiyunzhu · 2024-01-13T00:21:00Z

@wasade Thanks and happy weekend!

qiyunzhu added 13 commits December 21, 2023 10:11

updated URL and doc build

69e15b2

Revert "updated URL and doc build"

631ed95

This reverts commit 69e15b2.

Merge branch 'master' of https://github.com/biocore/scikit-bio

34bfef2

creating substitution matrix

d33867f

Merge branch 'master' of https://github.com/biocore/scikit-bio into s…

2d765e7

…ubmat

finished basic substitution matrix

dfcc69e

added script files

42363de

plugged new class into pairwise

fa45504

added documentation

0c083bc

fixed unit tests

a676839

edited changelog

0e89b75

improving documentation

155ddf4

fixing linting

78e34a9

qiyunzhu requested review from wasade and mortonjt December 24, 2023 03:02

wasade requested changes Jan 4, 2024

View reviewed changes

qiyunzhu and others added 4 commits January 12, 2024 12:24

Update skbio/sequence/_substitution.py

5b29be5

Co-authored-by: Daniel McDonald <d3mcdonald@eng.ucsd.edu>

Update skbio/sequence/_substitution.py

0c502f0

Co-authored-by: Daniel McDonald <d3mcdonald@eng.ucsd.edu>

Apply suggestions from code review

9288c69

Co-authored-by: Daniel McDonald <d3mcdonald@eng.ucsd.edu>

addressing comments

2eb653f

made named matrices internal

ca1165b

wasade approved these changes Jan 13, 2024

View reviewed changes

qiyunzhu mentioned this pull request Jan 13, 2024

Refactoring the lifecycle decorators #1916

Closed

qiyunzhu merged commit 91e1643 into scikit-bio:master Jan 13, 2024
22 checks passed

qiyunzhu deleted the submat branch January 13, 2024 00:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added class `SubstitutionMatrix` #1913

Added class `SubstitutionMatrix` #1913

qiyunzhu commented Dec 23, 2023 •

edited

qiyunzhu commented Jan 3, 2024

mortonjt commented Jan 4, 2024

wasade left a comment

wasade Jan 4, 2024

qiyunzhu Jan 12, 2024

wasade Jan 12, 2024

qiyunzhu Jan 12, 2024

wasade Jan 12, 2024

qiyunzhu Jan 12, 2024

wasade Jan 4, 2024

qiyunzhu Jan 12, 2024

wasade Jan 12, 2024

qiyunzhu Jan 12, 2024

qiyunzhu Jan 12, 2024

wasade Jan 12, 2024

qiyunzhu Jan 13, 2024

wasade Jan 13, 2024

qiyunzhu commented Jan 12, 2024

qiyunzhu commented Jan 12, 2024

qiyunzhu commented Jan 13, 2024

qiyunzhu commented Jan 13, 2024

		from skbio.stats.distance import DissimilarityMatrix


		@experimental(as_of='0.5.10')

Added class SubstitutionMatrix #1913

Added class SubstitutionMatrix #1913

Conversation

qiyunzhu commented Dec 23, 2023 • edited

qiyunzhu commented Jan 3, 2024

mortonjt commented Jan 4, 2024

wasade left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

qiyunzhu commented Jan 12, 2024

qiyunzhu commented Jan 12, 2024

qiyunzhu commented Jan 13, 2024

qiyunzhu commented Jan 13, 2024

Added class `SubstitutionMatrix` #1913

Added class `SubstitutionMatrix` #1913

qiyunzhu commented Dec 23, 2023 •

edited