A Python tool for selecting representative protein sequences from large datasets. It combines CD-HIT clustering with an advanced representative set selection algorithm: Repset to maintain sequence diversity while reducing redundancy.
- Reduce protein sequence redundancy using CD-HIT
- Select representative sequences using submodular optimization
- Maintain sequence diversity while minimizing dataset size
- Easy-to-use command line interface
- Flexible Python API for integration into bioinformatics pipelines
conda create -f environment.yml
conda activate seqpicker
(seqpicker) poetry build
(seqpicker) pip install dist/seqpicker-0.1.0-py3-none-any.whl# Basic usage
seqpick input.fasta -o output.fasta --maxsize 1000
# Use only CD-HIT (faster but less sophisticated)
seqpick input.fasta --cdhit-only --similarity 0.9
# Use only RepSet selection (slower but more accurate)
seqpick input.fasta --repset-only --maxsize 500
# Fine-tune the selection process
seqpick input.fasta \
--maxsize 1000 \
--mixture-weight 0.7 \
--cdhit-args "-c 0.9 -n 5"from seqpicker import reduce_database_redundancy
# Basic usage
reduce_database_redundancy(
input_fasta="input.fasta",
output_fasta="output.fasta",
maxsize=1000
)
# Advanced usage with more control
reduce_database_redundancy(
input_fasta="input.fasta",
output_fasta="output.fasta",
cdhit=True,
maxsize=1000,
cdhit_args="-c 0.9 -n 5",
mixture_weight=0.7
)seqpicker uses a two-step approach to select representative sequences:
-
Initial Redundancy Reduction (optional)
- Uses CD-HIT to quickly remove highly similar sequences
- Configurable similarity threshold and parameters
-
Representative Selection
- Implements RepSet, a submodular optimization algorithm to select representative sequences
- Balances sequence diversity and coverage
- Uses sequence similarity and redundancy metrics
- Configurable mixture weight between objectives
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.
If you use seqpicker in your research, please cite:
@software{seqpicker2024,
author = {Semidán Robaina Estévez},
title = {seqpicker: A tool for selecting representative protein sequences},
year = {2024},
publisher = {GitHub},
url = {https://github.com/Robaina/seqpicker}
}