Skip to content
forked from zhanglabNKU/eCAMI

Simultaneous Classification and Motif Identification(eCAMI)

Notifications You must be signed in to change notification settings

yinlabniu/eCAMI

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

eCAMI: Simultaneous Classification and Motif Identification for enzyme/CAZyme annotation

Running Environment

  • Linux environment, Python 3
  • The three packages need to be available: psutil, collections, scipy, numpy
  • The two python scripts (clustering.py and prediction.py) need to be in the same folder
  • the fasta file must follow this format:
>unique_sequence_ID followed by a “|” 

Citation

eCAMI: simultaneous classification and motif identification for enzyme annotation. Xu J, Zhang H, Zheng J, Dovoedo P, Yin Y. Bioinformatics. 2019 Dec 3. pii: btz908. doi: 10.1093/bioinformatics/btz908. [Epub ahead of print] PMID: 31794006

Install

git clone https://github.com/yinlabniu/eCAMI.git
cd eCAMI/

* install required python packages
pip install scipy
pip install argparse
pip install psutil
pip install numpy

Prediction:

given new protein sequences, search against a pre-computed k-mer peptide library, which are associated with known CAZyme families or EC numbers, for CAZyme family or EC number assignments

  • First, two pre-computed k-mer peptide libraries come with this package (the CAZyme and EC folder) and must exist in your folder
  • The prediction.py can be run as follows:
python prediction.py -input <file_name> -kmer_db <kmer_dir_path> -output <prediction_output_name>

For example:

python prediction.py -input examples/prediction/input/test.faa -kmer_db CAZyme -output examples/prediction/output/new-output.txt

The test.faa in the examples/prediction/input folder will be processed by comparing to the pre-computed k_mer peptides (the k_mer library) in the CAZyme folder. You will get the prediction results named new-output.txt in the examples/prediction/output folder. You can compare it with the test_pred_cluster_labels.txt that is already there.

  • In addition, other options for prediction are also available:

    • -k_mer: the length of the k_mer peptide to be extracted from the query for comparison against the k_mer peptide library; the length must be the same as the length of the k_mers in the k_mer peptide library (default=8)
    • -jobs: the number of CPU processors to use (default=8)
    • -beta: the minimum sum of k_mer frequency for assigning the query to an existing protein family of the k_mer peptide library (default=1)
    • -important_k_mer_number: the minimum number of the same k_mers shared between the query and a family in the k_mer peptide library (default=3)
  • In most of times, only option -k_mer is needed to change

  • To print all options, you can type:

python prediction.py -help
  • Output files: please see the readme.txt file in the examples/prediction folder for explaination of the output files and format.

  • Briefly, the output.txt will have the assignment of the query to the existing protein family of the k_mer peptide library, e.g.,:

protein_name    fam_name:group_number    subfam_name_of_the_group:subfam_name_count
matching kmers(start position in the query sequence)
For example:
>AIJ19564.1|GH5_4|3.2.1.4	GH5:102	GH5_4:15|3.2.1.4:3
SVRIPVTW(82),RIPVTWMG(84),PVTWMGHI(86),VTWMGHIG(87),IINIHHDG(124),VLNEWNQV(210),NRLMVAVH(259),

Clustering:

given a fasta file of a protein family, e.g., CAZyme family or EC at the 3rd level or any protein family, classify the sequences into clusters and extract the distinguishing k-mer peptides of each cluster (i.e., build your own library of k-mer peptides for future use)

  • The clustering.py can be run as follows:
python clustering.py -input <file_name> -output_dir <output_dir_name>

For example:

python clustering.py -input examples/clustering/input/GH5.faa -output_dir examples/clustering/output/new-GH5
python clustering.py -input examples/clustering/input/3.2.1.-.faa -output_dir examples/clustering/output/new-EC3.2.1.-

The GH5.faa in the examples/clustering/input folder will be processed. You will get the clustering results (fasta seqs and distinguising k-mer peptides of each cluster) in the examples/clustering/output/new-GH5 folder.
The 3.2.1.-.faa in the examples/clustering/input folder will be processed. You will get the clustering results (fasta seqs and distinguising k-mer peptides of each cluster) in the examples/clustering/output/new-EC3.2.1.- folder.
You can compare them with the GH5 and the EC3.2.1.- folders that are already there.

  • In addition, other options for clustering are also available:
    • -k_mer: the length of the k-mer peptide for clustering (default=8)
    • -minimum_group_size: the minimum number of proteins the final cluster (default=5)
    • -piece_number: the number of spliting pieces of sorted peptides (default=8)
    • -jobs: the number of CPU processors to use (default=1)
    • -important_k_mer_number: the minimum number of k_mer for making groups/clusters in the first step (default=10)
    • -alpha: the minimum sum of k_mer frequency for combining groups (default=7)
    • -beta: the minimum sum of k_mer frequency for adding singletons into existing groups (default=1)
    • -del_percent: the minimum percentage a k_mer's frequency in a cluster divided by its frequency in the whole protein set to be considered for clustering (default=0.2)
    • -selected_k_mer_number_cut: the minimum number of proteins that a k_mer is present to remove before clustering (default=2)

  • In most of times, only options -k_mer, -minimum_group_size and -jobs are needed to change
  • To print all options, you can type:
python clustering.py -help
  • Output files: please see the readme.txt file in the examples/clustering folder for explanation of the output files and format.


About

Simultaneous Classification and Motif Identification(eCAMI)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%