Skip to content

zhangrengang/REPcluster

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

REPcluster aim to cluster repeat sequences that have similar contents but distinct structures, such as (TTTAGGG)m vs (TTTAGGG)n (tandem repeats), A-B-C vs A-C (some TE sequences).

Quick install and start

git clone https://github.com/zhangrengang/REPcluster
cd REPcluster

# install
conda install -c bioconda kmer-db mcl xopen 
python3 setup.py install

# run an example
cd example_data
REPclust hifi.trf.fa -x 2	 # for tandem repeats

Outputs

repclust.a2a.csv.jaccard	# Similariry matrix
repclust.network	# Network to import into Cytoscape
repclust.attr		# Attibutes to import into Cytoscape
repclust.mcl		# one Cluster per line
repclust.fa			# centered sequences for each cluster

Usage

usage: REPclust [-h] [-pre STR] [-o DIR] [-tmpdir DIR] [-x INT] [-k INT]
               [-m {jaccard,min,max,cosine}] [-c FLOAT] [-I FLOAT] [-p INT]
               [-cleanup] [-overwrite] [-v]
               FILE [FILE ...]

Cluster Repeat Sequences.

optional arguments:
  -h, --help            show this help message and exit

Input:
  FILE                  Each sequence in a FASTA file is treated as a separate
                        sample

Output:
  -pre STR, -prefix STR
                        Prefix for output [default=repclust]
  -o DIR, -outdir DIR   Output directory [default=.]
  -tmpdir DIR           Temporary directory [default=tmp]

Kmer matrix:
  -x INT, -multiple INT
                        Repeat sequences to cluster tandem repeat or circular
                        sequences [default=1]
  -k INT                Length of kmer [default=15]
  -m {jaccard,min,max,cosine}, -measure {jaccard,min,max,cosine}
                        The similarity measure to be calculated.
                        [default=jaccard]

Cluster:
  -c FLOAT, -min_similarity FLOAT
                        Minimum similarity to cluster [default=0.2]
  -I FLOAT, -inflation FLOAT
                        Inflation for MCL (varying this parameter affects
                        granularity) [default=2.0]

Other options:
  -p INT, -ncpu INT     Maximum number of processors to use [default=32]
  -cleanup              Remove the temporary directory [default=False]
  -overwrite            Overwrite even if check point files existed
                        [default=False]
  -v, -version          show program's version number and exit

About

Clustering repeat sequences with kmer-based distance

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages