Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Welcome to the CD-HIT Wiki - http://cd-hit.org
CD-HIT is a widely used program for clustering biological sequences to reduce sequence redundancy and improve the performance of other sequence analyses.
CD-HIT was originally developed to cluster protein sequences to create reference databases with reduced redundancy (Li, et al., 2001) and was then extended to support clustering nucleotide sequences and comparing two datasets (Li and Godzik, 2006). The CD-HIT web server was implemented in 2009, which allows users to cluster or compare sequences without using command-line CD-HIT. The server provides interactive interface and additional visualization tools.
Currently, CD-HIT package has many programs: cd-hit, cd-hit-2d, cd-hit-est, cd-hit-est-2d, cd-hit-para, cd-hit-2d-para, psi-cd-hit, cd-hit-454, cd-hit-dup, cd-hit-lap, cd-hit-otu, etc. There are also many utility scripts, written in Perl, to help run and analyze CD-HIT jobs. Briefly:
* cd-hit Cluster peptide sequences * cd-hit-est Cluster nucleotide sequences * cd-hit-2d Compare 2 peptide databases * cd-hit-est-2d Compare 2 nucleotide databases * psi-cd-hit Cluster proteins at <40% cutoff * cd-hit-lap Identify overlapping reads * cd-hit-dup Identify duplicates from single or paired Illumina reads * cd-hit-454 Identify duplicates from 454 reads * cd-hit-otu Cluster rRNA tags * cd-hit Web server Cluster user-uploaded data * cd-hit-para Cluster sequences in parallel on a computer cluster * scripts Parse results and so on * h-cd-hit Hierarchical clustering
Recent development of cd-hit, especially the multiple-threaded version introduced in 2012 (Fu et al), has enabled clustering of very large NGS datasets. For example, it only took `cd-hit` less than a day on a 32-core computer to cluster a few hundred million protein sequences from a metagenomics study.