Skip to content
Shyam Saladi edited this page Sep 1, 2018 · 4 revisions

Welcome to the CD-HIT Wiki - http://cd-hit.org

Program developed by Weizhong Li's lab at UCSD http://weizhongli-lab.org and JCVI http://jcvi.org

Contact: liwz@sdsc.edu.

Introduction

CD-HIT is a widely used program for clustering biological sequences to reduce sequence redundancy and improve the performance of other sequence analyses.

CD-HIT was originally developed to cluster protein sequences to create reference databases with reduced redundancy (Li, et al., 2001) and was then extended to support clustering nucleotide sequences and comparing two datasets (Li and Godzik, 2006). The CD-HIT web server was implemented in 2009, which allows users to cluster or compare sequences without using command-line CD-HIT. The server provides interactive interface and additional visualization tools.

Currently, CD-HIT package has many programs: cd-hit, cd-hit-2d, cd-hit-est, cd-hit-est-2d, cd-hit-para, cd-hit-2d-para, psi-cd-hit, cd-hit-454, cd-hit-dup, cd-hit-lap, cd-hit-otu, etc. There are also many utility scripts, written in Perl, to help run and analyze CD-HIT jobs. Briefly:

  * cd-hit	        Cluster peptide sequences	
  * cd-hit-est	        Cluster nucleotide sequences
  * cd-hit-2d	        Compare 2 peptide databases	
  * cd-hit-est-2d	Compare 2 nucleotide databases
  * psi-cd-hit	        Cluster proteins at <40% cutoff	
  * cd-hit-lap	        Identify overlapping reads
  * cd-hit-dup          Identify duplicates from single or paired Illumina reads	
  * cd-hit-454          Identify duplicates from 454 reads 
  * cd-hit-otu	        Cluster rRNA tags	
  * cd-hit Web server	Cluster user-uploaded data 
  * cd-hit-para         Cluster sequences in parallel on a computer cluster	
  * scripts             Parse results and so on
  * h-cd-hit            Hierarchical clustering 		

Recent development of cd-hit, especially the multiple-threaded version introduced in 2012 (Fu et al), has enabled clustering of very large NGS datasets. For example, it only took `cd-hit` less than a day on a 32-core computer to cluster a few hundred million protein sequences from a metagenomics study.