New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

-d parameter does not change length of description in .clstr file #4

Open
GoogleCodeExporter opened this Issue May 4, 2015 · 6 comments

Comments

Projects
None yet
4 participants
@GoogleCodeExporter
Copy link

GoogleCodeExporter commented May 4, 2015

What steps will reproduce the problem?
1. choose a FASTA file with long headers as infile.fasta
1. run 'cd-hit -i infile.fasta -o outfile.fasta'
2. run 'cd-hit -i infile.fasta -o outfile2.fasta -d 100'
3. compare outfile.fasta.clstr with outfile2.fasta.clstr (e.g. in linux 'diff 
outfile{'',2}.fasta.clstr')

What is the expected output? What do you see instead?
Because of the -d 100 paramter, outfile2.fasta.clstr should contain a larger 
part of each header instead of just the first characters.
Both, outfile.fasta.clstr and outfile2.fasta.clstr contain only a very short 
part of each header, e.g.:

>Cluster 0
0       69aa, >1cc1_L... *
>Cluster 1
0       61aa, >2wpn_B... *

for these infile.fasta entries:
>1cc1_L Hydrogenase (large subunit); NI-Fe-Se hydrogenase, oxidoreduct 
(111-179:497)
QSHILHFYHLAALDYVKGPDVSPFVPRYANADLLTDRIKDGAKADATNTYGLNQYLKALEIRRICHEMV
>2wpn_B Periplasmic [nifese] hydrogenase, large subunit, selenocystein 
(116-176:494)
QSHILHFYHLSAQDFVQGPDTAPFVPRFPKSDLRLSKELNKAGVDQYIEALEVRRICHEMV

What version of the product are you using? On what operating system?
CD-HIT version 4.5.4 (built on May 31 2011)
Kubuntu 11.10 64bit

Original issue reported on code.google.com by klaus.ko...@gmail.com on 1 Feb 2012 at 11:58

@GoogleCodeExporter

This comment has been minimized.

Copy link

GoogleCodeExporter commented May 4, 2015

The -d feature has to be the most frustrating parameter of cd-hit. I use cd-hit 
alot in a number of applications. To not, be default print out at least the 
sequence description to the first base (instead of a default of 20, why??) 
makes no sense to me. To not allow the researcher to print the whole line, 
makes no sense to me. and if there are spaces in the sequence name, the -d 
behavior seems to get ignored and regardless of the choice of -d, cd-hit only 
seems to print to the first space. Please consider changing this behavior. 


Original comment by mattsett...@gmail.com on 17 Feb 2013 at 4:20

@GoogleCodeExporter

This comment has been minimized.

Copy link

GoogleCodeExporter commented May 4, 2015

Original comment by daoko...@gmail.com on 27 Mar 2013 at 11:52

  • Changed state: Accepted
@GoogleCodeExporter

This comment has been minimized.

Copy link

GoogleCodeExporter commented May 4, 2015

After some code diving, I've come up with a patch.
The FASTA entry description is stored incorrectly (only up to the first 
whitespace, as opposed to the first newline).
One line needs to change.

I've tested it and it now respects the -d switch.

Original comment by vladv...@gmail.com on 5 Oct 2013 at 2:55

Attachments:

@wanyuac

This comment has been minimized.

Copy link

wanyuac commented Jul 28, 2015

I solved this issue by using "-d 0" and replacing the first space with another character, such as "|".

@robertrentzsch

This comment has been minimized.

Copy link

robertrentzsch commented May 13, 2016

I don't get why this super simple fix (see patch above, it's just a single line change!) hasn't been applied to the official CD HIT release in a whole year now? -d still doesn't work as it should but stops at first space regardless of the number (e.g. 100 or 1000) given.

@dbolser-ebi

This comment has been minimized.

Copy link

dbolser-ebi commented May 16, 2016

Time to fork?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment