Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compression with reference genome #4

Open
8banzhuan opened this issue Jan 3, 2023 · 3 comments
Open

Compression with reference genome #4

8banzhuan opened this issue Jan 3, 2023 · 3 comments

Comments

@8banzhuan
Copy link

hello,Thanks for the excellent compression tools, I have some problems using the compression method with the reference genome,
I am compressing the FASTA data, and the reference genome is GRC38,I sampled the reference genome, that is to say, I did not use all the reference genome. The size of the original reference genome was 3G. I found that whether using the complete reference genome or half or even one-tenth of the reference genome, the compression rate nothing much has changed,I used Li Heng's samtools to check the comparison results. In fact, only about one-fifth of the data was compared. I want to know whether the comparison rate has a great impact on the compression performance of colord? Why do I get similar compression results with the full reference genome and with one-tenth of the reference genome (the smaller the reference genome, the lower the alignment rate)?
Looking forward to your reply,Best wishes!

@marekkokot
Copy link
Collaborator

Hello, @8banzhuan
First of all thanks for using CoLoRd! It is hard to tell what causes this. We may have some suspicions. One of them is maybe the data you are compressing is of such excellent quality that the reference genome is in fact not beneficial at all. It would be great if you could provide your input fasta file. Also, I'm not sure what you mean by "comparison ratio". In fact, it would be great if you could also provide your whole testing pipeline (command lines).

@8banzhuan
Copy link
Author

Hello, @8banzhuan First of all thanks for using CoLoRd! It is hard to tell what causes this. We may have some suspicions. One of them is maybe the data you are compressing is of such excellent quality that the reference genome is in fact not beneficial at all. It would be great if you could provide your input fasta file. Also, I'm not sure what you mean by "comparison ratio". In fact, it would be great if you could also provide your whole testing pipeline (command lines).

thank you for your reply!Comparison rate it means the ratio of my FASTA data successfully mapped to the reference genome,I used Li Heng's minimap2 for mapping, and then used samtools to analyze the alignment rate in the sam file. I wanted to find out the impact of the alignment rate on the compression rate (in the case of a reference genome)
The command I use is similar to the following
colord compress-hifi -G reference.fasta inputfile.fasta outputfie
The situation I encountered is that the size of the reference genome does not seem to improve my compression rate very much.
In other words, I use a reference genome with an alignment rate of about 25% and a reference genome with an alignment rate of 9%, and the compression rate is almost the same (both are 10% of the original file),For the reference genome with poor alignment quality, is the compression rate of colord using the reference genome mode much better than the effect without the reference genome?
Looking forward to your reply, best wishes!

@marekkokot
Copy link
Collaborator

Hi,

sorry for the late response, I didn't get (or missed) a notification.
And what is the compression ratio when the reference genome is not used at all?
Since you are using hifi data maybe its quality is so good that there are no benefits from using the reference genome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants