Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in reading BAM/SAM file. truncated file #34

Closed
baozg opened this issue May 22, 2020 · 14 comments
Closed

Error in reading BAM/SAM file. truncated file #34

baozg opened this issue May 22, 2020 · 14 comments

Comments

@baozg
Copy link

baozg commented May 22, 2020

Hi,

I am using SyRI to idenitify SVs between two species (same genus, 10 Mya divergened). Here is the full command I use, but the error encounted [W::sam_read1] Parse error at line 20 Reading BAM/SAM file - ERROR - Error in reading BAM/SAM file. truncated file

  • Command
minimap2 -t 24 -ax asm10 --eqx ref.fa que.fa > out.sam
python3 /data/software/SyRI/syri/syri/bin/syri -c out.sam -r ref.fa -q que.fa -k -F S --nc 12
  • OS system:
CentOS 7.4
python3.5 
lastest SyRI (install yesterday from git clone)
  • File
    out.sam
Chr1    0       Chr1    74430467        60      78538456S106=1X244=2I202=1X104=4I123=1X157=1X49=1X102=1X123=1X260=2X75=1X270=1X40=1X222=1D19=1X205=1X376=1X229=1X15=1X920=1X138=1X679=1X464=1D228=1X1=1X32=1D113=1X47=1X393=1X195=1X334=1X3=1X160=1X141=1X61=1X313=1X2=1X90=1X118=1X9=1X200=1X56=1X33=1X58=1X147=1D281=1X94=2X202=1X36=1X38=1X46=1X691=1X480=1X196=1X277=1X6=1X512=4I313=1X717=1X230=1X287=1X389=1D147=1X23=1X300=1X66=11D74=1X46=1X525=1D177=16I86=5I11=1X128=1X1=1X1=13D181=1X241=1X97=1D88=1X102=1X50=1X56=1X84=1X226=1X26=1X238=1X59=1D23=1X71=1D129=1X399=1X190=1D169=1X4=1X179=1X11=1X45=1D96=1X546=1X176=1X1=1X12=1X30=1X28=1X195=1X178=1X50=12I1=1X317=1X55=1D258=1X226=1D190=1X188=1X35=1X124=1X13=1D620=1D78=1X572=1X467=1X154=1I47=1X15=1X13=1X946=1X80=1X342=1X40=1I358=1X85=1X340=1X43=1X230=1X191=1X53=1X384=1X137=1X47=1X30=1X276=1X114=1X339=1X78=1X24=1X45=1X55=1X9=1X246=1X205=1X13=1X6=1X44=1X44=1X145=1X40=7I441=1X69=1X19=1X10=1X134=1X276=1X257=1X113=1X74=11D1173=1X223=1X112=1X241=1X176=1X389=1X119=4D282=1X455=1X29=1X11=6D1=1X360=1X439=1X499=1X91=1X207=1X62=1X38=1X248=1X80=1I173=143I206=1X100=1X63=1X397=1X58=1X310=1X535=1X444=1D154=1X39=1X201=1X63=1X68=1X86=1X431=1X564=4D17=1X92=1X10=1X37=1X21=1X127=1X232=1X149=1X66=1X114=1X8=1X446=1X51=1X98=1X210=1D189=1X150=1D1X61=1X72=1X109=1X122=1X126=1X252=1X504=1X84=1X150=1X56=3D232=1I435=1X133=1X373=1X151=1X206=1X128=1X53=1X15=1X52=1X88=1D52=1X9=1X99=1X241=1X144=1X43=2I90=16I128=16I150=1X25=1X344=2X346=1X132=1X420=1X754=1X101=1X559=1X58=1X724=16I14=1X160=1X433=143D1X182=1X483=1I179=1X124=1D72=1X202=1D93=1X67=1X237=3D44=1X48=1I225=1X291=1X479=1X75=1X376=1D723=1X26=1X17=1X88=1X4=1X413=1X26=1X4=1X10=2037I332=1D751=1X532=1X218=1X9=4I231=1X288=1X214=1D93=1X21=1I147=1X23=1D312=1X48=1X203=1I51=1X47=1X17=1X13=1X498=1D112=1X44=1D9=1X40=10I292=1X30=1I10=1X214=1X52=1I102=1D74=1X79=1D140=1X30=1X62=1X126=1X314=1I154=1D15=1X93=1X558=1X5=1X5=1X70=1X56=1X33=1X15=1X307=1X492=1X206=1X376=1X188=1X95=1X1=1X166=1D841=1X71=1X39=1X527=1X93=1X152=1X1278=1X633=1X81=2I39=1X413=1D118=1X111=1X294=1I6=1X224=1X93=1X71=1X251=23I10=1X421=1X11=1D338=1I211=1X477=4I267=1D1=1X450=1X201=1X73=1X350=1X64=2I383=1X13=4D329=1X250=1X267=1X496=5I95=1X129=1I12=1X82=1X833=20I46=1X107=1X117=1X10=1X121=1X35=1X28=4I212=1X41=1X83=1X707=1X340=1X221=1X45=1D10=1X136=1X207=1X36=7I12=1X188=1X295=1X53=1X367=1X257=1X19=1X192=1X417=1D82=1X403=1X56=1X178=1X288=1X270=1X30=1I356=1I141=1X194=1X97=1X178=1X230=4D102=1I5=1X202=1X19=1X512=12D240=1X465=1X26=1X200=1X356=1X364=1X6=4I185=1X163=1I529=1X1054=3I802=1D572=2D168=1X309=1X347=1X1233=1D69=1X89=2D71=1D34=1X1083=1X22=1D227=2I812=2X50=1X69=1X144=1X33=1X7=1X72=1X113=1X44=1X37=1X16=1D218=1X469=1X19=1D172=1X143=1D1X305=1X328=1I16=1X73=6D126=1X61=1X83=1X176=1X318=1X152=1X269=1X1=9D15=1X18=1X43=1X665=1X388=1X70=1X6=1D224=20I9=4D78=1X9=1X30=8D25=20I3=1X3=1X137=1X370=1I196=1X149=1D105=1D226=1X31=1X270=1I251=27D59=1X79=1X56=2D143=16I63=1X27=1X27=1X11=3579I52=1X22=1X71=1X351=1X123=1X499=3I157=10I8=1I61=1X14=1X43=1X113=1X313=6I15=1X84=1X83=1X84=1X27=1X18=1X1
@mnshgl0110
Copy link
Member

Hi. The error message suggest that there is something unexpected in the input SAM file (formatting, odd characters, or something else), as a result of which it cannot be read. A quick check would be to try to convert it to bam. If that also gives this error, then you can be sure that the SAM file is incorrect. If it works, then you can try to use that BAM file as input for SyRI.

@mnshgl0110 mnshgl0110 self-assigned this May 22, 2020
@baozg
Copy link
Author

baozg commented May 22, 2020

Hi,

Actually, due to the length of the chromosome (>400Mb), it cannot be convert to the bam, the max CIGAR operator length is 2**28=268435456.(see here lh3/minimap2#440).
So do you read SAM by pysam? How do you deal with the maize chromosome?

@mnshgl0110
Copy link
Member

Yes, SyRI uses pysam to read the SAM file. I would have to check how pysam is handling this. The discussion here pysam-developers/pysam#613 suggests that they fixed this, but I would have to check how is this working exactly.
For aligning maize, I used nucmer. So, I did not encounter this issue then.

@baozg
Copy link
Author

baozg commented May 22, 2020

Thanks, I will try to use nucmer. But my species have more than 60% repetitive sequence, do I need do repeatmask first? hardmask or softmask?

@mnshgl0110
Copy link
Member

Repeats are not an issue algorithmically, just that they can increase run-time and memory use significantly. So, you can decide about masking based on your project's requirements and time/memory restrictions.

@baozg
Copy link
Author

baozg commented May 22, 2020

Thanks for your promptly response. In fact, I try to split chromosome (ignore intra-chromosome rearrangement) to speed up, but SyRI was strunk in one chromosome for 7 days, then crashed. I will compare the w/o mask result.

@mnshgl0110
Copy link
Member

Could you please start a new issue for this crash and, if possible, share the syri.log file? I would like to see what happened there. Also, if you still have the error message then please share that too.

@baozg
Copy link
Author

baozg commented May 22, 2020

What if my species' chromosome have undergone split and fission?How should I use SyRI, or just the one-to-one synteny chromosme?
image

@mnshgl0110
Copy link
Member

I created a new issue for this.

@baozg
Copy link
Author

baozg commented May 24, 2020

Hi,

Delta file from nucmer can run smoothly with SyRI, so mummer is more appropriate for big chromosome genome (In my cases, the biggest chromosome are 450Mb). And have you compared the result with mummer or minimap2? It is said that minizmer of the minimap2 is very difficult to achieve base-level accurarucy.

@mnshgl0110
Copy link
Member

Hi Zhigui,

Thanks for letting me know. If I understand correctly, minimap2 can align such large chromosomes and can generate the SAM output. However, the alignments can neither be transformed to a BAM output nor can they be read though pysam. If this is the case, then it can be solved by a custom function to read SAM files which would be a better solution than not being able to use minimap2 for larger chromosomes.

Regarding the comparison between mummer and minimap2, the former is more sensitive and finds more alignments (at least in my experience) but at the cost of adding extra noisy alignments which can result in noisy annotations by SyRI. Minimap2 on the other hands results in more cleaner alignments (and cleaner SyRI annotations), but some alignments could be missing. I have not compared the differences at the basepair level though.

@baozg
Copy link
Author

baozg commented May 25, 2020

Hi Goel,

Yes. You are right.
It is true that minimap2 produced PAF is very fast, but it have comparable speed with mummer (in my experience) when using -ax asm5 --eqx. In my case, the sam file of minimap2 is 16Gb, but the raw .delta file have only 150Mb, so maybe mummer4 is more suitable for the large genome.

@mnshgl0110
Copy link
Member

minimap2 does not do basepair alignment without the -a option (as PAF does not output that). So, using -a do increase runtime for it. The file size becomes large because in SAM/BAM each line contains query sequence, but the alignment information stays the same.
Also, mummer4 package have some issues in handling N's between genomes. It considers N-N as a match, or something similar. So, you might want to check that before using alignments generated by it.

@mnshgl0110
Copy link
Member

I have added a reader for SAM files and now genome size should not be an issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants