-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Finding fusions and counting supporting reads zsh: killed #229
Comments
Something does not seem right about this sample. In fact, the situation looks a lot similar to your other issue #228. Since it's public data, can you point me to the source from where you obtained it? Then I can take a look at it myself. |
Thank you for your reply. (If access to the restricted public data is difficult, Is it okay to explain the situation to the data owner and seek permission?) Thanks a lot. Sincerely, |
Apologies for the slow response - it has been a busy week. I didn't realize it's restricted data from EGA. I doubt it will be possible to share the data with me. So we will need to resort to remote troubleshooting. Can you run Looking at the status messages from Arriba, about 54 million reads were considered as chimeric. That's almost all reads. This could happen when read1 and read2 were accidentally swapped when given as input to STAR. Can you rule this out? |
I am very sorry for the late response due to personal reasons. I have tried running samtools flagstat and the outcome is as follows: 108111052 + 6633048 in total (QC-passed reads + QC-failed reads) The STAR log.file.out could not be attached here, so I have attached it via email. Thanks! |
I am very sorry for the late response due to personal reasons. I have tried
running samtools flagstat and the outcome is as follows:
108111052 + 6633048 in total (QC-passed reads + QC-failed reads)
108111052 + 6633048 primary
0 + 0 secondary
0 + 0 supplementary
8903170 + 805612 duplicates
8903170 + 805612 primary duplicates
107740389 + 5690274 mapped (99.66% : 85.79%)
107740389 + 5690274 primary mapped (99.66% : 85.79%)
108111052 + 6633048 paired in sequencing
54055526 + 3316524 read1
54055526 + 3316524 read2
92916680 + 4440744 properly paired (85.95% : 66.95%)
107438656 + 5331394 with itself and mate mapped
301733 + 358880 singletons (0.28% : 5.41%)
10155368 + 676362 with mate mapped to a different chr
1489734 + 157793 with mate mapped to a different chr (mapQ>=5)
The STAR log.file.out could not be attached here, so I have attached it via
email.
Thanks!
Sincerely,
DJ. J
2024년 2월 10일 (토) 오전 8:48, suhrig ***@***.***>님이 작성:
… Apologies for the slow response - it has been a busy week.
I didn't realize it's restricted data from EGA. I doubt it will be
possible to share the data with me. So we will need to resort to remote
troubleshooting.
Can you run samtools flagstat on the problematic BAM file, please? Also,
do you have the STAR Log.final.out file and could share it with me?
Looking at the status messages from Arriba, about 54 million reads were
considered as chimeric. That's almost all reads. This could happen when
read1 and read2 were accidentally swapped when given as input to STAR. Can
you rule this out?
—
Reply to this email directly, view it on GitHub
<#229 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AQKDW3WASPFXNIKRLSGK4ALYS2YT7AVCNFSM6AAAAABCV3XY76VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZWG42DOMRYGM>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
--
DAJEONG JEONG, M.D., Ph.D.
Department of Laboratory Medicine
Yonsei University College of Medicine, Severance Hospital
50-1, Yonsei-ro, Seodaemun-gu, Seoul, Republic of Korea (03722)
Tel: 82-2-2228-2444 Fax: 82-2-2227-8353
Cel: 82-10-4901-9486
|
Unfortunately, the log file was also dropped by mail. Can you just copy-paste the content as a reply to this thread? The flagstats already look weird. The are no supplementary alignments. Can you show me the STAR command, please? |
Thank you for your reply. The STAR_log.final.out file result is as follows:
Number of reads unmapped: too many mismatches | 0
|
The problem could be that you haven't collated the paired-end mates before converting to FastQ. The mates will be mixed up if you don't do this. Would you please try the following command for the conversion?
|
Thank you for your reply. Started job on | Feb 27 14:53:10
Number of reads unmapped: too many mismatches | 0 It looks the same as the previous result. Thanks. :) |
Clearly, something must be going wrong during the conversion from BAM to FastQ. It is not normal that 70% of the reads are chimeric. This number is typically well below 10%. Also, Arriba usually takes only a few minutes to run and the main output file should only be a few kb in size. The flagstats you send me are from the original BAM file, not your realigned BAM file, right? Things still look okay in the original file, then. In this case, can you check the following: Pick a random read pair from the original BAM file. Note down the sequences from both mates. Next, extract the sequences of the read pair with the same name from the converted FastQs, e.g., using |
Hi, did you find some time to inspect the read sequences as explained in my previous post? Are my instructions clear enough? I'm pretty sure this has to do with the BAM file not being converted to FastQs properly. |
Hi, I am very sorry for the late reply. Thank you for suggesting various solutions to try. First, the flagstats are indeed from the original BAM file. The command used at that time is as follows. ======================================================================== Convert BAM to FASTQ/Users/dajeong/samtools-1.18/samtools fastq /Users/dajeong/STAR-Fusion/data/01-15563_T1.rna.bam -1 /Users/dajeong/STAR-Fusion/data/01-15563_T1_R1_fastq.gz -2 /Users/dajeong/STAR-Fusion/data/01-15563_T1_R2_fastq.gz Unzip FASTQ filesgunzip -k /Users/dajeong/STAR-Fusion/data/01-15563_T1_R1_fastq Run STAR alignment/Users/dajeong/STAR/bin/MacOSX_x86_64/STAR --runThreadN 5 ============================================================================== ############################################################################## HS27_221:4:1213:7746:97787 83 chr7 139877304 60 75M = 139877168 -211 GCATAGCTTCTTGCTTTAAGCGTGATAGGTGCTCGATAAAGGCTCAAGAAATTGAACTGTCTACATTTCTCTTAC E1G@BE>1=<1CDFGGF>FBGGGGFF1DFC<0GGC<@1g>EGFF=:1C1GGF@;GEF@1>GGGGGGGGF1@BBB@ MD:Z:75 PG:Z:MarkDuplicates RG:Z:226820 NM:i:0 AS:i:75 XS:i:0 grep -w -A1 -h "HS27_221:4:1213:7746:97787" /Users/dajeong/STAR-Fusion/data/01-15563_T1_R1_fastq > extracted_sequence_read1_second.txt @HS27_221:4:1213:7746:97787 grep -w -A1 -h "HS27_221:4:1213:7746:97787" /Users/dajeong/STAR-Fusion/data/01-15563_T1_R2_fastq > extracted_sequence_read2_second.txt @HS27_221:4:1213:7746:97787 ############################################################################## Trial 2. samtools view /Users/dajeong/STAR-Fusion/data/01-15563_T1.rna.bam | gshuf -n 1 > random_read_third.txt HS27_221:4:1109:11102:98424 99 chr2 196163272 35 75M = 196163385 8093 CTTTAGATGTAAGTATATAGAAATTATTAAAGTTTTCCATTTTTATTGGAATTTGAGGAGTTGTAGTTAGTAGGC CCCCCEFGGGGGGGGGGGGGGGGEGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGEGGGGGGGEFGGGGGGGG MD:Z:75 PG:Z:MarkDuplicates RG:Z:226820 NM:i:0 AS:i:75 XS:i:63 grep -w -A1 -h "HS27_221:4:1109:11102:98424" /Users/dajeong/STAR-Fusion/data/01-15563_T1_R1_fastq > extracted_sequence_read1_third.txt @HS27_221:4:1109:11102:98424 grep -w -A1 -h "HS27_221:4:1109:11102:98424" /Users/dajeong/STAR-Fusion/data/01-15563_T1_R2_fastq > extracted_sequence_read2_third.txt @HS27_221:4:1109:11102:98424 ############################################################################## Does this result suggest a problem in the file conversion process? Thanks! |
Thanks for the additional details. This looks good so far. May I ask you for another piece of information? Can you please run the following command on the BAM file which you have created (i.e., the one after realigning the reads)? The command extracts the randomly selected reads:
Another thing you might want to try is to run Arriba on the original BAM file. Does it also crash here? If not, then this would confirm that the problem must arise during conversion to FastQ or realignment. |
Thank you for providing additional guidance. samtools view /Users/dajeong/STAR-Fusion/output/01-15563_T1_STAR_Aligned.out.bam | grep -wE 'HS27_221:4:1213:7746:97787|HS27_221:4:1109:11102:98424' (base) dajeong@DajeongcBookPro data % samtools view /Users/dajeong/STAR-Fusion/output/01-15563_T1_STAR_Aligned.out.bam | grep -wE 'HS27_221:4:1213:7746:97787|HS27_221:4:1109:11102:98424'
|
The output from the You can try a different tool for the conversion, for example from the
|
Did you have more success with biobambam? |
I apologize for the late reply. |
Hi,
I am using Arriba to analyze RNA fusions.
I downloaded public data from two different sources (A and B).
I could successfully get final tsv files from RNA sequencing data (Aligned.sortedByCoord.out.bam BAM file with ~5-7 GB) from A.
The size of fusions.tsv files was about 170KB.
However, when I did the same with data from B, it took much time running Arriba.
Also, it worked well with RNA BAM files of ~6GB (final tsv file size: 110MB) while it failed to work when I dealt with >7GB BAM files with error message like follows:
(base) xxx@xxxcBookPro data % /Users/dajeong/arriba_v2.4.0/arriba -x /Users/dajeong/STAR-Fusion/output/01-17618_T1_STAR_Aligned.out.bam -g /Users/dajeong/Arriba_DJ/GENCODE38.gtf -a /Users/dajeong/Arriba_DJ/hg38.fa
-b /Users/dajeong/arriba_v2.4.0/database/blacklist_hg38_GRCh38_v2.4.0.tsv.gz -k /Users/dajeong/arriba_v2.4.0/database/known_fusions_hg38_GRCh38_v2.4.0.tsv.gz -p /Users/dajeong/arriba_v2.4.0/database/protein_domains_hg38_GRCh38_v2.4.0.gff3
-o /Users/dajeong/Arriba_DJ/output/01-17618_T1_fusions.tsv
[2024-02-01T18:53:46] Launching Arriba 2.4.0
[2024-02-01T18:53:46] Loading assembly from '/Users/dajeong/Arriba_DJ/hg38.fa'
[2024-02-01T18:54:03] Loading annotation from '/Users/dajeong/Arriba_DJ/GENCODE38.gtf'
[2024-02-01T18:54:09] Reading chimeric alignments from '/Users/dajeong/STAR-Fusion/output/01-17618_T1_STAR_Aligned.out.bam' (total=54199648)
[2024-02-01T18:57:53] Marking multi-mapping alignments (marked=160621)
[2024-02-01T18:58:10] Detecting strandedness (reverse)
[2024-02-01T18:58:12] Assigning strands to alignments
[2024-02-01T18:58:20] Annotating alignments
[2024-02-01T19:00:20] Filtering duplicates (remaining=50395284)
[2024-02-01T19:00:48] Filtering mates which do not map to interesting contigs (1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y AC_* NC_) (remaining=49947069)
[2024-02-01T19:00:56] Filtering mates which only map to viral contigs (AC_ NC_*) (remaining=49947069)
[2024-02-01T19:01:04] Filtering viral contigs with expression lower than the top 5 (remaining=49947069)
[2024-02-01T19:01:20] Filtering viral contigs with less than 5% coverage (remaining=49947069)
[2024-02-01T19:01:29] Estimating fragment length (mate gap mean=18863.6, mate gap stddev=24841.7, read length mean=39.6217)
[2024-02-01T19:01:37] Filtering read-through fragments with a distance <=10000bp (remaining=48397868)
[2024-02-01T19:01:46] Filtering inconsistently clipped mates (remaining=48397733)
[2024-02-01T19:01:54] Filtering breakpoints adjacent to homopolymers >=6nt (remaining=48394882)
[2024-02-01T19:02:02] Filtering fragments with small insert size (remaining=48394744)
[2024-02-01T19:02:10] Filtering alignments with long gaps (remaining=48394744)
[2024-02-01T19:02:19] Filtering fragments with both mates in the same gene (remaining=48392988)
[2024-02-01T19:02:27] Filtering fusions arising from hairpin structures (remaining=48388080)
[2024-02-01T19:02:37] Filtering reads with a mismatch p-value <=0.01 (remaining=46440036)
[2024-02-01T19:03:28] Filtering reads with low entropy (k-mer content >=60%) (remaining=46381071)
[2024-02-01T19:04:54] Finding fusions and counting supporting reads zsh: killed /Users/dajeong/arriba_v2.4.0/arriba -x -g -a -b -k -p -o
I converted BAM file to fastq and proceeded the downstream analysis.
When I performed FastQC,
Data A>
Measure Value
Filename ATL005_R1_fastq.gz
File type Conventional base calls
Encoding Sanger / Illumina 1.9
Total Sequences 31961691
Total Bases 3.1 Gbp
Sequences flagged as poor quality 0
Sequence length 100
%GC 49
Data B>
Measure Value
Filename 01-15563_T1_R2_fastq.gz
File type Conventional base calls
Encoding Sanger / Illumina 1.9
Total Sequences 57372050
Total Bases 4.3 Gbp
Sequences flagged as poor quality 0
Sequence length 75
%GC 47
============================================================================================
My MacBook has 64GB RAM.
Could it be due to resource shortage?
Is there any way to get RNA fusions.tsv file from >7GB RNA BAM files from data B on my MacBook?
I am confused because there is no big difference between data A and B in terms of BAM file size.
Thanks!
Sincerely,
DJ. J
The text was updated successfully, but these errors were encountered: