Finding fusions and counting supporting reads zsh: killed #229

jdjdj0202 · 2024-02-02T01:13:49Z

Hi,
I am using Arriba to analyze RNA fusions.

I downloaded public data from two different sources (A and B).
I could successfully get final tsv files from RNA sequencing data (Aligned.sortedByCoord.out.bam BAM file with ~5-7 GB) from A.
The size of fusions.tsv files was about 170KB.

However, when I did the same with data from B, it took much time running Arriba.
Also, it worked well with RNA BAM files of ~6GB (final tsv file size: 110MB) while it failed to work when I dealt with >7GB BAM files with error message like follows:

(base) xxx@xxxcBookPro data % /Users/dajeong/arriba_v2.4.0/arriba -x /Users/dajeong/STAR-Fusion/output/01-17618_T1_STAR_Aligned.out.bam -g /Users/dajeong/Arriba_DJ/GENCODE38.gtf -a /Users/dajeong/Arriba_DJ/hg38.fa
-b /Users/dajeong/arriba_v2.4.0/database/blacklist_hg38_GRCh38_v2.4.0.tsv.gz -k /Users/dajeong/arriba_v2.4.0/database/known_fusions_hg38_GRCh38_v2.4.0.tsv.gz -p /Users/dajeong/arriba_v2.4.0/database/protein_domains_hg38_GRCh38_v2.4.0.gff3
-o /Users/dajeong/Arriba_DJ/output/01-17618_T1_fusions.tsv
[2024-02-01T18:53:46] Launching Arriba 2.4.0
[2024-02-01T18:53:46] Loading assembly from '/Users/dajeong/Arriba_DJ/hg38.fa'
[2024-02-01T18:54:03] Loading annotation from '/Users/dajeong/Arriba_DJ/GENCODE38.gtf'
[2024-02-01T18:54:09] Reading chimeric alignments from '/Users/dajeong/STAR-Fusion/output/01-17618_T1_STAR_Aligned.out.bam' (total=54199648)
[2024-02-01T18:57:53] Marking multi-mapping alignments (marked=160621)
[2024-02-01T18:58:10] Detecting strandedness (reverse)
[2024-02-01T18:58:12] Assigning strands to alignments
[2024-02-01T18:58:20] Annotating alignments
[2024-02-01T19:00:20] Filtering duplicates (remaining=50395284)
[2024-02-01T19:00:48] Filtering mates which do not map to interesting contigs (1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y AC_* NC_) (remaining=49947069)
[2024-02-01T19:00:56] Filtering mates which only map to viral contigs (AC_ NC_*) (remaining=49947069)
[2024-02-01T19:01:04] Filtering viral contigs with expression lower than the top 5 (remaining=49947069)
[2024-02-01T19:01:20] Filtering viral contigs with less than 5% coverage (remaining=49947069)
[2024-02-01T19:01:29] Estimating fragment length (mate gap mean=18863.6, mate gap stddev=24841.7, read length mean=39.6217)
[2024-02-01T19:01:37] Filtering read-through fragments with a distance <=10000bp (remaining=48397868)
[2024-02-01T19:01:46] Filtering inconsistently clipped mates (remaining=48397733)
[2024-02-01T19:01:54] Filtering breakpoints adjacent to homopolymers >=6nt (remaining=48394882)
[2024-02-01T19:02:02] Filtering fragments with small insert size (remaining=48394744)
[2024-02-01T19:02:10] Filtering alignments with long gaps (remaining=48394744)
[2024-02-01T19:02:19] Filtering fragments with both mates in the same gene (remaining=48392988)
[2024-02-01T19:02:27] Filtering fusions arising from hairpin structures (remaining=48388080)
[2024-02-01T19:02:37] Filtering reads with a mismatch p-value <=0.01 (remaining=46440036)
[2024-02-01T19:03:28] Filtering reads with low entropy (k-mer content >=60%) (remaining=46381071)
[2024-02-01T19:04:54] Finding fusions and counting supporting reads zsh: killed /Users/dajeong/arriba_v2.4.0/arriba -x -g -a -b -k -p -o

I converted BAM file to fastq and proceeded the downstream analysis.
When I performed FastQC,

Data A>
Measure Value
Filename ATL005_R1_fastq.gz
File type Conventional base calls
Encoding Sanger / Illumina 1.9
Total Sequences 31961691
Total Bases 3.1 Gbp
Sequences flagged as poor quality 0
Sequence length 100
%GC 49

Data B>
Measure Value
Filename 01-15563_T1_R2_fastq.gz
File type Conventional base calls
Encoding Sanger / Illumina 1.9
Total Sequences 57372050
Total Bases 4.3 Gbp
Sequences flagged as poor quality 0
Sequence length 75
%GC 47

============================================================================================

My MacBook has 64GB RAM.
Could it be due to resource shortage?
Is there any way to get RNA fusions.tsv file from >7GB RNA BAM files from data B on my MacBook?

I am confused because there is no big difference between data A and B in terms of BAM file size.

Thanks!

Sincerely,
DJ. J

suhrig · 2024-02-03T01:09:58Z

Something does not seem right about this sample. In fact, the situation looks a lot similar to your other issue #228. Since it's public data, can you point me to the source from where you obtained it? Then I can take a look at it myself.

jdjdj0202 · 2024-02-05T00:57:22Z

Thank you for your reply.
After receiving data access approval from EGA, I downloaded the RNA data.
It was in BAM file format, which I converted into a FASTQ file, then used STAR-Align followed by Arriba.
The dataset ID is EGAD00001006268 = EGAS00001004289.
Among the dataset, I tried EGAF00004171085 (file name: 01-15563_T1.rna.bam) and EGAF00004171086 (file name: 01-17618_T1.rna.bam) etc.

(If access to the restricted public data is difficult, Is it okay to explain the situation to the data owner and seek permission?)

Thanks a lot.

Sincerely,
DJ. J

suhrig · 2024-02-09T23:48:03Z

Apologies for the slow response - it has been a busy week.

I didn't realize it's restricted data from EGA. I doubt it will be possible to share the data with me. So we will need to resort to remote troubleshooting.

Can you run samtools flagstat on the problematic BAM file, please? Also, do you have the STAR Log.final.out file and could share it with me?

Looking at the status messages from Arriba, about 54 million reads were considered as chimeric. That's almost all reads. This could happen when read1 and read2 were accidentally swapped when given as input to STAR. Can you rule this out?

jdjdj0202 · 2024-02-26T04:41:16Z

I am very sorry for the late response due to personal reasons. I have tried running samtools flagstat and the outcome is as follows:

108111052 + 6633048 in total (QC-passed reads + QC-failed reads)
108111052 + 6633048 primary
0 + 0 secondary
0 + 0 supplementary
8903170 + 805612 duplicates
8903170 + 805612 primary duplicates
107740389 + 5690274 mapped (99.66% : 85.79%)
107740389 + 5690274 primary mapped (99.66% : 85.79%)
108111052 + 6633048 paired in sequencing
54055526 + 3316524 read1
54055526 + 3316524 read2
92916680 + 4440744 properly paired (85.95% : 66.95%)
107438656 + 5331394 with itself and mate mapped
301733 + 358880 singletons (0.28% : 5.41%)
10155368 + 676362 with mate mapped to a different chr
1489734 + 157793 with mate mapped to a different chr (mapQ>=5)

The STAR log.file.out could not be attached here, so I have attached it via email.

Thanks!

jdjdj0202 · 2024-02-26T04:41:53Z

I am very sorry for the late response due to personal reasons. I have tried running samtools flagstat and the outcome is as follows: 108111052 + 6633048 in total (QC-passed reads + QC-failed reads) 108111052 + 6633048 primary 0 + 0 secondary 0 + 0 supplementary 8903170 + 805612 duplicates 8903170 + 805612 primary duplicates 107740389 + 5690274 mapped (99.66% : 85.79%) 107740389 + 5690274 primary mapped (99.66% : 85.79%) 108111052 + 6633048 paired in sequencing 54055526 + 3316524 read1 54055526 + 3316524 read2 92916680 + 4440744 properly paired (85.95% : 66.95%) 107438656 + 5331394 with itself and mate mapped 301733 + 358880 singletons (0.28% : 5.41%) 10155368 + 676362 with mate mapped to a different chr 1489734 + 157793 with mate mapped to a different chr (mapQ>=5) The STAR log.file.out could not be attached here, so I have attached it via email. Thanks! Sincerely, DJ. J 2024년 2월 10일 (토) 오전 8:48, suhrig ***@***.***>님이 작성:

…

Apologies for the slow response - it has been a busy week. I didn't realize it's restricted data from EGA. I doubt it will be possible to share the data with me. So we will need to resort to remote troubleshooting. Can you run samtools flagstat on the problematic BAM file, please? Also, do you have the STAR Log.final.out file and could share it with me? Looking at the status messages from Arriba, about 54 million reads were considered as chimeric. That's almost all reads. This could happen when read1 and read2 were accidentally swapped when given as input to STAR. Can you rule this out? — Reply to this email directly, view it on GitHub <#229 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AQKDW3WASPFXNIKRLSGK4ALYS2YT7AVCNFSM6AAAAABCV3XY76VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZWG42DOMRYGM> . You are receiving this because you authored the thread.Message ID: ***@***.***>

-- DAJEONG JEONG, M.D., Ph.D. Department of Laboratory Medicine Yonsei University College of Medicine, Severance Hospital 50-1, Yonsei-ro, Seodaemun-gu, Seoul, Republic of Korea (03722) Tel: 82-2-2228-2444 Fax: 82-2-2227-8353 Cel: 82-10-4901-9486

suhrig · 2024-02-26T08:32:47Z

Unfortunately, the log file was also dropped by mail. Can you just copy-paste the content as a reply to this thread?

The flagstats already look weird. The are no supplementary alignments. Can you show me the STAR command, please?

jdjdj0202 · 2024-02-26T10:13:46Z

Thank you for your reply.

The STAR_log.final.out file result is as follows:
Started job on | Jan 22 19:51:31
Started mapping on | Jan 22 19:51:44
Finished on | Jan 22 19:58:08
Mapping speed, Million of reads per hour | 537.86

                      Number of input reads |	57372050
                  Average input read length |	150
                                UNIQUE READS:
               Uniquely mapped reads number |	4053986
                    Uniquely mapped reads % |	7.07%
                      Average mapped length |	148.53
                   Number of splices: Total |	593449
        Number of splices: Annotated (sjdb) |	576252
                   Number of splices: GT/AG |	577965
                   Number of splices: GC/AG |	7210
                   Number of splices: AT/AC |	256
           Number of splices: Non-canonical |	8018
                  Mismatch rate per base, % |	0.51%
                     Deletion rate per base |	0.01%
                    Deletion average length |	1.33
                    Insertion rate per base |	0.03%
                   Insertion average length |	1.33
                         MULTI-MAPPING READS:
    Number of reads mapped to multiple loci |	3707714
         % of reads mapped to multiple loci |	6.46%
    Number of reads mapped to too many loci |	73616
         % of reads mapped to too many loci |	0.13%
                              UNMAPPED READS:

Number of reads unmapped: too many mismatches | 0
% of reads unmapped: too many mismatches | 0.00%
Number of reads unmapped: too short | 49489970
% of reads unmapped: too short | 86.26%
Number of reads unmapped: other | 39041
% of reads unmapped: other | 0.07%
CHIMERIC READS:
Number of chimeric reads | 40096610
% of chimeric reads | 69.89%

The STAR command is as follows:

Convert BAM to FASTQ

/Users/dajeong/samtools-1.18/samtools fastq /Users/dajeong/STAR-Fusion/data/01-15563_T1.rna.bam -1 /Users/dajeong/STAR-Fusion/data/01-15563_T1_R1_fastq.gz -2 /Users/dajeong/STAR-Fusion/data/01-15563_T1_R2_fastq.gz

Unzip FASTQ files

gunzip -k /Users/dajeong/STAR-Fusion/data01-15563_T1_R1_fastq
gunzip -k /Users/dajeong/STAR-Fusion/data/01-15563_T1_R2_fastq

Run STAR alignment

/Users/dajeong/STAR/bin/MacOSX_x86_64/STAR --runThreadN 5
--genomeDir /Users/dajeong/Arriba_DJ/STAR_index_hg38_GENCODE38
--readFilesIn /Users/dajeong/STAR-Fusion/data/01-15563_T1_R1_fastq /Users/dajeong/STAR-Fusion/data/01-15563_T1_R2_fastq
--outFileNamePrefix /Users/dajeong/STAR-Fusion/output/01-15563_T1_
--chimSegmentMin 20
--chimOutType WithinBAM
--alignMatesGapMax 100000
--alignIntronMax 100000
--chimJunctionOverhangMin 20
--outSAMtype BAM Unsorted

Rename the output file

mv /Users/dajeong/STAR-Fusion/output/A37987_Aligned.out.bam /Users/dajeong/STAR-Fusion/output/A37987_STAR_Aligned.out.bam

Run Arriba for fusion detection

/Users/dajeong/arriba_v2.4.0/arriba -x /Users/dajeong/STAR-Fusion/output/01-15563_T1_STAR_Aligned.out.bam -g /Users/dajeong/Arriba_DJ/GENCODE38.gtf -a /Users/dajeong/Arriba_DJ/hg38.fa
-b /Users/dajeong/arriba_v2.4.0/database/blacklist_hg38_GRCh38_v2.4.0.tsv.gz -k /Users/dajeong/arriba_v2.4.0/database/known_fusions_hg38_GRCh38_v2.4.0.tsv.gz -p /Users/dajeong/arriba_v2.4.0/database/protein_domains_hg38_GRCh38_v2.4.0.gff3
-o /Users/dajeong/Arriba_DJ/output/01-15563_T1_fusions.tsv

Thanks :)

suhrig · 2024-02-26T12:22:04Z

The problem could be that you haven't collated the paired-end mates before converting to FastQ. The mates will be mixed up if you don't do this.

Would you please try the following command for the conversion?

samtools collate -f -O -u -r 1000000  /Users/dajeong/STAR-Fusion/data/01-15563_T1.rna.bam |
samtools fastq -0 other.fastq -1 read1.fastq -2 read2.fastq -s singletons.fastq

jdjdj0202 · 2024-02-27T11:27:40Z

Thank you for your reply.
I tried the command line for the conversion.
I think it gives similar output file with >20000 fusion results.
Also, arriba process took about 5 hours.
When I checked the STAR_log.final.out file, it is like follows:

Started job on | Feb 27 14:53:10
Started mapping on | Feb 27 14:53:31
Finished on | Feb 27 14:59:55
Mapping speed, Million of reads per hour | 537.86

                      Number of input reads |	57372050
                  Average input read length |	150
                                UNIQUE READS:
               Uniquely mapped reads number |	4053986
                    Uniquely mapped reads % |	7.07%
                      Average mapped length |	148.53
                   Number of splices: Total |	593449
        Number of splices: Annotated (sjdb) |	576252
                   Number of splices: GT/AG |	577965
                   Number of splices: GC/AG |	7210
                   Number of splices: AT/AC |	256
           Number of splices: Non-canonical |	8018
                  Mismatch rate per base, % |	0.51%
                     Deletion rate per base |	0.01%
                    Deletion average length |	1.33
                    Insertion rate per base |	0.03%
                   Insertion average length |	1.33
                         MULTI-MAPPING READS:
    Number of reads mapped to multiple loci |	3707714
         % of reads mapped to multiple loci |	6.46%
    Number of reads mapped to too many loci |	73616
         % of reads mapped to too many loci |	0.13%
                              UNMAPPED READS:

It looks the same as the previous result.
Is there anything more I can try?

Thanks. :)

suhrig · 2024-02-27T21:27:13Z

Clearly, something must be going wrong during the conversion from BAM to FastQ. It is not normal that 70% of the reads are chimeric. This number is typically well below 10%. Also, Arriba usually takes only a few minutes to run and the main output file should only be a few kb in size.

The flagstats you send me are from the original BAM file, not your realigned BAM file, right? Things still look okay in the original file, then. In this case, can you check the following:

Pick a random read pair from the original BAM file. Note down the sequences from both mates. Next, extract the sequences of the read pair with the same name from the converted FastQs, e.g., using grep -w -A1 NAME-OF-READ read1.fastq read2.fastq. Both sequences should be the same as in the original BAM file or the reverse complement. Can you confirm this with a couple of randomly selected read pairs?

suhrig · 2024-03-09T13:00:49Z

Hi, did you find some time to inspect the read sequences as explained in my previous post? Are my instructions clear enough? I'm pretty sure this has to do with the BAM file not being converted to FastQs properly.

jdjdj0202 · 2024-03-14T05:23:41Z

Hi, I am very sorry for the late reply. Thank you for suggesting various solutions to try.

First, the flagstats are indeed from the original BAM file. The command used at that time is as follows.

========================================================================
samtools collate -f -O -u -r 1000000 /Users/dajeong/STAR-Fusion/data/01-15563_T1.rna.bam |
samtools fastq -0 other.fastq -1 read1.fastq -2 read2.fastq -s singletons.fastq

Convert BAM to FASTQ

/Users/dajeong/samtools-1.18/samtools fastq /Users/dajeong/STAR-Fusion/data/01-15563_T1.rna.bam -1 /Users/dajeong/STAR-Fusion/data/01-15563_T1_R1_fastq.gz -2 /Users/dajeong/STAR-Fusion/data/01-15563_T1_R2_fastq.gz

Unzip FASTQ files

gunzip -k /Users/dajeong/STAR-Fusion/data/01-15563_T1_R1_fastq
gunzip -k /Users/dajeong/STAR-Fusion/data/01-15563_T1_R2_fastq

Run STAR alignment

/Users/dajeong/STAR/bin/MacOSX_x86_64/STAR --runThreadN 5
--genomeDir /Users/dajeong/Arriba_DJ/STAR_index_hg38_GENCODE38
--readFilesIn /Users/dajeong/STAR-Fusion/data/01-15563_T1_R1_fastq /Users/dajeong/STAR-Fusion/data/01-15563_T1_R2_fastq
--outFileNamePrefix /Users/dajeong/STAR-Fusion/output/01-15563_T1_
--chimSegmentMin 20
--chimOutType WithinBAM
--alignMatesGapMax 100000
--alignIntronMax 100000
--chimJunctionOverhangMin 20
--outSAMtype BAM Unsorted

==============================================================================
Also, the results from proceeding according to the previous instruction are as follows.

##############################################################################
Trial 1.
samtools view /Users/dajeong/STAR-Fusion/data/01-15563_T1.rna.bam | gshuf -n 1 > random_read_second.txt

HS27_221:4:1213:7746:97787 83 chr7 139877304 60 75M = 139877168 -211 GCATAGCTTCTTGCTTTAAGCGTGATAGGTGCTCGATAAAGGCTCAAGAAATTGAACTGTCTACATTTCTCTTAC E1G@BE>1=<1CDFGGF>FBGGGGFF1DFC<0GGC<@1g>EGFF=:1C1GGF@;GEF@1>GGGGGGGGF1@BBB@ MD:Z:75 PG:Z:MarkDuplicates RG:Z:226820 NM:i:0 AS:i:75 XS:i:0

grep -w -A1 -h "HS27_221:4:1213:7746:97787" /Users/dajeong/STAR-Fusion/data/01-15563_T1_R1_fastq > extracted_sequence_read1_second.txt

@HS27_221:4:1213:7746:97787
GTAAGAGAAATGTAGACAGTTCAATTTCTTGAGCCTTTATCGAGCACCTATCACGCTTAAAGCAAGAAGCTATGC

grep -w -A1 -h "HS27_221:4:1213:7746:97787" /Users/dajeong/STAR-Fusion/data/01-15563_T1_R2_fastq > extracted_sequence_read2_second.txt

@HS27_221:4:1213:7746:97787
GGAGCAGGGAGGTGTACTCCTCGGATGGACGAGCGGGGAGGCAAGGCGTGTCTTTAAATAACAAGCAACCCTGCT

##############################################################################

Trial 2.

samtools view /Users/dajeong/STAR-Fusion/data/01-15563_T1.rna.bam | gshuf -n 1 > random_read_third.txt

HS27_221:4:1109:11102:98424 99 chr2 196163272 35 75M = 196163385 8093 CTTTAGATGTAAGTATATAGAAATTATTAAAGTTTTCCATTTTTATTGGAATTTGAGGAGTTGTAGTTAGTAGGC CCCCCEFGGGGGGGGGGGGGGGGEGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGEGGGGGGGEFGGGGGGGG MD:Z:75 PG:Z:MarkDuplicates RG:Z:226820 NM:i:0 AS:i:75 XS:i:63

grep -w -A1 -h "HS27_221:4:1109:11102:98424" /Users/dajeong/STAR-Fusion/data/01-15563_T1_R1_fastq > extracted_sequence_read1_third.txt

@HS27_221:4:1109:11102:98424
CTTTAGATGTAAGTATATAGAAATTATTAAAGTTTTCCATTTTTATTGGAATTTGAGGAGTTGTAGTTAGTAGGC

grep -w -A1 -h "HS27_221:4:1109:11102:98424" /Users/dajeong/STAR-Fusion/data/01-15563_T1_R2_fastq > extracted_sequence_read2_third.txt

@HS27_221:4:1109:11102:98424
CAGTCCCGGGAGAACCTGCGGCGGCCGGAGCGGTAAAAATAAGTGACTAAAGAAGCAGACCTGGGAATCACCTAA

##############################################################################

Does this result suggest a problem in the file conversion process?
I would appreciate your feedback.

Thanks!

suhrig · 2024-03-15T13:09:30Z

Thanks for the additional details. This looks good so far. May I ask you for another piece of information? Can you please run the following command on the BAM file which you have created (i.e., the one after realigning the reads)? The command extracts the randomly selected reads:

samtools view /path/to/new_alignments.bam | grep -wP 'HS27_221:4:1213:7746:97787|HS27_221:4:1109:11102:98424'

Another thing you might want to try is to run Arriba on the original BAM file. Does it also crash here? If not, then this would confirm that the problem must arise during conversion to FastQ or realignment.

jdjdj0202 · 2024-03-18T06:21:09Z

Thank you for providing additional guidance.
Firstly, the result of executing the command you provided is as follows. I'm not sure what they mean.

samtools view /Users/dajeong/STAR-Fusion/output/01-15563_T1_STAR_Aligned.out.bam | grep -wE 'HS27_221:4:1213:7746:97787|HS27_221:4:1109:11102:98424'

(base) dajeong@DajeongcBookPro data % samtools view /Users/dajeong/STAR-Fusion/output/01-15563_T1_STAR_Aligned.out.bam | grep -wE 'HS27_221:4:1213:7746:97787|HS27_221:4:1109:11102:98424'
HS27_221:4:1109:11102:98424 97 chr2 196163272 255 75M = 196136014 0 CTTTAGATGTAAGTATATAGAAATTATTAAAGTTTTCCATTTTTATTGGAATTTGAGGAGTTGTAGTTAGTAGGC CCCCCEFGGGGGGGGGGGGGGGGEGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGEGGGGGGGEFGGGGGGGG NH:i:1 HI:i:1 AS:i:73 nM:i:0 NM:i:0
HS27_221:4:1109:11102:98424 145 chr2 196136014 255 75M = 196163272 0 GTTATTCAAAAAGTTTTAAGTTATTATTAAAGAATTTGGAAACTTACCAAAATTTTGAGAAGTTAAAGGTCTAAA FGCGGGGDGGG1GGGFEGCGFC1GEGCCGGGEGGGGGGGGCEGGGGGGGGGFC1BF1>GEGGGGGGGGGGBBBBB NH:i:1 HI:i:1 AS:i:73 nM:i:0 NM:i:0
HS27_221:4:1213:7746:97787 81 chr7 139877304 255 75M = 138570982 0 GCATAGCTTCTTGCTTTAAGCGTGATAGGTGCTCGATAAAGGCTCAAGAAATTGAACTGTCTACATTTCTCTTAC E1G@BE>1=<1CDFGGF>FBGGGGFF1DFC<0GGC<@1g>EGFF=:1C1GGF@;GEF@1>GGGGGGGGF1@BBB@ NH:i:1 HI:i:1 AS:i:73 nM:i:0 NM:i:0
HS27_221:4:1213:7746:97787 161 chr7 138570982 255 22M2503N53M = 139877304 0 TAATCTTCCCTCTCTTCCGGATATTGACTGTTCAAGTACTATTATGCTGGACAATATTGTGAGGAAAGATACTAA BBBBBGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG NH:i:1 HI:i:1 AS:i:74 nM:i:0 NM:i:0

Moreover, running Arriba with the original BAM file took about 5 minutes and yielded 45 low confidence fusion results.

If the problem arises from converting BAM to FastQ files, how could it be resolved?
Thanks!

suhrig · 2024-03-18T19:28:46Z

The output from the grep command confirms that your realigned BAM file does not have matching paired-end read sequences anymore. Read 2 is different from what it says in the FastQ file. The most likely explanation is that the order of the reads is different in fastq1 and fastq2, which is usually caused by improper conversion from BAM to FastQ.

You can try a different tool for the conversion, for example from the biobambam package:

conda install bioconda::biobambam
bamtofastq filename=/path/to/original_alignments.bam F=read1.fastq F2=read2.fastq S=/dev/null gz=0 collate=1

suhrig · 2024-04-17T11:19:19Z

Did you have more success with biobambam?

jdjdj0202 · 2024-04-30T05:07:14Z

I apologize for the late reply.
I am having difficulty using biobambam in a MacBook environment, so it might take more time.
I will leave an additional message once it is completed later.
Thank you so much.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finding fusions and counting supporting reads zsh: killed #229

Finding fusions and counting supporting reads zsh: killed #229

jdjdj0202 commented Feb 2, 2024

suhrig commented Feb 3, 2024

jdjdj0202 commented Feb 5, 2024

suhrig commented Feb 9, 2024

jdjdj0202 commented Feb 26, 2024

jdjdj0202 commented Feb 26, 2024 via email

suhrig commented Feb 26, 2024

jdjdj0202 commented Feb 26, 2024

suhrig commented Feb 26, 2024

jdjdj0202 commented Feb 27, 2024

suhrig commented Feb 27, 2024 •

edited

Loading

suhrig commented Mar 9, 2024

jdjdj0202 commented Mar 14, 2024

suhrig commented Mar 15, 2024

jdjdj0202 commented Mar 18, 2024 •

edited

Loading

suhrig commented Mar 18, 2024

suhrig commented Apr 17, 2024

jdjdj0202 commented Apr 30, 2024

Finding fusions and counting supporting reads zsh: killed #229

Finding fusions and counting supporting reads zsh: killed #229

Comments

jdjdj0202 commented Feb 2, 2024

However, when I did the same with data from B, it took much time running Arriba. Also, it worked well with RNA BAM files of ~6GB (final tsv file size: 110MB) while it failed to work when I dealt with >7GB BAM files with error message like follows:

suhrig commented Feb 3, 2024

jdjdj0202 commented Feb 5, 2024

suhrig commented Feb 9, 2024

jdjdj0202 commented Feb 26, 2024

jdjdj0202 commented Feb 26, 2024 via email

suhrig commented Feb 26, 2024

jdjdj0202 commented Feb 26, 2024

Convert BAM to FASTQ

Unzip FASTQ files

Run STAR alignment

Rename the output file

Run Arriba for fusion detection

suhrig commented Feb 26, 2024

jdjdj0202 commented Feb 27, 2024

suhrig commented Feb 27, 2024 • edited Loading

suhrig commented Mar 9, 2024

jdjdj0202 commented Mar 14, 2024

Convert BAM to FASTQ

Unzip FASTQ files

Run STAR alignment

suhrig commented Mar 15, 2024

jdjdj0202 commented Mar 18, 2024 • edited Loading

suhrig commented Mar 18, 2024

suhrig commented Apr 17, 2024

jdjdj0202 commented Apr 30, 2024

However, when I did the same with data from B, it took much time running Arriba.
Also, it worked well with RNA BAM files of ~6GB (final tsv file size: 110MB) while it failed to work when I dealt with >7GB BAM files with error message like follows:

suhrig commented Feb 27, 2024 •

edited

Loading

jdjdj0202 commented Mar 18, 2024 •

edited

Loading