Tximeta for novel transcripts by stringite #68

karlaarz · 2022-10-25T10:33:28Z

Hello,

I am trying to run Swish on a novel transcript assembly. I tried to import some Salmon data using the gtf from novel transcripts that I obtained from StringTie. These novel transcripts have "MSTRG" as prefix. I import the data as following:

#Genome setting----
indexDir <- file.path(dir, "salmon_idx_both_zebra")
fasta <- file.path(dir, "both_zebra.fa")
gtf <- file.path(dir, "zebrafish_merged_both.gtf")
makeLinkedTxome(indexDir=indexDir, 
                source="de-novo", 
                organism="Danio rerio",
                release="105", genome="GRCz11", fasta=fasta, gtf=gtf, write=FALSE)

#tximeta----
se <- tximeta(coldata, type = "salmon", useHub=FALSE, skipMeta=TRUE, ignoreTxVersion = TRUE)

>head(rownames(se))
[1] "ENSDART00000189431.1" "ENSDART00000189226.1" "ENSDART00000172037.2" "ENSDART00000165410.2"
[5] "ENSDART00000163675.2" "ENSDART00000172374.2"

However, when I search for these novel transcripts, they seem to be missing.

>grep("MSTRG", rownames(se))
integer(0)

I am using tximeta v1.14.1 and the R version 4.2.0 (2022-04-22).

Any help would be appreciated.

Thanks

The text was updated successfully, but these errors were encountered:

mikelove · 2022-10-25T12:11:46Z

You'd want to check that these transcripts are both in the FASTA that you used to make a Salmon index (and hence in the Salmon quant files), and in the GTF.

E.g.

grep MSTRG sample1/quant.sf | head
grep MSTRG zebrafish_merged_both.gtf | head

karlaarz · 2022-10-25T12:50:54Z

Hi Mike, thanks for your quick response.

Yes, these novel transcripts are in the FASTA, the quant files and in the GTF:

grep MSTRG both_zebra.fa | head
>MSTRG.4.2
>MSTRG.4.4
>MSTRG.2.1
>MSTRG.9.2
>MSTRG.10.2
>MSTRG.6.1
>MSTRG.5.1
>MSTRG.7.1
>MSTRG.7.2
>MSTRG.19.3

grep MSTRG sample1/quant.sf | head
MSTRG.4.2       2476    2476.981        0.000000        0.000
MSTRG.4.4       2610    2645.029        1.863677        131.235
MSTRG.2.1       2322    2290.501        0.610686        37.239
MSTRG.9.2       2504    2868.330        1.006466        76.856
MSTRG.10.2      2296    2782.355        0.105928        7.846
MSTRG.6.1       1289    773.007 4.567673        94.000
MSTRG.5.1       865     715.359 2.362866        45.000
MSTRG.7.1       887     910.781 0.750790        18.205
MSTRG.7.2       1066    1124.650        0.828140        24.795
MSTRG.19.3      1816    1849.839        0.000000        0.000

awk -F "\t" '$3 == "transcript" { print $0 }' zebrafish_merged_both.gtf | awk -F ";" '{print $1";"$2";"$3}' | grep MSTRG | head
1       StringTie       transcript      27690   34330   1000    +       .       gene_id "MSTRG.1"; transcript_id "ENSDART00000162928"; gene_name "eed"
1       StringTie       transcript      27814   31905   1000    +       .       gene_id "MSTRG.1"; transcript_id "ENSDART00000166052"; gene_name "eed"
1       StringTie       transcript      27849   31321   1000    +       .       gene_id "MSTRG.1"; transcript_id "ENSDART00000161955"; gene_name "eed"
1       StringTie       transcript      31787   32353   1000    +       .       gene_id "MSTRG.1"; transcript_id "ENSDART00000163891"; gene_name "eed"
1       StringTie       transcript      18716   23837   1000    +       .       gene_id "MSTRG.2"; transcript_id "MSTRG.2.1";
1       StringTie       transcript      18716   19963   1000    +       .       gene_id "MSTRG.2"; transcript_id "ENSDART00000168451"; gene_name "nfkbiz"
1       StringTie       transcript      18716   23837   1000    +       .       gene_id "MSTRG.2"; transcript_id "ENSDART00000172454"; gene_name "nfkbiz"
1       StringTie       transcript      18747   20331   1000    +       .       gene_id "MSTRG.2"; transcript_id "ENSDART00000172487"; gene_name "nfkbiz"
1       StringTie       transcript      18786   23837   1000    +       .       gene_id "MSTRG.2"; transcript_id "ENSDART00000161190"; gene_name "nfkbiz"
1       StringTie       transcript      6408    12027   1000    -       .       gene_id "MSTRG.3"; transcript_id "ENSDART00000164359"; gene_name "rpl24"

I'm not sure if I might be missing something while importing the data.

mikelove · 2022-10-25T13:22:54Z

ignoreTxVersion = TRUE strips off whatever trails the transcript id string after .

For example if you have ENST1234.10 this becomes ENST1234. So that doesn't work for your case as it is the only thing identifying the different MSTRG transcripts.

Do you need to set this to TRUE?

mikelove · 2022-10-25T13:25:18Z

Oh I also see a bigger problem. These transcripts are named differently in the FASTA/quant.sf and in the GTF.

MSTRG.x.y in the FASTA but look what follows transcript_id in the GTF. tximeta needs to have matching transcript IDs in the FASTA and GTF, it won't do any guesswork.

karlaarz · 2022-10-25T17:22:50Z

I am not getting your point; as for instance, in the examples that I provided, I see that the name of the transcript ID MSTRG.2.1 is the same as in FASTA/quant.sf and GTF files.

And you're right, the ignoreTxVersion = TRUE is not necessary in this case.

mikelove · 2022-10-25T17:41:21Z

GTF doesn't have transcript ids of the form MSTRG.2.1

karlaarz · 2022-10-28T11:31:29Z

what about the following line of the GTF:

1       StringTie       transcript      18716   23837   1000    +       .       gene_id "MSTRG.2"; transcript_id "MSTRG.2.1";

mikelove · 2022-10-28T12:09:15Z

Oh sorry you're right. I think I see what's happening. The code recognizes versions as numbers following the first .. It's fairly common to use . to indicate transcript versions, so . is not often in the other part of the stable identifier, as it produces an ambiguity as to what is the stable ID and what is the version. The fact that StringTie uses . for both separators (gene + txp) and (txp + version) is a bit confusing for downstream software.

One solution would be to fix this in the FASTA, just find replace MSTRG. with MSTRG- everywhere before running Salmon index. Does it take long to re-quantify the files?

This would not just help with tximeta but other tools that would recognize . as separating a version number from a stable ID.

karlaarz · 2022-10-28T13:01:11Z

Ohh awesome! Ok, I will make those changes and re-run Salmon and then tximeta and will let you know how it goes (: thanks!!

karlaarz · 2022-11-07T18:11:30Z

Hi Mike,

Substituting MSTRG. with MSTRG_ was the perfect solution. I re-ran Salmon, imported the data using tximeta and then ran swish, and everything worked perfectly.

Thanks a lot!

mikelove · 2022-11-07T18:16:13Z

Great, I will try to add this to the documentation. Thanks for the report.

mikelove · 2022-11-07T18:36:20Z

added this to the docs in b09fd02

karlaarz closed this as completed Nov 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tximeta for novel transcripts by stringite #68

Tximeta for novel transcripts by stringite #68

karlaarz commented Oct 25, 2022

mikelove commented Oct 25, 2022

karlaarz commented Oct 25, 2022

mikelove commented Oct 25, 2022

mikelove commented Oct 25, 2022

karlaarz commented Oct 25, 2022

mikelove commented Oct 25, 2022

karlaarz commented Oct 28, 2022

mikelove commented Oct 28, 2022

karlaarz commented Oct 28, 2022

karlaarz commented Nov 7, 2022

mikelove commented Nov 7, 2022

mikelove commented Nov 7, 2022

Tximeta for novel transcripts by stringite #68

Tximeta for novel transcripts by stringite #68

Comments

karlaarz commented Oct 25, 2022

mikelove commented Oct 25, 2022

karlaarz commented Oct 25, 2022

mikelove commented Oct 25, 2022

mikelove commented Oct 25, 2022

karlaarz commented Oct 25, 2022

mikelove commented Oct 25, 2022

karlaarz commented Oct 28, 2022

mikelove commented Oct 28, 2022

karlaarz commented Oct 28, 2022

karlaarz commented Nov 7, 2022

mikelove commented Nov 7, 2022

mikelove commented Nov 7, 2022