Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tximeta for novel transcripts by stringite #68

Closed
karlaarz opened this issue Oct 25, 2022 · 12 comments
Closed

Tximeta for novel transcripts by stringite #68

karlaarz opened this issue Oct 25, 2022 · 12 comments

Comments

@karlaarz
Copy link

Hello,

I am trying to run Swish on a novel transcript assembly. I tried to import some Salmon data using the gtf from novel transcripts that I obtained from StringTie. These novel transcripts have "MSTRG" as prefix. I import the data as following:

#Genome setting----
indexDir <- file.path(dir, "salmon_idx_both_zebra")
fasta <- file.path(dir, "both_zebra.fa")
gtf <- file.path(dir, "zebrafish_merged_both.gtf")
makeLinkedTxome(indexDir=indexDir, 
                source="de-novo", 
                organism="Danio rerio",
                release="105", genome="GRCz11", fasta=fasta, gtf=gtf, write=FALSE)

#tximeta----
se <- tximeta(coldata, type = "salmon", useHub=FALSE, skipMeta=TRUE, ignoreTxVersion = TRUE)

>head(rownames(se))
[1] "ENSDART00000189431.1" "ENSDART00000189226.1" "ENSDART00000172037.2" "ENSDART00000165410.2"
[5] "ENSDART00000163675.2" "ENSDART00000172374.2"

However, when I search for these novel transcripts, they seem to be missing.

>grep("MSTRG", rownames(se))
integer(0)

I am using tximeta v1.14.1 and the R version 4.2.0 (2022-04-22).

Any help would be appreciated.

Thanks

@mikelove
Copy link
Collaborator

You'd want to check that these transcripts are both in the FASTA that you used to make a Salmon index (and hence in the Salmon quant files), and in the GTF.

E.g.

grep MSTRG sample1/quant.sf | head
grep MSTRG zebrafish_merged_both.gtf | head

@karlaarz
Copy link
Author

Hi Mike, thanks for your quick response.

Yes, these novel transcripts are in the FASTA, the quant files and in the GTF:

grep MSTRG both_zebra.fa | head
>MSTRG.4.2
>MSTRG.4.4
>MSTRG.2.1
>MSTRG.9.2
>MSTRG.10.2
>MSTRG.6.1
>MSTRG.5.1
>MSTRG.7.1
>MSTRG.7.2
>MSTRG.19.3
grep MSTRG sample1/quant.sf | head
MSTRG.4.2       2476    2476.981        0.000000        0.000
MSTRG.4.4       2610    2645.029        1.863677        131.235
MSTRG.2.1       2322    2290.501        0.610686        37.239
MSTRG.9.2       2504    2868.330        1.006466        76.856
MSTRG.10.2      2296    2782.355        0.105928        7.846
MSTRG.6.1       1289    773.007 4.567673        94.000
MSTRG.5.1       865     715.359 2.362866        45.000
MSTRG.7.1       887     910.781 0.750790        18.205
MSTRG.7.2       1066    1124.650        0.828140        24.795
MSTRG.19.3      1816    1849.839        0.000000        0.000
awk -F "\t" '$3 == "transcript" { print $0 }' zebrafish_merged_both.gtf | awk -F ";" '{print $1";"$2";"$3}' | grep MSTRG | head
1       StringTie       transcript      27690   34330   1000    +       .       gene_id "MSTRG.1"; transcript_id "ENSDART00000162928"; gene_name "eed"
1       StringTie       transcript      27814   31905   1000    +       .       gene_id "MSTRG.1"; transcript_id "ENSDART00000166052"; gene_name "eed"
1       StringTie       transcript      27849   31321   1000    +       .       gene_id "MSTRG.1"; transcript_id "ENSDART00000161955"; gene_name "eed"
1       StringTie       transcript      31787   32353   1000    +       .       gene_id "MSTRG.1"; transcript_id "ENSDART00000163891"; gene_name "eed"
1       StringTie       transcript      18716   23837   1000    +       .       gene_id "MSTRG.2"; transcript_id "MSTRG.2.1";
1       StringTie       transcript      18716   19963   1000    +       .       gene_id "MSTRG.2"; transcript_id "ENSDART00000168451"; gene_name "nfkbiz"
1       StringTie       transcript      18716   23837   1000    +       .       gene_id "MSTRG.2"; transcript_id "ENSDART00000172454"; gene_name "nfkbiz"
1       StringTie       transcript      18747   20331   1000    +       .       gene_id "MSTRG.2"; transcript_id "ENSDART00000172487"; gene_name "nfkbiz"
1       StringTie       transcript      18786   23837   1000    +       .       gene_id "MSTRG.2"; transcript_id "ENSDART00000161190"; gene_name "nfkbiz"
1       StringTie       transcript      6408    12027   1000    -       .       gene_id "MSTRG.3"; transcript_id "ENSDART00000164359"; gene_name "rpl24"

I'm not sure if I might be missing something while importing the data.

@mikelove
Copy link
Collaborator

ignoreTxVersion = TRUE strips off whatever trails the transcript id string after .

For example if you have ENST1234.10 this becomes ENST1234. So that doesn't work for your case as it is the only thing identifying the different MSTRG transcripts.

Do you need to set this to TRUE?

@mikelove
Copy link
Collaborator

Oh I also see a bigger problem. These transcripts are named differently in the FASTA/quant.sf and in the GTF.

MSTRG.x.y in the FASTA but look what follows transcript_id in the GTF. tximeta needs to have matching transcript IDs in the FASTA and GTF, it won't do any guesswork.

@karlaarz
Copy link
Author

I am not getting your point; as for instance, in the examples that I provided, I see that the name of the transcript ID MSTRG.2.1 is the same as in FASTA/quant.sf and GTF files.

And you're right, the ignoreTxVersion = TRUE is not necessary in this case.

@mikelove
Copy link
Collaborator

GTF doesn't have transcript ids of the form MSTRG.2.1

@karlaarz
Copy link
Author

what about the following line of the GTF:

1       StringTie       transcript      18716   23837   1000    +       .       gene_id "MSTRG.2"; transcript_id "MSTRG.2.1";

@mikelove
Copy link
Collaborator

Oh sorry you're right. I think I see what's happening. The code recognizes versions as numbers following the first .. It's fairly common to use . to indicate transcript versions, so . is not often in the other part of the stable identifier, as it produces an ambiguity as to what is the stable ID and what is the version. The fact that StringTie uses . for both separators (gene + txp) and (txp + version) is a bit confusing for downstream software.

One solution would be to fix this in the FASTA, just find replace MSTRG. with MSTRG- everywhere before running Salmon index. Does it take long to re-quantify the files?

This would not just help with tximeta but other tools that would recognize . as separating a version number from a stable ID.

@karlaarz
Copy link
Author

Ohh awesome! Ok, I will make those changes and re-run Salmon and then tximeta and will let you know how it goes (: thanks!!

@karlaarz
Copy link
Author

karlaarz commented Nov 7, 2022

Hi Mike,

Substituting MSTRG. with MSTRG_ was the perfect solution. I re-ran Salmon, imported the data using tximeta and then ran swish, and everything worked perfectly.

Thanks a lot!

@karlaarz karlaarz closed this as completed Nov 7, 2022
@mikelove
Copy link
Collaborator

mikelove commented Nov 7, 2022

Great, I will try to add this to the documentation. Thanks for the report.

@mikelove
Copy link
Collaborator

mikelove commented Nov 7, 2022

added this to the docs in b09fd02

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants