Idea: Supporting alevin output #24

PeteHaitch · 2018-08-16T03:52:38Z

(Unsure if this idea belongs here, in the tximeta repo, or in the salmon repo. Apologies if I guessed wrong. Tagging @rob-p and @k3yavi for their thoughts).

I'm looking into using the alevin tool from salmon for processing 3'-tag scRNA-seq data (currently just in the planning stage for such a workflow).
In such a workflow I'll ultimately be wanting to get the gene-sample count matrix created by alevin into a SingleCellExperiment (ideally with all the metadata via the very awesome tximeta).

Looking at the alevin docs (https://salmon.readthedocs.io/en/latest/alevin.html#output):

by default alevin dumps a per-cell level gene-count matrix in a binary-compressed format with the row and column indexes in a separate file.
A typical run of alevin will generate 3 files:
quants_mat.gz – Compressed count matrix.
quants_mat_cols.txt – Column Header (Gene-ids) of the matrix.
quants_mat_rows.txt – Row Index (CB-ids) of the matrix.
Alevin can also dump the count-matrix in a human readable – comma-separated-value (CSV) format

It's probably worth noting that the CSV version has cells along the rows and genes along the columns (i.e. opposite of SingleCellExperiment).

So a few questions:

@mikelove

Have you explored importing alevin output into R/Bioconductor?
Would you be interested in having this functionality in tximport?

@rob-p and @k3yavi

You note that the output format for scRNA-seq is itself an area of open research. Do you think the current formats are stable enough for me (and others?) to invest time in writing importers for R/Bioconductor?
Is there a more detailed description of the 'compressed count matrix' format (I'd like to see whether it maps to one of the sparse matrix formats in the R Matrix package).
Somewhat tangentially, have you considered HDF5 support (recognising that this I think this is quite a heavy thing to incorporate with and probably overkill for experiments with a few thousand cells)?

Thanks, I'm eager to hear your thoughts
Pete

mikelove · 2018-08-17T14:42:51Z

Thanks Pete,

This looks a good suggestion (haven't tried importing yet), and tximport probably makes sense as the place where it should live.

First thing I need to start working on it is some minimal example data that I can put into tximportData so we can have proper testing and vignette code. I'll coordinate with @rob-p and @k3yavi on it.

For timeline, I'm pretty busy the next few weeks with revisions and school starting, but we'll make it happen. Cool if we could do it before release in Oct.

PeteHaitch · 2018-08-19T22:07:57Z

I should have an alevin-processed dataset this week, but I don't think it'll be shareable. If @rob-p or @k3yavi already have one, that'd be ideal for inclusions in tximportData.

If you're interested in working together, I may have some time over the next few weeks to begin on this.

k3yavi · 2018-08-19T23:20:44Z

Hi guys,
Thanks for considering Alevin to be included in tximport, I think it would open up the gateway for Alevin to the awesome R world.
I have tried a stupid connection of Alevin -> R (Seurat) here although it starts from a csv file.
Not to bug @mikelove since I totally understand how busy it would be with the school opening up soon but when you guys have time the binary version of the matrix can be found here (inside folder alevin). Do let me know if you guys need any other help.

mikelove · 2018-08-23T12:27:14Z

@k3yavi what set of files should i put in tximportData to use for testing?

6.3K MappedUmi.txt
2.5K alevin.log
133K barcodeSoftMaps.txt
9.6M cell_eq_info.txt
2.4M cell_eq_mat.gz
5.0K cell_eq_order.txt
15K featureDump.txt
1.7K predictions.txt
1.6M quants_mat.gz
2.0M quants_mat_cols.txt
5.0K quants_mat_rows.txt
6.4K raw_cb_frequency.txt
5.9M transcripts.txt
1.6K whitelist.txt

k3yavi · 2018-08-23T13:31:39Z

Hi @mikelove ,

The relevant files would quants_mat* i.e.

quants_mat.gz: the compressed matrix of double
quants_mat_cols.txt: the name of the columns (gene names) in the above matrix.
quants_mat_rows.txt: the name of the rows (CB) in the above matrix.

mikelove · 2018-10-28T17:54:42Z

Sorry this didn’t happen by release, got sidetracked by some revisions and other stuff. I’ll be working on this in the new development branch

PeteHaitch · 2018-10-28T21:06:10Z

Totally understandable, I've also not spent time on it either due to other commitments

k3yavi · 2018-10-29T02:05:33Z

No worries @mikelove .
Just let me know once you have a stable version, I'll just copy paste the R code to parse the quants here 😜 .

k3yavi · 2018-10-30T19:27:55Z

Hi @mikelove ,
There has been a couple of users asking for reading the gzip output format of Alevin in R, I am attaching a snippet of code which I wrote for parsing the output, hope it might help in tximport too. Please let me know if you see something that can be made more efficient in the attached R code.
readAlevin.R.txt

mikelove · 2019-01-06T11:42:44Z

In devel branch

mikelove mentioned this issue Nov 5, 2018

sparse matrix support? #25

Closed

mikelove closed this as completed Jan 6, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Idea: Supporting alevin output #24

Idea: Supporting alevin output #24

PeteHaitch commented Aug 16, 2018

mikelove commented Aug 17, 2018

PeteHaitch commented Aug 19, 2018

k3yavi commented Aug 19, 2018

mikelove commented Aug 23, 2018

k3yavi commented Aug 23, 2018

mikelove commented Oct 28, 2018

PeteHaitch commented Oct 28, 2018

k3yavi commented Oct 29, 2018

k3yavi commented Oct 30, 2018

mikelove commented Jan 6, 2019

Idea: Supporting alevin output #24

Idea: Supporting alevin output #24

Comments

PeteHaitch commented Aug 16, 2018

mikelove commented Aug 17, 2018

PeteHaitch commented Aug 19, 2018

k3yavi commented Aug 19, 2018

mikelove commented Aug 23, 2018

k3yavi commented Aug 23, 2018

mikelove commented Oct 28, 2018

PeteHaitch commented Oct 28, 2018

k3yavi commented Oct 29, 2018

k3yavi commented Oct 30, 2018

mikelove commented Jan 6, 2019