Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reorganize/improve import tool #37

Open
bernt-matthias opened this issue Aug 31, 2022 · 11 comments
Open

Reorganize/improve import tool #37

bernt-matthias opened this issue Aug 31, 2022 · 11 comments

Comments

@bernt-matthias
Copy link
Contributor

I have a hard time figuring out how to import data into qiime2 tools using the import tool. I guess the most frequently used data is demultiplexed fastq.gz (maybe + sample data tsv file), e.g https://data.qiime2.org/2022.8/tutorials/importing/casava-18-single-end-demultiplexed.zip. I failed to find the corresponding option in the import tool.

  • I guess I start with SampleData[PairedEndSequencesWithQuality]?
  • Then there are many options that allow to select either
    • 1 dataset (e.g. Paired End Fastq Manifest Phred33)
    • or a collection (eg Casava One Eight Laneless Per Sample Directory Format) but the collection type is not set (I guess it should be collection_type="list:paired")
    • or individual files via a repeat where the number of elements can be anything between 1 and infinity

To get me started with exploring downstream tools it would be nice if someone could tell me for now how I could import data like the above (is there already a Galaxy specific tutorial that I did not notice so far?).

I guess the main problem is that the mapping between Galaxy concepts and qiime2 concepts needs a bit of improvement (e.g. that galaxy data types and collection types are not used yet). But probably its also because I'm unexperienced with qiime2 .. at the moment I'm just guessing that the goal of the import is to create a single qza dataset from all fastq files? Also I'm missing info in the help (like the definition of what a manifest is).

Since the tool is auto generated I'm unsure if this is easily possible. An alternative would be to handcraft an import tool covering the most frequently used types of input data that has a tight integration of the Galaxy concepts.

I imagine a tool that takes as input either

  • list:paired (for paired end data) or
  • list (for single end data)

with format fastq.gz plus (in addition simple data inputs with multiple="true" might be useful [because some users don't seem to like collections for some reason])

  • optional barcodes
  • optional tabular data set for metadata.

The tool then automatically knows about the phred encoding due to the specific Galaxy fastq.gz sub-datatypes.

@ebolyen
Copy link
Member

ebolyen commented Aug 31, 2022

Hey @bernt-matthias!

Yeah there's definitely some mild impedance here, this section of our tutorial should go over the "easy" way to do this:
https://docs.qiime2.org/jupyterbooks/cancer-microbiome-intervention-tutorial/020-tutorial-upstream/030-importing.html

But generally speaking, QIIME 2 doesn't have a notion of "collections" per-se, instead we are indeed trying to place all of those fastq.gz into a single QZA (we've found this to be pretty user-friendly). But to get the data into that QZA, we're expecting a galaxy collection and then we use a regular expression on the element IDs to figure out which is forward vs reverse. This is the same regular expression that we use to validate the user has given us a directory containing the appropriate files (we're quite file oriented).

There's really no equivalent concept of paired data in QIIME 2, as it's all defined by the format, which is expecting some directory structure. Instead we rely on the semantic type to indicate paired-ness, since many tools will use the default Casava layout. In principle, you should be able to upload a directory of raw reads from the sequencing instrument and place them in a collection (not paired, just a boring collection) and then probably add the file-extension of .fastq.gz if the upload stripped the file extension already. From there we go through some real pain to find the element IDs and reconstruct a temporary directory of the right shape for import to QZA.

@ebolyen
Copy link
Member

ebolyen commented Aug 31, 2022

Also I should mention that the Manifest style formats you mention for this particular type were a hack for importing which can basically never work in Galaxy, as they expect real filepaths to exist.

I have a rather informal proposal for modifying directory formats to better suite Galaxy as well, perhaps there is a way to indicate pair-ed-ness in this realm, which we could then automatically map to Galaxy's paired collections.

@bernt-matthias
Copy link
Contributor Author

Thanks for the clarifications and in particular for the link.

@bernt-matthias
Copy link
Contributor Author

Hi @ebolyen is there some documentation on the expected file names for the different input types (which might be added to the Galaxy tool help)?

I'm (better a colleague) currently struggling to import data: I'm using Type of data to import: SampleData[PairedEndSequencesWithQuality]

With QIIME 2 file format to import from: CasavaOneEightLanelessPerSampleDirFmt

Unexpected error importing data:
Unrecognized file (/work/songalax/galaxy-dev/database/jobs_directory/020/20999/working/q2galaxy-importb4s4e8x1/metadata-hs-t1.txt) for CasavaOneEightLanelessPerSampleDirFmt.

With QIIME 2 file format to import from: | CasavaOneEightSingleLanePerSampleDirFmt

Unexpected error importing data:
Missing one or more files for CasavaOneEightSingleLanePerSampleDirFmt: '.+_.+_L[0-9][0-9][0-9]_R[12]_001\\.fastq\\.gz'

The latter is kind of clear from the error message since the regex does not match our file names: ids.txt

Could you give us some advice which import format we should choose, or if we should rename our data?

@ebolyen
Copy link
Member

ebolyen commented Oct 7, 2022

Hey @bernt-matthias,

Sorry for not getting back to you. For user-support the forum is much more closely observed.

Regarding the error. Yeah that's definitely an unhelpful error. Your IDs look ok, although I see

qiime2 metadata tabulate on data 142: visualization.qzv
metadata-hs-t1.txt

in your list, which I presume isn't actually in the collection.

I would try setting the append an extension option to fastq.gz if the IDs in your collection are something like:

29_4_S83_R1_001

as QIIME2 is trying to match the entire collection element identifier to the directory regex.

@bernt-matthias
Copy link
Contributor Author

Sorry for not getting back to you.

No worries :)

For user-support the forum is much more closely observed.

Wondering if you want to add a link to the forum to the tool's help section?

Your IDs look ok, although I see

Oh, yes. That is probably it.

@bernt-matthias
Copy link
Contributor Author

Just have read this again

Also I should mention that the Manifest style formats you mention for this particular type were a hack for importing which can basically never work in Galaxy, as they expect real filepaths to exist.

Would relative path work?

@ebolyen
Copy link
Member

ebolyen commented Oct 10, 2022

Unfortunately no, you would need to have an absolute path the the /some/galaxy/managed/path/001.dat file which you happen to know is a fastq.gz. If you can predict those paths then it would work... presuming the data was in fact on the same host as the job was on which is also not likely to be true.

I'm working on something right now that may clean this up, but no particular ETA. Until then, using the directory formats is your best bet as you have control over the element identifiers which can be made to match the expected relative path of the directory format (as tedious as that is).

@bernt-matthias
Copy link
Contributor Author

Unfortunately no, you would need to have an absolute path the the /some/galaxy/managed/path/001.dat file which you happen to know is a fastq.gz. If you can predict those paths then it would work... presuming the data was in fact on the same host as the job was on which is also not likely to be true.

Indeed, this assumption does not hold in all Galaxy installations.

I'm working on something right now that may clean this up, but no particular ETA.

+1

Until then, using the directory formats is your best bet as you have control over the element identifiers which can be made to match the expected relative path of the directory format (as tedious as that is).

Thanks

@bgruening
Copy link

Hi guys! Now that we have the tools on EU we get this problem as well :)

It there any workaround yet?

@bernt-matthias
Copy link
Contributor Author

Workaround seems to be to not use the manifest for importing. For now we have to educate users to maybe use the manifest (which is just a metadata table, or?) to construct a collection and use this for the import.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants