Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FASTA Reader: supported file endings #2054

Closed
cbielow opened this issue Apr 3, 2017 · 7 comments
Closed

FASTA Reader: supported file endings #2054

cbielow opened this issue Apr 3, 2017 · 7 comments
Assignees
Projects
Milestone

Comments

@cbielow
Copy link

cbielow commented Apr 3, 2017

we'd like to upgrade from Seqan 1.6 to Seqan 2.x for OpenMS in the near future.
Currently, readRecord() is holding us back a little, since it seems to enforce certain file endings on FASTA input files, and refuses to read *.tmp files, which can occur in auto-generated class-test files or maybe even workflow systems. This is a deal breaker for us.
Is there a way to add a flag in readRecord() or somewhere appropriate which switches off filename suffix restrictions?!

@rrahn
Copy link
Contributor

rrahn commented Apr 3, 2017

I am afraid that's not so easy.
Here is a solution:

The FormattedFile Class can be overloaded by a third template parameter.
In that case you can create a tag like OpenMSFastaAdaptor or something like that and create an instance of your formatted file.
For this you need to overload the FileFormat<...> metafunction which returns a set of valid file formats. Here you need to define a tag list that get's an additional tag for your custom file format, which I called FastaAdaptor. You can use the name you like.
You also need to specify the valid file extensions for your custom file format with the metafunction FileExtensions.
And in the end you need to overload readRecord with your custom file format tag FastaAdaptor that delegates to the Fasta format, because they are the same.

I hope that helps and works.

struct OpenMSFastaAdpator_;
using OpenMSFastaAdpator = Tag<OpenMSFastaAdpator_>;

// Your custom file format.
struct FastaAdaptor_;
using FastaAdaptor = Tag<FastaAdaptor_>;

// List of valid input formats for your customized sequence file.
typedef
    TagList<Fastq,
    TagList<Fasta,
    TagList<FastaAdaptor,
    > > >
    OpenMSSeqInFormats;

// Overloaded file format metafunction.
template <>
struct FileFormat<FormattedFile<Fastq, Input, OpenMSFastaAdaptor> >
{
    typedef TagSelector<OpenMSSeqInFormats> Type;
};

// Specify the valid ending for your fasta adaptor:
template <typename T>
struct FileExtensions< FastaAdaptor, T>
{
    static char const * VALUE[6];
};
template <typename T>
char const * FileExtensions< FastaAdaptor, T>::VALUE[1] =
{
    ".tmp"  // fasta file with tmp ending.
};

// Overload the readRecord function:
template <typename TIdString, typename TSeqString, typename TFwdIterator>
inline SEQAN_FUNC_ENABLE_IF(Not<IsSameType<TFwdIterator, FormattedFile<Fastq, Input, OpenMSFastaAdaptor > > >, void)
readRecord(TIdString & meta, TSeqString & seq, TFwdIterator & iter, FastaAdaptor)
{
    readRecord(meta, seq, iter, Fasta());
}

@h-2 h-2 added this to the Release 2.4.0 milestone Jun 8, 2017
@h-2
Copy link
Member

h-2 commented Jun 8, 2017

@cbielow Does this solve your problem?

@martinjvickers
Copy link
Contributor

martinjvickers commented Jun 22, 2017

@h-2 @rrahn A documented solution to this issue in ReadTheDocs would be very useful as this problem also affects developers who wish to create tools for Galaxy.

When writing a Galaxy plugin you need to account for Galaxy changing all file formats to become a .dat file. The current solution is to write quite terrible wrappers around your program Galaxy https://biostar.usegalaxy.org/p/1965/

It would be nice to be able to accept any file extension (tmp/dat etc). Maybe using a flag, e.g. ./program --fastq 001.dat

@h-2
Copy link
Member

h-2 commented Jun 23, 2017

@martinjvickers There is no nice solution for this in the SeqAn2 code base and tbh I think any other solution will be bad usability wise for users (it should auto-detect!) and confusing for programmers: what if you set -i foo.fastq --format fasta should it be treated as fastq or fasta?
That having been said, we will provide workarounds for this in SeqAn3, although the default will still be to make format recognition extension-based.

@martinjvickers
Copy link
Contributor

@h-2 I 100% agree with the current way that SeqAn treats file formats. For command line tools it makes perfect sense and I don't think that should change. I've found it quite annoying that Galaxy renames the file. However Galaxy is very popular so it has been important to ensure the tools I write can be used with it, leading to the need to write the terrible wrappers in Galaxy (e.g. symlinking the dat file to a fastq/fastq or whatever before running it, potentially leaving existing symlinks in place if something dies before they're removed).

Your example shows exactly how you can't (and shouldn't) use SeqAn at the moment to support this, e.g. a --format flag to say what the input -i should be. In that example I would simply throw an error "Input is not a fasta file". My thinking wasn't to chose the format and the input file, more that the flag itself only accepted a specific file type, maybe as part of the argument parser.

However, I'm not convinced that this is necessary for SeqAn as a whole, but as you say a workaround is needed. It's quite telling as there is no documentation about creating Galaxy Workflows in the ReadTheDocs despite headers being in place. It's not easy to create a Galaxy workflow for a SeqAn tool following the Galaxy documentation because of this issue. Maybe on the Galaxy issue it shouldn't be to alter SeqAn but to demonstrate decent wrappers for SeqAn tools in the SeqAn docs. That probably doesn't help the OP @cbielow though.

@rrahn rrahn self-assigned this Sep 18, 2017
@rrahn rrahn added this to ToDo in SeqAn Oct 4, 2017
@rrahn
Copy link
Contributor

rrahn commented Oct 9, 2017

  • Add tutorial

@rrahn
Copy link
Contributor

rrahn commented Jan 31, 2018

I hope this helps to develop file ending wrappers for your use cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
SeqAn
ToDo
Development

No branches or pull requests

4 participants