Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Handle MANIFEST comments in per sample transformer #130

Closed
wants to merge 6 commits into from

Conversation

jakereps
Copy link
Member

@jakereps jakereps commented Jul 7, 2017

This resolves comments in the MANIFEST file, but runs into the same issue that demultiplexing did, where it hits the open file limit on large sample sets.

Resolves #129

@jairideout jairideout added the type:bug Something is wrong. label Jul 10, 2017
@jairideout jairideout requested a review from ebolyen July 11, 2017 17:01
Copy link
Member

@ebolyen ebolyen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think things are a little overcomplicated, you shouldn't need that many lambdas in one place. Was there something that specifically wasn't working without them?

manifest.filename = manifest.filename.apply(lambda x: str(dirfmt.path / x))

manifest['data'] = manifest.filename.apply(
lambda filepath: lambda: _iterator_magic(filepath))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is like a triple η-abstraction with _iterator_magic. I'm pretty sure you just need the one...

lambda filepath: skbio.io.read(filepath, format='fastq', constructor=skbio.DNA)

This will give you a series of Sequence generators (which you can return directly instead of the dataframe)

Copy link
Member Author

@jakereps jakereps Jul 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without the abstraction it hits the open file limit. The demultiplexed artifact I tested it against was 292 samples, and the vanilla lambda immediately makes the generator for each sample, which includes assignment of the file handle. Shimming the __iter__ to call each lambda when it is requested assures only one open file at a time, or two in the case of paired end.

Copy link
Member Author

@jakereps jakereps Jul 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's essentially doing this:

def _iterator_magic(filepath):
    def sequence_generator():
        yield from skbio.io.read(filepath, format='fastq',
                                 constructor=skbio.DNA)
    return sequence_generator
...
manifest['data'] = manifest.filename.apply(_iterator_magic)
...

In order to delay the actual opening of the file through skbio.io.read.


Now that I look at that variant, it is much cleaner...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 and technically that is one less η-abstraction too since the closure for the filepath also gets to be the generator.

manifest = _prepare_manifest(dirfmt)

for sample_id, data in manifest.data.iteritems():
result[sample_id] = data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think pandas has a to_dict() which will do basically this.

Copy link
Member Author

@jakereps jakereps Jul 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't it need to be added into the PerSampleDNAIterators that is initialized as result at the beginning of the function? to_dict() would be a vanilla dictionary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, although I guess you could just pass the dict to the subclass. dicts consume dicts in their constructor for some reason.


def __iter__(self):
for sample_id, seqs in self.items():
yield sample_id, seqs()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of this should be necessary if the dict already stores the generators ready to go

@ebolyen
Copy link
Member

ebolyen commented Jul 12, 2017

I understand what this PR is trying to accomplish now w.r.t. open filehandles and I think it makes sense, although it makes me a little uncomfortable that the following is true:

sample_iterator['sample1'] is not sample_iterator['sample1']

Backing up a bit. Is there a better object representation to start with? Maybe a dict of filepaths instead? I get that that isn't exactly what this view does, but I don't see a good consistent way to address filehandle limits and "normal" dict behavior (unless our iterators somehow knew how to open and close a file between iter calls, but that sounds like a bad idea for performance).

@jakereps
Copy link
Member Author

Is it better to have it as just a generator itself, like BarcodeSequenceFastqIterator in q2-demux? That way it has no get or items, etc... Because it being a generator is even kind of implied in the name of them with r'PerSample.*DNAIterators'. It could internally hold the dict that it is trying to be, and just have the same __iter__, but on an internal dict's items()?

@ebolyen
Copy link
Member

ebolyen commented Jul 12, 2017

If I understand correctly you're suggesting something like:

class PerSampleWhateverIterator:
    def __iter__(self):
        for key, value in self.mapping.items():
            yield from ((key, seq) for seq in value()))

? I certainly wouldn't be opposed to that, I guess the real question is who is using the view and would that work (or a different view if I misunderstood).

@jakereps
Copy link
Member Author

I'm not entirely sure. This all came up from the manifest breaking these views, and then the file limit being hit once that was fixed. The internal methods of q2-demux use that hacked/hidden _PlotQualView to be single/paired agnostic, as well as _read_fastq_seqs to avoid the time of reading into skbio, but my specific use-case was to just count how many sequences survived demultiplexing, independent of using demux.methods.summarize. I could revert the OS file limit fixes, and just fix the manifest intake for the time being if that's easiest?

Kind of just fixed it locally and threw up a PR with what I had done to get it to work.

@jakereps
Copy link
Member Author

Closing to pivot to new views

@jakereps jakereps closed this Jul 14, 2017
jairideout added a commit to jairideout/q2-types that referenced this pull request Aug 16, 2017
Removes `q2_types.per_sample_sequences.PerSampleDNAIterators` and `q2_types.per_sample_sequences.PerSamplePairedDNAIterators` view types, along with their corresponding transformers. As part of discussion in qiime2#129 and qiime2#130 it was found that these view types never actually worked and weren't being used anywhere. Even getting them to work raises another problem: file descriptor limit.

New view types can be added in the future that avoid the file descriptor limit.
thermokarst pushed a commit that referenced this pull request Aug 17, 2017
Removes `q2_types.per_sample_sequences.PerSampleDNAIterators` and `q2_types.per_sample_sequences.PerSamplePairedDNAIterators` view types, along with their corresponding transformers. As part of discussion in #129 and #130 it was found that these view types never actually worked and weren't being used anywhere. Even getting them to work raises another problem: file descriptor limit.

New view types can be added in the future that avoid the file descriptor limit.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:bug Something is wrong.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PerSampleDNAIterators doesn't take comments into account
3 participants