BUG: Handle MANIFEST comments in per sample transformer #130

jakereps · 2017-07-07T18:19:01Z

This resolves comments in the MANIFEST file, ~~but runs into the same issue that demultiplexing did, where it hits the open file limit on large sample sets.~~

Resolves #129

ebolyen

I think things are a little overcomplicated, you shouldn't need that many lambdas in one place. Was there something that specifically wasn't working without them?

ebolyen · 2017-07-11T19:47:22Z

q2_types/per_sample_sequences/_transformer.py

+    manifest.filename = manifest.filename.apply(lambda x: str(dirfmt.path / x))
+
+    manifest['data'] = manifest.filename.apply(
+        lambda filepath: lambda: _iterator_magic(filepath))


This is like a triple η-abstraction with _iterator_magic. I'm pretty sure you just need the one...

lambda filepath: skbio.io.read(filepath, format='fastq', constructor=skbio.DNA)

This will give you a series of Sequence generators (which you can return directly instead of the dataframe)

Without the abstraction it hits the open file limit. The demultiplexed artifact I tested it against was 292 samples, and the vanilla lambda immediately makes the generator for each sample, which includes assignment of the file handle. Shimming the __iter__ to call each lambda when it is requested assures only one open file at a time, or two in the case of paired end.

It's essentially doing this:

def _iterator_magic(filepath): def sequence_generator(): yield from skbio.io.read(filepath, format='fastq', constructor=skbio.DNA) return sequence_generator ... manifest['data'] = manifest.filename.apply(_iterator_magic) ...

In order to delay the actual opening of the file through skbio.io.read.

Now that I look at that variant, it is much cleaner...

👍 and technically that is one less η-abstraction too since the closure for the filepath also gets to be the generator.

ebolyen · 2017-07-11T19:47:35Z

q2_types/per_sample_sequences/_transformer.py

+    manifest = _prepare_manifest(dirfmt)
+
+    for sample_id, data in manifest.data.iteritems():
+        result[sample_id] = data


I think pandas has a to_dict() which will do basically this.

Doesn't it need to be added into the PerSampleDNAIterators that is initialized as result at the beginning of the function? to_dict() would be a vanilla dictionary.

Good point, although I guess you could just pass the dict to the subclass. dicts consume dicts in their constructor for some reason.

ebolyen · 2017-07-11T19:56:42Z

q2_types/per_sample_sequences/_transformer.py

+
+    def __iter__(self):
+        for sample_id, seqs in self.items():
+            yield sample_id, seqs()


None of this should be necessary if the dict already stores the generators ready to go

ebolyen · 2017-07-12T20:04:54Z

I understand what this PR is trying to accomplish now w.r.t. open filehandles and I think it makes sense, although it makes me a little uncomfortable that the following is true:

sample_iterator['sample1'] is not sample_iterator['sample1']

Backing up a bit. Is there a better object representation to start with? Maybe a dict of filepaths instead? I get that that isn't exactly what this view does, but I don't see a good consistent way to address filehandle limits and "normal" dict behavior (unless our iterators somehow knew how to open and close a file between iter calls, but that sounds like a bad idea for performance).

jakereps · 2017-07-12T20:33:10Z

Is it better to have it as just a generator itself, like BarcodeSequenceFastqIterator in q2-demux? That way it has no get or items, etc... Because it being a generator is even kind of implied in the name of them with r'PerSample.*DNAIterators'. It could internally hold the dict that it is trying to be, and just have the same __iter__, but on an internal dict's items()?

ebolyen · 2017-07-12T20:47:37Z

If I understand correctly you're suggesting something like:

class PerSampleWhateverIterator:
    def __iter__(self):
        for key, value in self.mapping.items():
            yield from ((key, seq) for seq in value()))

? I certainly wouldn't be opposed to that, I guess the real question is who is using the view and would that work (or a different view if I misunderstood).

jakereps · 2017-07-12T23:18:56Z

I'm not entirely sure. This all came up from the manifest breaking these views, and then the file limit being hit once that was fixed. The internal methods of q2-demux use that hacked/hidden _PlotQualView to be single/paired agnostic, as well as _read_fastq_seqs to avoid the time of reading into skbio, but my specific use-case was to just count how many sequences survived demultiplexing, independent of using demux.methods.summarize. I could revert the OS file limit fixes, and just fix the manifest intake for the time being if that's easiest?

Kind of just fixed it locally and threw up a PR with what I had done to get it to work.

jakereps · 2017-07-14T16:56:23Z

Closing to pivot to new views

Removes `q2_types.per_sample_sequences.PerSampleDNAIterators` and `q2_types.per_sample_sequences.PerSamplePairedDNAIterators` view types, along with their corresponding transformers. As part of discussion in qiime2#129 and qiime2#130 it was found that these view types never actually worked and weren't being used anywhere. Even getting them to work raises another problem: file descriptor limit. New view types can be added in the future that avoid the file descriptor limit.

Removes `q2_types.per_sample_sequences.PerSampleDNAIterators` and `q2_types.per_sample_sequences.PerSamplePairedDNAIterators` view types, along with their corresponding transformers. As part of discussion in #129 and #130 it was found that these view types never actually worked and weren't being used anywhere. Even getting them to work raises another problem: file descriptor limit. New view types can be added in the future that avoid the file descriptor limit.

jakereps added 2 commits July 7, 2017 11:18

BUG: Handle MANIFEST comments in per sample transformer

da49a8a

MAINT: Refactor block of code into helper

f02b143

jakereps force-pushed the psdnaiter branch from 7fadc1d to f02b143 Compare July 7, 2017 18:21

jakereps added 3 commits July 7, 2017 11:26

MAINT: flake8

9b139bd

MAINT: Handle file limit

31dcd6e

ENH: Change items() to __iter__() to match BCSeqs in demux

0767911

jairideout added the type:bug Something is wrong. label Jul 10, 2017

jairideout requested a review from ebolyen July 11, 2017 17:01

ebolyen requested changes Jul 11, 2017

View reviewed changes

ENH: Clean up nested functions

b441f94

jakereps closed this Jul 14, 2017

jairideout mentioned this pull request Aug 16, 2017

BUG: remove PerSampleDNAIterators and PerSamplePairedDNAIterators #141

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Handle MANIFEST comments in per sample transformer #130

BUG: Handle MANIFEST comments in per sample transformer #130

jakereps commented Jul 7, 2017 •

edited

Loading

ebolyen left a comment

ebolyen Jul 11, 2017

jakereps Jul 11, 2017 •

edited

Loading

jakereps Jul 11, 2017 •

edited

Loading

ebolyen Jul 12, 2017

ebolyen Jul 11, 2017

jakereps Jul 11, 2017 •

edited

Loading

ebolyen Jul 12, 2017

ebolyen Jul 11, 2017

ebolyen commented Jul 12, 2017

jakereps commented Jul 12, 2017

ebolyen commented Jul 12, 2017

jakereps commented Jul 12, 2017

jakereps commented Jul 14, 2017

BUG: Handle MANIFEST comments in per sample transformer #130

BUG: Handle MANIFEST comments in per sample transformer #130

Conversation

jakereps commented Jul 7, 2017 • edited Loading

ebolyen left a comment

Choose a reason for hiding this comment

ebolyen Jul 11, 2017

Choose a reason for hiding this comment

jakereps Jul 11, 2017 • edited Loading

Choose a reason for hiding this comment

jakereps Jul 11, 2017 • edited Loading

Choose a reason for hiding this comment

ebolyen Jul 12, 2017

Choose a reason for hiding this comment

ebolyen Jul 11, 2017

Choose a reason for hiding this comment

jakereps Jul 11, 2017 • edited Loading

Choose a reason for hiding this comment

ebolyen Jul 12, 2017

Choose a reason for hiding this comment

ebolyen Jul 11, 2017

Choose a reason for hiding this comment

ebolyen commented Jul 12, 2017

jakereps commented Jul 12, 2017

ebolyen commented Jul 12, 2017

jakereps commented Jul 12, 2017

jakereps commented Jul 14, 2017

jakereps commented Jul 7, 2017 •

edited

Loading

jakereps Jul 11, 2017 •

edited

Loading

jakereps Jul 11, 2017 •

edited

Loading

jakereps Jul 11, 2017 •

edited

Loading