-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Handle MANIFEST comments in per sample transformer #130
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think things are a little overcomplicated, you shouldn't need that many lambdas in one place. Was there something that specifically wasn't working without them?
manifest.filename = manifest.filename.apply(lambda x: str(dirfmt.path / x)) | ||
|
||
manifest['data'] = manifest.filename.apply( | ||
lambda filepath: lambda: _iterator_magic(filepath)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is like a triple η-abstraction with _iterator_magic
. I'm pretty sure you just need the one...
lambda filepath: skbio.io.read(filepath, format='fastq', constructor=skbio.DNA)
This will give you a series of Sequence generators (which you can return directly instead of the dataframe)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without the abstraction it hits the open file limit. The demultiplexed artifact I tested it against was 292 samples, and the vanilla lambda immediately makes the generator for each sample, which includes assignment of the file handle. Shimming the __iter__
to call each lambda when it is requested assures only one open file at a time, or two in the case of paired end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's essentially doing this:
def _iterator_magic(filepath):
def sequence_generator():
yield from skbio.io.read(filepath, format='fastq',
constructor=skbio.DNA)
return sequence_generator
...
manifest['data'] = manifest.filename.apply(_iterator_magic)
...
In order to delay the actual opening of the file through skbio.io.read
.
Now that I look at that variant, it is much cleaner...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 and technically that is one less η-abstraction too since the closure for the filepath also gets to be the generator.
manifest = _prepare_manifest(dirfmt) | ||
|
||
for sample_id, data in manifest.data.iteritems(): | ||
result[sample_id] = data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think pandas has a to_dict()
which will do basically this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't it need to be added into the PerSampleDNAIterators
that is initialized as result
at the beginning of the function? to_dict()
would be a vanilla dictionary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, although I guess you could just pass the dict to the subclass. dicts
consume dicts
in their constructor for some reason.
|
||
def __iter__(self): | ||
for sample_id, seqs in self.items(): | ||
yield sample_id, seqs() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None of this should be necessary if the dict already stores the generators ready to go
I understand what this PR is trying to accomplish now w.r.t. open filehandles and I think it makes sense, although it makes me a little uncomfortable that the following is true: sample_iterator['sample1'] is not sample_iterator['sample1'] Backing up a bit. Is there a better object representation to start with? Maybe a dict of filepaths instead? I get that that isn't exactly what this view does, but I don't see a good consistent way to address filehandle limits and "normal" dict behavior (unless our iterators somehow knew how to open and close a file between |
Is it better to have it as just a generator itself, like |
If I understand correctly you're suggesting something like: class PerSampleWhateverIterator:
def __iter__(self):
for key, value in self.mapping.items():
yield from ((key, seq) for seq in value())) ? I certainly wouldn't be opposed to that, I guess the real question is who is using the view and would that work (or a different view if I misunderstood). |
I'm not entirely sure. This all came up from the manifest breaking these views, and then the file limit being hit once that was fixed. The internal methods of Kind of just fixed it locally and threw up a PR with what I had done to get it to work. |
Closing to pivot to new views |
Removes `q2_types.per_sample_sequences.PerSampleDNAIterators` and `q2_types.per_sample_sequences.PerSamplePairedDNAIterators` view types, along with their corresponding transformers. As part of discussion in qiime2#129 and qiime2#130 it was found that these view types never actually worked and weren't being used anywhere. Even getting them to work raises another problem: file descriptor limit. New view types can be added in the future that avoid the file descriptor limit.
Removes `q2_types.per_sample_sequences.PerSampleDNAIterators` and `q2_types.per_sample_sequences.PerSamplePairedDNAIterators` view types, along with their corresponding transformers. As part of discussion in #129 and #130 it was found that these view types never actually worked and weren't being used anywhere. Even getting them to work raises another problem: file descriptor limit. New view types can be added in the future that avoid the file descriptor limit.
This resolves comments in the
MANIFEST
file,but runs into the same issue that demultiplexing did, where it hits the open file limit on large sample sets.Resolves #129