New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GFF parser updates #132
Comments
I would think using the builder pattern would be the right thing to do here. |
Hmm..that is also an option. It would mean there needs to be a builder struct, but then the validation can be handled at once by its I proposed a Now actually, while we're discussing this, I've been thinking a bit more about the gff module itself. What do you think about exposing two different levels of iterated objects? This is similar to the iterators Conversion from gff::Record to gff::Row could be done by implementing a I don't have a proper benchmark yet (which I can generate if you prefer), but the reason I think this may be a good idea is because I noticed that one of the bottleneck of the current reader seems to be parsing the attribute column (now with the What do you think? |
Just to be more concrete with the code, I imagine the builder pattern that you suggested and what I outlined above would look something like this:
I'm not exactly sure what the error struct for the builder should be ( (EDIT: fixed some obvious syntax errors) |
Can't we add the builder methods directly to Record? This way, we could avoid the build method. Maybe Row could be renamed into |
A builder, as I understand it, should give flexibility in struct creation and also validation of the struct's fields. Its For example, we can make the builder accepts either an There is also the slight issue of being in the same namespace as the struct's accessor methods, though this can be resolved if we use consistent names (e.g. |
I was curious to see how a builder would actually be implemented (and one thing lead to another), so I ended up with the changeset here: bow@90035c8. This should be everything I've mentioned above. I hope the commit message is quite clear ~ it's longer than what I usually write ;). It's still missing some documentation, there should probably be a proper deprecation of the functions (instead of just removing them), and we haven't really decided on the right builder struct, so like always, it's still up for discussion. As a side note, I also want to show the rough estimate of the speedup when iterating over raw records. This was done on a ~1.4 GB GTF file (unzipped, Homo sapiens from Ensembl release 89 without the GTF header), using the code below and
When iterating over
and when iterating over
and when iterating over
Of course the |
Thanks! Sorry for the late reply. I am very busy with some deadlines right now, but I hope to come back to this soon. Looks promising indeed! |
No worries @johanneskoester ~ I realize this changeset is quite big and require some consideration :). |
I have thought a bit about it. Downside of the builder is that you need new allocations for each record. Further, it is not so intuitive. What about the following, general strategy for all record based file formats.
|
Aha, I am slightly familiar with the I have to admit I've grown more used to the builders (maybe more than I should be), but the per-record allocation is indeed something quite unattractive. I'll give this one more thought and play around with some implementations ~ it sounds ok for now :). |
Great! Thanks for considering! |
It's been a while, but I've finally got enough time to fiddle with this again. I was wondering if what you had in mind is something along these lines as well:
On a side note, the |
Yes, before implementing anything we should update to the latest CSV version. |
Sorry for the slow answer. I was very busy with the bioconda paper. |
No worries. I haven't been the fastest to respond also. Ok then, this issue should indeed wait for the internal csv dependency update. Congratulations on the paper by the way (also extended to other people on it who may stumble here) ~ fingers crossed for the peer-reviewed publication! |
Wait, aren't we already using the latest csv? |
Oh wait, yes! I missed the last update to the TOML file ~ I thought it was still at |
Hello everyone,
This is issue is related to #115 and #128 as it concerns the GFF parser (I should ping @natir as the initial author and @johanneskoester). I have been spending some time with the GFF parser and I do have some questions and one concrete issue about it.
(I'd be happy to open PRs on any of these ~ just thought I should start here first :) ).
gff::Record
as public would make the struct easier to use?I stumbled on this when I realized that the current
gff::Record
does not have aframe_mut
method (so frames can not be updated after record creation).I also feel that the creation of
gff::Record
can be improved. My use case is simply to create records which I would later write to a file. However, since all of the fields are private, this seems to be the only way I can create new records:This can be made easier by simply exposing the fields as public, so library users have more freedom when creating the records. This also makes the struct pattern matchable.
If exposing the fields as public is not an option, I can imagine the record creation can still be improved by either of these two options:
a. Update the
new()
function ofgff::Record
to accept a tuple with one item representing one column in the GFF file. Currently, it seems to have the same function as the deriveddefault()
function, which I think can be removed.b. If not using
new()
, we can alsoimpl From<tuple_type> for gff::Record
. This would be easier perhaps if a type alias or a tuple struct is defined for the raw tuple type.Also, some rather small related things that I've noticed:
Since the GFF specs always use one character for the strand column, could this not be made as
char
instead ofString
?Would it be useful to also convert the coordinate system of the parsed GFF records? Off-by-one errors coming from different file formats are always unpleasant to work with, but having a consistent coordinate system across rust-bio would make this easier to deal with. In the record itself, perhaps the
utils::interval::Interval
struct can be used?The text was updated successfully, but these errors were encountered: