Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GFF parser updates #132

Open
bow opened this issue May 15, 2017 · 17 comments
Open

GFF parser updates #132

bow opened this issue May 15, 2017 · 17 comments

Comments

@bow
Copy link
Contributor

bow commented May 15, 2017

Hello everyone,

This is issue is related to #115 and #128 as it concerns the GFF parser (I should ping @natir as the initial author and @johanneskoester). I have been spending some time with the GFF parser and I do have some questions and one concrete issue about it.

(I'd be happy to open PRs on any of these ~ just thought I should start here first :) ).

  1. I was wondering if exposing the fields of gff::Record as public would make the struct easier to use?

I stumbled on this when I realized that the current gff::Record does not have a frame_mut method (so frames can not be updated after record creation).

I also feel that the creation of gff::Record can be improved. My use case is simply to create records which I would later write to a file. However, since all of the fields are private, this seems to be the only way I can create new records:

let mut rec = gff::Record::default();
*rec.seqname_mut() = <value>;
*rec.source_mut() = <value>;
... and so on

This can be made easier by simply exposing the fields as public, so library users have more freedom when creating the records. This also makes the struct pattern matchable.

If exposing the fields as public is not an option, I can imagine the record creation can still be improved by either of these two options:

a. Update the new() function of gff::Record to accept a tuple with one item representing one column in the GFF file. Currently, it seems to have the same function as the derived default() function, which I think can be removed.

b. If not using new(), we can also impl From<tuple_type> for gff::Record. This would be easier perhaps if a type alias or a tuple struct is defined for the raw tuple type.

  1. Are there any definite plans / deadlines on using the nom crate for the GFF parser?

Also, some rather small related things that I've noticed:

  1. Since the GFF specs always use one character for the strand column, could this not be made as char instead of String?

  2. Would it be useful to also convert the coordinate system of the parsed GFF records? Off-by-one errors coming from different file formats are always unpleasant to work with, but having a consistent coordinate system across rust-bio would make this easier to deal with. In the record itself, perhaps the utils::interval::Interval struct can be used?

@johanneskoester
Copy link
Contributor

I would think using the builder pattern would be the right thing to do here.

@bow
Copy link
Contributor Author

bow commented Jun 7, 2017

Hmm..that is also an option. It would mean there needs to be a builder struct, but then the validation can be handled at once by its build() method. gff::Recordcan then have a better representation of the GFF/GTF columns (e.g. score() doesn't have to parse every time it is called, we can store the underlying field as Option<u64> directly).

I proposed a new() method that accepts tuples to reduce the possible number of allocations, since the current new() needs to allocate empty strings upon creation (for most of the gtf fields) and then the actual strings that the users use needs to be allocated as well before being set (e.g. via *record.seqname_mut()). With the builder objects, I guess this can be solved by making the String fields Option<String> in the builder struct, right?

Now actually, while we're discussing this, I've been thinking a bit more about the gff module itself.

What do you think about exposing two different levels of iterated objects? This is similar to the iterators rust-csv. They have iterators over the byte fields, the string values of each column, and over the serialized records. I imagine the gff reader can expose two levels of iterators as well: one that yields tuples of all columns (let's call this the gff::Row type and we can simply label this with a type alias) and one that yields the records (the one that uses the builder to construct and has more validation, the gff::Record).

Conversion from gff::Record to gff::Row could be done by implementing a From trait, while the other way would need its own method for now (until perhaps the TryFrom trait landed in stable). The writer would then just need to work with gff::Row essentially.

I don't have a proper benchmark yet (which I can generate if you prefer), but the reason I think this may be a good idea is because I noticed that one of the bottleneck of the current reader seems to be parsing the attribute column (now with the regex::Regex struct). I got more than 6x speedup when I just keep the attributes unparsed. Of course in general the attributes should still be parsed (hence the use of gff::Record) but in some cases, only a few values from the attributes may be interesting (e.g. only gene_id and transcript_id when trying to reconstruct transcripts). We could allow users to decided what they want with the attribute by providing the gff::Row type.

What do you think?

@bow
Copy link
Contributor Author

bow commented Jun 7, 2017

Just to be more concrete with the code, I imagine the builder pattern that you suggested and what I outlined above would look something like this:

pub type Row = (String, String, String, u64, u64, char, char, String);

pub struct Record {
    seqname: String,
    source: String,
    feature_type: String,
    interval: Interval<u64>,
    score: Option<f64>,
    strand: Strand,
    frame: Option<u8>,
    attributes: MultiMap<String, String>,
}

pub struct RecordBuilder {
    seqname: String,
    start: u64,
    end: u64,
    source: Option<String>,
    feature_type: Option<String>,
    score: Option<f64>,
    strand: Strand,
    frame: Option<u8>,
    attributes: MultiMap<String, String>,
}

impl RecordBuilder {

    pub fn new<T>(seqname: T, start: u64, end: u64) -> Self
        where T: Into<String>
    {
        RecordBuilder {
            seqname: seqname.into(),
            start: start,
            end: end,
            source: None,
            feature_type: None,
            score: None,
            strand: Strand::Unknown,
            frame: None,
            attributes: MultiMap::new(),
        }
    }

    pub fn source<T>(self, source: T) -> Self
        where T: Into<String>
    {
        self.source = Some(source.into());
        self
    }

    pub fn build(self) -> Result<Record, ???> {
        let interval = Interval::new(self.start..self.end)?;
        Record {
            seqname: self.string,
            interval: interval,
            source: self.source.unwrap_or(".".to_owned()),
            ...
        }
    }
}

impl Record {

    fn try_from(row: Row) -> Result<Self, ???> {
        ...
    }

    fn start(&self) -> u64 {
        self.interval.start
    }
}

I'm not exactly sure what the error struct for the builder should be (gff::Error?), but I hope that is clear. The iterator structs would also be quite straightforward, I think, with the records iterator using the rows iterator under its hood (which uses the rust-csv iterator). This is also still in the context of using rust-csv as the underlying parser, which still has issues with the gff header and comments.

(EDIT: fixed some obvious syntax errors)

@johanneskoester
Copy link
Contributor

Can't we add the builder methods directly to Record? This way, we could avoid the build method. Maybe Row could be renamed into RawRecord?

@bow
Copy link
Contributor Author

bow commented Jun 7, 2017

RawRecord sounds like a good name, too. But for the builder..although technically possible, wouldn't having them in separate structs make the API easier to understand?

A builder, as I understand it, should give flexibility in struct creation and also validation of the struct's fields. Its build() method is essentially where all its action happens since it returns Result<Record, Error>. If we merge the methods of the builder into the struct itself, I think we will end up with methods that may return Record or Result<Record, Error>, since validation now needs to occur also on these methods.

For example, we can make the builder accepts either an Interval or start and end coordinates (separately), for use in the created Record. I used new() in my example above to handle only separate coordinates, but I can imagine instead of exposing new() that accepts just coordinates, we do RecordBuilder::with_coords for coordinates and RecordBuilder::with_interval for intervals. But if we do this on the Record struct, interval(...) would return Record while coords(...) would return Result<_, _> since it requires extra validation. Another example is allowing the builder to accept Strand or char to be used as the Record's strand.

There is also the slight issue of being in the same namespace as the struct's accessor methods, though this can be resolved if we use consistent names (e.g. Record::interval() returns its interval, Record::set_interval() is the builder method).

@bow
Copy link
Contributor Author

bow commented Jun 8, 2017

I was curious to see how a builder would actually be implemented (and one thing lead to another), so I ended up with the changeset here: bow@90035c8. This should be everything I've mentioned above. I hope the commit message is quite clear ~ it's longer than what I usually write ;).

It's still missing some documentation, there should probably be a proper deprecation of the functions (instead of just removing them), and we haven't really decided on the right builder struct, so like always, it's still up for discussion.

As a side note, I also want to show the rough estimate of the speedup when iterating over raw records. This was done on a ~1.4 GB GTF file (unzipped, Homo sapiens from Ensembl release 89 without the GTF header), using the code below and cargo build was executed with --release:

extern crate bio;

use std::env;
use bio::io::gff::{self, GffType};


fn main() {
    let mut reader = env::args().nth(1)
        .and_then(|fp| gff::Reader::from_file(fp, GffType::GTF2).ok())
        .expect("a GTF reader");

    let mut count = 0;
    for _ in reader.records() { //  or reader.raw_rows()
        count += 1;
    }
    println!("num entries: {}", count);
}

When iterating over records(), I got:

num entries: 2599436

real	0m47.646s
user	0m47.340s
sys	0m0.280s

and when iterating over raw_rows(), I got:

num entries: 2599436

real	0m4.280s
user	0m4.040s
sys	0m0.230s

and when iterating over records() without parsing the attributes column string:

num entries: 2599436

real	0m4.975s
user	0m4.750s
sys	0m0.220s

Of course the RawRow tuples themselves will often (if not alwasy) require additional processing. But with the addition of a record builder, I hope this also allows users to have more control over the record creation.

@johanneskoester
Copy link
Contributor

Thanks! Sorry for the late reply. I am very busy with some deadlines right now, but I hope to come back to this soon. Looks promising indeed!

@bow
Copy link
Contributor Author

bow commented Jul 7, 2017

No worries @johanneskoester ~ I realize this changeset is quite big and require some consideration :).

@johanneskoester
Copy link
Contributor

I have thought a bit about it. Downside of the builder is that you need new allocations for each record. Further, it is not so intuitive. What about the following, general strategy for all record based file formats.

  • we have a trait AbstractRecord that is accepted by the write method.
  • we have a struct RecordView that is yielded by the iterators and accepted by read. This struct is lazy about the attributes and has a getter for retrieving a hashmap, similar to the format implementation in rust-htslib.
  • we have a struct Record that can be filled from scratch and directly contains an attribute hashmap, without the need for a builder.

@bow
Copy link
Contributor Author

bow commented Sep 5, 2017

Aha, I am slightly familiar with the RecordView & fillable Record struct (I came across that in rust-htslib some weeks ago actually). Isn't this also what the fasta module already use?

I have to admit I've grown more used to the builders (maybe more than I should be), but the per-record allocation is indeed something quite unattractive.

I'll give this one more thought and play around with some implementations ~ it sounds ok for now :).

@johanneskoester
Copy link
Contributor

Great! Thanks for considering!

@bow
Copy link
Contributor Author

bow commented Oct 22, 2017

It's been a while, but I've finally got enough time to fiddle with this again. I was wondering if what you had in mind is something along these lines as well:

  1. Because we are working with parsing results from the csv library now, the RecordView struct would probably be a tuple struct around the tuple returned by csv. It has either an into_record method that consumes itself to create a proper Record, or a to_record method that creates a Record by clearing its own fields when possible (but does not deallocate itself). The lazy attribute parsing into MultiMap can maybe be implemented by using a RefCell (or the lazycell crate). I'm not completely sure a parsed MultiMap on its own outside a proper Record would be useful, though. Perhaps attribute parsing can only be done when a Record is created?

    In code:

    struct RecordView(/* raw columns of returned by csv */);
    impl RecordView {
        pub fn into_record(self) -> Record {/* ... */} // or,
        pub fn to_record(&mut self) -> Record {/* ... */}
    }
  2. The reader struct would be expanded with a read method that looks something like this:

    impl<R: io::Read> Reader<R> {
        pub fn read(&mut self, record_view: &mut RecordView) -> Result<bool>
    }

    which can be used in a while loop to reduce allocations? I'm not 100% sure here, since the underlying csv already allocates a tuple at every iteration. Perhaps the read method accepts a Record instead?

  3. Since Record should be fillable on its own without the use of intermediate builders, something like a .clear() method and subsequently .set_<field_name> for it as well? We can also make the methods return more meaningful types (e.g. strand() returning a bio::utils::Strand instead of a String as it is now).

On a side note, the csv crate has gone through some significant upgrades. Are there any plans on updating rust-bio with the newer API?

@johanneskoester
Copy link
Contributor

  1. I prefer attributes to return a separate object instead of using a refcell or something like that. This approach works very well in rust-htslib. All simple columns can be directly decoded into the RecordView via the CSV crate. This way, there is no extra allocation. I don't think we even need an into_record method.
  2. Yes, read with RecordView
  3. yes exactly.

Yes, before implementing anything we should update to the latest CSV version.

@johanneskoester
Copy link
Contributor

Sorry for the slow answer. I was very busy with the bioconda paper.

@bow
Copy link
Contributor Author

bow commented Nov 7, 2017

No worries. I haven't been the fastest to respond also. Ok then, this issue should indeed wait for the internal csv dependency update.

Congratulations on the paper by the way (also extended to other people on it who may stumble here) ~ fingers crossed for the peer-reviewed publication!

@johanneskoester
Copy link
Contributor

Wait, aren't we already using the latest csv?

@bow
Copy link
Contributor Author

bow commented Nov 16, 2017

Oh wait, yes! I missed the last update to the TOML file ~ I thought it was still at 0.15. Anyway, good then ~ no other blocking issue it seems :).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants