Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to Write or Display interval types #9

Open
brentp opened this issue Jan 26, 2024 · 4 comments
Open

how to Write or Display interval types #9

brentp opened this issue Jan 26, 2024 · 4 comments

Comments

@brentp
Copy link
Member

brentp commented Jan 26, 2024

We could use std::fmt::Display but what if we have VCF input and want to write, e.g. BCF output.

Currently, we're using standard outputting, but should likely switch to writers. This will require tracking that, for example, we can have a BCF writer from a VCF query, or a compressed (BGZF) writer from a VCF, but not a BAM writer from a VCF query.

The user should be able to specify, e.g. --output-format BAM | CRAM | SAM or --output-format BCF | VCF along with a --compression-level = 0..9

For BED, perhaps it's --output-format BED12 | BED3 | ??? where perhaps ??? is a template that can be filled.

I need to think about how how to implement this and to look at changes to noodles in 0.61.0 that unify BAM/CRAM/SAM reading.

@brentp
Copy link
Member Author

brentp commented Jan 30, 2024

now, we get a Report struct here in the example.

Instead of handling all of the branching there (in the example code), we probably want to have something like:

writer = Writer::init(&args)
...
for intersection ... {
    let report = intersection.report(...)
    writer.write(report)
}

so what goes to init() instead of args? Could be:

use sniff::FileFormat;
impl Writer {
    pub fn init(in:FileFormat, out:FileFormat) -> Result<Writer, FormatCompatibilityError> { ... }
}

this can be a good start, but will need a way to pass all of the options, for example the option to count overlaps should be known by the writer. But this can be in an additional struct.

@brentp
Copy link
Member Author

brentp commented Mar 5, 2024

The above seems good enough to proceed to specify input format X to output format Y.
But we also need to handle adding extra columns. For example, if the user asked for counts (-c in bedtools). In VCF/BCF, this would go into the INFO field with, for example counts=4,5,6 which also requires updating the header of the VCF.

For BAM, it can go to tags and for bed, it is appended as a column.

Who is responsible for keeping that information?

Is it:

let wtr = Writer::init(in_fmt, out_fmt, compression, args.count, args.count_bases, ...)

or

wtr.write(report, count, count_bases)

but then any possible additional count or column must be an argument to these. We can likely special-case those and also accept a function:

impl Writer {
    pub fn init(
        in_fmt: FileFormat,
        out_fmt: Option<FileFormat>,
        compression: Compression,
        column_fn: fn(Option<&report>) -> String
)

with the convention that when the argument is None, the function returns the name (for VCF), e.g. count so that it can be added to the header. But we also need to know the type for the VCF header...
So then we can instead accept the trait:

pub trait ColumnReporter {
   /// report the name, e.g. `count` for the INFO field of the VCF 
   fn name(&self) -> String
   /// report the type, for the INFO field of the VCF
   fn type(&self) -> Type // Type is some enum from noodles or here that limits to relevant types
   fn description(&self) -> String
   fn number(&self) -> Number // Number is some enum ...

  fn value(&report) -> Value // Value probably something from noodles that encapsulates Float/Int/Vec<Float>/String/...
}

This makes it more work to implement, but I think it's worth it and actually not too onerous.
With this, Writer.write can be:

impl Writer {
    ...
    pub fn write(r: &Report, Vec<dyn ColumnReporter>) {...}
}

brentp added a commit that referenced this issue Apr 1, 2024
@brentp
Copy link
Member Author

brentp commented Apr 1, 2024

This is started in 9e6f440

It's necessary to have a way to link each ColumnReporter which is responsible for the type and name metadata with the files in the Report. Report is Vec<ReportFragment> and that is:

pub struct ReportFragment {
    pub a: Option<Position>,
    pub b: Vec<Position>,
    pub id: usize,
}

so we might want

    pub fn write(&self, report: &Report, crs: Vec<Vec<Box<dyn ColumnReporter>>>) {

where the outer vec is has length of the number of b files and each of those has a length equal to the number of columns from that file.
This can get verbose, when we want a simple thing like the 3 bed columns, then we need:

struct BedReporter {
   name_: String,
   idx: usize,
   type: Type,
   desc: String,
  }

impl ColumnReporter for BedReporter {
    fn name(&self) { self.name }
    fn ftype(&self) { self.type }
    ...
   fn value(&self,  r: &ReportFragment) -> Result<Value, ColumnError> {
      r.b.map(|p| {
        match p {
            Position::Bed(br) => match self.idx {
                  0 => p.chrom(),
                   1 => p.start(),
                   2 => p.stop()
            },
            _ => Error
        }
      })
  }
}

This is very flexible, but would get quite complicated just to extract a single column

@brentp
Copy link
Member Author

brentp commented Apr 2, 2024

I am exploring using lua snippets again as this seems flexible and user-friendly enough. The cost is performance, but I think that's not the main concern.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant