Skip to content

API Design

Michael Lawrence edited this page Oct 9, 2017 · 2 revisions

Design Goals

The overarching goal is to distill the IRanges API into something more user-friendly.

Specific goals:

  • API entry points should be chainable, endomorphic verbs,
  • We intentionally avoid exposure of accessors, and intermediate data structures, although the implementation will be based on them,
  • We will not use any S4 generics, instead relying on dispatch by the low-level API. For meaningful error messages, we will probably need some way to assert that an object satisfies the implicit contract of the function.

Data structures

For compatibility, we will use the existing data structures from the IRanges family of packages. The API will treat these as tables, despite the Bioc API regarding them as vectors.

For the sake of simplicity, some data structures will not be handled by the API (explicitly; some might end up working through duck typing). The primary supported data structures include:

  • Ranges
  • GRanges
  • SummarizedExperiment (SE)

These notes apply primarily to the Ranges API, but at least some of the operations will also be valid for GRanges and even SE.

GRanges obviously introduces the notions of sequence and strand. Could we generalize this?

With RangedData, we tried to represent sequence as the “space”. We could also think of the sequence as a default grouping variable, and the user will have the option to group by an arbitrary column.

Strandedness seems to have an analog with directedness in graphs. We define a “directed” range as one that has an orientation, proceeding either from start or the end. Of course, “start” and “end” connote a direction, at least a meaningful default for “undirected” ranges.

GRangesLists typically arise with gene structures and read alignments. Gene structures typically come from a TxDb, and we can make the selection and shape of the data explicit:

txdb %>% exons() %>% group_by(tx_id)

We could represent GRangesList using a grouped GRanges, but it is not clear if that is semantically appropriate, because the groups are being treated as records/rows in the dataset, rather than as a specification for aggregation. We also want to preserve the GRangesList if possible, since many methods have been written for it. Thus, the API should try to treat a GRangesList as a nested type of GRanges, i.e., as a table, not a list. We could introduce the notion of “wound” ranges, where unwind() would flatten a GRangesList to a GRanges, and wind() would wind a GRanges into a GRangesList, e.g.

wind(exons, tx_id)

For read alignments, the data currently arrive as a GAlignments object. The user should never see this. But it is not clear how to represent the alignment structure as a GRanges. The user might expect one row per alignment (with a range spanning it), and there could be a split_alignments() that breaks up the ranges by the cigar N codes and groups by the original alignments automatically:

bam %>% split_alignments()

And maybe a split_alignments_by_deletions() to exclude the D regions, but that is uncommon.

Construction

We can construct ranges from their basic components (start, end, width). We could guess the variable names if a tabular object is passed to the constructor, just how GRanges() works now:

df %>% Ranges

Maybe extra parameters could specify atypical mappings, like:

df %>% Ranges(start=st, width=wdth)

This matches up well with existing tabular constructors like data.frame() and data_frame().

Arithmetic

Unary

IRanges has already explored this space and defined its own “algebra”, but we could revisit it. bedtools is also inspiring.

We need to make adjustments to the start, end and width. Extending dplyr, this would just be mutate(), e.g.:

mutate(x, start=start+1L)

And while that could be supported, we typically want to make relative adjustments.

One complication is that the start, end and width are mutually dependent. Changing the start, as above, needs to change either the width or the end to preserve integrity. We assume that the user thinks of the range as a pair of endpoints (this is natural in genomics anyway), so setting the start will modify the width but preserve the end. The opposite, preserving the start but not the other endpoint, is IRanges::shift(). Maybe we should be more explicit about whether the width or the other endpoint is changed by side effect.

Here are some possible verbs for changing the width:

stretch_start/end
Extend the range by ‘x’ by pulling the

start or the end (and leaving the other one anchored),

squeeze_start/end
Opposite of stretch (could just use

negative value for stretch, but being explicit seems preferable in many cases),

Not sure whether squeeze is actually useful, as we typically want to stretch a range to consider context, not squeeze it.

Or maybe it would be more obvious if we supported the notion of anchoring, similar to grouping. This would require decorating the object. For example,

stretch(anchor_end(x), y)

This pushes information from the syntax to the object state. While it simplifies the grammar, object behavior is more complex and implicit. But if the code is part of the same chain, it seems intuitive.

There would be three ways to anchor:

  • anchor_start()
  • anchor_end()
  • anchor_center()

Anchoring the midpoint would enable symmetric zooming.

This is basically a deconstruction of IRanges::resize(), except resize() is absolute. Perhaps set_width() could play the role of resize(). It is not clear whether we need set_start() and set_end().

When strand is involved, we will need something like:

  • anchor_3p (start for negative strand)
  • anchor_5p (end for negative strand)

Anchoring the width would not make sense, because (1) it is not a point that can be intuitively anchored and (2) stretching and squeezing obviously modify the width, so it would not apply to those.

For width-invariant changes, there is shift() from the IRanges API. That is a symmetric operation (positive for right, negative for left). We could introduce:

  • shift_left()
  • shift_right()

If we wanted the direction to be explicit.

Note that with GRanges, strand can modulate direction, and we need to distinguish left/right (chromosomal coordinates) from the direction of transcription:

  • shift_upstream()
  • shift_downstream()
    Fluent IRanges bedtools
    stretch(anchor_start()) - slop -r
    stretch(anchor_end()) - slop -l
    stretch(anchor_center()) +() slop -b
    set_width(anchor_*) resize(fix=*) -
    shift_right() shift() shift -s
    shift_left() - -
    shift_up/downstream() - -

Note that resize,GRanges() and slop -s use the strand for direction, i.e, anchor_3p() and anchor_5p(). Bedtools supports width-relative adjustment with slop -pct; we will let the user be explicit, e.g.: stretch(x, 2*width(x)).

One problem is that can be stretched or shifted out of bounds, or squeezed out of existence. Currently, GenomicRanges will warn the user when a range is pushed out of bounds and suggest calling trim(). We should probably throw an error in this case, because it is never a good idea for an object to be in a (semi)-invalid state. The resolution needs to be explicit, as part of the function call. By default, we throw an error. There could be variants of all of these operations with the post fix _up_to that imply the trim, e.g., stretch_up_to() or set_width_up_to(). Another resolution would be to drop the ranges that would become out of bounds. Those could use the _or_drop() postfix.

We could also use this anchoring notion to define a new verb flip() that generates flanking regions, but that will cause semantic issues with strandedness. Instead, we probably need flank_left() and flank_right(). With strand, there would be flank_before() and flank_after().

Binary

Binary operations between two ranges mostly align with set operations that treat ranges as sets of integers. These are parallel/vectorized operations between two vectors/tables of ranges, so are not to be confused with the aggregating set operations described later.

Currently, these set operations have the p prefix, e.g., punion(). We could make the binary nature more obvious by aliasing them as operators:

  • %union%
  • %intersect%
  • %setdiff%

There are also operations that relate to the relative positions of two ranges:

  • between(): the range in the gap between the two ranges
    • To get the separation, just do width(between(x, y))
  • span(): a range that spans both ranges (and the gap)

As a note, to express a binary operation involving the subject of a mutate(), summarize() or whatever call, we could use the pronoun:

mutate(x, foo = width(. %intersect% other_ranges))

Comparison

The Allen’s Interval Algebra defines different relationships between ranges. We could map those to special %x% operators (the formatting of this table is messed up in the GitHub display, so click Edit to the org-mode source):

Example Operator Inverse
“xxxx ” %upstream_of%, %left_of%
” xxxx” %downstream_of%, %right_of%
“xxxx ” %flanks_upstream%, %flanks_left%
” xxxxx” %flanks_downstream%, %flanks_right%
“xxxx ” %over% %outside%
” xxxxxx”
“xxxx ” %begins%, %starts%
“xxxxxx ” %begun_by%, %started_by%
” xxx ” %within%
” xxxxxx ” %includes%
” xxxxx” %finishes%, %ends%
” xxxxxxx” %finished_by%, %ended_by%
“xxxxxxxxx” == %!=%
“xxxxxxxxx”

Of those, only %over% and == (and their inverses) are currently supported. Not clear how many of these are actually useful in genomics, but these are easy to implement and describe.

For some relationships, there are multiple operators, which behave differently with respect to strand/direction.

The “before” and “after” relationships have only proven useful for some k, usually k=1. Those correspond to precede() and follow() in the current framework.

Restriction

We can keep or discard ranges based on range-specific comparisons. At the low-level, one could filter as a table. For example, to filter to the ranges that overlap some other ranges:

filter(x, ranges %over% other_ranges)

But we typically want to abstract references to ranges. This means high-level functions, including:

  • filter_by_overlaps() as an analog of subsetByOverlaps()
    • How to invert this (equivalent of %outside%)?
      • filter_by_non_overlaps()?

Aggregation

General aggregation will operate on a preceding group_by(), so objects will need to keep track of their grouping.

We often want to aggregate based on a matching/overlap operation, e.g., summarizing ranges that fall into some windows, or breaking ties in nearest hits. Ideally, some sort of grouping could be carried over from a previous overlap join, i.e., there is some variable that groups the joined ranges by the original query (or subject) range. This corresponds to the bedtools map tool. See the Merging section for more details.

Other range aggregations include:

range
min start and max end
  • a little confusing due to multiple use of the word “range”, maybe compute_extents?
coverage
counting overlapping ranges per position
  • the range analog of table()
  • needs to be an endomorphism, not generate an RleList
  • the output should be a GPos by default, as coverage is typically seen as a per-position summary
    • compute_runs could convert to an run-length encoded GRanges
  • covered_by_any() and covered_by_all() may be useful conveniences based on the coverage and the equivalence of ranges and logical vectors (TRUE inside, FALSE outside), see coerce,Rle,Ranges().

There is also a group of set operations view the range as a set of integers, counting from the start through the end. IRanges currently implements intersect(), union(), and setdiff(). There is some confusion though, because %in%() and match() operate at the range level and thus do not follow the compatible semantics. The ranges equivalent of unique() is reduce(), and disjoin() is in the same family.

There are also p variants of these functions that operate in parallel down two objects. While these do aggregate ranges (if we think of them as sets of integers), they operate within individual rows, so they are not aggregations in the present sense.

Some more notes on the special operations:

reduce
combine overlapping and adjacent ranges
  • in a way this is the range analog of unique()
  • bedtools calls it merge
disjoin
at the set level equivalent to reduce, but really a

union of the end points

The range aggregations themselves imply a grouping, which we want to use for aggregating other variables in the dataset. Some of the operations currently return a “revmap” list to do this.

We could express the grouping verbs more explicitly, e.g.:

  • reduce() maps to group_by_any_overlap()
  • disjoin() maps to group_by_all_overlap() (many-to-many)

So one could imagine an API where a reduce() is specified by:

r %>% group_by_any_overlap() %>% summarize(foo.mean=mean(foo))

where the grouping result is a deferred reduce(), finally executed by the summarize() call. The problem is that one could interpret the semantic of summarize() to be “apply these operations to each group, generating a list of tables, and stack them”, i.e., there is an implicit desire to perform a reduce() that should be made more explicit. This argues for coalescing the operations into:

r %>% reduce(foo.mean=mean(foo))

In the context of general tabular manipulations, it might make sense to explicitly qualify these summaries as range operations. This must be done consistently. I.e.,

  • intersect_ranges()
  • union_ranges()
  • setdiff_ranges()
  • reduce_ranges()
  • disjoin_ranges()

It would be nice to have a complement_ranges() that is like gaps() but does not treat strand as a “space”, which has the frustrating effect of including the entire sequence in gaps on the other strands. The name gaps() is also confusing, since it implies the result will only contain ranges between a pair of input ranges (the flanking regions are also included).

Merging

There are several common types of joins:

  • Mutate (combination) join (merge())
  • Filter join (filter with %in%)
  • Set operations (intersect(), etc)

A join consists of two steps:

  1. Matching to generate a bipartite graph, and a corresponding table where each row represents an edge (or, optionally, node for singletons) in that graph,
  2. Optionally apply some aggregation on the graph/table, like
    • Taking the left nodes with at least one edge for filtering,
    • Taking all unique left nodes and singleton right nodes for union.

The types of matching correspond to the comparison operations, described above. Conventional joins tend to match on identity, so the key columns are folded together. Would we carry over the other ranges, or drop them like an identity join? Probably carry them over.

Matching operations yield, internally, a Hits object, which we have recently started modeling as a graph. The canonical coercion to a Grouping is by the query (left) node, which is useful, because we are typically operating on the ranges of interest and want to summarize another set with respect to them. We can describe this as a new type of join: a grouping join. It defers the aggregation and just groups by the graph in some way, typically by left node. That operation would be group_by_left_join(). This is not perfect though, because the function does a lot more than group the rows.

Sometimes we want to group by the connected components of the graph, so there should be a convenient way to do that.

The most obvious and useful matching is the overlap, and findOverlaps() has supported that for a long time, along with restrictions for “start”, “end”, “within” and “equal”. We want to recast findOverlap() and similar matching functions to join operations, maybe with the term ojoin instead of join. Would we want join* variants for all operations, like flanking_upstream_join() and finishing_join()? Really only “any” overlap and “within” overlap, along with nearest neighbords, has proven useful in the current framework.

A very common aggregation on overlaps is counting how many subject ranges overlap each query, i.e., countOverlaps(). This could be done using dplyr:

left_join(x, group_by_left_ojoin(., y) %>%
             summarize(key = unique(key), count = N()),
          "key")

Reading that,

mutate(x, count = count_overlaps(., y))

seems a lot easier to understand. And it also suggests that we do not need to use join terminology at all, even though the operations could be classified as joins. In other words, we could keep find_overlaps() as a left-ojoin, and group_by_overlaps() as a join that groups the matches by query. It does not solve the problem, however, of the complexity of implementing something like count_overlaps(). We could use a hybrid of dplyr and base R:

mutate(x, count = lengths(group_by_overlaps(., y)))

Or maybe a bit more explicit,

mutate(x, count = group_nrows(group_by_overlaps(., y)))

In the case of GRanges, findOverlaps() has always required identical strand. This often surprises users, including experts. Instead, the require of identical strand should be expressed as a filter() on the find_overlaps() result.

When the input to find_overlaps() is already grouped, overlap is at the group level, i.e., if any two ranges in different groups overlap, those groups are considered overlapping. This supports GRangesList-style overlap operations.

For nearest neighbors, there is nearest(), precede(), and follow(). We should give these more consistent names, like:

join_nearest()
nearest on either side
join_nearest_left/right()
nearest on the left/right (strand ignored)
join_nearest_up/downstream()
nearest on one side, considering strand

Typical post-filters will include exclusion by distance, exclusion of overlapping ranges, etc. Tie breaking is an example of a post-aggregation.

Sorting

Things to order by (in some order of precedence):

  • Start, end, width, midpoint?

While sort() might follow IRanges (start, end), arrange() or sort(by =) would make this explicit.

For GRanges, seqname and strand are also useful.

This is a place where generic tables and annotated range tables diverge. Each row of our table has a primary value, the range, on which we can sort. Do we force the user to spell out the range-sorting, e.g:

r %>% arrange(start, end, score)

Or do we provide a convenience:

r %>% arrange_ranges(score)

For some reason arrange_ranges() does not seem obvious enough. Probably should keep it explicit.

Data import

The canonical way to import annotated genomic ranges is rtracklayer::import(). While import() is a useful abstraction, it may be too abstract. There is benefit to enabling the user to express the expected return value and/or the format of the file being parsed. Since we are more or less stipulating that every object is a GRanges, there is no need to be explicit about the type of the return value. Instead, we could follow the pattern of read.csv() and read_csv() and have functions named according to read_[format](), e.g., read_bed(), read_bam(), etc.

For BAM files specifically, the user typically wants the alignments, but sometimes the read sequences are also useful. Should that be a separate function (read_bam_with_sequences()), or a parameter?