Skip to content

Commit

Permalink
Added a section explaining how slicing reads are handled
Browse files Browse the repository at this point in the history
Reworded it so that the chunk deciding behavior is optional

Added some information to the proto file

Added some information to the proto file

Markdown table formatting
  • Loading branch information
westonpace authored and jacques-n committed Feb 6, 2022
1 parent 3670518 commit 9e3ca4a
Show file tree
Hide file tree
Showing 2 changed files with 21 additions and 4 deletions.
5 changes: 5 additions & 0 deletions proto/substrait/algebra.proto
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,11 @@ message ReadRel {
repeated FileOrFiles items = 1;
substrait.extensions.AdvancedExtension advanced_extension = 10;

// Many files consist of indivisible chunks (e.g. parquet row groups
// or CSV rows). If a slice partially selects an indivisible chunk
// then the consumer should employ some rule to decide which slice to
// include the chunk in (e.g. include it in the slice that contains
// the midpoint of the chunk)
message FileOrFiles {
oneof path_type {
string uri_path = 1;
Expand Down
20 changes: 16 additions & 4 deletions site/docs/relations/logical_relations.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,12 +37,24 @@ Read definition types are built by the community and added to the specification.


#### Files Type
#### Files Type

| Property | Description | Required |
| --------------------------- | ----------------------------------------------------------------- | -------- |
| Items | An array Items (path or path glob) associated with the read | Required |
| Format per item | Enumeration of available formats. Only current option is PARQUET. | Required |
| Slicing parameters per item | Information to use when reading a slice of a file. | Optional |

##### Slicing Files

| Property | Description | Required |
| --------------- | ------------------------------------------------------------ | -------- |
| Items | An array Items (path or path glob) associated with the read | Required |
| Format per item | Enumeration of available formats. Only current option is PARQUET. | Required |
A read operation may only read part of a file. This is convenient, for example, when distributing
a read operation across several nodes. The slicing parameters are specified as byte offsets
into the file.

Many file formats consist of indivisible "chunks" of data (e.g. parquet row groups). If this
happens the consumer can determine which slice a particular chunk belongs to. For example, one
possible approach is that a chunk should only be read if the midpoint of the chunk (dividing by
2 and rounding down) is contained within the asked-for byte range.


## Filter Operation
Expand Down

0 comments on commit 9e3ca4a

Please sign in to comment.