Skip to content

Commit

Permalink
Merge e2a0e37 into 97648e1
Browse files Browse the repository at this point in the history
  • Loading branch information
ibgreen committed Apr 12, 2019
2 parents 97648e1 + e2a0e37 commit ee5e56d
Show file tree
Hide file tree
Showing 5 changed files with 224 additions and 71 deletions.
91 changes: 61 additions & 30 deletions arrow-docs/api-reference/chunked.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,17 @@
# Chunked

Holds a "chunked array" that allows a number of array fragments (represented by `Vector` instnces) to be treated logically as a single vector. `Array` instances can be concatenated into a `Chunked` without any memory beind copied.
Holds a "chunked array" that allows a number of array fragments (represented by `Vector` instances) to be treated logically as a single vector. `Vector` instances can be concatenated into a `Chunked` without any memory beind copied.


## Usage

Create a new contiguous typed array from a `Chunked` instance (note that this creates a new typed array unless only one chunk)

```js
const typedArray = column.toArray();
const typedArray = chunked.toArray();
```

A `Chunked` array supports iteration, random access element access and mutation.
A `Chunked` array supports iteration, random element access and mutation.



Expand All @@ -23,32 +24,33 @@ class Chunked extends [Vector](docs-arrow/api-reference/vector.md)

### Chunked.flatten(...vectors: Vector[]) : Vector

Flattens a number of `Vector` instances into a single `Vector` instance, by allocating and copying memory from each `Vector`.

TBD - does this return a new `Vector` or a `Chunked`?
<p class="badges">
<img src="https://img.shields.io/badge/zero-copy-green.svg?style=flat-square" alt="zero-copy" />
</p>

Utility method that flattens a number of `Vector` instances or Arrays of `Vector` instances into a single Array of `Vector` instances. If the incoming Vectors are instances of `Chunked`, the child chunks are extracted and flattened into the resulting Array. Does not mutate or copy data from the Vector instances.

Returns an Array of `Vector` instances.

### Chunked.concat(...chunks: Vector<T>[]): Chunked

<p class="badges">
<img src="https://img.shields.io/badge/zero-copy-green.svg?style=flat-square" alt="zero-copy" />
</p>

Concatenates a number of `Vector` instances after the chunks. Returns a new `Chunked` array.

TBD - the supplied `Vector` chunks need to be of same type as the current chunks.
Concatenates a number of `Vector` instances of the same type into a single `Chunked` Vector. Returns a new `Chunked` Vector.

Note: This method extracts the inner chunks of any incoming `Chunked` instances, and flattens them into the `chunks` array of the returned `Chunked` Vector.

## Members

### [Symbol.iterator]() : Iterator

`Chunked` arrays are iterable, allowing you to use constructs like `for (const element of vector)` to iterate over elements.
`Chunked` arrays are iterable, allowing you to use constructs like `for (const element of chunked)` to iterate over elements. For in-order traversal, this is more performant than random-element access.

### type : TBD (read-only)
### type : T

Returns the type of elements in this `Chunked` instance. All vector chunks will have this type.
Returns the DataType instance which determines the type of elements this `Chunked` instance contains. All vector chunks will have this type.

### length: Number (read-only)

Expand All @@ -60,26 +62,37 @@ Returns an array of the `Vector` chunks that hold the elements in this `Chunked`

### typeId : TBD (read-only)

The `typeId` enum value of the `type` instance

### data : Data (read-only)

Returns the `Data` instance of the _first_ chunk in the list of inner Vectors.

### ArrayType (read-only)

Returns the type of the array that is used to represent the chunks.
Returns the constructor of the underlying typed array for the values buffer as determined by this Vector's DataType.

### numChildren (read-only)

The number of logical Vector children for the Chunked Vector. Only applicable if the DataType of the Vector is one of the nested types (List, FixedSizeList, Struct, or Map).

### stride (read-only)

This affects the
The number of elements in the underlying data buffer that constitute a single logical value for the given type. The stride for all DataTypes is 1 unless noted here:

- For `Decimal` types, the stride is 4.
- For `Date` types, the stride is 1 if the `unit` is DateUnit.DAY, else 2.
- For `Int`, `Interval`, or `Time` types, the stride is 1 if `bitWidth <= 32`, else 2.
- For `FixedSizeList` types, the stride is the `listSize` property of the `FixedSizeList` instance.
- For `FixedSizeBinary` types, the stride is the `byteWidth` property of the `FixedSizeBinary` instance.

### nullCount (read-only)

Number of null values across all Vector chunks in this chunked array.

### indices : ChunkedKeys<T> | null (read-only)

TBD
If this is a dictionary encoded column, returns a `Chunked` instance of the indicies of all the inner chunks. Otherwise, returns `null`.

### dictionary: ChunkedDict | null (read-only)

Expand All @@ -96,9 +109,9 @@ If this is a dictionary encoded column, returns the Dictionary.

Creates a new `Chunked` array instance of the given `type` and optionally initializes it with a list of `Vector` instances.

* `type` - TBD
* `type` - The DataType of the inner chunks
* `chunks`= - Vectors must all be compatible with `type`.
* `offsets`= - Offset into each chunk, elements before this offset are ignore in the contanated array. If not provided, offsets are autocalculated from the chunks.
* `offsets`= - A Uint32Array of offsets where each inner chunk starts and ends. If not provided, offsets are automatically calculated from the list of chunks.

TBD - Confirm/provide some information on how `offsets` can be used?

Expand All @@ -120,49 +133,67 @@ Returns a new `Chunked` instance that is a clone of this instance. Does not copy

Concatenates a number of `Vector` instances after the chunks. Returns a new `Chunked` array.

TBD - the supplied `Vector` chunks need to be of same type as the current chunks.

The supplied `Vector` chunks must be the same DataType as the `Chunked` instance.

### slice(begin?: Number, end?: Number): Chunked

Returns a new chunked array representing the logical array containing the elements within the index range., potentially dropping some chunks at beginning and end.
Returns a new chunked array representing the logical array containing the elements within the index range, potentially dropping some chunks at beginning and end.

* `begin`=`0` - The first logical index to be included as index 0 in the new array.
* `end` - The first logical index to be included as index 0 in the new array. Defaults to the last element in the range.

TBD
- Does this support negative indices etc like native slice?
Returns a zero-copy slice of this Vector. The begin and end arguments are handled the same way as JS' `Array.prototype.slice`; they are clamped between 0 and `vector.length` and wrap around when negative, e.g. `slice(-1, 5)` or `slice(5, -1)`


### getChildAt(index : Number): Chunked | null

- Returns the chunk holding the element at `index`.

TBD - confirm
If this `Chunked` Vector's DataType is one of the nested types (Map or Struct), returns a `Chunked` Vector view over all the chunks for the child Vector at `index`.

### search(index: Number): [number, number] | null;
### search(index: Number, then?: SearchContinuation): ReturnType<N>;
### search(index: Number, then?: SearchContinuation)

TBD?
Using an `index` that is relative to the whole `Chunked` Vector, binary search through the list of inner chunks using supplied "global" `index` to find the chunk at that location. Returns the child index of the inner chunk and an element index that has been adjusted to the keyspace of the found inner chunk.

`search()` can be called with only an integer index, in which case a pair of `[chunkIndex, valueIndex]` are returned as a two-element Array:

```ts
let chunked = [
Int32Vector.from([0, 1, 2, 3]),
Int32Vector.from([4, 5, 6, 7, 8])
].reduce((x, y) => x.concat(y));

let [chunkIndex, valueIndex] = chunked.search(6)
assert(chunkIndex === 1)
assert(valueIndex === 3)
```

If `search()` is called with an integer index and a callback, the callback will be invoked with the `Chunked` instance as the first argument, then the `chunkIndex` and `valueIndex` as the second and third arguments:

```ts
let getChildValue = (parent, childIndex, valueIndex) =>
chunked.chunks[childIndex].get(valueIndex);
let childValue = chunked.search(6, (chunked, childIndex, valueIndex) => )
```


### isValid(index: Number): boolean

Checks if the element at `index` in the logical array is valid.

Checks the null map (if present) to determine if the value in the logical `index` is included.

### get(index : Number): Type | null
### get(index : Number): T['TValue'] | null

Returns the element at `index` in the logical array, or `null` if no such element exists (e.e.g if `index` is out of range).

### set(index: Number, value: Type | null): void
### set(index: Number, value: T['TValue'] | null): void

Returns the element at `index` in the logical array, or `null` if no such element exists (e.e.g if `index` is out of range).
Writes the given `value` at the provided `index`. If the value is null, the null bitmap is updated.

### indexOf(element: Type, offset?: Number): Number

Returns the index of the first element with value `element`
Returns the index of the first occurrence of `element`, or `-1` if the value was not found.

* `offset` - the index to start searching from.

Expand Down
66 changes: 54 additions & 12 deletions arrow-docs/api-reference/record-batch-writer.md
Original file line number Diff line number Diff line change
@@ -1,37 +1,76 @@
## RecordBatchWriter

TBD: The `RecordBatchWriter` "serializes" RecordBatches to the Arrow streaming representation. It is essentially an asynchronous iterator that provides an API that allows the application to write `RecordBatch` instances and in response generates `Uint8Array` chunks that the application can then write to a desired location, e.g. a file or a socket. Streams are also supported, and the output of the `RecordBatchWriter` can be piped to a writeable stream.
The `RecordBatchWriter` "serializes" Arrow Tables (or streams of RecordBatches) to the Arrow File, Stream, or JSON representations for inter-process communication (see also: [Arrow IPC format docs](https://arrow.apache.org/docs/format/IPC.html#streaming-format)).

The RecordBatchWriter is conceptually a "transform" stream that transforms Tables or RecordBatches into binary `Uint8Array` chunks that represent the Arrow IPC messages (`Schema`, `DictionaryBatch`, `RecordBatch`, and in the case of the File format, `Footer` messages).

These binary chunks are buffered inside the `RecordBatchWriter` instance until they are consumed, typically by piping the RecordBatchWriter instance to a Writable Stream (like a file or socket), enumerating the chunks via async-iteration, or by calling `toUint8Array()` to create a single contiguous buffer of the concatenated results once the desired Tables or RecordBatches have been written.

RecordBatchWriter conforms to the `AsyncIterableIterator` protocol in all environments, and supports two additional stream primitives based on the environment (nodejs or browsers) available at runtime.

* In nodejs, the `RecordBatchWriter` can be converted to a `ReadableStream`, piped to a `WritableStream`, and has a static method that returns a `TransformStream` suitable in chained `pipe` calls.
* browser environments that support the [DOM/WhatWG Streams Standard](https://github.com/whatwg/streams), corresponding methods exist to convert `RecordBatchWriters` to the DOM `ReadableStream`, `WritableStream`, and `TransformStream` variants.

*Note*: The Arrow JSON representation is not suitable as an IPC mechanism in real-world scenarios. It is used inside the Arrow project as a human-readable debugging tool and for validating interoperability between each language's separate implementation of the Arrow library.


## Member Fields

closed: Promise (readonly)

A Promise which resolves when this `RecordBatchWriter` is closed.

## Static Methods

### RecordBatchWriter.throughNode(options?: Object): DuplexStream

Creates a Node.js duplex stream
Creates a Node.js `TransformStream` that transforms an input `ReadableStream` of Tables or RecordBatches into a stream of `Uint8Array` Arrow Message chunks.

- `options.autoDestroy`: Boolean
- `options.*` - Any Node Duplex Stream Options can be supplied
- `options.autoDestroy`: boolean - (default: `true`) Indicates whether the RecordBatchWriter should close after writing the first logical stream of RecordBatches (batches which all share the same Schema), or should continue and reset each time it encounters a new Schema.
- `options.*` - Any Node Duplex stream options can be supplied

Returns: A Node.js duplex stream
Returns: A Node.js Duplex stream

### RecordBatchWriter.throughDOM(writableStrategy? : Object, readableStrategy? : Object) : Object
Example:

```js

const fs = require('fs');
const { PassThrough, finished } = require('stream');
const { Table, RecordBatchWriter } = require('apache-arrow');

const table = Table.new({
i32: Int32Vector.from([1, 2, 3]),
f32: Float32Vector.from([1.0, 1.5, 2.0]),
});

const source = new PassThrough({ objectMode: true });

Creates a DOM/WhatWG duplex stream setup.
const result = source
.pipe(RecordBatchWriter.throughNode())
.pipe(fs.createWriteStream('table.arrow'));

- `writableStrategy.autoDestroy`=`false` : Boolean
- `writableStrategy.*`= - Any options for QueuingStrategy<RecordBatch>
source.write(table);
source.end();

finished(result, () => console.log('done writing table.arrow'));
```

### RecordBatchWriter.throughDOM(writableStrategy? : Object, readableStrategy? : Object) : Object

Creates a DOM/WhatWG `ReadableStream`/`WritableStream` pair that together transforms an input `ReadableStream` of Tables or RecordBatches into a stream of `Uint8Array` Arrow Message chunks.

- `options.autoDestroy`: boolean - (default: `true`) Indicates whether the RecordBatchWriter should close after writing the first logical stream of RecordBatches (batches which all share the same Schema), or should continue and reset each time it encounters a new Schema.
- `writableStrategy.*`= - Any options for QueuingStrategy\<RecordBatch\>
- `readableStrategy.highWaterMark`? : Number
- `readableStrategy.size`?: Number

Returns: an object with the following fields:
- `writable`: WritableStream<Table | RecordBatch>
- `readable`: ReadableStream<Uint8Array>

- `writable`: WritableStream\<Table | RecordBatch\>
- `readable`: ReadableStream\<Uint8Array\>




## Methods
Expand Down Expand Up @@ -66,13 +105,16 @@ Returns a new DOM/WhatWG stream that can be used to read the Uint8Array chunks p
- `options` - passed through to the Node ReadableStream constructor, any Node ReadableStream options.

### close() : void

Close the RecordBatchWriter. After close is called, no more chunks can be written.

### abort(reason?: any) : void
### finish() : this
### reset(sink?: WritableSink<ArrayBufferViewInput>, schema?: Schema | null): this

Change the sink

### write(chunk?: Table | RecordBatch | null): void
### write(payload?: Table | RecordBatch | Iterable<Table> | Iterable<RecordBatch> | null): void

Writes a `RecordBatch` or all the RecordBatches from a `Table`.

Expand Down
33 changes: 32 additions & 1 deletion arrow-docs/api-reference/row.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,34 @@
# Row

TBD - `Row` is returned from `Table.get()`?
A `Row` is an Object that retrieves each value at a certain index across a collection of child Vectors. Rows are returned from the `get()` function of the nested `StructVector` and `MapVector`, as well as `RecordBatch` and `Table`.

A `Row` defines read-only accessors for the indices and (if applicable) names of the child Vectors. For example, given a `StructVector` with the following schema:

```ts
const children = [
Int32Vector.from([0, 1]),
Utf8Vector.from(['foo', 'bar'])
];

const type = new Struct<{ id: Int32, value: Utf8 }>([
new Field('id', children[0].type),
new Field('value', children[1].type)
]);

const vector = new StructVector(Data.Struct(type, 0, 2, 0, null, children));

const row = vector.get(1);

assert((row[0] === 1 ) && (row.id === row[0]));
assert((row[1] === 'bar') && (row.value === row[1]));
```

`Row` implements the Iterator interface, enumerating each value in order of the child vectors list.

Notes:

- If the Row's parent type is a `Struct`, `Object.getOwnPropertyNames(row)` returns the child vector indices.
- If the Row's parent type is a `Map`, `Object.getOwnPropertyNames(row)` returns the child vector field names, as defined by the `children` Fields list of the `Map` instance.

## Methods

Expand All @@ -10,5 +38,8 @@ TBD - `Row` is returned from `Table.get()`?
### [kLength]: number (readonly)
### [Symbol.iterator](): IterableIterator<T[keyof T]["TValue"]>
### get(key: K): T[K]["TValue"]

Returns the value at the supplied `key`, where `key` is either the integer index of the set of child vectors, or the name of a child Vector

### toJSON(): any
### toString(): any
Loading

0 comments on commit ee5e56d

Please sign in to comment.