Merge e2a0e37 into 97648e1

visgl · Apr 12, 2019 · ee5e56d · ee5e56d
2 parents 97648e1 + e2a0e37
commit ee5e56d
Show file tree

Hide file tree

Showing 5 changed files with 224 additions and 71 deletions.
diff --git a/arrow-docs/api-reference/chunked.md b/arrow-docs/api-reference/chunked.md
@@ -1,16 +1,17 @@
 # Chunked
 
-Holds a "chunked array" that allows a number of array fragments (represented by `Vector` instnces) to be treated logically as a single vector. `Array` instances can be concatenated into a `Chunked` without any memory beind copied.
+Holds a "chunked array" that allows a number of array fragments (represented by `Vector` instances) to be treated logically as a single vector. `Vector` instances can be concatenated into a `Chunked` without any memory beind copied.
 
 
 ## Usage
 
 Create a new contiguous typed array from a `Chunked` instance (note that this creates a new typed array unless only one chunk)
+
 ```js
-const typedArray = column.toArray();
+const typedArray = chunked.toArray();
 ```
 
-A `Chunked` array supports iteration, random access element access and mutation.
+A `Chunked` array supports iteration, random element access and mutation.
 
 
 
@@ -23,32 +24,33 @@ class Chunked extends [Vector](docs-arrow/api-reference/vector.md)
 
 ### Chunked.flatten(...vectors: Vector[]) : Vector
 
-Flattens a number of `Vector` instances into a single `Vector` instance, by allocating and copying memory from each `Vector`.
-
-TBD - does this return a new `Vector` or a `Chunked`?
+<p class="badges">
+   <img src="https://img.shields.io/badge/zero-copy-green.svg?style=flat-square" alt="zero-copy" />
+</p>
 
+Utility method that flattens a number of `Vector` instances or Arrays of `Vector` instances into a single Array of `Vector` instances. If the incoming Vectors are instances of `Chunked`, the child chunks are extracted and flattened into the resulting Array. Does not mutate or copy data from the Vector instances.
 
+Returns an Array of `Vector` instances.
 
 ### Chunked.concat(...chunks: Vector<T>[]): Chunked
 
 <p class="badges">
    <img src="https://img.shields.io/badge/zero-copy-green.svg?style=flat-square" alt="zero-copy" />
 </p>
 
-Concatenates a number of `Vector` instances after the chunks. Returns a new `Chunked` array.
-
-TBD - the supplied `Vector` chunks need to be of same type as the current chunks.
+Concatenates a number of `Vector` instances of the same type into a single `Chunked` Vector. Returns a new `Chunked` Vector.
 
+Note: This method extracts the inner chunks of any incoming `Chunked` instances, and flattens them into the `chunks` array of the returned `Chunked` Vector.
 
 ## Members
 
 ### [Symbol.iterator]() : Iterator
 
-`Chunked` arrays are iterable, allowing you to use constructs like `for (const element of vector)` to iterate over elements.
+`Chunked` arrays are iterable, allowing you to use constructs like `for (const element of chunked)` to iterate over elements. For in-order traversal, this is more performant than random-element access.
 
-### type : TBD (read-only)
+### type : T
 
-Returns the type of elements in this `Chunked` instance. All vector chunks will have this type.
+Returns the DataType instance which determines the type of elements this `Chunked` instance contains. All vector chunks will have this type.
 
 ### length: Number  (read-only)
 
@@ -60,26 +62,37 @@ Returns an array of the `Vector` chunks that hold the elements in this `Chunked`
 
 ### typeId : TBD  (read-only)
 
+The `typeId` enum value of the `type` instance
+
 ### data : Data  (read-only)
 
+Returns the `Data` instance of the _first_ chunk in the list of inner Vectors.
+
 ### ArrayType  (read-only)
 
-Returns the type of the array that is used to represent the chunks.
+Returns the constructor of the underlying typed array for the values buffer as determined by this Vector's DataType.
 
 ### numChildren  (read-only)
 
+The number of logical Vector children for the Chunked Vector. Only applicable if the DataType of the Vector is one of the nested types (List, FixedSizeList, Struct, or Map).
 
 ### stride  (read-only)
 
-This affects the
+The number of elements in the underlying data buffer that constitute a single logical value for the given type. The stride for all DataTypes is 1 unless noted here:
+
+- For `Decimal` types, the stride is 4.
+- For `Date` types, the stride is 1 if the `unit` is DateUnit.DAY, else 2.
+- For `Int`, `Interval`, or `Time` types, the stride is 1 if `bitWidth <= 32`, else 2.
+- For `FixedSizeList` types, the stride is the `listSize` property of the `FixedSizeList` instance.
+- For `FixedSizeBinary` types, the stride is the `byteWidth` property of the `FixedSizeBinary` instance.
 
 ### nullCount  (read-only)
 
 Number of null values across all Vector chunks in this chunked array.
 
 ### indices : ChunkedKeys<T> | null  (read-only)
 
-TBD
+If this is a dictionary encoded column, returns a `Chunked` instance of the indicies of all the inner chunks. Otherwise, returns `null`.
 
 ### dictionary: ChunkedDict | null  (read-only)
 
@@ -96,9 +109,9 @@ If this is a dictionary encoded column, returns the Dictionary.
 
 Creates a new `Chunked` array instance of the given `type` and optionally initializes it with a list of `Vector` instances.
 
-* `type` - TBD
+* `type` - The DataType of the inner chunks
 * `chunks`= - Vectors must all be compatible with `type`.
-* `offsets`= - Offset into each chunk, elements before this offset are ignore in the contanated array. If not provided, offsets are autocalculated from the chunks.
+* `offsets`= - A Uint32Array of offsets where each inner chunk starts and ends. If not provided, offsets are automatically calculated from the list of chunks.
 
 TBD - Confirm/provide some information on how `offsets` can be used?
 
@@ -120,49 +133,67 @@ Returns a new `Chunked` instance that is a clone of this instance. Does not copy
 
 Concatenates a number of `Vector` instances after the chunks. Returns a new `Chunked` array.
 
-TBD - the supplied `Vector` chunks need to be of same type as the current chunks.
-
+The supplied `Vector` chunks must be the same DataType as the `Chunked` instance.
 
 ### slice(begin?: Number, end?: Number): Chunked
 
-Returns a new chunked array representing the logical array containing the elements within the index range., potentially dropping some chunks at beginning and end.
+Returns a new chunked array representing the logical array containing the elements within the index range, potentially dropping some chunks at beginning and end.
 
 * `begin`=`0` - The first logical index to be included as index 0 in the new array.
 * `end` - The first logical index to be included as index 0 in the new array. Defaults to the last element in the range.
 
-TBD
-- Does this support negative indices etc like native slice?
+Returns a zero-copy slice of this Vector. The begin and end arguments are handled the same way as JS' `Array.prototype.slice`; they are clamped between 0 and `vector.length` and wrap around when negative, e.g. `slice(-1, 5)` or `slice(5, -1)`
 
 
 ### getChildAt(index : Number): Chunked | null
 
-- Returns the chunk holding the element at `index`.
-
-TBD - confirm
+If this `Chunked` Vector's DataType is one of the nested types (Map or Struct), returns a `Chunked` Vector view over all the chunks for the child Vector at `index`.
 
 ### search(index: Number): [number, number] | null;
 ### search(index: Number, then?: SearchContinuation): ReturnType<N>;
 ### search(index: Number, then?: SearchContinuation)
 
-TBD?
+Using an `index` that is relative to the whole `Chunked` Vector, binary search through the list of inner chunks using supplied "global" `index` to find the chunk at that location. Returns the child index of the inner chunk and an element index that has been adjusted to the keyspace of the found inner chunk.
+
+`search()` can be called with only an integer index, in which case a pair of `[chunkIndex, valueIndex]` are returned as a two-element Array:
+
+```ts
+let chunked = [
+    Int32Vector.from([0, 1, 2, 3]),
+    Int32Vector.from([4, 5, 6, 7, 8])
+].reduce((x, y) => x.concat(y));
+
+let [chunkIndex, valueIndex] = chunked.search(6)
+assert(chunkIndex === 1)
+assert(valueIndex === 3)
+```
+
+If `search()` is called with an integer index and a callback, the callback will be invoked with the `Chunked` instance as the first argument, then the `chunkIndex` and `valueIndex` as the second and third arguments:
+
+```ts
+let getChildValue = (parent, childIndex, valueIndex) =>
+    chunked.chunks[childIndex].get(valueIndex);
+let childValue = chunked.search(6, (chunked, childIndex, valueIndex) => )
+```
+
 
 ### isValid(index: Number): boolean
 
 Checks if the element at `index` in the logical array is valid.
 
 Checks the null map (if present) to determine if the value in the logical `index` is included.
 
-### get(index : Number): Type | null
+### get(index : Number): T['TValue'] | null
 
 Returns the element at `index` in the logical array, or `null` if no such element exists (e.e.g if `index` is out of range).
 
-### set(index: Number, value: Type | null): void
+### set(index: Number, value: T['TValue'] | null): void
 
-Returns the element at `index` in the logical array, or `null` if no such element exists (e.e.g if `index` is out of range).
+Writes the given `value` at the provided `index`. If the value is null, the null bitmap is updated.
 
 ### indexOf(element: Type, offset?: Number): Number
 
-Returns the index of the first element with value `element`
+Returns the index of the first occurrence of `element`, or `-1` if the value was not found.
 
 * `offset` - the index to start searching from.
 

diff --git a/arrow-docs/api-reference/record-batch-writer.md b/arrow-docs/api-reference/record-batch-writer.md
@@ -1,37 +1,76 @@
 ## RecordBatchWriter
 
-TBD: The `RecordBatchWriter` "serializes" RecordBatches to the Arrow streaming representation. It is essentially an asynchronous iterator that provides an API that allows the application to write `RecordBatch` instances and in response generates `Uint8Array` chunks that the application can then write to a desired location, e.g. a file or a socket. Streams are also supported, and the output of the `RecordBatchWriter` can be piped to a writeable stream.
+The `RecordBatchWriter` "serializes" Arrow Tables (or streams of RecordBatches) to the Arrow File, Stream, or JSON representations for inter-process communication (see also: [Arrow IPC format docs](https://arrow.apache.org/docs/format/IPC.html#streaming-format)).
+
+The RecordBatchWriter is conceptually a "transform" stream that transforms Tables or RecordBatches into binary `Uint8Array` chunks that represent the Arrow IPC messages (`Schema`, `DictionaryBatch`, `RecordBatch`, and in the case of the File format, `Footer` messages).
+
+These binary chunks are buffered inside the `RecordBatchWriter` instance until they are consumed, typically by piping the RecordBatchWriter instance to a Writable Stream (like a file or socket), enumerating the chunks via async-iteration, or by calling `toUint8Array()` to create a single contiguous buffer of the concatenated results once the desired Tables or RecordBatches have been written.
+
+RecordBatchWriter conforms to the `AsyncIterableIterator` protocol in all environments, and supports two additional stream primitives based on the environment (nodejs or browsers) available at runtime.
+
+* In nodejs, the `RecordBatchWriter` can be converted to a `ReadableStream`, piped to a `WritableStream`, and has a static method that returns a `TransformStream` suitable in chained `pipe` calls.
+* browser environments that support the [DOM/WhatWG Streams Standard](https://github.com/whatwg/streams), corresponding methods exist to convert `RecordBatchWriters` to the DOM `ReadableStream`, `WritableStream`, and `TransformStream` variants.
+
+*Note*: The Arrow JSON representation is not suitable as an IPC mechanism in real-world scenarios. It is used inside the Arrow project as a human-readable debugging tool and for validating interoperability between each language's separate implementation of the Arrow library.
 
 
 ## Member Fields
 
 closed: Promise (readonly)
 
+A Promise which resolves when this `RecordBatchWriter` is closed.
 
 ## Static Methods
 
 ### RecordBatchWriter.throughNode(options?: Object): DuplexStream
 
-Creates a Node.js duplex stream
+Creates a Node.js `TransformStream` that transforms an input `ReadableStream` of Tables or RecordBatches into a stream of `Uint8Array` Arrow Message chunks.
 
-- `options.autoDestroy`: Boolean
-- `options.*` - Any Node Duplex Stream Options can be supplied
+- `options.autoDestroy`: boolean - (default: `true`) Indicates whether the RecordBatchWriter should close after writing the first logical stream of RecordBatches (batches which all share the same Schema), or should continue and reset each time it encounters a new Schema.
+- `options.*` - Any Node Duplex stream options can be supplied
 
-Returns: A Node.js duplex stream
+Returns: A Node.js Duplex stream
 
-### RecordBatchWriter.throughDOM(writableStrategy? : Object, readableStrategy? : Object) : Object
+Example:
+
+```js
+
+const fs = require('fs');
+const { PassThrough, finished } = require('stream');
+const { Table, RecordBatchWriter } = require('apache-arrow');
+
+const table = Table.new({
+    i32: Int32Vector.from([1, 2, 3]),
+    f32: Float32Vector.from([1.0, 1.5, 2.0]),
+});
+
+const source = new PassThrough({ objectMode: true });
 
-Creates a DOM/WhatWG duplex stream setup.
+const result = source
+    .pipe(RecordBatchWriter.throughNode())
+    .pipe(fs.createWriteStream('table.arrow'));
 
-- `writableStrategy.autoDestroy`=`false` : Boolean
-- `writableStrategy.*`= - Any options for QueuingStrategy<RecordBatch>
+source.write(table);
+source.end();
 
+finished(result, () => console.log('done writing table.arrow'));
+```
+
+### RecordBatchWriter.throughDOM(writableStrategy? : Object, readableStrategy? : Object) : Object
+
+Creates a DOM/WhatWG `ReadableStream`/`WritableStream` pair that together transforms an input `ReadableStream` of Tables or RecordBatches into a stream of `Uint8Array` Arrow Message chunks.
+
+- `options.autoDestroy`: boolean - (default: `true`) Indicates whether the RecordBatchWriter should close after writing the first logical stream of RecordBatches (batches which all share the same Schema), or should continue and reset each time it encounters a new Schema.
+- `writableStrategy.*`= - Any options for QueuingStrategy\<RecordBatch\>
 - `readableStrategy.highWaterMark`? : Number
 - `readableStrategy.size`?: Number
 
 Returns: an object with the following fields:
-- `writable`: WritableStream<Table | RecordBatch>
-- `readable`: ReadableStream<Uint8Array>
+
+- `writable`: WritableStream\<Table | RecordBatch\>
+- `readable`: ReadableStream\<Uint8Array\>
+
+
 
 
 ## Methods
@@ -66,13 +105,16 @@ Returns a new DOM/WhatWG stream that can be used to read the Uint8Array chunks p
 - `options` - passed through to the Node ReadableStream constructor, any Node ReadableStream options.
 
 ### close() : void
+
+Close the RecordBatchWriter. After close is called, no more chunks can be written.
+
 ### abort(reason?: any) : void
 ### finish() : this
 ### reset(sink?: WritableSink<ArrayBufferViewInput>, schema?: Schema | null): this
 
 Change the sink
 
-### write(chunk?: Table | RecordBatch | null): void
+### write(payload?: Table | RecordBatch | Iterable<Table> | Iterable<RecordBatch> | null): void
 
 Writes a `RecordBatch` or all the RecordBatches from a `Table`.
 

diff --git a/arrow-docs/api-reference/row.md b/arrow-docs/api-reference/row.md
@@ -1,6 +1,34 @@
 # Row
 
-TBD - `Row` is returned from `Table.get()`?
+A `Row` is an Object that retrieves each value at a certain index across a collection of child Vectors. Rows are returned from the `get()` function of the nested `StructVector` and `MapVector`, as well as `RecordBatch` and `Table`.
+
+A `Row` defines read-only accessors for the indices and (if applicable) names of the child Vectors. For example, given a `StructVector` with the following schema:
+
+```ts
+const children = [
+    Int32Vector.from([0, 1]),
+    Utf8Vector.from(['foo', 'bar'])
+];
+
+const type = new Struct<{ id: Int32, value: Utf8 }>([
+    new Field('id', children[0].type),
+    new Field('value', children[1].type)
+]);
+
+const vector = new StructVector(Data.Struct(type, 0, 2, 0, null, children));
+
+const row = vector.get(1);
+
+assert((row[0] ===   1  ) && (row.id    === row[0]));
+assert((row[1] === 'bar') && (row.value === row[1]));
+```
+
+`Row` implements the Iterator interface, enumerating each value in order of the child vectors list.
+
+Notes:
+
+- If the Row's parent type is a `Struct`, `Object.getOwnPropertyNames(row)` returns the child vector indices.
+- If the Row's parent type is a `Map`, `Object.getOwnPropertyNames(row)` returns the child vector field names, as defined by the `children` Fields list of the `Map` instance.
 
 ## Methods
 
@@ -10,5 +38,8 @@ TBD - `Row` is returned from `Table.get()`?
 ### [kLength]: number (readonly)
 ### [Symbol.iterator](): IterableIterator<T[keyof T]["TValue"]>
 ### get(key: K): T[K]["TValue"]
+
+Returns the value at the supplied `key`, where `key` is either the integer index of the set of child vectors, or the name of a child Vector
+
 ### toJSON(): any
 ### toString(): any