-
Notifications
You must be signed in to change notification settings - Fork 10
Arrow
Apache Arrow is a cross-language development platform for in-memory data. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware.
Source: https://arrow.apache.org/
Tables are collections of columns in accordance to a schema. They are the most capable dataset-providing abstraction in Arrow.
Schemas describe a logical collection of several pieces of data, each with a distinct name and type, and optional metadata.
Columns contains the actual data in the form of Chunked Arrays. Chunked Array is a data structure managing a list of primitive Arrow arrays logically as one large array. An array holds actual data in contagious memory. Following diagram shows relation between chunked array and array.
Following is a logical representation of Arrow Table
Table {
Schema {
Metadata {
keys:vec; // Vector of strings
vals:vec; // Vector of strings
}
Fields:vec {
name:string; // Column name
type:datatype; // Column data type
}
}
Column:vec {
name:string;
type:datatype;
ChunkedArray {
chunks:array // Array which holds actual data.
}
}
}
Prior adding data, add data type and name of the column in schema
std::vector<std::shared_ptr<arrow::Field>> schema_vector = {
arrow::field("colname_1", arrow::int8()), arrow::field("colname_2", arrow::uint16())};
auto schema = std::make_shared<arrow::Schema>(schema_vector);
For adding values into the column, Apache Arrow has provided builder classes for each data type. Therefore, for two columns above we need to define instances of the builder class
arrow::Int8Builder int8_builder(pool);
arrow::UInt16Builder uint16_builder(pool);
Add the elements using Append
function provided builder class
int8_builder.Append(10);
uint16_builder.Append(12);
At the end, we finalise the arrays, declare the (type) schema and combine them into a single arrow::Table
:
std::shared_ptr<arrow::Array> int8_array;
ARROW_RETURN_NOT_OK(int8_builder.Finish(&int8_array));
std::shared_ptr<arrow::Array> uint16_array;
ARROW_RETURN_NOT_OK(uint16_builder.Finish(&uint16_array));
arrow::Table table = arrow::Table::Make(schema, {int8_array, uint16_array});
Each column in arrow is represented using Chunked Array. A chunked array is a vector of chunks i.e. arrays which holds actual data.
Chunked Array Array 1
+---+ +--+--+--+
| +------------> | | | |
+---+ +--+--+--+
| +---+
+---+ | +--+--+--+
|...| +----->| | | |
+---+ +--+--+--+
Array 2
Therefore, for printing a column will require to iterate through all the arrays inside a chunked array. Consider we have n rows and m columns in a table, and in each column (i.e. chunked_array) we have 3 chunks (i.e. arrays). As we have n rows, each column will have n elements and they are distributed sequentially in all the chunks. To print this table is to print each rows. For printing 1st row, we need to get 1st element from col 1, then 1st element from col 2, so on till 1st element of col m. In terms of arrow, we need to print 1st element in 1st chunk of 1st chunked array then print 1st element in 1st chunk of 2nd chunked array, so on till mth chunked_array. We will print like this until we exahust 1st chunk of each chunked_array. Then we will move to 2nd chunked array and print elements inside it.
Skyhook metadata is mapped into arrow schema as key-value pairs under arrow->schema->metadata
. The index of key and value under metadata vector is represented by
enum arrow_metadata_t {
METADATA_SKYHOOK_VERSION,
METADATA_DATA_SCHEMA_VERSION,
METADATA_DATA_STRUCTURE_VERSION,
METADATA_DATA_FORMAT_TYPE,
METADATA_DATA_SCHEMA,
METADATA_DB_SCHEMA,
METADATA_TABLE_NAME,
METADATA_NUM_ROWS
};
For example, you can get number of rows using following code
auto schema = table->schema();
auto metadata = schema->metadata();
int num_rows = metadata->value(METADATA_NUM_ROWS);
In case of flatbuffers RID
is stored as a separate table i.e. table ROW_TABLE
inside the flatbuffer schema and deleted vector is represented as a bitmap inside table ROW
. But in case of arrow, both are represented as different column inside the arrow table. They are always the last two columns at the end of the arrow table. Using the following macro declared
#define ARROW_RID_INDEX(cols) (cols)
#define ARROW_DELVEC_INDEX(cols) (cols + 1)
Concepts
> Architecture
> Data formats
> > Flatbuffers
> > Arrow
> Test Integration
Tutorials
> Build
> Dev/Test environment
> Run test queries
> Ceph-SkyhookDM cluster setup
Technical Reports
> Google Summer of Code 2019 Report
> Google Summer of Code 2020 Report
> Flatbuffers and Flexbuffers access experiments
Archives
> CloudLab Ceph Deployment Notes
> Deploy Notes
> Running CloudLab Experiments
> Installing a Non Release Fork on CloudLab
> Installing with Skyhook-Ansible
> FBU Queries (PDSW19)
> Paper Experiments
> Skyhook Development on CloudLab
> Building Deb Files for Cloudlab Installs (Sp2019)
> CephFS FUSE on CloudLab