Google Summer of Code 2019 Report
Google Summer of Code 2019 has come to an end. I have worked on extending current processing methods with column-oriented processing via Arrow. The report includes work done in each phase with relevant links to the code, challenges faced, and achievements made so far.
Skyhook projects add the data management capabilities into the storage system, which helps with scalability for databases in the cloud. Skyhook is part of the larger 'programmable storage' research at UC Santa Cruz. Programmable storage means delegating data management, like transform or process data, tasks to the storage layer. Skyhook uses Ceph as an object-store.
Skyhook extends Ceph distributed object storage with a limited set of database functionality. This includes common data processing tasks such as select, project, and aggregate data, as well as higher-level data management tasks such as indexing, statistics collection, data layout, and formatting tasks. Skyhook used to store data in row-oriented fashion (using Flatbuffers). Therefore my project was to extend the current processing methods with column-oriented processing. Processing methods mainly include transforming row-oriented data to column-oriented data and process the queries on the column-oriented data.
In this phase, the main goal was to transform the row-oriented data to column-oriented data. It started with an understanding of the relevant skyhook code and setting up the environment. Parallelly, I worked on writing a small program which converts an array of structs (row data) to the Apache arrow table. Here is the link for the program.
Next challenge was to learn about Apache Arrow architecture and methods. Once equipped with essential knowledge I wrote a transform method which converts flatbuffer to apache arrow. Apart from that methods were written to dump/read apache arrow to/from a file (which was critically helpful in debugging)
- Skeleton for transform DB (bf07204)
- Process rows for transform DB operation (f40feb0)
- Added object type as an extended attribute (a58bb02)
- Transform flatbuffer to Apache arrow (13b8062)
After the transform method, the next target was to do processing of Apache arrow table. Initially, the processing method was written only for projecting some/all the columns in the table. Afterward, the processing of predicates was added. Other than processing logic to compress and split arrow table was also written for future needs. Also, the method to print an arrow table in CLS_LOG was written.
Documenting the work done so far was also one of the primary goals of this phase. All the design and implementation of columnar data is captured under the Wiki page.
Also, Jeff LeFevre (Mentor) and I constructed a data type comparison table which compares the data types supported by different formats. As Skyhook is intended to with generic datatypes, coming up with a minimum set of data types which can represent maximum formats is an interesting problem to solve in the Skyhook. We compared PostgreSQL data types with the other formats. Here is the link for the table.
- Add methods to split and merge arrow tables (8d5f675)
- Change apache arrow print function (3de9a01)
- Process apache arrow buffer (1bd694f)
This being the last phase, the main target was to optimize and evaluate the processing of arrow vs flatbuffer. Major optimization done was to make the processing of arrow column-wise instead of row-wise, which proved to be better in performance. Also, changes in metadata (fbmeta) were adopted into skyhook arrow methods.
For all the evaluations each node in a cluster is equipped with the following configuration:
Linux Ubuntu 18.04, Kernel 4.15.0-55 64-bit Version CPU Two Intel Xeon Silver 4114 10-core CPUs at 2.20 GHz RAM 192GB ECC DDR4-2666 Memory Disk One 1 TB 7200 RPM 6G SAS HDs Disk One Intel DC S3500 480 GB 6G SATA SSD NIC Dual-port Intel X520-DA2 10Gb NIC (PCIe v3.0, 8 lanes) NIC Onboard Intel i350 1Gb
We evaluated the performance of Apache Arrow with Flatbuffer in terms of duration taken for processing of different queries. Also, we have tested performance by varying the size of the objects i.e. a varying number of rows in each object.
In this experiment, we uploaded 100 objects each having 50K rows (1 MB) of flatflex and arrow. The graph shows the results of three queries ran over different cluster configurations (1, 2 and 4 OSDs) The results show that the arrow data format performs slightly better than flatflex format. Although the performance of flatflex and arrow is similar in 4 OSD configuration, arrow performs significantly better as we decrease the number of OSDs.
In this experiment, we uploaded 100 objects each having 500K rows (10 MB) of flatflex and arrow. The graph shows the results of three queries ran over different cluster configurations (1, 2 and 4 OSDs) For the queries on 1 OSD configuration, arrow performs better than flatflex almost by an order of magnitude. And around 7X in 2 OSDs and 3X in 4 OSDs cases. Therefore, with bigger object size arrow data format performs a lot better than flatflex data format.
In this experiment, we uploaded 100 objects each having 500K rows (100 MB) of flatflex and arrow. The graph shows the results of three queries ran over different cluster configurations (1, 2 and 4 OSDs) For the all the queries on all OSD configuration, arrow performs 3X better than flatflex.
We compared the size of objects (100 Objects) on Ceph for arrow and flatflex. We compared two sets of objects each having 50K, 500K, and 50000K rows.
For each object, the arrow takes almost 1/3rd space as compared to flatflex.
In this experiment, we compared the time taken for converting 100 1MB objects (50K Rows/Obj), 10MB objects (500K Rows/Obj) and 100MB objects (50000K Rows/Obj) flatflex to the arrow on different cluster configurations As we increase the number of OSDs the transform duration improves by the order of two.
Apache Arrow performed better in the execution of queries than flatflex in all the cluster configuration. This might be because the arrow takes less space for each object than flatflex format, therefore processing less amount of data.
Also, as we increase the number of OSDs both the data formats perform significantly better by leveraging parallel processing.
Work planned in each phase was delivered successfully. There is a lot of scope for enhancement on top of the work done so far but that is out of the scope of the GSoC Program. Also, I will be uploading a patch upstream in Ceph after the GSoC for the possible enhancement in dynamic object interface. NOTE: This work will be a voluntary work which is not a part of GSoC.
It was challenging yet exciting journey. My GSoC project was part of a larger research and development project it helped me to develop critical thinking and problem-solving skills. I learned a lot throughout the project which mainly includes different serialization types (Google's Flatbuffer and Apache Arrow) and Ceph in general. Ceph is a huge open-source effort and I am glad that I got to know bits and pieces of it. Also, I gained knowledge about the trade-offs between different parameters in storage systems.
One of the main advantage gained by participating in GSoC is, I truly got connected to the Open Source world. During the initial phase, I got a chance to submit a patch upstream into Google's Flatbuffer. Also, I communicated with folks on IRC channels of Apache Kudu and Apache Cassandra for understanding more about support for Apache Arrow.
Last but not least, GSoC gave me a perfect start to my masters. Having project done in the CROSS which is part of UC Santa Cruz, I got an early exposure to academic folks. Most of the projects in CROSS being in the storage systems domain getting connected to the members will be helpful for me in the coming future.