[FEA] Refactoring JSON reader tree algorithms with Compressed Sparse Row (CSR) #15903

GregoryKimball · 2024-05-31T22:24:52Z

The steady addition of features to the JSON reader has resulted in some code paths that are error-prone (see #15750) and difficult to maintain. Support for mixed types, coercing nested types to string, array of arrays, null literals and more has been added over the past few releases (see comment) and stretched the original design of token-to-tree and tree-to-column processing.

Status	Topic
🔄	Introduce column vertex structure and graph traversal to the tree representation. Make sure to maintain the pandas requirements for handling array-of-arrays and null literals.
	Introduce mixed type handling with pruning for non-conforming dtypes (updated Spark requirement). Also consider the case where a dtype is not provided for a column with mixed types.
	Add an pruning option for cross-column pruning, for cases when validation fails and all values in the row become null
	#15278

GregoryKimball · 2024-05-31T22:27:48Z

Hello @shrshi, I added this issue about the refactoring work you started this week. Please excuse me if you documented this elsewhere and I missed it. Please feel free to update these topics with your current picture of the project. Thank you!

shrshi · 2024-06-03T19:37:47Z

Tree representation:

This feature introduces a new column_tree_csr struct that stores the column tree representation in CSR format. The nodes are renumbered level-wise instead of being directly mapped to column ids. This serves two purposes - (i) sub-trees matching input dtype schema can be skipped in between-column pruning, and (ii) sub-trees matching non-conforming dtypes in mixed type columns can be similarly skipped (within-column pruning).
The key advantage of wrapping column properties - mixed types and map types support, column pruning, ignoring null literals, column validity, and array of arrays support - as 'non-zero' values in column_tree_csr struct is maintainability and ease of adding new features in the future.

GregoryKimball added feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue Spark Functionality that helps Spark RAPIDS labels May 31, 2024

GregoryKimball added this to the Nested JSON reader milestone May 31, 2024

GregoryKimball assigned shrshi May 31, 2024

shrshi mentioned this issue Jun 11, 2024

JSON tree algorithms refactor I: CSR data structure for column tree #15979

Open

4 tasks

shrshi mentioned this issue Jul 6, 2024

[WIP] JSON tree algorithms refactor II: Constructing device JSON column #16205

Draft

8 tasks

GregoryKimball changed the title ~~[FEA] Refactoring JSON reader tree algorithms in 24.08~~ [FEA] Refactoring JSON reader tree algorithms with Compressed Sparse Row (CSR) Jul 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Refactoring JSON reader tree algorithms with Compressed Sparse Row (CSR) #15903

[FEA] Refactoring JSON reader tree algorithms with Compressed Sparse Row (CSR) #15903

GregoryKimball commented May 31, 2024 •

edited

Loading

GregoryKimball commented May 31, 2024

shrshi commented Jun 3, 2024

[FEA] Refactoring JSON reader tree algorithms with Compressed Sparse Row (CSR) #15903

[FEA] Refactoring JSON reader tree algorithms with Compressed Sparse Row (CSR) #15903

Comments

GregoryKimball commented May 31, 2024 • edited Loading

GregoryKimball commented May 31, 2024

shrshi commented Jun 3, 2024

GregoryKimball commented May 31, 2024 •

edited

Loading