[REVIEW] Fix data corruption in string columns #7746

galipremsagar · 2021-03-28T20:42:47Z

Minimal repro of the above issue is:

>>> import cudf
>>> s = cudf.Series(['hi', 'hello', None])
>>> s
0       hi
1    hello
2     <NA>
dtype: string
>>> h = s[0:3]
0       hi
1    hello
2     <NA>
dtype: string
>>> s._column.null_count
1
>>> h._column.null_count
1

Incorrect mask calculation in Column.from_column_view because of incorrect base_size calculation in StringColumn:

>>> s._column.mask.to_host_array()
array([3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
      dtype=uint8)
>>> h._column.mask.to_host_array()
array([], dtype=uint8) # Should have a mask similar to above one.

>>> s._column.base_size
0 # Should be 3
>>> h._column.base_size
0 # Should be 3

So in this PR I have fixed the calculation of StringColumn.base_size and introduced tests to have a check for the same.

shwina · 2021-03-28T20:46:10Z

python/cudf/cudf/tests/test_string.py

+def test_string_slice_with_mask():
+    actual = cudf.Series(["hi", "hello", None])
+    expected = actual[0:3]
+
+    assert actual._column.base_size == 3
+    assert_eq(actual._column.base_size, expected._column.base_size)
+    assert_eq(actual._column.null_count, expected._column.null_count)
+    assert_eq(
+        actual._column.mask.to_host_array(),
+        expected._column.mask.to_host_array(),
+    )
+    assert_eq(actual, expected)


Can we convert this to something that tests user-facing behaviour rather than internal behaviour?

In other words, did this bug manifest in a way that affected end-users? If so, can we test that we fixed that instead?

Yeah, I had this similar thought initially and thought we could check with isnull public API, but since this goes to libcudf call and that returns the correct result without interacting with Column.mask we cannot validate using Series.isnull.

The closest user-facing behavior where this issue would surface is when we round-trip(when it goes through serialize) a series with string dtype & having nulls like in test_distributed::test_str_series_roundtrip.

Had to add both user-facing & internal test because we don't seem to validate base_size anywhere except for this test where we only test against an empty column:

cudf/python/cudf/cudf/tests/test_string.py

Lines 1315 to 1318 in ccc28d5

def test_string_no_children_properties():

empty_col = StringColumn(children=())

assert empty_col.base_children == ()

assert empty_col.base_size == 0

Sounds like we should definitely add a test for that then (maybe in test_serialize.py?).

We can keep this test in addition, if you prefer. Personally, I'm not a fan of testing internals, but that could be just me :-)

Added a serialization test where we would still have to validate an internal component(i.e., the frames) and removed checking for mask and retained base_size checks in test_string.py

I don't understand -- what internal attribute are we testing in the new serialize test?

Serialize returns a dict & Frames as buffers. The internal attribute we are testing here is Frames([index_frame, offset_frame, chars_frame, mask_frame]), to be specific we want to validate the mask_frame at last index which would be the right validation.

Got it. Thanks!

codecov · 2021-03-28T22:56:06Z

Codecov Report

Merging #7746 (7f279d1) into branch-0.19 (ccc28d5) will increase coverage by 0.40%.
The diff coverage is 100.00%.

@@               Coverage Diff               @@
##           branch-0.19    #7746      +/-   ##
===============================================
+ Coverage        82.13%   82.54%   +0.40%     
===============================================
  Files              101      101              
  Lines            17096    17461     +365     
===============================================
+ Hits             14042    14413     +371     
+ Misses            3054     3048       -6

Impacted Files	Coverage Δ
python/cudf/cudf/core/column/string.py	`86.79% <100.00%> (+0.21%)`	⬆️
python/cudf/cudf/io/feather.py	`100.00% <0.00%> (ø)`
python/cudf/cudf/comm/serialize.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/_fuzz_testing/io.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/column/struct.py	`100.00% <0.00%> (ø)`
python/dask_cudf/dask_cudf/_version.py	`0.00% <0.00%> (ø)`
python/dask_cudf/dask_cudf/io/tests/test_csv.py	`100.00% <0.00%> (ø)`
python/dask_cudf/dask_cudf/io/tests/test_orc.py	`100.00% <0.00%> (ø)`
python/dask_cudf/dask_cudf/io/tests/test_json.py	`100.00% <0.00%> (ø)`
...ython/dask_cudf/dask_cudf/io/tests/test_parquet.py	`100.00% <0.00%> (ø)`
... and 38 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ccc28d5...7f279d1. Read the comment docs.

kkraus14 · 2021-03-29T18:35:02Z

@gpucibot merge

galipremsagar added 3 commits March 28, 2021 15:15

fix base_size calculation of string column

66a5283

add distributed round trip test

c6d7321

add test

78d6dc0

galipremsagar added bug Something isn't working 3 - Ready for Review Ready for review by team Python Affects Python cuDF API. 4 - Needs cuDF (Python) Reviewer dask Dask issue strings strings issues (C++ and Python) labels Mar 28, 2021

galipremsagar requested review from shwina and kkraus14 March 28, 2021 20:42

galipremsagar requested review from a team as code owners March 28, 2021 20:42

galipremsagar self-assigned this Mar 28, 2021

galipremsagar added the non-breaking Non-breaking change label Mar 28, 2021

shwina reviewed Mar 28, 2021

View reviewed changes

add serialization tests

7f279d1

kkraus14 approved these changes Mar 29, 2021

View reviewed changes

kkraus14 removed 3 - Ready for Review Ready for review by team 4 - Needs Dask Reviewer labels Mar 29, 2021

rapids-bot bot merged commit 42c3bf9 into rapidsai:branch-0.19 Mar 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REVIEW] Fix data corruption in string columns #7746

[REVIEW] Fix data corruption in string columns #7746

galipremsagar commented Mar 28, 2021

shwina Mar 28, 2021

galipremsagar Mar 28, 2021 •

edited

Loading

shwina Mar 28, 2021

galipremsagar Mar 28, 2021

shwina Mar 29, 2021

galipremsagar Mar 29, 2021

shwina Mar 29, 2021

codecov bot commented Mar 28, 2021 •

edited

Loading

kkraus14 commented Mar 29, 2021

	def test_string_no_children_properties():
	empty_col = StringColumn(children=())
	assert empty_col.base_children == ()
	assert empty_col.base_size == 0

[REVIEW] Fix data corruption in string columns #7746

[REVIEW] Fix data corruption in string columns #7746

Conversation

galipremsagar commented Mar 28, 2021

shwina Mar 28, 2021

Choose a reason for hiding this comment

galipremsagar Mar 28, 2021 • edited Loading

Choose a reason for hiding this comment

shwina Mar 28, 2021

Choose a reason for hiding this comment

galipremsagar Mar 28, 2021

Choose a reason for hiding this comment

shwina Mar 29, 2021

Choose a reason for hiding this comment

galipremsagar Mar 29, 2021

Choose a reason for hiding this comment

shwina Mar 29, 2021

Choose a reason for hiding this comment

codecov bot commented Mar 28, 2021 • edited Loading

Codecov Report

kkraus14 commented Mar 29, 2021

galipremsagar Mar 28, 2021 •

edited

Loading

codecov bot commented Mar 28, 2021 •

edited

Loading