Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Fix N/A detection for empty fields in CSV reader #6922

Merged
merged 15 commits into from
Dec 9, 2020

Conversation

vuule
Copy link
Contributor

@vuule vuule commented Dec 5, 2020

Fixes #6682, #6680

Currently, empty fields are treated as N/A regardless on parsing options. However, the desired behavior is to handle empty fields the same way as fields with special values (apply default_na_values, na_filter logic).
This PR irons out the behavior so it matches Pandas in this regard.

  • Tries now support matching empty strings.
  • The list of special NA values is now generated more robustly, so it has correct elements in any parameter combination.
  • Empty string is added to the list of special NA values.
  • Empty string string ("/"/"") is added to NA value list if empty string ("") is included (mirrors Pandas behavior).
  • Added tests for previously failing parameter combinations.
  • Reworked some of the tests to check against Pandas results instead of assumed desired behavior.

@vuule vuule added bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue labels Dec 5, 2020
@vuule vuule self-assigned this Dec 5, 2020
@vuule vuule added this to PR-WIP in v0.18 Release via automation Dec 5, 2020
@GPUtester
Copy link
Collaborator

Please update the changelog in order to start CI tests.

View the gpuCI docs here.

@vuule vuule added the non-breaking Non-breaking change label Dec 5, 2020
@vuule vuule changed the title [WIP] Fix N/A detection for empty fields in CSV reader [REVIEW] Fix N/A detection for empty fields in CSV reader Dec 5, 2020
@vuule vuule marked this pull request as ready for review December 5, 2020 00:34
@vuule vuule requested review from a team as code owners December 5, 2020 00:34
v0.18 Release automation moved this from PR-WIP to PR-Reviewer approved Dec 7, 2020
Copy link
Contributor

@rgsl888prabhu rgsl888prabhu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@codecov
Copy link

codecov bot commented Dec 7, 2020

Codecov Report

Merging #6922 (cfa64bb) into branch-0.18 (598a14d) will increase coverage by 1.96%.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff               @@
##           branch-0.18    #6922      +/-   ##
===============================================
+ Coverage        81.53%   83.50%   +1.96%     
===============================================
  Files               96       96              
  Lines            15876    18290    +2414     
===============================================
+ Hits             12945    15273    +2328     
- Misses            2931     3017      +86     
Impacted Files Coverage Δ
python/cudf/cudf/io/csv.py 93.75% <ø> (-0.25%) ⬇️
python/cudf/cudf/io/feather.py 100.00% <0.00%> (ø)
python/cudf/cudf/comm/serialize.py 0.00% <0.00%> (ø)
python/cudf/cudf/_fuzz_testing/io.py 0.00% <0.00%> (ø)
python/dask_cudf/dask_cudf/_version.py 0.00% <0.00%> (ø)
python/dask_cudf/dask_cudf/io/tests/test_csv.py 100.00% <0.00%> (ø)
python/dask_cudf/dask_cudf/io/tests/test_orc.py 100.00% <0.00%> (ø)
python/dask_cudf/dask_cudf/io/tests/test_json.py 100.00% <0.00%> (ø)
...ython/dask_cudf/dask_cudf/io/tests/test_parquet.py 100.00% <0.00%> (ø)
python/cudf/cudf/utils/applyutils.py 98.74% <0.00%> (+0.02%) ⬆️
... and 37 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 598a14d...cfa64bb. Read the comment docs.

@vuule
Copy link
Contributor Author

vuule commented Dec 8, 2020

rerun tests

@@ -224,7 +224,7 @@ __global__ void __launch_bounds__(csvparse_block_dim)
long tempPos = pos - 1;
long field_len = pos - start;

if (field_len <= 0 || serialized_trie_contains(opts.trie_na, raw_csv + start, field_len)) {
if (field_len < 0 || serialized_trie_contains(opts.trie_na, raw_csv + start, field_len)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess field_len < 0 is because we're using inclusive offsets (i.e, [0, 0] includes index 0)... Do we have a bug open for changing inclusive offsets -> exclusive offsets throughout cuio?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might want to add a comment to this so it's not thought to be a mistake. Or use <= -1 ?

Copy link
Contributor Author

@vuule vuule Dec 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took another look and it seems like field_len can never be negative. Since long field_len = pos - start;, and pos is initialized to start and can only advance.
I left the condition as a sanity check, but it actually makes no sense in the context. Will remove the condition, thanks!

@@ -89,6 +89,8 @@ inline thrust::host_vector<SerialTrieNode> createSerializedTrie(
// Serialize the tree trie
std::deque<IndexedTrieNode> to_visit;
thrust::host_vector<SerialTrieNode> nodes;
// suport for matching empty input
nodes.push_back(SerialTrieNode(trie_terminating_character, tree_trie.is_end_of_word));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean all tries match empty input? Difficult to follow the logic how this coordinates with the code below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not always. Here we create a node that will have is_end_of_word = true if the root node of the tree_trie is also end of a word (empty word, since it's the root). Hence, passing tree_trie.is_end_of_word as the second parameter.
I'll expand the comment above to make this clearer. 👍

Copy link
Collaborator

@kkraus14 kkraus14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pytests lgtm

@rapids-bot rapids-bot bot merged commit 17c8f97 into rapidsai:branch-0.18 Dec 9, 2020
v0.18 Release automation moved this from PR-Reviewer approved to Done Dec 9, 2020
@vuule vuule deleted the bug-csv-empty-na-values branch December 9, 2020 01:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
No open projects
v0.18 Release
  
Done
Development

Successfully merging this pull request may close these issues.

[BUG] csv reader - empty fields treated as null even with na_filter=False
5 participants