remove embedded nul bytes if present in raw data to be read #2250

jozefhajnala · 2020-02-02T18:39:14Z

When trying to collect data that contain embedded nul bytes into R the process fails with an error (reproducible example added as a unit test case). This proposes to work around this issue by omitting those bytes.

tests/testthat/test-read-write.R

javierluraschi · 2020-02-04T18:57:48Z

Looks like arrow does not support this test case:

spark_read_csv() can read if embedded nuls present: error: embedded nul in string: 'test\0string'

We can skip the test, added a suggestion, thank you for this fix @jozefhajnala!

jozefhajnala · 2020-02-14T18:31:27Z

Added the skip as suggested.

falaki · 2020-04-16T15:43:05Z

Databricks Connect tests failed. View logs here.

falaki · 2020-04-16T16:02:22Z

Databricks Connect tests succeeded. View logs here.

yitao-li

Looks good to me!
Let's try rebaseing past #2426 and see if arrow tests are skipped as expected.

Signed-off-by: Jozef <jozef.hajnala@gmail.com> Signed-off-by: Jozef Hajnala <jozef.hajnala@gmail.com>

falaki · 2020-04-16T17:54:25Z

Databricks Connect tests succeeded. View logs here.

yitao-li · 2020-04-16T18:09:15Z

Great success!

rexdouglass · 2020-08-04T14:57:07Z

I'm encountering an error on collect
Error in RecordBatch__to_dataframe(x, use_threads = option_use_threads()) :
embedded nul in string: '�\0'

I have the most updated 1.3.1 whose patch notes include "Embedded nul bytes are removed from strings when reading strings from Spark to R (#2250)"
https://spark.rstudio.com/news/

I do not have a reproducible example (the null appears somewhere in a 13 billion row table, and creating one artificially is difficult). So i'm just asking if this hotfix is supposed to cover the use case of running a collect() on a spark table and returning it to R, as is the case where I'm getting this error.

yitao-li · 2020-08-04T16:06:11Z

@rexdouglass Hey I don't think this PR covers the use case involving arrow. You can try disabling arrow as a workaround but then there will be some performance penalty as deserialization will be slower without arrow.

I'll try to address the use case involving arrow in #2633 -- I'm hoping it's non-complicated and can be shipped as part of Sparklyr 1.4.

jozefhajnala force-pushed the fix-embedded-nuls branch from d22272f to 8c86497 Compare February 2, 2020 18:42

javierluraschi reviewed Feb 4, 2020

View reviewed changes

tests/testthat/test-read-write.R Show resolved Hide resolved

jozefhajnala force-pushed the fix-embedded-nuls branch from 8c86497 to 65fcdeb Compare February 6, 2020 17:49

jozefhajnala force-pushed the fix-embedded-nuls branch 2 times, most recently from 732697e to 1c536df Compare April 16, 2020 15:27

yitao-li mentioned this pull request Apr 16, 2020

Fix skip_on_arrow() #2426

Merged

yitao-li self-requested a review April 16, 2020 16:56

yitao-li approved these changes Apr 16, 2020

View reviewed changes

remove embedded nul bytes if present

1797af1

Signed-off-by: Jozef <jozef.hajnala@gmail.com> Signed-off-by: Jozef Hajnala <jozef.hajnala@gmail.com>

jozefhajnala force-pushed the fix-embedded-nuls branch from 1c536df to 1797af1 Compare April 16, 2020 17:33

yitao-li merged commit bf09d50 into sparklyr:master Apr 16, 2020

yitao-li mentioned this pull request Aug 4, 2020

make 'remove embedded null' behavior configurable #2634

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

remove embedded nul bytes if present in raw data to be read #2250

remove embedded nul bytes if present in raw data to be read #2250

jozefhajnala commented Feb 2, 2020

javierluraschi commented Feb 4, 2020 •

edited

jozefhajnala commented Feb 14, 2020

falaki commented Apr 16, 2020

falaki commented Apr 16, 2020

yitao-li left a comment

falaki commented Apr 16, 2020

yitao-li commented Apr 16, 2020

rexdouglass commented Aug 4, 2020

yitao-li commented Aug 4, 2020 •

edited

remove embedded nul bytes if present in raw data to be read #2250

remove embedded nul bytes if present in raw data to be read #2250

Conversation

jozefhajnala commented Feb 2, 2020

javierluraschi commented Feb 4, 2020 • edited

jozefhajnala commented Feb 14, 2020

falaki commented Apr 16, 2020

falaki commented Apr 16, 2020

yitao-li left a comment

Choose a reason for hiding this comment

falaki commented Apr 16, 2020

yitao-li commented Apr 16, 2020

rexdouglass commented Aug 4, 2020

yitao-li commented Aug 4, 2020 • edited

javierluraschi commented Feb 4, 2020 •

edited

yitao-li commented Aug 4, 2020 •

edited