Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove embedded nul bytes if present in raw data to be read #2250

Merged
merged 1 commit into from Apr 16, 2020

Conversation

jozefhajnala
Copy link
Contributor

When trying to collect data that contain embedded nul bytes into R the process fails with an error (reproducible example added as a unit test case). This proposes to work around this issue by omitting those bytes.

@javierluraschi
Copy link
Collaborator

javierluraschi commented Feb 4, 2020

Looks like arrow does not support this test case:

spark_read_csv() can read if embedded nuls present: error: embedded nul in string: 'test\0string' 

We can skip the test, added a suggestion, thank you for this fix @jozefhajnala!

@jozefhajnala
Copy link
Contributor Author

Added the skip as suggested.

@jozefhajnala jozefhajnala force-pushed the fix-embedded-nuls branch 2 times, most recently from 732697e to 1c536df Compare April 16, 2020 15:27
@falaki
Copy link
Collaborator

falaki commented Apr 16, 2020

Databricks Connect tests failed. View logs here.

@falaki
Copy link
Collaborator

falaki commented Apr 16, 2020

Databricks Connect tests succeeded. View logs here.

@yitao-li yitao-li mentioned this pull request Apr 16, 2020
@yitao-li yitao-li self-requested a review April 16, 2020 16:56
Copy link
Contributor

@yitao-li yitao-li left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!
Let's try rebaseing past #2426 and see if arrow tests are skipped as expected.

Signed-off-by: Jozef <jozef.hajnala@gmail.com>
Signed-off-by: Jozef Hajnala <jozef.hajnala@gmail.com>
@falaki
Copy link
Collaborator

falaki commented Apr 16, 2020

Databricks Connect tests succeeded. View logs here.

@yitao-li yitao-li merged commit bf09d50 into sparklyr:master Apr 16, 2020
@yitao-li
Copy link
Contributor

Great success!

@rexdouglass
Copy link

I'm encountering an error on collect
Error in RecordBatch__to_dataframe(x, use_threads = option_use_threads()) :
embedded nul in string: '�\0'

I have the most updated 1.3.1 whose patch notes include "Embedded nul bytes are removed from strings when reading strings from Spark to R (#2250)"
https://spark.rstudio.com/news/

I do not have a reproducible example (the null appears somewhere in a 13 billion row table, and creating one artificially is difficult). So i'm just asking if this hotfix is supposed to cover the use case of running a collect() on a spark table and returning it to R, as is the case where I'm getting this error.

@yitao-li
Copy link
Contributor

yitao-li commented Aug 4, 2020

@rexdouglass Hey I don't think this PR covers the use case involving arrow. You can try disabling arrow as a workaround but then there will be some performance penalty as deserialization will be slower without arrow.

I'll try to address the use case involving arrow in #2633 -- I'm hoping it's non-complicated and can be shipped as part of Sparklyr 1.4.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants