Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sparklyr and Arrow Max Records #2243

Closed
ryan-chien opened this issue Jan 21, 2020 · 6 comments
Closed

Sparklyr and Arrow Max Records #2243

ryan-chien opened this issue Jan 21, 2020 · 6 comments

Comments

@ryan-chien
Copy link

ryan-chien commented Jan 21, 2020

Hey Sparklyr Team,

I noticed that when using spark_apply with arrow, there is a default maximum records per batch of 10000 if I forget to set "spark.sql.execution.arrow.maxRecordsPerBatch" when connecting to spark. This is the line of code: spark_apply line 191.

A demonstration of bad results when using the default settings is here: https://github.com/rychien-official/sparklyr_maxrecords/blob/master/readme.md

I am wondering if there is a technical reason for hard coding maximum records per batch during spark connection? It would be nice to modify spark_apply to be able to pass through maximum records per batch as a parameter. I'd be happy to create a pull request and do so.

Thanks!

-Ryan

@mjcarroll1985
Copy link

mjcarroll1985 commented Feb 24, 2020

Related to this, I've recently discovered that trying to set spark.sql.execution.arrow.maxRecordsPerBatch as 100000 or more caused an error because when the value was processed, it was being converted to scientific form, rather than the integer form required. I think it can be fixed by changing the R session's scipen option, although I'm not 100% certain this is working. So, if there was any work to be done on this, it would also be good to check this behaviour.

@ryan-chien
Copy link
Author

@mjcarroll1985 how are you setting spark.sql.execution.arrow.maxRecordsPerBatch?

Are you setting it as a character, or as numeric?

E.g. setting maxRecordsPerBath as char:

# Set dynamic config
dyn_spark_config <-
  spark_config() %>%
  append(.,
         list(
           "spark.sql.execution.arrow.maxRecordsPerBatch"="1000000"))

Versus numeric:

# Set dynamic config
dyn_spark_config <-
  spark_config() %>%
  append(.,
         list(
           "spark.sql.execution.arrow.maxRecordsPerBatch"=100000))

@mjcarroll1985
Copy link

@rychien-official as numeric - it seems to work with a value of, say, 80000, just not with anything >=100000

@ryan-chien
Copy link
Author

Got it. Can please you try using the char version and see if that works?

@mjcarroll1985
Copy link

@rychien-official Yes, specifying as character works - thank you!

@ryan-chien
Copy link
Author

Nice!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants