Sparklyr and Arrow Max Records #2243

ryan-chien · 2020-01-21T17:52:58Z

Hey Sparklyr Team,

I noticed that when using spark_apply with arrow, there is a default maximum records per batch of 10000 if I forget to set "spark.sql.execution.arrow.maxRecordsPerBatch" when connecting to spark. This is the line of code: spark_apply line 191.

A demonstration of bad results when using the default settings is here: https://github.com/rychien-official/sparklyr_maxrecords/blob/master/readme.md

I am wondering if there is a technical reason for hard coding maximum records per batch during spark connection? It would be nice to modify spark_apply to be able to pass through maximum records per batch as a parameter. I'd be happy to create a pull request and do so.

Thanks!

-Ryan

mjcarroll1985 · 2020-02-24T15:43:07Z

Related to this, I've recently discovered that trying to set spark.sql.execution.arrow.maxRecordsPerBatch as 100000 or more caused an error because when the value was processed, it was being converted to scientific form, rather than the integer form required. I think it can be fixed by changing the R session's scipen option, although I'm not 100% certain this is working. So, if there was any work to be done on this, it would also be good to check this behaviour.

ryan-chien · 2020-02-24T16:51:49Z

@mjcarroll1985 how are you setting spark.sql.execution.arrow.maxRecordsPerBatch?

Are you setting it as a character, or as numeric?

E.g. setting maxRecordsPerBath as char:

# Set dynamic config
dyn_spark_config <-
  spark_config() %>%
  append(.,
         list(
           "spark.sql.execution.arrow.maxRecordsPerBatch"="1000000"))

Versus numeric:

# Set dynamic config
dyn_spark_config <-
  spark_config() %>%
  append(.,
         list(
           "spark.sql.execution.arrow.maxRecordsPerBatch"=100000))

mjcarroll1985 · 2020-02-24T17:02:28Z

@rychien-official as numeric - it seems to work with a value of, say, 80000, just not with anything >=100000

ryan-chien · 2020-02-24T17:08:42Z

Got it. Can please you try using the char version and see if that works?

mjcarroll1985 · 2020-02-25T10:00:26Z

@rychien-official Yes, specifying as character works - thank you!

ryan-chien · 2020-02-25T12:10:09Z

Nice!

yitao-li closed this as completed Mar 20, 2020

lalas mentioned this issue May 17, 2020

Setting maxRecordsPerBatch when using Arrow and spark_apply does not work as expected #2503

Closed

renanxcortes mentioned this issue Dec 19, 2020

can't "copy_to" large datasets #487

Closed

josephd000 mentioned this issue Dec 16, 2022

spark_apply() with group_by parameter and arrow package returns summary results by batch, not group #3305

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sparklyr and Arrow Max Records #2243

Sparklyr and Arrow Max Records #2243

ryan-chien commented Jan 21, 2020 •

edited

mjcarroll1985 commented Feb 24, 2020 •

edited

ryan-chien commented Feb 24, 2020

mjcarroll1985 commented Feb 24, 2020

ryan-chien commented Feb 24, 2020

mjcarroll1985 commented Feb 25, 2020

ryan-chien commented Feb 25, 2020

Sparklyr and Arrow Max Records #2243

Sparklyr and Arrow Max Records #2243

Comments

ryan-chien commented Jan 21, 2020 • edited

mjcarroll1985 commented Feb 24, 2020 • edited

ryan-chien commented Feb 24, 2020

mjcarroll1985 commented Feb 24, 2020

ryan-chien commented Feb 24, 2020

mjcarroll1985 commented Feb 25, 2020

ryan-chien commented Feb 25, 2020

ryan-chien commented Jan 21, 2020 •

edited

mjcarroll1985 commented Feb 24, 2020 •

edited