You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It seems that using the arrow package (v9.0.0.1) with sparklyr::spark_apply() (v1.7.5) and its group_by parameter may cause unexpected results due to batches being used by the worker(s). Namely, summary results are returned by batch rather than by group. Detaching arrow leads to summary results by group rather than by batch. My apologies if this is expected behavior or previously reported (I couldn't find it) or if this belongs under arrow's issues.
Unfortunately, my real problem demands results by group but I receive results by batch even without using arrow. I'll add that reprex if I figure it out.
Observed
There are only two values in the group_by column, "A" and "B". However, sparklyr::spark_apply() returns four result rows, two for "A" and two for "B". This is consistent with batches allowing 10,000 rows (see #2243 and/or #2503 for 10,000 as a magic number). As each group has 15,000 rows, it seems the rows are split into two batches for "A" and two for "B" and results are returned by batch rather than by group.
Ah, sorry, I was trying to be helpful with this issue that is close to my actual real-world issue, but I can see that this is expected behavior in Chapter 11.9.2 Apache Arrow.
So I'll close this for now and ask my question elsewhere:
Is there a way to say: I have 800 groups, 8 workers, and 100 groups per machine so it's OK to make batches within partition 1's 100 groups but not to break groups across batches?
Summary
It seems that using the
arrow
package (v9.0.0.1) withsparklyr::spark_apply()
(v1.7.5) and itsgroup_by
parameter may cause unexpected results due to batches being used by the worker(s). Namely, summary results are returned by batch rather than by group. Detachingarrow
leads to summary results by group rather than by batch. My apologies if this is expected behavior or previously reported (I couldn't find it) or if this belongs underarrow
's issues.Unfortunately, my real problem demands results by group but I receive results by batch even without using
arrow.
I'll add that reprex if I figure it out.Observed
There are only two values in the
group_by
column,"A"
and"B"
. However,sparklyr::spark_apply()
returns four result rows, two for"A"
and two for"B"
. This is consistent with batches allowing 10,000 rows (see #2243 and/or #2503 for 10,000 as a magic number). As each group has 15,000 rows, it seems the rows are split into two batches for"A"
and two for"B"
and results are returned by batch rather than by group.Created on 2022-12-16 with reprex v2.0.2
Expected
There are only two values in the
group_by
column,"A"
and"B"
, andsparklyr::spark_apply()
returns two result rows and results are returned by group.Created on 2022-12-16 with reprex v2.0.2
The text was updated successfully, but these errors were encountered: