Open
Description
Describe the bug, including details regarding any error messages, version, and platform.
Hello, it's me again with big-data R segfaults :)
I have a dataset of approx 8GB, hive-partitioned, 8537
parquet files. I can probably share the dataset.
I'm executing this query:
open_dataset("data/bluesky/labeler_logs_dirty_parquet") %>%
group_by(uri) %>%
tally() %>%
filter(n==1) %>%
tally() %>%
collect()
which throws:
*** caught segfault ***
address 0x7f0634a5e2e8, cause 'memory not mapped'
*** caught segfault ***
address 0x7f063441a2d5, cause 'memory not mapped'
Traceback:
1: Table__from_ExecPlanReader(self)
2: x$read_table()
3: as_arrow_table.RecordBatchReader(reader)
4: as_arrow_table(reader)
5: as_arrow_table.arrow_dplyr_query(x)
6: as_arrow_table(x)
7: doTryCatch(return(expr), name, parentenv, handler)
8: tryCatchOne(expr, names, parentenv, handlers[[1L]])
9: tryCatchList(expr, classes, parentenv, handlers)Segmentation fault
Unfortunately, I didn't get a core dump this time, no clue why.
Another query got as far as computing the number of rows and columns, but also segfaulted:
open_dataset("data/bluesky/labeler_logs_dirty_parquet") %>%
group_by(uri) %>%
tally() %>%
collect()
... gets as far as this:
# A tibble: 62,642,379 × 2
and segfaults like so:
*** caught segfault ***
address 0x7ff004949d34, cause 'memory not mapped'
Traceback:
1: vec_slice(x, seq_len(n))
2: vec_head(as.data.frame(x), n)
3: df_head(x, n)
4: tbl_format_setup.tbl(x, width, ..., setup = setup, n = n, max_extra_cols = max_extra_cols, max_footer_lines = max_footer_lines, focus = focus)
5: tbl_format_setup_dispatch(x, width, ..., setup = setup, n = n, max_extra_cols = max_extra_cols, max_footer_lines = max_footer_lines, focus = focus)
6: tbl_format_setup(x, width = width, ..., setup = setup, n = n, max_extra_cols = max_extra_cols, max_footer_lines = max_footer_lines, focus = attr(x, "pillar_focus"))
7: format_tbl(x, width = width, ..., n = n, max_extra_cols = max_extra_cols, max_footer_lines = max_footer_lines, transform = writeLines)
8: print_tbl(x, width, ..., n = n, max_extra_cols = max_extra_cols, max_footer_lines = max_footer_lines)
9: print.tbl(x)
10: (function (x, ...) UseMethod("print"))(x)
The second crash is less surprising, as that's a giant tibble and R probably doesn't like it.
But the first query is essentially a scalar, so that should be fine.
The parquet files were originally produced by DuckDB.
This is the format:
> open_dataset("data/bluesky/labeler_logs_dirty_parquet")
FileSystemDataset with 8537 Parquet files
12 columns
dom: int64
seq: int64
ts: timestamp[us, tz=UTC]
src: string
neg: bool
val: string
uri: string
cid: string
ver: int64
labeler_host: string
year: int32
month: int32
Additional Info
Machine overview:
Memory: 378 GB
CPU: 64x Intel(R) Xeon(R) Gold 6154
OS: Debian 12
R sessionInfo()
:
> sessionInfo()
R version 4.5.0 (2025-04-11)
Platform: x86_64-pc-linux-gnu
Running under: Debian GNU/Linux 12 (bookworm)
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.11.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.11.0 LAPACK version 3.11.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Europe/Berlin
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices datasets utils methods base
other attached packages:
[1] paletteer_1.6.0 ggplot2_3.5.2 viridis_0.6.5 viridisLite_0.4.2
[5] pracma_2.4.4 xtable_1.8-4 forcats_1.0.0 readr_2.1.5
[9] arrow_20.0.0 tidyr_1.3.1 stringr_1.5.1 lubridate_1.9.4
[13] dplyr_1.1.4
loaded via a namespace (and not attached):
[1] bit_4.6.0 gtable_0.3.6 rematch2_2.1.2 compiler_4.5.0
[5] renv_1.0.3 tidyselect_1.2.1 parallel_4.5.0 assertthat_0.2.1
[9] gridExtra_2.3 scales_1.4.0 R6_2.6.1 generics_0.1.4
[13] tibble_3.3.0 RColorBrewer_1.1-3 pillar_1.10.2 tzdb_0.5.0
[17] rlang_1.1.6 stringi_1.8.7 bit64_4.6.0-1 timechange_0.3.0
[21] cli_3.6.5 withr_3.0.2 magrittr_2.0.3 grid_4.5.0
[25] hms_1.1.3 lifecycle_1.0.4 vctrs_0.6.5 glue_1.8.0
[29] farver_2.1.2 purrr_1.0.4 tools_4.5.0 pkgconfig_2.0.3
lsb_release -a
:
No LSB modules are available.
Distributor ID: Debian
Description: Debian GNU/Linux 12 (bookworm)
Release: 12
Codename: bookworm
uname -a
:
Linux <redacted> 6.1.0-34-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.135-1 (2025-04-25) x86_64 GNU/Linux
cat /proc/cpuinfo
(truncated):
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 85
model name : Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz
stepping : 4
microcode : 0x2007108
cpu MHz : 2992.968
cache size : 16384 KB
physical id : 0
siblings : 64
core id : 0
cpu cores : 64
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 22
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat umip pku ospke md_clear flush_l1d arch_capabilities
vmx flags : vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid shadow_vmcs pml tsc_scaling
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa mmio_stale_data retbleed gds bhi ibpb_no_ret
bogomips : 5985.93
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:
Component(s)
R