Skip to content

[C++] Segfault when executing query on large-ish parquet dataset #46814

Open
@mrd0ll4r

Description

@mrd0ll4r

Describe the bug, including details regarding any error messages, version, and platform.

Hello, it's me again with big-data R segfaults :)

I have a dataset of approx 8GB, hive-partitioned, 8537 parquet files. I can probably share the dataset.

I'm executing this query:

open_dataset("data/bluesky/labeler_logs_dirty_parquet") %>%
  group_by(uri) %>%
  tally() %>%
  filter(n==1) %>%
  tally() %>%
  collect()

which throws:

 *** caught segfault ***
address 0x7f0634a5e2e8, cause 'memory not mapped'

 *** caught segfault ***
address 0x7f063441a2d5, cause 'memory not mapped'

Traceback:
 1: Table__from_ExecPlanReader(self)
 2: x$read_table()
 3: as_arrow_table.RecordBatchReader(reader)
 4: as_arrow_table(reader)
 5: as_arrow_table.arrow_dplyr_query(x)
 6: as_arrow_table(x)
 7: doTryCatch(return(expr), name, parentenv, handler)
 8: tryCatchOne(expr, names, parentenv, handlers[[1L]])
 9: tryCatchList(expr, classes, parentenv, handlers)Segmentation fault

Unfortunately, I didn't get a core dump this time, no clue why.

Another query got as far as computing the number of rows and columns, but also segfaulted:

open_dataset("data/bluesky/labeler_logs_dirty_parquet") %>%
  group_by(uri) %>%
  tally() %>%
  collect()

... gets as far as this:

# A tibble: 62,642,379 × 2

and segfaults like so:

 *** caught segfault ***
address 0x7ff004949d34, cause 'memory not mapped'

Traceback:
 1: vec_slice(x, seq_len(n))
 2: vec_head(as.data.frame(x), n)
 3: df_head(x, n)
 4: tbl_format_setup.tbl(x, width, ..., setup = setup, n = n, max_extra_cols = max_extra_cols,     max_footer_lines = max_footer_lines, focus = focus)
 5: tbl_format_setup_dispatch(x, width, ..., setup = setup, n = n,     max_extra_cols = max_extra_cols, max_footer_lines = max_footer_lines,     focus = focus)
 6: tbl_format_setup(x, width = width, ..., setup = setup, n = n,     max_extra_cols = max_extra_cols, max_footer_lines = max_footer_lines,     focus = attr(x, "pillar_focus"))
 7: format_tbl(x, width = width, ..., n = n, max_extra_cols = max_extra_cols,     max_footer_lines = max_footer_lines, transform = writeLines)
 8: print_tbl(x, width, ..., n = n, max_extra_cols = max_extra_cols,     max_footer_lines = max_footer_lines)
 9: print.tbl(x)
10: (function (x, ...) UseMethod("print"))(x)

The second crash is less surprising, as that's a giant tibble and R probably doesn't like it.
But the first query is essentially a scalar, so that should be fine.

The parquet files were originally produced by DuckDB.
This is the format:

> open_dataset("data/bluesky/labeler_logs_dirty_parquet")
FileSystemDataset with 8537 Parquet files
12 columns
dom: int64
seq: int64
ts: timestamp[us, tz=UTC]
src: string
neg: bool
val: string
uri: string
cid: string
ver: int64
labeler_host: string
year: int32
month: int32

Additional Info

Machine overview:

Memory: 378 GB
CPU: 64x Intel(R) Xeon(R) Gold 6154
OS: Debian 12

R sessionInfo():

> sessionInfo()
R version 4.5.0 (2025-04-11)
Platform: x86_64-pc-linux-gnu
Running under: Debian GNU/Linux 12 (bookworm)

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.11.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.11.0  LAPACK version 3.11.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

time zone: Europe/Berlin
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base

other attached packages:
 [1] paletteer_1.6.0   ggplot2_3.5.2     viridis_0.6.5     viridisLite_0.4.2
 [5] pracma_2.4.4      xtable_1.8-4      forcats_1.0.0     readr_2.1.5
 [9] arrow_20.0.0      tidyr_1.3.1       stringr_1.5.1     lubridate_1.9.4
[13] dplyr_1.1.4

loaded via a namespace (and not attached):
 [1] bit_4.6.0          gtable_0.3.6       rematch2_2.1.2     compiler_4.5.0
 [5] renv_1.0.3         tidyselect_1.2.1   parallel_4.5.0     assertthat_0.2.1
 [9] gridExtra_2.3      scales_1.4.0       R6_2.6.1           generics_0.1.4
[13] tibble_3.3.0       RColorBrewer_1.1-3 pillar_1.10.2      tzdb_0.5.0
[17] rlang_1.1.6        stringi_1.8.7      bit64_4.6.0-1      timechange_0.3.0
[21] cli_3.6.5          withr_3.0.2        magrittr_2.0.3     grid_4.5.0
[25] hms_1.1.3          lifecycle_1.0.4    vctrs_0.6.5        glue_1.8.0
[29] farver_2.1.2       purrr_1.0.4        tools_4.5.0        pkgconfig_2.0.3

lsb_release -a:

No LSB modules are available.
Distributor ID: Debian
Description:    Debian GNU/Linux 12 (bookworm)
Release:        12
Codename:       bookworm

uname -a:

Linux <redacted> 6.1.0-34-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.135-1 (2025-04-25) x86_64 GNU/Linux

cat /proc/cpuinfo (truncated):

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 85
model name      : Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz
stepping        : 4
microcode       : 0x2007108
cpu MHz         : 2992.968
cache size      : 16384 KB
physical id     : 0
siblings        : 64
core id         : 0
cpu cores       : 64
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 22
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat umip pku ospke md_clear flush_l1d arch_capabilities
vmx flags       : vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid shadow_vmcs pml tsc_scaling
bugs            : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa mmio_stale_data retbleed gds bhi ibpb_no_ret
bogomips        : 5985.93
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

Component(s)

R

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions