Unexpected column behavior with spark_read_csv #2107

InfProbSciX · 2019-08-02T12:09:38Z

I'm trying to read three csv files contained in a folder with spark_read_csv, where the second csv file has an additional column that the other two don't. I would expect spark_read_csv to align columns based on column name, but this doesn't happen.

One way to get around this issue to use lapply with the spark_read_csv function to read every file as a separate table and then do a sdf_bind_rows to rowbind them into one big data frame. To avoid doing this, is there an option that I can pass to spark_read_csv that handles these mismatched columns?

An example of my issue is given below:

library(dplyr)
library(sparklyr)
library(magrittr)

data_1 <- tibble(a = 1:5, c = 101:105)
data_2 <- tibble(a = 1:5, b = 56:60, c = 106:110)
data_3 <- tibble(a = 11:15, c = 111:115)

write.csv(data_1, "~/spark_test/data_1.csv", row.names = F)
write.csv(data_2, "~/spark_test/data_2.csv", row.names = F)
write.csv(data_3, "~/spark_test/data_3.csv", row.names = F)

sc <- spark_connect(master = "local", version = "2.4.3")
data <- spark_read_csv(sc, path = "~/spark_test/", memory = T)

# the b column looks like ..., 59, 60, 111, 112, ... and column c
# is filled with NAs, which shouldn't be the case

I'd expect the behavior in this scenario to be similar to data.table::rbindlist or dplyr::bind_rows.

The text was updated successfully, but these errors were encountered:

javierluraschi · 2019-08-02T22:26:54Z

@InfProbSciX not that I'm aware of, you will have to implement this yourself.

One option is to read the files as text instead of CSVs, as in:

data <- spark_read_text(sc, path = "~/spark_test/", memory = T)

And then use spark_apply() and readr to convert appropriately...

data %>% spark_apply(function(e) {
  data <- paste(readr::read_csv(e$line), collapse = "\n")
  # TODO: custom transformation to match columns across files.
  data
})

yitao-li closed this as completed Jun 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected column behavior with spark_read_csv #2107

Unexpected column behavior with spark_read_csv #2107

InfProbSciX commented Aug 2, 2019 •

edited

javierluraschi commented Aug 2, 2019 •

edited

Unexpected column behavior with spark_read_csv #2107

Unexpected column behavior with spark_read_csv #2107

Comments

InfProbSciX commented Aug 2, 2019 • edited

javierluraschi commented Aug 2, 2019 • edited

InfProbSciX commented Aug 2, 2019 •

edited

javierluraschi commented Aug 2, 2019 •

edited