Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected column behavior with spark_read_csv #2107

Closed
InfProbSciX opened this issue Aug 2, 2019 · 1 comment
Closed

Unexpected column behavior with spark_read_csv #2107

InfProbSciX opened this issue Aug 2, 2019 · 1 comment

Comments

@InfProbSciX
Copy link

InfProbSciX commented Aug 2, 2019

I'm trying to read three csv files contained in a folder with spark_read_csv, where the second csv file has an additional column that the other two don't. I would expect spark_read_csv to align columns based on column name, but this doesn't happen.

One way to get around this issue to use lapply with the spark_read_csv function to read every file as a separate table and then do a sdf_bind_rows to rowbind them into one big data frame. To avoid doing this, is there an option that I can pass to spark_read_csv that handles these mismatched columns?

An example of my issue is given below:

library(dplyr)
library(sparklyr)
library(magrittr)

data_1 <- tibble(a = 1:5, c = 101:105)
data_2 <- tibble(a = 1:5, b = 56:60, c = 106:110)
data_3 <- tibble(a = 11:15, c = 111:115)

write.csv(data_1, "~/spark_test/data_1.csv", row.names = F)
write.csv(data_2, "~/spark_test/data_2.csv", row.names = F)
write.csv(data_3, "~/spark_test/data_3.csv", row.names = F)

sc <- spark_connect(master = "local", version = "2.4.3")
data <- spark_read_csv(sc, path = "~/spark_test/", memory = T)

# the b column looks like ..., 59, 60, 111, 112, ... and column c
# is filled with NAs, which shouldn't be the case

I'd expect the behavior in this scenario to be similar to data.table::rbindlist or dplyr::bind_rows.

@javierluraschi
Copy link
Collaborator

javierluraschi commented Aug 2, 2019

@InfProbSciX not that I'm aware of, you will have to implement this yourself.

One option is to read the files as text instead of CSVs, as in:

data <- spark_read_text(sc, path = "~/spark_test/", memory = T)

And then use spark_apply() and readr to convert appropriately...

data %>% spark_apply(function(e) {
  data <- paste(readr::read_csv(e$line), collapse = "\n")
  # TODO: custom transformation to match columns across files.
  data
})

@yitao-li yitao-li closed this as completed Jun 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants