You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm trying to read three csv files contained in a folder with spark_read_csv, where the second csv file has an additional column that the other two don't. I would expect spark_read_csv to align columns based on column name, but this doesn't happen.
One way to get around this issue to use lapply with the spark_read_csv function to read every file as a separate table and then do a sdf_bind_rows to rowbind them into one big data frame. To avoid doing this, is there an option that I can pass to spark_read_csv that handles these mismatched columns?
An example of my issue is given below:
library(dplyr)
library(sparklyr)
library(magrittr)
data_1<- tibble(a=1:5, c=101:105)
data_2<- tibble(a=1:5, b=56:60, c=106:110)
data_3<- tibble(a=11:15, c=111:115)
write.csv(data_1, "~/spark_test/data_1.csv", row.names=F)
write.csv(data_2, "~/spark_test/data_2.csv", row.names=F)
write.csv(data_3, "~/spark_test/data_3.csv", row.names=F)
sc<- spark_connect(master="local", version="2.4.3")
data<- spark_read_csv(sc, path="~/spark_test/", memory=T)
# the b column looks like ..., 59, 60, 111, 112, ... and column c# is filled with NAs, which shouldn't be the case
I'd expect the behavior in this scenario to be similar to data.table::rbindlist or dplyr::bind_rows.
The text was updated successfully, but these errors were encountered:
And then use spark_apply() and readr to convert appropriately...
data %>% spark_apply(function(e) {
data<- paste(readr::read_csv(e$line), collapse="\n")
# TODO: custom transformation to match columns across files.data
})
I'm trying to read three csv files contained in a folder with
spark_read_csv
, where the second csv file has an additional column that the other two don't. I would expectspark_read_csv
to align columns based on column name, but this doesn't happen.One way to get around this issue to use
lapply
with thespark_read_csv
function to read every file as a separate table and then do asdf_bind_rows
to rowbind them into one big data frame. To avoid doing this, is there an option that I can pass tospark_read_csv
that handles these mismatched columns?An example of my issue is given below:
I'd expect the behavior in this scenario to be similar to
data.table::rbindlist
ordplyr::bind_rows
.The text was updated successfully, but these errors were encountered: