Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strings with "NA" still handled incorrectly in sdf_copy_to #2031

Closed
caewok opened this issue Jun 3, 2019 · 4 comments · Fixed by #2757
Closed

Strings with "NA" still handled incorrectly in sdf_copy_to #2031

caewok opened this issue Jun 3, 2019 · 4 comments · Fixed by #2757
Assignees
Milestone

Comments

@caewok
Copy link

@caewok caewok commented Jun 3, 2019

Problem:

When copying a data frame into spark, sdf_copy_to (or spark?) is treating characters or factors that use "NA" as NA. At least in R, "NA" is not the same as NA. Thus the copy in spark may have more NAs than it should.

This appears related to #1854.

Example:

sc <- spark_connect()

na_dat <- data.frame(A = rep("NA", times = 10),
                     B = c(rep("NA", times =5), rep(NA, times = 5)),
                     C = c(rep("C", times = 5), rep("NA", times = 3), rep(NA, times = 2)),
                     stringsAsFactors = FALSE)
na_dat_spark <- sdf_copy_to(sc = sc, x = na_dat, name = "na_dat", 
                          repartition = 8)
na_dat # A has zero NAs; B has 50% NAs; C has 20% NAs.
na_dat_spark # A and B are all NAs; C is 50% NAs
@jablauvelt
Copy link

@jablauvelt jablauvelt commented Feb 14, 2020

Just wanted to say, facing the same issue, and it doesn't seem like there are any args that can affect this.

@wkdavis
Copy link
Contributor

@wkdavis wkdavis commented Sep 10, 2020

The way that the data is written to Spark, it is first written to a temporary CSV and then written to Spark.

sdf_current <- serializer(sc, df, columns, repartition)

def parseCsvField(column: String, value: String) = {
column match {
case "integer" => Try(value.toInt).getOrElse(null)
case "double" => Try(value.toDouble).getOrElse(null)
case "logical" => Try(value.toBoolean).getOrElse(null)
case "timestamp" => Try(new java.sql.Timestamp(value.toLong * 1000)).getOrElse(null)
case "date" => Try(java.sql.Date.valueOf(value)).getOrElse(null)
case _ => if (value == "NA") null else value
}
}
def createDataFrameFromCsv(
sc: SparkContext,
path: String,
columns: Array[String],
partitions: Int,
separator: String): RDD[Row] = {
val lines = scala.io.Source.fromFile(path).getLines.toIndexedSeq
val rddRows: RDD[String] = sc.parallelize(lines, partitions);
val data: RDD[Row] = rddRows.map(o => {
val r = o.split(separator, -1)
var typed = (Array.range(0, r.length)).map(idx => {
val column = columns(idx)
val value = r(idx)
parseCsvField(column, value)
})
org.apache.spark.sql.Row.fromSeq(typed)
})
data
}

After the CSV write, any NA become "NA", so when the scala method reads and parses the CSV file it looks for "NA" to set as null (the Spark version of NA). This would also scoop up any values that are "NA" in R prior to the CSV creation, because after the data is written to the CSV both NA and "NA" will become "NA". I think the only way around your issue would be to use some other value besides "NA" if you wan to distinguish between R's NA and NA in your dataset.

@yitao-li
Copy link
Contributor

@yitao-li yitao-li commented Oct 27, 2020

@wkdavis I'm in the process of replacing serialization code paths involving CSV with something more sensible in #2757

@yitao-li yitao-li self-assigned this Oct 27, 2020
@yitao-li yitao-li linked a pull request Oct 27, 2020 that will close this issue
@yitao-li
Copy link
Contributor

@yitao-li yitao-li commented Nov 10, 2020

@caewok @wkdavis This is fixed now after an extensive rewrite (!) of serialization routines in #2757 🙂

@yitao-li yitao-li added this to the 1.5.0 milestone Nov 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants