Improve Serialization #941

javierluraschi · 2017-08-16T22:30:08Z

There are two areas worth considering here:

Improve collect()serialization. This is already implemented as columnar-based collect; however, I believe only for numeric and logical data types.
Implement copy_to() through columnar invoke().

Assuming (2) is on a par with current implementations of copy_to(), we can stop here; otherwise, we would have to explore additional serialization improvements.

Related issues:

Date comparison in filter #202 and copy_to turns Dates into numeric columns (with only the year) #187: Should be fixed with this work item since Date would serialize properly into Spark.
investigate incorrect serialization of missing vs. NA strings #151, handle / report NA values in logical vectors for copy_to #163, head(d, n=1) has problems on sparklyr (possibly due to blank values) #506, NA representation errors-out in Spark2 #528 Sparklyr losing rows #680, Sparklyr doesn't distinguish between blank and NA #1143: Issues with "" and NA should be addressed by a common serialization mechanism. Notice also that some NA in flights are being copied as NaN.
Error when copying dataframe to spark context #304 not clear what the issue in robotics.csv was nor we have a repro, but this change should help fix that.
Titanic data loaded OK in local mode, fail to load in yarn-client mode #307 titanic data under copy_to, improving fidelity here would solve this issue as well.
Do we need to worry about nullability of columns? #515 Revisit special types: NaN, NA, NULL and "".
Text column with newline causes error in dplyr::copy_to in presence of a date column #1020 \n in copy_to
Improve serialization from JDBC Sources #1013 Serialize correctly DATE, DATETIME, and TIMESTAMP while coming from JDBC. Probably no need to do anything JDBC specific, but worth testing
serialize listcols in copy_to() #1030 handle vector columns in copy_to()
spark_apply() could significantly improve performance for data transfer into Scala #1034 column based copy_to could be reused to speed up spark_apply()
Collect NA in logical columns.

The text was updated successfully, but these errors were encountered:

JakeRuss · 2017-08-21T15:06:12Z

@javierluraschi I know you're hard at work supporting these features, but do you have any estimate for how long before sparklyr will support date (and datetime) column types?

I'm updating a production database via sparklyr now, but the code is fragile because it includes workarounds for managing timestamps, and the rest of the dev team has me weighing the merits of a switch to SparkR. If you think the answer is "soon" then I may be able to stall the refactor and continue using sparklyr (my preference).

kevinykuo · 2017-09-01T02:03:52Z

@JakeRuss We'll be taking a look at this next week to see if we can do something about dates before entirely revamping copy_to() and report back.

JakeRuss · 2017-09-01T12:04:48Z

@kevinykuo That update sounds great from my end. As soon as it's ready, I'll be in line to test it out. Many thanks!

javierluraschi · 2017-09-11T20:54:41Z

@JakeRuss thanks for the feedback. There are two points to consider here:

The current implementation of copy_to does contain a few number of serialization issues; however, copy_to was never intended to copy "big data" into the cluster, it is more likely that the data will be already available in the Spark cluster.
There are a few other serialization issues; however, most of the serialization code in sparklyr was actually forked from sparkR. Therefore, the relevant part of this work is to avoid using the original sparkR serializer and use a more robust approach.

If you end up switching to sparkR, I would love to hear feedback from you on which things happen to work in sparkR but not in sparklyr.

We don't have a timeline yet to implement this "Improve Serialization" feature. However, if there is a particular issue that is blocking you, please open a github issue and I'll try to address it very soon without depending on the completion of this work.

JakeRuss · 2017-09-14T17:14:35Z

I appreciate your thoughts, @javierluraschi. My desired workflow may be irrelevant for the copy_to discussion, but I'm not certain, so let me explain more and if this discussion should be moved, I will open another issue.

I mainly use spark_read_jdbc to pull data into Spark from our production database, make a calculation, and then record the results back to another database table. Does this copy_to serialization work affect serialization in spark_read_jdbc? The only remaining pain point from my ideal workflow to what I am using right now is that I can't currently work with timestamp fields. When I spark_read_jdbc timestamp fields are brought into Spark as character and then I convert them to Unix epochs (integers). From there I can manage the calculations and various date manipulations using epoch math before I send the results back to our database. The spark_write_jdbc step also requires a workaround for timestamps because I have to write the epoch first and then convert that field to a timestamp on the database side.

I did experiment with a sparkR workflow but there were other issues that also required workarounds using that process too. And the code wasn't as expressive as the dplyr pipes, so I am content to stick with sparklyr and wait for the serialization improvements before I finalize these processes.

Please let me know if I should open a separate serialization issue for spark_read_jdbc and spark_write_jdbc.

javierluraschi · 2017-09-14T22:45:56Z

@JakeRuss not at all, copy_to would not affect spark_read_jdbc nor spark_write_jdbc, most of the blocking issues this bug is tracking fall into the copy_to category which does not affect you.

Scala Date should map to R Date and Scala Time to R POSIXct while collecting; however, there might be additional types to consider for JDBC or other formats that should fall into a Data conversion.

@JakeRuss, what would really help us here would be to get a copy of the schema (not data necessary) that you are using to make sure data round trips correctly when we get to this work. Something like:

CREATE TABLE COMPANY(
   ID INT PRIMARY KEY     NOT NULL,
   NAME           TEXT    NOT NULL,
   AGE            INT     NOT NULL,
   ADDRESS        CHAR(50),
   SALARY         REAL
);

and I'll make sure this gets validated when the improvements get implemented. Feel free to comment here or open a new issue with details that I can close and link to this one.

Also worth mentioning that I do want to work on this issue as soon as possible. I've been cleaning up old github issues and while no specific serialization issue seems critical, it's now obvious that there are many little ones that make this more urgent. I'll keep you posted!

JakeRuss · 2017-09-15T13:05:10Z

Moved to a new issue, thank you @javierluraschi!

russellpierce · 2017-09-21T03:25:44Z

@javierluraschi You mentioned that copy_to was never intended to copy "big data" into the cluster. What is the prefered pathway for large datasets? Get it up as a .csv or Parquet or other format directly into file storage and read it off file storage?

javierluraschi · 2017-09-22T21:15:31Z

@russellpierce yes, for instance with hadoop cp file dest if you are using Hadoop, etc. See https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-common/FileSystemShell.html#cp

I would like to improve copy_to, but in general, being Spark a big data / big compute solution, it is often the case that the data will already be available in the cluster. That said, more people are using sparklyr + Spark also as a general compute engine, so uploading the data and then doing computations over larger nodes is something we can improve at some point.

Would you mind explaining why the data is not available in your case already in Spark? And how you are using sparklyr? Would love to hear what the use case for copy_to here to see how to help best.

russellpierce · 2017-09-23T15:37:39Z

We're still feeling out usage patterns. At the moment we have a bunch of pre-existing code that pulls data locally to R and processes it. Previously at the end of processing we'd just write the results as a CSV. Now we'd like to write the results as Parquet instead. The easiest path I've found to write Parquet from R is to bounce it through a Spark DataFrame. Thus, using copy_to or writing the CSV to the cluster and then loading the CSV (which seemed unnecessarily awkward).

javierluraschi · 2017-09-28T18:53:43Z

#1041 makes data collection improvements over dates and timestamps.

javierluraschi · 2017-09-29T00:25:15Z

#1045 makes data collection improvements over fields with NAs.

JakeRuss · 2017-10-02T18:52:48Z

It looks like spark_read_jdbc() is reading in date types correctly now. Will comment once I'm able to test spark_write_jdbc().

Is it expected behavior for TINYINT (a field of 0s and 1s) to be read in as logical?

doorisajar · 2017-11-01T18:05:53Z

Is this also the place to discuss things like expanding spark_apply functions to accept multiple arguments, and then shipping those artifacts (e.g. small R model objects) to the workers along with the package dependencies?

asantucci · 2017-11-09T16:54:21Z

@javierluraschi The issue of date-columns being truncated to year when using copy_to is preventing me from continuing to adopt sparklyr further. I have data stored on a PostgreSQL database. I would like to import into a sparklyr pipeline.

I tried using spark_read_jdbc but was greeted with an error of the form java.sql.SQLException: No suitable driver and there is no documentation specific to R that I can find that addresses this. (If you can point me to a resource addressing this, that would be appreciated)

In an effort to prototype, I attempted to pull a subset of the data into memory in R and then use copy_to to upload the data to a Spark cluster (which, at this point happens to be local as well, since I am just determining limitations of existing packages). Of course, I get date truncated to year and this is a showstopper.

To give some context, I actually have already built a pipeline that plugs into PostgreSQL using dbplyr and pushes all computations out of memory, the intermediate goal being to generate features which will then be used for modeling. Given that PostgreSQL limits each row to a single page of memory (8060 bytes), I cannot create very "wide" feature sets, and hence have been motivated to switch over to a sparklyr pipeline. The idea from the beginning has been to write all code using basic dplyr transformations such that we can experiment with Spark and PostgreSQL backends.

russellpierce · 2017-11-09T17:11:59Z

@asantucci I've been having some success with writing a CSV out to distributed storage, then loading the dataset into Spark via a CSV read (and specifying data types where it helps). That bypasses copy_to and any serialization issues we have and here it was mentioned that copy_to() isn't intended for large data sets. A draft of that, mostly focused on saving data off to parquet is here: https://github.com/zapier/parquetr - but it doesn't work entirely yet because it depends on some internal packages for communicating with S3. But, I think that those can be self-documenting / fixable for someone who is particularly interested. (Gladly accepting PRs if that issue can be resolved - or for anything else - I haven't been able to dedicate time to it yet).

JakeRuss · 2017-11-09T17:12:42Z

@asantucci ,

I use sparklyr to pull data out of MySQL and directly into the spark context, via spark_read_jdbc(). To do this I add a MySQL JDBC driver to the class path via spark_config, like so,

config <- spark_config()
config$`sparklyr.shell.driver-class-path` <- "~/my/local/path/mysql-connector-java-5.1.43/mysql-connector-java-5.1.43-bin.jar"

Maybe there is a driver you could add for PostgreSQL instead which will eliminate the error message. Here?

Also, it looks like Javier has resolved the issues with dates in the development version of sparklyr. I am testing this out on my pipeline right now, as it is critical for me as well.

@javierluraschi are the serialization improvements scheduled to be part of the 0.7 release to CRAN?

asantucci · 2017-11-09T21:12:00Z

@JakeRuss I'm trying your approach as follows:

config <- spark_config()
config$`sparklyr.shell.driver-class-path` <- "valid_path/postgresql-42.14"

sc <- spark_connect(master = 'local', config = config)

spark_read_jdbc(sc, name = tbl_name, options = list(url = 'jdbc:postgresql://...', user = username, password = pword, dbtable = tbl_name)

And I still yield the same error: java.sql.SQLException: No suitable driver. Any further thoughts?

JakeRuss · 2017-11-09T21:47:03Z

Let me post more of my code and see if you spot anything different about your set up...

config <- spark_config()
config$`sparklyr.shell.driver-class-path` <- "valid_path/mysql-connector-java-5.1.44/mysql-connector-java-5.1.44-bin.jar"
config$`spark.executor.heartbeatInterval` <- "120000ms"
config$`spark.network.timeout`            <- "240s"
config$`sparklyr.shell.driver-memory`     <- "6G"
config$`spark.executor.memory`            <- "4G" 

sc <- spark_connect(master         = "local",
                     version        = "1.6.0",
                     hadoop_version = 2.6,
                     config         = config)

jdbc_read_url <- paste0("jdbc:mysql://", db_host_ip,":3306/", db_name)

db_table <- sc %>%
  spark_read_jdbc(sc      = .,
                  name    = "spark_tbl_name",  
                  options = list(url      = jdbc_read_url,
                                 user     = db_username,
                                 password = db_password,
                                 dbtable  = "database_tbl_name"),
                  memory  = FALSE)

My jdbc url is formatted like jdbc:mysql://host_ip:3306/database_name. The only difference I see is that I have memory = FALSE, but that shouldn't matter here.

My cursory Google searching for that error message indicates that either the jdbc URL is incorrect, or the driver still isn't found on the class path. Maybe check your valid_path again?

asantucci · 2017-11-09T21:58:11Z

Thank you for the follow up. It looks like my problem was in fact not appending the jdbc url with both a port and database name. Now it appears that I am able to connect to the DBI!

javierluraschi · 2018-06-26T13:53:38Z

We are planning to improve serialization by using Apache Arrow, this work should address many conversion issues by providing a common serialization format between R and Scale that we don't have to maintain in sparklyr. Arrow work is being tracked with #1457.

javierluraschi created this issue from a note in SparklyBoard (Wishlist) Aug 16, 2017

javierluraschi added the featurerequest label Aug 16, 2017

This was referenced Sep 5, 2017

head(d, n=1) has problems on sparklyr (possibly due to blank values) #506

Closed

Do we need to worry about nullability of columns? #515

Closed

NA representation errors-out in Spark2 #528

Closed

javierluraschi mentioned this issue Sep 14, 2017

Sparklyr losing rows #680

Closed

JakeRuss mentioned this issue Sep 15, 2017

Improve serialization from JDBC Sources #1013

Closed

javierluraschi mentioned this issue Sep 18, 2017

Text column with newline causes error in dplyr::copy_to in presence of a date column #1020

Closed

kevinykuo mentioned this issue Sep 22, 2017

serialize listcols in copy_to() #1030

Closed

This was referenced Nov 25, 2017

Sparklyr doesn't distinguish between blank and NA #1143

Closed

NAs getting collected as FALSE #1160

Closed

Tutuchan mentioned this issue Jan 16, 2018

Incorrect return values using current_timestamp and to_date #1219

Closed

javierluraschi closed this as completed Jun 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Serialization #941

Improve Serialization #941

javierluraschi commented Aug 16, 2017 •

edited

JakeRuss commented Aug 21, 2017 •

edited

kevinykuo commented Sep 1, 2017

JakeRuss commented Sep 1, 2017

javierluraschi commented Sep 11, 2017

JakeRuss commented Sep 14, 2017

javierluraschi commented Sep 14, 2017

JakeRuss commented Sep 15, 2017

russellpierce commented Sep 21, 2017

javierluraschi commented Sep 22, 2017 •

edited

russellpierce commented Sep 23, 2017

javierluraschi commented Sep 28, 2017

javierluraschi commented Sep 29, 2017

JakeRuss commented Oct 2, 2017

doorisajar commented Nov 1, 2017

asantucci commented Nov 9, 2017

russellpierce commented Nov 9, 2017 •

edited

JakeRuss commented Nov 9, 2017 •

edited

asantucci commented Nov 9, 2017

JakeRuss commented Nov 9, 2017

asantucci commented Nov 9, 2017

javierluraschi commented Jun 26, 2018

Improve Serialization #941

Improve Serialization #941

Comments

javierluraschi commented Aug 16, 2017 • edited

JakeRuss commented Aug 21, 2017 • edited

kevinykuo commented Sep 1, 2017

JakeRuss commented Sep 1, 2017

javierluraschi commented Sep 11, 2017

JakeRuss commented Sep 14, 2017

javierluraschi commented Sep 14, 2017

JakeRuss commented Sep 15, 2017

russellpierce commented Sep 21, 2017

javierluraschi commented Sep 22, 2017 • edited

russellpierce commented Sep 23, 2017

javierluraschi commented Sep 28, 2017

javierluraschi commented Sep 29, 2017

JakeRuss commented Oct 2, 2017

doorisajar commented Nov 1, 2017

asantucci commented Nov 9, 2017

russellpierce commented Nov 9, 2017 • edited

JakeRuss commented Nov 9, 2017 • edited

asantucci commented Nov 9, 2017

JakeRuss commented Nov 9, 2017

asantucci commented Nov 9, 2017

javierluraschi commented Jun 26, 2018

javierluraschi commented Aug 16, 2017 •

edited

JakeRuss commented Aug 21, 2017 •

edited

javierluraschi commented Sep 22, 2017 •

edited

russellpierce commented Nov 9, 2017 •

edited

JakeRuss commented Nov 9, 2017 •

edited