-
Notifications
You must be signed in to change notification settings - Fork 302
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Serialization #941
Comments
@javierluraschi I know you're hard at work supporting these features, but do you have any estimate for how long before sparklyr will support date (and datetime) column types? I'm updating a production database via sparklyr now, but the code is fragile because it includes workarounds for managing timestamps, and the rest of the dev team has me weighing the merits of a switch to SparkR. If you think the answer is "soon" then I may be able to stall the refactor and continue using sparklyr (my preference). |
@JakeRuss We'll be taking a look at this next week to see if we can do something about dates before entirely revamping |
@kevinykuo That update sounds great from my end. As soon as it's ready, I'll be in line to test it out. Many thanks! |
@JakeRuss thanks for the feedback. There are two points to consider here:
If you end up switching to We don't have a timeline yet to implement this "Improve Serialization" feature. However, if there is a particular issue that is blocking you, please open a github issue and I'll try to address it very soon without depending on the completion of this work. |
I appreciate your thoughts, @javierluraschi. My desired workflow may be irrelevant for the I mainly use I did experiment with a Please let me know if I should open a separate serialization issue for |
@JakeRuss not at all,
@JakeRuss, what would really help us here would be to get a copy of the schema (not data necessary) that you are using to make sure data round trips correctly when we get to this work. Something like:
and I'll make sure this gets validated when the improvements get implemented. Feel free to comment here or open a new issue with details that I can close and link to this one. Also worth mentioning that I do want to work on this issue as soon as possible. I've been cleaning up old github issues and while no specific serialization issue seems critical, it's now obvious that there are many little ones that make this more urgent. I'll keep you posted! |
Moved to a new issue, thank you @javierluraschi! |
@javierluraschi You mentioned that |
@russellpierce yes, for instance with I would like to improve Would you mind explaining why the data is not available in your case already in Spark? And how you are using |
We're still feeling out usage patterns. At the moment we have a bunch of pre-existing code that pulls data locally to R and processes it. Previously at the end of processing we'd just write the results as a CSV. Now we'd like to write the results as Parquet instead. The easiest path I've found to write Parquet from R is to bounce it through a Spark DataFrame. Thus, using copy_to or writing the CSV to the cluster and then loading the CSV (which seemed unnecessarily awkward). |
#1041 makes data collection improvements over dates and timestamps. |
#1045 makes data collection improvements over fields with |
It looks like Is it expected behavior for TINYINT (a field of 0s and 1s) to be read in as logical? |
Is this also the place to discuss things like expanding |
@javierluraschi The issue of date-columns being truncated to year when using I tried using In an effort to prototype, I attempted to pull a subset of the data into memory in R and then use To give some context, I actually have already built a pipeline that plugs into PostgreSQL using |
@asantucci I've been having some success with writing a CSV out to distributed storage, then loading the dataset into Spark via a CSV read (and specifying data types where it helps). That bypasses copy_to and any serialization issues we have and here it was mentioned that |
I use
Maybe there is a driver you could add for PostgreSQL instead which will eliminate the error message. Here? Also, it looks like Javier has resolved the issues with dates in the development version of @javierluraschi are the serialization improvements scheduled to be part of the 0.7 release to CRAN? |
@JakeRuss I'm trying your approach as follows:
And I still yield the same error: |
Let me post more of my code and see if you spot anything different about your set up...
My jdbc url is formatted like My cursory Google searching for that error message indicates that either the jdbc URL is incorrect, or the driver still isn't found on the class path. Maybe check your valid_path again? |
Thank you for the follow up. It looks like my problem was in fact not appending the jdbc url with both a port and database name. Now it appears that I am able to connect to the DBI! |
We are planning to improve serialization by using Apache Arrow, this work should address many conversion issues by providing a common serialization format between R and Scale that we don't have to maintain in |
There are two areas worth considering here:
Improve
collect()
serialization. This is already implemented as columnar-based collect; however, I believe only for numeric and logical data types.Implement
copy_to()
through columnarinvoke()
.Assuming (2) is on a par with current implementations of
copy_to()
, we can stop here; otherwise, we would have to explore additional serialization improvements.Related issues:
copy_to
turns Dates into numeric columns (with only the year) #187: Should be fixed with this work item sinceDate
would serialize properly into Spark.""
andNA
should be addressed by a common serialization mechanism. Notice also that someNA
inflights
are being copied asNaN
.robotics.csv
was nor we have a repro, but this change should help fix that.copy_to
, improving fidelity here would solve this issue as well.NaN
,NA
,NULL
and""
.\n
incopy_to
DATE
,DATETIME
, andTIMESTAMP
while coming from JDBC. Probably no need to do anything JDBC specific, but worth testingcopy_to()
spark_apply()
could significantly improve performance for data transfer into Scala #1034 column based copy_to could be reused to speed upspark_apply()
The text was updated successfully, but these errors were encountered: