Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add new codec option for compression in Spark-Tensorflow connector #131

Merged
merged 2 commits into from
May 7, 2019

Conversation

vgod-dbx
Copy link
Contributor

With #125, it became possible to output gzipped TFrecords by setting spark.hadoop.mapreduce.output.fileoutputformat.compress in the global SparkConf.
However, there's no way to only enable compression for individual DataFrame outputs.

This PR adds a new option codec to the Spark-Tensorflow connector for enabling compression in individual DataFrameWriter. With this, we don't need to set spark.hadoop.mapreduce.output.fileoutputformat.compress globally anymore.

Sample usage:

(
  dataframe
  .write
  .format('tfrecords')
  .option('codec', 'org.apache.hadoop.io.compress.GzipCodec')
  .save('sample.tfrecord.gz')
)

@googlebot
Copy link

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here (e.g. I signed it!) and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

ℹ️ Googlers: Go here for more info.

@vgod-dbx
Copy link
Contributor Author

I signed it!

@googlebot
Copy link

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

@jhseu
Copy link
Contributor

jhseu commented May 1, 2019

@skavulya Mind doing a code review?

@skavulya
Copy link
Contributor

skavulya commented May 2, 2019

@jhseu Sure, I'll review it. Thanks!

@skavulya
Copy link
Contributor

skavulya commented May 7, 2019

@vgod-dbx Thank you so much for the contribution. It looks good. Please add a description and example usage of the codec option to the README under the features section before merge.

@vgod-dbx
Copy link
Contributor Author

vgod-dbx commented May 7, 2019

@skavulya README updated! Thanks for the review.

@skavulya
Copy link
Contributor

skavulya commented May 7, 2019

@vgod-dbx Thanks! Looks great.
@jhseu The PR is ready for merge

@jhseu jhseu merged commit 12d65f2 into tensorflow:master May 7, 2019
fjkfwz pushed a commit to xiachufang/ecosystem that referenced this pull request Oct 12, 2019
…ensorflow#131)

* support option 'codec' for compression

* add `codec` option to README
@eggie5
Copy link

eggie5 commented Nov 14, 2019

@vgod-dbx what version did this make it into? I'm on 1.13.1 and it seems to ignore the codec option...

@acastelli1
Copy link

Hi it seems that the codec is ignored actually

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants