Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use compact and compressed model json by default #375

Merged
merged 1 commit into from
Jul 30, 2019

Conversation

gerashegalov
Copy link
Contributor

Related issues
Fixes #374

Describe the proposed solution
use compact serialization, apply gzip to it.

Describe alternatives you've considered
make this behavior configurable

Additional context
In a test scenario output is reduced from 1.6M to 188K

@codecov
Copy link

codecov bot commented Jul 29, 2019

Codecov Report

Merging #375 into master will increase coverage by 0.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #375      +/-   ##
==========================================
+ Coverage   86.77%   86.79%   +0.01%     
==========================================
  Files         336      336              
  Lines       10921    10922       +1     
  Branches      342      577     +235     
==========================================
+ Hits         9477     9480       +3     
+ Misses       1444     1442       -2
Impacted Files Coverage Δ
...cala/com/salesforce/op/OpWorkflowModelWriter.scala 100% <100%> (ø) ⬆️
...es/src/main/scala/com/salesforce/op/OpParams.scala 89.79% <0%> (+4.08%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 82bb2c1...527b3a1. Read the comment docs.

Copy link
Collaborator

@tovbinm tovbinm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don’t you need to mention the gzip codec when reading the model?

@gerashegalov
Copy link
Contributor Author

Don’t you need to mention the gzip codec when reading the model?

It's handled by Hadoop TextInputFormat based on the filename extension transparently. On the write path the filename extension is appended based on compression codec. On the read path, you can even mix files with different compression codecs/extensions and uncompressed files in the same dir.

@tovbinm tovbinm merged commit b505ff7 into salesforce:master Jul 30, 2019
@gerashegalov gerashegalov deleted the gera/model-json-out branch July 31, 2019 18:27
@gerashegalov gerashegalov mentioned this pull request Sep 8, 2019
gerashegalov added a commit that referenced this pull request Sep 11, 2019
Bug fixes:
- Ensure correct metrics despite model failures on some CV folds [#404](#404)
- Fix flaky `ModelInsight` tests [#395](#395)
- Avoid creating `SparseVector`s for LOCO [#377](#377)

New features / updates:
- Model combiner [#385](#399)
- Added new sample for HousingPrices [#365](#365)
- Test to verify that custom metrics appear in model insight metrics [#387](#387)
- Add `FeatureDistribution` to `SerializationFormat`s [#383](#383)
- Add metadata to `OpStandadrdScaler` to allow for descaling [#378](#378)
- Improve json serde error in `evalMetFromJson` [#380](#380)
- Track mean & standard deviation as metrics for numeric features and for text length of text features [#354](#354)
- Making model selectors robust to failing models [#372](#372)
- Use compact and compressed model json by default [#375](#375)
- Descale feature contribution for Linear Regression & Logistic Regression [#345](#345)

Dependency updates:   
- Update tika version [#382](#382)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

compact and compressed json serialization for models
2 participants