Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ft_robust_scaler #2254

Merged
merged 3 commits into from Sep 15, 2020
Merged

Add ft_robust_scaler #2254

merged 3 commits into from Sep 15, 2020

Conversation

zero323
Copy link
Contributor

@zero323 zero323 commented Feb 6, 2020

This PR add ft_robust_scaler as a wrapper for RobustScalerSPARK-28399 ‒ that

RobustScaler removes the median and scales the data according to the quantile range.

It is applicable for Spark >= 3.0.0

Signed-off-by: zero323 mszymkiewicz@gmail.com

@zero323
Copy link
Contributor Author

@zero323 zero323 commented Feb 7, 2020

It seems like build on master is older than apache/spark@bb47870

Context: ml feature robust scaler
ft_robust_scaler() works properly: error: java.lang.IllegalArgumentException: invalid method setRelativeError for object 14691/org.apache.spark.ml.feature.RobustScaler fields 0 selected 0
	at sparklyr.Invoke.invoke(invoke.scala:174)
	at sparklyr.StreamHandler.handleMethodCall(stream.scala:122)
	at sparklyr.StreamHandler.read(stream.scala:65)
	at sparklyr.BackendHandler.channelRead0(handler.scala:53)
	at sparklyr.BackendHandler.channelRead0(handler.scala:12)
	at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352)
	at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352)
	at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:328)
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:302)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352)
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1422)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:931)
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163)
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:700)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:635)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:552)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:514)
	at io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1044)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Thread.java:834) 
  callstack:

...
)

if (is_ml_transformer(stage))
Copy link
Contributor Author

@zero323 zero323 Feb 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seem like this can ever happen, but since this pattern is repeated all over the code base, I'll keep it for consistency.

@zero323 zero323 changed the title Add fr_robust_scaler Add ft_robust_scaler Feb 15, 2020
@zero323 zero323 closed this Mar 12, 2020
@zero323 zero323 reopened this Mar 12, 2020
@zero323 zero323 force-pushed the ml-robust-scaler branch 2 times, most recently from 8b73595 to ba49785 Compare Apr 7, 2020
@zero323 zero323 closed this Sep 15, 2020
@zero323 zero323 reopened this Sep 15, 2020
@falaki
Copy link
Collaborator

@falaki falaki commented Sep 15, 2020

Databricks Connect tests failed. View logs here.

zero323 added 3 commits Sep 15, 2020
Signed-off-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: zero323 <mszymkiewicz@gmail.com>
@yitao-li
Copy link
Contributor

@yitao-li yitao-li commented Sep 15, 2020

@zero323 Nice! Looks like you fixed the build failure and the change looks good to me.

Would you mind signing off your 3 commits (by running git rebase -S -i HEAD~3 where -S means "sign-off") followed by git push --force? Commits being signed off is typically required for contributing to sparklyr.

Once you signed off all your commits the "DCO check" should pass, and then we can merge your commits into master.

@yitao-li yitao-li self-requested a review Sep 15, 2020
@yitao-li yitao-li mentioned this pull request Sep 15, 2020
@yitao-li
Copy link
Contributor

@yitao-li yitao-li commented Sep 15, 2020

@zero323 ^^ Nah actually never mind. No need to rebase manually. I'll just append the correct sign-off-by field to your commit message which would be easier.

Thanks for contributing to sparklyr!!

@yitao-li yitao-li merged commit 8400edd into sparklyr:master Sep 15, 2020
12 of 14 checks passed
@zero323
Copy link
Contributor Author

@zero323 zero323 commented Sep 15, 2020

Thanks @falaki and @yitao-li.

yitao-li pushed a commit that referenced this issue Sep 15, 2020
Add fr_robust_scaler

Signed-off-by: zero323 <mszymkiewicz@gmail.com>
@zero323 zero323 deleted the ml-robust-scaler branch Sep 15, 2020
yitao-li added a commit that referenced this issue Oct 27, 2020
* fix typo in test-dplyr-hof.R (#2688)

Signed-off-by: Yitao Li <yitao@rstudio.com>

* implement `pcre_to_java` and all relevant test cases

Signed-off-by: Yitao Li <yitao@rstudio.com>

* support POSIX char classes in `sep` parameter of separate.tbl_spark

Signed-off-by: Yitao Li <yitao@rstudio.com>

* support deterministic sampling outcomes for dplyr::sample_* on Spark dataframes (#2689)

Signed-off-by: Yitao Li <yitao@rstudio.com>

* NEWS.md update for sparklyr 1.4 release (#2693)

Signed-off-by: Yitao Li <yitao@rstudio.com>

* also mention `grepl` support in `dplyr` (#2695)

Signed-off-by: Yitao Li <yitao@rstudio.com>

* implement tidyr::fill functionality for Spark data frame (#2691)

Signed-off-by: Yitao Li <yitao@rstudio.com>

* update default Spark version for spark_install() (#2696)

Signed-off-by: Yitao Li <yitao@rstudio.com>

* make `dplyr_sample_*` work with `ft_dplyr_transformer` (#2698)

Signed-off-by: Yitao Li <yitao@rstudio.com>

* support `ptype` specification in `unnest.tbl_spark` (#2700)

Signed-off-by: Yitao Li <yitao@rstudio.com>

* fix tidyr-unnest requirement (#2702)

Signed-off-by: Yitao Li <yitao@rstudio.com>

* update sparklyr_livy_branch (#2704)

Signed-off-by: Yitao Li <yitao@rstudio.com>

* fix reexports.R (#2706)

Signed-off-by: Yitao Li <yitao@rstudio.com>

* prepare for sparklyr 1.4.0 release (#2709)

Signed-off-by: Yitao Li <yitao@rstudio.com>

* Add ft_robust_scaler (#2254)

Add fr_robust_scaler

Signed-off-by: zero323 <mszymkiewicz@gmail.com>

* sdf_quantile() handles multiple columns (#2716)

sdf_quantile() handles multiple columns

Signed-off-by: wkdavis <william.davis@worthingtonindustries.com>

* fix warnings from --as-cran checks (#2715)

minor changes to fix warnings from CRAN-related checks

Signed-off-by: Yitao Li <yitao@rstudio.com>

* fix a bug with grouping vars in nest.tbl_spark (#2720)

Signed-off-by: Yitao Li <yitao@rstudio.com>

* Avoiding bundle file name collision when session_id is not provided (#2721)

* Avoiding bundle file name collision with session_id is not provided

* Mon Sep 21 23:26:36 PDT 2020

* Update R/spark_apply_bundle.R

Co-authored-by: Yitao Li <yl790@10xeng.ca>

* Update R/spark_apply_bundle.R

Co-authored-by: Yitao Li <yl790@10xeng.ca>

* update package metadata (#2723)

Signed-off-by: Yitao Li <yitao@rstudio.com>

* skip append-data test on db connect (#2727)

Signed-off-by: Yitao Li <yitao@rstudio.com>

* fix incorrect column name in `stream_watermark()` (#2728)

Signed-off-by: Yitao Li <yitao@rstudio.com>

* implement `unnest_wider` functionality for Spark dataframes (#2730)

Signed-off-by: Yitao Li <yitao@rstudio.com>

* implement `unnest_longer` functionality for Spark dataframes (#2732)

Signed-off-by: Yitao Li <yitao@rstudio.com>

* update _pkgdown.yml with recent topics and make docs/reference contain only static html content (#2726)

update _pkgdown.yml with recent topics and make docs/reference contain only static html content

Signed-off-by: Yitao Li <yitao@rstudio.com>

* ignore platform-specific date serialization issue on Windows (#2734)

Signed-off-by: Yitao Li <yitao@rstudio.com>

* remove rjson usage (#2735)

Signed-off-by: Yitao Li <yitao@rstudio.com>

* revise spark_web impl (#2738)

Signed-off-by: Yitao Li <yitao@rstudio.com>

* call rstudioapi::translateLocalUrl() when applicable (#2625)

Signed-off-by: yl790 <yitao@rstudio.com>

* implement the equivalent of dplyr lag() functionality for streaming dataframes (#2739)

Signed-off-by: Yitao Li <yitao@rstudio.com>

* implement timestamp threshold option for stream_lag() (#2743)

Signed-off-by: Yitao Li <yitao@rstudio.com>

* switch CI workflow default branch from 'master' to 'main'

Signed-off-by: Yitao Li <yitao@rstudio.com>

* update CONTRIBUTORS.md (#2748)

Signed-off-by: Yitao Li <yitao@rstudio.com>

* remove GitHub CI workflow for R 3.2.5 (#2749)

Signed-off-by: Yitao Li <yitao@rstudio.com>

* replace 'master' with 'main' in jenkins config file (#2746)

Signed-off-by: Yitao Li <yitao@rstudio.com>

* skip unsupported test on windows (#2751)

Signed-off-by: Yitao Li <yitao@rstudio.com>

* improve sparklyr serialization routines

Signed-off-by: Yitao Li <yitao@rstudio.com>

Co-authored-by: Maciej <zero323@users.noreply.github.com>
Co-authored-by: Wil Davis <william.davis@worthingtonindustries.com>
Co-authored-by: Hossein Falaki <falaki@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants