XGBoost classification & regression models + Spark 2.3.2 #44

tovbinm · 2018-08-08T22:07:14Z

Describe the proposed solution
Adding XGBoost classification & regression models. This should eventually allow us to train models with better or on par performance than Random Forest. But more importantly use wider sparse feature vectors.

Describe alternatives you've considered
Fix Spark Random Forest implementation.

Additional context
This change adds xgboost4j-spark dependency and also upgrades to Spark 2.3.2.

TODO

compare model quality and runtime performance against RF models
if performs well, add into our model selectors

…grade to Spark 2.3.1

…xgboost

… mt/xgboost

kinfaikan · 2018-08-10T22:54:21Z

core/src/main/scala/ml/dmlc/xgboost4j/scala/spark/XGBoostParams.scala

+  /**
+   * Copied from [[ml.dmlc.xgboost4j.scala.spark.XGBoost.removeMissingValues]] private method
+   */
+  def removeMissingValues(xgbLabelPoints: Iterator[LabeledPoint], missing: Float): Iterator[LabeledPoint] = {


refection trick doesn't work for object?

The method is not present in the snapshot version I use. We need to switch to latest 0.8 version once they release it.

done! it's present in 0.80 release.

kinfaikan · 2018-08-10T22:55:33Z

core/src/test/scala/com/salesforce/op/ModelInsightsTest.scala

@@ -63,7 +63,7 @@ class ModelInsightsTest extends FlatSpec with PassengerSparkFixtureTest {
  implicit val doubleOptEquality = new Equality[Option[Double]] {
    def areEqual(a: Option[Double], b: Any): Boolean = b match {
      case None => a.isEmpty
-      case s: Option[Double] => (a.exists(_.isNaN) && s.exists(_.isNaN)) ||
+      case s: Option[Double]@unchecked => (a.exists(_.isNaN) && s.exists(_.isNaN)) ||
        (a.nonEmpty && a.toSeq.zip(s.toSeq).forall{ case (n, m) => n == m })


(a.exists(_.isNaN) && s.exists(_.isNaN)) || (a == s)

…xgboost

codecov · 2018-09-28T06:48:04Z

Codecov Report

Merging #44 into master will decrease coverage by 0.67%.
The diff coverage is 47.39%.

@@            Coverage Diff             @@
##           master      #44      +/-   ##
==========================================
- Coverage    86.4%   85.72%   -0.68%     
==========================================
  Files         299      302       +3     
  Lines        9750     9881     +131     
  Branches      354      540     +186     
==========================================
+ Hits         8424     8470      +46     
- Misses       1326     1411      +85

Impacted Files	Coverage Δ
...ges/impl/classification/OpLogisticRegression.scala	`57.14% <ø> (ø)`	⬆️
...m/salesforce/op/aggregators/ExtendedMultiset.scala	`75% <0%> (-25%)`	⬇️
...ce/op/stages/impl/classification/OpLinearSVC.scala	`77.27% <100%> (ø)`	⬆️
...ssification/OpMultilayerPerceptronClassifier.scala	`69.23% <100%> (+5.59%)`	⬆️
.../scala/com/salesforce/op/features/types/Maps.scala	`92.68% <100%> (+0.27%)`	⬆️
...lesforce/op/utils/reflection/ReflectionUtils.scala	`97.36% <100%> (+0.14%)`	⬆️
...om/salesforce/op/utils/spark/OpSparkListener.scala	`97.4% <100%> (-1.3%)`	⬇️
...s/sparkwrappers/specific/SparkModelConverter.scala	`94.11% <100%> (+0.78%)`	⬆️
...a/com/salesforce/op/filters/RawFeatureFilter.scala	`88.99% <100%> (+0.2%)`	⬆️
...op/evaluators/OpMultiClassificationEvaluator.scala	`94.73% <100%> (+0.07%)`	⬆️
... and 15 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update abdfff0...5a98621. Read the comment docs.

tovbinm · 2018-10-11T02:11:22Z

@leahmcguire if there are no objections - let's get this merged.

leahmcguire · 2018-10-12T19:35:51Z

core/src/main/scala/com/salesforce/op/stages/impl/classification/OpXGBoostClassifier.scala

+    CheckIsResponseValues(in1, in2)
+  }
+
+  def setWeightCol(value: String): this.type = set(weightCol, value)


please add comments for these params

done as well

leahmcguire · 2018-10-12T19:42:49Z

core/src/main/scala/com/salesforce/op/stages/impl/regression/OpXGBoostRegressor.scala

+    CheckIsResponseValues(in1, in2)
+  }
+
+  def setWeightCol(value: String): this.type = set(weightCol, value)


please put comments on these settings

leahmcguire · 2018-10-12T19:43:56Z

core/src/main/scala/com/salesforce/op/stages/impl/regression/RegressionModelSelector.scala

@@ -199,6 +199,7 @@ object RegressionModelsToTry extends Enum[RegressionModelsToTry] {
  case object OpRandomForestRegressor extends RegressionModelsToTry
  case object OpGBTRegressor extends RegressionModelsToTry
  case object OpGeneralizedLinearRegression extends RegressionModelsToTry
+  case object OpXGBoostRegressor extends RegressionModelsToTry


please also define some default grid settings for this to run in regression

also for the other model selectors

I propose to make additions of xgb to model selectors a separate pr.

ok if that is the plan then lots hold off on adding it to the enum as well - particularly since you only added it to regression and not multiclass and binary

sounds good. removed.

leahmcguire

LGTM

salesforce-cla · 2020-10-16T17:50:27Z

Thanks for the contribution! It looks like @Jauntbox is an internal user so signing the CLA is not required. However, we need to confirm this.

tovbinm added 5 commits August 8, 2018 09:32

Initial implementation of XGBoost classifier & regressor moddels + up…

eb2da76

…grade to Spark 2.3.1

fix property name

e46a152

Minor updates

533aa3a

added maven repo

4877aa3

move repo to build.gradle

3a023bd

tovbinm requested a review from leahmcguire as a code owner August 8, 2018 22:07

tovbinm added the work in progress label Aug 8, 2018

tovbinm requested a review from kinfaikan August 8, 2018 22:07

tovbinm and others added 20 commits August 8, 2018 15:07

Merge branch 'master' into mt/xgboost

bb547eb

Merge branch 'master' of github.com:salesforce/TransmogrifAI into mt/…

0801e1b

…xgboost

quite logging in tests

26c1d89

update some tests

b355d4c

debug stuff

9f7bf85

Merge branch 'master' of github.com:salesforce/TransmogrifAI into mt/…

625dd2e

…xgboost

remove line

75bec3f

Merge branch 'master' of github.com:salesforce/TransmogrifAI into mt/…

e28e13a

…xgboost

Fix GeneralizedLinearRegression

565868e

Merge branch 'master' of github.com:salesforce/TransmogrifAI into mt/…

2b85332

…xgboost

Make xgboost work

de7969d

cleanup

ea71e94

update test

42a95cf

Added test

07391ad

Added test

fafa9a0

make stalastyle happy

be245a2

Merge branch 'master' into mt/xgboost

c41d1b6

fix tests

7dbfd68

cleanup

a3e6334

Merge branch 'mt/xgboost' of github.com:salesforce/TransmogrifAI into…

c40f264

… mt/xgboost

kinfaikan reviewed Aug 10, 2018

View reviewed changes

tovbinm and others added 7 commits August 31, 2018 22:10

move version

91d52a8

Merge branch 'master' of github.com:salesforce/TransmogrifAI into mt/…

e7607e5

…xgboost

Merge branch 'master' of github.com:salesforce/TransmogrifAI into mt/…

f108111

…xgboost

Merge branch 'master' into mt/xgboost

0271ab6

make it compile

76a6d6f

cleanup

031a37d

Merge branch 'master' of github.com:salesforce/TransmogrifAI into mt/…

d38b934

…xgboost

tovbinm changed the title ~~XGBoost classification & regression models + Spark 2.3.1~~ XGBoost classification & regression models + Spark 2.3.2 Sep 28, 2018

tovbinm added 2 commits September 27, 2018 23:19

spark 2.3.2

5ec1a32

Merge branch 'master' of github.com:salesforce/TransmogrifAI into mt/…

c3b2502

…xgboost

tovbinm added 2 commits October 1, 2018 10:09

Merge branch 'master' into mt/xgboost

b97f510

Merge branch 'master' into mt/xgboost

2c75617

tovbinm removed the DO NOT MERGE label Oct 11, 2018

leahmcguire reviewed Oct 12, 2018

View reviewed changes

tovbinm mentioned this pull request Oct 12, 2018

Investigate which classes require registration with Kryo #155

Open

tovbinm added 3 commits October 12, 2018 13:52

added docs

9d49d21

remove enums for now

55e7f27

Addressed comments + docs

5a98621

leahmcguire approved these changes Oct 15, 2018

View reviewed changes

tovbinm merged commit b1aec92 into master Oct 15, 2018

tovbinm deleted the mt/xgboost branch October 15, 2018 17:08

ericwayman pushed a commit that referenced this pull request Feb 8, 2019

XGBoost classification & regression models + Spark 2.3.2 (#44)

b19d764

salesforce-cla bot added the cla:signed label Jul 10, 2020

salesforce-cla bot removed the cla:signed label Oct 16, 2020

salesforce-cla bot added the cla:missing label Oct 16, 2020

salesforce-cla bot added cla:signed and removed cla:missing labels Mar 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XGBoost classification & regression models + Spark 2.3.2 #44

XGBoost classification & regression models + Spark 2.3.2 #44

tovbinm commented Aug 8, 2018 •

edited

Loading

kinfaikan Aug 10, 2018

tovbinm Aug 11, 2018

tovbinm Aug 17, 2018

kinfaikan Aug 10, 2018

codecov bot commented Sep 28, 2018 •

edited

Loading

tovbinm commented Oct 11, 2018

leahmcguire Oct 12, 2018

tovbinm Oct 12, 2018

leahmcguire Oct 12, 2018

tovbinm Oct 12, 2018

leahmcguire Oct 12, 2018

leahmcguire Oct 12, 2018

tovbinm Oct 12, 2018

leahmcguire Oct 12, 2018

tovbinm Oct 12, 2018

leahmcguire left a comment

salesforce-cla bot commented Oct 16, 2020

XGBoost classification & regression models + Spark 2.3.2 #44

XGBoost classification & regression models + Spark 2.3.2 #44

Conversation

tovbinm commented Aug 8, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Sep 28, 2018 • edited Loading

Codecov Report

tovbinm commented Oct 11, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

leahmcguire left a comment

Choose a reason for hiding this comment

salesforce-cla bot commented Oct 16, 2020

tovbinm commented Aug 8, 2018 •

edited

Loading

codecov bot commented Sep 28, 2018 •

edited

Loading