Scala API for XGBoost-Spark

This doc focuses on GPU related Scala API interfaces. 7 new classes are introduced:

CrossValidator
GpuDataset
GpuDataReader
XGBoostClassifier
XGBoostClassificationModel
XGBoostRegressor
XGBoostRegressionModel

CrossValidator

The full name is ml.dmlc.xgboost4j.scala.spark.rapids.CrossValidator, extending from the Spark's CrossValidator.

Constructors

CrossValidator()

Methods

Note: Only GPU related methods are listed below.

fit(dataset: GpuDataset): Model[_]. This method triggers the corss validation for hyperparameter tuninng.
- dataset: a GpuDataset used for cross validation
- returns the best Model[_] for the given hyperparameters. Please note this model returned here is actually a XGBoostClassificationModel for XGBoostClassifier, or a XGBoostRegressionModel for XGBoostRegressor. You need to cast it to the right model for calling the GPU version transform(dataset: GpuDataset).
- Note: For CPU version, you can still call fit(dataset: Dataset[_])

GpuDataset

The full name is ml.dmlc.xgboost4j.scala.spark.rapids.GpuDataset. A GpuDataset is an object that is produced by GpuDataReaders and consumed by XGBoostClassifiers and XGBoostRegressors. No constructors or methods are exposed for this class.

GpuDataReader

The full name is ml.dmlc.xgboost4j.scala.spark.rapids.GpuDataReader. A GpuDataReader sets options and builds GpuDataset from data sources. The data loading is a lazy operation. It occurs when the data is processed later.

Constructors

GpuDataReader(sparkSession: SparkSession)
- sparkSession: spark session for data loading

Methods

format(source: String): GpuDataReader. This method sets data format. Valid values include csv, parquet and orc.
- source: data format to set
- returns the data reader itself
schema(schema: StructType): GpuDataReader. This method sets data schema.
- schema: data schema in StructType format
- returns the data reader itself
schema(schemaString: String): GpuDataReader. This method sets data schema.
- schemaString: data schema in DDL-formatted String, e.g., a INT, b STRING, c DOUBLE
- returns the data reader itself
option(key: String, value: String): GpuDataReader. This method sets an option.
- key: the option key
- value: the option value in string format
- returns the data reader itself
option(key: String, value: Boolean): GpuDataReader. This method sets an option.
- key: the option key
- value: the Boolean option value
- returns the data reader itself
option(key: String, value: Long): GpuDataReader. This method sets an option.
- key: the option key
- value: the Long option value
- returns the data reader itself
option(key: String, value: Double): GpuDataReader. This method sets an option.
- key: the option key
- value: the Double option value
- returns the data reader itself
options(options: scala.collection.Map[String, String]): GpuDataReader. This method sets options.
- options: the options Map to set
- returns the data reader itself
options(options: java.util.Map[String, String]): GpuDataReader. This method sets options. It is designed for Java compatibility.
- options: the options Map to set
- returns the data reader itself
load(): GpuDataset. This method builds a GpuDataset.
- returns a GpuDataset as the result
load(path: String): GpuDataset. This method builds a GpuDataset.
- path: the data source path
- returns a GpuDataset as the result
load(paths: String*): GpuDataset. This method builds a GpuDataset.
- paths: the data source paths
- returns a GpuDataset as the result
csv(path: String): GpuDataset. This method builds a GpuDataset.
- path: the CSV data path
- returns a GpuDataset as the result
csv(paths: String*): GpuDataset. This method builds a GpuDataset.
- paths: the CSV data paths
- returns a GpuDataset as the result
parquet(path: String): GpuDataset. This method builds a GpuDataset.
- path: the Parquet data path
- returns a GpuDataset as the result
parquet(paths: String*): GpuDataset. This method builds a GpuDataset.
- paths: the Parquet data paths
- returns a GpuDataset as the result
orc(path: String): GpuDataset. This method builds a GpuDataset.
- path: the ORC data path
- returns a GpuDataset as the result
orc(paths: String*): GpuDataset. This method builds a GpuDataset.
- paths: the ORC data paths
- returns a GpuDataset as the result

Options

Common options
- asFloats: A Boolean flag indicates whether cast all numeric values to floats. Default is true.
- maxRowsPerChunk: An Int specifies the max rows per chunk. Default is Int.MaxValue.
Options for CSV
- comment: A single character used for skipping lines beginning with this character. Default is empty string. By default, it is disabled.
- header: A Boolean flag indicates whether the first line should be used as names of columns. Default is false.
- nullValue: The string representation of a null value. Default is empty string.
- quote: A single character used for escaping quoted values where the separator can be part of the value. Default is ".
- sep: A single character as a separator between adjacent values. Default is ,.

XGBoostClassifier

The full name is ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier. It extends ProbabilisticClassifier[Vector, XGBoostClassifier, XGBoostClassificationModel].

Constructors

XGBoostClassifier(xgboostParams: Map[String, Any])
- all standard xgboost parameters are supported
- eval_sets: Map[String, GpuDataset]. This parameter sets the eval sets for training. (For CPU training, the type of parameter eval_sets is Map[String, DataFrame])

Methods

Note: Only GPU related methods are listed below.

setFeaturesCols(value: Seq[String]): XGBoostClassifier. This method sets the feature columns for training.
- value: a sequence of feature column names to set
- returns the classifier itself
setEvalSets(evalSets: Map[String, GpuDataset]): XGBoostClassifier. This method sets eval sets for training.
- evalSets: eval sets for training (For CPU training, the type is Map[String, DataFrame])
- returns the classifier itself
fit(dataset: GpuDataset): XGBoostClassificationModel. This method triggers the training.
- dataset: a GpuDataset to train
- returns the training result as a XGBoostClassificationModel
- Note: For CPU training, you can still call fit(dataset: Dataset[_])

XGBoostClassificationModel

The full name is ml.dmlc.xgboost4j.scala.spark.XGBoostClassificationModel. It extends ProbabilisticClassificationModel[Vector, XGBoostClassificationModel].

Methods

Note: Only GPU related methods are listed below.

transform(dataset: GpuDataset): DataFrame. This method predicts results based on the model.
- dataset: a GpuDataset to predicate
- returns a DataFrame with the prediction

XGBoostRegressor

The full name is ml.dmlc.xgboost4j.scala.spark.XGBoostRegressor. It extends Predictor[Vector, XGBoostRegressor, XGBoostRegressionModel].

Constructors

XGBoostRegressor(xgboostParams: Map[String, Any])
- all standard xgboost parameters are supported
- eval_sets: Map[String, GpuDataset]. This parameter sets the eval sets for training. (For CPU training, the type of parameter eval_sets is Map[String, DataFrame])

Methods

Note: Only GPU related methods are listed below.

setFeaturesCols(value: Seq[String]): XGBoostRegressor. This method sets the feature columns for training.
- value: a sequence of feature column names to set
- returns the regressor itself
setEvalSets(evalSets: Map[String, GpuDataset]): XGBoostRegressor. This method sets eval sets for training.
- evalSets: eval sets for training (For CPU training, the type is Map[String, DataFrame])
- returns the regressor itself
fit(dataset: GpuDataset): XGBoostRegressionModel. This method triggers the training.
- dataset: a GpuDataset to train
- returns the training result as a XGBoostRegressionModel
- Note: For CPU training, you can still call fit(dataset: Dataset[_])

XGBoostRegressionModel

The full name is ml.dmlc.xgboost4j.scala.spark.XGBoostRegressionModel. It extends PredictionModel[Vector, XGBoostRegressionModel].

Methods

Note: Only GPU related methods are listed below.

transform(dataset: GpuDataset): DataFrame. This method predicts results based on the model.
- dataset: a GpuDataset to predicate
- returns a DataFrame with the prediction

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scala.md

scala.md

Scala API for XGBoost-Spark

CrossValidator

Constructors

Methods

GpuDataset

GpuDataReader

Constructors

Methods

Options

XGBoostClassifier

Constructors

Methods

XGBoostClassificationModel

Methods

XGBoostRegressor

Constructors

Methods

XGBoostRegressionModel

Methods

Files

scala.md

Latest commit

History

scala.md

File metadata and controls

Scala API for XGBoost-Spark

CrossValidator

Constructors

Methods

GpuDataset

GpuDataReader

Constructors

Methods

Options

XGBoostClassifier

Constructors

Methods

XGBoostClassificationModel

Methods

XGBoostRegressor

Constructors

Methods

XGBoostRegressionModel

Methods