Skip to content

Model Selection, Evaluation and Prediction

Yin Lou edited this page May 1, 2019 · 1 revision

Contents

Model Selection

When a learner is a subclass of mltk.predictor.HoldoutValidatedLearner, it allows to specify the validation set and the metric so that the best model will be found on the validation set according to the metric during the training.

Alternatively, MLTK also supports cross validation using mltk.core.processor.InstancesSplitter.

Convergence Criteria

Most machine learning algorithms are iterative, and usually best model is selected using a validation set based on some metric. By analyzing the series of metric values, one can often conclude the series has already been converged and therefore stop the learning algorithm before it hits maximum number of iterations. This not only saves training time, but also makes maximum number iterations a less important parameter to tune.

MLTK uses mltk.predictor.evaluation.ConvergenceTester class to keep track of the series of metric values and determines whether the series is converged or not. There are three main parameters to test convergence; minNumPoints, n and c. The series has to be at least minNumPoints long to be eligible for convergence test. Once a series is at least minNumPoints long, we find the index idx for the best metric value so far. We say a series is converged if idx + n < size * c, where size is the current number of points in the series. The intuition is that the best metric value should be peaked (or bottomed) with a wide margin.

With n and c we can implement complex convergence rules, but there are two common cases.

  • n = 0 and c in [0, 1]

For example, n = 0 and c = 0.8. This means there should be at least 20% of points after the best metric value. Smaller value in c leads to a more conservative convergence test. This setting is recommended in training boosted tree ensembles.

  • n > 0 and c = 1.0

Sometimes we need to make sure there are at least n points after the peak (or bottom). For example, when training GAM model, it is recommended to test at least k passes after the peak (or bottom), where a pass means iterating over all p features. This translates to setting n = k * p.

Specifying Convergence Criteria from Command Line

Most subclasses of mltk.predictor.HoldoutValidatedLearner should have an -S option that can be used to specify convergence criteria. Currently it only works on validation set. The syntax is minNumPoints[:n][:c]. For example, to require at least 200 points and c = 0.8, we can use -S 200:0:0.8. In addition, to require 200 points and n = 400, we can use -S 200:400. Default values for n and c are 0 and 1.0, respectively. A negative minNumPoints will turn off the convergence test.

Specifying Convergence Criteria in Java Code

The following code specifies minNumPoints = 200, n = 0 and c = 0.8 in LogitBoostLearner:

ConvergenceTester ct = new ConvergenceTester(200, 0, 0.8);
LogitBoostLearner learner = new LogitBoostLearner();
learner.setConvergenceTester(ct);

Model Evaluation

Evaluating Models from Command Line

MLTK uses mltk.predictor.evaluation.Evaluator class. To evaluate models from command line, use the following command:

$ java mltk.predictor.evaluation.Evaluator

It should output a message like this:

Usage: mltk.predictor.evaluation.Evaluator
-d	data set path
-m	model path
[-r]	attribute file path
[-e]	AUC (a), Error (c), Logistic Loss (l), MAE(m), RMSE (r) (default: r)

Currently MLTK supports area-under-curve (AUC), classification error (Error), logistic loss (Logistic Loss), log loss (Log Loss), mean absolute error (MAE) and root-mean-squared error (RMSE).

Some learners support customized metric, such as LogitBoostLearner. In command line, -e AUC will use AUC as metric while -e LogLoss:true will use log loss as metric. Note : is used to separate parameters; the first part is the metric name and followed by : is optional parameters. Here LogLoss:true will use the raw score to compute log loss (default is to use probability to compute log loss).

Evaluating Models in Java Code

The following code builds a L1-regularized linear model and evaluate the classification error on a held-out test set:

LassoLearner learner = new LassoLearner();
GLM glm = learner.buildClassifier(trainSet, 100, 0.01);

double error = Evaluator.evalError(glm, testSet);

Model Prediction

Making Predictions from Command Line

MLTK uses mltk.predictor.evaluation.Predictor class. To make predictions from command line, use the following command:

$ java mltk.predictor.evaluation.Predictor

It should output a message like this:

Usage: mltk.predictor.evaluation.Predictor
-d	data set path
-m	model path
[-r]	attribute file path
[-p]	prediction path
[-R]	residual path
[-g]	task between classification (c) and regression (r) (default: r)
[-P]	output probability (default: false)

When using -P true, it will generate probabilities instead of predicted labels. When -R is used, it generates residuals (for classification problems, it will be pseudo residuals). Residuals will be the input to mltk.predictor.gam.interaction.FAST if running with GA2M