Skip to content

Dataset Transformation

Dan F edited this page Apr 24, 2019 · 5 revisions

Contents

Discretization

Discretization from Command Line

To launch the discretizer, use the following command:

$ java mltk.core.processor.Discretizer

It should output a message like this:

Usage: mltk.core.processor.Discretizer
-i	input dataset path
-o	output dataset path
[-r]	attribute file path
[-t]	training file path
[-d]	discretized attribute file path
[-m]	output attribute file path
[-n]	maximum num of bins (default: 256)

This class has two functionalities. The first one is to learn a discretization and to create a discretized attribute file. The second one is to use the new attribute file (containing discretization information) to discretize new datasets.

Learning Discretization

$ java mltk.core.processor.Discretizer -r <attr file> -t <training data> -m <output attribute file> -i <input dataset> -o <discretized output dataset>

This command loads training data and discretizes all continuous features into 256 bins (default value). It generates a new attribute file specified by -m argument. It also takes an input dataset, discretizes it and saves the new discretized dataset to disk.

Applying Discretization

$ java mltk.core.processor.Discretizer -r <attr file> -i <input dataset> -d <discretized attribute file> -o <discretized output dataset>

This command loads the input dataset, applies the discretization specified by -m argument, and saves the new discretized dataset to disk.

Discretization in Java Code

List<Attribute> attributes = instances.getAttributes();
for (int j = 0; j < instances.dimension(); j++) {
    if (attributes.get(j).getType() == Type.NUMERIC) {
        Discretizer.discretize(instances, j, 256);
    }
}

It discretizes all numeric attributes into 256 bins. The corresponding attribute objects are updated.

Splitting

To launch the splitter, use the following command:

$ java mltk.core.processor.InstancesSplitter

It should output a message like this:

Usage: mltk.core.processor.InstancesSplitter
-i	input dataset path
-o	output directory path
[-r]	attribute file path
[-m]	splitting mode:parameter. Splitting mode can be split (s) and cross validation (c) (default: c:5)
[-a]	attribute name to perform stratified sampling (default: null)
[-s]	seed of the random number generator (default: 0)

There are two modes in InstancesSplitter: split (s) and cross validation (c). The output is under a directory specified by -o argument. If the directory does not exist, a new one will be created. Optionally, stratified sampling can be performed. For example, -a label will indicate the code to keep the distribution of attribute label in all samples.

Split

In this mode, the dataset is split into two parts; training set and validation set. The parameter followed by s will determine the portion of points in training set. For example, -m s:0.8 means 80% of the points are in training set and 20% of the points are in validation set, -m s:0.7:0.15:0.15 means 70% of the points are in the training set, 15% of the points are in the validation set and 15% of the points are in the test set.

Cross validation

In this mode, the dataset is split into k folds. Each fold contains a training set, a test set and an optional validation set. For example, -m c:5 will create 5 directories (cv.0, ..., cv.4) under the output directory. Each fold will contain a training set and a test set. 1/k of the points are in test set and the rest are in training set. In this case, 20% of the points are in the test set and 80% of the points are in the training set. All test sets are disjoint and the union of them will be the whole dataset. -m c:5:0.8 creates an additional validation set for each fold; 20% of the points will be in test set, 16% of the points will be in validation set, and 64% of the points will be in training set.