Skip to content
Yin Lou edited this page Aug 26, 2019 · 6 revisions

Contents

Datasets

Format

Dense Input Format

Typical input to MLTK is a text file containing the data matrix. An optional attribute file may also be provided to specify the target attribute. Datasets should be provided in separate white-space-delimited text files without any headers. MLTK supports continuous, nominal and binned attributes. Missing values can be represented as ? or NaN. Currently MLTK has limited support for missing values. All dense datasets should have the same number and order of columns. The structure of the attribute description is the following:

attribute_name: type [(target)|(x)]

Attributes specified with "(x)" will be skipped. There are two types of binned attributes. One is specified using the number of bins, and the other is specified using number of bins, upper bounds and medians for each bin.

Example attribute file

f1: cont (x)
f2: {a, b, c}
f3: binned (256) 
f4: binned (3;[1, 5, 6];[0.5, 2.5, 5.5])
label: cont (target) 

Example data file

0.1 1 2 0 5
-2.3 0 255 1 2
? 2 128 2 -3
5 1 0 1 0.2
0.1 1 37 0 0.1

Sparse Input Format

MLTK uses the following structure for sparse input format:

target feature:value ... feature:value

Feature/value pairs must be ordered by increasing feature number. MLTK does not skip zero valued features. For classification problems, make sure the target is in {0, ..., K - 1}, where K is the number of classes.

Example data file

0 1:0.2 3:0.5
0 2:-0.4
1 1:3.2 5:-3

Dataset I/O

MLTK provides two classes to perform reading/writing of datasets: mltk.core.io.InstancesReader and mltk.core.io.InstancesWriter.

Example

Instances instances = InstancesReader.read("< attr file path >", "< dataset file path >");

It reads a dataset from attribute file and data file. Attribute file can be null. When attribute file is not provided, if the data file follows dense format, no target attribute will be assigned, if the data file follows sparse format, the target attribute will be nominal if all target values are integers, or continuous attribute otherwise.

Instances instances = InstancesReader.read("< dataset file path >", "< target index >");

It reads a dense dataset from data file and a specified target index. A negative target index (e.g., -1) means no target is specified.

Models

All models in MLTK are subclass of mltk.predictor.Predictor.

Classifiers

All classifiers are subclass of mltk.predictor.Classifier. They classify an instance and return the index of the class. For example, if the return value is 0, that means the instance is classified to the first class (specified by the target attribute).

When a classifier is a subclass of mltk.predictor.ProbabilisticClassifier, it has the ability to output probabilities for each class.

Regressors

All regressors are subclass of mltk.predictor.Regressor. The return value for a regressor is a predicted continuous value.

Predictor I/O

MLTK provides two classes to perform reading/writing of predictors: mltk.predictor.io.PredictorReader and mltk.predictor.io.PredictorWriter. Although each predictor object has read and write method, use them with caution as they are supposed to be called by PredictorReader and PredictorWriter.

All model files are in plain text so that one can easily parse to another language and edit. The first line of the model usually specifies the model class.

Example

GAM gam = (GAM) PredictorReader.read("< model file path >");

It reads a generalized additive model (GAM) from the specified model file path. The casting is necessary since the model class is specified in the file. PredictorReader also supports another method to read models.

GAM gam = PredictorReader.read("< model file path >", GAM.class);

In this case, the casting is not needed as the method does it automatically.

The following code writes a GAM object into file.

PredictorWriter.write(gam, "< model file path >");