[WIP] Tidy-up and improve ergonomics with new interface and dataset #45

bytesnake · 2020-09-16T08:41:41Z

I think the linfa crate has collected enough algorithms to move towards a unified interface. The intention of this PR is to tidy up the project and implement unified traits for transformers, fittable models and incremental models. Further a Dataset struct is introduced, which wraps records, targets, labels and weights in an unified way.

The description is also WIP and will be updated over time

Traits for the learning process

This PR introduces traits for transformers, learnable models and incremental models. The trait implementation (see here) follows the convention:

all three traits should be implemented for a hyparameter set, which governs the learning algorithm
Transformer: a transformer is an algorithm, which does not expose its inner state. A common example are kernel methods, which are unique for every dataset given. Support Vector Machines, on the other hand, may implement this as well and call fit and predict internally.
Fit: a fittable algorithm learns model from the training dataset. It can predict new targets from the same domain as the input data.
IncrementalFit: an incremental algorithm can make updates of its inner state, depending on a former model and new data

Dataset

A dataset contains a records field and further may contains targets, weights and labels. The Targets trait corresponds to any kind of target (f32 as well as String), but the Labels trait can be used to narrow down the implementation to comparable targets (atm implemented for bool, usize and String)

Tidying up the crate

remove top level dependencies to sub-crates
remove dependency on serde

Unresolved questions

the implementation of a dataset follows the model of ndarray, where everything is implemented with concrete types. This creates on the one hand some boilercode (for example Vec<T>, &Vec<T> and &[T] have to be implemented separately), but I don't think that Rust is expressive enough atm to go another way
the name Records was chosen to distinguish it form the Data trait of ndarray, but its not the most common name to describe the actual data in a dataset

Example

Here is a simple example for a kernel transformation, followed by the training of a SVC model:

    // everything above 6.5 is considered a good wine
    let dataset = Dataset::new(data, targets)
        .map_targets(|x| **x > 6.5);

    // split into training and validation dataset
    let (train, valid) = dataset.split_with_ratio(0.1);
    
    // transform with RBF kernel
    let train_kernel = Kernel::params()
        .method(KernelMethod::Gaussian(80.0))
        .transform(&train);

    println!(
        "Fit SVM classifier with #{} training points",
        train.observations()
    );

    // fit a SVM with C value 7 and 0.6 for positive and negative classes
    let model = Svm::params()
        .pos_neg_weights(7., 0.6)
        .fit(&train_kernel);

    println!("{}", model);
    // A positive prediction indicates a good wine, a negative, a bad one
    fn tag_classes(x: &bool) -> String {
        if *x {
            "good".into()
        } else {
            "bad".into()
        }
    };

    // map targets for validation dataset
    let valid = valid.map_targets(tag_classes);

    // predict and map targets
    let pred = model.predict(&valid)
        .map_targets(|x| *x > 0.0)
        .map_targets(tag_classes);

    // create a confusion matrix
    let cm = pred.confusion_matrix(&valid);

    // Print the confusion matrix, this will print a table with four entries. On the diagonal are
    // the number of true-positive and true-negative predictions, off the diagonal are
    // false-positive and false-negative
    println!("{:?}", cm);

    // Calculate the accuracy and Matthew Correlation Coefficient (cross-correlation between
    // predicted and targets)
    println!("accuracy {}, MCC {}", cm.accuracy(), cm.mcc());

bytesnake · 2020-09-21T09:06:56Z

I don't have super much time atm, but any feedback is welcome

NDarray has already a trait called `Data`, to avoid name collisions our `Data` is now called `Records`

* implement `Transformer` for `KernelParams` * move creation functions to `KernelParams` * use `Records` and `Float`

* make `Fit` more generic over return object * implement one-class classification for Labels = ()

The phantom type distinguishes between different kind of predictions, like boolean, probability or floating predictions

* add builder pattern to kernel methods * wrestle with type system

* split Targets into Targets and Labels * implement Fit and Predict traits for SVM regression

* create a dataset from records and targets * use kernel method as transformer * fit a model with hyperparameters and given dataset * create a second validation dataset and populate targets with predict * evaluate with confusion matrix

* remove labels array from dataset, because its elements are not enforced to be unique or ordered * introduce TargetsWithLabels, which wraps targets with additional labels * fix warnings

paulkoerbitz · 2020-10-27T21:29:59Z

I've taken a short look tonight! This looks really awesome, thanks for this extensive amount of work! I'd like to spend a bit more time thinking about the names and visibility of some items. I'll write more comments tomorrow.

Thanks for the work!

bytesnake · 2020-11-02T13:54:27Z

Moved to #55

Introduce traits and dataset

f885f1d

bytesnake mentioned this pull request Sep 16, 2020

[WIP] Feature/linfa ensemble random forest #43

Closed

bytesnake added 16 commits October 12, 2020 17:14

Merge remote-tracking branch 'upstream/master' into traits

3c117bf

Rename fit_update to fit_with

271d1cc

Add get_label to dataset struct

aae00c0

Every dataset can also act as data alone

f29efb2

Rename data to records to disambiguate from ndarray

b4605b0

NDarray has already a trait called `Data`, to avoid name collisions our `Data` is now called `Records`

Port linfa-kernel to new syntax

4c3f033

* implement `Transformer` for `KernelParams` * move creation functions to `KernelParams` * use `Records` and `Float`

Start porting SVM to new architecture

e4c3df2

* make `Fit` more generic over return object * implement one-class classification for Labels = ()

Introduce phantom type to SVM

d7e2528

The phantom type distinguishes between different kind of predictions, like boolean, probability or floating predictions

Fit with different targets

35173f4

* add builder pattern to kernel methods * wrestle with type system

Add Fit to SVRegression

071c94e

* split Targets into Targets and Labels * implement Fit and Predict traits for SVM regression

Implement ConfusionMatrix for dataset struct

6f3231f

Port BinaryClassification to dataset struct

e0cbff5

First working example for SVM

92794ee

* create a dataset from records and targets * use kernel method as transformer * fit a model with hyperparameters and given dataset * create a second validation dataset and populate targets with predict * evaluate with confusion matrix

Run autoformatting

a0541e6

Move to new API for tests in linfa-svm

b9a5218

Move wine quality example to linfa dataset

5f689e8

bytesnake changed the title ~~[WIP] Experiment with public interface and datasets~~ [WIP] Tidy-up and improve ergonomics with new interface and dataset Oct 17, 2020

bytesnake mentioned this pull request Oct 17, 2020

Confused about project layout #52

Closed

bytesnake added 9 commits October 18, 2020 00:00

Add support vector regression tests

549bf8d

Implement transformer for hierarchical clustering

c5b40f6

Add option to choose ndarray backend for linfa-ica

e10ddd5

Move linfa-ica to new traits

0167887

Implement new traits for KMeans

0e48991

Implement transformer for DBSCAN

7f06b36

Move PCA and diffusion maps to new traits

46f5e2e

Add prelude to linfa

0e3ba8f

Fix tests of classification metrics

a060c85

bytesnake added 10 commits October 21, 2020 19:59

Add text how to contribute

dd504fb

Remove associated type from Labels

eb7d5d3

Remove labels from Dataset

ae940bf

* remove labels array from dataset, because its elements are not enforced to be unique or ordered * introduce TargetsWithLabels, which wraps targets with additional labels * fix warnings

Run fmt and remove serde dependency in reduction

e710d51

Make serde optional in clustering

430faf4

Add section on datasets to contribute document

6e5c608

Add section on serde feature

93ac0aa

Add one-vs-all function

a9e1161

Add error type

7b0591d

Add error type for parameters and ndarray

58113a1

paulkoerbitz self-requested a review October 27, 2020 21:30

Add section on builder patterns

fdf2411

bytesnake closed this Nov 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Tidy-up and improve ergonomics with new interface and dataset #45

[WIP] Tidy-up and improve ergonomics with new interface and dataset #45

bytesnake commented Sep 16, 2020 •

edited

Loading

bytesnake commented Sep 21, 2020

paulkoerbitz commented Oct 27, 2020

bytesnake commented Nov 2, 2020

[WIP] Tidy-up and improve ergonomics with new interface and dataset #45

[WIP] Tidy-up and improve ergonomics with new interface and dataset #45

Conversation

bytesnake commented Sep 16, 2020 • edited Loading

Traits for the learning process

Dataset

Tidying up the crate

Unresolved questions

Example

bytesnake commented Sep 21, 2020

paulkoerbitz commented Oct 27, 2020

bytesnake commented Nov 2, 2020

bytesnake commented Sep 16, 2020 •

edited

Loading