Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset improvements #70

Open
8 of 10 tasks
bytesnake opened this issue Dec 11, 2020 · 29 comments
Open
8 of 10 tasks

Dataset improvements #70

bytesnake opened this issue Dec 11, 2020 · 29 comments
Labels
enhancement New feature or request

Comments

@bytesnake
Copy link
Member

bytesnake commented Dec 11, 2020

Most of the time the dataset is implemented for ArrayBase<D, Ix2> for records and ArrayBase<D, Ix1> for targets. We could simplify a lot of impls by:

  • renaming Dataset to DatasetBase
  • implement Dataset and DatasetView
  • implement Dataframe
  • support multi-targets and label probabilities

Implement for new types

  • k-folding
  • non-copying k-folding iter_fold
  • bootstrapping
  • shuffling

Improve ergonomics:

  • feature naming
  • target naming
@Sauro98
Copy link
Member

Sauro98 commented Dec 16, 2020

Hi, I would love to help with this, but I can't say that I understand completely what you have in mind. Could you explain it a little more? Is the following what you had in mind?

  • Keep what's already implemented for Dataset but rename the struct DatasetBase (kind of like ndarray with Array and ArrayBase?)
  • Add new types Dataset and DatasetView that use ArrayBase and ArrayView for the records and targets like you described. Would one do this by pub type Dataset = DatasetBase<ArrayBase<D, Ix2>,ArrayBase<D, Ix1>> ?

@bytesnake
Copy link
Member Author

both points are right, we would just follow the ndarray model more closely. Because of the coming holidays, my plan was to schedule that for beginning of next year. You can give it a try, though it's rather diligence work as all algorithms have to be updated :D

@Sauro98
Copy link
Member

Sauro98 commented Dec 16, 2020

Well I'll try then and I'll let you know how it goes 👍🏻

@Sauro98
Copy link
Member

Sauro98 commented Dec 17, 2020

Hi, is there some specific reason why some datasets in the datasets crate are returned using a Vec for the targets instead of an Array1?

@bytesnake
Copy link
Member Author

no this should default to Array1

@Sauro98
Copy link
Member

Sauro98 commented Jan 5, 2021

I have one doubt about iter_fold. Shouldn't iterator items possibly outlive the object from which the iterator is created? I mean, one would have to:

  • take a mutable reference to the original dataset
  • split that reference into mutable slices
  • for each item in the iterator swap the content of some slices and then recombine the slices in two arrays

But at that point, the content of the various items would overlap, and calling collect on that iterator would yield a mess right? I mean if one wanted to use two items from the iterator at the same time then there would be a conflict on the contents of the slices composing the two items right?

Wouldn't such implementation need something like this proposal?

I'm asking because as a rust noob I just can't wrap my head around it

@bytesnake
Copy link
Member Author

We can't ensure that somebody is holding a reference while modifying the content. One solution would involve RefCell, which basically moves the borrow check to the runtime and panics when the content is atm borrowed for mutation.
Another solution would drop the iterator completely and just take a closure with Dataset -> Model. The caller would then ensure that the reference to the dataset is dropped after the model is trained.

@Sauro98
Copy link
Member

Sauro98 commented Jan 7, 2021

And what would a function like the second one you mention return? The k validation scores or the k couples (model, validation set)? I'm trying to make some sort of connection with scikit-learn's cross_val_score.

@bytesnake
Copy link
Member Author

It should return impl Iterator<Item = (model, validation set)>, but I don't know wether the model type can be chosen on the return type of a closure

@Sauro98
Copy link
Member

Sauro98 commented Jan 7, 2021

Ok so I thought I had it because now it complies but there's one tiny problem. I tried to implement iter_fold for fittable algorithms but the problem here is that fit requires that the dataset used for fitting has the same lifetime as the model. That means that the view that is given as input to the closure can't be used to train the model because it's too short-lived.

@bytesnake
Copy link
Member Author

mh this was introduced to associate the correct lifetime for the kernel matrix here. We could copy the kernel matrix and avoid the lifetime there

@Sauro98
Copy link
Member

Sauro98 commented Jan 14, 2021

I've been a little busy lately and I'm proceeding slowly so I thought I'd give a little "status-report" here, also to get a second opinion on what I've been doing.

First part:

  • removed lifetime requirement on the training set for fittable models;
  • completed the implementation of iter_fold for fittable models;
  • tested the implementation on a mock struct implementing Fit;

Second part:

Here things get a little less straightforward. I didn't want to force people using SVMs to clone their kernel every time, so I am trying to change SVMs to accept either an owned kernel or a "view" of one. To that end I changed linfa-kernel in this way:

  • Added two new enum variants to KernelInner to be able to take a view of sparse or dense kernel matrices;
  • Renamed Kernel to KernelBase in linfa-kernel;
  • Added types:
    • KernelArray : uses ArrayBase<D, Ix2> as records
    • Kernel: uses Array2 as records
    • Kerneliew: uses ArrayView2 as records
  • Moved implementations relying only on the inner matrix to KernelBase and the ones relying also on the dataset to KernelArray;
  • Implemented new methods for Kernel and KernelView where the first one takes ownership of the input data and both use owned inner matrices;
  • Implemented view methods for Kernel and KernelView where the first one takes a view of the dataset and both take views of their inner matrices;

This way instead of taking a reference to the kernel one can call kernel.view() and not worry about data being cloned.

At this point the problem is that I would need to be able to differentiate between an SVM that has a view of data that lives as long as itself and one that owns the data in itself. I don't know if that can be done with just one type, I'm trying not to complicate the code too much.

Now I still need to modify linfa-svm to adjust it to these changes, but I plan to do it sometime in the next weeks. One thing that bothers me is: since the KernelInner can contain both a view or owned data I'm forced to specify a lifetime for the type even if I am not planning to use the view type. For example, the type Kernel still needs a lifetime even if it can't have a view as an inner matrix, but I don't know if that's avoidable somehow.

@bytesnake
Copy link
Member Author

I've been a little busy lately and I'm proceeding slowly so I thought I'd give a little "status-report" here, also to get a second opinion on what I've been doing.

I'm currently re-locating to a different city and this skipped completely my radar, sorry for the late response.

First part:

* removed lifetime requirement on the training set for fittable models;

* completed the implementation of `iter_fold` for fittable models;

* tested the implementation on a mock struct implementing `Fit`;

👍

Second part:

Here things get a little less straightforward. I didn't want to force people using SVMs to clone their kernel every time, so I am trying to change SVMs to accept either an owned kernel or a "view" of one. To that end I changed linfa-kernel in this way:

* Added two new enum variants to `KernelInner` to be able to take a view of sparse or dense kernel matrices;

* Renamed `Kernel` to `KernelBase` in `linfa-kernel`;

* Added types:
  
  * `KernelArray` : uses `ArrayBase<D, Ix2>` as records
  * `Kernel`: uses `Array2` as records
  * `Kerneliew`: uses `ArrayView2` as records

* Moved implementations relying only on the inner matrix to `KernelBase` and the ones relying also on the dataset to `KernelArray`;

* Implemented `new` methods for `Kernel` and `KernelView` where the first one takes ownership of the input data and both use owned inner matrices;

* Implemented `view` methods for `Kernel` and `KernelView` where the first one takes a view of the dataset and both take views of their inner matrices;

This way instead of taking a reference to the kernel one can call kernel.view() and not worry about data being cloned.

this seems to be the straightforward way to have a view on a kernel matrix.

At this point the problem is that I would need to be able to differentiate between an SVM that has a view of data that lives as long as itself and one that owns the data in itself. I don't know if that can be done with just one type, I'm trying not to complicate the code too much.

you are here in a similar spot as the std::borrow::Cow implementation because you cannot suppress the lifetime in the type signature

Now I still need to modify linfa-svm to adjust it to these changes, but I plan to do it sometime in the next weeks. One thing that bothers me is: since the KernelInner can contain both a view or owned data I'm forced to specify a lifetime for the type even if I am not planning to use the view type. For example, the type Kernel still needs a lifetime even if it can't have a view as an inner matrix, but I don't know if that's avoidable somehow.

no you cannot avoid this without introducing SvmView and Svm, but then you have to decide what to return in SvmParams::fit.

I think that it would be reasonable to copy only those sample points which are support vectors. The whole point of SVM is to find a few SV's which can describe the separating hyperplane and we can easily copy those. (btw. this is only the case for the non-linear case, otherwise we do not need to cache the samples and have memory complexity O(n)) The following changes would be reasonable:

  • introduce a KernelInner::Dynamic: a placeholder where the distance function is calculated on-the-fly without caching
  • add a Kernel::select function: this takes the indices of the sub-samples and returns a copy of the kernel with owned records and empty (i.e. KernelInner::Dynamic) kernel matrix.
  • store only the smaller kernel of the support vectors Svm and don't borrow the whole kernel beyond the fitting process
  • in Svm::predict the weighted_sum is only passed the actual scaling factors of the SV solution (we could also think about not storing the kernel at all for the linear case, but would probably rename the Kernel::select then)

hope this is plausible, I think we should not make the API more complex as it already is

@Sauro98
Copy link
Member

Sauro98 commented Jan 20, 2021

you are here in a similar spot as the std::borrow::Cow implementation because you cannot suppress the lifetime in the type signature

One way I've found is to define a trait "Inner" with all methods concerning only inner matrices and implement it for Array2, ArrayView2, CsMat, and CsMatView. Then I define an enum:

pub enum KernelInnerBase<K1: Inner, K2:Inner> {
    Dense(K1),
    Sparse(k2),
}

Now the definition of kernel becomes:

pub struct  Kernel<R: Records, K1: Inner, K2: Inner> {
    ....
    inner: KernelInner<K1,K2>,
    ....
}

So I can define Kernel<F> = Kernel<Array2<F>, Array2<F>, CsMat<F>>. Then the naming gets a little confusing because I think both KernelArrayView<ArrayView2<'a, F>, Array2<F>, CsMat<F>> and KernelView<ArrayView2<'a,F>, ArrayView2<'a,F>, CsMatView<'a,F>> would be useful.

With this kind of implementation, all methods like diag and such become just calls to the corresponding methods of either K1 or K2 depending on whether the inner matrix is sparse or not.

I agree that the API should stay as simple as possible. Right now I'm deep into studying/taking my finals so I'll slowly get to the points you made about SVM in the next weeks. 👍🏻

@Sauro98 Sauro98 mentioned this issue Jan 27, 2021
@Sauro98
Copy link
Member

Sauro98 commented Feb 4, 2021

What if in SVMs the kernel was specified as a parameter? I mean something like this:

Svm::params().with_kernel(Kernel::params()).fit(&dataset)

That way fit would take the original dataset instead of the transformed kernel and inside fit one could do something like this:

let kernel_matrix = self.kernel.transform(dataset);
...
svm.kernel = kernel_matrix;
svm.dataset = dataset;
...

without the need to store dataset inside the kernel. Then the Svm could be improved to pick the support vectors from dataset and clone only them instead of the whole set.

@bytesnake
Copy link
Member Author

What if in SVMs the kernel was specified as a parameter? I mean something like this:

Svm::params().with_kernel(Kernel::params()).fit(&dataset)

this should be called with_kernel_params or take the actual kernel matrix, alternatively we could add separate functions for the different kernel functions, like .with_gaussian(eps) etc.

That way fit would take the original dataset instead of the transformed kernel and inside fit one could do something like this:

let kernel_matrix = self.kernel.transform(dataset);
...
svm.kernel = kernel_matrix;
svm.dataset = dataset;
...

without the need to store dataset inside the kernel. Then the Svm could be improved to pick the support vectors from dataset and clone only them instead of the whole set.

sounds like a good solution 👍

@bytesnake
Copy link
Member Author

in Elements of Statistical Learning the training vectors are sometimes called predictors, perhaps that's a better name than records

@Sauro98
Copy link
Member

Sauro98 commented Feb 7, 2021

To me, calling dataset.records() sounds more straightforward than dataset.predictors(), just from what I would generally expect from a dataset-like structure, but since this structure is focused purely on ML maybe it would make sense?. I know that Understanding Machine Learning uses predictor to refer to the hypothesis learned and instances to refer to domain points, so I think that predictors could be misleading to people who learned ML through that book. Other names that I heard a lot in my uni courses are samples or examples but I don't know how often any of those terms are actually used in literature.

@bytesnake
Copy link
Member Author

Good point! ESL uses predictors as a synonym for training data in the Least Angle Regression section and it should only be used for those features which are used in the prediction. If you have selected a subset to decrease the variance of your model, then only those are the predictors as opposed to records. I don't like samples, examples because they can refer to the records as well as targets. I think we should stick to records

@Sauro98
Copy link
Member

Sauro98 commented Feb 7, 2021

I agree, even if it may not be the best possible name I think that it is general enough to at least avoid conflicts 👍🏻

@Sauro98
Copy link
Member

Sauro98 commented Mar 5, 2021

I've run into a problem when trying to implement the proposed changes to the regression metrics traits.
Since both SingleTargetRegression and MultiTargetRegression provide methods with the same name, as soon as I implement both SingleTargetRegression and MultiTargetRegressionfor a certain type I get errors because there are multiple methods available with the same name, even if they apply to different compared targets.
I don't know if there is a nice way to solve this, one solution is to keep defaulting to multi-target as it happens now or either to make the API a little uglier and rename them to single_target_metric and multi_target_metric.

If I use associated types it gets even worse, because I can only provide one implementation (and so only one option for the associated type) for each AsTarget type, since I can't take the associated type as a template parameter in the trait definition.

I think that the problem can't be overcome in rust since there is no concept of function overloading.

This is what I've tried so far:

Case 1: no associated types

pub trait SingleTargetRegression<F,T> : AsTargets<F> {
    fn max_error(&self, &T) -> ... {...}
}

pub trait MultiTargetRegression<F,T> : AsTargets<F> {
    fn max_error(&self, &T) -> ... {...}
}

impl<F,type1> SingleTargetRegression<F,type1> for type {}  
impl<F,type2> MultiTargetRegression<F,type2> for type {}

// error: multiple methods with the same name
type.max_error(..)

This means that I can't have Array2 use single target metrics when compared to an array1 and multi-target metrics when compared to array2

Case 2: Associated types

Defining type in trait template

pub trait SingleTargetRegression<F,T> : AsTargets<F> {
    // error: associated type defaults are unstable
    type Tar = T;
    fn max_error(&self, &T) -> ... {...}
}

Defining type in impl template

pub trait SingleTargetRegression<F> : AsTargets<F> {
    type Tar : AsTargets<F>;
    fn max_error(&self, &T) -> ... {...}
}

// error: unconstrained type parameter
impl<type2> SingleTargetRegression<F> for type {
    type Tar = type2;
}

implementing twice

pub trait SingleTargetRegression<F> : AsTargets<F> {
    type Tar : AsTargets<F>;
    fn max_error(&self, &T) -> ... {...}
}

// error: conflicting implementation
impl SingleTargetRegression<F> for type {
    type Tar = base_type;
}

impl SingleTargetRegression<F> for type {
    type Tar = base_type2;
}

@bytesnake
Copy link
Member Author

mh I wasn't aware of the Array2/Array1 implementation, but yeah I think it is not possible with the current trait system (though I'm absolutely not an expert there) the only workaround I see is to implement MultiTargetRegression like this:

impl<F: Float, D: Data<Elem = F>> MultiTargetRegression<F, ArrayBase<D, Ix1>> for ArrayBase<D, Ix2> {}

and hope that most people are using predict in combination with this trait.

@Sauro98
Copy link
Member

Sauro98 commented Mar 5, 2021

Yes it is implemented like that, just allowing the two arrays to have different data, the only issue is that array2 can only implement multi target metrics.

(In this comment and in the ones before i am using array1 \ array2, loosely to indicate both owned arrays and views)

@bytesnake
Copy link
Member Author

bytesnake commented Mar 9, 2021

a comment under the reddit post mentioned polars and I'm now thinking about adding a dataframe type:

pub type Dataframe<T> = DatasetBase<polars::DataFrame, polars::ChunkedArray<T>>

observe the missing type on polars::Dataframe, this would allow us to represent generic records with categorical/continuous features mixed. For example naive bayes could implement Fit for a Dataframe as well and handle the categorical features with multinomial distributions

@Sauro98
Copy link
Member

Sauro98 commented Mar 9, 2021

Wow, I didn't know linfa was being posted on Reddit 🚀
I had never thought of categorical features, we would still need to convert categorical features into continuous ones before being able to use them in most algorithms right? Or would this be something to be used only on some specific algorithms like the naive Bayes that you mentioned?

@bytesnake
Copy link
Member Author

the latter for example it makes no sense for ICA to support categorical features, but we could extend the scope of some algorithms with categorical types

@Sauro98
Copy link
Member

Sauro98 commented Mar 10, 2021

I guess that dataframes would also be useful for count vectorizers right?

@bytesnake
Copy link
Member Author

you can write a transformer f(Array<String>) -> Array2 without dataframes, but would make handling strings easier

@bytesnake bytesnake added the enhancement New feature or request label Jul 22, 2021
@CGMossa
Copy link

CGMossa commented Oct 14, 2022

I think that .with_feature_names can provide a longer vector than there are features.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants