[MRG+1] Sparse Target Dummy Classifier #3438

Merged
merged 59 commits into from Aug 17, 2014

Projects

None yet

5 participants

@hamsal
  • Test sparse with all four strategies
  • Fit with sparse target
  • Predict sparse target
  • Simplify (skip steps) when constant strategy is used
  • Clean up redundancies of indicies, data, indptr in predict
  • Clean up conditional redundancies in fit
  • Random sampling function outputs sparse column
  • Test random sampling function with simulation
  • Uniques/class distribution function for multilabel-multioutput (share with sprs knn)
  • Always use dense predict with uniform case, and throw a sparse efficiency warning.
@vene vene and 1 other commented on an outdated diff Jul 19, 2014
sklearn/dummy.py
+ "shape (%d, 1)." % self.n_outputs_)
+
+ for k in xrange(self.n_outputs_):
+ classes, y_k = np.unique(y[:, k], return_inverse=True)
+ self.classes_.append(classes)
+ self.n_classes_.append(classes.shape[0])
+ self.class_prior_.append(np.bincount(y_k) / float(y_k.shape[0]))
+
+ # Checking in case of constant strategy if the constant provided
+ # by the user is in y.
+ if self.strategy == "constant":
+ if constant[k] not in self.classes_[k]:
+ raise ValueError("The constant target value must be "
+ "present in training data")
+ else:
+ y.eliminate_zeros()
@vene
vene Jul 19, 2014

You don't need this if strategy="constant". Maybe the order of the ifs can be changed and some things can be kept common for sparse and dense?

@hamsal
hamsal Jul 21, 2014

Note this has not yet been implemented, the change on this line is unrelated

@hamsal
hamsal Aug 1, 2014

The code here has been abstracted to a function sparse_class_distribution which the constant case uses

@hamsal
hamsal Aug 1, 2014

This is a good solution because it solves all sparse cases with the same reusable function.

@vene
scikit-learn member

What is the point in returning sparse output in predict?

@arjoly
scikit-learn member

What is the point in returning sparse output in predict?

Because if the output was actually sparse in the fit method, it's probably mean that you don't want a dense output since it might blow your memory.

@vene
scikit-learn member

Yeah, sorry, my question was stupid.

@arjoly
scikit-learn member

You are welcome :-)

@arjoly
scikit-learn member

There are code to share between the sparse and dense code.

@arjoly arjoly and 1 other commented on an outdated diff Jul 20, 2014
sklearn/dummy.py
+ self.class_prior_.append(np.bincount(y_k) / float(y_k.shape[0]))
+
+ # Checking in case of constant strategy if the constant provided
+ # by the user is in y.
+ if self.strategy == "constant":
+ if constant[k] not in self.classes_[k]:
+ raise ValueError("The constant target value must be "
+ "present in training data")
+ else:
+ y.eliminate_zeros()
+ self.n_outputs_ = y.shape[1]
+ self.classes_ = []
+ self.n_classes_ = []
+ self.class_prior_ = []
+
+ for k in xrange(self.n_outputs_):
@arjoly
arjoly Jul 20, 2014

I would use range instead of xrange. In python 3, the difference between xrange and rage doesn't matter.

@arjoly
arjoly Aug 4, 2014

I think that we could drop the xrange in favor of range (even on python 2). Thus we could drop the import from .externals.six.moves import xrange.
`

@arjoly arjoly and 1 other commented on an outdated diff Jul 20, 2014
sklearn/dummy.py
+ # Checking in case of constant strategy if the constant provided
+ # by the user is in y.
+ if self.strategy == "constant":
+ if constant[k] not in self.classes_[k]:
+ raise ValueError("The constant target value must be "
+ "present in training data")
+ else:
+ y.eliminate_zeros()
+ self.n_outputs_ = y.shape[1]
+ self.classes_ = []
+ self.n_classes_ = []
+ self.class_prior_ = []
+
+ for k in xrange(self.n_outputs_):
+ y_col_k = y.getcol(k)
+ classes, y_k = np.unique(y_col_k.data, return_inverse=True)
@arjoly
arjoly Jul 20, 2014

It looks this could be improved by working directly on y.data of the csr matrix. I would do

  1. convert to csc
  2. eliminate zeros
  3. count the number of zeros of the column through indptr
  4. compute the unique set of non zero classes and get the inverse
  5. Add the zero class in classes if the columns is not dense
  6. perform a bincount on the classes and use the minlength paramter
  7. compute the class_prior
@hamsal
hamsal Jul 21, 2014

I have implemented these changes except for step 6, I do not used the min length parameter. I think predict could work without needing to do the minlength which could save some space, I could be wrong. Will report when I implement sparse predict.

@arjoly
arjoly Jul 21, 2014

I would have used the minlenght to be sure that handle the 0 class properly. But you might have handled this case differently. :-)

@hamsal
hamsal Jul 21, 2014

Oh I see! I will take a look at that corner case

@hamsal

I have updated fit to share some more code between the dense and sparse conditional. Predict is partially implemented but 'stratified' and 'uniform' need improvement which I am working on.

@coveralls

Coverage Status

Coverage decreased (-0.04%) when pulling ff54b58 on hamsal:sprs-out-dmy into 7a7fca8 on scikit-learn:master.

@hamsal

The class prior calculation is incorrect for the sparse case, I am working on this now and this will correct the stratified case.

@vene
scikit-learn member

I suggest first writing a failing test. It's scary to see travis green when something is incorrect :)

@hamsal

I agree, doing that now!

@coveralls

Coverage Status

Coverage increased (+0.12%) when pulling 7791b5f on hamsal:sprs-out-dmy into 7a7fca8 on scikit-learn:master.

@hamsal

The most recent test in the commit b3b40c9 serves to test the corner case with the class priors that was previously incorrect.

@coveralls

Coverage Status

Coverage increased (+0.12%) when pulling 1311a55 on hamsal:sprs-out-dmy into 7a7fca8 on scikit-learn:master.

@hamsal

@arjoly and @vene could you review the changes made to predict?

The "stratified" and "uniform" cases are failing to run in a reasonable time (or even close to the time it takes with a dense format). I must be doing something very costly in those cases.

Also here is a memory benchmark for the "most_frequent" case which seems to be performing well:
That data set is eurlex, with sparsity 0.01, and the script is here: https://gist.github.com/hamsal/438e05cea2ed2c3e1428

Dense

Sparse

In comparison with the dense case the sparse case uses very little memory for fit and predict after the initial loading of the data set. The dense case reaches about 1250 MIB while the sparse case peaks at 325 MiB

@coveralls

Coverage Status

Coverage increased (+0.12%) when pulling dc5c6cb on hamsal:sprs-out-dmy into 7a7fca8 on scikit-learn:master.

@arjoly
scikit-learn member

"uniform" cases are failing to run in a reasonable time

In the uniform case, this could be be due to the sampling which is indeed dense.

@arjoly arjoly and 1 other commented on an outdated diff Jul 23, 2014
sklearn/dummy.py
- raise ValueError("The constant target value must be "
- "present in training data")
+ if not self.sparse_target_input_:
+ for k in xrange(self.n_outputs_):
+ classes, y_k = np.unique(y[:, k], return_inverse=True)
+ self.classes_.append(classes)
+ self.n_classes_.append(classes.shape[0])
+ class_prior = np.bincount(y_k, weights=sample_weight)
+ self.class_prior_.append(class_prior / class_prior.sum())
+ else:
+ y = y.tocsc()
+ y.eliminate_zeros()
+ y_nnz = np.diff(y.indptr)
+
+ for k in range(self.n_outputs_):
+ nz_indices = y.indices[y.indptr[k]:y.indptr[k+1]]
@arjoly
arjoly Jul 23, 2014

col_n_nonzero/ col_nnz?

@hamsal
hamsal Aug 1, 2014

the code is now in the sparse_class_distribution function and this is renamed to col_nonzero

@arjoly
scikit-learn member

The scalability issue for the the stratified version might come from the call to predict_proba.

@hamsal

I have gotten the uniform case to run much faster on the eurlex dataset by writing the sparse sampling function suggested by you @arjoly. Will update the PR and the other cases in the morning.

@arjoly
scikit-learn member

Great to know !!! Thanks @hamsal !

@hamsal

I am back to a slow implementation. It looks like I have made a mistake I thought the fast version was correct. I am still working on the correct sparse sampler.

@hamsal

@arjoly If you want to take a look at my implementation for the sparse sampling method please find it in the utils/sparsefuncs file it is designed to behave very similarly to the numpy.random.choice function except it returns a sparse csc_column. I am still working on optimizing the uniform case. Right now it is just way to slow on a large number of classes.

Here is eurlex 17,000 samples with 500 classes

dense_fit_time:  0.727050065994
dense_predict_time:  0.185846090317
sparse_fit_time:  0.0599489212036
sparse_predict_time:  2.65254998207
@coveralls

Coverage Status

Coverage increased (+0.13%) when pulling 8063084 on hamsal:sprs-out-dmy into 7a7fca8 on scikit-learn:master.

@coveralls

Coverage Status

Coverage increased (+0.18%) when pulling 2754829 on hamsal:sprs-out-dmy into 7a7fca8 on scikit-learn:master.

@arjoly arjoly commented on an outdated diff Jul 24, 2014
sklearn/utils/sparsefuncs.py
@@ -278,3 +279,37 @@ def min_max_axis(X, axis):
return sparse_min_max(X, axis=axis)
else:
_raise_typeerror(X)
+
+
+def random_choice_csc(a, size, p=None):
@arjoly
arjoly Jul 24, 2014

What do you think a creating the entire matrix directly? I was thinking of the following interface:

def random_choice_csc(n_samples, classes, class_probability):
    """Generate a sparse random matrix given a class distribution

    Parameters
    ----------
    n_samples : int,
        Number of samples to draw

    classes : list of size n_outputs of array of size (n_classes, )
        List of classes for each column.

    class_probability : list of size n_output of array of size (n_classes)
        Class distribution of each column.

    Return
    ------
    random_matrix : sparse csc matrix of size (n_samples, n_outputs)

    """    
@arjoly
arjoly Jul 24, 2014

Maybe instead of classes, you could but n_classes : a numpy array of size (n_ouputs). The number of classes of each output.

@arjoly arjoly commented on an outdated diff Jul 24, 2014
sklearn/utils/sparsefuncs.py
@@ -278,3 +279,37 @@ def min_max_axis(X, axis):
return sparse_min_max(X, axis=axis)
else:
_raise_typeerror(X)
+
+
+def random_choice_csc(a, size, p=None):
+ # XXX Corner case: size 0
+
+ if isinstance(a, int):
+ a = np.arange(a)
+ a = np.array(a, np.int)
+
+ if p is None:
+ p = np.empty(shape=a.shape[0])
+ p.fill(1.0/a.shape[0])
+ p = np.array(p)
@arjoly
arjoly Jul 24, 2014

Looks like a complicated logic. I wouldn't bother with such default at the beginning. :-)

@arjoly arjoly and 1 other commented on an outdated diff Jul 24, 2014
sklearn/utils/sparsefuncs.py
+ p = np.empty(shape=a.shape[0])
+ p.fill(1.0/a.shape[0])
+ p = np.array(p)
+
+ # XXX Validate: p and a have the same length
+ # XXX Warning: Sparse efficieny if 0 is not in a, or p for 0 is small
+
+ if 0 not in a:
+ a = np.insert(a, 0, 0)
+ p = np.insert(p, 0, 0.0)
+
+ if 0 in a and a.shape[0] is 1:
+ return sp.csc_matrix((size, 1))
+
+ nnz = size - int(size * p[np.where(a == 0)[0][0]]) # XXX maybe round
+ indices = choice(a=range(size), size=nnz, replace=False)
@arjoly
arjoly Jul 24, 2014

Instead of choice, it might be faster to use np.searchsorted. Something like:

cumulative_p_c = np.cumsum(p_c)
c = np.searchsorted(cumulative_p_c, generator.rand(n_samples))

as in make_multilabel_classification.

@hamsal
hamsal Jul 29, 2014

I believe I understand how this works, but I don't think it allows for random sampling with no replacement.

@arjoly
scikit-learn member

Have you benchmark the implementation of random_choice_csc? There might be some sweet spots to optimize. :-)

@coveralls

Coverage Status

Coverage increased (+0.19%) when pulling 3c53404 on hamsal:sprs-out-dmy into 7a7fca8 on scikit-learn:master.

@hamsal

Here is a small benchmark:

from sklearn.utils.sparsefuncs import random_choice_csc
from sklearn.utils.random import choice
import numpy as np 
import time

nruns = 50


def bench(): 
    start = time.time()
    for i in range(nruns):
        choice(range(5),size=200000,
               p=[0.99,0.0025,0.0025,0.0025,0.0025])
    print("choice: ", time.time() - start)

    start = time.time()
    for i in range(nruns):
        random_choice_csc(range(5),size=200000,
                          p=[0.99,0.0025,0.0025,0.0025,0.0025])
    print("random_choice_csc: ", time.time() - start)

bench()

Output:

('choice: ', 0.41407179832458496)
('random_choice_csc: ', 2.8454980850219727)

I run both functions 50 times and print the total time, the data to be generated has sparsity ~0.1.

This custom function is much slower, the next thing to try will be the full matrix generation instead of column wise generation.

@hamsal hamsal closed this Jul 29, 2014
@hamsal hamsal reopened this Jul 29, 2014
@hamsal

I have found out that all of this slowness is from the call to choice which uses the replace=False parameter. If I use replace=True the random_choice_csc outpreforms the choice function as intended.

EDIT: ofcourse replace=True does not give the behavior needed

@arjoly
scikit-learn member

Why would it be without replacement (replace=False)?

@hamsal

The call with replace=False is used for choosing nnz distinct indices.

@arjoly
scikit-learn member

We have an efficient sample_without_replacement function in sklearn.utils.random. Could it solve your problem?

@coveralls

Coverage Status

Coverage increased (+0.13%) when pulling fa4947a on hamsal:sprs-out-dmy into 7a7fca8 on scikit-learn:master.

@hamsal

The sample_without_replacement function seems to have fixed the slowness.

Here are the results of the benchmark above:

('choice: ', 0.40119290351867676)
('random_choice_csc: ', 0.03993511199951172)

Here are the memory profiller plots of the stratified case on the eurlex data set:

Stratified - Dense

Stratified - Sparse

You can see the time difference on the right.

@arjoly
scikit-learn member

Great news :-)

@coveralls

Coverage Status

Coverage increased (+0.13%) when pulling 438ecc8 on hamsal:sprs-out-dmy into 7a7fca8 on scikit-learn:master.

hamsal added some commits Jul 18, 2014
@hamsal hamsal Test sparse target data with dummy classifier bdd5cb1
@hamsal hamsal Fit dummy classifier with sparse target data c04ba19
@hamsal hamsal Op directly on y.data in fit, xrange -> range, y.tocsc f7efdc9
@hamsal hamsal Share array intialization sparse and dense fit
The arrays n_outputs_, classes_, n_classes_, class_prior_
73f2f0c
@hamsal hamsal Reorder fit and share denes sprs, Scafold sprs pred
Put the code for managing the constant case in fit
outside of the sparse conditional so it is shared
by both branches of execution.

Scaffold the four cases for sparse prediciton,
use a new class attribute sparse_target_input_
to carry over issparse from fit. Cases for
stratified and unifrom need improvement.
c0981d2
@hamsal hamsal Remove temporary code from test_sparse_target_data
The code was used to test a classifier with an incomplete
 predict function.
4052878
@hamsal hamsal remove zero from random uniform return data 388ac61
@hamsal hamsal Emulate sample weight support from dense (Rebase) a120827
@hamsal hamsal Add four tests for sparse target dummy, one for each strategy
The tests are
test_constant_strategy_sparse_target,
test_uniform_strategy_sparse_target,
test_stratified_strategy_sparse_target,
test_most_frequent_strategy_sparse_target
0a381ed
@hamsal hamsal Correct the class priors computation in the sparse target case
Consider the wieghts of the zero elements and hendle the insertion
of the count of zeros in the bincount manaully
7d9114c
@hamsal hamsal Remove redundant indices, data, indptr appends in predict
Take them out of the four branches and do them once at the end of
all of the conditionals
8c3a986
@hamsal hamsal Test sparse stratifed corner case with 0s
This activates the code for the manual handling of class prior
of zero elements
ff12d77
@hamsal hamsal Combine dense column check in fit 69adaa4
@hamsal hamsal Update y with y.tocsc (Not an inplace op) 5c5ef1a
@hamsal hamsal Select nonzero elements in predict stratified case 05a8dd7
@hamsal hamsal Implement uniform and stratified w/ sparse random choices, optimize c…
…oncats

-Write a sparse version of numpy.random.choice in utils/sparsefuncs
-Remove concats from inner loop of predict and use only one finaly concat call
3d692a9
@hamsal hamsal np.random.choice -> choice (from utils/random) fd58c28
@hamsal hamsal Use array for sparse matrix construction, fix constant check
-Replace the numpy array construction of data,indicies, and inptr
with regular arrays.
-loop over all output columns when checking for the prescence of
the user provided constant.
72885ef
@hamsal hamsal Replace np.random.choice with utils.random.sample_without_replacement f96a6aa
@hamsal hamsal Raise warning unifrom fit sparse target, Raise error uniform precit s…
…parse target
d0f52b6
@hamsal hamsal Support matrix output random_choice_csc, Update usage in dummy predict
Use this general function for each case int the sparse predict
of the dummy classfier
91a5b92
@hamsal hamsal Validate lengths equal each element of classes and class_probabilities 6e9d166
@hamsal hamsal Test random_choice_csc 319d96e
@hamsal hamsal Test ValueError for length mismatch between array in classes, and probs 292590c
@coveralls

Coverage Status

Coverage increased (+0.04%) when pulling 292590c on hamsal:sprs-out-dmy into 195dd26 on scikit-learn:master.

@coveralls

Coverage Status

Coverage increased (+0.05%) when pulling 292590c on hamsal:sprs-out-dmy into 195dd26 on scikit-learn:master.

@coveralls

Coverage Status

Coverage increased (+0.05%) when pulling 65f4f95 on hamsal:sprs-out-dmy into 195dd26 on scikit-learn:master.

hamsal added some commits Aug 1, 2014
@hamsal hamsal Test sparse_class_distribution, Correct data indexing
Write a new test and update sparse_class_distribution for correctness,
the indexing of y.data in sparse_class_distribution is now done correctly
e1bbd9a
@hamsal hamsal Move array intiialiaztions into dense conditional, nz_indices -> col_…
…nonzero
44499fa
@hamsal

This is ready for merge. @vene and @arjoly could you please review the changes? Most recently I have made two large editions to utils/sparsefuncs.py the sparse_class_distribution function and the random_choice_csc function. These additions have simplified the code for dummy fit and predict. Thanks to @arjoly for the suggestion to write these functions.

@coveralls

Coverage Status

Coverage increased (+0.06%) when pulling 44499fa on hamsal:sprs-out-dmy into 195dd26 on scikit-learn:master.

@hamsal hamsal changed the title from [WIP] Sparse Target Dummy Classifier to [MRG] Sparse Target Dummy Classifier Aug 1, 2014
@arjoly arjoly and 1 other commented on an outdated diff Aug 4, 2014
sklearn/dummy.py
if constant.shape[0] != self.n_outputs_:
raise ValueError("Constant target value should have "
"shape (%d, 1)." % self.n_outputs_)
- for k in xrange(self.n_outputs_):
- classes, y_k = np.unique(y[:, k], return_inverse=True)
- self.classes_.append(classes)
- self.n_classes_.append(classes.shape[0])
- class_prior = np.bincount(y_k, weights=sample_weight)
- self.class_prior_.append(class_prior / class_prior.sum())
-
- # Checking in case of constant strategy if the constant provided
- # by the user is in y.
+ if not self.sparse_target_input_:
@arjoly
arjoly Aug 4, 2014

This attribute is not documented.

By the way, I would renamesparse_target_input_ to sparse_output_. This would be consistent with LabelBinarizer.

@arjoly
arjoly Aug 4, 2014

It's easier to read when you put the positive case first, then the negative. This would be transformed to :


if self.sparse_output:
    ...
else:
    ...

@hamsal
hamsal Aug 4, 2014

The attribute is now named sparse_output_ and is documented. The positive case has been moved to come first.

@arjoly arjoly and 1 other commented on an outdated diff Aug 4, 2014
sklearn/dummy.py
@@ -90,36 +95,55 @@ def fit(self, X, y, sample_weight=None):
"constant"):
raise ValueError("Unknown strategy type.")
- y = np.atleast_1d(y)
- self.output_2d_ = y.ndim == 2
+ if self.strategy == "uniform" and sp.issparse(y):
+ y = y.toarray()
+ warnings.warn('prediciting on sparse target data with the uniform '
+ 'strategy would not save memory and would be '
+ 'slower. The target data has been densified',
@arjoly
arjoly Aug 4, 2014

The target data has been densified' => The target data has been converted to a numpy array.

@arjoly arjoly and 2 others commented on an outdated diff Aug 4, 2014
sklearn/dummy.py
if self.strategy == "constant":
if self.constant is None:
- raise ValueError("Constant target value has to be specified "
- "when the constant strategy is used.")
+ raise ValueError("Constant target value has to be"
+ "specified when the constant strategy is"
+ " used.")
@arjoly
arjoly Aug 4, 2014

Why has it changed?

@vene
vene Aug 4, 2014

Also, now there's no space between "be" and "specified"

@hamsal
hamsal Aug 4, 2014

This has been changed back

@arjoly arjoly and 1 other commented on an outdated diff Aug 4, 2014
sklearn/dummy.py
else:
- constant = np.reshape(np.atleast_1d(self.constant), (-1, 1))
+ constant = np.reshape(np.atleast_1d(self.constant),
+ (-1, 1))
@arjoly
arjoly Aug 4, 2014

Same here, What has changed?

@hamsal
hamsal Aug 4, 2014

This has been changed back

@arjoly arjoly and 1 other commented on an outdated diff Aug 4, 2014
sklearn/dummy.py
- # by the user is in y.
+ if not self.sparse_target_input_:
+ self.classes_ = []
+ self.n_classes_ = []
+ self.class_prior_ = []
+
+ for k in xrange(self.n_outputs_):
+ classes, y_k = np.unique(y[:, k], return_inverse=True)
+ self.classes_.append(classes)
+ self.n_classes_.append(classes.shape[0])
+ class_prior = np.bincount(y_k, weights=sample_weight)
+ self.class_prior_.append(class_prior / class_prior.sum())
+ else:
+ (self.classes_,
+ self.n_classes_,
+ self.class_prior_) = sparse_class_distribution(y, sample_weight)
@arjoly
arjoly Aug 4, 2014

I have the feeling that we could have only one function that handle both the dense and sparse cases. Do you see reason to keep both path separated in the code?

@hamsal
hamsal Aug 5, 2014

Fixed. The function is now class_distribution and handles both the dense and sparse case

@arjoly arjoly and 1 other commented on an outdated diff Aug 4, 2014
sklearn/dummy.py
@@ -172,26 +196,45 @@ def predict(self, X):
if self.n_outputs_ == 1:
proba = [proba]
- y = []
- for k in xrange(self.n_outputs_):
+ if not self.sparse_target_input_:
@arjoly
arjoly Aug 4, 2014

Same here. It's easier to read when you put the positive case first, then the negative. This would be transformed to :


if self.sparse_output:
    ...
else:
    ...

@arjoly arjoly and 1 other commented on an outdated diff Aug 4, 2014
sklearn/dummy.py
@@ -172,26 +196,45 @@ def predict(self, X):
if self.n_outputs_ == 1:
proba = [proba]
- y = []
- for k in xrange(self.n_outputs_):
+ if not self.sparse_target_input_:
+ y = []
+ for k in xrange(self.n_outputs_):
+ if self.strategy == "most_frequent":
+ ret = (np.ones(n_samples, dtype=int) *
+ class_prior_[k].argmax())
@arjoly
arjoly Aug 4, 2014

Here, I would use np.tile

@arjoly
arjoly Aug 4, 2014

This could be done for all outputs using

np.tile([class_prior_[k].argmax() for k in range(self.n_outputs)], (n_samples, 1))
@arjoly
arjoly Aug 4, 2014

Making it simpler by predicting all the output at once could be done in a separate pull request if you want.

An even faster strategy would be to do:

ret = np.empty(n_samples, dtype=int)
ret.fill(class_prior_[k].argmax())
@hamsal
hamsal Aug 5, 2014

I removed the outer for loop and used these suggestions to write each case with list comprehensions.

@arjoly arjoly commented on an outdated diff Aug 4, 2014
sklearn/dummy.py
+ if not self.sparse_target_input_:
+ y = []
+ for k in xrange(self.n_outputs_):
+ if self.strategy == "most_frequent":
+ ret = (np.ones(n_samples, dtype=int) *
+ class_prior_[k].argmax())
+
+ elif self.strategy == "stratified":
+ ret = proba[k].argmax(axis=1)
+
+ elif self.strategy == "uniform":
+ ret = rs.randint(n_classes_[k], size=n_samples)
+
+ elif self.strategy == "constant":
+ ret = (np.ones(n_samples, dtype=int) *
+ np.where(classes_[k] == constant[k]))
@arjoly
arjoly Aug 4, 2014

Here I would also use np.tile.

@arjoly
arjoly Aug 4, 2014

This looks like the constant case could be obtain easily through np.tile.

np.tile(self.constant, (n_samples, 1))
@arjoly
scikit-learn member

Can you add a test with truly multioutput-multiclass task (instead of multioutput-binary task / multilabel)?

@arjoly arjoly and 1 other commented on an outdated diff Aug 4, 2014
sklearn/utils/sparsefuncs.py
+ # Normalize probabilites for the nonzero elements
+ class_probability_nz = class_prob_j[classes[j] != 0]
+ class_probability_nz_norm = (class_probability_nz /
+ np.sum(class_probability_nz))
+ data.extend(choice(classes[j][classes[j] != 0],
+ size=nnz,
+ p=class_probability_nz_norm,
+ replace=True))
+ indptr.append(len(indices))
+
+ return sp.csc_matrix((data, indices, indptr),
+ (n_samples, len(classes)),
+ dtype=int)
+
+
+def sparse_class_distribution(y, sample_weight=None):
@arjoly
arjoly Aug 4, 2014

I would put that function in sklearn.utils.multiclass.

@hamsal
hamsal Aug 4, 2014

Moved to sklearn.utils.multiclass

@arjoly arjoly and 1 other commented on an outdated diff Aug 4, 2014
sklearn/utils/sparsefuncs.py
+ Return
+ ------
+ random_matrix : sparse csc matrix of size (n_samples, n_outputs)
+
+ """
+ data = array.array('i')
+ indices = array.array('i')
+ indptr = array.array('i', [0])
+
+ for j in range(len(classes)):
+ classes[j] = classes[j].astype(int)
+
+ # use uniform distribution if no class_probability is given
+ if class_probability is None:
+ class_prob_j = np.empty(shape=classes[j].shape[0])
+ class_prob_j.fill(1.0/classes[j].shape[0])
@arjoly
arjoly Aug 4, 2014

pep8: class_prob_j.fill(1.0 / classes[j].shape[0])

@arjoly
arjoly Aug 4, 2014

You could add from __future__ import divisionto get rid of the .0

@arjoly arjoly and 1 other commented on an outdated diff Aug 4, 2014
sklearn/utils/sparsefuncs.py
+ classes[j] = classes[j].astype(int)
+
+ # use uniform distribution if no class_probability is given
+ if class_probability is None:
+ class_prob_j = np.empty(shape=classes[j].shape[0])
+ class_prob_j.fill(1.0/classes[j].shape[0])
+ else:
+ class_prob_j = class_probability[j]
+ class_prob_j = np.asarray(class_prob_j)
+
+ if class_prob_j.shape[0] != classes[j].shape[0]:
+ raise ValueError("classes[{0}] (length {1}) and "
+ "class_probability[{0}] (length {2}) have "
+ "different length.".format(j,
+ classes[j].shape[0],
+ class_prob_j.shape[0]))
@arjoly
arjoly Aug 4, 2014

This kind of error are good and even better if they are raised early (before any heavy computation).

@hamsal
hamsal Aug 5, 2014

This depends on the statement before it. However the computations following it are as you say heavy and will delay the next iteration of this error check. It could be moved into its own for loop if necessary.

@arjoly arjoly and 1 other commented on an outdated diff Aug 4, 2014
sklearn/utils/sparsefuncs.py
+ class_prob_j = np.asarray(class_prob_j)
+
+ if class_prob_j.shape[0] != classes[j].shape[0]:
+ raise ValueError("classes[{0}] (length {1}) and "
+ "class_probability[{0}] (length {2}) have "
+ "different length.".format(j,
+ classes[j].shape[0],
+ class_prob_j.shape[0]))
+
+ # If 0 is not present in the classes insert it with a probability 0.0
+ if 0 not in classes[j]:
+ classes[j] = np.insert(classes[j], 0, 0)
+ class_prob_j = np.insert(class_prob_j, 0, 0.0)
+
+ # If there are nonzero classes choose randomly using class_probability
+ if 0 not in classes[j] or classes[j].shape[0] > 1:
@arjoly
arjoly Aug 4, 2014

It seems that the condition not in classes[j] is never triggered due to the previous lines.

@hamsal
hamsal Aug 4, 2014

removed the condition here

@arjoly arjoly and 1 other commented on an outdated diff Aug 4, 2014
sklearn/utils/sparsefuncs.py
+ classes[j].shape[0],
+ class_prob_j.shape[0]))
+
+ # If 0 is not present in the classes insert it with a probability 0.0
+ if 0 not in classes[j]:
+ classes[j] = np.insert(classes[j], 0, 0)
+ class_prob_j = np.insert(class_prob_j, 0, 0.0)
+
+ # If there are nonzero classes choose randomly using class_probability
+ if 0 not in classes[j] or classes[j].shape[0] > 1:
+ nnz = int(n_samples -
+ n_samples *
+ class_prob_j[np.where(classes[j] == 0)[0][0]])
+ indices.extend(sample_without_replacement(n_samples,
+ nnz,
+ "auto",
@arjoly
arjoly Aug 4, 2014

I wouldn't pass the auto argument if you use the default.

@arjoly arjoly and 1 other commented on an outdated diff Aug 4, 2014
sklearn/utils/sparsefuncs.py
+ raise ValueError("classes[{0}] (length {1}) and "
+ "class_probability[{0}] (length {2}) have "
+ "different length.".format(j,
+ classes[j].shape[0],
+ class_prob_j.shape[0]))
+
+ # If 0 is not present in the classes insert it with a probability 0.0
+ if 0 not in classes[j]:
+ classes[j] = np.insert(classes[j], 0, 0)
+ class_prob_j = np.insert(class_prob_j, 0, 0.0)
+
+ # If there are nonzero classes choose randomly using class_probability
+ if 0 not in classes[j] or classes[j].shape[0] > 1:
+ nnz = int(n_samples -
+ n_samples *
+ class_prob_j[np.where(classes[j] == 0)[0][0]])
@arjoly
arjoly Aug 4, 2014

I would break this line into two:

p_non_zero = 1 - class_prob_j[classes == 0]
n_non_zeros = int(n_samples * p_nonzero)

(The where statement doesn't seemed to be necessary.)

@arjoly arjoly and 1 other commented on an outdated diff Aug 4, 2014
sklearn/utils/sparsefuncs.py
+
+ # If there are nonzero classes choose randomly using class_probability
+ if 0 not in classes[j] or classes[j].shape[0] > 1:
+ nnz = int(n_samples -
+ n_samples *
+ class_prob_j[np.where(classes[j] == 0)[0][0]])
+ indices.extend(sample_without_replacement(n_samples,
+ nnz,
+ "auto",
+ random_state))
+
+ # Normalize probabilites for the nonzero elements
+ class_probability_nz = class_prob_j[classes[j] != 0]
+ class_probability_nz_norm = (class_probability_nz /
+ np.sum(class_probability_nz))
+ data.extend(choice(classes[j][classes[j] != 0],
@arjoly
arjoly Aug 4, 2014

This statement classes[j] != 0 could be factored .

@hamsal
hamsal Aug 4, 2014

Factored out classes_j_nonzero

@arjoly arjoly and 1 other commented on an outdated diff Aug 4, 2014
sklearn/utils/sparsefuncs.py
+ nnz = int(n_samples -
+ n_samples *
+ class_prob_j[np.where(classes[j] == 0)[0][0]])
+ indices.extend(sample_without_replacement(n_samples,
+ nnz,
+ "auto",
+ random_state))
+
+ # Normalize probabilites for the nonzero elements
+ class_probability_nz = class_prob_j[classes[j] != 0]
+ class_probability_nz_norm = (class_probability_nz /
+ np.sum(class_probability_nz))
+ data.extend(choice(classes[j][classes[j] != 0],
+ size=nnz,
+ p=class_probability_nz_norm,
+ replace=True))
@arjoly
arjoly Aug 4, 2014

replace=True is already the default of np.choice.

@arjoly arjoly and 1 other commented on an outdated diff Aug 4, 2014
sklearn/utils/sparsefuncs.py
+ "auto",
+ random_state))
+
+ # Normalize probabilites for the nonzero elements
+ class_probability_nz = class_prob_j[classes[j] != 0]
+ class_probability_nz_norm = (class_probability_nz /
+ np.sum(class_probability_nz))
+ data.extend(choice(classes[j][classes[j] != 0],
+ size=nnz,
+ p=class_probability_nz_norm,
+ replace=True))
+ indptr.append(len(indices))
+
+ return sp.csc_matrix((data, indices, indptr),
+ (n_samples, len(classes)),
+ dtype=int)
@arjoly
arjoly Aug 4, 2014

If you work only with integer, I would replace the choice with a searchsorted approach (which is normally faster).
You would do something like

In [20]: p = np.array([0.1, 0.2, 0.7])

In [21]: cum_p = p.cumsum()

In [22]: classes = np.searchsorted(cum_p, np.random.rand(1000))

In [23]: np.bincount(classes) / 1000.
Out[23]: array([ 0.103,  0.183,  0.714])
@arjoly
arjoly Aug 4, 2014

Here a small comparison:

In [29]: mychoice??
Type:       function
String Form:<function mychoice at 0x10ca75500>
File:       /Users/ajoly/git/scikit-learn/<ipython-input-25-8eca022b55a1>
Definition: mychoice(p, n_samples)
Source:
def mychoice(p, n_samples):
    return np.searchsorted(np.array(p).cumsum(), np.random.rand(n_samples))

In [27]: timeit mychoice([0.1, 0.2, 0.7], 1000)
10000 loops, best of 3: 36.5 us per loop

In [28]: timeit np.random.choice([0, 1, 2], p=[0.1, 0.2, 0.7], size=1000)
10000 loops, best of 3: 128 us per loop
@arjoly
arjoly Aug 4, 2014

If the dtype is not int only, this might still be faster.

@hamsal
hamsal Aug 5, 2014

Choice has been replaced with the search sorted approach

@arjoly arjoly and 1 other commented on an outdated diff Aug 4, 2014
sklearn/utils/sparsefuncs.py
+ class_probability_nz = class_prob_j[classes[j] != 0]
+ class_probability_nz_norm = (class_probability_nz /
+ np.sum(class_probability_nz))
+ data.extend(choice(classes[j][classes[j] != 0],
+ size=nnz,
+ p=class_probability_nz_norm,
+ replace=True))
+ indptr.append(len(indices))
+
+ return sp.csc_matrix((data, indices, indptr),
+ (n_samples, len(classes)),
+ dtype=int)
+
+
+def sparse_class_distribution(y, sample_weight=None):
+ """Generate class priors from multiclass-multioutput target data
@arjoly
arjoly Aug 4, 2014

Compute class priors from sparse multioutput-multiclass target data

@arjoly arjoly and 1 other commented on an outdated diff Aug 4, 2014
sklearn/utils/sparsefuncs.py
+
+ y = y.tocsc()
+ y.eliminate_zeros()
+ y_nnz = np.diff(y.indptr)
+
+ n_samples, n_outputs = y.shape
+
+ for k in range(n_outputs):
+ col_nonzero = y.indices[y.indptr[k]:y.indptr[k+1]]
+ # separate sample weights for zero and non-zero elements
+ nz_sample_weight = (None if sample_weight is None else
+ sample_weight[col_nonzero])
+ z_sample_weight_sum = (y.shape[0] - y_nnz[k] if
+ sample_weight is None else
+ np.sum(sample_weight) -
+ np.sum(nz_sample_weight))
@arjoly
arjoly Aug 4, 2014

Can you break these line into several one? (expand the if, ...)

@arjoly arjoly and 1 other commented on an outdated diff Aug 4, 2014
sklearn/utils/sparsefuncs.py
+ classes_ = []
+ n_classes = []
+ class_prior_ = []
+
+ y = y.tocsc()
+ y.eliminate_zeros()
+ y_nnz = np.diff(y.indptr)
+
+ n_samples, n_outputs = y.shape
+
+ for k in range(n_outputs):
+ col_nonzero = y.indices[y.indptr[k]:y.indptr[k+1]]
+ # separate sample weights for zero and non-zero elements
+ nz_sample_weight = (None if sample_weight is None else
+ sample_weight[col_nonzero])
+ z_sample_weight_sum = (y.shape[0] - y_nnz[k] if
@arjoly
arjoly Aug 4, 2014

I find that the z is not very intuitive.

@hamsal
hamsal Aug 4, 2014

Renamed this zeros_samp_weight_sum

@arjoly arjoly and 1 other commented on an outdated diff Aug 4, 2014
sklearn/utils/sparsefuncs.py
+
+ Return
+ ------
+ classes : list of size n_outputs of arrays of size (n_classes, )
+ List of classes for each column.
+
+ n_classes : list of integrs of size n_outputs
+ Number of classes in each column
+
+ class_prior : list of size n_outputs of arrays of size (n_classes, )
+ Class distribution of each column.
+
+ """
+ classes_ = []
+ n_classes = []
+ class_prior_ = []
@arjoly
arjoly Aug 4, 2014

Could you remove the last _ in classes and class_prior?

@arjoly arjoly and 1 other commented on an outdated diff Aug 4, 2014
sklearn/utils/sparsefuncs.py
+ classes : list of size n_outputs of arrays of size (n_classes, )
+ List of classes for each column.
+
+ n_classes : list of integrs of size n_outputs
+ Number of classes in each column
+
+ class_prior : list of size n_outputs of arrays of size (n_classes, )
+ Class distribution of each column.
+
+ """
+ classes_ = []
+ n_classes = []
+ class_prior_ = []
+
+ y = y.tocsc()
+ y.eliminate_zeros()
@arjoly
arjoly Aug 4, 2014

Can you add a comment about this?

@arjoly arjoly and 1 other commented on an outdated diff Aug 4, 2014
sklearn/utils/tests/test_sparsefuncs.py
@@ -289,3 +293,55 @@ def test_min_max_axis_errors():
assert_raises(TypeError, min_max_axis, X_csr.tolil(), axis=0)
assert_raises(ValueError, min_max_axis, X_csr, axis=2)
assert_raises(ValueError, min_max_axis, X_csc, axis=-3)
+
+
+def test_random_choice_csc():
+ n_samples = 10000
+ classes = [np.array([[0, 1]]).T, np.array([[0, 1, 2]]).T]
+ class_probabilites = [np.array([[0.5, 0.5]]).T,
+ np.array([[0.6, 0.1, 0.3]]).T]
@arjoly
arjoly Aug 4, 2014

Why do we need the transpose?

@arjoly arjoly and 1 other commented on an outdated diff Aug 4, 2014
sklearn/utils/sparsefuncs.py
@@ -278,3 +283,138 @@ def min_max_axis(X, axis):
return sparse_min_max(X, axis=axis)
else:
_raise_typeerror(X)
+
+
+def random_choice_csc(n_samples, classes, class_probability=None,
@arjoly
arjoly Aug 4, 2014

I have the feeling that this function would be better in sklearn.utils.random.

@hamsal
hamsal Aug 4, 2014

This is now in sklearn.utils.random

@arjoly arjoly and 1 other commented on an outdated diff Aug 4, 2014
sklearn/utils/sparsefuncs.py
+ by `np.random`.
+
+ Return
+ ------
+ random_matrix : sparse csc matrix of size (n_samples, n_outputs)
+
+ """
+ data = array.array('i')
+ indices = array.array('i')
+ indptr = array.array('i', [0])
+
+ for j in range(len(classes)):
+ classes[j] = classes[j].astype(int)
+
+ # use uniform distribution if no class_probability is given
+ if class_probability is None:
@arjoly
arjoly Aug 4, 2014

Could you add a test case for this?

@hamsal
hamsal Aug 5, 2014

The test has been expanded to add this case

@vene vene and 1 other commented on an outdated diff Aug 4, 2014
sklearn/dummy.py
@@ -90,36 +95,55 @@ def fit(self, X, y, sample_weight=None):
"constant"):
raise ValueError("Unknown strategy type.")
- y = np.atleast_1d(y)
- self.output_2d_ = y.ndim == 2
+ if self.strategy == "uniform" and sp.issparse(y):
+ y = y.toarray()
+ warnings.warn('prediciting on sparse target data with the uniform '
@vene vene and 1 other commented on an outdated diff Aug 4, 2014
sklearn/dummy.py
@@ -90,36 +95,55 @@ def fit(self, X, y, sample_weight=None):
"constant"):
raise ValueError("Unknown strategy type.")
- y = np.atleast_1d(y)
- self.output_2d_ = y.ndim == 2
+ if self.strategy == "uniform" and sp.issparse(y):
+ y = y.toarray()
+ warnings.warn('prediciting on sparse target data with the uniform '
+ 'strategy would not save memory and would be '
+ 'slower. The target data has been densified',
+ sp.SparseEfficiencyWarning)
@vene
vene Aug 4, 2014

I wonder if it's appropriate to reuse this scipy warning class here.

@hamsal
hamsal Aug 5, 2014

Changed to UserWarning

@hamsal

Thank you for the review!

@coveralls

Coverage Status

Coverage increased (+0.06%) when pulling cfad359 on hamsal:sprs-out-dmy into 195dd26 on scikit-learn:master.

@coveralls

Coverage Status

Coverage increased (+0.06%) when pulling 7bb1c01 on hamsal:sprs-out-dmy into 195dd26 on scikit-learn:master.

hamsal added some commits Aug 4, 2014
@hamsal hamsal Make cosmetic revisions to random_choice_csc
Drop the passing of deault parmaters in calls to
sample_without_replacement and random.choice, calculate
nnz in a more concise 2 line sequence
31d9b12
@hamsal hamsal import divison from future to give real results with two ints 11c7538
@hamsal hamsal Make cosmetic changes to sparse_class_distribution aa47285
@coveralls

Coverage Status

Coverage increased (+0.06%) when pulling 11c7538 on hamsal:sprs-out-dmy into 195dd26 on scikit-learn:master.

@hamsal hamsal Make cosmetic changes to sparse_class_distribution
classes_ -> classes
class_prior_ -> class_prior
class_prior -> class_prior_k
classes -> classes_k

Seperate if out from assignment to nz_sample_weight and
z_sample_weight_sum
f5622b7
@coveralls

Coverage Status

Coverage increased (+0.05%) when pulling f5622b7 on hamsal:sprs-out-dmy into 195dd26 on scikit-learn:master.

hamsal added some commits Aug 4, 2014
@hamsal hamsal Make naming changes in sparse_class_distribution
z_sample_weight_sum -> zeros_samp_weight_sum
nz_sample_weight_sum -> nz_samp_weight_sum
431f291
@hamsal hamsal Make default parameters for test_random_choice_csc, and use almost equal
Use assert_array_almost_equal in test_sparse_class_distribution so
that slight numerical inacuracies are tolerated,

Move initilization of n_samples and random state to the parameters
of the function
2207046
@coveralls

Coverage Status

Coverage increased (+0.05%) when pulling 2207046 on hamsal:sprs-out-dmy into 195dd26 on scikit-learn:master.

@coveralls

Coverage Status

Coverage increased (+0.05%) when pulling 7e694fa on hamsal:sprs-out-dmy into 195dd26 on scikit-learn:master.

@coveralls

Coverage Status

Coverage increased (+0.05%) when pulling 478f9ad on hamsal:sprs-out-dmy into 195dd26 on scikit-learn:master.

@vene vene and 1 other commented on an outdated diff Aug 5, 2014
sklearn/dummy.py
@@ -58,6 +63,10 @@ class DummyClassifier(BaseEstimator, ClassifierMixin):
`outputs_2d_` : bool,
True if the output at fit is 2d, else false.
+ `sparse_output_` : bool,
+ True if the array returned from predict is to be in sparse CSC format.
+ Is set True if the input target data given to fit in sparse format.
@vene
vene Aug 5, 2014

I'd rephrase this line a bit.

"Is automatically set to True if the input y is passed in sparse format." ?

@vene vene and 1 other commented on an outdated diff Aug 5, 2014
sklearn/dummy.py
@@ -90,16 +99,24 @@ def fit(self, X, y, sample_weight=None):
"constant"):
raise ValueError("Unknown strategy type.")
- y = np.atleast_1d(y)
- self.output_2d_ = y.ndim == 2
+ if self.strategy == "uniform" and sp.issparse(y):
+ y = y.toarray()
+ warnings.warn('The target data has been converted to a numpy '
+ 'array. Predicting on sparse target data with the '
@vene
vene Aug 5, 2014

The dense y is just a local copy, right? This message might make a user think that their y has been converted in place.

@vene vene and 1 other commented on an outdated diff Aug 5, 2014
sklearn/utils/multiclass.py
+
+
+def sparse_class_distribution(y, sample_weight=None):
+ """Compute class priors from sparse multioutput-multiclass target data
+
+ Parameters
+ ----------
+ y : sparse matrix of size (n_samples, n_outputs)
+ The labels for each example.
+
+ sample_weight : array-like of shape = [n_samples], optional
+ Sample weights.
+
+ Return
+ ------
+ classes : list of size n_outputs of arrays of size (n_classes, )
@vene
vene Aug 5, 2014

should there be a space between the comma and the bracket here? I don't know...

@hamsal
hamsal Aug 5, 2014

Fixed. I made it (n_classes,) because numpy prints shapes without the space

@vene vene and 1 other commented on an outdated diff Aug 5, 2014
sklearn/utils/multiclass.py
+
+ """
+ classes = []
+ n_classes = []
+ class_prior = []
+
+ y = y.tocsc()
+ # Remove explcit zeros so we can count zeros vs nonzeros accurately
+ # using the shape of the matrix
+ y.eliminate_zeros()
+ y_nnz = np.diff(y.indptr)
+
+ n_samples, n_outputs = y.shape
+
+ for k in range(n_outputs):
+ col_nonzero = y.indices[y.indptr[k]:y.indptr[k+1]]
@vene vene and 1 other commented on an outdated diff Aug 5, 2014
sklearn/utils/multiclass.py
@@ -347,3 +347,62 @@ def _check_partial_fit_first_call(clf, classes=None):
# classes is None and clf.classes_ has already previously been set:
# nothing to do
return False
+
+
+def sparse_class_distribution(y, sample_weight=None):
+ """Compute class priors from sparse multioutput-multiclass target data
+
+ Parameters
+ ----------
+ y : sparse matrix of size (n_samples, n_outputs)
+ The labels for each example.
+
+ sample_weight : array-like of shape = [n_samples], optional
@vene
vene Aug 5, 2014

should shape be (n_samples,)?

@vene vene and 1 other commented on an outdated diff Aug 5, 2014
sklearn/utils/multiclass.py
+ classes : list of size n_outputs of arrays of size (n_classes, )
+ List of classes for each column.
+
+ n_classes : list of integrs of size n_outputs
+ Number of classes in each column
+
+ class_prior : list of size n_outputs of arrays of size (n_classes, )
+ Class distribution of each column.
+
+ """
+ classes = []
+ n_classes = []
+ class_prior = []
+
+ y = y.tocsc()
+ # Remove explcit zeros so we can count zeros vs nonzeros accurately
@vene vene and 1 other commented on an outdated diff Aug 5, 2014
sklearn/utils/multiclass.py
+ n_classes = []
+ class_prior = []
+
+ y = y.tocsc()
+ # Remove explcit zeros so we can count zeros vs nonzeros accurately
+ # using the shape of the matrix
+ y.eliminate_zeros()
+ y_nnz = np.diff(y.indptr)
+
+ n_samples, n_outputs = y.shape
+
+ for k in range(n_outputs):
+ col_nonzero = y.indices[y.indptr[k]:y.indptr[k+1]]
+ # separate sample weights for zero and non-zero elements
+ nz_samp_weight = None
+ zeros_samp_weight_sum = y.shape[0] - y_nnz[k]
@vene
vene Aug 5, 2014

much more readable if done in an if/else rather than overriding

@hamsal hamsal Include dense case support in class_distribution, pep8 revisions
Make The function previoulsy sparse_class_distribution support both the sparse and
dense case, and make some cosmetic updates for imporved readability
5ba7df8
@coveralls

Coverage Status

Coverage increased (+0.05%) when pulling 5ba7df8 on hamsal:sprs-out-dmy into 195dd26 on scikit-learn:master.

@coveralls

Coverage Status

Coverage increased (+0.02%) when pulling 40d8fb3 on hamsal:sprs-out-dmy into 195dd26 on scikit-learn:master.

@coveralls

Coverage Status

Coverage increased (+0.02%) when pulling f69dcef on hamsal:sprs-out-dmy into 195dd26 on scikit-learn:master.

@hamsal

I revised the testing to include multioutput-multiclass in place of the binary columns. The dense case in predict also uses list comprehensions now instead of a larger for loop.

@arjoly arjoly and 1 other commented on an outdated diff Aug 6, 2014
sklearn/dummy.py
- for k in xrange(self.n_outputs_):
- classes, y_k = np.unique(y[:, k], return_inverse=True)
- self.classes_.append(classes)
- self.n_classes_.append(classes.shape[0])
- class_prior = np.bincount(y_k, weights=sample_weight)
- self.class_prior_.append(class_prior / class_prior.sum())
-
- # Checking in case of constant strategy if the constant provided
- # by the user is in y.
+ (self.classes_,
+ self.n_classes_,
+ self.class_prior_) = class_distribution(y, sample_weight)
+
+ for k in range(self.n_outputs_):
+ # Checking in case of constant strategy if the constant
+ # provided by the user is in y.
if self.strategy == "constant":
@arjoly
arjoly Aug 6, 2014

I would put this berfore the for loop.

@arjoly arjoly and 1 other commented on an outdated diff Aug 6, 2014
sklearn/dummy.py
elif self.strategy == "constant":
- ret = np.ones(n_samples, dtype=int) * (
- np.where(classes_[k] == constant[k]))
+ classes_ = [np.array([c]) for c in constant]
+
+ y = random_choice_csc(n_samples, classes_, class_prob,
+ self.random_state)
+ else:
+ if self.strategy == "most_frequent":
+ ret = [classes_[k][class_prior_[k].argmax()] for k in
+ range(self.n_outputs_)]
+ y = np.tile(ret, [n_samples, 1])
@arjoly
arjoly Aug 6, 2014

Nitpick: I would break at the for and maybe merge both line.

@arjoly arjoly and 1 other commented on an outdated diff Aug 6, 2014
sklearn/dummy.py
- y.append(classes_[k][ret])
+ elif self.strategy == "stratified":
+ y = np.vstack([classes_[k][proba[k].argmax(axis=1)] for k in
+ range(self.n_outputs_)]).T
@arjoly
arjoly Aug 6, 2014

Nitpick: I would break at the for.

@arjoly
arjoly Aug 6, 2014

You don't need the [] inside the vstack. For instance, you could do

y = np.vstack(classes_[k][proba[k].argmax(axis=1)] for k in range(self.n_outputs_))
@arjoly arjoly and 1 other commented on an outdated diff Aug 6, 2014
sklearn/dummy.py
- y = np.vstack(y).T
- if self.n_outputs_ == 1 and not self.output_2d_:
- y = np.ravel(y)
+ elif self.strategy == "uniform":
+ ret = [classes_[k][rs.randint(n_classes_[k], size=n_samples)]
+ for k in range(self.n_outputs_)]
@arjoly
arjoly Aug 6, 2014

Nitpick: I would break at the for and maybe merge both line.
vstack works with a generator. For instance, you cand do:

np.vstack(np.random.randint(3, size=(4,)) for k in range(5)) 
@hamsal
hamsal Aug 7, 2014

This line is too long to break at the for or join with the next line

@arjoly arjoly and 1 other commented on an outdated diff Aug 6, 2014
sklearn/utils/multiclass.py
+ np.sum(nz_samp_weight))
+ else:
+ nz_samp_weight = None
+ zeros_samp_weight_sum = y.shape[0] - y_nnz[k]
+
+ classes_k, y_k = np.unique(y.data[y.indptr[k]:y.indptr[k + 1]],
+ return_inverse=True)
+ class_prior_k = np.bincount(y_k, weights=nz_samp_weight)
+ if y_nnz[k] < y.shape[0]:
+ classes_k = np.insert(classes_k, 0, 0)
+ class_prior_k = np.insert(class_prior_k, 0,
+ zeros_samp_weight_sum)
+
+ classes.append(classes_k)
+ n_classes.append(classes_k.shape[0])
+ class_prior.append(class_prior_k / float(class_prior_k.sum()))
@arjoly
arjoly Aug 6, 2014

I am not sure that the float(...) is needed.

@arjoly arjoly and 1 other commented on an outdated diff Aug 6, 2014
sklearn/utils/random.py
+ If int, random_state is the seed used by the random number generator;
+ If RandomState instance, random_state is the random number generator;
+ If None, the random number generator is the RandomState instance used
+ by `np.random`.
+
+ Return
+ ------
+ random_matrix : sparse csc matrix of size (n_samples, n_outputs)
+
+ """
+ data = array.array('i')
+ indices = array.array('i')
+ indptr = array.array('i', [0])
+
+ for j in range(len(classes)):
+ classes[j] = classes[j].astype(int)
@arjoly
arjoly Aug 6, 2014

Can you check that this will fail if classes is an array of string, dtype object and floating point value that can be converted to int?

@arjoly
arjoly Aug 6, 2014

Do you accept array-like? For instance, [0, 1, 2]?

@hamsal
hamsal Aug 7, 2014

There is now test to check that string and float dtypes fail and also the it accepts array-likes

@arjoly arjoly and 1 other commented on an outdated diff Aug 6, 2014
sklearn/utils/random.py
+
+ """
+ data = array.array('i')
+ indices = array.array('i')
+ indptr = array.array('i', [0])
+
+ for j in range(len(classes)):
+ classes[j] = classes[j].astype(int)
+
+ # use uniform distribution if no class_probability is given
+ if class_probability is None:
+ class_prob_j = np.empty(shape=classes[j].shape[0])
+ class_prob_j.fill(1 / classes[j].shape[0])
+ else:
+ class_prob_j = class_probability[j]
+ class_prob_j = np.asarray(class_prob_j)
@arjoly
arjoly Aug 6, 2014

I would merge both line

@arjoly
scikit-learn member

The DummyClassifier code is especially clean.

@hamsal hamsal Make cosmetic changes
Break list comprehenesions differently in dummy precit

Drop uncesseray float() in multiclass class_distrribtuion

Combine two lines in random_choice_csc
7b7784e
@coveralls

Coverage Status

Coverage increased (+0.02%) when pulling 7b7784e on hamsal:sprs-out-dmy into 195dd26 on scikit-learn:master.

@arjoly arjoly and 1 other commented on an outdated diff Aug 7, 2014
sklearn/utils/multiclass.py
+ n_samples, n_outputs = y.shape
+
+ if issparse(y):
+ y = y.tocsc()
+ # Remove explicit zeros so we can count zeros vs nonzeros accurately
+ # using the shape of the matrix
+ y.eliminate_zeros()
+ y_nnz = np.diff(y.indptr)
+
+ for k in range(n_outputs):
+ col_nonzero = y.indices[y.indptr[k]:y.indptr[k + 1]]
+ # separate sample weights for zero and non-zero elements
+ if sample_weight is not None:
+ nz_samp_weight = sample_weight[col_nonzero]
+ zeros_samp_weight_sum = (np.sum(sample_weight) -
+ np.sum(nz_samp_weight))
@arjoly
arjoly Aug 7, 2014

When I do nosetests sklearn/utils/tests/test_multiclass.py --with-coverage. The following lines are not covered

Name                                          Stmts   Miss Branch BrMiss  Cover   Missing
-----------------------------------------------------------------------------------------
sklearn.utils.multiclass                        132      7     71      8    93%   ..., 393-394

@hamsal
hamsal Aug 7, 2014

Fixed. By adding a test with explicit sample weights.

@arjoly arjoly and 1 other commented on an outdated diff Aug 7, 2014
sklearn/utils/random.py
+
+ if class_prob_j.shape[0] != classes[j].shape[0]:
+ raise ValueError("classes[{0}] (length {1}) and "
+ "class_probability[{0}] (length {2}) have "
+ "different length.".format(j,
+ classes[j].shape[0],
+ class_prob_j.shape[0]))
+
+ # If 0 is not present in the classes insert it with a probability 0.0
+ if 0 not in classes[j]:
+ classes[j] = np.insert(classes[j], 0, 0)
+ class_prob_j = np.insert(class_prob_j, 0, 0.0)
+
+ # If there are nonzero classes choose randomly using class_probability
+ if classes[j].shape[0] > 1:
+ p_nonzero = 1 - class_prob_j[classes == 0]
@arjoly
arjoly Aug 7, 2014

Whenever, I run nosetests sklearn/utils/tests/test_random.py --with-coverage,
I got the following warnings

./Users/ajoly/git/scikit-learn/sklearn/utils/random.py:257: DeprecationWarning: using a boolean instead of an integer will result in an error in the future
  p_nonzero = 1 - class_prob_j[classes == 0]
@arjoly
arjoly Aug 7, 2014

I am using numpy==1.8.1

@arjoly
scikit-learn member

Whenever the last comments are addressed, +1 !
Thanks @hamsal for your hard work.

@coveralls

Coverage Status

Coverage increased (+0.01%) when pulling 7b7784e on hamsal:sprs-out-dmy into 046f7e8 on scikit-learn:master.

@coveralls

Coverage Status

Coverage increased (+0.01%) when pulling 94b7845 on hamsal:sprs-out-dmy into 046f7e8 on scikit-learn:master.

@hamsal

Thank you @arjoly for the quick follow up review! I have addressed the comments

@arjoly arjoly and 1 other commented on an outdated diff Aug 8, 2014
sklearn/utils/random.py
+ If None, the random number generator is the RandomState instance used
+ by `np.random`.
+
+ Return
+ ------
+ random_matrix : sparse csc matrix of size (n_samples, n_outputs)
+
+ """
+ data = array.array('i')
+ indices = array.array('i')
+ indptr = array.array('i', [0])
+
+ for j in range(len(classes)):
+ classes[j] = np.asarray(classes[j])
+ if classes[j].dtype.kind is 'f':
+ raise ValueError("Class type float is not supported")
@arjoly
arjoly Aug 8, 2014

I would do

if classes[j].dtype.kind != 'i':
    raise ValueError("class dtype %s is not supported" % classes[j].dtype)
@arjoly
scikit-learn member

A last niptick to address and then a big +1 :-)
Thanks @hamsal !

@vene vene and 1 other commented on an outdated diff Aug 8, 2014
sklearn/dummy.py
@@ -90,16 +98,24 @@ def fit(self, X, y, sample_weight=None):
"constant"):
raise ValueError("Unknown strategy type.")
- y = np.atleast_1d(y)
- self.output_2d_ = y.ndim == 2
+ if self.strategy == "uniform" and sp.issparse(y):
+ y = y.toarray()
+ warnings.warn('A Local copy of the target data has been converted '
@vene vene commented on the diff Aug 8, 2014
sklearn/dummy.py
@@ -4,13 +4,17 @@
# License: BSD 3 clause
@vene
vene Aug 8, 2014

It's not your change, but could you add a space between the address and the e-mail on line 3?

@vene vene commented on the diff Aug 8, 2014
sklearn/dummy.py
- y.append(classes_[k][ret])
+ y = random_choice_csc(n_samples, classes_, class_prob,
+ self.random_state)
+ else:
+ if self.strategy == "most_frequent":
+ y = np.tile([classes_[k][class_prior_[k].argmax()] for
@vene
vene Aug 8, 2014

I find it unconventional to break after for, I'd break before.

@hamsal
hamsal Aug 8, 2014

You and @arjoly have suggested different break points, to me this way with breaking after the for make some sense because pep8 recommends breaking after 'and' and 'or' when using multiline conditionals.

@vene
vene Aug 8, 2014
@arjoly
arjoly Aug 9, 2014

I thought that I said the other way. Anyway if pep8 is ok. It's ok for me.

@arjoly
arjoly Aug 9, 2014

(I mean break before the for)

@vene vene and 1 other commented on an outdated diff Aug 8, 2014
sklearn/tests/test_dummy.py
+ y_pred = clf.predict(X)
+ assert_true(sp.issparse(y_pred))
+ assert_array_equal(y_pred.toarray(), np.hstack([np.ones((n_samples, 1)),
+ np.zeros((n_samples, 1))]))
+
+
+def test_uniform_strategy_sparse_target_warning():
+ X = [[0]] * 5 # ignored
+ y = sp.csc_matrix(np.array([[2, 1],
+ [2, 2],
+ [1, 4],
+ [4, 2],
+ [1, 1]]))
+
+ clf = DummyClassifier(strategy="uniform", random_state=0)
+ assert_warns(UserWarning, clf.fit, X, y)
@vene
vene Aug 8, 2014

Should we check for the exact warning, it might be better. You can use sklearn.utils.testing.assert_warns_message and pass a crucial substring.

@vene vene and 1 other commented on an outdated diff Aug 8, 2014
sklearn/utils/multiclass.py
+ for k in range(n_outputs):
+ col_nonzero = y.indices[y.indptr[k]:y.indptr[k + 1]]
+ # separate sample weights for zero and non-zero elements
+ if sample_weight is not None:
+ nz_samp_weight = np.asarray(sample_weight)[col_nonzero]
+ zeros_samp_weight_sum = (np.sum(sample_weight) -
+ np.sum(nz_samp_weight))
+ else:
+ nz_samp_weight = None
+ zeros_samp_weight_sum = y.shape[0] - y_nnz[k]
+
+ classes_k, y_k = np.unique(y.data[y.indptr[k]:y.indptr[k + 1]],
+ return_inverse=True)
+ class_prior_k = np.bincount(y_k, weights=nz_samp_weight)
+ if y_nnz[k] < y.shape[0]:
+ classes_k = np.insert(classes_k, 0, 0)
@vene
vene Aug 8, 2014

Looks to me like you could still do this even without calling y.eliminate_zeros().

You'd just need to check whether 0 is already in classes_k, in which case don't add it and just increment its class prior.

Would it be worth it? Eliminating zeros basically rebuilds the CSC matrix, as far as I can tell by the code in scipy.

@hamsal
hamsal Aug 9, 2014

I have removed the eliminate_zeros call and wrote a few lines to manage the case with explicit zeros. I also modified the test to include a mix of explicit and implicit zeros in the sparse matrix.

@vene vene and 1 other commented on an outdated diff Aug 8, 2014
sklearn/utils/random.py
@@ -1,8 +1,10 @@
# This file contains a backport of np.random.choice from numpy 1.7
@vene
vene Aug 8, 2014

This description is no longer complete, starting the file with this comment might be misleading.

@hamsal
hamsal Aug 8, 2014

I put this near the function and gave the file an authors list.

@vene vene and 1 other commented on an outdated diff Aug 8, 2014
sklearn/utils/tests/test_multiclass.py
@@ -332,6 +338,63 @@ def test_type_of_target():
assert_raises(ValueError, type_of_target, example)
+def test_class_distribution():
+ y = np.array([[1, 0, 0, 1],
+ [2, 2, 0, 1],
+ [1, 3, 0, 1],
+ [4, 2, 0, 1],
+ [2, 0, 0, 1],
+ [1, 3, 0, 1]])
+
+ classes, n_classes, class_prior = class_distribution(y)
+ classes_expected = [np.array([1, 2, 4]), np.array([0, 2, 3]),
@vene
vene Aug 8, 2014

I'd rather put one per line.

Also I think you can skip the inner np.array calls here; assert_array_almost_equal should be able to compare an array to a list, right?

@vene vene and 1 other commented on an outdated diff Aug 8, 2014
sklearn/utils/tests/test_multiclass.py
@@ -332,6 +338,63 @@ def test_type_of_target():
assert_raises(ValueError, type_of_target, example)
+def test_class_distribution():
+ y = np.array([[1, 0, 0, 1],
+ [2, 2, 0, 1],
+ [1, 3, 0, 1],
+ [4, 2, 0, 1],
+ [2, 0, 0, 1],
+ [1, 3, 0, 1]])
+
+ classes, n_classes, class_prior = class_distribution(y)
+ classes_expected = [np.array([1, 2, 4]), np.array([0, 2, 3]),
+ np.array([0]), np.array([1])]
+ n_classes_expected = [3, 3, 1, 1]
+ class_prior_expected = [np.array([3/6, 2/6, 1/6]),
+ np.array([1/3, 1/3, 1/3]),
+ np.array([1.0]), np.array([1.0])]
@vene vene and 1 other commented on an outdated diff Aug 8, 2014
sklearn/utils/tests/test_multiclass.py
+
+ classes, n_classes, class_prior = class_distribution(y)
+ classes_expected = [np.array([1, 2, 4]), np.array([0, 2, 3]),
+ np.array([0]), np.array([1])]
+ n_classes_expected = [3, 3, 1, 1]
+ class_prior_expected = [np.array([3/6, 2/6, 1/6]),
+ np.array([1/3, 1/3, 1/3]),
+ np.array([1.0]), np.array([1.0])]
+
+ for k in range(y.shape[1]):
+ assert_array_almost_equal(classes[k], classes_expected[k])
+ assert_array_almost_equal(n_classes[k], n_classes_expected[k])
+ assert_array_almost_equal(class_prior[k], class_prior_expected[k])
+
+
+def test_class_distribution_sparse():
@vene
vene Aug 8, 2014

I think you could collapse this test and the previous one into one test. This would also make sure that class_weight with dense y is tested.

@vene vene commented on the diff Aug 8, 2014
sklearn/utils/tests/test_random.py
@@ -99,3 +101,57 @@ def check_sample_int_distribution(sample_without_replacement):
raise AssertionError(
"number of combinations != number of expected (%s != %s)" %
(len(output), n_expected))
+
+
+def test_random_choice_csc(n_samples=10000, random_state=24):
+ # Explicit class probabilities
+ classes = [np.array([0, 1]), np.array([0, 1, 2])]
+ class_probabilites = [np.array([0.5, 0.5]), np.array([0.6, 0.1, 0.3])]
@vene
vene Aug 8, 2014

Should we test also for classes that get 0 or 1 priors?
Should we test what happens the given probas don't sum to 1?

@vene
vene Aug 8, 2014

Oh I see, the 0 proba case is tested below.

@vene
vene Aug 8, 2014

How about the case where a target has only one possible class? I think this is actually a way this function is used in your code (for the most_frequent strategy).

@hamsal
hamsal Aug 9, 2014

I have added these test cases, and implemented a value error when the probabilities given do not sum to 1.

@vene vene commented on the diff Aug 8, 2014
sklearn/utils/tests/test_random.py
@@ -99,3 +101,57 @@ def check_sample_int_distribution(sample_without_replacement):
raise AssertionError(
"number of combinations != number of expected (%s != %s)" %
(len(output), n_expected))
+
+
+def test_random_choice_csc(n_samples=10000, random_state=24):
@vene
vene Aug 8, 2014

does the test strongly depend on this random state? or is it just 42 backwards? :)

@hamsal
hamsal Aug 8, 2014

It does not depend on this random state, I fixed the random_state so the testing of this function would not vary every time it is run.

@vene vene and 1 other commented on an outdated diff Aug 8, 2014
sklearn/utils/tests/test_random.py
@@ -99,3 +101,57 @@ def check_sample_int_distribution(sample_without_replacement):
raise AssertionError(
"number of combinations != number of expected (%s != %s)" %
(len(output), n_expected))
+
+
+def test_random_choice_csc(n_samples=10000, random_state=24):
+ # Explicit class probabilities
+ classes = [np.array([0, 1]), np.array([0, 1, 2])]
+ class_probabilites = [np.array([0.5, 0.5]), np.array([0.6, 0.1, 0.3])]
+
+ got = random_choice_csc(n_samples, classes, class_probabilites,
+ random_state)
+ assert_true(sp.issparse(got))
+ got = got.toarray()
@vene
vene Aug 8, 2014

instead of doing this, you could densify each column at a time in the for loop, it might make the test more efficient

@vene vene and 1 other commented on an outdated diff Aug 8, 2014
sklearn/utils/tests/test_random.py
+def test_random_choice_csc(n_samples=10000, random_state=24):
+ # Explicit class probabilities
+ classes = [np.array([0, 1]), np.array([0, 1, 2])]
+ class_probabilites = [np.array([0.5, 0.5]), np.array([0.6, 0.1, 0.3])]
+
+ got = random_choice_csc(n_samples, classes, class_probabilites,
+ random_state)
+ assert_true(sp.issparse(got))
+ got = got.toarray()
+
+ for k in range(len(classes)):
+ p = np.bincount(got[:, k]) / float(n_samples)
+ assert_array_almost_equal(class_probabilites[k], p, decimal=1)
+
+ # Implicit class probabilities
+ classes = [[0, 1], [1, 2]] # array like support
@vene
vene Aug 8, 2014

you mean "test for array-like support"?

@vene
scikit-learn member

Looks good to me. Apart from a few test cases that I mentioned, the rest of my comments are pretty minor.

@coveralls

Coverage Status

Coverage increased (+0.01%) when pulling 7cfea35 on hamsal:sprs-out-dmy into 046f7e8 on scikit-learn:master.

@coveralls

Coverage Status

Coverage increased (+0.01%) when pulling 2a3f504 on hamsal:sprs-out-dmy into 046f7e8 on scikit-learn:master.

@coveralls

Coverage Status

Coverage increased (+0.02%) when pulling 9783de3 on hamsal:sprs-out-dmy into 046f7e8 on scikit-learn:master.

@coveralls

Coverage Status

Coverage increased (+0.01%) when pulling 2203422 on hamsal:sprs-out-dmy into 046f7e8 on scikit-learn:master.

@hamsal

I believe I have addressed all of the comments, the tests now test some more corner cases and the code is cleaner cosmetically.

@arjoly arjoly changed the title from [MRG] Sparse Target Dummy Classifier to [MRG+1] Sparse Target Dummy Classifier Aug 11, 2014
@arjoly
scikit-learn member

Again +1 for merge!

@arjoly
scikit-learn member

Is it ok for you @vene ?

@GaelVaroquaux
scikit-learn member

Seems fine to me. Merging.

@GaelVaroquaux GaelVaroquaux merged commit 39da06f into scikit-learn:master Aug 17, 2014

1 check passed

Details continuous-integration/travis-ci The Travis CI build passed
@arjoly
scikit-learn member

Thanks @hamsal !!

@arjoly
scikit-learn member

🍻

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment