[MRG+1]Update discriminant analysis code for better memory usage #10904

bobchennan · 2018-04-02T15:43:03Z

rth

Thanks for this PR @bobchennan !

Please add [MRG] at the beginning of the PR title if this is ready for review. A few comments are below.

What's the estimated memory usage gain from this change? (you can measure memory usage, for instance, with memory_profiler.

rth · 2018-04-03T09:43:37Z

sklearn/discriminant_analysis.py

-        covs.append(np.atleast_2d(_cov(Xg, shrinkage)))
-    return np.average(covs, axis=0, weights=priors)
+        if priors is None:
+            cov = cov + np.atleast_2d(_cov(Xg, shrinkage)) / len(classes)


Since cov is already allocated, you could just do an inplace add cov += ..

rth · 2018-04-03T09:49:54Z

sklearn/discriminant_analysis.py

+        if priors is None:
+            cov = cov + np.atleast_2d(_cov(Xg, shrinkage)) / len(classes)
+        else:
+            cov = cov + priors[idx] * np.atleast_2d(_cov(Xg, shrinkage))


I'm not familiar with this part of the code base, but if I understand correctly #10898 was only about memory usage and I don't see how this line can be equivalent to the one above (if priors=None).

jnothman

I failed to submit these comments hours ago...

jnothman · 2018-04-02T22:01:07Z

sklearn/discriminant_analysis.py

    classes = np.unique(y)
-    for group in classes:
+    means = np.empty(shape=(len(classes), X.shape[1]), dtype=X.dtype)
+    for idx, group in enumerate(classes):


This looks good, though fwiw, I think we could do this without a loop, using no.add.at

No sure I understand this part. I can re-write it as

means = np.array([X[y==group].mean(0) for group in classes])

but the loop is still inside.

jnothman · 2018-04-02T22:06:37Z

sklearn/discriminant_analysis.py

+        if priors is None:
+            cov = cov + np.atleast_2d(_cov(Xg, shrinkage)) / len(classes)
+        else:
+            cov = cov + priors[idx] * np.atleast_2d(_cov(Xg, shrinkage))


Please avoid this duplication, by only doing the if for the prior, not the whole statement.

You also make the assumption that priors add to 1. The previous version did not. Please verify that this is a safe assumption.

bobchennan · 2018-04-05T16:00:55Z

Regarding to the sum of priors, in fit function it is already assigned (thus not None) and normalized.

bobchennan · 2018-04-05T16:04:34Z

For the memory usage I will give an example later.

TomDLT · 2018-04-05T19:21:35Z

sklearn/discriminant_analysis.py

+    classes, ny = np.unique(y, return_inverse=True)
+    cnt = np.bincount(ny)
+    means = np.zeros(shape=(len(classes), X.shape[1]))
+    for idx in xrange(X.shape[0]):


This loop over n_samples is slow.
What about the equivalent array operation:

means = np.zeros(shape=(len(classes), X.shape[1])) np.add.at(means, ny, X) means /= cnt[:, None] return means

Looks good! I don't know this function before.

TomDLT · 2018-04-05T19:30:32Z

sklearn/discriminant_analysis.py

        Xg = X[y == group, :]
-        covs.append(np.atleast_2d(_cov(Xg, shrinkage)))
-    return np.average(covs, axis=0, weights=priors)
+        cov += priors[idx] * np.atleast_2d(_cov(Xg, shrinkage))


This line does not work if priors is None. You should do something like:

cov_g = np.atleast_2d(_cov(Xg, shrinkage)) if priors is not None: cov_g *= priors[idx] cov += cov_g

Oh priors cannot be None.
Then maybe we should remove the default which implies None is a valid input.

TomDLT · 2018-04-05T19:32:40Z

sklearn/discriminant_analysis.py

-        covs.append(np.atleast_2d(_cov(Xg, shrinkage)))
-    return np.average(covs, axis=0, weights=priors)
+        cov += priors[idx] * np.atleast_2d(_cov(Xg, shrinkage))
+    return cov


In your version, you don't return the weighted average as before, but the weighted sum.
You should divide cov by the sum of priors, or by n_classes if the priors are None.

Ok you answered it already, my mistake

bobchennan · 2018-04-05T20:14:30Z

One example of memory usage is given here.
New implementation reduced memory usage from 2.2GB to 838MB for 20000 classes case.

jnothman

LGTM otherwise.

jnothman · 2018-04-09T08:19:21Z

sklearn/discriminant_analysis.py

-        Xg = X[y == group, :]
-        means.append(Xg.mean(0))
-    return np.asarray(means)
+    classes, ny = np.unique(y, return_inverse=True)


I don't know what ny means. Perhaps just overwrite y?

jnothman · 2018-04-09T08:20:04Z

Please add an entry to the Enhancements change log at doc/whats_new/v0.20.rst. Like the other entries there, please reference this pull request with :issue: and credit yourself (and other contributors if applicable) with :user:

bobchennan · 2018-04-25T15:01:22Z

@rth any suggestions?

TomDLT · 2018-04-25T17:39:46Z

Thank you @bobchennan !

update discriminant analysis for memory usage

48005ac

bobchennan force-pushed the master branch from 3c77b2c to 48005ac Compare April 2, 2018 17:39

rth reviewed Apr 3, 2018

View reviewed changes

jnothman reviewed Apr 3, 2018

View reviewed changes

qinhanmin2014 mentioned this pull request Apr 5, 2018

memory issue of _class_cov #10898

Closed

remove the condition of priors

226349b

bobchennan changed the title ~~Update discriminant analysis code for better memory usage~~ [MRG]Update discriminant analysis code for better memory usage Apr 5, 2018

bobchennan added 2 commits April 5, 2018 12:51

remove wrong dtype

8931d23

update _class_mean

109050a

TomDLT reviewed Apr 5, 2018

View reviewed changes

using np.add.at for mean calculation

deb2964

TomDLT changed the title ~~[MRG]Update discriminant analysis code for better memory usage~~ [MRG+1]Update discriminant analysis code for better memory usage Apr 5, 2018

TomDLT approved these changes Apr 5, 2018

View reviewed changes

jnothman approved these changes Apr 9, 2018

View reviewed changes

update variable name and add entry to change log

ae49687

Merge branch 'master' into master

ea88907

TomDLT merged commit 481dac7 into scikit-learn:master Apr 25, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG+1]Update discriminant analysis code for better memory usage #10904

[MRG+1]Update discriminant analysis code for better memory usage #10904

bobchennan commented Apr 2, 2018

rth left a comment

rth Apr 3, 2018

rth Apr 3, 2018

jnothman left a comment

jnothman Apr 2, 2018

bobchennan Apr 5, 2018

jnothman Apr 2, 2018

bobchennan commented Apr 5, 2018

bobchennan commented Apr 5, 2018

TomDLT Apr 5, 2018

bobchennan Apr 5, 2018

TomDLT Apr 5, 2018

TomDLT Apr 5, 2018

TomDLT Apr 5, 2018

TomDLT Apr 5, 2018

bobchennan commented Apr 5, 2018

jnothman left a comment

jnothman Apr 9, 2018

jnothman commented Apr 9, 2018

bobchennan commented Apr 25, 2018

TomDLT commented Apr 25, 2018

[MRG+1]Update discriminant analysis code for better memory usage #10904

[MRG+1]Update discriminant analysis code for better memory usage #10904

Conversation

bobchennan commented Apr 2, 2018

rth left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnothman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bobchennan commented Apr 5, 2018

bobchennan commented Apr 5, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bobchennan commented Apr 5, 2018

jnothman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnothman commented Apr 9, 2018

bobchennan commented Apr 25, 2018

TomDLT commented Apr 25, 2018