[MRG]: add fast_dot function calling BLAS directly and consume only twice the memory of your data #2248

Closed
wants to merge 1 commit into
from

Projects

None yet

5 participants

Contributor

Hi there, I finally got it running.

This implements a feature 'advocated' on this scipy page (section on large arrays and linalg):

http://wiki.scipy.org/PerformanceTips

When directly calling blass instead of np.dot it's possible to avoid copying when data are passed in F-contiguous order. In addition I've added chunking to the _logcosh function which avoids an extra copy.
This is now how it looks on 1GB testing data:

fast_dot_chunking_logcosh

This was how the same test would have looked on the current master (plot from the last memory PR):
memory_ica_par_w1_computation_del_gwx

To make this functionality available for other use cases I've added a fast_dot function to utils.extmath with almost stupid but explicit tests that exemplify the mapping between np.dot and fast_dot which can be a hell.
Finally I've made sure that down-stream applications are still workin. For example with this local branch the mne-python ICA looks as good as it had looked before.

cc @agramfort @GaelVaroquaux @mblondel

@pgervais pgervais and 1 other commented on an outdated diff Jul 26, 2013
sklearn/utils/extmath.py
@@ -58,6 +58,45 @@ def _fast_logdet_numpy(A):
fast_logdet = _fast_logdet
+def _check_forder(X):
+ """Helper Function"""
+ if X.flags.c_contiguous:
+ return X.T
+ else:
+ return X
+
+
+def fast_dot(A, B, trans_a=False, trans_b=False):
+ """Compute fast dot products directly calling blas.
+
+ This function calls BLAS directly while warranting
+ that Fortran contiguity.
pgervais
pgervais Jul 26, 2013 Contributor

Fortran contiguity. [ no "that" ]

dengemann
dengemann Jul 26, 2013 Contributor

ouups.

@pgervais pgervais and 1 other commented on an outdated diff Jul 26, 2013
sklearn/utils/extmath.py
+ http://wiki.scipy.org/PerformanceTips
+
+ Parameters
+ ----------
+ A : array array-like
+ First argument.
+ B : array-like
+ Second argument.
+ trans_a : bool, optional
+ If True A will be transposed.
+ trans_b : bool, optional
+ If True B will be transposed.
+ """
+ if A.dtype != B.dtype:
+ raise ValueError('A and B must be of the same type.')
+ if A.dtype not in [np.dtype(k) for k in 'f4', 'f8']:
pgervais
pgervais Jul 26, 2013 Contributor

Using this instead of a list comprehension should work and would be faster:

(np.float32, np.float64)
GaelVaroquaux
GaelVaroquaux Jul 26, 2013 Owner

I think that you can do:

A.dtype.kind == 'f'
@pgervais pgervais commented on an outdated diff Jul 26, 2013
sklearn/utils/extmath.py
+
+
+def fast_dot(A, B, trans_a=False, trans_b=False):
+ """Compute fast dot products directly calling blas.
+
+ This function calls BLAS directly while warranting
+ that Fortran contiguity.
+ This helps avoiding extra copies `np.dot` would have created.
+ For details see section `Linear Algebra on large Arrays`:
+ http://wiki.scipy.org/PerformanceTips
+
+ Parameters
+ ----------
+ A : array array-like
+ First argument.
+ B : array-like
pgervais
pgervais Jul 26, 2013 Contributor

Could you add a blank line between any two parameter descriptions?

@pgervais pgervais commented on an outdated diff Jul 26, 2013
sklearn/utils/extmath.py
+ else:
+ return X
+
+
+def fast_dot(A, B, trans_a=False, trans_b=False):
+ """Compute fast dot products directly calling blas.
+
+ This function calls BLAS directly while warranting
+ that Fortran contiguity.
+ This helps avoiding extra copies `np.dot` would have created.
+ For details see section `Linear Algebra on large Arrays`:
+ http://wiki.scipy.org/PerformanceTips
+
+ Parameters
+ ----------
+ A : array array-like
pgervais
pgervais Jul 26, 2013 Contributor

extra "array"

@pgervais pgervais commented on an outdated diff Jul 26, 2013
sklearn/utils/extmath.py
+
+def fast_dot(A, B, trans_a=False, trans_b=False):
+ """Compute fast dot products directly calling blas.
+
+ This function calls BLAS directly while warranting
+ that Fortran contiguity.
+ This helps avoiding extra copies `np.dot` would have created.
+ For details see section `Linear Algebra on large Arrays`:
+ http://wiki.scipy.org/PerformanceTips
+
+ Parameters
+ ----------
+ A : array array-like
+ First argument.
+ B : array-like
+ Second argument.
pgervais
pgervais Jul 26, 2013 Contributor

You can describe A and B at the same time:

A, B: array-like
    input matrices

By the way, this function only accepts numpy.ndarray. Lists will not be accepted, since there is an access to dtype in the first line.

@pgervais pgervais and 1 other commented on an outdated diff Jul 26, 2013
sklearn/utils/extmath.py
+ B : array-like
+ Second argument.
+ trans_a : bool, optional
+ If True A will be transposed.
+ trans_b : bool, optional
+ If True B will be transposed.
+ """
+ if A.dtype != B.dtype:
+ raise ValueError('A and B must be of the same type.')
+ if A.dtype not in [np.dtype(k) for k in 'f4', 'f8']:
+ raise ValueError('Data must be single or double precision float.')
+
+ dot = linalg.get_blas_funcs('gemm', (A, B), dtype=A.dtype)
+ # from scipy.linalg._fblas import dgemm as dot
+ return dot(alpha=1.0, a=_check_forder(A), b=_check_forder(B),
+ trans_a=trans_a, trans_b=trans_b)
pgervais
pgervais Jul 26, 2013 Contributor

I don't understand the computation here. Shoudn't there be some kind of relationship between the value of trans_a and the ordering of A? (same for B).

Suppose you have A and B in C order, and both trans_a and trans_b False. _check_forder will transpose both arrays, so you'll end up computing something like np.dot(A.T, B.T) instead of np.dot(A,B). The right way would be to compute np.dot(B.T, A.T).T. Or did I miss something?

dengemann
dengemann Jul 26, 2013 Contributor

It's a bit tricky. You've got to make sure that None of the array's is in C-order when passed to blas. Then due to the order the matrix to transpose changes. Take a look at the tests in extmath.

pgervais
pgervais Jul 26, 2013 Contributor

This is a question of implicit versus explicit: depending on the ordering of the input matrices, the result will change. This should not happen in any case.

@GaelVaroquaux GaelVaroquaux and 2 others commented on an outdated diff Jul 26, 2013
sklearn/utils/extmath.py
@@ -58,6 +58,45 @@ def _fast_logdet_numpy(A):
fast_logdet = _fast_logdet
+def _check_forder(X):
+ """Helper Function"""
+ if X.flags.c_contiguous:
+ return X.T
+ else:
+ return X
GaelVaroquaux
GaelVaroquaux Jul 26, 2013 Owner

I think that the name of this function can be a bit misleading: it is not garanteed to return a f-contiguous array: X may be neither C nor F contiguous.

You probably need tests with such an array, by the way:

X = np.ones((10, 10))[::2]
GaelVaroquaux
GaelVaroquaux Jul 26, 2013 Owner

Also, shouldn't it be called '_impose_f_order', rather than '_check_f_order'?

pgervais
pgervais Jul 26, 2013 Contributor

I gather that this is not the first function in scikit-learn starting with 'check_' that returns the modified input :-)

GaelVaroquaux
GaelVaroquaux Jul 26, 2013 Owner

I gather that this is not the first function in scikit-learn starting with 'check_' that returns the modified input :-)

Yes :$

dengemann
dengemann Jul 26, 2013 Contributor

@GaelVaroquaux @pgervais yes, good points. Actually it does not warrant too much. In FastICA I'm using asserts to make sure all is fine. + 1 for the proposed API changes.

GaelVaroquaux
GaelVaroquaux Jul 26, 2013 Owner

In FastICA I'm using asserts to make sure all is fine.

Maybe they should be move to the fast_dot function then. Another option
would be to default back to standard dot if we are not in the good
situation.

dengemann
dengemann Jul 26, 2013 Contributor

@GaelVaroquaux @pgervais
We just had a discussion about the confusing call signature. Making a fast dot with the same call signature as np.dot will be a PITA, needles to mention the tests. I think what we should do is to rename fast_dot safe_gemm, since it's supposed to handle dtype and order issues and at the same time expose the gemm API. Then, if blas is not available the arguments will be parsed in a way to allow for an equivalent dispatch to np.dot. We should add more information to the docs to really convey the point that everything has to be in F-order while the transposes are done inside blas. For np.dot(A.T, A) you need to gemm(A.T, A.T, trans_b=True) and if you don't comply with this it may consume lot's of memory ...

@GaelVaroquaux GaelVaroquaux commented on an outdated diff Jul 26, 2013
sklearn/utils/extmath.py
@@ -58,6 +58,45 @@ def _fast_logdet_numpy(A):
fast_logdet = _fast_logdet
+def _check_forder(X):
+ """Helper Function"""
+ if X.flags.c_contiguous:
+ return X.T
+ else:
+ return X
+
+
+def fast_dot(A, B, trans_a=False, trans_b=False):
+ """Compute fast dot products directly calling blas.
+
+ This function calls BLAS directly while warranting
GaelVaroquaux
GaelVaroquaux Jul 26, 2013 Owner

I don't think that it actually warranties fortran contiguity (check out my example of array neither F nor C contiguous).

@pgervais pgervais and 2 others commented on an outdated diff Jul 26, 2013
sklearn/utils/tests/test_extmath.py
+
+
+def test_fast_dot():
+ """Check fast dot blas wrapper function"""
+ A = np.random.random([2, 10])
+ B = np.random.random([2, 10])
+ assert_raises(ValueError, fast_dot, A, A.astype('f4'))
+ assert_raises(ValueError, fast_dot, A.astype('i4'), A)
+ # test cov-like use case + dtypes
+ for dtype in ['f8', 'f4']:
+ A = A.astype(dtype)
+ B = B.astype(dtype)
+
+ # col < row
+ C = np.dot(A.T, A)
+ C_ = fast_dot(A, A, trans_b=True)
pgervais
pgervais Jul 26, 2013 Contributor

In my opinion, the signature of np.dot and fast_dot should be identical, otherwise it would be confusing to everybody.

dengemann
dengemann Jul 26, 2013 Contributor

This is going to be painful ... I did it this way to expose the blas arguments. Ideas how to remap this are welcome.

pgervais
pgervais Jul 26, 2013 Contributor

If fast_dot is a higher-level version of the blas function, then, it should be called fast_gemm, not fast_dot.

GaelVaroquaux
GaelVaroquaux Jul 26, 2013 Owner

Agreed: np.dot and fast_dot should do exactly the same thing.

It seems to me that trans_a and trans_b could be determined inside fast_dot, by adding a return argument to _check_forder on whether it transposed or not.

@GaelVaroquaux GaelVaroquaux and 1 other commented on an outdated diff Jul 26, 2013
sklearn/decomposition/fastica_.py
+ # this is crucial to avoid going to BLAS-Hell.
+ assert X1.flags.f_contiguous
GaelVaroquaux
GaelVaroquaux Jul 26, 2013 Owner

Should that stay in there?

dengemann
dengemann Jul 26, 2013 Contributor

I can remove the comment ;-) Kidding, once we're confident we can remove it. It helped debugging things across init parameters in the tests, contiguity was init dependent ....

Contributor

There are two fundamental flaws in the current design, that I think should be adressed before merging:

  • fast_dot output depends on the ordering of the input arrays. This is highly implicit, and different from the usual Numpy behaviour
  • fast_dot and numpy.dot must have the same signature (for consistency). fast_dot should just be a faster | leaner version of np.dot.
Owner

Agreed with @pgervais comment on design flaws.

In addition, the travis tests fail.

Contributor

Agreed with @pgervais comment on design flaws.

see inline discussion above ...

In addition, the travis tests fail.

would be great to know more about the travis environment, their blas module seems to be somewhat older / different from what I have on my machine, grrr.

Owner

their blas module seems to be somewhat older / different from what I have on my machine, grrr.

Yeah, users tend to do that :)

Contributor

Yeah, users tend to do that :)

LOL

Contributor

@GaelVaroquaux @pgervais my recent commit addresses our discussion, tests passing on my box. I now need to find a way to deal with blas versions.

@pgervais pgervais commented on an outdated diff Jul 26, 2013
sklearn/utils/extmath.py
@@ -58,6 +58,44 @@ def _fast_logdet_numpy(A):
fast_logdet = _fast_logdet
+def _impose_f_order(X):
+ """Helper Function"""
+ if X.flags.c_contiguous:
+ return array2d(X.T, copy=False, order='F'), True
+ else:
+ return array2d(X, copy=False, order='F'), False
pgervais
pgervais Jul 26, 2013 Contributor

array2d make a copy if the order changes. Is it what you intend here?

pgervais
pgervais Jul 26, 2013 Contributor

Sorry, I got it wrong. This code works.

@agramfort agramfort and 1 other commented on an outdated diff Jul 26, 2013
sklearn/decomposition/fastica_.py
@@ -93,10 +93,12 @@ def _ica_par(X, tol, g, fun_args, max_iter, w_init):
del w_init
p_ = float(X.shape[1])
for ii in moves.xrange(max_iter):
- gwtx, g_wtx = g(np.dot(W, X), fun_args)
- W1 = _sym_decorrelation(np.dot(gwtx, X.T) / p_
- - g_wtx[:, np.newaxis] * W)
+ gwtx, g_wtx = g(fast_dot(W, X), fun_args)
+ # array2d needed here, in some corner cases gwtx is 1d
+ W1 = _sym_decorrelation(fast_dot(array2d(gwtx, copy=False), X.T) /
agramfort
agramfort Jul 26, 2013 Owner

why do you need the array2d since it's already 2D?

dengemann
dengemann Jul 26, 2013 Contributor

I found a corner case where in a test suddenly a 1d array would appear. Blas is not forgiving ....

agramfort
agramfort Jul 26, 2013 Owner

Blas is not forgiving ....

like MEG data analysis :)

@agramfort agramfort commented on an outdated diff Jul 26, 2013
sklearn/decomposition/tests/test_fastica.py
for whiten in [True, False]:
- for n_components in [5, 10]:
+ for n_components in [n1, n1]:
agramfort
agramfort Jul 26, 2013 Owner

should be n1, n2

@agramfort agramfort commented on an outdated diff Jul 26, 2013
sklearn/utils/extmath.py
@@ -58,6 +58,45 @@ def _fast_logdet_numpy(A):
fast_logdet = _fast_logdet
+def _impose_f_order(X):
+ """Helper Function"""
+ if X.flags.c_contiguous:
agramfort
agramfort Jul 26, 2013 Owner

use np.isfortran instead of direct access to flags

@agramfort agramfort commented on the diff Jul 26, 2013
sklearn/utils/tests/test_extmath.py
+ # col < row
+ C = np.dot(A.T, A)
+ C_ = fast_dot(A.T, A)
+ assert_array_equal(C, C_)
+
+ C = np.dot(A.T, B)
+ C_ = fast_dot(A.T, B)
+ assert_array_equal(C, C_)
+
+ C = np.dot(A, B.T)
+ C_ = fast_dot(A, B.T)
+ assert_array_equal(C, C_)
+
+ C = np.dot(A, B.T)
+ C_ = fast_dot(A, B.T)
+ assert_array_equal(C, C_)
agramfort
agramfort Jul 26, 2013 Owner

you do twice the same test

agramfort
agramfort Jul 27, 2013 Owner

you still do twice the same test. See below

@agramfort agramfort and 1 other commented on an outdated diff Jul 26, 2013
sklearn/utils/tests/test_extmath.py
+ C_ = fast_dot(A.T, A)
+ assert_array_equal(C, C_)
+
+ C = np.dot(A.T, B)
+ C_ = fast_dot(A.T, B)
+ assert_array_equal(C, C_)
+
+ C = np.dot(A, B.T)
+ C_ = fast_dot(A, B.T)
+ assert_array_equal(C, C_)
+
+ C = np.dot(A, B.T)
+ C_ = fast_dot(A, B.T)
+ assert_array_equal(C, C_)
+
+ # test square matrix * rectengular use case
agramfort
agramfort Jul 26, 2013 Owner

rectangular

agramfort
agramfort Jul 27, 2013 Owner

rectangular still

dengemann
dengemann Jul 27, 2013 Contributor

... I was sure I fixed this, weird.

dengemann
dengemann Jul 27, 2013 Contributor

@agramfort ok, addressed in recent commit, also the test repetition

agramfort
agramfort Jul 27, 2013 Owner

ask for an extra reviewers but you have my +1 for merge

dengemann
dengemann Jul 27, 2013 Contributor

ask for an extra reviewers but you have my +1 for merge

great! @pgervais @vene can anyone pull, take a look and run tests? I'd love to see the blas stuff running on a second machine before this is merged (watchout, import errors are silenced).

@agramfort agramfort commented on the diff Jul 26, 2013
sklearn/utils/tests/test_extmath.py
+ C_ = fast_dot(A, B.T)
+ assert_array_equal(C, C_)
+
+ # test square matrix * rectengular use case
+ A = np.random.random([2, 2])
+ for dtype in ['f8', 'f4']:
+ A = A.astype(dtype)
+ B = B.astype(dtype)
+
+ C = np.dot(A, B)
+ C_ = fast_dot(A, B)
+ assert_array_equal(C, C_)
+
+ C = np.dot(A.T, B)
+ C_ = fast_dot(A.T, B)
+ assert_array_equal(C, C_)
agramfort
agramfort Jul 26, 2013 Owner

why not A * B.T too ?

dengemann
dengemann Jul 26, 2013 Contributor

... matrices don't support this ;-)

dengemann
dengemann Jul 27, 2013 Contributor

@ogrisel you can execute these lines and comment out one of the dots
while watching your memory / use a memory profiler

import numpy as np
n_samples = 1e6 / 2
rng = np.random.RandomState(42)
W = rng.random_sample([250, 250])
X = rng.random_sample([n_samples, 250])

print 'estimated data size in memory'
print '%i MB' % (X.size * X.itemsize / 1e6)
print '%s' % X.dtype

from sklearn.utils.extmath import fast_dot

# np.dot(W, X.T)

fast_dot(W, X.T)
dengemann
dengemann Jul 27, 2013 Contributor

@ogrisel

here come my temporal benchmarks:

In [8]: timeit np.dot(W, X.T)
1 loops, best of 3: 7.24 s per loop

In [9]: timeit fast_dot(W, X.T)
1 loops, best of 3: 5.39 s per loop

dengemann
dengemann Jul 27, 2013 Contributor

The above results was for 1GB of data,
here the same benchmarks for 100MB

In [3]: timeit fast_dot(W, X.T)
1 loops, best of 3: 528 ms per loop

In [4]: timeit np.dot(W, X.T)
1 loops, best of 3: 611 ms per loop
Contributor

@agramfort addressed your comments.

Contributor

@agramfort seems we're finally done here.

Contributor

@agramfort rebased. Travis won't be though happy unless the master is fixed.

Contributor

Rebased, updating to MRG

@ogrisel ogrisel commented on an outdated diff Jul 27, 2013
sklearn/utils/tests/test_extmath.py
@@ -266,3 +268,42 @@ def test_logistic_sigmoid():
extreme_x = np.array([-100, 100], dtype=np.float)
assert_array_almost_equal(logistic_sigmoid(extreme_x), [0, 1])
assert_array_almost_equal(logistic_sigmoid(extreme_x, log=True), [-100, 0])
+
+
+def test_fast_dot():
+ """Check fast dot blas wrapper function"""
+ A = np.random.random([2, 10])
+ B = np.random.random([2, 10])
ogrisel
ogrisel Jul 27, 2013 Owner

Please never use the np.random singleton in tests, rather do:

rng = np.random.RandomState(42)
A = rng.random([2, 10])
B = rng.random([2, 10])

The goal is to have all the test run side effects-free to make them order independent.

@ogrisel ogrisel commented on the diff Jul 27, 2013
sklearn/utils/tests/test_extmath.py
+ C_ = fast_dot(A, B.T)
+ assert_array_equal(C, C_)
+
+ # test square matrix * rectangular use case
+ A = np.random.random([2, 2])
+ for dtype in ['f8', 'f4']:
+ A = A.astype(dtype)
+ B = B.astype(dtype)
+
+ C = np.dot(A, B)
+ C_ = fast_dot(A, B)
+ assert_array_equal(C, C_)
+
+ C = np.dot(A.T, B)
+ C_ = fast_dot(A.T, B)
+ assert_array_equal(C, C_)
ogrisel
ogrisel Jul 27, 2013 Owner

I have the following test failures: you should use assert_array_almost_equal instead:

======================================================================
FAIL: Check fast dot blas wrapper function
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/Cellar/python/2.7.5/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/Users/ogrisel/code/scikit-learn/sklearn/utils/tests/test_extmath.py", line 295, in test_fast_dot
    assert_array_equal(C, C_)
  File "/usr/local/Cellar/python/2.7.5/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/numpy/testing/utils.py", line 719, in assert_array_equal
    verbose=verbose, header='Arrays are not equal')
  File "/usr/local/Cellar/python/2.7.5/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/numpy/testing/utils.py", line 645, in assert_array_compare
    raise AssertionError(msg)
AssertionError:
Arrays are not equal

(mismatch 50.0%)
 x: array([[ 2.51309336,  3.09625551],
       [ 3.20171644,  3.40308068]])
 y: array([[ 2.51309336,  3.09625551],
       [ 3.20171644,  3.40308068]])
Owner
ogrisel commented Jul 27, 2013

You have to check with numpy 1.7.1 but I think this optimization has already been included in upstream numpy:

>>> import numpy as np
>>> from sklearn.utils.extmath import fast_dot
>>> A, B = np.random.normal(size=(1000, 1000)), np.random.normal(size=(1000, 1000))
>>> %timeit _ = fast_dot(A, B)
10 loops, best of 3: 71.9 ms per loop
>>> %timeit _ = np.dot(A, B)
10 loops, best of 3: 66.5 ms per loop

I am using a numpy built against the OSX 10.8 system install of Blas. Which Blas do you have? MKL?

@ogrisel ogrisel commented on the diff Jul 27, 2013
sklearn/decomposition/tests/test_fastica.py
@@ -199,12 +199,12 @@ def test_fit_transform():
X = rng.random_sample((100, 10))
for whiten, n_components in [[True, 5], [False, 10]]:
- ica = FastICA(n_components=5, whiten=whiten, random_state=0)
+ ica = FastICA(n_components=n_components, whiten=whiten, random_state=0)
ogrisel
ogrisel Jul 27, 2013 Owner

This fix should be cherry-picked in another PR to be merged to master before the release.

Contributor

Thanks @ogrisel addressed the testing issue in a recent commit. My blas should be MKL, binaries from Canopy64.

Contributor

@pgervais @agramfort @ogrisel @GaelVaroquaux @vene | whoever got a few minutes to spend on this

With this gist:

https://gist.github.com/dengemann/6094449

You can produce this plot:

fast_dot_profile

It would be good to know whether you can get similar results on your machines across shapes / sizes.

Contributor

Here is my second benchmark for n_features, n_samples = 5e3, 5e3
fast_dot_square_matrix

Owner

Can you open a new pr without fast dot but just ICA improvements?

Contributor

Can you open a new pr without fast dot but just ICA improvements?

Yes, that's possible. I think there aren't so many in this PR. Oh yes, the logcosh ....
I fear this is to interwoven to cherry pick, need to do it manually

@dengemann dengemann ENH: directly call BLAS to further improve ICA memory
FIXES: dot products

ENH: fix BLAS, get tests running, reduce MEM

COSMITS + FIX version issue

ENH: equalize call signatures, address API

ENH: improve impose f order

ENH: return np.dot if blas not available

ENH catch attribute error

ENH: remove debugging support statements + checks

ENH: more fast dots

FIX/STY: addressing @agramfort 's comments

ENH: better 2d handling

COSMIT

what's new

COSMITS

ENH: better tests

ENH: another fast_dot
2ae2b8e
Contributor

Closing this PR to create two separate ones: fast_dot only + ica improvements.

@dengemann dengemann closed this Jul 28, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment