Skip to content

Conversation

MohamedBsh
Copy link
Contributor

@MohamedBsh MohamedBsh commented Jan 22, 2022

Reference Issues/PRs

Fixes #20558.

What does this implement/fix? Explain your changes.

Possible performance improvement of FastICA.

Many contributors have demonstrated with different benchmarks that there is a runtime/memory saving by using np.einsum instead of np.dot.

@jjerphan @chritter @norbusan

@MohamedBsh
Copy link
Contributor Author

It seems that one of the tests failed because of the following warning :

sklearn/decomposition/_fastica.py:12:1: F401 'email.mime.base' imported but unused
from email.mime import base

How do I handle this ?

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice to see you contributing, @MohamedBsh! 👋

Here are a few comments helping with this contribution.

When you addressed them, I think the only thing left will be benchmarks.

To do this, you can:

  • Start from @ogrisel's script #20558 (comment) and adapt it to %timeit and %memit the sklearn.decomposition.FastICA.fit method
  • Report the result you get for scikit-learn current implementation (i.e. on main)
  • Report the result you get on this PR branch (i.e perfImprovementFastICA)

Let me know if your need more information.

@jjerphan jjerphan changed the title memory improvements and fast execution with np.einsum than np.dot FIX Optimise decomposition.FastICA.fit memory footprint and runtime Jan 22, 2022
@jjerphan jjerphan changed the title FIX Optimise decomposition.FastICA.fit memory footprint and runtime ENH Optimise decomposition.FastICA.fit memory footprint and runtime Jan 22, 2022
@MohamedBsh
Copy link
Contributor Author

MohamedBsh commented Jan 22, 2022

  • perfImprovementFastICA

Hey @jjerphan ,I have retrieved the following examples within the proposed examples in the documentation and here are the results obtained between the main branch and this branch:

Example 1 : load_digits

from sklearn.datasets import load_digits
from sklearn.decomposition import FastICA
X, _ = load_digits(return_X_y=True)
transformer = FastICA(n_components=7,random_state=0,whiten='unit-variance')

Result - perfImprovementFastICA branch :

%timeit transformer.fit_transform(X)

> 156 ms ± 8.76 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%memit transformer.fit_transform(X)

> peak memory: 96.81 MiB, increment: -2.62 MiB

Result - main branch :

%timeit transformer.fit_transform(X)

> 148 ms ± 9.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%memit transformer.fit_transform(X)

> peak memory: 104.30 MiB, increment: 0.09 MiB

Example 2 : load_breast_cancer

from sklearn.datasets import load_breast_cancer
from sklearn.decomposition import FastICA
X, _ = load_breast_cancer(return_X_y=True)
transformer = FastICA(n_components=7,random_state=0,whiten='unit-variance')

Result - perfImprovementFastICA branch :

%timeit transformer.fit_transform(X)

> 7.86 ms ± 2.81 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

%memit transformer.fit_transform(X)

peak memory: 97.33 MiB, increment: 1.51 MiB

Result - main branch :

%timeit transformer.fit_transform(X)

> 7.78 ms ± 1.77 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

%memit transformer.fit_transform(X)

> peak memory: 97.76 MiB, increment: 0.18 MiB

@MohamedBsh
Copy link
Contributor Author

Hmm, to be honest, I don't know how to fix the errors related to azure-pipelines :

Build log #L13
The HTTP request timed out after 00:00:30.

@jjerphan
Copy link
Member

Thanks for the benchmark. Can you use bigger synthetic datasets?

Azure CI runs sometimes fail randomly. The only thing that we can do is to restart it.

To trigger the CI again, you can create an empty commit:

git commit --allow-empty -m "CI Rerun CI"

and push it.

@MohamedBsh
Copy link
Contributor Author

MohamedBsh commented Jan 26, 2022

Hey, sorry for the late reply.

I have generated two synthetic datasets and retrieved the following example within the proposed example in Real World dataset fetch_california_housing.

Here are the results obtained between the main branch and this branch:

Dataset fetch_california_housing
Samples total : 20640 rows

MAIN

from sklearn.datasets import fetch_california_housing
from sklearn.decomposition import FastICA
X, _ = fetch_california_housing(return_X_y=True)
transformer = FastICA(n_components=7,random_state=0,whiten='unit-variance')

%timeit transformer.fit_transform(X)

> 35.6 ms ± 3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%load_ext memory_profiler
%memit transformer.fit_transform(X)

> peak memory: 98.03 MiB, increment: 0.08 MiB

Current branch

%timeit transformer.fit_transform(X)

> 36.6 ms ± 3.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%memit transformer.fit_transform(X)

> peak memory: 93.72 MiB, increment: 0.14 MiB

1st Synthetic Dataset
n_samples = 50000

from sklearn.decomposition import FastICA
import sklearn.datasets as dt
rand_state = 11
noise = 0.2
X,Y = dt.make_regression(n_samples=50000,
                             n_features=2,
                             noise=noise,
                             random_state=rand_state)
transformer = FastICA(n_components=7,random_state=0,whiten='unit-variance')

MAIN

%timeit transformer.fit_transform(X)

> 35.5 ms ± 11.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%memit transformer.fit_transform(X)

> peak memory: 89.88 MiB, increment: -0.24 MiB

Current branch

%timeit transformer.fit_transform(X)

> 33.7 ms ± 7.19 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%memit transformer.fit_transform(X)

> peak memory: 103.87 MiB, increment: 0.18 MiB

2nd Synthetic Dataset
n_samples = 100000

from sklearn.decomposition import FastICA
import sklearn.datasets as dt
rand_state = 11
noise = 0.2
X,Y = dt.make_regression(n_samples=100000,
                             n_features=2,
                             noise=noise,
                             random_state=rand_state)
transformer = FastICA(n_components=7,random_state=0,whiten='unit-variance')

MAIN

%timeit transformer.fit_transform(X)

> 35.4 ms ± 5.67 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%memit transformer.fit_transform(X)

> peak memory: 113.36 MiB, increment: 0.15 MiB

Current branch

%timeit transformer.fit_transform(X)

> 46.5 ms ± 15.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%memit transformer.fit_transform(X)

> peak memory: 97.33 MiB, increment: -0.48 MiB

The results seem to indicate that there is a memory footprint gain but it is rather unclear on the runtime side. Feel free to react and reproduce these tests.

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for the benchmarks you shown, n_components is small, which means np.einsum will spend more of it's time parsing the string "ij,ij->i" compared to the actual computation. (Small matrix == less computation and less memory, while the parsing the string is constant)

Given this benchmark with n_components=500

Benchmark script
from sklearn.decomposition import FastICA
from time import perf_counter
from sklearn.datasets import make_blobs
from tqdm import trange
from statistics import mean, stdev
from sklearn.exceptions import ConvergenceWarning
import warnings

warnings.filterwarnings("ignore", category=ConvergenceWarning)

X, Y = make_blobs(n_samples=10000, n_features=500, random_state=0)

transformer = FastICA(n_components=500, random_state=0, whiten="unit-variance", max_iter=10)

n_repeat = 10
durations = []

for i in trange(n_repeat):
    start = perf_counter()
    transformer.fit(X)
    duration = perf_counter() - start
    durations.append(duration)

print(f"{mean(durations):.2f} +/- {stdev(durations):.2f}")

This PR: 3.67 +/- 0.08, and on main: 4.75 +/- 0.99

In theory, the memory savings for the above case is: 4 * (500 * 500 - 500) / 10^6 ~ 1 Mb per iteration. which reduces memory pressure. For memory profiling I used:

Memory profiling script
from sklearn.decomposition import FastICA
from time import perf_counter
from sklearn.datasets import make_blobs
from statistics import mean, stdev

from sklearn.exceptions import ConvergenceWarning
import warnings

warnings.filterwarnings("ignore", category=ConvergenceWarning)

X, Y = make_blobs(n_samples=10000, n_features=500, random_state=0)

transformer = FastICA(
    n_components=500, random_state=0, whiten="unit-variance", max_iter=10
)

transformer.fit(X)

Using scalene memory line profiling, I see that on main, the original line used 4Mb. While with this PR, the changed lined uses 0.0Mb. (I'm guessing it collapsed ~ 0.002 Mb).

Summary

There is a small gain in memory, and there is a runtime benefit for high values of n_components. Overall I am +1 on this.

@jjerphan
Copy link
Member

jjerphan commented Mar 4, 2022

Hi @MohamedBsh, do you still have time to work on this?

@MohamedBsh
Copy link
Contributor Author

MohamedBsh commented Mar 26, 2022

Hello @jjerphan, the benchmark of @thomasjpfan seems relevant. However I can't reproduce the results with my example. When I run this code in Main / Branch I get the following warnings:

from sklearn.datasets import fetch_california_housing
from sklearn.decomposition import FastICA
X, _ = fetch_california_housing(return_X_y=True)
transform = FastICA(n_components=500,random_state=0,whiten='unit-variance')
../scikit-learn/sklearn/decomposition/_fastica.py:540: UserWarning: n_components is too large: it will be set to 8

=> I can't set n_components=500 and do comparisons like @thomasjpfan. n_components is automatically set to 8 (this is equivalent to doing the initial benchmark with n_components=7).

transform = FastICA(n_components=500,random_state=0,whiten='unit-variance',max_iter=10)
../scikit-learn/sklearn/decomposition/_fastica.py:116: ConvergenceWarning: FastICA did not converge. Consider increasing tolerance or the maximum number of iterations.

-> When I put max_iter=10, FastICA does not converge.

Is it possible to have more information about this?

@jjerphan @thomasjpfan thank you for your time!

@ogrisel
Copy link
Member

ogrisel commented Apr 5, 2022

=> I can't set n_components=500 and do comparisons like @thomasjpfan. n_components is automatically set to 8 (this is equivalent to doing the initial benchmark with n_components=7).

This is expected since the California housing dataset just has 8 dimensions. The performance improvement can only be measured with wider datasets.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM to me as well once #22268 (comment) is accepted.

@MohamedBsh please also document the performance improvement in a dedicated changelog entry in doc/whats_new/v1.1.rst.

@ogrisel
Copy link
Member

ogrisel commented Apr 5, 2022

I merged the main branch to this PR to check if it fixes the circle CI problem.

@MohamedBsh
Copy link
Contributor Author

Done.
Thank you for your time and your explanations @ogrisel @lorentzenchr @jjerphan @thomasjpfan.
I am +1 on this.

Copy link
Member

@lorentzenchr lorentzenchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lorentzenchr
Copy link
Member

@MohamedBsh Could once more merge the branch main into this branch to resolve merge conflicts (in the whats new file). Then we're ready to merge this PR.

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lorentzenchr lorentzenchr merged commit 8308646 into scikit-learn:main Jun 14, 2022
ogrisel pushed a commit to ogrisel/scikit-learn that referenced this pull request Jul 11, 2022
…scikit-learn#22268)

* memory improvements and fast execution with np.einsum than np.dot
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Possible performance improvement of FastICA
6 participants