Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] WASM / pyodide as a (somewhat) officially supported platform for scikit-learn #23727

Open
ogrisel opened this issue Jun 22, 2022 · 6 comments
Labels
Needs Investigation Issue requires investigation RFC

Comments

@ogrisel
Copy link
Member

ogrisel commented Jun 22, 2022

We started having bug reports (at least one indirect, in real life report at a conference: #23707) from users of scikit-learn in WASM environment (e.g. pyodide / jupyterlite, pyscript...).

Shall we invest effort in setting CI tooling to properly test and maybe even handle packaging of scikit-learn to target that platform?

It's very likely that not all of scikit-learn will work out of the box, but with proper tooling in place we could maintain a public list of modules that have all their tests that pass and maybe a list of modules that required so patches to handle graceful degradation to target this platform (e.g. number of parallel worker threads with n_jobs).

@rth put some interesting info in the following comment on how to run the tests:

Pros:

  • WASM is likely to be a very popular target platform, at least for education (can directly teach Python programming and ML concepts without having to teach how to install packages from the command line first).

Cons:

  • test execution is probably much slower that on our regular CI targets;
  • need to maintain a list of known issues / limitations;
  • more packaging, release process will be even more complicated;
  • SciPy is quite heavily patched because there is no working Fortran compiler on that platform (that might change soon with lfotran) so it relies on a semi-hackish Fortran to C transpilation step that introduces additional complexity.
@github-actions github-actions bot added the Needs Triage Issue requires triage label Jun 22, 2022
@ogrisel ogrisel changed the title RFC WASM / pyodide as a (somewhat) officially supported platform for scikit-learn [RFC] WASM / pyodide as a (somewhat) officially supported platform for scikit-learn Jun 22, 2022
@ogrisel ogrisel added RFC Needs Decision - Include Feature Requires decision regarding including feature Needs Investigation Issue requires investigation and removed Needs Triage Issue requires triage labels Jun 22, 2022
@ogrisel
Copy link
Member Author

ogrisel commented Jun 22, 2022

Maybe we should start by listing the fraction of modules with failing test and scanning through them to estimate how many stem from a common known upstream limitation and how likely it is going to be lifted in the short to medium term or if there is a somewhat maintainable work-around we might want to include in the scikit-learn code base or as an external packaging patch.

@rth
Copy link
Member

rth commented Jun 22, 2022

Yes, I think running it in CI is probably indeed a bit early (and also it would be really slow). The first more investigative steps could be a good start,

  1. manually run the test suite module by module, see what fails (and report upstream). This would already be very helpful. Last time I did it was in 2018 in Package scikit-learn pyodide/pyodide#139 (comment) and the situation should be much better now.
  2. figure out how to best run the test suite programmatically for a large package such as scikit-learn. Pyodide has a pytest plugin which will be better packaged once TST Make pyodide-test-runner installable pyodide/pyodide#2742 is merged. Once installed it exposes pytest fixtures that would allow running some Python code in Pyodide inside a browser (Chrome or Firefox) with selenium or Node.js. Though for now, this package has no users outside of Pyodide, so more work is likely necessary to make it standalone and re-usable by external packages.

Then once we have some way to run Python code inside the browser from a Python script (or pytest) on the host, the question remains how to best run the full scikit-learn test suite. The problem is that when running pytest.main over the full package directly it takes a while and no feedback is reported to the user until the run completes. Furthermore, if there is a fatal error in scipy somewhere (similar to a segfault in terms of outcome) then the whole session would crash. So it's probably better to run pytest inside WASM on smaller chunks, serialize back the results and concatenate them on the host. A bit similar things about which I was wondering in pytest-dev/pytest-xdist#336 as in the end the problem is very similar to running pytest on the remote node (except that communication is not happening over the network).

In any case, if anyone is interested in investigating this, I'd be happy to talk more about it.

@ogrisel
Copy link
Member Author

ogrisel commented Jun 22, 2022

Thanks for the summary, I agree with your plan.

@ogrisel
Copy link
Member Author

ogrisel commented Jun 22, 2022

Once the test runner tooling is improved, we could imagine a nightly run that would run the test suite of each top level scikit-learn module and consolidate a report of scikit-learn modules that work without any failure, run with some test failures or finally cause an unrecoverable crash of a fatal error of the WASM runtime environment (it would be great to automatically collect the post-mortem output of the JS console of the browser in such a case).

@amueller
Copy link
Member

Btw, I think one of the benefits we'd get from WASM support is the ability to have interactive examples in the browser on the docs. I think that'll be a gamechanger for documentation.

@lesteve
Copy link
Member

lesteve commented Oct 10, 2022

I put together a repo to run the scikit-learn inside Pyodide. I listed the issues I have spotted now there. This will need more investigation. Any feed-back, let me know!

https://github.com/lesteve/scikit-learn-tests-pyodide

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Investigation Issue requires investigation RFC
Projects
None yet
Development

No branches or pull requests

5 participants