Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: out-of-tree Pyodide builds for statsmodels #9166

Open
agriyakhetarpal opened this issue Feb 29, 2024 · 16 comments
Open

Feature request: out-of-tree Pyodide builds for statsmodels #9166

agriyakhetarpal opened this issue Feb 29, 2024 · 16 comments

Comments

@agriyakhetarpal
Copy link

agriyakhetarpal commented Feb 29, 2024

Is your feature request related to a problem? Please describe

Hi there! I am opening this feature request to gauge ideas and comments about out-of-tree Pyodide builds, i.e., wasm32 wheels via the Emscripten toolchain for statsmodels. In my most recent work assignment, I am working on improving the interoperability for the Scientific Python ecosystem of packages with Pyodide and with each other, which shall culminate with efforts towards bringing interactive documentation for these packages where they can then be run in JupyterLite notebooks, through nightly builds and wheels for these packages pushed to PyPI-like indices on Anaconda, at and during a later phase during the project.

It looks like in-tree builds in the Pyodide have been built in the past as noted by the conversations in #7956 – and they seem to be maintained with every release for Pyodide. However, this issue proposes out-of-tree builds for statsmodels on its own CI and build infrastructure. I would be glad to work on this for statsmodels.

Describe the solution you'd like

  1. A CI pipeline on Azure Pipelines or GitHub Actions (I would prefer the latter, but I have no qualms with the former) where Emscripten/Pyodide builds for the development version of statsmodels are pursued
  2. Testing the built wheels against a Pyodide wasm32 runtime virtual environment within the same workflow
  3. Fixing up and skipping failing tests as necessary based on current Pyodide limitations and ensuring that all relevant test cases pass

Describe alternatives you have considered

N/A

Additional context

This project is being tracked at Quansight-Labs/czi-scientific-python-mgmt#18 and has started out with packages like PyWavelets (PyWavelets/pywt#701) and NumPy (numpy/numpy#25894) being complete, thanks to @rgommers. Other packages besides statsmodels, such as matplotlib, zarr, pandas, and many more are planned to follow suit soon in the coming months.

@bashtage
Copy link
Member

Someone who had significant time would need to contribute this. The core team is pretty small and mostly focused on quality and new features.

@agriyakhetarpal
Copy link
Author

Someone who had significant time would need to contribute this. The core team is pretty small and mostly focused on quality and new features.

Indeed, I would be happy to start working on this (I edited the issue description a bit in the time you responded). It might be a long ordeal given that this is to be done from the ground up – but as long as someone can help out with code review and suggestions, that would help streamline the process.

@bashtage
Copy link
Member

I"m happy to try and help. I have no idea what the process looks like since we

  • Have templated cython code
  • Link to SciPy provided blas
  • Require patsy and pandas

@agriyakhetarpal
Copy link
Author

agriyakhetarpal commented Feb 29, 2024

Thanks for sharing, @bashtage. The compilation procedure should take care of Cythonizing and building wasm32-compatible shared object files, but I am not sure about how to get a BLAS distribution on WASM working (xref: OpenMathLib/OpenBLAS#4023). The requirements should be partly satisfied, since patsy==0.5.3 and pandas==1.5.3 exist in the current Pyodide packages1, but from the requirements.txt file it looks like I will have to get patsy bumped to 0.5.6 – let me see how that goes. It does look like the patsy project is no longer developed actively, but they are trying to keep compatibility for newer Python versions and for their dependents, and version 0.5.6 came out last month – I will hope it is trivial to bump that for Pyodide.

Edit: patsy==0.5.6 is out in the development branch for Pyodide, all we need is to wait for a release from their end.

Footnotes

  1. https://pyodide.org/en/stable/usage/packages-in-pyodide.html

@rgommers
Copy link
Member

rgommers commented Mar 1, 2024

The use of scipy.linalg.cython_blas is the interesting one here I guess - I don't know yet how that works in the Pyodide build, just that it should work in principle because statsmodels and scikit-learn are both using that and are packaged in Pyodide. I'll note that it doesn't specifically require OpenBLAS; any BLAS library will do (and IIRC Pyodide used reference BLAS; getting OpenBLAS to work is about improving performance only).

@agriyakhetarpal maybe you can look into how this works in the in-tree Pyodide build for statsmodels? CI logs, code comments and/or patches may show that. The first questions I would ask are:

  • is scipy/linalg/cython_blas.pxd shipped inside a Pyodide-provided SciPy wheel?
  • if so, does the build use that, or does it use the cython_blas.pxd from a native (i.e., for the architecture of the build machine, usually x86-64) scipy package and rely on the .pxd file contents being the same between x86-64 and wasm32?

@rgommers
Copy link
Member

rgommers commented Mar 1, 2024

but as long as someone can help out with code review and suggestions, that would help streamline the process.

For the record, I can help with this. I even still have my statsmodels commit rights, although I haven't exercised them in a long time.

@agriyakhetarpal
Copy link
Author

agriyakhetarpal commented Mar 1, 2024

I did spend some time surfing through links and files, and I believe that I definitely have some more insights into both questions, but probably not complete answers:

is scipy/linalg/cython_blas.pxd shipped inside a Pyodide-provided SciPy wheel?

It looks like yes, they do ship it:

import scipy

scipy.linalg.cython_blas

returns the WASM shared object compiled with Emscripten:

<module 'scipy.linalg.cython_blas' from '/lib/python3.11/site-packages/scipy/linalg/cython_blas.cpython-311-wasm32-emscripten.so'>

if so, does the build use that, or does it use the cython_blas.pxd from a native (i.e., for the architecture of the build machine, usually x86-64) scipy package and rely on the .pxd file contents being the same between x86-64 and wasm32?

I am not entirely sure that I followed that – I checked:

import os
os.listdir("/lib/python3.11/site-packages/scipy/linalg/")

which returns

expand to view files

['__init__.py',
 '_basic.py',
 '_blas_subroutine_wrappers.f',
 '_blas_subroutines.h',
 '_cythonized_array_utils.cpython-311-wasm32-emscripten.so',
 '_cythonized_array_utils.pxd',
 '_cythonized_array_utils.pyi',
 '_decomp.py',
 '_decomp_cholesky.py',
 '_decomp_cossin.py',
 '_decomp_ldl.py',
 '_decomp_lu.py',
 '_decomp_lu_cython.cpython-311-wasm32-emscripten.so',
 '_decomp_lu_cython.pyi',
 '_decomp_polar.py',
 '_decomp_qr.py',
 '_decomp_qz.py',
 '_decomp_schur.py',
 '_decomp_svd.py',
 '_decomp_update.cpython-311-wasm32-emscripten.so',
 '_expm_frechet.py',
 '_fblas.cpython-311-wasm32-emscripten.so',
 '_flapack.cpython-311-wasm32-emscripten.so',
 '_flinalg.cpython-311-wasm32-emscripten.so',
 '_flinalg_py.py',
 '_interpolative.cpython-311-wasm32-emscripten.so',
 '_interpolative_backend.py',
 '_lapack_subroutine_wrappers.f',
 '_lapack_subroutines.h',
 '_matfuncs.py',
 '_matfuncs_expm.cpython-311-wasm32-emscripten.so',
 '_matfuncs_expm.pyi',
 '_matfuncs_inv_ssq.py',
 '_matfuncs_sqrtm.py',
 '_matfuncs_sqrtm_triu.cpython-311-wasm32-emscripten.so',
 '_misc.py',
 '_procrustes.py',
 '_sketches.py',
 '_solve_toeplitz.cpython-311-wasm32-emscripten.so',
 '_solvers.py',
 '_special_matrices.py',
 '_testutils.py',
 'basic.py',
 'blas.py',
 'cython_blas.cpython-311-wasm32-emscripten.so',
 'cython_blas.pxd',
 'cython_blas.pyx',
 'cython_lapack.cpython-311-wasm32-emscripten.so',
 'cython_lapack.pxd',
 'cython_lapack.pyx',
 'decomp.py',
 'decomp_cholesky.py',
 'decomp_lu.py',
 'decomp_qr.py',
 'decomp_schur.py',
 'decomp_svd.py',
 'flinalg.py',
 'interpolative.py',
 'lapack.py',
 'matfuncs.py',
 'misc.py',
 'special_matrices.py'
]

where cython_blas.pxd is present. I inspected the contents of the file, and it is indeed different from what I have in a local M-series (arm64) macOS SciPy installation – all of the Cython C-function definitions such as this one, for example:

cdef int zgerc(int *m, int *n, z *alpha, z *x, int *incx, z *y, int *incy, z *a, int *lda) noexcept nogil

are all returning integers, while locally this doesn't return anything (the void keyword is present instead of int):

cdef void zgerc(int *m, int *n, z *alpha, z *x, int *incx, z *y, int *incy, z *a, int *lda) noexcept nogil

I tracked this change down to these lines in the Pyodide recipe for SciPy, and since these files are generated by the build system (or Cython?) at compilation time, the difference in return type seems to be coming from when OpenBLAS was used for SciPy in pyodide/pyodide#3331 – this should be the related patch that induced this change.

As for statsmodels, the recipe does exist; but with no available tests: https://github.com/pyodide/pyodide/tree/main/packages/statsmodels. So, yes, we should be able to use the reference BLAS or a more subservient BLAS implementation for now, rather than OpenBLAS. But @bashtage mentions that statsmodels links to SciPy-provided BLAS, so I am not sure what the Pyodide developers have been doing for compilation. In this comment: pyodide/pyodide#537 (comment), it is mentioned that if SciPy is unavailable at build time, then pure-Python implementations are used and extension modules are possibly not compiled. However, in the recipe for statsmodels above, SciPy is indeed listed as a build-time dependency, so I suppose the WASM-patched OpenBLAS provided by WASM-SciPy is being used to compile the latest available version of Pyodide's in-tree statsmodels (on earlier versions, pyodide/pyodide#2073 reveals that the cython_blas.pxd file was earlier being patched, but this patch does not exist anymore after statsmodels v0.13.1 was later updated to v0.14.0).

How should we proceed with this? Based on inputs I receive, I can start conjuring a Pyodide build on my fork to test things out (and open a draft PR for visibility as needed if it helps with review and expedites things). I will have to use GitHub Actions, however – I do not have access or credentials for testing builds on Azure pipelines :)

@agriyakhetarpal
Copy link
Author

agriyakhetarpal commented Mar 1, 2024

Minor update: I did compile a WASM wheel through a workflow on my fork, but there are issues with linkage and unresolved symbols appear.

Error: Dynamic linking error: cannot resolve symbol pow_di, with the logs here: https://github.com/agriyakhetarpal/statsmodels/actions/runs/8113306346/job/22176325503#step:9:38

Google search reveals some hits:

  1. https://opensource.apple.com/source/gcc/gcc-1435/libf2c/libF77/pow_di.c.auto.html
  2. https://github.com/fermi-lat/f2c/blob/master/pow_di.c

It looks like this is coming from the f2c F77 compiler, which I have no experience of using – this would be challenging to fix 😕 because Pyodide itself doesn't support Fortran to WASM translation very well so far (xref: pyodide/pyodide#184) and issues with f2c have appeared for SciPy modules: pyodide/pyodide#3380.

It is to be noted that there are warnings from the compilation, however – I am not sure how to debug if any of them is related for our case:

warning: statsmodels/tsa/statespace/_smoothers/_univariate_diffuse.pyx:558:14: Unreachable code
warning: statsmodels/tsa/statespace/_smoothers/_univariate_diffuse.pyx:1136:14: Unreachable code
warning: statsmodels/tsa/statespace/_smoothers/_univariate_diffuse.pyx:1714:14: Unreachable code
warning: statsmodels/tsa/statespace/_smoothers/_univariate_diffuse.pyx:2292:14: Unreachable code

out of which, the first one is:

blas.sgemm("N", "N", &model.k_states, &model.k_states, &model.k_states,
                  &gamma, &kfilter.predicted_state_cov[0,0,smoother.t+1], &kfilter.k_states,
                          smoother._input_scaled_smoothed_estimator_cov, &kfilter.k_states,
                  &beta, smoother._tmp0, &kfilter.k_states)

so this could also be an issue with the BLAS vendor being linked to. Is there some advice on debugging this (I am unsure how to go forward)?

@ChadFulton
Copy link
Member

Unfortunately I don't have much to contribute on the fortran compiler side, although I can confirm that the following warnings are not a problem (just an early return with some code following)

warning: statsmodels/tsa/statespace/_smoothers/_univariate_diffuse.pyx:558:14: Unreachable code
warning: statsmodels/tsa/statespace/_smoothers/_univariate_diffuse.pyx:1136:14: Unreachable code
warning: statsmodels/tsa/statespace/_smoothers/_univariate_diffuse.pyx:1714:14: Unreachable code
warning: statsmodels/tsa/statespace/_smoothers/_univariate_diffuse.pyx:2292:14: Unreachable code

@bashtage
Copy link
Member

bashtage commented Mar 1, 2024

Can you run it with more verbosity so I can see which file? I suspect this is somehow intermediate because stats models does not use Fortran.

@agriyakhetarpal
Copy link
Author

Neither the Pyodide build command nor the pypa/build frontend has a flag to turn up the verbosity, so I passed the --verbose flag to setup.py – I don't know if that brought up some more output. Here are the logs for the workflow run: https://github.com/agriyakhetarpal/statsmodels/actions/runs/8114876518/job/22181528198

@bashtage
Copy link
Member

bashtage commented Mar 1, 2024

This is not something that is in statsmodels from what I can tell. Likely either to do with Blas or SciPy. I just checked the generated C source for statsmodels and pow_di does not appear.

@bashtage
Copy link
Member

bashtage commented Mar 1, 2024

What happens if you just python -c "import statsmodels.api"?

@agriyakhetarpal
Copy link
Author

agriyakhetarpal commented Mar 1, 2024

What happens if you just python -c "import statsmodels.api"?

That doesn't work, it fails to find the statsmodels.robust._qn module: https://github.com/agriyakhetarpal/statsmodels/actions/runs/8115706064/job/22184121018

I can confirm that it passes locally

@bashtage
Copy link
Member

bashtage commented Mar 1, 2024

This just tells you that things are not gettign compiled. robust._qn is a Cython module.

The log looks like it is building

building 'statsmodels.robust._qn' extension
creating build/temp.emscripten_3_1_46_wasm32-cpython-311/statsmodels/robust
/tmp/tmpiahnkw84/cc -DNDEBUG -g -fwrapv -O3 -Wall -O2 -g0 -fPIC -DPY_CALL_TRAMPOLINE -DCYTHON_TRACE_NOGIL=0 -DNPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION -I/tmp/build-env-gn3pcfdh/lib/python3.11/site-packages/numpy/core/include -I/tmp/build-env-gn3pcfdh/lib/python3.11/site-packages/numpy/core/include -I/tmp/build-env-gn3pcfdh/include -I/opt/hostedtoolcache/Python/3.11.3/x64/include/python3.11 -c statsmodels/robust/_qn.c -o build/temp.emscripten_3_1_46_wasm32-cpython-311/statsmodels/robust/_qn.o
/tmp/tmpiahnkw84/cc build/temp.emscripten_3_1_46_wasm32-cpython-311/statsmodels/robust/_qn.o -L/tmp/build-env-gn3pcfdh/lib/python3.11/site-packages/numpy/core/include/../lib -lnpymath -o build/lib.emscripten_3_1_46_wasm32-cpython-311/statsmodels/robust/_qn.cpython-311-wasm32-emscripten.so

Can you see it in your tree?

When I said "python" I means the pyodide python. I don't reallly know how pyodide works, but I am assuming you can use it like standard Python.

Looks like it woudl be something like

node --experimental-repl-await
Welcome to Node.js v18.5.0.
Type ".help" for more information.
> const { loadPyodide } = require("pyodide");
> let pyodide = await loadPyodide();
> sm = pyodide.pyimport("statsmodels.api");

@agriyakhetarpal
Copy link
Author

agriyakhetarpal commented Mar 1, 2024

Yes, I see in the logs that is is being copied into the wheel as well, after it gets compiled:

copying build/lib.emscripten_3_1_46_wasm32-cpython-311/statsmodels/robust/_qn.cpython-311-wasm32-emscripten.so -> build/bdist.emscripten_3_1_46_wasm32/wheel/statsmodels/robust
...
adding 'statsmodels/robust/_qn.cpython-311-wasm32-emscripten.so'

Can you see it in your tree?

I do have statsmodels/robust/_qn.cpython-311-darwin.so locally, so no problems with Cython-ization there.

When I said "python" I means the pyodide python. I don't reallly know how pyodide works, but I am assuming you can use it like standard Python.

One can actually – in the workflow file, I activated a special virtual environment that was created with pyodide venv, where the Python interpreter (python) is a wasm32 one and Node.js is IIUC used from under the hood.

However, the Node.js code snippet you mentioned is usable directly too and should also return the same output.

It looks like the compilation is successful, but it cannot be loaded on WASM – otool does not help for the one I have locally:

$ otool -vL statsmodels/robust/_qn.cpython-311-darwin.so

statsmodels/robust/_qn.cpython-311-darwin.so:
        /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1336.61.1)
        time stamp 2 Thu Jan  1 05:30:02 1970

which means that the shared object is self-contained (which should be good).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants