Skip to content

Cannot use pgml.activate_venv() to set environment for parallel workers. #1147

Open
@higuoxing

Description

@higuoxing

Currently, pgml provides a UDF called pgml.activate_venv()1. However, when a query requires parallel workers, the venv environment cannot be set for parallel workers. This is not very easy to reproduce but parallel queries are not rare in PostgreSQL. Probably we can remove this UDF since we've already had the GUC parameter 'pgml.venv'2 to control the venv path.

Steps to reproduce:

  1. The package xgboost exists in my venv environment.

  2. Remove the IMMUTABLE qualifier from pgml.validate_python_dependencies, so that this UDF can be execute on parallel workers multiple times.

    diff --git a/pgml-extension/src/api.rs b/pgml-extension/src/api.rs
    index ad952e48..440df23d 100644
    --- a/pgml-extension/src/api.rs
    +++ b/pgml-extension/src/api.rs
    @@ -27,7 +27,7 @@ pub fn activate_venv(venv: &str) -> bool {
     }
    
     #[cfg(feature = "python")]
    -#[pg_extern(immutable, parallel_safe)]
    +#[pg_extern(parallel_safe)]
     pub fn validate_python_dependencies() -> bool {
         unwrap_or_error!(crate::bindings::python::validate_dependencies())
     }
  3. Construct a query that involves parallel workers.

    CREATE TABLE t1(i int);
    INSERT INTO t1 VALUES(generate_series(1,500000));
    INSERT INTO t1 VALUES(generate_series(1,500000));
    INSERT INTO t1 VALUES(generate_series(1,500000));
    INSERT INTO t1 VALUES(generate_series(1,500000));
    
    pgml=# select pgml.activate_venv('/tmp/virtualenv');
    activate_venv
    ---------------
     t
    (1 row)
    
    pgml=# explain (analyze) select count(pgml.validate_python_dependencies()) 
    from t1;
    INFO:  Python version: 3.11.5 (main, Sep  2 2023, 14:16:33) [GCC 13.2.1 20230801]
    INFO:  Scikit-learn 1.3.0, XGBoost 2.0.1, LightGBM 4.1.0, NumPy 1.26.1
    INFO:  Python version: 3.11.5 (main, Sep  2 2023, 14:16:33) [GCC 13.2.1 20230801]
    ERROR:  The xgboost package is missing. Install it with `sudo pip3 install xgboost`
    ModuleNotFoundError: No module named 'xgboost'
    CONTEXT:  parallel worker
    

Footnotes

  1. https://github.com/postgresml/postgresml/blob/785815d47698551cfc59634e889b564b156e6a3e/pgml-extension/src/api.rs#L25

  2. https://github.com/postgresml/postgresml/blob/785815d47698551cfc59634e889b564b156e6a3e/pgml-extension/src/bindings/python/mod.rs#L12

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions