Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XGBRegressor fit method numpy type error #96

Closed
PrestonBlackburn opened this issue Apr 10, 2024 · 4 comments
Closed

XGBRegressor fit method numpy type error #96

PrestonBlackburn opened this issue Apr 10, 2024 · 4 comments
Assignees

Comments

@PrestonBlackburn
Copy link

Hey, I was having an issue with running the XGBRegressor fit method throwing a numpy type error. Snowpark ml installs numpy version 1.24, but when I run the .fit() method I get an error related to the numpy version (float type depreciation)

Original Code (simplified, but still errors)

from snowflake.ml.modeling.model_selection import GridSearchCV
from snowflake.ml.modeling.xgboost import XGBRegressor
from snowflake.snowpark.types import DecimalType, IntegerType, DoubleType
from snowflake.snowpark.functions import cast

df_typed = df.select(cast(df["SEASON"], IntegerType()).as_("SEASON"),
                     cast(df["HOLIDAY"], IntegerType()).as_("HOLIDAY"),
                     cast(df["WORKINGDAY"], IntegerType()).as_("WORKINGDAY"),
                      cast(df["COUNT"], DoubleType()).as_("COUNT"),

param_grid = {
        "max_depth":[3, 4, 5, 6, 7, 8],
        "min_child_weight":[1, 2, 3, 4],
}

grid_search = GridSearchCV(
    estimator=XGBRegressor(),
    param_grid=param_grid,
    n_jobs = -1,
    scoring="neg_root_mean_squared_error",
    input_cols=["SEASON", "HOLIDAY", "WORKINGDAY"],
    label_cols=["COUNT"],
    output_cols=['PREDICTED_COUNT']
)

Error

AttributeError: module 'numpy' has no attribute 'float'.
`np.float` was a deprecated alias for the builtin `float`. To avoid this error in existing code, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
    https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

environment
python version: 3.8.3

snowflake-ml-python==1.4.0
snowflake-snowpark-python==1.14.0
snowflake-connector-python==3.6.0
numpy==1.24.4

If I change my numpy version to <1.20 then GridSearchCV throws an error when trying to import:

cannot import name 'GridSearchCV' from 'snowflake.ml.modeling.model_selection'
@sfc-gh-afero sfc-gh-afero self-assigned this Apr 18, 2024
@sfc-gh-afero
Copy link

sfc-gh-afero commented Apr 18, 2024

Hi @PrestonBlackburn - I'd be happy to help you. In an attempt to reproduce, I made a script that loads sample sklearn data, does a similar snowflake casting, and runs the grid search fit.

I was unable to reproduce your error with the script I made below, so there may be an interaction with your particular data. Two questions:

  1. Do you know what line of code the error is being thrown from?
  2. Is it possible to share a fully reproducible example so that I can reproduce on my end?

In an attempt to find the deprecated attributes, I did a search for np.float in our code base but I don't see any references, so I'd be curious to see where the error is being raised from.

My repro:

From pip freeze:

numpy==1.24.4
snowflake-connector-python==3.6.0
snowflake-ml-python==1.4.0
snowflake-snowpark-python==1.14.0

Script:

from snowflake.ml.modeling.model_selection import GridSearchCV
from snowflake.ml.modeling.xgboost import XGBRegressor
from snowflake.snowpark.types import DecimalType, IntegerType, DoubleType
from snowflake.snowpark.functions import cast
from snowflake.snowpark import Session
from sklearn.datasets import fetch_california_housing


INPUT_COLS = ["MEDINC", "AVEROOMS", "LATITUDE", "LONGITUDE"]
LABEL_COLS = ["MEDHOUSEVAL"]

session = Session.builder.create()

def load_housing_data():
    input_df_pandas = fetch_california_housing(as_frame=True).frame
    input_df_pandas.columns = [c.upper() for c in input_df_pandas.columns]
    input_df = session.create_dataframe(input_df_pandas)

    return input_df

df = load_housing_data()


df_typed = df.select(cast(df["MEDINC"], DoubleType()).as_("MEDINC"),
                     cast(df["AVEROOMS"], IntegerType()).as_("AVEROOMS"),
                     cast(df["LATITUDE"], DoubleType()).as_("LATITUDE"),
                      cast(df["LONGITUDE"], DoubleType()).as_("LONGITUDE"),
                      cast(df["MEDHOUSEVAL"], DoubleType()).as_("MEDHOUSEVAL")
                      )

param_grid = {
        "max_depth":[3, 4, 5, 6, 7, 8],
        "min_child_weight":[1, 2, 3, 4],
}

grid_search = GridSearchCV(
    estimator=XGBRegressor(),
    param_grid=param_grid,
    n_jobs = -1,
    scoring="neg_root_mean_squared_error",
    input_cols=INPUT_COLS,
    label_cols=LABEL_COLS,
    output_cols=['PREDICTED_COUNT']
)

grid_search.fit(df_typed)

@PrestonBlackburn
Copy link
Author

Hey, thanks for following up. I tested the same code with the same Python 3.8.3 version and requirements in a separate conda env, but when I did that, it worked. I guess it must have been some sort of issue with that particular environment.

Here is the full error, but since I can't reproduce it in another environment, I think this can probably be closed.

Traceback (most recent call last):
  File "C:\Users\Preston\anaconda3\lib\site-packages\snowflake\ml\_internal\telemetry.py", line 367, in wrap
    res = func(*args, **kwargs)
  File "C:\Users\Preston\anaconda3\lib\site-packages\snowflake\ml\modeling\model_selection\grid_search_cv.py", line 331, in fit
    self._sklearn_object = model_trainer.train()
  File "C:\Users\Preston\anaconda3\lib\site-packages\snowflake\ml\modeling\_internal\snowpark_implementations\snowpark_trainer.py", line 433, in train
    fit_wrapper_sproc = self._get_fit_wrapper_sproc(statement_params=statement_params)
  File "C:\Users\Preston\anaconda3\lib\site-packages\snowflake\ml\modeling\_internal\snowpark_implementations\snowpark_trainer.py", line 253, in _get_fit_wrapper_sproc
    model_spec = ModelSpecificationsBuilder.build(model=self.estimator)
  File "C:\Users\Preston\anaconda3\lib\site-packages\snowflake\ml\modeling\_internal\model_specifications.py", line 132, in build
    return SklearnModelSelectionModelSpecifications()
  File "C:\Users\Preston\anaconda3\lib\site-packages\snowflake\ml\modeling\_internal\model_specifications.py", line 97, in __init__
    import lightgbm
  File "C:\Users\Preston\anaconda3\lib\site-packages\lightgbm\__init__.py", line 8, in <module> 
    from .basic import Booster, Dataset, register_logger
  File "C:\Users\Preston\anaconda3\lib\site-packages\lightgbm\basic.py", line 17, in <module>   
    from .compat import PANDAS_INSTALLED, concat, dt_DataTable, is_dtype_sparse, pd_DataFrame, pd_Series
  File "C:\Users\Preston\anaconda3\lib\site-packages\lightgbm\compat.py", line 114, in <module> 
    from dask.array import Array as dask_Array
  File "C:\Users\Preston\anaconda3\lib\site-packages\dask\array\__init__.py", line 3, in <module>
    from .core import (
  File "C:\Users\Preston\anaconda3\lib\site-packages\dask\array\core.py", line 22, in <module>  
    from . import chunk
  File "C:\Users\Preston\anaconda3\lib\site-packages\dask\array\chunk.py", line 7, in <module>  
    from . import numpy_compat as npcompat
  File "C:\Users\Preston\anaconda3\lib\site-packages\dask\array\numpy_compat.py", line 21, in <module>
    np.divide(0.4, 1, casting="unsafe", dtype=np.float),
  File "C:\Users\Preston\anaconda3\lib\site-packages\numpy\__init__.py", line 305, in __getattr__
    raise AttributeError(__former_attrs__[attr])
AttributeError: module 'numpy' has no attribute 'float'.
`np.float` was a deprecated alias for the builtin `float`. To avoid this error in existing code, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
    https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "issue_test.py", line 65, in <module>
    grid_search.fit(df_typed)
  File "C:\Users\Preston\anaconda3\lib\site-packages\snowflake\ml\_internal\telemetry.py", line 389, in wrap
    raise me.original_exception from e
AttributeError: (0000) module 'numpy' has no attribute 'float'.
`np.float` was a deprecated alias for the builtin `float`. To avoid this error in existing code, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
    https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

@sfc-gh-afero
Copy link

sfc-gh-afero commented Apr 19, 2024

@PrestonBlackburn Thanks for sharing the full error. This provides many more clues as to what happened. This error is actually being raised when calling import lightgbm, so the root of the issue is likely that your version of lightgbm is not compatible with numpy 1.24.4. On the environment that's failing, what version of lightgbm do you have installed?

You may be wondering, why are we importing lightgbm at all since you're running an xgboost model?

When we execute the fit method from the snowflake-ml client, we are often implementing that as a stored procedure in Snowflake. In some cases, as an optimization we re-use the same stored procedure multiple times for consecutive grid searches. Because grid search is a "composed" estimator, it can be run with lots of different types of models, including both xgboost and lightgbm. As such, when we create the stored procedure we include both xgboost and lightgbm as dependencies, if lightgbm is available in the client's environment (code). I expect that your new conda env worked for one of two reasons: 1) you did not install lightgbm at all or 2) you used a different version that is compatible with numpy >1.20

@sfc-gh-afero
Copy link

@PrestonBlackburn Marking as closed because this seems to be an incompatibility in your local python environment; if you are able to reproduce with a version of lightgbm we support please do update us

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants