Skip to content

QuantileRegressor unable to allocate memory for large datasets #22922

@bharatkrishna

Description

@bharatkrishna

Describe the bug

I am using sklearn.linear_model.QuantileRegressor for a dataset with ~2.9 million datapoints.
When I use it as follows, I get a memory error.

MemoryError: Unable to allocate 61.6 TiB for an array with shape (2909376, 2909376) and data type float64

When I do the same with statsmodel library, I don't encounter any issues.

Would this be a known limitation of the sklarn implementation? What would be the biggest size of dataframe it can take? Or is it a bug?

I was choosing to do it in sklearn rather than statsmodel because the sklearn API provides ability to add regularization which the other library doesn't.

Steps/Code to Reproduce

from sklearn.linear_model import QuantileRegressor
reg = QuantileRegressor(quantile=0.8).fit(X, y)

Expected Results

The data is fit without errors.

Actual Results

---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
/tmp/ipykernel_53907/1165686720.py in <module>
      1 from sklearn.linear_model import QuantileRegressor
      2 
----> 3 reg = QuantileRegressor(quantile=0.8).fit(X, y)
      4 print_analysis(reg, X, y, flyte_type)

/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/_quantile.py in fit(self, X, y, sample_weight)
    239                     -np.ones((n_mask, 1)),
    240                     -X[mask],
--> 241                     np.eye(n_mask),
    242                     -np.eye(n_mask),
    243                 ],

/anaconda3/lib/python3.7/site-packages/numpy/lib/twodim_base.py in eye(N, M, k, dtype, order)
    197     if M is None:
    198         M = N
--> 199     m = zeros((N, M), dtype=dtype, order=order)
    200     if k >= M:
    201         return m

MemoryError: Unable to allocate 61.6 TiB for an array with shape (2909376, 2909376) and data type float64

Versions

System:
    python: 3.7.7 (default, Mar 26 2020, 15:48:22)  [GCC 7.3.0]
executable: /anaconda3/bin/python
   machine: Linux-4.14.68-4.14.3-amd64-3adf3675665129fa-x86_64-with-debian-stretch-sid

Python dependencies:
          pip: 21.2.2
   setuptools: 58.0.4
      sklearn: 1.0.1
        numpy: 1.19.2
        scipy: 1.5.2
       Cython: 0.29.25
       pandas: 1.2.4
   matplotlib: 3.5.0
       joblib: 1.1.0
threadpoolctl: 2.2.0

Built with OpenMP: True

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions