Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unreliable auto-casting of pandas data in model fitters #9205

Open
maciejskorski opened this issue Apr 13, 2024 · 0 comments
Open

Unreliable auto-casting of pandas data in model fitters #9205

maciejskorski opened this issue Apr 13, 2024 · 0 comments

Comments

@maciejskorski
Copy link

maciejskorski commented Apr 13, 2024

Describe the bug

As of now np.asarray with no flags is used to convert model input into a proper numerical format, when fitting models such as OLS, WLS etc.

def _convert_endog_exog(self, endog, exog=None):
#TODO: remove this when we handle dtype systematically
endog = np.asarray(endog)
exog = exog if exog is None else np.asarray(exog)
if endog.dtype == object or exog is not None and exog.dtype == object:
raise ValueError("Pandas data cast to numpy dtype of object. "
"Check input data with np.asarray(data).")
return super()._convert_endog_exog(endog, exog)

This is non-reliable, particularly for mixed data types, as shown below:

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np
import statsmodels.api as sm

Xy = pd.DataFrame(data=[[0.0,False,0.1],[1.0,True,0.9],[1.0,True,0.9]],columns=['y','x1','x2'])

X = Xy[['x1','x2']] # FIXME: + .astype(float), otherwise type casting fails
y = Xy['y']

sm.OLS(y,X)

Note: As you can see, there are many issues on our GitHub tracker, so it is very possible that your issue has been posted before. Please check first before submitting so that we do not have to handle and close duplicates.

Note: Please be sure you are using the latest released version of statsmodels, or a recent build of main. If your problem has been fixed in an unreleased version, you might be able to use main until a new release occurs.

Note: If you are using a released version, have you verified that the bug exists in the main branch of this repository? It helps the limited resources if we know problems exist in the current main branch so that they do not need to check whether the code sample produces a bug in the next release.

Related Issues

Here #8794 opens a narrower discussion about non-reliability of handling nulls.

Suggested solution

I suggest to inform numpy about numerical types when casting

np.asarray(Xy, dtype=np.float32)

Expected Output

Fitting does not produce errors.

Output of import statsmodels.api as sm; sm.show_versions()

[paste the output of import statsmodels.api as sm; sm.show_versions() here below this line]

INSTALLED VERSIONS

Python: 3.10.12.final.0
OS: Linux 6.1.58+ #1 SMP PREEMPT_DYNAMIC Sat Nov 18 15:31:17 UTC 2023 x86_64
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8

statsmodels

Installed: 0.14.1 (/usr/local/lib/python3.10/dist-packages/statsmodels)

Required Dependencies

cython: 3.0.10 (/usr/local/lib/python3.10/dist-packages/Cython)
numpy: 1.25.2 (/usr/local/lib/python3.10/dist-packages/numpy)
scipy: 1.11.4 (/usr/local/lib/python3.10/dist-packages/scipy)
pandas: 2.0.3 (/usr/local/lib/python3.10/dist-packages/pandas)
dateutil: 2.8.2 (/usr/local/lib/python3.10/dist-packages/dateutil)
patsy: 0.5.6 (/usr/local/lib/python3.10/dist-packages/patsy)

Optional Dependencies

matplotlib: 3.7.1 (/usr/local/lib/python3.10/dist-packages/matplotlib)
backend: module://matplotlib_inline.backend_inline
cvxopt: 1.3.2 (/usr/local/lib/python3.10/dist-packages/cvxopt)
joblib: 1.4.0 (/usr/local/lib/python3.10/dist-packages/joblib)

Developer Tools

IPython: 7.34.0 (/usr/local/lib/python3.10/dist-packages/IPython)
jinja2: 3.1.3 (/usr/local/lib/python3.10/dist-packages/jinja2)
sphinx: 5.0.2 (/usr/local/lib/python3.10/dist-packages/sphinx)
pygments: 2.16.1 (/usr/local/lib/python3.10/dist-packages/pygments)
pytest: 7.4.4 (/usr/local/lib/python3.10/dist-packages/pytest)
virtualenv: Not installed

@maciejskorski maciejskorski changed the title Non-reliable auto-casting of pandas data in model fitters Unreliable auto-casting of pandas data in model fitters Apr 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant