Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return X/Xw arrays in linear regression output? #112

Closed
kncrabtree opened this issue Jul 30, 2020 · 4 comments · Fixed by #113
Closed

Return X/Xw arrays in linear regression output? #112

kncrabtree opened this issue Jul 30, 2020 · 4 comments · Fixed by #113
Assignees
Labels
feature request 🚧 New feature or request

Comments

@kncrabtree
Copy link
Contributor

Hello again Raphael,

What do you think about adding the X and/or Xw array as hidden outputs in linear_regression along with the residuals that are currently returned? The use case is that following regression, I'd like to plot the residuals vs the regressors. Unfortunately, my input dataframe has many NaN values in each of the Y and [X,...] columns, and I'm using the remove_na option in the linear_regression function for convenience. As a result, though, I cannot easily associate the values in the returned residuals_ array with the appropriate X values without reproducing everything that was done inside pingouin to prepare the data for regression, or by manually recalculating the residuals from the model coefficients.

If, however, the X array were returned as a hidden attribute X_, I could plot fit.residuals_ vs fit.X_[i] for each regressor. As far as I can tell from the source, this could be done simply by adding X and/or Xw to the output dataframe or dict exactly as is done for the residuals without any side effects except potentially increased memory usage for particularly large regressions.

If this seems like a reasonable feature, I can implement it and submit a pull request. I think I would add a returnx argument to linear_regression (default False), and if True, I would add X and Xw to the output DataFrame as X_ and Xw_ or to the output dict as 'X' and 'Xw'. Let me know what you think about this.

@raphaelvallat raphaelvallat self-assigned this Jul 30, 2020
@raphaelvallat raphaelvallat added the feature request 🚧 New feature or request label Jul 30, 2020
@raphaelvallat
Copy link
Owner

Hi @kncrabtree!

I think it would indeed make sense to return the post-processed X/ y for OLS and Xw / yw for WLS. However, I'd prefer not adding yet another input parameter to the function to avoid overburdening the API. Two ideas:

  1. X and y are only returned when as_dataframe=False, in other words, if you only care about the coefficients and p-values you can use a DataFrame (default), but if you want more then you'd use a dict output which would contain a lot of additional infos (X, y, predicted values, residuals, etc).

  2. If we prefer to return X and y in all cases, I would be keen on trying the pandas.DataFrame.attrs method. I know it's still under active development (see here) but I think it would be more elegant than the current Pingouin implementation.

Let me know what you think,
Raphael

@kncrabtree
Copy link
Contributor Author

Hm, I've never worked with pandas.DataFrame.attrs, and since it's also experimental I don't know what best practices would be. For now I can implement the first idea easily enough; that seems like a good solution. I'll give it a shot soon and open a pull request if that's ok.

@raphaelvallat
Copy link
Owner

Sounds good, thanks!

@raphaelvallat raphaelvallat linked a pull request Aug 3, 2020 that will close this issue
@raphaelvallat
Copy link
Owner

This is fixed in the just-released stable version of Pingouin (v0.3.8). Please make sure to update your version. Closing the issue. Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request 🚧 New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants