Return X/Xw arrays in linear regression output? #112

kncrabtree · 2020-07-30T17:34:14Z

Hello again Raphael,

What do you think about adding the X and/or Xw array as hidden outputs in linear_regression along with the residuals that are currently returned? The use case is that following regression, I'd like to plot the residuals vs the regressors. Unfortunately, my input dataframe has many NaN values in each of the Y and [X,...] columns, and I'm using the remove_na option in the linear_regression function for convenience. As a result, though, I cannot easily associate the values in the returned residuals_ array with the appropriate X values without reproducing everything that was done inside pingouin to prepare the data for regression, or by manually recalculating the residuals from the model coefficients.

If, however, the X array were returned as a hidden attribute X_, I could plot fit.residuals_ vs fit.X_[i] for each regressor. As far as I can tell from the source, this could be done simply by adding X and/or Xw to the output dataframe or dict exactly as is done for the residuals without any side effects except potentially increased memory usage for particularly large regressions.

If this seems like a reasonable feature, I can implement it and submit a pull request. I think I would add a returnx argument to linear_regression (default False), and if True, I would add X and Xw to the output DataFrame as X_ and Xw_ or to the output dict as 'X' and 'Xw'. Let me know what you think about this.

The text was updated successfully, but these errors were encountered:

raphaelvallat · 2020-07-30T21:50:21Z

Hi @kncrabtree!

I think it would indeed make sense to return the post-processed X/ y for OLS and Xw / yw for WLS. However, I'd prefer not adding yet another input parameter to the function to avoid overburdening the API. Two ideas:

X and y are only returned when as_dataframe=False, in other words, if you only care about the coefficients and p-values you can use a DataFrame (default), but if you want more then you'd use a dict output which would contain a lot of additional infos (X, y, predicted values, residuals, etc).
If we prefer to return X and y in all cases, I would be keen on trying the pandas.DataFrame.attrs method. I know it's still under active development (see here) but I think it would be more elegant than the current Pingouin implementation.

Let me know what you think,
Raphael

kncrabtree · 2020-07-31T20:37:13Z

Hm, I've never worked with pandas.DataFrame.attrs, and since it's also experimental I don't know what best practices would be. For now I can implement the first idea easily enough; that seems like a good solution. I'll give it a shot soon and open a pull request if that's ok.

raphaelvallat · 2020-07-31T20:56:02Z

Sounds good, thanks!

raphaelvallat · 2020-09-07T18:24:52Z

This is fixed in the just-released stable version of Pingouin (v0.3.8). Please make sure to update your version. Closing the issue. Thanks again!

raphaelvallat self-assigned this Jul 30, 2020

raphaelvallat added the feature request 🚧 New feature or request label Jul 30, 2020

raphaelvallat linked a pull request Aug 3, 2020 that will close this issue

Add processed arrays to linear_regression output #113

Merged

raphaelvallat closed this as completed Sep 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Return X/Xw arrays in linear regression output? #112

Return X/Xw arrays in linear regression output? #112

kncrabtree commented Jul 30, 2020

raphaelvallat commented Jul 30, 2020

kncrabtree commented Jul 31, 2020

raphaelvallat commented Jul 31, 2020

raphaelvallat commented Sep 7, 2020

Return X/Xw arrays in linear regression output? #112

Return X/Xw arrays in linear regression output? #112

Comments

kncrabtree commented Jul 30, 2020

raphaelvallat commented Jul 30, 2020

kncrabtree commented Jul 31, 2020

raphaelvallat commented Jul 31, 2020

raphaelvallat commented Sep 7, 2020