Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: predict tools, helper function using pandas for exog grid #8439

Open
josef-pkt opened this issue Oct 13, 2022 · 3 comments
Open

ENH: predict tools, helper function using pandas for exog grid #8439

josef-pkt opened this issue Oct 13, 2022 · 3 comments

Comments

@josef-pkt
Copy link
Member

helper function for lsmeans, emmeans, predict exog
related: #5387 ...

Currently we don't have any functions to create a grid of exog values for predict that affect more than a single column.
Relevant for predict and marginal/derivative effects.

lsmeans, emmeans and other margin packages in R (and SAS, ...) have functions that create grids of exog values.

For us, the models do not have enough information about the original data, trying to build the grid there is too late.

So, what we need are helper functions for the user to process the original dataframe assuming there are no irrelevant columns.
This would be all using pandas dataframes and the corresponding methods for categorical variables and quantiles.
Using formulas and formula transform will convert this to a design matrix for predict and other methods consistent with the model specification.
(Do we have _get_predict_exog in general? AFAIR, I only added it in a few models.)

bonus:
for purely categorical exog, we might also want to have freq or prob weights for cell frequencies or probabilities in the original sample.
(new get_prediction in discrete allows for aggregation weights.)

@josef-pkt
Copy link
Member Author

a not very quick try:
I didn't find any useful pandas methods, or I don't know it well enough to figure it out

The following works in my example (speed is the endog)
The get_col_values is an iterator to work with python product and can be extended to handle other dtypes like categorical.
Not sure how to handle count data in exog. e.g. user provides list of count varnames, then either use all or round quantiles to int.


included = []
def get_col_values(data2, exclude=["const", "speed"]):
    for col in data2:
        ser = data2[col]
        if ser.name in exclude:
            continue
        if ser.dtype == np.float64:
            values = ser.quantile(q=[0.1, 0.5, 0.9]).to_list()

        if ser.dtype == object:
            values = ser.unique().tolist()  # returns ndarray

        included.append(ser.name)
        yield values

# based on https://stackoverflow.com/a/37755303/333700  
# preserves dtypes of values (but not meta info, e.g. categorical)
result = pd.DataFrame(list(product(*(get_col_values(data2)))), columns=included)
result.head()

@josef-pkt
Copy link
Member Author

another question for predicted means at some exog values
Stata margin command defaults to predicted marginal means, not a marginal effect with "marginal" as derivative or difference.

https://stackoverflow.com/questions/75772170/produce-predictive-margins-in-statsmodels-output-for-logistic-regression

@josef-pkt
Copy link
Member Author

josef-pkt commented Mar 19, 2023

something that would give us an original array

predict_at where user needs to provide a DataFrame that includes all original variables used in exog and only those.
Then we automatically construct a grid with whatever options for at are specified.
Then we call get_prediction with the constructed "exog", which will still be formula transformed in get_prediction.

The same would be possible if we have the original DataFrame attached to the model and we can identify columns that were used in the formula.

possible ambiguity, we need to know which variables are categorical if they have numeric levels, C(cat) in formula. those should use unique instead of mean.
Fractional exog for categorical variables (like mean gender in the population) would not be possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant