Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BinomialBayesMixedGLM predict function returns non-1D array #6158

Open
mnky9800n opened this issue Sep 10, 2019 · 7 comments
Open

BinomialBayesMixedGLM predict function returns non-1D array #6158

mnky9800n opened this issue Sep 10, 2019 · 7 comments

Comments

@mnky9800n
Copy link

Describe the bug

I tried to use BinomialBayesMixedGLM to predict some values so that I can assess the prediction capability of the model. However, when putting in sample data, the output is not 1D predictions but instead the same shape as the input data. This seems to be a bug since the expected output is a 1-dimensional array of predicted values.

Code Sample, a copy-pastable example if possible

In[1] : from statsmodels.genmod import bayes_mixed_glm as b
        import pandas as pd
        data = pd.DataFrame(my_private_data, columns=['graduates_next_semester', 'first_course_year', 'cumulative_avg_grade', 'hs_gpa', 'female'])
        formula = 'graduates_next_semester ~ C(first_course_year) + cumulative_avg_grade + hs_gpa + C(female)'
        random = {'a':'0 + C(first_course_year)', 'b':'0 + C(first_course_year)*hs_gpa'}
        model = b.BinomialBayesMixedGLM.from_formula(formula, random, data)
        results = model.fit_vb()

In[2] : data.values
Out: array([[ 0.        ,  0.        , -0.16846258, -3.26481235,  0.        ],
       [ 0.        ,  0.        ,  0.25580621,  0.60181999,  1.        ],
       [ 1.        ,  4.        ,  1.64888846,  1.66018495,  0.        ],
       ...,
       [ 0.        , 20.        ,  0.88437497,  0.15070583,  1.        ],
       [ 0.        , 20.        , -1.72232004, -1.76500254,  0.        ],
       [ 1.        , 20.        ,  0.82043374,  0.27632605,  1.        ]])
In[3] : model.predict(data)
Out : array([[0.5       , 0.5       , 0.60994004, 0.00872048, 0.45798368],
       [0.5       , 0.5       , 0.7986418 , 0.01236763, 0.77830333],
       [0.5       , 0.5       , 0.14909222, 0.00507625, 0.93394245],
       ...,
       [0.73105858, 0.5       , 0.72314575, 0.00128766, 0.86811284],
       [0.73105858, 0.5       , 0.34848906, 0.01597394, 0.15157257],
       [0.73105858, 0.5       , 0.71658333, 0.00144936, 0.86061816]])
@bashtage
Copy link
Member

What is the dtype of graduates_next_semester? This is probably happening because Patsy is encoding graduates_next_semester as a categorical variable.

@mnky9800n
Copy link
Author

It's int64. However it is categorical, its just 1 or 0 depending on graduation.

@bashtage
Copy link
Member

Turn it into a plain int64

@mnky9800n
Copy link
Author

mnky9800n commented Sep 10, 2019

As in dont have the formula be C(graduates_next_semester) ~...?

The current formula already lacks the C() designation.

As a note:

In : model.endog.dtype
Out : dtype('float64')

@bashtage
Copy link
Member

You are getting float since Patsy is encoding your categorical as two columns of floating point 0.0 or 1.0 values.

@mnky9800n
Copy link
Author

Sorry I'm unclear what you are suggesting to do to fix it. Converting graduates_next_semester column to int64 doesnt seem to make a difference:

In[1] : data = df[df.semester_idx==10][[ 'first_course_year', 'cumulative_avg_grade', 'hs_gpa', 'female', 
        'graduates_next_semester',]].copy()
        data['graduates_next_semester'] = data.graduates_next_semester.astype(np.int64)

        formula = 'graduates_next_semester ~ C(first_course_year) + cumulative_avg_grade + hs_gpa + C(female)'

        # formula = 'C(graduates_next_semester) ~ C(first_course_year) + cumulative_avg_grade + hs_gpa + C(female)'

        random = {'a':'1 + C(first_course_year)', 'b':'1 + C(first_course_year)*hs_gpa'}
        model = b.BinomialBayesMixedGLM.from_formula(formula, random, data)
        results = model.fit_vb()
In[2] : model.predict(data[data.columns[:-1]])
Out   : array([[5.00000000e-01, 4.97043420e-01, 3.27626316e-01, 3.12715703e-02],
       [5.00000000e-01, 2.09075156e-01, 2.01295331e-02, 9.68658863e-01],
       [5.00000000e-01, 1.62070571e-01, 7.78242678e-04, 7.46860763e-01],
       ...,
       [5.00000000e-01, 3.45092835e-01, 2.62984333e-02, 8.90587849e-01],
       [5.00000000e-01, 3.11531895e-01, 5.49629386e-03, 7.12442971e-01],
       [5.00000000e-01, 6.60760922e-01, 1.14419101e-01, 9.37076501e-01]])
In[3] : model.endog
Out   : array([0., 0., 0., ..., 1., 1., 0.])
In[4] : model.endog.dtype
Out[4]: dtype('float64')

@sean00002
Copy link

I just had the same issue. Instead of using model.predict, you probably wanna try result.predict?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants