Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement softmax regression (categorical_logit_glm_lpmf) #1242

Open
wants to merge 10 commits into
base: develop
from

Conversation

Projects
None yet
6 participants
@t4c1
Copy link
Contributor

commented May 15, 2019

Summary

Adds CPU implementation of softmax regression. Since the distribution that is used in this GLM is already implemented under the name categorical_logit_lpmf I named the GLM categorical_logit_glm_lpmf.

Tests

New tests are in test/unit/math/rev/mat/prob/categorical_logit_glm_lpmf_test.cpp .

Side Effects

None.

Checklist

  • Math issue #1241

  • Copyright holder: Tadej Ciglarič, University of Ljubljana

    The copyright holder is typically you or your assignee, such as a university or company. By submitting this pull request, the copyright holder is agreeing to the license the submitted work under the following licenses:
    - Code: BSD 3-clause (https://opensource.org/licenses/BSD-3-Clause)
    - Documentation: CC-BY 4.0 (https://creativecommons.org/licenses/by/4.0/)

  • the basic tests are passing

    • unit tests pass (to run, use: ./runTests.py test/unit)
    • header checks pass, (make test-headers)
    • docs build, (make doxygen)
    • code passes the built in C++ standards checks (make cpplint)
  • the code is written in idiomatic C++ and changes are documented in the doxygen

  • the new changes are tested

@t4c1

This comment has been minimized.

Copy link
Contributor Author

commented May 15, 2019

I measured the speedup compared the version that uses autodiff instead of analytic derivatives. Benchmark code is here. Results are in the graphs below.

glm_speedup

@bob-carpenter

This comment has been minimized.

Copy link
Contributor

commented May 15, 2019

There's an issue with this GLM in terms of whether you use K or K - 1 coefficient vectors.

The traditional approach in stats is to use K - 1 coefficient vectors because they're not identified otherwise. With Bayesian models, you can identify them with the prior. We've gone around and round on this over time, with the advantage of K coefficient vectors being symmetry and K - 1 being identifiability.

Which one are you using here, and if K, how are you dealing with the non-identifiability?

If you use K - 1, then the model's identified in the traditional sense (assuming the data's not separable).

This looks great either way. I'm not sure whether we want one or both of these versions.

@t4c1

This comment has been minimized.

Copy link
Contributor Author

commented May 15, 2019

Right now it uses K coefficient vectors and leaves it to the user so specify appropriate prior. If K-1 vectors is preferable, I can easily change that.

@bob-carpenter

This comment has been minimized.

Copy link
Contributor

commented May 15, 2019

I almost think we should implement both, but it'd be super confusing having two. I think if we only have one, it should probably be the K-based one. I wonder what other people think. We haven't done a lot of analysis on efficiency. Lately, we've been moving more toward softer identifications with priors on groups of coefficients (like sum(alpha) ~ normal(0, 0.001)), but I don't know how to do that here.

@estrumbelj

This comment has been minimized.

Copy link

commented May 15, 2019

Hey, I'd suggest just k. With k-1 we take away some freedom from the user (like Bob said, some might prefer "soft" identifiability constraint via prior; selecting the reference category; etc.) and with k-1 you have multiple possible interpretations what they mean. I'd go as far as to say that k-1 would be a bad idea.

@bob-carpenter

This comment has been minimized.

Copy link
Contributor

commented May 15, 2019

with k-1 you have multiple possible interpretations what they mean

What the coefficients mean? Don't they only have meaning in a relative sense because of the non-identifiability? If you fix one of the coefficient vectors to zero, the other ones are all interpreted relative to that value.

@estrumbelj

This comment has been minimized.

Copy link

commented May 15, 2019

with k-1 you have multiple possible interpretations what they mean

What the coefficients mean? Don't they only have meaning in a relative sense because of the non-identifiability? If you fix one of the coefficient vectors to zero, the other ones are all interpreted relative to that value.

What I wanted to say was that with k-1 you have to know which one the developer picked as the "-1". At least the 1st one and the k-th one are possible. That is, I have to read the documentation. :) And then I need to reorder my categories to put the one I want as the reference in the correct spot.

Or we can just assume it's the k-th one, which makes what I said before sound stupid. :)

@betanalpha

This comment has been minimized.

Copy link
Contributor

commented May 16, 2019

@bob-carpenter

This comment has been minimized.

Copy link
Contributor

commented May 16, 2019

We discussed this issue in the Stan meeting today (@t4c1 --- we can send you the link if you'd like to attend---it's 11 am NY time on Thursdays). The general consensus was that it should be coded with K - 1 categories. There's no binding decision from our meetings, but @andrewgelman and @bgoodri were both there and preferred K - 1. We haven't heard from a couple of our regression/modeling experts, @dpsimpson and @avehtari.

Andrew suggested you might have a motivating example for using K and identifying with a prior instead of K - 1. If so, please share.

@t4c1

This comment has been minimized.

Copy link
Contributor Author

commented May 16, 2019

I don't have such an example, but @estrumbelj might have one.

@estrumbelj

This comment has been minimized.

Copy link

commented May 16, 2019

We discussed this issue in the Stan meeting today (@t4c1 --- we can send you the link if you'd like to attend---it's 11 am NY time on Thursdays). The general consensus was that it should be coded with K - 1 categories. There's no binding decision from our meetings, but @andrewgelman and @bgoodri were both there and preferred K - 1. We haven't heard from a couple of our regression/modeling experts, @dpsimpson and @avehtari.

Andrew suggested you might have a motivating example for using K and identifying with a prior instead of K - 1. If so, please share.

What were the arguments for "K - 1"? It just doesn't make any sense to me.

By picking "K - 1" we are literally saying that this is the only possible way of identifying the model AND that the model must be identified.

On the other hand, going with "K" does not prevent the user from setting one row of Beta to zero. It is exactly what users have to do now with the categorical() and categorical_logit() when doing multinomial regression. And with "K" I have the additional freedom of choosing which category I want to use as the reference category.

Even the Stan manual categorical_logit example starts with a prior and then mentions k-1 as an alternative :) : https://mc-stan.org/docs/2_19/stan-users-guide/multi-logit-section.html

@estrumbelj

This comment has been minimized.

Copy link

commented May 24, 2019

This GLMs raised a few specific and general questions on how GLMs should be implemented.

1. Should we implement it with K categories or K - 1 categories, enforcing that the K-th category as the reference category and identifying the model.

There seems to be a general consensus that in almost all cases you'd want to use a reference category. But there also exist cases where you (I) wouldn't want to. The shortest example I can come up with is when I'd want to regularize the coefficients for the covariates in my categorical logit regression (I might still use a reference for the intercepts). If I only have K-1 available, I'm forced to choose a reference category, breaking symmetry and making my results dependent on the category I chose.

My view is that if there is a possibility that K will be used, that option should be available. An additional argument for K is that you are still able to set a reference category by setting the corresponding column in beta. And it shouldn't mislead the Stan user into doing something silly, because users already have to "manually" set a reference category when doing categorical regression with categorical_logit().

On the other hand, going with K - 1 is strictly less flexible.

2. Separate intercept or not or both?

categorical_logit_glm_lpmf(y | x, alpha, beta) VS categorical_logit_glm_lpmf(y | x, beta)

I think having them separate is very useful. If nothing else, it makes it easier to regularize the coefficients, but not the intercept. The four GLMs that are currently in Stan are implemented only with separate intercept.

@syclik, you'll probably have something to add here.

3. Signature/types/vectorization

Based on what we discussed, I suggest the following:

categorical_logit_glm_lpmf(y | x, alpha, beta) =def= categorical_logit(y | alpha + x * beta);

  • y can be a scalar OR array of size n,
  • x can be a vector of size J OR n x J matrix (only if y is an array)
  • alpha is a vector of size K
  • beta is a J x K matrix.

J >= 0, K > 0, n > 0

The only broadcasting that should happen is when x is a vector and y is an array?

Anyway, someone needs to make a decision on these before we can move forward.

@bob-carpenter

This comment has been minimized.

Copy link
Contributor

commented May 24, 2019

@estrumbelj

This comment has been minimized.

Copy link

commented May 24, 2019

For 1, The main reason to use K - 1 is identifiability. If you go with K, the joint density is only identified through the priors.

This sounds like we have to give up indentifiability via reference category. Just to make it clear - if we go with K, no one will lose their right to identify their categorical logit with a reference category :)

For 3, thanks for defining the suggested signatures. I have one modification I'd like to propose to avoid dependencies in the arguments. Specifically, I'd like to make the legal types for y and x independent. If x is a matrix and y a scalar, y gets broadcast; if x is a vector and y an array, then x gets broadcast.

It just seemed to me that broadcasting the same value of y is not something that would come up. But thinking about it, I agree with what you wrote. It is more elegant even if it never comes up.

Given that you don't suggest vectorizing alpha and beta, does that mean we won't be trying to cover discrete choice models with this GLM? I think it's best not to include them, but it was on the table earlier, so I thought I'd check.

No, no discrete choice. I think it is a nice idea, but I don't think it is worth the extra implementation effort. And it would probably make it too complicated and confuse a lot of Stan users.

But it might be worth thinking about a "discrete_choice_logit_glm_lpdf" once we can do ragged structures.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.