-
-
Notifications
You must be signed in to change notification settings - Fork 26.2k
Description
Description
The score of the LatentDirichletAllocation (LDA) seems inverted. The GridSearchCV seems to maximize score, but for the LDA it seems that the score is better if it is more negative.
Steps/Code to Reproduce
- Build a Pipeline() from a CountVectorizer() and an LDA()
- Pass the pipeline to a GridSearchCV()
- The thing doesn't seem to work, and after manual inspection, the score seems to increase in the wrong direction (negatively) instead that it should be positively increasing.
I also tried to implement a class inheriting the LDA to negate the score, but the score is internally used in the LDA class itself, so overriding that seemed to change nothing at all. So either no learning is made for a bad Grid Search, or either no learning is made because of an early stopping of a score not improving the way it was.
Expected Results
If I'm wrong, the documentation should be clearer on wheter or not the GridSearchCV does reduce or increase the score. Also, there should be a better description of the directions in which the score and perplexity changes in the LDA. Obviously normally the perplexity should go down. But the score goes down with the perplexity going down too. And I'd expect a "score" to be a metric going better the higher it is.
Actual Results
# Tried to print this with a few different optimization round, the score grows negatively with a reduction of perplexity.
print(lda.score(train), lda.perplexity(train))
# -182.51029019702543 27.615270885701467
# -52.05494670051718 41.19061662552155
# -41.973876576664225 189.9450029464927
# -32.432059761463556 3320.9791408284505
Versions
Linux-4.17.0-amd64-x86_64
Python 3.6.6 (default, Jun 27 2018, 14:44:17)
[GCC 8.1.0]
NumPy 1.14.0
SciPy 1.1.0
Scikit-Learn 0.19.2