New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG+2] Added 'average' option to passive aggressive classifier/regressor. #4939
Conversation
Differently from the SGDClassifier (and SGDRegressor), the PassiveAggressiveClassifier (and PassivieAggressiveRegressor) does not expose the 'average' option of the BaseSGD class. The 'average' option helps smoothing out the impact of rarely observed variables. For example, when used in combination with the HashingVectorizer on text, it helps to compensate for the lack of idf information, lowering the impact low-idf (i.e., rare) features.
can you add a tiny test? |
Yes. But it is not clear to me what the test should check. Have you any
|
I have extended the original tests for PassiveAggressive classifier/regressor to be run and checked also with average option turned on. |
looks pretty good but can you avoid duplicating so much code for testing? also I feel that the average param is not really tested. I am sure that the tests would still pass if the average param was ignored internally. |
@ogrisel are there logs we can look into for the failure? Or restart appveyor? |
…de, following what done in test_sgd.py.
@agramfort I have added the same checks from test_sgd.py to verify that averaging is actually performed (checking for presence of attributes added exclusively by the averaging code). |
Looks good to me. To reduce test duplication, you can do something like : def test_classifier_partial_fit():
classes = np.unique(y)
for data in (X, X_csr):
for average in (False, True):
clf = PassiveAggressiveClassifier(C=1.0, fit_intercept=True,
random_state=0, average=average)
for t in range(30):
clf.partial_fit(data, y, classes)
score = clf.score(data, y)
assert_greater(score, 0.79)
if average:
assert_true(hasattr(clf, 'average_coef_')) and so on... |
@TomDLT thank you for the suggestion, I followed your advice. |
Good job, it looks better ! |
Travis errors are cause by #5045 and unrelated. |
LGTM (though testing for the presence of the attributes in a single test would probably have sufficed). |
Conflicts: sklearn/linear_model/passive_aggressive.py
…into aesuli-pa-average Conflicts: sklearn/linear_model/passive_aggressive.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could have merged that a long time ago, sorry @aesuli for the reaction time.
I added a few comments and I think we can merge it.
result in the ``coef_`` attribute. If set to an int greater than 1, | ||
averaging will begin once the total number of samples seen reaches | ||
average. So average=10 will begin averaging after seeing 10 samples. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should add a tag .. versionadded:: 0.19
as above.
@@ -1,3 +1,5 @@ | |||
import unittest | |||
from nose.tools import assert_true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use instead from sklearn.utils.testing import assert_true
which centralizes the nose dependency.
@@ -1,3 +1,5 @@ | |||
import unittest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess you can remove this import.
score = clf.score(data, y) | ||
assert_greater(score, 0.79) | ||
if average: | ||
assert_true(hasattr(clf, 'average_coef_')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a straightforward way to test the sanity of these values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know what kind of relevant characteristics these weights have that could be checked.
The check on the score in line 81 is an indirect check on them that they produce sound classifications.
@@ -59,9 +59,15 @@ class PassiveAggressiveClassifier(BaseSGDClassifier): | |||
weights inversely proportional to class frequencies in the input data | |||
as ``n_samples / (n_classes * np.bincount(y))`` | |||
|
|||
.. versionadded:: 0.17 | |||
.. versionadded:: 0.17 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this change in indentation?
warm_start=warm_start, | ||
class_weight=class_weight, | ||
n_jobs=n_jobs) | ||
penalty=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this change in indentation?
result in the ``coef_`` attribute. If set to an int greater than 1, | ||
averaging will begin once the total number of samples seen reaches | ||
average. So average=10 will begin averaging after seeing 10 samples. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
veersionadded
verbose=verbose, | ||
random_state=random_state, | ||
warm_start=warm_start) | ||
penalty=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indentation
LGTM |
Please add an entry to what's new and we can merge. |
@jnothman It's a "new feature" or an "enhancement"? |
Enhancement is fine. Could be either, but features are usually bigger (e.g. new estimators) and highlighted in a release. |
Thanks, @aesuli |
Differently from the SGDClassifier (and SGDRegressor), the PassiveAggressiveClassifier (and PassivieAggressiveRegressor) does not expose the 'average' option of the BaseSGD class.
The 'average' option helps smoothing out the impact of rarely observed variables. For example, when used in combination with the HashingVectorizer on text, it helps to compensate for the lack of idf information, lowering the impact low-idf (i.e., rare) features.
The following is an example of using averaging both on SGDClassifier and PassiveAggressiveClassifier.
The data is the Bo Pang movie review dataset ( https://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz ).
The results show that averaging in BaseSGD works perfectly for both classifiers to remove "noise" features.
The following is the code to replicate the experiment that generated the above plot: