Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+2] Added 'average' option to passive aggressive classifier/regressor. #4939

Merged
merged 11 commits into from Sep 30, 2016
Merged

[MRG+2] Added 'average' option to passive aggressive classifier/regressor. #4939

merged 11 commits into from Sep 30, 2016

Conversation

@aesuli
Copy link
Contributor

@aesuli aesuli commented Jul 8, 2015

Differently from the SGDClassifier (and SGDRegressor), the PassiveAggressiveClassifier (and PassivieAggressiveRegressor) does not expose the 'average' option of the BaseSGD class.
The 'average' option helps smoothing out the impact of rarely observed variables. For example, when used in combination with the HashingVectorizer on text, it helps to compensate for the lack of idf information, lowering the impact low-idf (i.e., rare) features.

The following is an example of using averaging both on SGDClassifier and PassiveAggressiveClassifier.
The data is the Bo Pang movie review dataset ( https://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz ).
The results show that averaging in BaseSGD works perfectly for both classifiers to remove "noise" features.

averagedemo

The following is the code to replicate the experiment that generated the above plot:

import argparse
import codecs
import sys
import matplotlib.pyplot as plt
import numpy as np

from sklearn.cross_validation import ShuffleSplit
from sklearn import cross_validation
from sklearn.datasets import load_files
from sklearn.feature_extraction.text import TfidfVectorizer, HashingVectorizer
from sklearn.linear_model import PassiveAggressiveClassifier, SGDClassifier
from sklearn.metrics import SCORERS
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC


def randomized_train_test(estimator, data, target, test_size, n_iter, scorer, random_state, n_jobs):
    cv = ShuffleSplit(len(data), n_iter=n_iter, test_size=test_size, random_state=random_state)

    scores = cross_validation.cross_val_score(estimator, data, target, scorer, cv, n_jobs)
    score_mean = scores.mean()
    score_std = scores.std()
    return score_mean, score_std, scores


def plot(scorer_name, results, grid_w=None, grid_h=None):
    all_scores = list()
    for _, (_, _, scores) in results:
        all_scores.extend(scores)

    count = len(results)

    bins = np.histogram(all_scores, bins=10 + (len(all_scores) /
                                               (count * np.max((
                                                   1, np.log2(len(all_scores) / count))))))[1]

    if grid_w is None:
        grid_w = int(np.ceil(np.sqrt(count)))
    if grid_h is None:
        grid_h = int(np.ceil(count / grid_w))

    subplots = list()

    for i, (name, (score_mean, score_std, scores)) in enumerate(results):
        if len(subplots) > 0:
            subplot = plt.subplot2grid((grid_h, grid_w),
                                       (int(i / grid_w), int(i % grid_w)),
                                       sharex=subplots[0],
                                       sharey=subplots[0])
        else:
            subplot = plt.subplot2grid((grid_h, grid_w),
                                       (int(i / grid_w), int(i % grid_w)))

        subplot.set_title('%s mean %s=%0.3f std=%0.3f' % (name, scorer_name, score_mean, score_std))
        subplot.hist(scores, alpha=0.5, bins=bins)
        subplots.append(subplot)

    ylim = subplots[0].get_ylim()

    for (name, (score_mean, score_std, _)), subplot in zip(results, subplots):
        subplot.plot([score_mean for _ in ylim], ylim, '--k', linewidth=2)
        subplot.plot([score_mean - score_std for _ in ylim], ylim, ':k', linewidth=2)
        subplot.plot([score_mean + score_std for _ in ylim], ylim, ':k', linewidth=2)

# download demo data from:
# https://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz

def main():
    sys.stdout = codecs.getwriter('utf8')(sys.stdout.buffer)
    parser = argparse.ArgumentParser(description='')
    parser.add_argument('-p', '--path', help='Input path', required=True)
    parser.add_argument('-r', '--random', help='Randomization seed (default: 0)', type=int, default=0)
    parser.add_argument('-s', '--scorer',
                        help='Evaluation function (default: f1, values: %s)' % ' '.join(SCORERS.keys()), type=str,
                        default='f1')
    parser.add_argument('-i', '--iter', help='Number of randomized runs (default: 100)', type=int, default=100)
    parser.add_argument('-t', '--test', help='Portions of examples to be used as test set (default: 0.2)', type=float,
                        default=0.2)
    args = parser.parse_args()

    test_size = args.test
    random_state = args.random
    n_iter = args.iter
    n_jobs = -1
    scorer = SCORERS[args.scorer]

    examples = load_files(args.path, shuffle=False)

    data = examples.data
    target = examples.target

    print("Total %i Train/test split %0.2f/%0.2f %i/%i" % (len(data),
                                                           (1 - args.test), args.test,
                                                           int(len(data) * (1 - args.test)),
                                                           int(len(data) * args.test)))

    experiments = list()

    experiments.append(('pa-tfidf', Pipeline([
        ('vec', TfidfVectorizer()),
        ('clf', PassiveAggressiveClassifier(random_state=random_state)),
    ])))

    experiments.append(('pa-hash-no-average', Pipeline([
        ('vec', HashingVectorizer()),
        ('clf', PassiveAggressiveClassifier(random_state=random_state)),
    ])))

    experiments.append(('pa-hash-average-100', Pipeline([
        ('vec', HashingVectorizer()),
        ('clf', PassiveAggressiveClassifier(random_state=random_state, average=100)),
    ])))


    experiments.append(('sgdc-tfidf', Pipeline([
        ('vec', TfidfVectorizer()),
        ('clf', SGDClassifier(random_state=random_state)),
    ])))

    experiments.append(('sgdc-hash-no-average', Pipeline([
        ('vec', HashingVectorizer()),
        ('clf', SGDClassifier(random_state=random_state)),
    ])))

    experiments.append(('sgdc-hash-average-100', Pipeline([
        ('vec', HashingVectorizer()),
        ('clf', SGDClassifier(random_state=random_state, average=100)),
    ])))

    results = list()

    for name, estimator in experiments:
        print(name)
        results.append((name, list(randomized_train_test(estimator, data, target, test_size, n_iter,
                                                         scorer,
                                                         random_state, n_jobs))))

    print(results)

    plt.figure()
    plot(args.scorer, results, grid_w=3, grid_h=2)
    plt.show()


if __name__ == "__main__":
    main()
'''

<!-- Reviewable:start -->

---
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/scikit-learn/scikit-learn/4939)
<!-- Reviewable:end -->
Differently from the SGDClassifier (and SGDRegressor), the PassiveAggressiveClassifier (and PassivieAggressiveRegressor) does not expose the 'average' option of the BaseSGD class.
The 'average' option helps smoothing out the impact of rarely observed variables. For example, when used in combination with the HashingVectorizer on text, it helps to compensate for the lack of idf information, lowering the impact low-idf (i.e., rare) features.
@agramfort
Copy link
Member

@agramfort agramfort commented Jul 8, 2015

can you add a tiny test?

@aesuli
Copy link
Contributor Author

@aesuli aesuli commented Jul 8, 2015

Yes. But it is not clear to me what the test should check. Have you any
example of similar tests to point me at?
Il 08/lug/2015 15:07, "Alexandre Gramfort" notifications@github.com ha
scritto:

can you add a tiny test?


Reply to this email directly or view it on GitHub
#4939 (comment)
.

@aesuli
Copy link
Contributor Author

@aesuli aesuli commented Jul 9, 2015

I have extended the original tests for PassiveAggressive classifier/regressor to be run and checked also with average option turned on.
Note that the code that actually performs the averaging is implemented in the BaseSGDClassifier, BaseSGDRegressor and BaseSGD classes, and it is extensively tested in linear_model/tests/test_sgd.py

@agramfort
Copy link
Member

@agramfort agramfort commented Jul 11, 2015

looks pretty good but can you avoid duplicating so much code for testing?

also I feel that the average param is not really tested. I am sure that the tests would still pass if the average param was ignored internally.

@amueller
Copy link
Member

@amueller amueller commented Jul 11, 2015

@ogrisel are there logs we can look into for the failure? Or restart appveyor?

@aesuli
Copy link
Contributor Author

@aesuli aesuli commented Jul 13, 2015

@agramfort I have added the same checks from test_sgd.py to verify that averaging is actually performed (checking for presence of attributes added exclusively by the averaging code).
With respect to code duplication, it is the minimal code to get the data, fit the classifier, and check its accuracy.

@TomDLT
Copy link
Member

@TomDLT TomDLT commented Jul 27, 2015

Looks good to me.

To reduce test duplication, you can do something like :

def test_classifier_partial_fit():
    classes = np.unique(y)
    for data in (X, X_csr):
        for average in (False, True):
        clf = PassiveAggressiveClassifier(C=1.0, fit_intercept=True,
                                          random_state=0, average=average)
        for t in range(30):
            clf.partial_fit(data, y, classes)
        score = clf.score(data, y)
        assert_greater(score, 0.79)
        if average:
          assert_true(hasattr(clf, 'average_coef_'))

and so on...

@aesuli
Copy link
Contributor Author

@aesuli aesuli commented Jul 29, 2015

@TomDLT thank you for the suggestion, I followed your advice.

@TomDLT
Copy link
Member

@TomDLT TomDLT commented Jul 30, 2015

Good job, it looks better !
Travis is not happy but it is unrelated, master branch also fails.

@amueller
Copy link
Member

@amueller amueller commented Jul 30, 2015

Travis errors are cause by #5045 and unrelated.

@amueller
Copy link
Member

@amueller amueller commented Jul 30, 2015

LGTM (though testing for the presence of the attributes in a single test would probably have sufficed).

@amueller amueller changed the title Added 'average' option to passive aggressive classifier/regressor. [MRG] Added 'average' option to passive aggressive classifier/regressor. Jul 30, 2015
@aesuli aesuli closed this Sep 1, 2015
@aesuli aesuli reopened this Sep 1, 2015
aesuli added 2 commits Sep 1, 2015
Conflicts:
	sklearn/linear_model/passive_aggressive.py
…into aesuli-pa-average

Conflicts:
	sklearn/linear_model/passive_aggressive.py
@TomDLT
TomDLT approved these changes Sep 19, 2016
Copy link
Member

@TomDLT TomDLT left a comment

We could have merged that a long time ago, sorry @aesuli for the reaction time.
I added a few comments and I think we can merge it.

result in the ``coef_`` attribute. If set to an int greater than 1,
averaging will begin once the total number of samples seen reaches
average. So average=10 will begin averaging after seeing 10 samples.

This comment has been minimized.

@TomDLT

TomDLT Sep 19, 2016
Member

You should add a tag .. versionadded:: 0.19 as above.

@@ -1,3 +1,5 @@
import unittest
from nose.tools import assert_true

This comment has been minimized.

@TomDLT

TomDLT Sep 19, 2016
Member

Please use instead from sklearn.utils.testing import assert_true which centralizes the nose dependency.

@@ -1,3 +1,5 @@
import unittest

This comment has been minimized.

@TomDLT

TomDLT Sep 19, 2016
Member

I guess you can remove this import.

score = clf.score(data, y)
assert_greater(score, 0.79)
if average:
assert_true(hasattr(clf, 'average_coef_'))

This comment has been minimized.

@jnothman

jnothman Sep 20, 2016
Member

Is there a straightforward way to test the sanity of these values?

This comment has been minimized.

@aesuli

aesuli Sep 23, 2016
Author Contributor

I don't know what kind of relevant characteristics these weights have that could be checked.
The check on the score in line 81 is an indirect check on them that they produce sound classifications.

@@ -59,9 +59,15 @@ class PassiveAggressiveClassifier(BaseSGDClassifier):
weights inversely proportional to class frequencies in the input data
as ``n_samples / (n_classes * np.bincount(y))``
.. versionadded:: 0.17
.. versionadded:: 0.17

This comment has been minimized.

@jnothman

jnothman Sep 20, 2016
Member

Why this change in indentation?

warm_start=warm_start,
class_weight=class_weight,
n_jobs=n_jobs)
penalty=None,

This comment has been minimized.

@jnothman

jnothman Sep 20, 2016
Member

Why this change in indentation?

result in the ``coef_`` attribute. If set to an int greater than 1,
averaging will begin once the total number of samples seen reaches
average. So average=10 will begin averaging after seeing 10 samples.

This comment has been minimized.

@jnothman

jnothman Sep 20, 2016
Member

veersionadded

verbose=verbose,
random_state=random_state,
warm_start=warm_start)
penalty=None,

This comment has been minimized.

@jnothman

jnothman Sep 20, 2016
Member

indentation

@TomDLT TomDLT changed the title [MRG] Added 'average' option to passive aggressive classifier/regressor. [MRG+1] Added 'average' option to passive aggressive classifier/regressor. Sep 23, 2016
@jnothman
Copy link
Member

@jnothman jnothman commented Sep 28, 2016

LGTM

@jnothman
Copy link
Member

@jnothman jnothman commented Sep 28, 2016

Please add an entry to what's new and we can merge.

@aesuli
Copy link
Contributor Author

@aesuli aesuli commented Sep 29, 2016

@jnothman It's a "new feature" or an "enhancement"?

@jnothman
Copy link
Member

@jnothman jnothman commented Sep 30, 2016

Enhancement is fine. Could be either, but features are usually bigger (e.g. new estimators) and highlighted in a release.

@jnothman jnothman changed the title [MRG+1] Added 'average' option to passive aggressive classifier/regressor. [MRG+2] Added 'average' option to passive aggressive classifier/regressor. Sep 30, 2016
@jnothman jnothman merged commit 69c8382 into scikit-learn:master Sep 30, 2016
2 of 3 checks passed
2 of 3 checks passed
ci/circleci A command timed out during your tests
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@jnothman
Copy link
Member

@jnothman jnothman commented Sep 30, 2016

Thanks, @aesuli

TomDLT added a commit to TomDLT/scikit-learn that referenced this pull request Oct 3, 2016
afiodorov added a commit to unravelin/scikit-learn that referenced this pull request Apr 25, 2017
Sundrique added a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017
paulha added a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

5 participants