New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+2] Added 'average' option to passive aggressive classifier/regressor. #4939

Merged
merged 11 commits into from Sep 30, 2016

Conversation

Projects
None yet
5 participants
@aesuli
Contributor

aesuli commented Jul 8, 2015

Differently from the SGDClassifier (and SGDRegressor), the PassiveAggressiveClassifier (and PassivieAggressiveRegressor) does not expose the 'average' option of the BaseSGD class.
The 'average' option helps smoothing out the impact of rarely observed variables. For example, when used in combination with the HashingVectorizer on text, it helps to compensate for the lack of idf information, lowering the impact low-idf (i.e., rare) features.

The following is an example of using averaging both on SGDClassifier and PassiveAggressiveClassifier.
The data is the Bo Pang movie review dataset ( https://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz ).
The results show that averaging in BaseSGD works perfectly for both classifiers to remove "noise" features.

averagedemo

The following is the code to replicate the experiment that generated the above plot:

import argparse
import codecs
import sys
import matplotlib.pyplot as plt
import numpy as np

from sklearn.cross_validation import ShuffleSplit
from sklearn import cross_validation
from sklearn.datasets import load_files
from sklearn.feature_extraction.text import TfidfVectorizer, HashingVectorizer
from sklearn.linear_model import PassiveAggressiveClassifier, SGDClassifier
from sklearn.metrics import SCORERS
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC


def randomized_train_test(estimator, data, target, test_size, n_iter, scorer, random_state, n_jobs):
    cv = ShuffleSplit(len(data), n_iter=n_iter, test_size=test_size, random_state=random_state)

    scores = cross_validation.cross_val_score(estimator, data, target, scorer, cv, n_jobs)
    score_mean = scores.mean()
    score_std = scores.std()
    return score_mean, score_std, scores


def plot(scorer_name, results, grid_w=None, grid_h=None):
    all_scores = list()
    for _, (_, _, scores) in results:
        all_scores.extend(scores)

    count = len(results)

    bins = np.histogram(all_scores, bins=10 + (len(all_scores) /
                                               (count * np.max((
                                                   1, np.log2(len(all_scores) / count))))))[1]

    if grid_w is None:
        grid_w = int(np.ceil(np.sqrt(count)))
    if grid_h is None:
        grid_h = int(np.ceil(count / grid_w))

    subplots = list()

    for i, (name, (score_mean, score_std, scores)) in enumerate(results):
        if len(subplots) > 0:
            subplot = plt.subplot2grid((grid_h, grid_w),
                                       (int(i / grid_w), int(i % grid_w)),
                                       sharex=subplots[0],
                                       sharey=subplots[0])
        else:
            subplot = plt.subplot2grid((grid_h, grid_w),
                                       (int(i / grid_w), int(i % grid_w)))

        subplot.set_title('%s mean %s=%0.3f std=%0.3f' % (name, scorer_name, score_mean, score_std))
        subplot.hist(scores, alpha=0.5, bins=bins)
        subplots.append(subplot)

    ylim = subplots[0].get_ylim()

    for (name, (score_mean, score_std, _)), subplot in zip(results, subplots):
        subplot.plot([score_mean for _ in ylim], ylim, '--k', linewidth=2)
        subplot.plot([score_mean - score_std for _ in ylim], ylim, ':k', linewidth=2)
        subplot.plot([score_mean + score_std for _ in ylim], ylim, ':k', linewidth=2)

# download demo data from:
# https://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz

def main():
    sys.stdout = codecs.getwriter('utf8')(sys.stdout.buffer)
    parser = argparse.ArgumentParser(description='')
    parser.add_argument('-p', '--path', help='Input path', required=True)
    parser.add_argument('-r', '--random', help='Randomization seed (default: 0)', type=int, default=0)
    parser.add_argument('-s', '--scorer',
                        help='Evaluation function (default: f1, values: %s)' % ' '.join(SCORERS.keys()), type=str,
                        default='f1')
    parser.add_argument('-i', '--iter', help='Number of randomized runs (default: 100)', type=int, default=100)
    parser.add_argument('-t', '--test', help='Portions of examples to be used as test set (default: 0.2)', type=float,
                        default=0.2)
    args = parser.parse_args()

    test_size = args.test
    random_state = args.random
    n_iter = args.iter
    n_jobs = -1
    scorer = SCORERS[args.scorer]

    examples = load_files(args.path, shuffle=False)

    data = examples.data
    target = examples.target

    print("Total %i Train/test split %0.2f/%0.2f %i/%i" % (len(data),
                                                           (1 - args.test), args.test,
                                                           int(len(data) * (1 - args.test)),
                                                           int(len(data) * args.test)))

    experiments = list()

    experiments.append(('pa-tfidf', Pipeline([
        ('vec', TfidfVectorizer()),
        ('clf', PassiveAggressiveClassifier(random_state=random_state)),
    ])))

    experiments.append(('pa-hash-no-average', Pipeline([
        ('vec', HashingVectorizer()),
        ('clf', PassiveAggressiveClassifier(random_state=random_state)),
    ])))

    experiments.append(('pa-hash-average-100', Pipeline([
        ('vec', HashingVectorizer()),
        ('clf', PassiveAggressiveClassifier(random_state=random_state, average=100)),
    ])))


    experiments.append(('sgdc-tfidf', Pipeline([
        ('vec', TfidfVectorizer()),
        ('clf', SGDClassifier(random_state=random_state)),
    ])))

    experiments.append(('sgdc-hash-no-average', Pipeline([
        ('vec', HashingVectorizer()),
        ('clf', SGDClassifier(random_state=random_state)),
    ])))

    experiments.append(('sgdc-hash-average-100', Pipeline([
        ('vec', HashingVectorizer()),
        ('clf', SGDClassifier(random_state=random_state, average=100)),
    ])))

    results = list()

    for name, estimator in experiments:
        print(name)
        results.append((name, list(randomized_train_test(estimator, data, target, test_size, n_iter,
                                                         scorer,
                                                         random_state, n_jobs))))

    print(results)

    plt.figure()
    plot(args.scorer, results, grid_w=3, grid_h=2)
    plt.show()


if __name__ == "__main__":
    main()
'''

<!-- Reviewable:start -->

---
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/scikit-learn/scikit-learn/4939)
<!-- Reviewable:end -->
Added 'average' option to passive aggressive classifier/regressor.
Differently from the SGDClassifier (and SGDRegressor), the PassiveAggressiveClassifier (and PassivieAggressiveRegressor) does not expose the 'average' option of the BaseSGD class.
The 'average' option helps smoothing out the impact of rarely observed variables. For example, when used in combination with the HashingVectorizer on text, it helps to compensate for the lack of idf information, lowering the impact low-idf (i.e., rare) features.
@agramfort

This comment has been minimized.

Show comment
Hide comment
@agramfort

agramfort Jul 8, 2015

Member

can you add a tiny test?

Member

agramfort commented Jul 8, 2015

can you add a tiny test?

@aesuli

This comment has been minimized.

Show comment
Hide comment
@aesuli

aesuli Jul 8, 2015

Contributor

Yes. But it is not clear to me what the test should check. Have you any
example of similar tests to point me at?
Il 08/lug/2015 15:07, "Alexandre Gramfort" notifications@github.com ha
scritto:

can you add a tiny test?


Reply to this email directly or view it on GitHub
#4939 (comment)
.

Contributor

aesuli commented Jul 8, 2015

Yes. But it is not clear to me what the test should check. Have you any
example of similar tests to point me at?
Il 08/lug/2015 15:07, "Alexandre Gramfort" notifications@github.com ha
scritto:

can you add a tiny test?


Reply to this email directly or view it on GitHub
#4939 (comment)
.

@aesuli

This comment has been minimized.

Show comment
Hide comment
@aesuli

aesuli Jul 9, 2015

Contributor

I have extended the original tests for PassiveAggressive classifier/regressor to be run and checked also with average option turned on.
Note that the code that actually performs the averaging is implemented in the BaseSGDClassifier, BaseSGDRegressor and BaseSGD classes, and it is extensively tested in linear_model/tests/test_sgd.py

Contributor

aesuli commented Jul 9, 2015

I have extended the original tests for PassiveAggressive classifier/regressor to be run and checked also with average option turned on.
Note that the code that actually performs the averaging is implemented in the BaseSGDClassifier, BaseSGDRegressor and BaseSGD classes, and it is extensively tested in linear_model/tests/test_sgd.py

@agramfort

This comment has been minimized.

Show comment
Hide comment
@agramfort

agramfort Jul 11, 2015

Member

looks pretty good but can you avoid duplicating so much code for testing?

also I feel that the average param is not really tested. I am sure that the tests would still pass if the average param was ignored internally.

Member

agramfort commented Jul 11, 2015

looks pretty good but can you avoid duplicating so much code for testing?

also I feel that the average param is not really tested. I am sure that the tests would still pass if the average param was ignored internally.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Jul 11, 2015

Member

@ogrisel are there logs we can look into for the failure? Or restart appveyor?

Member

amueller commented Jul 11, 2015

@ogrisel are there logs we can look into for the failure? Or restart appveyor?

Added checks for presence of attributes generated by the averaging co…
…de, following what done in test_sgd.py.
@aesuli

This comment has been minimized.

Show comment
Hide comment
@aesuli

aesuli Jul 13, 2015

Contributor

@agramfort I have added the same checks from test_sgd.py to verify that averaging is actually performed (checking for presence of attributes added exclusively by the averaging code).
With respect to code duplication, it is the minimal code to get the data, fit the classifier, and check its accuracy.

Contributor

aesuli commented Jul 13, 2015

@agramfort I have added the same checks from test_sgd.py to verify that averaging is actually performed (checking for presence of attributes added exclusively by the averaging code).
With respect to code duplication, it is the minimal code to get the data, fit the classifier, and check its accuracy.

@TomDLT

This comment has been minimized.

Show comment
Hide comment
@TomDLT

TomDLT Jul 27, 2015

Member

Looks good to me.

To reduce test duplication, you can do something like :

def test_classifier_partial_fit():
    classes = np.unique(y)
    for data in (X, X_csr):
        for average in (False, True):
        clf = PassiveAggressiveClassifier(C=1.0, fit_intercept=True,
                                          random_state=0, average=average)
        for t in range(30):
            clf.partial_fit(data, y, classes)
        score = clf.score(data, y)
        assert_greater(score, 0.79)
        if average:
          assert_true(hasattr(clf, 'average_coef_'))

and so on...

Member

TomDLT commented Jul 27, 2015

Looks good to me.

To reduce test duplication, you can do something like :

def test_classifier_partial_fit():
    classes = np.unique(y)
    for data in (X, X_csr):
        for average in (False, True):
        clf = PassiveAggressiveClassifier(C=1.0, fit_intercept=True,
                                          random_state=0, average=average)
        for t in range(30):
            clf.partial_fit(data, y, classes)
        score = clf.score(data, y)
        assert_greater(score, 0.79)
        if average:
          assert_true(hasattr(clf, 'average_coef_'))

and so on...

@aesuli

This comment has been minimized.

Show comment
Hide comment
@aesuli

aesuli Jul 29, 2015

Contributor

@TomDLT thank you for the suggestion, I followed your advice.

Contributor

aesuli commented Jul 29, 2015

@TomDLT thank you for the suggestion, I followed your advice.

@TomDLT

This comment has been minimized.

Show comment
Hide comment
@TomDLT

TomDLT Jul 30, 2015

Member

Good job, it looks better !
Travis is not happy but it is unrelated, master branch also fails.

Member

TomDLT commented Jul 30, 2015

Good job, it looks better !
Travis is not happy but it is unrelated, master branch also fails.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Jul 30, 2015

Member

Travis errors are cause by #5045 and unrelated.

Member

amueller commented Jul 30, 2015

Travis errors are cause by #5045 and unrelated.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Jul 30, 2015

Member

LGTM (though testing for the presence of the attributes in a single test would probably have sufficed).

Member

amueller commented Jul 30, 2015

LGTM (though testing for the presence of the attributes in a single test would probably have sufficed).

@amueller amueller changed the title from Added 'average' option to passive aggressive classifier/regressor. to [MRG] Added 'average' option to passive aggressive classifier/regressor. Jul 30, 2015

@aesuli aesuli closed this Sep 1, 2015

@aesuli aesuli reopened this Sep 1, 2015

aesuli added some commits Sep 1, 2015

Merge remote-tracking branch 'upstream/master' into aesuli-pa-average
Conflicts:
	sklearn/linear_model/passive_aggressive.py
Merge branch 'master' of https://github.com/scikit-learn/scikit-learn
…into aesuli-pa-average

Conflicts:
	sklearn/linear_model/passive_aggressive.py
@TomDLT

TomDLT approved these changes Sep 19, 2016

We could have merged that a long time ago, sorry @aesuli for the reaction time.
I added a few comments and I think we can merge it.

result in the ``coef_`` attribute. If set to an int greater than 1,
averaging will begin once the total number of samples seen reaches
average. So average=10 will begin averaging after seeing 10 samples.

This comment has been minimized.

@TomDLT

TomDLT Sep 19, 2016

Member

You should add a tag .. versionadded:: 0.19 as above.

@TomDLT

TomDLT Sep 19, 2016

Member

You should add a tag .. versionadded:: 0.19 as above.

Show outdated Hide outdated sklearn/linear_model/tests/test_passive_aggressive.py
@@ -1,3 +1,5 @@
import unittest
from nose.tools import assert_true

This comment has been minimized.

@TomDLT

TomDLT Sep 19, 2016

Member

Please use instead from sklearn.utils.testing import assert_true which centralizes the nose dependency.

@TomDLT

TomDLT Sep 19, 2016

Member

Please use instead from sklearn.utils.testing import assert_true which centralizes the nose dependency.

Show outdated Hide outdated sklearn/linear_model/tests/test_passive_aggressive.py
@@ -1,3 +1,5 @@
import unittest

This comment has been minimized.

@TomDLT

TomDLT Sep 19, 2016

Member

I guess you can remove this import.

@TomDLT

TomDLT Sep 19, 2016

Member

I guess you can remove this import.

score = clf.score(data, y)
assert_greater(score, 0.79)
if average:
assert_true(hasattr(clf, 'average_coef_'))

This comment has been minimized.

@jnothman

jnothman Sep 20, 2016

Member

Is there a straightforward way to test the sanity of these values?

@jnothman

jnothman Sep 20, 2016

Member

Is there a straightforward way to test the sanity of these values?

This comment has been minimized.

@aesuli

aesuli Sep 23, 2016

Contributor

I don't know what kind of relevant characteristics these weights have that could be checked.
The check on the score in line 81 is an indirect check on them that they produce sound classifications.

@aesuli

aesuli Sep 23, 2016

Contributor

I don't know what kind of relevant characteristics these weights have that could be checked.
The check on the score in line 81 is an indirect check on them that they produce sound classifications.

Show outdated Hide outdated sklearn/linear_model/passive_aggressive.py
@@ -59,9 +59,15 @@ class PassiveAggressiveClassifier(BaseSGDClassifier):
weights inversely proportional to class frequencies in the input data
as ``n_samples / (n_classes * np.bincount(y))``
.. versionadded:: 0.17
.. versionadded:: 0.17

This comment has been minimized.

@jnothman

jnothman Sep 20, 2016

Member

Why this change in indentation?

@jnothman

jnothman Sep 20, 2016

Member

Why this change in indentation?

Show outdated Hide outdated sklearn/linear_model/passive_aggressive.py
warm_start=warm_start,
class_weight=class_weight,
n_jobs=n_jobs)
penalty=None,

This comment has been minimized.

@jnothman

jnothman Sep 20, 2016

Member

Why this change in indentation?

@jnothman

jnothman Sep 20, 2016

Member

Why this change in indentation?

result in the ``coef_`` attribute. If set to an int greater than 1,
averaging will begin once the total number of samples seen reaches
average. So average=10 will begin averaging after seeing 10 samples.

This comment has been minimized.

@jnothman

jnothman Sep 20, 2016

Member

veersionadded

@jnothman

jnothman Sep 20, 2016

Member

veersionadded

Show outdated Hide outdated sklearn/linear_model/passive_aggressive.py
verbose=verbose,
random_state=random_state,
warm_start=warm_start)
penalty=None,

This comment has been minimized.

@jnothman

jnothman Sep 20, 2016

Member

indentation

@jnothman

jnothman Sep 20, 2016

Member

indentation

@TomDLT TomDLT changed the title from [MRG] Added 'average' option to passive aggressive classifier/regressor. to [MRG+1] Added 'average' option to passive aggressive classifier/regressor. Sep 23, 2016

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Sep 28, 2016

Member

LGTM

Member

jnothman commented Sep 28, 2016

LGTM

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Sep 28, 2016

Member

Please add an entry to what's new and we can merge.

Member

jnothman commented Sep 28, 2016

Please add an entry to what's new and we can merge.

@jnothman jnothman removed the Needs Review label Sep 28, 2016

@aesuli

This comment has been minimized.

Show comment
Hide comment
@aesuli

aesuli Sep 29, 2016

Contributor

@jnothman It's a "new feature" or an "enhancement"?

Contributor

aesuli commented Sep 29, 2016

@jnothman It's a "new feature" or an "enhancement"?

aesuli added some commits Sep 29, 2016

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Sep 30, 2016

Member

Enhancement is fine. Could be either, but features are usually bigger (e.g. new estimators) and highlighted in a release.

Member

jnothman commented Sep 30, 2016

Enhancement is fine. Could be either, but features are usually bigger (e.g. new estimators) and highlighted in a release.

@jnothman jnothman changed the title from [MRG+1] Added 'average' option to passive aggressive classifier/regressor. to [MRG+2] Added 'average' option to passive aggressive classifier/regressor. Sep 30, 2016

@jnothman jnothman merged commit 69c8382 into scikit-learn:master Sep 30, 2016

2 of 3 checks passed

ci/circleci A command timed out during your tests
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Sep 30, 2016

Member

Thanks, @aesuli

Member

jnothman commented Sep 30, 2016

Thanks, @aesuli

TomDLT added a commit to TomDLT/scikit-learn that referenced this pull request Oct 3, 2016

afiodorov added a commit to unravelin/scikit-learn that referenced this pull request Apr 25, 2017

Sundrique added a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017

paulha added a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment