Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the issue #3745, the code book generation for OutputCodeClassifier #3768

Closed
wants to merge 11 commits into from

Conversation

queqichao
Copy link
Contributor

  • Change the process of generating output code for OutputCodeClassifier. The process is to draw subsets of the exhaustive code book, see [1], multiple times and pick the one that give largest hamming distances between classes.
  • Change the default value of code_size from 1.5 to 1. 1.5 is problematic. For example when n_classes = 3, the size of exhaustive code book is 3, so 1.5 for code_size is not possible.
  • Add test case.
  • Update the document.

[1] Thomas G. Dietterich, Ghulum Bakiri. Solving Multiclass Learning Problems via Error-Correcting Output Codes

@coveralls
Copy link

Coverage Status

Coverage increased (+0.0%) when pulling 9565a2a on queqichao:multiclass_code_book_fix into 031a3fc on scikit-learn:master.

@arjoly
Copy link
Member

arjoly commented Oct 14, 2014

Can you preserve the previous strategy? We need to remain backward compatible.

@queqichao
Copy link
Contributor Author

@arjoly I didn't the change the interface actually. The only place that could potentially cause backward compatibility issue is that when code_size is not in the correct range, it will give ValueError.

To resolve this, there are 2 things I can do:
(1) keep old method as the default one, and add option use the new methods. However, the old one is sub optimal and inefficient. I do not think using it as default is good for OutputCodeClassifier.
(2) I can use the old method, when the code_size is not in the correct range for the new one. When this happens, instead of giving a Error, the program could still run, and maybe output some deprecated information.

Please give me your thought. Thanks.

@arjoly
Copy link
Member

arjoly commented Oct 14, 2014

Being backward compatible also means that you are still able to reproduce results from past experiments with the current version of scikit-learn.

What do you think of having a new constructor parameter called strategy which would able to select between the previous strategy and the one that you implemented through a string? By default, there could be an "auto" option to have automatically a good choice.

@queqichao
Copy link
Contributor Author

I can add an extra parameter strategy to the constructor. By "auto" do you mean setting the default value of "strategy" to be the old strategy. Of course I can do this, but as I mentioned before, the old one is not a very good method. One example case is that for iris in the example code, there are 3 classes, [1,2,3], the exhaustive code would be
[[1, 1, 1],
[0, 0, 1],
[0, 1, 0],
each row is the code for one class, and column corresponds to a binary classification problem. Any other extra columns will be complement of the existing columns, thus gives the same binary classification problem. In this case, code_size could not be larger than 1, which is allowed by the old strategy.

The new method only samples subset from the exhaustive code book, which has 2^(n_class-1)-1 columns. I admit that the choice of code_size becomes a little bit more tricky.

So what I would like to do is to set the default strategy to be the new strategy, but still keep the old strategy if old user does not change their code. Do you have any idea how this could be achieved?

@arjoly
Copy link
Member

arjoly commented Oct 14, 2014

By 'auto', I mean that it would always choose a not too bad choice for the user, i.e. select the old or new strategy depending on the code size.

@queqichao
Copy link
Contributor Author

@arjoly I just keep the old strategy, and add an extra parameter in the constructor to allow user to choose the coding strategy.

@@ -625,6 +631,13 @@ class OutputCodeClassifier(BaseEstimator, ClassifierMixin, MetaEstimatorMixin):
If 1 is given, no parallel computing code is used at all, which is
useful for debugging. For n_jobs below -1, (n_cpus + 1 + n_jobs) are
used. Thus for n_jobs = -2, all CPUs but one are used.
coding_strategy : str, optional, default: None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strategy would be consistent with the dummy estimators.

What is the meaning of None?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of None, could it be auto for the automatic strategy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of None it will use an auto strategy inside, so maybe auto is better in terms of readability.

@arjoly
Copy link
Member

arjoly commented Oct 15, 2014

Can you add tests that ensure that each codebook follows the property said in the documentation?

elif self.coding_strategy == "opt_column_selection":
self._opt_column_selection_code_book(random_state, code_size_)
else:
raise ValueError("Unknown coding strategy %r" % self.coding_strategy)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give all the possible strategy to the user?

@queqichao
Copy link
Contributor Author

Hi, @arjoly Thanks for your comments and I addressed them in the new version.

dist = 0
for k in range(max_iter):
p = random_state.permutation(max_code_size)
tmp_code_book = (p[:code_size, None] + max_code_size+1 & (1 << np.arange(n_classes-1, -1, -1)) > 0).astype(int).T
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you cut this in several line? It's a bit hard to read for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

dist1 = np.sum(pairwise_distances(_max_hamming_code_book(5, random_state,
10, 2),
metric='hamming'))
assert_true(dist0 >= dist1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you can use assert_greater_equal.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.03%) when pulling e5187ee on queqichao:multiclass_code_book_fix into 031a3fc on scikit-learn:master.

@queqichao
Copy link
Contributor Author

@arjoly, please take a look at the new version. Thanks.

@arjoly
Copy link
Member

arjoly commented Oct 22, 2014

Ok, I have now a better understanding of the whole algorithm. To summarize, the two main differences between your approach and the old one:

  1. Ensure non repeated codeword by sampling without replacement.
  2. Try iteratively to obtain a good code book.

Why do you think for one of adding a parameter such as bootstrap_codes to select between a sampling with and without replacement? Why do you think of adding the possibility to iteratively generate a good code book for both approach?

Finally, what are the advantages of the "dense" codebook vs the "sparse" codebook presented in the paper Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers.

Could you come up with the example that illustrate the benefit of the new feature? Based on the example, we might have a better picture for the best default parameter of the estimator.

+ Update reference.
@queqichao
Copy link
Contributor Author

@arjoly, I think the first feature is the most important feature in the new algorithm. Suppose you sample from the code book with replacement, you could have two same code words. E.g. for 3-class problem, you sample 3 code words as follows,
1 1 1
0 0 1
1 1 0
Each column gives a binary problem. So here the first two columns correspond to the same binary problem (setting class 0 and 2 t "1", and class 1 to "0".). As long as you do not change the training data, the result should be same if you use "deterministic" algorithm like SVM. And you get a duplicated classifier. So sampling with replacement does not help in this case.

The second thing is why iterative optimization could improve the code book. This part is more subtle. As Solving Multiclass Learning Problems via Error-Correcting Output Codes
suggested, there're two criteria: (1) row separation and (2) column separation. Reasoning is provided in the paper. What I do in the algorithm is basically optimizing these two criteria iteratively through random sampling.

I can also do some experiments to demonstrating the empirical effectiveness of the new algorithm on MNIST later.

@arjoly
Copy link
Member

arjoly commented Oct 23, 2014

Thanks @queqichao for the explanation. Now, we have to be sure that we have the proper interface that rationalizes the current and the new features (e.g. benefit everywhere from the iterative algorithm while being DRY) while not blocking everything for later without having a yagni case.

Could you come up with the example that illustrate the benefit of the new feature? Based on the example, we might have a better picture for the best default parameter of the estimator.

I can also do some experiments to demonstrating the empirical effectiveness of the new algorithm on MNIST later.

What I suggested is a new example for the narrative documentation. It's very important to highlight your work and make it know to everybody that you have written a very useful piece of code. Without an example, it will be hard to user to discover your contribution.

Unfortunately MNIST is a too big dataset for the narrative documentation, instead we can use any dataset used in http://scikit-learn.org/stable/auto_examples/index.html.

+ Make too small code_size invalid for all strategies.
@queqichao
Copy link
Contributor Author

@arjoly I think you make a good point. I am planning to add an simple example to compare the new algorithm and the old one, and other coding methods for multi-class. I ran a simple experiment on the digits dataset and plot the classification error of using different coding algorithms.
error
Here horizontal axis is code_size and vertical axis is error. Because the 'iter_hamming' and 'random' are random algorithm, so the errors for these two are averages over 50 repetitions.

The new algorithm 'iter_hamming' is better than the old one 'random' when code_size is relatively small. This is expected, because 'random' strategy is more vulnerable to code word collision when code_size is small. But both is worse than the other algorithms. I guess it is probably because both the randomize coding schemas are not so optimized.

Finding a better coding algorithm for multi-class problem is still of open problem I believe. But the new algorithm is at least better than the old one in certain situations. So what do you think about this? Where would be the right place for the example and the corresponding documentation.

@queqichao
Copy link
Contributor Author

Hi, @arjoly, please take a look at the new version with an example for the new multiclass.

@arjoly
Copy link
Member

arjoly commented Nov 3, 2014

Recently, I haven't had much time to look at this pull request. I will try to dig some time this week. Thanks for your patience.

@@ -0,0 +1,139 @@
"""
===========================
Multi-class classification
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rename this "Multi-class encoding" or something since the example is about coding strategies, not multi-class classification.

@amueller
Copy link
Member

amueller commented Nov 6, 2014

Basically your example shows that output coding is much worse in any way than just using OVO or OVR. With this graph, it is not really clear why we want to add the algorithm.
The method you contributed is better than the random one, but I'm not super convinced about adding a substantial amount of code for something that will not be useful in practice. Do you have an example where output encoding fares better than OVR or OVO?

@amueller
Copy link
Member

amueller commented Nov 6, 2014

I guess we could still add the algorithms for completeness as we already have the error correcting output code, but maybe add a note that this is more for illustration purposes? I'm not entirely sure what the purpose of adding it is...

@queqichao
Copy link
Contributor Author

You're correct. I guess output code does not necessary out-perform the OVO or OVR in practice. That's probably why most people still prefer to use OVO or OVR. Before I initiated this pull request, I just thought the original algorithm is not perfect. But after doing the experiment, I found output code does not work so well, at least on the data set I have tried.

@amueller
Copy link
Member

amueller commented Nov 6, 2014

Do you think it would still be worth including this in scikit-learn? Or do you think it would be worth doing more experiments?

@queqichao
Copy link
Contributor Author

If the output code is still kept in scikit-learn, I think an improvement to the original algorithm might be worthy. But I admit that the justification of the improvement is not solid. Actually the original motivation for adding output code to scikit-learn is confusing to me, because its effectiveness was not fully tested.

I would like to do more experiments, the data sets available for multi-class classification is quite limited.

@amueller amueller added the Needs Decision Requires decision label Aug 5, 2019
Base automatically changed from master to main January 22, 2021 10:48
@thomasjpfan thomasjpfan added Needs Decision - Close Requires decision for closing and removed Needs Decision Requires decision labels Feb 8, 2022
@glemaitre
Copy link
Member

From the latest comment, closing this PR.

@glemaitre glemaitre closed this Jul 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Decision - Close Requires decision for closing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants