Discretization using Fayyad's MDLP stop criterion #4801

Closed
wants to merge 22 commits into
from

Conversation

Projects
None yet
8 participants
@hlin117
Contributor

hlin117 commented Jun 2, 2015

This pull request addresses #4468

This adds the feature of discretization using Fayyad's minimum description length principle (MDLP) stop criterion. The original paper describing the principle is here. Essentially, it splits the continuous attributes into intervals by minimizing the conditional entropy between the attribute in question and the class values.

I demonstrate how to use this feature in this gist. I also show that it produces "approximately" the same output as the corresponding R package "discretization". I say "approximately" because there are indeed some rows where the output is not the same. However, I looked into this issue, and I am assuming that the R package actually has roundoff errors. Also note that this feature allows the users to specify which columns to discretize; the corresponding R package assumes each column is continuous.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jun 2, 2015

Member

Please add tests with examples from the paper, for instance, as well as test for tricky cases. Your code incorporates syntax that is not valid in Python 3 (e.g. lambda (x, y): ...) and other invalid Python 2.6 syntax (dict comprehension). But I suspect almost your entire contribution will be rewritten before merging in order to harness numpy. However, tests can always be implemented, regardless of the internal code structure.

Member

jnothman commented Jun 2, 2015

Please add tests with examples from the paper, for instance, as well as test for tricky cases. Your code incorporates syntax that is not valid in Python 3 (e.g. lambda (x, y): ...) and other invalid Python 2.6 syntax (dict comprehension). But I suspect almost your entire contribution will be rewritten before merging in order to harness numpy. However, tests can always be implemented, regardless of the internal code structure.

sklearn/preprocessing/discretization.py
+ return log(x, 2) if x > 0 else 0
+
+
+class MDLP(object):

This comment has been minimized.

@jnothman

jnothman Jun 2, 2015

Member

This is not in accordance with scikit-learn's naming scheme. From a glance at your code, it is applicable to multiclass classification problems only. Perhaps we should call this ClassificationDiscretizer or MDLDiscretizer.

@jnothman

jnothman Jun 2, 2015

Member

This is not in accordance with scikit-learn's naming scheme. From a glance at your code, it is applicable to multiclass classification problems only. Perhaps we should call this ClassificationDiscretizer or MDLDiscretizer.

sklearn/preprocessing/discretization.py
+ Classification Learning"
+ """
+
+ def __init__(self):

This comment has been minimized.

@jnothman

jnothman Jun 2, 2015

Member

Please review the contributor's guide. __init__ might take constructor parameters like continuous_features, and these -- but only these -- should set attributes of the same names on the object.

@jnothman

jnothman Jun 2, 2015

Member

Please review the contributor's guide. __init__ might take constructor parameters like continuous_features, and these -- but only these -- should set attributes of the same names on the object.

sklearn/preprocessing/discretization.py
+ """
+
+ def __init__(self):
+ self.intervals = dict()

This comment has been minimized.

@jnothman

jnothman Jun 2, 2015

Member

This should be named intervals_ as with other output attributes in the project. Please also document input parameters and output attributes appropriately in the class docstring.

@jnothman

jnothman Jun 2, 2015

Member

This should be named intervals_ as with other output attributes in the project. Please also document input parameters and output attributes appropriately in the class docstring.

sklearn/preprocessing/discretization.py
+
+ "pos" here reflects large error. "neg" here reflect small error.
+ """
+

This comment has been minimized.

@jnothman

jnothman Jun 2, 2015

Member

In terms of idiomatic numpy, I'm almost certain you're going to land up with something more like:

def _slice_entropy(self, y_reordered, start, stop):
    counts = np.bincount(y[start:stop], minlength=self.n_classes_)
    vals = counts / float(stop - start)  # note float unnecessary with `from __future__ import division`
    return scipy.stats.entropy(vals, base=2)  # or could calculate `vals * np.log2(vals)` directly
@jnothman

jnothman Jun 2, 2015

Member

In terms of idiomatic numpy, I'm almost certain you're going to land up with something more like:

def _slice_entropy(self, y_reordered, start, stop):
    counts = np.bincount(y[start:stop], minlength=self.n_classes_)
    vals = counts / float(stop - start)  # note float unnecessary with `from __future__ import division`
    return scipy.stats.entropy(vals, base=2)  # or could calculate `vals * np.log2(vals)` directly
sklearn/preprocessing/discretization.py
+ """
+
+ # attr_list contains triples, (attribute, y class)
+ attr_list = list([attr, y] for attr, y in izip(attributes, Y))

This comment has been minimized.

@jnothman

jnothman Jun 2, 2015

Member

Instead, maintain parallel numpy arrays for attr (let's call it x, a column of X) and y.

order = np.argsort(x)
x = x[order]
y = y[order]

Rewrite the method to work with this data structure.

@jnothman

jnothman Jun 2, 2015

Member

Instead, maintain parallel numpy arrays for attr (let's call it x, a column of X) and y.

order = np.argsort(x)
x = x[order]
y = y[order]

Rewrite the method to work with this data structure.

@hlin117

This comment has been minimized.

Show comment
Hide comment
@hlin117

hlin117 Jun 2, 2015

Contributor

Thanks for the review, @jnothman. I'll rewrite my code according to your comments.

Contributor

hlin117 commented Jun 2, 2015

Thanks for the review, @jnothman. I'll rewrite my code according to your comments.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jun 2, 2015

Member

Just note it was not a full review, rather some ideas to take into
consideration.

On 2 June 2015 at 14:32, Henry notifications@github.com wrote:

Thanks for the review, @jnothman https://github.com/jnothman. I'll
rewrite my code according to your comments.


Reply to this email directly or view it on GitHub
#4801 (comment)
.

Member

jnothman commented Jun 2, 2015

Just note it was not a full review, rather some ideas to take into
consideration.

On 2 June 2015 at 14:32, Henry notifications@github.com wrote:

Thanks for the review, @jnothman https://github.com/jnothman. I'll
rewrite my code according to your comments.


Reply to this email directly or view it on GitHub
#4801 (comment)
.

@hlin117

This comment has been minimized.

Show comment
Hide comment
@hlin117

hlin117 Jun 2, 2015

Contributor

I updated the code according to @jnothman's comments above. I also tried to follow more closely to the scikit-learn contribution guidelines here.

Contributor

hlin117 commented Jun 2, 2015

I updated the code according to @jnothman's comments above. I also tried to follow more closely to the scikit-learn contribution guidelines here.

@hlin117

This comment has been minimized.

Show comment
Hide comment
@hlin117

hlin117 Jun 2, 2015

Contributor

A point to make is that the discretization of MDLP might not be the same as before, but the algorithm did not change. There is a caveat when attributes x are repeated, but their class labels y might be different. The algorithm will sort according to x, and so the entropy at different cut points may vary.

Contributor

hlin117 commented Jun 2, 2015

A point to make is that the discretization of MDLP might not be the same as before, but the algorithm did not change. There is a caveat when attributes x are repeated, but their class labels y might be different. The algorithm will sort according to x, and so the entropy at different cut points may vary.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jun 2, 2015

Member

make it stable, at least, by doing a np.lexsort on x and y

On 2 June 2015 at 17:57, Henry notifications@github.com wrote:

A point to make is that the discretization of MDLP might not be the same
as before, but the algorithm did not change. There is a caveat when
attributes x are repeated, but their class labels y might be different.
The algorithm will sort according to x, and so the entropy at different
cut points may vary.


Reply to this email directly or view it on GitHub
#4801 (comment)
.

Member

jnothman commented Jun 2, 2015

make it stable, at least, by doing a np.lexsort on x and y

On 2 June 2015 at 17:57, Henry notifications@github.com wrote:

A point to make is that the discretization of MDLP might not be the same
as before, but the algorithm did not change. There is a caveat when
attributes x are repeated, but their class labels y might be different.
The algorithm will sort according to x, and so the entropy at different
cut points may vary.


Reply to this email directly or view it on GitHub
#4801 (comment)
.

sklearn/preprocessing/discretization.py
+ for Classification Learning"
+ """
+
+ def __init__(self, **params):

This comment has been minimized.

@jnothman

jnothman Jun 3, 2015

Member

Interesting that you've interpreted the contributors' guide this way... All our estimators have an explicit list of parameters, not ** magic. With an explicit list of parameters, the implementation of get_params and set_params in BaseEstimator will suffice.

@jnothman

jnothman Jun 3, 2015

Member

Interesting that you've interpreted the contributors' guide this way... All our estimators have an explicit list of parameters, not ** magic. With an explicit list of parameters, the implementation of get_params and set_params in BaseEstimator will suffice.

This comment has been minimized.

@hlin117

hlin117 Jun 3, 2015

Contributor

Thanks for the tip. I think I got confused by the documentation of the OneHotEncoderhere, whose set_params() function takes **. But now that I know that the OneHotEncoder just inherits that function from its base class, which has the ** parameters, this makes sense to me.

@hlin117

hlin117 Jun 3, 2015

Contributor

Thanks for the tip. I think I got confused by the documentation of the OneHotEncoderhere, whose set_params() function takes **. But now that I know that the OneHotEncoder just inherits that function from its base class, which has the ** parameters, this makes sense to me.

sklearn/preprocessing/discretization.py
+ def fit(self, X, y):
+ """Finds the intervals of interest from the input data.
+ """
+ if type(X) is list:

This comment has been minimized.

@jnothman

jnothman Jun 3, 2015

Member

Use sklearn.utils.validation.check_array

@jnothman

jnothman Jun 3, 2015

Member

Use sklearn.utils.validation.check_array

sklearn/preprocessing/discretization.py
+ """Implements the MDLP discretization criterion from Usama Fayyad's
+ paper "Multi-Interval Discretization of Continuous-Valued Attributes
+ for Classification Learning"
+ """

This comment has been minimized.

@jnothman

jnothman Jun 3, 2015

Member

add Parameters and Attributes sections. In particular, I'd like to see a description of the intervals_ output.

@jnothman

jnothman Jun 3, 2015

Member

add Parameters and Attributes sections. In particular, I'd like to see a description of the intervals_ output.

sklearn/preprocessing/discretization.py
+ if type(y) is list:
+ y = np.array(y)
+ if self.continuous_columns is None:
+ self.continuous_columns = range(X.shape[1])

This comment has been minimized.

@jnothman

jnothman Jun 3, 2015

Member

The initial parameters should not be modified. If this is necessary, store continuous_columns_.

@jnothman

jnothman Jun 3, 2015

Member

The initial parameters should not be modified. If this is necessary, store continuous_columns_.

sklearn/preprocessing/discretization.py
+ for i, interval in enumerate(intervals)]
+ return self
+
+ def transform(self, X):

This comment has been minimized.

@jnothman

jnothman Jun 3, 2015

Member

I've not looked at how you've stored the model, etc., but I would have expected a discretiser's transform method to work with a sorted array of boundaries per feature, such as boundaries=array([-inf, 0, 2, 5, 10, inf]) (the infs may not be necessary) then employ np.searchsorted, e.g. np.searchsorted(boundaries, [6, 3, 100, 0, 1]) returns array([4, 3, 5, 1, 2]).

So transform should be something like:

out = X.copy()
for i, boundaries in zip(self.continuous_columns, self.boundaries_):
    out[:, i] = np.searchsorted(boundaries, X[:, i])
return out

Please also accept an unused y argument (y=None) for API consistency.

@jnothman

jnothman Jun 3, 2015

Member

I've not looked at how you've stored the model, etc., but I would have expected a discretiser's transform method to work with a sorted array of boundaries per feature, such as boundaries=array([-inf, 0, 2, 5, 10, inf]) (the infs may not be necessary) then employ np.searchsorted, e.g. np.searchsorted(boundaries, [6, 3, 100, 0, 1]) returns array([4, 3, 5, 1, 2]).

So transform should be something like:

out = X.copy()
for i, boundaries in zip(self.continuous_columns, self.boundaries_):
    out[:, i] = np.searchsorted(boundaries, X[:, i])
return out

Please also accept an unused y argument (y=None) for API consistency.

This comment has been minimized.

@hlin117

hlin117 Jun 3, 2015

Contributor

Hmm, your implementation with np.searchsorted is actually very sleek, I like it. With some refactoring, I could definitely implement my discretizer this way. Thanks!

@hlin117

hlin117 Jun 3, 2015

Contributor

Hmm, your implementation with np.searchsorted is actually very sleek, I like it. With some refactoring, I could definitely implement my discretizer this way. Thanks!

sklearn/preprocessing/discretization.py
+ return self
+
+ def transform(self, X):
+ """Converts the continuous values in X into ascii character

This comment has been minimized.

@jnothman

jnothman Jun 3, 2015

Member

I don't think we care for ASCII chars, just ordinal integers.

@jnothman

jnothman Jun 3, 2015

Member

I don't think we care for ASCII chars, just ordinal integers.

sklearn/preprocessing/discretization.py
+ "of targets in y")
+
+ self.num_classes_ = set(y)
+ self.intervals_ = dict()

This comment has been minimized.

@jnothman

jnothman Jun 3, 2015

Member

should be a list, with entries parallel to continuous_columns, so that it can be iterated in order.

@jnothman

jnothman Jun 3, 2015

Member

should be a list, with entries parallel to continuous_columns, so that it can be iterated in order.

This comment has been minimized.

@jnothman

jnothman Jun 3, 2015

Member

although I appreciate there's room for debate here, and you could keep the current approach if you like

@jnothman

jnothman Jun 3, 2015

Member

although I appreciate there's room for debate here, and you could keep the current approach if you like

This comment has been minimized.

@hlin117

hlin117 Jun 3, 2015

Contributor

I imagine a feature to be associated with a specific list of intervals. Which is why I wanted self.intervals_ to be a dictionary; it represents a mapping from a column to a list of intervals.

Furthermore, I wanted to use a dictionary, because this makes its compatibility with pandas DataFrame objects easier in the future. A pandas Series (indexed by a string) would map to their corresponding list of intervals. (But obviously, we need to go one step at a time, and work on this pull request first.)

Does that sound fair? This implementation sounds intuitive to me at least.

@hlin117

hlin117 Jun 3, 2015

Contributor

I imagine a feature to be associated with a specific list of intervals. Which is why I wanted self.intervals_ to be a dictionary; it represents a mapping from a column to a list of intervals.

Furthermore, I wanted to use a dictionary, because this makes its compatibility with pandas DataFrame objects easier in the future. A pandas Series (indexed by a string) would map to their corresponding list of intervals. (But obviously, we need to go one step at a time, and work on this pull request first.)

Does that sound fair? This implementation sounds intuitive to me at least.

sklearn/preprocessing/discretization.py
+
+ def _mdlp(self, attributes, Y):
+ """
+ *attributes*: A numpy 1 dimensional ndarray

This comment has been minimized.

@jnothman

jnothman Jun 3, 2015

Member

Usually scikit-learn calls these "features"

@jnothman

jnothman Jun 3, 2015

Member

Usually scikit-learn calls these "features"

This comment has been minimized.

@jnothman

jnothman Jun 3, 2015

Member

And x would be more appropriate.

@jnothman

jnothman Jun 3, 2015

Member

And x would be more appropriate.

sklearn/preprocessing/discretization.py
+ """
+ *attributes*: A numpy 1 dimensional ndarray
+
+ *Y*: A python list of numeric class labels.

This comment has been minimized.

@jnothman

jnothman Jun 3, 2015

Member

this is conventially called y, not Y

@jnothman

jnothman Jun 3, 2015

Member

this is conventially called y, not Y

sklearn/preprocessing/discretization.py
+ # and [ind, end).
+ length = end - start
+
+ def partition_entropy(ind):

This comment has been minimized.

@jnothman

jnothman Jun 3, 2015

Member

Please inline this function. It doesn't help to be given a name, given that it is called once.

@jnothman

jnothman Jun 3, 2015

Member

Please inline this function. It doesn't help to be given a name, given that it is called once.

sklearn/preprocessing/discretization.py
+
+ # Need to see whether the "front" and "back" of the intervals need
+ # to be float("-inf") or float("inf")
+ if (k == -1) or (self._reject_split(y, start, end, k) and

This comment has been minimized.

@jnothman

jnothman Jun 3, 2015

Member

This condition looks like _reject_split should be renamed _validate_split. It's also much more complex condition than depth >= self.min_depth, so placing that condition first will avoid unnecessary calls to _validate_split.

@jnothman

jnothman Jun 3, 2015

Member

This condition looks like _reject_split should be renamed _validate_split. It's also much more complex condition than depth >= self.min_depth, so placing that condition first will avoid unnecessary calls to _validate_split.

This comment has been minimized.

@hlin117

hlin117 Jun 3, 2015

Contributor

Swapping depth >= self.min_depth and _reject_split is a good suggestion.

I think that we should keep the name _reject_split though. _validate_split, semantically, seems to check whether an input is correct, whereas _reject_split semantically means "we should stop cutting".

Do you have any follow up comments?

@hlin117

hlin117 Jun 3, 2015

Contributor

Swapping depth >= self.min_depth and _reject_split is a good suggestion.

I think that we should keep the name _reject_split though. _validate_split, semantically, seems to check whether an input is correct, whereas _reject_split semantically means "we should stop cutting".

Do you have any follow up comments?

sklearn/preprocessing/discretization.py
+ return gain <= 1 / N * (log(N - 1) + delta)
+
+ @staticmethod
+ def _find_cut(y, start, end):

This comment has been minimized.

@jnothman

jnothman Jun 3, 2015

Member

This method is the most expensive part of the algorithm. You currently have an O(n^2) implementation, though it could be written in O(n), either by going back to using Counters, but using them more cleverly, or by implementing something fast and array-based in Cython, along the lines of:

def find_cut(int[:] y, int n_classes):
    cdef int[:] suffix_counts = np.bincount(y, minlength=n_classes)
    cdef int[:] prefix_counts = np.zeros_like(suffix_counts)
    # Iterate through cut positions, subtracting labels from suffix_counts and adding them to prefix_counts
    # Return minimum prefix entropy, suffix entropy, and cut position
@jnothman

jnothman Jun 3, 2015

Member

This method is the most expensive part of the algorithm. You currently have an O(n^2) implementation, though it could be written in O(n), either by going back to using Counters, but using them more cleverly, or by implementing something fast and array-based in Cython, along the lines of:

def find_cut(int[:] y, int n_classes):
    cdef int[:] suffix_counts = np.bincount(y, minlength=n_classes)
    cdef int[:] prefix_counts = np.zeros_like(suffix_counts)
    # Iterate through cut positions, subtracting labels from suffix_counts and adding them to prefix_counts
    # Return minimum prefix entropy, suffix entropy, and cut position
@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jun 3, 2015

Member

PS: still really looking for tests

Member

jnothman commented Jun 3, 2015

PS: still really looking for tests

@hlin117

This comment has been minimized.

Show comment
Hide comment
@hlin117

hlin117 Jun 3, 2015

Contributor

@jnothman: I'll be happy to provide more tests. Where should I commit these tests though? Should I commit them as github gists?

Contributor

hlin117 commented Jun 3, 2015

@jnothman: I'll be happy to provide more tests. Where should I commit these tests though? Should I commit them as github gists?

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jun 3, 2015

Member

Tests are coded in separate modules. See sklearn/preprocessing/tests

On 4 June 2015 at 01:55, Henry notifications@github.com wrote:

@jnothman https://github.com/jnothman: I'll be happy to provide more
tests. Where should I commit these tests though? Should I commit them as
github gists?


Reply to this email directly or view it on GitHub
#4801 (comment)
.

Member

jnothman commented Jun 3, 2015

Tests are coded in separate modules. See sklearn/preprocessing/tests

On 4 June 2015 at 01:55, Henry notifications@github.com wrote:

@jnothman https://github.com/jnothman: I'll be happy to provide more
tests. Where should I commit these tests though? Should I commit them as
github gists?


Reply to this email directly or view it on GitHub
#4801 (comment)
.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Jun 3, 2015

Member

Also, a usage example would be great. Your example shows just what happens, but not why it is useful. It would be great to have a data set and task where using this method actually improves upon the numeric features. Why not just put them in a forest?

Member

amueller commented Jun 3, 2015

Also, a usage example would be great. Your example shows just what happens, but not why it is useful. It would be great to have a data set and task where using this method actually improves upon the numeric features. Why not just put them in a forest?

@GaelVaroquaux

This comment has been minimized.

Show comment
Hide comment
@GaelVaroquaux

GaelVaroquaux Jun 3, 2015

Member
Member

GaelVaroquaux commented Jun 3, 2015

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Jun 3, 2015

Member

I can imagine there are use-cases where you want to train a linear model for speed reasons or because trees don't work on some variables or something. But I need to see an application / comparison to believe it is useful.

Member

amueller commented Jun 3, 2015

I can imagine there are use-cases where you want to train a linear model for speed reasons or because trees don't work on some variables or something. But I need to see an application / comparison to believe it is useful.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jun 3, 2015

Member

Yes, while reviewing, I've been thinking the algorithm is very close to
tree building, but with independence between columns. I suspect you're
right that transforming into tree/forest-leaf space is equivalent :s

On 4 June 2015 at 05:27, Andreas Mueller notifications@github.com wrote:

I can imagine there are use-cases where you want to train a linear model
for speed reasons or because trees don't work on some variables or
something. But I need to see an application / comparison to believe it is
useful.


Reply to this email directly or view it on GitHub
#4801 (comment)
.

Member

jnothman commented Jun 3, 2015

Yes, while reviewing, I've been thinking the algorithm is very close to
tree building, but with independence between columns. I suspect you're
right that transforming into tree/forest-leaf space is equivalent :s

On 4 June 2015 at 05:27, Andreas Mueller notifications@github.com wrote:

I can imagine there are use-cases where you want to train a linear model
for speed reasons or because trees don't work on some variables or
something. But I need to see an application / comparison to believe it is
useful.


Reply to this email directly or view it on GitHub
#4801 (comment)
.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jun 3, 2015

Member

One advantage of the forest, at least in terms of performant models due to
the discretization, is that it allows overlapping intervals.

On 4 June 2015 at 08:58, Joel Nothman joel.nothman@gmail.com wrote:

Yes, while reviewing, I've been thinking the algorithm is very close to
tree building, but with independence between columns. I suspect you're
right that transforming into tree/forest-leaf space is equivalent :s

On 4 June 2015 at 05:27, Andreas Mueller notifications@github.com wrote:

I can imagine there are use-cases where you want to train a linear model
for speed reasons or because trees don't work on some variables or
something. But I need to see an application / comparison to believe it is
useful.


Reply to this email directly or view it on GitHub
#4801 (comment)
.

Member

jnothman commented Jun 3, 2015

One advantage of the forest, at least in terms of performant models due to
the discretization, is that it allows overlapping intervals.

On 4 June 2015 at 08:58, Joel Nothman joel.nothman@gmail.com wrote:

Yes, while reviewing, I've been thinking the algorithm is very close to
tree building, but with independence between columns. I suspect you're
right that transforming into tree/forest-leaf space is equivalent :s

On 4 June 2015 at 05:27, Andreas Mueller notifications@github.com wrote:

I can imagine there are use-cases where you want to train a linear model
for speed reasons or because trees don't work on some variables or
something. But I need to see an application / comparison to believe it is
useful.


Reply to this email directly or view it on GitHub
#4801 (comment)
.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jun 3, 2015

Member

Sorry for not digesting that sooner...

I suspect more useful discretizers/binning to include may be those that are
substantially less complex (e.g. equal-width or equal-population buckets).

On 4 June 2015 at 09:02, Joel Nothman joel.nothman@gmail.com wrote:

One advantage of the forest, at least in terms of performant models due to
the discretization, is that it allows overlapping intervals.

On 4 June 2015 at 08:58, Joel Nothman joel.nothman@gmail.com wrote:

Yes, while reviewing, I've been thinking the algorithm is very close to
tree building, but with independence between columns. I suspect you're
right that transforming into tree/forest-leaf space is equivalent :s

On 4 June 2015 at 05:27, Andreas Mueller notifications@github.com
wrote:

I can imagine there are use-cases where you want to train a linear model
for speed reasons or because trees don't work on some variables or
something. But I need to see an application / comparison to believe it is
useful.


Reply to this email directly or view it on GitHub
#4801 (comment)
.

Member

jnothman commented Jun 3, 2015

Sorry for not digesting that sooner...

I suspect more useful discretizers/binning to include may be those that are
substantially less complex (e.g. equal-width or equal-population buckets).

On 4 June 2015 at 09:02, Joel Nothman joel.nothman@gmail.com wrote:

One advantage of the forest, at least in terms of performant models due to
the discretization, is that it allows overlapping intervals.

On 4 June 2015 at 08:58, Joel Nothman joel.nothman@gmail.com wrote:

Yes, while reviewing, I've been thinking the algorithm is very close to
tree building, but with independence between columns. I suspect you're
right that transforming into tree/forest-leaf space is equivalent :s

On 4 June 2015 at 05:27, Andreas Mueller notifications@github.com
wrote:

I can imagine there are use-cases where you want to train a linear model
for speed reasons or because trees don't work on some variables or
something. But I need to see an application / comparison to believe it is
useful.


Reply to this email directly or view it on GitHub
#4801 (comment)
.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jun 3, 2015

Member

To be clearer to @hlin117, what is the substantive difference between this approach, and buildng a DecisionTreeClassifier on each continuous feature independently, and using its split points? Should we reuse that implementation? Or is this functionality deficient with respect to forests?

Member

jnothman commented Jun 3, 2015

To be clearer to @hlin117, what is the substantive difference between this approach, and buildng a DecisionTreeClassifier on each continuous feature independently, and using its split points? Should we reuse that implementation? Or is this functionality deficient with respect to forests?

@hlin117

This comment has been minimized.

Show comment
Hide comment
@hlin117

hlin117 Jun 4, 2015

Contributor

I'll try to answer the questions above, but excuse me if some of the questions aren't answered.

The algorithm's purpose is to discretize continuous features in a supervised fashioned by maximizing the information gain per split (which surmounts to minimizing the conditional entropy per split, as my code does). MDLP provides a heuristic (with theoretical justification) of when to stop the splitting process. The gorey mathematical formulation of the stopping criterion is described here in IBM's SPSS documentation. The derivation of the ugly inequality is derived from Fayyad's original paper.

In Fayyad's original paper here, the discretization of these continuous features was then shown to help improve the accuracy of decision trees. (He provides the results of the experiments towards the bottom of his write up). I'm not sure what stopping criterion DecisionTreeClassifier here uses, but the MDLP stopping criterion has been shown to, on average, yield less error than other stopping criterion. Fayyad's results are corroborated by results in this paper. Even Quinlan, the author of C4.5, studied MDLP, and stated in section 4.1 of this paper that discretizing using MDLP outperformed his original formulation of C4.5.

To answer @jnothman's and @amueller's question regarding forests, the MDLP stopping criterion is a separate idea from forests (I'm assuming random forests here). It's not a learning algorithm. It's an optimum binning algorithm, based upon class labels.

Contributor

hlin117 commented Jun 4, 2015

I'll try to answer the questions above, but excuse me if some of the questions aren't answered.

The algorithm's purpose is to discretize continuous features in a supervised fashioned by maximizing the information gain per split (which surmounts to minimizing the conditional entropy per split, as my code does). MDLP provides a heuristic (with theoretical justification) of when to stop the splitting process. The gorey mathematical formulation of the stopping criterion is described here in IBM's SPSS documentation. The derivation of the ugly inequality is derived from Fayyad's original paper.

In Fayyad's original paper here, the discretization of these continuous features was then shown to help improve the accuracy of decision trees. (He provides the results of the experiments towards the bottom of his write up). I'm not sure what stopping criterion DecisionTreeClassifier here uses, but the MDLP stopping criterion has been shown to, on average, yield less error than other stopping criterion. Fayyad's results are corroborated by results in this paper. Even Quinlan, the author of C4.5, studied MDLP, and stated in section 4.1 of this paper that discretizing using MDLP outperformed his original formulation of C4.5.

To answer @jnothman's and @amueller's question regarding forests, the MDLP stopping criterion is a separate idea from forests (I'm assuming random forests here). It's not a learning algorithm. It's an optimum binning algorithm, based upon class labels.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jun 4, 2015

Member

The best way to argue your case is by illustrative example of this being
more useful/effective in a machine learning context than the discrete
embedding resulting from decision trees.

On 4 June 2015 at 15:33, Henry notifications@github.com wrote:

I'll try to answer the questions above, but excuse me if some of the
questions aren't answered.

The algorithm's purpose is to discretize continuous features in a
supervised fashioned by maximizing the information gain per split (which
surmounts to minimizing the conditional entropy per split, as my code
does). MDLP provides a heuristic (with theoretical justification) of when
to stop the splitting process. The gorey mathematical formulation of the
stopping criterion is described here
http://www-01.ibm.com/support/knowledgecenter/SSLVMB_20.0.0/com.ibm.spss.statistics.help/alg_optimal-binning_mdlp_accept-criterion.htm
in IBM's SPSS documentation. The derivation of the ugly inequality is
derived from Fayyad's original paper.

In Fayyad's original paper here
http://ijcai.org/Past%20Proceedings/IJCAI-93-VOL2/PDF/022.pdf, the
discretization of these continuous features was then shown to help improve
the accuracy of decision trees. (He provides the results of the experiments
towards the bottom of his write up). I'm not sure what stopping criterion
DecisionTreeClassifier here
http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
uses, but the MDLP stopping criterion has been shown to, on average, yield
less error than other stopping criterion. Fayyad's results are corroborated
by results in this paper
https://www1.comp.nus.edu.sg/~tancl/publications/j2001-2/DMKD-Liu-2002.pdf.
Even Quinlan, the author of C4.5, studied MDLP, and stated in section 4.1
of this paper http://www.jair.org/media/279/live-279-1538-jair.pdf that
discretizing using MDLP outperformed his original formulation of C4.5.

To answer @jnothman https://github.com/jnothman's and @amueller
https://github.com/amueller's question regarding forests, the MDLP
stopping criterion is a separate idea than forests (I'm assuming random
forests here). It's not a learning algorithm. It's an optimum binning
algorithm, based upon class labels.


Reply to this email directly or view it on GitHub
#4801 (comment)
.

Member

jnothman commented Jun 4, 2015

The best way to argue your case is by illustrative example of this being
more useful/effective in a machine learning context than the discrete
embedding resulting from decision trees.

On 4 June 2015 at 15:33, Henry notifications@github.com wrote:

I'll try to answer the questions above, but excuse me if some of the
questions aren't answered.

The algorithm's purpose is to discretize continuous features in a
supervised fashioned by maximizing the information gain per split (which
surmounts to minimizing the conditional entropy per split, as my code
does). MDLP provides a heuristic (with theoretical justification) of when
to stop the splitting process. The gorey mathematical formulation of the
stopping criterion is described here
http://www-01.ibm.com/support/knowledgecenter/SSLVMB_20.0.0/com.ibm.spss.statistics.help/alg_optimal-binning_mdlp_accept-criterion.htm
in IBM's SPSS documentation. The derivation of the ugly inequality is
derived from Fayyad's original paper.

In Fayyad's original paper here
http://ijcai.org/Past%20Proceedings/IJCAI-93-VOL2/PDF/022.pdf, the
discretization of these continuous features was then shown to help improve
the accuracy of decision trees. (He provides the results of the experiments
towards the bottom of his write up). I'm not sure what stopping criterion
DecisionTreeClassifier here
http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
uses, but the MDLP stopping criterion has been shown to, on average, yield
less error than other stopping criterion. Fayyad's results are corroborated
by results in this paper
https://www1.comp.nus.edu.sg/~tancl/publications/j2001-2/DMKD-Liu-2002.pdf.
Even Quinlan, the author of C4.5, studied MDLP, and stated in section 4.1
of this paper http://www.jair.org/media/279/live-279-1538-jair.pdf that
discretizing using MDLP outperformed his original formulation of C4.5.

To answer @jnothman https://github.com/jnothman's and @amueller
https://github.com/amueller's question regarding forests, the MDLP
stopping criterion is a separate idea than forests (I'm assuming random
forests here). It's not a learning algorithm. It's an optimum binning
algorithm, based upon class labels.


Reply to this email directly or view it on GitHub
#4801 (comment)
.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Jun 4, 2015

Member

I understand that it is a supervised feature transformation, not a classification algorithm. I was just wondering what common use patterns are, and in which situations this works well (aka what @jnothman said).

Member

amueller commented Jun 4, 2015

I understand that it is a supervised feature transformation, not a classification algorithm. I was just wondering what common use patterns are, and in which situations this works well (aka what @jnothman said).

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Jun 4, 2015

Member

(the paper does have ~3k cites ^^, so there probably is some use?)

Member

amueller commented Jun 4, 2015

(the paper does have ~3k cites ^^, so there probably is some use?)

@hlin117

This comment has been minimized.

Show comment
Hide comment
@hlin117

hlin117 Jun 11, 2015

Contributor

Hi all. Thanks for all of the comments. Because my advisor assigned me to another research project, I'm taking a break from working on this pull request for now. I'll try to contribute more to it as soon as I find time.

Contributor

hlin117 commented Jun 11, 2015

Hi all. Thanks for all of the comments. Because my advisor assigned me to another research project, I'm taking a break from working on this pull request for now. I'll try to contribute more to it as soon as I find time.

@ptully

This comment has been minimized.

Show comment
Hide comment
@ptully

ptully Jun 18, 2015

Hi all,

I'm now using the mdlp library @hlin117 for my purposes, so I think I can help clarify some things about why one would choose to use it in the first place (also - it's working like a charm!).

In cases where there is no clear cut uniformly distributed patterns of intervals nor any a priori notion of how many clusters should arise (think k in k means clustering), MDLP can be extremely useful. I suppose something like Dirichlet clustering could alternatively be used, however in my experience it is very sensitive to parameters (like alpha) and not as intuitive in my opinion as MDLP. I think it would be beneficial for sklearn to have some kind of expert binning functionality for feature selection like MDLP, which is tried and true in the literature.

ptully commented Jun 18, 2015

Hi all,

I'm now using the mdlp library @hlin117 for my purposes, so I think I can help clarify some things about why one would choose to use it in the first place (also - it's working like a charm!).

In cases where there is no clear cut uniformly distributed patterns of intervals nor any a priori notion of how many clusters should arise (think k in k means clustering), MDLP can be extremely useful. I suppose something like Dirichlet clustering could alternatively be used, however in my experience it is very sensitive to parameters (like alpha) and not as intuitive in my opinion as MDLP. I think it would be beneficial for sklearn to have some kind of expert binning functionality for feature selection like MDLP, which is tried and true in the literature.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Jun 18, 2015

Member

@ptully thanks for your input. Can you clarify what the ultimate application of the binning is here? Do you inspect the bins? Or do you train another model on top of it?

Member

amueller commented Jun 18, 2015

@ptully thanks for your input. Can you clarify what the ultimate application of the binning is here? Do you inspect the bins? Or do you train another model on top of it?

@ptully

This comment has been minimized.

Show comment
Hide comment
@ptully

ptully Jun 19, 2015

I understand your point about Random Forests, where something similar is performed. But if not using RF's, please see this paper for a more in depth discussion of the issue:

Discretization Techniques: A recent survey - Sotiris Kotsiantis, Dimitris Kanellopoulos

ptully commented Jun 19, 2015

I understand your point about Random Forests, where something similar is performed. But if not using RF's, please see this paper for a more in depth discussion of the issue:

Discretization Techniques: A recent survey - Sotiris Kotsiantis, Dimitris Kanellopoulos

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Jun 19, 2015

Member

I feel like putting "recent" in a paper title is not a great idea lol.

Member

amueller commented Jun 19, 2015

I feel like putting "recent" in a paper title is not a great idea lol.

@zmallen

This comment has been minimized.

Show comment
Hide comment
@zmallen

zmallen Jul 15, 2015

@GaelVaroquaux although random forest does help with binning techniques by design, why would you want to pigeon-hole this solution into one classifier? There could requirements by someone wanting to implement a specific classifier that isnt random forest, where binning could be a useful feature.

zmallen commented Jul 15, 2015

@GaelVaroquaux although random forest does help with binning techniques by design, why would you want to pigeon-hole this solution into one classifier? There could requirements by someone wanting to implement a specific classifier that isnt random forest, where binning could be a useful feature.

@hlin117

This comment has been minimized.

Show comment
Hide comment
@hlin117

hlin117 Jul 20, 2015

Contributor

Regarding a use case for this algorithm, the other day I was using the MDLP algorithm to bin datetime objects to try predicting crimes in San Francisco. (It was part of this kaggle competition.) Though you could convert an hour of a day to a continuous variable, I think that binned times would be more informative. In my use case, the bins would correspond to a given crime in SF.

Contributor

hlin117 commented Jul 20, 2015

Regarding a use case for this algorithm, the other day I was using the MDLP algorithm to bin datetime objects to try predicting crimes in San Francisco. (It was part of this kaggle competition.) Though you could convert an hour of a day to a continuous variable, I think that binned times would be more informative. In my use case, the bins would correspond to a given crime in SF.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Jul 30, 2015

Member

@zmallen the question was more "what are these applications?".

Member

amueller commented Jul 30, 2015

@zmallen the question was more "what are these applications?".

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jul 31, 2015

Member

(and I think @zmallen also misses the point that you can use a random
forest to discretise a feature directly, as in RandomTreesEmbedding, then
apply a different learning algorithm)

On 31 July 2015 at 08:59, Andreas Mueller notifications@github.com wrote:

@zmallen https://github.com/zmallen the question was more "what are
these applications?".


Reply to this email directly or view it on GitHub
#4801 (comment)
.

Member

jnothman commented Jul 31, 2015

(and I think @zmallen also misses the point that you can use a random
forest to discretise a feature directly, as in RandomTreesEmbedding, then
apply a different learning algorithm)

On 31 July 2015 at 08:59, Andreas Mueller notifications@github.com wrote:

@zmallen https://github.com/zmallen the question was more "what are
these applications?".


Reply to this email directly or view it on GitHub
#4801 (comment)
.

@hlin117

This comment has been minimized.

Show comment
Hide comment
@hlin117

hlin117 Sep 2, 2015

Contributor

For my own self-satisfaction, I have been working on this PR, while knowing that it probably won't be pulled into the master branch soon.

I have recently been learning Cython, and I rewrote some of the MDLP code in Cython, and it needs to be built using the usual build command

python setup.py build_ext --inplace

while in the scikit-learn/sklearn/preprocessing directory.

How can I have the project's Makefile automatically set up the Cython files? This is the error output that occurs when I run it without modifications.

# Some stuff here ...
running build_ext
cythoning _mdlp.pyx to _mdlp.c
error: [Errno 2] No such file or directory: '/Users/hlin117/scikit-learn/_mdlp.pyx'
make: *** [inplace] Error 1

(It's trying to run python setup.py build_ext --inplace in the wrong directory.)

Contributor

hlin117 commented Sep 2, 2015

For my own self-satisfaction, I have been working on this PR, while knowing that it probably won't be pulled into the master branch soon.

I have recently been learning Cython, and I rewrote some of the MDLP code in Cython, and it needs to be built using the usual build command

python setup.py build_ext --inplace

while in the scikit-learn/sklearn/preprocessing directory.

How can I have the project's Makefile automatically set up the Cython files? This is the error output that occurs when I run it without modifications.

# Some stuff here ...
running build_ext
cythoning _mdlp.pyx to _mdlp.c
error: [Errno 2] No such file or directory: '/Users/hlin117/scikit-learn/_mdlp.pyx'
make: *** [inplace] Error 1

(It's trying to run python setup.py build_ext --inplace in the wrong directory.)

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Sep 8, 2015

Member

Look at any of the other setup.py in the submodules, for example the tree one: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/setup.py
It shows you how to do the setup correctly.

If you have some example applications where this is really helpful, we'd be happy to include it. It's just that it seems most maintainers don't know any domains where this kind of processing is usually used.

Member

amueller commented Sep 8, 2015

Look at any of the other setup.py in the submodules, for example the tree one: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/setup.py
It shows you how to do the setup correctly.

If you have some example applications where this is really helpful, we'd be happy to include it. It's just that it seems most maintainers don't know any domains where this kind of processing is usually used.

@hlin117

This comment has been minimized.

Show comment
Hide comment
@hlin117

hlin117 Sep 20, 2015

Contributor

Someone on the Apache Spark side is implementing Fayyad's MDLP algorithm in MLLib here.

Here is a use case that they found:

A dataset selected for the GECCO-2014 in Vancouver, July 13th, 2014 competition, which comes from the Protein Structure Prediction field (http://cruncher.ncl.ac.uk/bdcomp/). The dataset has 32 million instances, 631 attributes, 2 classes, 98% of negative examples and occupies, when uncompressed, about 56GB of disk space.
Epsilon dataset: http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#epsilon. 400K instances and 2K attributes

We have demonstrated that our method performs 300 times faster than the sequential version for the first dataset, and also improves the accuracy for Naive Bayes.

What do you guys think?

Contributor

hlin117 commented Sep 20, 2015

Someone on the Apache Spark side is implementing Fayyad's MDLP algorithm in MLLib here.

Here is a use case that they found:

A dataset selected for the GECCO-2014 in Vancouver, July 13th, 2014 competition, which comes from the Protein Structure Prediction field (http://cruncher.ncl.ac.uk/bdcomp/). The dataset has 32 million instances, 631 attributes, 2 classes, 98% of negative examples and occupies, when uncompressed, about 56GB of disk space.
Epsilon dataset: http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#epsilon. 400K instances and 2K attributes

We have demonstrated that our method performs 300 times faster than the sequential version for the first dataset, and also improves the accuracy for Naive Bayes.

What do you guys think?

@raghavrv raghavrv referenced this pull request Nov 10, 2015

Closed

Discretizer #5778

@mblondel

This comment has been minimized.

Show comment
Hide comment
@mblondel

mblondel Nov 10, 2015

Member

I suspect more useful discretizers/binning to include may be those that are substantially less complex (e.g. equal-width or equal-population buckets).

+1

I suspect uniform binning and quantile based binning should work well enough in many applications.
I created #5778 to track this issue.

Also lack of sparse data support is problematic.

Member

mblondel commented Nov 10, 2015

I suspect more useful discretizers/binning to include may be those that are substantially less complex (e.g. equal-width or equal-population buckets).

+1

I suspect uniform binning and quantile based binning should work well enough in many applications.
I created #5778 to track this issue.

Also lack of sparse data support is problematic.

+ currlevel = search_intervals.back()
+ search_intervals.pop_back()
+ start, end, depth = unwrap(currlevel)
+ PyMem_Free(currlevel)

This comment has been minimized.

@mblondel

mblondel Nov 10, 2015

Member

We try to avoid manual memory management in the project.

@mblondel

mblondel Nov 10, 2015

Member

We try to avoid manual memory management in the project.

This comment has been minimized.

@hlin117

hlin117 Nov 10, 2015

Contributor

I'm only a beginner when it comes to Cython. Does Cython have a garbage collector, or will this cause a memory leak?

@hlin117

hlin117 Nov 10, 2015

Contributor

I'm only a beginner when it comes to Cython. Does Cython have a garbage collector, or will this cause a memory leak?

@hlin117

This comment has been minimized.

Show comment
Hide comment
@hlin117

hlin117 Nov 10, 2015

Contributor

I can definitely work on this PR more if there is more interest in it. I've gotten more familiar with the scikit-learn source code since then, and I'm willing to polish this code up.

Contributor

hlin117 commented Nov 10, 2015

I can definitely work on this PR more if there is more interest in it. I've gotten more familiar with the scikit-learn source code since then, and I'm willing to polish this code up.

@Zelazny7

This comment has been minimized.

Show comment
Hide comment
@Zelazny7

Zelazny7 Apr 13, 2016

I just wanted to add that a very useful use case of supervised discretization is scorecard modeling in a heavily regulated industry like credit decisioning. Uniform binning and quantile binning are usually not sufficient for zero-inflated data, for example.

This technique won't help win Kaggle competitions, but it is very useful when transparency is more important than predictive performance.

I just wanted to add that a very useful use case of supervised discretization is scorecard modeling in a heavily regulated industry like credit decisioning. Uniform binning and quantile binning are usually not sufficient for zero-inflated data, for example.

This technique won't help win Kaggle competitions, but it is very useful when transparency is more important than predictive performance.

@hlin117

This comment has been minimized.

Show comment
Hide comment
@hlin117

hlin117 Feb 20, 2017

Contributor

Closing this for now. If you would like to see a related project, see https://github.com/hlin117/mdlp-discretization. Pull requests to this project are welcome!

Contributor

hlin117 commented Feb 20, 2017

Closing this for now. If you would like to see a related project, see https://github.com/hlin117/mdlp-discretization. Pull requests to this project are welcome!

@hlin117 hlin117 closed this Feb 20, 2017

@jnothman jnothman referenced this pull request Mar 21, 2018

Closed

Discretization #10848

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment