Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds plot_decision_region_slices function #189

Merged
merged 9 commits into from
May 17, 2017

Conversation

jrbourbeau
Copy link
Contributor

@jrbourbeau jrbourbeau commented May 11, 2017

Description

Adds plot_decision_region_slices function to mlxtend/plotting/decision_regions.py

Related issues or pull requests

See issue #188

Pull Request requirements

  • Added appropriate unit test functions in the ./mlxtend/*/tests directories
  • Ran nosetests ./mlxtend -sv and make sure that all unit tests pass
  • Checked the test coverage by running nosetests ./mlxtend --with-coverage
  • Checked for style issues by running flake8 ./mlxtend
  • Added a note about the modification or contribution to the ./docs/sources/CHANGELOG.md file
  • Modify documentation in the appropriate location under mlxtend/docs/sources/ (optional)
  • Checked that the Travis-CI build passed at https://travis-ci.org/rasbt/mlxtend

@coveralls
Copy link

Coverage Status

Coverage remained the same at 93.451% when pulling 4bbccc7 on jrbourbeau:add_plot_decision_slices into 861cade on rasbt:master.

@jrbourbeau
Copy link
Contributor Author

jrbourbeau commented May 11, 2017

A couple of comments:

  1. The existing plot_decision_regions function has built in validation and tests to ensure that there aren't more than 2 training features (which is really nice). But that makes it hard to have plot_decision_region_slices be a wrapper for plot_decision_regions. Instead, I have plot_decision_region_slices fall back to using plot_decision_regions if there are only 2 training features.

  2. When doing a 2D decision region slice in feature space, the values for the features on the x-y axis are given by the grid points on the plot. But the user needs to specify "filler" values that will be used for the other features not on the x-y axis. For example, in the plot shown in issue Add support for plotting 2D decision regions for a slice in feature space  #188, the value for feature 3 needs to be specified. I decided to do this by using a pandas DataFrame. This way you can specify the desired training/target features by their column name. E.g.

training_features = ['feature1', 'feature2']
target_feature = 'target'
filler_feature_dict = {'feature3': 1.0}

plot_decision_region_slices('feature1', 'feature2', dataframe,
                                              training_features, target_feature, clf,
                                              filler_feature_dict=filler_feature_dict)

But this can also be done using numpy arrays and replacing the column names with the column indices. Which direction (pandas dataframe vs. numpy array) do you think is the better way to go?

@rasbt
Copy link
Owner

rasbt commented May 12, 2017

The existing plot_decision_regions function has built in validation and tests to ensure that there aren't more than 2 training features (which is really nice). But that makes it hard to have plot_decision_region_slices be a wrapper for plot_decision_regions. Instead, I have plot_decision_region_slices fall back to using plot_decision_regions if there are only 2 training features.

Good point. I think what we could do is to add a parameter to the plot_decision_plots function to select 2 features from the X array. E.g., let's say

plot_decision_regions(..., feature_index=None)

And if feature_index=None, it would select the first to features by default, i.e.,(0, 1), if the dataset is 2D, and (0,) if the dataset is 1D. If X.shape[1] > 2, it would use (0, 1) as well but add those filler values for the other features. This just has to be documented well so users don't get confused (I got a lot of emails where people were asking why the function doesn't work on their multidimensional datasets and thought there was a bug). Further, a user can then specify other slices than (0, 1), e.g., (0, 2, ..., (2, 4), ... etc.

Or in other words, I guess there's no good way around making the plot_decision_region_slices work without the modification above, right? What would work though is to let the plot_decision_regions function plot 1 custom slice (e.g., (0, 1) or (0, ) by default) and have the plot_decision_region_slices as a wrapper function to plot a grid of slices.

What do you think?

@jrbourbeau
Copy link
Contributor Author

Good idea. On second thought, I think that modifying plot_decision_regions to support X.shape[1] > 2 is a good way to go.

if feature_index=None, it would select the first to features by default, i.e.,(0, 1), if the dataset is 2D, and (0,) if the dataset is 1D. If X.shape[1] > 2, it would use (0, 1) as well but add those filler values for the other features...Further, a user can then specify other slices than (0, 1), e.g., (0, 2, ..., (2, 4), ... etc.

Couldn't agree more 👍. I'll work on adding this additional feature to plot_decision_regions and commit it to the PR.

plot_decision_regions should behave exactly the same for the case when
the number of training features (X.shape[1]) is 1 or 2. This commit
adds the additional keyword arguments feature_index and
filler_feature_dict. Feature_index is an array-like object with two
items that specify the features indices in X to be plotted on the
x and y axis of the decision regions plot. Filler_feature_dict is a
dictionary of feature index-value pairs that will be filled in for the
features not on the x and y axis.
@pep8speaks
Copy link

pep8speaks commented May 12, 2017

Hello @jrbourbeau! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on May 16, 2017 at 02:30 Hours UTC

@coveralls
Copy link

Coverage Status

Coverage remained the same at 93.451% when pulling c20887f on jrbourbeau:add_plot_decision_slices into 861cade on rasbt:master.

@jrbourbeau
Copy link
Contributor Author

jrbourbeau commented May 12, 2017

A few comments about the previous commit

  1. I didn't update plot_decision_region_slices at all. That still needs to be done.

  2. This new version of plot_decision_regions should behave exactly the same as the previous version when there are 1 or 2 features (so no existing code will break).

  3. For the case X.shape[1] > 2, the feature indices to be on the x and y axis are specified with feature_index (default is (0, 1)). And the filler values for the other features are provided by filler_feature_dict (default is None, but will raise a ValueError is not provided). For example, if you have a dataset with 3 training features and want to have the first and third features on the x and y axis, and the value 999 for the second feature, you would do something like

plot_decision_regions(X, y, clf, feature_index=(0, 2), filler_feature_dict={1: 999})

Passing the filler values in with a dictionary of index-value pairs is a little awkward, but I haven't been able to think of simpler way of doing it. Any suggestions?

  1. The main problem I ran into has to do with scatter plotting the training data. The problem arises because (I think) you really only want to plot the training data that has values for the filled feature near the filled values. For example, with the previous example, if the second feature is filled with the value 999, then you probably only want to show training events that have a value for the second feature close to 999. Presumably, the value of the second feature will affect what the decision boundaries look like.

My solution to this problem was...the null solution. I put if dim <= 2: before the scatter section of the code. Any thoughts on if the training data scatter plot should be built into plot_decision_regions for X.shape[1] > 2?

@rasbt
Copy link
Owner

rasbt commented May 13, 2017

Thanks for the notes!

This new version of plot_decision_regions should behave exactly the same as the previous version when there are 1 or 2 features (so no existing code will break).

Sounds great! Unfortunately, there are currently no unit tests for the plotting functions, but I will also test it "manually" before I merge (later when we think the PR looks ready :))

For the case X.shape[1] > 2, the feature indices to be on the x and y axis are specified with feature_index (default is (0, 1)). And the filler values for the other features are provided by filler_feature_dict (default is None, but will raise a ValueError is not provided).

But the filler_feature_dict would only need to be provided if feature_index is not manually provided (i.e., not None), right?

Passing the filler values in with a dictionary of index-value pairs is a little awkward, but I haven't been able to think of simpler way of doing it. Any suggestions?

Hm, not sure, but is there a reason why you would use different filler values for different feature columns? Otherwise, we could simply accept an integer for filler_feature_dict and use these filler values for all feature columns that are not selected in feature_index. For example, say we have a array with 4 feature columns, then

plot_decision_regions(X, y, clf, feature_index=(0, 2), filler_feature_dict=999)

would will the columns with index 1 and index 3 with the 999 filler values. Would that work, and would this also address the issue in

The main problem I ran into has to do with scatter plotting the training data. The problem arises because (I think) you really only want to plot the training data that has values for the filled feature near the filled values.

@jrbourbeau
Copy link
Contributor Author

jrbourbeau commented May 14, 2017

Sorry, my first response is super long!

But the filler_feature_dict would only need to be provided if feature_index is not manually provided (i.e., not None), right?

The filler_feature_dict will always need to be provided if there are more than 2 training features. Say that someone has trained a classifier on the full iris data set (sepal length, sepal width, petal length, and petal width) and wants to plot the decision region for sepal length vs. sepal width. Below are 10 random samples from the iris dataset

[[ 4.6,  3.4,  1.4,  0.3],
  [ 4.6,  3.1,  1.5,  0.2],
  [ 5.7,  2.5,  5. ,  2. ],
  [ 4.8,  3. ,  1.4,  0.1],
  [ 4.8,  3.4,  1.9,  0.2],
  [ 7.2,  3. ,  5.8,  1.6],
  [ 5. ,  3. ,  1.6,  0.2],
  [ 6.7,  2.5,  5.8,  1.8],
  [ 6.4,  2.8,  5.6,  2.1],
  [ 4.8,  3. ,  1.4,  0.3]]

When the predict method for the classifier is called, there need to be four training features. That is, the sepal width and sepal length columns alone (shown below) will raise an error because the classifier was trained with 4 features.

[[ 4.6,  3.4],
 [ 4.6,  3.1],
 [ 5.7,  2.5],
 [ 4.8,  3. ],
 [ 4.8,  3.4],
 [ 7.2,  3. ],
 [ 5. ,  3. ],
 [ 6.7,  2.5],
 [ 6.4,  2.8],
 [ 4.8,  3. ]]

The user needs to specify values to be filled in for the other features (in this case the petal length and petal width). Say, they specify a petal length filler value of 1.2 and petal width filler value of 0.3. Then the new feature array would be something like

[[ 4.6,  3.4,  1.2,  0.3],
 [ 4.6,  3.1,  1.2,  0.3],
 [ 5.7,  2.5,  1.2,  0.3],
 [ 4.8,  3. ,  1.2,  0.3],
 [ 4.8,  3.4,  1.2,  0.3],
 [ 7.2,  3. ,  1.2,  0.3],
 [ 5. ,  3. ,  1.2,  0.3],
 [ 6.7,  2.5,  1.2,  0.3],
 [ 6.4,  2.8,  1.2,  0.3],
 [ 4.8,  3. ,  1.2,  0.3]]

This example, plotting the sepal length and width decision region for a petal length of 1.2 and a petal width of 0.3, would be implemented using the following code.

plot_decision_regions(X_iris, y_iris, clf,
                        feature_index=(0, 1),
                        filler_feature_dict={2: 1.2, 3: 0.3})

That was pretty long-winded, hopefully my explanation makes some sense

Hm, not sure, but is there a reason why you would use different filler values for different feature columns?

I think that the filler values will change what the decision regions look like. In the above example, if instead of choosing a petal length of 1.2 and petal width of 0.3 the user chose 1.2 for both the petal length and petal width, one would expect the decision boundaries to look different from the 1.2 and 0.3 case. While a single filler value can be used when there are 3 training features, for more than 3 features multiple filler values should be possible.

Although, you bring up a good point, we could build into plot_decision_regions the syntax filler_feature_dict=<number> if the user does indeed want to use the same value for all the filler feature columns.

@rasbt
Copy link
Owner

rasbt commented May 14, 2017

Thanks for the clarification; I was a bit confused yesterday and thought one would typically use the same filler values. But yeah, you bring up an important point:

I think that the filler values will change what the decision regions look like. In the above example, if instead of choosing a petal length of 1.2 and petal width of 0.3 the user chose 1.2 for both the petal length and petal width, one would expect the decision boundaries to look different from the 1.2 and 0.3 case. While a single filler value can be used when there are 3 training features, for more than 3 features multiple filler values should be possible.

We need to remember to be careful with using a dictionary as a default for a function argument -- i.e., we mustn't modify it in the function in any way, since it is mutable. I remember some gotcha from when I was new at Python :P :

>>> def my_func(dct={'a': 1, 'b': 2}, some_pair=('c', 1)):
...     dct[some_pair[0]] = some_pair[1] 
...     return dct

>>> res1 = my_func()
>>> print(res1)
{'a': 1, 'b': 2, 'c': 1}

>>> res2 = my_func(some_pair=('d', 2))
>>> print(res2)
{'a': 1, 'b': 2, 'c': 1, 'd': 2}

@jrbourbeau
Copy link
Contributor Author

We need to remember to be careful with using a dictionary as a default for a function argument -- i.e., we mustn't modify it in the function in any way, since it is mutable.

That's a great point. I was thinking about making filler_feature_dict=None by default, and then using whatever dictionary the user specifies. If the number of training features is > 2 and the user doesn't specify a filler_feature_dict, i.e. filler_feature_dict is None, then a ValueError would be raised saying that for >2 training features, a filler_feature_dict must be specified.

Is this sufficient to avoid mutability issues with filler_feature_dict?

@rasbt
Copy link
Owner

rasbt commented May 14, 2017

If the number of training features is > 2 and the user doesn't specify a filler_feature_dict, i.e. filler_feature_dict is None, then a ValueError would be raised saying that for >2 training features, a filler_feature_dict must be specified.

That sounds great! It would make it also kind of fool-proof to avoid confusion about potentially unexpected results if someone hasn't read the docstring carefully.

@jrbourbeau
Copy link
Contributor Author

Awesome!

With regards to making a scatter plot of the training data on top of the decision regions, I was thinking about a two options:

  1. Let the user specify a certain range for each filler value (e.g. for the iris example above, scatter all training samples where the petal length is 1.2 +/- 0.1 and the petal width if 0.3 +/- 0.05).

  2. For > 2 training features, don't support the training set scatter plot. Just focus on the decision regions.

What do you think about this?

Adds a test that checks the format of feature_index and the
existence of filler_feature_dict. Also removes test that checks that
there aren't more than 2 training features.
@coveralls
Copy link

Coverage Status

Coverage remained the same at 93.451% when pulling 28f4d67 on jrbourbeau:add_plot_decision_slices into 861cade on rasbt:master.

@rasbt
Copy link
Owner

rasbt commented May 15, 2017

That's a good idea; hope it doesn't overcomplicate things when plotting the training samples though. Otherwise we could just leave them out -- I think it would be useful to have them though.

Would you suggest specifying this range via the
filler_feature_dict or rather adding an additional function parameter?

@jrbourbeau
Copy link
Contributor Author

Yeah, it seems like either way will be a little cluttered : /

My thought was to make it an additional function parameter, say filler_feature_width. By default, this could be None and won't perform the training sample plot. So,

plot_decision_regions(X, y, clf,
                        feature_index=(0, 1),
                        filler_feature_dict={2: 1.2, 3: 0.3})

will just plot the decision regions. And if the user really wants to scatter the training data, they can then provide the width of the feature values they want to include

plot_decision_regions(X, y, clf,
                        feature_index=(0, 1),
                        filler_feature_dict={2: 1.2, 3: 0.3},
                        filler_feature_width={2: 0.1, 3: 0.05})

Does this seem too complicated, or okay to you?

@rasbt
Copy link
Owner

rasbt commented May 15, 2017

That sounds okay. This way, it would do only sth for people who really want to -- and others don't have to worry about the additional param :P

@jrbourbeau
Copy link
Contributor Author

Sounds good to me! I'll go ahead and add the training set plotting to the PR.

Also, I was thinking about dropping the plot_decision_region_slices wrapper for plot_decision_regions. With these extra training set plotting values, it would probably just be easier for the user to call plot_decision_regions themselves.

@rasbt
Copy link
Owner

rasbt commented May 15, 2017

Also, I was thinking about dropping the plot_decision_region_slices wrapper for plot_decision_regions. With these extra training set plotting values, it would probably just be easier for the user to call plot_decision_regions themselves.

Yeah, I somehow didn't think that it would require so many additions to the original function. I think dropping it would be fine, but it would be nice to have an example for plotting a grid of slices in the documentation (https://github.com/rasbt/mlxtend/blob/master/docs/sources/user_guide/plotting/plot_decision_regions.ipynb) then!

Thanks for putting so much thought effort into it, I really appreciate it! :)

In addition, this commit also:
* Replaces several of existing checks with check_Xy in the utils module
* Adds additional input validation
…to be either a float or an array-like object. This allows for different x-axis and y-axis resolutions.

* Adds a test
* Adds two examples to plot_decision_regions.ipynb in the user guide
* Updates the changelog accordingly
@coveralls
Copy link

Coverage Status

Coverage remained the same at 93.451% when pulling 240a7c1 on jrbourbeau:add_plot_decision_slices into 861cade on rasbt:master.

@coveralls
Copy link

Coverage Status

Coverage remained the same at 93.451% when pulling 0c74ce4 on jrbourbeau:add_plot_decision_slices into 861cade on rasbt:master.

@coveralls
Copy link

Coverage Status

Coverage remained the same at 93.451% when pulling 89fb8a5 on jrbourbeau:add_plot_decision_slices into 861cade on rasbt:master.

@jrbourbeau
Copy link
Contributor Author

That series of commits should take care of the stuff we've been discussing. The last thing I wanted your opinion on was the name of the new function parameters. Do you think feature_index, filler_feature_values, and filler_feature_ranges are suitable, or do you have other suggestions?
Let me know if you find any issues and I can take care of them

No problem! I think this is a really great project and I'm happy to help out however I can

@coveralls
Copy link

Coverage Status

Coverage remained the same at 93.451% when pulling 89fb8a5 on jrbourbeau:add_plot_decision_slices into 861cade on rasbt:master.

@rasbt
Copy link
Owner

rasbt commented May 16, 2017

Do you think feature_index, filler_feature_values, and filler_feature_ranges are suitable, or do you have other suggestions?
Let me know if you find any issues and I can take care of them

They sound all reasonable to me! However, what about filler_feature_scatter? That would kind of hint at that there's a "scattering" (and scatterplot) of training/filler values. But honestly, I really don't have a strong preference here.

I am looking forward to give the new code a try tomorrow to see how it works and if I have additional feedback :) -- looks great so far!

@rasbt
Copy link
Owner

rasbt commented May 17, 2017

Was looking through the code; this PR looks fantastic! Thanks a lot! Nice that it doesn't change the "previous" usage but just adds additional functionality that can come in handy for higher-dim datasets. I think the filler_feature_ranges sounds fine in this context. From looking at the examples and inline comments , I think it should be clear what it does :).

I am happy merge this now, and regarding the unit tests, mlxtend.plotting will be added back to the CI after merging #190 which addresses an issue in one of the other plotting functions.

Thanks for the great work!

@rasbt rasbt merged commit 89a2a0e into rasbt:master May 17, 2017
@jrbourbeau
Copy link
Contributor Author

Awesome! Thanks for all the comments and suggestions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants