Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update step_nearmiss.R #109

Merged
merged 2 commits into from
Dec 14, 2022
Merged

Conversation

PursuitOfDataScience
Copy link
Contributor

Fixed a few typos.

Also, a quick question: does step_nearmiss() provide three different versions of near miss like this link? Thanks!

Fixed a few typos. 

Also, a quick question: does `step_nearmiss()` provide three different versions of near miss like this [link](https://machinelearningmastery.com/undersampling-algorithms-for-imbalanced-classification/)? Thanks!
@EmilHvitfeldt
Copy link
Member

Thank you for the PR!

This step only does what is called NearMiss-1 in that article. If you would like to see the other methods, feel free to add them in an issue so I can keep track, once I come around for more {themis} development

@EmilHvitfeldt EmilHvitfeldt merged commit badbece into tidymodels:main Dec 14, 2022
@PursuitOfDataScience
Copy link
Contributor Author

Thank you.

I still have two questions: 1) When using step_nearmiss(), what if the model contains some nominal variables? Will the function tease them out automatically? 2) When comparing two plots from this link, they are essentially the same. I am wondering if step_tomek() really worked here?

@EmilHvitfeldt
Copy link
Member

  1. As per the documentation step_nearmiss() required

All columns used in this step must be numeric with no missing data.

so If there are nominal variables it will error.

  1. It did work, it just doesn't remove too many points since that data is already fairly well separated. Highlighted are some points that were removed

Screen Shot 2022-12-14 at 2 54 52 PM

@PursuitOfDataScience
Copy link
Contributor Author

Thank you. The Tomek link is interesting. Based on the two plots shown, it seems like this method does not downsample the majority to the point where the classes are balanced. As you can see from the plot attached by you, there are still far more majority cases (Rest dots) than minority case (Circle dots). Could you help me understand this more? I am confused why the class ratio isn't 1 after using this method.

@PursuitOfDataScience
Copy link
Contributor Author

Also, another follow-up question if you don't mind. Don't we just keep all the minority class observations and remove the majority class only? Why do these two points get removed at the same time?

@EmilHvitfeldt
Copy link
Member

A tomek link is a pair of observations where they are different classes and nearest neighbors. This method then removes the whole link.

@EmilHvitfeldt
Copy link
Member

The method could have been modified to only remove in the majority class, but right now it follows the literature (as far as I can tell)

@EmilHvitfeldt
Copy link
Member

Tomek link removal is not as much about balancing, as it is removing "troublesome" observations

@EmilHvitfeldt
Copy link
Member

If you find any particular documentation unclear please open an issue. :)

@PursuitOfDataScience
Copy link
Contributor Author

Thanks for your clarification. When you check this link, you actually put step_tomek() under the category of "Under-sampling". When adding this step in the recipe, I doubt it will make a huge difference because two plots are very alike with only minor differences. I am wondering when this step will make the recipe perform better than the one without adding this step. I am very new to this method and it comes to be as a surprise to me that it doesn't work as I've thought.

@github-actions
Copy link

This pull request has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Dec 30, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants