Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PR for packaging, Python3 Support, and Issues #3 and #4 #5

Merged
merged 6 commits into from Apr 10, 2016

Conversation

mbernico
Copy link
Contributor

@mbernico mbernico commented Apr 9, 2016

First of all, I apologize for writing such a big PR all at once. I was struggling to get Boruta functioning on my problem and these were the things I needed to fix to get it working for me.

  1. PR supports Python 3.5
  2. PR refactored code to support setup.py. Boruta can be uploaded to PyPI or installed with 'python setup.py install'
  3. PR resolves issue # 3 Re slicing Pandas DataFrames (likely only in versions > .17.x?)
  4. PR resolves issue Error when all variables are selected #4 Re all features selected.
  5. I was unsure what to do with the two versions of boruta.py. Patching two files in parallel isn't ideal. The second version seems desirable/valuable so I removed the first from the package.

I hope you find this useful. Happy to make any changes you'd like. Thanks for implementing Boruta in python!

@danielhomola
Copy link
Collaborator

Hey there,
This is fantastic work thanks so much! Could you please put back the first version of Boruta and then I'll accept your PR.. Basically the BorutaPy is the line-by-line re-implementation of the original R method which was published and went through peer-review, plus that version was used in several projects. Consequently people might expect it to work a certain way, like the one in R, that's why I want to keep it in here as a reference. The added features of BorutaPy2 could be used if someone needs them, but that's not equivalent of the original and published method, so I don't want that to be default as that would be misleading..

@mbernico
Copy link
Contributor Author

Hi Daniel,

My pleasure, glad it could help.

I re-added Boruta_py and applied the Python3, #3, and #4 fixes to it as well. I kept the 'reference' implementation as class BorutaPy and made boruta_py2.py's class BorutaPyPlus (for lack of a better name). That way the user can have the module installed and instantiate either.

I can update the doc to reflect this as well if you'd like and are satisfied with that name.

@danielhomola
Copy link
Collaborator

Hey Mike,

Thanks a lot this is really great! Like your name suggestion as well! Will
accept your pr tonight!
On 10 Apr 2016 20:36, "Mike Bernico" notifications@github.com wrote:

Hi Daniel,

My pleasure, glad it could help.

I re-added Boruta_py and applied the Python3, #3
#3, and #4
#4 fixes to it as well.
I kept the 'reference' implementation as class BorutaPy and made
boruta_py2.py's class BorutaPyPlus (for lack of a better name). That way
the user can have the module installed and instantiate either.

I can update the doc to reflect this as well if you'd like and are
satisfied with that name.


You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#5 (comment)

@danielhomola danielhomola merged commit 809c31d into scikit-learn-contrib:master Apr 10, 2016
@mbq
Copy link

mbq commented Apr 11, 2016

Hi guys,
Is it really necessary to have BorutaPlus a separate entity, while, as I understand, it is equal to Boruta when perc=100 and multi_corr_method=bonferroni (and those are more-less defaults anyway)?

Automatic tree number selection is actually a part of the wrapped importance source, so in this regard it is already doable in R Boruta, arbitrary perc and MC correction could be back-ported quite easily making R&Python implementations fully compatible again.

@danielhomola
Copy link
Collaborator

Hey Miron,

Thanks a lot for joining in. If you're happy with it like that, having BorutaPlus be the one and only Boruta_py would be a great simplification as Mike already pointed out. I'll do this tomorrow and change the docs accordingly. Cheers!

@mbq
Copy link

mbq commented Apr 12, 2016

Er, why calling it Plus then?

@danielhomola
Copy link
Collaborator

Sorry I wasn't clear, I meant, the functionality in BorutaPyPlus will be the only one in this repo and it will be called BorutaPy, just like Mike suggested 2 days ago.. Will do this tonight.

@mbq
Copy link

mbq commented Apr 12, 2016

Great.

@mbernico
Copy link
Contributor Author

Thanks Miron, I agree. @danielhomola when you make that change, remember to update init.py as well. It's out of sync at the moment, from the last refactor. :)

@danielhomola
Copy link
Collaborator

Hi Miron,

I just noticed that in boruta.R, at line 165, you have this:
toAccept<-stats::p.adjust(stats::pbinom(hitReg-1,runs,0.5,lower.tail=FALSE),method=pAdjMethod)<pValue;

my R is getting rusty as I use more and more Python, but doesn't this part
pbinom(hitReg-1,runs,0.5,lower.tail=FALSE)
return a vector with the length of all features, p? If yes, then your bonferroni correction will be off, because as you remove features in later iterations, it should only correct for the multiple testing across the features that are still included in the testing, i.e. the tentative ones.. Instead of that in it's current form, if you start with p=100 features and in the 60th iteration we only have 20 tentative left, the p-values will be (corrected) divided by a 100, and not 20.. Thus the Bonferroni correction will be overly harsh, won't it?

Am I missing something? Sorry if this is stupid and I misunderstood something..

@mbq
Copy link

mbq commented Apr 13, 2016

This is intentional; larger p makes all selected attributes more likely to be false positive, regardless in which iteration they were selected. (In other words, correction applies to the whole Boruta run, not a single iteration; this is also why Bonferroni method is the only correction which is not somewhat fishy to use.)

BTW, it strikes me that you find it too strict; wherever I apply it, I feel it selects too much...

@danielhomola
Copy link
Collaborator

Oh OK, if that's intentional, then I'll leave the Python version like that as well. I understand your logic, it's just in biological datasets we often test 10-20 thousand features and bonferroni is always overly harsh that's why FDR is widely used instead.

@mbq
Copy link

mbq commented Apr 13, 2016

So have I, but also the p-values from Boruta easily go extremely low (like log p required low) and there are tentatives to cover the grey area. Anyway, proper FDR has landed on my Boruta research problems list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants