PR for packaging, Python3 Support, and Issues #3 and #4 #5

mbernico · 2016-04-09T22:29:47Z

First of all, I apologize for writing such a big PR all at once. I was struggling to get Boruta functioning on my problem and these were the things I needed to fix to get it working for me.

PR supports Python 3.5
PR refactored code to support setup.py. Boruta can be uploaded to PyPI or installed with 'python setup.py install'
PR resolves issue # 3 Re slicing Pandas DataFrames (likely only in versions > .17.x?)
PR resolves issue Error when all variables are selected #4 Re all features selected.
I was unsure what to do with the two versions of boruta.py. Patching two files in parallel isn't ideal. The second version seems desirable/valuable so I removed the first from the package.

I hope you find this useful. Happy to make any changes you'd like. Thanks for implementing Boruta in python!

danielhomola · 2016-04-10T17:44:42Z

Hey there,
This is fantastic work thanks so much! Could you please put back the first version of Boruta and then I'll accept your PR.. Basically the BorutaPy is the line-by-line re-implementation of the original R method which was published and went through peer-review, plus that version was used in several projects. Consequently people might expect it to work a certain way, like the one in R, that's why I want to keep it in here as a reference. The added features of BorutaPy2 could be used if someone needs them, but that's not equivalent of the original and published method, so I don't want that to be default as that would be misleading..

mbernico · 2016-04-10T19:36:15Z

Hi Daniel,

My pleasure, glad it could help.

I re-added Boruta_py and applied the Python3, #3, and #4 fixes to it as well. I kept the 'reference' implementation as class BorutaPy and made boruta_py2.py's class BorutaPyPlus (for lack of a better name). That way the user can have the module installed and instantiate either.

I can update the doc to reflect this as well if you'd like and are satisfied with that name.

danielhomola · 2016-04-10T19:38:30Z

Hey Mike,

Thanks a lot this is really great! Like your name suggestion as well! Will
accept your pr tonight!
On 10 Apr 2016 20:36, "Mike Bernico" notifications@github.com wrote:

Hi Daniel,

My pleasure, glad it could help.

I re-added Boruta_py and applied the Python3, #3
#3, and #4
#4 fixes to it as well.
I kept the 'reference' implementation as class BorutaPy and made
boruta_py2.py's class BorutaPyPlus (for lack of a better name). That way
the user can have the module installed and instantiate either.

I can update the doc to reflect this as well if you'd like and are
satisfied with that name.

—
You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#5 (comment)

mbq · 2016-04-11T15:28:23Z

Hi guys,
Is it really necessary to have BorutaPlus a separate entity, while, as I understand, it is equal to Boruta when perc=100 and multi_corr_method=bonferroni (and those are more-less defaults anyway)?

Automatic tree number selection is actually a part of the wrapped importance source, so in this regard it is already doable in R Boruta, arbitrary perc and MC correction could be back-ported quite easily making R&Python implementations fully compatible again.

danielhomola · 2016-04-11T18:52:31Z

Hey Miron,

Thanks a lot for joining in. If you're happy with it like that, having BorutaPlus be the one and only Boruta_py would be a great simplification as Mike already pointed out. I'll do this tomorrow and change the docs accordingly. Cheers!

mbq · 2016-04-12T10:07:21Z

Er, why calling it Plus then?

danielhomola · 2016-04-12T10:53:20Z

Sorry I wasn't clear, I meant, the functionality in BorutaPyPlus will be the only one in this repo and it will be called BorutaPy, just like Mike suggested 2 days ago.. Will do this tonight.

mbq · 2016-04-12T11:04:17Z

Great.

mbernico · 2016-04-12T13:36:17Z

Thanks Miron, I agree. @danielhomola when you make that change, remember to update init.py as well. It's out of sync at the moment, from the last refactor. :)

danielhomola · 2016-04-13T18:34:55Z

Hi Miron,

I just noticed that in boruta.R, at line 165, you have this:
toAccept<-stats::p.adjust(stats::pbinom(hitReg-1,runs,0.5,lower.tail=FALSE),method=pAdjMethod)<pValue;

my R is getting rusty as I use more and more Python, but doesn't this part
pbinom(hitReg-1,runs,0.5,lower.tail=FALSE)
return a vector with the length of all features, p? If yes, then your bonferroni correction will be off, because as you remove features in later iterations, it should only correct for the multiple testing across the features that are still included in the testing, i.e. the tentative ones.. Instead of that in it's current form, if you start with p=100 features and in the 60th iteration we only have 20 tentative left, the p-values will be (corrected) divided by a 100, and not 20.. Thus the Bonferroni correction will be overly harsh, won't it?

Am I missing something? Sorry if this is stupid and I misunderstood something..

mbq · 2016-04-13T19:06:24Z

This is intentional; larger p makes all selected attributes more likely to be false positive, regardless in which iteration they were selected. (In other words, correction applies to the whole Boruta run, not a single iteration; this is also why Bonferroni method is the only correction which is not somewhat fishy to use.)

BTW, it strikes me that you find it too strict; wherever I apply it, I feel it selects too much...

danielhomola · 2016-04-13T19:15:05Z

Oh OK, if that's intentional, then I'll leave the Python version like that as well. I understand your logic, it's just in biological datasets we often test 10-20 thousand features and bonferroni is always overly harsh that's why FDR is widely used instead.

mbq · 2016-04-13T19:28:55Z

So have I, but also the p-values from Boruta easily go extremely low (like log p required low) and there are tentatives to cover the grey area. Anyway, proper FDR has landed on my Boruta research problems list.

mbernico added 5 commits April 9, 2016 10:48

Initial Commit of package

6260eda

Refactor

83d368c

Adding Examples, Structure for test cases

e789c46

Fix Issue scikit-learn-contrib#3

6c0f997

Fix Issue scikit-learn-contrib#4, Added Madalon Example

c5ecb87

re-added original method

52734c0

danielhomola merged commit 809c31d into scikit-learn-contrib:master Apr 10, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PR for packaging, Python3 Support, and Issues #3 and #4 #5

PR for packaging, Python3 Support, and Issues #3 and #4 #5

mbernico commented Apr 9, 2016

danielhomola commented Apr 10, 2016

mbernico commented Apr 10, 2016

danielhomola commented Apr 10, 2016

mbq commented Apr 11, 2016

danielhomola commented Apr 11, 2016

mbq commented Apr 12, 2016

danielhomola commented Apr 12, 2016

mbq commented Apr 12, 2016

mbernico commented Apr 12, 2016

danielhomola commented Apr 13, 2016

mbq commented Apr 13, 2016

danielhomola commented Apr 13, 2016

mbq commented Apr 13, 2016

PR for packaging, Python3 Support, and Issues #3 and #4 #5

PR for packaging, Python3 Support, and Issues #3 and #4 #5

Conversation

mbernico commented Apr 9, 2016

danielhomola commented Apr 10, 2016

mbernico commented Apr 10, 2016

danielhomola commented Apr 10, 2016

mbq commented Apr 11, 2016

danielhomola commented Apr 11, 2016

mbq commented Apr 12, 2016

danielhomola commented Apr 12, 2016

mbq commented Apr 12, 2016

mbernico commented Apr 12, 2016

danielhomola commented Apr 13, 2016

mbq commented Apr 13, 2016

danielhomola commented Apr 13, 2016

mbq commented Apr 13, 2016