Skip to content

Commit

Permalink
- unified boruta_py and boruta_py_plus, now boruta_py has all previou…
Browse files Browse the repository at this point in the history
…sly added features of the latter (two step multiple correction, percentile)

- default perc=100 equals to choosing the maximum of the shadow importances as in original boruta.R
- added two_step Boolean param, if set to False, then simple Bonferroni is performed as in boruta.R
- replaced the nanrankdata method with _nanrankdata to decouple the bottleneck dependency
- include multipletests from statsmodels as _fdrcorrection, to decouple this dependency
- delete _check_pandas, adds unnecessary dependency as Mike said, plus check_X_y returns numpy array anyway
- updated .init.py
- updated class documentation and README.md to reflect these changes
- changed multi_alpha param to alpha
- changed iter to _iter because iter is used by Python internally
  • Loading branch information
danielhomola committed Apr 13, 2016
1 parent 4f17e2d commit 80a74c1
Show file tree
Hide file tree
Showing 5 changed files with 216 additions and 643 deletions.
80 changes: 33 additions & 47 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,10 +8,7 @@ This project hosts Python implementations of the [Boruta all-relevant feature se

* numpy
* scipy
* bottleneck
* scikit-learn
* statsmodels


## How to use ##
Download, import and do as you would with any other scikit-learn method:
Expand Down Expand Up @@ -46,7 +43,6 @@ by definition depends on your classifier choice).

### BorutaPy ###


It is the original R package recoded in Python with a few added extra features.
Some improvements include:

Expand All @@ -59,48 +55,48 @@ Some improvements include:
* Automatic n_estimator selection

* Ranking of features

For more details, please check the top of the docstring.

We highly recommend using pruned trees with a depth between 3-7.

### BorutaPyPlus ###

After playing around a lot with the original code I identified a few areas
where the core algorithm could be improved. I basically ran lots of
benchmarking tests on simulated datasets (scikit-learn's amazing
make_classification was used for generating these).
Also, after playing around a lot with the original code I identified a few areas
where the core algorithm could be improved/altered to make it less strict and
more applicable to biological data, where the Bonferroni correction might be
overly harsh.

__Percentile as threshold__
The original method uses the maximum of the shadow features as a threshold in
deciding which real feature is doing better than the shadow ones. This could be
overly harsh.
overly harsh.

To control this, in the 2nd version the perc parameter sets the
percentile of the shadow features' importances, the algorithm uses as the
threshold. The default of 99 is close to taking the maximum, but as it's a
percentile, it changes with the size of the dataset. With several thousands of
To control this, I added the perc parameter, which sets the
percentile of the shadow features' importances, the algorithm uses as the
threshold. The default of 100 which is equivalent to taking the maximum as the
R version of Boruta does, but it could be relaxed. Note, since this is the
percentile, it changes with the size of the dataset. With several thousands of
features it isn't as stringent as with a few dozens at the end of a Boruta run.


__Two step correction for multiple testing__
The correction for multiple testing was improved by making it a two step
The correction for multiple testing was relaxed by making it a two step
process, rather than a harsh one step Bonferroni correction.

We need to correct firstly because in each iteration we test a number of
We need to correct firstly because in each iteration we test a number of
features against the null hypothesis (does a feature perform better than
expected by random). For this the Bonferroni correction is used in the original
code which is known to be too stringent in such scenarios, and also the
original code corrects for n features, even if we are in the 50th iteration
where we only have k<<n features left. For this reason the first step of
correction is the widely used Benjamini Hochberg FDR.
expected by random). For this the Bonferroni correction is used in the original
code which is known to be too stringent in such scenarios (at least for
biological data), and also the original code corrects for n features, even if
we are in the 50th iteration where we only have k<<n features left. For this
reason the first step of correction is the widely used Benjamini Hochberg FDR.

Following that however we also need to account for the fact that we have been
testing the same features over and over again in each iteration with the
same test. For this scenario the Bonferroni is perfect, so it is applied by
deviding the p-value threshold with the current iteration index.

We highly recommend using pruned trees with a depth between 3-7.

If this two step correction is not required, the two_step parameter has to be
set to False, then (with perc=100) BorutaPy behaves exactly as the R version.

* * *

Expand All @@ -119,32 +115,22 @@ __n_estimators__ : int or string, default = 1000
> dataset. The other parameters of the used estimators need to be set
> with initialisation.
__multi_corr_method__ : string, default = 'bonferroni' - only in BorutaPy
>Method for correcting for multiple testing during the feature selection process. statsmodels' multiple test is used, so one of the following:
>* 'bonferroni' : one-step correction
>* 'sidak' : one-step correction
>* 'holm-sidak' : step down method using Sidak adjustments
>* 'holm' : step-down method using Bonferroni adjustments
>* 'simes-hochberg' : step-up method (independent)
>* 'hommel' : closed method based on Simes tests (non-negative)
>* 'fdr_bh' : Benjamini/Hochberg (non-negative)
>* 'fdr_by' : Benjamini/Yekutieli (negative)
>* 'fdr_tsbh' : two stage fdr correction (non-negative)
>* 'fdr_tsbky' : two stage fdr correction (non-negative)
__perc__ : int, default = 99 - only in BorutaPy2
__perc__ : int, default = 100
> Instead of the max we use the percentile defined by the user, to pick
> our threshold for comparison between shadow and real features. The max
> tend to be too stringent. This provides a finer control over this. The
> lower perc is the more false positives will be picked as relevant but
> lower perc is the more false positives will be picked as relevant but
> also the less relevant features will be left out. The usual trade-off.
> The default is essentially the vanilla Boruta corresponding to the max.

__multi_alpha__ : float, default = 0.05
__alpha__ : float, default = 0.05
> Level at which the corrected p-values will get rejected in both correction
steps.

__two_step__ : Boolean, default = True
> If you want to use the original implementation of Boruta with Bonferroni
> correction only set this to False.
__max_iter__ : int, default = 100
> The number of maximum iterations to perform.
Expand Down Expand Up @@ -177,10 +163,10 @@ __verbose__ : int, default=0

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from boruta_py import boruta_py
from boruta_py import BorutaPy

# load X and y
# NOTE BorutaPy accepts numpy arrays only, hence .values
# NOTE BorutaPy accepts numpy arrays only, hence the .values attribute
X = pd.read_csv('my_X_table.csv', index_col=0).values
y = pd.read_csv('my_y_vector.csv', index_col=0).values

Expand All @@ -189,7 +175,7 @@ __verbose__ : int, default=0
rf = RandomForestClassifier(n_jobs=-1, class_weight='auto', max_depth=5)

# define Boruta feature selection method
feat_selector = boruta_py.BorutaPy(rf, n_estimators='auto', verbose=2)
feat_selector = BorutaPy(rf, n_estimators='auto', verbose=2)

# find all relevant features
feat_selector.fit(X, y)
Expand Down
1 change: 0 additions & 1 deletion boruta/__init__.py
Original file line number Diff line number Diff line change
@@ -1,2 +1 @@
from .boruta_py import BorutaPy
from .boruta_py_plus import BorutaPyPlus
Loading

0 comments on commit 80a74c1

Please sign in to comment.