A re-implementation of the minimum redundancy maximum relevance (mRMR) feature selection algorithm with emphasis on greatly increased perfomance (1000x or greater on large data sets) and an improved user interface.
License
rlichtenwalter/mRMR
develop
Name already in use
Code
-
Clone
Use Git or checkout with SVN using the web URL.
Work fast with our official CLI. Learn more about the CLI.
- Open with GitHub Desktop
- Download ZIP
Sign In Required
Please sign in to use Codespaces.
Launching GitHub Desktop
If nothing happens, download GitHub Desktop and try again.
Launching GitHub Desktop
If nothing happens, download GitHub Desktop and try again.
Launching Xcode
If nothing happens, download Xcode and try again.
Launching Visual Studio Code
Your codespace will open once ready.
There was a problem preparing your codespace, please try again.
Latest commit
Git stats
Files
Improved mRMR is a re-implementation of the minimum redundancy maximum relevance (mRMR) feature selection algorithm with emphasis on greatly increased perfomance (1000x or greater on large data sets) and an improved user interface. There are no disadvantages to using this utility as opposed to the original release by Hanchuan Peng, but benefits include: - results identical to original mRMR implementation by Hanchuan Peng, excluding statistically inconsequential preservation of rank corresponce in the case of metric ties - incorporation of all improvements from the Fast-MRMR implementation by Sergio Ramírez - additional performance improvements, such as avoiding computing mutual information for zero-entropy attributes, and careful selection of and usage of data structures - output for each attribute includes its selection rank, entropy, mutual information with the class attribute, and mRMR score in an easily parsed format friendly to downstream manipulation - operates directly on original textual data, requiring no transformation into a one-time binary representation - robust data set parser fails gracefully with bad input and reports the location of the first error - modular support in the code for arbitrary discretization routines, with several examples already provided and implemented - support to output the result of parsing and discretization so that it can be verified and analyzed with external tools - supports stream-based processing, and can operate equally well reading a data set from standard input, in a pipeline, from a named pipe, or with process substitution - standard GNU getopt_long POSIX-compliant option processing, including full-featured -h/--help capability, informative error messages, graceful failure, and sensible defaults - high-quality C++14 compliant code base BUILING =============== 1. To build, enter project directory and type 'make'. 2. To run from project directory, type './mrmr -h' to get usage information and additional help. EXAMPLE USAGE =============== The following two commands are equivalent in effect when run from the project top-level directory. < example.tsv bin/mrmr bin/mrmr -t '\t' -c 1 -d 'truncate' example.tsv Notes: Implemented discretization functionality is minimal. Feature values are expected to be or to discretize to be contiguous integers starting from 0, but this is not currently checked. VERSION HISTORY =============== 0.93 (beta) - Added header for easier usage as library principally in support of Python bindings. 0.92 (beta) - Added example data and updated README with example usage. - Added note about expected feature value ranges. 0.91 (beta) - Refactored discretization code. - Added 'truncate' discretization procedure and make it the default. - Added support for new warning log level and made it the default. - Now warn in case a discretization method is not explicitly chosen. - Implemented attribute domain translation and integer overflow detection. 0.9 (beta) - Support for delimiter specification. - Minor improvement to log handling. - Small incidental code changes. 0.1 (beta) - Initial release.
About
A re-implementation of the minimum redundancy maximum relevance (mRMR) feature selection algorithm with emphasis on greatly increased perfomance (1000x or greater on large data sets) and an improved user interface.