About the project
This project's goal is to extend the machine learning library Mahout with attribute selection functionality. It is developed by an informal research group at the Faculty of Mathematics, Informatics and Mechanics at the University of Warsaw. It is also developed in collaboration with UK PhD Centre in Financial Computing as part of DRACUS project. Members of the group are the following MSc students in mathematics: Paweł Olszewski, Krzysztof Rutkowski and Wiktor Gromniak. At University of Warsaw the work is supervised by Professor Andrzej Skowron, who is heading the Group of Logic. At UK PhD Centre in Financial Computing it is supervised by Michał Galas PhD.
The library currently features two attribute selection frameworks:
- Random / Dynamic reducts - an algorithm based on the paper Random Reducts  by Kruczyk et al., extended with the ideas of Dynamic Reducts  by Bazan et al.
- Monte Carlo Feature Selection - an algorithm based on random forest-like approach found in Monte Carlo Feature Selection  by Draminski et al.
The library uses, among others, the following two machine learning libraries:
- Rseslib - " a library of machine learning data structures and algorithms implemented in Java " featuring algorithms for reduct's calculations
- Cognitive Foundry - a great, general-purpose, well-coded, well-documented library of machine learning algorithms
The library is currently only partially designed to handle big data, most of the algorithms' implementations require the data to fit into a single machine's memory. Although, work on big data support is in progress and there is already one implementation (reducts on Spark, see below) that handles any size of data your cluster can handle (tested with data up to 1TB in size). The former algorithms take advantage of MapReduce's and Spark's capabilities to parallelize computationally intensive algorithms on a number of machines.
Currently there exist the following modes of computation:
- For reducts-based framework: standalone (on a single computer), MapReduce, Spark, Spark Big Data
- For trees-based framework: standalone, MapReduce, Spark
-  Marcin Kruczyk et. al, [Random Reducts: A Monte Carlo Rough Set-based Method for Feature Selection in Large Datasets] (http://iospress.metapress.com/content/53556177225267w6/), 2013
-  Jan G. Bazan et al., [Dynamic Reducts as a Tool for Extracting Laws from Decisions Tables] (http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.56.6411), 1994
-  M Draminski et al., [Monte Carlo feature selection for supervised classification] (http://www.ncbi.nlm.nih.gov/pubmed/18048398), 2008
Running the project
To build the project we use gradle. Currently, there are gradle tasks designed to build one JAR for each computation mode (see above; we're planning to put it into a single JAR or extend the Mahout JAR).
To run the project:
- Clone the repository
git clone https://github.com/wgromniak/mahout-extensions.git
- Build a JAR of your interest. Available tasks
- Run the selected task using gradle wrapper, e.g. (in the project directory)
This will create a JAR in
- Run the JAR
on Spark, e.g. (your command will depend on your cluser configuration, read more)
spark-submit --class org.mimuw.attrsel.trees.spark.TreeAttrSelDriver --master yarn-cluster mahout-extensions-spark-trees.jar -i myData.csv -numSub 5000 -subCard 150 -numCutIter 2 -subGen org.mimuw.attrsel.common.MatrixFixedSizeAttributeSubtableGenerator
on Hadoop, e.g
hadoop jar mahout-extensions-mapred-trees.jar -i myData.csv -o outputDir -numSub 5000 -subCard 150 -subGen org.mimuw.attrsel.common.MatrixFixedSizeAttributeSubtableGenerator
java -jar mahout-extensions-standalone-reducts.jar -i myData.csv -numSub 5000 -subCard 150
Notes on running
- To get all available configuration options run the JARs with
- Spark needs the data in HDFS, MapReduce in a local directory (this will be made consistent in the near future).
- Input data should be in CSV format. Objects are in rows, attributes are in columns, last column is always the decision attribute (only integer decision is allowed). Absent values are not possible.
- If you want to run the Reducts Spark Big Data version run the
- You might need to run Spark with options such as
--driver-memory 8g --executor-memory 2g --driver-java-options "-Dspark.akka.frameSize=1000" --num-executors 50to make it work. Running spark is tricky, cf. Spark should be better than MapReduce (if only it worked). If you need help running the project, please contact us, we've probably already seen all the weired exceptions you get.