This project includes the implementation of evolutionary feature selection models based on MapReduce.
This repository includes the MapReduce implementations used in [1]. This implementation is based on Apache Mahout 0.8 library. The Apache Mahout ( project's goal is to build an environment for quickly creating scalable performant machine learning applications.


  • Hadoop 2.5.
  • ant

Associated paper:

  • D. Peralta, S. Del Río, S. Ramírez-Gallego, I. Triguero, J.M. Benítez, F. Herrera. Evolutionary Feature Selection for Big Data Classification: A MapReduce Approach.

Compile the whole project with ANT:

$ ant

Put the dataset folder into the HDFS system:

hadoop fs -put datasets/

Generate descriptor file needed by the mahout code. (Check:

$ hadoop jar Model.jar -p  datasets/page-blocks-10-fold/  -f  datasets/page-blocks-10-fold/ -d  10 N L
hadoop jar Model.jar org.apache.mahout.classifier.feature_selection.mapreduce.FeatureSelectionModel -h

 [--data  --dataset  --header  --output ]          
  --data (-d) path           Data path                                          
  --dataset (-ds) dataset    The path of the file descriptor of the dataset     
  --header (-he) header      Header of the dataset in Keel format               
  --output (-o) path         Output path, will contain the set of selected      

Example of use:

To compute the number of mappers, we have to check the number of bytes of the training file:

$ ls -l datasets/page-blocks-10-fold/ 
 -rw-rw-r-- 1 isaac isaac 221580 jul 15  2013 datasets/page-blocks-10-fold/ 

If we want to have 4 maps, we should divide this number by 4 (55395).

hadoop jar Model.jar org.apache.mahout.classifier.feature_selection.mapreduce.FeatureSelectionModel -Dmapred.max.split.size=55395 -d  -d datasets/page-blocks-5-fold/  -he datasets/page-blocks-5-fold/page-blocks.header  -ds datasets/page-blocks-5-fold/  -o output-FS-pageblocks

Build the preprocessed dataset for classification purposes:

hadoop jar Model.jar org.apache.mahout.classifier.feature_selection.mapreduce.FSconstructor -i datasets/page-blocks-5-fold/ -fs output-FS-pageblocks/seleccionadas.txt -ds datasets/page-blocks-5-fold/ -he datasets/page-blocks-5-fold/page-blocks.header -o output-FSconstructor