This repository includes the MapReduce implementation proposed for Prototype Reduction in [1]. This implementation is based on Apache Mahout 0.8 library. The Apache Mahout (http://mahout.apache.org/) project's goal is to build an environment for quickly creating scalable performant machine learning applications.
Prerequisites:
- Hadoop 2.5.
- ant
Associated papers:
[1] I. Triguero, D. Peralta, J. Bacardit, S. García, F. Herrera. MRPR: A MapReduce Solution for Prototype Reduction in Big Data Classification. Neurocomputing 150 (2015), 331-345. doi: 10.1016/j.neucom.2014.04.078 (http://sci2s.ugr.es/sites/default/files/ficherosPublicaciones/1769_2015-Neurocomputing-MRPR-A%20MapReduce%20solution%20for%20prototype%20reduction%20in%20big%20data%20classification.pdf)
[2] I. Triguero, D. Peralta, J. Bacardit, S. García, F. Herrera. A Combined MapReduce-Windowing Two-Level Parallel Scheme for Evolutionary Prototype Generation. In Proceeding on the WCCI 2014 IEEE World Congress on Computational Intelligence, IEEE Congress on Evolutionary Computation CEC'2014, Beijing (China), 6-11 July, pp. 3036-3043, 2014.
Compile the whole project with ANT:
$ ant
Put the dataset folder into the HDFS system:
hadoop fs -put datasets/
Generate descriptor file needed by the mahout code. (Check: ...classifier.df.tools.Describe.java).
$ hadoop jar Model.jar org.apache.mahout.classifier.df.tools.Describe -p datasets/page-blocks-10-fold/page-blocks-10-1tra.data -f datasets/page-blocks-10-fold/page-blocks.info -d 10 N L
hadoop jar Model.jar org.apache.mahout.classifier.pg.mapreduce.PrototypeGenerationModel --help
Usage: [--data --dataset --header --output --help --pgMethod --TypeOfReduce --numberOfWindows ] Options --data (-d) path Data path --dataset (-ds) dataset The path of the file descriptor of the dataset --header (-he) header Header of the dataset in Keel format --output (-o) path Output path, will contain the preprocessed dataset --help (-h) Print out help --pgMethod (-pg) path PG method: IPADE or SSMASFLSDE. Default: IPADE --TypeOfReduce (-r) path Type of reduce: Join, Fusion, Filtering, NoReduce. Default: Join --numberOfWindows (-w) path Number of Windows
Generate the Reduced Set example:
To compute the number of mappers, we have to check the number of bytes of the training file:
$ ls -l datasets/page-blocks-10-fold/page-blocks-10-1tra.data -rw-rw-r-- 1 isaac isaac 221580 jul 15 2013 datasets/page-blocks-10-fold/page-blocks-10-1tra.data
If we want to have 4 maps, we should divide this number by 4 (55395).
hadoop jar Model.jar org.apache.mahout.classifier.pg.mapreduce.PrototypeGenerationModel -Dmapred.min.split.size=55395 -Dmapred.max.split.size=55396 -d datasets/page-blocks-5-fold/page-blocks-5-1tra.data -he datasets/page-blocks-5-fold/page-blocks.header -ds datasets/page-blocks-5-fold/page-blocks.info -pg SSMASFLDE -r Fusion -o output-MRPR/
hadoop jar Model.jar org.apache.mahout.classifier.pg.mapreduce.TestModel --help Usage: [--input --info --header --preprocessed --save Reduce set as plain text --output --help] Options --input (-i) input Path to job input directory. --info (-ds) test The path of the file descriptor of the dataset --header (-he) header Header of the dataset in Keel format --preprocessed (-pre) path Preprocessed set path --save Reduce set as plain text (-save) path Preprocessed set path --output (-o) output The directory pathname for output. --help (-h) Print out help
Classifying the test set example:
Number of mappers of the test phase. Bytes of the test file: 24706. We will divide this into two parts:
hadoop jar Model.jar org.apache.mahout.classifier.pg.mapreduce.TestModel -Dmapred.min.split.size=12353 -Dmapred.max.split.size=12354 -i datasets/page-blocks-10-fold/page-blocks-10-1tst.data -ds datasets/page-blocks-10-fold/page-blocks.info -he datasets/page-blocks-5-fold/page-blocks.header --pre output-MRPR/resultingSet.data -o outputKNN-pageblocks