This repository includes the MapReduce implementation proposed for Prototype Reduction in [1].
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
bin/org
datasets/page-blocks-5-fold
lib
src/org
README.md
build.xml

README.md

MRPR

This repository includes the MapReduce implementation proposed for Prototype Reduction in [1]. This implementation is based on Apache Mahout 0.8 library. The Apache Mahout (http://mahout.apache.org/) project's goal is to build an environment for quickly creating scalable performant machine learning applications.

Prerequisites:

  • Hadoop 2.5.
  • ant

Associated papers:

[1] I. Triguero, D. Peralta, J. Bacardit, S. García, F. Herrera. MRPR: A MapReduce Solution for Prototype Reduction in Big Data Classification. Neurocomputing 150 (2015), 331-345. doi: 10.1016/j.neucom.2014.04.078 (http://sci2s.ugr.es/sites/default/files/ficherosPublicaciones/1769_2015-Neurocomputing-MRPR-A%20MapReduce%20solution%20for%20prototype%20reduction%20in%20big%20data%20classification.pdf)

[2] I. Triguero, D. Peralta, J. Bacardit, S. García, F. Herrera. A Combined MapReduce-Windowing Two-Level Parallel Scheme for Evolutionary Prototype Generation. In Proceeding on the WCCI 2014 IEEE World Congress on Computational Intelligence, IEEE Congress on Evolutionary Computation CEC'2014, Beijing (China), 6-11 July, pp. 3036-3043, 2014.

Compile the whole project with ANT:

$ ant

Put the dataset folder into the HDFS system:

hadoop fs -put datasets/

Generate descriptor file needed by the mahout code. (Check: ...classifier.df.tools.Describe.java).

$ hadoop jar Model.jar org.apache.mahout.classifier.df.tools.Describe -p  datasets/page-blocks-10-fold/page-blocks-10-1tra.data  -f  datasets/page-blocks-10-fold/page-blocks.info -d  10 N L

== PrototypeGenerationModel class

hadoop jar Model.jar org.apache.mahout.classifier.pg.mapreduce.PrototypeGenerationModel --help

Usage:                                                                          
 [--data  --dataset  --header  --output  --help    
--pgMethod  --TypeOfReduce  --numberOfWindows ]               
Options                                                                         
  --data (-d) path               Data path                                      
  --dataset (-ds) dataset        The path of the file descriptor of the dataset 
  --header (-he) header          Header of the dataset in Keel format           
  --output (-o) path             Output path, will contain the preprocessed     
                                 dataset                                        
  --help (-h)                    Print out help                                 
  --pgMethod (-pg) path          PG method: IPADE or SSMASFLSDE. Default: IPADE 
  --TypeOfReduce (-r) path       Type of reduce: Join, Fusion, Filtering,       
                                 NoReduce. Default: Join                        
  --numberOfWindows (-w) path    Number of Windows     

Generate the Reduced Set example:

To compute the number of mappers, we have to check the number of bytes of the training file:

$ ls -l datasets/page-blocks-10-fold/page-blocks-10-1tra.data 
 -rw-rw-r-- 1 isaac isaac 221580 jul 15  2013 datasets/page-blocks-10-fold/page-blocks-10-1tra.data 

If we want to have 4 maps, we should divide this number by 4 (55395).

hadoop jar Model.jar  org.apache.mahout.classifier.pg.mapreduce.PrototypeGenerationModel -Dmapred.min.split.size=55395 -Dmapred.max.split.size=55396   -d datasets/page-blocks-5-fold/page-blocks-5-1tra.data  -he datasets/page-blocks-5-fold/page-blocks.header  -ds datasets/page-blocks-5-fold/page-blocks.info -pg SSMASFLDE -r Fusion -o output-MRPR/

== TestModel class

hadoop jar Model.jar  org.apache.mahout.classifier.pg.mapreduce.TestModel --help


Usage:                                                                          
 [--input  --info  --header  --preprocessed  --save  
Reduce set as plain text  --output  --help]         

Options                                                                         
  --input (-i) input                              Path to job input directory.  
  --info (-ds) test                               The path of the file          
                                                  descriptor of the dataset     
  --header (-he) header                           Header of the dataset in Keel 
                                                  format                        
  --preprocessed (-pre) path                      Preprocessed set path         
  --save Reduce set as plain text (-save) path    Preprocessed set path         
  --output (-o) output                            The directory pathname for    
                                                  output.                       
  --help (-h)                                     Print out help      

Classifying the test set example:

Number of mappers of the test phase. Bytes of the test file: 24706. We will divide this into two parts:

hadoop jar Model.jar  org.apache.mahout.classifier.pg.mapreduce.TestModel -Dmapred.min.split.size=12353 -Dmapred.max.split.size=12354 -i datasets/page-blocks-10-fold/page-blocks-10-1tst.data 
-ds datasets/page-blocks-10-fold/page-blocks.info -he datasets/page-blocks-5-fold/page-blocks.header --pre output-MRPR/resultingSet.data -o outputKNN-pageblocks