This project contains the code used in the ROSEFW-RF paper.
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
bin/org
datasets
lib
src/org
README.md
build.xml

README.md

ROSEFW-RF

This repository includes the MapReduce implementations used in [1]. This implementation is based on Apache Mahout 0.8 library. The Apache Mahout (http://mahout.apache.org/) project's goal is to build an environment for quickly creating scalable performant machine learning applications.

Prerequisites:

  • Hadoop 2.5.
  • ant

Associated paper:

  • I. Triguero, S. Río, V. López, J. Bacardit, J.M. Benítez, F. Herrera. ROSEFW-RF: The winner algorithm for the ECBDL'14 Big Data Competition: An extremely imbalanced big data bioinformatics problem. Knowledge-Based Systems, in press. doi: 10.1016/j.knosys.2015.05.027

Compile the whole project with ANT:

$ ant

Put the dataset folder into the HDFS system:

hadoop fs -put datasets/

Generate descriptor file needed by the mahout code. (Check: ...classifier.df.tools.Describe.java).

$ hadoop jar Model.jar org.apache.mahout.classifier.df.tools.Describe -p  datasets/ECBDL14subset.data  -f  datasets/ECBDL14subset.info -d  3 N 18 C 18 N 54 C 38 N 20 C 480 N L

== Random Oversampling

hadoop jar Model.jar  org.apache.mahout.classifier.df.mapreduce.Resampling --help

Usage:                                                                          
 [--data  --dataset  --time  --help --resampling           
 --dataPreprocessing  --nbpartitions  --npos    
 --nneg  --negclass ]                                     
Options                                                                         
  --data (-d) path                    Data path                                 
  --dataset (-ds) dataset             Dataset path                              
  --time (-tm) path                   Time path                                 
  --help (-h)                         Print out help                            
  --resampling (-rs) resampling       The resampling technique (oversampling    
                                      (overs), undersampling (unders) or SMOTE  
                                      (smote))                                  
  --dataPreprocessing (-dp) path      Data Preprocessing path                   
  --nbpartitions (-p) nbpartitions    Number of partitions                      
  --npos (-npos) npos                 Number of instances of the positive class 
  --nneg (-nneg) nneg                 Number of instances of the negative class 
  --negclass (-negclass) negclass     Name of the negative class      

Generate the Preprocessed data example:

To compute the number of mappers, we have to check the number of bytes of the training file:

$ ls -l datasets/
-rw-r--r--. 1 isaact users 19019170 Jun  9 14:10 ECBDL14subset.data

If we want to have 4 maps, we should divide this number by 4 (4754792).

$ hadoop jar Model.jar org.apache.mahout.classifier.df.mapreduce.Resampling -Dmapred.min.split.size=4754792 -Dmapred.max.split.size=4754793 -dp datasets/ECBDL14subset.data -d output-ROS -ds datasets/ECBDL14subset.info -rs overs -p 4 -tm ROS-ECBDL14-build_time

== Evolutionary Feature Weighting

hadoop jar Model.jar org.apache.mahout.classifier.feature_weighting.mapreduce.FeatureWeightingModel --help

Usage:                                                                          
 [--data  --dataset  --header  --output ]          
Options                                                                         
  --data (-d) path           Data path                                          
  --dataset (-ds) dataset    The path of the file descriptor of the dataset     
  --header (-he) header      Header of the dataset in Keel format               
  --output (-o) path         Output path, will contain the set of selected      
                             features   

Example of application of EFW on the previosly generated balanced data. (please adjust the size of the split according to the size of the input data)

hadoop jar Model.jar org.apache.mahout.classifier.feature_weighting.mapreduce.FeatureWeightingModel -Dmapred.max.split.size=XXXX -d output-ROS/part-r-00000 -ds datasets/ECBDL14subset.info -he datasets/ECBDL14subset.header -o output-DEFW

Create the resulting preprocessed dataset:

hadoop jar Model.jar org.apache.mahout.classifier.feature_weighting.mapreduce.FWconstructor --help

Usage:                                                                          
 [--input  --info  --header  --feature_weighting     
--weight threshold  --output  --help]                             
Options                                                                         
  --input (-i) input                Path to job input directory.                
  --info (-ds) test                 The path of the file descriptor of the      
                                    dataset                                     
  --header (-he) header             Header of the dataset in Keel format        
  --feature_weighting (-fw) path    Feature weights path                        
  --weight threshold (-w) path      Weight threshold to select features         
  --output (-o) output              The directory pathname for output.          
  --help (-h)                       Print out help  
 hadoop jar Model.jar org.apache.mahout.classifier.feature_weighting.mapreduce.FWconstructor -i output-ROS/part-r-00000 -fw output-DEFW/Pesos.txt -w 0.46 -ds datasets/ECBDL14subset.info -he datasets/ECBDL14subset.header -o output-FWconstructor

== RandomForest

First, generate the describe info file for this data:

hadoop jar Model.jar org.apache.mahout.classifier.df.tools.Describe -p output-FWconstructor/part-r-00000.out -f   output-FWconstructor/part-r-00000.info -d 3 N 18 C 18 N 54 C 38 N 20 C 480 N L

Build a model with the previous preprocessed data. Please adjust the split size accordingly.

hadoop jar Model.jar  org.apache.mahout.classifier.df.mapreduce.BuildForest -Dmapred.min.split.size=XXXXX -Dmapred.max.split.size=XXXX -o output-RF/  -d output-FWconstructor/part-r-00000.out -ds output-FWconstructor/part-r-00000.info -sl 25 -p -t 200 -tm model_build_time

Classify test data:

hadoop jar  Model.jar org.apache.mahout.classifier.df.mapreduce.TestForest -Dmapred.min.split.size=XXXX -Dmapred.max.split.size=XXXX -i datasets/ECBDL14subset.data
-ds datasets/ECBDL14subset.info -m  output-RF/ -a -mr -o outputTEST-RF