Load libraries

In [1]:
library(sparklyr)
library(rsparkling)
library(h2o)
library(dplyr)


----------------------------------------------------------------------

Your next step is to start H2O:
    > h2o.init()

For H2O package documentation, ask for help:
    > ??h2o

After starting H2O, you can use the Web UI at http://localhost:54321
For more information visit http://docs.h2o.ai

----------------------------------------------------------------------


Attaching package: ‘h2o’

The following objects are masked from ‘package:stats’:

    cor, sd, var

The following objects are masked from ‘package:base’:

    ||, &&, %*%, apply, as.factor, as.numeric, colnames, colnames<-,
    ifelse, %in%, is.character, is.factor, is.numeric, log, log10,
    log1p, log2, round, signif, trunc


Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union



Start h2o

In [2]:
h2o.init()


H2O is not running yet, starting it now...

Note:  In case of errors look at the following log files:
    /tmp/Rtmp35RSLI/h2o_root_started_from_r.out
    /tmp/Rtmp35RSLI/h2o_root_started_from_r.err


Starting H2O JVM and connecting: .. Connection successful!

R is connected to the H2O cluster: 
    H2O cluster uptime:         2 seconds 983 milliseconds 
    H2O cluster version:        3.10.3.2 
    H2O cluster version age:    4 months and 12 days !!! 
    H2O cluster name:           H2O_started_from_R_root_drv418 
    H2O cluster total nodes:    1 
    H2O cluster total memory:   0.86 GB 
    H2O cluster total cores:    2 
    H2O cluster allowed cores:  2 
    H2O cluster healthy:        TRUE 
    H2O Connection ip:          localhost 
    H2O Connection port:        54321 
    H2O Connection proxy:       NA 
    R Version:                  R version 3.4.0 (2017-04-21) 


“
Your H2O cluster version is too old (4 months and 12 days)!
Please download and install the latest version from http://h2o.ai/download/”


Note:  As started, H2O is limited to the CRAN default of 2 CPUs.
       Shut down and restart H2O as shown below to use all your CPUs.
           > h2o.shutdown()
           > h2o.init(nthreads = -1)



Connect to spark

In [3]:
config <- spark_config()
config$sparklyr.gateway.port = 8881

sc <- spark_connect(master = "spark://s01:7077", config=config)

Spark version 2.1 detected. Will call latest Sparkling Water version 2.1.8


Read training data

In [4]:
tmp_data <- spark_read_csv(sc, name="data", path="/data/mnist.csv", header = TRUE)
data <- as_h2o_frame(sc, tmp_data, strict_version_check = FALSE)
rm( tmp_data )

Split Dataset into Training( 80% ) - Test - Valid( 10% )

In [5]:
splits <- h2o.splitFrame(
  data=data, 
  ratios = c(0.8,0.1),   ## only need to specify 2 fractions, the 3rd is implied
  destination_frames = c("train.hex", "valid.hex", "test.hex"), seed = 1099
)

rm( data )

train <- splits[[1]]
valid <- splits[[2]]
test <- splits[[3]]

rm( splits )

dim( train )
dim( valid )
dim( test )

It is important to declare the labels as factors to be considered as classification problem

In [6]:
train[,'labels'] <- as.factor( train[,'labels'] )
valid[,'labels'] <- as.factor( valid[,'labels'] )
test[,'labels'] <- as.factor( test[,'labels'] )

Build the Generalized Boosted Model

In [9]:
start.time <- Sys.time()

gbm_model <- h2o.gbm(x = setdiff( names( train ), c( "_c0", "labels" ) ),
                     y = "labels",
                     training_frame = train,
                     validation_frame = valid,
                    learn_rate=0.01)

end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken

“Dropping constant columns: [V728, V646, V645, V729, V21, V20, V23, V22, V25, V731, V24, V730, V27, V1, V26, V2, V29, V3, V28, V4, V5, V6, V7, V8, V759, V9, V758, V757, V756, V755, V30, V32, V31, V169, V760, V561, V702, V701, V700, V83, V393, V85, V84, V86, V477, V113, V674, V112, V673, V672, V10, V54, V53, V12, V56, V11, V55, V58, V57, V784, V783, V18, V782, V17, V781, V142, V19, V141].
”



Time difference of 13.80028 mins

Calculate predictions

In [10]:
#compute predicted values on our test dataset
pred <- h2o.predict(gbm_model, newdata = test )
# convert from H2O Frame to Spark DataFrame
predicted <- as_spark_dataframe(sc, pred, strict_version_check = FALSE)



Calculate accuracy

In [11]:
is_correct <- pred$predict == test$labels
accuracy <- mean(is_correct)
print( sprintf( 'Accuracy := %f', accuracy ) )

[1] "Accuracy := 0.878398"


Applying DeepLearning

In [12]:
start.time <- Sys.time()

dpl_model <- h2o.deeplearning(x = setdiff( names( train ), c( "_c0", "labels" ) ),
                     y = "labels",
                     training_frame = train,
                     validation_frame = valid,
                     epochs = 1)

end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken

“Dropping constant columns: [V728, V646, V645, V729, V21, V20, V23, V22, V25, V731, V24, V730, V27, V1, V26, V2, V29, V3, V28, V4, V5, V6, V7, V8, V759, V9, V758, V757, V756, V755, V30, V32, V31, V169, V760, V561, V702, V701, V700, V83, V393, V85, V84, V86, V477, V113, V674, V112, V673, V672, V10, V54, V53, V12, V56, V11, V55, V58, V57, V784, V783, V18, V782, V17, V781, V142, V19, V141].
”



Time difference of 1.54599 mins

Calculate predictions

In [15]:
#compute predicted values on our test dataset
pred <- h2o.predict(dpl_model, newdata = test )
# convert from H2O Frame to Spark DataFrame
predicted <- as_spark_dataframe(sc, pred, strict_version_check = FALSE)



Calculate accuracy

In [16]:
is_correct <- pred$predict == test$labels
accuracy <- mean(is_correct)
print( sprintf( 'Accuracy := %f', accuracy ) )

[1] "Accuracy := 0.898095"
