<a href="https://colab.research.google.com/github/zia207/r-colab/blob/main/dnn_h20.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Deep Neural Network with h20

Zia Ahmed, University at Buffalo


H2O’s Deep Learning is based on a multi-layer feedforward artificial neural network that is trained with stochastic gradient descent using back-propagation. The network can contain a large number of hidden layers consisting of neurons with tanh, rectifier, and maxout activation functions. Advanced features such as adaptive learning rate, rate annealing, momentum training, dropout, L1 or L2 regularization, checkpointing, and grid search enable high predictive accuracy. Each compute node trains a copy of the global model parameters on its local data with multi-threading (asynchronously) and contributes periodically to the global model via model averaging across the network.



### Install h20

In [None]:
install.packages('h2o')
install.packages('tidymodels')
install.packages('tidyverse')

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependencies ‘gargle’, ‘googledrive’, ‘timechange’, ‘systemfonts’, ‘textshaping’, ‘vroom’, ‘dbplyr’, ‘dtplyr’, ‘forcats’, ‘googlesheets4’, ‘haven’, ‘hms’, ‘httr’, ‘jsonlite’, ‘lubridate’, ‘magrittr’, ‘modelr’, ‘ragg’, ‘readr’, ‘readxl’, ‘reprex’, ‘rvest’


“installation of package ‘textshaping’ had non-zero exit status”
“installation of package ‘ragg’ had non-zero exit status”
“installation of package ‘readr’ had non-zero exit status”
“installation of package ‘tidyverse’ had non-zero exit status”


### Data

In this exercise we will use following synthetic data set and use  DEM, Slope,  TPI, MAT, MAP, NDVI, NLCD, FRG to fit Deep Neural Network regression model. This data was created with AI using gp_soil_data data set

[gp_soil_data_syn.csv](https://www.dropbox.com/s/c63etg7u5qql2y8/gp_soil_data_syn.csv?dl=0)

In [None]:
library(tidyverse)
# define file from my github
urlfile = "https://github.com//zia207/r-colab/raw/main/Data/USA/gp_soil_data_syn.csv"
mf<-read_csv(url(urlfile))
# Create a data-frame
df<-mf %>% dplyr::select(SOC, DEM, Slope, TPI,MAT, MAP,NDVI, NLCD, FRG)%>%
    glimpse()

ERROR: ignored

### Data Preprocessing

### Convert to factor

In [None]:
df$NLCD <- as.factor(df$NLCD)
df$FRG <- as.factor(df$FRG)

### Data split

The data set (n = 1408) will randomly split into sub-sets for training (70%), validation (15%) and test data (15%). The validation data will be used to optimized the model parameters during the tuning and training processes. The test data set will be used as the hold-out data to evaluate the DNN model.

In [None]:
library(tidymodels)
set.seed(1245)   # for reproducibility
split_01 <- initial_split(df, prop = 0.7, strata = SOC)
train <- split_01 %>% training()
test_valid <-  split_01 %>% testing()

split_02 <- initial_split(test_valid, prop = 0.5, strata = SOC)
test <- split_02 %>% training()
valid <-  split_02 %>% testing()

# Density plot all, train and test data
ggplot()+
  geom_density(data = df, aes(SOC))+
  geom_density(data = train, aes(SOC), color = "green")+
  geom_density(data = test, aes(SOC), color = "red") +
  geom_density(data = valid, aes(SOC), color = "blue") +
      xlab("Soil Organic Carbon (kg/g)") +
     ylab("Density")

### Import h2o

In [None]:
library(h2o)
h2o.init()
#disable progress bar for RMarkdown
h2o.no_progress()
# Optional: remove anything from previous session
h2o.removeAll()

### Import data to h2o cluster

In [None]:
h_df=as.h2o(df)
h_train = as.h2o(train)
h_test = as.h2o(test)
h_valid = as.h2o(valid)

In [None]:
CV.xy<- as.data.frame(h_train)
test.xy<- as.data.frame(h_test)

### Define response and predictors

In [None]:
y <- "SOC"
x <- setdiff(names(h_df), y)

### Fit h2o model with few prameters

First we fit DNN model with following parameters:

standardize: logical. If enabled, automatically standardize the data.

distribution:

activation: Specify the activation function. One of:

  - tanh
  
  - tanh_with_dropout
  
  - rectifier (default)
  
  - rectifier_with_dropout

  - maxout (not supported when autoencoder is enabled)
  
  - maxout_with_dropout
  
hidden: Specify the hidden layer sizes (e.g., (100,100)). The value must be positive. This option defaults to (200,200).

adaptive_rate: Specify whether to enable the adaptive learning rate (ADADELTA). This option defaults to True (enabled).

epochs: Specify the number of times to iterate (stream) the dataset. The value can be a fraction. This option defaults to 10.

epsilon: (Applicable only if adaptive_rate=True) Specify the adaptive learning rate time smoothing factor to avoid dividing by zero. This option defaults to 1e-08.

input_dropout_ratio: Specify the input layer dropout ratio to improve generalization. Suggested values are 0.1 or 0.2. This option defaults to 0.

l1: Specify the L1 regularization to add stability and improve generalization; sets the value of many weights to 0 (default).

l2: Specify the L2 regularization to add stability and improve generalization; sets the value of many weights to smaller values. Defaults to 0.

max_w2: Specify the constraint for the squared sum of the incoming weights per unit (e.g. for rectifier). Defaults to 3.4028235e+38.

momentum_start: (Applicable only if adaptive_rate=False) Specify the initial momentum at the beginning of training; we suggest 0.5. This option defaults to 0.

rate: (Applicable only if adaptive_rate=False) Specify the learning rate. Higher values result in a less stable model, while lower values lead to slower convergence. This option defaults to 0.005.

rate_annealing: Learning rate decay, (Applicable only if adaptive_rate=False) Specify the rate annealing value. rate(1+ rate_annealing × samples), This option defaults to 1e-06.

rate_decay: (Applicable only if adaptive_rate=False) Specify the rate decay factor between layers. N-th layer: rate × rate_decay(n−1). This options defaults to 1.

regression_stop: (Regression models only) Specify the stopping criterion for regression error (MSE) on the training data. When the error is at or below this threshold, training stops. To disable this option, enter -1. This option defaults to 1e-06.

rho: (Applicable only if adaptive_rate is enabled) Specify the adaptive learning rate time decay factor. This option defaults to 0.99.

shuffle_training_data: Specify whether to shuffle the training data. This option is recommended if the training data is replicated and the value of train_samples_per_iteration is close to the number of nodes times the number of rows. This option defaults to False (disabled).

stopping_tolerance = Relative tolerance for metric-based stopping criterion

stopping_rounds = Early stopping based on convergence of stopping_metric.Defaults to 5.

stopping_metric = Metric to use for early stopping

variable_importances: Specify whether to compute variable importance. This option defaults to True (enabled).

In [None]:
DNN <- h2o.deeplearning(
                       model_id="DNN_model_ID",
                       training_frame=h_train,
                       validation_frame=h_valid,
                       x=x,
                       y=y,
                       distribution ="AUTO",
                       standardize = TRUE,
                       shuffle_training_data = TRUE,
                       activation = "tanh",
                       hidden = c(100, 100, 100),
                       epochs = 500,
                       adaptive_rate = TRUE,
                       rate = 0.005,
                       rate_annealing = 1e-06,
                       rate_decay = 1,
                       rho = 0.99,
                       epsilon = 1e-08,
                       momentum_start = 0.5,
                       momentum_stable =0.99,
                       input_dropout_ratio = 0.0001,
                       regression_stop = 1e-06,
                       l1 = 0.0001,
                       l2 = 0.0001,
                       max_w2 = 3.4028235e+38,
                       stopping_tolerance = 0.001,
                       stopping_rounds = 3,
                       stopping_metric = "RMSE",
                       nfolds = 5,
                       keep_cross_validation_models = TRUE,
                       keep_cross_validation_predictions = TRUE,
                       variable_importances = TRUE,
                       seed=1256
                       )

### Scoring history

In [None]:
plot(DNN)

### Model Performance

#### Training

In [None]:
h2o.performance(DNN,  h_train)

#### Cross-validation

In [None]:
h2o.performance(DNN,  xval=TRUE)

#### Validation data

In [None]:
h2o.performance(DNN,  h_valid)

#### Test data

In [None]:
h2o.performance(DNN, h_test)

### Prediction

In [None]:
test.pred.DNN<-as.data.frame(h2o.predict(object = DNN, newdata = h_test))
test.xy$DNN_SOC<-test.pred.DNN$predict