<a href="https://colab.research.google.com/github/vinidiol/descmerc/blob/main/ELEVEN_Data_Set_R_Models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ELEVEN - ELEctronic inVoicEs in the portuguese laNguage

## Preparation of the environment (installation of Java and H2O)

1. Set seed for reproducible results

In [1]:
set.seed(100)

2. Packages

In [2]:
# Text mining: Corpus and Document Term Matrix
require(devtools)
install_version("tm", version = "0.7-1", repos = "http://cran.us.r-project.org") 
#install.packages("tmap")
library(tm)
#library(tmap)
# Stemming words
install.packages("SnowballC")
library(SnowballC) 
# Pacote H2O
install.packages("h2o")
library(h2o)  

Loading required package: devtools

Loading required package: usethis

Downloading package from url: http://cran.us.r-project.org/src/contrib/Archive/tm/tm_0.7-1.tar.gz



Rcpp (NA -> 1.0.8.3 ) [CRAN]
BH   (NA -> 1.78.0-0) [CRAN]
slam (NA -> 0.1-50  ) [CRAN]
NLP  (NA -> 0.2-1   ) [CRAN]


Installing 4 packages: Rcpp, BH, slam, NLP

Installing packages into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Loading required package: NLP

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependencies ‘bitops’, ‘RCurl’



----------------------------------------------------------------------

Your next step is to start H2O:
    > h2o.init()

For H2O package documentation, ask for help:
    > ??h2o

After starting H2O, you can use the Web UI at http://localhost:54321
For more information visit https://docs.h2o.ai

----------------------------------------------------------------------



Attaching package: ‘h2o’


The following objects are masked from ‘package:stats’:

    cor, sd, var


The following objects are masked from ‘package:base’:

    &&, %*

3. Initialize H2O

In [3]:
h2o.init()


H2O is not running yet, starting it now...

Note:  In case of errors look at the following log files:
    /tmp/Rtmp1dTEN0/file3a3cd7a022/h2o_UnknownUser_started_from_r.out
    /tmp/Rtmp1dTEN0/file3a502ac85d/h2o_UnknownUser_started_from_r.err


Starting H2O JVM and connecting: .... Connection successful!

R is connected to the H2O cluster: 
    H2O cluster uptime:         3 seconds 330 milliseconds 
    H2O cluster timezone:       Etc/UTC 
    H2O data parsing timezone:  UTC 
    H2O cluster version:        3.36.0.4 
    H2O cluster version age:    1 month and 14 days  
    H2O cluster name:           H2O_started_from_R_root_gwk712 
    H2O cluster total nodes:    1 
    H2O cluster total memory:   3.17 GB 
    H2O cluster total cores:    2 
    H2O cluster allowed cores:  2 
    H2O cluster healthy:        TRUE 
    H2O Connection ip:          localhost 
    H2O Connection port:        54321 
    H2O Connection proxy:       NA 
    H2O Internal Security:      FALSE 
    R Version:    

## Acquire and prepare Data-Set

4. Read csv file with two columns:: Text e Category

In [4]:
df <- read.delim("https://raw.githubusercontent.com/vinidiol/descmerc/main/AmostraDescMerc.csv"
#, encoding="UTF-8"
)
head(df)

Unnamed: 0_level_0,Text,Category
Unnamed: 0_level_1,<chr>,<chr>
1,- REEF.: (140330) - SPRAY CAPILAR 270ML - TIGI CATWALK SESSION SERIES SALT SPRAY - PROT.DO.M.S,cosmeticos
2,BOZZANO ESPUMA BARBA CERRADA 6X190G,cosmeticos
3,NINA ELIXIR BC100 GENERIC GWP 2012,cosmeticos
4,"CHA DE CAMOMILA 12X10X1,0 G",cosmeticos
5,EDP SILVER RAIN VAPO 50 ML,cosmeticos
6,MINOTAURE M75,cosmeticos


5. Create corpus

In [5]:
docs <- Corpus(VectorSource(df$Text))

6. Clean corpus

In [21]:
docs <- tm_map(docs, content_transformer(tolower))
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, removeWords, stopwords("pt"))
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, stripWhitespace)

7. Create dtm

In [7]:
dtm <- DocumentTermMatrix(docs)

8. Transform dtm to matrix to data frame

In [8]:
mat.df <- as.data.frame(data.matrix(dtm), 
#stringsAsfactors = FALSE
stringsAsfactors = TRUE
)

9. Preparing labels

In [9]:
# Column bind category (known classification)
mat.df <- cbind(mat.df, df$Category)

# Change name of new column to "category"
colnames(mat.df)[ncol(mat.df)] <- "category"

In [10]:
mat.df$category<-as.factor(mat.df$category)
str(mat.df$category)

 Factor w/ 3 levels "alimentos","cosmeticos",..: 2 2 2 2 2 2 2 2 2 2 ...


10. Preparing base and spliting sets

In [11]:
mat.df.h <- as.h2o(mat.df)
data.split <- h2o.splitFrame(data = mat.df.h, ratios = c(0.7, 0.2), seed = 1234)
data.train <- data.split[[1]]
data.valid <- data.split[[2]]
data.test <- data.split[[3]]
myY <- "category"
myX <- setdiff(names(data.train), c(myY, "ID"))



## Modeling

11. GBM Model - Gradient Boosting Machine (It can take a couple of hours to run)

In [12]:
gbm.model <- h2o.gbm(x = myX, y = myY,
                     training_frame = data.train,
                     validation_frame = data.valid, ntrees = 1000, max_depth = 3,
                     model_id = "gbm_xprod_5mil")
h2o.confusionMatrix(gbm.model@model$validation_metrics)

“Dropping bad and constant columns: [crpent, aventura, espelhada, potexgr, ccg, saquinho, somboc, saku, pratico, sotxg, ted, tec, tee, lactus, essentials, rfinas, conceal, venclote, ferrare, sulminas, serr, ultime, mochilete, gpomegranatelocao, pytaia, lint, size, amable, linx, cellular, solix, afthercoldesodorante, marroq, wine, furadinho, rechchocol, pauto, borrachinha, goma, luminosos, feiticeira, aprefpil, zoodrin, sals, ervadoceg, salt, jumbitos, itacolomy, mlnecessaire, daya, curaprox, danesa, lustralgodao, blua, charcoal, condiciona, tordelini, neutergen, terracota, leli, cibele, maybellinemlcandy, got, enxaguante, dijon, conchiglioni, lvml, pak, macgalo, moisture, cht, sand, zaza, cia, thym, santos, cxkg, powerdose, folia, licor, crmarina, mtodeschini, aquila, emagran, ansolar, sportml, remove, refillable, pvaso, semorin, treasure, almar, pearlfusion, pcr, sanafit, escargots, arga, victory, mlxxxwebxxx, modamlvinho, condicionad, china, fioruccimlvanila, medref, enfamil, sexo, l



Unnamed: 0_level_0,alimentos,cosmeticos,matlimpeza,Error,Rate
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<chr>
alimentos,895,92,7,0.09959759,99 / 994
cosmeticos,17,934,24,0.04205128,41 / 975
matlimpeza,14,124,859,0.13841525,138 / 997
Totals,926,1150,890,0.09372893,"278 / 2,966"


In [13]:
conf.mat <- h2o.confusionMatrix(gbm.model@model$validation_metrics)
conf.mat

Unnamed: 0_level_0,alimentos,cosmeticos,matlimpeza,Error,Rate
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<chr>
alimentos,895,92,7,0.09959759,99 / 994
cosmeticos,17,934,24,0.04205128,41 / 975
matlimpeza,14,124,859,0.13841525,138 / 997
Totals,926,1150,890,0.09372893,"278 / 2,966"


In [15]:
r2.gbm.model.5mil <- gbm.model@model$validation_metrics@metrics$r2
r2.gbm.model.5mil

12. DeepLearning (MLP - Multi Layer Perceptron)

In [17]:
#encoding = "LabelEncoder"

dl.model <- h2o.deeplearning(x = myX, y = myY, 
                             training_frame = data.train ,
                             #categorical_encoding = encoding,
                             hidden = c(100,200,100), 
                             epochs = 20,
                             validation_frame = data.valid,
                             model_id = "dl_xprod_5mil")
h2o.confusionMatrix(dl.model@model$validation_metrics)

“Dropping bad and constant columns: [crpent, aventura, espelhada, potexgr, ccg, saquinho, somboc, saku, pratico, sotxg, ted, tec, tee, lactus, essentials, rfinas, conceal, venclote, ferrare, sulminas, serr, ultime, mochilete, gpomegranatelocao, pytaia, lint, size, amable, linx, cellular, solix, afthercoldesodorante, marroq, wine, furadinho, rechchocol, pauto, borrachinha, goma, luminosos, feiticeira, aprefpil, zoodrin, sals, ervadoceg, salt, jumbitos, itacolomy, mlnecessaire, daya, curaprox, danesa, lustralgodao, blua, charcoal, condiciona, tordelini, neutergen, terracota, leli, cibele, maybellinemlcandy, got, enxaguante, dijon, conchiglioni, lvml, pak, macgalo, moisture, cht, sand, zaza, cia, thym, santos, cxkg, powerdose, folia, licor, crmarina, mtodeschini, aquila, emagran, ansolar, sportml, remove, refillable, pvaso, semorin, treasure, almar, pearlfusion, pcr, sanafit, escargots, arga, victory, mlxxxwebxxx, modamlvinho, condicionad, china, fioruccimlvanila, medref, enfamil, sexo, l



Unnamed: 0_level_0,alimentos,cosmeticos,matlimpeza,Error,Rate
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<chr>
alimentos,967,17,10,0.02716298,27 / 994
cosmeticos,13,934,28,0.04205128,41 / 975
matlimpeza,5,64,928,0.06920762,69 / 997
Totals,985,1015,966,0.04619016,"137 / 2,966"


In [18]:
conf.mat.dl5mil <- h2o.confusionMatrix(dl.model@model$validation_metrics)
conf.mat.dl5mil

Unnamed: 0_level_0,alimentos,cosmeticos,matlimpeza,Error,Rate
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<chr>
alimentos,967,17,10,0.02716298,27 / 994
cosmeticos,13,934,28,0.04205128,41 / 975
matlimpeza,5,64,928,0.06920762,69 / 997
Totals,985,1015,966,0.04619016,"137 / 2,966"


In [19]:
r2.dl.model.5mil <- dl.model@model$validation_metrics@metrics$r2
r2.dl.model.5mil