Skip to content

Binary classification

ivan-pavlov edited this page Mar 17, 2019 · 1 revision

We will try to identify benign or malignant class of a tumour using its histology characteristics

library(rvw)
library(mlbench) # For a dataset

data("BreastCancer", package = "mlbench")
data_full <- BreastCancer

First, start with data preprocessing

data_full <- data_full[complete.cases(data_full),]
ind_train <- sample(1:nrow(data_full), 0.8*nrow(data_full))

summary(data_full)
#>      Id             Cl.thickness   Cell.size     Cell.shape  Marg.adhesion  Epith.c.size  Bare.nuclei   Bl.cromatin  Normal.nucleoli    Mitoses   
#> Length:683         1      :139   1      :373   1      :346   1      :393   2      :376   1      :402   3      :161   1      :432     1      :563  
#> Class :character   5      :128   10     : 67   2      : 58   2      : 58   3      : 71   10     :132   2      :160   10     : 60     2      : 35  
#> Mode  :character   3      :104   3      : 52   10     : 58   3      : 58   4      : 48   2      : 30   1      :150   3      : 42     3      : 33  
#>                    4      : 79   2      : 45   3      : 53   10     : 55   1      : 44   5      : 30   7      : 71   2      : 36     10     : 14  
#>                    10     : 69   4      : 38   4      : 43   4      : 33   6      : 40   3      : 28   4      : 39   8      : 23     4      : 12  
#>                    2      : 50   5      : 30   5      : 32   8      : 25   5      : 39   8      : 21   5      : 34   6      : 22     7      :  9  
#>                    (Other):114   (Other): 78   (Other): 93   (Other): 61   (Other): 65   (Other): 40   (Other): 68   (Other): 68     (Other): 17  
#>       Class    
#> benign   :444  
#> malignant:239  

We can see that "benign" cases appear more often in our dataset This will be used to set up a baseline model

data_full <- data_full[,-1]
data_full$Class <- ifelse(data_full$Class == "malignant", 1, -1)

data_train <- data_full[ind_train,]
data_test <- data_full[-ind_train,]

Our baseline model simply reports every tumour class as benign

baseline_pred <- rep(-1, length(data_test$Class))

# Accuracy for binary classification case
acc_prc <- function(y_pred, y_true){sum(y_pred == y_true) / length(y_pred) * 100}

acc_prc(data_test$Class, baseline_pred)
#> [1] 64.9635

With our baseline model, we get an accuracy of around 65%

Now we a ready to use Vowpal Wabbit models

test_vwmodel <-  vwsetup(dir = "./", model = "mdl.vw",
                         option = "binary") # Convert predictions to {-1,+1}

Basic training and testing

vwtrain(vwmodel = test_vwmodel,
        data = data_train,
        passes = 10,
        targets = "Class",
        quiet = T)

vw_output <- vwtest(vwmodel = test_vwmodel, data = data_test, quiet = T)

acc_prc(data_test$Class, vw_output)
#> [1] 97.08029

Now we get much better results with an accuracy of around 97%

Clone this wiki locally