Skip to content

Robust Convex Clustering Algorithm Implemented in R

Notifications You must be signed in to change notification settings

statsle/Rcvxclustr

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Rcvxclustr

Robust Convex Clustering Algorithm Implemented in R


Description

This R package Rcvxclustr implements the robust convex clustering problem proposed by Liu, Sun and Tan (2019). Classical approaches towards convex clustering solves a convex optimization problem with the cost function being a squared loss plus a fusion penalty that encourages the estimated centroids for observations in the same cluster to be identical. Those approaches are not robust to arbitrary outliers, and when data are contaminated, they fail to identify the correct cluster relationships. This proposed robust convex clustering algorithm, applying Huber loss and a modified weight function, performs well in cases with outliers. It does not break down until more than half of the observations are arbitrary outliers.

In a convex clustering problem, the input is a data matrix X of dimension n×p, with n is the number of samples and p the number of features. Convex clustering algorithm estimates a centroid matrix U of the same size as X, with Xi and Xj being in the same cluster if and only if Ui=Uj.

This proposed algorithm is a modified version of that proposed by Chi and Lange (2015), implemented in the R package cvxclustr. The two methods are compared in our paper and can be reproduced.

Installation

Install Rcvxclustr from GitHub:

install.packages("devtools")
library(devtools)
devtools::install_github("statsle/Rcvxclustr")
library(Rcvxclustr)

Helper functions can be accessed by typing help(function name) in R command.

Functions

Two main functions are implemented, and other functions in the package are dependencies.

  • robust_weights: This function implements the new weight function applied in fusion penalty, proposed in our paper. It returns the lower triangle part (in the form of vector) of the n ×n weight matrix for data X with n rows (number of samples).

  • robustcvxclust: This function implements the alternating direction method of multipliers algorithm for solving the convex clustering objective function. It returns the matrix of centroid differences, i.e. matrix V, with Ui-Ui'=Vii' for all i<i'.

  • create_adjacency_matrix: This function creates the adjacency matrix A using matrix V and the weight vector.

  • find_clusters_from_adjacency: This function generates clustering results from the adjacency matrix A.

A simple example

We first library all packages we need:

library(clustRviz)
library(MASS)
library(Matrix)
library(igraph)
library(gdata)
library(Rcpp)
library(clues)
library(Rcvxclustr)
library(ggplot2)
library(RSNNS)

In the example, there are n=50 observations that belong to two distinct non-overlapping clusters. The data matrix X is generated according to the model Xi=U1i when i belongs to the first cluster, and Xi=U2i when i belongs to the second cluster. Here U1 and U2 subject to different 20-dimensional multivariate Gaussian distribution with identical covariance matrix and different means. We also some outliers to the data, making some dimensions in some samples deviate a lot from its neighbours. The data is generated as follows:

set.seed(1234)
N <- 50/2
p <- 20
mu1 <- mvrnorm(mu=rep(0,p),Sigma=diag(1,p))
X1 <- mvrnorm(N,mu=mu1,Sigma=diag(1,p))
mu2 <- mvrnorm(mu=rep(5,p),Sigma=diag(1,p))
X2 <- mvrnorm(N,mu=mu2,Sigma=diag(1,p))
X <- rbind(X1, X2)
n = dim(X)[1]
outliers <- sample(c(runif(n=5,min=20,max=50),runif(n=5,min=-50,max=-20)))
X[sample(1:(n*p),10)] <- outliers

Then create the weight vector:

wt.vec <- uni_weights(n,p)

Solve the convex clustering objective function:

H <- robustcvxclust(X,rho=1,tau=3,lambda=0.3,wt=wt.vec)

The returned H contains the final outcome of matrix U, W, Y, V, Z, the iteration time, and the final tolerance level ε.

We use the output V (first dimension equals to the length of weight vector, and second dimension equals to p) to create the adjacency matrix:

A <- create_adjacency_matrix(t(H$V),wt.vec,n)

Finally we can obtain the clustering results by

cl <- find_clusters_from_adjacency(A)$cluster

Using function adjustedRand in package clues, we can evaluate the performance of clustering:

cl_true <- c(rep(1,N),rep(2,N))
adjustedRand(cl_true,cl)

In this case, the algorithm gets all the results correctly, that is, cl equals cl_true, and the adjustedRand all equal to 1.

> cl
 [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[26] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
> adjustedRand(cl_true,cl)
   Rand      HA      MA      FM Jaccard 
      1       1       1       1       1 

The ADMM algorithm with parameter tuning for $\lambda$

We use the ADMM algorithm to solve the optimization problem over an increasing sequence of $\lambda$ values until all data points are clustered into one single cluster. To enhance computational efficiency, we employ a warm-start strategy, initializing the ADMM algorithm for a new $\lambda$ with the solution obtained from the preceding $\lambda$. For each value of $\lambda$, we perform multiple iterations of the ADMM algorithm. We report the best solution obtained along the path of increasing $\lambda$’s, measured by Hubert and Arabie’s adjusted Rand index (HA Rand index). We use the function ADMM_RCC in the package to conduct the numerical experiments in our paper.

Synthetic data

In our numerical studies, we show the robustness of our proposed method using the synthetic data generated as follows:

data.gen.mixed <- function(seed,N,p,out_form='entry',out_entry_prop,
                           noise_type='normal',outlier_type='uniform',df=NULL){
  set.seed(seed)
  
  # normal noise with uniform outliers
  if (noise_type=='normal' & outlier_type=='uniform'){
    mu1 <- rnorm(p,0,1)
    X1 <- mvrnorm(N,mu=mu1,Sigma=diag(1,p))
    mu2 <- c(rnorm(p/2,3,1),rnorm(p/2,-3,1))
    X2 <- mvrnorm(N,mu=mu2,Sigma=diag(1,p))
    X <- rbind(X1, X2)
    X0<-X #clean data
    n = dim(X)[1]
    if (out_form=='entry'){
      out_num <- as.integer(n*p*out_entry_prop)
      outliers <- runif(n=out_num,min=10,max=20)
      if (out_num > 0){
        X[sample(1:(n*p),out_num)] <- outliers
      }
    }
    else if (out_form=='row'){
      out_row <- as.integer(n*out_entry_prop)
      out_rows <- sample(1:n,out_row)
      out_num <- as.integer(p*0.2)
      for (row in out_rows){
        out_cols<-sample(1:p,out_num)
        outliers <- runif(n=out_num,min=10,max=20)
        X[row,out_cols]<-outliers
      }
    } 
    else {
      stop("The contamination form needs to be either 'entry' or 'row'.")
    }
  }
  # t noise with uniform outliers
  else if (noise_type=='t' & outlier_type=='uniform'){
    mu1 <- rnorm(p,0,1)
    X1 <- matrix(rep(mu1,each=N),nrow=N,ncol=p)+matrix(rt(n=N*p,df=df,ncp=0),nrow=N,ncol=p)
    mu2 <- c(rnorm(p/2,3,1),rnorm(p/2,-3,1))
    X2 <- matrix(rep(mu2,each=N),nrow=N,ncol=p)+matrix(rt(n=N*p,df=df,ncp=0),nrow=N,ncol=p)
    X <- rbind(X1, X2)
    X0<-X #clean data
    n = dim(X)[1]
    if (out_form=='entry'){
      out_num <- as.integer(n*p*out_entry_prop)
      outliers <- runif(n=out_num,min=10,max=20)
      if (out_num > 0){
        X[sample(1:(n*p),out_num)] <- outliers
      }
    }
    else if (out_form=='row'){
      out_row <- as.integer(n*out_entry_prop)
      out_rows <- sample(1:n,out_row)
      out_num <- as.integer(p*0.2)
      for (row in out_rows){
        out_cols<-sample(1:p,out_num)
        outliers <- runif(n=out_num,min=10,max=20)
        X[row,out_cols]<-outliers
      }
    } 
    else {
      stop("The contamination form needs to be either 'entry' or 'row'.")
    }
  }
  # normal noise with t outliers
  else if (noise_type=='normal' & outlier_type=='t'){
    mu1 <- rnorm(p,0,1)
    X1 <- mvrnorm(N,mu=mu1,Sigma=diag(1,p))
    mu2 <- c(rnorm(p/2,3,1),rnorm(p/2,-3,1))
    X2 <- mvrnorm(N,mu=mu2,Sigma=diag(1,p))
    X <- rbind(X1, X2)
    X0<-X #clean data
    n = dim(X)[1]
    if (out_form=='entry'){
      out_num <- as.integer(n*p*out_entry_prop)
      outliers <- rt(n=out_num,df=df,ncp=0)
      if (out_num > 0){
        X[sample(1:(n*p),out_num)] <- outliers
      }
    }
    else if (out_form=='row'){
      out_row <- as.integer(n*out_entry_prop)
      out_rows <- sample(1:n,out_row)
      out_num <- as.integer(p*0.2)
      for (row in out_rows){
        out_cols<-sample(1:p,out_num)
        outliers <- rt(n=out_num,df=df,ncp=0)
        X[row,out_cols]<-outliers
      }
    } 
    else {
      stop("The contamination form needs to be either 'entry' or 'row'.")
    }
  } 
  else{
    stop("The noise type or outlier type is not recognized.")
  }
  
  cl_true <- c(rep(1,N),rep(2,N))
  return (list(X0=X0,X=X,cl_true=cl_true))
}

To generate synthetic data with Gaussian noise and uniform outliers with 2% entry-wise contamination:

gen <- data.gen.mixed(seed=2024,N=20,p=20,out_form='entry',out_entry_prop=0.02,
                      noise_type='normal',outlier_type='uniform')
X <- gen$X
cl_true <- gen$cl_true

To generate synthetic data with t noise (with df=5) and uniform outliers with 10% entry-wise contamination:

gen <- data.gen.mixed(seed=2024,N=20,p=20,out_form='entry',out_entry_prop=0.1,
                      noise_type='t',outlier_type='uniform',df=5)
X <- gen$X
cl_true <- gen$cl_true

To generate synthetic data with Gaussian noise and t outliers (with df=1) with 50% row-wise contamination:

gen <- data.gen.mixed(seed=2024,N=20,p=20,out_form='row',out_entry_prop=0.5,
                      noise_type='normal',outlier_type='t',df=1)
X <- gen$X
cl_true <- gen$cl_true

Results

Using the cvxclustr method with uniform weights, we obtain:

gen <- data.gen.mixed(seed=2024,N=10,p=20,out_form='entry',out_entry_prop=0.02,
                      noise_type='normal',outlier_type='uniform')
X <- gen$X
cl_true <- gen$cl_true
result <- ADMM_RCC(X=X,phi=.1,method='cvx_uni',lam.begin=0.01,lam.step=1.05,
                   rho=1,tau=NULL,cl_true=cl_true,randmode='HA',
                   max.log=200,max.iter_cvx=100,max.iter_Rcvx=NULL)
result$cl_est
[1]  1  1  1  2  3  1  1  1  1  1  4  5  6  7  8  9 10 11 12 13

Using the proposed Rcvxclustr method with uniform weights and $\tau=0.1$, we obtain:

gen <- data.gen.mixed(seed=2024,N=10,p=20,out_form='entry',out_entry_prop=0.02,
                      noise_type='normal',outlier_type='uniform')
X <- gen$X
cl_true <- gen$cl_true
result <- ADMM_RCC(X=X,phi=.1,method='Rcvx_uni',lam.begin=0.01,lam.step=1.05,
                   rho=1,tau=0.1,cl_true=cl_true,randmode='HA',
                   max.log=200,max.iter_cvx=NULL,max.iter_Rcvx=100)
result$cl_est
[1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2

The path graph can be obtained from result$path, and some pictures are shown in the paper.

About

Robust Convex Clustering Algorithm Implemented in R

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published