Skip to content

R package for "Block-wise Variable Selection for Clustering via Latent States of Mixture Models"

Notifications You must be signed in to change notification settings

seo-beomseok/HDclustVS

Repository files navigation

HDclustVS

This repository contains an R package for the paper "Block-wise Variable Selection for Clustering via Latent States of Mixture Models."

Users can install HDclustVS with just two lines of code in R console:

install.packages("devtools")
devtools::install_github("seo-beomseok/HDclustVS")

Quick tour of HDclustVS

Beomseok Seo

2021-04-22

1. Introduction

HDclustVS is an R package for new block-wise variable selection methods for clustering, HMM-VB-VS and GMM-VB-VS, which exploit the latent states of the hidden Markov model on variable blocks or Gaussian mixture model. The variable blocks are formed by early-stop-and-sorted-depth-first-search (ESS-DFS) on a dendrogram created based on the mutual information between any pair of variables. Then, the variable selection is conducted by an independence test between the latent states and semi-clusters which are the smaller clusters that will be further grouped into final clusters. This package will be merged with HDclust in CRAN in the near future.

2. Variable block construction by ESS-DFS

Here, we illustrate the usage of HDclustVS for variable selection for high dimensional clustering based on a simulated toy example which has 300 samples, 100 relevant variables, 100 irrelevant variables and 5 clusters. genData2( ) generates the data set.

library(HDclustVS)
library(mclust)
# Data generation
set.seed(1)
dat = genData2(n=300,p1=100,p2=100,C=5,rep=1) 
#> [1] TRUE
#> [1] TRUE
#> [1] TRUE
#> [1] TRUE
#> [1] TRUE
X = dat[[1]]$X_total
Y = dat[[1]]$z
  
n = dim(X)[1]
p = dim(X)[2]

# EDA
Xr = prcomp(dat[[1]]$X_relev)$x[,1:2]
Xi = prcomp(dat[[1]]$X_irrel)$x[,1:2]
Xt = prcomp(dat[[1]]$X_total)$x[,1:2]
plotCls(Xr,Y,title="Relevant var.",no.legend=T)
plotCls(Xi,Y,title="Irrelevant var.",no.legend=T)
plotCls(Xt,Y,title="Total var.",no.legend=T)

First, variable blocks are constructed by ESS-DFS algorithm with maximum block size \(m=10\) based on mutual information between any pair of variables. As the result, the algorithm generates \(30\) variable blocks.

# The number of clusters.
C = 5
# Maximum block size is set 5% of the total dimension.
max.vb.size = 10

# Calculate Mutual Information.
pwmi = pairwiseMI(X)
# Variable block construction by ESS-DFS.
vbs = constVB(X,pwmi,max.vb.size)

3. Fitting HMM-VB or GMM-VB

HMM-VB with \(C=5\) components for each block is fitted. The number of components for HMM-VB is set as the true number of clusters, which we assume known.

# Fitting HMM-VB with 5 components.
fit = fitHmmvb(X,C,vbs)
# If we want to use GMM-VB we can use fit = fitGmmvb(X,C,vbs) instead.

4. Variable block selection by an independence test between semi-clusters and latent states of variable blocks.

Semi-clusters are computed by dendrogram clustering of the estimated MAP state sequences based on type of distance measure. Then, to select variable blocks, a bimodality test is applied on the normalized mutual information (NMI) between the latent states of each variable block and semi-cluster labels.

# Semi-clusters
semi.cls = semicls(X,fit)
# Variable block selection by a bimodality test.
chosen.vb = semi.cls$chosen.vb

# Reduce the model structure
red.dat = reduceVB(X,fit,chosen.vb)
red.X = red.dat$X
red.vbs = red.dat$vbs

5. Retraining of HMM-VB with the dimension-reduced data and finding final clusters.

Now, HMM-VB with reduced variable blocks is applied and the final clusters are computed by dendrogram clustering of the re-estimated MAP state sequences with the desired number of clusters.

# Re-estimation of the HMM-VB model with reduced dimensions
re.fit = fitHmmvb(red.X,C,red.vbs)

# Final clustering
final.cls = finalcls(red.dat$X,re.fit,C,5)

For this simulated dataset, the clustering result of HMM-VB-VS has high clustering accuracy which is measured by adjusted Rand index (ARI) and Wasserstein distance (WD) as well as perfect variable selection accuracy which is measured by F-score.

library(mclust)
library(OTclust)
# The clustering and variable selection accuracy.
ARI = adjustedRandIndex(Y,final.cls)
print(ARI)
#> [1] 0.9506484

WD = wassDist(Y,final.cls)
print(WD)
#> [1] 0.06529856

DRR = 1-dim(red.X)[2]/dim(X)[2]
print(DRR)
#> [1] 0.5

chosen.v = unlist(fit$vbs$vb[chosen.vb])
tp = sum(chosen.v<=100)
fp = sum(chosen.v>100)
tn = sum((1:p)[-chosen.v]>100)
fn = sum((1:p)[-chosen.v]<=100)
  
precision = tp/(tp+fp)
print(precision)
#> [1] 1

recall = tp/(tp+fn)
print(recall)
#> [1] 1

F1 = 2/(1/precision+1/recall)
print(F1)
#> [1] 1

About

R package for "Block-wise Variable Selection for Clustering via Latent States of Mixture Models"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages