Skip to content

Expectation–Maximization (EM) algorithm implementation in R and Python, and a comparison with K-means.

Notifications You must be signed in to change notification settings

Samashi47/EM-algorithm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

EM-algorithm - Academic project

An academic research and implementation of the expectation–maximization algorithm, with Python and R.

To start off, clone the project:

git clone https://github.com/Samashi47/EM-algorithm.git

Then:

cd EM-algorithm

Python implementation

After cloning the project, go to the Python-implementation folder:

cd Python-implementation

Then, create your virutal environment:

Windows

py -3 -m venv .venv

MacOS/Linus

python3 -m venv .venv

And, activate it:

Windows

.venv\Scripts\activate

MacOS/Linus

. .venv/bin/activate

You can run the following command to install the dependencies:

pip3 install -r requirements.txt

To run the code:

  1. Select the kernel in the jupyter notebook in the Python-implementation folder.
  2. Run the cells.

R implementation

Note

Here we suppose that R is fully installed and configured on your computer.

R-markdownn doesn't require any further configuration to run on Rstudio or VSCode, but for a more rich experience on VSCode (live preview, generate HTML, LaTeX and pdf files) you need a TeX distribution, and pandoc on your computer. You can install pandoc from https://pandoc.org/installing.html

To start with the R implementation, you should install the required packages first, go to the R console, then:

install.packages(c("base", "methods", "datasets", "utils", "grDevices", "graphics", "stats", "plyr", "mvtnorm", "ggplot2"))

Then you are ready to run the implementations in the .rmd files chunk by chunk.

Use

To use the implementation, first you got to initiate starting values for the mean, cov, and probabilities.

  1. The mean is a matrix, of dimensions (nbr of wanted clusters, nbr of used columns to generate clusters), of means of every column, for the number of wanted clusters.
  2. The cov is a tensor, of dimensions (nbr of used columns to generate clusters, nbr of used columns to generate clusters, nbr of wanted clusters), of covariance between the datasets columns, for the number of wanted clusters.
  3. The probs is a list, of dimensions (nbr of wanted clusters), of probabilities that a given data point belongs to a cluster.

To do that in code, we first generate a list of means for each column, and a covariance matrix between columns:

library(plyr)

# Create starting values
Mu = daply(iris2, NULL, function(x) colMeans(x)) + runif(4, 0, 0.5)
Cov = dlply(iris2, NULL, function(x) var(x) + diag(runif(4, 0, 0.5)))
column.names <- colnames(iris2)
row.names <- c("Cluster 1", "Cluster 2", "Cluster 3")

Then we create a 2D array of means for the number of wanted clusters with a noise to not have indentical rows, and a tensor of covariance matrices for the number of wanted clusters:

initMu = array(c(Mu[1] + 0.1, Mu[1] + 0.2, Mu[1] + 0.3, Mu[2] + 0.1, Mu[2] + 0.2, Mu[2] + 0.3, Mu[3] + 0.1, Mu[3] + 0.2, Mu[3] + 0.3, Mu[4] + 0.1, Mu[4] + 0.2, Mu[4] + 0.4) , dim = c(3, 4),dimnames = list(row.names,column.names))
initCov <- list('Cluster 1' = Cov[[1]], 'Cluster 2' = Cov[[1]], 'Cluster 3' = Cov[[1]])

For probabilities, we can initiate them manually:

initProbs = c(.1, .2, .7)

Or, randomly:

initProbs = sort(runif(3, min=0.1, max=0.9))

Finally, we encapsulate the initiated params in a variable called initParams:

initParams <- list(mu = initMu, var = initCov, probs = initProbs)

And run the algorithm with:

results = gaussmixEM(params=initParams, X=as.matrix(iris2), clusters = 3, tol=1e-10, maxits=1500, showits=T)
print(results)

References

About

Expectation–Maximization (EM) algorithm implementation in R and Python, and a comparison with K-means.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •