## Prelude data

The notebook requires following prerequisites:
```
Pkg.update()
Pkg.add("CSV")
Pkg.add("Plots")
Pkg.add("Distributions")
Pkg.add("GaussianMixtures")
```

**NOTE** If `PyPlot` fails to load, try `Pkg.build("PyPlot")` after installing the package from the Julia command line.

In [7]:
using CSV
using Distributions
using GaussianMixtures
using Plots; plotlyjs()

[1m[34mINFO: Precompiling module GaussianMixtures.


Plots.PlotlyJSBackend()

## Load the data
We will be using the 2-dimensional data loaded from CSV file.
The data can be generated quickly under desired constraints with https://github.com/starcolon/data-generator

In [2]:
# Read in data sources
function readCSVData(path)
    return CSV.read(path; types=[Float32,Float32], header=false)
end

data = readCSVData("data.csv")
d    = Array(data)
# Unwrap numerical arrays out of the Nullable thingy
x    = map(get, d[:,1]) 
y    = map(get, d[:,2])

# Preview of the data
data[1:4, :]

Unnamed: 0,Column1,Column2
1,Nullable{Float32}(-0.726729),Nullable{Float32}(0.280804)
2,Nullable{Float32}(0.791204),Nullable{Float32}(0.525192)
3,Nullable{Float32}(0.647266),Nullable{Float32}(1.37907)
4,Nullable{Float32}(-0.55901),Nullable{Float32}(-1.0751)


## Distribution of the data

In [154]:
# Plot the data points prior to clustering
Plots.scatter(x,y)

## Estimate the clusters with Gaussian Mixture Model
Assume the data is composed of multiple Gaussian distributions.

In [71]:
C   = 3 # Number of clusters
p   = cat(2,x[:],y[:])
gmm = GMM(C, p)

[1m[34mINFO: Initializing GMM, 3 Gaussians diag covariance 2 dimensions using 300 data points
[0m

  Iters               objv        objv-change | affected 
-------------------------------------------------------------
      0       9.186924e+02
      1       5.639701e+02      -3.547223e+02 |        3
      2       5.487179e+02      -1.525220e+01 |        3
      3       5.477159e+02      -1.001953e+00 |        2
      4       5.472988e+02      -4.171143e-01 |        2
      5       5.470349e+02      -2.639160e-01 |        0
      6       5.470349e+02       0.000000e+00 |        0
K-means converged with 6 iterations (objv = 547.0349)


[1m[34mINFO: K-means with 300 data points using 6 iterations
33.3 data points per parameter
[0m[1m[34mINFO: Running 10 iterations EM on diag cov GMM with 3 Gaussians in 2 dimensions
[0m[1m[34mINFO: iteration 1, average log likelihood -1.808315
[0m[1m[34mINFO: iteration 2, average log likelihood -1.803464
[0m[1m[34mINFO: iteration 3, average log likelihood -1.799670
[0m[1m[34mINFO: iteration 4, average log likelihood -1.796227
[0m[1m[34mINFO: iteration 5, average log likelihood -1.793364
[0m[1m[34mINFO: iteration 6, average log likelihood -1.791394
[0m[1m[34mINFO: iteration 7, average log likelihood -1.790349
[0m[1m[34mINFO: iteration 8, average log likelihood -1.789924
[0m[1m[34mINFO: iteration 9, average log likelihood -1.789783
[0m[1m[34mINFO: iteration 10, average log likelihood -1.789742
[0m[1m[34mINFO: EM with 300 data points 10 iterations avll -1.789742
21.4 data points per parameter
[0m

GMM{Float32} with 3 components in 2 dimensions and diag covariance
⋮


In [127]:
# Cluster the samples by posterior probability model of GMM
clusters = Array{Array{Float32}}(C)
pos = gmmposterior(gmm, p)[1]

for c in 1:C
    clusters[c] = Float32[]
end

# Split data points by cluster based on maximum likelihood
labels = [indmax(pos[i,:]) for i=1:size(p,1)]
for i in 1:size(p,1)
    c = labels[i]
    z = p[i,:]
    for j in 1:2
        push!(clusters[c], z[j])
    end
end
# Reshape
for c in 1:C
    d = div(length(clusters[c]),2)
    clusters[c] = reshape(clusters[c], d, 2)
end

## Plotting of clusters

In [181]:
# Loop doesn't work with cumulative plot, oh no
Plots.scatter(clusters[1][:,1],clusters[1][:,2], color=:red)
Plots.scatter!(clusters[2][:,1],clusters[2][:,2], color=:blue)
Plots.scatter!(clusters[3][:,1],clusters[3][:,2], color=:green)

LoadError: UndefVarError: plt not defined