## Clustering and collaborative filtering (via clustering) algorithms

- [Importable source code (most up-to-date version)](https://github.com/sylvaticus/Bmlt.jl/blob/master/src/clusters.jl) - [Julia Package](https://github.com/sylvaticus/Bmlt.jl)
- [Demonstrative static notebook](https://github.com/sylvaticus/Bmlt.jl/blob/master/notebooks/Clustering.ipynb)
- [Demonstrative live notebook](https://mybinder.org/v2/gh/sylvaticus/Bmlt.jl/master?filepath=notebooks%2FClustering.ipynb) (temporary personal online computational environment on myBinder) - it can takes minutes to start with!
- Theory based on [MITx 6.86x - Machine Learning with Python: from Linear Models to Deep Learning](https://github.com/sylvaticus/MITx_6.86x) ([Unit 4](https://github.com/sylvaticus/MITx_6.86x/blob/master/Unit%2004%20-%20Unsupervised%20Learning/Unit%2004%20-%20Unsupervised%20Learning.md))
- New to Julia? [A concise Julia tutorial](https://github.com/sylvaticus/juliatutorial) - [Julia Quick Syntax Reference book](https://julia-book.com)

In [3]:
using Pkg
if ! haskey(Pkg.installed(), "Distributions")
    Pkg.add("Distributions")
end
using LinearAlgebra
using Random
using Distributions
using Statistics
using DelimitedFiles
using Bmlt.Clustering

└ @ Pkg /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.4/Pkg/src/Pkg.jl:531


In [4]:
K = 3
X = [1 10.5;1.5 10.8; 1.8 8; 1.7 15; 3.2 40; 3.6 32; 3.3 38; 5.1 -2.3; 5.2 -2.4]

9×2 Array{Float64,2}:
 1.0  10.5
 1.5  10.8
 1.8   8.0
 1.7  15.0
 3.2  40.0
 3.6  32.0
 3.3  38.0
 5.1  -2.3
 5.2  -2.4

In [8]:
Z₀ = initRepresentatives([1 10.5;1.5 10.8; 1.8 8; 1.7 15; 3.2 40; 3.6 32; 3.6 38],2,initStrategy="grid")

2×2 Array{Float64,2}:
 1.0   8.0
 3.6  40.0

In [9]:
(clIdx,Z) = kmeans([1 10.5;1.5 10.8; 1.8 8; 1.7 15; 3.2 40; 3.6 32; 3.3 38; 5.1 -2.3; 5.2 -2.4],3)

([2, 2, 2, 2, 3, 3, 3, 1, 1], [5.15 -2.3499999999999996; 1.5 11.075; 3.366666666666667 36.666666666666664])

In [10]:
(clIdx,Z) = kmedoids([1 10.5;1.5 10.8; 1.8 8; 1.7 15; 3.2 40; 3.6 32; 3.3 38; 5.1 -2.3; 5.2 -2.4],3,dist = (x,y) -> norm(x-y)^2,initStrategy="grid")

([2, 2, 2, 2, 3, 3, 3, 1, 1], [5.1 -2.3; 1.5 10.8; 3.3 38.0])

In [12]:
clusters = emGMM([1 10.5;1.5 10.8; 1.8 8; 1.7 15; 3.2 40; 3.6 32; 3.3 38; 5.1 -2.3; 5.2 -2.4],3,msgStep=1)
clusters.pⱼₓ

Iter. 1:	Var. of the post  2.7798403823788407 	  Log-likelihood -62.16435618142972
Iter. 2:	Var. of the post  0.5606080950482362 	  Log-likelihood -51.82785452710985
Iter. 3:	Var. of the post  0.3047407377931759 	  Log-likelihood -47.21642564372429
Iter. 4:	Var. of the post  0.003227835533034755 	  Log-likelihood -40.26621217189606
Iter. 5:	Var. of the post  5.609004006879284e-16 	  Log-likelihood -40.26558370139746
Iter. 6:	Var. of the post  1.7801324102294862e-27 	  Log-likelihood -40.26558370139746


9×3 Array{Float64,2}:
 2.90709e-158  1.0          6.00753e-27
 1.10092e-161  1.0          2.5648e-26
 4.57484e-102  1.0          2.31584e-31
 1.12191e-270  1.0          9.14102e-18
 0.0           8.60227e-57  1.0
 0.0           1.7555e-29   1.0
 0.0           1.33625e-49  1.0
 1.0           1.59504e-14  6.0099e-59
 1.0           9.36404e-15  2.97135e-59

In [13]:
cd(@__DIR__)
K = [1,12]
seeds = [0,1,2,3,4]

5-element Array{Int64,1}:
 0
 1
 2
 3
 4

In [14]:
# Test data
baseDir = "assets/netflix/toy_data/"
X = readdlm(joinpath(baseDir,"toy_data.txt"))
for k in K
    ulL = -Inf
    bestSeed = -1
    bestOut = nothing
    for s in seeds
        println("[INFO] Working with (k,seed) = ($(k), $(s))")
        μ₀ = readdlm(joinpath(baseDir,"init_mu_$(k)_$(s).csv"), ' ')
        σ²₀ = dropdims(readdlm(joinpath(baseDir,"init_var_$(k)_$(s).csv"), ' '),dims=2)
        p₀ = dropdims(readdlm(joinpath(baseDir,"init_p_$(k)_$(s).csv"), ' '),dims=2)
        emOut = emGMM(X,k;p₀=p₀,μ₀=μ₀,σ²₀=σ²₀,msgStep=0,missingValue=0)
        lL  = emOut.lL
        if lL > ulL
            ulL = lL
            bestSeed = s
            bestOut = emOut
        end
    end
    println("Upper logLikelihood with $(k) clusters: $(ulL) (seed $(bestSeed))")
end

[INFO] Working with (k,seed) = (1, 0)
[INFO] Working with (k,seed) = (1, 1)
[INFO] Working with (k,seed) = (1, 2)
[INFO] Working with (k,seed) = (1, 3)
[INFO] Working with (k,seed) = (1, 4)
Upper logLikelihood with 1 clusters: -1307.2234317600933 (seed 0)
[INFO] Working with (k,seed) = (12, 0)
[INFO] Working with (k,seed) = (12, 1)
[INFO] Working with (k,seed) = (12, 2)
[INFO] Working with (k,seed) = (12, 3)
[INFO] Working with (k,seed) = (12, 4)
Upper logLikelihood with 12 clusters: -1118.6190434326675 (seed 2)


In [18]:
# Full NetFlix dataset.. may take time !!!!
baseDir = "assets/netflix/full/"
X = convert(Array{Int64,2},readdlm(joinpath(baseDir,"netflix_incomplete.txt")))

1200×1200 Array{Int64,2}:
 2  4  5  0  0  3  5  0  4  2  0  4  3  …  5  4  4  0  0  4  0  5  4  4  4  4
 3  5  5  3  4  3  5  4  2  4  4  4  3     4  4  3  4  2  4  4  3  3  5  3  4
 2  0  4  3  3  1  3  3  3  3  0  3  3     2  3  3  4  0  0  0  3  2  4  4  3
 4  3  4  4  5  2  4  5  4  4  3  5  4     4  4  3  4  5  4  5  4  4  4  4  4
 2  2  5  4  0  1  5  0  5  3  3  1  0     2  2  1  4  0  3  2  5  0  4  0  0
 3  0  0  4  4  4  4  4  5  0  1  4  4  …  2  3  2  3  5  4  4  5  4  2  5  4
 1  4  5  4  5  5  4  4  4  5  4  4  3     5  5  5  4  3  4  4  5  5  4  5  4
 2  0  5  4  5  1  5  2  3  3  4  4  4     5  3  3  3  2  3  4  4  2  5  3  3
 3  5  0  5  4  0  4  4  4  4  3  0  5     3  0  0  3  5  4  4  4  0  2  5  5
 0  0  0  0  3  5  0  0  4  0  0  0  0     4  0  0  0  0  0  0  0  0  0  5  5
 3  0  5  3  3  0  0  4  0  3  0  5  0  …  2  0  3  3  0  0  4  0  4  0  3  0
 2  4  5  0  4  0  4  2  4  2  0  3  0     5  3  3  0  0  0  4  4  4  4  3  0
 0  4  5  0  3  3  5  0  4  0  0  4  2

In [15]:
for k in K
    ulL = -Inf
    bestSeed = -1
    bestOut = nothing
    for s in seeds
        println("[INFO] Working with (k,seed) = ($(k), $(s))")
        μ₀  = readdlm(joinpath(baseDir,"init_mu_$(k)_$(s).csv"), ' ')
        σ²₀ = dropdims(readdlm(joinpath(baseDir,"init_var_$(k)_$(s).csv"), ' '),dims=2)
        p₀  = dropdims(readdlm(joinpath(baseDir,"init_p_$(k)_$(s).csv"), ' '),dims=2)
        emOut = emGMM(X,k;p₀=p₀,μ₀=μ₀,σ²₀=σ²₀,msgStep=0,missingValue=0)
        lL  = emOut.lL
        if lL > ulL
            ulL = lL
            bestSeed = s
            bestOut = emOut
        end
    end
    println("Upper logLikelihood with $(k) clusters: $(ulL) (seed $(bestSeed))")
end

[INFO] Working with (k,seed) = (1, 0)
[INFO] Working with (k,seed) = (1, 1)
[INFO] Working with (k,seed) = (1, 2)
[INFO] Working with (k,seed) = (1, 3)
[INFO] Working with (k,seed) = (1, 4)
Upper logLikelihood with 1 clusters: -1.5210609539852452e6 (seed 0)
[INFO] Working with (k,seed) = (12, 0)
[INFO] Working with (k,seed) = (12, 1)
[INFO] Working with (k,seed) = (12, 2)
[INFO] Working with (k,seed) = (12, 3)
[INFO] Working with (k,seed) = (12, 4)
Upper logLikelihood with 12 clusters: -1.3902809991574623e6 (seed 1)


In [16]:
X = [1 10.5;1.5 0; 1.8 8; 1.7 15; 3.2 40; 0 0; 3.3 38; 0 -2.3; 5.2 -2.4]

9×2 Array{Float64,2}:
 1.0  10.5
 1.5   0.0
 1.8   8.0
 1.7  15.0
 3.2  40.0
 0.0   0.0
 3.3  38.0
 0.0  -2.3
 5.2  -2.4

In [17]:
cFOut = collFilteringGMM(X,3,msgStep=1,missingValue=0)
cFOut.X̂

Iter. 1:	Var. of the post  2.61937747932065 	  Log-likelihood -47.59140596017498
Iter. 2:	Var. of the post  0.5226030386857065 	  Log-likelihood -34.55184066668723
Iter. 3:	Var. of the post  0.3500981768393402 	  Log-likelihood -32.92185047653772
Iter. 4:	Var. of the post  0.32940171779360017 	  Log-likelihood -30.01085600946215
Iter. 5:	Var. of the post  0.05092179105118827 	  Log-likelihood -27.686896657600293
Iter. 6:	Var. of the post  0.01144416282455234 	  Log-likelihood -27.681990100476558
Iter. 7:	Var. of the post  0.004605091358874689 	  Log-likelihood -27.681832719530703
Iter. 8:	Var. of the post  0.0022110716618263934 	  Log-likelihood -27.68179603140188
Iter. 9:	Var. of the post  0.0010765120575048945 	  Log-likelihood -27.68178722759999


9×2 Array{Float64,2}:
 1.0     10.5
 1.5     14.1779
 1.8      8.0
 1.7     15.0
 3.2     40.0
 2.8627  15.1255
 3.3     38.0
 5.2     -2.3
 5.2     -2.4