## Joins with Quasi-stable Coloring

In [135]:
using Distributions
using DataStructures: counter

In [136]:
n = 250
nothing #hide

Let's generate two arrays of size $n$.

In [137]:
d = Geometric()
x1 = rand(d, n)
x2 = rand(d, n)
nothing

hash them then count the hash values:

In [138]:
hash1 = x1 .% 4
hash2 = x2 .% 4
agg1 = counter(hash1)
agg2 = counter(hash2)

DataStructures.Accumulator{Int64, Int64} with 4 entries:
  0 => 141
  2 => 30
  3 => 13
  1 => 66

by *Walter's method* we get an upper cardinality bound of: 

In [139]:
estimate_prior = sum(agg1[v] * agg2[v] for v in keys(agg1.map))

24973

The actual cardinality is:

In [140]:
cardinality = sum(i == j for (i, j) in Iterators.product(x1, x2))

21946

## Graph Coloring
Now let's transform this into a graph coloring problem:

In [141]:
using Graphs
using QuasiStableColors

In [142]:
m = 1 + maximum([x1; x2])

g = Graph(n * 2 + m)

for (i, x) in enumerate(x1)
    add_edge!(g, m + i, x + 1)
end

for (i, x) in enumerate(x2)
    add_edge!(g, n + m + i, x + 1)
end

m

10

In [143]:
C = q_color(g, n_colors=4 + 1, warm_start=Vector{Vector{Int}}([collect(1:m), (m+1 : n* 2 + m)]))

5-element Vector{Vector{Int64}}:
 [4, 5, 6, 7, 8, 9, 10]
 [11, 12, 13, 14, 15, 16, 17, 18, 19, 20  …  501, 502, 503, 504, 505, 506, 507, 508, 509, 510]
 [3]
 [1]
 [2]

In [144]:
color_hash::Dict{Int, Int} = Dict()
for (color, nodes) in enumerate(C)
    for x in nodes
        if x <= m
            color_hash[x - 1] = color
        end
    end
end
color_hash

Dict{Int64, Int64} with 10 entries:
  5 => 1
  4 => 1
  6 => 1
  7 => 1
  2 => 3
  0 => 4
  9 => 1
  8 => 1
  3 => 1
  1 => 5

In [145]:
hash1 = map(x -> color_hash[x], x1)
hash2 = map(x -> color_hash[x], x2)
agg1 = counter(hash1)
agg2 = counter(hash2)
estimate = sum(agg1[v] * agg2[v] for v in keys(agg1.map))

22425

In [157]:
"Hashing error: $(estimate_prior / cardinality * 100 - 100)%, color hashing error: $(estimate / cardinality * 100 - 100)%"

"Hashing error: 13.792946322792304%, \n color hashing error: 2.182630092044107%"