## LSH-based Graph Partition Algorithm

### 1. Abstract

`(State the problem)`应用分图对数据进行预处理可以显著地提升算法执行的效率. 然而在大规模图数据处理中, 使用优化的分图算法事先对图数据进行预处理却很少见. `(Say why it’s an interesting problem;)`缘由不言自明, 现有的图划分算法多基于多层多路划分或标签传递的迭代方式进行. 虽然, 他们都可以得到比较理想的分图的结果, 但其空间时间复杂度限制了其应用于大规模数据集的能力. `(Say what your solution achieves)`本文提出的基于LSH的图划分方法, 其空间时间复杂度都为O(n), n为图顶点数. `(Say what follows from your solution)`对于各种超大规模图数据集可以快速地进行划分, 且算法在预处理后的数据上运行减少的时间远大于预处理的时间. 与其他主流的和最新的图划分算法相比, 性能可以提高$10^1-10^3$倍.

### 2. Introduction

`(Describe your problem / motivation)`除了算法的改进, 并行仍然是处理规模日益增大的数据集的主要手段之一. 对于很大一部分数据集和算法, 对数据集进行划分是执行并行算法的必要条件. 不同的数据划分方式可能产生截然不同的运行效率, 在时间上差别可能是若干个数量级的区别. 对于图数据集而言, 已经有许多性能和效果都不错的算法, 如基于Multilevel K-way的Metis, 基于Label-Propagation的1, 2, 3等. Metis的复杂度为$O(n^3)$, 针对小规模数据集而言是一种非常不错的方法, 但是图规模增大到边数为100K时, 其基本上就跑不出结果了. 虽然, 真大大规模数据集, 其作者提出了改进的Parallel Metis, 但实测结果显示. 基于标签传递的算法.

数据划分之后, 数据分块之间的依赖性是限制并行算法性能发挥的主要因素之一. 欠优化的划分算法, 致使数据分块之间产生大量的依赖性, 进而增加了计算节点之间的通信负担, 加大了计算对依赖数据的准备时间. 同时, 也限制了更大规模的并行, 因为增加的通信开销会抵消掉增加计算减少的时间.

`As huge networks become abundant, there is a need for their parallel analysis. In many cases, a graph needs to be partitioned`

`(State your contributions)` 本文提出了一种基于LSH的图划分算法.    
`The main contributions of this paper are a low time complexity graph partitioning algorithm.`

### 3. Related Work

In [61]:
'''
Some examples for MinHash
'''

from hashlib import sha1
from datasketch.minhash import MinHash
from pprint import pprint

data1 = ['minhash', 'is', 'a', 'probabilistic', 'data', 'structure', 'for',
        'estimating', 'the', 'similarity', 'between', 'datasets']
data2 = ['minhash', 'is', 'a', 'probability', 'data', 'structure', 'for',
        'estimating', 'the', 'similarity', 'between', 'documents']

def eg1():
    m1 = MinHash()
    print("Before hash doc:")
    print(len(m1.hashvalues))
    print(len(m1.permutations[0]))
    m2 = MinHash()
    for d in data1:
        m1.update(d.encode('utf8'))
    for d in data2:
        m2.update(d.encode('utf8'))
    print("After hash doc:")
    print(len(m1.hashvalues))
    print(len(m1.permutations[0]))
    pprint(vars(m1))
    pprint(m2)
    print("Estimated Jaccard for data1 and data2 is", m1.jaccard(m2))

    s1 = set(data1)
    s2 = set(data2)
    actual_jaccard = float(len(s1.intersection(s2))) /\
            float(len(s1.union(s2)))
    print("Actual Jaccard for data1 and data2 is", actual_jaccard)

if __name__ == "__main__":
    eg1()

Before hash doc:
128
128
After hash doc:
128
128
{'hashobj': <built-in function openssl_sha1>,
 'hashvalues': array([ 297616339,  279951299,  113505080,  311917730,    1735256,
        278730948,  249258812,  306660385,  386953741,  423518424,
        120511132,  607298570,  490287863,  115094987,  290874010,
         58384851,   82568189,  483072302,  312640790,   86174351,
        198731659,  788039411,   54507159,  828911042,   93863906,
         16071831,  260431759,  316407020,  261463262, 1524825895,
        648376383,  206326676,  176707072,   18714679,  478567185,
        180270267,   89979232,  111646838,  240537181,  342142234,
        620096571, 1407834531,  330961037,  663383944, 1105899070,
        181581527,  132285593,  375422674, 1436377075,  484486034,
        252946215,   87331021,  374968398,  968098446,   72863372,
        484842735,  179471924,  672911886,   14648640,  656664915,
        417888415,  271335895,   10055390,   48164330,  430379235,
        622471011, 

In [60]:
from datasketch import MinHash, MinHashLSH

set1 = set(['minhash', 'is', 'a', 'probabilistic', 'data', 'structure', 'for',
            'estimating', 'the', 'similarity', 'between', 'datasets'])
set2 = set(['minhash', 'is', 'a', 'probability', 'data', 'structure', 'for',
            'estimating', 'the', 'similarity', 'between', 'documents'])
set3 = set(['minhash', 'is', 'probability', 'data', 'structure', 'for',
            'estimating', 'the', 'similarity', 'between', 'documents'])

set4 = set(['001'])
set5 = set(['112', '150'])
set6 = set(['112', '150', '450'])

m1 = MinHash(num_perm=128)
m2 = MinHash(num_perm=128)
m3 = MinHash(num_perm=128)
m4 = MinHash(num_perm=128)
m5 = MinHash(num_perm=128)
m6 = MinHash(num_perm=128)
for d in set1:
    m1.update(d.encode('utf8'))
for d in set2:
    m2.update(d.encode('utf8'))
for d in set3:
    m3.update(d.encode('utf8'))
for d in set4:
    m4.update(d.encode('utf8'))
for d in set5:
    m5.update(d.encode('utf8'))
for d in set6:
    m6.update(d.encode('utf8'))

# Create LSH index
lsh = MinHashLSH(threshold=0.3, num_perm=128)
#pprint(vars(lsh))
lsh.insert("m2", m2)
lsh.insert("m3", m3)
lsh.insert("m4", m4)
lsh.insert("m5", m5)
lsh.insert("m6", m6)
# pprint(vars(lsh.hashtables[0]))

for i in range(len(lsh.hashtables)):
    #pprint(vars(lsh.hashtables[i]))
    print(lsh.hashtables[0]._dict.values())

print(type(lsh.hashtables[0]._dict))
print(lsh.hashtables[0]._dict.values())
pprint(vars(lsh.hashtables[0]))
    
result = lsh.query(m6)
print("Approximate neighbours with Jaccard similarity > 0.5", result)

dict_values([{'m4'}, {'m3', 'm2'}, {'m5', 'm6'}])
dict_values([{'m4'}, {'m3', 'm2'}, {'m5', 'm6'}])
dict_values([{'m4'}, {'m3', 'm2'}, {'m5', 'm6'}])
dict_values([{'m4'}, {'m3', 'm2'}, {'m5', 'm6'}])
dict_values([{'m4'}, {'m3', 'm2'}, {'m5', 'm6'}])
dict_values([{'m4'}, {'m3', 'm2'}, {'m5', 'm6'}])
dict_values([{'m4'}, {'m3', 'm2'}, {'m5', 'm6'}])
dict_values([{'m4'}, {'m3', 'm2'}, {'m5', 'm6'}])
dict_values([{'m4'}, {'m3', 'm2'}, {'m5', 'm6'}])
dict_values([{'m4'}, {'m3', 'm2'}, {'m5', 'm6'}])
dict_values([{'m4'}, {'m3', 'm2'}, {'m5', 'm6'}])
dict_values([{'m4'}, {'m3', 'm2'}, {'m5', 'm6'}])
dict_values([{'m4'}, {'m3', 'm2'}, {'m5', 'm6'}])
dict_values([{'m4'}, {'m3', 'm2'}, {'m5', 'm6'}])
dict_values([{'m4'}, {'m3', 'm2'}, {'m5', 'm6'}])
dict_values([{'m4'}, {'m3', 'm2'}, {'m5', 'm6'}])
dict_values([{'m4'}, {'m3', 'm2'}, {'m5', 'm6'}])
dict_values([{'m4'}, {'m3', 'm2'}, {'m5', 'm6'}])
dict_values([{'m4'}, {'m3', 'm2'}, {'m5', 'm6'}])
dict_values([{'m4'}, {'m3', 'm2'}, {'m5', 'm6'}])
