Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: spectral clustering always crashes with > 100 eigenvectors #710

Open
joelb92 opened this issue Feb 10, 2020 · 2 comments
Open

BUG: spectral clustering always crashes with > 100 eigenvectors #710

joelb92 opened this issue Feb 10, 2020 · 2 comments

Comments

@joelb92
Copy link

@joelb92 joelb92 commented Feb 10, 2020

Describe the bug
I am attempting to partition a large sparse graph using cugraph's new spectralModularityMaximizationClustering method (however this bug also can be reproduced using the balanced version)

The graph I am performing on is ~ 100k*100k in size, with 444429 edges (quite sparse). No matter what I set the partition size to (in my case 5000), I cannot set the num_eigen_vects parameter to be greater than 100 without a crash:

To Reproduce
Steps to reproduce the behavior:

Graph building:

def getedges(f):

    fname = os.path.basename(f)[:-5]#get rid of .json suffix
    with open(f,'r') as fp:
        results = json.load(fp)
    edges = []
    for r in results:
        score = r['searchscore']
        rname = os.path.basename(r['source'])
        row = np.where(imagenames == fname)[0][0]
        column = np.where(imagenames == rname)[0]
        if len(column > 1):
            column = np.where(imagepaths == r['source'])[0]
        if len(column > 0):
            column = column[0]
            edges.append((row,column,score))
    return edges
edgelist = Parallel(n_jobs=20)(delayed(getedges)(f) for f in tqdm(jsonfiles))
adj_mat = lil_matrix((len(imagedb),len(imagedb)),dtype='float32')

 for edges in tqdm(edgelist):
        for row,column,score in edges:
            adj_mat[row,column] = score     adj_mat = adj_mat.tocsr()

Converting to cugraph:

row_offsets = cudf.Series(adj_mat.indptr)
col_indices = cudf.Series(adj_mat.indices)
values = cudf.Series(adj_mat.data)
G = cugraph.Graph()
G.add_adj_list(row_offsets, col_indices, values)`

Performing spectral clustering:

p = 5000
df = cugraph.spectralModularityMaximizationClustering(G,p,num_eigen_vects=100)#min(100,p))
score = cugraph.analyzeClustering_edge_cut(G, p, df['cluster'])
print(score)

Error:

GDFError                                  Traceback (most recent call last)
<ipython-input-15-9d098fb37fd7> in <module>
      1 p = 5000
----> 2 df = 
cugraph.spectralModularityMaximizationClustering(G,p,num_eigen_vects=101)#min(100,p))
      3 score = cugraph.analyzeClustering_edge_cut(G, p, df['cluster'])
      4 print(score)

cugraph/spectral_clustering/spectral_clustering.pyx in 
cugraph.spectralModularityMaximizationClustering()

cugraph/spectral_clustering/spectral_clustering.pyx in 
cugraph.spectralModularityMaximizationClustering()

cudf/bindings/cudf_cpp.pyx in cudf.bindings.cudf_cpp.check_gdf_error()

cudf/bindings/cudf_cpp.pyx in cudf.bindings.cudf_cpp.check_gdf_error()

GDFError: CUDA ERROR. b'cudaSuccess': b'no error'

Expected behavior
With a matrix of this size I would expect partitioning to allow more than 100 eigen vectors.

Desktop (please complete the following information):

  • OS: Ubuntu 18.04 with NVIDIA 2080ti
  • Software: Cugraph installed via this call: conda install -y -c nvidia -c rapidsai -c numba -c conda-forge -c defaults cugraph cudatoolkit=10.0

Can anyone else reproduce this issue with a large random graph?

Joel

@mike-wendt mike-wendt added this to Needs Prioritizing in Bug Squashing Feb 15, 2020
@BradReesWork BradReesWork added this to P0 in v0.13 Release via automation Feb 18, 2020
@BradReesWork BradReesWork moved this from P0 to Bugs in v0.13 Release Feb 18, 2020
@BradReesWork

This comment has been minimized.

Copy link
Contributor

@BradReesWork BradReesWork commented Feb 18, 2020

@joelb92 We currently limit the number of eigenvectors since is limited benefit to using more than the first 30 eigenvectors. Is there an analytical problem that you are trying to solve based on the 100th eigenvector?

@BradReesWork BradReesWork moved this from Needs Prioritizing to Hotfix - current release in Bug Squashing Feb 18, 2020
@afender

This comment has been minimized.

Copy link
Member

@afender afender commented Feb 18, 2020

Hi Joel,

I'd recommend giving a try a smaller number of eigenpairs. Based on the experiment in Fig. 2a : https://www.sciencedirect.com/science/article/pii/S1877050917307913

In our experiments we have found that for most networks it is possible to maximize the modularity by varying the number of clusters independently of the number of eigenpairs as shown in Fig. 2a. In this experiments, we have computed 2 to 8 eigenpairs, and afterward we have continued to increase only the number of k-means clusters. Notice that the plot demonstrates how the choice of the number of eigenpairs impacts the modularity score.
Based on the formulation of the modularity problem one can expect that the best-case scenario would be to compute a number of eigenpairs equal to the number of clusters. This is mostly, but not always, true, with the exceptions often due to loss of orthogonality or low-quality approximation of the latest eigenvectors.
We also see that it is possible to compute less eigenpairs for an equivalent clustering quality as shown in Fig. 2a. Indeed, modularity values seem to follow a trend set by the number of clusters and the selected number of eigenpairs is secondary. Hence, given a constant number of eigenpairs it is possible to maximize modularity by only increasing the number of k-means clusters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Bug Squashing
Hotfix - current release
v0.13 Release
  
Bugs
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants
You can’t perform that action at this time.