memory Error / Agglomerative Clustering with Sparse Matrixes #18859

esyker · 2020-11-17T21:37:51Z

Describe the workflow you want to enable

Below is agglomerative clustering from scikit-learn. The problem is, to use the .fit method I need to convert the sparse matrix representation to an array. Since I have 80k documents with around 400k terms, the created array will have 80k x 400k elements. This will be a huge array and naturally, my computer has no memory and raised an memory Error/Exception.

`
from sklearn.cluster import AgglomerativeClustering

model = AgglomerativeClustering(n_clusters = n_clusters,
linkage = linkage,
affinity = affinity,
distance_threshold = distance_threshold)
model = model.fit(vectorspace.toarray())
`

With KMeans, I don't need to transform the sparse matrix representation to an array. Notice that I don't need to call the .toarray() inside the model.fit(). Therefore, my computer does not raise an Exception, since it does not use that much memory.

`
model=KMeans(n_clusters=n_clusters)

model = model.fit(vectorspace)
`

Notice that most of the entries are 0 since most documents have only some overlapping terms, so a sparse matrix hugely reduces the memory used.
Does anyone know if there is already any other python module for agglomerative sparse matrixes?
Could you guys implement this

Describe your proposed solution

Use the same algorithms with sparse matrixes. There are already defined multiplication/sum from other modules, so this shouldn't be very difficult.

Describe alternatives you've considered, if relevant

Additional context

This is a huge issue with large collections/big data. I want to process a big collection in python, but I am not able do to HUGE memory requisites with dense matrixes.

esyker added the New Feature label Nov 17, 2020

cmarmo added the module:cluster label Jan 19, 2021

roybka mentioned this issue Nov 15, 2023

AgglomerativeClustering not using cache #27788

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

memory Error / Agglomerative Clustering with Sparse Matrixes #18859

memory Error / Agglomerative Clustering with Sparse Matrixes #18859

esyker commented Nov 17, 2020

memory Error / Agglomerative Clustering with Sparse Matrixes #18859

memory Error / Agglomerative Clustering with Sparse Matrixes #18859

Comments

esyker commented Nov 17, 2020

Describe the workflow you want to enable

Describe your proposed solution

Describe alternatives you've considered, if relevant

Additional context