Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

memory Error / Agglomerative Clustering with Sparse Matrixes #18859

Open
esyker opened this issue Nov 17, 2020 · 0 comments
Open

memory Error / Agglomerative Clustering with Sparse Matrixes #18859

esyker opened this issue Nov 17, 2020 · 0 comments

Comments

@esyker
Copy link

esyker commented Nov 17, 2020

Describe the workflow you want to enable

Below is agglomerative clustering from scikit-learn. The problem is, to use the .fit method I need to convert the sparse matrix representation to an array. Since I have 80k documents with around 400k terms, the created array will have 80k x 400k elements. This will be a huge array and naturally, my computer has no memory and raised an memory Error/Exception.

`
from sklearn.cluster import AgglomerativeClustering

model = AgglomerativeClustering(n_clusters = n_clusters,
linkage = linkage,
affinity = affinity,
distance_threshold = distance_threshold)
model = model.fit(vectorspace.toarray())
`

With KMeans, I don't need to transform the sparse matrix representation to an array. Notice that I don't need to call the .toarray() inside the model.fit(). Therefore, my computer does not raise an Exception, since it does not use that much memory.

`
model=KMeans(n_clusters=n_clusters)

model = model.fit(vectorspace)
`

Notice that most of the entries are 0 since most documents have only some overlapping terms, so a sparse matrix hugely reduces the memory used.
Does anyone know if there is already any other python module for agglomerative sparse matrixes?
Could you guys implement this

Describe your proposed solution

Use the same algorithms with sparse matrixes. There are already defined multiplication/sum from other modules, so this shouldn't be very difficult.

Describe alternatives you've considered, if relevant

Additional context

This is a huge issue with large collections/big data. I want to process a big collection in python, but I am not able do to HUGE memory requisites with dense matrixes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants