You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Below is agglomerative clustering from scikit-learn. The problem is, to use the .fit method I need to convert the sparse matrix representation to an array. Since I have 80k documents with around 400k terms, the created array will have 80k x 400k elements. This will be a huge array and naturally, my computer has no memory and raised an memory Error/Exception.
`
from sklearn.cluster import AgglomerativeClustering
model = AgglomerativeClustering(n_clusters = n_clusters,
linkage = linkage,
affinity = affinity,
distance_threshold = distance_threshold)
model = model.fit(vectorspace.toarray())
`
With KMeans, I don't need to transform the sparse matrix representation to an array. Notice that I don't need to call the .toarray() inside the model.fit(). Therefore, my computer does not raise an Exception, since it does not use that much memory.
`
model=KMeans(n_clusters=n_clusters)
model = model.fit(vectorspace)
`
Notice that most of the entries are 0 since most documents have only some overlapping terms, so a sparse matrix hugely reduces the memory used.
Does anyone know if there is already any other python module for agglomerative sparse matrixes?
Could you guys implement this
Describe your proposed solution
Use the same algorithms with sparse matrixes. There are already defined multiplication/sum from other modules, so this shouldn't be very difficult.
Describe alternatives you've considered, if relevant
Additional context
This is a huge issue with large collections/big data. I want to process a big collection in python, but I am not able do to HUGE memory requisites with dense matrixes.
The text was updated successfully, but these errors were encountered:
Describe the workflow you want to enable
Below is agglomerative clustering from scikit-learn. The problem is, to use the .fit method I need to convert the sparse matrix representation to an array. Since I have 80k documents with around 400k terms, the created array will have 80k x 400k elements. This will be a huge array and naturally, my computer has no memory and raised an memory Error/Exception.
`
from sklearn.cluster import AgglomerativeClustering
model = AgglomerativeClustering(n_clusters = n_clusters,
linkage = linkage,
affinity = affinity,
distance_threshold = distance_threshold)
model = model.fit(vectorspace.toarray())
`
With KMeans, I don't need to transform the sparse matrix representation to an array. Notice that I don't need to call the .toarray() inside the model.fit(). Therefore, my computer does not raise an Exception, since it does not use that much memory.
`
model=KMeans(n_clusters=n_clusters)
model = model.fit(vectorspace)
`
Notice that most of the entries are 0 since most documents have only some overlapping terms, so a sparse matrix hugely reduces the memory used.
Does anyone know if there is already any other python module for agglomerative sparse matrixes?
Could you guys implement this
Describe your proposed solution
Use the same algorithms with sparse matrixes. There are already defined multiplication/sum from other modules, so this shouldn't be very difficult.
Describe alternatives you've considered, if relevant
Additional context
This is a huge issue with large collections/big data. I want to process a big collection in python, but I am not able do to HUGE memory requisites with dense matrixes.
The text was updated successfully, but these errors were encountered: