Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

from_tensor_slices not compatible with sparse data #44565

Open
amirhmk opened this issue Nov 3, 2020 · 11 comments
Open

from_tensor_slices not compatible with sparse data #44565

amirhmk opened this issue Nov 3, 2020 · 11 comments
Assignees
Labels
comp:data tf.data related issues TF 2.3 Issues related to TF 2.3 type:feature Feature requests

Comments

@amirhmk
Copy link

amirhmk commented Nov 3, 2020

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Google Colab / TF 2.3.0

Describe the current behavior
It's stated in the documentation that Dataset is able to handle SpareTensors on top of Ragged tensors and the other standard data types.

Issue is from_tensor_slices is not able to handle a list of SparseTensors, but from_tensors which accepts a single datapoint is able to instantiate it. Currently Dataset.from_generator is not able to handle Sparse datatype either (in the current release at least) #41981 so I'm not sure how one is supposed to handle datasets that are compromised of both Sparse and dense data.

Describe the expected behavior
from_tensor_slices should accept a list of SparseTensors given that there is exactly the same number of items in the list as the other features passed into from_tensor_slices

Standalone code to reproduce the issue
Colab Link
Other info / logs

I'm not sure if it's just me, but when I read this part of the documentation I was under the impression that this feature would be supported. Thus you may identify this as a feature_request rather than a bug if I've misunderstood the documentarian.

@amirhmk amirhmk added the type:bug Bug label Nov 3, 2020
@amahendrakar amahendrakar added comp:data tf.data related issues type:feature Feature requests and removed type:bug Bug labels Nov 4, 2020
@amahendrakar
Copy link
Contributor

Was able to reproduce the issue with TF v2.3 and TF-nightly. Please find the gist of it here. Thanks!

@amahendrakar amahendrakar added the TF 2.3 Issues related to TF 2.3 label Nov 4, 2020
@amirhmk
Copy link
Author

amirhmk commented Nov 5, 2020

@amahendrakar No worries! I can potentially look into creating a fix for this in the next couple of days and submit a PR. Though do you know of any workarounds to deal with such a dataset using the Dataset API? (Sparse + Dense)?

@jvishnuvardhan jvishnuvardhan added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Nov 5, 2020
@aaudiber
Copy link
Contributor

aaudiber commented Nov 5, 2020

@amirhmk Each component of the argument passed to Dataset.from_tensors or Dataset.from_tensor_slices must be convertible to a single object representable by a tf.TypeSpec. list(SparseTensor) doesn't satisfy this. There are a couple ways to make it work:

  1. Define a single sparse tensor and take slices out of it:
sparse = tf.SparseTensor(indices=[[0, 0, 0], [0, 1, 2], [1, 0, 0], [1, 1, 1]], 
    values=[1, 2, 3, 4], dense_shape=[2, 3, 4])
dataset = tf.data.Dataset.from_tensor_slices(sparse)
  1. Produce sparse tensors using Dataset.from_generator:
sparse_1 = tf.SparseTensor(indices=[[0, 0], [1, 2]], values=[1, 2], dense_shape=[3, 4])
sparse_2 = tf.SparseTensor(indices=[[0, 0], [1, 1]], values=[3, 4], dense_shape=[2, 2])
def gen():
  yield sparse_1
  yield sparse_2

dataset = tf.data.Dataset.from_generator(gen, output_signature=tf.SparseTensorSpec(dtype=tf.int32))

@tensorflowbutler tensorflowbutler removed the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Nov 7, 2020
@amirhmk
Copy link
Author

amirhmk commented Nov 7, 2020

@aaudiber Thank you for your comment, that makes sense. I like the first approach, I just have to keep track of the indices that represent each data point after merging them into a single SparseTensor which shouldn't be too bad.

As for the second approach, I think from_generator accepts SparseTensorSpec in the pre-release version. Do you know when this will become a stable version?

@aaudiber
Copy link
Contributor

@amirhmk It will be available in TF 2.4.0, which should be released in the next week or two.

@amirhmk
Copy link
Author

amirhmk commented Nov 11, 2020

Sounds good. I still think this feature would be nice to have, as in each data point has some sort of a sparse tensor attached to it. Feel free to close if there is a strong reason not to do that. Thanks!

@ktsitsi
Copy link

ktsitsi commented May 7, 2021

I was trying to ingest sparse data into tf.data.Dataset by using the from_generator API. Τhe example below uses scipy.coo_matrix. Trying that with on-prem design for Sparse Arrays which follows the COO representation I was not able to pass the

row = np.array([0, 3, 1, 0])
col = np.array([0, 3, 1, 2])
data = np.array([4, 5, 7, 9])
a = coo_matrix((data, (row, col)), shape=(4, 4))
return tf.data.Dataset.from_generator(
            generator=cls._generator,
            output_signature=(
                tf.SparseTensorSpec(dtype=tf.int32)
            )
            ,args=(a,),
        )

I was getting the following error:

TypeError("Failed to convert object of type %s to Tensor. "
TypeError: Failed to convert object of type <class 'SparseArray'> to Tensor. Contents: ....

Thinking that the culprit may be the on-prem implementation of sparse data I tried the same with scipy.coo_matrix

I get the same error:

Attempt to convert a value (<4x4 sparse matrix of type '<class 'numpy.int64'>' with 4 stored elements in COOrdinate format>) with an unsupported type (<class 'scipy.sparse.coo.coo_matrix'>) to a Tensor.

After some debugging I saw that the error comes from the evaluation of the arguments of generator function, Which makes sense since in the docs here clearly states that the args should be tf.Tensors.

https://www.tensorflow.org/api_docs/python/tf/data/Dataset#from_generator

args | (Optional.) A tuple of tf.Tensor objects that will be evaluated and passed to generator as NumPy-array arguments.

However after the introduction of SparseTensors and SparseTensorSpec I'm not sure if it's just me, but when I read this part of the documentation I was under the impression that these scipy.coo_matrix would be translated into SparseTensors or at least to index,value Numpy-arrays by exploiting the data,row,col attr of the COO sparse format (e.g. scipy.coo_matrix). Is there any walk-around on how someone can ingest sparse data using from_generator? Or is there any formal way that I am not aware?

Using tensorflow==2.4.1

@sat2000pts
Copy link

I am doing something very similar to @ktsitsi but instead inside the function that I pass to from_generator() I am converting the coo matrix to SparseTensor and specifying like @ktsitsi the SparseTensorSpec into the output_signature -> the error message I get is the following:

TypeError: Cannot convert value SparseTensor Spec to a TensorFlow DType.

@aaudiber
Copy link
Contributor

@sat2000pts This colab demonstrates how to convert from coo_matrix to SparseTensor:

The crux is

coo = coo_matrix((data, (row, col)), shape=shape)
tf_sparse = tf.sparse.SparseTensor(list(zip(coo.row, coo.col)), coo.data, coo.shape)

@sat2000pts
Copy link

sat2000pts commented May 11, 2021

@aaudiber thanks for the reply.
Just a bit more details: I am using the sklearn count_vectorizer to transform the text and then convert it a coo_matrix.
Now with the code you send I convert it to SparseTensor and the error is the same: TypeError: Cannot convert value SparseTensorSpec(TensorShape(None), tf.int32) to a TensorFlow DType.

I don't think this plays a role but I also yield an simple integer -> ds = tf.data.Dataset.from_generator(rdd_generator, (tf.SparseTensorSpec(dtype=tf.int32), tf.int32), (tf.TensorShape([None,vec_shape]), tf.TensorShape([1])))

@aaudiber
Copy link
Contributor

@sat2000pts Can you open a new issue, and include code to reproduce the error you're seeing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:data tf.data related issues TF 2.3 Issues related to TF 2.3 type:feature Feature requests
Projects
None yet
Development

No branches or pull requests

8 participants