-
Notifications
You must be signed in to change notification settings - Fork 222
Description
I am building a data pipeline using Tensorflow Transform using Apache Beam. Inside the preprocessing function I am generating vocabulary which is converting my input sentences to list of integers.
My data has 3 features:
Column 1 = Context (dtype = String) (Sentences of varying Length)
Column 2 = Utterance (dtype = String) (Sentences of varying Length)
Column 3 = Label (0/1)
I want to right pad my list of integers as soon as they are generated in the preprocessing function below to a max length of 160 words.
Example:
"I love Pizza" --> [34, 67, 78] --> Max length I want = 10
then I want [34, 67, 78,0,0,0,0,0,0,0] and if the length of my sentence is already greater than 10, then I want to trim the extra portion to make it length =10
Now to use tf.keras.preprocessing.sequence.pad_sequences you need as input a list of sequences but as shown below, my mapped_context and mapped_utterance are tf.int64 tensors. So I am not able to use the padding functionality of Keras.
Can someone please help me achieve this ?
The reference code I am following is Tensorflow Sentiment Analysis Example:
https://github.com/tensorflow/transform/blob/599691c8b94bbd6ee7f67c11542e7fef1792a566/examples/sentiment_example.py
My code's preprocessing function is below:
