Skip to content

How to pad data in Tensorflow Transform so that sequences are of equal lenght ? #171

@anantvir

Description

@anantvir

I am building a data pipeline using Tensorflow Transform using Apache Beam. Inside the preprocessing function I am generating vocabulary which is converting my input sentences to list of integers.
My data has 3 features:
Column 1 = Context (dtype = String) (Sentences of varying Length)
Column 2 = Utterance (dtype = String) (Sentences of varying Length)
Column 3 = Label (0/1)
I want to right pad my list of integers as soon as they are generated in the preprocessing function below to a max length of 160 words.
Example:
"I love Pizza" --> [34, 67, 78] --> Max length I want = 10
then I want [34, 67, 78,0,0,0,0,0,0,0] and if the length of my sentence is already greater than 10, then I want to trim the extra portion to make it length =10

Now to use tf.keras.preprocessing.sequence.pad_sequences you need as input a list of sequences but as shown below, my mapped_context and mapped_utterance are tf.int64 tensors. So I am not able to use the padding functionality of Keras.
Can someone please help me achieve this ?

The reference code I am following is Tensorflow Sentiment Analysis Example:
https://github.com/tensorflow/transform/blob/599691c8b94bbd6ee7f67c11542e7fef1792a566/examples/sentiment_example.py

My code's preprocessing function is below:

image

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions