From 378feee8052121fcd33aa1ba3c168f4fc33fa5ae Mon Sep 17 00:00:00 2001 From: tanzhenyu Date: Tue, 14 Jul 2020 17:08:36 -0700 Subject: [PATCH 1/3] Update Keras categorical input design, mark it as complete. More importantly, add an official guide on how TF1 Estimator users with feature columns can convert their code into TF2 Keras users with KPL. --- rfcs/20191212-keras-categorical-inputs.md | 617 +++++++++++++--------- 1 file changed, 376 insertions(+), 241 deletions(-) diff --git a/rfcs/20191212-keras-categorical-inputs.md b/rfcs/20191212-keras-categorical-inputs.md index b1954c25d..19e5c04a9 100644 --- a/rfcs/20191212-keras-categorical-inputs.md +++ b/rfcs/20191212-keras-categorical-inputs.md @@ -1,6 +1,6 @@ # Keras categorical inputs -| Status | Accepted | +| Status | Completed | :-------------- |:---------------------------------------------------- | | **Author(s)** | Zhenyu Tan (tanzheny@google.com), Francois Chollet (fchollet@google.com)| | **Sponsor** | Karmel Allison (karmel@google.com), Martin Wicke (wicke@google.com) | @@ -8,8 +8,8 @@ ## Objective -This document proposes 4 new preprocessing Keras layers (`IndexLookup`, `CategoryCrossing`, `CategoryEncoding`, `Hashing`), and an extension to existing op (`tf.sparse.from_dense`) to allow users to: -* Perform feature engineering for categorical inputs +This document proposes 5 new Keras preprocessing layers (KPL) (`StringLookup`, `CategoryCrossing`, `CategoryEncoding`, `Hashing`, `IntegerLookup`) and allow users to: +* Perform basic feature engineering for categorical inputs * Replace feature columns and `tf.keras.layers.DenseFeatures` with proposed layers * Introduce sparse inputs that work with Keras linear models and other layers that support sparsity @@ -19,10 +19,12 @@ The proposed layers should support ragged tensors. ## Motivation -Specifically, by introducing the 4 layers, we aim to address these pain points: +Specifically, by introducing the 5 layers, we aim to address these pain points: * Users have to define both feature columns and Keras Inputs for the model, resulting in code duplication and deviation from DRY (Do not repeat yourself) principle. See this [Github issue](https://github.com/tensorflow/tensorflow/issues/27416). * Users with large dimension categorical inputs will incur large memory footprint and computation cost, if wrapped with indicator column through `tf.keras.layers.DenseFeatures`. -* Currently there is no way to correctly feed Keras linear model or dense layer with multivalent categorical inputs or weighted categorical inputs. +* Currently there is no way to correctly feed Keras linear model or dense layer with multivalent categorical inputs or weighted categorical inputs, or shared embedding inputs. +* feature columns offer black-box implementations, mix feature engineering with trainable objects, and lead to + unintended coding pattern. ## User Benefit @@ -30,226 +32,394 @@ We expect to get rid of the user painpoints once migrating off feature columns. ## Example Workflows -Two example workflows are presented below. These workflows can be found at this [colab](https://colab.sandbox.google.com/drive/1cEJhSYLcc2MKH7itwcDvue4PfvrLN-OR#scrollTo=22sa0D19kxXY). +Two example workflows are presented below. These workflows can be found at this [colab](https://colab.sandbox.google.com/drive/1cEJhSYLcc2MKH7itwcDvue4PfvrLN-OR). -### Workflow 1 +### Workflow 1 -- Official guide on how to replace feature columns with KPL -The first example gives an equivalent code snippet to canned `LinearEstimator` [tutorial](https://www.tensorflow.org/tutorials/estimator/linear) on the Titanic dataset: +Refer to [tf.feature_column](https://www.tensorflow.org/api_docs/python/tf/feature_column) for a complete list of feature columns. +1. Replacing `tf.feature_column.categorical_column_with_hash_bucket` with `Hashing` +from ```python -dftrain = pd.read_csv('https://storage.googleapis.com/tf-datasets/titanic/train.csv') -y_train = dftrain.pop('survived') - -CATEGORICAL_COLUMNS = ['sex', 'n_siblings_spouses', 'parch', 'class', 'deck', 'embark_town', 'alone'] -NUMERICAL_COLUMNS = ['age', 'fare'] -# input list to create functional model. -model_inputs = [] -# input list to feed linear model. -linear_inputs = [] -for feature_name in CATEGORICAL_COLUMNS: - feature_input = tf.keras.Input(shape=(1,), dtype=tf.string, name=feature_name, sparse=True) - vocab_list = sorted(dftrain[feature_name].unique()) - # Map string values to indices - x = tf.keras.layers.IndexLookup(vocabulary=vocab_list, name=feature_name)(feature_input) - x = tf.keras.layers.CategoryEncoding(num_categories=len(vocab_list))(x) - linear_inputs.append(x) - model_inputs.append(feature_input) - -for feature_name in NUMERICAL_COLUMNS: - feature_input = tf.keras.Input(shape=(1,), name=feature_name) - linear_inputs.append(feature_input) - model_inputs.append(feature_input) - -linear_model = tf.keras.experimental.LinearModel(units=1) -linear_logits = linear_model(linear_inputs) -model = tf.keras.Model(model_inputs, linear_logits) - -model.compile('sgd', loss=tf.keras.losses.BinaryCrossEntropy(from_logits=True), metrics=['accuracy']) - -dataset = tf.data.Dataset.from_tensor_slices(( - (tf.sparse.from_dense(dftrain.sex, "Unknown"), tf.sparse.from_dense(dftrain.n_siblings_spouses, -1), - tf.sparse.from_dense(dftrain.parch, -1), tf.sparse.from_dense(dftrain['class'], "Unknown"), tf.sparse.from_dense(dftrain.deck, "Unknown"), - tf.expand_dims(dftrain.age, axis=1), tf.expand_dims(dftrain.fare, axis=1)), - y_train)).batch(bach_size).repeat(n_epochs) - -model.fit(dataset) +tf.feature_column.categorical_column_with_hash_bucket(key, hash_bucket_size) +``` +to +```python +keras_input = tf.keras.Input(shape=(1,), name=key, dtype=dtype) +hashed_input = tf.keras.experimental.preprocessing.Hashing(num_bins=hash_bucket_size)(keras_input) ``` -### Workflow 2 +Note the hashed output from KPL will be different than the hashed output from feature column, given how seed is choosen. `Hashing` also supports customized `salt`. -The second example gives an instruction on how to transition from categorical feature columns to the proposed layers. Note that one difference for vocab categorical column is that, instead of providing a pair of mutually exclusive `default_value` and `num_oov_buckets` where `default_value` represents the value to map input to given out-of-vocab value, and `num_oov_buckets` represents value range of [len(vocab), len(vocab)+num_oov_buckets) to map input to from a hashing function given out-of-vocab value. In practice, we believe out-of-vocab values should be mapped to the head, i.e., [0, num_oov_tokens), and in-vocab values should be mapped to [num_oov_tokens, num_oov_tokens+len(vocab)). +2. `tf.feature_column.categorical_column_with_identity` +This feature column is merely for having identical inputs and outputs except mapping out-of-range value into `default_value`, thus can easily be done at data cleaning stage, +not be part of feature engineering, and hence dropped in this proposal. -1. Categorical vocab list column +3. Replacing `tf.feature_column.categorical_column_with_vocabulary_file` and `tf.feature_column.categorical_column_with_vocabulary_list` with `StringLookup` or `IntegerLookup` +for string inputs, +from +```python +tf.feature_column.categorical_column_with_vocabulary_file(key, vocabulary_file, vocabulary_size, tf.dtypes.string, default_value, num_oov_buckets) +``` +to +```python +keras_input = tf.keras.Input(shape=(1,), name=key, dtype=tf.dtypes.string) +id_input = tf.keras.experimental.preprocessing.StringLookup(max_tokens=vocabulary_size + num_oov_buckets, + num_oov_indices=num_oov_buckets, mask_token=None, vocabulary=vocabulary_file)(keras_input) +``` -Original: +Similarly, from ```python -fc = tf.feature_column.categorical_feature_column_with_vocabulary_list( - key, vocabulary_list, dtype, default_value, num_oov_buckets) +tf.feature_column.categorical_column_with_vocabulary_list(key, vocabulary_list, tf.dtypes.string, default_value, num_oov_buckets) ``` -Proposed: +to ```python -x = tf.keras.Input(shape=(1,), name=key, dtype=dtype) -layer = tf.keras.layers.IndexLookup( - vocabulary=vocabulary_list, num_oov_tokens=num_oov_buckets) -out = layer(x) +keras_input = tf.keras.Input(shape=(1,), name=key, dtype=tf.dtypes.string) +id_input = tf.keras.experimental.preprocessing.StringLookup(max_tokens=len(vocabulary_list) + num_oov_buckets, num_oov_indices=num_oov_buckets, + mask_token=None, vocabulary=vocabulary_list)(keras_input) ``` -2. categorical vocab file column -Original: +Note that `default_value` is mutually exclusive with `num_oov_buckets`, in the case of `num_oov_buckets=0` and `default_value=-1`, simply set `num_oov_indices=0`. We do not support +any values other than `default_value=-1`. + +Note the out-of-range values for `StringLookup` is prepended, i.e., [0,..., num_oov_tokens) for out-of-range values, whereas for `categorical_colulmn_with_vocabulary_file` is +appended, i.e., [vocabulary_size, vocabulary_size + num_oov_tokens) for out-of-range values. The former can give you more flexibility when reloading and adding vocab. + +for integer inputs, +from ```python -fc = tf.feature_column.categorical_column_with_vocab_file( - key, vocabulary_file, vocabulary_size, dtype, - default_value, num_oov_buckets) +tf.feature_column.categorical_column_with_vocabulary_file(key, vocabulary_file, vocabulary_size, tf.dtypes.int64, default_value, num_oov_buckets) ``` -Proposed: +to ```python -x = tf.keras.Input(shape=(1,), name=key, dtype=dtype) -layer = tf.keras.layers.IndexLookup( - vocabulary=vocabulary_file, num_oov_tokens=num_oov_buckets) -out = layer(x) +keras_input = tf.keras.Input(shape=(1,), name=key, dtype=tf.dtypes.int64) +id_input = tf.keras.experimental.preprocessing.IntegerLookup(max_values=vocabulary_size + num_oov_buckets, num_oov_indices=num_oov_buckets, mask_value=None, vocabulary=vocabulary_file)(keras_input) +``` + +Similarly, from +```python +tf.feature_column.categorical_column_with_vocabulary_list(key, vocabulary_list, tf.dtypes.int64, default_value, num_oov_buckets) +``` +to +```python +keras_input = tf.keras.Input(shape=(1,), name=key, dtype=tf.dtypes.int64) +id_input = tf.keras.experimental.preprocessing.IntegerLookup(max_values=len(vocabulary_list) + num_oov_buckets, num_oov_indices=num_oov_buckets, mask_value=None, vocabulary=vocabulary_list)(keras_input) ``` -Note: `vocabulary_size` is only valid if `adapt` is called. Otherwise if user desires to lookup for the first K vocabularies in vocab file, then shrink the vocab file by only having the first K lines. -3. categorical hash column -Original: +4. Replacing `tf.feature_column.crossed_column` with `CategoryCrossing` or `Hashing` +from ```python -fc = tf.feature_column.categorical_column_with_hash_bucket( - key, hash_bucket_size, dtype) +tf.feature_column.crossed_column(keys, hash_bucket_size, hash_key) ``` -Proposed: +to ```python -x = tf.keras.Input(shape=(1,), name=key, dtype=dtype) -layer = tf.keras.layers.Hashing(num_bins=hash_bucket_size) -out = layer(x) +keras_inputs = [] +for key in keys: + keras_inputs.append(tf.keras.Input(shape=(1,), name=key, dtype=tf.dtypes.string)) +hashed_input = tf.keras.layers.experimental.preprocessing.Hashing(num_bins=hash_bucket_size)(keras_inputs) ``` -4. categorical identity column +Note when `hash_bucket_size=0`, no hashing is performed, in this case it should be replaced with: +```python +keras_inputs = [] +for key in keys: + keras_inputs.append(tf.keras.Input(shape=(1,), name=key, dtype=tf.dtypes.string)) +crossed_input = tf.keras.layers.experimental.preprocessing.CategoryCrossing()(keras_inputs) +``` -Original: +5. Replacing `tf.feature_column.embedding_column` with `tf.keras.layers.Embedding` +Note that `combiner=sum` can be replaced with `tf.reduce_sum` and `combiner=mean` with `tf.reduce_mean` after +the embedding output. `sqrtn` can also be implemented using tf operations. For example: ```python -fc = tf.feature_column.categorical_column_with_identity( - key, num_buckets, default_value) +categorical_column = tf.feature_column.categorical_column_with_vocabulary_list(key, vocabulary_list) +tf.feature_column.embedding_column(categorical_column, dimension=dimension, combiner="sum", initializer=initializer, + max_norm=max_norm) ``` -Proposed: +can be replaced with: ```python -x = tf.keras.Input(shape=(1,), name=key, dtype=dtype) -layer = tf.keras.layers.Lambda(lambda x: tf.where(tf.logical_or(x < 0, x > num_buckets), tf.fill(dims=tf.shape(x), value=default_value), x)) -out = layer(x) +categorical_input = tf.keras.Input(name=key, dtype=tf.string) +id_input = tf.keras.layers.experimental.preprocessing.StringLookup(vocabulary=vocabulary_list)(categorical_input) +embedding_input = tf.keras.layers.Embedding(input_dim=len(vocabulary_list), output_dim=dimension, + embeddings_initializer=initializer, embeddings_constraint=tf.keras.constraints.MaxNorm(max_norm))(id_input) +embedding_input = tf.reduce_sum(embedding_input, axis=-2) ``` -5. cross column +6. Replacing `tf.feature_column.indicator_column` with `CategoryEncoding` +from +```python +categorical_column = tf.feature_column.categorical_column_with_vocabulary_list(key, vocabulary_list) +tf.feature_column.indicator_column(categorical_column) +``` +to +```python +categorical_input = tf.keras.Input(name=key, dtype=tf.string) +id_input = tf.keras.layers.experimental.preprocessing.StringLookup(vocabulary=vocabulary_list)(categorical_input) +encoded_input = tf.keras.layers.experimental.preprocessing.CateogoryEncoding( + max_tokens=categorical_column.num_buckets, output_mode="count", sparse=True)(id_input) +``` -Original: +Note that `CategoryEncoding` supports one-hot through `output_mode="binary"` as well. This is a much more +efficient approach than `tf.one_hot` + `tf.reduce_sum(axis=-2)` to reduce the multivalent categorical inputs. + +Note that by specifing `sparse` flag, the output can be either a `tf.Tensor` or `tf.SparseTensor`. + +7. Replacing `tf.feature_column.weighted_categorical_column` with `CategoryEncoding` +from ```python -fc_1 = tf.feature_column.categorical_column_with_vocabulary_list(key_1, vocabulary_list, - dtype, default_value, num_oov_buckets) -fc_2 = tf.feature_column.categorical_column_with_hash_bucket(key_2, hash_bucket_size, - dtype) -fc = tf.feature_column.crossed_column([fc_1, fc_2], hash_bucket_size, hash_key) +categorical_column = tf.feature_column.categorical_column_with_vocabulary_list(key, vocabulary_list) +tf.feature_column.weighted_categorical_column(categorical_column, weight_feature_key) ``` -Proposed: +to ```python -x1 = tf.keras.Input(shape=(1,), name=key_1, dtype=dtype) -x2 = tf.keras.Input(shape=(1,), name=key_2, dtype=dtype) -layer1 = tf.keras.layers.IndexLookup( - vocabulary=vocabulary_list, - num_oov_tokens=num_oov_buckets) -x1 = layer1(x1) -layer2 = tf.keras.layers.Hashing( - num_bins=hash_bucket_size) -x2 = layer2(x2) -layer = tf.keras.layers.CategoryCrossing(num_bins=hash_bucket_size) -out = layer([x1, x2]) +categorical_input = tf.keras.Input(name=key, dtype=tf.string) +lookup_output = tf.keras.layers.experimental.preprocessing.StringLookup(vocabulary=vocabulary_list)(categorical_input) +weight_input = tf.keras.Input(shape=(1,), dtype=tf.float32, name=weight_feature_key) +weighted_output = tf.keras.layers.experimental.preprocessing.CategoryEncoding( + max_tokens=categorical_column.num_buckets)(lookup_output, weight_input) ``` -6. weighted categorical column +8. Replacing `tf.feature_column.shared_embeddings` with a single `tf.keras.layers.Embedding` +Similar to 5, but with multiple categorical inputs: +from +```python +watched_video_id = tf.feature_column.categorical_column_with_vocabulary_list('watched_video_id', video_vocab_list) +impression_video_id = tf.feature_column.categorical_column_with_vocabulary_list('impression_video_id', video_vocab_list) +tf.feature_column.shared_embeddings([watched_video_id, impression_video_id], dimension) +``` +to +```python +watched_video_input = tf.keras.Input(shape=(1,), name='watched_video_id', dtype=tf.int64) +impression_video_input = tf.keras.Input(shape=(1,), name='impression_video_id', dtype=tf.int64) +embed_layer = tf.keras.layers.Embedding(input_dim=len(video_vocab_list), output_dim=dimension) +embedded_watched_video_input = embed_layer(watched_video_input) +embedded_impression_video_input = embed_layer(impression_video_input) +``` -Original: +9. Replacing `tf.estimator.LinearXXX` with `CategoryEncoding` and `tf.keras.experimental.LinearModel` +LinearClassifier or LinearRegressor treats categorical columns by multi-hot, this can be replaced by encoding layer and Keras linear model, see Workflow 2 for details. + +10. Replacing `tf.feature_column.numeric_column` and `tf.feature_column.sequence_numeric_column` with `tf.keras.Input` and `Normalization` +`tf.keras.layers.experimental.preprocessing.Normalization` with `set_weights` on mean and standard deviation. + +11. Replacing `tf.feature_column.sequence_categorical_xxx` +Replacing `tf.feature_column.sequence_categorical_xxx` is similar to `tf.feature_column.categorical_xxx` except `tf.keras.Input` should take time dimension into +`input_shape` as well. + +12. Replacing `tf.feature_column.bucketized_column` with `Discretization` +from ```python -fc = tf.feature_column.categorical_column_with_vocab_list(key, vocabulary_list, - dtype, default_value, num_oov_buckets) -weight_fc = tf.feature_column.weighted_categorical_column(fc, weight_feature_key, - dtype=weight_dtype) -linear_model = tf.estimator.LinearClassifier(units, feature_columns=[weight_fc]) +source_column = tf.feature_column.numeric_column(key) +tf.feature_column.bucketized_column(source_column, boundaries) ``` -Proposed: +to ```python -x1 = tf.keras.Input(shape=(1,), name=key, dtype=dtype) -x2 = tf.keras.Input(shape=(1,), name=weight_feature_key, dtype=weight_dtype) -layer = tf.keras.layers.IndexLookup( - vocabulary=vocabulary_list, - num_oov_tokens=num_oov_buckets) -x1 = layer(x1) -x = tf.keras.layers.CategoryEncoding(num_categories=len(vocabulary_list)+num_oov_buckets)([x1, x2]) -linear_model = tf.keras.premade.LinearModel(units) -linear_logits = linear_model(x) +keras_input = tf.keras.Input(shape=(1,), name=key, dtype=tf.float32) +bucketized_input = tf.keras.experimental.preprocessing.Discretization(bins=boundaries)(keras_input) +``` + + +### Workflow 2 -- Complete Example + +This example gives an equivalent code snippet to canned `LinearEstimator` [tutorial](https://www.tensorflow.org/tutorials/estimator/linear) on the Titanic dataset: + +Refer to this [colab](https://colab.sandbox.google.com/drive/1cEJhSYLcc2MKH7itwcDvue4PfvrLN-OR) to reproduce. + +```python +dftrain = pd.read_csv('https://storage.googleapis.com/tf-datasets/titanic/train.csv') +y_train = dftrain.pop('survived') + +STRING_CATEGORICAL_COLUMNS = ['sex', 'class', 'deck', 'embark_town', 'alone'] +INT_CATEGORICAL_COLUMNS = ['n_siblings_spouses', 'parch'] +NUMERIC_COLUMNS = ['age', 'fare'] + +keras_inputs = {} +keras_preproc_inputs = [] +for key in STRING_CATEGORICAL_COLUMNS: + keras_input = tf.keras.Input(shape=(1,), dtype=tf.string, name=key) + keras_inputs[key] = keras_input + vocab = dftrain[key].unique() + keras_preproc_input = tf.keras.layers.experimental.preprocessing.StringLookup(vocabulary=vocab, num_oov_indices=0, mask_token=None, name='lookup' + key)(keras_input) + keras_preproc_input = tf.keras.layers.experimental.preprocessing.CategoryEncoding(max_tokens=len(vocab), output_mode='count', sparse=True, name='encode' + key)(keras_preproc_input) + keras_preproc_inputs.append(keras_preproc_input) + +for key in INT_CATEGORICAL_COLUMNS: + keras_input = tf.keras.Input(shape=(1,), dtype=tf.int64, name=key) + keras_inputs[key] = keras_input + vocab = dftrain[key].unique() + keras_preproc_input = tf.keras.layers.experimental.preprocessing.IntegerLookup(vocabulary=vocab, num_oov_indices=0, mask_value=None, name='lookup' + key)(keras_input) + keras_preproc_input = tf.keras.layers.experimental.preprocessing.CategoryEncoding(max_tokens=len(vocab), output_mode='count', sparse=True, name='encode' + key)(keras_preproc_input) + keras_preproc_inputs.append(keras_preproc_input) + +for key in NUMERIC_COLUMNS: + keras_input = tf.keras.Input(shape=(1,), dtype=tf.float32, name=key) + keras_inputs[key] = keras_input + keras_preproc_inputs.append(keras_preproc_input) + +age_x_sex = tf.keras.layers.experimental.preprocessing.CategoryCrossing(name='age_x_sex_crossing')([keras_inputs['age'], keras_inputs['sex']]) +age_x_sex = tf.keras.layers.experimental.preprocessing.Hashing(num_bins=100, name='age_x_sex_hashing')(age_x_sex) +keras_output_age_x_sex = tf.keras.layers.experimental.preprocessing.CategoryEncoding(max_tokens=100, output_mode='count', sparse=True, name='age_x_sex_encoding')(age_x_sex) +keras_preproc_inputs.append(keras_output_age_x_sex) + + +linear_model = tf.keras.experimental.LinearModel(units=1, kernel_initializer='zeros', activation='sigmoid') +linear_logits = linear_model(keras_preproc_inputs) +sorted_keras_inputs = tuple(keras_inputs[key] for key in sorted(keras_inputs.keys())) +model = tf.keras.Model(sorted_keras_inputs, linear_logits) + +model.compile('ftrl', 'binary_crossentropy', metrics=['accuracy']) + +df_dataset = tf.data.Dataset.from_tensor_slices((dict(dftrain), y_train)) +def encode_map(features, labels): + encoded_features = tuple(tf.expand_dims(features[key], axis=1) for key in sorted(features.keys())) + return (encoded_features, labels) +encoded_dataset = df_dataset.batch(32).map(encode_map) + +model.fit(encoded_dataset) ``` ## Design Proposal -We propose a `IndexLookup` layer to replace `tf.feature_column.categorical_column_with_vocabulary_list` and `tf.feature_column.categorical_column_with_vocabulary_file`, a `Hashing` layer to replace `tf.feature_column.categorical_column_with_hash_bucket`, a `CategoryCrossing` layer to replace `tf.feature_column.crossed_column`, and another `CategoryEncoding` layer to convert the sparse input to the format required by linear models. ```python -`tf.keras.layers.IndexLookup` -IndexLookup(PreprocessingLayer): +`tf.keras.layers.StringLookup` +StringLookup(PreprocessingLayer): """This layer transforms categorical inputs to index space. If input is dense/sparse, then output is dense/sparse.""" - def __init__(self, max_tokens=None, num_oov_tokens=1, vocabulary=None, - name=None, **kwargs): + def __init__(self, max_tokens=None, num_oov_indices=1, mask_token="", + oov_token="[UNK]", vocabulary=None, encoding=None, + invert=False, name=None, **kwargs): """Constructs a IndexLookup layer. Args: max_tokens: The maximum size of the vocabulary for this layer. If None, - there is no cap on the size of the vocabulary. This is used when `adapt` - is called. - num_oov_tokens: Non-negative integer. The number of out-of-vocab tokens. - All out-of-vocab inputs will be assigned IDs in the range of - [0, num_oov_tokens) based on a hash. When - `vocabulary` is None, it will convert inputs in [0, num_oov_tokens) - vocabulary: the vocabulary to Lookup the input. If it is a file, specify the file - path to represent the source vocab file, example 'tmp/vocab_file.txt'; - If it is a list/tuple, it represents the source vocab list, example '[A, B, C]'; - If it is None, the vocabulary can later be set. - name: Name to give to the layer. - **kwargs: Keyword arguments to construct a layer. + there is no cap on the size of the vocabulary. Note that this vocabulary + includes the OOV and mask tokens, so the effective number of tokens is + (max_tokens - num_oov_indices - (1 if mask_token else 0)) + num_oov_indices: The number of out-of-vocabulary tokens to use; defaults to + 1. If this value is more than 1, OOV inputs are hashed to determine their + OOV value; if this value is 0, passing an OOV input will result in a '-1' + being returned for that value in the output tensor. (Note that, because + the value is -1 and not 0, this will allow you to effectively drop OOV + values from categorical encodings.) + mask_token: A token that represents masked values, and which is mapped to + index 0. Defaults to the empty string "". If set to None, no mask term + will be added and the OOV tokens, if any, will be indexed from + (0...num_oov_indices) instead of (1...num_oov_indices+1). + oov_token: The token representing an out-of-vocabulary value. Defaults to + "[UNK]". + vocabulary: An optional list of vocabulary terms, or a path to a text file + containing a vocabulary to load into this layer. The file should contain + one token per line. If the list or file contains the same token multiple + times, an error will be thrown. + encoding: The Python string encoding to use. Defaults to `'utf-8'`. + invert: If true, this layer will map indices to vocabulary items instead + of mapping vocabulary items to indices. + name: Name of the layer. + **kwargs: Keyword arguments to construct a layer. - Input: a string or int tensor of shape `[batch_size, d1, ..., dm]` - Output: an int tensor of shape `[batch_size, d1, ..., dm]` + Input shape: + a string or int tensor of shape `[batch_size, d1, ..., dm]` + Output shape: + an int tensor of shape `[batch_size, d1, ..., dm]` Example: - If one input sample is `["a", "c", "d", "a", "x"]` and the vocabulary is ["a", "b", "c", "d"], - and a single OOV token is used (`num_oov_tokens=1`), then the corresponding output sample is - `[1, 3, 4, 1, 0]`. 0 stands for an OOV token. + >>> vocab = ["a", "b", "c", "d"] + >>> data = tf.constant([["a", "c", "d"], ["d", "z", "b"]]) + >>> layer = StringLookup(vocabulary=vocab) + >>> layer(data) + """ pass + +`tf.keras.layers.IntegerLookup` +IntegerLookup(PreprocessingLayer): +"""This layer transforms categorical inputs to index space. + If input is dense/sparse, then output is dense/sparse.""" + + def __init__(self, max_values=None, num_oov_indices=1, mask_value=0, + oov_value=-1, vocabulary=None, invert=False, name=None, **kwargs): + """Constructs a IndexLookup layer. + + Args: + max_values: The maximum size of the vocabulary for this layer. If None, + there is no cap on the size of the vocabulary. Note that this vocabulary + includes the OOV and mask values, so the effective number of values is + (max_values - num_oov_values - (1 if mask_token else 0)) + num_oov_indices: The number of out-of-vocabulary values to use; defaults to + 1. If this value is more than 1, OOV inputs are modulated to determine + their OOV value; if this value is 0, passing an OOV input will result in + a '-1' being returned for that value in the output tensor. (Note that, + because the value is -1 and not 0, this will allow you to effectively drop + OOV values from categorical encodings.) + mask_value: A value that represents masked inputs, and which is mapped to + index 0. Defaults to 0. If set to None, no mask term will be added and the + OOV values, if any, will be indexed from (0...num_oov_values) instead of + (1...num_oov_values+1). + oov_value: The value representing an out-of-vocabulary value. Defaults to -1. + vocabulary: An optional list of values, or a path to a text file containing + a vocabulary to load into this layer. The file should contain one value + per line. If the list or file contains the same token multiple times, an + error will be thrown. + invert: If true, this layer will map indices to vocabulary items instead + of mapping vocabulary items to indices. + name: Name of the layer. + **kwargs: Keyword arguments to construct a layer. + + Input shape: + a string or int tensor of shape `[batch_size, d1, ..., dm]` + Output shape: + an int tensor of shape `[batch_size, d1, ..., dm]` + + Example: + >>> vocab = [12, 36, 1138, 42] + >>> data = tf.constant([[12, 1138, 42], [42, 1000, 36]]) + >>> layer = IntegerLookup(vocabulary=vocab) + >>> layer(data) + + """ + pass + + `tf.keras.layers.CategoryCrossing` CategoryCrossing(PreprocessingLayer): """This layer transforms multiple categorical inputs to categorical outputs by Cartesian product, and hash the output if necessary. If any of the inputs is sparse, then all outputs will be sparse. Otherwise, all outputs will be dense.""" - def __init__(self, depth=None, num_bins=None, name=None, **kwargs): + def __init__(self, depth=None, separator=None, name=None, **kwargs): """Constructs a CategoryCrossing layer. Args: - depth: depth of input crossing. By default None, all inputs are crossed - into one output. It can be an int or tuple/list of ints, where inputs are - combined into all combinations of output with degree of `depth`. For example, - with inputs `a`, `b` and `c`, `depth=2` means the output will be [ab;ac;bc] - num_bins: Number of hash bins. By default None, no hashing is performed. + depth: depth of input crossing. By default None, all inputs are crossed into + one output. It can also be an int or tuple/list of ints. Passing an + integer will create combinations of crossed outputs with depth up to that + integer, i.e., [1, 2, ..., `depth`), and passing a tuple of integers will + create crossed outputs with depth for the specified values in the tuple, + i.e., `depth`=(N1, N2) will create all possible crossed outputs with depth + equal to N1 or N2. Passing `None` means a single crossed output with all + inputs. For example, with inputs `a`, `b` and `c`, `depth=2` means the + output will be [a;b;c;cross(a, b);cross(bc);cross(ca)]. + separator: A string added between each input being joined. Defaults to '_X_'. name: Name to give to the layer. **kwargs: Keyword arguments to construct a layer. - Input: a list of int tensors of shape `[batch_size, d1, ..., dm]` - Output: a single int tensor of shape `[batch_size, d1, ..., dm]` + Input shape: a list of string or int tensors or sparse tensors of shape + `[batch_size, d1, ..., dm]` - Example: - If the layer receives two inputs, `a=[[1, 2]]` and `b=[[1, 3]]`, and - if depth is 2, then the output will be a string tensor `[[b'1_X_1', - b'1_X_3', b'2_X_1', b'2_X_3']]` if not hashed, or integer tensor - `[[hash(b'1_X_1'), hash(b'1_X_3'), hash(b'2_X_1'), hash(b'2_X_3')]]` if hashed. + Output shape: a single string or int tensor or sparse tensor of shape + `[batch_size, d1, ..., dm]` + + Example: (`depth`=None) + If the layer receives three inputs: + `a=[[1], [4]]`, `b=[[2], [5]]`, `c=[[3], [6]]` + the output will be a string tensor: + `[[b'1_X_2_X_3'], [b'4_X_5_X_6']]` """ pass @@ -258,28 +428,38 @@ CategoryEncoding(PreprocessingLayer): """This layer transforms categorical inputs from index space to category space. If input is dense/sparse, then output is dense/sparse.""" - def __init__(self, num_categories, mode="count", axis=-1, sparse_out=True, name=None, **kwargs): + def __init__(self, max_tokens=None, output_mode="binary", sparse=False, name=None, **kwargs): """Constructs a CategoryEncoding layer. Args: - num_categories: Number of elements in the vocabulary. - mode: how to reduce a categorical input if multivalent, can be one of "count", - "avg_count", "binary", "tfidf". It can also be None if this is not a multivalent input, - and simply needs to convert input from index space to category space. "tfidf" is only - valid when adapt is called on this layer. - axis: the axis to reduce, by default will be the last axis, specially true - for sequential feature columns. - sparse_out: boolean to indicate whether the output should be dense or sparse tensor. + max_tokens: The maximum size of the vocabulary for this layer. If None, + there is no cap on the size of the vocabulary. + output_mode: Specification for the output of the layer. + Defaults to "binary". Values can be "binary", "count" or "tf-idf", + configuring the layer as follows: + "binary": Outputs a single int array per batch, of either vocab_size or + max_tokens size, containing 1s in all elements where the token mapped + to that index exists at least once in the batch item. + "count": As "binary", but the int array contains a count of the number + of times the token at that index appeared in the batch item. + "tf-idf": As "binary", but the TF-IDF algorithm is applied to find the + value in each token slot. + sparse: Boolean. If true, returns a `SparseTensor` instead of a dense + `Tensor`. Defaults to `False`. name: Name to give to the layer. **kwargs: Keyword arguments to construct a layer. - Input: a int tensor of shape `[batch_size, d1, ..., dm-1, dm]` - Output: a float tensor of shape `[batch_size, d1, ..., dm-1, num_categories]` + Input shape: A int tensor of shape `[batch_size, d1, ..., dm-1, dm]` + Output shape: a float tensor of shape `[batch_size, d1, ..., dm-1, num_categories]` Example: - If the input is 2 by 2 dense integer tensor '[[0, 2], [2, 2]]' with `num_categories=3`, then - output is 2 by 3 dense integer tensor '[[1, 0, 1], [0, 0, 2]]' with a `count` encoding, or - dense float tensor '[[.5, 0, .5], [0, 0, 1.]]' with a `avg_count` encoding, or dense integer tensor - '[[1, 0, 1], [0, 0, 1]]' with a `binary` encoding. + >>> layer = tf.keras.layers.experimental.preprocessing.CategoryEncoding( + ... max_tokens=4, output_mode="count") + >>> layer([[0, 1], [0, 0], [1, 2], [3, 1]]) + """ pass @@ -287,46 +467,48 @@ CategoryEncoding(PreprocessingLayer): Hashing(PreprocessingLayer): """This layer transforms categorical inputs to hashed output. If input is dense/sparse, then output is dense/sparse.""" - def __init__(self, num_bins, name=None, **kwargs): + def __init__(self, num_bins, salt=None, name=None, **kwargs): """Constructs a Hashing layer. Args: num_bins: Number of hash bins. + salt: A single unsigned integer or None. + If passed, the hash function used will be SipHash64, with these values + used as an additional input (known as a "salt" in cryptography). + These should be non-zero. Defaults to `None` (in that + case, the FarmHash64 hash function is used). It also supports + tuple/list of 2 unsigned integer numbers, see reference paper for details. name: Name to give to the layer. **kwargs: Keyword arguments to construct a layer. - Input: a int tensor of shape `[batch_size, d1, ..., dm]` - Output: a int tensor of shape `[batch_size, d1, ..., dm]` + Input shape: A single or list of string, int32 or int64 `Tensor`, + `SparseTensor` or `RaggedTensor` of shape `[batch_size, ...,]` + + Output shape: An int64 `Tensor`, `SparseTensor` or `RaggedTensor` of shape + `[batch_size, ...]`. If any input is `RaggedTensor` then output is + `RaggedTensor`, otherwise if any input is `SparseTensor` then output is + `SparseTensor`, otherwise the output is `Tensor`. Example: - If the input is a 5 by 1 string tensor '[['A'], ['B'], ['C'], ['D'], ['E']]' with `num_bins=2`, - then output is 5 by 1 integer tensor '[[0], [0], [1], [1], [0]]', i.e., - [[hash('A')], [hash('B')], [hash('C')], [hash('D')], [hash('E')]]. + >>> layer = tf.keras.layers.experimental.preprocessing.Hashing(num_bins=3) + >>> inp = [['A'], ['B'], ['C'], ['D'], ['E']] + >>> layer(inp) + """ pass ``` -We also propose to extend the current `tf.sparse.from_dense` op with a `ignore_value` to convert dense tensors to sparse tensors given user specified ignore values. This op can be used in both `tf.data` or [TF Transform](https://www.tensorflow.org/tfx/transform/get_started). In previous feature column world, "" is ignored for dense string input and -1 is ignored for dense int input. - -```python -`tf.sparse.from_dense` -def from_dense(tensor, name=None, ignore_value=0): - """Convert dense/sparse tensor to sparse while dropping user specified values. - - Args: - tensor: A dense `Tensor` to be converted to a `SparseTensor`. - name: Optional name for the op. - ignore_value: The value to be dropped from input. Default to 0 for backward compatibility. - """ - pass -``` - ### Alternatives Considered An alternative is to provide solutions on top of feature columns. This will make user code to be slightly cleaner but far less flexible. ### Performance Implications -End to End benchmark should be same as other preprocessing layers. +End to End benchmark should be same or faster than feature columns implementations. ### Dependencies This proposal does not add any new dependencies. @@ -347,56 +529,6 @@ No backward compatibility issues. ### User Impact User facing changes to migrate feature column based Keras modeling to preprocessing layer based Keras modeling, as the example workflow suggests. - -## Code Snippets - -Below is a more detailed illustration of how each layer works. If there is a vocabulary list of countries: -```python -vocabulary_list = ["Italy", "France", "England", "Austria", "Germany"] -inp = np.asarray([["Italy", "Italy"], ["Germany", ""]]) -sp_inp = tf.sparse.from_dense(inp, ignore_value="") -cat_layer = tf.keras.layers.IndexLookup(vocabulary=vocabulary_list) -sp_out = cat_layer(sp_inp) -``` - -The categorical layer will first convert the input to: -```python -sp_out.indices = -sp_out.values = -``` - -The `CategoryEncoding` layer will then convert the input from index space to category space, e.g., from a sparse tensor with indices shape as [batch_size, n_columns] and values in the range of [0, n_categories) to a sparse tensor with indices shape as [batch_size, n_categories] and values as the frequency of each value that occured in the example: -```python -encoding_layer = CategoryEncoding(num_categories=len(vocabulary_list)) -sp_encoded_out = encoding_layer(sp_out) -sp_encoded_out.indices = -sp_encoded_out.values = -``` -A weight input can also be passed into the layer if different categories/examples should be treated differently. - -If this input needs to be crossed with another categorical input, say a vocabulary list of days, then use `CategoryCrossing` which works in the same way as `tf.feature_column.crossed_column` without setting `depth`: -```python -days = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"] -inp_days = tf.sparse.from_dense(np.asarray([["Sunday"], [""]]), ignore_value="") -layer_days = IndexLookup(vocabulary=days) -sp_out_2 = layer_days(inp_days) - -sp_out_2.indices = -sp_out_2.values = - -cross_layer = CategoryCrossing(num_bins=5) -# Use the output from IndexLookup (sp_out), not CategoryEncoding (sp_combined_out) -crossed_out = cross_layer([sp_out, sp_out_2]) - -cross_out.indices = -cross_out.values = -``` - ## Questions and Meeting Notes We'd like to gather feedbacks on `IndexLookup`, specifically we propose migrating off from mutually exclusive `num_oov_buckets` and `default_value` and replace with `num_oov_tokens`. 1. Naming for encoding v.s. vectorize: encoding can mean many things, vectorize seems to general. We will go with "CategoryEncoding" @@ -404,4 +536,7 @@ We'd like to gather feedbacks on `IndexLookup`, specifically we propose migratin 3. Rename "sparse_combiner" to "mode", which aligns with scikit-learn. 4. Have a 'sparse_out' flag for "CategoryEncoding" layer. 5. Hashing -- we refer to hashing when we mean fingerprinting. Keep using "Hashing" for layer name, but document how it relies on tf.fingerprint, and also provides option for salt. -5. Rename "CategoryLookup" to "IndexLookup" \ No newline at end of file +5. Rename "CategoryLookup" to "IndexLookup" + +## Updates on 07/14/20 +Mark the RFC as completed, update the layer naming and arguments. From 4d8e61ec183880dd7a326a980653ce615f430ecc Mon Sep 17 00:00:00 2001 From: tanzhenyu Date: Wed, 15 Jul 2020 11:05:18 -0700 Subject: [PATCH 2/3] Address comments. --- rfcs/20191212-keras-categorical-inputs.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/rfcs/20191212-keras-categorical-inputs.md b/rfcs/20191212-keras-categorical-inputs.md index 19e5c04a9..ebd6ce1b3 100644 --- a/rfcs/20191212-keras-categorical-inputs.md +++ b/rfcs/20191212-keras-categorical-inputs.md @@ -1,6 +1,6 @@ # Keras categorical inputs -| Status | Completed | +| Status | Implemented (https://github.com/tensorflow/community/pull/209) | :-------------- |:---------------------------------------------------- | | **Author(s)** | Zhenyu Tan (tanzheny@google.com), Francois Chollet (fchollet@google.com)| | **Sponsor** | Karmel Allison (karmel@google.com), Martin Wicke (wicke@google.com) | @@ -23,7 +23,7 @@ Specifically, by introducing the 5 layers, we aim to address these pain points: * Users have to define both feature columns and Keras Inputs for the model, resulting in code duplication and deviation from DRY (Do not repeat yourself) principle. See this [Github issue](https://github.com/tensorflow/tensorflow/issues/27416). * Users with large dimension categorical inputs will incur large memory footprint and computation cost, if wrapped with indicator column through `tf.keras.layers.DenseFeatures`. * Currently there is no way to correctly feed Keras linear model or dense layer with multivalent categorical inputs or weighted categorical inputs, or shared embedding inputs. -* feature columns offer black-box implementations, mix feature engineering with trainable objects, and lead to +* Feature columns offer black-box implementations, mix feature engineering with trainable objects, and lead to unintended coding pattern. ## User Benefit From 5507ba06c8f43830f4a77ea85bc8d345f5b60765 Mon Sep 17 00:00:00 2001 From: tanzhenyu Date: Wed, 15 Jul 2020 11:26:21 -0700 Subject: [PATCH 3/3] Fix some punctuation. --- rfcs/20191212-keras-categorical-inputs.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/rfcs/20191212-keras-categorical-inputs.md b/rfcs/20191212-keras-categorical-inputs.md index ebd6ce1b3..f0eeb948f 100644 --- a/rfcs/20191212-keras-categorical-inputs.md +++ b/rfcs/20191212-keras-categorical-inputs.md @@ -55,7 +55,7 @@ Note the hashed output from KPL will be different than the hashed output from fe This feature column is merely for having identical inputs and outputs except mapping out-of-range value into `default_value`, thus can easily be done at data cleaning stage, not be part of feature engineering, and hence dropped in this proposal. -3. Replacing `tf.feature_column.categorical_column_with_vocabulary_file` and `tf.feature_column.categorical_column_with_vocabulary_list` with `StringLookup` or `IntegerLookup` +3. Replacing `tf.feature_column.categorical_column_with_vocabulary_file` and `tf.feature_column.categorical_column_with_vocabulary_list` with `StringLookup` or `IntegerLookup`. for string inputs, from ```python @@ -86,7 +86,7 @@ any values other than `default_value=-1`. Note the out-of-range values for `StringLookup` is prepended, i.e., [0,..., num_oov_tokens) for out-of-range values, whereas for `categorical_colulmn_with_vocabulary_file` is appended, i.e., [vocabulary_size, vocabulary_size + num_oov_tokens) for out-of-range values. The former can give you more flexibility when reloading and adding vocab. -for integer inputs, +For integer inputs, from ```python tf.feature_column.categorical_column_with_vocabulary_file(key, vocabulary_file, vocabulary_size, tf.dtypes.int64, default_value, num_oov_buckets) @@ -180,7 +180,7 @@ weighted_output = tf.keras.layers.experimental.preprocessing.CategoryEncoding( max_tokens=categorical_column.num_buckets)(lookup_output, weight_input) ``` -8. Replacing `tf.feature_column.shared_embeddings` with a single `tf.keras.layers.Embedding` +8. Replacing `tf.feature_column.shared_embeddings` with a single `tf.keras.layers.Embedding`. Similar to 5, but with multiple categorical inputs: from ```python @@ -197,17 +197,17 @@ embedded_watched_video_input = embed_layer(watched_video_input) embedded_impression_video_input = embed_layer(impression_video_input) ``` -9. Replacing `tf.estimator.LinearXXX` with `CategoryEncoding` and `tf.keras.experimental.LinearModel` +9. Replacing `tf.estimator.LinearXXX` with `CategoryEncoding` and `tf.keras.experimental.LinearModel`. LinearClassifier or LinearRegressor treats categorical columns by multi-hot, this can be replaced by encoding layer and Keras linear model, see Workflow 2 for details. -10. Replacing `tf.feature_column.numeric_column` and `tf.feature_column.sequence_numeric_column` with `tf.keras.Input` and `Normalization` +10. Replacing `tf.feature_column.numeric_column` and `tf.feature_column.sequence_numeric_column` with `tf.keras.Input` and `Normalization`. `tf.keras.layers.experimental.preprocessing.Normalization` with `set_weights` on mean and standard deviation. -11. Replacing `tf.feature_column.sequence_categorical_xxx` +11. Replacing `tf.feature_column.sequence_categorical_xxx`. Replacing `tf.feature_column.sequence_categorical_xxx` is similar to `tf.feature_column.categorical_xxx` except `tf.keras.Input` should take time dimension into `input_shape` as well. -12. Replacing `tf.feature_column.bucketized_column` with `Discretization` +12. Replacing `tf.feature_column.bucketized_column` with `Discretization`. from ```python source_column = tf.feature_column.numeric_column(key)