https://www.tensorflow.org/recommenders/examples/featurization

One of the great advantages of using a deep learning framework to build recommender models is the freedom to build rich, flexible feature representations.

The first step in doing so is preparing the features, as raw features will usually not be immediately usable in a model.

For example:

- User and item ids may be strings (titles, usernames) or large, noncontiguous integers (database IDs).
- Item descriptions could be raw text.
- Interaction timestamps could be raw Unix timestamps.

These need to be appropriately transformed in order to be useful in building models:

- User and item ids have to be translated into embedding vectors: high-dimensional numerical representations that are adjusted during training to help the model predict its objective better.
- Raw text needs to be tokenized (split into smaller parts such as individual words) and translated into embeddings.
- Numerical features need to be normalized so that their values lie in a small interval around 0.

Fortunately, by using TensorFlow we can make such preprocessing part of our model rather than a separate preprocessing step. This is not only convenient, but also ensures that our pre-processing is exactly the same during training and during serving. This makes it safe and easy to deploy models that include even very sophisticated pre-processing.

In this tutorial, we are going to focus on recommenders and the preprocessing we need to do on the MovieLens dataset. If you're interested in a larger tutorial without a recommender system focus, have a look at the full Keras preprocessing guide.

In [1]:
import os
import tempfile
import pprint

from typing import Dict, Text

In [2]:
import numpy as np
import pandas as pd

In [3]:
import tensorflow_datasets as tfds

In [4]:
import tensorflow as tf
import tensorflow_datasets as tfds

In [5]:
def load_data_file_cold(file, stats):
    print('loading file:' + file)
    training_df = pd.read_csv(
        file,
        skiprows=[0],
        names=["viewer","broadcaster","viewer_age","viewer_gender","viewer_longitude","viewer_latitude","viewer_lang","viewer_country","broadcaster_age","broadcaster_gender","broadcaster_longitude","broadcaster_latitude","broadcaster_lang","broadcaster_country","duration", "viewer_network", "broadcaster_network", "count"], dtype={
            'viewer': np.unicode,
            'broadcaster': np.unicode,
            'viewer_age': np.single,
            'viewer_gender': np.unicode,
            'viewer_longitude': np.single,
            'viewer_latitude': np.single,
            'viewer_lang': np.unicode,
            'viewer_country': np.unicode,
            'broadcaster_age': np.single,
            'broadcaster_longitude': np.single,
            'broadcaster_latitude': np.single,
            'broadcaster_lang': np.unicode,
            'broadcaster_country': np.unicode,
            'viewer_network': np.unicode,
            'broadcaster_network': np.unicode,
            'count': np.int
        })

    values = {
        'viewer': 'unknown',
        'broadcaster': 'unknown',
        'viewer_age': 30,
        'viewer_gender': 'unknown',
        'viewer_longitude': 0,
        'viewer_latitude': 0,
        'viewer_lang': 'unknown',
        'viewer_country': 'unknown',
        'broadcaster_age': 30,
        'broadcaster_longitude': 0,
        'broadcaster_latitude': 0,
        'broadcaster_lang': 'unknown',
        'broadcaster_country': 'unknown',
        'duration': 0,
        'viewer_network': 'unknown',
        'broadcaster_network': 'unknown',
        'count': 0
    }
    training_df.fillna(value=values, inplace=True)
#     print(training_df.head(10))
#     print(training_df.iloc[-10:])
#     stats.send_stats('data-size', len(training_df.index))

    sampled_df = training_df.sample(frac=0.1)
    print(sampled_df.head(10))
    print(sampled_df.iloc[-10:])
    return sampled_df

def load_training_data_cold(file, stats):
    ratings_df = load_data_file_cold(file, stats)
    print('creating data set')
    training_ds = (
        tf.data.Dataset.from_tensor_slices(
            ({
                "viewer": tf.cast(
                    ratings_df['viewer'].values,
                    tf.string),
                "viewer_gender": tf.cast(
                    ratings_df['viewer_gender'].values,
                    tf.string),
                "viewer_lang": tf.cast(
                    ratings_df['viewer_lang'].values,
                    tf.string),
                "viewer_country": tf.cast(
                    ratings_df['viewer_country'].values,
                    tf.string),
                "viewer_age": tf.cast(
                    ratings_df['viewer_age'].values,
                    tf.int32),
                "viewer_longitude": tf.cast(
                    ratings_df['viewer_longitude'].values,
                    tf.float16),
                "viewer_latitude": tf.cast(
                    ratings_df['viewer_latitude'].values,
                    tf.float16),
                "broadcaster": tf.cast(
                    ratings_df['broadcaster'].values,
                    tf.string),
                "viewer_network": tf.cast(
                    ratings_df['viewer_network'].values,
                    tf.string),
                "broadcaster_network": tf.cast(
                    ratings_df['broadcaster_network'].values,
                    tf.string),
                "duration": tf.cast(
                    ratings_df['duration'].values,
                    tf.float16),
                "count": tf.cast(
                    ratings_df['count'].values,
                    tf.int16),
            })))

    return training_ds

In [6]:
ratings = load_training_data_cold(file="a3d86f3b-eb45-4641-b05d-30dff7423e6b.csv", stats="")

for x in ratings.take(1).as_numpy_iterator():
    pprint.pprint(x)

loading file:a3d86f3b-eb45-4641-b05d-30dff7423e6b.csv
                   viewer       broadcaster  viewer_age viewer_gender  \
1858698  meetme:313644104     pof:298284045        29.0          male   
2509002     pof:329096608  meetme:311343456        26.0          male   
1931494  meetme:217635487  meetme:228586518        25.0          male   
2476227     pof:291205266  meetme:276266098        28.0        female   
2130883   skout:164234213  meetme:311655828        28.0          male   
1357574     pof:172586161     pof:331106405        29.0          male   
3818589  meetme:317318339  meetme:283611530        26.0          male   
5300278     pof:325631121     pof:317685822        41.0          male   
3082418  meetme:317976949    skout:62089611        26.0          male   
249200   meetme:209570613  meetme:274948689        35.0          male   

         viewer_longitude  viewer_latitude viewer_lang viewer_country  \
1858698       -117.166397        33.915401          en             US

### Defining the vocabulary

In [7]:
broadcaster_lookup = tf.keras.layers.experimental.preprocessing.StringLookup()

In [8]:
broadcaster_lookup.adapt(ratings.map(lambda x: x["broadcaster"]))

Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Constant'
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Constant'
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: 'arguments' object has no attribute 'posonlyargs'
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: 'arguments' object has no attribute 'posonlyargs'


In [9]:
print(f"Vocabulary: {broadcaster_lookup.get_vocabulary()[:20]}")

Vocabulary: ['[UNK]', 'meetme:277903808', 'meetme:50697624', 'meetme:219070323', 'pof:300442673', 'pof:322045884', 'pof:319663298', 'pof:315853960', 'pof:297373249', 'meetme:283611530', 'pof:79582086', 'skout:150743909', 'pof:299641758', 'meetme:309755964', 'meetme:197536011', 'meetme:294844287', 'meetme:308663123', 'skout:39313218', 'meetme:228586518', 'meetme:195325769']


In [10]:
broadcaster_lookup.vocabulary_size()

69797

In [11]:
broadcaster_lookup(["[UNK]", "meetme:277903808", "meetme:50697624"])

<tf.Tensor: shape=(3,), dtype=int64, numpy=array([0, 1, 2])>

### Using feature hashing

In [12]:
# We set up a large number of bins to reduce the chance of hash collisions.
num_hashing_bins = 200_000

broadcaster_hashing = tf.keras.layers.experimental.preprocessing.Hashing(
    num_bins=num_hashing_bins
)

In [13]:
broadcaster_hashing(["[UNK]", "meetme:277903808", "meetme:50697624"])

<tf.Tensor: shape=(3,), dtype=int64, numpy=array([ 18280, 193815, 180119])>

### Defining the embeddings

In [14]:
broadcaster_embedding = tf.keras.layers.Embedding(
    # Let's use the explicit vocabulary lookup.
    input_dim=broadcaster_lookup.vocabulary_size(),
    output_dim=32
)

In [15]:
broadcaster_model = tf.keras.Sequential([broadcaster_lookup, broadcaster_embedding])

In [16]:
broadcaster_model(["meetme:277903808"])

Consider rewriting this model with the Functional API.


<tf.Tensor: shape=(1, 32), dtype=float32, numpy=
array([[-0.00220549,  0.00318506, -0.04274105, -0.03318482,  0.01996905,
        -0.00360984,  0.03728544,  0.04276649,  0.02944965, -0.01693236,
        -0.03837664,  0.02658382, -0.01988866, -0.02986122,  0.02398682,
        -0.00580009,  0.0463425 , -0.02724286,  0.03874153, -0.00180887,
        -0.00071955, -0.02124978, -0.03418276,  0.03018219, -0.02725717,
         0.03669694,  0.04854344,  0.0132411 ,  0.02092062,  0.04907768,
        -0.04367078, -0.01637457]], dtype=float32)>

In [17]:
# user embedding
user_id_lookup = tf.keras.layers.experimental.preprocessing.StringLookup()
user_id_lookup.adapt(ratings.map(lambda x: x["viewer"]))

user_id_embedding = tf.keras.layers.Embedding(user_id_lookup.vocabulary_size(), 32)
user_id_model = tf.keras.Sequential([user_id_lookup, user_id_embedding])

Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Constant'
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Constant'
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: 'arguments' object has no attribute 'posonlyargs'
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: 'arguments' object has no attribute 'posonlyargs'


In [18]:
user_id_model(["meetme:277903808"])

Consider rewriting this model with the Functional API.


<tf.Tensor: shape=(1, 32), dtype=float32, numpy=
array([[ 0.01943531, -0.00493991, -0.00137669,  0.0355396 , -0.02645339,
        -0.00294147,  0.03740872,  0.02480601, -0.04640242,  0.03370896,
        -0.02711483,  0.00093615,  0.01668551,  0.03670264,  0.01320238,
        -0.0214431 ,  0.04782381,  0.00272961,  0.04388637,  0.02362683,
         0.02637327, -0.02401898, -0.03807665, -0.00941879,  0.0490785 ,
        -0.04063647,  0.03368082, -0.01720614, -0.0470655 , -0.00019421,
        -0.04578037,  0.04345498]], dtype=float32)>

### Normalizing continuous features

In [19]:
for x in ratings.take(3).as_numpy_iterator():
    print(f"viewer_age: {x['viewer_age']}.")

viewer_age: 29.
viewer_age: 26.
viewer_age: 25.


In [20]:
# Standardization
viewer_age_normalization = tf.keras.layers.experimental.preprocessing.Normalization(
    axis=None
)
viewer_age_normalization.adapt(ratings.map(lambda x: x['viewer_age']).batch(32))

for x in ratings.take(3).as_numpy_iterator():
    print(f"Normalized viewer age: {viewer_age_normalization(x['viewer_age'])}.")

Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Constant'
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Constant'
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: 'arguments' object has no attribute 'posonlyargs'
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: 'arguments' object has no attribute 'posonlyargs'
Normalized viewer age: [-0.52391666].
Normalized viewer age: [-0.83302295].
Normalized viewer age: [-0.9360584].


In [21]:
# Discretization
max_viewer_age = ratings.map(lambda x: x["viewer_age"]).reduce(
    tf.cast(0, tf.int32), tf.maximum).numpy().max()
min_viewer_age = ratings.map(lambda x: x["viewer_age"]).reduce(
    np.int32(100), tf.minimum).numpy().min()

viewer_age_buckets = np.linspace(
    min_viewer_age, max_viewer_age, num=10)

print(f"Buckets: {viewer_age_buckets[:10]}")

Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Constant'
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Constant'
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Constant'
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Constant'
Buckets: [ 18.          38.22222222  58.44444444  78.66666667  98.88888889
 119.11111111 139.33333333 159.55555556 179.77777778 200.        ]


In [22]:
# Given the bucket boundaries we can transform timestamps into embeddings:
viewer_age_embedding_model = tf.keras.Sequential([
  tf.keras.layers.experimental.preprocessing.Discretization(viewer_age_buckets.tolist()),
  tf.keras.layers.Embedding(len(viewer_age_buckets) + 1, 32)
])

for viewer_age in ratings.take(1).map(lambda x: x["viewer_age"]).batch(1).as_numpy_iterator():
    print(f"Viewer_age embedding: {viewer_age_embedding_model(viewer_age)}.")

Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Constant'
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Constant'
Viewer_age embedding: [[-0.03302993  0.03110944  0.01416438 -0.01521438 -0.03592966 -0.02223239
  -0.04625424 -0.00605955  0.00684149 -0.02312992 -0.01667572 -0.03321574
  -0.02386065 -0.02983061  0.0152416   0.03268285  0.04790663  0.04542525
  -0.02657044 -0.04043273 -0.03326219  0.04068789  0.04848525  0.02951981
   0.04435328  0.02764675 -0.00330366 -0.03459014  0.00951219 -0.04073434
   0.017962    0.01527431]].


### Processing text features

In [28]:
def split_on_colons(text):
    return tf.strings.split(text, sep=":")

In [32]:
broadcaster_text = tf.keras.layers.experimental.preprocessing.TextVectorization(standardize= None, split=split_on_colons)
broadcaster_text.adapt(ratings.map(lambda x: x["broadcaster"]))

Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Constant'
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Constant'
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: 'arguments' object has no attribute 'posonlyargs'
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: 'arguments' object has no attribute 'posonlyargs'


In [33]:
for row in ratings.batch(1).map(lambda x: x["broadcaster"]).take(1):
  print(broadcaster_text(row))

Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Constant'
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Constant'
tf.Tensor([[    3 13467]], shape=(1, 2), dtype=int64)


In [36]:
broadcaster_text.get_vocabulary()[:10]

['',
 '[UNK]',
 'meetme',
 'pof',
 'skout',
 '277903808',
 '50697624',
 'zoosk',
 '219070323',
 '300442673']

### User Model

In [40]:
class UserModel(tf.keras.Model):

	def __init__(self):
		super().__init__()

		self.user_embedding = tf.keras.Sequential([
			user_id_lookup,
			tf.keras.layers.Embedding(user_id_lookup.vocabulary_size(), 32),
		])
        
		self.viewer_age_embedding = tf.keras.Sequential([
			tf.keras.layers.experimental.preprocessing.Discretization(viewer_age_buckets.tolist()),
			tf.keras.layers.Embedding(len(viewer_age_buckets) + 1, 32)
		])
		self.normalized_viewer_age = tf.keras.layers.experimental.preprocessing.Normalization(
			axis = None
		)

	def call(self, inputs):
		# Take the input dictionary, pass it through each input layer,
		# and concatenate the result.
		return tf.concat([
			self.user_embedding(inputs["viewer"]),
			self.viewer_age_embedding(inputs["viewer_age"]),
			tf.reshape(self.normalized_viewer_age(inputs["viewer_age"]), (-1, 1))
		], axis = 1)

In [41]:
user_model = UserModel()

user_model.normalized_viewer_age.adapt(
    ratings.map(lambda x: x["viewer_age"]).batch(128))

for row in ratings.batch(1).take(1):
    print(f"Computed representations: {user_model(row)[0, :3]}")

Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Constant'
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Constant'
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: 'arguments' object has no attribute 'posonlyargs'
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: 'arguments' object has no attribute 'posonlyargs'
Computed representations: [-0.03831752 -0.03813299  0.0499326 ]


### Broadcaster model

In [51]:
broadcaster_ids = ratings.batch(100_000).map(lambda x: x["broadcaster"])
unique_broadcaster_ids = np.unique(np.concatenate(list(broadcaster_ids)))

Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Constant'
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Constant'


In [59]:
class BroadcasterModel(tf.keras.Model) :

	def __init__(self) :
		super().__init__()

		max_tokens = 32
		self.broadcaster_embedding = tf.keras.Sequential([
			tf.keras.layers.experimental.preprocessing.StringLookup(
				vocabulary = unique_broadcaster_ids, max_tokens=None),
			tf.keras.layers.Embedding(len(unique_broadcaster_ids) + 1, 32)
		])
        
		self.broadcaster_text_embedding = tf.keras.Sequential([
			tf.keras.layers.experimental.preprocessing.TextVectorization(standardize= None, split=split_on_colons, max_tokens=32),
			tf.keras.layers.Embedding(max_tokens, 32, mask_zero = True),
			# We average the embedding of individual words to get one embedding vector
			# per title.
			tf.keras.layers.GlobalAveragePooling1D(),
		])


	def call(self, inputs) :
		return tf.concat([
			self.broadcaster_embedding(inputs["broadcaster"]),
			self.broadcaster_text_embedding(inputs["broadcaster"]),
		], axis = 1)

In [60]:
broadcaster_model = BroadcasterModel()

broadcaster_model.broadcaster_text_embedding.layers[0].adapt(
    ratings.map(lambda x: x["broadcaster"]))

for row in ratings.batch(1).take(1):
    print(f"Computed representations: {broadcaster_model(row)[0, :3]}")

Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Constant'
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Constant'
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: 'arguments' object has no attribute 'posonlyargs'
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: 'arguments' object has no attribute 'posonlyargs'
Computed representations: [ 0.0276117  -0.04592123  0.02068477]
