https://www.tensorflow.org/recommenders/examples/featurization

One of the great advantages of using a deep learning framework to build recommender models is the freedom to build rich, flexible feature representations.

The first step in doing so is preparing the features, as raw features will usually not be immediately usable in a model.

For example:

- User and item ids may be strings (titles, usernames) or large, noncontiguous integers (database IDs).
- Item descriptions could be raw text.
- Interaction timestamps could be raw Unix timestamps.

These need to be appropriately transformed in order to be useful in building models:

- User and item ids have to be translated into embedding vectors: high-dimensional numerical representations that are adjusted during training to help the model predict its objective better.
- Raw text needs to be tokenized (split into smaller parts such as individual words) and translated into embeddings.
- Numerical features need to be normalized so that their values lie in a small interval around 0.

Fortunately, by using TensorFlow we can make such preprocessing part of our model rather than a separate preprocessing step. This is not only convenient, but also ensures that our pre-processing is exactly the same during training and during serving. This makes it safe and easy to deploy models that include even very sophisticated pre-processing.

In this tutorial, we are going to focus on recommenders and the preprocessing we need to do on the MovieLens dataset. If you're interested in a larger tutorial without a recommender system focus, have a look at the full Keras preprocessing guide.

In [1]:
import os
import tempfile
import pprint

from typing import Dict, Text

In [2]:
import numpy as np
import pandas as pd

In [3]:
import tensorflow_datasets as tfds

In [4]:
import tensorflow as tf
import tensorflow_datasets as tfds

In [7]:
def load_data_file_cold(file, stats=""):
    print('loading file:' + file)
    training_df = pd.read_csv(
        file,
        skiprows=[0],
        names=["viewer","broadcaster","viewer_age","viewer_gender","viewer_longitude","viewer_latitude","viewer_lang","viewer_country","broadcaster_age","broadcaster_gender","broadcaster_longitude","broadcaster_latitude","broadcaster_lang","broadcaster_country","duration", "viewer_network", "broadcaster_network", "count"], dtype={
            'viewer': np.unicode,
            'broadcaster': np.unicode,
            'viewer_age': np.single,
            'viewer_gender': np.unicode,
            'viewer_longitude': np.single,
            'viewer_latitude': np.single,
            'viewer_lang': np.unicode,
            'viewer_country': np.unicode,
            'broadcaster_age': np.single,
            'broadcaster_longitude': np.single,
            'broadcaster_latitude': np.single,
            'broadcaster_lang': np.unicode,
            'broadcaster_country': np.unicode,
            'viewer_network': np.unicode,
            'broadcaster_network': np.unicode,
            'count': np.unicode,
        })

    values = {
        'viewer': 'unknown',
        'broadcaster': 'unknown',
        'viewer_age': 30,
        'viewer_gender': 'unknown',
        'viewer_longitude': 0,
        'viewer_latitude': 0,
        'viewer_lang': 'unknown',
        'viewer_country': 'unknown',
        'broadcaster_age': 30,
        'broadcaster_longitude': 0,
        'broadcaster_latitude': 0,
        'broadcaster_lang': 'unknown',
        'broadcaster_country': 'unknown',
        'duration': 0,
        'viewer_network': 'unknown',
        'broadcaster_network': 'unknown',
        'count': '0',
    }
    training_df = training_df.sample(frac = 0.001)
    training_df.fillna(value=values, inplace=True)
    training_df['viewer_lat_long'] = training_df[['viewer_latitude', 'viewer_longitude']].apply(lambda x: '{},{}'.format(x[0],x[1]), axis=1)
    print(training_df.head(10))
    print(training_df.iloc[-10:])
    return training_df


def load_training_data_cold(file, stats):
    ratings_df = load_data_file_cold(file, stats)
    print('creating data set')
    training_ds = (
        tf.data.Dataset.from_tensor_slices(
            ({
                "viewer": tf.cast(
                    ratings_df['viewer'].values,
                    tf.string),
                "viewer_gender": tf.cast(
                    ratings_df['viewer_gender'].values,
                    tf.string),
                "viewer_lang": tf.cast(
                    ratings_df['viewer_lang'].values,
                    tf.string),
                "viewer_country": tf.cast(
                    ratings_df['viewer_country'].values,
                    tf.string),
                "viewer_age": tf.cast(
                    ratings_df['viewer_age'].values,
                    tf.int16),
                "viewer_longitude": tf.cast(
                    ratings_df['viewer_longitude'].values,
                    tf.float16),
                "viewer_latitude": tf.cast(
                    ratings_df['viewer_latitude'].values,
                    tf.float16),
                "broadcaster": tf.cast(
                    ratings_df['broadcaster'].values,
                    tf.string),
                "viewer_network": tf.cast(
                    ratings_df['viewer_network'].values,
                    tf.string),
                "broadcaster_network": tf.cast(
                    ratings_df['broadcaster_network'].values,
                    tf.string),
                "viewer_lat_long": tf.cast(
                    ratings_df['viewer_lat_long'].values,
                    tf.string),
            })))

    return training_ds


def prepare_training_data_cold(train_ds):
    print('prepare_training_data')
    training_ds = train_ds.cache().map(lambda x: {
        "broadcaster": x["broadcaster"],
        "viewer": x["viewer"],
        "viewer_gender": x["viewer_gender"],
        "viewer_lang": x["viewer_lang"],
        "viewer_country": x["viewer_country"],
        "viewer_age": x["viewer_age"],
        "viewer_longitude": x["viewer_longitude"],
        "viewer_latitude": x["viewer_latitude"],
        "viewer_network": x["viewer_network"],
        "broadcaster_network": x["broadcaster_network"],
        "viewer_lat_long": x["viewer_lat_long"],
    }, num_parallel_calls=tf.data.AUTOTUNE,
       deterministic=False)

    print('done prepare_training_data')
    return training_ds


In [8]:
ratings = load_training_data_cold(file="csv/a3d86f3b-eb45-4641-b05d-30dff7423e6b.csv", stats="")

for x in ratings.take(1).as_numpy_iterator():
    pprint.pprint(x)

loading file:csv/a3d86f3b-eb45-4641-b05d-30dff7423e6b.csv
                   viewer       broadcaster  viewer_age viewer_gender  \
4129856   skout:183939265  meetme:250339028        29.0          male   
4642062   skout:183504804   skout:183348180        28.0          male   
1471301  meetme:253949213     pof:331596292        41.0          male   
2640883   skout:164668693   skout:158946969        56.0          male   
1826086  meetme:316347618  meetme:317022882        60.0          male   
4615372  meetme:299434476  meetme:312750515        28.0          male   
4971800  meetme:316353025   skout:127053087        34.0          male   
2626976   skout:153125384  meetme:236794915        44.0          male   
448094      pof:280603467     pof:268252594        33.0        female   
4618366     pof:332744870     pof:310626492        37.0          male   

         viewer_longitude  viewer_latitude viewer_lang viewer_country  \
4129856         77.583000        12.983000          en           

2021-10-15 14:22:32.448036: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


### Defining the vocabulary

In [7]:
broadcaster_lookup = tf.keras.layers.experimental.preprocessing.StringLookup()

In [8]:
broadcaster_lookup.adapt(ratings.map(lambda x: x["broadcaster"]))

Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Constant'
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Constant'
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: 'arguments' object has no attribute 'posonlyargs'
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: 'arguments' object has no attribute 'posonlyargs'


In [9]:
print(f"Vocabulary: {broadcaster_lookup.get_vocabulary()[:20]}")

Vocabulary: ['[UNK]', 'meetme:277903808', 'meetme:50697624', 'meetme:219070323', 'pof:300442673', 'pof:322045884', 'pof:319663298', 'pof:315853960', 'pof:297373249', 'meetme:283611530', 'pof:79582086', 'skout:150743909', 'pof:299641758', 'meetme:309755964', 'meetme:197536011', 'meetme:294844287', 'meetme:308663123', 'skout:39313218', 'meetme:228586518', 'meetme:195325769']


In [10]:
broadcaster_lookup.vocabulary_size()

69797

In [11]:
broadcaster_lookup(["[UNK]", "meetme:277903808", "meetme:50697624"])

<tf.Tensor: shape=(3,), dtype=int64, numpy=array([0, 1, 2])>

### Using feature hashing

In [12]:
# We set up a large number of bins to reduce the chance of hash collisions.
num_hashing_bins = 200_000

broadcaster_hashing = tf.keras.layers.experimental.preprocessing.Hashing(
    num_bins=num_hashing_bins
)

In [13]:
broadcaster_hashing(["[UNK]", "meetme:277903808", "meetme:50697624"])

<tf.Tensor: shape=(3,), dtype=int64, numpy=array([ 18280, 193815, 180119])>

### Defining the embeddings

In [14]:
broadcaster_embedding = tf.keras.layers.Embedding(
    # Let's use the explicit vocabulary lookup.
    input_dim=broadcaster_lookup.vocabulary_size(),
    output_dim=32
)

In [15]:
broadcaster_model = tf.keras.Sequential([broadcaster_lookup, broadcaster_embedding])

In [16]:
broadcaster_model(["meetme:277903808"])

Consider rewriting this model with the Functional API.


<tf.Tensor: shape=(1, 32), dtype=float32, numpy=
array([[-0.00220549,  0.00318506, -0.04274105, -0.03318482,  0.01996905,
        -0.00360984,  0.03728544,  0.04276649,  0.02944965, -0.01693236,
        -0.03837664,  0.02658382, -0.01988866, -0.02986122,  0.02398682,
        -0.00580009,  0.0463425 , -0.02724286,  0.03874153, -0.00180887,
        -0.00071955, -0.02124978, -0.03418276,  0.03018219, -0.02725717,
         0.03669694,  0.04854344,  0.0132411 ,  0.02092062,  0.04907768,
        -0.04367078, -0.01637457]], dtype=float32)>

In [17]:
# user embedding
user_id_lookup = tf.keras.layers.experimental.preprocessing.StringLookup()
user_id_lookup.adapt(ratings.map(lambda x: x["viewer"]))

user_id_embedding = tf.keras.layers.Embedding(user_id_lookup.vocabulary_size(), 32)
user_id_model = tf.keras.Sequential([user_id_lookup, user_id_embedding])

Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Constant'
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Constant'
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: 'arguments' object has no attribute 'posonlyargs'
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: 'arguments' object has no attribute 'posonlyargs'


In [18]:
user_id_model(["meetme:277903808"])

Consider rewriting this model with the Functional API.


<tf.Tensor: shape=(1, 32), dtype=float32, numpy=
array([[ 0.01943531, -0.00493991, -0.00137669,  0.0355396 , -0.02645339,
        -0.00294147,  0.03740872,  0.02480601, -0.04640242,  0.03370896,
        -0.02711483,  0.00093615,  0.01668551,  0.03670264,  0.01320238,
        -0.0214431 ,  0.04782381,  0.00272961,  0.04388637,  0.02362683,
         0.02637327, -0.02401898, -0.03807665, -0.00941879,  0.0490785 ,
        -0.04063647,  0.03368082, -0.01720614, -0.0470655 , -0.00019421,
        -0.04578037,  0.04345498]], dtype=float32)>

### Normalizing continuous features

In [19]:
for x in ratings.take(3).as_numpy_iterator():
    print(f"viewer_age: {x['viewer_age']}.")

viewer_age: 29.
viewer_age: 26.
viewer_age: 25.


In [20]:
# Standardization
viewer_age_normalization = tf.keras.layers.experimental.preprocessing.Normalization(
    axis=None
)
viewer_age_normalization.adapt(ratings.map(lambda x: x['viewer_age']).batch(32))

for x in ratings.take(3).as_numpy_iterator():
    print(f"Normalized viewer age: {viewer_age_normalization(x['viewer_age'])}.")

Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Constant'
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Constant'
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: 'arguments' object has no attribute 'posonlyargs'
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: 'arguments' object has no attribute 'posonlyargs'
Normalized viewer age: [-0.52391666].
Normalized viewer age: [-0.83302295].
Normalized viewer age: [-0.9360584].


In [21]:
# Discretization
max_viewer_age = ratings.map(lambda x: x["viewer_age"]).reduce(
    tf.cast(0, tf.int32), tf.maximum).numpy().max()
min_viewer_age = ratings.map(lambda x: x["viewer_age"]).reduce(
    np.int32(100), tf.minimum).numpy().min()

viewer_age_buckets = np.linspace(
    min_viewer_age, max_viewer_age, num=10)

print(f"Buckets: {viewer_age_buckets[:10]}")

Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Constant'
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Constant'
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Constant'
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Constant'
Buckets: [ 18.          38.22222222  58.44444444  78.66666667  98.88888889
 119.11111111 139.33333333 159.55555556 179.77777778 200.        ]


In [22]:
# Given the bucket boundaries we can transform timestamps into embeddings:
viewer_age_embedding_model = tf.keras.Sequential([
  tf.keras.layers.experimental.preprocessing.Discretization(viewer_age_buckets.tolist()),
  tf.keras.layers.Embedding(len(viewer_age_buckets) + 1, 32)
])

for viewer_age in ratings.take(1).map(lambda x: x["viewer_age"]).batch(1).as_numpy_iterator():
    print(f"Viewer_age embedding: {viewer_age_embedding_model(viewer_age)}.")

Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Constant'
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Constant'
Viewer_age embedding: [[-0.03302993  0.03110944  0.01416438 -0.01521438 -0.03592966 -0.02223239
  -0.04625424 -0.00605955  0.00684149 -0.02312992 -0.01667572 -0.03321574
  -0.02386065 -0.02983061  0.0152416   0.03268285  0.04790663  0.04542525
  -0.02657044 -0.04043273 -0.03326219  0.04068789  0.04848525  0.02951981
   0.04435328  0.02764675 -0.00330366 -0.03459014  0.00951219 -0.04073434
   0.017962    0.01527431]].


### Processing text features

In [28]:
def split_on_colons(text):
    return tf.strings.split(text, sep=":")

In [32]:
broadcaster_text = tf.keras.layers.experimental.preprocessing.TextVectorization(standardize= None, split=split_on_colons)
broadcaster_text.adapt(ratings.map(lambda x: x["broadcaster"]))

Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Constant'
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Constant'
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: 'arguments' object has no attribute 'posonlyargs'
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: 'arguments' object has no attribute 'posonlyargs'


In [33]:
for row in ratings.batch(1).map(lambda x: x["broadcaster"]).take(1):
  print(broadcaster_text(row))

Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Constant'
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Constant'
tf.Tensor([[    3 13467]], shape=(1, 2), dtype=int64)


In [36]:
broadcaster_text.get_vocabulary()[:10]

['',
 '[UNK]',
 'meetme',
 'pof',
 'skout',
 '277903808',
 '50697624',
 'zoosk',
 '219070323',
 '300442673']

### User Model

In [40]:
class UserModel(tf.keras.Model):

	def __init__(self):
		super().__init__()

		self.user_embedding = tf.keras.Sequential([
			user_id_lookup,
			tf.keras.layers.Embedding(user_id_lookup.vocabulary_size(), 32),
		])
        
		self.viewer_age_embedding = tf.keras.Sequential([
			tf.keras.layers.experimental.preprocessing.Discretization(viewer_age_buckets.tolist()),
			tf.keras.layers.Embedding(len(viewer_age_buckets) + 1, 32)
		])
		self.normalized_viewer_age = tf.keras.layers.experimental.preprocessing.Normalization(
			axis = None
		)

	def call(self, inputs):
		# Take the input dictionary, pass it through each input layer,
		# and concatenate the result.
		return tf.concat([
			self.user_embedding(inputs["viewer"]),
			self.viewer_age_embedding(inputs["viewer_age"]),
			tf.reshape(self.normalized_viewer_age(inputs["viewer_age"]), (-1, 1))
		], axis = 1)

In [41]:
user_model = UserModel()

user_model.normalized_viewer_age.adapt(
    ratings.map(lambda x: x["viewer_age"]).batch(128))

for row in ratings.batch(1).take(1):
    print(f"Computed representations: {user_model(row)[0, :3]}")

Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Constant'
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Constant'
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: 'arguments' object has no attribute 'posonlyargs'
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: 'arguments' object has no attribute 'posonlyargs'
Computed representations: [-0.03831752 -0.03813299  0.0499326 ]


### Broadcaster model

In [51]:
broadcaster_ids = ratings.batch(100_000).map(lambda x: x["broadcaster"])
unique_broadcaster_ids = np.unique(np.concatenate(list(broadcaster_ids)))

Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Constant'
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Constant'


In [59]:
class BroadcasterModel(tf.keras.Model) :

	def __init__(self) :
		super().__init__()

		max_tokens = 32
		self.broadcaster_embedding = tf.keras.Sequential([
			tf.keras.layers.experimental.preprocessing.StringLookup(
				vocabulary = unique_broadcaster_ids, max_tokens=None),
			tf.keras.layers.Embedding(len(unique_broadcaster_ids) + 1, 32)
		])
        
		self.broadcaster_text_embedding = tf.keras.Sequential([
			tf.keras.layers.experimental.preprocessing.TextVectorization(standardize= None, split=split_on_colons, max_tokens=32),
			tf.keras.layers.Embedding(max_tokens, 32, mask_zero = True),
			# We average the embedding of individual words to get one embedding vector
			# per title.
			tf.keras.layers.GlobalAveragePooling1D(),
		])


	def call(self, inputs) :
		return tf.concat([
			self.broadcaster_embedding(inputs["broadcaster"]),
			self.broadcaster_text_embedding(inputs["broadcaster"]),
		], axis = 1)

In [60]:
broadcaster_model = BroadcasterModel()

broadcaster_model.broadcaster_text_embedding.layers[0].adapt(
    ratings.map(lambda x: x["broadcaster"]))

for row in ratings.batch(1).take(1):
    print(f"Computed representations: {broadcaster_model(row)[0, :3]}")

Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Constant'
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Constant'
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: 'arguments' object has no attribute 'posonlyargs'
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: 'arguments' object has no attribute 'posonlyargs'
Computed representations: [ 0.0276117  -0.04592123  0.02068477]


### Preprocessing Lat/Long

In [188]:
lat_long = (ratings
           # Retain only the fields we need.
           .map(lambda x: {"viewer_latitude": x["viewer_latitude"], 
                           "viewer_longitude": x["viewer_longitude"]}))

Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Constant'
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Constant'


In [189]:
CENTROIDS = np.array([[36.68147669256268, -82.8910274009993],
                      [23.22243322909555, 78.23027450833709],
                      [50.04997682638993, 0.22379313938744885],
                      [37.9309447099281, -117.00741350764692],
                      [-32.795864819917725, 148.7159172660312],
                      [-18.570548393114084, -54.280255665692565],
                      [13.921140442819565, 116.38740315555172],
                      [29.78951080730802, 40.279515865947936]])

In [210]:
len(CENTROIDS)

8

In [190]:
def classify(datapoint):
    """
    given a datapoint, compute the cluster closest to the
    datapoint. Return the cluster ID of that cluster.
    :param datapoint:
    :return: cluster ID
    """
#     datapoint = np.zeros(shape=(1, 2))
#     datapoint[0] = lat
#     print(datapoint)
#     datapoint[1] = long
#     datapoint = [lat, long]
    print(datapoint)
    dists = np.sqrt(np.sum((CENTROIDS - datapoint) ** 2, axis = 1))
    return np.argmin(dists)

In [191]:
for row in lat_long.take(2).as_numpy_iterator():
    res = classify([row['viewer_latitude'], row['viewer_longitude']])
    print(res)

[28.7, 77.1]
1
[40.22, -84.8]
0


In [80]:
from tensorflow.keras.layers import Lambda
from tensorflow.keras import backend as K

In [92]:
unique_clusters = 8
viewer_lat_long_embedding = tf.keras.Sequential([
    tf.keras.layers.Lambda(lambda x: classify(x)), 
    tf.keras.layers.Embedding(unique_clusters + 1, 2),
])

In [150]:
for x in ratings.take(10).as_numpy_iterator():
    pprint.pprint(str(x['viewer_latitude']) + "|" + str(x['viewer_longitude']))

'28.7|77.1'
'40.22|-84.8'
'37.0|-122.0'
'41.5|-87.7'
'29.77|-95.7'
'32.2|-90.3'
'32.7|-97.0'
'-7.938|-34.88'
'42.44|-83.3'
'43.28|-76.4'


In [32]:
float("1.5")

1.5

In [206]:
def classify(pair):
    """
    given a datapoint, compute the cluster closest to the
    datapoint. Return the cluster ID of that cluster.
    :param datapoint:
    :return: cluster ID
    """
    datapoint = pair.numpy().tolist()
    dists = np.sqrt(np.sum((CENTROIDS - datapoint) ** 2, axis = 1))
    return tf.constant([np.argmin(dists)])

In [34]:
latlong = ratings.map(lambda x: tf.stack([x["viewer_latitude"], x["viewer_longitude"]]))
for val in latlong.take(2).as_numpy_iterator():
    pprint.pprint(val)

Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Constant'
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Constant'
array([ 43.9, -78.9], dtype=float16)
array([ 14.51, 121.  ], dtype=float16)


In [209]:
viewer_latitude_longitude = tf.keras.Sequential([
  tf.keras.layers.Lambda(lambda x: classify(x)), 
  tf.keras.layers.Embedding(len(viewer_latitude_longitude_buckets) + 1, 32)
])

for val in latlong.take(2).batch(1).as_numpy_iterator():
    print(viewer_latitude_longitude(val))

[[28.703125, 77.125]]
tf.Tensor(
[[ 0.00398612  0.0471969  -0.00608052 -0.02895863 -0.03615345 -0.0151992
  -0.04495274 -0.04468764 -0.0247431   0.02581981  0.04732665  0.04497429
  -0.01733258  0.01310393 -0.02832994 -0.04735022  0.0124325  -0.01465311
   0.04150859 -0.00486708 -0.02039013  0.02938708  0.03096035  0.01580676
  -0.0127617  -0.02478284 -0.02168956  0.01502671 -0.04769066  0.02023294
  -0.02210033  0.01853705]], shape=(1, 32), dtype=float32)
[[40.21875, -84.8125]]
tf.Tensor(
[[-0.02717905  0.03290197 -0.04601464  0.00960105 -0.03947737 -0.02975992
   0.0410358  -0.00295651 -0.01346616  0.01446816  0.02706251 -0.01199473
  -0.0110734  -0.00836089  0.04716246  0.04230661 -0.00511326 -0.03318679
   0.03733987 -0.04369868 -0.03661479 -0.00964867  0.03899993  0.04625989
   0.02531463 -0.00039531  0.02407861 -0.01779387  0.02972336 -0.02403492
   0.01394169  0.01960785]], shape=(1, 32), dtype=float32)


In [33]:
expanded_centroids.shape

TensorShape([8, 1, 2])

In [65]:
samples = tf.constant([[36.68147669256268, -82.8910274009993],[13.921140442819565, 116.38740315555172],[29.78951080730802, 40.279515865947936]])

In [66]:
expanded_vectors = tf.expand_dims(samples, 0)

In [67]:
expanded_vectors

<tf.Tensor: shape=(1, 3, 2), dtype=float32, numpy=
array([[[ 36.681477, -82.89103 ],
        [ 13.921141, 116.387405],
        [ 29.78951 ,  40.279514]]], dtype=float32)>

In [70]:
distances = tf.reduce_sum(tf.square(tf.subtract(expanded_vectors, expanded_centroids)), 2)
print(distances)
cluster = tf.math.argmin(distances)
print(cluster)

tf.Tensor(
[[    0.     40229.93   15218.482 ]
 [26141.223   1542.4805  1483.387 ]
 [ 7086.7905 14799.277   2014.9473]
 [ 1165.489  55049.613  24805.46  ]
 [58468.875   3227.6106 15675.382 ]
 [ 3871.3628 30183.162  11280.245 ]
 [40229.93       0.      6044.217 ]
 [15218.482   6044.217      0.    ]], shape=(8, 3), dtype=float32)
tf.Tensor([0 6 7], shape=(3,), dtype=int64)


In [83]:
def classify(pair):
    """
    given a datapoint, compute the cluster closest to the datapoint. Return the cluster ID of that cluster.
    :param pair:
    :return: cluster ID
    """
    centroids = tf.constant([
         [36.68147669256268, -82.8910274009993],
         [23.22243322909555, 78.23027450833709],
         [50.04997682638993, 0.22379313938744885],
         [37.9309447099281, -117.00741350764692],
         [-32.795864819917725, 148.7159172660312],
         [-18.570548393114084, -54.280255665692565],
         [13.921140442819565, 116.38740315555172],
         [29.78951080730802, 40.279515865947936]]
    )
    expanded_centroids = tf.expand_dims(centroids, 1)

    latlong = tf.strings.split(pair, sep=",")
    datapoints = [tf.strings.to_number(splits) for splits in latlong]
    expanded_vectors = tf.expand_dims(datapoints, 0)
    
    distances = tf.reduce_sum(tf.square(tf.subtract(expanded_vectors, expanded_centroids)), 2)
    clusters = tf.math.argmin(distances)
    print(clusters)
    return tf.strings.as_string(clusters)

In [87]:
viewer_lat_long_embedding = tf.keras.layers.experimental.preprocessing.TextVectorization(
    standardize= None, split=classify, vocabulary=['0', '1', '2', '3', '4', '5', '6', '7'])

In [88]:
for pair in ratings.batch(1).map(lambda x: x["viewer_lat_long"]).take(10):
    print(pair)
    print(f"Viewer_latlong embedding: {viewer_lat_long_embedding(pair)}.")

Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Constant'
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Constant'
tf.Tensor([b'12.982999801635742,77.58300018310547'], shape=(1,), dtype=string)
tf.Tensor([1], shape=(1,), dtype=int64)
Viewer_latlong embedding: [3].
tf.Tensor([b'34.590999603271484,-118.12000274658203'], shape=(1,), dtype=string)
tf.Tensor([3], shape=(1,), dtype=int64)
Viewer_latlong embedding: [5].
tf.Tensor([b'40.78089904785156,-73.25450134277344'], shape=(1,), dtype=string)
tf.Tensor([0], shape=(1,), dtype=int64)
Viewer_latlong embedding: [2].
tf.Tensor([b'37.16299819946289,126.98100280761719'], shape=(1,), dtype=string)
tf.Tensor([6], shape=(1,), dtype=int64)
Viewer_latl

In [100]:
viewer_age_buckets = [0, 1, 2, 3, 4, 5, 6, 7]
# Given the bucket boundaries we can transform timestamps into embeddings:
viewer_age_embedding_model = tf.keras.Sequential([
  tf.keras.layers.Lambda(lambda x: x ** 2), 
  tf.keras.layers.experimental.preprocessing.Discretization(viewer_age_buckets),
  tf.keras.layers.Embedding(len(viewer_age_buckets) + 1, 32)
])

for viewer_age in ratings.take(1).map(lambda x: x["viewer_age"]).batch(1).as_numpy_iterator():
    print(f"Viewer_age embedding: {viewer_age_embedding_model(viewer_age)}.")

Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Constant'
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Constant'
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: 'arguments' object has no attribute 'posonlyargs'
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: 'arguments' object has no attribute 'posonlyargs'
Viewer_age embedding: [[-0.04835718  0.0053164   0.01803212  0.00861342 -0.01517547 -0.03086997
  -0.00026621  0.00785236 -0.03899376  0.02694774 -0

In [64]:
latlong = (ratings
           .take(1)
           # Retain only the fields we need.
           .map(lambda x: {"viewer_latitude": x["viewer_latitude"], 
                           "viewer_longitude": x["viewer_longitude"]}))
print(latlong)
    
#     print(f"Viewer_lat_long_cluster: {viewer_lat_long_cluster_embedding(latlong)}.")

Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Constant'
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Constant'
<MapDataset shapes: {viewer_latitude: (), viewer_longitude: ()}, types: {viewer_latitude: tf.float16, viewer_longitude: tf.float16}>
