# Wide-and-Deep ML: Model Preparation

In this notebook, we train and evaluate the wide-and-deep collaborative filtering recommender using features engineered in the prior notebook.

In [42]:
# !pip3 install mlflow

In [None]:
# import required libraries

import pandas as pd
import tensorflow as tf
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization, StringLookup
import math
import mlflow

In [2]:
# directories to save the model's training progress, output, and save files
CHECKPOINT_PATH = './tmp/model_checkpoint'
EXPORT_PATH = './tmp/model_export'

In [3]:
# configs

# enable eager execution
tf.config.run_functions_eagerly(True)
tf.data.experimental.enable_debug_mode()

## 1. Prepare the data

### 1.1. Load the data

In [4]:
# save models
train_df = pd.read_csv('../data/user_movie_interaction_train.csv')
val_df = pd.read_csv('../data/user_movie_interaction_val.csv')
test_df = pd.read_csv('../data/user_movie_interaction_train.csv')

In [57]:
train_df.head()

Unnamed: 0.1,Unnamed: 0,userId,movieId,rating,title,genres,avg_movie_rating,user_all_genres
0,23695,442,51662,0.4,300 (2007),action fantasy war imax,0.721622,fantasy sci-fi mystery animation documentary w...
1,37754,417,1027,0.4,Robin Hood: Prince of Thieves (1991),adventure drama,0.610526,fantasy sci-fi musical horror mystery western ...
2,18178,394,45499,0.5,X-Men: The Last Stand (2006),action sci-fi thriller,0.638095,sci-fi drama children thriller western film-no...
3,33268,271,60609,0.9,Death Note (2006),adventure crime drama horror mystery,0.9,sci-fi drama children thriller western film-no...
4,47465,489,3301,0.6,"Whole Nine Yards, The (2000)",comedy crime,0.641667,sci-fi drama children thriller western film-no...


In [5]:
# add dataframes to list for convenience
df_list = [train_df, val_df, test_df]

In [6]:
# drop unnecessary columns
for df in df_list:
    df.drop(['Unnamed: 0'], axis=1, inplace=True)

train_df.head()

Unnamed: 0,userId,movieId,rating,title,genres,avg_movie_rating,user_all_genres
0,442,51662,0.4,300 (2007),action fantasy war imax,0.721622,fantasy sci-fi mystery animation documentary w...
1,417,1027,0.4,Robin Hood: Prince of Thieves (1991),adventure drama,0.610526,fantasy sci-fi musical horror mystery western ...
2,394,45499,0.5,X-Men: The Last Stand (2006),action sci-fi thriller,0.638095,sci-fi drama children thriller western film-no...
3,271,60609,0.9,Death Note (2006),adventure crime drama horror mystery,0.9,sci-fi drama children thriller western film-no...
4,489,3301,0.6,"Whole Nine Yards, The (2000)",comedy crime,0.641667,sci-fi drama children thriller western film-no...


### 1.2. Preprocess raw features and make Embeddings with Keras preprocessing layers.

We know that raw features may not be sufficiently accessible or practically usable by machine learning models and need to be preprocessed before the data is made available for training. This process typically involves:
- normalizing numerical features so that their impact on learning are not minimized or overpronounced relative to others.
- turning categorical features into *embeddings*.
- *tokenizing* textual features to translate them into embeddings.

We already normalized the *rating* column by scaling down the values to fit between 0.0 and 1.0. But remember that the movie titles and genre data are still strings, so we process them in this step.

#### 1.2.1. Converting primary features into categorical data

Categorical features represent discrete data. However, for our dataset, we see that the user and movie ids are of integer data types and therefore express continuous quantities. As such, we start by transforming them to the appropriate data type, which we achieved through trial and error.

In [7]:
# convert id features into string data to allow
# tokenization with keras
for df in df_list:
    id_cols = df.columns[df.columns.str.contains('Id')].tolist()
    for col in id_cols:
        df[col] = tf.convert_to_tensor(df[col].astype('string'), dtype=tf.string)

train_df.dtypes

userId               object
movieId              object
rating              float64
title                object
genres               object
avg_movie_rating    float64
user_all_genres      object
dtype: object

Most deep learning models express categorical data as high-dimensional embedding vectors that can be adjusted during model training. We can achieve this by building a '**vocabulary**' that maps each raw value into unique integers that can then be turned into embedding vectors.

We start by making a `Keras` `StringLookup` layer for the mapping. The `StringLookup` layer is a ***non-trainable*** layer and its *state*, the vocabulary, must be constructed and set before training in a step called 'adaptation.' This includes one or more unknown - or 'out of vocabulary,' OOV - tokens that allow the layer to handle categorical values that are not in the vocabulary, and consequently, ensures that the model can continue to learn using features that have not been seen during vocabulary construction.

In [8]:
# make a keras string lookup layer
userId_lookup_layer = StringLookup(mask_token=None)
movieId_lookup_layer = StringLookup(mask_token=None)

for df in df_list:
    userId_lookup_layer.adapt(df['userId'])
    movieId_lookup_layer.adapt(df['movieId'])

# verify tokenization
userId_lookup_layer.get_vocabulary()[:10]

['[UNK]', '156', '359', '208', '394', '298', '116', '104', '424', '348']

In [9]:
userId_lookup_layer(train_df['userId'])

<tf.Tensor: shape=(45794,), dtype=int64, numpy=array([237,  46,   4, ...,  22, 113, 291])>

#### 1.2.2. Tokenize textual features and translate them into embeddings

Textual data is tokenized as words so we can create word embeddings to represent words as dense vectors of real numbers, where each dimension represents a different feature of the word. This allows us to capture the semantic meaning of words and their relationships to other words in way that is useful to machine learning models.

In our case, tokenizing then transforming the genre data into embeddings will allow the deep network to predict movie preferences by identifying and generalizing their contextual meaning.

In [10]:
# Keras TextVectorization layer turns raw string data into an encoded
# representation that can be read by an embedding or dense layer

# get all columns with string data
str_cols = df.select_dtypes(include=['object']).columns.tolist()
# str_cols.remove('title')

for df in df_list:
    for col_name in str_cols:
        vectorizer = TextVectorization()
        vectorizer.adapt(df[col_name])

# verify tokenization
print(f'vocabulary[0:10]: {vectorizer.get_vocabulary()[:10]}')
vectorizer(train_df['genres'])

vocabulary[0:10]: ['', '[UNK]', 'thriller', 'drama', 'comedy', 'action', 'romance', 'adventure', 'crime', 'scifi']


<tf.Tensor: shape=(45794, 7), dtype=int64, numpy=
array([[ 5, 10, 12, ...,  0,  0,  0],
       [ 7,  3,  0, ...,  0,  0,  0],
       [ 5,  9,  2, ...,  0,  0,  0],
       ...,
       [ 5,  4,  9, ...,  0,  0,  0],
       [ 2,  0,  0, ...,  0,  0,  0],
       [ 4,  6,  0, ...,  0,  0,  0]])>

## 2. Create the model

In this step, we create a [`tf.estimator.DNNLinearCombinedClassifier`](https://developers.googleblog.com/2017/09/introducing-tensorflow-datasets.html) estimator object. It is important to note that according to the documentation, estimators are deprecated, and the warnings on my notebook indicate that I should switch to `Keras`, but I don't know how to do *that* yet and this is a semester project; I will try to update the model later.

### 2.1. Define the input columns

The `get_wide_and_deep_columns()` function returns a tuple of `(wide_columns, deep_columns)` where each item represents [`tf.feature_columns`](https://developers.googleblog.com/2017/11/introducing-tensorflow-feature-columns.html).

The feature inputs for the `DNNLinearCombinedClassifier` estimator are divided into those associated with the wide, linear and classifier models, and those associated with a deep neural network. The inputs to the wide model are the user and movie ID combinations that allow us to train the linear and classifier model to memorize what users watch which movies and how they rate them. These features may be brought into the model as simple categorical features hashed into a smaller number of buckets. The inclusion of the *crossed hashes* allow the model to better understand the user-movie interactions.

The target does not go into either wide or deep columns, so we need to specify it because it is useful for training the model.

In [11]:
# calculate number of examples
user_num = len(train_df['userId'].unique())
movie_num = len(train_df['movieId'].unique())
genre_num = len(train_df['genres'].unique())
user_all_genres_num = len(train_df['genres'].unique())

# calculate manually added embedding dimensions
user_dim = int(round(math.pow(user_num, 1/3)))
movie_dim = int(round(math.pow(movie_num, 1/3)))
genre_dim = int(round(math.pow(genre_num, 1/3)))
user_all_genres_dim = int(round(math.pow(user_all_genres_num, 1/3)))

# variables to define wide and deep columns from the dataset
LABEL_COL = 'title'

CATEGORICAL_COLS = [
    'userId',
    'movieId'
]

NUMERIC_COLS = [
    'rating',
    'avg_movie_rating'
]

TEXT_COLS = [
    'genres',
    'user_all_genres'
]

HASH_BUCKET_SIZES = {
    'userId': user_num,
    'movieId': movie_num,
    'genres': genre_num,
    'user_all_genres': user_all_genres_num
}

EMBEDDING_DIMENSIONS = {
    'userId': user_dim,
    'movieId': movie_dim,
    'genres': genre_dim,
    'user_all_genres': user_all_genres_dim,
}

# define wide and deep columns
def get_wide_and_deep_columns():
    wide_cols, deep_cols = [], []
    text_buckets = []
    numeric_cols, numeric_buckets = [], []
    cat_hash_bucket_size = genre_num * genre_dim
    l, r = 3/5, 4.5/5

    # categorical embedding columns
    for col_name in CATEGORICAL_COLS:
        categorical_col = tf.feature_column.categorical_column_with_identity(
            col_name,
            num_buckets = HASH_BUCKET_SIZES[col_name])
        wrapped_col = tf.feature_column.embedding_column(
            categorical_col,
            dimension = EMBEDDING_DIMENSIONS[col_name],
            combiner = 'sqrtn')
        wide_cols.append(categorical_col)
        deep_cols.append(wrapped_col)

    # text data embedding
    for col_name in TEXT_COLS:
        text_col = tf.feature_column.categorical_column_with_identity(
            col_name,
            num_buckets = HASH_BUCKET_SIZES[col_name])
        wrapped_col = tf.feature_column.embedding_column(
            categorical_col,
            dimension = EMBEDDING_DIMENSIONS[col_name],
            combiner = 'sqrtn')
        text_buckets.append(col_name)
        wide_cols.append(text_col)
        deep_cols.append(wrapped_col)

    # numeric columns
    for col_name in NUMERIC_COLS:
        col_name = tf.feature_column.numeric_column(
            col_name,
            shape = (1,),
            dtype = tf.float32)
        col_buckets = tf.feature_column.bucketized_column(
            col_name,
            boundaries=[l, r])
        numeric_cols.append(col_name)
        numeric_buckets.append(col_buckets)
        deep_cols.append(col_name)

    # cross numeric columns, text data columns
    numeric_cols_crossed = tf.feature_column.crossed_column(numeric_buckets, 12)
    text_cols_crossed = tf.feature_column.crossed_column(text_buckets, cat_hash_bucket_size)

    # add buckets and crossed columns to set of wide columns
    wide_cols.extend([numeric_buckets, numeric_cols_crossed, text_cols_crossed])

    return wide_cols, deep_cols

As we pointed out earlier, embeddings allow machine learning models to perform better. One vital property of embedding is that are semantically similar are spacially closer to each other, which means they have similar vector representations as measured by a distance metric such as *cosine similarity*.

The **dimensionality** of the word embeddings refers to the number of dimensions in which the vector representation of the word is defined, that is, the total number of features that are encoded in the word. The number of dimensions can have a significant impact on the performance: too few and the embeddings might not capture enough relevant information about the data; too many and the number of embeddings might become too complex, causing the model to overfit, and consequently, leading to poor generalization of new data. Setting the dimensionality is conventionally done through trial and error, much like adjusting the weights and biases of a model to improve its performance. Here we use the third root of the number of unique items in each column.

**Hash bucket sizes** are used to represent categorical data as a fixed-length vector of integers by hashing the categorical values into a fixed number of buckets, which are then used as indices into the vector. By using a fixed number of buckets, we can represent a large number of categories with a relatively small number of dimensions. This can help to reduce overfitting and improve the generalization performance of machine learning models. The optimal hash bucket size depends on the specific dataset and task. In general, it is recommended to use a hash bucket size that is large enough to capture all the relevant information in the data, but not so large that it becomes computationally expensive to train the model.

> **TODO**: find out how exactly the dimensions of the wide-and-deep model are similar/differ from the dimensions of the preprocessed data.

In [13]:
wide_columns, deep_columns = get_wide_and_deep_columns()

In [14]:
wide_columns

[IdentityCategoricalColumn(key='userId', number_buckets=500, default_value=None),
 IdentityCategoricalColumn(key='movieId', number_buckets=6368, default_value=None),
 IdentityCategoricalColumn(key='genres', number_buckets=766, default_value=None),
 IdentityCategoricalColumn(key='user_all_genres', number_buckets=766, default_value=None),
 [BucketizedColumn(source_column=NumericColumn(key='rating', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None), boundaries=(0.6, 0.9)),
  BucketizedColumn(source_column=NumericColumn(key='avg_movie_rating', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None), boundaries=(0.6, 0.9))],
 CrossedColumn(keys=(BucketizedColumn(source_column=NumericColumn(key='rating', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None), boundaries=(0.6, 0.9)), BucketizedColumn(source_column=NumericColumn(key='avg_movie_rating', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None), boundaries=(0.6, 0.9)))

In [15]:
deep_columns

[EmbeddingColumn(categorical_column=IdentityCategoricalColumn(key='userId', number_buckets=500, default_value=None), dimension=8, combiner='sqrtn', initializer=<tensorflow.python.ops.init_ops.TruncatedNormal object at 0x7f2174b3b7c0>, ckpt_to_load_from=None, tensor_name_in_ckpt=None, max_norm=None, trainable=True, use_safe_embedding_lookup=True),
 EmbeddingColumn(categorical_column=IdentityCategoricalColumn(key='movieId', number_buckets=6368, default_value=None), dimension=19, combiner='sqrtn', initializer=<tensorflow.python.ops.init_ops.TruncatedNormal object at 0x7f2174b3b5e0>, ckpt_to_load_from=None, tensor_name_in_ckpt=None, max_norm=None, trainable=True, use_safe_embedding_lookup=True),
 EmbeddingColumn(categorical_column=IdentityCategoricalColumn(key='movieId', number_buckets=6368, default_value=None), dimension=9, combiner='sqrtn', initializer=<tensorflow.python.ops.init_ops.TruncatedNormal object at 0x7f2174b3be80>, ckpt_to_load_from=None, tensor_name_in_ckpt=None, max_norm=Non

### 1.2. Define the wide-and-deep model

We are ready to assemble our estimator. We use the `Ftrl` and `Adagrad` optimizers that have their default learning rate and other parameters that we can modify later, then declare the number of layers for the model.

In [17]:
# adapted from https://www.tensorflow.org/api_docs/python/tf/estimator/DNNLinearCombinedClassifier
estimator = tf.estimator.DNNLinearCombinedClassifier(
    # wide settings
    linear_feature_columns=wide_columns,
    linear_optimizer=tf.keras.optimizers.Ftrl(),

    # deep settings
    dnn_feature_columns=deep_columns,
    dnn_hidden_units=[100, 50],
    dnn_optimizer=tf.keras.optimizers.Adagrad(),

    # warm-start settings
    model_dir=CHECKPOINT_PATH
)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': './tmp/model_checkpoint', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_checkpoint_save_graph_def': True, '_service': None, '_cluster_spec': ClusterSpec({}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}


### 1.3. Create the custom metric

We learn from this [Databricks](https://notebooks.databricks.com/notebooks/RCG/Wide_and_Deep/index.html#Wide_and_Deep_3.html) tutorial that for recommenders where the goal is to present items in order from most to least likely to be selected, *average precision at k* is conventionally used. Ultimately, our goal is to recommend video games and tournaments in this very fashion, so this metric works for us.

> the metric examines the average precision associated with a top-k number of recommendations. The closer the value of MAP@K (the average precision at k), the better aligned those recommendations are with a customer's product selections.

In [18]:
# adapted from: https://github.com/NVIDIA/DeepLearningExamples/blob/master/TensorFlow/Recommendation/WideAndDeep/utils/metrics.py
def map_custom_metric(features, labels, predictions):
    user_ids = tf.reshape(features['userId'], [-1])
    predictions = predictions['probabilities'][:, 1]
    
    # Processing unique userIds, indices and counts
    # Sorting needed in case the same userId occurs in two different places
    sorted_ids = tf.argsort(user_ids)
    user_ids = tf.gather(user_ids, indices=sorted_ids)
    predictions = tf.gather(predictions, indices=sorted_ids)
    labels = tf.gather(labels, indices=sorted_ids)
    
    _, user_ids_idx, user_ids_movies_count = tf.unique_with_counts(user_ids, out_idx=tf.int64)
    pad_length = 30 - tf.reduce_max(user_ids_movies_count)
    pad_fn = lambda x: tf.pad(x, [(0, 0), (0, pad_length)])
    
    preds = tf.RaggedTensor.from_value_rowids(predictions, user_ids_idx).to_tensor()
    labels = tf.RaggedTensor.from_value_rowids(labels, user_ids_idx).to_tensor()
    
    labels = tf.argmax(labels, axis=1)
    
    return {
        'map': tf.compat.v1.metrics.average_precision_at_k(
            predictions=pad_fn(preds),
            labels=labels,
            k=5,
            name="streaming_map"
        )}

In [19]:
estimator = tf.estimator.add_metrics(estimator, map_custom_metric)

Instructions for updating:
Use tf.keras instead.
INFO:tensorflow:Using config: {'_model_dir': './tmp/model_checkpoint', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_checkpoint_save_graph_def': True, '_service': None, '_cluster_spec': ClusterSpec({}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}


## 3. Train and evaluate the model

We start by aggregating all the preprocessed columns into a `tf.data.Dataset`.

In [20]:
# combine the numerical and vectorized data to create a
# tensorflow dataset

tfd_list = ['train', 'test', 'val']
i = 0

for df in df_list:
    userId = userId_lookup_layer(df['userId'])
    movieId = userId_lookup_layer(df['movieId'])
    rating = df['rating']
    avg_movie_rating = df['avg_movie_rating']
    genres = vectorizer(df['genres'])
    user_all_genres = vectorizer(df['user_all_genres'])
    title = vectorizer(df['title'])
    
    skibidi = (userId, movieId, rating, avg_movie_rating, genres, user_all_genres, title)
    
    globals()[f'{tfd_list[i]}_tf_dataset'] = tf.data.Dataset.from_tensor_slices(skibidi)
    i += 1

Finally, we connect the model to the training data using an input function. The function creates the tensorflow operations that generate data for the model. It maps the raw (preprocessed) data into the model.

We found that we need to make the `tf.data.Dataset` iterable before we can train the model with it, and one way to do this was by using `as_numpy_iterator()`. However, it appears that this can only happen when running in ***eager execution*** mode, but unfortunately, we still could not train it after setting it as required. I am guessing that this is due to incompatibility issues with previous versions of tensorflow. What I do know is that we need to get the `numpy` data from the `TensorSliceDataset` data.

In [21]:
# utility function that converts the tf.data.Dataset into a tuple
def to_tuple(batch):
    iterator = batch.as_numpy_iterator()
    for item in iterator:
        features = {
            'userId': item[0],
            'movieId': item[1],
            'rating': item[2],
            'avg_movie_rating': item[3],
            'genres': item[4],
            'user_all_genres': item[5]
        }
        label = item[6]
    return features, label

# utility function that creates the input function from the tf dataset
def get_input_fn(dataset_context_manager):
    def _fn():
        return to_tuple(dataset_context_manager)
    return _fn()

In [None]:
train_spec = tf.estimator.TrainSpec(input_fn=lambda: get_input_fn(train_tf_dataset), max_steps=29000)
eval_spec = tf.estimator.EvalSpec(input_fn=lambda: get_input_fn(val_tf_dataset))

with mlflow.start_run():
    tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)
    artifact_uri = mlflow.get_artifact_uri()

## 4. Future

A high priority is the backwards compatibility issues and learning to adapt Keras for the model so that it can be trained and tested better. This will provide a solid foundation for the recommendation engine.