### Deep Structured Semantic Model - Triplet Hinge Loss
This notebook is used to train the Deep Structured Semantic Model (DSSM) using the Yelp dataset. The DSSM model is used to predict user and business interest based on the user and business features. The model is trained using the Triplet Hinge Loss as the loss function according the the paper from Facebook. The goal is to separate the positive and negative samples by a margin, so that the positive samples are closer to the anchor than the negative samples. 

#### Pre-requisites
1. Have the processed Yelp dataset in the `../../data/processed_data/yelp_data` folder.
2. Have the virtual environment setup and used for the notebook.

#### Output
1. `user_model.keras` - The trained user model for retrieval. 
2. `user_id_encoder.pkl` - The label encoder for the user id.
3. `user_scaler.pkl` - The scaler for the user features.
4. `business_model.keras` - The trained business model for retrieval.
5. `business_id_encoder.pkl` - The label encoder for the business id.
6. `categories_encoder.pkl` - The label encoder for the business categories.
7. `business_scaler.pkl` - The scaler for the business features.
    
#### Move to Production
1. Before moving to production, the model should be indexed in the `DSSM Index (Faiss).ipynb` notebook.

In [1]:
from general_program import *
from tensorflow.keras.layers import Dot, Activation
from tensorflow.keras.models import Model
from tensorflow.keras.callbacks import ModelCheckpoint

Loaded 78059 rows from business_details table.
Loaded 360656 rows from business_categories table.
Loaded 980418 rows from review table.
Loaded 229447 rows from user table.
Loaded 173085 rows from tip table.


In [2]:
categories_encoder = LabelEncoder()
user_id_encoder = LabelEncoder()
business_id_encoder = LabelEncoder()
business_geohash_encoder = LabelEncoder()

user_scaler = StandardScaler()
business_scaler = StandardScaler()

In [3]:
user_df, business_df, review_df, label_df, user_continuous_features_scaled, business_continuous_features_scaled, num_users, num_businesses, num_categories, num_geohashes = prepare_data(user_df, business_df, review_df, categories_df, user_id_encoder, business_id_encoder, categories_encoder, business_geohash_encoder, user_scaler, business_scaler)

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  easy_positive_df['label'] = 1
  positive_df = pd.concat([positive_df, easy_positive_df])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  easy_negative_df['label'] = 0
  negative_df = pd.concat([negative_df, easy_negative_df])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing

In [4]:
user_continuous_features_scaled = user_continuous_features_scaled.set_index(user_df['user_id_encoded'].values)
business_continuous_features_scaled = business_continuous_features_scaled.set_index(business_df['business_id_encoded'].values)

In [5]:
def create_embedding_layer(input_dim, output_dim, name):
    return layers.Embedding(
        input_dim=input_dim,
        output_dim=output_dim,
        name=f"{name}_embedding",
        # embeddings_regularizer=regularizers.l2(1e-4)  # L2 regularization
    )

# Create embedding layers
user_id_embedding = create_embedding_layer(num_users, 16, "user_id")
business_id_embedding = create_embedding_layer(num_businesses, 16, "business_id")
category_embedding = create_embedding_layer(num_categories, 16, "category")
business_geohash_embedding = create_embedding_layer(num_geohashes, 16, "geohash")

In [6]:
def user_tower(continuous_dim):
    # Inputs
    user_id_input = layers.Input(shape=(1,), name="user_id")
    user_continuous_input = layers.Input(shape=(continuous_dim,), name="user_continuous")

    # Embedding
    user_id_embedded = user_id_embedding(user_id_input)
    user_id_embedded = layers.Flatten()(user_id_embedded)

    # Combine
    concat = layers.Concatenate()([user_id_embedded, user_continuous_input])
    x = layers.Dense(64, activation='relu')(concat)
    x = layers.Dense(32, activation='relu')(x)
    # x = layers.Dense(64, activation='relu', kernel_regularizer=regularizers.l2(1e-4))(concat)
    # x = layers.Dropout(0.15)(x)  # Drop 30% of neurons randomly
    # x = layers.Dense(32, activation='relu', kernel_regularizer=regularizers.l2(1e-4))(x)
    # x = layers.Dropout(0.15)(x)  # Drop 30% of neurons randomly
    user_embedding = layers.Dense(16, activation=None, name="user_embedding")(x)

    return Model([user_id_input, user_continuous_input], user_embedding, name="UserTower")


In [7]:
def item_tower(continuous_dim):
    # Inputs
    business_id_input = layers.Input(shape=(1,), name="business_id")
    business_continuous_input = layers.Input(shape=(continuous_dim,), name="business_continuous")
    # category_input = layers.Input(shape=(None,), dtype="int32", name="category_indices")  # Variable-length input
    geohash_input = layers.Input(shape=(1,), name="geohash")

    # Embedding
    business_id_embedded = business_id_embedding(business_id_input)
    business_id_embedded = layers.Flatten()(business_id_embedded)

    # category_embeddings = category_embedding(category_input)
    # aggregated_category_embedding = CategoryPoolingLayer(name="category_pooling")(category_embeddings)

    geohash_embedding = business_geohash_embedding(geohash_input)
    geohash_embedding = layers.Flatten()(geohash_embedding)

    # Combine
    concat = layers.Concatenate()([business_id_embedded,
                                    # aggregated_category_embedding, 
                                    geohash_embedding, business_continuous_input])

    # x = layers.Dense(64, activation='relu', kernel_regularizer=regularizers.l2(1e-4))(concat)
    # x = layers.Dropout(0.15)(x)  # Drop 30% of neurons randomly
    # x = layers.Dense(32, activation='relu', kernel_regularizer=regularizers.l2(1e-4))(x)
    # x = layers.Dropout(0.15)(x)  # Drop 30% of neurons randomly
    x = layers.Dense(64, activation='relu')(concat)
    x = layers.Dense(32, activation='relu')(x)
    business_embedding = layers.Dense(16, activation=None, name="business_embedding")(x)

    return Model([business_id_input, 
                #   category_input,
                    geohash_input,
                    business_continuous_input],
                    business_embedding, name="ItemTower")


In [8]:
# Split label_df into train and test sets
train_df, test_df = train_test_split(label_df, test_size=0.2, random_state=42)

In [9]:
# Instantiate towers
user_model = user_tower(user_continuous_features_scaled.shape[1])
item_model = item_tower(business_continuous_features_scaled.shape[1])

# Define inputs for user and business towers
user_inputs_model = [Input(shape=(1,), dtype=tf.int32, name="user_id_input"),
                     Input(shape=(user_continuous_features_scaled.shape[1],), name="user_cont_features_input")]

business_inputs_model = [
    Input(shape=(1,), dtype=tf.int32, name="business_id_input"),
    Input(shape=(1,), dtype=tf.int32, name="geohash_input"),
    Input(shape=(business_continuous_features_scaled.shape[1],), name="business_cont_features_input")
]

user_embedding = user_model(user_inputs_model)    # Shape: (batch, embedding_dim)
business_embedding = item_model(business_inputs_model)  # Shape: (batch, embedding_dim)

# Compute similarity via dot product (or cosine similarity if you prefer)
# Here we use dot product with normalization to approximate cosine similarity.
similarity = Dot(axes=-1, normalize=True)([user_embedding, business_embedding])

# Apply a sigmoid to produce a probability output
output = Activation('sigmoid', name="similarity_output")(similarity)

# Build the binary classification model
binary_model = Model(
    inputs=user_inputs_model + business_inputs_model,
    outputs=output,
    name="binary_dssm_model"
)

binary_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
binary_model.summary()

In [10]:
user_ids =  np.array(train_df['user_id_encoded'].values, dtype=np.int32)
business_ids =  np.array(train_df['business_id_encoded'].values, dtype=np.int32)

# Extract values for existing user IDs
user_features = user_continuous_features_scaled.loc[user_ids].values

# For business continuous features:
business_features = business_continuous_features_scaled.loc[business_ids].values

business_geohash_map = business_df.set_index('business_id_encoded')['geohash_encoded']

# For geohash features, assume you have a similar DataFrame:
# Here, I'm assuming business_geohash_map is a DataFrame/Series where the index is business_id_encoded.
geohash_features = business_geohash_map.loc[business_ids].values  # Make sure the shapes match

labels = train_df['label'].values


# Prepare input list for binary_model
train_inputs = [
    user_ids, user_features,
    business_ids, geohash_features, business_features
]

In [15]:
binary_model.fit(
    x=train_inputs,
    y=labels,
    batch_size=32,
    epochs=1,
    validation_split=0.2,
    verbose=1,
)

[1m12101/12101[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m189s[0m 16ms/step - accuracy: 0.8853 - loss: 0.4332 - val_accuracy: 0.7860 - val_loss: 0.5216


<keras.src.callbacks.history.History at 0x2373616a410>

In [16]:
save_folder_path = 'Saved_BCE_Loss/'

# Save the models
user_model.save(save_folder_path + 'user_model.keras')
item_model.save(save_folder_path + 'item_model.keras')

# Save the label encoders
with open(save_folder_path + 'user_id_encoder.pkl', 'wb') as f:
    pickle.dump(user_id_encoder, f)

with open(save_folder_path + 'business_id_encoder.pkl', 'wb') as f:
    pickle.dump(business_id_encoder, f)

with open(save_folder_path + 'categories_encoder.pkl', 'wb') as f:
    pickle.dump(categories_encoder, f)

with open(save_folder_path + 'business_geohash_encoder.pkl', 'wb') as f:
    pickle.dump(business_geohash_encoder, f)
    
# Save the scalers
with open(save_folder_path + 'user_scaler.pkl', 'wb') as f:
    pickle.dump(user_scaler, f)

with open(save_folder_path + 'business_scaler.pkl', 'wb') as f:
    pickle.dump(business_scaler, f)