### Deep Structured Semantic Model - Triplet Hinge Loss
This notebook is used to train the Deep Structured Semantic Model (DSSM) using the Yelp dataset. The DSSM model is used to predict user and business interest based on the user and business features. The model is trained using the Triplet Hinge Loss as the loss function according the the paper from Facebook. The goal is to separate the positive and negative samples by a margin, so that the positive samples are closer to the anchor than the negative samples. 

#### Pre-requisites
1. Have the processed Yelp dataset in the `../../data/processed_data/yelp_data` folder.
2. Have the virtual environment setup and used for the notebook.

#### Output
1. `user_model.keras` - The trained user model for retrieval. 
2. `user_id_encoder.pkl` - The label encoder for the user id.
3. `user_scaler.pkl` - The scaler for the user features.
4. `business_model.keras` - The trained business model for retrieval.
5. `business_id_encoder.pkl` - The label encoder for the business id.
6. `categories_encoder.pkl` - The label encoder for the business categories.
7. `business_scaler.pkl` - The scaler for the business features.
    
#### Move to Production
1. Before moving to production, the model should be indexed in the `DSSM Index (Faiss).ipynb` notebook.

In [1]:
from general_program import *

Loaded 78059 rows from business_details table.
Loaded 360656 rows from business_categories table.
Loaded 980418 rows from review table.
Loaded 229447 rows from user table.
Loaded 173085 rows from tip table.


In [2]:
categories_encoder = LabelEncoder()
user_id_encoder = LabelEncoder()
business_id_encoder = LabelEncoder()
business_geohash_encoder = LabelEncoder()

user_scaler = StandardScaler()
business_scaler = StandardScaler()

In [3]:
user_df, business_df, review_df, label_df, user_continuous_features_scaled, business_continuous_features_scaled, num_users, num_businesses, num_categories, num_geohashes = prepare_data(user_df, business_df, review_df, categories_df, user_id_encoder, business_id_encoder, categories_encoder, business_geohash_encoder, user_scaler, business_scaler)

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  easy_positive_df['label'] = 1
  positive_df = pd.concat([positive_df, easy_positive_df])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  easy_negative_df['label'] = 0
  negative_df = pd.concat([negative_df, easy_negative_df])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing

In [4]:
user_continuous_features_scaled = user_continuous_features_scaled.set_index(user_df['user_id_encoded'].values)
business_continuous_features_scaled = business_continuous_features_scaled.set_index(business_df['business_id_encoded'].values)

In [5]:
def generate_balanced_triplets(review_df, neg_sampling_factor=3):
    """
    Generate triplets with oversampling of negatives when necessary.
    
    Parameters:
      review_df: DataFrame containing reviews with encoded 'user_id', 'business_id', 'label', and 'difficulty'
      neg_sampling_factor: How many negative samples to generate per positive sample.
      
    Returns:
      A numpy array of triplets in the form [user_id, positive_business_id, negative_business_id].
    """
    triplets = []
    # Group reviews by user
    grouped = review_df.groupby('user_id_encoded')
    
    for user_id, group in grouped:
        # Separate positives and negatives for this user
        pos_samples = group[group['label'] == 1]
        neg_samples = group[group['label'] == 0]
        
        # Skip if user doesn't have at least one positive and one negative sample
        if pos_samples.empty or neg_samples.empty:
            continue
        
        # For each positive sample, generate triplets by oversampling negatives.
        for _, pos_row in pos_samples.iterrows():
            for _ in range(neg_sampling_factor):
                # Sample a negative with replacement (oversampling)
                neg_row = neg_samples.sample(n=1, replace=True).iloc[0]
                triplets.append([
                    user_id, 
                    pos_row['business_id_encoded'], 
                    neg_row['business_id_encoded']
                ])
    
    return np.array(triplets)

In [14]:
def create_embedding_layer(input_dim, output_dim, name):
    return layers.Embedding(
        input_dim=input_dim,
        output_dim=output_dim,
        name=f"{name}_embedding",
        # embeddings_regularizer=regularizers.l2(1e-4)  # L2 regularization
    )

# Create embedding layers
user_id_embedding = create_embedding_layer(num_users, 16, "user_id")
business_id_embedding = create_embedding_layer(num_businesses, 16, "business_id")
category_embedding = create_embedding_layer(num_categories, 16, "category")
business_geohash_embedding = create_embedding_layer(num_geohashes, 16, "geohash")

In [15]:
def user_tower(continuous_dim):
    # Inputs
    user_id_input = layers.Input(shape=(1,), name="user_id")
    user_continuous_input = layers.Input(shape=(continuous_dim,), name="user_continuous")

    # Embedding
    user_id_embedded = user_id_embedding(user_id_input)
    user_id_embedded = layers.Flatten()(user_id_embedded)

    # Combine
    concat = layers.Concatenate()([user_id_embedded, user_continuous_input])
    # x = layers.Dense(64, activation='relu', kernel_regularizer=regularizers.l2(1e-4))(concat)
    x = layers.Dense(64, activation='relu')(concat)
    x = layers.Dropout(0.1)(x)  # Drop 30% of neurons randomly
    # x = layers.Dense(32, activation='relu', kernel_regularizer=regularizers.l2(1e-4))(x)
    x = layers.Dense(32, activation='relu')(x)
    x = layers.Dropout(0.1)(x)  # Drop 30% of neurons randomly
    user_embedding = layers.Dense(16, activation=None, name="user_embedding")(x)

    return Model([user_id_input, user_continuous_input], user_embedding, name="UserTower")


In [16]:
def item_tower(continuous_dim):
    # Inputs
    business_id_input = layers.Input(shape=(1,), name="business_id")
    business_continuous_input = layers.Input(shape=(continuous_dim,), name="business_continuous")
    # category_input = layers.Input(shape=(None,), dtype="int32", name="category_indices")  # Variable-length input
    geohash_input = layers.Input(shape=(1,), name="geohash")

    # Embedding
    business_id_embedded = business_id_embedding(business_id_input)
    business_id_embedded = layers.Flatten()(business_id_embedded)

    # category_embeddings = category_embedding(category_input)
    # aggregated_category_embedding = CategoryPoolingLayer(name="category_pooling")(category_embeddings)

    geohash_embedding = business_geohash_embedding(geohash_input)
    geohash_embedding = layers.Flatten()(geohash_embedding)

    # Combine
    concat = layers.Concatenate()([business_id_embedded,
                                    # aggregated_category_embedding, 
                                    geohash_embedding, business_continuous_input])

    # x = layers.Dense(64, activation='relu', kernel_regularizer=regularizers.l2(1e-4))(concat)
    x = layers.Dense(64, activation='relu')(concat)
    x = layers.Dropout(0.1)(x)  # Drop 30% of neurons randomly
    # x = layers.Dense(32, activation='relu', kernel_regularizer=regularizers.l2(1e-4))(x)
    x = layers.Dense(32, activation='relu')(x)
    x = layers.Dropout(0.1)(x)  # Drop 30% of neurons randomly
    business_embedding = layers.Dense(16, activation=None, name="business_embedding")(x)

    return Model([business_id_input, 
                #   category_input,
                    geohash_input,
                    business_continuous_input],
                    business_embedding, name="ItemTower")


In [17]:
# Triplet loss function
def triplet_hinge_loss(margin=1.0):
    def loss(y_true, y_pred):
        # y_pred shape: (batch_size, 3, embedding_dim)
        anchor, positive, negative = tf.unstack(y_pred, num=3, axis=1)
        
        # Compute pairwise distances
        pos_dist = tf.reduce_sum(tf.square(anchor - positive), axis=-1)
        neg_dist = tf.reduce_sum(tf.square(anchor - negative), axis=-1)
        
        # Hinge loss: max(0, pos_dist - neg_dist + margin)
        return tf.reduce_mean(tf.maximum(pos_dist - neg_dist + margin, 0.0))
    return loss

In [18]:
# Split review_df into train and test sets
train_df, test_df = train_test_split(label_df, test_size=0.2, random_state=42)

balanced_triplets = generate_balanced_triplets(label_df, neg_sampling_factor=3)
# Generate triplets for training and testing
# train_triplets = generate_triplets(train_df)
train_triplets = balanced_triplets

In [19]:
# Prepare train and test inputs
def prepare_triplet_inputs(triplets, user_features, business_features, business_geohash_map, business_category_map, max_category_length=MAX_CATEGORY_LENGTH):
    # Replace NaN values with empty lists in `business_category_map`
    business_category_map = business_category_map.apply(lambda x: x if isinstance(x, list) else [])

    anchor_indices = triplets[:, 0]
    positive_indices = triplets[:, 1]
    negative_indices = triplets[:, 2]

    anchor_features = [anchor_indices, user_features.take(anchor_indices, axis=0).values]
    positive_features = [
        positive_indices,
        # pad_sequences(business_category_map.loc[positive_indices].tolist(), maxlen=max_category_length, padding="post"),
        business_geohash_map.take(positive_indices).values,
        business_features.take(positive_indices, axis=0).values, 
    ]
    negative_features = [
        negative_indices, 
        # pad_sequences(business_category_map.loc[negative_indices].tolist(), maxlen=max_category_length, padding="post"),
        business_geohash_map.take(negative_indices).values,
        business_features.take(negative_indices, axis=0).values,
    ]

    return [
        anchor_features[0], anchor_features[1],
        positive_features[0], positive_features[1], 
        positive_features[2], 
        # positive_features[3],
        negative_features[0], negative_features[1], 
        negative_features[2], 
        # negative_features[3]
    ]

business_category_map = business_df.set_index('business_id_encoded')['category_encoded']
business_geohash_map = business_df.set_index('business_id_encoded')['geohash_encoded']

max_category_length = MAX_CATEGORY_LENGTH

train_inputs = prepare_triplet_inputs(train_triplets, user_continuous_features_scaled, business_continuous_features_scaled, business_geohash_map, business_category_map, max_category_length)

In [20]:
# Instantiate towers
user_model = user_tower(user_continuous_features_scaled.shape[1])
item_model = item_tower(business_continuous_features_scaled.shape[1])

# Define inputs for user and business towers
user_inputs_model = [Input(shape=(1,), dtype=tf.int32, name="user_id_input"),
                     Input(shape=(user_continuous_features_scaled.shape[1],), name="user_cont_features_input")]

positive_inputs_model = [
    Input(shape=(1,), dtype=tf.int32, name="positive_id_input"),
    # Input(shape=(max_category_length,), dtype=tf.int32, name="positive_category_input"),
    Input(shape=(1,), dtype=tf.int32, name="positive_geohash_input"),
    Input(shape=(business_continuous_features_scaled.shape[1],), name="positive_cont_features_input")
]

negative_inputs_model = [
    Input(shape=(1,), dtype=tf.int32, name="negative_id_input"),
    # Input(shape=(max_category_length,), dtype=tf.int32, name="negative_category_input"),
    Input(shape=(1,), dtype=tf.int32, name="negative_geohash_input"),
    Input(shape=(business_continuous_features_scaled.shape[1],), name="negative_cont_features_input")
]

# Generate embeddings
anchor_embedding = user_model(user_inputs_model)
positive_embedding = item_model(positive_inputs_model)
negative_embedding = item_model(negative_inputs_model)

In [21]:
def stack_embeddings(embeddings):
    # Unpack the embeddings from the list
    anchor, positive, negative = embeddings
    return tf.stack([anchor, positive, negative], axis=1)

triplet_embeddings = Lambda(stack_embeddings, name="triplet_embeddings")(
    [anchor_embedding, positive_embedding, negative_embedding]
)




In [22]:
all_inputs = user_inputs_model + positive_inputs_model + negative_inputs_model
# Build the model
triplet_model = Model(
    inputs= all_inputs,
    outputs=triplet_embeddings,
    name="triplet_model"
)

# Compile with triplet loss
triplet_model.compile(
    optimizer='adam',
    # loss=triplet_hinge_loss(margin=0.2)
    loss=triplet_hinge_loss(margin=1)
)

In [23]:
# Train the triplet model
triplet_model.fit(
    x=train_inputs,
    y=np.zeros(len(train_inputs[0])),  # Dummy labels, as loss is computed from embeddings
    batch_size=32,
    # epochs=10,
    epochs=5,
    # epochs=1,
    validation_split=0.2,
    verbose=1
)


Epoch 1/5
[1m20938/20938[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m425s[0m 20ms/step - loss: 0.5443 - val_loss: 0.7398
Epoch 2/5
[1m20938/20938[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m476s[0m 23ms/step - loss: 0.1374 - val_loss: 0.8208
Epoch 3/5
[1m20938/20938[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m392s[0m 19ms/step - loss: 0.0768 - val_loss: 0.8871
Epoch 4/5
[1m  319/20938[0m [37m━━━━━━━━━━━━━━━━━━━━[0m [1m5:41[0m 17ms/step - loss: 0.0519

KeyboardInterrupt: 

In [25]:
save_folder_path = 'Saved_Triplet_Hinge_Loss/'

# Save the models
user_model.save(save_folder_path + 'user_model.keras')
item_model.save(save_folder_path + 'item_model.keras')

# Save the label encoders
with open(save_folder_path + 'user_id_encoder.pkl', 'wb') as f:
    pickle.dump(user_id_encoder, f)

with open(save_folder_path + 'business_id_encoder.pkl', 'wb') as f:
    pickle.dump(business_id_encoder, f)

with open(save_folder_path + 'categories_encoder.pkl', 'wb') as f:
    pickle.dump(categories_encoder, f)

with open(save_folder_path + 'business_geohash_encoder.pkl', 'wb') as f:
    pickle.dump(business_geohash_encoder, f)
    
# Save the scalers
with open(save_folder_path + 'user_scaler.pkl', 'wb') as f:
    pickle.dump(user_scaler, f)

with open(save_folder_path + 'business_scaler.pkl', 'wb') as f:
    pickle.dump(business_scaler, f)