### Deep Structured Semantic Model - Index Building
This notebook is used to build the index for the Deep Structured Semantic Model (DSSM) using the Yelp dataset. The DSSM model is used to retrieve similar businesses based on the business name and categories. The model is built using the `faiss` library which is a library for efficient similarity search and clustering of dense vectors.

#### Pre-requisites
1. Have the processed `user_model.keras`, `scalers.pkl` and `encoder.pkl` in the `./Saved_Triplet_Hinge_Loss` folder.
2. Have the processed Yelp dataset in the `../../data/processed_data/yelp_data` folder.
3. Have the virtual environment setup and used for the notebook.

#### Output
1. `faiss_index.bin` - The index file that is used to retrieve similar businesses based on the business name and categories.
2. `business_ids.npy` - The business ids that are used to retrieve the business details from the Yelp dataset.
3. `user_continuous_features.pkl` - The user continuous features that are used to retrieve the user details from the Yelp dataset. (Temporary file)

#### Move to Production
1. Gather all the files in the `./Saved_Triplet_Hinge_Loss` folder and move them to the `../../data/processed_data/DSSM` folder. The files are:
    - `user_model.keras`
    - `user_scalers.pkl`
    - `user_id_encoder.pkl`
    - `user_continous_features.pkl` (Temporary file)
    - `faiss_index.bin` (This replaces the business model)
    - `business_ids.npy`
    - `business_id_encoder.pkl`

In [74]:
from general_program import *
import faiss

In [75]:
user_model, item_model, user_id_encoder, business_id_encoder, categories_encoder, user_scaler, business_scaler = load_saved_models(save_folder_path='Saved_Triplet_Hinge_Loss/')

In [76]:
user_df, business_df, review_df, user_continuous_features_scaled, business_continuous_features_scaled, num_users, num_businesses, num_categories = prepare_data(user_df, business_df, review_df, categories_df, user_id_encoder, business_id_encoder, categories_encoder, user_scaler, business_scaler, use_stage='test')

In [77]:
business_category_map = business_df.set_index('business_id_encoded')['category_encoded']

In [78]:
# Step 1: Prepare the Faiss index for business embeddings
def create_faiss_index(item_model, business_ids, business_cont_features, business_category_map, max_category_length=MAX_CATEGORY_LENGTH):
    business_categories = business_category_map.loc[business_ids].apply(
        lambda x: x if isinstance(x, list) else []
    )
    business_category_padded = pad_sequences(business_categories.tolist(), maxlen=max_category_length, padding="post")

    # Predict embeddings
    business_embeddings = item_model.predict([business_ids, business_category_padded, business_cont_features])

    business_embeddings_normalized = normalize(business_embeddings, axis=1)
    # Create a Faiss index for cosine similarity (using inner product)
    index = faiss.IndexFlatIP(business_embeddings_normalized.shape[1])  # Assuming 16-dimensional embeddings
    index.add(business_embeddings_normalized)
    return index, business_embeddings_normalized

business_ids = business_continuous_features_scaled.index.values
faiss_index, business_embeddings_normalized = create_faiss_index(
    item_model, business_ids, 
    business_continuous_features_scaled.values, 
    business_category_map
)

[1m2440/2440[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 1ms/step


In [79]:
# Step 2: Query top-k businesses for a given user
def query_top_k(user_id, user_model, faiss_index, business_ids, k=5):
    # Encode user_id and get continuous features
    user_id_encoded = user_id_encoder.transform([user_id])[0]
    user_cont_features = user_scaler.transform(
        user_continuous_features_scaled.loc[[user_id_encoded]].values
    )

    # Predict and normalize the user's embedding
    user_embedding = user_model.predict([np.array([user_id_encoded]), user_cont_features], verbose=0)
    user_embedding_normalized = normalize(user_embedding, axis=1)

    # Perform ANN search using Faiss
    distances, indices = faiss_index.search(user_embedding_normalized, k)

    # Return top-k businesses and distances
    top_k_business_ids = business_ids[indices.flatten()]
    return top_k_business_ids, distances.flatten()

In [81]:
# Step 3: Example usage
user_id = "9HQLEChkam3GMBQn0SmvVw"  # Replace with an actual user_id from your dataset
top_k_business_ids, scores = query_top_k(user_id, user_model, faiss_index, business_ids, k=300)

# Decode business IDs back to their original format
decoded_business_ids = business_id_encoder.inverse_transform(top_k_business_ids)
result_df = pd.DataFrame({
    'business_id': decoded_business_ids,
    'similarity_score': scores
})

print(result_df)

                business_id  similarity_score
0    sQhh7JCGpqNgf0hHWc4m8g          0.762543
1    TV58NdbRgHq2IMxyzHWDbQ          0.732087
2    lwJllJ5e4CLHdliOPEfgGg          0.729557
3    vUrTGX_7HxqeoQ_6QCVz6g          0.728677
4    mBgfK8HLthPOMPkEbYLW-A          0.725245
..                      ...               ...
295  O1XwKgUYNI_YBXnOqgaTaw          0.598820
296  ETAxDtQbcCmy5ibQ5Y6Glg          0.598788
297  9A5Gw0At6so0x-vWM0_JZw          0.598568
298  hWuLvI5QqPyQ1x9ww0HeRw          0.597822
299  _6BDxk8486ZYiRwpPmQewg          0.597797

[300 rows x 2 columns]


In [None]:
save_folder = "Saved_Triplet_Hinge_Loss/"
# Save the Faiss index to a file
faiss.write_index(faiss_index, save_folder+"faiss_index.bin")

# Save business IDs
np.save(save_folder+"business_ids.npy", business_ids)

# Save user continuous features (temporal solution)
with open(save_folder + "user_continuous_features_scaled.pkl", "wb") as f:
    pickle.dump(user_continuous_features_scaled, f)