### Deep Structured Semantic Model - Index Building
This notebook is used to build the index for the Deep Structured Semantic Model (DSSM) using the Yelp dataset. The DSSM model is used to retrieve similar businesses based on the business name and categories. The model is built using the `faiss` library which is a library for efficient similarity search and clustering of dense vectors.

#### Pre-requisites
1. Have the processed `user_model.keras`, `scalers.pkl` and `encoder.pkl` in the `./Saved_Triplet_Hinge_Loss` folder.
2. Have the processed Yelp dataset in the `../../data/processed_data/yelp_data` folder.
3. Have the virtual environment setup and used for the notebook.

#### Output
1. `faiss_index.bin` - The index file that is used to retrieve similar businesses based on the business name and categories.
2. `business_ids.npy` - The business ids that are used to retrieve the business details from the Yelp dataset.
3. `user_continuous_features.pkl` - The user continuous features that are used to retrieve the user details from the Yelp dataset. (Temporary file)

#### Move to Production
1. Gather all the files in the `./Saved_Triplet_Hinge_Loss` folder and move them to the `../../data/processed_data/DSSM` folder. The files are:
    - `user_model.keras`
    - `user_scalers.pkl`
    - `user_id_encoder.pkl`
    - `user_continous_features.pkl` (Temporary file)
    - `faiss_index.bin` (This replaces the business model)
    - `business_ids.npy`
    - `business_id_encoder.pkl`

In [1]:
from general_program import *
import faiss

Loaded 78059 rows from business_details table.
Loaded 360656 rows from business_categories table.
Loaded 980418 rows from review table.
Loaded 229447 rows from user table.
Loaded 173085 rows from tip table.


In [2]:
save_folder_path='Saved_Triplet_Hinge_Loss/'
user_model, item_model, user_id_encoder, business_id_encoder, categories_encoder, user_scaler, business_scaler = load_saved_models(save_folder_path=save_folder_path)




In [3]:
user_df, business_df, review_df, user_continuous_features_scaled, business_continuous_features_scaled, num_users, num_businesses, num_categories = prepare_data(user_df, business_df, review_df, categories_df, user_id_encoder, business_id_encoder, categories_encoder, user_scaler, business_scaler, use_stage='test')

In [4]:
business_category_map = business_df.set_index('business_id_encoded')['category_encoded']

In [5]:
# Step 1: Prepare the Faiss index for business embeddings
def create_faiss_index(item_model, business_ids, business_cont_features, business_category_map, max_category_length=MAX_CATEGORY_LENGTH):
    # business_categories = business_category_map.loc[business_ids].apply(
    #     lambda x: x if isinstance(x, list) else []
    # )
    # business_category_padded = pad_sequences(business_categories.tolist(), maxlen=max_category_length, padding="post")

    business_categories = business_category_map.loc[business_ids].astype(object).tolist()
    business_category_padded = pad_sequences(business_categories, maxlen=max_category_length, padding="post")

    
    # print(business_category_padded)
    # Predict embeddings
    business_embeddings = item_model.predict([business_ids, business_category_padded, business_cont_features])

    business_embeddings_normalized = normalize(business_embeddings, axis=1)
    # Create a Faiss index for cosine similarity (using inner product)
    index = faiss.IndexFlatIP(business_embeddings_normalized.shape[1])  # Assuming 16-dimensional embeddings
    index.add(business_embeddings_normalized)

    # index = faiss.IndexHNSWFlat(business_embeddings_normalized.shape[1], 32)  # 32 neighbors
    # index.hnsw.efConstruction = 200  # Controls recall/accuracy tradeoff
    # index.add(business_embeddings_normalized)

    return index, business_embeddings_normalized

business_ids = business_continuous_features_scaled.index.values
faiss_index, business_embeddings_normalized = create_faiss_index(
    item_model, business_ids, 
    business_continuous_features_scaled.values, 
    business_category_map
)

[1m2440/2440[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 2ms/step


In [6]:
def query_top_k(user_id, user_model, faiss_index, business_ids, k=100):
    # Check if the user_id is in the user_id_encoder
    if user_id not in user_id_encoder.classes_:
        # raise ValueError("User ID is not in the encoder")
        user_id = "default_user"

    # Encode user_id and get continuous features
    user_id_encoded = user_id_encoder.transform([user_id])[0]
    user_cont_features = user_scaler.transform(
        user_continuous_features_scaled.loc[[user_id_encoded]].values
    )

    # Predict the user's embedding
    # user_embedding = user_model.predict([np.array([user_id_encoded]), user_cont_features], verbose=0)

    user_embedding = user_model.predict([user_id_encoded.reshape(1, -1), user_cont_features], verbose=0)
    user_embedding_normalized = normalize(user_embedding, axis=1)

    # Perform ANN search using Faiss
    distances, indices = faiss_index.search(user_embedding_normalized, k)

    # Return top-k businesses and distances
    top_k_business_ids = business_ids[indices.flatten()]

    # valid_indices = indices[indices != -1].flatten()
    # top_k_business_ids = business_ids[valid_indices]

    return top_k_business_ids, distances.flatten()


In [7]:
# Step 3: Example usage
user_id = "9HQLEChkam3GMBQn0SmvVw"  # Replace with an actual user_id from your dataset
# Check if the user_id is in the encoder
if user_id not in user_id_encoder.classes_:
    user_id = "default_user"
top_k_business_ids, scores = query_top_k(user_id, user_model, faiss_index, business_ids, k=300)

# Decode business IDs back to their original format
decoded_business_ids = business_id_encoder.inverse_transform(top_k_business_ids)
result_df = pd.DataFrame({
    'business_id': decoded_business_ids,
    'similarity_score': scores
})

print(user_id)
print(result_df)



9HQLEChkam3GMBQn0SmvVw
                business_id  similarity_score
0    G503RdNVGztzLuBfNY4l3A          0.919918
1    cQlpkV5PLr6g_pwyb4M3Mw          0.919611
2    ARKtuM_CFDs427PjEXKwQw          0.918689
3    wAgw6Ufvgnnh6HoTzHvM1Q          0.915781
4    2gVxo1YubGEhfwNhNd6DAQ          0.909420
..                      ...               ...
295  uMbeb4sQLjqm2rlx3Kss2g          0.807704
296  p_w3VYR50sPbN-mAp7UqVg          0.807535
297  DXn6UxlpT4R41hqpWbXcLA          0.807359
298  _wZbh1bLXGxXQwzs1_5zsg          0.807282
299  J1O2KJ57jnAZlsdaNQCyCQ          0.807072

[300 rows x 2 columns]


In [8]:
# Save the Faiss index to a file
faiss.write_index(faiss_index, save_folder_path+"faiss_index.bin")
faiss_index = None  # Free memory

# Save business IDs
np.save(save_folder_path+"business_ids.npy", business_ids)

# Save user continuous features (temporal solution)
with open(save_folder_path + "user_continuous_features_scaled.pkl", "wb") as f:
    pickle.dump(user_continuous_features_scaled, f)