### **KNN Model Training**

**Purpose**:  
Train a K-Nearest Neighbors (KNN) model using the sparse user-scheme matrix, with cosine similarity as the distance metric.

**Process**:  
- The sparse matrix representation of the user-scheme matrix is passed to the KNN model.
- The model is trained using the `fit()` method on the sparse matrix, which allows it to later make recommendations based on user-scheme similarities.

**Input**:  
- A sparse matrix (`user_scheme_sparse`) representing the relationship between partners and schemes, with the 'Engagement_Score' as values.

**Output**:  
- A trained KNN model (`knn_model`) that can generate similarity-based recommendations.



In [37]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import NearestNeighbors
from scipy.sparse import csr_matrix
from sklearn.preprocessing import MinMaxScaler
# Load dataset
base_dir = os.getcwd()
file_path = os.path.join(base_dir, "Augmented_Stockist_Data (1).csv")
df = pd.read_csv(file_path)

# One-hot encoding for Geography and Stockist_Type
df = pd.get_dummies(df, columns=["Geography", "Stockist_Type"], dtype=int)

# Identify geography and stockist type columns
geo_columns = [col for col in df.columns if col.startswith("Geography")]
stockist_columns = [col for col in df.columns if col.startswith("Stockist_Type")]

if not geo_columns or not stockist_columns:
    raise ValueError("No Geography or Stockist_Type features found after encoding! Check encoding step.")

# Ensure Sales_Value_Last_Period does not contain zeros to avoid log(0)
# Purpose: Replace zero values in the "Sales_Value_Last_Period" column with 1 to avoid taking the log of zero.
df["Sales_Value_Last_Period"] = df["Sales_Value_Last_Period"].replace(0, 1)

# Calculate Engagement Score

df["Engagement_Score"] = np.log1p(df["Sales_Value_Last_Period"]) * (df["Feedback_Score"] + df[geo_columns + stockist_columns].sum(axis=1)*5)

# Train-Test Split
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42, stratify=df["Partner_id"])

# Pivot User-Scheme Matrix using Engagement Score
#Create a pivot table where rows represent unique 'Partner_id', columns represent unique 'Scheme_Type', and values represent the sum of 'Engagement_Score'.
user_scheme_matrix = train_df.pivot_table(
    index="Partner_id", columns="Scheme_Type", values="Engagement_Score", aggfunc="sum", fill_value=0
)

# Add Geography & Stockist_Type Features
user_features = train_df.groupby("Partner_id")[geo_columns + stockist_columns].mean()  # Aggregate features per Partner_id
user_scheme_matrix = user_scheme_matrix.merge(user_features, left_index=True, right_index=True, how="left")
print(user_scheme_matrix)
# Prepare sparse matrix

user_scheme_matrix_scaled_baseline = user_scheme_matrix.copy()

# Apply Min-Max Scaling to bring all features to [0,1] range
scaler = MinMaxScaler()
#ser_scheme_matrix_scaled_baseline[:] = scaler.fit_transform(user_scheme_matrix_scaled_baseline)

# Train KNN model with the normalized user-scheme matrix
user_scheme_sparse_scaled_baseline = csr_matrix(user_scheme_matrix_scaled_baseline.values)
#print(user_scheme_sparse_scaled)
partner_id_lookup = list(user_scheme_matrix_scaled_baseline.index)
knn_model = NearestNeighbors(metric='cosine', algorithm='brute')
knn_model.fit(user_scheme_sparse_scaled_baseline)


                 Cashback  Loyalty Points  Loyalty Program  Seasonal Offer  \
Partner_id                                                                   
P1000         2964.507680        0.000000      3844.334581     4297.830301   
P1001       105721.875362   110674.843668      3694.135924   106864.840847   
P1002       104626.720136    99992.868468      3176.507841   112757.532771   
P1003       113841.577883   105184.827189      3009.054418   111495.252426   
P1004         3084.659184        0.000000      3324.383398     3034.040779   
...                   ...             ...              ...             ...   
P1096         3532.912006        0.000000      3955.409316     2614.569961   
P1097         4324.722459        0.000000      2276.373331     2488.797636   
P1098         2400.713194        0.000000      3158.219812     3766.749875   
P1099         4070.093753        0.000000      3311.431481     2311.683146   
P1100         3764.344532        0.000000      3512.510927     3


**Purpose**:  
Generate personalized scheme recommendations for a given partner based on similar users using a trained KNN model.

**Process**:  
- The function receives a partner's ID and returns a list of scheme recommendations based on the most similar users as identified by the KNN model.
- The function calculates similarity scores between the given partner and other users and selects the top `N` most similar users.
- The top schemes for the most similar user(s) are recommended to the given partner.

**Input**:  
- `partner_id`: The ID of the partner for whom recommendations are generated.
- `knn_model`: The KNN model that has been trained on the user-scheme matrix.
- `user_scheme_matrix`: The feature matrix that captures the partner-scheme interactions, used to train the KNN model.
- `train_df`: DataFrame containing details about partner-scheme relationships.
- `top_n`: The number of top similar users to consider for scheme recommendations (default is 3).

**Output**:  
- A list containing:
  - `partner_id`: The ID of the partner for whom the recommendations are made.
  - `product_id`: The product ID associated with the given partner.
  - `similarity_score`: The similarity score of the most similar user to the given partner.
  - `top_schemes`: A list of the top 3 scheme recommendations for the given partner.

In [38]:
def recommend_user_based(partner_id, knn_model, user_scheme_matrix, train_df, top_n=3):
    """
    Recommends schemes based on similar users using a given KNN model.
    
    Parameters:
    - partner_id: The ID of the partner for whom recommendations are needed.
    - knn_model: The trained KNN model (either base or perturbed).
    - user_scheme_matrix: The feature matrix used for training the KNN model.
    - train_df: DataFrame containing Partner-Scheme relationships.
    - top_n: Number of top similar users to consider for recommendations.

    Returns:
    - List containing partner_id, product_id, similarity score, and top recommended schemes.
    """
    
    if partner_id not in user_scheme_matrix.index:
        return None

    # Get index of the partner_id in the feature matrix
    partner_id_lookup = list(user_scheme_matrix.index)
    idx = partner_id_lookup.index(partner_id)

    # Find nearest neighbors
    distances, indices = knn_model.kneighbors(user_scheme_matrix.iloc[idx].values.reshape(1, -1),
                                              n_neighbors=min(top_n + 1, len(user_scheme_matrix)))

    similarities = 1 - distances.flatten()
    neighbors = indices.flatten()

    # Exclude the input partner itself from recommendations
    filtered = [(i, sim) for i, sim in zip(neighbors, similarities) if i != idx]

    if not filtered:
        return None

    top_idx, sim_score = filtered[0]
    similar_user = partner_id_lookup[top_idx]
    sim_score = round(sim_score, 6)

    # Retrieve the top 3 most common scheme types for the similar user
    top_schemes = (
        train_df[train_df["Partner_id"] == similar_user]["Scheme_Type"]
        .value_counts()
        .head(3)
        .index
        .tolist()
    )

    # Ensure at least 3 scheme recommendations
    while len(top_schemes) < 3:
        top_schemes.append("No Scheme")

    # Get the product ID associated with the given partner
    product = train_df[train_df["Partner_id"] == partner_id]["Product_id"].unique()[0]

    return [partner_id, product, sim_score, *top_schemes]


In [39]:
# Generate Recommendations
user_partners = test_df["Partner_id"].unique()
#user_recommendations = [recommend_user_based(pid) for pid in user_partners if recommend_user_based(pid)]
# Generate recommendations for each partner, ensuring valid recommendations are returned
user_recommendations = [
    recommend_user_based(pid, knn_model, user_scheme_matrix_scaled_baseline, train_df) 
    for pid in user_partners 
    if recommend_user_based(pid, knn_model, user_scheme_matrix_scaled_baseline, train_df)
]
# Save Output
user_rec_df = pd.DataFrame(user_recommendations, columns=["Partner_id", "Product_id", "Similarity_Score", "Scheme_1", "Scheme_2", "Scheme_3"])
user_rec_df.to_csv("user_based_recommendations_with_geography_stockist.csv", index=False)

print("User-Based Recommendations saved with Geography and Stockist_Type features.")

User-Based Recommendations saved with Geography and Stockist_Type features.


Dropping Geography_Region

In [40]:
from copy import deepcopy
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix
from sklearn.neighbors import NearestNeighbors

def compute_feature_importance_drop(geo_columns, stockist_columns,random_seed=42):
    importance_scores = {}

    # Step 1: Generate Baseline Recommendations (Using All Features)
    baseline_recommendations = {
        pid: recommend_user_based(pid, knn_model, user_scheme_matrix_scaled_baseline, train_df)
        for pid in test_df["Partner_id"].unique()
    }
    
    # Convert Baseline to DataFrame
    baseline_df = pd.DataFrame(
        [v for v in baseline_recommendations.values() if v],
        columns=["Partner_id", "Product_id", "Similarity_Score", "Scheme_1", "Scheme_2", "Scheme_3"]
    ).set_index(["Partner_id", "Product_id"])

    # Step 2: Drop Only Geography Features (Keep Stockist Type)
    perturbed_df = deepcopy(df)
    
    # Drop Geography columns but keep Stockist Type
    perturbed_df = perturbed_df.drop(columns=geo_columns, errors='ignore')

    # Keep the Same Training Data (Avoid Re-Splitting)
    train_df_perturbed = perturbed_df.loc[train_df.index]

    # Recompute Engagement Score without Dropped Geography Features
    train_df_perturbed["Engagement_Score"] = np.log1p(train_df_perturbed["Sales_Value_Last_Period"]) * (
        train_df_perturbed["Feedback_Score"] + train_df_perturbed[stockist_columns].sum(axis=1) # No geo_columns included here
    )

    # Pivot User-Scheme Matrix
    user_scheme_matrix_perturbed = train_df_perturbed.pivot_table(
        index="Partner_id", columns="Scheme_Type", values="Engagement_Score", aggfunc="sum", fill_value=0
    )
    user_features_perturbed = train_df.groupby("Partner_id")[stockist_columns].mean()  # Aggregate features per Partner_id
    user_scheme_matrix_perturbed = user_scheme_matrix_perturbed.merge(user_features_perturbed, left_index=True, right_index=True, how="left")
    print(user_scheme_matrix_perturbed)

    user_scheme_matrix_scaled = user_scheme_matrix_perturbed.copy()

    # Apply Min-Max Scaling to bring all features to [0,1] range
    scaler = MinMaxScaler()
    user_scheme_matrix_scaled[:] = scaler.fit_transform(user_scheme_matrix_scaled)
    
    # Train KNN model with the normalized user-scheme matrix
    user_scheme_sparse_scaled = csr_matrix(user_scheme_matrix_scaled.values)
    #print(user_scheme_sparse_scaled)
    knn_model_scaled = NearestNeighbors(metric='cosine', algorithm='brute')
    knn_model_scaled.fit(user_scheme_sparse_scaled)

    
    # Step 3: Generate New Recommendations (After Dropping Geography)
    perturbed_recommendations = {
        pid: recommend_user_based(pid, knn_model_scaled, user_scheme_matrix_scaled, train_df)
        for pid in test_df["Partner_id"].unique()
    }

    # Convert Perturbed Recommendations to DataFrame
    perturbed_df = pd.DataFrame(
        [v for v in perturbed_recommendations.values() if v],
        columns=["Partner_id", "Product_id", "Similarity_Score", "Scheme_1", "Scheme_2", "Scheme_3"]
    ).set_index(["Partner_id", "Product_id"])

    # Step 4: Compare Before & After Dropping Geography Features
    #changed_count = np.sum(~(baseline_df == perturbed_df).all(axis=1))
    #importance_scores["Dropped_Geography"] = changed_count / len(df)
    scheme_columns = ["Scheme_1", "Scheme_2", "Scheme_3"]
    changed_recommendations = (baseline_df[scheme_columns] != perturbed_df[scheme_columns]).all(axis=1)
    
    # Count only rows where all three schemes have changed
    changed_count = changed_recommendations.sum()
    
    # Normalize by total recommendations in baseline
    importance_scores["Dropped_Importance_Score"] = changed_count / len(baseline_df)

    return importance_scores

# Compute and Print Feature Importance for Dropping Only Geography Features
geo_importance_drop = compute_feature_importance_drop(geo_columns, stockist_columns)
print("\nFeature Importance (Drop Geography Approach):")
print(pd.Series(geo_importance_drop).sort_values(ascending=False))


                Cashback  Loyalty Points  Loyalty Program  Seasonal Offer  \
Partner_id                                                                  
P1000         885.661640        0.000000      1105.035861     1280.902192   
P1001       32515.686255    34196.205391      1175.657718    32506.887265   
P1002       32128.464313    30963.539069       953.888022    34974.005340   
P1003       34840.725196    32609.553382       853.345957    34085.500555   
P1004         871.870933        0.000000       940.298089      919.815557   
...                  ...             ...              ...             ...   
P1096        1143.484406        0.000000      1240.294333      842.574362   
P1097        1342.635532        0.000000       654.238036      778.514939   
P1098         697.198000        0.000000       902.148448     1225.708992   
P1099        1197.513981        0.000000       972.875024      737.642567   
P1100        1056.535694        0.000000      1130.561992      957.527462   

Dropping Stockist Type

In [41]:
from copy import deepcopy
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix
from sklearn.neighbors import NearestNeighbors

def compute_feature_importance_drop(feature_columns,random_seed=42):
    importance_scores = {}

    # Step 1: Generate Baseline Recommendations (Using All Features)
    baseline_recommendations = {
        pid: recommend_user_based(pid, knn_model, user_scheme_matrix, train_df)
        for pid in test_df["Partner_id"].unique()
    }

    # Convert Baseline to DataFrame
    baseline_df = pd.DataFrame(
        [v for v in baseline_recommendations.values() if v],
        columns=["Partner_id", "Product_id", "Similarity_Score", "Scheme_1", "Scheme_2", "Scheme_3"]
    ).set_index(["Partner_id", "Product_id"])

    # Step 2: Drop Stockist Type Features
    perturbed_df = deepcopy(df)
    perturbed_df = perturbed_df.drop(columns=feature_columns)  # Drop Stockist Type features

    # Keep the Same Training Data (Avoid Re-Splitting)
    train_df_perturbed = perturbed_df.loc[train_df.index]

    # Recompute Engagement Score without Stockist Type but keeping Geography
    train_df_perturbed["Engagement_Score"] = (
        np.log1p(train_df_perturbed["Sales_Value_Last_Period"]) *
        train_df_perturbed["Feedback_Score"] *  
        train_df_perturbed[geo_columns].sum(axis=1)  # Keep geography impact
    )

    # Pivot User-Scheme Matrix
    user_scheme_matrix_perturbed = train_df_perturbed.pivot_table(
        index="Partner_id", columns="Scheme_Type", values="Engagement_Score", aggfunc="sum", fill_value=0
    )
    user_features_perturbed = train_df.groupby("Partner_id")[geo_columns].mean()  # Aggregate features per Partner_id
    user_scheme_matrix_perturbed = user_scheme_matrix_perturbed.merge(user_features_perturbed, left_index=True, right_index=True, how="left")
    print(user_scheme_matrix_perturbed)


    user_scheme_matrix_scaled = user_scheme_matrix_perturbed.copy()

    # Apply Min-Max Scaling to bring all features to [0,1] range
    scaler = MinMaxScaler()
    user_scheme_matrix_scaled[:] = scaler.fit_transform(user_scheme_matrix_scaled)
    
    # Train KNN model with the normalized user-scheme matrix
    user_scheme_sparse_scaled = csr_matrix(user_scheme_matrix_scaled.values)
    #print(user_scheme_sparse_scaled)
    knn_model_scaled = NearestNeighbors(metric='cosine', algorithm='brute')
    knn_model_scaled.fit(user_scheme_sparse_scaled)
    # Prepare Sparse Matrix and Retrain KNN Model
    

    # Step 3: Generate New Recommendations (After Dropping Features)
    perturbed_recommendations = {
        pid: recommend_user_based(pid, knn_model_scaled, user_scheme_matrix_scaled, train_df)
        for pid in test_df["Partner_id"].unique()
    }

    # Convert Perturbed Recommendations to DataFrame
    perturbed_df = pd.DataFrame(
        [v for v in perturbed_recommendations.values() if v],
        columns=["Partner_id", "Product_id", "Similarity_Score", "Scheme_1", "Scheme_2", "Scheme_3"]
    ).set_index(["Partner_id", "Product_id"])

    # Step 4: Compare Before & After Dropping Features
    #changed_count = np.sum(~(baseline_df == perturbed_df).all(axis=1))
    #importance_scores["Dropped_" + "_".join(feature_columns)] = changed_count / len(df)
    scheme_columns = ["Scheme_1", "Scheme_2", "Scheme_3"]
    changed_recommendations = (baseline_df[scheme_columns] != perturbed_df[scheme_columns]).all(axis=1)
    
    # Count only rows where all three schemes have changed
    changed_count = changed_recommendations.sum()
    
    # Normalize by total recommendations in baseline
    importance_scores["Dropped_Importance_Score"] = changed_count / len(baseline_df)

    return importance_scores

# Compute and Print Feature Importance for Dropped Stockist Type Features
stockist_importance_drop = compute_feature_importance_drop(stockist_columns)
print("\nFeature Importance (Drop Stockist Type Removed):")
print(pd.Series(stockist_importance_drop).sort_values(ascending=False))


                Cashback  Loyalty Points  Loyalty Program  Seasonal Offer  \
Partner_id                                                                  
P1000         654.678747        0.000000       800.669337      945.687958   
P1001       24381.665244    25698.578916       895.826806    24244.892422   
P1002       24073.102554    23293.613581       706.930265    26331.391181   
P1003       26062.852676    24545.634070       613.822794    25484.417014   
P1004         626.005571        0.000000       675.399722      684.901644   
...                  ...             ...              ...             ...   
P1096         877.992451        0.000000       938.614890      645.685962   
P1097        1011.292541        0.000000       474.000781      588.483528   
P1098         507.918534        0.000000       651.473851      943.371116   
P1099         878.338451        0.000000       713.035418      562.749169   
P1100         755.668045        0.000000       865.900999      715.995710   