# Customer Segmentation

Generated by Auto-Analysis Web App

## Feature Engineering
Engineers new, more descriptive features from the raw dataset. This includes calculating 'customer_age' using `current_year` (2024) and 'customer_birthdate', determining 'customer_since_years' from 'year_first_transaction' and `current_year`. Various 'lifetime_spend_' columns are aggregated into higher-level categories like 'total_lifetime_spend', 'total_food_spend', 'total_drink_spend', 'total_home_personal_care_spend', and 'total_entertainment_spend'. Ratios such as 'alcohol_spend_ratio' and 'promotion_efficiency_index' are also created, with robust handling for potential division-by-zero scenarios by filling resulting NaNs with 0.

In [None]:
import pandas as pdimport numpy as npfrom datetime import datetime# Parameters for Step 1current_year = 2024# Assume 'df' is the initial raw dataframe for the entire processdf_engineered = df.copy()# Substep 1.1: Calculate 'customer_age'df_engineered['customer_birthdate'] = pd.to_datetime(df_engineered['customer_birthdate'], errors='coerce')df_engineered['customer_age'] = current_year - df_engineered['customer_birthdate'].dt.year# Substep 1.2: Calculate 'customer_since_years'df_engineered['customer_since_years'] = current_year - df_engineered['year_first_transaction']# Substep 1.3: Derive 'total_lifetime_spend'spend_columns = [col for col in df_engineered.columns if 'lifetime_spend_' in col]df_engineered['total_lifetime_spend'] = df_engineered[spend_columns].sum(axis=1)# Substep 1.4: Create 'total_food_spend'df_engineered['total_food_spend'] = df_engineered['lifetime_spend_groceries'] + \                                    df_engineered['lifetime_spend_vegetables'] + \                                    df_engineered['lifetime_spend_meat'] + \                                    df_engineered['lifetime_spend_fish']# Substep 1.5: Create 'total_drink_spend'df_engineered['total_drink_spend'] = df_engineered['lifetime_spend_nonalcohol_drinks'] + \                                     df_engineered['lifetime_spend_alcohol_drinks']# Substep 1.6: Create 'total_home_personal_care_spend'df_engineered['total_home_personal_care_spend'] = df_engineered['lifetime_spend_hygiene'] + \                                                 df_engineered['lifetime_spend_petfood']# Substep 1.7: Create 'total_entertainment_spend'df_engineered['total_entertainment_spend'] = df_engineered['lifetime_spend_electronics'] + \                                            df_engineered['lifetime_spend_videogames']# Substep 1.8: Calculate 'alcohol_spend_ratio' (handle division by zero)df_engineered['alcohol_spend_ratio'] = df_engineered['lifetime_spend_alcohol_drinks'] / \                                      df_engineered['total_drink_spend'].replace(0, np.nan)df_engineered['alcohol_spend_ratio'] = df_engineered['alcohol_spend_ratio'].fillna(0)# Substep 1.9: Calculate 'promotion_efficiency_index' (handle division by zero)df_engineered['promotion_efficiency_index'] = df_engineered['total_lifetime_spend'] / \                                              df_engineered['percentage_of_products_bought_promotion'].replace(0, np.nan)df_engineered['promotion_efficiency_index'] = df_engineered['promotion_efficiency_index'].fillna(0)# The dataframe `df_engineered` is now prepared for the next step and will be used for final profiling.

## Pre-clustering Feature Preparation
Prepares the engineered dataset for clustering by encoding categorical variables and removing irrelevant features. The 'customer_gender' column is one-hot encoded to create new binary features, and the original 'customer_gender' column is then dropped. Additionally, a list of specified identifier or redundant columns, including 'customer_id', 'customer_name', 'customer_birthdate', 'loyalty_card_number', and 'year_first_transaction', are removed from the dataset to ensure a clean feature set for clustering.

In [None]:
from sklearn.preprocessing import OneHotEncoder# Parameters for Step 2features_to_drop_pre_clustering = ["customer_id", "customer_name", "customer_birthdate", "loyalty_card_number", "year_first_transaction"]df_preprocessed = df_engineered.copy()# Substep 2.1: One-hot encode the 'customer_gender' categorical featureencoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)gender_encoded = encoder.fit_transform(df_preprocessed[['customer_gender']])gender_df = pd.DataFrame(gender_encoded, columns=encoder.get_feature_names_out(['customer_gender']), index=df_preprocessed.index)df_preprocessed = pd.concat([df_preprocessed, gender_df], axis=1)# Substep 2.2: Drop the original 'customer_gender' column after encoding.df_preprocessed = df_preprocessed.drop('customer_gender', axis=1)# Substep 2.3: Remove columns specified in 'features_to_drop_pre_clustering'df_preprocessed = df_preprocessed.drop(columns=features_to_drop_pre_clustering, errors='ignore')# The dataframe `df_preprocessed` is now ready for scaling.

## Feature Scaling
Applies feature scaling to standardize the numerical features of the pre-processed dataset, which is crucial for distance-based algorithms like K-Means. Using the `StandardScaler` as specified by `scaler_type`, each numerical feature is transformed to have a mean of 0 and a standard deviation of 1. Missing or infinite values are imputed with the median of their respective columns prior to scaling to ensure robust transformation. The scaled features are then converted back into a DataFrame, retaining their original column names.

In [None]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler# Parameters for Step 3scaler_type = "StandardScaler"df_scaled_for_clustering = df_preprocessed.copy()# Substep 3.1: Initialize the chosen scalerif scaler_type == "StandardScaler":    scaler = StandardScaler()elif scaler_type == "MinMaxScaler":    scaler = MinMaxScaler()else:    raise ValueError(f"Unsupported scaler_type: {scaler_type}. Choose 'StandardScaler' or 'MinMaxScaler'.")# Substep 3.2: Apply the scaler to all numerical featuresnumerical_cols = df_scaled_for_clustering.select_dtypes(include=np.number).columns.tolist()# Handle potential NaN/Inf values before scalingdf_scaled_for_clustering[numerical_cols] = df_scaled_for_clustering[numerical_cols].replace([np.inf, -np.inf], np.nan)for col in numerical_cols:    if df_scaled_for_clustering[col].isnull().any():        df_scaled_for_clustering[col] = df_scaled_for_clustering[col].fillna(df_scaled_for_clustering[col].median()) # Impute with medianscaled_features = scaler.fit_transform(df_scaled_for_clustering[numerical_cols])# Substep 3.3: Convert the scaled features back into a DataFrame, retaining column names.df_scaled_for_clustering = pd.DataFrame(scaled_features, columns=numerical_cols, index=df_scaled_for_clustering.index)# `df_scaled_for_clustering` is the input for clustering algorithms.

## Determine Optimal Cluster Count (K)
Determines the optimal number of clusters ('K') using two widely accepted methods: the Elbow Method and Silhouette Analysis. The Elbow Method computes the Within-Cluster Sum of Squares (WCSS) for K-Means models, initialized with `random_state=42` and `n_init=10`, across a range of 1 to `max_clusters_to_test` (10) clusters. Silhouette Analysis calculates the average Silhouette Score for the same range of clusters (from 2 to 10). The results from both methods are then visualized using separate plots to help identify an appropriate 'elbow' point or peak silhouette score, guiding the selection of 'K' for the subsequent clustering step.

In [None]:
import matplotlib.pyplot as pltimport seaborn as snsfrom sklearn.cluster import KMeansfrom sklearn.metrics import silhouette_score# Parameters for Step 4max_clusters_to_test = 10random_state = 42wcss = []silhouette_scores = []# Ensure df_scaled_for_clustering is available from Step 3X = df_scaled_for_clusteringprint("Calculating WCSS and Silhouette Scores for K-Means...")for i in range(1, max_clusters_to_test + 1):    kmeans = KMeans(n_clusters=i, random_state=random_state, n_init=10) # n_init for robustness    kmeans.fit(X)    wcss.append(kmeans.inertia_)    if i > 1: # Silhouette score requires at least 2 clusters        score = silhouette_score(X, kmeans.labels_)        silhouette_scores.append(score)        print(f"K={i}: Silhouette Score = {score:.4f}")    else:        print(f"K={i}: WCSS = {kmeans.inertia_:.2f}")# Substep 4.3: Visualize the resultsplt.style.use('ggplot')fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))# Elbow Method Plotax1.plot(range(1, max_clusters_to_test + 1), wcss, marker='o', linestyle='--', color='blue', mfc='red')ax1.set_title('Elbow Method for Optimal K', fontsize=14)ax1.set_xlabel('Number of Clusters (K)', fontsize=12)ax1.set_ylabel('Within-Cluster Sum of Squares (WCSS)', fontsize=12)ax1.set_xticks(range(1, max_clusters_to_test + 1))ax1.grid(True, linestyle='--', alpha=0.7)ax1.tick_params(axis='both', which='major', labelsize=10)# Silhouette Analysis Plotax2.plot(range(2, max_clusters_to_test + 1), silhouette_scores, marker='o', linestyle='--', color='green', mfc='purple')ax2.set_title('Silhouette Score for Optimal K', fontsize=14)ax2.set_xlabel('Number of Clusters (K)', fontsize=12)ax2.set_ylabel('Average Silhouette Score', fontsize=12)ax2.set_xticks(range(2, max_clusters_to_test + 1))ax2.grid(True, linestyle='--', alpha=0.7)ax2.tick_params(axis='both', which='major', labelsize=10)plt.suptitle('Determining Optimal Number of Clusters (K)', fontsize=16, y=1.02)plt.tight_layout(rect=[0, 0.03, 1, 0.98]) # Adjust layout to prevent suptitle overlapplt.show()print("\nInterpretation Guidance:")print("Elbow Method: Identify the 'elbow' point where the decrease in WCSS starts to slow down significantly.")print("Silhouette Score: Look for the number of clusters (K) that yields the highest average silhouette score, "      "indicating well-defined and separated clusters.")print("A visual inspection of both plots is recommended to make a final decision on 'k'.")

## Perform K-Means Clustering
Performs K-Means clustering on the scaled feature set. The K-Means model is initialized with `n_clusters` set to 5 (as indicated in the plan, typically derived from Step 4), `random_state=42` for reproducibility, and `n_init=10` to run the algorithm multiple times with different centroid seeds and select the best result. After fitting the model to the `df_scaled_for_clustering` data, the resulting cluster labels are assigned back to the `df_engineered` DataFrame (the result of Step 1) to facilitate subsequent profiling and interpretation, creating `df_with_clusters`.

In [None]:
from sklearn.cluster import KMeans# Parameters for Step 5n_clusters = 5 # This value should ideally be determined from Step 4's visualizationrandom_state = 42n_init = 10# Ensure df_scaled_for_clustering is the scaled dataset for K-MeansX = df_scaled_for_clustering# Substep 5.1: Initialize the K-Means modelkmeans_model = KMeans(n_clusters=n_clusters, random_state=random_state, n_init=n_init)# Substep 5.2: Fit the K-Means model to the scaled feature set.kmeans_model.fit(X)# Substep 5.3: Assign the cluster labels obtained from the model back to the engineered customer dataset.# `df_engineered` (from Step 1) contains all relevant features for profiling.df_with_clusters = df_engineered.copy()df_with_clusters['cluster_label'] = kmeans_model.labels_# The dataframe `df_with_clusters` now includes the assigned cluster labels and is ready for profiling.

## Cluster Profiling and Interpretation
Analyzes and interprets the characteristics of each customer segment identified by K-Means clustering. Descriptive statistics (mean values) are calculated for all relevant features within each cluster, excluding identifiers and the cluster label itself. Discriminative features are identified by comparing cluster means against overall means, highlighting the top `top_n_features_for_profiling` (10) for each cluster. Cluster profiles are visualized using a heatmap for an overview and a radar chart to compare the most discriminative features across clusters. Finally, a narrative interpretation framework is provided for each cluster, outlining their defining traits, potential motivations, and implications for business strategies.

In [None]:
import matplotlib.pyplot as pltimport seaborn as snsimport pandas as pdimport numpy as np# Parameters for Step 6top_n_features_for_profiling = 10# Ensure df_with_clusters is available from Step 5if 'df_with_clusters' not in locals():    print("Error: `df_with_clusters` not found. Please ensure Step 5 executed correctly.")    # Example placeholder for isolated execution if df_with_clusters is missing:    # df_with_clusters = pd.DataFrame(np.random.rand(100, 15), columns=[f'feature_{i}' for i in range(15)])    # df_with_clusters['customer_id'] = range(100)    # df_with_clusters['cluster_label'] = np.random.randint(0, 5, 100)    # df_engineered = df_with_clusters.drop(columns=['cluster_label'])# Substep 6.1: Calculate descriptive statistics (e.g., mean) for all relevant features within each customer cluster.# Exclude identifiers and the cluster label itself from the stats calculationfeatures_to_exclude_from_profiling = ['customer_id', 'customer_name', 'customer_birthdate', 'loyalty_card_number', 'year_first_transaction', 'cluster_label']profiling_features = df_with_clusters.drop(columns=features_to_exclude_from_profiling, errors='ignore').columns.tolist()cluster_profiles = df_with_clusters.groupby('cluster_label')[profiling_features].mean()# Calculate overall means for comparisonoverall_means = df_with_clusters[profiling_features].mean()print("\n--- Cluster Profiles (Mean Values) ---")print(cluster_profiles.round(2))print("\n--- Overall Mean Values for Comparison ---")print(overall_means.round(2))# Substep 6.2: Identify the most discriminative features for each cluster.relative_cluster_profiles = cluster_profiles - overall_meansprint("\n--- Top Discriminative Features per Cluster (Relative to Overall Mean) ---")for cluster_id in cluster_profiles.index:    print(f"\nCluster {cluster_id}:")    # Get top_n_features with largest absolute deviation for the current cluster    top_features_abs_dev = relative_cluster_profiles.loc[cluster_id].abs().nlargest(top_n_features_for_profiling).index.tolist()    # Print their actual values and relative deviation    for feature in top_features_abs_dev:        cluster_val = cluster_profiles.loc[cluster_id, feature]        overall_val = overall_means[feature]        deviation = relative_cluster_profiles.loc[cluster_id, feature]        print(f"  - {feature}: Cluster Mean={cluster_val:.2f}, Overall Mean={overall_val:.2f}, Deviation={deviation:.2f}")# Substep 6.3: Visualize cluster profiles.plt.figure(figsize=(15, 8))sns.heatmap(cluster_profiles.round(2), annot=True, cmap='viridis', fmt=".2f", linewidths=.5)plt.title('Cluster Mean Values Across Features', fontsize=16)plt.ylabel('Cluster Label', fontsize=12)plt.xlabel('Features', fontsize=12)plt.yticks(rotation=0)plt.xticks(rotation=45, ha='right')plt.tight_layout()plt.show()# Radar chart for a more detailed comparison of key featuresif not cluster_profiles.empty:    # Select features with high overall discriminative power across all clusters    # (e.g., features where the standard deviation of cluster means is high)    discriminative_features = cluster_profiles.std().nlargest(top_n_features_for_profiling).index.tolist()    if len(discriminative_features) > 0:        # Normalize data for radar chart to a 0-1 range for consistent visualization        scaler_profiling = MinMaxScaler()        cluster_profiles_scaled = pd.DataFrame(scaler_profiling.fit_transform(cluster_profiles[discriminative_features]),                                               columns=discriminative_features,                                               index=cluster_profiles.index)        # Prepare for radar chart plotting        categories = list(cluster_profiles_scaled.columns)        num_vars = len(categories)        angles = np.linspace(0, 2 * np.pi, num_vars, endpoint=False).tolist()        angles += angles[:1] # Complete the loop for plotting        fig, ax = plt.subplots(figsize=(8, 8), subplot_kw=dict(polar=True))                for i, cluster_label in enumerate(cluster_profiles_scaled.index):            values = cluster_profiles_scaled.loc[cluster_label].tolist()            values += values[:1]            ax.plot(angles, values, linewidth=1, linestyle='solid', label=f'Cluster {cluster_label}')            ax.fill(angles, values, alpha=0.25)        ax.set_theta_offset(np.pi / 2)        ax.set_theta_direction(-1)        ax.set_rlabel_position(0)        ax.set_xticks(angles[:-1])        ax.set_xticklabels(categories, fontsize=10)        ax.set_yticklabels([]) # Hide r-axis tick labels for cleaner look        plt.title('Radar Chart of Key Features by Cluster (Scaled)', size=16, color='grey', y=1.05)        plt.legend(loc='upper right', bbox_to_anchor=(1.3, 1.1))        plt.show()    else:        print("Not enough discriminative features to plot radar chart.")else:    print("No cluster profiles to visualize for radar chart.")# Substep 6.4: Provide a comprehensive narrative interpretation for each cluster.print("\n--- Narrative Interpretation of Clusters ---")print("Based on the calculated cluster profiles and visual analysis, interpret each cluster:")for cluster_id in cluster_profiles.index:    print(f"\nCluster {cluster_id}:")    # Identify key high and low features for this cluster based on relative deviation    high_features = relative_cluster_profiles.loc[cluster_id].nlargest(3).index.tolist()    low_features = relative_cluster_profiles.loc[cluster_id].nsmallest(3).index.tolist()    print(f"  - Characterized by: Higher than average spending/activity in {', '.join(high_features)}.")    print(f"  - Less prominent in: Lower than average spending/activity in {', '.join(low_features)}.")    print(f"  - Potential Segment Description: (e.g., 'High-value tech enthusiasts', 'Budget-conscious family shoppers', 'Health-focused singles', etc.)")    print("  - Actionable Insights: (e.g., Tailored product recommendations, targeted marketing campaigns, personalized loyalty programs, specific communication channels.)")print("\nFurther in-depth domain knowledge and business context are essential to refine these interpretations and develop highly specific, actionable strategies for each segment.")