# Customer Segmentation

Generated by Auto-Analysis App

## Feature Engineering
Derives new features: customer age, tenure, total dependents, total lifetime spend, spend ratios for various categories, birth month, and birth day of week. Handles missing values by imputing with median for age and tenure, and division by zero for spend ratios.

In [None]:
import pandas as pdimport numpy as npcurrent_year = 2024try:    # Ensure 'customer_birthdate' is in datetime format    df['customer_birthdate'] = pd.to_datetime(df['customer_birthdate'], errors='coerce')    # Step 1.1: Calculate 'customer_age'    df['customer_age'] = current_year - df['customer_birthdate'].dt.year    median_age = df['customer_age'].median()    df['customer_age'].fillna(median_age, inplace=True)    print(f"Calculated 'customer_age'. Median age: {median_age:.0f}")    # Step 1.2: Calculate 'customer_tenure'    df['customer_tenure'] = current_year - df['year_first_transaction']    median_tenure = df['customer_tenure'].median()    df['customer_tenure'].fillna(median_tenure, inplace=True)    print(f"Calculated 'customer_tenure'. Median tenure: {median_tenure:.0f}")    # Step 1.3: Calculate 'total_dependents'    df['total_dependents'] = df['kids_home'] + df['teens_home']    print("Calculated 'total_dependents'.")    # Identify all 'lifetime_spend_X' columns    spend_cols = [col for col in df.columns if 'lifetime_spend_' in col and col not in ['lifetime_total_distinct_products']]    # Step 1.4: Calculate 'total_lifetime_spend'    df['total_lifetime_spend'] = df[spend_cols].sum(axis=1)    print("Calculated 'total_lifetime_spend'.")    # Step 1.5: Derive 'spend_ratio_X' for each individual spend category    for col in spend_cols:        ratio_col_name = col + '_ratio'        # Handle division by zero: if total_lifetime_spend is 0, ratio is 0        df[ratio_col_name] = np.where(            df['total_lifetime_spend'] == 0,             0,             df[col] / df['total_lifetime_spend']        )    print("Derived 'spend_ratio_X' for all spend categories, handling division by zero.")    # Step 1.6: Extract 'birth_month' and 'birth_day_of_week'    df['birth_month'] = df['customer_birthdate'].dt.month    df['birth_day_of_week'] = df['customer_birthdate'].dt.dayofweek    # Impute NaNs for birth_month/day_of_week if birthdate was NaT    df['birth_month'].fillna(df['birth_month'].median(), inplace=True)    df['birth_day_of_week'].fillna(df['birth_day_of_week'].median(), inplace=True)    df['birth_month'] = df['birth_month'].astype(int)    df['birth_day_of_week'] = df['birth_day_of_week'].astype(int)    print("Extracted 'birth_month' and 'birth_day_of_week'.")    print("Feature Engineering completed successfully.")except KeyError as e:    print(f"Error: Missing expected column for feature engineering: {e}")except Exception as e:    print(f"An unexpected error occurred during feature engineering: {e}")

## Data Preparation and Encoding
Drops identifier columns ('customer_id', 'customer_name', 'loyalty_card_number'), original date columns ('customer_birthdate', 'year_first_transaction'), one-hot encodes 'customer_gender', imputes any remaining numerical NaNs with the median, and selects the final feature set for clustering.

In [None]:
import pandas as pdtry:    # Create a copy to avoid modifying the original DataFrame directly for subsequent steps    df_processed = df.copy()    # Step 2.1: Drop identifier columns    id_cols_to_drop = ['customer_id', 'customer_name', 'loyalty_card_number']    df_processed.drop(columns=[col for col in id_cols_to_drop if col in df_processed.columns], inplace=True)    print(f"Dropped identifier columns: {', '.join(id_cols_to_drop)}.")    # Step 2.2: Drop original 'customer_birthdate' and 'year_first_transaction' columns    original_date_cols_to_drop = ['customer_birthdate', 'year_first_transaction']    df_processed.drop(columns=[col for col in original_date_cols_to_drop if col in df_processed.columns], inplace=True)    print(f"Dropped original date columns: {', '.join(original_date_cols_to_drop)}.")    # Step 2.3: One-Hot Encode the 'customer_gender' column    if 'customer_gender' in df_processed.columns:        df_processed = pd.get_dummies(df_processed, columns=['customer_gender'], drop_first=True, prefix='gender')        print("One-Hot Encoded 'customer_gender'.")    else:        print("'customer_gender' column not found for encoding.")    # Step 2.4: Handle any remaining missing values in numerical columns by imputing with the median    numerical_cols = df_processed.select_dtypes(include=np.number).columns    if df_processed[numerical_cols].isnull().sum().sum() > 0:        for col in numerical_cols:            if df_processed[col].isnull().any():                median_val = df_processed[col].median()                df_processed[col].fillna(median_val, inplace=True)        print("Imputed remaining numerical NaNs with median.")    else:        print("No remaining numerical NaNs found.")    # Step 2.5: Select the final set of features for clustering    # Exclude any non-numeric columns that might have slipped through (e.g., if new object columns were created)    features_for_clustering = df_processed.select_dtypes(include=np.number).columns.tolist()    # Store the feature names for later use (e.g., PCA, scaling, profiling)    global selected_features_df    selected_features_df = df_processed[features_for_clustering].copy()    print(f"Selected {len(features_for_clustering)} features for clustering.")    print("Data Preparation and Encoding completed successfully.")except KeyError as e:    print(f"Error: Missing expected column during data preparation: {e}")except Exception as e:    print(f"An unexpected error occurred during data preparation: {e}")

## Feature Scaling
Initializes `StandardScaler`, fits it to the selected numerical features, and transforms the data to have zero mean and unit variance. The scaled data is stored in `X_scaled`.

In [None]:
from sklearn.preprocessing import StandardScalerimport pandas as pdtry:    if 'selected_features_df' not in globals() or selected_features_df.empty:        raise ValueError("selected_features_df is not defined or is empty. Please ensure Step 2 ran successfully.")    # Step 3.1: Initialize StandardScaler    scaler = StandardScaler()    print("StandardScaler initialized.")    # Step 3.2 & 3.3: Fit the StandardScaler and transform the feature set    X_scaled = scaler.fit_transform(selected_features_df)    # Convert back to DataFrame for easier inspection and consistency, preserving column names    X_scaled_df = pd.DataFrame(X_scaled, columns=selected_features_df.columns, index=selected_features_df.index)    print("Features scaled using StandardScaler.")    # Store the scaled data globally for subsequent steps    global X_scaled_data    X_scaled_data = X_scaled_df    print("Feature Scaling completed successfully.")except ValueError as e:    print(f"Error during feature scaling: {e}")except Exception as e:    print(f"An unexpected error occurred during feature scaling: {e}")

## Dimensionality Reduction (PCA)
Applies Principal Component Analysis (PCA) to the scaled data, retaining components that explain 95% of the variance. The transformed data is stored in `X_pca`, and the explained variance ratio is analyzed.

In [None]:
from sklearn.decomposition import PCAimport matplotlib.pyplot as pltimport numpy as npn_components_pca = 0.95try:    if 'X_scaled_data' not in globals() or X_scaled_data.empty:        raise ValueError("X_scaled_data is not defined or is empty. Please ensure Step 3 ran successfully.")    # Step 4.1: Initialize PCA    pca = PCA(n_components=n_components_pca, random_state=42)    print(f"PCA initialized with n_components={n_components_pca} (explaining {n_components_pca*100}% variance).")    # Step 4.2 & 4.3: Fit the PCA model and transform the scaled data    X_pca = pca.fit_transform(X_scaled_data)    print(f"Data transformed into {X_pca.shape[1]} principal components.")    # Step 4.4: Analyze the explained variance ratio    explained_variance_ratio = pca.explained_variance_ratio_    cumulative_explained_variance = np.cumsum(explained_variance_ratio)    print(f"Explained variance ratio for each component: {explained_variance_ratio}")    print(f"Cumulative explained variance: {cumulative_explained_variance}")    # Plot explained variance    plt.figure(figsize=(10, 6))    plt.plot(range(1, len(cumulative_explained_variance) + 1), cumulative_explained_variance, marker='o', linestyle='--')    plt.title('Cumulative Explained Variance by Principal Components')    plt.xlabel('Number of Components')    plt.ylabel('Cumulative Explained Variance')    plt.grid(True)    plt.axhline(y=n_components_pca, color='r', linestyle='-', label=f'{n_components_pca*100}% Explained Variance')    plt.legend()    plt.show()    # Store the PCA transformed data globally    global X_pca_data    X_pca_data = X_pca    print("Dimensionality Reduction (PCA) completed successfully.")except ValueError as e:    print(f"Error during PCA: {e}")except Exception as e:    print(f"An unexpected error occurred during PCA: {e}")

## Determine Optimal Number of Clusters (K)
Evaluates the optimal number of clusters (K) for K-Means using the Elbow Method (WCSS/Inertia) and the Silhouette Score Method. K-Means is run for K values from 2 to 9 (exclusive of 10) with a random state of 42. Plots are generated for both metrics to aid in decision-making.

In [None]:
from sklearn.cluster import KMeansfrom sklearn.metrics import silhouette_scoreimport matplotlib.pyplot as pltimport numpy as npk_range_start = 2k_range_end = 10random_state_kmeans_evaluation = 42try:    if 'X_pca_data' not in globals() or X_pca_data.shape[0] == 0:        raise ValueError("X_pca_data is not defined or is empty. Please ensure Step 4 ran successfully.")    wcss = []  # Within-Cluster Sum of Squares (Inertia)    silhouette_scores = []    k_values = range(k_range_start, k_range_end)    print(f"Evaluating K-Means for K from {k_range_start} to {k_range_end-1}...")    for k in k_values:        if k == 1: # Silhouette score is not defined for k=1            wcss.append(KMeans(n_clusters=k, random_state=random_state_kmeans_evaluation, n_init=10).fit(X_pca_data).inertia_)            silhouette_scores.append(np.nan) # Not applicable            continue                kmeans_model = KMeans(n_clusters=k, random_state=random_state_kmeans_evaluation, n_init=10)        kmeans_model.fit(X_pca_data)        wcss.append(kmeans_model.inertia_)                # Step 5.2: Apply the Silhouette Score Method        labels = kmeans_model.labels_        score = silhouette_score(X_pca_data, labels)        silhouette_scores.append(score)        print(f"  K={k}: WCSS={kmeans_model.inertia_:.2f}, Silhouette Score={score:.4f}")    # Step 5.1: Plot Elbow Method (WCSS)    plt.figure(figsize=(12, 5))    plt.subplot(1, 2, 1)    plt.plot(k_values, wcss, marker='o', linestyle='--')    plt.title('Elbow Method for Optimal K (WCSS)')    plt.xlabel('Number of Clusters (K)')    plt.ylabel('WCSS (Inertia)')    plt.xticks(k_values)    plt.grid(True)    # Plot Silhouette Score Method    plt.subplot(1, 2, 2)    plt.plot(k_values, silhouette_scores, marker='o', linestyle='--')    plt.title('Silhouette Score for Optimal K')    plt.xlabel('Number of Clusters (K)')    plt.ylabel('Silhouette Score')    plt.xticks(k_values)    plt.grid(True)    plt.tight_layout()    plt.show()    # Step 5.3: Consider domain knowledge and business interpretability    print("\nBased on the Elbow Method and Silhouette Score, visually inspect the plots to determine the optimal K.")    print("Consider business interpretability and the trade-off between cluster cohesion and separation.")    print("The 'n_clusters' parameter in Step 6 should be updated based on this analysis.")    print("Optimal K determination completed successfully.")except ValueError as e:    print(f"Error during optimal K determination: {e}")except Exception as e:    print(f"An unexpected error occurred during optimal K determination: {e}")

## Apply K-Means Clustering
Applies K-Means clustering to the PCA-transformed data using 5 clusters (this value should be updated based on Step 5's analysis), with a random state of 42 and 10 initializations. The resulting cluster labels are assigned back to the original DataFrame.

In [None]:
from sklearn.cluster import KMeansimport pandas as pdn_clusters = 5  # IMPORTANT: Update this value based on the analysis from Step 5random_state_kmeans_final = 42n_init = 10try:    if 'X_pca_data' not in globals() or X_pca_data.shape[0] == 0:        raise ValueError("X_pca_data is not defined or is empty. Please ensure Step 4 ran successfully.")    if df.empty:        raise ValueError("Original DataFrame 'df' is empty.")    # Step 6.1: Initialize KMeans    kmeans = KMeans(n_clusters=n_clusters, random_state=random_state_kmeans_final, n_init=n_init)    print(f"KMeans initialized with n_clusters={n_clusters}, random_state={random_state_kmeans_final}, n_init={n_init}.")    # Step 6.2: Fit the KMeans model to the preprocessed data (PCA-transformed data)    kmeans.fit(X_pca_data)    print("KMeans model fitted to the PCA-transformed data.")    # Step 6.3: Assign the generated cluster labels back to the original DataFrame    # Ensure the index aligns correctly    df['cluster_label'] = kmeans.labels_    print("Cluster labels assigned to the original DataFrame.")    print(f"Cluster distribution:\n{df['cluster_label'].value_counts().sort_index()}")    print("K-Means Clustering completed successfully.")except ValueError as e:    print(f"Error during K-Means clustering: {e}")except Exception as e:    print(f"An unexpected error occurred during K-Means clustering: {e}")

## Cluster Profiling and Interpretation
Calculates descriptive statistics (mean and median) for key original and engineered features, grouped by the assigned cluster labels. Visualizes cluster profiles using bar plots for selected features to highlight differences and aid in assigning meaningful names to each segment. Also analyzes cluster sizes.

In [None]:
import pandas as pdimport matplotlib.pyplot as pltimport seaborn as snstry:    if 'df' not in globals() or df.empty or 'cluster_label' not in df.columns:        raise ValueError("DataFrame 'df' or 'cluster_label' column is missing. Please ensure Step 6 ran successfully.")    print("Starting Cluster Profiling and Interpretation...")    # Define a list of key features for profiling. This should include original and engineered features.    # Exclude highly correlated ratio features for initial profiling, focus on absolute spends and key demographics.    profiling_features = [        'customer_age', 'customer_tenure', 'total_dependents', 'number_complaints',        'distinct_stores_visited', 'typical_hour', 'total_lifetime_spend',        'lifetime_spend_groceries', 'lifetime_spend_electronics', 'lifetime_spend_vegetables',        'lifetime_spend_nonalcohol_drinks', 'lifetime_spend_alcohol_drinks', 'lifetime_spend_meat',        'lifetime_spend_fish', 'lifetime_spend_hygiene', 'lifetime_spend_videogames',        'lifetime_spend_petfood', 'lifetime_total_distinct_products',        'percentage_of_products_bought_promotion', 'latitude', 'longitude',        'gender_M' # Assuming 'gender_M' from one-hot encoding, adjust if different    ]    # Filter for features actually present in the DataFrame    profiling_features = [f for f in profiling_features if f in df.columns]    # Step 7.1: Calculate descriptive statistics for each cluster    cluster_means = df.groupby('cluster_label')[profiling_features].mean()    cluster_medians = df.groupby('cluster_label')[profiling_features].median()    print("\nCluster Means:\n", cluster_means)    print("\nCluster Medians:\n", cluster_medians)    # Step 7.5: Analyze the size and proportion of each cluster    cluster_sizes = df['cluster_label'].value_counts().sort_index()    cluster_proportions = df['cluster_label'].value_counts(normalize=True).sort_index()    print("\nCluster Sizes:\n", cluster_sizes)    print("\nCluster Proportions:\n", (cluster_proportions * 100).round(2).astype(str) + '%')    # Step 7.3: Visualize cluster profiles    print("\nGenerating visualizations for cluster profiles...")    # Select a subset of features for visualization to avoid clutter    # Choose features that are likely to show clear distinctions    visual_features = [        'customer_age', 'customer_tenure', 'total_dependents', 'total_lifetime_spend',        'lifetime_spend_groceries', 'lifetime_spend_electronics', 'number_complaints',        'distinct_stores_visited', 'percentage_of_products_bought_promotion', 'typical_hour'    ]    visual_features = [f for f in visual_features if f in df.columns]    if not visual_features:        print("No suitable features found for visualization. Skipping visualization step.")    else:        num_clusters = df['cluster_label'].nunique()        num_features = len(visual_features)                # Determine grid size for subplots        n_cols = 3 # Max columns for plots        n_rows = (num_features + n_cols - 1) // n_cols        plt.figure(figsize=(n_cols * 5, n_rows * 4))        for i, feature in enumerate(visual_features):            plt.subplot(n_rows, n_cols, i + 1)            sns.barplot(x=cluster_means.index, y=cluster_means[feature], palette='viridis')            plt.title(f'Mean {feature} by Cluster')            plt.xlabel('Cluster')            plt.ylabel(feature)        plt.tight_layout()        plt.show()        # Example: Radar chart for a more holistic view (requires normalization)        # For a radar chart, we need to normalize the cluster means across features        # so that different scales don't dominate.        if num_features > 2 and num_clusters > 1:            print("Generating a radar chart for a subset of features...")            # Select a subset of features for the radar chart, ideally 5-8 features            radar_features = ['total_lifetime_spend', 'customer_age', 'total_dependents',                               'lifetime_spend_electronics', 'lifetime_spend_groceries',                               'percentage_of_products_bought_promotion']            radar_features = [f for f in radar_features if f in cluster_means.columns]            if radar_features:                radar_df = cluster_means[radar_features].copy()                # Min-Max scale each feature across clusters for radar chart                for col in radar_df.columns:                    min_val = radar_df[col].min()                    max_val = radar_df[col].max()                    if max_val - min_val > 0:                        radar_df[col] = (radar_df[col] - min_val) / (max_val - min_val)                    else:                        radar_df[col] = 0.5 # If all values are the same, set to mid-point                # Add the first row to the end to close the radar chart                radar_df = pd.concat([radar_df, radar_df.iloc[[0]]], ignore_index=False)                                angles = np.linspace(0, 2 * np.pi, len(radar_features), endpoint=False).tolist()                angles += angles[:1]                plt.figure(figsize=(8, 8))                ax = plt.subplot(111, polar=True)                for i, row in radar_df.iterrows():                    if i < num_clusters: # Only plot actual clusters, not the appended row                        values = row[radar_features].tolist()                        values += values[:1]                        ax.plot(angles, values, linewidth=2, linestyle='solid', label=f'Cluster {i}')                        ax.fill(angles, values, alpha=0.25)                ax.set_theta_offset(np.pi / 2)                ax.set_theta_direction(-1)                ax.set_rlabel_position(0)                plt.xticks(angles[:-1], radar_features, color='grey', size=8)                plt.yticks([0.2, 0.4, 0.6, 0.8, 1.0], ["0.2", "0.4", "0.6", "0.8", "1.0"], color="grey", size=7)                plt.ylim(0, 1)                plt.title('Cluster Profiles (Normalized Features)', size=15, color='black', y=1.1)                plt.legend(loc='upper right', bbox_to_anchor=(1.3, 1.1))                plt.show()            else:                print("Not enough suitable features for radar chart.")    # Step 7.4: Assign meaningful names to each cluster (manual step based on analysis)    print("\n--- Cluster Naming Guide ---")    print("Review the cluster means/medians and visualizations to assign descriptive names to each cluster.")    print("Example: Cluster 0: 'Young Tech Enthusiasts', Cluster 1: 'Family Shoppers', etc.")    print("This step requires human interpretation based on the data patterns.")    print("Cluster Profiling and Interpretation completed successfully.")except ValueError as e:    print(f"Error during cluster profiling: {e}")except Exception as e:    print(f"An unexpected error occurred during cluster profiling: {e}")

## Reporting and Strategic Recommendations
Summarizes the findings of the customer segmentation and provides actionable strategic recommendations for each identified segment. This includes documenting key characteristics, proposing targeted marketing, suggesting product development, outlining retention/acquisition strategies, and discussing potential business impact.

In [None]:
print("--- Step 8: Reporting and Strategic Recommendations ---")print("\nThis step involves synthesizing all previous analysis into a comprehensive report and actionable recommendations.")# Step 8.1: Document the key characteristics, size, and unique behaviors of each customer segment.print("\n8.1 Documenting Cluster Characteristics:")print("  - For each cluster (e.g., Cluster 0, Cluster 1, etc.), summarize its defining features based on Step 7's profiling.")print("  - Include average age, tenure, total spend, dominant spending categories, family size, complaint levels, etc.")print("  - State the size and proportion of each segment within the total customer base.")# Step 8.2: Propose targeted marketing and communication strategies for each segment.print("\n8.2 Targeted Marketing and Communication Strategies:")print("  - **Segment 0 (e.g., 'Young Tech Enthusiasts'):** Focus on digital channels, new gadget promotions, gaming-related offers. Emphasize convenience and innovation.")print("  - **Segment 1 (e.g., 'Family Shoppers'):** Promote bulk discounts on groceries, kid-friendly products, household essentials. Use family-oriented messaging and loyalty programs.")print("  - **Segment 2 (e.g., 'Budget Conscious'):** Highlight value-for-money products, discount coupons, and essential items. Communicate through flyers and in-store promotions.")print("  - **Segment 3 (e.g., 'Luxury Spenders'):** Offer premium products, personalized shopping experiences, exclusive events. Focus on quality and brand prestige.")print("  - *[Add strategies for other segments as identified]*")# Step 8.3: Suggest potential product or service development opportunities.print("\n8.3 Product/Service Development Opportunities:")print("  - For 'Tech Enthusiasts': Consider partnerships for smart home devices or subscription services for electronics.")print("  - For 'Family Shoppers': Develop a wider range of organic baby food or convenient meal kits.")print("  - For 'Health-Conscious': Introduce more plant-based options, health supplements, or fitness-related products.")# Step 8.4: Outline strategies for customer retention, loyalty programs, and acquisition efforts.print("\n8.4 Retention, Loyalty, and Acquisition Strategies:")print("  - **Retention:** Tailored newsletters, birthday discounts, exclusive early access to sales based on segment preferences.")print("  - **Loyalty Programs:** Tiered rewards for high-value segments, points for specific category purchases relevant to a segment.")print("  - **Acquisition:** Target look-alike audiences for high-value segments, run campaigns on platforms frequented by specific segments.")# Step 8.5: Discuss potential business impact and expected ROI.print("\n8.5 Business Impact and ROI:")print("  - Quantify potential increase in customer lifetime value (CLTV) by segment.")print("  - Estimate revenue uplift from targeted promotions and reduced churn rates.")print("  - Discuss cost savings from optimized marketing spend by avoiding generic campaigns.")print("  - Highlight improved customer satisfaction and brand perception.")print("\nReporting and Strategic Recommendations outline completed. This output serves as a template for a detailed business report.")