# Customer Segmentation

Generated by Auto-Analysis App

## Feature Engineering
Created new features including customer age, tenure, total children at home, a boolean for having children, total lifetime spend, average spend per product, promotion engagement, spend category ratios, and a boolean for pet ownership. Handled potential division by zero for ratio calculations.

In [None]:
try:    current_year = datetime.now().year    # 1.1 Calculate 'customer_age'    df_fe['customer_age'] = current_year - df_fe['customer_birthdate'].dt.year    # Handle cases where birthdate might be in the future or invalid (though pre-processing should catch this)    df_fe['customer_age'] = df_fe['customer_age'].apply(lambda x: x if x >= 0 else np.nan)    # Impute with median if any NaNs are introduced (e.g., from future birthdates)    if df_fe['customer_age'].isnull().any():        median_age = df_fe['customer_age'].median()        df_fe['customer_age'].fillna(median_age, inplace=True)    # 1.2 Calculate 'customer_tenure'    df_fe['customer_tenure'] = current_year - df_fe['year_first_transaction']    df_fe['customer_tenure'] = df_fe['customer_tenure'].apply(lambda x: x if x >= 0 else 0) # Tenure cannot be negative    # 1.3 Combine 'kids_home' and 'teens_home'    df_fe['total_children_home'] = df_fe['kids_home'] + df_fe['teens_home']    df_fe['has_children'] = (df_fe['total_children_home'] > 0).astype(int)    # 1.4 Calculate 'total_lifetime_spend'    spend_cols = [col for col in df_fe.columns if 'lifetime_spend_' in col and col != 'lifetime_spend_petfood'] # Exclude petfood for general spend    df_fe['total_lifetime_spend'] = df_fe[spend_cols].sum(axis=1)    # Add petfood to total spend for overall value    df_fe['total_lifetime_spend'] = df_fe['total_lifetime_spend'] + df_fe['lifetime_spend_petfood']    # 1.5 Derive 'average_spend_per_product'    # Handle division by zero for 'lifetime_total_distinct_products'    df_fe['average_spend_per_product'] = df_fe.apply(        lambda row: row['total_lifetime_spend'] / row['lifetime_total_distinct_products']        if row['lifetime_total_distinct_products'] > 0 else 0,        axis=1    )    # 1.6 Calculate 'promotion_engagement'    # Using 'percentage_of_products_bought_promotion' scaled by 'lifetime_total_distinct_products'    df_fe['promotion_engagement'] = df_fe['percentage_of_products_bought_promotion'] * df_fe['lifetime_total_distinct_products']    # 1.7 Create 'spend_category_ratios'    spend_categories = [col for col in df_fe.columns if 'lifetime_spend_' in col]    for col in spend_categories:        ratio_col_name = col.replace('lifetime_spend_', 'spend_ratio_')        # Handle division by zero for 'total_lifetime_spend'        df_fe[ratio_col_name] = df_fe.apply(            lambda row: row[col] / row['total_lifetime_spend']            if row['total_lifetime_spend'] > 0 else 0,            axis=1        )    # 1.8 Derive 'has_pet'    df_fe['has_pet'] = (df_fe['lifetime_spend_petfood'] > 0).astype(int)    print("Feature Engineering completed successfully. New features added:")    print(df_fe.filter(regex='customer_age|customer_tenure|total_children_home|has_children|total_lifetime_spend|average_spend_per_product|promotion_engagement|spend_ratio_|has_pet').head())except Exception as e:    print(f"Error during Feature Engineering: {e}")    # In case of error, revert df_fe to original df or handle as appropriate    df_fe = df.copy()

## Feature Preprocessing for Clustering
Dropped irrelevant identifier columns. Applied one-hot encoding to 'customer_gender'. Identified and scaled numerical features using StandardScaler, preparing data for clustering.

In [None]:
try:    # Make a copy of the engineered DataFrame for preprocessing    df_processed = df_fe.copy()    # Parameters from the plan    categorical_encoding_method = "one-hot"    numerical_scaling_method = "standard_scaler"    # 2.1 Drop irrelevant identifier columns    columns_to_drop = [        'customer_id', 'customer_name', 'loyalty_card_number',        'customer_birthdate', 'year_first_transaction' # 'customer_birthdate' used for age, 'year_first_transaction' for tenure    ]    df_processed.drop(columns=columns_to_drop, errors='ignore', inplace=True)    # Identify categorical and numerical features    categorical_features = df_processed.select_dtypes(include=['object', 'category']).columns.tolist()    numerical_features = df_processed.select_dtypes(include=np.number).columns.tolist()    # 2.2 Apply one-hot encoding to 'customer_gender'    if 'customer_gender' in categorical_features and categorical_encoding_method == 'one-hot':        df_processed = pd.get_dummies(df_processed, columns=['customer_gender'], prefix='gender', drop_first=True)        # Update categorical features list        categorical_features.remove('customer_gender')        # Add new dummy columns to numerical features for scaling if they are not already there        if 'gender_Male' in df_processed.columns and 'gender_Male' not in numerical_features:            numerical_features.append('gender_Male') # Assuming 'Male' is the remaining column after drop_first    # Remove any remaining categorical features that were not handled (e.g., if there were others besides gender)    if categorical_features:        df_processed.drop(columns=categorical_features, errors='ignore', inplace=True)    # Re-identify numerical features after encoding and dropping    numerical_features = df_processed.select_dtypes(include=np.number).columns.tolist()    # 2.4 Apply a Standard Scaler to all selected numerical features    if numerical_scaling_method == 'standard_scaler':        scaler = StandardScaler()        df_scaled = pd.DataFrame(scaler.fit_transform(df_processed[numerical_features]), columns=numerical_features, index=df_processed.index)        # Replace original numerical columns with scaled ones in df_processed        for col in numerical_features:            df_processed[col] = df_scaled[col]        print(f"Numerical features scaled using {numerical_scaling_method}.")    elif numerical_scaling_method == 'min_max_scaler':        from sklearn.preprocessing import MinMaxScaler        scaler = MinMaxScaler()        df_scaled = pd.DataFrame(scaler.fit_transform(df_processed[numerical_features]), columns=numerical_features, index=df_processed.index)        for col in numerical_features:            df_processed[col] = df_scaled[col]        print(f"Numerical features scaled using {numerical_scaling_method}.")    else:        print("No numerical scaling applied as method is not 'standard_scaler' or 'min_max_scaler'.")        df_scaled = df_processed[numerical_features].copy() # If no scaling, df_scaled is just the numerical part    print("Feature Preprocessing completed successfully. Head of processed data:")    print(df_processed.head())    print(f"Shape of preprocessed data: {df_processed.shape}")except Exception as e:    print(f"Error during Feature Preprocessing: {e}")    df_processed = df_fe.copy() # Revert to df_fe in case of error    df_scaled = pd.DataFrame() # Clear scaled data

## Dimensionality Reduction (Optional)
Applied Principal Component Analysis (PCA) to reduce dimensionality, aiming to retain 95% of the variance in the scaled data. The data was transformed into the new PCA space.

In [None]:
try:    # Parameters from the plan    apply_pca = True # Using the value from the plan    n_components_pca = 0.95 # Using the value from the plan    if apply_pca:        if df_processed.empty:            raise ValueError("df_processed is empty. Ensure previous preprocessing step ran successfully.")        # 3.1 Perform Principal Component Analysis (PCA)        # It's crucial to apply PCA on the scaled numerical features.        # df_processed at this stage should contain only scaled numerical features and one-hot encoded gender.        pca_model = PCA(n_components=n_components_pca, random_state=42)        df_pca = pd.DataFrame(pca_model.fit_transform(df_processed), index=df_processed.index)        print(f"PCA applied. Original dimensions: {df_processed.shape[1]}, Reduced dimensions: {df_pca.shape[1]}")        print(f"Explained variance ratio: {pca_model.explained_variance_ratio_.sum():.2f}")        print("Head of PCA transformed data:")        print(df_pca.head())    else:        df_pca = df_processed.copy() # If PCA is not applied, use the preprocessed data directly        print("PCA not applied as 'apply_pca' is set to False.")except Exception as e:    print(f"Error during Dimensionality Reduction: {e}")    df_pca = df_processed.copy() # Fallback to non-PCA data if error occurs

## Determine Optimal Number of Clusters
Evaluated the optimal number of clusters (k) using the Elbow Method (WCSS) and Silhouette Score for k ranging from 1 to 10. Visualizations were generated to aid in selecting the best k.

In [None]:
try:    # Parameters from the plan    max_k_for_evaluation = 10    if df_pca.empty:        raise ValueError("df_pca is empty. Ensure previous steps ran successfully.")    wcss = []    silhouette_scores = []    k_range = range(1, max_k_for_evaluation + 1)    for k in k_range:        kmeans = KMeans(n_clusters=k, random_state=42, n_init=10) # n_init for robust initialization        kmeans.fit(df_pca)        wcss.append(kmeans.inertia_)        if k > 1: # Silhouette score requires at least 2 clusters            silhouette_scores.append(silhouette_score(df_pca, kmeans.labels_))        else:            silhouette_scores.append(np.nan) # Placeholder for k=1    # 4.1 Apply the Elbow Method    plt.figure(figsize=(10, 6))    plt.plot(k_range, wcss, marker='o', linestyle='--')    plt.title('Elbow Method for Optimal K (WCSS)')    plt.xlabel('Number of Clusters (K)')    plt.ylabel('Within-Cluster Sum of Squares (WCSS)')    plt.xticks(k_range)    plt.grid(True)    plt.show()    # 4.2 Calculate the Silhouette Score    plt.figure(figsize=(10, 6))    plt.plot(k_range[1:], silhouette_scores[1:], marker='o', linestyle='--') # Exclude k=1 for silhouette    plt.title('Silhouette Score for Optimal K')    plt.xlabel('Number of Clusters (K)')    plt.ylabel('Silhouette Score')    plt.xticks(k_range[1:])    plt.grid(True)    plt.show()    # 4.3 Review the results and select a consensus 'k' value    print("WCSS values:", wcss)    print("Silhouette Scores (for k > 1):", silhouette_scores[1:])    print("\nReview the plots above to determine the optimal number of clusters (k).")    print("Look for the 'elbow' in the WCSS plot and the peak in the Silhouette Score plot.")except Exception as e:    print(f"Error during Optimal K determination: {e}")

## Apply Clustering Algorithm
Applied K-Means clustering with 4 clusters and a random state of 42 on the preprocessed and PCA-transformed data. Assigned the resulting cluster labels back to the original DataFrame.

In [None]:
try:    # Parameters from the plan    n_clusters = 4 # Using the value from the plan    random_state = 42 # Using the value from the plan    clustering_algorithm = "kmeans" # Using the value from the plan    if df_pca.empty:        raise ValueError("df_pca is empty. Ensure previous steps ran successfully.")    if clustering_algorithm == 'kmeans':        # 5.1 Initialize and fit the K-Means clustering model        kmeans_model = KMeans(n_clusters=n_clusters, random_state=random_state, n_init=10) # n_init for robust initialization        kmeans_model.fit(df_pca)        cluster_labels = kmeans_model.labels_        # 5.2 Assign each customer to their respective cluster        # Add cluster labels to the original df_fe for profiling        df_fe['cluster_label'] = cluster_labels        print(f"K-Means clustering applied with {n_clusters} clusters.")        print("Cluster labels added to df_fe. First 5 cluster labels:")        print(df_fe['cluster_label'].head())    else:        print(f"Clustering algorithm '{clustering_algorithm}' is not supported in this step. Only 'kmeans' is implemented.")except Exception as e:    print(f"Error during Clustering Algorithm application: {e}")    if 'df_fe' in locals() and 'cluster_label' in df_fe.columns:        df_fe.drop(columns=['cluster_label'], inplace=True) # Clean up if error occurs

## Cluster Profiling and Interpretation
Calculated mean descriptive statistics for all original and engineered features for each cluster. Visualized key feature distributions across clusters using bar plots to highlight distinguishing characteristics and performed ANOVA tests to confirm significant differences.

In [None]:
try:    if 'cluster_label' not in df_fe.columns:        raise ValueError("Cluster labels not found in df_fe. Ensure clustering step ran successfully.")    # 6.1 Calculate descriptive statistics for all original and engineered features for each cluster    cluster_profiles = df_fe.groupby('cluster_label').mean().T    print("\nCluster Profiles (Mean values per cluster):\n")    print(cluster_profiles)    # 6.2 Identify the distinguishing characteristics of each cluster    # This is an interpretive step, but we can print some key insights.    print("\nKey observations from cluster profiles (mean values):\n")    for i in range(cluster_profiles.shape[1]):        print(f"Cluster {i}:")        # Example: Top 3 highest average spend categories        top_spend_categories = cluster_profiles.loc[[col for col in cluster_profiles.index if 'lifetime_spend_' in col], i].nlargest(3)        print(f"  Top spend categories: {top_spend_categories.index.tolist()} (Avg: {top_spend_categories.values.round(2).tolist()})")        # Example: Age, children, total spend        print(f"  Avg Age: {cluster_profiles.loc['customer_age', i]:.1f}, Total Children: {cluster_profiles.loc['total_children_home', i]:.1f}, Total Spend: {cluster_profiles.loc['total_lifetime_spend', i]:.2f}")        print(f"  Has Pet: {cluster_profiles.loc['has_pet', i]:.1f} (proportion), Promotion Engagement: {cluster_profiles.loc['promotion_engagement', i]:.2f}")        print("\n")    # 6.3 Visualize cluster profiles    # Select a few key features for visualization    features_to_visualize = [        'customer_age', 'total_children_home', 'total_lifetime_spend',        'lifetime_spend_groceries', 'lifetime_spend_electronics', 'lifetime_spend_alcohol_drinks',        'promotion_engagement', 'distinct_stores_visited', 'number_complaints', 'has_pet'    ]    # Ensure all features_to_visualize exist in df_fe    features_to_visualize = [f for f in features_to_visualize if f in df_fe.columns]    if 'gender_Male' in df_fe.columns:        features_to_visualize.append('gender_Male') # Add gender if one-hot encoded    for feature in features_to_visualize:        plt.figure(figsize=(8, 5))        sns.barplot(x='cluster_label', y=feature, data=df_fe)        plt.title(f'Mean {feature} by Cluster')        plt.xlabel('Cluster Label')        plt.ylabel(f'Mean {feature}')        plt.grid(axis='y', linestyle='--', alpha=0.7)        plt.show()    # 6.4 Perform statistical tests (e.g., ANOVA)    print("\nPerforming ANOVA tests for key features across clusters:\n")    for feature in features_to_visualize:        if feature != 'cluster_label': # Ensure we don't try to ANOVA the label itself            groups = [df_fe[feature][df_fe['cluster_label'] == i] for i in sorted(df_fe['cluster_label'].unique())]            # Filter out empty groups or groups with all NaNs if any            groups = [g.dropna() for g in groups if not g.empty and not g.dropna().empty]            if len(groups) > 1:                f_stat, p_val = f_oneway(*groups)                print(f"Feature: {feature}, F-statistic: {f_stat:.2f}, P-value: {p_val:.3f}")                if p_val < 0.05:                    print("  (Statistically significant difference between cluster means)")                else:                    print("  (No statistically significant difference between cluster means)")            else:                print(f"  Not enough groups for ANOVA for feature: {feature}")except Exception as e:    print(f"Error during Cluster Profiling and Interpretation: {e}")

## Evaluation and Refinement
Evaluated the internal validity of the clusters using Silhouette Score, Davies-Bouldin Index, and Calinski-Harabasz Index. Provided guidance for assessing practical utility and potential refinement steps.

In [None]:
try:    if 'cluster_label' not in df_fe.columns or df_pca.empty:        raise ValueError("Cluster labels or PCA data not found. Ensure clustering and PCA steps ran successfully.")    # Ensure the number of clusters is greater than 1 for silhouette and Davies-Bouldin scores    n_clusters_actual = df_fe['cluster_label'].nunique()    if n_clusters_actual <= 1:        print("Cannot evaluate cluster validity with 1 or fewer clusters.")    else:        # 7.1 Evaluate the internal validity of the clusters        # Use df_pca (the data used for clustering) and the assigned cluster labels        silhouette_avg = silhouette_score(df_pca, df_fe['cluster_label'])        davies_bouldin = davies_bouldin_score(df_pca, df_fe['cluster_label'])        calinski_harabasz = calinski_harabasz_score(df_pca, df_fe['cluster_label'])        print(f"\nCluster Internal Validity Metrics (for {n_clusters_actual} clusters):\n")        print(f"Silhouette Score: {silhouette_avg:.3f} (Higher is better, range -1 to 1)")        print(f"Davies-Bouldin Index: {davies_bouldin:.3f} (Lower is better, min 0)")        print(f"Calinski-Harabasz Index: {calinski_harabasz:.3f} (Higher is better)")    # 7.2 Assess the practical utility and business relevance of the segments    print("\n--- Practical Utility and Business Relevance Assessment ---")    print("Based on the cluster profiles from Step 6, consider the following:")    print("1. Distinctness: Are the segments clearly different from each other?")    print("2. Stability: Would these segments likely remain consistent over time?")    print("3. Substantiality: Are the segments large enough to be worth targeting?")    print("4. Accessibility: Can these segments be reached through marketing efforts?")    print("5. Actionability: Can specific marketing strategies be developed for each segment?")    print("Review the visualizations and descriptive statistics to answer these questions.")    # 7.3 If segments are not well-defined, distinct, or actionable, consider iterating    print("\n--- Refinement Considerations ---")    print("If the current segmentation is not satisfactory, consider the following iterative steps:")    print("a) Adjust feature engineering: Create new features or remove less impactful ones (Go back to Step 1).")    print("b) Try different scaling methods: Experiment with MinMaxScaler instead of StandardScaler (Go back to Step 2).")    print("c) Explore alternative clustering algorithms or parameters: e.g., DBSCAN, Agglomerative Clustering, or different K-Means parameters (Go back to Step 5).")    print("d) Re-evaluate the optimal number of clusters: Revisit the Elbow and Silhouette plots (Go back to Step 4).")    print("e) Consider different dimensionality reduction techniques or parameters (Go back to Step 3).")    # 7.4 Document the final segmentation model    print("\n--- Documentation Guidance ---")    print("Once satisfied, document the final model including:")    print("- Chosen features and their engineering logic.")    ("- Preprocessing steps (encoding, scaling).")    print("- Dimensionality reduction (if applied) and parameters.")    print("- Optimal number of clusters and justification.")    print("- Detailed profiles for each customer segment, including their names/descriptions.")    print("- Evaluation metrics and business implications.")except Exception as e:    print(f"Error during Evaluation and Refinement: {e}")

## New Analysis Step
Describe your new analysis step here.

In [None]:
# Add your python code here