# Customer Segmentation

Generated by Auto-Analysis Web App

## Initial Data Overview
Displays the first few rows, data types, and checks for missing values to get an initial understanding of the dataset structure and quality. This helps in identifying potential data cleaning and preprocessing steps.

In [None]:
import pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport seaborn as snsfrom datetime import datetimefrom sklearn.preprocessing import StandardScaler, OneHotEncoderfrom sklearn.cluster import KMeansfrom sklearn.decomposition import PCAfrom sklearn.impute import SimpleImputerprint('DataFrame Head:')print(df.head())print('\nDataFrame Info:')df.info()print('\nMissing Values:')print(df.isnull().sum())

## Calculate Age and Tenure
Extracts the customer's age from their birthdate and calculates customer tenure based on the `year_first_transaction` relative to a current year (e.g., 2024). These features provide insights into customer demographics and loyalty.

In [None]:
current_year = 2024df['customer_birthdate'] = pd.to_datetime(df['customer_birthdate'])df['age'] = current_year - df['customer_birthdate'].dt.year# Handle potential future birthdates or very old people by capping agedf['age'] = df['age'].apply(lambda x: x if x > 0 else df['age'].mean())df['customer_tenure'] = current_year - df['year_first_transaction']# Cap tenure if year_first_transaction is in the futuredf['customer_tenure'] = df['customer_tenure'].apply(lambda x: x if x >= 0 else 0)print('New features created: age, customer_tenure')print(df[['customer_birthdate', 'age', 'year_first_transaction', 'customer_tenure']].head())

## Engineer Household & Encode Gender
Combines `kids_home` and `teens_home` into `total_children_home` to represent household composition. Also, one-hot encodes the `customer_gender` column to convert categorical data into a numerical format suitable for machine learning models.

In [None]:
df['total_children_home'] = df['kids_home'].fillna(0) + df['teens_home'].fillna(0)# Convert gender to string for consistent one-hot encodingdf['customer_gender'] = df['customer_gender'].astype(str)df = pd.get_dummies(df, columns=['customer_gender'], prefix='gender', drop_first=True)print('New features created: total_children_home, gender_male/other')print(df[['kids_home', 'teens_home', 'total_children_home', 'gender_male']].head())

## Aggregate Spending Features
Calculates `total_lifetime_spend` and the proportion of spending in various categories relative to the total spend. This helps in understanding overall spending power and specific spending habits across different product categories.

In [None]:
spend_columns = [    'lifetime_spend_groceries', 'lifetime_spend_electronics',     'lifetime_spend_vegetables', 'lifetime_spend_nonalcohol_drinks',    'lifetime_spend_alcohol_drinks', 'lifetime_spend_meat',     'lifetime_spend_fish', 'lifetime_spend_hygiene',     'lifetime_spend_videogames', 'lifetime_spend_petfood']# Impute missing spend values with 0 or mean before summingfor col in spend_columns:    if df[col].isnull().any():        df[col] = df[col].fillna(df[col].mean())df['total_lifetime_spend'] = df[spend_columns].sum(axis=1)for col in spend_columns:    df[f'{col}_ratio'] = df[col] / (df['total_lifetime_spend'] + 1e-6) # Add small epsilon to avoid division by zeroprint('New features created: total_lifetime_spend and spend ratios for each category')print(df[['total_lifetime_spend'] + [col + '_ratio' for col in spend_columns[:3]]].head())

## Select Features for Clustering
Identifies and selects relevant numerical features for clustering, dropping unique identifiers (`customer_id`, `customer_name`, `loyalty_card_number`), original categorical columns, and date/spend columns that have been engineered or aggregated.

In [None]:
features_to_drop = [    'customer_id', 'customer_name', 'customer_birthdate',     'year_first_transaction', 'loyalty_card_number',     'kids_home', 'teens_home',] + spend_columns # Drop original spend columns as we have total and ratios# Ensure only existing columns are droppedfeatures_to_drop = [col for col in features_to_drop if col in df.columns]df_clustering = df.drop(columns=features_to_drop)# Select numerical features for clustering, excluding gender_nan if it was created due to missing gender dataclustering_features = df_clustering.select_dtypes(include=np.number).columns.tolist()# Drop any specific columns if they are not suitable (e.g., original longitude/latitude if not used directly)# For this example, we keep latitude/longitude as they are numerical.df_clustering = df_clustering[clustering_features]print(f'Selected {len(df_clustering.columns)} features for clustering:')print(df_clustering.columns.tolist())print(df_clustering.head())

## Impute Remaining Missing Values
Addresses any remaining missing values in the selected numerical features by imputing them with the mean. This ensures that the clustering algorithm can process the data without errors.

In [None]:
imputer = SimpleImputer(strategy='mean')# Fit on the DataFrame for clustering and transformdf_clustering_imputed = pd.DataFrame(imputer.fit_transform(df_clustering), columns=df_clustering.columns, index=df_clustering.index)print('Missing values after imputation:')print(df_clustering_imputed.isnull().sum().sum())print(df_clustering_imputed.head())

## Scale Numerical Features
Applies `StandardScaler` to normalize the numerical features. This ensures that features with larger ranges do not disproportionately influence the clustering algorithm, which relies on distance calculations.

In [None]:
scaler = StandardScaler()df_scaled = pd.DataFrame(scaler.fit_transform(df_clustering_imputed), columns=df_clustering_imputed.columns, index=df_clustering_imputed.index)print('Scaled data head:')print(df_scaled.head())print('Mean of scaled data (should be close to 0):')print(df_scaled.mean().head())print('Std Dev of scaled data (should be close to 1):')print(df_scaled.std().head())

## Apply PCA for Visualization/Reduction
Reduces the dimensionality of the scaled data using Principal Component Analysis (PCA). While not strictly necessary for K-Means, it helps in visualizing the clusters in 2D or 3D and can sometimes improve clustering by removing noise and multicollinearity.

In [None]:
pca = PCA(n_components=2) # For visualizationdf_pca = pca.fit_transform(df_scaled)df_pca = pd.DataFrame(data = df_pca, columns = ['principal_component_1', 'principal_component_2'], index=df_scaled.index)print('PCA explained variance ratio:')print(pca.explained_variance_ratio_)print('PCA transformed data head:')print(df_pca.head())

## Elbow Method for Optimal K
Uses the elbow method to find the optimal number of clusters (K) for K-Means clustering. It plots the Within-Cluster Sum of Squares (WCSS) against different K values, and the 'elbow point' indicates a good number of clusters.

In [None]:
wcss = []max_k = 10 # Define a reasonable max K to testfor i in range(1, max_k + 1):    kmeans = KMeans(n_clusters=i, init='k-means++', random_state=42, n_init=10)    kmeans.fit(df_scaled)    wcss.append(kmeans.inertia_)plt.figure(figsize=(10, 6))plt.plot(range(1, max_k + 1), wcss, marker='o', linestyle='--')plt.title('Elbow Method for Optimal K')plt.xlabel('Number of Clusters (K)')plt.ylabel('WCSS (Within-Cluster Sum of Squares)')plt.grid(True)plt.show()print(f'Please observe the plot above to determine the optimal K. A common choice is where the 'elbow' appears (e.g., 3, 4, or 5).')

## Perform K-Means Clustering
Applies the K-Means clustering algorithm with the chosen number of clusters (determined from the elbow method, here chosen as 4 for demonstration) to segment the customers.

In [None]:
n_clusters = 4 # Example: Replace with the K chosen from the Elbow Methodkmeans_model = KMeans(n_clusters=n_clusters, init='k-means++', random_state=42, n_init=10)df['cluster'] = kmeans_model.fit_predict(df_scaled)print(f'K-Means clustering performed with K={n_clusters}.')print('Cluster distribution:')print(df['cluster'].value_counts())print(df[['customer_id', 'cluster']].head())

## Analyze Customer Segments
Appends the cluster labels back to the original dataframe and calculates the mean of each key feature for each cluster. This provides a detailed statistical summary of each segment, allowing for interpretation and characterization.

In [None]:
segment_analysis = df.groupby('cluster')[clustering_features].mean()print('Mean feature values for each cluster:')print(segment_analysis)# Also show value counts for categorical/discrete features if any were kept# For example, if 'total_children_home' is important, we can analyze its distribution per clusterprint('\nTotal children home distribution per cluster:')print(df.groupby('cluster')['total_children_home'].value_counts(normalize=True).unstack(fill_value=0))print('\nGender distribution per cluster:')if 'gender_male' in df.columns:    print(df.groupby('cluster')['gender_male'].value_counts(normalize=True).unstack(fill_value=0).rename(columns={0: 'Female/Other', 1: 'Male'}))

## Visualize Clusters with PCA
Visualizes the customer segments in a 2D plot using the first two principal components, colored by cluster. This graphical representation helps to visually inspect the separation and distribution of the identified customer segments.

In [None]:
df_pca['cluster'] = df['cluster'] # Add cluster labels to PCA componentsplt.figure(figsize=(12, 8))sns.scatterplot(    x='principal_component_1',     y='principal_component_2',     hue='cluster',     palette='viridis',     data=df_pca,     s=100,     alpha=0.7)plt.title(f'Customer Segments (K={n_clusters}) - PCA 2D Visualization')plt.xlabel('Principal Component 1')plt.ylabel('Principal Component 2')plt.legend(title='Cluster')plt.grid(True)plt.show()print('Visualization of clusters based on first two principal components displayed.')