<a href="https://colab.research.google.com/github/srJboca/segmentacion/blob/main/EN/3.%20Segmentation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tutorial: Gas Customer Segmentation

## Introduction

This notebook focuses on customer segmentation using the `df_analisis.parquet` DataFrame, which was prepared in the data exploration notebook. Customer segmentation allows us to group customers with similar characteristics, which is useful for personalized marketing strategies, service management, and operational optimization.

**Objective:** Identify distinct customer segments based on their consumption and payment behavior.

**Techniques Used:**
1.  **Data Aggregation:** Transforming invoice-level data into customer-level features.
2.  **Principal Component Analysis (PCA):** Reducing the dimensionality of the data to facilitate visualization and clustering.
3.  **K-Means Clustering:** Grouping customers into K distinct segments.
4.  **Segment Profiling:** Analyzing the characteristics of each segment.

## 1. Environment Setup and Data Loading

### 1.1 Importing Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

warnings.filterwarnings('ignore')
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

### 1.2 Downloading and Loading the Preprocessed DataFrame

We will use the `df_analisis.parquet` file.

In [None]:
!wget -N https://github.com/srJboca/segmentacion/raw/refs/heads/main/archivos/df_analisis.parquet
df_analysis = pd.read_parquet('df_analisis.parquet')

## 2. Quick Data Review
Let's recall the structure of the `df_analysis` DataFrame.

In [None]:
print("--- First 5 rows of df_analysis ---")
print(df_analysis.head())
print("\n--- Information of df_analysis ---")
df_analysis.info()

## 3. Aggregating Features at the Customer Level

To segment customers, we need features that describe each customer's behavior over time. We will group the data by `Numero de contrato` and calculate aggregate metrics (averages) of their consumption, costs, and payment behavior.

In [None]:
df_grouped = df_analysis.groupby('Numero de contrato').agg(
    Average_Consumption=('Consumo (m3)', 'mean'),
    Average_Consumption_Price=('Precio por Consumo', 'mean'),
    Average_Days_Issue_DueDate=('Dias_Emision_PagoOportuno', 'mean'),
    Average_Days_Reading_Issue=('Dias_Lectura_Emision', 'mean'),
    Average_Days_DueDate_ActualPayment=('Dias_PagoOportuno_PagoReal', 'mean'),
    Average_Late_Payment_Rate=('Mora', 'mean') # Late payment rate
).reset_index()

print("--- df_grouped (data aggregated by customer) ---")
print(df_grouped.head())

### 3.1 Incorporating 'Estrato Socioeconómico'

The 'Estrato socioeconomico' is an important customer characteristic. We will join it to the grouped DataFrame. We need to get the stratum from `df_analysis`. Since `df_analysis` already has the 'Estrato' column, we will use it, making sure to take a single value per customer.

In [None]:
# Get the stratum for each contract (taking the first one, assuming it doesn't change)
df_strata = df_analysis.drop_duplicates(subset=['Numero de contrato'])[['Numero de contrato', 'Estrato']].copy()
# Rename 'Estrato' to 'Estrato socioeconomico' for clarity
df_strata.rename(columns={'Estrato': 'Estrato socioeconomico'}, inplace=True)

df_segmentation = pd.merge(df_grouped, df_strata, on='Numero de contrato', how='left')

print("--- df_segmentation with Estrato ---")
print(df_segmentation.head())
print("\n--- Information about df_segmentation ---")
df_segmentation.info()

## 4. Preprocessing for PCA and Clustering

### 4.1 Selecting Numerical Features and Handling NaNs
We will select the numerical features for PCA and clustering. The 'Estrato socioeconomico' column is categorical, and we will handle it later or use it to profile the segments. For now, we will convert it to a numerical type since it's an ordinal feature.
We will also remove rows with NaN values in the selected features.

In [None]:
# Convert 'Estrato socioeconomico' to numeric (ordinal)
if df_segmentation['Estrato socioeconomico'].dtype == 'object' or isinstance(df_segmentation['Estrato socioeconomico'].dtype, pd.CategoricalDtype):
    df_segmentation['Stratum_Num'] = df_segmentation['Estrato socioeconomico'].str.replace('Estrato ', '', regex=False).astype(int)
else:
    df_segmentation['Stratum_Num'] = df_segmentation['Estrato socioeconomico'].astype(int)

features_for_pca = [
    'Average_Consumption',
    'Average_Consumption_Price',
    'Average_Days_Issue_DueDate',
    'Average_Days_Reading_Issue',
    'Average_Days_DueDate_ActualPayment',
    'Average_Late_Payment_Rate',
    'Stratum_Num' # We include the numerical stratum
]
X = df_segmentation[features_for_pca].copy()

print(f"Shape before dropna: {X.shape}")
X.dropna(inplace=True) # Remove rows with NaNs in these features
print(f"Shape after dropna: {X.shape}")

print("\nMissing values after dropna:")
print(X.isnull().sum())

### 4.2 Feature Scaling
PCA is sensitive to the scale of the features. Therefore, we will standardize the data.

In [None]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print("--- Scaled data (first 5 rows) ---")
print(pd.DataFrame(X_scaled, columns=X.columns).head())

## 5. Principal Component Analysis (PCA)

We will reduce the dimensionality to 2 principal components for visualization.

In [None]:
pca = PCA(n_components=2) # Reduce to 2 components
X_pca = pca.fit_transform(X_scaled)

df_pca = pd.DataFrame(data=X_pca, columns=['principal_component_1', 'principal_component_2'])

print("--- Principal Components (first 5 rows) ---")
print(df_pca.head())

print(f"\nExplained variance by each component: {pca.explained_variance_ratio_}")
print(f"Total explained variance (2 components): {pca.explained_variance_ratio_.sum():.2f}")

The total explained variance tells us what percentage of the original information is retained in the 2 principal components.

## 6. K-Means Clustering

### 6.1 Elbow Method for Optimal K
We will use the elbow method to help determine an appropriate number of clusters (K). We are looking for the point where adding more clusters does not significantly improve the Within-Cluster Sum of Squares (WCSS) or inertia.

In [None]:
inertia_list = []
k_range = range(1, 11)

for k in k_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init='auto') # n_init='auto' is the default in recent versions
    kmeans.fit(df_pca) # We use the PCA-transformed data
    inertia_list.append(kmeans.inertia_)

plt.figure(figsize=(10, 6))
plt.plot(k_range, inertia_list, marker='o')
plt.title('Elbow Method for Finding Optimal K')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia')
plt.xticks(k_range)
plt.grid(True)
plt.show()

Observe the plot above. The "elbow" is the point where the rate of decrease in inertia flattens. This point suggests an optimal K value. For this tutorial, we will use K=4.


### 6.2 Applying K-Means and Visualization
We apply K-Means with the chosen K (K=4) and visualize the clusters in the PCA space.

In [None]:
optimal_k = 4 # @param {type:"integer"} # Chosen from the elbow method or business analysis

kmeans_optimal = KMeans(n_clusters=optimal_k, random_state=42, n_init='auto')
df_pca['segment'] = kmeans_optimal.fit_predict(df_pca)

plt.figure(figsize=(12, 8))
sns.scatterplot(x='principal_component_1', y='principal_component_2', hue='segment', data=df_pca, palette='viridis', s=100, alpha=0.7)
plt.title(f'Customer Clusters (k={optimal_k}) in PCA Space')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend(title='Customer Segment')
plt.grid(True)
plt.show()

print("--- df_pca with Segments ---")
print(df_pca.head())

## 7. Segment Profiling

Now that we have the segments, we need to understand what characterizes each one. We will join the segment labels back to the `df_segmentation` DataFrame (which contains the aggregated features and the stratum) and analyze the mean of the features for each segment.

In [None]:
# Since X was created from X.dropna(), df_pca should have the same length and order
# We create df_segmentation_cleaned to ensure the index correspondence
df_segmentation_cleaned = df_segmentation.loc[X.index].copy()
df_segmentation_cleaned['segment'] = df_pca['segment'].values

print("--- df_segmentation_cleaned with Segments ---")
print(df_segmentation_cleaned.head())

### 7.1 Average Characteristics per Segment

In [None]:
segment_profiles = df_segmentation_cleaned.groupby('segment')[features_for_pca].mean()
print("--- Segment Profiles (Average Characteristics) ---")
print(segment_profiles)

# Visualizing the profiles
segment_profiles.T.plot(kind='bar', figsize=(15, 7))
plt.title('Average Characteristics per Customer Segment')
plt.ylabel('Average Value')
plt.xticks(rotation=45, ha='right')
plt.legend(title='Segment')
plt.tight_layout()
plt.show()

### 7.2 Stratum Distribution per Segment

In [None]:
stratum_segment_distribution = pd.crosstab(df_segmentation_cleaned['segment'], df_segmentation_cleaned['Estrato socioeconomico'], normalize='index') * 100
print("\n--- Percentage Distribution of Stratum per Segment ---")
print(stratum_segment_distribution)

stratum_segment_distribution.plot(kind='bar', stacked=True, figsize=(12, 7), colormap='viridis')
plt.title('Distribution of Socioeconomic Stratum by Segment')
plt.xlabel('Customer Segment')
plt.ylabel('Percentage of Customers (%)')
plt.xticks(rotation=0)
plt.legend(title='Stratum')
plt.tight_layout()
plt.show()

**Interpreting the Profiles:**
* **Average Characteristics:** Look at the bars for each segment. Which characteristics have high or low values in a particular segment? For example, one segment might have a high 'Average_Consumption' and a low 'Average_Late_Payment_Rate', indicating high-value customers with good payment behavior.
* **Stratum Distribution:** Is there a predominant stratum in each segment? This can help understand the socioeconomic composition of the groups.

**Naming the Segments (Example):**
Based on the analysis, you could assign descriptive names to each segment, such as:
* **Segment 0:** Low Consumption, Punctual Payers
* **Segment 1:** High Consumption, Moderate Late Payments, High Stratum
* Etc.

## 8. Conclusions and Next Steps

In this notebook, we have:
1.  Aggregated data at the customer level.
2.  Used PCA to reduce dimensionality and facilitate visualization.
3.  Applied K-Means to group customers into segments (K=4 in this example).
4.  Begun profiling the segments by analyzing their average characteristics and stratum distribution.

**Potential Next Steps:**
* **Segment Validation:** Use clustering metrics (like the Silhouette Coefficient) to evaluate the quality of the segments. Also, validate the stability of the segments with different K-Means initializations or data subsets.
* **Detailed Profiling:** Analyze more variables for each segment (e.g., customer tenure, contract type, if they were included).
* **Strategic Actions:** Define specific strategies for each segment (e.g., loyalty campaigns for high-value customers, late payment management programs for segments with high default rates, personalized offers based on consumption).
* **Monitoring:** Observe how segments evolve over time and whether customers move from one segment to another.
* **Try other algorithms:** Explore other clustering algorithms (e.g., DBSCAN, Agglomerative Clustering) if K-Means does not produce satisfactory results or if non-spherical cluster structures are suspected.

Customer segmentation is a powerful tool for better understanding the customer base and making more informed business decisions.