<a href="https://colab.research.google.com/github/srJboca/segmentacion/blob/main/EN/4.%20Prediction%20with%20PCA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tutorial: Late Payment Prediction with PCA

## Introduction

This notebook will guide you through building a model to predict 'Mora' (payment default) for customers. We will explore the use of Principal Component Analysis (PCA) for dimensionality reduction before training a binary classifier.

**Key Points of the Tutorial:**
1.  Preparing data for prediction.
2.  Applying PCA to reduce the dimensionality of the feature set.
3.  Training and evaluating a binary classification model using the principal components.
4.  A method for estimating the importance of the original features after applying PCA.

This iterative approach is common in the development of machine learning models.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

warnings.filterwarnings('ignore')
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

### Downloading and Loading the Preprocessed DataFrame

We will use the `df_analisis.parquet` file.

In [None]:
!wget -N https://github.com/srJboca/segmentacion/raw/refs/heads/main/archivos/df_analisis.parquet
df_analysis = pd.read_parquet('df_analisis.parquet')

## 2. Quick Data Review
Let's recall the structure of the `df_analysis` DataFrame.

In [None]:
print("--- First 5 rows of df_analysis ---")
print(df_analysis.head())
print("\n--- Information of df_analysis ---")
df_analysis.info()

In [None]:
df_filtered = df_analysis[[
    'Numero de factura',
    'Consumo (m3)',
    'Estrato',
    'Precio m3 (COP)',
    'Dias_Emision_PagoOportuno',
    'Dias_Lectura_Emision',
    'Dias_PagoOportuno_PagoReal',
    'Mora'
]].copy()

# Check the first few rows and info of the filtered dataframe
print("\n--- First 5 rows of the filtered DataFrame ---")
print(df_filtered.head())
print("\n--- Information of the filtered DataFrame ---")
df_filtered.info()

In [None]:
# Convert 'Estrato' to numeric (ordinal)

if df_filtered['Estrato'].dtype == 'object' or isinstance(df_filtered['Estrato'].dtype, pd.CategoricalDtype):
    df_filtered['Stratum_Num'] = df_filtered['Estrato'].str.replace('Estrato ', '', regex=False).astype(int)
else:
    df_filtered['Stratum_Num'] = df_filtered['Estrato'].astype(int)

features_for_pca = [
    'Consumo (m3)',
    'Stratum_Num',
    'Precio m3 (COP)',
    'Dias_Emision_PagoOportuno',
    'Dias_Lectura_Emision',
    'Dias_PagoOportuno_PagoReal',
]
X = df_filtered[features_for_pca].copy()

print(f"Shape before dropna: {X.shape}")
X.dropna(inplace=True) # Remove rows with NaNs in these features
print(f"Shape after dropna: {X.shape}")

print("\nMissing values after dropna:")
print(X.isnull().sum())

### 3. Feature Scaling
PCA is sensitive to the scale of the features. Therefore, we will standardize the data.

In [None]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print("--- Scaled data (first 5 rows) ---")
print(pd.DataFrame(X_scaled, columns=X.columns).head())

## 4. Principal Component Analysis (PCA)

We will reduce the dimensionality to 2 principal components for visualization.

In [None]:
pca = PCA(n_components=2) # Reduce to 2 components
X_pca = pca.fit_transform(X_scaled)

df_pca = pd.DataFrame(data=X_pca, columns=['principal_component_1', 'principal_component_2'])

print("--- Principal Components (first 5 rows) ---")
print(df_pca.head())

print(f"\nExplained variance by each component: {pca.explained_variance_ratio_}")
print(f"Total explained variance (2 components): {pca.explained_variance_ratio_.sum():.2f}")

In [None]:
df_pca = df_pca.set_index(X.index)
df_pca = df_pca.join(df_filtered[['Mora']])

print("\n--- df_pca with 'Mora' column (first 5 rows) ---")
print(df_pca.head())

print("\n--- df_pca Information ---")
df_pca.info()

In [None]:
# Create the scatter plot
plt.figure(figsize=(10, 6))
sns.scatterplot(
    x='principal_component_1',
    y='principal_component_2',
    hue='Mora',  # Color the points according to the 'Mora' value (0 or 1)
    data=df_pca,
    palette='viridis', # Optional: change the color palette
    alpha=0.6 # Optional: adjust the transparency of the points
)

plt.title('PCA Scatter Plot: PCA1 vs PCA2 colored by Mora')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.grid(True)
plt.show()


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Define features (PCA components) and the target variable
X_model = df_pca[['principal_component_1', 'principal_component_2']]
y_model = df_pca['Mora']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_model, y_model, test_size=0.3, random_state=42, stratify=y_model) # Stratify to maintain the 'Mora' proportion

print(f"Training set size: {X_train.shape[0]}")
print(f"Test set size: {X_test.shape[0]}")
print(f"Proportion of 'Mora' in training set: {y_train.value_counts(normalize=True)}")
print(f"Proportion of 'Mora' in test set: {y_test.value_counts(normalize=True)}")

# Initialize and train the Logistic Regression model
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model
print("\n--- Model Evaluation ---")

accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")

print("\nClassification Report:")
print(classification_report(y_test, y_pred))

print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

# Optional: Visualize the decision boundaries (only works well for 2 components)
# You can run this part to see how the model separates the classes in the PCA space.

# Create a grid to plot the decision boundaries
x_min, x_max = X_model['principal_component_1'].min() - 0.5, X_model['principal_component_1'].max() + 0.5
y_min, y_max = X_model['principal_component_2'].min() - 0.5, X_model['principal_component_2'].max() + 0.5
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02),
                     np.arange(y_min, y_max, 0.02))

# Predict the class for each point in the grid
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

# Plot the decision boundaries and data points
plt.figure(figsize=(10, 6))
plt.contourf(xx, yy, Z, alpha=0.8, cmap='viridis') # Background colors for the regions

sns.scatterplot(
    x='principal_component_1',
    y='principal_component_2',
    hue='Mora',
    data=df_pca,
    palette='viridis',
    alpha=0.6,
    edgecolor='k', # Add border to points for better visibility
    s=50 # Adjust point size
)

plt.title('Classification Model Decision Boundaries with PCA')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.grid(True)
plt.show()



In [None]:
# Visualize the Confusion Matrix
cm = confusion_matrix(y_test, y_pred)

plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=['Not Late (0)', 'Late (1)'],
            yticklabels=['Not Late (0)', 'Late (1)'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

# Explanation of the Confusion Matrix
print("\n--- Explanation of the Confusion Matrix ---")
print(f"The confusion matrix shows the performance of our model on the test set.")
print(f"The rows represent the actual classes (Actual), and the columns represent the predicted classes (Predicted).")
print(f"We have 4 main cells:")
print(f"  - Top-left (True Negatives, TN): {cm[0, 0]} cases where the actual value was 0 (Not Late) and the model predicted 0 (Not Late).")
print(f"  - Top-right (False Positives, FP): {cm[0, 1]} cases where the actual value was 0 (Not Late) but the model predicted 1 (Late). These are 'Type I' errors.")
print(f"  - Bottom-left (False Negatives, FN): {cm[1, 0]} cases where the actual value was 1 (Late) but the model predicted 0 (Not Late). These are 'Type II' errors.")
print(f"  - Bottom-right (True Positives, TP): {cm[1, 1]} cases where the actual value was 1 (Late) and the model predicted 1 (Late).")

print(f"\nFrom these values, metrics are calculated such as:")
print(f"  - Accuracy = (TN + TP) / Total cases = ({cm[0, 0]} + {cm[1, 1]}) / {np.sum(cm):.0f} = {accuracy:.4f}")
print(f"  - Precision (for class 1, Late) = TP / (TP + FP) = {cm[1, 1]} / ({cm[1, 1]} + {cm[0, 1]}): Proportion of positive predictions (Late) that were correct.")
print(f"  - Recall (Sensitivity, for class 1, Late) = TP / (TP + FN) = {cm[1, 1]} / ({cm[1, 1]} + {cm[1, 0]}): Proportion of actual positive cases (Late) that were correctly identified.")
print(f"  - F1-Score (for class 1, Late): Harmonic mean of Precision and Recall, useful when there is a class imbalance.")
print(f"These metrics, especially Precision and Recall, give us a more detailed view of the model's performance, particularly in identifying cases of 'Mora', which may be the class of primary interest.")
