[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/husseinlopez/diplomadoIA/blob/main/M1-4_Ejercicios_Limpieza.ipynb)

# M√≥dulo 1: Introducci√≥n a la Miner√≠a de Datos
## Ejercicios Pr√°cticos de Limpieza y Preparaci√≥n de Datos

**Diplomado en Inteligencia Artificial**  
Dr. Irvin Hussein L√≥pez Nava
CICESE - UABC

---

## Objetivos de esta sesi√≥n

1. **Identificar y corregir problemas de calidad** en conjuntos de datos reales
2. **Manejar valores faltantes** con diferentes estrategias de imputaci√≥n
3. **Detectar y tratar valores at√≠picos** sin perder informaci√≥n relevante
4. **Aplicar t√©cnicas de reducci√≥n de dimensionalidad** (PCA, t-SNE)
5. **Seleccionar atributos relevantes** mediante m√©todos Filter y Wrapper
6. **Balancear clases desbalanceadas** con t√©cnicas de over/undersampling

## Estructura del notebook

### Parte 1: Limpieza de Datos
* Inspecci√≥n inicial y detecci√≥n de problemas
* Manejo de valores faltantes
* Identificaci√≥n y tratamiento de outliers
* Transformaciones y escalamiento

### Parte 2: Reducci√≥n de Dimensionalidad
* An√°lisis de Componentes Principales (PCA)
* t-SNE para visualizaci√≥n no lineal
* Comparaci√≥n de m√©todos

### Parte 3: Selecci√≥n de Atributos
* M√©todos basados en filtros (Filter)
* M√©todos Wrapper
* Consenso entre m√©todos

### Parte 4: Balanceo de Clases
* T√©cnicas de oversampling (SMOTE, ADASYN)
* T√©cnicas de undersampling
* Visualizaci√≥n del impacto

---
## 0. Configuraci√≥n del Entorno

Importaremos todas las bibliotecas necesarias para el an√°lisis completo.

In [None]:
# Manejo de datos
import numpy as np
import pandas as pd
from scipy import stats

# Visualizaci√≥n
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Configuraci√≥n de visualizaci√≥n
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
%matplotlib inline

# Configuraci√≥n de pandas
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.float_format', lambda x: '%.3f' % x)

# Reproducibilidad
np.random.seed(42)

# Ignorar warnings
import warnings
warnings.filterwarnings('ignore')

print("‚úì Bibliotecas b√°sicas importadas correctamente")

In [None]:
# Preprocesamiento
from sklearn.preprocessing import (
    StandardScaler, MinMaxScaler, RobustScaler,
    LabelEncoder, OneHotEncoder, PowerTransformer
)
from sklearn.impute import SimpleImputer, KNNImputer

# Reducci√≥n de dimensionalidad
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

# Selecci√≥n de atributos
from sklearn.feature_selection import (
    SelectKBest, chi2, f_classif, mutual_info_classif,
    RFE
)

# Modelos para selecci√≥n
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression

# Balanceo de clases
from imblearn.over_sampling import SMOTE, ADASYN, BorderlineSMOTE, RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
from imblearn.combine import SMOTETomek

# Datasets
from sklearn.datasets import load_breast_cancer, make_classification

print("‚úì Bibliotecas de ML y preprocesamiento importadas correctamente")

---
# Parte 1: Limpieza de Datos

En esta secci√≥n trabajaremos con un dataset que presenta problemas comunes:
- Valores faltantes
- Valores at√≠picos
- Escalas incompatibles
- Tipos de datos incorrectos

## 1.1 Creaci√≥n de un Dataset con Problemas Reales

Crearemos un dataset sint√©tico que simula datos m√©dicos con problemas t√≠picos.

In [None]:
def create_messy_health_dataset(n_samples=500):
    """
    Crea un dataset sint√©tico de datos de salud con problemas reales:
    - Valores faltantes (MCAR, MAR, MNAR)
    - Outliers
    - Escalas inconsistentes
    - Errores de registro
    """
    np.random.seed(42)
    
    # Variables base
    data = {
        'edad': np.random.normal(45, 15, n_samples).clip(18, 90),
        'peso': np.random.normal(70, 15, n_samples).clip(40, 150),
        'estatura': np.random.normal(165, 10, n_samples).clip(140, 200),
        'presion_sistolica': np.random.normal(120, 15, n_samples).clip(80, 200),
        'presion_diastolica': np.random.normal(80, 10, n_samples).clip(60, 120),
        'glucosa': np.random.normal(100, 20, n_samples).clip(70, 300),
        'colesterol': np.random.normal(200, 40, n_samples).clip(120, 350),
        'trigliceridos': np.random.normal(150, 50, n_samples).clip(50, 500),
        'frecuencia_cardiaca': np.random.normal(75, 10, n_samples).clip(50, 120),
    }
    
    df = pd.DataFrame(data)
    
    # Calcular IMC
    df['imc'] = df['peso'] / ((df['estatura']/100) ** 2)
    
    # Variables categ√≥ricas
    df['genero'] = np.random.choice(['M', 'F'], n_samples)
    df['fumador'] = np.random.choice(['Si', 'No', 'Exfumador'], n_samples, p=[0.2, 0.6, 0.2])
    df['diabetes'] = (df['glucosa'] > 126).astype(int)
    df['hipertension'] = (df['presion_sistolica'] > 140).astype(int)
    
    # Introducir valores faltantes de diferentes tipos
    
    # MCAR (Missing Completely At Random) - 5% en edad
    mcar_mask = np.random.random(n_samples) < 0.05
    df.loc[mcar_mask, 'edad'] = np.nan
    
    # MAR (Missing At Random) - Personas con diabetes tienen m√°s faltantes en colesterol
    mar_mask = (df['diabetes'] == 1) & (np.random.random(n_samples) < 0.15)
    df.loc[mar_mask, 'colesterol'] = np.nan
    
    # MNAR (Missing Not At Random) - Valores altos de glucosa tienden a faltar m√°s
    high_glucose = df['glucosa'] > df['glucosa'].quantile(0.75)
    mnar_mask = high_glucose & (np.random.random(n_samples) < 0.10)
    df.loc[mnar_mask, 'glucosa'] = np.nan
    
    # Valores faltantes adicionales
    df.loc[np.random.random(n_samples) < 0.08, 'trigliceridos'] = np.nan
    df.loc[np.random.random(n_samples) < 0.03, 'frecuencia_cardiaca'] = np.nan
    
    # Introducir outliers
    
    # Outliers extremos (errores de medici√≥n)
    outlier_indices = np.random.choice(n_samples, size=10, replace=False)
    df.loc[outlier_indices[:3], 'peso'] = np.random.uniform(200, 250, 3)
    df.loc[outlier_indices[3:6], 'presion_sistolica'] = np.random.uniform(220, 280, 3)
    df.loc[outlier_indices[6:], 'glucosa'] = np.random.uniform(400, 600, 4)
    
    # Outliers moderados (valores reales pero inusuales)
    moderate_outliers = np.random.choice(n_samples, size=20, replace=False)
    df.loc[moderate_outliers, 'colesterol'] = np.random.uniform(300, 400, 20)
    
    # Introducir inconsistencias
    
    # Algunas estatura en cm, otras (pocas) en metros
    error_indices = np.random.choice(n_samples, size=5, replace=False)
    df.loc[error_indices, 'estatura'] = df.loc[error_indices, 'estatura'] / 100
    
    # Calcular variable objetivo (riesgo cardiovascular)
    risk_score = (
        (df['edad'] > 55).astype(int) * 2 +
        (df['imc'] > 30).astype(int) * 2 +
        df['diabetes'] * 3 +
        df['hipertension'] * 3 +
        (df['fumador'] == 'Si').astype(int) * 2 +
        (df['colesterol'] > 240).fillna(0).astype(int) * 2
    )
    
    # Binarizar riesgo con algo de ruido
    noise = np.random.random(n_samples) < 0.1
    df['riesgo_alto'] = ((risk_score >= 6) != noise).astype(int)
    
    return df

# Crear dataset
df_health = create_messy_health_dataset(500)

print(f"Dataset creado con {len(df_health)} observaciones y {len(df_health.columns)} variables")
print(f"\nPrimeras filas:")
df_health.head(10)

## 1.2 Inspecci√≥n Inicial

Primer vistazo a la estructura y calidad de los datos.

In [None]:
def inspect_dataset(df):
    """
    Realiza una inspecci√≥n completa del dataset
    """
    print("="*80)
    print("INSPECCI√ìN GENERAL DEL DATASET")
    print("="*80)
    
    print(f"\nüìä Dimensiones: {df.shape[0]} filas √ó {df.shape[1]} columnas")
    print(f"üíæ Memoria utilizada: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
    
    print("\n" + "="*80)
    print("TIPOS DE DATOS")
    print("="*80)
    print(df.dtypes)
    
    print("\n" + "="*80)
    print("VALORES FALTANTES")
    print("="*80)
    
    missing = df.isnull().sum()
    missing_pct = 100 * missing / len(df)
    missing_table = pd.DataFrame({
        'Columna': missing.index,
        'Faltantes': missing.values,
        'Porcentaje': missing_pct.values
    })
    missing_table = missing_table[missing_table['Faltantes'] > 0].sort_values('Porcentaje', ascending=False)
    
    if len(missing_table) > 0:
        print(missing_table.to_string(index=False))
        print(f"\n‚ö†Ô∏è  Total de valores faltantes: {missing.sum()} ({100*missing.sum()/(df.shape[0]*df.shape[1]):.2f}% del dataset)")
    else:
        print("‚úì No hay valores faltantes")
    
    print("\n" + "="*80)
    print("ESTAD√çSTICAS DESCRIPTIVAS (VARIABLES NUM√âRICAS)")
    print("="*80)
    print(df.describe().T)
    
    print("\n" + "="*80)
    print("DISTRIBUCI√ìN DE VARIABLES CATEG√ìRICAS")
    print("="*80)
    
    categorical_cols = df.select_dtypes(include=['object', 'category']).columns
    for col in categorical_cols:
        print(f"\n{col}:")
        print(df[col].value_counts())
        print(f"Valores √∫nicos: {df[col].nunique()}")

inspect_dataset(df_health)

## 1.3 Visualizaci√≥n de Valores Faltantes

Entender el patr√≥n de datos faltantes es crucial para decidir c√≥mo manejarlos.

In [None]:
def visualize_missing_data(df):
    """
    Crea visualizaciones comprehensivas de valores faltantes
    """
    fig = plt.figure(figsize=(16, 12))
    gs = fig.add_gridspec(3, 2, hspace=0.3, wspace=0.3)
    
    # 1. Matriz de valores faltantes
    ax1 = fig.add_subplot(gs[0, :])
    missing_matrix = df.isnull().astype(int)
    sns.heatmap(missing_matrix.T, cmap='YlOrRd', cbar=True, ax=ax1,
                yticklabels=df.columns, xticklabels=False)
    ax1.set_title('Matriz de Valores Faltantes\n(Amarillo = Presente, Rojo = Faltante)', 
                  fontsize=14, fontweight='bold')
    ax1.set_xlabel('Observaciones')
    
    # 2. Porcentaje de valores faltantes por columna
    ax2 = fig.add_subplot(gs[1, 0])
    missing_pct = 100 * df.isnull().sum() / len(df)
    missing_pct = missing_pct[missing_pct > 0].sort_values(ascending=True)
    
    if len(missing_pct) > 0:
        colors = ['#d62728' if x > 10 else '#ff7f0e' if x > 5 else '#2ca02c' for x in missing_pct]
        missing_pct.plot(kind='barh', ax=ax2, color=colors)
        ax2.set_xlabel('Porcentaje de valores faltantes (%)')
        ax2.set_title('Valores Faltantes por Variable', fontweight='bold')
        ax2.axvline(x=5, color='orange', linestyle='--', alpha=0.5, label='5%')
        ax2.axvline(x=10, color='red', linestyle='--', alpha=0.5, label='10%')
        ax2.legend()
        ax2.grid(axis='x', alpha=0.3)
    
    # 3. N√∫mero de valores faltantes por fila
    ax3 = fig.add_subplot(gs[1, 1])
    missing_per_row = df.isnull().sum(axis=1)
    missing_counts = missing_per_row.value_counts().sort_index()
    
    ax3.bar(missing_counts.index, missing_counts.values, color='steelblue', alpha=0.7)
    ax3.set_xlabel('N√∫mero de valores faltantes')
    ax3.set_ylabel('N√∫mero de observaciones')
    ax3.set_title('Distribuci√≥n de Valores Faltantes por Fila', fontweight='bold')
    ax3.grid(axis='y', alpha=0.3)
    
    # A√±adir texto con estad√≠sticas
    total_rows_with_missing = (missing_per_row > 0).sum()
    ax3.text(0.95, 0.95, 
             f'Filas con faltantes: {total_rows_with_missing}\n'
             f'Filas completas: {len(df) - total_rows_with_missing}',
             transform=ax3.transAxes, fontsize=10,
             verticalalignment='top', horizontalalignment='right',
             bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))
    
    # 4. Correlaci√≥n entre valores faltantes
    ax4 = fig.add_subplot(gs[2, :])
    missing_corr = df.isnull().corr()
    mask = np.triu(np.ones_like(missing_corr), k=1)
    
    sns.heatmap(missing_corr, mask=mask, annot=True, fmt='.2f', cmap='coolwarm',
                center=0, ax=ax4, cbar_kws={'label': 'Correlaci√≥n'})
    ax4.set_title('Correlaci√≥n entre Patrones de Valores Faltantes\n'
                  '(Valores altos sugieren faltantes no aleatorios)', fontweight='bold')
    
    plt.suptitle('An√°lisis Comprehensivo de Valores Faltantes', 
                 fontsize=16, fontweight='bold', y=0.995)
    
    return fig

fig = visualize_missing_data(df_health)
plt.show()

## 1.4 An√°lisis de Patrones de Valores Faltantes

Determinar si los valores faltantes son MCAR, MAR o MNAR.

In [None]:
# An√°lisis detallado de patrones de valores faltantes
def analyze_missing_patterns(df):
    """
    Analiza si los valores faltantes son MCAR, MAR o MNAR
    """
    print("="*80)
    print("AN√ÅLISIS DE PATRONES DE VALORES FALTANTES")
    print("="*80)
    
    # Crear indicadores de faltantes
    cols_with_missing = df.columns[df.isnull().any()].tolist()
    
    for col in cols_with_missing:
        print(f"\n{'='*80}")
        print(f"Variable: {col}")
        print(f"{'='*80}")
        
        missing_mask = df[col].isnull()
        
        # Comparar caracter√≠sticas entre observaciones con y sin faltantes
        numeric_cols = df.select_dtypes(include=[np.number]).columns
        numeric_cols = [c for c in numeric_cols if c != col]
        
        print("\nComparaci√≥n de medias (con faltantes vs sin faltantes):")
        print("-" * 60)
        
        for other_col in numeric_cols[:5]:  # Limitamos a 5 para no saturar
            if df[other_col].notna().sum() > 0:
                mean_missing = df.loc[missing_mask, other_col].mean()
                mean_present = df.loc[~missing_mask, other_col].mean()
                
                if pd.notna(mean_missing) and pd.notna(mean_present):
                    diff_pct = 100 * (mean_missing - mean_present) / mean_present
                    
                    # Test t para diferencia de medias
                    try:
                        t_stat, p_value = stats.ttest_ind(
                            df.loc[missing_mask, other_col].dropna(),
                            df.loc[~missing_mask, other_col].dropna()
                        )
                        significance = "***" if p_value < 0.001 else "**" if p_value < 0.01 else "*" if p_value < 0.05 else ""
                    except:
                        p_value = np.nan
                        significance = ""
                    
                    print(f"{other_col:30s}: {mean_present:7.2f} ‚Üí {mean_missing:7.2f} "
                          f"({diff_pct:+6.1f}%) p={p_value:.3f} {significance}")
    
    print("\n" + "="*80)
    print("INTERPRETACI√ìN:")
    print("="*80)
    print("* = p < 0.05  (diferencia estad√≠sticamente significativa)")
    print("** = p < 0.01 (alta significancia)")
    print("*** = p < 0.001 (muy alta significancia)")
    print("\nDiferencias significativas sugieren valores faltantes MAR o MNAR")
    print("No diferencias sugiere MCAR (Missing Completely At Random)")

analyze_missing_patterns(df_health)

## 1.4 Manejo de Valores Faltantes

Compararemos diferentes estrategias de imputaci√≥n.

In [None]:
def compare_imputation_methods(df, column):
    """
    Compara diferentes m√©todos de imputaci√≥n en una columna espec√≠fica
    """
    df_test = df.copy()
    missing_mask = df_test[column].isnull()
    original_values = df_test.loc[~missing_mask, column].copy()
    
    methods = {}
    
    # 1. Eliminaci√≥n
    methods['Eliminaci√≥n'] = df_test[column].dropna()
    
    # 2. Media
    imputer_mean = SimpleImputer(strategy='mean')
    methods['Media'] = pd.Series(
        imputer_mean.fit_transform(df_test[[column]]).ravel(),
        index=df_test.index
    )
    
    # 3. Mediana
    imputer_median = SimpleImputer(strategy='median')
    methods['Mediana'] = pd.Series(
        imputer_median.fit_transform(df_test[[column]]).ravel(),
        index=df_test.index
    )
    
    # 4. KNN Imputer
    numeric_cols = df_test.select_dtypes(include=[np.number]).columns.tolist()
    if len(numeric_cols) > 1:
        imputer_knn = KNNImputer(n_neighbors=5)
        df_knn = df_test[numeric_cols].copy()
        imputed_knn = imputer_knn.fit_transform(df_knn)
        col_idx = numeric_cols.index(column)
        methods['KNN (k=5)'] = pd.Series(
            imputed_knn[:, col_idx],
            index=df_test.index
        )
    
    # Visualizaci√≥n comparativa
    fig, axes = plt.subplots(2, 3, figsize=(18, 10))
    axes = axes.ravel()
    
    # Plot original
    ax = axes[0]
    ax.hist(original_values, bins=30, alpha=0.7, color='gray', edgecolor='black')
    ax.axvline(original_values.mean(), color='red', linestyle='--', 
               linewidth=2, label=f'Media: {original_values.mean():.2f}')
    ax.axvline(original_values.median(), color='blue', linestyle='--', 
               linewidth=2, label=f'Mediana: {original_values.median():.2f}')
    ax.set_title('Distribuci√≥n Original\\n(sin valores faltantes)', fontweight='bold')
    ax.set_xlabel(column)
    ax.set_ylabel('Frecuencia')
    ax.legend()
    ax.grid(alpha=0.3)
    
    # Plot cada m√©todo
    for idx, (method_name, imputed_data) in enumerate(methods.items(), 1):
        if idx >= len(axes):
            break
        ax = axes[idx]
        ax.hist(original_values, bins=30, alpha=0.4, color='gray', label='Original', edgecolor='black')
        ax.hist(imputed_data.dropna(), bins=30, alpha=0.6, color='steelblue', label=method_name, edgecolor='black')
        mean_diff = imputed_data.mean() - original_values.mean()
        std_diff = imputed_data.std() - original_values.std()
        ax.set_title(f'{method_name}\\nŒîmedia: {mean_diff:+.2f}, Œîstd: {std_diff:+.2f}', fontweight='bold')
        ax.set_xlabel(column)
        ax.set_ylabel('Frecuencia')
        ax.legend()
        ax.grid(alpha=0.3)
    
    for idx in range(len(methods) + 1, len(axes)):
        axes[idx].axis('off')
    
    plt.suptitle(f'Comparaci√≥n de M√©todos de Imputaci√≥n: {column}', fontsize=16, fontweight='bold')
    plt.tight_layout()
    
    # Estad√≠sticas
    print("="*80)
    print(f"COMPARACI√ìN DE M√âTODOS DE IMPUTACI√ìN: {column}")
    print("="*80)
    print(f"\\nOriginal: N={len(original_values)}, Media={original_values.mean():.2f}, Std={original_values.std():.2f}")
    for method_name, imputed_data in methods.items():
        print(f"{method_name}: N={len(imputed_data.dropna())}, Media={imputed_data.mean():.2f}, Std={imputed_data.std():.2f}")
    
    return fig, methods

# Comparar m√©todos para glucosa
fig, methods = compare_imputation_methods(df_health, 'glucosa')
plt.show()

In [None]:
# Aplicar imputaci√≥n con KNN
def apply_imputation(df, strategy='knn'):
    """
    Aplica estrategia de imputaci√≥n al dataset completo
    """
    df_imputed = df.copy()
    
    numeric_cols = df_imputed.select_dtypes(include=[np.number]).columns.tolist()
    categorical_cols = df_imputed.select_dtypes(include=['object', 'category']).columns.tolist()
    
    if strategy == 'knn':
        imputer_num = KNNImputer(n_neighbors=5)
        df_imputed[numeric_cols] = imputer_num.fit_transform(df_imputed[numeric_cols])
        
        for col in categorical_cols:
            if df_imputed[col].isnull().any():
                mode_value = df_imputed[col].mode()[0]
                df_imputed[col].fillna(mode_value, inplace=True)
    
    print(f"Imputaci√≥n aplicada con estrategia: {strategy}")
    print(f"Filas antes: {len(df)} ‚Üí Filas despu√©s: {len(df_imputed)}")
    print(f"Valores faltantes restantes: {df_imputed.isnull().sum().sum()}")
    
    return df_imputed

df_health_imputed = apply_imputation(df_health, strategy='knn')

## 1.5 Detecci√≥n y Tratamiento de Outliers

Identificaremos valores at√≠picos usando m√∫ltiples m√©todos.

In [None]:
def detect_outliers_multiple_methods(df, column):
    """
    Detecta outliers usando diferentes m√©todos:
    1. IQR (Interquartile Range)
    2. Z-score
    3. Isolation Forest
    """
    from sklearn.ensemble import IsolationForest
    
    # Trabajar solo con datos no nulos
    data = df[column].dropna().values.reshape(-1, 1)
    indices_validos = df[column].dropna().index  # √çndices de datos no nulos
    
    outliers = {}
    
    # 1. M√©todo IQR
    Q1 = np.percentile(data, 25)
    Q3 = np.percentile(data, 75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    outliers['IQR'] = (data < lower_bound) | (data > upper_bound)
    
    # 2. Z-score
    z_scores = np.abs(stats.zscore(data))
    outliers['Z-score'] = z_scores > 3
    
    # 3. Isolation Forest
    iso_forest = IsolationForest(contamination=0.1, random_state=42)
    outliers['Isolation Forest'] = iso_forest.fit_predict(data) == -1
    
    # Visualizaci√≥n
    fig, axes = plt.subplots(2, 2, figsize=(16, 12))
    
    # Box plot
    ax = axes[0, 0]
    bp = ax.boxplot([data.ravel()], vert=True, patch_artist=True,
                     boxprops=dict(facecolor='lightblue', alpha=0.7),
                     medianprops=dict(color='red', linewidth=2))
    ax.axhline(lower_bound, color='orange', linestyle='--', label=f'IQR lower: {lower_bound:.2f}')
    ax.axhline(upper_bound, color='orange', linestyle='--', label=f'IQR upper: {upper_bound:.2f}')
    ax.set_ylabel(column)
    ax.set_title('Box Plot con L√≠mites IQR', fontweight='bold')
    ax.legend()
    ax.grid(alpha=0.3)
    
    # Distribuci√≥n con outliers
    ax = axes[0, 1]
    ax.hist(data, bins=50, alpha=0.6, color='steelblue', edgecolor='black')
    for method_name, is_outlier in outliers.items():
        outlier_values = data[is_outlier.ravel()]
        if len(outlier_values) > 0:
            ax.scatter(outlier_values, [0] * len(outlier_values), s=100, alpha=0.6, label=method_name)
    ax.set_xlabel(column)
    ax.set_ylabel('Frecuencia')
    ax.set_title('Distribuci√≥n con Outliers Detectados', fontweight='bold')
    ax.legend()
    ax.grid(alpha=0.3)
    
    # Z-scores
    ax = axes[1, 0]
    sorted_idx = np.argsort(data.ravel())
    ax.scatter(range(len(data)), z_scores[sorted_idx], alpha=0.5, s=20)
    ax.axhline(3, color='red', linestyle='--', label='Umbral Z=3')
    ax.set_xlabel('Observaciones (ordenadas)')
    ax.set_ylabel('|Z-score|')
    ax.set_title('Z-scores', fontweight='bold')
    ax.legend()
    ax.grid(alpha=0.3)
    
    # Comparaci√≥n
    ax = axes[1, 1]
    method_names = list(outliers.keys())
    counts = [outliers[m].sum() for m in method_names]
    bars = ax.barh(method_names, counts, color=['#ff7f0e', '#2ca02c', '#d62728'])
    ax.set_xlabel('N√∫mero de outliers detectados')
    ax.set_title('Comparaci√≥n de M√©todos', fontweight='bold')
    ax.grid(axis='x', alpha=0.3)
    
    for bar, count in zip(bars, counts):
        width = bar.get_width()
        ax.text(width, bar.get_y() + bar.get_height()/2,
               f'{int(count)} ({100*count/len(data):.1f}%)',
               ha='left', va='center', fontweight='bold')
    
    plt.suptitle(f'Detecci√≥n de Outliers: {column}', fontsize=16, fontweight='bold')
    plt.tight_layout()
    
    # Consenso (al menos 2 m√©todos coinciden)
    # Sumar arrays booleanos: cada True cuenta como 1
    consensus_count = np.zeros(len(data), dtype=int)
    for method_array in outliers.values():
        consensus_count += method_array.ravel().astype(int)
    
    # Outliers detectados por al menos 2 m√©todos
    consensus_outliers_indices = consensus_count >= 2
    
    # Crear m√°scara booleana del tama√±o completo del DataFrame
    consensus_mask = pd.Series(False, index=df.index, dtype=bool)
    
    # Marcar como True los √≠ndices que son outliers
    outlier_positions = indices_validos[consensus_outliers_indices]
    consensus_mask.loc[outlier_positions] = True
    
    print("="*80)
    print(f"DETECCI√ìN DE OUTLIERS: {column}")
    print("="*80)
    print(f"\nTotal observaciones v√°lidas: {len(data)}")
    for method, is_outlier in outliers.items():
        n_outliers = is_outlier.sum()
        print(f"{method:20s}: {n_outliers:4d} ({100*n_outliers/len(data):5.2f}%)")
    print(f"\nConsenso (‚â•2 m√©todos): {consensus_outliers_indices.sum()} ({100*consensus_outliers_indices.sum()/len(data):.2f}%)")
    print(f"M√°scara creada con {len(consensus_mask)} elementos, {consensus_mask.sum()} marcados como outliers")
    
    return fig, outliers, consensus_mask

# Detectar outliers
fig_out, outliers_peso, consensus_peso = detect_outliers_multiple_methods(df_health_imputed, 'peso')
plt.show()

In [None]:
# Tratamiento de outliers
def treat_outliers(df, column, method='cap', outlier_mask=None):
    """
    Trata outliers usando diferentes estrategias
    
    Parameters:
    -----------
    df : DataFrame
        DataFrame con los datos
    column : str
        Nombre de la columna a tratar
    method : str
        M√©todo de tratamiento: 'remove', 'cap', 'transform'
    outlier_mask : Series booleana
        M√°scara indicando outliers (mismo √≠ndice que df)
    """
    df_treated = df.copy()
    original = df_treated[column].copy()
    
    if outlier_mask is None:
        # Si no hay m√°scara, usar IQR
        Q1 = df_treated[column].quantile(0.25)
        Q3 = df_treated[column].quantile(0.75)
        IQR = Q3 - Q1
        outlier_mask = (df_treated[column] < Q1 - 1.5*IQR) | (df_treated[column] > Q3 + 1.5*IQR)
    
    # Verificar que la m√°scara tenga el tama√±o correcto
    if len(outlier_mask) != len(df_treated):
        raise ValueError(f"outlier_mask tiene tama√±o {len(outlier_mask)} pero DataFrame tiene {len(df_treated)} filas")
    
    n_outliers = outlier_mask.sum()
    print(f"Tratando {n_outliers} outliers en '{column}' usando m√©todo '{method}'")
    
    if method == 'remove':
        df_treated = df_treated[~outlier_mask]
        print(f"  ‚Üí Filas eliminadas: {n_outliers}")
        print(f"  ‚Üí Filas restantes: {len(df_treated)}")
    elif method == 'cap':
        lower = df_treated[column].quantile(0.05)
        upper = df_treated[column].quantile(0.95)
        df_treated[column] = df_treated[column].clip(lower, upper)
        print(f"  ‚Üí Valores limitados a [{lower:.2f}, {upper:.2f}]")
    elif method == 'transform':
        # Winsorizaci√≥n: reemplazar outliers con valores percentiles
        lower = df_treated[column].quantile(0.05)
        upper = df_treated[column].quantile(0.95)
        df_treated.loc[df_treated[column] < lower, column] = lower
        df_treated.loc[df_treated[column] > upper, column] = upper
        print(f"  ‚Üí Outliers reemplazados con percentiles 5 y 95")
    
    # Visualizaci√≥n
    fig, axes = plt.subplots(1, 3, figsize=(18, 5))
    
    # Antes
    axes[0].hist(original.dropna(), bins=50, alpha=0.7, color='red', edgecolor='black')
    axes[0].set_title('Antes del Tratamiento', fontweight='bold')
    axes[0].set_xlabel(column)
    axes[0].set_ylabel('Frecuencia')
    axes[0].grid(alpha=0.3)
    
    # Despu√©s
    axes[1].hist(df_treated[column].dropna(), bins=50, alpha=0.7, color='green', edgecolor='black')
    axes[1].set_title(f'Despu√©s ({method})', fontweight='bold')
    axes[1].set_xlabel(column)
    axes[1].set_ylabel('Frecuencia')
    axes[1].grid(alpha=0.3)
    
    # Comparaci√≥n con boxplots
    if method != 'remove':
        axes[2].boxplot([original.dropna(), df_treated[column].dropna()],
                       labels=['Antes', 'Despu√©s'], patch_artist=True)
    else:
        # Para remove, solo mostrar el despu√©s
        axes[2].boxplot([original.dropna(), df_treated[column].dropna()],
                       labels=['Antes\n(todos)', 'Despu√©s\n(sin outliers)'], patch_artist=True)
    axes[2].set_title('Comparaci√≥n', fontweight='bold')
    axes[2].set_ylabel(column)
    axes[2].grid(alpha=0.3)
    
    plt.tight_layout()
    
    return df_treated, fig

# Probar diferentes m√©todos
print("\n" + "="*80)
print("EJEMPLO 1: M√©todo 'cap' (limitar valores)")
print("="*80)
df_peso_capped, fig_cap = treat_outliers(df_health_imputed, 'peso', method='cap', outlier_mask=consensus_peso)
plt.show()

print("\n" + "="*80)
print("EJEMPLO 2: M√©todo 'remove' (eliminar filas)")
print("="*80)
df_peso_removed, fig_remove = treat_outliers(df_health_imputed, 'peso', method='remove', outlier_mask=consensus_peso)
plt.show()

## 1.6 Escalamiento y Transformaciones

Comparaci√≥n de diferentes m√©todos de escalamiento.

In [None]:
def compare_scaling_methods(df, columns=None):
    """
    Compara diferentes m√©todos de escalamiento
    """
    if columns is None:
        columns = df.select_dtypes(include=[np.number]).columns[:4]
    
    df_subset = df[columns].copy()
    
    scalers = {
        'Original': None,
        'StandardScaler': StandardScaler(),
        'MinMaxScaler': MinMaxScaler(),
        'RobustScaler': RobustScaler(),
        'PowerTransformer': PowerTransformer(method='yeo-johnson')
    }
    
    scaled_data = {}
    for name, scaler in scalers.items():
        if scaler is None:
            scaled_data[name] = df_subset.values
        else:
            scaled_data[name] = scaler.fit_transform(df_subset)
    
    # Visualizaci√≥n
    fig, axes = plt.subplots(len(scalers), len(columns), figsize=(5*len(columns), 4*len(scalers)))
    if len(columns) == 1:
        axes = axes.reshape(-1, 1)
    
    for i, (method_name, data) in enumerate(scaled_data.items()):
        for j, col in enumerate(columns):
            ax = axes[i, j]
            ax.hist(data[:, j], bins=50, alpha=0.7, color='steelblue', edgecolor='black')
            mean = np.mean(data[:, j])
            std = np.std(data[:, j])
            if i == 0:
                ax.set_title(f'{col}\\n{method_name}\\nŒº={mean:.2f}, œÉ={std:.2f}', fontweight='bold')
            else:
                ax.set_title(f'{method_name}\\nŒº={mean:.2f}, œÉ={std:.2f}', fontweight='bold')
            ax.axvline(mean, color='red', linestyle='--', linewidth=2, alpha=0.7)
            ax.grid(alpha=0.3)
    
    plt.suptitle('Comparaci√≥n de M√©todos de Escalamiento', fontsize=16, fontweight='bold')
    plt.tight_layout()
    return fig, scaled_data

cols_to_scale = ['edad', 'peso', 'presion_sistolica', 'glucosa']
fig_scale, scaled_results = compare_scaling_methods(df_health_imputed, cols_to_scale)
plt.show()

---
# Parte 2: Reducci√≥n de Dimensionalidad

Exploraremos t√©cnicas para reducir el n√∫mero de variables preservando la mayor cantidad de informaci√≥n.

## 2.1 Preparaci√≥n: Dataset de C√°ncer de Mama

Usaremos el dataset cl√°sico de Wisconsin Breast Cancer con 30 caracter√≠sticas.

In [None]:
# Cargar dataset
cancer = load_breast_cancer()
X_cancer = pd.DataFrame(cancer.data, columns=cancer.feature_names)
y_cancer = cancer.target

print("="*80)
print("DATASET: Wisconsin Breast Cancer")
print("="*80)
print(f"\nDimensiones: {X_cancer.shape}")
print(f"Clases: {np.unique(y_cancer, return_counts=True)}")
print(f"\nPrimeras caracter√≠sticas:")
print(X_cancer.columns.tolist()[:10])
print("...")

# Escalamiento previo (necesario para PCA y t-SNE)
scaler = StandardScaler()
X_cancer_scaled = scaler.fit_transform(X_cancer)
X_cancer_scaled_df = pd.DataFrame(X_cancer_scaled, columns=X_cancer.columns)

print(f"\n‚úì Datos escalados con StandardScaler")

---
# Parte 3: Selecci√≥n de Atributos

Identificaremos las caracter√≠sticas m√°s relevantes usando m√©todos Filter y Wrapper.

## 3.1 M√©todos Filter

Eval√∫an la relevancia de cada atributo independientemente del modelo.

## 2.2 An√°lisis de Componentes Principales (PCA)

PCA encuentra direcciones ortogonales de m√°xima varianza.

In [None]:
def perform_pca_analysis(X, y=None, feature_names=None):
    """
    Realiza an√°lisis completo de PCA con m√∫ltiples visualizaciones
    """
    # PCA completo
    pca_full = PCA()
    X_pca_full = pca_full.fit_transform(X)
    
    explained_variance = pca_full.explained_variance_ratio_
    cumulative_variance = np.cumsum(explained_variance)
    
    # Encontrar componentes para 90%, 95%, 99%
    n_90 = np.argmax(cumulative_variance >= 0.90) + 1
    n_95 = np.argmax(cumulative_variance >= 0.95) + 1
    n_99 = np.argmax(cumulative_variance >= 0.99) + 1
    
    print("="*80)
    print("AN√ÅLISIS PCA")
    print("="*80)
    print(f"\nDimensiones originales: {X.shape[1]}")
    print(f"\nComponentes necesarios para:")
    print(f"  - 90% varianza: {n_90} componentes")
    print(f"  - 95% varianza: {n_95} componentes")
    print(f"  - 99% varianza: {n_99} componentes")
    print(f"\nPrimeros 5 componentes explican: {cumulative_variance[4]:.1%}")
    print(f"Primeros 10 componentes explican: {cumulative_variance[9]:.1%}")
    
    # Visualizaciones
    fig = plt.figure(figsize=(20, 12))
    gs = fig.add_gridspec(3, 3, hspace=0.3, wspace=0.3)
    
    # 1. Varianza por componente (Scree Plot)
    ax1 = fig.add_subplot(gs[0, 0])
    components = np.arange(1, min(21, len(explained_variance)+1))
    ax1.bar(components, explained_variance[:20], alpha=0.7, color='steelblue', edgecolor='black')
    ax1.set_xlabel('Componente Principal', fontsize=12)
    ax1.set_ylabel('Varianza Explicada', fontsize=12)
    ax1.set_title('Scree Plot\n(Primeras 20 componentes)', fontweight='bold', fontsize=13)
    ax1.grid(alpha=0.3, axis='y')
    ax1.set_xticks(components[::2])
    
    # 2. Varianza acumulada
    ax2 = fig.add_subplot(gs[0, 1])
    ax2.plot(range(1, len(cumulative_variance)+1), cumulative_variance, 
             marker='o', linewidth=2, markersize=4, color='steelblue')
    ax2.axhline(y=0.90, color='green', linestyle='--', linewidth=2, label='90%', alpha=0.7)
    ax2.axhline(y=0.95, color='orange', linestyle='--', linewidth=2, label='95%', alpha=0.7)
    ax2.axhline(y=0.99, color='red', linestyle='--', linewidth=2, label='99%', alpha=0.7)
    ax2.axvline(x=n_95, color='orange', linestyle=':', alpha=0.5)
    ax2.set_xlabel('N√∫mero de Componentes', fontsize=12)
    ax2.set_ylabel('Varianza Acumulada', fontsize=12)
    ax2.set_title('Varianza Explicada Acumulada', fontweight='bold', fontsize=13)
    ax2.legend(fontsize=10)
    ax2.grid(alpha=0.3)
    ax2.set_xlim(0, min(30, len(cumulative_variance)))
    
    # 3. Raz√≥n de varianza (Kaiser criterion)
    ax3 = fig.add_subplot(gs[0, 2])
    eigenvalues = pca_full.explained_variance_[:20]
    ax3.plot(range(1, len(eigenvalues)+1), eigenvalues, marker='s', 
             linewidth=2, markersize=6, color='darkred')
    ax3.axhline(y=1, color='black', linestyle='--', linewidth=2, label='Kaiser criterion (Œª=1)', alpha=0.7)
    ax3.set_xlabel('Componente Principal', fontsize=12)
    ax3.set_ylabel('Eigenvalue (Œª)', fontsize=12)
    ax3.set_title('Eigenvalues\n(Kaiser: retener Œª > 1)', fontweight='bold', fontsize=13)
    ax3.legend(fontsize=10)
    ax3.grid(alpha=0.3)
    n_kaiser = np.sum(pca_full.explained_variance_ > 1)
    ax3.text(0.98, 0.98, f'n={n_kaiser}', transform=ax3.transAxes,
             ha='right', va='top', fontsize=11, fontweight='bold',
             bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.8))
    
    # 4. Proyecci√≥n 2D (PC1 vs PC2)
    ax4 = fig.add_subplot(gs[1, :2])
    if y is not None:
        scatter = ax4.scatter(X_pca_full[:, 0], X_pca_full[:, 1], 
                            c=y, cmap='RdYlGn', alpha=0.6, s=50, edgecolors='black', linewidth=0.5)
        cbar = plt.colorbar(scatter, ax=ax4)
        cbar.set_label('Clase', fontsize=11)
    else:
        ax4.scatter(X_pca_full[:, 0], X_pca_full[:, 1], 
                   alpha=0.6, s=50, color='steelblue', edgecolors='black', linewidth=0.5)
    
    ax4.set_xlabel(f'PC1 ({explained_variance[0]:.1%} varianza)', fontsize=12)
    ax4.set_ylabel(f'PC2 ({explained_variance[1]:.1%} varianza)', fontsize=12)
    ax4.set_title(f'Proyecci√≥n en Primeras 2 Componentes\n(Total: {explained_variance[0]+explained_variance[1]:.1%} varianza)', 
                 fontweight='bold', fontsize=13)
    ax4.grid(alpha=0.3)
    ax4.axhline(0, color='k', linestyle='-', linewidth=0.5, alpha=0.3)
    ax4.axvline(0, color='k', linestyle='-', linewidth=0.5, alpha=0.3)
    
    # 5. Loadings PC1
    ax5 = fig.add_subplot(gs[1, 2])
    if feature_names is not None:
        loadings_pc1 = pd.Series(pca_full.components_[0], index=feature_names)
        top_loadings = pd.concat([loadings_pc1.nlargest(5), loadings_pc1.nsmallest(5)])
        colors = ['red' if x < 0 else 'green' for x in top_loadings.values]
        top_loadings.plot(kind='barh', ax=ax5, color=colors, alpha=0.7, edgecolor='black')
        ax5.set_xlabel('Loading', fontsize=11)
        ax5.set_title('Top Loadings PC1', fontweight='bold', fontsize=13)
        ax5.axvline(0, color='black', linewidth=1)
        ax5.grid(alpha=0.3, axis='x')
    
    # 6. Loadings PC2
    ax6 = fig.add_subplot(gs[2, 0])
    if feature_names is not None:
        loadings_pc2 = pd.Series(pca_full.components_[1], index=feature_names)
        top_loadings = pd.concat([loadings_pc2.nlargest(5), loadings_pc2.nsmallest(5)])
        colors = ['red' if x < 0 else 'green' for x in top_loadings.values]
        top_loadings.plot(kind='barh', ax=ax6, color=colors, alpha=0.7, edgecolor='black')
        ax6.set_xlabel('Loading', fontsize=11)
        ax6.set_title('Top Loadings PC2', fontweight='bold', fontsize=13)
        ax6.axvline(0, color='black', linewidth=1)
        ax6.grid(alpha=0.3, axis='x')
    
    # 7. Biplot (PC1 vs PC2 con vectores)
    ax7 = fig.add_subplot(gs[2, 1:])
    if y is not None:
        scatter = ax7.scatter(X_pca_full[:, 0], X_pca_full[:, 1], 
                            c=y, cmap='RdYlGn', alpha=0.3, s=30)
    else:
        ax7.scatter(X_pca_full[:, 0], X_pca_full[:, 1], alpha=0.3, s=30, color='gray')
    
    if feature_names is not None:
        # Dibujar vectores de variables (solo las m√°s importantes)
        scale = 4
        top_features = np.argsort(np.abs(pca_full.components_[0]))[-8:]
        for i in top_features:
            ax7.arrow(0, 0, 
                     pca_full.components_[0, i]*scale, 
                     pca_full.components_[1, i]*scale,
                     head_width=0.1, head_length=0.1, fc='red', ec='red', alpha=0.6, linewidth=2)
            ax7.text(pca_full.components_[0, i]*scale*1.15, 
                    pca_full.components_[1, i]*scale*1.15,
                    feature_names[i], fontsize=9, ha='center', 
                    bbox=dict(boxstyle='round,pad=0.3', facecolor='yellow', alpha=0.7))
    
    ax7.set_xlabel(f'PC1 ({explained_variance[0]:.1%})', fontsize=12)
    ax7.set_ylabel(f'PC2 ({explained_variance[1]:.1%})', fontsize=12)
    ax7.set_title('Biplot (Observaciones + Variables)', fontweight='bold', fontsize=13)
    ax7.axhline(0, color='k', linestyle='-', linewidth=0.5, alpha=0.3)
    ax7.axvline(0, color='k', linestyle='-', linewidth=0.5, alpha=0.3)
    ax7.grid(alpha=0.3)
    
    plt.suptitle('An√°lisis Completo de Componentes Principales (PCA)', 
                fontsize=18, fontweight='bold', y=0.998)
    
    return pca_full, X_pca_full, fig

# Aplicar PCA
pca_model, X_cancer_pca, fig_pca = perform_pca_analysis(
    X_cancer_scaled, 
    y_cancer, 
    X_cancer.columns
)
plt.show()

## 3.2 M√©todos Wrapper

Eval√∫an subconjuntos de features entrenando modelos.

In [None]:
def plot_pca_3d_interactive(X_pca, y=None, explained_variance=None):
    """
    Crea visualizaci√≥n 3D interactiva de PCA
    """
    fig = go.Figure()
    
    if y is not None:
        # Colores por clase
        colors = ['red' if label == 0 else 'green' for label in y]
        labels = ['Maligno' if label == 0 else 'Benigno' for label in y]
        
        for class_label in np.unique(y):
            mask = y == class_label
            class_name = 'Maligno' if class_label == 0 else 'Benigno'
            color = 'red' if class_label == 0 else 'green'
            
            fig.add_trace(go.Scatter3d(
                x=X_pca[mask, 0],
                y=X_pca[mask, 1],
                z=X_pca[mask, 2],
                mode='markers',
                name=class_name,
                marker=dict(
                    size=5,
                    color=color,
                    opacity=0.6,
                    line=dict(color='black', width=0.5)
                ),
                text=[class_name] * mask.sum(),
                hovertemplate='<b>%{text}</b><br>PC1: %{x:.2f}<br>PC2: %{y:.2f}<br>PC3: %{z:.2f}<extra></extra>'
            ))
    else:
        fig.add_trace(go.Scatter3d(
            x=X_pca[:, 0],
            y=X_pca[:, 1],
            z=X_pca[:, 2],
            mode='markers',
            marker=dict(size=5, color='steelblue', opacity=0.6),
        ))
    
    # Etiquetas de ejes
    if explained_variance is not None:
        xlabel = f'PC1 ({explained_variance[0]:.1%})'
        ylabel = f'PC2 ({explained_variance[1]:.1%})'
        zlabel = f'PC3 ({explained_variance[2]:.1%})'
        total_var = explained_variance[0] + explained_variance[1] + explained_variance[2]
        title = f'PCA 3D - Varianza Total: {total_var:.1%}'
    else:
        xlabel, ylabel, zlabel = 'PC1', 'PC2', 'PC3'
        title = 'PCA 3D'
    
    fig.update_layout(
        title=dict(text=title, font=dict(size=20, color='black'), x=0.5, xanchor='center'),
        scene=dict(
            xaxis=dict(title=xlabel, backgroundcolor='rgb(230, 230,230)'),
            yaxis=dict(title=ylabel, backgroundcolor='rgb(230, 230,230)'),
            zaxis=dict(title=zlabel, backgroundcolor='rgb(230, 230,230)'),
        ),
        width=900,
        height=700,
        showlegend=True
    )
    
    return fig

# Crear visualizaci√≥n 3D
fig_3d = plot_pca_3d_interactive(
    X_cancer_pca, 
    y_cancer, 
    pca_model.explained_variance_ratio_
)
fig_3d.show()

## 2.4 t-SNE para Visualizaci√≥n No Lineal

t-SNE (t-Distributed Stochastic Neighbor Embedding) preserva la estructura local de los datos.

---
# Parte 4: Balanceo de Clases

Manejaremos el desbalance de clases usando t√©cnicas de over/undersampling.

## 4.1 Creaci√≥n de Dataset Desbalanceado

Simularemos un escenario realista de desbalance severo.

## 3.1 M√©todos Filter

Eval√∫an la relevancia de cada atributo independientemente del modelo.

In [None]:
def apply_filter_methods(X, y, k=15):
    """
    Aplica m√∫ltiples m√©todos filter para selecci√≥n de atributos
    """
    feature_names = X.columns if hasattr(X, 'columns') else [f'F{i}' for i in range(X.shape[1])]
    results = {}
    
    print("="*80)
    print("M√âTODOS FILTER - SELECCI√ìN DE ATRIBUTOS")
    print("="*80)
    
    # 1. ANOVA F-test (para clasificaci√≥n)
    print("\n1. Ejecutando ANOVA F-test...", end=' ')
    f_selector = SelectKBest(f_classif, k='all')
    f_selector.fit(X, y)
    results['F-test'] = pd.DataFrame({
        'Feature': feature_names,
        'Score': f_selector.scores_,
        'p-value': f_selector.pvalues_
    }).sort_values('Score', ascending=False)
    print("‚úì")
    
    # 2. Mutual Information
    print("2. Ejecutando Mutual Information...", end=' ')
    mi_selector = SelectKBest(mutual_info_classif, k='all')
    mi_selector.fit(X, y)
    results['Mutual Info'] = pd.DataFrame({
        'Feature': feature_names,
        'Score': mi_selector.scores_
    }).sort_values('Score', ascending=False)
    print("‚úì")
    
    # 3. Chi-squared (requiere valores no negativos)
    print("3. Ejecutando Chi-squared...", end=' ')
    # Normalizar a [0, 1] para chi2
    from sklearn.preprocessing import MinMaxScaler
    scaler = MinMaxScaler()
    X_normalized = scaler.fit_transform(X)
    chi2_selector = SelectKBest(chi2, k='all')
    chi2_selector.fit(X_normalized, y)
    results['Chi-squared'] = pd.DataFrame({
        'Feature': feature_names,
        'Score': chi2_selector.scores_
    }).sort_values('Score', ascending=False)
    print("‚úì")
    
    # Visualizaci√≥n
    fig = plt.figure(figsize=(20, 6))
    
    for idx, (method_name, scores_df) in enumerate(results.items(), 1):
        ax = plt.subplot(1, 3, idx)
        top_features = scores_df.head(k)
        
        # Colores basados en score normalizado
        scores_norm = (top_features['Score'] - top_features['Score'].min()) / (top_features['Score'].max() - top_features['Score'].min())
        colors = plt.cm.RdYlGn(scores_norm)
        
        bars = ax.barh(range(len(top_features)), top_features['Score'].values, color=colors, edgecolor='black')
        ax.set_yticks(range(len(top_features)))
        ax.set_yticklabels(top_features['Feature'].values, fontsize=10)
        ax.invert_yaxis()
        ax.set_xlabel('Score', fontsize=12)
        ax.set_title(f'{method_name}\nTop {k} Features', fontweight='bold', fontsize=14)
        ax.grid(axis='x', alpha=0.3)
        
        # A√±adir valores
        for i, (bar, score) in enumerate(zip(bars, top_features['Score'].values)):
            width = bar.get_width()
            ax.text(width, bar.get_y() + bar.get_height()/2,
                   f' {score:.2f}', ha='left', va='center', fontsize=9, fontweight='bold')
    
    plt.suptitle('M√©todos Filter: Ranking de Features', fontsize=18, fontweight='bold')
    plt.tight_layout()
    
    # Imprimir rankings
    print("\n" + "="*80)
    print("TOP 10 FEATURES POR M√âTODO")
    print("="*80)
    for method_name, scores_df in results.items():
        print(f"\n{method_name}:")
        print(scores_df.head(10)[['Feature', 'Score']].to_string(index=False))
    
    return results, fig

# Aplicar m√©todos filter
filter_results, fig_filter = apply_filter_methods(
    pd.DataFrame(X_cancer_scaled, columns=X_cancer.columns), 
    y_cancer, 
    k=15
)
plt.show()

## 4.2 Comparaci√≥n de M√©todos de Balanceo

Compararemos diferentes t√©cnicas de over/undersampling.

In [None]:
def plot_feature_correlations(X, top_features=None, threshold=0.8):
    """
    Visualiza correlaciones entre features y detecta redundancia
    """
    if top_features is not None:
        X_subset = X[top_features]
    else:
        X_subset = X
    
    # Calcular correlaciones
    corr_matrix = X_subset.corr()
    
    # Encontrar pares altamente correlacionados
    high_corr_pairs = []
    for i in range(len(corr_matrix.columns)):
        for j in range(i+1, len(corr_matrix.columns)):
            if abs(corr_matrix.iloc[i, j]) > threshold:
                high_corr_pairs.append({
                    'Feature1': corr_matrix.columns[i],
                    'Feature2': corr_matrix.columns[j],
                    'Correlation': corr_matrix.iloc[i, j]
                })
    
    # Visualizaci√≥n
    fig = plt.figure(figsize=(18, 14))
    
    # Heatmap completo
    ax1 = plt.subplot(2, 1, 1)
    mask = np.triu(np.ones_like(corr_matrix, dtype=bool), k=1)
    sns.heatmap(corr_matrix, mask=mask, annot=True, fmt='.2f', 
                cmap='coolwarm', center=0, square=True,
                linewidths=0.5, cbar_kws={'label': 'Correlaci√≥n'},
                ax=ax1, vmin=-1, vmax=1)
    ax1.set_title('Matriz de Correlaci√≥n entre Features', fontweight='bold', fontsize=16)
    
    # Distribuci√≥n de correlaciones
    ax2 = plt.subplot(2, 2, 3)
    corr_values = corr_matrix.values[np.triu_indices_from(corr_matrix.values, k=1)]
    ax2.hist(corr_values, bins=50, alpha=0.7, color='steelblue', edgecolor='black')
    ax2.axvline(threshold, color='red', linestyle='--', linewidth=2, label=f'Umbral: ¬±{threshold}')
    ax2.axvline(-threshold, color='red', linestyle='--', linewidth=2)
    ax2.set_xlabel('Correlaci√≥n', fontsize=12)
    ax2.set_ylabel('Frecuencia', fontsize=12)
    ax2.set_title('Distribuci√≥n de Correlaciones', fontweight='bold', fontsize=14)
    ax2.legend()
    ax2.grid(alpha=0.3)
    
    # Tabla de features altamente correlacionados
    ax3 = plt.subplot(2, 2, 4)
    ax3.axis('tight')
    ax3.axis('off')
    
    if high_corr_pairs:
        df_high_corr = pd.DataFrame(high_corr_pairs)
        df_high_corr = df_high_corr.sort_values('Correlation', ascending=False, key=abs)
        
        table_data = []
        for _, row in df_high_corr.head(15).iterrows():
            table_data.append([
                row['Feature1'][:20],
                row['Feature2'][:20],
                f"{row['Correlation']:.3f}"
            ])
        
        table = ax3.table(cellText=table_data,
                         colLabels=['Feature 1', 'Feature 2', 'Corr'],
                         cellLoc='left',
                         loc='center',
                         colWidths=[0.4, 0.4, 0.2])
        table.auto_set_font_size(False)
        table.set_fontsize(9)
        table.scale(1, 2)
        
        # Colorear header
        for i in range(3):
            table[(0, i)].set_facecolor('#40466e')
            table[(0, i)].set_text_props(weight='bold', color='white')
        
        ax3.set_title(f'Features Altamente Correlacionados (|r| > {threshold})\n{len(high_corr_pairs)} pares encontrados',
                     fontweight='bold', fontsize=14, pad=20)
    else:
        ax3.text(0.5, 0.5, f'No hay features con |r| > {threshold}',
                ha='center', va='center', fontsize=14, fontweight='bold')
    
    plt.tight_layout()
    
    print("="*80)
    print(f"AN√ÅLISIS DE CORRELACIONES (umbral = {threshold})")
    print("="*80)
    print(f"\nTotal de pares altamente correlacionados: {len(high_corr_pairs)}")
    if high_corr_pairs:
        print("\nTop 10 pares m√°s correlacionados:")
        df_high_corr = pd.DataFrame(high_corr_pairs).sort_values('Correlation', ascending=False, key=abs)
        print(df_high_corr.head(10).to_string(index=False))
    
    return fig, high_corr_pairs

# Analizar correlaciones en top features de F-test
top_15_features = filter_results['F-test'].head(15)['Feature'].tolist()
fig_corr, high_corr = plot_feature_correlations(
    pd.DataFrame(X_cancer_scaled, columns=X_cancer.columns),
    top_features=top_15_features,
    threshold=0.8
)
plt.show()

## 4.3 Visualizaci√≥n del Impacto del Balanceo

Ver c√≥mo cada m√©todo afecta el espacio de features.

In [None]:
def apply_wrapper_methods(X, y, n_features_to_select=10):
    """
    Aplica RFE (Recursive Feature Elimination) con diferentes modelos
    """
    feature_names = X.columns if hasattr(X, 'columns') else [f'F{i}' for i in range(X.shape[1])]
    
    # Definir modelos
    models = {
        'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
        'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
        'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42)
    }
    
    results = {}
    
    print("="*80)
    print("M√âTODOS WRAPPER - RFE (Recursive Feature Elimination)")
    print("="*80)
    print(f"\nSeleccionando top {n_features_to_select} features con cada modelo...")
    
    for model_name, model in models.items():
        print(f"\n{model_name}...", end=' ')
        
        # RFE
        rfe = RFE(estimator=model, n_features_to_select=n_features_to_select, step=1)
        rfe.fit(X, y)
        
        # Guardar resultados
        results[model_name] = {
            'selected': feature_names[rfe.support_].tolist(),
            'ranking': rfe.ranking_
        }
        
        print("‚úì")
        print(f"  Features seleccionados: {results[model_name]['selected'][:5]}...")
    
    # Visualizaci√≥n
    fig = plt.figure(figsize=(20, 12))
    
    # 1. Ranking por modelo
    for idx, (model_name, result) in enumerate(results.items(), 1):
        ax = plt.subplot(2, 3, idx)
        
        ranking_df = pd.DataFrame({
            'Feature': feature_names,
            'Ranking': result['ranking']
        }).sort_values('Ranking')
        
        top_features = ranking_df.head(15)
        colors = ['green' if r == 1 else 'orange' if r <= 3 else 'red' 
                 for r in top_features['Ranking']]
        
        bars = ax.barh(range(len(top_features)), top_features['Ranking'].values, 
                      color=colors, alpha=0.7, edgecolor='black')
        ax.set_yticks(range(len(top_features)))
        ax.set_yticklabels(top_features['Feature'].values, fontsize=9)
        ax.invert_yaxis()
        ax.set_xlabel('Ranking (1 = mejor)', fontsize=11)
        ax.set_title(f'{model_name}\nTop 15 Features', fontweight='bold', fontsize=13)
        ax.grid(axis='x', alpha=0.3)
        
        # A√±adir l√≠nea en ranking = n_features_to_select
        ax.axvline(n_features_to_select, color='blue', linestyle='--', 
                  linewidth=2, alpha=0.5, label=f'Top {n_features_to_select}')
        ax.legend()
    
    # 2. Diagrama de Venn (consenso)
    ax4 = plt.subplot(2, 3, 4)
    ax4.axis('off')
    
    selected_sets = {name: set(result['selected']) for name, result in results.items()}
    
    # Intersecciones
    all_three = selected_sets['Logistic Regression'] & selected_sets['Random Forest'] & selected_sets['Gradient Boosting']
    lr_rf = (selected_sets['Logistic Regression'] & selected_sets['Random Forest']) - all_three
    lr_gb = (selected_sets['Logistic Regression'] & selected_sets['Gradient Boosting']) - all_three
    rf_gb = (selected_sets['Random Forest'] & selected_sets['Gradient Boosting']) - all_three
    
    only_lr = selected_sets['Logistic Regression'] - selected_sets['Random Forest'] - selected_sets['Gradient Boosting']
    only_rf = selected_sets['Random Forest'] - selected_sets['Logistic Regression'] - selected_sets['Gradient Boosting']
    only_gb = selected_sets['Gradient Boosting'] - selected_sets['Logistic Regression'] - selected_sets['Random Forest']
    
    # Texto
    y_pos = 0.9
    ax4.text(0.5, y_pos, 'CONSENSO ENTRE MODELOS', ha='center', fontsize=16, fontweight='bold')
    y_pos -= 0.1
    
    ax4.text(0.1, y_pos, f'üü¢ Los 3 modelos ({len(all_three)}):', fontsize=12, fontweight='bold')
    y_pos -= 0.05
    for feat in sorted(all_three):
        ax4.text(0.15, y_pos, f'‚Ä¢ {feat}', fontsize=10)
        y_pos -= 0.04
    
    y_pos -= 0.03
    ax4.text(0.1, y_pos, f'üü° 2 modelos:', fontsize=12, fontweight='bold')
    y_pos -= 0.05
    for feat in sorted(lr_rf | lr_gb | rf_gb):
        ax4.text(0.15, y_pos, f'‚Ä¢ {feat}', fontsize=10)
        y_pos -= 0.04
        if y_pos < 0.1:
            break
    
    ax4.set_xlim(0, 1)
    ax4.set_ylim(0, 1)
    
    # 3. Heatmap de selecci√≥n
    ax5 = plt.subplot(2, 3, 5)
    selection_matrix = []
    model_names_list = list(results.keys())
    
    for model_name in model_names_list:
        row = [1 if feat in results[model_name]['selected'] else 0 
               for feat in feature_names]
        selection_matrix.append(row)
    
    selection_df = pd.DataFrame(selection_matrix, 
                               index=model_names_list,
                               columns=feature_names)
    
    # Ordenar por n√∫mero de selecciones
    feature_counts = selection_df.sum(axis=0)
    selection_df = selection_df[feature_counts.sort_values(ascending=False).index]
    
    sns.heatmap(selection_df.iloc[:, :20], annot=True, fmt='d', cmap='RdYlGn',
                cbar_kws={'label': 'Seleccionado'}, ax=ax5,
                linewidths=0.5, vmin=0, vmax=1)
    ax5.set_title('Features Seleccionados por Modelo\n(Top 20 m√°s frecuentes)', 
                 fontweight='bold', fontsize=13)
    ax5.set_xlabel('')
    ax5.set_ylabel('')
    
    # 4. Frecuencia de selecci√≥n
    ax6 = plt.subplot(2, 3, 6)
    feature_counts_sorted = feature_counts.sort_values(ascending=False).head(15)
    colors_freq = ['green' if c == 3 else 'orange' if c == 2 else 'red' 
                   for c in feature_counts_sorted.values]
    
    bars = ax6.barh(range(len(feature_counts_sorted)), feature_counts_sorted.values,
                   color=colors_freq, alpha=0.7, edgecolor='black')
    ax6.set_yticks(range(len(feature_counts_sorted)))
    ax6.set_yticklabels(feature_counts_sorted.index, fontsize=10)
    ax6.invert_yaxis()
    ax6.set_xlabel('N√∫mero de modelos que lo seleccionaron', fontsize=11)
    ax6.set_title('Frecuencia de Selecci√≥n\nTop 15 Features', fontweight='bold', fontsize=13)
    ax6.set_xticks([0, 1, 2, 3])
    ax6.grid(axis='x', alpha=0.3)
    
    plt.suptitle('M√©todos Wrapper: RFE con M√∫ltiples Modelos', 
                fontsize=18, fontweight='bold')
    plt.tight_layout()
    
    return results, all_three, fig

# Aplicar RFE
wrapper_results, consensus_features, fig_wrapper = apply_wrapper_methods(
    pd.DataFrame(X_cancer_scaled, columns=X_cancer.columns),
    y_cancer,
    n_features_to_select=10
)
plt.show()

print("\n" + "="*80)
print("FEATURES CON CONSENSO (seleccionados por los 3 modelos):")
print("="*80)
for feat in sorted(consensus_features):
    print(f"  ‚úì {feat}")

---
# Resumen y Conclusiones

## ‚úÖ Lo que hemos aprendido

### 1. Limpieza de Datos
* Los valores faltantes requieren an√°lisis cuidadoso (MCAR, MAR, MNAR)
* KNN Imputation generalmente supera a m√©todos simples
* Los outliers deben investigarse antes de eliminarlos
* El escalamiento es crucial para muchos algoritmos

### 2. Reducci√≥n de Dimensionalidad
* **PCA**: R√°pido, interpretable, lineal
  * √ötil para reducci√≥n real de dimensionalidad
  * Preserva varianza global
  
* **t-SNE**: Lento, no interpretable, no lineal
  * Excelente para visualizaci√≥n
  * Preserva estructura local (clusters)

### 3. Selecci√≥n de Atributos
* **M√©todos Filter**: R√°pidos pero independientes del modelo
* **M√©todos Wrapper**: M√°s lentos pero espec√≠ficos del modelo
* **Consenso**: Combinar m√∫ltiples m√©todos aumenta robustez

### 4. Balanceo de Clases
* El desbalance severo sesga modelos hacia la mayor√≠a
* **SMOTE** genera ejemplos sint√©ticos interpolando
* **ADASYN** adapta la s√≠ntesis a la densidad local
* **BorderlineSMOTE** enfoca en ejemplos frontera
* **Random Oversampling** duplica ejemplos minoritarios

## üéØ Mejores Pr√°cticas

1. **Entender los datos** antes de limpiarlos
2. **Documentar decisiones** de preprocesamiento
3. **Validar el impacto** de cada transformaci√≥n
4. **No eliminar datos sin investigar** primero
5. **Escalar antes de PCA** o m√©todos basados en distancia
6. **El balanceo es una decisi√≥n importante** seg√∫n el problema

## üìä Resultados Clave de Este Notebook

De nuestros experimentos:
* El escalamiento normaliz√≥ las escalas entre variables
* PCA redujo dimensiones preservando estructura
* Los m√©todos Filter y Wrapper identificaron features consistentes
* SMOTE y variantes mejoraron la representaci√≥n de clase minoritaria

## üöÄ Pr√≥ximos Pasos

En los siguientes notebooks veremos:
* Integraci√≥n de t√©cnicas en pipelines completos
* Evaluaci√≥n de modelos con m√©tricas apropiadas
* Validaci√≥n cruzada y ajuste de hiperpar√°metros
* Aplicaci√≥n a problemas reales de clasificaci√≥n

## üìö Referencias y Recursos

* Scikit-learn Documentation: https://scikit-learn.org
* Imbalanced-learn: https://imbalanced-learn.org
* "Feature Engineering and Selection" - Kuhn & Johnson
* "Hands-On Machine Learning" - Aur√©lien G√©ron

---
## üí° Ejercicios Adicionales (Opcional)

Pon a prueba tu comprensi√≥n:

### Ejercicio 1: Dataset Diferente
Aplica las t√©cnicas de limpieza a otro dataset:
* Wine Quality
* Iris
* Digits

### Ejercicio 2: Par√°metros
Experimenta cambiando:
* N√∫mero de vecinos en KNN Imputer
* Percentiles para detecci√≥n de outliers
* N√∫mero de componentes en PCA
* Perplexity en t-SNE

### Ejercicio 3: An√°lisis Comparativo
Compara:
* Diferentes estrategias de imputaci√≥n en el mismo dataset
* PCA vs selecci√≥n de features para reducci√≥n de dimensionalidad
* Diferentes m√©todos de balanceo en m√©tricas espec√≠ficas

### Ejercicio 4: Crear Dataset Propio
Genera un dataset sint√©tico con:
* Patrones espec√≠ficos de valores faltantes
* Outliers controlados
* Desbalance definido

---

**¬°Excelente trabajo completando este m√≥dulo!** üéâ

Has aprendido las t√©cnicas fundamentales de limpieza y preparaci√≥n que son la base para todo proyecto de ML.