EDA for fe-dataset. Trying to do some **feature selection** with simple techniques to find important features

In [5]:
import pandas as pd

In [6]:
df = pd.read_pickle('cleaned_dataset-fe.pkl')

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2106 entries, 0 to 2105
Data columns (total 73 columns):
 #   Column                                                                                                    Non-Null Count  Dtype  
---  ------                                                                                                    --------------  -----  
 0   sl_contract_id                                                                                            2106 non-null   object 
 1   sie_ic_ruc                                                                                                2106 non-null   object 
 2   sie_ic_proveedor                                                                                          1512 non-null   object 
 3   sie_ic_monto_presupuestado                                                                                2106 non-null   float64
 4   sie_ic_monto_contrato                                                         

In [8]:
df.head()

Unnamed: 0,sl_contract_id,sie_ic_ruc,sie_ic_proveedor,sie_ic_monto_presupuestado,sie_ic_monto_contrato,sie_ic_monto_adjudicacion,sie_ic_indicador_01,sie_ic_indicador_04,sie_ic_indicador_05,sie_ic_indicador_06,...,sd_saldo_pago_a_45_dias,sd_saldo_pago_contra_entrega_de_bienes_obras_o_servicio,sd_saldo_pagos_por_planilla,sd_estado_adjudicado,sd_estado_cancelado,sd_estado_desierta,sd_estado_ejecucion,sd_estado_finalizada,sd_estado_otro,sd_estado_recepcion
0,1755652,,,0.0,0.0,0.0,,1.0,1,1.0,...,False,False,False,False,False,True,False,False,False,False
1,1747349,,,0.0,0.0,0.0,,1.0,1,1.0,...,False,True,False,False,False,True,False,False,False,False
2,1746667,,,0.0,0.0,0.0,,1.0,1,1.0,...,False,False,False,False,False,True,False,False,False,False
3,1746522,,,0.0,0.0,0.0,,1.0,1,0.0,...,False,True,False,False,False,True,False,False,False,False
4,1746358,991445242001.0,DIEMPEC CIA. LTDA. DISTRIBUIDORA FARMACEUTICA,25800.0,24252.0,24252.0,1.0,1.0,1,0.0,...,False,True,False,False,False,False,False,False,False,True


# Cleaning df

In [28]:
# Function to find and drop duplicate columns
def drop_duplicate_columns(df):
    # Create a dictionary to store columns with the same data
    duplicates = {}
    
    # Iterate over all columns
    for i, col1 in enumerate(df.columns):
        for j, col2 in enumerate(df.columns[i+1:]):
            if df[col1].equals(df[col2]):
                duplicates[col2] = col1  # Store the duplicate column
    
    # Drop the duplicate columns
    df = df.drop(columns=duplicates.keys())
    
    return df

# Drop duplicate columns
df = drop_duplicate_columns(df)

In [29]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2106 entries, 0 to 2105
Data columns (total 66 columns):
 #   Column                                                                                                    Non-Null Count  Dtype  
---  ------                                                                                                    --------------  -----  
 0   sl_contract_id                                                                                            2106 non-null   object 
 1   sie_ic_ruc                                                                                                2106 non-null   object 
 2   sie_ic_proveedor                                                                                          1512 non-null   object 
 3   sie_ic_monto_presupuestado                                                                                2106 non-null   float64
 4   sie_ic_monto_contrato                                                         

# Corr

### **Feature Selection Using Correlation**

#### **What is Feature Selection with Correlation?**
Feature selection using correlation helps identify the most relevant features by measuring their statistical relationship with the target variable. The goal is to:
1. **Identify strong predictors**: Features that have a high correlation with the target variable are likely to be useful for the model.
2. **Remove weak or irrelevant features**: Features with very low correlation contribute little to predictions.
3. **Detect multicollinearity**: Highly correlated features might be redundant and could cause instability in models.

---

### **Key Concepts**
- **Pearson Correlation Coefficient (`.corr()`)**  
  - Measures the linear relationship between two variables.  
  - Values range from **-1 to 1**:
    - `+1`: Strong positive correlation (both variables increase together).
    - `-1`: Strong negative correlation (one increases, the other decreases).
    - `0`: No linear correlation.
- **Absolute Correlation (`.abs()`)**  
  - We take absolute values so both positive and negative correlations are treated equally.
- **Threshold Selection**  
  - A low threshold (e.g., `0.1`) keeps weakly correlated features, which might add some predictive power.  
  - A high threshold (e.g., `0.5`) keeps only strongly correlated features, which might simplify the model.


### **When to Use Correlation for Feature Selection**
✅ **Good for Exploratory Data Analysis (EDA)**  
✅ **Useful for Linear Models** (e.g., Linear Regression, Logistic Regression)  
✅ **Helps Reduce Dimensionality**  

🚨 **Limitations**:
- **Ignores non-linear relationships**: Some features might have strong predictive power but low correlation.
- **Sensitive to Multicollinearity**: Highly correlated features might introduce redundancy rather than unique information.
- **Does not handle categorical features**: Correlation is only computed on numeric and boolean features.


In [35]:
import pandas as pd
import seaborn as sns
import numpy as np

def correlation_selection(df, target_col, threshold=0.1):

    # Numeric and boolean columns
    numeric_columns = df.select_dtypes(include=[np.number]).columns
    boolean_columns = df.select_dtypes(include=[bool]).columns
    numeric_columns = numeric_columns.append(boolean_columns)


    # Calculate correlation matrix
    corr_matrix = df[numeric_columns].corr().abs()
    
    # Get correlations with target variable
    correlations = corr_matrix[target_col].sort_values(ascending=False)
    print(correlations[correlations > threshold])
    
    # Select features with correlation above threshold
    selected_features = correlations[correlations > threshold].index.tolist()
    
    return selected_features



In [36]:
# Usage
selected_features = correlation_selection(df, 'sie_ic_promedio', 0.085)

sie_ic_promedio                                          1.000000
sie_ic_indicador_22                                      0.738457
sie_ic_indicador_11                                      0.738228
sie_ic_indicador_25                                      0.668869
sie_ic_indicador_19                                      0.498705
sie_ic_indicador_04                                      0.449793
sie_ic_indicador_09                                      0.434290
sie_ic_indicador_05                                      0.293830
sie_ic_indicador_06                                      0.217838
sie_ic_indicador_15                                      0.205133
sie_ic_estado_proc_desierta                              0.198808
sie_ic_monto_adjudicacion                                0.185850
sie_ic_monto_presupuestado                               0.182679
sd_presupuesto_referencial_total_sin_iva                 0.178111
sie_ic_monto_contrato                                    0.178032
sie_ic_ind

# Variance threshold

### **Understanding Variance Thresholding**
1. **Variance in Features**  
   - Variance measures how much a feature's values spread out from the mean.  
   - If a feature has very low variance (close to zero), it means that it does not change much across samples and may not contribute useful information for a model.

2. **Threshold Interpretation**  
   - The `threshold=0.01` means that any feature with a variance below `0.01` will be removed.  
   - This helps eliminate features that have little to no variability (e.g., mostly constant columns).


In [45]:
def top_variance_features(X, threshold=0.01, top_n=10):
    numeric_columns = X.select_dtypes(include=[np.number, bool]).columns
    
    selector = VarianceThreshold(threshold=0.0)  # No threshold to get all variances
    selector.fit(X[numeric_columns])  
    
    # Get variance values
    feature_variances = dict(zip(numeric_columns, selector.variances_))
    
    # Sort features by variance (descending)
    sorted_features = sorted(feature_variances.items(), key=lambda x: x[1], reverse=True)
    
    # Display top N features
    print(f"Top {top_n} features by variance:")
    for feature, var in sorted_features[:top_n]:
        print(f"{feature}: {var:.4f}")

    return sorted_features



In [47]:
# Usage
top_features = top_variance_features(df, threshold=0.01, top_n=15)


Top 15 features by variance:
sd_presupuesto_referencial_total_sin_iva: 140522090780.3496
sie_ic_monto_presupuestado: 8135400.0000
sie_ic_monto_contrato: 7687948.5300
sie_ic_monto_adjudicacion: 7687948.5300
sd_plazo_de_entrega: 3740.0000
sd_vigencia_de_oferta: 357.0000
sie_ic_day: 30.0000
sie_ic_month: 9.6216
sd_comision_no: 0.2499
sd_comision_si: 0.2461
sie_ic_indicador_19: 0.2448
sie_ic_estado_proc_ejecucion_de_contrato: 0.2200
sie_ic_indicador_22: 0.2166
sie_ic_indicador_11: 0.2160
sie_ic_estado_proc_finalizada: 0.2049


# Univariate Feature Selection

### **Understanding Univariate Feature Selection (`SelectKBest`)**
Univariate feature selection is a method to filter out the most relevant features based on their statistical relationship with the target variable (`y`). It evaluates each feature independently using a scoring function.

### **Key Concepts**
1. **Scoring Functions:**
   - `f_classif`: Uses **ANOVA F-test**, which measures how well a feature separates different classes.
   - `mutual_info_classif`: Uses **Mutual Information**, which measures the dependency between a feature and the target (non-linear relationships included).

2. **How It Works:**
   - Each feature gets a score based on the chosen function.
   - The top `k` features with the highest scores are selected.

3. **Why It Matters:**
   - Helps remove irrelevant or redundant features, improving model performance.
   - Reduces dimensionality and computational cost.


> **Why Handle NaNs?**  
> `SelectKBest` and many other scikit-learn methods do not accept NaN values.  
> Some algorithms (e.g., `HistGradientBoostingClassifier`/`Regressor`) can handle NaNs natively, but for feature selection using `SelectKBest`, you must remove or impute them.  
>  

1. **Dropping NaNs:**  
   We remove any rows containing NaN values from both the features and the target. This is simple but may reduce your dataset size significantly if many values are missing. We also print the resulting DataFrame shape.

2. **Imputation:**  
   We use `SimpleImputer` to fill in missing values (using, for example, the median). This keeps the original dataset size and is often preferred if dropping rows would lose too much data.



### **Summary**
- **Dropping NaNs** removes any rows with missing values, reducing the data size.  
- **Imputation** fills missing values (here, using the median) to preserve the data size.  
- Both methods then perform univariate feature selection using either `f_classif` or `mutual_info_classif` and return the top features along with their scores.


In [67]:
import pandas as pd
import numpy as np
from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif
from sklearn.impute import SimpleImputer

def univariate_selection_dropna(df, target_col, k=10, score_func='f_classif'):
    """
    Performs univariate feature selection after dropping any rows with NaN values.
    Parameters:
        df: pandas DataFrame
        target_col: string, name of the target column
        k: int, number of features to select
        score_func: string, either 'f_classif' or 'mutual_info'
    """
    if target_col not in df.columns:
        raise ValueError("Target variable not found in DataFrame.")
    
    # Separate target variable
    y = df[target_col]
    X = df.drop(columns=[target_col])
    
    # Select only numeric and boolean columns
    numeric_columns = X.select_dtypes(include=[np.number, bool]).columns
    X_numeric = X[numeric_columns]
    
    # Drop rows with missing values (align y accordingly)
    X_clean = X_numeric.dropna()
    y_clean = y.loc[X_clean.index]
    print(f"Shape after dropping NaNs: {X_clean.shape}")
    
    # Choose scoring function
    score_functions = {
        'f_classif': f_classif,
        'mutual_info': mutual_info_classif
    }
    score_function = score_functions.get(score_func)
    
    if score_function is None:
        raise ValueError("Invalid score_func. Choose 'f_classif' or 'mutual_info'.")
    
    # Initialize and fit selector
    selector = SelectKBest(score_func=score_function, k=min(k, X_clean.shape[1]))
    X_selected = selector.fit_transform(X_clean, y_clean)
    
    # Get selected features
    selected_features = X_clean.columns[selector.get_support()].tolist()
    
    # Get feature scores
    scores_df = pd.DataFrame({
        'Feature': X_clean.columns,
        'Score': selector.scores_
    }).sort_values('Score', ascending=False)
    
    return X_selected, selected_features, scores_df



This strategy fails, as removing all NaNs leaves us with no rows.

In [68]:
df.dropna().shape

(0, 66)

In [None]:
# X_selected_drop, features_drop, scores_drop = univariate_selection_dropna(
#     df, 
#     target_col="sie_ic_promedio", 
#     k=10, 
#     score_func='f_classif'
# )

# print("\nSelected Features after dropping NaNs:", features_drop)
# print("\nTop Feature Scores (Drop NaNs):")
# print(scores_drop.head(10))




In [61]:
def univariate_selection_impute(df, target_col, k=10, score_func='f_classif', strategy='median'):
    """
    Performs univariate feature selection after imputing missing values.
    
    Parameters:
        df: pandas DataFrame
        target_col: string, name of the target column
        k: int, number of features to select
        score_func: string, either 'f_classif' or 'mutual_info'
        strategy: string, imputation strategy ('mean', 'median', 'most_frequent', or 'constant')
    
    Returns:
        tuple: (transformed X matrix, list of selected features, DataFrame of feature scores)
    """
    if target_col not in df.columns:
        raise ValueError("Target variable not found in DataFrame.")
    
    # Separate target variable
    y = df[target_col]
    X = df.drop(columns=[target_col])
    
    # Select only numeric and boolean columns
    numeric_columns = X.select_dtypes(include=[np.number, bool]).columns
    X_numeric = X[numeric_columns]
    
    # Impute missing values
    imputer = SimpleImputer(strategy=strategy)
    X_imputed_array = imputer.fit_transform(X_numeric)
    X_imputed = pd.DataFrame(
        X_imputed_array,
        columns=numeric_columns,
        index=X_numeric.index
    )
    
    print(f"Shape after imputation (should be same as original): {X_imputed.shape}")
    
    # Choose scoring function
    score_functions = {
        'f_classif': f_classif,
        'mutual_info': mutual_info_classif
    }
    score_function = score_functions.get(score_func)
    
    if score_function is None:
        raise ValueError("Invalid score_func. Choose 'f_classif' or 'mutual_info'.")
    
    # Initialize and fit selector
    selector = SelectKBest(
        score_func=score_function,
        k=min(k, X_imputed.shape[1])
    )
    X_selected = selector.fit_transform(X_imputed, y)
    
    # Get selected features
    selected_features = X_imputed.columns[selector.get_support()].tolist()
    
    # Get feature scores
    scores_df = pd.DataFrame({
        'Feature': X_imputed.columns,
        'Score': selector.scores_
    }).sort_values('Score', ascending=False)
    
    return X_selected, selected_features, scores_df


In [64]:
X_selected_imputed, features_imputed, scores_imputed = univariate_selection_impute(
    df,
    target_col="sie_ic_promedio",
    k=10,
    score_func='f_classif',
    strategy='median'
)


print("\nSelected Features after imputation:")
print(features_imputed)
print("\nTop Feature Scores (Imputation):")
print(scores_imputed.head(15))

Shape after imputation (should be same as original): (2106, 50)

Selected Features after imputation:
['sie_ic_indicador_09', 'sie_ic_indicador_11', 'sie_ic_indicador_19', 'sie_ic_indicador_22', 'sie_ic_indicador_25', 'sie_ic_estado_proc_adjudicado_-_registro_de_contratos', 'sie_ic_estado_proc_cancelado', 'sie_ic_estado_proc_desierta', 'sie_ic_estado_proc_ejecucion_de_contrato', 'sie_ic_estado_proc_finalizada']

Top Feature Scores (Imputation):
                                              Feature       Score
29                        sie_ic_estado_proc_desierta  457.039915
10                                sie_ic_indicador_22  183.968623
7                                 sie_ic_indicador_11  182.983594
11                                sie_ic_indicador_25   82.630731
9                                 sie_ic_indicador_19   72.534980
28                       sie_ic_estado_proc_cancelado   64.405468
30           sie_ic_estado_proc_ejecucion_de_contrato   56.989296
6                       

# HistGradientBoostingClassifier

### About HistGradientBoosting Algorithm:

1. Missing Value Handling:
   - Unlike traditional gradient boosting, HistGradientBoosting can handle missing values natively
   - Missing values are treated as a special category in splits
   - No need for imputation preprocessing

2. Key Features:
   - Uses histogram-based learning (similar to LightGBM)
   - Data is binned into discrete histograms
   - Much faster than traditional gradient boosting
   - Lower memory usage due to binning
   - Native handling of categorical features

3. Technical Details:
   - Binning Process:
     * Continuous features are discretized into bins
     * Reduces memory usage and computation time
     * Makes the algorithm more robust to outliers
   
   - Tree Building:
     * Uses a level-wise growing strategy
     * Finds optimal splits using histograms
     * Can automatically determine the best depth for each tree

4. Advantages:
   - Fast training and prediction
   - Memory efficient
   - Robust to outliers
   - Handles missing values naturally
   - Good performance on large datasets

5. **Parameters to tune**:
   - `learning_rate`: Controls the contribution of each tree
   - `max_iter`: Number of boosting stages (trees)
   - `max_depth`: Maximum depth of trees
   - `min_samples_leaf`: Minimum samples required in a leaf
   - `l2_regularization`: L2 regularization term


In [69]:
import pandas as pd
import numpy as np
from sklearn.ensemble import HistGradientBoostingRegressor, HistGradientBoostingClassifier
from sklearn.inspection import permutation_importance
from sklearn.model_selection import train_test_split

def feature_selection_histgb(df, target_col, k=10, task='regression', random_state=42):
    """
    Performs feature selection using HistGradientBoosting algorithm, which naturally handles NaN values.
    
    Parameters:
        df: pandas DataFrame
        target_col: string, name of the target column
        k: int, number of features to select
        task: string, either 'regression' or 'classification'
        random_state: int, for reproducibility
    
    Returns:
        tuple: (selected feature names, DataFrame with feature importances)
    """
    if target_col not in df.columns:
        raise ValueError("Target variable not found in DataFrame.")
    
    # Separate target variable
    y = df[target_col]
    X = df.drop(columns=[target_col])
    
    # Select only numeric and boolean columns
    numeric_columns = X.select_dtypes(include=[np.number, bool]).columns
    X_numeric = X[numeric_columns]
    
    # Split data for more robust feature importance
    X_train, X_test, y_train, y_test = train_test_split(
        X_numeric, y, test_size=0.2, random_state=random_state
    )
    
    # Initialize model based on task
    if task == 'regression':
        model = HistGradientBoostingRegressor(
            random_state=random_state,
            max_iter=100,  # number of trees
            learning_rate=0.1,
            max_depth=None,  # auto-determined
            min_samples_leaf=20,
            l2_regularization=1.0,
            categorical_features=None  # auto-detect
        )
    elif task == 'classification':
        model = HistGradientBoostingClassifier(
            random_state=random_state,
            max_iter=100,
            learning_rate=0.1,
            max_depth=None,
            min_samples_leaf=20,
            l2_regularization=1.0,
            categorical_features=None
        )
    else:
        raise ValueError("Task must be either 'regression' or 'classification'")
    
    # Fit model
    model.fit(X_train, y_train)
    
    # Calculate permutation importance (more reliable than feature_importances_)
    result = permutation_importance(
        model, X_test, y_test, 
        n_repeats=10, 
        random_state=random_state
    )
    
    # Create importance DataFrame
    importance_df = pd.DataFrame({
        'Feature': X_numeric.columns,
        'Importance_Mean': result.importances_mean,
        'Importance_Std': result.importances_std
    }).sort_values('Importance_Mean', ascending=False)
    
    # Select top k features
    selected_features = importance_df['Feature'].head(k).tolist()
    
    return selected_features, importance_df



In [71]:
selected_features, importance_df = feature_selection_histgb(
    df, 
    target_col="sie_ic_promedio", 
    k=10, 
    task='regression'
)

In [78]:
# Analyze feature importance
print("\nSelected Features (HistGradientBoosting):")
print("\nTop Feature Importance (HistGradientBoosting):")
print(importance_df.head(16))


Selected Features (HistGradientBoosting):

Top Feature Importance (HistGradientBoosting):
                                     Feature  Importance_Mean  Importance_Std
9                        sie_ic_indicador_19         0.544609        0.030859
7                        sie_ic_indicador_11         0.500104        0.031697
3                        sie_ic_indicador_04         0.475457        0.023925
11                       sie_ic_indicador_25         0.223583        0.018486
4                        sie_ic_indicador_05         0.142584        0.012386
8                        sie_ic_indicador_15         0.110361        0.007422
5                        sie_ic_indicador_06         0.098239        0.007834
13                       sie_ic_indicador_27         0.065026        0.006504
6                        sie_ic_indicador_09         0.055063        0.002544
16  sd_presupuesto_referencial_total_sin_iva         0.011494        0.002088
12                       sie_ic_indicador_26       

# PCA

### **Principal Component Analysis (PCA) - Explained**

#### **What is PCA?**
Principal Component Analysis (PCA) is a **dimensionality reduction** technique used to transform high-dimensional data into a lower-dimensional space while preserving as much variance as possible. It is useful for:
- **Reducing computational cost** in machine learning.
- **Removing multicollinearity** between features.
- **Improving visualization** of high-dimensional data.

#### **How Does PCA Work?**
1. **Standardization**: Since PCA is affected by scale, data is first standardized (mean=0, variance=1).
2. **Covariance Matrix Computation**: Measures how features vary together.
3. **Eigen Decomposition**: Computes principal components (PCs) by finding eigenvalues & eigenvectors.
4. **Sorting Principal Components**: The components are sorted by **explained variance** (higher variance = more information retained).
5. **Dimensionality Reduction**: Keep only the top `n_components` that explain the most variance.


In [79]:
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA, KernelPCA, TruncatedSVD
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

def advanced_feature_reduction(df, target_col, n_components=None, method='pca', 
                             kernel='rbf', random_state=42):
    """
    Performs dimensionality reduction using various techniques (PCA, Kernel PCA, or Truncated SVD).
    
    Parameters:
        df: pandas DataFrame
        target_col: string, name of the target column
        n_components: int or float, number of components to keep
                     if float between 0 and 1, represents variance to preserve
        method: string, {'pca', 'kernel_pca', 'truncated_svd'}
        kernel: string, kernel for Kernel PCA ('rbf', 'poly', 'sigmoid', etc.)
        random_state: int, for reproducibility
    
    Returns:
        tuple: (transformed features, model, explained variance ratio, feature loadings)
    """
    if target_col not in df.columns:
        raise ValueError("Target variable not found in DataFrame.")
    
    # Separate target variable
    X = df.drop(columns=[target_col])
    
    # Select numeric columns
    numeric_columns = X.select_dtypes(include=[np.number]).columns
    X_numeric = X[numeric_columns]
    
    # Handle missing values
    imputer = SimpleImputer(strategy='median')
    X_imputed = imputer.fit_transform(X_numeric)
    
    # Standardize the features
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X_imputed)
    
    # If n_components not specified, use minimum of:
    # 1. Number of samples
    # 2. Number of features
    # 3. 95% of variance explained
    if n_components is None:
        n_components = min(X_scaled.shape[0], X_scaled.shape[1], 
                         int(0.95 * min(X_scaled.shape)))
    
    # Initialize the appropriate reduction method
    if method == 'pca':
        reducer = PCA(n_components=n_components, random_state=random_state)
    elif method == 'kernel_pca':
        reducer = KernelPCA(n_components=n_components, kernel=kernel, 
                          random_state=random_state)
    elif method == 'truncated_svd':
        reducer = TruncatedSVD(n_components=n_components, random_state=random_state)
    else:
        raise ValueError("Method must be one of: 'pca', 'kernel_pca', 'truncated_svd'")
    
    # Fit and transform the data
    X_transformed = reducer.fit_transform(X_scaled)
    
    # Get explained variance ratio if available
    if hasattr(reducer, 'explained_variance_ratio_'):
        explained_variance = reducer.explained_variance_ratio_
    else:
        explained_variance = None
    
    # Create feature loadings DataFrame for PCA
    if method == 'pca':
        loadings = pd.DataFrame(
            reducer.components_.T,
            columns=[f'PC{i+1}' for i in range(reducer.components_.shape[0])],
            index=numeric_columns
        )
    else:
        loadings = None
    
    # Create DataFrame with transformed features
    transformed_df = pd.DataFrame(
        X_transformed,
        columns=[f'Component_{i+1}' for i in range(X_transformed.shape[1])],
        index=X_numeric.index
    )
    
    return transformed_df, reducer, explained_variance, loadings

# Function to analyze and visualize PCA results
def analyze_pca_results(transformed_df, explained_variance, loadings, n_components=None):
    """
    Analyzes and creates visualizations for PCA results using React components.
    """
    # Create cumulative variance explanation
    cumulative_variance = np.cumsum(explained_variance)
    
    # Prepare data for visualization
    variance_data = [
        {
            'component': f'PC{i+1}',
            'variance': var,
            'cumulative': cum_var
        }
        for i, (var, cum_var) in enumerate(zip(explained_variance, cumulative_variance))
    ]
    
    return variance_data



In [87]:
# df without columns that start with 'sie_ic_indicador'
df_2 = df.loc[:, ~df.columns.str.startswith('sie_ic_indicador')]

In [88]:
# Example usage:
transformed_features, pca_model, exp_variance, loadings = advanced_feature_reduction(
    df=df_2,
    target_col="sie_ic_promedio",
    n_components=0.95,  # Keep 95% of variance
    method='pca'
)

In [89]:
# Then analyze the results
variance_data = analyze_pca_results(
    transformed_features,
    exp_variance,
    loadings
)

In [90]:
variance_data[:10]

[{'component': 'PC1',
  'variance': 0.33828782897227666,
  'cumulative': 0.33828782897227666},
 {'component': 'PC2',
  'variance': 0.11107233796052435,
  'cumulative': 0.449360166932801},
 {'component': 'PC3',
  'variance': 0.09617694361804441,
  'cumulative': 0.5455371105508454},
 {'component': 'PC4',
  'variance': 0.09077978354731706,
  'cumulative': 0.6363168940981625},
 {'component': 'PC5',
  'variance': 0.08209897224482507,
  'cumulative': 0.7184158663429876},
 {'component': 'PC6',
  'variance': 0.07785139571278402,
  'cumulative': 0.7962672620557716},
 {'component': 'PC7',
  'variance': 0.0717595228788601,
  'cumulative': 0.8680267849346317},
 {'component': 'PC8',
  'variance': 0.06817432871039349,
  'cumulative': 0.9362011136450252},
 {'component': 'PC9',
  'variance': 0.061616485546330736,
  'cumulative': 0.9978175991913559}]