<a href="https://colab.research.google.com/github/rtrochepy/astronomer/blob/main/supervised.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Ignorar advertencias de pandas para una ejecución más limpia
warnings.filterwarnings('ignore')

# Configuración de pandas para mostrar todas las columnas al imprimir
pd.set_option('display.max_columns', None)

In [None]:
# Lee el archivo CSV.  Error handling mejorado.
try:
    df = pd.read_csv("processed_dataset_to_train_models.csv")
    # df = pd.read_csv("data_labels.csv")
except FileNotFoundError:
    print("Error: El archivo 'data_labels.csv' no se encuentra.")
except pd.errors.EmptyDataError:
    print("Error: El archivo 'data_labels.csv' está vacío.")
except pd.errors.ParserError:
    print("Error: Error al analizar el archivo 'data_labels.csv'.")

In [None]:
# ver cuantas filas y columnas tiene (90009 filas, 191 columnas)
df.shape

(90009, 191)

In [None]:
print(f"Filas cargadas: {len(df)}")

Filas cargadas: 90009


In [None]:
# Mostrar las primeras filas y la forma inicial del DataFrame
print("Primeras filas del DataFrame:")
print(df.head())
print("\nDimensiones del DataFrame (filas, columnas):", df.shape)

Primeras filas del DataFrame:
                                                  ID Expenditure_AHF  \
0  1547558447248542464208633772467833372637054433...      2017-10-23   
1  8303023334174375372752224543342433062237135032...      2017-05-16   
2  6205323654737347834626173852036442544385334747...      2017-12-15   
3  3727835411357335232137873674310300621187543222...      2018-02-02   
4  3526827315715777302832343600863305043273474403...      2017-05-31   

   Payment_6804  Infraction_CGP  Base_7744  Base_80863  Risk_1930  \
0      0.983147        0.001925   0.024150    1.009185   0.006479   
1      0.845400        0.001012   0.032260    1.008112   0.001539   
2      0.773748        0.008989   0.007325    0.817528   0.008786   
3      0.853480        0.653419   0.107481    0.816447   0.007058   
4      0.632887        0.000809   0.009320    0.812306   0.001417   

   Expenditure_JIG  Infraction_SNZ  Base_02683  Infraction_SBF  \
0         0.104240        0.005930    0.008182          

# Creación de DataFrames: `df_SelectKBest` y `df_rfe`

En este paso, se generan dos nuevos DataFrames basados en las características seleccionadas por los métodos de selección de características **SelectKBest** y **RFE** (Recursive Feature Elimination).

### Objetivo:
- Consolidar las características seleccionadas por cada método en DataFrames separados, permitiendo un análisis más detallado y facilitando la comparación de su impacto en el desempeño del modelo.

### Proceso:
1. **`df_SelectKBest`:**  
   - Contiene las columnas seleccionadas como las más relevantes por el método **SelectKBest**, basado en métricas estadísticas como el ANOVA F-score.  
   - Este DataFrame incluye las características que muestran una mayor relación estadística con la variable objetivo.

2. **`df_rfe`:**  
   - Contiene las columnas seleccionadas mediante **RFE**, que identifica características clave considerando su importancia en el desempeño del modelo.  
   - Este método tiene en cuenta las interacciones entre variables, proporcionando un conjunto optimizado de características.  

### Resultado:
Dos DataFrames (`df_SelectKBest` y `df_rfe`) con los subconjuntos de características más relevantes según cada método, listos para ser utilizados en el modelado y comparación del rendimiento.  


In [None]:
selectKBest_columns = ['Payment_6804', 'Base_80863', 'Infraction_QJJF', 'Base_76065',
       'Infraction_TLPJ', 'Base_1165', 'Base_39598', 'Base_85131', 'Base_9516',
       'Infraction_BSU', 'Infraction_ZYW', 'Infraction_TBP', 'Infraction_PBC',
       'Base_0229', 'Base_69608', 'Base_3041', 'Infraction_QKZN',
       'Infraction_CZE', 'Base_9103', 'Base_67254_low', 'label']

In [None]:
rfe_columns = ['Payment_6804', 'Base_80863', 'Expenditure_JIG', 'Base_02683',
       'Infraction_ZWWJ', 'Infraction_QJJF', 'Infraction_EJZ',
       'Infraction_GGO', 'Infraction_TLPJ', 'Base_1165', 'Base_39598',
       'Base_6187', 'Base_85131', 'Risk_9995', 'Infraction_AYWV', 'Base_9516',
       'Expenditure_HMO', 'Infraction_BSU', 'Infraction_ZYW', 'Infraction_TBP',
       'Infraction_PBC', 'Base_0229', 'Base_69608', 'Base_3041',
       'Infraction_QKZN', 'Infraction_CZE', 'Expenditure_MTRQ',
       'Infraction_XEPQ', 'Infraction_RKTA', 'Infraction_KEJT', 'label']

In [None]:
print(selectKBest_columns)

['Payment_6804', 'Base_80863', 'Infraction_QJJF', 'Base_76065', 'Infraction_TLPJ', 'Base_1165', 'Base_39598', 'Base_85131', 'Base_9516', 'Infraction_BSU', 'Infraction_ZYW', 'Infraction_TBP', 'Infraction_PBC', 'Base_0229', 'Base_69608', 'Base_3041', 'Infraction_QKZN', 'Infraction_CZE', 'Base_9103', 'label']


In [None]:
print(df.columns)

Index(['ID', 'Expenditure_AHF', 'Payment_6804', 'Infraction_CGP', 'Base_7744',
       'Base_80863', 'Risk_1930', 'Expenditure_JIG', 'Infraction_SNZ',
       'Base_02683',
       ...
       'Infraction_ADWZ', 'Infraction_MZI', 'Infraction_QWWW',
       'Infraction_YQXM', 'Infraction_QGR', 'Infraction_ZTLC',
       'Infraction_LSX', 'Infraction_IBJ', 'Infraction_DNOU', 'label'],
      dtype='object', length=191)


In [None]:
df_SelectKBest = df[selectKBest_columns].copy()
df_SelectKBest

Unnamed: 0,Payment_6804,Base_80863,Infraction_QJJF,Base_76065,Infraction_TLPJ,Base_1165,Base_39598,Base_85131,Base_9516,Infraction_BSU,Infraction_ZYW,Infraction_TBP,Infraction_PBC,Base_0229,Base_69608,Base_3041,Infraction_QKZN,Infraction_CZE,Base_9103,label
0,0.983147,1.009185,0.001666,0.034075,0.016837,0.214074,0.023268,0.007956,0.302213,0.077297,0.001927,0.011131,0.435724,0.003501,1.003915,0.014277,0.008729,0.000062,1.001486,0
1,0.845400,1.008112,0.000419,0.013843,0.025762,0.201589,0.031007,0.007593,0.296767,0.032608,0.001471,0.032594,0.347845,0.001931,1.006270,0.011776,0.004370,0.003116,1.003285,0
2,0.773748,0.817528,0.004499,0.034746,0.039036,0.161302,0.087423,0.003430,0.297563,0.033580,0.006882,0.041917,0.439804,0.008618,1.002205,0.055730,0.008723,0.001408,1.002614,0
3,0.853480,0.816447,0.007606,0.079584,0.131762,0.041181,0.066944,0.101443,0.117961,0.209908,0.211129,0.314938,0.482618,0.002384,0.653463,0.049118,0.143268,0.134363,1.000834,0
4,0.632887,0.812306,0.000821,0.214009,0.478780,0.007032,0.484647,0.002572,0.014136,0.578611,0.255802,0.599144,0.059177,0.424084,0.591848,0.437156,0.145670,0.136903,1.002185,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
90004,0.308710,0.005810,0.375491,0.261794,0.994140,0.005503,0.421551,0.681809,0.016417,1.003539,0.420430,0.870465,0.093322,1.006234,0.053964,0.429172,0.216154,0.271683,0.002014,1
90005,0.406081,0.011926,0.378869,0.533366,0.859942,0.021502,0.251779,0.620245,0.021669,0.692073,0.634320,0.840728,0.015189,0.843106,0.052817,0.250836,0.359508,0.403070,0.007445,1
90006,0.355882,0.053917,1.009551,0.495675,0.918946,0.012835,0.349284,0.609505,0.031746,0.750879,0.396612,0.924792,0.016703,0.668673,0.119744,0.329592,0.717555,0.733539,0.007015,1
90007,0.559553,0.077777,0.133649,0.708811,0.406708,0.022080,0.291884,0.399765,0.059641,0.359651,0.557526,0.703999,0.360209,0.925509,0.116787,0.289949,0.290681,0.340025,0.001712,1


In [None]:
df_rfe = df[rfe_columns].copy()
df_rfe

Unnamed: 0,Payment_6804,Base_80863,Expenditure_JIG,Base_02683,Infraction_ZWWJ,Infraction_QJJF,Infraction_EJZ,Infraction_GGO,Infraction_TLPJ,Base_1165,Base_39598,Base_6187,Base_85131,Risk_9995,Infraction_AYWV,Base_9516,Expenditure_HMO,Infraction_BSU,Infraction_ZYW,Infraction_TBP,Infraction_PBC,Base_0229,Base_69608,Base_3041,Infraction_QKZN,Infraction_CZE,Expenditure_MTRQ,Infraction_XEPQ,Infraction_RKTA,Infraction_KEJT,label
0,0.983147,1.009185,0.104240,0.008182,0.237006,0.001666,0.782012,0.773537,0.016837,0.214074,0.023268,0.003628,0.007956,0.009305,0.230541,0.302213,0.086060,0.077297,0.001927,0.011131,0.435724,0.003501,1.003915,0.014277,0.008729,0.000062,0.961532,0.187506,0.761997,0.282657,0
1,0.845400,1.008112,0.050516,0.003052,,0.000419,0.921744,0.665193,0.025762,0.201589,0.031007,0.005277,0.007593,0.001155,0.161673,0.296767,0.067886,0.032608,0.001471,0.032594,0.347845,0.001931,1.006270,0.011776,0.004370,0.003116,0.931369,0.751306,0.710254,0.094398,0
2,0.773748,0.817528,,0.003561,0.006036,0.004499,0.121423,0.286896,0.039036,0.161302,0.087423,1.006375,0.003430,0.106186,0.196457,0.297563,,0.033580,0.006882,0.041917,0.439804,0.008618,1.002205,0.055730,0.008723,0.001408,0.081398,0.199552,0.263882,0.144643,0
3,0.853480,0.816447,0.155341,0.015199,0.123918,0.007606,0.186223,0.228199,0.131762,0.041181,0.066944,0.007518,0.101443,0.102640,0.217054,0.117961,0.108943,0.209908,0.211129,0.314938,0.482618,0.002384,0.653463,0.049118,0.143268,0.134363,0.924043,0.119938,0.290865,0.048012,0
4,0.632887,0.812306,,0.001020,,0.000821,0.417804,0.823922,0.478780,0.007032,0.484647,0.001165,0.002572,0.003210,0.134980,0.014136,,0.578611,0.255802,0.599144,0.059177,0.424084,0.591848,0.437156,0.145670,0.136903,0.077032,0.742403,0.656106,0.007144,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
90004,0.308710,0.005810,0.301038,0.528282,0.052540,0.375491,0.174523,0.481308,0.994140,0.005503,0.421551,0.008084,0.681809,0.104191,0.198862,0.016417,0.309447,1.003539,0.420430,0.870465,0.093322,1.006234,0.053964,0.429172,0.216154,0.271683,0.972134,0.081266,0.546104,0.460915,1
90005,0.406081,0.011926,0.440772,0.211694,0.138257,0.378869,0.063239,0.137879,0.859942,0.021502,0.251779,1.008972,0.620245,0.903146,0.104208,0.021669,0.362610,0.692073,0.634320,0.840728,0.015189,0.843106,0.052817,0.250836,0.359508,0.403070,0.969714,0.069067,0.288964,0.364771,1
90006,0.355882,0.053917,0.197294,0.252172,0.046706,1.009551,0.028158,0.435089,0.918946,0.012835,0.349284,1.001788,0.609505,0.008670,0.060552,0.031746,0.218536,0.750879,0.396612,0.924792,0.016703,0.668673,0.119744,0.329592,0.717555,0.733539,0.932846,0.052996,0.749184,0.275053,1
90007,0.559553,0.077777,0.133838,0.158496,0.030792,0.133649,0.066223,0.364487,0.406708,0.022080,0.291884,1.009336,0.399765,0.204014,0.176150,0.059641,0.099765,0.359651,0.557526,0.703999,0.360209,0.925509,0.116787,0.289949,0.290681,0.340025,0.947959,0.552707,0.701238,0.372105,1


In [None]:
# Función para entrenar y evaluar modelos
def train_and_evaluate_model(model, X_train, X_test, y_train, y_test, model_name):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    return model_name, acc

# Crear listas de DataFrames y nombres para iterar
dataframes = [(df_SelectKBest, 'SelectKBest'), (df_rfe, 'RFE')]

# Iterar sobre los DataFrames
for df, method_name in dataframes:
    print(f"=== Evaluación utilizando {method_name} ===")

    # Separar características (X) y variable objetivo (y)
    X = df.drop(columns=['label'])
    y = df['label']

    # Dividir los datos en conjuntos de entrenamiento y prueba
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

    # Escalado de los datos (ajustar en el entrenamiento y transformar en ambos)
    scaler = StandardScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)

    # Modelos de clasificación
    models = [
        (LogisticRegression(random_state=42, max_iter=1000), "Regresión Logística"),
        (DecisionTreeClassifier(random_state=42), "Árbol de Decisión"),
        (KNeighborsClassifier(), "k-Nearest Neighbors (k-NN)"),
        (SVC(kernel='linear', random_state=42), "Máquina de Soporte Vectorial (SVM)")
    ]

    # Almacenar resultados
    results = []

    # Entrenar y evaluar cada modelo
    for model, model_name in models:
        name, acc = train_and_evaluate_model(model, X_train, X_test, y_train, y_test, model_name)
        results.append((name, acc))

    # Ordenar los resultados por accuracy de mayor a menor
    results_sorted = sorted(results, key=lambda x: x[1], reverse=True)

    # Imprimir resultados
    print(f"Resultados para {method_name}:")
    for name, acc in results_sorted:
        print(f"{name}: Accuracy = {acc:.4f}")
    print("\n")


=== Evaluación utilizando SelectKBest ===


ValueError: Input X contains NaN.
LogisticRegression does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values