<a href="https://colab.research.google.com/github/rtrochepy/astronomer/blob/main/supervised.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

In [2]:
# Ignorar advertencias de pandas para una ejecución más limpia
warnings.filterwarnings('ignore')

# Configuración de pandas para mostrar todas las columnas al imprimir
pd.set_option('display.max_columns', None)

In [3]:
# Lee el archivo CSV.  Error handling mejorado.
try:
    df = pd.read_csv("processed_dataset_to_train_models.csv")
    # df = pd.read_csv("data_labels.csv")
except FileNotFoundError:
    print("Error: El archivo 'data_labels.csv' no se encuentra.")
except pd.errors.EmptyDataError:
    print("Error: El archivo 'data_labels.csv' está vacío.")
except pd.errors.ParserError:
    print("Error: Error al analizar el archivo 'data_labels.csv'.")

In [4]:
# ver cuantas filas y columnas tiene (90009 filas, 191 columnas)
df.shape

(58540, 37)

In [5]:
print(f"Filas cargadas: {len(df)}")

Filas cargadas: 58540


In [6]:
# Mostrar las primeras filas y la forma inicial del DataFrame
print("Primeras filas del DataFrame:")
print(df.head())
print("\nDimensiones del DataFrame (filas, columnas):", df.shape)

Primeras filas del DataFrame:
   Payment_6804  Base_4569  Infraction_QKZN  Infraction_GGO  Base_76065  \
0      0.908188   0.093861         0.072951        0.323026    0.095100   
1      1.006051   0.018476         0.003973        0.877212    0.005009   
2      0.738226   0.083522         0.071831        0.268012    0.123427   
3      0.736433   0.009765         0.003048        0.189892    0.051474   
4      0.788748   0.015759         0.005867        0.195987    0.000039   

   Infraction_BSU  Expenditure_MTRQ  Expenditure_HMO  Base_9516  \
0        0.191787          0.669911         0.169783   0.074688   
1        0.017047          0.077196         0.145802   0.297800   
2        0.077244          0.677906         0.098902   0.156874   
3        0.031781          0.681072         0.033442   0.483579   
4        0.019708          0.083002         0.145802   0.300320   

   Infraction_TLPJ  Base_1165  Base_85131  Base_3041  Infraction_PBC  \
0         0.185647   0.044997    0.138221   

# Creación de DataFrames: `df_SelectKBest` y `df_rfe`

En este paso, se generan dos nuevos DataFrames basados en las características seleccionadas por los métodos de selección de características **SelectKBest** y **RFE** (Recursive Feature Elimination).

### Objetivo:
- Consolidar las características seleccionadas por cada método en DataFrames separados, permitiendo un análisis más detallado y facilitando la comparación de su impacto en el desempeño del modelo.

### Proceso:
1. **`df_SelectKBest`:**  
   - Contiene las columnas seleccionadas como las más relevantes por el método **SelectKBest**, basado en métricas estadísticas como el ANOVA F-score.  
   - Este DataFrame incluye las características que muestran una mayor relación estadística con la variable objetivo.

2. **`df_rfe`:**  
   - Contiene las columnas seleccionadas mediante **RFE**, que identifica características clave considerando su importancia en el desempeño del modelo.  
   - Este método tiene en cuenta las interacciones entre variables, proporcionando un conjunto optimizado de características.  

### Resultado:
Dos DataFrames (`df_SelectKBest` y `df_rfe`) con los subconjuntos de características más relevantes según cada método, listos para ser utilizados en el modelado y comparación del rendimiento.  


In [7]:
selectKBest_columns = ['Payment_6804', 'Base_80863', 'Infraction_QJJF', 'Base_76065',
       'Infraction_TLPJ', 'Base_1165', 'Base_39598', 'Base_85131', 'Base_9516',
       'Infraction_BSU', 'Infraction_ZYW', 'Infraction_TBP', 'Infraction_PBC',
       'Base_0229', 'Base_69608', 'Base_3041', 'Infraction_QKZN',
       'Infraction_CZE', 'Base_9103', 'Base_67254_low', 'label']

In [13]:
rfe_columns = ['Payment_6804', 'Base_80863', 'Expenditure_JIG', 'Base_02683',
       'Infraction_ZWWJ', 'Infraction_QJJF', 'Infraction_EJZ',
       'Infraction_GGO', 'Infraction_TLPJ', 'Base_1165', 'Base_39598',
       'Base_6187', 'Base_85131', 'Risk_9995', 'Infraction_AYWV', 'Base_9516',
       'Expenditure_HMO', 'Infraction_BSU', 'Infraction_ZYW', 'Infraction_TBP',
       'Infraction_PBC', 'Base_0229', 'Base_69608', 'Base_3041',
       'Infraction_QKZN', 'Infraction_CZE', 'Expenditure_MTRQ',
       'Infraction_RKTA', 'Infraction_KEJT', 'label']

In [14]:
print(selectKBest_columns)

['Payment_6804', 'Base_80863', 'Infraction_QJJF', 'Base_76065', 'Infraction_TLPJ', 'Base_1165', 'Base_39598', 'Base_85131', 'Base_9516', 'Infraction_BSU', 'Infraction_ZYW', 'Infraction_TBP', 'Infraction_PBC', 'Base_0229', 'Base_69608', 'Base_3041', 'Infraction_QKZN', 'Infraction_CZE', 'Base_9103', 'Base_67254_low', 'label']


In [10]:
print(df.columns)

Index(['Payment_6804', 'Base_4569', 'Infraction_QKZN', 'Infraction_GGO',
       'Base_76065', 'Infraction_BSU', 'Expenditure_MTRQ', 'Expenditure_HMO',
       'Base_9516', 'Infraction_TLPJ', 'Base_1165', 'Base_85131', 'Base_3041',
       'Infraction_PBC', 'Infraction_RKTA', 'Expenditure_JIG',
       'Infraction_ZMKI', 'Base_0229', 'Infraction_ZTNC', 'Base_39598',
       'Infraction_CZE', 'Base_67254_low', 'Base_02683', 'Infraction_TBP',
       'Infraction_AYWV', 'Infraction_ZWWJ', 'Base_80863', 'Infraction_EJZ',
       'Infraction_QJJF', 'Base_6187', 'Base_2810', 'Base_69608', 'Risk_9995',
       'Infraction_KEJT', 'Infraction_ZYW', 'Base_9103', 'label'],
      dtype='object')


In [15]:
df_SelectKBest = df[selectKBest_columns].copy()
df_SelectKBest

Unnamed: 0,Payment_6804,Base_80863,Infraction_QJJF,Base_76065,Infraction_TLPJ,Base_1165,Base_39598,Base_85131,Base_9516,Infraction_BSU,Infraction_ZYW,Infraction_TBP,Infraction_PBC,Base_0229,Base_69608,Base_3041,Infraction_QKZN,Infraction_CZE,Base_9103,Base_67254_low,label
0,0.908188,0.077875,0.013905,0.095100,0.185647,0.044997,0.124141,0.138221,0.074688,0.191787,0.235487,0.360544,0.247189,1.006168,0.201661,0.093503,0.072951,0.133594,0.008919,0.0,0
1,1.006051,0.819432,0.074872,0.005009,0.003176,0.130274,0.020801,0.003870,0.297800,0.017047,0.005154,0.006396,0.505102,0.002738,1.004301,0.014684,0.003973,0.003461,1.003621,1.0,0
2,0.738226,0.726630,0.067859,0.123427,0.352163,0.161845,0.039725,0.143156,0.156874,0.077244,0.103899,0.807562,0.010014,0.250177,0.209429,0.030109,0.071831,0.138296,0.006946,0.0,0
3,0.736433,1.004405,0.065625,0.051474,0.058059,0.390846,0.020706,0.206435,0.483579,0.031781,0.002994,0.047218,0.352273,0.003406,1.008153,0.021827,0.003048,0.005108,1.003414,1.0,0
4,0.788748,0.811773,0.062932,0.000039,0.352163,0.131319,0.007435,0.004920,0.300320,0.019708,0.009743,0.462107,0.105594,0.004141,1.004778,0.001916,0.005867,0.006370,1.006909,1.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
58535,0.734532,1.007686,0.037635,0.045501,0.352163,0.198242,0.029473,0.103242,0.296619,0.011226,0.002941,0.462107,0.426523,0.003184,1.000787,0.018564,0.005337,0.009867,1.007698,1.0,0
58536,0.678145,0.022160,0.361057,0.483222,0.406849,0.004748,0.529404,0.436659,0.016035,0.465808,0.608296,0.898326,0.028047,1.001349,0.101886,0.503907,0.361410,0.405830,0.000600,0.0,1
58537,0.819433,1.000099,0.078910,0.085687,0.168573,0.123132,0.052353,0.005474,0.181059,0.163443,0.240330,0.191067,0.389210,0.083572,0.649005,0.034571,0.072463,0.068284,1.003705,1.0,0
58538,0.999453,1.006880,0.091524,0.041952,0.117336,0.129455,0.040900,0.009843,0.161995,0.145797,0.085959,0.141279,0.048022,0.000144,0.694288,0.032017,0.071613,0.073470,1.000072,1.0,0


In [16]:
df_rfe = df[rfe_columns].copy()
df_rfe

Unnamed: 0,Payment_6804,Base_80863,Expenditure_JIG,Base_02683,Infraction_ZWWJ,Infraction_QJJF,Infraction_EJZ,Infraction_GGO,Infraction_TLPJ,Base_1165,Base_39598,Base_6187,Base_85131,Risk_9995,Infraction_AYWV,Base_9516,Expenditure_HMO,Infraction_BSU,Infraction_ZYW,Infraction_TBP,Infraction_PBC,Base_0229,Base_69608,Base_3041,Infraction_QKZN,Infraction_CZE,Expenditure_MTRQ,Infraction_RKTA,Infraction_KEJT,label
0,0.908188,0.077875,0.172576,0.165970,0.051655,0.013905,0.352824,0.323026,0.185647,0.044997,0.124141,0.004903,0.138221,0.004180,0.111744,0.074688,0.169783,0.191787,0.235487,0.360544,0.247189,1.006168,0.201661,0.093503,0.072951,0.133594,0.669911,0.587948,0.182863,0
1,1.006051,0.819432,0.165443,0.004531,0.092942,0.074872,0.296293,0.877212,0.003176,0.130274,0.020801,0.004328,0.003870,0.005659,0.276679,0.297800,0.145802,0.017047,0.005154,0.006396,0.505102,0.002738,1.004301,0.014684,0.003973,0.003461,0.077196,0.687627,0.053317,0
2,0.738226,0.726630,0.145093,0.010554,0.097831,0.067859,0.386510,0.268012,0.352163,0.161845,0.039725,0.004060,0.143156,0.104509,0.093543,0.156874,0.098902,0.077244,0.103899,0.807562,0.010014,0.250177,0.209429,0.030109,0.071831,0.138296,0.677906,0.302143,0.094294,0
3,0.736433,1.004405,0.044428,0.003168,0.010471,0.065625,0.263622,0.189892,0.058059,0.390846,0.020706,0.008570,0.206435,0.000566,0.153827,0.483579,0.033442,0.031781,0.002994,0.047218,0.352273,0.003406,1.008153,0.021827,0.003048,0.005108,0.681072,0.272384,0.413995,0
4,0.788748,0.811773,0.165443,0.008298,0.081078,0.062932,0.181022,0.195987,0.352163,0.131319,0.007435,0.008825,0.004920,0.106966,0.049698,0.300320,0.145802,0.019708,0.009743,0.462107,0.105594,0.004141,1.004778,0.001916,0.005867,0.006370,0.083002,0.354322,0.234572,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
58535,0.734532,1.007686,0.097499,0.002687,0.092942,0.037635,0.419709,0.423561,0.352163,0.198242,0.029473,0.007293,0.103242,0.002030,0.194454,0.296619,0.072305,0.011226,0.002941,0.462107,0.426523,0.003184,1.000787,0.018564,0.005337,0.009867,0.678064,0.723507,0.550445,0
58536,0.678145,0.022160,0.177490,0.545342,0.137384,0.361057,0.054532,0.591020,0.406849,0.004748,0.529404,1.004122,0.436659,0.001872,0.101297,0.016035,0.143054,0.465808,0.608296,0.898326,0.028047,1.001349,0.101886,0.503907,0.361410,0.405830,0.673840,0.656764,0.047512,1
58537,0.819433,1.000099,0.120599,0.007986,0.236165,0.078910,0.771879,0.589542,0.168573,0.123132,0.052353,0.003264,0.005474,0.267900,0.208698,0.181059,0.078037,0.163443,0.240330,0.191067,0.389210,0.083572,0.649005,0.034571,0.072463,0.068284,0.677072,0.770979,0.144491,0
58538,0.999453,1.006880,0.142966,0.008983,0.092942,0.091524,0.342313,0.847671,0.117336,0.129455,0.040900,0.004038,0.009843,0.002166,0.130773,0.161995,0.097013,0.145797,0.085959,0.141279,0.048022,0.000144,0.694288,0.032017,0.071613,0.073470,0.679685,0.886287,0.279764,0


In [17]:
# Función para entrenar y evaluar modelos
def train_and_evaluate_model(model, X_train, X_test, y_train, y_test, model_name):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    return model_name, acc

# Crear listas de DataFrames y nombres para iterar
dataframes = [(df_SelectKBest, 'SelectKBest'), (df_rfe, 'RFE')]

# Iterar sobre los DataFrames
for df, method_name in dataframes:
    print(f"=== Evaluación utilizando {method_name} ===")

    # Separar características (X) y variable objetivo (y)
    X = df.drop(columns=['label'])
    y = df['label']

    # Dividir los datos en conjuntos de entrenamiento y prueba
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

    # Escalado de los datos (ajustar en el entrenamiento y transformar en ambos)
    scaler = StandardScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)

    # Modelos de clasificación
    models = [
        (LogisticRegression(random_state=42, max_iter=1000), "Regresión Logística"),
        (DecisionTreeClassifier(random_state=42), "Árbol de Decisión"),
        (KNeighborsClassifier(), "k-Nearest Neighbors (k-NN)"),
        (SVC(kernel='linear', random_state=42), "Máquina de Soporte Vectorial (SVM)")
    ]

    # Almacenar resultados
    results = []

    # Entrenar y evaluar cada modelo
    for model, model_name in models:
        name, acc = train_and_evaluate_model(model, X_train, X_test, y_train, y_test, model_name)
        results.append((name, acc))

    # Ordenar los resultados por accuracy de mayor a menor
    results_sorted = sorted(results, key=lambda x: x[1], reverse=True)

    # Imprimir resultados
    print(f"Resultados para {method_name}:")
    for name, acc in results_sorted:
        print(f"{name}: Accuracy = {acc:.4f}")
    print("\n")


=== Evaluación utilizando SelectKBest ===
Resultados para SelectKBest:
k-Nearest Neighbors (k-NN): Accuracy = 0.9082
Árbol de Decisión: Accuracy = 0.8766
Máquina de Soporte Vectorial (SVM): Accuracy = 0.8343
Regresión Logística: Accuracy = 0.8332


=== Evaluación utilizando RFE ===
Resultados para RFE:
k-Nearest Neighbors (k-NN): Accuracy = 0.9411
Árbol de Decisión: Accuracy = 0.9033
Máquina de Soporte Vectorial (SVM): Accuracy = 0.8464
Regresión Logística: Accuracy = 0.8447


