<a href="https://colab.research.google.com/github/veruizr/ML_Doc/blob/main/decision_tree_categ.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Árbol de decisión cuando se tienen datos categóricos

Dataset a utilizar

| Math (Ma)  | Science (Sc) | English (En) | Preferred career (Pc) |
        |-------|---------|---------|-----------------|
        | Good (G)  | Good (G)  | Regular (R)  | Engineering (E) |
        | Regular (R)  | Good (G)  | Bad (B)  | Medicine (M) |
        | Bad (B)  | Regular (R)  | Good (G)  | Arts (A) |
        | Good (G)  | Regular (R)  | Good (G)  | Engineering (E) |
        | Regular (R)  | Bad (B)  | Regular (R)  | Arts (A) |
        | Good (G)  | Good (G)  | Good (G)  | Engineering (E) |
        | Bad (B)  | Good (G)  | Regular (R)  | Medicine (M) |
        | Regular (R)  | Regular (R)  | Regular (R)  | Medicine (M) |
        | Good (G)  | Bad (B)  | Good (G)  | Arts (A) |
        | Bad (B)  | Bad (B)  | Bad (B)  | Arts (A) |


In [2]:


import polars as pl
from sklearn.tree import DecisionTreeClassifier, export_text
from sklearn.preprocessing import LabelEncoder

# construcción dataframe
df = pl.DataFrame({
    "Ma": ["G", "R", "B", "G", "R", "G", "B", "R", "G", "B"],
    "Sc": ["G", "G", "R", "R", "B", "G", "G", "R", "B", "B"],
    "En": ["R", "B", "G", "G", "R", "G", "R", "R", "G", "B"],
    "Pc": ["E", "M", "A", "E", "A", "E", "M", "M", "A", "A"]
})

# Función para codificar variables categóricas
def encode_column(series: pl.Series) -> pl.Series:
    encoder = LabelEncoder()
    encoded = encoder.fit_transform(series.to_list())
    return pl.Series(encoded)

# Codificar todas las columnas
df_encoded = df.with_columns([
    encode_column(df["Ma"]).alias("Ma_encoded"),
    encode_column(df["Sc"]).alias("Sc_encoded"),
    encode_column(df["En"]).alias("En_encoded"),
    encode_column(df["Pc"]).alias("Pc_encoded")
])

# Obtener características (X) y variable objetivo (y) como arrays de numpy
X = df_encoded.select(["Ma_encoded", "Sc_encoded", "En_encoded"]).to_numpy()
y = df_encoded.select("Pc_encoded").to_numpy().ravel()

# Crear y entrenar el árbol de decisión
new_var = 100
clf = DecisionTreeClassifier(criterion='entropy', random_state=new_var)
clf.fit(X, y)

# Mostrar el árbol de decisión
tree_rules = export_text(clf, feature_names=['Ma', 'Sc', 'En'])
print("Árbol de Decisión:\n", tree_rules)

# Función para interpretar los resultados usando Polars
def interpretar_prediccion(Ma: str, Sc: str, En: str) -> str:
    # Crear DataFrame temporal
    temp_df = pl.DataFrame({
        "Ma": [Ma],
        "Sc": [Sc],
        "En": [En]
    })

    # Codificar los valores de entrada
    temp_df = temp_df.with_columns([
        encode_column(temp_df["Ma"]).alias("Ma_encoded"),
        encode_column(temp_df["Sc"]).alias("Sc_encoded"),
        encode_column(temp_df["En"]).alias("En_encoded")
    ])

    # Predecir
    X_pred = temp_df.select(["Ma_encoded", "Sc_encoded", "En_encoded"]).to_numpy()
    pred_encoded = clf.predict(X_pred)[0]

    # Decodificar la predicción
    pc_encoder = LabelEncoder().fit(df["Pc"].to_list())
    return pc_encoder.inverse_transform([pred_encoded])[0]

# predicción
print("\nPredicción para Ma=G, Sc=G, En=R:", interpretar_prediccion("G", "G", "R"))
print("Predicción para Ma=B, Sc=B, En=B:", interpretar_prediccion("B", "B", "B"))

Árbol de Decisión:
 |--- Sc <= 0.50
|   |--- class: 0
|--- Sc >  0.50
|   |--- Ma <= 0.50
|   |   |--- Sc <= 1.50
|   |   |   |--- class: 2
|   |   |--- Sc >  1.50
|   |   |   |--- class: 0
|   |--- Ma >  0.50
|   |   |--- Ma <= 1.50
|   |   |   |--- class: 1
|   |   |--- Ma >  1.50
|   |   |   |--- class: 2


Predicción para Ma=G, Sc=G, En=R: A
Predicción para Ma=B, Sc=B, En=B: A
