# Cómo Elegir el Mejor Algoritmo

A lo largo del día has aprendido a usar una serie de algoritmos que permiten que la máquina pueda hacer **Aprendizaje Supervisado**.

Pero ahora vamos a levantar una pregunta muy interesante: **¿Cómo puedo saber cuál es el mejor algoritmo para utilizar en cada caso?**

Una forma de responder a esta pregunta será usando un abordaje que permite comparar los puntajes (`score`) de cada algoritmo colocándolos dentro de un loop `for`. Veamos este ejemplo

Vamos a reproducir la misma situación de la lección anterior, pero en vez de aplicar el algoritmo de **Bosque Aleatorio**, vamos a poner a prueba el lenguaje de varios algoritmos diferentes.

Comencemos por cargar las librerías necesarias para preparar todo, más las librerías correspondientes a los algoritmos que queremos medir.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

# algoritmos
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

Cargamos el mismo dataset de la lección anterior.

In [2]:
ruta = "C:/Users/Federico/Downloads/Python para Data Science/Día 11/7 - Bosque Aleatorio/tarjetas_credito.csv"

df = pd.read_csv(ruta)
df.head()

Unnamed: 0,Duracion,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Monto,Clase
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


Normalizamos sus datos.

In [3]:
escala = MinMaxScaler(feature_range=(0, 1))
normado = escala.fit_transform(df)
df_normado = pd.DataFrame(data=normado, columns=df.columns)
df_normado.head()

Unnamed: 0,Duracion,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Monto,Clase
0,0.0,0.935192,0.76649,0.881365,0.313023,0.763439,0.267669,0.266815,0.786444,0.475312,...,0.561184,0.522992,0.663793,0.391253,0.585122,0.394557,0.418976,0.312697,0.005824,0.0
1,0.0,0.978542,0.770067,0.840298,0.271796,0.76612,0.262192,0.264875,0.786298,0.453981,...,0.55784,0.480237,0.666938,0.33644,0.58729,0.446013,0.416345,0.313423,0.000105,0.0
2,6e-06,0.935217,0.753118,0.868141,0.268766,0.762329,0.281122,0.270177,0.788042,0.410603,...,0.565477,0.54603,0.678939,0.289354,0.559515,0.402727,0.415489,0.311911,0.014739,0.0
3,6e-06,0.941878,0.765304,0.868484,0.213661,0.765647,0.275559,0.266803,0.789434,0.414999,...,0.559734,0.510277,0.662607,0.223826,0.614245,0.389197,0.417669,0.314371,0.004807,0.0
4,1.2e-05,0.938617,0.77652,0.864251,0.269796,0.762975,0.263984,0.268968,0.782484,0.49095,...,0.561327,0.547271,0.663392,0.40127,0.566343,0.507497,0.420561,0.31749,0.002724,0.0


Definimos la variable independiente.

In [4]:
X = df_normado.drop("Clase", axis=1)

Definimos la variable deppendiente.

In [5]:
y = df_normado["Clase"]

Separamos nuestros registros en los grupos de entrenamiento y de práctica.

In [6]:
X_entrena, X_prueba, y_entrena, y_prueba = train_test_split(X, y, train_size=0.7, random_state=42)

Y en este punto viene lo interesante.

Aquí es donde vamos a comparar el desempeño de diferentes algoritmos a la vez.

Primero vamos a crear una variable `modelos`, que contenga una tupla por cada **nombre** y **algoritmo** a testear.

In [7]:
modelos = [
    ("Regresión Logística", LogisticRegression()),
    ("Arbol de Decisión", DecisionTreeClassifier()),
    ("Bosque Aleatorio", RandomForestClassifier())
]

Y finalmente, dentro de un loop `for` vamos a realizar el **entrenamiento** y a medir el **puntaje** de cada uno de los modelos que tenemos en la lista `modelos`.

In [8]:
for nombre, modelo in modelos:
    modelo.fit(X_entrena, y_entrena)
    puntaje = modelo.score(X_prueba, y_prueba)
    print(f'{nombre}: {puntaje:.4f}')

Regresión Logística: 0.9991
Arbol de Decisión: 0.9991
Bosque Aleatorio: 0.9996


De esta manera, ahora podemos estar seguros que el algoritmo de `RandomforestClassifier()` es el mejor y más robusto para aplicar en este caso.