<a href="https://colab.research.google.com/github/santhiperbolico/categorical_variables_treatment/blob/main/embedding_aplicado_a_variables_categ_ricas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Embedding aplicado a variables categóricas

In [1]:
!git clone https://github.com/santhiperbolico/categorical_variables_treatment.git
!pip install categorical_variables_treatment/src
#@markdown ## Librerías
#@markdown Ejecuta esta celda para instalar y cargar todas las librerías necesarias
import os
from dotenv import load_dotenv

from google.colab import drive
drive.mount('/content/drive/')


import ipywidgets as widgets
from IPython.display import clear_output

from categorical_variables_treatment.datasets.datasets import get_dataset
from categorical_variables_treatment.models.models import evaluate_dataset
from categorical_variables_treatment.preprocessing.preprocessing import (
    OneHotEncoding,
    LabelEncoding,
    EmbeddingEncoding,
    LlmEncoding
)

def inf(msg, style, wdth): inf = widgets.Button(description=msg, disabled=True, button_style=style, layout=widgets.Layout(min_width=wdth));display(inf)


clear_output()
inf('\u2714 Done','success', '50px')

Button(button_style='success', description='✔ Done', disabled=True, layout=Layout(min_width='50px'), style=But…

## Introducción
Las variables categóricas son un tipo de variable que representa datos en categorías discretas, como género, país o color. Los modelos de Machine Learning suelen trabajar mejor con datos numéricos, por lo que es necesario convertir estas variables categóricas a un formato que puedan entender. El embedding es una técnica que permite representar variables categóricas como vectores numéricos, lo que facilita su uso en modelos de Machine Learning.

## ¿Qué es el embedding?

El embedding es un proceso que mapea variables categóricas a vectores numéricos de baja dimensionalidad, preservando la información relevante de las categorías. Esto se logra mediante el entrenamiento de un modelo de aprendizaje automático en un conjunto de datos grande. El modelo aprende a representar las categorías como vectores numéricos, donde vectores similares representan categorías similares.

## Ventajas de Usar Embeddings para Variables Categóricas

Dimensionalidad Reducida: Los embeddings permiten representar categorías con un número fijo de dimensiones, lo que es más eficiente que el one-hot encoding, especialmente cuando hay muchas categorías.

Captura de Relaciones Semánticas: Los embeddings pueden capturar similitudes y relaciones entre categorías. Por ejemplo, en un problema de recomendación, podrían aprender que usuarios con comportamientos similares tienen embeddings cercanos.

Mejoras en el Rendimiento: Al usar embeddings, los modelos pueden aprender representaciones más ricas y complejas, lo que puede mejorar el rendimiento predictivo en comparación con técnicas tradicionales.

## Métodos de embedding:

Existen varios métodos para crear embeddings, incluyendo:

- One-hot encoding: Este método crea un vector binario donde cada elemento representa una categoría única.

- Word embedding: Estos modelos se utilizan para representar palabras como vectores numéricos en el procesamiento del lenguaje natural.

- Embedding layers en redes neuronales: Las redes neuronales pueden aprender embeddings de las variables categóricas durante el entrenamiento.

- Métodos de reducción de dimensionalidad: Técnicas como PCA o t-SNE pueden utilizarse para reducir la dimensionalidad de las variables categóricas.

Para demostrar la potencia de los embedding vamos a comparar diferentes metodologías al conjunto de datos `"Home Credit Default Risk"` de Kaggle, en particular la variable `ORGANIZATION_TYPE`. En este ejemplo, demostramos cómo aplicar embeddings a una variable categórica con múltiples valores utilizando y veremos la potencia de este método. Para este ejemplo vamos a utilizar diferentes test estadísticos para medir la información que aportan al problema las diferentes representaciones de variables categóricas:

1. One-Hot Encodding, explicado un poco más arriba.

2. Embedding layers en redes neuronales utilizando keras.

3. Embedding de LLM + PCA para reducir dimensionalidad

Previo a entrar en detalle en estos métodos, veamos un poco primero nuestro conjunto de datos

## LLM Embedding + PCA

Dada una columna o variable categórica, este método de categorización se basa en dos fases:

1. Calcular los vectores de cada uno de los valores de la variable utilizando un modelo de embedding de LLM. En esta prueba utilizamos el modelo text-embedding-3-small de OpenAI.
2. Estos modelos nos devuelven vectore de dimensiones 1536, esta es una dimensión demasiada grande para ser tratada en el modelo. Dado que estamos trabajando con una lista de valores mucho más pequeña, podemos reducir el tamaño de estos vectores sin perder información. En especial, calculamos el rango `n` de la matriz de los vectores de embedding y aplicamos un modelo `PCA` de `n` componentes para reducir la dimensión de este. Como veremos, sin pérdida de información.

En nuestro ejemplo trabajaremos con la siguiente lista de paises `[
    "Jamaica",
    "Ecuador",
    "United Kingdom",
    "Paraguay",
    "France",
    "Mexico",
    "Colombia",
    "Germany",
    "España",
    "Canada",
    "Venezuela",
    "Dominican Republic",
    "United States",
    "Peru",
    "Uruguay",
    "Argentina",
    "Brazil"
]` y veremos como el método de categorización va a situar en posiciones más cercanas los países más relacionados.

In [3]:
#@markdown Funciones auxiliares
import numpy as np
import pandas as pd
from openai import OpenAI
from sklearn.decomposition import PCA



def get_distance_matrix(df_data: pd.DataFrame) -> np.ndarray:
  """
  Función que calcula la distancia entre las filas de df_data.
  """
  distance = np.zeros((df_data.shape[0], df_data.shape[0]))
  for i in range(distance.shape[0]):
    vector = df_data.iloc[i, :]
    for j in range(distance.shape[1]):
      distance[i,j] = np.linalg.norm(vector - df_data.iloc[j, :])
  return distance

def create_embedding(client: OpenAI, values: np.ndarray) -> pd.DataFrame:
  """
  Función que genera el embedding para cada uno de los valores de values usando
  el modelo text-embedding-3-small de OpenAI.
  """
  embedding_list = []
  for val in values:
    response = client.embeddings.create(
      input=val,
      model="text-embedding-3-small"
    )
    embedding = response.data[0].embedding
    print(f" Embedding {val}")
    print(f"\t -Data Len {len(response.data)}")
    print(f"\t -Embedding Shape {len(embedding)}")
    embedding_list.append(embedding)

  df_data = pd.DataFrame(np.array(embedding_list), index=values)
  return df_data

def get_new_base(df_data: pd.DataFrame) -> np.ndarray:
  """
  Método que reduce la dimensionalidad de los datos a n columnas
  donde n es el rango de la mátriz de valores de df_data.
  """
  pca = PCA(n_components=np.linalg.matrix_rank(df_data))
  return pca.fit_transform(df_data)

clear_output()
inf('\u2714 Done','success', '50px')

Button(button_style='success', description='✔ Done', disabled=True, layout=Layout(min_width='50px'), style=But…

In [4]:
#@markdown Configurar las credenciales de OpenAI y generamos el embedding de los diferentes paises.
env_file = "/content/drive/MyDrive/Mis Cosas/Data Science/ML-IA/env" #@param

if env_file == "":
  load_dotenv()
else:
  load_dotenv(env_file)

params = {
  "api_key": os.getenv("API_KEY")
}

countries = [
    "Jamaica",
    "Ecuador",
    "United Kingdom",
    "Paraguay",
    "France",
    "Mexico",
    "Colombia",
    "Germany",
    "España",
    "Canada",
    "Venezuela",
    "Dominican Republic",
    "United States",
    "Peru",
    "Uruguay",
    "Argentina",
    "Brazil"
]


client = OpenAI(api_key=params.get("api_key"))

df_origin = create_embedding(client, countries)
df_origin

 Embedding Jamaica
	 -Data Len 1
	 -Embedding Shape 1536
 Embedding Ecuador
	 -Data Len 1
	 -Embedding Shape 1536
 Embedding United Kingdom
	 -Data Len 1
	 -Embedding Shape 1536
 Embedding Paraguay
	 -Data Len 1
	 -Embedding Shape 1536
 Embedding France
	 -Data Len 1
	 -Embedding Shape 1536
 Embedding Mexico
	 -Data Len 1
	 -Embedding Shape 1536
 Embedding Colombia
	 -Data Len 1
	 -Embedding Shape 1536
 Embedding Germany
	 -Data Len 1
	 -Embedding Shape 1536
 Embedding España
	 -Data Len 1
	 -Embedding Shape 1536
 Embedding Canada
	 -Data Len 1
	 -Embedding Shape 1536
 Embedding Venezuela
	 -Data Len 1
	 -Embedding Shape 1536
 Embedding Dominican Republic
	 -Data Len 1
	 -Embedding Shape 1536
 Embedding United States
	 -Data Len 1
	 -Embedding Shape 1536
 Embedding Peru
	 -Data Len 1
	 -Embedding Shape 1536
 Embedding Uruguay
	 -Data Len 1
	 -Embedding Shape 1536
 Embedding Argentina
	 -Data Len 1
	 -Embedding Shape 1536
 Embedding Brazil
	 -Data Len 1
	 -Embedding Shape 1536


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1526,1527,1528,1529,1530,1531,1532,1533,1534,1535
Jamaica,0.03,-0.03,0.01,0.05,-0.04,0.01,0.0,0.06,0.04,-0.01,...,-0.03,-0.02,-0.08,-0.04,0.03,-0.01,0.03,-0.02,-0.02,0.02
Ecuador,0.02,-0.02,0.05,0.04,-0.05,0.03,0.0,0.1,-0.0,-0.04,...,-0.01,0.01,-0.0,0.01,0.02,0.0,0.01,-0.03,-0.04,0.02
United Kingdom,0.0,0.0,0.02,0.02,-0.07,0.02,0.01,0.05,0.0,0.01,...,0.03,0.02,-0.04,0.0,0.01,-0.05,-0.03,-0.02,-0.02,0.01
Paraguay,0.01,-0.01,-0.01,-0.01,-0.05,0.03,0.02,0.05,-0.0,-0.05,...,-0.0,-0.03,0.03,-0.01,0.01,-0.03,0.0,0.01,-0.01,0.02
France,-0.01,-0.0,0.04,0.03,-0.03,-0.05,-0.03,0.01,-0.03,-0.0,...,0.02,-0.01,-0.02,-0.03,-0.0,-0.02,0.0,-0.03,-0.01,0.01
Mexico,-0.01,0.0,0.0,0.08,-0.01,-0.03,0.03,0.08,0.01,0.0,...,-0.01,0.02,0.0,0.0,0.05,-0.01,0.02,-0.02,-0.03,0.02
Colombia,0.01,-0.02,0.03,0.03,-0.08,0.03,0.01,0.06,-0.01,-0.03,...,-0.01,-0.01,-0.01,-0.01,0.02,-0.0,0.03,0.01,-0.02,0.02
Germany,-0.03,0.01,0.06,0.03,-0.02,-0.08,-0.01,0.04,-0.03,0.04,...,0.03,0.02,-0.03,-0.03,0.02,-0.02,0.0,-0.02,-0.02,0.0
España,0.02,-0.01,-0.02,-0.01,-0.06,-0.0,0.02,0.06,0.01,-0.01,...,0.01,-0.0,-0.03,0.02,0.03,-0.03,-0.03,0.0,0.01,-0.0
Canada,0.01,0.0,0.04,0.06,-0.0,0.0,-0.0,0.01,-0.01,0.02,...,-0.01,0.01,-0.09,-0.03,0.02,-0.02,0.01,-0.02,-0.0,-0.0


Vemos que a cada país se le asociado un vector de dimensión 1536, teniendo en cuenta su contenido semántico. Sin embargo, dado que vamos a trabajar con 17 países podemos reducir el tamaño de los vectores sin perder información. Para ello aplicamos un PCA de 17 componentes (que además es el rango de la matriz).

In [5]:
#@markdown Aplicamos el PCA de 17 componentes.
new_origin = pd.DataFrame(get_new_base(df_origin), index=df_origin.index)

new_origin.columns = [f"cord_{i}" for i in new_origin.columns]
new_origin.index.name = "origin"
new_origin

Unnamed: 0_level_0,cord_0,cord_1,cord_2,cord_3,cord_4,cord_5,cord_6,cord_7,cord_8,cord_9,cord_10,cord_11,cord_12,cord_13,cord_14,cord_15,cord_16
origin,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
Jamaica,-0.09,-0.37,0.4,0.17,-0.24,0.24,-0.17,0.08,0.08,-0.18,0.14,-0.08,-0.0,0.23,0.12,0.03,0.0
Ecuador,-0.39,-0.02,-0.08,-0.17,0.13,0.08,-0.1,-0.19,0.35,0.04,-0.31,0.24,0.02,0.11,0.08,-0.08,0.0
United Kingdom,0.31,-0.36,-0.4,0.06,-0.24,0.07,-0.03,-0.04,0.22,0.06,0.12,-0.04,-0.21,-0.16,-0.13,-0.16,0.0
Paraguay,-0.4,0.21,-0.06,0.27,-0.25,-0.03,0.12,-0.09,-0.21,-0.13,-0.14,-0.03,0.11,0.1,-0.26,-0.2,0.0
France,0.48,0.17,0.01,0.24,0.16,0.18,-0.17,-0.13,-0.25,0.02,-0.02,0.1,0.03,-0.11,0.26,-0.21,0.0
Mexico,0.2,0.07,0.3,-0.3,0.2,-0.07,0.39,0.02,0.07,-0.14,-0.03,-0.23,-0.15,0.03,0.05,-0.18,0.0
Colombia,-0.33,-0.11,-0.07,-0.07,0.22,0.15,-0.15,0.32,-0.24,0.28,-0.13,-0.13,-0.2,0.06,-0.07,0.01,0.0
Germany,0.56,0.19,0.02,0.15,0.04,0.04,0.06,-0.28,-0.01,0.07,-0.08,-0.0,-0.17,0.19,-0.14,0.27,0.0
España,0.05,-0.16,-0.3,0.37,0.43,-0.05,0.14,0.18,0.15,-0.12,0.03,-0.09,0.23,0.02,-0.01,0.08,0.0
Canada,0.43,-0.1,0.1,-0.32,-0.07,0.08,-0.15,0.19,-0.08,-0.25,-0.22,0.11,0.15,-0.17,-0.17,0.09,0.0


In [8]:
#@markdown Si calculamos las distancias de lso vectores antes y después de aplicar el PCA vemos que no hay diferencias entre una tabla y otra.
distance_origin = get_distance_matrix(df_origin)
new_distance_origin = get_distance_matrix(new_origin)
result = (np.abs(distance_origin - new_distance_origin)<1e-3).all()
if result:
  print("Las distancias son iguales en ambas tablas.")

Las distancias son iguales en ambas tablas.


Podemos comprobar como los países de latino américa o de habla española se encuentran más cerca entre sí que otros países angloparlantes o européos.

In [15]:
country = "Ecuador" #@param ["Jamaica", "Ecuador", "United Kingdom", "Paraguay", "France", "Mexico", "Colombia", "Germany", "España", "Canada", "Venezuela", "Dominican Republic", "United States", "Peru", "Uruguay", "Argentina", "Brazil"]

df_distance_origins = pd.DataFrame(distance_origin, columns=df_origin.index, index=df_origin.index)
display(df_distance_origins.sort_values(country)[[country]])

origin,Ecuador
origin,Unnamed: 1_level_1
Ecuador,0.0
Uruguay,0.96
Colombia,0.97
Peru,0.98
Venezuela,0.99
Paraguay,1.02
Brazil,1.07
Dominican Republic,1.09
España,1.11
Jamaica,1.11


In [11]:
#@markdown La tabla completa de distancias
display(df_distance_origins)

origin,Jamaica,Ecuador,United Kingdom,Paraguay,France,Mexico,Colombia,Germany,España,Canada,Venezuela,Dominican Republic,United States,Peru,Uruguay,Argentina,Brazil
origin,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
Jamaica,0.0,1.11,1.12,1.1,1.17,1.14,1.08,1.21,1.17,1.1,1.12,0.96,1.15,1.16,1.1,1.21,1.15
Ecuador,1.11,0.0,1.18,1.02,1.24,1.13,0.97,1.24,1.11,1.18,0.99,1.09,1.15,0.98,0.96,1.15,1.07
United Kingdom,1.12,1.18,0.0,1.22,1.09,1.18,1.17,1.05,1.05,1.05,1.22,1.22,0.95,1.2,1.13,1.17,1.17
Paraguay,1.1,1.02,1.22,0.0,1.2,1.19,1.04,1.21,1.13,1.26,1.06,1.07,1.18,1.01,0.83,1.05,1.07
France,1.17,1.24,1.09,1.2,0.0,1.09,1.19,0.84,1.07,1.0,1.2,1.19,1.14,1.22,1.19,1.07,1.04
Mexico,1.14,1.13,1.18,1.19,1.09,0.0,1.14,1.02,1.13,0.99,1.13,1.07,1.08,1.11,1.13,1.08,1.04
Colombia,1.08,0.97,1.17,1.04,1.19,1.14,0.0,1.26,1.06,1.16,1.01,1.08,1.12,1.0,0.99,1.14,1.1
Germany,1.21,1.24,1.05,1.21,0.84,1.02,1.26,0.0,1.11,1.01,1.24,1.19,1.1,1.25,1.2,1.06,1.03
España,1.17,1.11,1.05,1.13,1.07,1.13,1.06,1.11,0.0,1.17,1.12,1.14,1.12,1.15,1.1,1.13,1.18
Canada,1.1,1.18,1.05,1.26,1.0,0.99,1.16,1.01,1.17,0.0,1.19,1.19,1.0,1.21,1.2,1.09,1.08


## Comparativa de métodos de categorización,

### Paso 1: Cargar y Preprocesar el Conjunto de Datos

El conjunto de datos "Home Credit Default Risk" de Kaggle está diseñado para ayudar a las instituciones financieras a evaluar la capacidad de un solicitante para pagar un préstamo. Este conjunto de datos contiene información sobre los solicitantes de préstamos, incluyendo datos demográficos, historial de crédito, y características relacionadas con la solicitud y el rendimiento del préstamo.

In [None]:
dataset_name = "housing" #@param ["housing", "video_games_sales", "adults"]

dataset = get_dataset(dataset_name)
df = dataset.get_dataset()
categorical_features = dataset.categorical_features
model_type = dataset.model_type

display(df.head())

Unnamed: 0,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,TARGET,ORGANIZATION_TYPE
0,41.0,880.0,129.0,322.0,126.0,8.33,452600.0,NEAR BAY
1,21.0,7099.0,1106.0,2401.0,1138.0,8.3,358500.0,NEAR BAY
2,52.0,1467.0,190.0,496.0,177.0,7.26,352100.0,NEAR BAY
3,52.0,1274.0,235.0,558.0,219.0,5.64,341300.0,NEAR BAY
4,52.0,1627.0,280.0,565.0,259.0,3.85,342200.0,NEAR BAY


Para ver el estado inicial del problema podemos entrenar una regresión lineal y analizar su error. Pero primero, observamos que el dataset tiene datos nulos en la columna `total_bedrooms`, optamos por eliminar aquellos registros que contengan algún dato nulo.

In [None]:
display(evaluate_dataset(df.drop(columns=categorical_features), model_type=model_type))

 17%|█▋        | 1/6 [00:00<00:01,  3.05it/s]

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001295 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1328
[LightGBM] [Info] Number of data points in the train set: 16346, number of used features: 6
[LightGBM] [Info] Start training from score 206644.400098


100%|██████████| 6/6 [00:13<00:00,  2.21s/it]


Unnamed: 0_level_0,Adjusted R-Squared,R-Squared,RMSE,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
LGBMRegressor,0.69,0.69,65386.92,0.31
HistGradientBoostingRegressor,0.69,0.69,65575.26,0.63
RandomForestRegressor,0.67,0.67,67145.71,11.88
XGBRegressor,0.67,0.67,67294.4,0.33
LinearRegression,0.57,0.57,76587.33,0.06
BayesianRidge,0.57,0.57,76588.48,0.03


### One-Hot Encoding

In [None]:
df_one_hot = OneHotEncoding().execute(dataset)
display(df_one_hot.head())
evaluate_dataset(df_one_hot, model_type=model_type)

Unnamed: 0,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,TARGET,ORGANIZATION_TYPE_<1H OCEAN,ORGANIZATION_TYPE_INLAND,ORGANIZATION_TYPE_ISLAND,ORGANIZATION_TYPE_NEAR BAY,ORGANIZATION_TYPE_NEAR OCEAN
0,41.0,880.0,129.0,322.0,126.0,8.33,452600.0,0,0,0,1,0
1,21.0,7099.0,1106.0,2401.0,1138.0,8.3,358500.0,0,0,0,1,0
2,52.0,1467.0,190.0,496.0,177.0,7.26,352100.0,0,0,0,1,0
3,52.0,1274.0,235.0,558.0,219.0,5.64,341300.0,0,0,0,1,0
4,52.0,1627.0,280.0,565.0,259.0,3.85,342200.0,0,0,0,1,0


 17%|█▋        | 1/6 [00:00<00:01,  2.79it/s]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001030 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1340
[LightGBM] [Info] Number of data points in the train set: 16346, number of used features: 10
[LightGBM] [Info] Start training from score 206644.400098


100%|██████████| 6/6 [00:13<00:00,  2.29s/it]


Unnamed: 0_level_0,Adjusted R-Squared,R-Squared,RMSE,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
LGBMRegressor,0.73,0.73,60208.57,0.3
HistGradientBoostingRegressor,0.73,0.73,60384.26,0.66
RandomForestRegressor,0.72,0.72,61560.89,12.27
XGBRegressor,0.72,0.72,62050.15,0.36
LinearRegression,0.64,0.64,70322.01,0.05
BayesianRidge,0.64,0.64,70323.89,0.1


## Label Encoder

In [None]:
# Codificar las categorías
df_encoder = LabelEncoding().execute(dataset)
display(df_encoder.sample(5))

evaluate_dataset(df_encoder, model_type=model_type)

Unnamed: 0,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,TARGET,ORGANIZATION_TYPE_COD
1810,26.0,3887.0,779.0,2512.0,740.0,2.23,122400.0,3
825,35.0,517.0,108.0,391.0,107.0,4.07,156900.0,3
2901,52.0,90.0,35.0,36.0,31.0,0.81,60000.0,1
1951,17.0,2616.0,492.0,1158.0,457.0,2.88,142600.0,1
14845,29.0,1792.0,449.0,1650.0,396.0,2.22,100000.0,4


 17%|█▋        | 1/6 [00:03<00:16,  3.32s/it]

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001288 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1334
[LightGBM] [Info] Number of data points in the train set: 16346, number of used features: 7
[LightGBM] [Info] Start training from score 206644.400098


100%|██████████| 6/6 [00:16<00:00,  2.67s/it]


Unnamed: 0_level_0,Adjusted R-Squared,R-Squared,RMSE,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
LGBMRegressor,0.73,0.73,60433.03,0.26
HistGradientBoostingRegressor,0.73,0.73,60664.4,0.63
RandomForestRegressor,0.72,0.72,62334.63,11.73
XGBRegressor,0.71,0.71,62533.22,3.32
LinearRegression,0.57,0.58,76208.99,0.04
BayesianRidge,0.57,0.58,76210.16,0.04


## Keras Embedding

In [None]:
# Codificar las categorías
df_embedding = EmbeddingEncoding().execute(dataset)

display(df_embedding.head())

evaluate_dataset(df_embedding, model_type=model_type)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


Unnamed: 0,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,TARGET,ORGANIZATION_TYPE_COD_embed_0,ORGANIZATION_TYPE_COD_embed_1,ORGANIZATION_TYPE_COD_embed_2,ORGANIZATION_TYPE_COD_embed_3,ORGANIZATION_TYPE_COD_embed_4
0,41.0,880.0,129.0,322.0,126.0,8.33,452600.0,0.11,0.01,0.14,-0.07,0.04
1,21.0,7099.0,1106.0,2401.0,1138.0,8.3,358500.0,0.11,0.01,0.14,-0.07,0.04
2,52.0,1467.0,190.0,496.0,177.0,7.26,352100.0,0.11,0.01,0.14,-0.07,0.04
3,52.0,1274.0,235.0,558.0,219.0,5.64,341300.0,0.11,0.01,0.14,-0.07,0.04
4,52.0,1627.0,280.0,565.0,259.0,3.85,342200.0,0.11,0.01,0.14,-0.07,0.04


 17%|█▋        | 1/6 [00:00<00:01,  2.87it/s]

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001952 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1358
[LightGBM] [Info] Number of data points in the train set: 16346, number of used features: 11
[LightGBM] [Info] Start training from score 206644.400098


100%|██████████| 6/6 [00:14<00:00,  2.43s/it]


Unnamed: 0_level_0,Adjusted R-Squared,R-Squared,RMSE,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
LGBMRegressor,0.73,0.73,60260.48,0.31
HistGradientBoostingRegressor,0.73,0.73,60296.93,0.6
RandomForestRegressor,0.72,0.72,61546.85,13.23
XGBRegressor,0.71,0.72,62350.79,0.35
BayesianRidge,0.64,0.64,70299.97,0.03
LinearRegression,0.64,0.64,70322.01,0.03


In [None]:
df_embedding.sample(5)

Unnamed: 0,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,TARGET,ORGANIZATION_TYPE_COD_embed_0,ORGANIZATION_TYPE_COD_embed_1,ORGANIZATION_TYPE_COD_embed_2,ORGANIZATION_TYPE_COD_embed_3,ORGANIZATION_TYPE_COD_embed_4
1497,15.0,7803.0,1603.0,2957.0,1546.0,4.45,184900.0,0.11,0.01,0.14,-0.07,0.04
9742,30.0,2755.0,597.0,1519.0,554.0,3.3,234600.0,0.06,0.01,0.08,-0.04,0.03
19703,27.0,1513.0,374.0,839.0,350.0,1.2,64600.0,-0.18,0.28,-0.16,-0.13,-0.28
79,38.0,684.0,176.0,344.0,155.0,2.01,131300.0,0.11,0.01,0.14,-0.07,0.04
15813,34.0,3131.0,669.0,2204.0,600.0,3.55,251000.0,0.06,0.01,0.08,-0.04,0.03


## LLM Embedding

In [None]:
#@markdown Indica el archivo con la variable de entorno "API_KEY" de OpenAI.
env_file = "/content/drive/MyDrive/Mis Cosas/Data Science/ML-IA/env" #@param

if env_file == "":
  load_dotenv()
else:
  load_dotenv(env_file)

params = {
  "api_key": os.getenv("API_KEY")
}



In [None]:
df_llm_embedding  = LlmEncoding().execute(dataset, params.get("api_key"))

display(df_llm_embedding)
evaluate_dataset(df_llm_embedding, model_type=model_type)

 Embedding NEAR BAY
	 -Data Len 1
	 -Embedding Shape 1536
 Embedding <1H OCEAN
	 -Data Len 1
	 -Embedding Shape 1536
 Embedding INLAND
	 -Data Len 1
	 -Embedding Shape 1536
 Embedding NEAR OCEAN
	 -Data Len 1
	 -Embedding Shape 1536
 Embedding ISLAND
	 -Data Len 1
	 -Embedding Shape 1536


Unnamed: 0,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,TARGET,ORGANIZATION_TYPE_embed_0,ORGANIZATION_TYPE_embed_1,ORGANIZATION_TYPE_embed_2,ORGANIZATION_TYPE_embed_3,ORGANIZATION_TYPE_embed_4
0,41.00,880.00,129.00,322.00,126.00,8.33,452600.00,0.50,-0.18,0.09,-0.12,0.00
1,21.00,7099.00,1106.00,2401.00,1138.00,8.30,358500.00,0.50,-0.18,0.09,-0.12,0.00
2,52.00,1467.00,190.00,496.00,177.00,7.26,352100.00,0.50,-0.18,0.09,-0.12,0.00
3,52.00,1274.00,235.00,558.00,219.00,5.64,341300.00,0.50,-0.18,0.09,-0.12,0.00
4,52.00,1627.00,280.00,565.00,259.00,3.85,342200.00,0.50,-0.18,0.09,-0.12,0.00
...,...,...,...,...,...,...,...,...,...,...,...,...
20428,27.00,1675.00,521.00,744.00,331.00,2.16,450000.00,-0.27,-0.00,0.40,0.04,0.00
20429,52.00,2359.00,591.00,1100.00,431.00,2.83,414700.00,-0.27,-0.00,0.40,0.04,0.00
20430,52.00,2127.00,512.00,733.00,288.00,3.39,300000.00,-0.27,-0.00,0.40,0.04,0.00
20431,52.00,996.00,264.00,341.00,160.00,2.74,450000.00,-0.27,-0.00,0.40,0.04,0.00


 17%|█▋        | 1/6 [00:00<00:01,  3.01it/s]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000589 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1358
[LightGBM] [Info] Number of data points in the train set: 16346, number of used features: 11
[LightGBM] [Info] Start training from score 207282.923957


100%|██████████| 6/6 [00:18<00:00,  3.14s/it]


Unnamed: 0_level_0,Adjusted R-Squared,R-Squared,RMSE,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
HistGradientBoostingRegressor,0.73,0.73,59814.84,0.94
LGBMRegressor,0.73,0.73,60040.61,0.34
XGBRegressor,0.72,0.72,61042.42,0.33
RandomForestRegressor,0.71,0.71,61549.09,17.11
BayesianRidge,0.62,0.62,70463.48,0.04
LinearRegression,0.62,0.62,70464.73,0.04
