Deep Clustering for Financial Market Segmentation

Un enfoque de aprendizaje profundo no supervisado para el agrupamiento de clientes de tarjetas de crédito
Con el avance del aprendizaje profundo no supervisado, la red neuronal Autoencoder ahora se usa con frecuencia para la reducción de alta dimensionalidad (por ejemplo, un conjunto de datos con miles o más características). Autoencoder también se puede combinar con aprendizaje supervisado (por ejemplo, Random Forest) para formar un método de aprendizaje semisupervisado. Recientemente se publicó un método de agrupamiento integrado profundo (DEC) [1]. Combina autoencoder con K-means y otras técnicas de aprendizaje automático para el agrupamiento en lugar de la reducción de dimensionalidad. La implementación original de DEC se basa en Caffe.

El resto de este cuaderno está organizado de la siguiente manera:

Preparación de datos
Implementación del método DEC en Keras
Resumen

In [1]:
from time import time
import keras.backend as K
from tensorflow.keras.layers import Layer, InputSpec
from keras.layers import Dense, Input
from keras.models import Model
from keras.optimizers import SGD
from keras import callbacks
from keras.initializers import VarianceScaling
from sklearn.cluster import KMeans
import keras.metrics
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import silhouette_score
from IPython.display import Image
from tensorflow.keras.callbacks import TensorBoard
import tensorflow as tf
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
%matplotlib inline

2025-03-20 16:16:57.340224: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-03-20 16:16:57.591462: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-03-20 16:16:57.702118: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1742487418.030572    6150 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1742487418.112183    6150 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1742487418.748544    6150 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linkin

In [2]:
np.random.seed(10) ### Para la selección aleatoria, vamos a utilizar la semilla 10.
data = pd.read_csv('../../FraudeCanastas.csv')
data.head()

Unnamed: 0,ID,APPLE PRODUCTDESCRIPTION | SAMSUNG | MODEL90,AUDIO ACCESSORIES | AB AUDIO | AB AUDIO GO AIR TRUE WIRELESS BLUETOOTH IN-EAR H,AUDIO ACCESSORIES | APPLE | 2019 APPLE AIRPODS WITH CHARGING CASE,AUDIO ACCESSORIES | APPLE | 2019 APPLE AIRPODS WITH CHARGING CASE 2ND GENERATI,AUDIO ACCESSORIES | APPLE | 2019 APPLE AIRPODS WITH WIRELESS CHARGING CASE,AUDIO ACCESSORIES | APPLE | 2019 APPLE AIRPODS WITH WIRELESS CHARGING CASE 2ND,AUDIO ACCESSORIES | APPLE | 2021 APPLE AIRPODS WITH MAGSAFE CHARGING CASE 3RD,AUDIO ACCESSORIES | APPLE | AIRPODS PRO,AUDIO ACCESSORIES | APPLE | APPLE AIRPODS MAX,...,WOMEN S NIGHTWEAR | ANYDAY RETAILER | ANYDAY RETAILER LEOPARD PRINT JERSEY PY,WOMEN S NIGHTWEAR | RETAILER | RETAILER CLEO VELOUR JOGGER LOUNGE PANT,WOMEN S NIGHTWEAR | SOSANDAR | SOSANDAR ZEBRA PRINT PYJAMA BOTTOMS BLACK 10,Nb_of_items,total_of_items,costo_total,costo_medio_item,costo_item_max,costo_item_min,fraud_flag
0,130,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,2,2,1299,649.5,1299,0.0,1.0
1,195,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,3,3,4119,1373.0,2470,0.0,1.0
2,217,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,2,2,2806,1403.0,2799,7.0,1.0
3,552,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,2,2,1206,603.0,1199,7.0,1.0
4,854,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,19,27,1807,66.925926,195,4.0,1.0


In [None]:
# ¿Cuántas columnas (variables) tenemos?
print(data.columns)
print("# of columns: " + str(len(data.columns)))

1.2 Selección de características

En el DataFrame anterior se puede ver que el campo CUST_ID es único para cada registro de datos de cliente. Este campo con valores únicos no es útil para la agrupación y, por lo tanto, se puede descartar:

In [None]:
data_x = data.drop(['CUST_ID'], axis=1)