<a href="https://colab.research.google.com/github/unclepete-20/lab7-k-means/blob/main/K_Means.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Laboratorio #7 (K-Means)

## Integrantes:


*   Pedro Pablo Arriola Jimenez (20188)
*   Oscar Fernando Lopez Barrios (20679)
*   Santiago Taracena Puga (20017)
*   YongBum Park (20117)








# Introducción a K-Means Clustering para Análisis de Transacciones Bancarias 💳💰

El análisis de transacciones bancarias es una tarea importante para detectar fraudes, comportamientos inusuales y patrones de gasto de los clientes. Una forma de analizar estos datos es mediante el uso de técnicas de agrupamiento, como el algoritmo de K-Means.

El algoritmo de K-Means es una técnica de aprendizaje no supervisado que permite agrupar datos en clusters o grupos, basado en su similitud. En el caso del análisis de transacciones bancarias, se pueden agrupar los datos según el comportamiento de los clientes, como sus patrones de gasto, lugares frecuentes de uso de tarjeta, entre otros.

La implementación de K-Means en Python es relativamente sencilla gracias a la disponibilidad de diversas librerías como scikit-learn y pandas. A través del uso de esta técnica, se puede obtener una mejor comprensión de los patrones y comportamientos de los clientes, lo que puede ser valioso para la toma de decisiones en el ámbito bancario.

En esta investigación, se explorará el uso de K-Means Clustering para el análisis de transacciones bancarias, incluyendo su implementación en Python y la interpretación de los resultados obtenidos.


## Task 1 - Limpieza de datos y Análisis Exploratorio

In [24]:
# Librerías necesarias para la limpieza de datos y análisis exploratorio.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [25]:
# Se carga el dataset para comenzar a realizar limpieza y exploración.
data = pd.read_csv("./data/bank_transactions.csv")
data.head()

Unnamed: 0,TransactionID,CustomerID,CustomerDOB,CustGender,CustLocation,CustAccountBalance,TransactionDate,TransactionTime,TransactionAmount (INR)
0,T1,C5841053,10/1/94,F,JAMSHEDPUR,17819.05,2/8/16,143207,25.0
1,T2,C2142763,4/4/57,M,JHAJJAR,2270.69,2/8/16,141858,27999.0
2,T3,C4417068,26/11/96,F,MUMBAI,17874.44,2/8/16,142712,459.0
3,T4,C5342380,14/9/73,F,MUMBAI,866503.21,2/8/16,142714,2060.0
4,T5,C9031234,24/3/88,F,NAVI MUMBAI,6714.43,2/8/16,181156,1762.5


La tabla muestra información sobre transacciones financieras. A continuación se describe el significado de cada columna:

- TransactionID: un identificador único para cada transacción
- CustomerID: un identificador único para cada cliente
- CustomerDOB: la fecha de nacimiento del cliente
- CustGender: el género del cliente
- CustLocation: la ubicación geográfica del cliente
- CustAccountBalance: el saldo de la cuenta del cliente
- TransactionDate: la fecha de la transacción
- TransactionTime: la hora de la transacción
- TransactionAmount (INR): la cantidad de la transacción en rupias indias (INR).

Con esta información, se procederá con la limpieza y codificación de los datos.

In [26]:
# Se eliminar estas variables categóricas que no sirven ningún propósito.
data = data.drop(["TransactionID", "CustomerID", "CustomerDOB", "TransactionDate"], axis=1)
data.head()

Unnamed: 0,CustGender,CustLocation,CustAccountBalance,TransactionTime,TransactionAmount (INR)
0,F,JAMSHEDPUR,17819.05,143207,25.0
1,M,JHAJJAR,2270.69,141858,27999.0
2,F,MUMBAI,17874.44,142712,459.0
3,F,MUMBAI,866503.21,142714,2060.0
4,F,NAVI MUMBAI,6714.43,181156,1762.5


In [27]:
# Se hace un mapeo para codificar el genero del cliente

gender_map = {"M": 1, "F": 0}  # Mapeo de género a valores numéricos
data["CustGender"] = data["CustGender"].replace(gender_map)
# Convertimos la columna a valores numéricos
data['CustGender'] = pd.to_numeric(data['CustGender'], errors='coerce')
data

Unnamed: 0,CustGender,CustLocation,CustAccountBalance,TransactionTime,TransactionAmount (INR)
0,0.0,JAMSHEDPUR,17819.05,143207,25.0
1,1.0,JHAJJAR,2270.69,141858,27999.0
2,0.0,MUMBAI,17874.44,142712,459.0
3,0.0,MUMBAI,866503.21,142714,2060.0
4,0.0,NAVI MUMBAI,6714.43,181156,1762.5
...,...,...,...,...,...
1048562,1.0,NEW DELHI,7635.19,184824,799.0
1048563,1.0,NASHIK,27311.42,183734,460.0
1048564,1.0,HYDERABAD,221757.06,183313,770.0
1048565,1.0,VISAKHAPATNAM,10117.87,184706,1000.0


In [28]:
# Se eliminan datos faltantes o nulos
data = data.dropna()
data

Unnamed: 0,CustGender,CustLocation,CustAccountBalance,TransactionTime,TransactionAmount (INR)
0,0.0,JAMSHEDPUR,17819.05,143207,25.0
1,1.0,JHAJJAR,2270.69,141858,27999.0
2,0.0,MUMBAI,17874.44,142712,459.0
3,0.0,MUMBAI,866503.21,142714,2060.0
4,0.0,NAVI MUMBAI,6714.43,181156,1762.5
...,...,...,...,...,...
1048562,1.0,NEW DELHI,7635.19,184824,799.0
1048563,1.0,NASHIK,27311.42,183734,460.0
1048564,1.0,HYDERABAD,221757.06,183313,770.0
1048565,1.0,VISAKHAPATNAM,10117.87,184706,1000.0


In [29]:
# Escalamiento de la data

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
data[["CustAccountBalance", "TransactionAmount (INR)"]] = scaler.fit_transform(data[["CustAccountBalance", "TransactionAmount (INR)"]])
data

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[["CustAccountBalance", "TransactionAmount (INR)"]] = scaler.fit_transform(data[["CustAccountBalance", "TransactionAmount (INR)"]])


Unnamed: 0,CustGender,CustLocation,CustAccountBalance,TransactionTime,TransactionAmount (INR)
0,0.0,JAMSHEDPUR,-0.115327,143207,-0.235558
1,1.0,JHAJJAR,-0.133684,141858,4.022106
2,0.0,MUMBAI,-0.115261,142712,-0.169503
3,0.0,MUMBAI,0.886688,142714,0.074170
4,0.0,NAVI MUMBAI,-0.128438,181156,0.028891
...,...,...,...,...,...
1048562,1.0,NEW DELHI,-0.127351,184824,-0.117755
1048563,1.0,NASHIK,-0.104119,183734,-0.169351
1048564,1.0,HYDERABAD,0.125456,183313,-0.122169
1048565,1.0,VISAKHAPATNAM,-0.124419,184706,-0.087162


In [30]:
median_balance = data["CustAccountBalance"].median()
data.loc[:, "CustAccountBalance"] = data["CustAccountBalance"].fillna(median_balance)
data

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data.loc[:, "CustAccountBalance"] = data["CustAccountBalance"].fillna(median_balance)


Unnamed: 0,CustGender,CustLocation,CustAccountBalance,TransactionTime,TransactionAmount (INR)
0,0.0,JAMSHEDPUR,-0.115327,143207,-0.235558
1,1.0,JHAJJAR,-0.133684,141858,4.022106
2,0.0,MUMBAI,-0.115261,142712,-0.169503
3,0.0,MUMBAI,0.886688,142714,0.074170
4,0.0,NAVI MUMBAI,-0.128438,181156,0.028891
...,...,...,...,...,...
1048562,1.0,NEW DELHI,-0.127351,184824,-0.117755
1048563,1.0,NASHIK,-0.104119,183734,-0.169351
1048564,1.0,HYDERABAD,0.125456,183313,-0.122169
1048565,1.0,VISAKHAPATNAM,-0.124419,184706,-0.087162


In [31]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
data.loc[:, "CustLocation"] = le.fit_transform(data["CustLocation"])
data


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data.loc[:, "CustLocation"] = le.fit_transform(data["CustLocation"])
  data.loc[:, "CustLocation"] = le.fit_transform(data["CustLocation"])


Unnamed: 0,CustGender,CustLocation,CustAccountBalance,TransactionTime,TransactionAmount (INR)
0,0.0,3567,-0.115327,143207,-0.235558
1,1.0,3629,-0.133684,141858,4.022106
2,0.0,5242,-0.115261,142712,-0.169503
3,0.0,5242,0.886688,142714,0.074170
4,0.0,5631,-0.128438,181156,0.028891
...,...,...,...,...,...
1048562,1.0,5766,-0.127351,184824,-0.117755
1048563,1.0,5603,-0.104119,183734,-0.169351
1048564,1.0,3377,0.125456,183313,-0.122169
1048565,1.0,9108,-0.124419,184706,-0.087162


In [32]:
# Ahora se hace una breve descripcion estadistica de los datos
data.describe()

Unnamed: 0,CustGender,CustLocation,CustAccountBalance,TransactionTime,TransactionAmount (INR)
count,1044946.0,1044946.0,1044946.0,1044946.0,1044946.0
mean,0.7308559,4104.684,-3.016393e-17,157100.1,1.332217e-16
std,0.4435152,2377.073,1.0,51266.09,1.0
min,0.0,0.0,-0.1363652,0.0,-0.2393632
25%,0.0,2062.0,-0.130774,124033.0,-0.2148589
50%,1.0,4101.0,-0.1165098,164236.0,-0.1694239
75%,1.0,5766.0,-0.06823476,200016.0,-0.05672226
max,1.0,9325.0,135.6825,235959.0,237.1992


In [33]:
# Tambien se obtiene un poco sobre la informacion de los datos
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1044946 entries, 0 to 1048566
Data columns (total 5 columns):
 #   Column                   Non-Null Count    Dtype  
---  ------                   --------------    -----  
 0   CustGender               1044946 non-null  float64
 1   CustLocation             1044946 non-null  int64  
 2   CustAccountBalance       1044946 non-null  float64
 3   TransactionTime          1044946 non-null  int64  
 4   TransactionAmount (INR)  1044946 non-null  float64
dtypes: float64(3), int64(2)
memory usage: 47.8 MB


In [34]:
# Se explora la cantida de ubicaciones para tener una nocion sobre la variable
data.groupby('CustLocation').size()

CustLocation
0       11
1        1
2       21
3        5
4       11
        ..
9321     1
9322     6
9323     4
9324     9
9325     1
Length: 9326, dtype: int64

## Task 1.1 - K-Mean Clustering

In [55]:
class KMeans:
    def __init__(self, n_clusters=8, max_iter=300):
        self.n_clusters = n_clusters
        self.max_iter = max_iter

    def fit(self, X):
        self.centroids = X[np.random.choice(range(len(X)), self.n_clusters, replace=False)]
        for i in range(self.max_iter):
            clusters = [[] for _ in range(self.n_clusters)]
            for x in X:
                distances = [np.linalg.norm(x - c) for c in self.centroids]
                cluster_idx = np.argmin(distances)
                clusters[cluster_idx].append(x)
            new_centroids = [np.mean(cluster, axis=0) for cluster in clusters]
            if np.allclose(self.centroids, new_centroids):
                break
            self.centroids = new_centroids
        self.labels_ = [np.argmin([np.linalg.norm(x - c) for c in self.centroids]) for x in X]

    def predict(self, X):
        return [np.argmin([np.linalg.norm(x - c) for c in self.centroids]) for x in X]

#### Es importante tener en cuenta que en el caso de K-Means Clustering no se utiliza una variable objetivo (y), sino que se utiliza la propia matriz de características X para agrupar los datos en clústeres. Por lo tanto, en este caso, la división de los datos en conjuntos de entrenamiento y validación solo se utiliza para evaluar la calidad de los clústeres obtenidos, no para entrenar un modelo en el sentido tradicional.

In [60]:
from sklearn.model_selection import train_test_split

X = data.values

X_train, X_valid = train_test_split(X, test_size=0.2, random_state=42)


In [63]:
# Se arma el modelo
kmeans = KMeans(n_clusters=2)
kmeans.fit(X)