<div style="width:100%; overflow:hidden; background-color:#F1F1E6; padding: 10px; border-style: outset; color:#17469e">
    <div style="width: 80%; float: left;">
    <h2 align="center">Universidad de Sonora</h2>
    <hr style="border-width: 3px; border-color:#17469e">
          <h1>Reconocimiento de patrones: Preparación de los datos</h1>          
          <h4>Ramón Soto C. <a href="mailto:rsotoc@moviquest.com/">(rsotoc@moviquest.com)</a></h4>
    </div>
    <div style="float: right;">
    <img src="images/escudo_unison.png">
    </div>
</div>

## Caso de estudio: [*Stack Overflow 2018 Developer Survey*](https://www.kaggle.com/stackoverflow/stack-overflow-2018-developer-survey)

Como caso de estudio principal en el presente curso hemos seleccionado la encuesta de desarrolladores 2018 de *Stack Overflow* disponible en [Kaggle](https://www.kaggle.com). En este esta etapa realizaremos el análisis de agrupamientos.

### 4. Modelado - ISODATA

<div style="margin-top: 6px; border: 1px solid #cfcfcf; padding: 8px 12px; border-radius:2px; background-color:#f7f7f7; ">
... ahora utilizamos la técnica ISODATA para identificar prototipos de clases. <br>Inicializamos el contexto y cargamos los datos:
</div>

In [1]:
"""
Reconocimiento de patrones: ISODATA
"""

#from scipy.spatial.distance import squareform

# Inicializar el ambiente
import sys
import numpy as np
import pandas as pd
import json
import pickle
#import math
import random
#import time

from IPython.display import display, HTML
from collections import Counter
from operator import itemgetter
#from scipy.spatial.distance import euclidean, pdist, squareform

np.set_printoptions(precision=2, suppress=True) # Cortar la impresión de decimales a 1
pd.set_option('display.max_columns', 130)
pd.set_option('max_colwidth', 80)

LARGER_DISTANCE = sys.maxsize
TALK = True # TALK = True, imprime resultados parciales

In [2]:
path = "Data sets/Stack Overflow Survey/"

# Recuperar encabezados de columnas en orden original
with open(path + 'survey_results_public_transformed.headers', 'rb') as file:  
    headers = pickle.load(file)

# Recuperar diccionarios... sólo por si se requieren
with open(path + 'survey_results_public_transformed.dicts', 'rb') as file:  
    dict_of_dicts = pickle.load(file)

with open(path + 'survey_results_public_transformed.json') as f:
    dict_json = json.load(f)
df = pd.DataFrame.from_dict(dict_json)

df = df.sample(n=100).reset_index(drop=True)


# Reordenar las columnas de acuerdo al orden original
df = df.reindex(headers, axis=1)

DATA_LEN = df.shape[0]

# Agregar una columna "cluster" inicializada a null 
df["Cluster"] = np.nan

In [3]:
var_str = ['Hobby', 'OpenSource', 'Country', 'Student', 'Employment', 'FormalEducation', 
         'UndergradMajor', 'CompanySize', 'YearsCoding', 'YearsCodingProf', 'UpdateCV', 
         'JobSatisfaction', 'CareerSatisfaction', 'HopeFiveYears', 'JobSearchStatus', 
         'LastNewJob', 'TimeFullyProductive', 'AgreeDisagree1', 'AgreeDisagree2', 
         'AgreeDisagree3', 'OperatingSystem', 'NumberMonitors', 'CheckInCode', 'AdBlocker', 
         'AdBlockerDisable', 'AdsAgreeDisagree1', 'AdsAgreeDisagree2', 'AdsAgreeDisagree3', 
         'AIDangerous', 'AIInteresting', 'AIResponsible', 'AIFuture', 'EthicsChoice', 
         'EthicsReport', 'EthicsResponsible', 'EthicalImplications', 'HoursComputer', 
         'StackOverflowRecommend', 'StackOverflowVisit', 'StackOverflowHasAccount', 
         'StackOverflowParticipate', 'StackOverflowJobs', 'StackOverflowDevStory', 
         'StackOverflowJobsRecommend', 'StackOverflowConsiderMember', 'HypotheticalTools1', 
         'HypotheticalTools2', 'HypotheticalTools3', 'HypotheticalTools4', 'WakeTime', 
         'HypotheticalTools5', 'HoursOutside', 'SkipMeals', 'Exercise', 'EducationParents', 
         'Age', 'Dependents', 'SurveyTooLong', 'SurveyEasy']
var_list = ['DevType', 'CommunicationTools', 'EducationTypes', 'SelfTaughtTypes', 
         'HackathonReasons', 'LanguageDesireNextYear', 'DatabaseWorkedWith', 
         'DatabaseDesireNextYear', 'PlatformWorkedWith', 'PlatformDesireNextYear', 
         'FrameworkWorkedWith', 'FrameworkDesireNextYear', 'IDE', 'Methodology', 
         'VersionControl', 'AdBlockerReasons', 'AdsActions', 'ErgonomicDevices', 
         'RaceEthnicity', 'LanguageWorkedWith']
var_ranks = ['AssessJob', 'AssessBenefits', 'JobContactPriorities', 'JobEmailPriorities', 
             'AdsPriorities']
var_float = 'ConvertedSalary'

def distance_qual(x, y):
    # Número de variables; si var_float es array, modificar "+ 1" por "+ len(var_float)"
    numvars = len(var_str) + len(var_list) + len(var_ranks) + 1
    
    distancia = abs(x.ConvertedSalary - y.ConvertedSalary)
    if pd.isnull(distancia):
        distancia = 0
        numvars -= 1
        
    for col in var_str:
        if x[col] != y[col]:
            distancia += 1
        
    for col in var_list:
        num_vars = len(x[col]) + len(y[col])
        d = 0
        if num_vars > 0:
            d = (2*len(set(x[col] + y[col])) - num_vars) / num_vars
        distancia += d

    for col in var_ranks:
        d = 0
        max_vars = max(len(x[col]), len(y[col]))
        if len(x[col]) != 0 and len(y[col]) != 0:
            for v in range(len(x[col])):
                if x[col][v] != y[col][v]:
                    d += 1
        else:
            d += max_vars
        
        if d != 0:
            d /= max_vars
        distancia += d

    return distancia / numvars
    
def decode(dataframe):
    new_df = dataframe.copy(deep=True)
    
    for col in var_str:
        if col in list(dataframe) and col in dict_of_dicts:
            for index, row in dataframe.iterrows():
                value = dict_of_dicts[col][row[col]]
                new_df.at[clusters.index[index], col] = value
                
    for index, row in dataframe.iterrows():
        new_df.at[clusters.index[index], 'ConvertedSalary'] = row['ConvertedSalary'] * 200000
    
    for col in var_list:
        if col in list(dataframe):
            for index, row in dataframe.iterrows():
                values_list = row[col].copy()
                for i in range(len(values_list)):
                    values_list[i] = dict_of_dicts[col][values_list[i]]
                new_df.at[clusters.index[index], col] = values_list
                
    return new_df

<div style="margin-top: 6px; border: 1px solid #cfcfcf; padding: 8px 12px; border-radius:2px; background-color:#f7f7f7; ">
A continuación ejecutamos el algoritmo ISODATA:
</div>

1) Definir los valores de $k_{init}, n_{min}, I_{max}, \sigma_{max}, L_{min}$ y $P_{max}$:

In [4]:
K_INIT = 5
N_MIN = 15
I_MAX = 10
S_MAX = .95 # La desviación estándar está normalizada
L_MIN = .75 # Las distancis están normalizadas
P_MAX = 2

NUM_CLUSTERS = K_INIT # valor de k
iteration = 0

2) Seleccionar de manera arbitraria *k* puntos en el espacio de características como centros iniciales de los clusters (centroides o centros de masa).

In [5]:
# Inicializar los centroides
centroids = df.sample(n=NUM_CLUSTERS).reset_index(drop=True)
display(centroids)

Unnamed: 0,Hobby,OpenSource,Country,Student,Employment,FormalEducation,UndergradMajor,CompanySize,DevType,YearsCoding,YearsCodingProf,JobSatisfaction,CareerSatisfaction,HopeFiveYears,JobSearchStatus,LastNewJob,UpdateCV,ConvertedSalary,CommunicationTools,TimeFullyProductive,EducationTypes,SelfTaughtTypes,HackathonReasons,AgreeDisagree1,AgreeDisagree2,AgreeDisagree3,LanguageDesireNextYear,DatabaseWorkedWith,DatabaseDesireNextYear,PlatformWorkedWith,PlatformDesireNextYear,FrameworkWorkedWith,FrameworkDesireNextYear,IDE,OperatingSystem,NumberMonitors,Methodology,VersionControl,CheckInCode,AdBlocker,AdBlockerDisable,AdBlockerReasons,AdsAgreeDisagree1,AdsAgreeDisagree2,AdsAgreeDisagree3,AdsActions,AIDangerous,AIInteresting,AIResponsible,AIFuture,EthicsChoice,EthicsReport,EthicsResponsible,EthicalImplications,StackOverflowRecommend,StackOverflowVisit,StackOverflowHasAccount,StackOverflowParticipate,StackOverflowJobs,StackOverflowDevStory,StackOverflowJobsRecommend,StackOverflowConsiderMember,HypotheticalTools1,HypotheticalTools2,HypotheticalTools3,HypotheticalTools4,HypotheticalTools5,WakeTime,HoursComputer,HoursOutside,SkipMeals,ErgonomicDevices,Exercise,EducationParents,RaceEthnicity,Age,Dependents,SurveyTooLong,SurveyEasy,LanguageWorkedWith,AssessJob,AssessBenefits,JobContactPriorities,JobEmailPriorities,AdsPriorities,Cluster
0,1,1,USA,2,1,8,6,2,"[0, 11, 12, 13, 15, 19, 20]",10,0,3,7,6.0,1,3.0,,0.0,[],,"[1, 3, 7, 8]","[0, 3, 5, 7, 8]","[0, 2, 3, 4]",2.0,2.0,1.0,"[0, 1, 14, 18, 2, 33, 4, 5]",[],[],"[18, 25]","[15, 18, 25, 3]",[5],[5],"[15, 20, 6]",2.0,2.0,[],"[0, 1, 6]",4.0,2.0,2.0,[1],4.0,0.0,1.0,"[2, 3]",0.0,3.0,2.0,1.0,1.0,0.0,1.0,1.0,10.0,1.0,2.0,4.0,1.0,2.0,5.0,1.0,0.0,2.0,2.0,3.0,0.0,7.0,2.0,3.0,0.0,[],3.0,2.0,[6],0.0,0.0,1.0,0.0,"[1, 14, 17, 18, 33, 5]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0]",[],
1,1,0,DEU,0,5,3,6,8,"[0, 12, 4, 6]",3,1,5,7,3.0,0,3.0,5.0,0.0,[],,[],[],[],,,,[],[],[],[],[],[],[],[],,,[],[],,,,[],,,,[],,,,,,,,,,,,,,,,,,,,,,,,,,[],,,[],,,,,[],"[10, 1, 5, 7, 2, 9, 4, 8, 6, 3]","[10, 8, 7, 6, 11, 9, 1, 4, 5, 3, 2]","[1, 2, 3, 5, 4]","[1, 4, 7, 3, 6, 5, 2]",[],
2,1,1,GBR,0,0,1,6,7,"[0, 1, 10, 11, 12, 15, 17, 18, 2, 20, 4, 5, 6, 7, 8]",8,8,1,1,0.0,1,4.0,7.0,0.298645,"[10, 5, 7]",3.0,"[1, 7, 8]","[0, 1, 3, 5, 7, 8]",[],2.0,2.0,4.0,"[1, 12, 13, 14, 17, 18, 21, 23, 25, 26, 27, 29, 31, 33, 34, 5, 6, 8]",[14],"[11, 14, 17]","[0, 14, 15, 20]","[0, 14, 15, 2, 20, 25]",[8],[8],"[17, 4]",2.0,3.0,[],"[1, 4]",2.0,1.0,0.0,[0],1.0,2.0,3.0,[3],0.0,1.0,0.0,1.0,1.0,3.0,2.0,2.0,10.0,5.0,2.0,4.0,2.0,0.0,10.0,2.0,0.0,2.0,0.0,0.0,0.0,4.0,2.0,0.0,3.0,[0],1.0,6.0,[6],3.0,0.0,0.0,4.0,"[1, 13, 14, 17, 18, 25, 26, 27, 29, 31, 5]","[1, 8, 2, 3, 5, 9, 4, 7, 10, 6]","[2, 7, 4, 10, 5, 1, 9, 3, 11, 8, 6]","[0, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0]",[],
3,1,0,PHL,0,0,1,9,1,"[12, 7]",7,11,5,5,,1,,,0.0,[],,[],[],[],,,,[],[],[],[],[],[],[],[],,,[],[],,,,[],,,,[],,,,,,,,,10.0,2.0,2.0,0.0,2.0,2.0,10.0,2.0,,,,,,,,,,[],,,[],,,,,[],"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0]",[],
4,1,1,GBR,0,0,3,6,6,"[0, 20, 7]",1,7,0,0,6.0,2,4.0,4.0,0.29517,"[0, 10, 2, 5, 7]",0.0,"[1, 3, 8]","[2, 3, 4, 5, 7]","[0, 3, 4, 5, 6]",2.0,2.0,4.0,"[25, 26, 4]","[10, 12, 13, 14, 17, 18, 20]",[18],"[14, 20, 22, 24, 5]","[0, 14, 20]",[],[],"[0, 1, 10, 17, 9]",3.0,3.0,"[0, 4, 8, 9]","[1, 6]",2.0,2.0,2.0,[1],4.0,0.0,3.0,"[2, 3]",1.0,3.0,2.0,2.0,0.0,2.0,2.0,1.0,10.0,5.0,2.0,4.0,2.0,2.0,0.0,1.0,3.0,2.0,1.0,4.0,1.0,5.0,4.0,0.0,0.0,[],3.0,6.0,[6],1.0,0.0,1.0,2.0,"[1, 17, 25, 26, 29, 31, 6]","[7, 9, 3, 1, 6, 2, 4, 5, 10, 8]","[1, 8, 7, 9, 11, 5, 4, 2, 10, 3, 6]","[5, 2, 4, 1, 3]","[6, 5, 2, 3, 1, 7, 4]","[7, 4, 2, 1, 6, 3, 5]",


3) Asignar cada punto del conjunto de datos al cluster donde la distancia del punto al centroide es menor.

In [6]:
elim = False
members = []

def update_clusters():
    global NUM_CLUSTERS, elim, members, centroids
    changed = False
    cluster_col_index = df.shape[1] - 1
    
    if TALK :
        print("Actualizando clusters")
    for index, row in df.iterrows():
        minDistance = LARGER_DISTANCE
        currentCluster = 0
        
        # Buscar la menor distancia del punto a un centroide
        for i, r in centroids.iterrows():
            dist = distance_qual(row, r)
            if(dist < minDistance):
                minDistance = dist
                currentCluster = i
        
        # Si hay cambio, realizarlo y levantar la bandera 'changed'
        if(pd.isnull(row['Cluster']) or row['Cluster'] != currentCluster):
            df.iloc[index, cluster_col_index] = currentCluster
            changed = True  
            
    # Contabilizar los elementos en cada cluster   
    members = [0] * NUM_CLUSTERS
    for i in range(NUM_CLUSTERS):
        members[i] = df[df["Cluster"]==i].count()["Cluster"]
        if (TALK) : 
            print("El cluster ", i, " incluye ", members[i], "miembros.")
    if (TALK) : 
        print()

    to_eliminate = []
    for j in range(NUM_CLUSTERS):
        if members[j] < N_MIN:
            to_eliminate.append(j)
    if len(to_eliminate) > 0:
        elim = True
        if (TALK) : 
            print("Clusters a eliminar:", to_eliminate)
        # Eliminar los centroides seleccionados
        centroids.drop(to_eliminate, inplace=True)    
        centroids = centroids.reset_index(drop=True)
        NUM_CLUSTERS = centroids.shape[0]
        changed = True
    else :
        elim = False
        
    if changed:
        for index, row in df.iterrows():
            minDistance = LARGER_DISTANCE
            currentCluster = 0

            for j, rc in centroids.iterrows():
                dist = distance_qual(row, rc)
                #print(j, dist, currentCluster, minDistance)
                if(dist < minDistance):
                    minDistance = dist
                    currentCluster = j
            
            if(pd.isnull(row['Cluster']) or row['Cluster'] != currentCluster):
                df.iloc[index, cluster_col_index] = currentCluster
                
        # Contabilizar los elementos en cada cluster   
        members = [0] * NUM_CLUSTERS
        for i in range(NUM_CLUSTERS):
            members[i] = df[df["Cluster"]==i].count()["Cluster"]
            if (TALK) : 
                print("El cluster ", i, " incluye ", members[i], "miembros.")
        if (TALK) : 
            print()
        
    return changed

# --------------------------
# Actualizar los clusters
KEEP_WALKING = update_clusters()

Actualizando clusters
El cluster  0  incluye  19 miembros.
El cluster  1  incluye  13 miembros.
El cluster  2  incluye  31 miembros.
El cluster  3  incluye  16 miembros.
El cluster  4  incluye  21 miembros.

Clusters a eliminar: [1]
El cluster  0  incluye  19 miembros.
El cluster  1  incluye  31 miembros.
El cluster  2  incluye  29 miembros.
El cluster  3  incluye  21 miembros.



4) Calcular los centroides a partir de los puntos en cada cluster. 

In [7]:
def update_centroids():
    global centroids
    
    for cl_j in range(NUM_CLUSTERS):        
        # Seleccionar registros en el cluster cl_j
        df_clusterj = df[df["Cluster"] == cl_j]
        
        centroids.loc[centroids.index[cl_j]] = get_centroide(df_clusterj).loc[0]        
    return

def get_centroide(data):
    # Copiar estructura de la tabla
    df2 = pd.DataFrame(data=None, columns=data.columns)
    #df2.append(pd.Series([np.nan]), ignore_index = True)

    col = 'ConvertedSalary'
    df2.at[0, col] = data[col].mean()

    # Moda en las columnas 'simples' (en var_str)
    mode = data[var_str].mode()
    for col in mode:
        df2.at[0, col] = mode[col].values[0]

    # Moda en las columnas con listas de longitud variable (en var_list)
    for col in var_list:
        mean_len = 0
        vars_list = []
        for index, row in data.iterrows():
            mean_len += len(row[col])
            vars_list = vars_list + row[col]
        mean_len /= data.shape[0]
        counter = Counter(vars_list)
        mean_list = []
        for v in counter.most_common(round(mean_len + 0.5)):
            mean_list.append(v[0])
        df2.at[0, col] = mean_list


    # Moda en las columnas con listas de longitud fija (en var_ranks)
    ranges = [11, 12, 6, 8, 8]
    # Para cada variable en var_list, obtener el número de componentes en el vector
    # y el nombre de la columna
    for i, col in zip(range(len(ranges)), var_ranks):
        # Inicializar una matriz (lista de listas, en realidad), con tantos renglones como 
        # componentes tiene el vector de la variable. Cada renglón tiene todos los valores 
        # utilizados en cada posición del vector
        vars = []
        for j in range(ranges[i] - 1):
            vars.append([])

        # Recorrer todos los elementos actualmente en el cluster para rellenar la matriz
        for index, row in data.iterrows():
            # Si el vector de la variable no está vacío...
            if len(row[col]) > 0:
                # Para cada componente en el vector...
                for j in range(len(row[col])):
                    # Si no es 0
                    if row[col][j] != '0':
                        # Agregarla al renglón actual en la matriz
                        vars[j].append(row[col][j])

        
        # Contabilizar ocurrencias de cada componente. Crear una matriz con el orden para
        # cada componente como renglones
        most_commons = []
        for j in range(ranges[i] - 1):
            counter = Counter(vars[j])
            #most_commons.append(counter.most_common(ranges[i] - 1))
            most_commons.append(counter.most_common())

        # Inicializar vector. Se escoge el valor más popular en la primera componente
        if len(most_commons) > 0 and len(most_commons[0]) > 0:
            vars_list = [most_commons[0][0][0]]
            # Para cada componente a partir de la segunda...
            for j in range(1, ranges[i] - 1):
                # Buscar la componente más común...
                for c in most_commons[j]:
                    # Siempre y cuando no esté utilizada...
                    if c[0] not in vars_list[:j]:
                        # Agregarla al vector y...
                        vars_list.append(c[0])
                        # Dejar de buscar.
                        break

        if len(vars_list) < ranges[i] - 1:
            for i in set(range(1, ranges[i])):
                if str(i) not in vars_list:
                    vars_list.append(str(i))
        df2.at[0, col] = vars_list

    return df2

# --------------------------
# Actualizar los centroides
update_centroids()

In [8]:
display(centroids)

Unnamed: 0,Hobby,OpenSource,Country,Student,Employment,FormalEducation,UndergradMajor,CompanySize,DevType,YearsCoding,YearsCodingProf,JobSatisfaction,CareerSatisfaction,HopeFiveYears,JobSearchStatus,LastNewJob,UpdateCV,ConvertedSalary,CommunicationTools,TimeFullyProductive,EducationTypes,SelfTaughtTypes,HackathonReasons,AgreeDisagree1,AgreeDisagree2,AgreeDisagree3,LanguageDesireNextYear,DatabaseWorkedWith,DatabaseDesireNextYear,PlatformWorkedWith,PlatformDesireNextYear,FrameworkWorkedWith,FrameworkDesireNextYear,IDE,OperatingSystem,NumberMonitors,Methodology,VersionControl,CheckInCode,AdBlocker,AdBlockerDisable,AdBlockerReasons,AdsAgreeDisagree1,AdsAgreeDisagree2,AdsAgreeDisagree3,AdsActions,AIDangerous,AIInteresting,AIResponsible,AIFuture,EthicsChoice,EthicsReport,EthicsResponsible,EthicalImplications,StackOverflowRecommend,StackOverflowVisit,StackOverflowHasAccount,StackOverflowParticipate,StackOverflowJobs,StackOverflowDevStory,StackOverflowJobsRecommend,StackOverflowConsiderMember,HypotheticalTools1,HypotheticalTools2,HypotheticalTools3,HypotheticalTools4,HypotheticalTools5,WakeTime,HoursComputer,HoursOutside,SkipMeals,ErgonomicDevices,Exercise,EducationParents,RaceEthnicity,Age,Dependents,SurveyTooLong,SurveyEasy,LanguageWorkedWith,AssessJob,AssessBenefits,JobContactPriorities,JobEmailPriorities,AdsPriorities,Cluster
0,1,1,USA,0,0,7,6,2,"[0, 19, 11]",9,0,5,5,6,1,3,7,0.076349,"[7, 8]",3.0,"[8, 1, 3]","[7, 5, 0, 8]","[0, 4]",0,2,1,"[1, 18, 2, 27, 4, 14]","[14, 17]","[17, 14]","[14, 25]","[14, 2, 18]",[5],"[5, 9]","[17, 6, 1]",1,2,"[0, 9]","[1, 0]",0,2,3,"[6, 5]",1,1,0,"[3, 2]",3,3,3,1,1,0,2,2,10,1,2,4,1,1,5,2,2,2,2,0,4,5,2,2,3,[2],3,2.0,[6],0,0.0,0,2,"[18, 1, 2, 27, 4, 14, 17]","[6, 8, 2, 7, 3, 1, 9, 4, 5, 10]","[1, 7, 8, 9, 6, 5, 4, 2, 3, 10, 11]","[2, 1, 4, 3, 5]","[3, 5, 1, 4, 2, 6, 7]","[1, 3, 2, 5, 6, 7, 4]",
1,1,0,USA,0,0,1,6,8,"[0, 12, 11, 6]",7,7,3,1,0,2,4,7,0.155259,"[8, 6]",3.0,"[8, 5, 1]","[7, 5, 3, 8]",[0],0,1,4,"[27, 5, 31, 14, 18, 1]","[14, 17, 19]","[17, 14, 18]","[14, 22, 0]","[14, 2, 22]","[5, 6]","[5, 6]","[17, 19, 18]",3,2,"[0, 9]","[1, 4]",2,2,3,"[0, 6]",0,0,0,"[3, 2]",0,1,3,1,1,3,2,2,10,2,2,4,2,1,5,2,0,2,2,3,0,6,2,0,3,[0],1,1.0,[6],1,0.0,0,4,"[18, 14, 5, 31, 27, 1, 17]","[7, 10, 1, 5, 2, 6, 4, 3, 9, 8]","[1, 6, 4, 10, 8, 2, 7, 3, 11, 9, 5]","[4, 1, 5, 3, 2]","[1, 5, 3, 2, 6, 7, 4]","[1, 5, 3, 4, 6, 7, 2]",
2,1,1,IND,0,0,1,6,8,"[0, 12, 11]",7,11,5,5,3,1,5,1,0.0,[7],,[1],[0],[],0,3,1,"[17, 31]","[14, 19]",[19],[14],[14],[0],[1],"[3, 18]",3,2,[0],[1],2,1,0,[0],3,1,1,[3],0,1,3,1,0,0,2,2,10,5,2,0,2,2,10,2,1,3,1,1,3,6,1,0,3,[1],0,,[],0,,1,0,"[17, 31, 5]","[10, 1, 5, 7, 2, 9, 4, 8, 6, 3]","[10, 8, 7, 6, 11, 9, 1, 4, 5, 3, 2]","[1, 2, 3, 5, 4]","[1, 4, 7, 3, 6, 5, 2]","[6, 5, 2, 4, 1, 7, 3]",
3,1,1,IND,0,0,1,6,8,"[0, 12, 11, 6]",9,7,3,3,2,2,4,1,0.075476,"[5, 4, 0]",0.0,"[8, 5, 1, 7]","[7, 5, 8, 0]","[0, 4]",0,2,1,"[18, 14, 17, 27, 5, 12]","[14, 13, 19]","[13, 14, 17]","[14, 24, 2]","[14, 0, 18, 2]","[5, 6]","[5, 6]","[10, 19, 18]",3,1,"[0, 4, 9]","[1, 4]",2,2,3,"[1, 5]",1,0,3,"[3, 0]",1,3,3,1,0,0,2,2,10,2,2,4,2,2,5,2,3,2,0,4,1,6,2,0,0,[0],3,2.0,[6],1,0.0,1,2,"[14, 18, 5, 31, 17, 1, 3]","[7, 9, 6, 1, 2, 4, 3, 5, 10, 8]","[1, 8, 3, 5, 11, 9, 6, 4, 10, 7, 2]","[2, 1, 5, 4, 3]","[6, 4, 5, 2, 1, 7, 3]","[1, 4, 3, 5, 7, 6, 2]",


In [9]:
deltas = []
delta = 0
def update_deltas():
    global deltas, delta, centroids
    deltas = [0] * NUM_CLUSTERS
    N = 0
    for j, rc in centroids.iterrows():
        n = 0
        for i, row in df[df["Cluster"]==j].iterrows():
            deltas[j] += distance_qual(row, rc)
            n += 1
        delta += deltas[j]
        deltas[j] /= n
        N += n
    delta /= N
    
    if TALK : 
        print("Las distancias medias en cada cluster son:\n", deltas)   
        print("\nLa distancia media promedio es:", delta)   
        
    return

update_deltas()

Las distancias medias en cada cluster son:
 [0.5587983252515673, 0.5522677611527127, 0.7951183030884567, 0.5258222522628587]

La distancia media promedio es: 0.6183816686259914


In [10]:
import math

def std_dev():
    # Inicializar vector de desviaciones estándar... los valores actuales son inserbibles
    std_vectors = centroids.copy()
    
    for c in range(NUM_CLUSTERS) :
        df_c = df[(df["Cluster"]==c)]
        
        # Para cada variable numérica...
        df_cj = df_c[pd.notnull(df_c['ConvertedSalary'])]

        s = math.sqrt(sum(abs(df_cj["ConvertedSalary"] - 
                              centroids.iloc[c]["ConvertedSalary"])) / (df_cj.shape[0] - 1))
        std_vectors.loc[c, "ConvertedSalary"] = s
        
        for col in var_str:
            diff = sum(df_cj[col] != centroids.iloc[c][col])
            s = math.sqrt(diff / (df_cj.shape[0] - 1))
            std_vectors.loc[c, col] = s
        
        for col in var_list:
            y = centroids.iloc[c][col]
            diff = 0
            for i, row in df_cj.iterrows():
                x = row[col]
                num_vars = len(x) + len(y)
                if num_vars > 0:
                    diff += (2*len(set(x + y)) - num_vars) / num_vars
            s = math.sqrt(diff / (df_cj.shape[0] - 1))
            std_vectors.loc[c, col] = s
        
        for col in var_ranks:
            y = centroids.iloc[c][col]
            for i, row in df_cj.iterrows():
                diff = 0
                x = row[col]
                max_vars = max(len(x), len(y))
                if len(x) != 0 and len(y) != 0:
                    for v in range(len(x)):
                        if x[v] != y[v]:
                            diff += 1
                else:
                    diff += max_vars

                if diff != 0:
                    diff /= max_vars
            s = math.sqrt(diff / (df_cj.shape[0] - 1))
            std_vectors.loc[c, col] = s
         
    return std_vectors

display(std_dev())

Unnamed: 0,Hobby,OpenSource,Country,Student,Employment,FormalEducation,UndergradMajor,CompanySize,DevType,YearsCoding,YearsCodingProf,JobSatisfaction,CareerSatisfaction,HopeFiveYears,JobSearchStatus,LastNewJob,UpdateCV,ConvertedSalary,CommunicationTools,TimeFullyProductive,EducationTypes,SelfTaughtTypes,HackathonReasons,AgreeDisagree1,AgreeDisagree2,AgreeDisagree3,LanguageDesireNextYear,DatabaseWorkedWith,DatabaseDesireNextYear,PlatformWorkedWith,PlatformDesireNextYear,FrameworkWorkedWith,FrameworkDesireNextYear,IDE,OperatingSystem,NumberMonitors,Methodology,VersionControl,CheckInCode,AdBlocker,AdBlockerDisable,AdBlockerReasons,AdsAgreeDisagree1,AdsAgreeDisagree2,AdsAgreeDisagree3,AdsActions,AIDangerous,AIInteresting,AIResponsible,AIFuture,EthicsChoice,EthicsReport,EthicsResponsible,EthicalImplications,StackOverflowRecommend,StackOverflowVisit,StackOverflowHasAccount,StackOverflowParticipate,StackOverflowJobs,StackOverflowDevStory,StackOverflowJobsRecommend,StackOverflowConsiderMember,HypotheticalTools1,HypotheticalTools2,HypotheticalTools3,HypotheticalTools4,HypotheticalTools5,WakeTime,HoursComputer,HoursOutside,SkipMeals,ErgonomicDevices,Exercise,EducationParents,RaceEthnicity,Age,Dependents,SurveyTooLong,SurveyEasy,LanguageWorkedWith,AssessJob,AssessBenefits,JobContactPriorities,JobEmailPriorities,AdsPriorities,Cluster
0,0.25,0.661438,0.829156,0.707107,0.790569,0.866025,0.790569,0.866025,0.831836,0.866025,0.75,0.790569,0.866025,0.75,0.707107,0.790569,0.790569,0.33717,0.888585,0.829156,0.636677,0.645881,0.850245,0.829156,0.901388,0.790569,0.780626,0.893262,0.929189,0.817771,0.829425,0.85999,0.838437,0.828857,0.829156,0.612372,0.770359,0.622495,0.901388,0.433013,0.661438,0.83666,0.829156,0.790569,0.790569,0.622495,0.829156,0.707107,0.707107,0.5,0.5,0.75,0.790569,0.559017,0.5,0.790569,0.25,0.829156,0.612372,0.75,0.5,0.707107,0.866025,0.866025,0.829156,0.829156,0.790569,0.866025,0.790569,0.829156,0.661438,0.978945,0.75,0.75,0.5,0.661438,0.5,0.75,0.790569,0.708171,0.176777,0.238366,0.25,0.188982,0.25,
1,0.4,0.69282,0.848528,0.4,0.52915,0.69282,0.774597,0.87178,0.734922,0.916515,0.894427,0.748331,0.774597,0.8,0.663325,0.848528,0.663325,0.408203,0.80884,0.8,0.65276,0.609935,0.956681,0.663325,0.848528,0.8,0.752406,0.760597,0.801962,0.790539,0.850009,0.836394,0.840635,0.843726,0.8,0.8,0.740013,0.653197,0.565685,0.632456,0.774597,0.794984,0.87178,0.824621,0.848528,0.749222,0.8,0.8,0.824621,0.282843,0.565685,0.72111,0.565685,0.34641,0.6,0.87178,0.4,0.824621,0.663325,0.894427,0.72111,0.69282,0.848528,0.8,0.87178,0.848528,0.894427,0.824621,0.663325,0.72111,0.52915,0.848528,0.774597,0.894427,0.52915,0.8,0.565685,0.632456,0.774597,0.670149,0.2,0.2,0.2,0.169031,0.169031,
2,0.0,0.603023,0.92932,0.564076,0.6742,0.797724,0.768706,0.797724,0.804479,0.768706,0.639602,0.603023,0.639602,0.977008,0.639602,0.904534,0.977008,0.0,1.0,1.02247,0.984732,1.01504,0.0,0.953463,0.977008,0.977008,0.964734,0.945217,0.984292,0.993485,0.992395,1.00486,0.975012,0.942618,0.879049,0.92932,0.964859,0.89612,0.825723,0.904534,0.904534,0.904534,0.953463,0.953463,1.0,0.961375,0.92932,1.0,0.977008,0.953463,0.953463,0.92932,0.977008,0.977008,0.879049,0.904534,0.797724,0.92932,0.879049,0.953463,0.92932,0.904534,1.0,0.977008,0.977008,0.977008,0.977008,0.977008,0.977008,0.953463,0.92932,0.977008,1.0,1.02247,0.0,1.0,1.02247,0.953463,1.0,0.930812,0.213201,0.213201,0.213201,0.213201,0.213201,
3,0.242536,0.641689,0.874475,0.242536,0.342997,0.840168,0.685994,0.939336,0.749883,0.874475,0.840168,0.766965,0.766965,0.840168,0.485071,0.727607,0.874475,0.311017,0.76596,0.840168,0.603276,0.617169,0.874742,0.685994,0.8044,0.907485,0.782408,0.797405,0.860916,0.814258,0.847878,0.870944,0.849122,0.791204,0.594089,0.840168,0.658387,0.653797,0.420084,0.242536,0.685994,0.801958,0.907485,0.840168,0.874475,0.701539,0.840168,0.766965,0.641689,0.641689,0.641689,0.641689,0.685994,0.485071,0.641689,0.840168,0.420084,0.8044,0.727607,0.874475,0.685994,0.685994,0.766965,0.840168,0.874475,0.594089,0.874475,0.874475,0.8044,0.727607,0.8044,0.840168,0.641689,0.874475,0.727607,0.727607,0.542326,0.542326,0.766965,0.623883,0.21693,0.206835,0.153393,0.129641,0.224544,


In [11]:
def divide_clusters():
    global NUM_CLUSTERS, centroids

    if TALK :
        display(centroids)
    
    # Cálculo de desviaciones estandar
    sigma_vect = std_dev()   
    if TALK :
        display(sigma_vect)
    
    candidates = []
    for c, s_row in sigma_vect.iterrows():
        for col in s_row:
            if col > S_MAX :
                candidates.append(c)
                break # Ya encontramos un atributo con sigma elevada 

    if TALK :
        print("Posibles clusters a dividir:", candidates)
    
    divided = False
    to_eliminate = []
    for c in candidates:
        cond = NUM_CLUSTERS < K_INIT/2 or (deltas[c] > delta and members[c] > 2 * N_MIN)
        if cond: 
            d = 0
            # Obtener dos puntos "suficientemente separados", no es el óptimo, 
            # pero son buenos candidatos a buen costo
            count = 0
            while d < deltas[c] and count < 5000:
                s1 = df[df["Cluster"]==c].sample(n=2)
                d = distance_qual(s1.iloc[0], s1.iloc[1])
                count += 1
            if count < 5000:
                to_eliminate.append(c)
                centroids = centroids.append(s1)
                NUM_CLUSTERS += 1
                
#    display("observando", to_eliminate, NUM_CLUSTERS, centroids)
            
    if len(to_eliminate) > 0 :
        if TALK : 
            print("Clusters a eliminar:", to_eliminate)
            print("")
        centroids.drop(to_eliminate, inplace=True)
        centroids = centroids.reset_index(drop=True)
        update_clusters()
        update_centroids()
        if TALK : 
            display(centroids)
            print("")
            
    return 

divide_clusters()    

Unnamed: 0,Hobby,OpenSource,Country,Student,Employment,FormalEducation,UndergradMajor,CompanySize,DevType,YearsCoding,YearsCodingProf,JobSatisfaction,CareerSatisfaction,HopeFiveYears,JobSearchStatus,LastNewJob,UpdateCV,ConvertedSalary,CommunicationTools,TimeFullyProductive,EducationTypes,SelfTaughtTypes,HackathonReasons,AgreeDisagree1,AgreeDisagree2,AgreeDisagree3,LanguageDesireNextYear,DatabaseWorkedWith,DatabaseDesireNextYear,PlatformWorkedWith,PlatformDesireNextYear,FrameworkWorkedWith,FrameworkDesireNextYear,IDE,OperatingSystem,NumberMonitors,Methodology,VersionControl,CheckInCode,AdBlocker,AdBlockerDisable,AdBlockerReasons,AdsAgreeDisagree1,AdsAgreeDisagree2,AdsAgreeDisagree3,AdsActions,AIDangerous,AIInteresting,AIResponsible,AIFuture,EthicsChoice,EthicsReport,EthicsResponsible,EthicalImplications,StackOverflowRecommend,StackOverflowVisit,StackOverflowHasAccount,StackOverflowParticipate,StackOverflowJobs,StackOverflowDevStory,StackOverflowJobsRecommend,StackOverflowConsiderMember,HypotheticalTools1,HypotheticalTools2,HypotheticalTools3,HypotheticalTools4,HypotheticalTools5,WakeTime,HoursComputer,HoursOutside,SkipMeals,ErgonomicDevices,Exercise,EducationParents,RaceEthnicity,Age,Dependents,SurveyTooLong,SurveyEasy,LanguageWorkedWith,AssessJob,AssessBenefits,JobContactPriorities,JobEmailPriorities,AdsPriorities,Cluster
0,1,1,USA,0,0,7,6,2,"[0, 19, 11]",9,0,5,5,6,1,3,7,0.076349,"[7, 8]",3.0,"[8, 1, 3]","[7, 5, 0, 8]","[0, 4]",0,2,1,"[1, 18, 2, 27, 4, 14]","[14, 17]","[17, 14]","[14, 25]","[14, 2, 18]",[5],"[5, 9]","[17, 6, 1]",1,2,"[0, 9]","[1, 0]",0,2,3,"[6, 5]",1,1,0,"[3, 2]",3,3,3,1,1,0,2,2,10,1,2,4,1,1,5,2,2,2,2,0,4,5,2,2,3,[2],3,2.0,[6],0,0.0,0,2,"[18, 1, 2, 27, 4, 14, 17]","[6, 8, 2, 7, 3, 1, 9, 4, 5, 10]","[1, 7, 8, 9, 6, 5, 4, 2, 3, 10, 11]","[2, 1, 4, 3, 5]","[3, 5, 1, 4, 2, 6, 7]","[1, 3, 2, 5, 6, 7, 4]",
1,1,0,USA,0,0,1,6,8,"[0, 12, 11, 6]",7,7,3,1,0,2,4,7,0.155259,"[8, 6]",3.0,"[8, 5, 1]","[7, 5, 3, 8]",[0],0,1,4,"[27, 5, 31, 14, 18, 1]","[14, 17, 19]","[17, 14, 18]","[14, 22, 0]","[14, 2, 22]","[5, 6]","[5, 6]","[17, 19, 18]",3,2,"[0, 9]","[1, 4]",2,2,3,"[0, 6]",0,0,0,"[3, 2]",0,1,3,1,1,3,2,2,10,2,2,4,2,1,5,2,0,2,2,3,0,6,2,0,3,[0],1,1.0,[6],1,0.0,0,4,"[18, 14, 5, 31, 27, 1, 17]","[7, 10, 1, 5, 2, 6, 4, 3, 9, 8]","[1, 6, 4, 10, 8, 2, 7, 3, 11, 9, 5]","[4, 1, 5, 3, 2]","[1, 5, 3, 2, 6, 7, 4]","[1, 5, 3, 4, 6, 7, 2]",
2,1,1,IND,0,0,1,6,8,"[0, 12, 11]",7,11,5,5,3,1,5,1,0.0,[7],,[1],[0],[],0,3,1,"[17, 31]","[14, 19]",[19],[14],[14],[0],[1],"[3, 18]",3,2,[0],[1],2,1,0,[0],3,1,1,[3],0,1,3,1,0,0,2,2,10,5,2,0,2,2,10,2,1,3,1,1,3,6,1,0,3,[1],0,,[],0,,1,0,"[17, 31, 5]","[10, 1, 5, 7, 2, 9, 4, 8, 6, 3]","[10, 8, 7, 6, 11, 9, 1, 4, 5, 3, 2]","[1, 2, 3, 5, 4]","[1, 4, 7, 3, 6, 5, 2]","[6, 5, 2, 4, 1, 7, 3]",
3,1,1,IND,0,0,1,6,8,"[0, 12, 11, 6]",9,7,3,3,2,2,4,1,0.075476,"[5, 4, 0]",0.0,"[8, 5, 1, 7]","[7, 5, 8, 0]","[0, 4]",0,2,1,"[18, 14, 17, 27, 5, 12]","[14, 13, 19]","[13, 14, 17]","[14, 24, 2]","[14, 0, 18, 2]","[5, 6]","[5, 6]","[10, 19, 18]",3,1,"[0, 4, 9]","[1, 4]",2,2,3,"[1, 5]",1,0,3,"[3, 0]",1,3,3,1,0,0,2,2,10,2,2,4,2,2,5,2,3,2,0,4,1,6,2,0,0,[0],3,2.0,[6],1,0.0,1,2,"[14, 18, 5, 31, 17, 1, 3]","[7, 9, 6, 1, 2, 4, 3, 5, 10, 8]","[1, 8, 3, 5, 11, 9, 6, 4, 10, 7, 2]","[2, 1, 5, 4, 3]","[6, 4, 5, 2, 1, 7, 3]","[1, 4, 3, 5, 7, 6, 2]",


Unnamed: 0,Hobby,OpenSource,Country,Student,Employment,FormalEducation,UndergradMajor,CompanySize,DevType,YearsCoding,YearsCodingProf,JobSatisfaction,CareerSatisfaction,HopeFiveYears,JobSearchStatus,LastNewJob,UpdateCV,ConvertedSalary,CommunicationTools,TimeFullyProductive,EducationTypes,SelfTaughtTypes,HackathonReasons,AgreeDisagree1,AgreeDisagree2,AgreeDisagree3,LanguageDesireNextYear,DatabaseWorkedWith,DatabaseDesireNextYear,PlatformWorkedWith,PlatformDesireNextYear,FrameworkWorkedWith,FrameworkDesireNextYear,IDE,OperatingSystem,NumberMonitors,Methodology,VersionControl,CheckInCode,AdBlocker,AdBlockerDisable,AdBlockerReasons,AdsAgreeDisagree1,AdsAgreeDisagree2,AdsAgreeDisagree3,AdsActions,AIDangerous,AIInteresting,AIResponsible,AIFuture,EthicsChoice,EthicsReport,EthicsResponsible,EthicalImplications,StackOverflowRecommend,StackOverflowVisit,StackOverflowHasAccount,StackOverflowParticipate,StackOverflowJobs,StackOverflowDevStory,StackOverflowJobsRecommend,StackOverflowConsiderMember,HypotheticalTools1,HypotheticalTools2,HypotheticalTools3,HypotheticalTools4,HypotheticalTools5,WakeTime,HoursComputer,HoursOutside,SkipMeals,ErgonomicDevices,Exercise,EducationParents,RaceEthnicity,Age,Dependents,SurveyTooLong,SurveyEasy,LanguageWorkedWith,AssessJob,AssessBenefits,JobContactPriorities,JobEmailPriorities,AdsPriorities,Cluster
0,0.25,0.661438,0.829156,0.707107,0.790569,0.866025,0.790569,0.866025,0.831836,0.866025,0.75,0.790569,0.866025,0.75,0.707107,0.790569,0.790569,0.33717,0.888585,0.829156,0.636677,0.645881,0.850245,0.829156,0.901388,0.790569,0.780626,0.893262,0.929189,0.817771,0.829425,0.85999,0.838437,0.828857,0.829156,0.612372,0.770359,0.622495,0.901388,0.433013,0.661438,0.83666,0.829156,0.790569,0.790569,0.622495,0.829156,0.707107,0.707107,0.5,0.5,0.75,0.790569,0.559017,0.5,0.790569,0.25,0.829156,0.612372,0.75,0.5,0.707107,0.866025,0.866025,0.829156,0.829156,0.790569,0.866025,0.790569,0.829156,0.661438,0.978945,0.75,0.75,0.5,0.661438,0.5,0.75,0.790569,0.708171,0.176777,0.238366,0.25,0.188982,0.25,
1,0.4,0.69282,0.848528,0.4,0.52915,0.69282,0.774597,0.87178,0.734922,0.916515,0.894427,0.748331,0.774597,0.8,0.663325,0.848528,0.663325,0.408203,0.80884,0.8,0.65276,0.609935,0.956681,0.663325,0.848528,0.8,0.752406,0.760597,0.801962,0.790539,0.850009,0.836394,0.840635,0.843726,0.8,0.8,0.740013,0.653197,0.565685,0.632456,0.774597,0.794984,0.87178,0.824621,0.848528,0.749222,0.8,0.8,0.824621,0.282843,0.565685,0.72111,0.565685,0.34641,0.6,0.87178,0.4,0.824621,0.663325,0.894427,0.72111,0.69282,0.848528,0.8,0.87178,0.848528,0.894427,0.824621,0.663325,0.72111,0.52915,0.848528,0.774597,0.894427,0.52915,0.8,0.565685,0.632456,0.774597,0.670149,0.2,0.2,0.2,0.169031,0.169031,
2,0.0,0.603023,0.92932,0.564076,0.6742,0.797724,0.768706,0.797724,0.804479,0.768706,0.639602,0.603023,0.639602,0.977008,0.639602,0.904534,0.977008,0.0,1.0,1.02247,0.984732,1.01504,0.0,0.953463,0.977008,0.977008,0.964734,0.945217,0.984292,0.993485,0.992395,1.00486,0.975012,0.942618,0.879049,0.92932,0.964859,0.89612,0.825723,0.904534,0.904534,0.904534,0.953463,0.953463,1.0,0.961375,0.92932,1.0,0.977008,0.953463,0.953463,0.92932,0.977008,0.977008,0.879049,0.904534,0.797724,0.92932,0.879049,0.953463,0.92932,0.904534,1.0,0.977008,0.977008,0.977008,0.977008,0.977008,0.977008,0.953463,0.92932,0.977008,1.0,1.02247,0.0,1.0,1.02247,0.953463,1.0,0.930812,0.213201,0.213201,0.213201,0.213201,0.213201,
3,0.242536,0.641689,0.874475,0.242536,0.342997,0.840168,0.685994,0.939336,0.749883,0.874475,0.840168,0.766965,0.766965,0.840168,0.485071,0.727607,0.874475,0.311017,0.76596,0.840168,0.603276,0.617169,0.874742,0.685994,0.8044,0.907485,0.782408,0.797405,0.860916,0.814258,0.847878,0.870944,0.849122,0.791204,0.594089,0.840168,0.658387,0.653797,0.420084,0.242536,0.685994,0.801958,0.907485,0.840168,0.874475,0.701539,0.840168,0.766965,0.641689,0.641689,0.641689,0.641689,0.685994,0.485071,0.641689,0.840168,0.420084,0.8044,0.727607,0.874475,0.685994,0.685994,0.766965,0.840168,0.874475,0.594089,0.874475,0.874475,0.8044,0.727607,0.8044,0.840168,0.641689,0.874475,0.727607,0.727607,0.542326,0.542326,0.766965,0.623883,0.21693,0.206835,0.153393,0.129641,0.224544,


Posibles clusters a dividir: [0, 1, 2]


In [12]:
def mix_clusters():
    global centroids, NUM_CLUSTERS
    
    # Matriz triangular superior de distancias entre centroides
    dist_lists = []
    for i, rc_i in centroids.iterrows():
        dist_lists.append([])
        for j, rc_j in centroids.iterrows():
            if j <= i:
                dist_lists[i].append(LARGER_DISTANCE)
            else:
                dist_lists[i].append(distance_qual(rc_i, rc_j))
    dist_matrix = np.array(dist_lists)
    
    mixed = False
    to_eliminate = []
    # to_eliminate contendrá la mitad de los clusters unidos...
    while (dist_matrix.min() < LARGER_DISTANCE and len(to_eliminate) < P_MAX/2) :
        dist_min = dist_matrix.min()
        idx = (dist_matrix==dist_min).argmax()
        z1 = idx // len(centroids)
        z2 = idx % len(centroids)
        
        if dist_min < L_MIN:
            centroids.iloc[z1] = get_centroide(centroids.iloc[[z1, z2]]).loc[0]
            to_eliminate.append(z2)
            NUM_CLUSTERS -= 1
            mixed = True
            if TALK:
                print("Unificando clusters {} y {}.\n".format(z1, z2))
        
        dist_matrix[z1][z2] = LARGER_DISTANCE
        
    centroids.drop(to_eliminate, inplace=True)
    centroids = centroids.reset_index(drop=True)
    
    if mixed :
        update_clusters()
        #update_centroids()

    return

mix_clusters()

Unificando clusters 1 y 3.

Actualizando clusters
El cluster  0  incluye  25 miembros.
El cluster  1  incluye  48 miembros.
El cluster  2  incluye  27 miembros.

El cluster  0  incluye  25 miembros.
El cluster  1  incluye  48 miembros.
El cluster  2  incluye  27 miembros.

