<div style="width:100%; overflow:hidden; background-color:#F1F1E6; padding: 10px; border-style: outset; color:#17469e">
    <div style="width: 80%; float: left;">
    <h2 align="center">Universidad de Sonora</h2>
    <hr style="border-width: 3px; border-color:#17469e">
          <h1>Reconocimiento de patrones: Preparación de los datos</h1>          
          <h4>Ramón Soto C. <a href="mailto:rsotoc@moviquest.com/">(rsotoc@moviquest.com)</a></h4>
    </div>
    <div style="float: right;">
    <img src="images/escudo_unison.png">
    </div>
</div>

## Caso de estudio: [*Stack Overflow 2018 Developer Survey*](https://www.kaggle.com/stackoverflow/stack-overflow-2018-developer-survey)

Como caso de estudio principal en el presente curso hemos seleccionado la encuesta de desarrolladores 2018 de *Stack Overflow* disponible en [Kaggle](https://www.kaggle.com). En este esta etapa realizaremos el análisis de agrupamientos.

### 4. Modelado - ISODATA

<div style="margin-top: 6px; border: 1px solid #cfcfcf; padding: 8px 12px; border-radius:2px; background-color:#f7f7f7; ">
... ahora utilizamos la técnica ISODATA para identificar prototipos de clases. <br>Inicializamos el contexto y cargamos los datos:
</div>

In [1]:
"""
Reconocimiento de patrones: ISODATA
"""

#from scipy.spatial.distance import squareform

# Inicializar el ambiente
import sys
import numpy as np
import pandas as pd
import json
import pickle
#import math
import random
#import time

from IPython.display import display, HTML
from collections import Counter
from operator import itemgetter
#from scipy.spatial.distance import euclidean, pdist, squareform

np.set_printoptions(precision=2, suppress=True) # Cortar la impresión de decimales a 1
pd.set_option('display.max_columns', 130)
pd.set_option('max_colwidth', 80)

LARGER_DISTANCE = sys.maxsize
TALK = True # TALK = True, imprime resultados parciales

In [2]:
path = "Data sets/Stack Overflow Survey/"

# Recuperar encabezados de columnas en orden original
with open(path + 'survey_results_public_transformed.headers', 'rb') as file:  
    headers = pickle.load(file)

# Recuperar diccionarios... sólo por si se requieren
with open(path + 'survey_results_public_transformed.dicts', 'rb') as file:  
    dict_of_dicts = pickle.load(file)

with open(path + 'survey_results_public_transformed.json') as f:
    dict_json = json.load(f)
df = pd.DataFrame.from_dict(dict_json)

df = df.sample(n=100).reset_index(drop=True)



# Reordenar las columnas de acuerdo al orden original
df = df.reindex(headers, axis=1)

DATA_LEN = df.shape[0]

# Agregar una columna "cluster" inicializada a null 
df["Cluster"] = np.nan

In [3]:
var_str = ['Hobby', 'OpenSource', 'Country', 'Student', 'Employment', 'FormalEducation', 
         'UndergradMajor', 'CompanySize', 'YearsCoding', 'YearsCodingProf', 'UpdateCV', 
         'JobSatisfaction', 'CareerSatisfaction', 'HopeFiveYears', 'JobSearchStatus', 
         'LastNewJob', 'TimeFullyProductive', 'AgreeDisagree1', 'AgreeDisagree2', 
         'AgreeDisagree3', 'OperatingSystem', 'NumberMonitors', 'CheckInCode', 'AdBlocker', 
         'AdBlockerDisable', 'AdsAgreeDisagree1', 'AdsAgreeDisagree2', 'AdsAgreeDisagree3', 
         'AIDangerous', 'AIInteresting', 'AIResponsible', 'AIFuture', 'EthicsChoice', 
         'EthicsReport', 'EthicsResponsible', 'EthicalImplications', 'HoursComputer', 
         'StackOverflowRecommend', 'StackOverflowVisit', 'StackOverflowHasAccount', 
         'StackOverflowParticipate', 'StackOverflowJobs', 'StackOverflowDevStory', 
         'StackOverflowJobsRecommend', 'StackOverflowConsiderMember', 'HypotheticalTools1', 
         'HypotheticalTools2', 'HypotheticalTools3', 'HypotheticalTools4', 'WakeTime', 
         'HypotheticalTools5', 'HoursOutside', 'SkipMeals', 'Exercise', 'EducationParents', 
         'Age', 'Dependents', 'SurveyTooLong', 'SurveyEasy']
var_list = ['DevType', 'CommunicationTools', 'EducationTypes', 'SelfTaughtTypes', 
         'HackathonReasons', 'LanguageDesireNextYear', 'DatabaseWorkedWith', 
         'DatabaseDesireNextYear', 'PlatformWorkedWith', 'PlatformDesireNextYear', 
         'FrameworkWorkedWith', 'FrameworkDesireNextYear', 'IDE', 'Methodology', 
         'VersionControl', 'AdBlockerReasons', 'AdsActions', 'ErgonomicDevices', 
         'RaceEthnicity', 'LanguageWorkedWith']
var_ranks = ['AssessJob', 'AssessBenefits', 'JobContactPriorities', 'JobEmailPriorities', 
             'AdsPriorities']
var_float = 'ConvertedSalary'

def distance_qual(x, y):
    # Número de variables; si var_float es array, modificar "+ 1" por "+ len(var_float)"
    numvars = len(var_str) + len(var_list) + len(var_ranks) + 1
    
    distancia = abs(x.ConvertedSalary - y.ConvertedSalary)
    if pd.isnull(distancia):
        distancia = 0
        numvars -= 1
        
    for col in var_str:
        if x[col] != y[col]:
            distancia += 1
        
    for col in var_list:
        num_vars = len(x[col]) + len(y[col])
        d = 0
        if num_vars > 0:
            d = (2*len(set(x[col] + y[col])) - num_vars) / num_vars
        distancia += d

    for col in var_ranks:
        d = 0
        max_vars = max(len(x[col]), len(y[col]))
        if len(x[col]) != 0 and len(y[col]) != 0:
            for v in range(len(x[col])):
                if x[col][v] != y[col][v]:
                    d += 1
        else:
            d += max_vars
        
        if d != 0:
            d /= max_vars
        distancia += d

    return distancia / numvars
    
def decode(dataframe):
    new_df = dataframe.copy(deep=True)
    
    for col in var_str:
        if col in list(dataframe) and col in dict_of_dicts:
            for index, row in dataframe.iterrows():
                value = dict_of_dicts[col][row[col]]
                new_df.at[clusters.index[index], col] = value
                
    for index, row in dataframe.iterrows():
        new_df.at[clusters.index[index], 'ConvertedSalary'] = row['ConvertedSalary'] * 200000
    
    for col in var_list:
        if col in list(dataframe):
            for index, row in dataframe.iterrows():
                values_list = row[col].copy()
                for i in range(len(values_list)):
                    values_list[i] = dict_of_dicts[col][values_list[i]]
                new_df.at[clusters.index[index], col] = values_list
                
    return new_df

<div style="margin-top: 6px; border: 1px solid #cfcfcf; padding: 8px 12px; border-radius:2px; background-color:#f7f7f7; ">
A continuación ejecutamos el algoritmo ISODATA:
</div>

1) Definir los valores de $k_{init}, n_{min}, I_{max}, \sigma_{max}, L_{min}$ y $P_{max}$:

In [4]:
K_INIT = 5
N_MIN = 15
I_MAX = 10
S_MAX = 5
L_MIN = 80
P_MAX = 2

NUM_CLUSTERS = K_INIT # valor de k
iteration = 0

2) Seleccionar de manera arbitraria *k* puntos en el espacio de características como centros iniciales de los clusters (centroides o centros de masa).

In [5]:
# Inicializar los centroides
centroids = df.sample(n=NUM_CLUSTERS).reset_index(drop=True)

3) Asignar cada punto del conjunto de datos al cluster donde la distancia del punto al centroide es menor.

In [6]:
elim = False
members = []

def update_clusters():
    global NUM_CLUSTERS, elim, members, centroids
    changed = False
    cluster_col_index = df.shape[1] - 1
    
    if TALK :
        print("Actualizando clusters")
    for index, row in df.iterrows():
        minDistance = LARGER_DISTANCE
        currentCluster = 0
        
        # Buscar la menor distancia del punto a un centroide
        for i, r in centroids.iterrows():
            dist = distance_qual(row, r)
            if(dist < minDistance):
                minDistance = dist
                currentCluster = i
        
        # Si hay cambio, realizarlo y levantar la bandera 'changed'
        if(pd.isnull(row['Cluster']) or row['Cluster'] != currentCluster):
            df.iloc[index, cluster_col_index] = currentCluster
            changed = True  
            
    # Contabilizar los elementos en cada cluster   
    members = [0] * NUM_CLUSTERS
    for i in range(NUM_CLUSTERS):
        members[i] = df[df["Cluster"]==i].count()["Cluster"]
        if (TALK) : 
            print("El cluster ", i, " incluye ", members[i], "miembros.")
    if (TALK) : 
        print()

    to_eliminate = []
    for j in range(NUM_CLUSTERS):
        if members[j] < N_MIN:
            to_eliminate.append(j)
    if len(to_eliminate) > 0:
        elim = True
        if (TALK) : 
            print("Clusters a eliminar:", to_eliminate)
        # Eliminar los centroides seleccionados
        centroids.drop(to_eliminate, inplace=True)    
        centroids = centroids.reset_index(drop=True)
        NUM_CLUSTERS = centroids.shape[0]
        changed = True
    else :
        elim = False
        
    if changed:
        for index, row in df.iterrows():
            minDistance = LARGER_DISTANCE
            currentCluster = 0

            for j, rc in centroids.iterrows():
                dist = distance_qual(row, rc)
                #print(j, dist, currentCluster, minDistance)
                if(dist < minDistance):
                    minDistance = dist
                    currentCluster = j
            
            if(pd.isnull(row['Cluster']) or row['Cluster'] != currentCluster):
                df.iloc[index, cluster_col_index] = currentCluster
                
        # Contabilizar los elementos en cada cluster   
        members = [0] * NUM_CLUSTERS
        for i in range(NUM_CLUSTERS):
            members[i] = df[df["Cluster"]==i].count()["Cluster"]
            if (TALK) : 
                print("El cluster ", i, " incluye ", members[i], "miembros.")
        if (TALK) : 
            print()
        
    return changed

# --------------------------
# Actualizar los clusters
KEEP_WALKING = update_clusters()




Actualizando clusters
El cluster  0  incluye  14 miembros.
El cluster  1  incluye  14 miembros.
El cluster  2  incluye  19 miembros.
El cluster  3  incluye  15 miembros.
El cluster  4  incluye  38 miembros.

Clusters a eliminar: [0, 1]
El cluster  0  incluye  20 miembros.
El cluster  1  incluye  15 miembros.
El cluster  2  incluye  65 miembros.



4) Calcular los centroides a partir de los puntos en cada cluster. 

In [7]:
def update_centroids():    
    for cl_j in range(NUM_CLUSTERS):
        means = [0] * (df.shape[1] - 1)
        
        # Seleccionar registros en el cluster cl_j
        df_clusterj = df[df["Cluster"] == cl_j]

        # Media en los datos numéricos
        col = 'ConvertedSalary'
        centroids.at[centroids.index[cl_j], col] = df_clusterj[col].mean()
        
        # Moda en las columnas 'simples' (en var_str)
        mode = df_clusterj[var_str].mode()
        for col in mode:
            centroids.at[centroids.index[cl_j], col] = mode[col].values[0]

        # Moda en las columnas con listas de longitud variable (en var_list)
        for col in var_list:
            mean_len = 0
            vars_list = []
            for index, row in df_clusterj.iterrows():
                mean_len += len(row[col])
                vars_list = vars_list + row[col]
            mean_len /= df_clusterj.shape[0]
            counter = Counter(vars_list)
            mean_list = []
            for v in counter.most_common(round(mean_len + 0.5)):
                mean_list.append(v[0])
            centroids.at[centroids.index[cl_j], col] = mean_list

            
        # Moda en las columnas con listas de longitud fija (en var_ranks)
        ranges = [11, 12, 6, 8, 8]
        # Para cada variable en var_list, obtener el número de componentes en el vector
        # y el nombre de la columna
        for i, col in zip(range(len(ranges)), var_ranks):
            # Inicializar una matriz (lista de listas, en realidad), con tantos renglones como 
            # componentes tiene el vector de la variable. Cada renglón tiene todos los valores 
            # utilizados en cada posición del vector
            vars = []
            for j in range(ranges[i] - 1):
                vars.append([])

            # Recorrer todos los elementos actualmente en el cluster para rellenar la matriz
            for index, row in df_clusterj.iterrows():
                # Si el vector de la variable no está vacío...
                if len(row[col]) > 0:
                    # Para cada componente en el vector...
                    for j in range(len(row[col])):
                        # Si no es 0
                        if row[col][j] != '0':
                            # Agregarla al renglón actual en la matriz
                            vars[j].append(row[col][j])

            # Contabilizar ocurrencias de cada componente. Crear una matriz con el orden para
            # cada componente como renglones
            most_commons = []
            for j in range(ranges[i] - 1):
                counter=Counter(vars[j])
                most_commons.append(counter.most_common(ranges[i] - 1))

            # Inicializar vector. Se escoge el valor más popular en la primera componente
            vars_list = [most_commons[0][0][0]]
            # Para cada componente a partir de la segunda...
            for j in range(1, ranges[i] - 1):
                # Buscar la componente más común...
                for c in most_commons[j]:
                    # Siempre y cuando no esté utilizada...
                    if c[0] not in vars_list[:j]:
                        # Agregarla al vector y...
                        vars_list.append(c[0])
                        # Dejar de buscar.
                        break
            if len(vars_list) < ranges[i] - 1:
                vars_list = vars_list + list(set(range(1, ranges[i])) - set(vars_list))
            centroids.at[centroids.index[cl_j], col] = vars_list
    return

# --------------------------
# Actualizar los centroides
update_centroids()

In [28]:
display(centroids)

Unnamed: 0,Hobby,OpenSource,Country,Student,Employment,FormalEducation,UndergradMajor,CompanySize,DevType,YearsCoding,YearsCodingProf,JobSatisfaction,CareerSatisfaction,HopeFiveYears,JobSearchStatus,LastNewJob,UpdateCV,ConvertedSalary,CommunicationTools,TimeFullyProductive,EducationTypes,SelfTaughtTypes,HackathonReasons,AgreeDisagree1,AgreeDisagree2,AgreeDisagree3,LanguageDesireNextYear,DatabaseWorkedWith,DatabaseDesireNextYear,PlatformWorkedWith,PlatformDesireNextYear,FrameworkWorkedWith,FrameworkDesireNextYear,IDE,OperatingSystem,NumberMonitors,Methodology,VersionControl,CheckInCode,AdBlocker,AdBlockerDisable,AdBlockerReasons,AdsAgreeDisagree1,AdsAgreeDisagree2,AdsAgreeDisagree3,AdsActions,AIDangerous,AIInteresting,AIResponsible,AIFuture,EthicsChoice,EthicsReport,EthicsResponsible,EthicalImplications,StackOverflowRecommend,StackOverflowVisit,StackOverflowHasAccount,StackOverflowParticipate,StackOverflowJobs,StackOverflowDevStory,StackOverflowJobsRecommend,StackOverflowConsiderMember,HypotheticalTools1,HypotheticalTools2,HypotheticalTools3,HypotheticalTools4,HypotheticalTools5,WakeTime,HoursComputer,HoursOutside,SkipMeals,ErgonomicDevices,Exercise,EducationParents,RaceEthnicity,Age,Dependents,SurveyTooLong,SurveyEasy,LanguageWorkedWith,AssessJob,AssessBenefits,JobContactPriorities,JobEmailPriorities,AdsPriorities,Cluster
0,1,0,IND,0,0,1,6,8,"[0, 16, 12]",7,11,5,5,2,1,5,7,0.01229,[],,[8],[1],[],1.0,0.0,0,"[18, 27]",[13],[12],[22],[19],[1],[0],[18],3,1,[0],[2],3,2,0,[1],1,0,0,[0],1,2,,2.0,0.0,0.0,2.0,2.0,10,0,1,,0,,5,1,0,0,3,0,4,6,2,2,0,[],0,5.0,[6],1.0,0.0,1,1.0,"[18, 14, 5]","[5, 10, 7, 6, 3, 4, 2, 1, 9, 8]","[1, 3, 2, 11, 7, 5, 4, 9, 10, 8, 6]","[1, 3, 5, 2, 4]","[3, 1, 7, 4, 2, 6, 5]","[1, 5, 4, 3, 2, 6, 7]",
1,0,0,IND,0,0,1,6,4,"[0, 12, 11]",0,0,4,3,6,2,3,3,0.047883,[5],2.0,[8],[0],[],,,4,[17],[19],[13],[14],[14],[3],[1],[18],3,1,[],[1],0,2,3,[4],2,0,0,[0],0,0,,,,,,,7,2,2,3.0,1,0.0,5,2,0,3,3,3,0,6,2,0,0,[3],2,,[],,,1,,"[3, 31]","[9, 7, 6, 4, 1, 5, 8, 3, 10, 2]","[1, 2, 3, 10, 8, 5, 9, 11, 6, 7, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]","[1, 2, 5, 3, 4]","[3, 5, 2, 4, 6, 7, 1]","[4, 1, 2, 3, 7, 6, 5]",
2,1,0,IND,0,0,1,6,3,"[0, 12, 11, 6]",7,0,3,3,2,2,0,7,0.162178,"[8, 5, 4]",3.0,"[8, 7, 5, 1]","[5, 7, 3, 0]","[4, 0]",0.0,0.0,4,"[18, 5, 14, 3, 31, 27]","[14, 19, 13]","[14, 13, 19]","[14, 2, 22]","[14, 2, 0, 18]","[5, 1]","[5, 6, 0]","[18, 10, 19, 15]",3,2,"[0, 9, 4]","[1, 4]",2,2,3,"[6, 4]",1,1,0,"[3, 2]",3,3,3.0,1.0,1.0,0.0,2.0,2.0,10,5,2,0.0,2,1.0,5,2,0,2,3,4,3,6,2,2,3,[2],3,1.0,[6],1.0,0.0,0,4.0,"[18, 14, 5, 31, 17, 3, 1, 27]","[6, 7, 9, 4, 2, 5, 10, 1, 8, 3]","[1, 6, 2, 10, 8, 7, 11, 4, 9, 5, 3]","[2, 1, 5, 3, 4]","[1, 5, 3, 4, 2, 7, 6]","[1, 5, 2, 4, 6, 7, 3]",


In [43]:
deltas = []
delta = 0
def update_deltas():
    global deltas, delta, centroids
    deltas = [0] * NUM_CLUSTERS
    N = 0
    for j, rc in centroids.iterrows():
        n = 0
        for i, row in df[df["Cluster"]==j].iterrows():
            deltas[j] += distance_qual(row, rc)
            n += 1
        delta += deltas[j]
        deltas[j] /= n
        N += n
    delta /= N
    
    if TALK : 
        print("Las distancias medias en cada cluster son:\n", deltas)   
        print("\nLa distancia media promedio es:", delta)   
        
    return

update_deltas()

Las distancias medias en cada cluster son:
 [0.7966206484339192, 0.8133942988323076, 0.5519572858653702]

La distancia media promedio es: 0.6401055103241206


In [62]:
import math

def std_dev():
    # Inicializar vector de desviaciones estándar... los valores actuales son inserbibles
    std_vectors = centroids.copy()
    
    for c in range(NUM_CLUSTERS) :
        df_c = df[(df["Cluster"]==c)]
        
        # Para cada variable numérica...
        df_cj = df_c[pd.notnull(df_c['ConvertedSalary'])]
        m = df_cj["ConvertedSalary"].mean()
        s = math.sqrt(sum(abs(df_cj["ConvertedSalary"] - m)) / (df_cj.shape[0] - 1))
        std_vectors.loc[c, "ConvertedSalary"] = s
        
        
        
        #for j, row in df_c.iterrows():
        
        
    return 

std_dev()        

0.1567801007781281
0.2824973661908406
0.38525106334514747


Unnamed: 0,Hobby,OpenSource,Country,Student,Employment,FormalEducation,UndergradMajor,CompanySize,DevType,YearsCoding,YearsCodingProf,JobSatisfaction,CareerSatisfaction,HopeFiveYears,JobSearchStatus,LastNewJob,UpdateCV,ConvertedSalary,CommunicationTools,TimeFullyProductive,EducationTypes,SelfTaughtTypes,HackathonReasons,AgreeDisagree1,AgreeDisagree2,AgreeDisagree3,LanguageDesireNextYear,DatabaseWorkedWith,DatabaseDesireNextYear,PlatformWorkedWith,PlatformDesireNextYear,FrameworkWorkedWith,FrameworkDesireNextYear,IDE,OperatingSystem,NumberMonitors,Methodology,VersionControl,CheckInCode,AdBlocker,AdBlockerDisable,AdBlockerReasons,AdsAgreeDisagree1,AdsAgreeDisagree2,AdsAgreeDisagree3,AdsActions,AIDangerous,AIInteresting,AIResponsible,AIFuture,EthicsChoice,EthicsReport,EthicsResponsible,EthicalImplications,StackOverflowRecommend,StackOverflowVisit,StackOverflowHasAccount,StackOverflowParticipate,StackOverflowJobs,StackOverflowDevStory,StackOverflowJobsRecommend,StackOverflowConsiderMember,HypotheticalTools1,HypotheticalTools2,HypotheticalTools3,HypotheticalTools4,HypotheticalTools5,WakeTime,HoursComputer,HoursOutside,SkipMeals,ErgonomicDevices,Exercise,EducationParents,RaceEthnicity,Age,Dependents,SurveyTooLong,SurveyEasy,LanguageWorkedWith,AssessJob,AssessBenefits,JobContactPriorities,JobEmailPriorities,AdsPriorities,Cluster
0,1,0,IND,0,0,1,6,8,"[0, 16, 12]",7,11,5,5,2,1,5,7,0.15678,[],,[8],[1],[],1.0,0.0,0,"[18, 27]",[13],[12],[22],[19],[1],[0],[18],3,1,[0],[2],3,2,0,[1],1,0,0,[0],1,2,,2.0,0.0,0.0,2.0,2.0,10,0,1,,0,,5,1,0,0,3,0,4,6,2,2,0,[],0,5.0,[6],1.0,0.0,1,1.0,"[18, 14, 5]","[5, 10, 7, 6, 3, 4, 2, 1, 9, 8]","[1, 3, 2, 11, 7, 5, 4, 9, 10, 8, 6]","[1, 3, 5, 2, 4]","[3, 1, 7, 4, 2, 6, 5]","[1, 5, 4, 3, 2, 6, 7]",
1,0,0,IND,0,0,1,6,4,"[0, 12, 11]",0,0,4,3,6,2,3,3,0.282497,[5],2.0,[8],[0],[],,,4,[17],[19],[13],[14],[14],[3],[1],[18],3,1,[],[1],0,2,3,[4],2,0,0,[0],0,0,,,,,,,7,2,2,3.0,1,0.0,5,2,0,3,3,3,0,6,2,0,0,[3],2,,[],,,1,,"[3, 31]","[9, 7, 6, 4, 1, 5, 8, 3, 10, 2]","[1, 2, 3, 10, 8, 5, 9, 11, 6, 7, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]","[1, 2, 5, 3, 4]","[3, 5, 2, 4, 6, 7, 1]","[4, 1, 2, 3, 7, 6, 5]",
2,1,0,IND,0,0,1,6,3,"[0, 12, 11, 6]",7,0,3,3,2,2,0,7,0.385251,"[8, 5, 4]",3.0,"[8, 7, 5, 1]","[5, 7, 3, 0]","[4, 0]",0.0,0.0,4,"[18, 5, 14, 3, 31, 27]","[14, 19, 13]","[14, 13, 19]","[14, 2, 22]","[14, 2, 0, 18]","[5, 1]","[5, 6, 0]","[18, 10, 19, 15]",3,2,"[0, 9, 4]","[1, 4]",2,2,3,"[6, 4]",1,1,0,"[3, 2]",3,3,3.0,1.0,1.0,0.0,2.0,2.0,10,5,2,0.0,2,1.0,5,2,0,2,3,4,3,6,2,2,3,[2],3,1.0,[6],1.0,0.0,0,4.0,"[18, 14, 5, 31, 17, 3, 1, 27]","[6, 7, 9, 4, 2, 5, 10, 1, 8, 3]","[1, 6, 2, 10, 8, 7, 11, 4, 9, 5, 3]","[2, 1, 5, 3, 4]","[1, 5, 3, 4, 2, 7, 6]","[1, 5, 2, 4, 6, 7, 3]",


In [32]:
        
    deltas = []
    
    
    # Número de variables no numéricas
    numvars = len(var_str) + len(var_list) + len(var_ranks)
    


    s = 0
    count = 0
        
        
            distancia = abs(x.ConvertedSalary - y.ConvertedSalary)
    if pd.isnull(distancia):
        distancia = 0
        numvars -= 1
        
    for col in var_str:
        if x[col] != y[col]:
            distancia += 1
        
    for col in var_list:
        num_vars = len(x[col]) + len(y[col])
        d = 0
        if num_vars > 0:
            d = (2*len(set(x[col] + y[col])) - num_vars) / num_vars
        distancia += d

    for col in var_ranks:
        d = 0
        max_vars = max(len(x[col]), len(y[col]))
        if len(x[col]) != 0 and len(y[col]) != 0:
            for v in range(len(x[col])):
                if x[col][v] != y[col][v]:
                    d += 1
        else:
            d += max_vars
        
        if d != 0:
            d /= max_vars
        distancia += d

        
        centroid = centroids.iloc[int(row.Cluster)]
        
        diff = row.ConvertedSalary - centroid.ConvertedSalary
        if pd.isnull(diff) :
            diff = 0
            numvars -= 1
        else :
            if row.Cluster == 0 :
                s += diff * diff
                count += 1
            
    s = math.sqrt(s / (count - 1))
    print(s)

        
    

(20, 86)
(15, 86)
(65, 86)


In [None]:
def distance_qual(x, y):
    # Número de variables; si var_float es array, modificar "+ 1" por "+ len(var_float)"
    numvars = len(var_str) + len(var_list) + len(var_ranks) + 1
    
    distancia = abs(x.ConvertedSalary - y.ConvertedSalary)
    if pd.isnull(distancia):
        distancia = 0
        numvars -= 1
        
    for col in var_str:
        if x[col] != y[col]:
            distancia += 1
        
    for col in var_list:
        num_vars = len(x[col]) + len(y[col])
        d = 0
        if num_vars > 0:
            d = (2*len(set(x[col] + y[col])) - num_vars) / num_vars
        distancia += d

    for col in var_ranks:
        d = 0
        max_vars = max(len(x[col]), len(y[col]))
        if len(x[col]) != 0 and len(y[col]) != 0:
            for v in range(len(x[col])):
                if x[col][v] != y[col][v]:
                    d += 1
        else:
            d += max_vars
        
        if d != 0:
            d /= max_vars
        distancia += d

    return distancia / numvars    

def divide_clusters():
    global NUM_CLUSTERS, centroids

    if TALK :
        display(centroids)
    
    # Cálculo de desviaciones estandar
    sigma_vect = []
    for c in range(NUM_CLUSTERS):
        sigma_vect.append(std_dev(df[df["Cluster"]==c]))    
    if TALK :
        display(sigma_vect)
    
    candidates = []
    for c in range(NUM_CLUSTERS):
        for i in range(df.shape[1] - 1):
            if sigma_vect[c][i] > S_MAX :
                candidates.append(c)
                break # Sucio... pero eficiente :-) ... ya encontramos un atributo con elevada sigma

    if TALK :
        print("Posibles clusters a dividir:", candidates)
    
    divided = False
    to_eliminate = []
    for c in candidates:
        cond = NUM_CLUSTERS < K_INIT/2 or (deltas[c] > delta and members[c] > 2 * N_MIN)
        if cond:
            dist_matrix = squareform(pdist(df[df["Cluster"]==c], 'euclidean'))
            idx = (dist_matrix==dist_matrix.max()).argmax()
            z1 = idx // members[c]
            z2 = idx % members[c]

            if TALK :
                print("\nSe dividirá el cluster {}.".format(c))
                print("Se crearán nuevos clusters en torno a {} y {}."
                     .format(z1, z2))
            to_eliminate.append(c)
            centroids = centroids.append(df.iloc[z1][['A', 'B']], 
                                         ignore_index=True, sort=False)
            centroids = centroids.append(df.iloc[z2][['A', 'B']], 
                                         ignore_index=True, sort=False)
            NUM_CLUSTERS += 1
            

    if len(to_eliminate) > 0 :
        centroids.drop(to_eliminate, inplace=True)
        centroids = centroids.reset_index(drop=True)

        if TALK : 
            display(centroids)
            print("")
            
        update_clusters()
        update_centroids()
    
    return 

In [None]:
while(KEEP_WALKING):
    KEEP_WALKING = update_clusters()
    if (KEEP_WALKING):
        update_centroids()
    else :
        if (TALK) : 
            print ("No más cambios.")  

<div style="margin-top: 6px; border: 1px solid #cfcfcf; padding: 8px 12px; border-radius:2px; background-color:#f7f7f7; ">
Los prototipos de las 3 clases obtenidos en este ejercicio son:
</div>

In [None]:
display(clusters)

<div style="margin-top: 6px; border: 1px solid #cfcfcf; padding: 8px 12px; border-radius:2px; background-color:#f7f7f7; ">
O bien, decodificando los valores:
</div>

In [None]:
dec_clusters = decode(clusters)
print(dec_clusters)

<div style="margin-top: 6px; border: 1px solid #cfcfcf; padding: 8px 12px; border-radius:2px; background-color:#f7f7f7; ">
Estos resultados permiten realizar un análisis de conformación de los grupos, así como revizar la utilidad de las variables registradas. Obsérvese que diversas varaibles no cambian de valor para las 3 prototipos, lo cual, a este nivel, indicaría que no aportan poder discriminante. <br><br>Una exploración más detallada permitiría descartar variables o, después de una revisión de los objetivos del negocio, un replanteamiento del procedimiento para recabar los datos.
<br><br>Una manera de evaluar la calidad de los clusters obtenidos es evaluar la separación entre los prototipos. Distancias muy pequeñas entre dos prototipos indicarían la conveniencia de unificar los clusters correspondientes:
</div>

In [None]:
list_array = []
for index, row in clusters.iterrows():
    for i in range(index + 1, clusters.shape[0]):
        list_array.append(distance_qual(clusters.iloc[index], clusters.iloc[i]))
print(squareform(list_array))

<div style="margin-top: 6px; border: 1px solid #cfcfcf; padding: 8px 12px; border-radius:2px; background-color:#f7f7f7; ">
En esos resultados, se repite, en gran medida, el comportamiento ya observado con 3 clusters.
</div>

In [None]:
for _, row in df.iterrows():
    if len(row['JobEmailPriorities']) > 0 and len(row['JobEmailPriorities']) < 7 :
        print(row['JobEmailPriorities'])

In [None]:
a = range(1, 8)
b = [3, 2, 6, 1, 4]
c = list(set(a) - set(b))
print(b + c)