<div style="width:100%; overflow:hidden; background-color:#F1F1E6; padding: 10px; border-style: outset; color:#17469e">
    <div style="width: 80%; float: left;">
    <h2 align="center">Universidad de Sonora</h2>
    <hr style="border-width: 3px; border-color:#17469e">
          <h1>Reconocimiento de patrones: Preparación de los datos</h1>          
          <h4>Ramón Soto C. <a href="mailto:rsotoc@moviquest.com/">(rsotoc@moviquest.com)</a></h4>
    </div>
    <div style="float: right;">
    <img src="images/escudo_unison.png">
    </div>
</div>

## Caso de estudio: [*Stack Overflow 2018 Developer Survey*](https://www.kaggle.com/stackoverflow/stack-overflow-2018-developer-survey)

Como caso de estudio principal en el presente curso hemos seleccionado la encuesta de desarrolladores 2018 de *Stack Overflow* disponible en [Kaggle](https://www.kaggle.com). En este esta etapa realizaremos el análisis de agrupamientos.

### 4. Modelado

<div style="margin-top: 6px; border: 1px solid #cfcfcf; padding: 8px 12px; border-radius:2px; background-color:#f7f7f7; ">
La primera fase del modelado, en este caso particular donde no tenemos información de las categorías subyacentes en la muestra, consiste en identificar los potenciales agrupamientos.<br><br> 
La primera técnica que emplearemos para este fin es la identificación de clusters mediante Dendogramas.<br><br> 
El primer problema que tenemos que resolver es como medir distancias entre los tipos de datos específicos. Utilizamos una medida de distancia de Gower para datos híbridos.
</div>

In [1]:
"""
Reconocimiento de patrones: Dendrogramas
"""

import pandas as pd
import numpy as np
import json
import pickle

from collections import Counter
from operator import itemgetter
from IPython.display import display, HTML

from matplotlib import pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage
%matplotlib inline

pd.set_option('display.max_columns', 130)
pd.set_option('max_colwidth', 80)

In [2]:
path = "Data sets/Stack Overflow Survey/"

# Recuperar encabezados de columnas en orden original
with open(path + 'survey_results_public_transformed.headers', 'rb') as file:  
    headers = pickle.load(file)

# Recuperar diccionarios... sólo por si se requieren
with open(path + 'survey_results_public_transformed.dicts', 'rb') as file:  
    dict_of_dicts = pickle.load(file)

with open(path + 'survey_results_public_transformed.json') as f:
    dict_json = json.load(f)
df = pd.DataFrame.from_dict(dict_json)

# Reordenar las columnas de acuerdo al orden original
df = df.reindex(headers, axis=1)

In [3]:
var_str = ['Hobby', 'OpenSource', 'Country', 'Student', 'Employment', 'FormalEducation', 
         'UndergradMajor', 'CompanySize', 'YearsCoding', 'YearsCodingProf', 'UpdateCV', 
         'JobSatisfaction', 'CareerSatisfaction', 'HopeFiveYears', 'JobSearchStatus', 
         'LastNewJob', 'TimeFullyProductive', 'AgreeDisagree1', 'AgreeDisagree2', 
         'AgreeDisagree3', 'OperatingSystem', 'NumberMonitors', 'CheckInCode', 'AdBlocker', 
         'AdBlockerDisable', 'AdsAgreeDisagree1', 'AdsAgreeDisagree2', 'AdsAgreeDisagree3', 
         'AIDangerous', 'AIInteresting', 'AIResponsible', 'AIFuture', 'EthicsChoice', 
         'EthicsReport', 'EthicsResponsible', 'EthicalImplications', 'HoursComputer', 
         'StackOverflowRecommend', 'StackOverflowVisit', 'StackOverflowHasAccount', 
         'StackOverflowParticipate', 'StackOverflowJobs', 'StackOverflowDevStory', 
         'StackOverflowJobsRecommend', 'StackOverflowConsiderMember', 'HypotheticalTools1', 
         'HypotheticalTools2', 'HypotheticalTools3', 'HypotheticalTools4', 'WakeTime', 
         'HypotheticalTools5', 'HoursOutside', 'SkipMeals', 'Exercise', 'EducationParents', 
         'Age', 'Dependents', 'SurveyTooLong', 'SurveyEasy']
var_list = ['DevType', 'CommunicationTools', 'EducationTypes', 'SelfTaughtTypes', 
         'HackathonReasons', 'LanguageDesireNextYear', 'DatabaseWorkedWith', 
         'DatabaseDesireNextYear', 'PlatformWorkedWith', 'PlatformDesireNextYear', 
         'FrameworkWorkedWith', 'FrameworkDesireNextYear', 'IDE', 'Methodology', 
         'VersionControl', 'AdBlockerReasons', 'AdsActions', 'ErgonomicDevices', 
         'RaceEthnicity', 'LanguageWorkedWith']
var_ranks = ['AssessJob', 'AssessBenefits', 'JobContactPriorities', 'JobEmailPriorities', 
             'AdsPriorities']
var_float = 'ConvertedSalary'

def distance_qual(x, y):
    # Número de variables; si var_float es array, modificar "+ 1" por "+ len(var_float)"
    numvars = len(var_str) + len(var_list) + len(var_ranks) + 1
    
    distancia = abs(x.ConvertedSalary - y.ConvertedSalary)
    if pd.isnull(distancia):
        distancia = 0
        numvars -= 1
        
    for col in var_str:
        if x[col] != y[col]:
            distancia += 1
        
    for col in var_list:
        num_vars = len(x[col]) + len(y[col])
        d = 0
        if num_vars > 0:
            d = (2*len(set(x[col] + y[col])) - num_vars) / num_vars
        distancia += d

    for col in var_ranks:
        d = 0
        max_vars = max(len(x[col]), len(y[col]))
        if len(x[col]) != 0 and len(y[col]) != 0:
            for v in range(len(x[col])):
                if x[col][v] != y[col][v]:
                    d += 1
        else:
            d += max_vars
        
        if d != 0:
            d /= max_vars
        distancia += d
        
        #print(col, x[col], y[col], d)

    return distancia / numvars

def distance_matrix(rows):
    list_array = []
    for index, row in rows.iterrows():
        for i in range(index + 1, rows.shape[0]):
            list_array.append(distance_qual(rows.iloc[index], rows.iloc[i]))
#    print(list_array)
#    print(squareform(list_array))

    return list_array # matriz triangular superior

def display_tree(X, clusters=10):
    Z = linkage(X, 'weighted')
    plt.figure(figsize=(12, 5))
    dendrogram(Z,     
               truncate_mode='lastp',
               p=clusters, 
               show_leaf_counts=True,  
               leaf_font_size=14)
    plt.show()
    
def decode(dataframe):
    new_df = dataframe.copy(deep=True)
    
    for col in var_str:
        if col in list(dataframe):
            for index, row in dataframe.iterrows():
                value = dict_of_dicts[col][row[col]]
                new_df.at[clusters.index[index], col] = value
                
    for index, row in dataframe.iterrows():
        new_df.at[clusters.index[index], 'ConvertedSalary'] = row['ConvertedSalary'] * 200000
    
    for col in var_list + var_ranks:
        if col in list(dataframe):
            for index, row in dataframe.iterrows():
                values_list = row[col]
                for i in range(len(values_list)):
                    values_list[i] = dict_of_dicts[col][values_list[i]]
                new_df.at[clusters.index[index], col] = values_list
                
    return new_df

<div style="margin-top: 6px; border: 1px solid #cfcfcf; padding: 8px 12px; border-radius:2px; background-color:#f7f7f7; ">
Un problema específico de los dendrogramas es que hay que calcular la matriz de distancias, lo cual es, en principio, una matriz de todos los datos contra los restantes. Dado que es una matriz triangular, con valores cero en la diagonal, los cálculos se reducen a $n_d=\frac{1}{2} n_e \times (n_e -1)$, siendo $n_d$ el número de distancias que debn calcularse y $ne$ el número de elementos de la muestra. Esta cantidad de cálculos es aún elevado para una base de datos como la de la encuesta, como puede apreciarse de la siguiente exploración:
</div>

In [None]:
import time

total_dists = df.shape[0] * (df.shape[0] - 1) / 2

df1 = df.sample(n=1000).reset_index(drop=True)
start_time = time.time()
X = distance_matrix(df1)
elapsed_time = time.time() - start_time

print("Número de distancias calculadas:", len(X), 
     "\nTiempo empleado (segundos):", elapsed_time,
     "\nTiempo unitario promedio:", elapsed_time / len(X),
      "\n\nTotal de distancia a calcular:", total_dists, 
     "\nTiempo total para el conjunto de datos (dias):", 
      elapsed_time / len(X) * total_dists / 86400)

<div style="margin-top: 6px; border: 1px solid #cfcfcf; padding: 8px 12px; border-radius:2px; background-color:#f7f7f7; ">
Como puede apreciarse, el cálculo de la matriz triangular tardaría cerca de 5 meses, en una computadora con procesador Intel Core i7 de 4GHz y 32GB de RAM. En lugar de hacer la exploración completa, haremos una serie de exploraciones conmuestreos aleatorios:
</div>

In [None]:
for i in range(6):
    df1 = df.sample(n=1000).reset_index(drop=True)
    X = distance_matrix(df1)
    display_tree(X)

<div style="margin-top: 6px; border: 1px solid #cfcfcf; padding: 8px 12px; border-radius:2px; background-color:#f7f7f7;">
Estos dendrogramas muestran, al nivel explorado, 2 o tres grandes grupos. Analicemos un mayor grado de agrupamientos:
</div>

In [None]:
display_tree(X, 20)

In [4]:
import sys
import collections

LARGER_DISTANCE = sys.maxsize
DATA_LEN = df.shape[0]
TALK = True # TALK = True, imprime resultados parciales

In [5]:
# Agregar una columna "cluster" inicializada a null 
df["Cluster"] = np.nan

In [6]:
NUM_CLUSTERS = 3

# Inicializar los centroides
clusters = df.sample(n=NUM_CLUSTERS).reset_index(drop=True)

In [7]:
def update_clusters():
    changed = False
    cluster_col_index = df.shape[1] - 1
    
    for index, row in df.iterrows():
        minDistance = LARGER_DISTANCE
        currentCluster = 0
        
        # Buscar la menor distancia del punto a un centroide
        for i, r in clusters.iterrows():
            dist = distance_qual(row, r)
            if(dist < minDistance):
                minDistance = dist
                currentCluster = i

        # Si hay cambio, realizarlo y levantar la bandera 'changed'
        if(pd.isnull(row['Cluster']) or row['Cluster'] != currentCluster):
            df.iloc[index, cluster_col_index] = currentCluster
            changed = True
            
    # Contabilizar los elementos en cada cluster   
    members = [0] * NUM_CLUSTERS
    for i in range(NUM_CLUSTERS):
        members[i] = df[df["Cluster"]==i].count()["Cluster"]
        if (TALK) : 
            print("El cluster ", i, " incluye ", members[i], "miembros.")
    if (TALK) : 
        print()
            
    return changed

# --------------------------
# Actualizar los clusters
KEEP_WALKING = update_clusters()

El cluster  0  incluye  22255 miembros.

El cluster  1  incluye  39707 miembros.

El cluster  2  incluye  36481 miembros.



In [8]:
def update_centroids():    
    for cl_j in range(NUM_CLUSTERS):
        means = [0] * (df.shape[1] - 1)
        
        # Seleccionar registros en el cluster cl_j
        df_clusterj = df[df["Cluster"] == cl_j]
        
        # Media en los datos numéricos
        col = 'ConvertedSalary'
        clusters.at[clusters.index[cl_j], col] = df_clusterj[col].mean()
        
        # Moda en las columnas 'simples' (en var_str)
        mode = df_clusterj[var_str].mode()
        for col in mode:
            clusters.at[clusters.index[cl_j], col] = mode[col].values[0]

        # Moda en las columnas con listas de longitud variable (en var_list)
        for col in var_list:
            mean_len = 0
            vars_list = []
            for index, row in df_clusterj.iterrows():
                mean_len += len(row[col])
                vars_list = vars_list + row[col]
            mean_len /= df_clusterj.shape[0]
            counter=collections.Counter(vars_list)
            mean_list = []
            for v in counter.most_common(round(mean_len + 0.5)):
                mean_list.append(v[0])
            clusters.at[clusters.index[cl_j], col] = mean_list

            
        # Moda en las columnas con listas de longitud fija (en var_ranks)
        ranges = [11, 12, 6, 8, 8]
        # Para cada variable en var_list, obtener el número de componentes en el vector
        # y el nombre de la columna
        for i, col in zip(range(len(ranges)), var_ranks):
            # Inicializar una matriz (lista de listas, en realidad), con tantos renglones como 
            # componentes tiene el vector de la variable. Cada renglón tiene todos los valores 
            # utilizados en cada posición del vector
            vars = []
            for j in range(ranges[i] - 1):
                vars.append([])

            # Recorrer todos los elementos actualmente en el cluster para rellenar la matriz
            for index, row in df_clusterj.iterrows():
                # Si el vector de la variable no está vacío...
                if len(row[col]) > 0:
                    # Para cada componente en el vector...
                    for j in range(len(row[col])):
                        # Si no es 0
                        if row[col][j] != '0':
                            # Agregarla al renglón actual en la matriz
                            vars[j].append(row[col][j])

            # Contabilizar ocurrencias de cada componente. Crear una matriz con el orden para
            # cada componente como renglones
            most_commons = []
            for j in range(ranges[i] - 1):
                counter=collections.Counter(vars[j])
                most_commons.append(counter.most_common(ranges[i] - 1))

            # Inicializar vector. Se escoge el valor más popular en la primera componente
            vars_list = [most_commons[0][0][0]]
            # Para cada componente a partir de la segunda...
            for j in range(1, ranges[i] - 1):
                # Buscar la componente más común...
                for c in most_commons[j]:
                    # Siempre y cuando no esté utilizada...
                    if c[0] not in vars_list[:j]:
                        # Agregarla al vector y...
                        vars_list.append(c[0])
                        # Dejar de buscar.
                        break
            clusters.at[clusters.index[cl_j], col] = vars_list
    return

# --------------------------
# Actualizar los centroides
update_centroids()

In [9]:
while(KEEP_WALKING):
    KEEP_WALKING = update_clusters()
    if (KEEP_WALKING):
        update_centroids()
    else :
        if (TALK) : 
            print ("No más cambios.")  

El cluster  0  incluye  31130 miembros.

El cluster  1  incluye  28633 miembros.

El cluster  2  incluye  38680 miembros.

El cluster  0  incluye  29906 miembros.

El cluster  1  incluye  29443 miembros.

El cluster  2  incluye  39094 miembros.

El cluster  0  incluye  29177 miembros.

El cluster  1  incluye  28463 miembros.

El cluster  2  incluye  40803 miembros.

El cluster  0  incluye  29280 miembros.

El cluster  1  incluye  26874 miembros.

El cluster  2  incluye  42289 miembros.

El cluster  0  incluye  30136 miembros.

El cluster  1  incluye  26433 miembros.

El cluster  2  incluye  41874 miembros.

El cluster  0  incluye  30231 miembros.

El cluster  1  incluye  26200 miembros.

El cluster  2  incluye  42012 miembros.

El cluster  0  incluye  30515 miembros.

El cluster  1  incluye  25685 miembros.

El cluster  2  incluye  42243 miembros.

El cluster  0  incluye  30931 miembros.

El cluster  1  incluye  25365 miembros.

El cluster  2  incluye  42147 miembros.

El cluster  0  i

In [11]:
display(dict_of_dicts)

{'AIDangerous': {'0': 'Algorithms making important decisions',
  '1': 'Artificial intelligence surpassing human intelligence ("the singularity")',
  '2': 'Evolving definitions of "fairness" in algorithmic versus human decisions',
  '3': 'Increasing automation of jobs'},
 'AIFuture': {'0': "I don't care about it, or I haven't thought about it.",
  '1': "I'm excited about the possibilities more than worried about the dangers.",
  '2': "I'm worried about the dangers more than I'm excited about the possibilities."},
 'AIInteresting': {'0': 'Algorithms making important decisions',
  '1': 'Artificial intelligence surpassing human intelligence ("the singularity")',
  '2': 'Evolving definitions of "fairness" in algorithmic versus human decisions',
  '3': 'Increasing automation of jobs'},
 'AIResponsible': {'0': 'A governmental or other regulatory body',
  '1': 'Nobody',
  '2': 'Prominent industry leaders',
  '3': 'The developers or the people creating the AI'},
 'AdBlocker': {'0': "I'm not sur

In [10]:
display(clusters)

Unnamed: 0,Hobby,OpenSource,Country,Student,Employment,FormalEducation,UndergradMajor,CompanySize,DevType,YearsCoding,YearsCodingProf,JobSatisfaction,CareerSatisfaction,HopeFiveYears,JobSearchStatus,LastNewJob,UpdateCV,ConvertedSalary,CommunicationTools,TimeFullyProductive,EducationTypes,SelfTaughtTypes,HackathonReasons,AgreeDisagree1,AgreeDisagree2,AgreeDisagree3,LanguageDesireNextYear,DatabaseWorkedWith,DatabaseDesireNextYear,PlatformWorkedWith,PlatformDesireNextYear,FrameworkWorkedWith,FrameworkDesireNextYear,IDE,OperatingSystem,NumberMonitors,Methodology,VersionControl,CheckInCode,AdBlocker,AdBlockerDisable,AdBlockerReasons,AdsAgreeDisagree1,AdsAgreeDisagree2,AdsAgreeDisagree3,AdsActions,AIDangerous,AIInteresting,AIResponsible,AIFuture,EthicsChoice,EthicsReport,EthicsResponsible,EthicalImplications,StackOverflowRecommend,StackOverflowVisit,StackOverflowHasAccount,StackOverflowParticipate,StackOverflowJobs,StackOverflowDevStory,StackOverflowJobsRecommend,StackOverflowConsiderMember,HypotheticalTools1,HypotheticalTools2,HypotheticalTools3,HypotheticalTools4,HypotheticalTools5,WakeTime,HoursComputer,HoursOutside,SkipMeals,ErgonomicDevices,Exercise,EducationParents,RaceEthnicity,Age,Dependents,SurveyTooLong,SurveyEasy,LanguageWorkedWith,AssessJob,AssessBenefits,JobContactPriorities,JobEmailPriorities,AdsPriorities,Cluster
0,1,0,IND,0,0,1,6,8,"[0, 12, 11]",7,0,5,3,2,2,3,7,0.065076,"[8, 4]",3,"[8, 7]","[5, 7]",[4],0,0,1,"[18, 27, 14, 5, 17]","[14, 19]","[14, 13, 17]","[14, 2]","[14, 2, 8]","[5, 1]","[5, 6]","[15, 10, 19]",3,1,"[0, 9]","[1, 4]",2,2,3,[0],1,1,0,[3],1,0,3,1,1,0,2,2,10,5,2,4,2,2,5,2,3,4,4,4,4,6,2,0,3,[0],3,1,[6],0,0,1,4,"[14, 18, 5, 31, 17]","[9, 8, 7, 1, 2, 4, 10, 3, 6, 5]","[1, 2, 3, 10, 6, 9, 11, 4, 8, 7, 5]","[2, 1, 5, 4, 3]","[1, 3, 7, 2, 4, 6, 5]","[1, 5, 2, 4, 6, 7, 3]",
1,1,1,USA,0,0,1,6,8,"[0, 12, 11]",7,11,5,5,6,1,5,7,0.039675,[8],3,"[8, 1]","[7, 5]",[0],0,2,2,"[18, 27, 14]","[14, 19]","[14, 17]","[14, 22]","[14, 2]",[5],[5],"[18, 19]",3,2,[0],[1],2,2,3,[0],1,1,0,[3],0,3,3,1,1,0,2,2,10,2,2,4,2,2,5,2,3,2,3,3,3,6,2,0,3,[0],3,1,[6],1,0,0,4,"[14, 18, 5, 31]","[9, 10, 7, 1, 2, 3, 5, 4, 8, 6]","[1, 2, 3, 10, 9, 4, 11, 5, 8, 7, 6]","[2, 1, 5, 3, 4]","[1, 6, 2, 3, 4, 7, 5]","[1, 4, 2, 3, 6, 7, 5]",
2,1,0,USA,0,0,1,6,4,"[0, 12, 11, 6]",9,7,3,3,6,2,3,7,0.162743,"[8, 4, 5]",3,"[8, 7, 5]","[7, 5, 0]",[0],0,1,1,"[18, 14, 27, 31, 5]","[14, 19, 17]","[17, 13, 14]","[14, 22, 0]","[14, 0, 2]","[5, 1]","[5, 6]","[18, 19, 10]",3,2,"[0, 9, 4]","[1, 4]",2,2,3,"[6, 2]",1,1,0,"[3, 2]",0,3,3,1,1,0,2,2,10,2,2,4,2,2,5,2,3,2,3,3,3,5,2,0,3,[2],3,2,[6],1,0,0,2,"[18, 14, 5, 31, 1, 17]","[9, 8, 7, 1, 2, 4, 10, 3, 6, 5]","[1, 2, 3, 10, 8, 4, 11, 5, 9, 7, 6]","[2, 1, 5, 3, 4]","[1, 5, 2, 3, 4, 7, 6]","[1, 4, 2, 3, 6, 7, 5]",


In [None]:
dec_clusters = decode(clusters[["Country", "OpenSource", "DevType", "YearsCodingProf", 
                  "JobSatisfaction", "CareerSatisfaction", "JobSearchStatus", "LastNewJob",
                  "ConvertedSalary", "CommunicationTools", "TimeFullyProductive", 
                  "EducationTypes", "SelfTaughtTypes", "AgreeDisagree2", "AgreeDisagree3", 
                  "LanguageDesireNextYear", "DatabaseWorkedWith", "DatabaseDesireNextYear", 
                  "PlatformWorkedWith", "PlatformDesireNextYear", "FrameworkWorkedWith", 
                  "FrameworkDesireNextYear", "IDE", "Methodology", 
                  "VersionControl", "AdBlockerReasons", "AdsAgreeDisagree2", "AdsActions", 
                  "AIDangerous", "StackOverflowVisit", "StackOverflowJobs", "HypotheticalTools2", 
                  "HypotheticalTools3", "HypotheticalTools4", "HypotheticalTools5", "WakeTime", 
                  "ErgonomicDevices", "Age", "SurveyTooLong", "SurveyEasy", "LanguageWorkedWith", 
                  "AssessJob", "AssessBenefits", "JobEmailPriorities", "AdsPriorities"]])
print(dec_clusters)

In [13]:
print(clusters[["Country", "OpenSource", "DevType", "YearsCodingProf", 
                  "JobSatisfaction", "CareerSatisfaction", "JobSearchStatus", "LastNewJob",
                  "ConvertedSalary", "CommunicationTools", "TimeFullyProductive", 
                  "EducationTypes", "SelfTaughtTypes", "AgreeDisagree2", "AgreeDisagree3", 
                  "LanguageDesireNextYear", "DatabaseWorkedWith", "DatabaseDesireNextYear", 
                  "PlatformWorkedWith", "PlatformDesireNextYear", "FrameworkWorkedWith", 
                  "FrameworkDesireNextYear", "IDE", "Methodology", 
                  "VersionControl", "AdBlockerReasons", "AdsAgreeDisagree2", "AdsActions", 
                  "AIDangerous", "StackOverflowVisit", "StackOverflowJobs", "HypotheticalTools2", 
                  "HypotheticalTools3", "HypotheticalTools4", "HypotheticalTools5", "WakeTime", 
                  "ErgonomicDevices", "Age", "SurveyTooLong", "SurveyEasy", "LanguageWorkedWith", 
                  "AssessJob", "AssessBenefits", "JobEmailPriorities", "AdsPriorities"]])
print(0.065076 * 200000)
print(0.039675 * 200000)
print(.162743  * 200000)

  Country OpenSource         DevType YearsCodingProf JobSatisfaction  \
0     IND          0     [0, 12, 11]               0               5   
1     USA          1     [0, 12, 11]              11               5   
2     USA          0  [0, 12, 11, 6]               7               3   

  CareerSatisfaction JobSearchStatus LastNewJob  ConvertedSalary  \
0                  3               2          3         0.065076   
1                  5               1          5         0.039675   
2                  3               2          3         0.162743   

  CommunicationTools TimeFullyProductive EducationTypes SelfTaughtTypes  \
0             [8, 4]                   3         [8, 7]          [5, 7]   
1                [8]                   3         [8, 1]          [7, 5]   
2          [8, 4, 5]                   3      [8, 7, 5]       [7, 5, 0]   

  AgreeDisagree2 AgreeDisagree3 LanguageDesireNextYear DatabaseWorkedWith  \
0              0              1    [18, 27, 14, 5, 17]      