<table width=100%>
<tr>
<td width=100%>
<img src="https://drive.google.com/uc?export=view&id=1v7cQGiZPXcA0o3dS56r3YBVhHgykqatq" width=500>
</td>
</tr>
<tr>
<td width=100%>
<h1><b>Enquesta de professorat associat: clustering</b></h1>
<h4>2021 - Josep Ramon Morros
<br>
<a href="https://imatge.upc.edu/web/"> GPI @ IDEAI</a> Research group
</td>
</tr>
</table>

Primer de tot, cal pujar el fitxers *respostes_enquesta_associats_postprocessades.csv* i *radar_chart.py* a Colab mitjançant el menú de l'esquerra (feu clic a la icona de la carpeta 
<img src="https://drive.google.com/uc?export=view&id=1v7nFtfUESbeCxpEn4ohlF8ImxZ_V08OV" width=15>, pengeu els fitxers). 

In [None]:
!pip install kmodes

In [None]:
import sys
import pandas as pd
import numpy as np
from kmodes.kprototypes import KPrototypes
import matplotlib.pyplot as plt

Llegir el fitxer amb respostes de l'enquesta. És un fitxer processat. **Falta: Descriure processament**

In [None]:
file_name = 'respostes_enquesta_associats_postprocessades.csv'

df = pd.read_csv(file_name, sep=',')

Descartar camps que no s'utilitzen al clustering:

In [None]:
# Some fields are personal opinions and not suitable for clustering. Remove them.                                                            
df.drop(['X1','X2','X3','ESCOLA'], axis='columns', inplace=True)
df.drop(['ALTRAUNI', 'REDUCCIO', 'RENOVACIO', 'ASSEMBLEES', 'MOBILITZACIONS', 'BONTRACTE', 'DEPARTAMENT', 'ANYACREDITACIO', 'ANYTESI', 'GEST\
IO'] , axis='columns', inplace=True)

La columna ACREDITACIO pot tenir respostes múltiples (una persona pot tenir cap, una o vàries acreditacions). Es conserva l'acreditació "més important". He establert una ordenació de les acreditacions, així que aquesta variable es considerarà numèrica.

In [None]:
# The column 'ACREDITACIO' can have multiple answers.                                                                                        
# keep only the 'best' accreditation if more than one                                                                                        
acreditations_names_dic = {'Universitat privada':1, 'Col·laborador/a':2, 'Ajudant doctor':3, 'Lector/a':4, 'Contractat/da doctor/a':5, 'Agre\
gat/da':6}

# Simplify ACREDITACIO column: keep only the 'best' accreditation if more than one                                                           
acred_column = list(df['ACREDITACIO'])

simplified_acred = []
for elem in acred_column:
    codes = [0]
    for name,val in acreditations_names_dic.items():
        if name in elem:
            codes.append(val)
    simplified_acred.append(sorted(codes, reverse=True)[0])  # Keep only the maximum value (most important acreditation)                     

df.drop('ACREDITACIO', axis='columns', inplace=True)
df['ACREDITACIO'] = simplified_acred

Representem les categories en format numèric. Aquesta etapa no tinc clar si és necessària.

In [None]:
# Headers for the columns in the resulting sheet                                                                                               
df_columns = ['EDAT', 'SEXE', 'ANTIGUETAT', 'TIPUS', 'DOCTOR', 'RAOTESI', 'ACREDITACIO', 'FENTTESI', 'RAOFENTTESI', 'DISPOSATTESI', 'CARRERA\
', 'ASSOCIAT', 'FALSASSOCIAT', 'EXPECTATIVES', 'FORMACIO', 'SATISFET']

# To perform clustering, it is easier to represent the categories in numerical form                                                          
# We use label encoding.                                                                                                                     
# See https://towardsdatascience.com/categorical-encoding-using-label-encoding-and-one-hot-encoder-911ef77fb5bd                              

# Convert data to category codes (numerical)                                                                                                 
for col in df_columns:
    df[col] = df[col].astype('category').cat.codes


Separar en dues parts: 'professorat doctor' i 'professorat no doctor'

In [None]:
# In the form, there are conditional questions (they only appear depending of the answer to                                                  
# a previous question). For instance, if the user answers Yes to DOCTOR, then he/she can answer                                              
# questions 'ANYTESI', 'RAOTESI', 'ACREDITACIO', 'ANYACREDITACIO'. If the user answers No,                                                   
# the extra choices are 'FENTTESI', 'RAOFENTTESI', 'DISPOSATTESI'.                                                                           
# Because of this, two main grups are created, DOCTOR=Yes, DOCTOR=No                                                                         

# Separate doctors from non-doctors. The two partitions will be clustered separately                                                         
df_sidoctor = df.loc[df['DOCTOR'] == 1].copy()
df_nodoctor = df.loc[df['DOCTOR'] == 0].copy()

# Drop fields that do not apply in each case                                                                                                 
df_nodoctor.drop(['DOCTOR', 'RAOTESI', 'ACREDITACIO'                 ],    axis='columns', inplace=True)
df_sidoctor.drop(['DOCTOR', 'FENTTESI', 'RAOFENTTESI', 'DISPOSATTESI'],    axis='columns', inplace=True)


Especificar quins camps son categòrics i quina numèrics. És necessari per kprototypes.

In [None]:
# Encoding of type of columns. 0 means the column is numeric, 1 means the column is categorical.                                             
no_doctor_cat = [0,1,0,0, 1,1,1, 1,1,1,1,1,0]
si_doctor_cat = [0,1,0,0,   1,1, 1,1,1,1,1,0]

Obtenir els índex dels camps categòrics:

In [None]:
# Indices of categorical columns                                                                                                             
no_doctor_cat_list = [ii for ii, value in enumerate(no_doctor_cat) if value == 1]
si_doctor_cat_list = [ii for ii, value in enumerate(si_doctor_cat) if value == 1]

# Convert data to numpy arrays (required by KPrototypes)                                                                                     
no_doctor_data = df_nodoctor.to_numpy(copy=True)
si_doctor_data = df_sidoctor.to_numpy(copy=True)

Seleccionar el nombre de clústers pot ser complicat. En aquest cas, provarem diversos valors de K i comprovem visualment ( [Google Data Studio](https://datastudio.google.com), per exemple) quin ofereix la millor explicació per al nostre cas. També es podríen utilitzar mètodes per [determinar automàticament](https://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set) el nombre de clusters.

In [None]:
# Create a folder for the results
!mkdir /content/results

In [None]:
# Perform clustering with several values of K for non-doctors                                                                                

no_doctor_clusters  = []
no_doctor_centroids = []
no_doctor_labels    = []
no_doctor_cost      = [] 

df_nodoctor_clust = []
for ii in range(2,6):
    print ('Non doctors: using {} clusters'.format(ii))
    kp = KPrototypes(n_clusters=ii, init='Cao', verbose=0)
    no_doctor_clusters.append(kp.fit_predict(no_doctor_data, categorical=no_doctor_cat_list))
    no_doctor_centroids.append(kp.cluster_centroids_)
    no_doctor_labels.append(kp.labels_)
    no_doctor_cost.append(kp.cost_)


    df_nodoctor_clust.append(df_nodoctor)
    df_nodoctor_clust[-1]['Cluster'] = no_doctor_clusters[-1]

    # Save the results                                                                                                                       
    out_name = '/content/results/out_nodoctor_k{}.csv'.format(ii)
    df_nodoctor_clust[-1].to_csv(out_name)

Idem per professorat associat doctor:

In [None]:
# Perform clustering with several values of K for doctors                                                                                    

si_doctor_clusters  = []
si_doctor_centroids = []
si_doctor_labels    = []
si_doctor_cost      = []

df_sidoctor_clust = []
for ii in range(2,6):
    print ('Doctors: using {} clusters'.format(ii))
    kp = KPrototypes(n_clusters=ii, init='Cao', verbose=0)

    # Get the indices of the rows (answers) belonging to each cluster
    si_doctor_clusters.append(kp.fit_predict(si_doctor_data, categorical=si_doctor_cat_list))

    df_sidoctor_clust.append(df_sidoctor)
    df_sidoctor_clust[-1]['Cluster'] = si_doctor_clusters[-1]

    # Save the results                                                                                                                       
    out_name = '/content/results/out_sidoctor_k{}.csv'.format(ii)
    df_sidoctor_clust[-1].to_csv(out_name)


Per analitzar els clusters reultant, crearem un *radar chart* per cada clustering. Com que hem povat valors $k = [2,6)$, tindrem 4 gràfics. 

En un gràfic donat, tindrem k clusters. Per cada cluster calculem, per cada columna, el promig de les respostes pertanyents al cluster. El *radar chart* visualitza aquests promitjos.

Nota: per simplicitat, algunes variables (EDAT, SEXE, FORMACIO] s'utilitzen al clustering però no es visualitzen en el *radar chart*. La columna 'Cluster', que indica a quin cluster pertany la fila, tampoc es visualitza.

In [None]:
from radar_chart import radar_factory

Pel professorat no doctor:

In [None]:
visualized_columns = [2,3,4,5,6,7,8,9,10,12]   # exclude EDAT, SEXE, FORMACIO and Cluster
                                               # from radar chart

# To plot the radar chart of each clustering result, 
total_data = {} # Dictionary, the keys are the value of k, the values the averages of each cluster

for ii in range(2,6):  # Loop the different clustering results k=2,3,4,5

    this_clustering_avgs = []
    this_clustering_stds = []

    # For each cluster 
    for jj in range(0, ii):
        # Select the rows that belong to this cluster
        data = df_nodoctor_clust[ii-2][df_nodoctor_clust[ii-2]['Cluster'] == jj]
    
        # Compute the average of all rows belonging to this cluster and store in list this_clustering_avgs
        this_clustering_avgs.append(np.average(data.to_numpy(copy=True)[:,visualized_columns], axis=0))
        this_clustering_stds.append(np.std(data.to_numpy(copy=True)[:,visualized_columns], axis=0))

    this_clustering_avgs = np.array(this_clustering_avgs) # Shape: k x 9
    this_clustering_stds = np.array(this_clustering_stds) # Shape: k x 9

    total_data[ii] = [this_clustering_avgs, this_clustering_stds] # Store in dictionary


N = 10  # Number of displayed variables in the chart 

# Names of the displayed variables
spoke_labels = df_nodoctor_clust[0].columns 
spoke_labels = spoke_labels[visualized_columns]

# Create a radar chart object
theta = radar_factory(N, frame='polygon')

colors = ['b', 'r', 'g', 'm', 'y', 'k']  # Define colors for each cluster.


fig, axs = plt.subplots(figsize=(12, 22), nrows=4, ncols=2, subplot_kw=dict(projection='radar'))
fig.subplots_adjust(wspace=0.25, hspace=0.25, top=0.85, bottom=0.05)

for jj, (k, data) in enumerate(total_data.items()):
    title = 'Professorat No Doctor, k = {}: Avg'.format(k)
    axs[jj,0].set_rgrids([0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0])
    axs[jj,0].set_title(title, weight='bold', size='medium', position=(0.5, 1.1),
                        horizontalalignment='center', verticalalignment='center')

    title = 'Professorat No Doctor, k = {}: Std'.format(k)
    axs[jj,1].set_rgrids([0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0])
    axs[jj,1].set_title(title, weight='bold', size='medium', position=(0.5, 1.1),
                        horizontalalignment='center', verticalalignment='center')

    for ii in range(k):
        axs[jj,0].plot(theta, data[0][ii,:], color=colors[ii])
        axs[jj,0].fill(theta, data[0][ii,:], facecolor=colors[ii], alpha=0.25)
        axs[jj,0].set_varlabels(spoke_labels)

        axs[jj,1].plot(theta, data[1][ii,:], color=colors[ii])
        axs[jj,1].fill(theta, data[1][ii,:], facecolor=colors[ii], alpha=0.25)
        axs[jj,1].set_varlabels(spoke_labels)


plt.tight_layout()
plt.show()

Pel professorat doctor:

In [None]:
visualized_columns = [2,3,4,5,6,7,8,10,11]   # exclude EDAT, SEXE, FORMACIO and Cluster
                                             # from radar chart

# To plot the radar chart of each clustering result, 
total_data = {} # Dictionary, the keys are the value of k, the values the averages of each cluster

for ii in range(2,6):  # Loop the different clustering results k=2,3,4,5

    this_clustering_avgs = []
    this_clustering_stds = []

    # For each cluster 
    for jj in range(0, ii):
        # Select the rows that belong to this cluster
        data = df_sidoctor_clust[ii-2][df_sidoctor_clust[ii-2]['Cluster'] == jj]
    
        # Compute the average of all rows belonging to this cluster and store in list this_clustering_avgs
        this_clustering_avgs.append(np.average(data.to_numpy(copy=True)[:,visualized_columns], axis=0))
        this_clustering_stds.append(np.std(data.to_numpy(copy=True)[:,visualized_columns], axis=0))

    this_clustering_avgs = np.array(this_clustering_avgs) # Shape: k x 9
    this_clustering_stds = np.array(this_clustering_stds) # Shape: k x 9

    total_data[ii] = [this_clustering_avgs, this_clustering_stds] # Store in dictionary


N = 9  # Number of displayed variables in the chart 

# Names of the displayed variables
spoke_labels = df_sidoctor_clust[0].columns 
spoke_labels = spoke_labels[visualized_columns]

# Create a radar chart object
theta = radar_factory(N, frame='polygon')

colors = ['b', 'r', 'g', 'm', 'y', 'k']  # Define colors for each cluster.


fig, axs = plt.subplots(figsize=(12, 22), nrows=4, ncols=2, subplot_kw=dict(projection='radar'))
fig.subplots_adjust(wspace=0.25, hspace=0.25, top=0.85, bottom=0.05)

for jj, (k, data) in enumerate(total_data.items()):
    title = 'Professorat Doctor, k = {}: Avg'.format(k)
    axs[jj,0].set_rgrids([0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0])
    axs[jj,0].set_title(title, weight='bold', size='medium', position=(0.5, 1.1),
                        horizontalalignment='center', verticalalignment='center')

    title = 'Professorat Doctor, k = {}: Std'.format(k)
    axs[jj,1].set_rgrids([0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0])
    axs[jj,1].set_title(title, weight='bold', size='medium', position=(0.5, 1.1),
                        horizontalalignment='center', verticalalignment='center')

    for ii in range(k):
        axs[jj,0].plot(theta, data[0][ii,:], color=colors[ii])
        axs[jj,0].fill(theta, data[0][ii,:], facecolor=colors[ii], alpha=0.25)
        axs[jj,0].set_varlabels(spoke_labels)

        axs[jj,1].plot(theta, data[1][ii,:], color=colors[ii])
        axs[jj,1].fill(theta, data[1][ii,:], facecolor=colors[ii], alpha=0.25)
        axs[jj,1].set_varlabels(spoke_labels)


plt.tight_layout()
plt.show()