<h1>Gower distance calculation for Python V2</h1>

<p>It is not all the time that the data under study is an even matrix of numerical values. Sometimes, you need to dig into data with mixed types of variables (e.g., categorical, boolean, numerical).
</p>
<p>This notebook provides a single function that calculates the Gower mixed similarity.
</p>
<p>For more details about the Gower distance, please visit: <a href="http://members.cbio.mines-paristech.fr/~jvert/svn/bibli/local/Gower1971general.pdf">Gower, J.C., 1971, A General Coefficient of Similarity and Some of Its Properties</a>.</p>


<h2>1. Generate some data with mixed types</h2>

In [2]:
import numpy as np
import pandas as pd
from scipy.spatial import distance 
from sklearn.metrics import pairwise

X=pd.DataFrame({'age':[21,21,19,30,21,21,19,30,None],
'gender':['M','M','N','M','F','F','F','F',None],
'civil_status':['MARRIED','SINGLE','SINGLE','SINGLE','MARRIED','SINGLE','WIDOW','DIVORCED',None],
'salary':[3000.0,1200.0 ,32000.0,1800.0 ,2900.0 ,1100.0 ,10000.0,1500.0,None],
'has_children':[True,False,True,True,True,False,False,True,None],
'available_credit':[2200,100,22000,1100,2000,100,6000,2200,None]})


print(X)

    age  available_credit civil_status gender has_children   salary
0  21.0            2200.0      MARRIED      M         True   3000.0
1  21.0             100.0       SINGLE      M        False   1200.0
2  19.0           22000.0       SINGLE      N         True  32000.0
3  30.0            1100.0       SINGLE      M         True   1800.0
4  21.0            2000.0      MARRIED      F         True   2900.0
5  21.0             100.0       SINGLE      F        False   1100.0
6  19.0            6000.0        WIDOW      F        False  10000.0
7  30.0            2200.0     DIVORCED      F         True   1500.0
8   NaN               NaN         None   None         None      NaN





<h1>2. The Gower Function</h1>


In [3]:
def gower_distances(X, Y=None, w=None):
    """
    Computes the gower distances between X and Y

    Read more in the :ref:`User Guide <metrics>`.

    Parameters
    ----------
    X : array-like, shape (n_samples, n_features)

    Y : array-like, shape (n_samples, n_features)

    Returns
    -------
    distances : ndarray, shape (n_samples, )

    Notes
    ------
    Gower is a similarity for categorical, boolean and numerical mixed data.
    """

    data = pairwise.check_array(X, Y, dtype=np.object)
    X, Y = data
    rows, cols = X.shape
    dtypes = []
    for col in range(cols):
        dtypes.append(type(X[0, col]))

    # calculate the range and max values of numeric values for mixed data
    ranges_of_numeric = [0.0] * cols
    max_of_numeric = [0.0] * cols
    for col in range(cols):
        if np.issubdtype(dtypes[col], np.number):
            max = np.nanmax(X[:, col].astype(dtypes[col])) + 0.0
            if np.isnan(max):
                max = 0.0

            max_of_numeric[col] = max

            min = np.nanmin(X[:, col].astype(dtypes[col])) + 0.0
            if np.isnan(min):
                min = 0.0

            ranges_of_numeric[col] = (1 - min / max,0)[max == 0]


    # According the Gower formula, w is an attribute weight
    if w is None:
        w = [1] * cols

    yrows, ycols = Y.shape

    dm = np.zeros((rows, yrows), dtype=np.double)


    for i in range(0, rows ):
        j_start = i
        
        #for non square results
        if rows != yrows:
            j_start = 0

        for j in range(j_start, yrows):
            xi = X[i]
            xj = Y[j]
            sum_sij = 0.0
            sum_wij = 0.0
            for col in range(cols):
                value_xi = xi[col]
                value_xj = xj[col]
                if np.issubdtype(dtypes[col], np.number):
                    if (max_of_numeric[col] != 0):
                        value_xi = value_xi / max_of_numeric[col]
                        value_xj = value_xj / max_of_numeric[col]
                    else:
                        value_xi = 0
                        value_xj = 0

                    if ranges_of_numeric[col] != 0 :
                        sij = abs(value_xi - value_xj) / ranges_of_numeric[col]
                    else:
                        sij=0
                    wij = (w[col], 0)[np.isnan(value_xi) or np.isnan(value_xj)]
                else:
                    sij = (1.0, 0.0)[value_xi == value_xj]
                    wij = (w[col], 0)[value_xi is None and value_xj is None]
                sum_sij += (wij * sij)
                sum_wij += wij
            
            if sum_wij!=0:
                dm[i,j] = (sum_sij / sum_wij)
                if j < rows and i<yrows :
                    dm[j,i] = dm[i,j]
                
            
            
    return dm



<h1>5. Get the Gower distance matrix</h1>

In [4]:
D = gower_distances(X)
print(D)


ValueError: too many values to unpack

<h1>6. The equivalent code in R</h1>
Using the daisy method from {cluster} package

<p>
<code>
library(cluster)

age=c(21,21,19,30,21,21,19,30,NA)
gender=c('M','M','N','M','F','F','F','F',NA)
civil_status=c('MARRIED','SINGLE','SINGLE','SINGLE','MARRIED','SINGLE','WIDOW','DIVORCED',NA)
salary=c(3000.0,1200.0 ,32000.0,1800.0 ,2900.0 ,1100.0 ,10000.0,1500.0,NA)
children=c(TRUE,FALSE,TRUE,TRUE,TRUE,FALSE,FALSE,TRUE,NA)
available_credit=c(2200,100,22000,1100,2000,100,6000,2200,NA)
X=data.frame(age,gender,civil_status,salary,children,available_credit)

D=daisy(X,metric="gower")

print(D)

Dissimilarities :
          1         2         3         4         5         6         7         8
2 0.3590238                                                                      
3 0.6707398 0.6964303                                                            
4 0.3178742 0.3138769 0.6552807                                                  
5 0.1687281 0.5236290 0.6728013 0.4824794                                        
6 0.5262298 0.2006472 0.6969697 0.4810829 0.3575017                              
7 0.5969786 0.5472028 0.7404280 0.7481861 0.4323733 0.3478501                    
8 0.4777876 0.6539635 0.8151941 0.3433228 0.3121036 0.4878362 0.5747661          
9        NA        NA        NA        NA        NA        NA        NA        NA

</code>
