<h1>Gower distance calculation for Python V3</h1>
<h3>Version submited to scikit learn project approval</h3>
https://github.com/scikit-learn/scikit-learn/pull/9555

<p>It is not all the time that the data under study is an even matrix of numerical values. Sometimes, you need to dig into data with mixed types of variables (e.g., categorical, boolean, numerical).
</p>
<p>This notebook provides the Gower function that calculates the Gower mixed similarity.
</p>
<p>For more details about the Gower distance, please visit: <a href="http://members.cbio.mines-paristech.fr/~jvert/svn/bibli/local/Gower1971general.pdf">Gower, J.C., 1971, A General Coefficient of Similarity and Some of Its Properties</a>.</p>


<h2>1. Generate some data with mixed types</h2>

In [34]:
import numpy as np
import pandas as pd
from scipy.spatial import distance 
from sklearn.utils import validation
from sklearn.metrics import pairwise
from scipy.sparse import issparse

X=pd.DataFrame({'age':[21,21,19,30,21,21,19,30,None],
'gender':['M','M','N','M','F','F','F','F',None],
'civil_status':['MARRIED','SINGLE','SINGLE','SINGLE','MARRIED','SINGLE','WIDOW','DIVORCED',None],
'salary':[3000.0,1200.0 ,32000.0,1800.0 ,2900.0 ,1100.0 ,10000.0,1500.0,None],
'has_children':[True,False,True,True,True,False,False,True,None],
'available_credit':[2200,100,22000,1100,2000,100,6000,2200,None]})


print(X)

    age  available_credit civil_status gender has_children   salary
0  21.0            2200.0      MARRIED      M         True   3000.0
1  21.0             100.0       SINGLE      M        False   1200.0
2  19.0           22000.0       SINGLE      N         True  32000.0
3  30.0            1100.0       SINGLE      M         True   1800.0
4  21.0            2000.0      MARRIED      F         True   2900.0
5  21.0             100.0       SINGLE      F        False   1100.0
6  19.0            6000.0        WIDOW      F        False  10000.0
7  30.0            2200.0     DIVORCED      F         True   1500.0
8   NaN               NaN         None   None         None      NaN


# 2. Some pairwise utility functions (not released yet to scikit-learn)


In [35]:


def _return_float_dtype(X, Y):
    """
    1. If dtype of X and Y is float32, then dtype float32 is returned.
    2. Else dtype float is returned.
    """
    if not issparse(X) and not isinstance(X, np.ndarray):
        X = np.asarray(X)

    if Y is None:
        Y_dtype = X.dtype
    elif not issparse(Y) and not isinstance(Y, np.ndarray):
        Y = np.asarray(Y)
        Y_dtype = Y.dtype
    else:
        Y_dtype = Y.dtype

    if X.dtype == Y_dtype == np.float32:
        dtype = np.float32
    elif X.dtype == np.object and not issparse(X):
        dtype = np.float
        for col in range(X.shape[1]):
            if not np.issubdtype(type(X[0, col]), np.number):
                dtype = np.object
                break
    else:
        dtype = np.float

    return X, Y, dtype


def check_pairwise_arrays(X, Y, precomputed=False, dtype=None):
    X, Y, dtype_float = _return_float_dtype(X, Y)

    warn_on_dtype = dtype is not None
    estimator = 'check_pairwise_arrays'
    if dtype is None:
        dtype = dtype_float


    if Y is X or Y is None:
        X = Y = validation.check_array(X, accept_sparse='csr', dtype=dtype,
                            warn_on_dtype=warn_on_dtype, estimator=estimator)
    else:
        X = validation.check_array(X, accept_sparse='csr', dtype=dtype,
                        warn_on_dtype=warn_on_dtype, estimator=estimator)
        Y = validation.check_array(Y, accept_sparse='csr', dtype=dtype,
                        warn_on_dtype=warn_on_dtype, estimator=estimator)

    if precomputed:
        if X.shape[1] != Y.shape[0]:
            raise ValueError("Precomputed metric requires shape "
                             "(n_queries, n_indexed). Got (%d, %d) "
                             "for %d indexed." %
                             (X.shape[0], X.shape[1], Y.shape[0]))
    elif X.shape[1] != Y.shape[1]:
        raise ValueError("Incompatible dimension for X and Y matrices: "
                         "X.shape[1] == %d while Y.shape[1] == %d" % (
                             X.shape[1], Y.shape[1]))

    return X, Y

# 3. The Gower Function

In [36]:
def gower_distances(X, Y=None, w=None, categorical_features=None):
    """
    Computes the gower distances between X and Y

    Read more in the :ref:`User Guide <metrics>`.

    Parameters
    ----------
    X : array-like, shape (n_samples, n_features)

    Y : array-like, shape (n_samples, n_features)

    w:  array-like, shape (n_features)
    According the Gower formula, w is an attribute weight.

    categorical_features: array-like, shape (n_features)
    Indicates with True/False wheter a column is a categorical attribute.
    This is useful when categorical atributes are represented as integer
    values.

    Returns
    -------
    similarities : ndarray, shape (n_samples, )

    Notes
    ------
    Gower is a similarity measure for categorical, boolean and numerical mixed
    data.

    """

    X, Y = check_pairwise_arrays(X, Y, dtype=(np.object, None)[issparse(X) or
                                                               issparse(Y)])

    rows, cols = X.shape

    if categorical_features is None:
        categorical_features = []
        for col in range(cols):
            if np.issubdtype(type(X[0, col]), np.number):
                categorical_features.append(False)
            else:
                categorical_features.append(True)
    # Calculates the normalized ranges and max values of numeric values
    ranges_of_numeric = [0.0] * cols
    max_of_numeric = [0.0] * cols
    for col in range(cols):
        if not categorical_features[col]:
            max = None
            min = None
            if issparse(X):
                col_array = X.getcol(col)
                max = col_array.max() + 0.0
                min = col_array.min() + 0.0
            else:
                col_array = X[:, col].astype(np.double)
                max = np.nanmax(col_array)
                min = np.nanmin(col_array)

            if np.isnan(max):
                max = 0.0
            if np.isnan(min):
                min = 0.0
            max_of_numeric[col] = max
            ranges_of_numeric[col] = (1 - min / max) if (max != 0) else 0.0

    if w is None:
        w = [1] * cols

    yrows, ycols = Y.shape

    dm = np.zeros((rows, yrows), dtype=np.double)

    for i in range(0, rows):
        j_start = i

        # for non square results
        if rows != yrows:
            j_start = 0

        for j in range(j_start, yrows):
            sum_sij = 0.0
            sum_wij = 0.0
            for col in range(cols):
                value_xi = X[i, col]
                value_xj = Y[j, col]

                if not categorical_features[col]:
                    if (max_of_numeric[col] != 0):
                        value_xi = value_xi / max_of_numeric[col]
                        value_xj = value_xj / max_of_numeric[col]
                    else:
                        value_xi = 0
                        value_xj = 0

                    if ranges_of_numeric[col] != 0:
                        sij = abs(value_xi - value_xj) / ranges_of_numeric[col]
                    else:
                        sij = 0
                    wij = (w[col], 0)[np.isnan(value_xi) or np.isnan(value_xj)]
                else:
                    sij = (1.0, 0.0)[value_xi == value_xj]
                    wij = (w[col], 0)[value_xi is None and value_xj is None]
                sum_sij += (wij * sij)
                sum_wij += wij

            if sum_wij != 0:
                dm[i, j] = (sum_sij / sum_wij)
                if j < rows and i < yrows:
                    dm[j, i] = dm[i, j]

    return dm




# 4. Get the Gower distance matrix

In [37]:
D = gower_distances(X)
print(D)

[[ 0.          0.35902381  0.67073985  0.31787418  0.16872811  0.52622985
   0.59697856  0.47778758         nan]
 [ 0.35902381  0.          0.69643032  0.3138769   0.52362903  0.16720604
   0.45600237  0.65396349         nan]
 [ 0.67073985  0.69643032  0.          0.6552807   0.67280129  0.6969697
   0.74042795  0.8151941          nan]
 [ 0.31787418  0.3138769   0.6552807   0.          0.4824794   0.48108294
   0.74818608  0.34332284         nan]
 [ 0.16872811  0.52362903  0.67280129  0.4824794   0.          0.35750174
   0.43237334  0.31210361         nan]
 [ 0.52622985  0.16720604  0.6969697   0.48108294  0.35750174  0.
   0.28987508  0.4878362          nan]
 [ 0.59697856  0.45600237  0.74042795  0.74818608  0.43237334  0.28987508
   0.          0.57476615         nan]
 [ 0.47778758  0.65396349  0.8151941   0.34332284  0.31210361  0.4878362
   0.57476615  0.                 nan]
 [        nan         nan         nan         nan         nan         nan
          nan         nan  0.   

<h1>5. The equivalent code in R</h1>
Using the daisy method from {cluster} package

<p>
<code>
library(cluster)

age=c(21,21,19,30,21,21,19,30,NA)
gender=c('M','M','N','M','F','F','F','F',NA)
civil_status=c('MARRIED','SINGLE','SINGLE','SINGLE','MARRIED','SINGLE','WIDOW','DIVORCED',NA)
salary=c(3000.0,1200.0 ,32000.0,1800.0 ,2900.0 ,1100.0 ,10000.0,1500.0,NA)
children=c(TRUE,FALSE,TRUE,TRUE,TRUE,FALSE,FALSE,TRUE,NA)
available_credit=c(2200,100,22000,1100,2000,100,6000,2200,NA)
X=data.frame(age,gender,civil_status,salary,children,available_credit)

D=daisy(X,metric="gower")

print(D)

Dissimilarities :
          1         2         3         4         5         6         7         8
2 0.3590238                                                                      
3 0.6707398 0.6964303                                                            
4 0.3178742 0.3138769 0.6552807                                                  
5 0.1687281 0.5236290 0.6728013 0.4824794                                        
6 0.5262298 0.2006472 0.6969697 0.4810829 0.3575017                              
7 0.5969786 0.5472028 0.7404280 0.7481861 0.4323733 0.3478501                    
8 0.4777876 0.6539635 0.8151941 0.3433228 0.3121036 0.4878362 0.5747661          
9        NA        NA        NA        NA        NA        NA        NA        NA

</code>


# 6. Not Squared Matrix Test

In [38]:
X2 = np.array([['Syria', 1200, 0,411114.44,True],
                  ['Ireland', 300, 0, 199393333.22, False],
                  ['United Kingdom', 100, 0, 32323222.121, False]], dtype=object)
               
Y2 = np.array([['United Kingdom', 200, 0, 99923921.47, True]], dtype=object)


D = gower_distances(X2,Y2)

print(D)
               
               

[[ 0.48183999]
 [ 0.51816001]
 [ 0.28612829]]


# 7. Sparse Matrix Test

In [39]:
from sklearn.datasets import load_iris
from scipy.sparse import csc_matrix


iris = load_iris()
# converts to sparse matrix
C = csc_matrix(iris.data)
D = gower_distances(C)

print(D)


[[ 0.          0.06597222  0.06326507 ...,  0.4978225   0.47504708
   0.43108522]
 [ 0.06597222  0.          0.03895951 ...,  0.45962806  0.52018597
   0.39289077]
 [ 0.06326507  0.03895951  0.         ...,  0.49858757  0.51747881
   0.43185028]
 ..., 
 [ 0.4978225   0.45962806  0.49858757 ...,  0.          0.10222458
   0.06673729]
 [ 0.47504708  0.52018597  0.51747881 ...,  0.10222458  0.          0.1272952 ]
 [ 0.43108522  0.39289077  0.43185028 ...,  0.06673729  0.1272952   0.        ]]
