<h1>Euclidean and Overlap distance calculation</h1>

It is not all the time that the data under study is an even matrix of numerical values. Sometimes, you need to dig into data with mixed types of variables (e.g., categorical, boolean, numerical).

This notebook proposes a refactoring for scipy's pdist function in order to support the Euclidean Overlap distance.


According to this paper, this is the most used similarity function for mixed data.<br>
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.140.8831&rep=rep1&type=pdf
<nr>
In Weka platform, by default the distance measurement is Euclidean with Overlap.<br>

The original paper for Overlap:<br>
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.140.8831&rep=rep1&type=pdf


<h2>1. Generate some data with mixed types</h2>

In [3]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import scale
from scipy._lib.six import xrange
import numpy as np
from scipy.spatial.distance import pdist, squareform

X=pd.DataFrame({'age':[21,21,19,30,21,21,19,30],
'gender':['M','M','M','M','F','F','F','F'],
'civil_status':['MARRIED','SINGLE','SINGLE','SINGLE','MARRIED','SINGLE','WIDOW','DIVORCED'],
'salary':[3000.0,1200.0 ,32000.0,1800.0 ,2900.0 ,1100.0 ,10000.0,1500.0],
'children':[True,False,True,True,True,False,False,True],
'available_credit':[2200,100,22000,1100,2000,100,6000,2200]})

print(X)

   age  available_credit children civil_status gender   salary
0   21              2200     True      MARRIED      M   3000.0
1   21               100    False       SINGLE      M   1200.0
2   19             22000     True       SINGLE      M  32000.0
3   30              1100     True       SINGLE      M   1800.0
4   21              2000     True      MARRIED      F   2900.0
5   21               100    False       SINGLE      F   1100.0
6   19              6000    False        WIDOW      F  10000.0
7   30              2200     True     DIVORCED      F   1500.0


<h2>2. Auxiliary functions</h2>
This is necessary because numpy does not give support for mixed data matrices operations.

In [4]:


#Normalize the array
def normalize_mixed_data_columns(arr, dtypes):
  
    if isinstance(arr, pd.DataFrame):
        arr =np.asmatrix(arr.copy())
    elif isinstance(arr, np.ndarray):
        arr =arr.copy()
    else:
        raise ValueError('A DataFrame or ndarray must be provided.')
    
    
    rows,cols = arr.shape
    for col in xrange(cols):
        if np.issubdtype(dtypes[col],np.number):
            max = arr[:,col].max()+0.0 #Converts to double
            if (cols>1):
                arr[:,col] /= max
            else:    
                arr= arr/max
    return( arr)

 



<h1>3. Refactoring of pdist</h1>
With support for mixed data. Not possible to override the module methods from pdist, because they are private.

In [2]:
#This function must be refactored on pdist module to support mixed data
def _copy_array_if_base_present(a):
    if a.base is not None:
        return a.copy()
    elif np.issubsctype(a, np.float32):
        return np.array(a, dtype=np.double)
    else:
        return a

#This function must be refactored on pdist module to support mixed data
def _convert_to_double(X):
    if X.dtype == np.object:
        return X.copy()
    if X.dtype != np.double:
        X = X.astype(np.double)
    if not X.flags.contiguous:
        X = X.copy()
    return X

#This function was copied from pdist because it is private. No change in the original function.
def _validate_vector(u, dtype=None):
    # XXX Is order='c' really necessary?
    u = np.asarray(u, dtype=dtype, order='c').squeeze()
    # Ensure values such as u=1 and u=[1] still return 1-D arrays.
    u = np.atleast_1d(u)
    if u.ndim > 1:
        raise ValueError("Input vector should be 1-D.")
    return u


#An excerpt from pdist function only with the basic structure to call the metric function. 
#The original pdist must be adapted to this current metric using this as example.
def pdist_(X, metric='euclidean', p=2, w=None, V=None, VI=None):
    X = np.asarray(X, order='c')

    # The C code doesn't do striding.
    X = _copy_array_if_base_present(X)

    s = X.shape
    if len(s) != 2:
        raise ValueError('A 2-dimensional array must be passed.')

    m, n = s
    dm = np.zeros((m * (m - 1)) // 2, dtype=np.double)

    #(...)
    dfun = metric
    k = 0
    for i in xrange(0, m - 1):
        for j in xrange(i + 1, m):
            dm[k] = dfun(X[i], X[j],VI=VI)
            k = k + 1

    return dm

<h1>4. The Euclidean-Overlap similarity function</h1>

In [4]:
from scipy.spatial.distance import pdist, squareform


def euclidean_overlap(xi, xj,VI=None):

    cols = len(xj)

    if VI is None:
        raise ValueError('An array with the dtypes for each column must be passed in VI.')
        
    xi=_validate_vector(xi)
    xj=_validate_vector(xj)

    sum_of_sq_cathetus =0.0
    for col in xrange(cols):
        if np.issubdtype(VI[col],np.number):
            sum_of_sq_cathetus+=abs(xi[col]-xj[col])**2
        else:
            sum_of_sq_cathetus+=(0,1)[xi[col]==xj[col]]
            

    return(sum_of_sq_cathetus**0.5)




<h1>5. Get the Euclidean-Overlap distance matrix</h1>

In [7]:
#get the dtypes
dtypes = X.dtypes

#normalize between 0 and 1
Xn=normalize_mixed_data_columns(X, dtypes)

print(np.tril(squareform(pdist_(Xn,euclidean_overlap,VI=dtypes))))

[[ 0.          0.          0.          0.          0.          0.          0.
   0.        ]
 [ 1.00611909  0.          0.          0.          0.          0.          0.
   0.        ]
 [ 1.90675995  1.98034857  0.          0.          0.          0.          0.
   0.        ]
 [ 1.4470336   1.44651916  2.21982173  0.          0.          0.          0.
   0.        ]
 [ 1.41424623  0.10139499  1.63029249  1.04539715  0.          0.          0.
   0.        ]
 [ 0.11241424  1.41421702  1.71108324  1.0452486   1.00529734  0.          0.
   0.        ]
 [ 0.28658457  1.0733084   1.00079062  0.49971586  1.04246382  1.4675556
   0.          0.        ]
 [ 1.04508242  0.31495946  1.68905054  1.00129311  1.44637364  1.04845974
   1.11123166  0.        ]]
