#**Jacardian Index**

So far in all our examples of distance/similarity measure we have dealt with variables/features that are continous in nature. The following is an example where we compare the results of 4 tests between 3 patients. The test results are Positive/Negative and all other information is categorical.

![](https://www.computing.dcu.ie/~amccarren/mcm_images/jaccardian_example.png)

Figure 1: Comparison of 3 patients and the results from 4 medical tests.

 In this situation it would not make sense to use the Euclidan distance or the Cosine simalarity measure and so we introduce the Jaccardian index. It is also known as the Intersection over Union or Jaccard similarity coefficent, [Wikipedia](https://en.wikipedia.org/wiki/Jaccard_index#Similarity_of_asymmetric_binary_attributes). The idea is to basically compare the rows in the dataset by counting the number of times both rows or subjects in our case have a positive occurence at the same time. 

So from the examle above we would create a contigency table such as that shown in Figure 2 below:

![](https://www.computing.dcu.ie/~amccarren/mcm_images/jaccardian_contigency_table.png), Wikipedia.

Figure 2: Contingency table of the comparison between 2 rows.


So to calculate the Jaccard index we would use the following formula for a symetric variable:

$$J_s=\frac{M_{00}+M_{11}}{M_{10}+M_{01}+M_{00}+M_{11}}$$

</br>

for an asymmetrical dataset we would use the following:

</br>

$$J_{as}=\frac{M_{11}}{M_{10}+M_{01}+M_{11}}$$

</br> 

So the next question you might ask is; what is the an asymetrical variable? An asymetrical variable is a categorical variable where the liklely outcome is uneven between the possible outcomes. So if you were testing patients for a rare disease then we would not expect many people to have it, but when they do we want our metric to score this as a high occurence. So in our example above the only symetrical variable is gender(we are assuming a equal probability of the occurence of male/female). Now the Jaccardian distance for both cases is effectively $1-J$ for both the symmetrical and asymetrical cases.

So lets try and calculate the Jaccardian distance for the example in Figure 1 between Jack and Mary. We will assume the following for this example:

>* Gender is symmetric attribute.

>* The remaining attributes are asymetric binary

>* Let Y and P be set to 1 and the value N be set to 0.

Now the Jaccardian distance between Jack and Mary is as follows:

$$d(Jack,Mary)=\frac{0~+~1}{2~+~0~+~1}=0.33$$

or the Jaccardian index is $1-d=0.666$

Note that becuase Gender is symetrical it was left out of the calculations. If we include Gender it will bias the results, to the point that the distance would go from 0.33 to 0.5. To answer this question one really needs to look at the problem being addressed. In some cases the Gender variable may be asymetrical. For example in a breast cancer study it is rare for a male to show up, but it is possible.

#Multinomial data
A generalization of the binary variable in that it can take more than 2 states, e.g., red, yellow, blue, green is as follows

$m$: # of matches, $p$: total # of variables

$$d(i,j)=\frac{p-m}{p}$$

</br>

#Ordinal Data
If you have an ordinal variable do the following:

>* replace $x_{if}$  by their rank $r_{if} \in \{1,..M_f\}$
map the range of each variable onto [0, 1] by replacing i-th object in the f-th variable by

$$Z_{if}=\frac{r_{if}-1}{M_f-1}$$


>* compute the dissimilarity using methods for interval-scaled variables

If you have a ratio-scaled  variable (e.g. weight) do not treat is as an interval scaled variable as the scale can be distorted. The solution is to apply a logarithmic transformation.


#Example

The following code shows you how to calculate the Jaccardian distance for the example I gave in Figure 1. It relies on the [python Scipy](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.jaccard.html) library and assumes assymetry. Have a go at trying to calculate the overall distance matrix for the whole dataset. As usual leave your thoughts on the comments board.










In [0]:
from scipy.spatial import distance
jack=[1,0,1,0,0,0]
mary=[1,0,1,0,1,0]
print("Jaccard distance between Jack and Mary is:",distance.jaccard(jack, mary))

Jaccard distance between Jack and Mary is: 0.3333333333333333
