#**Similarity Metrics**

In the previous step we talked about clustering and where it can be used. At the end of the step I asked the question of how do we know if have a good cluster? Well, typically a good clustering techniques will have the following:

>* will produce high quality clusters with
>> * high intra-class similarity.
>> * low inter-class similarity. 
>*The quality of a clustering result depends on both the similarity measure used by the method and its implementation.
>*The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns.

Now measuring the quality of the clustering technique is based on how we calculate the distance/similarity.Dissimilarity/Similarity metric is usually expressed in terms of a distance function, typically a metric: $d(i, j)$. In addition to the $d(i,j)$ we need a separate “quality” function that measures the “goodness” of a cluster. This could be the Mean Square Error (MSE) for example. The definitions of distance functions are usually very different for interval-scaled, boolean, categorical, ordinal ratio, and vector variables, and the weights of importance of particular variables will be application specific.  It can be very hard to define “similar enough” or “good enough” as the answer is typically highly subjective. 

So generally we want the following requirements from a clustering algorithm: 

>* Scalability
>* Ability to deal with different types of attributes
>* Ability to handle dynamic data 
>* Discovery of clusters with arbitrary shape
>* Minimal requirements for domain knowledge to determine input parameters
>* Able to deal with noise and outliers
>* Insensitive to order of input records
>* High dimensionality
>* Incorporation of user-specified constraints
>* Interpretability and usability

Similarity

This is a considerable list and I generally give the most importance to the last request for interpretabiltiy and usability. Now if you remember in MOOC 2 we discussed entropy, and used the infomation criteria to create a discrete variable that maintained as much information form the orignal continous variable. Now finding the right clustering approach is similar. We are looking for clusters that can give us new information and not just cluster for clustering sake. So the first step in this process is to decide how we calculate simillarity or disimilarity. 

Similarity can be describe as follows:
>* Numerical measure of how alike two data objects are.
>* Is higher when objects are more  alike.
>* Often falls  in the range [0,1]

Dissimilarity can be described as folows:

>* Numerical measure of how different two data objects are. 
>* Lower  when objects are more  alike
>* Minimum dissimilarity is often  0
>* Upper limit varies



The first measure dissimilarity we will look at is probably the most well know and is known as the Euclidean distance.

# Euclidean Distance

The Euclidean distance between two points is the length of the path connecting them.The Pythagorean theorem gives this distance between two points. [Euclidean distance](http://en.wikipedia.org/wiki/Euclidean_distance) is regularly the default distance for most practioneers.

<p>In Cartesian_coordinates, if <b>p</b>&nbsp;=&nbsp;(<i>p</i><sub>1</sub>,&nbsp;<i>p</i><sub>2</sub>,...,&nbsp;<i>p</i><sub><i>n</i></sub>) and <b>q</b>&nbsp;=&nbsp;(<i>q</i><sub>1</sub>,&nbsp;<i>q</i><sub>2</sub>,...,&nbsp;<i>q</i><sub><i>n</i></sub>) are two points in Euclidean <i>n</i>-space</a>, then the distance (d) from <b>p</b> to <b>q</b>, or from <b>q</b> to <b>p</b> is given by the Pythagorean formula.


<img src="https://wikimedia.org/api/rest_v1/media/math/render/svg/795b967db2917cdde7c2da2d1ee327eb673276c0" class="mwe-math-fallback-image-inline" aria-hidden="true" style="vertical-align: -6.671ex; width:64.822ex; height:14.509ex;" alt="{\displaystyle {\begin{aligned}d(\mathbf {p} ,\mathbf {q} )=d(\mathbf {q} ,\mathbf {p} )&amp;={\sqrt {(q_{1}-p_{1})^{2}+(q_{2}-p_{2})^{2}+\cdots +(q_{n}-p_{n})^{2}}}\\[8pt]&amp;={\sqrt {\sum _{i=1}^{n}(q_{i}-p_{i})^{2}}}.\end{aligned}}}">

So lets look at an implementation of the Euclidean distance metric, using scikit learn. In the first example we will find the euclidean distance between a matrix $X$ and itself. So if we have a matrix X where:

>> $X=\begin{bmatrix}0 & 1\\1 & 1\end{bmatrix}$

Each row represents a subject or an object and the columns represent a feature. Now lets calculate the distance of row 1 with itself, row 1 with row 2, row 2 with itself and row 2 with row 1. This will give us another $[2x2]$ **distance matrix**.


In [0]:
from sklearn.metrics.pairwise import euclidean_distances
X = [[0, 1], [1, 1]]
# distance between rows of X

print(euclidean_distances(X, X))
# get distance to origin




[[0. 1.]
 [1. 0.]]


Now lets look at another example where we compare each of the rows in $X$ with the vector $\begin{bmatrix}0 & 0 \end{bmatrix}$. The calculations will be as follows:

> $\sqrt {(0-0)^{2}+(1-0)^{2}}=1$

> $\sqrt {(1-0)^{2}+(1-0)^{2}}=1.414$

In [0]:
print(euclidean_distances(X, [[0, 0]]))

[[1.        ]
 [1.41421356]]


Have a go at trying to get the distance matrix for the following matrix:

> $X=\begin{bmatrix}1.2 & 3.4 & 10.2 \\1.4 & 3.1 & 10.7\\2.1 & 3.77&11.3\\ 1.5 & 3.2 & 10.9\end{bmatrix}$

Can you spot when where there are going to be issues with the euclidean distance metric? What is the difference between euclidean distance and correlation?

As usual but your thoughts on the comments board.
