# Clustering

Clustering algorithms are a part of unsupervised machine learning algorithms. Why unsupervised ? Because, the target variable is not present. The model is trained based on given input variables which attempt to discover intrinsic groups (or clusters).

# Clustering Types

* **Soft Clustering**: In this technique, the probability or likelihood of an observation being partitioned into a cluster is calculated.
* **Hard Clustering**: In hard clustering, an observation is partitioned into exactly one cluster (no probability is calculated). <br><br>
There are many types of clustering algorithms, such as K means, hierarchical clustering, etc. Other than these, several other methods have emerged which are used only for specific data sets or types (categorical, binary, numeric).
> We will look into two types of algorithms and their variants
> 1. K means
> 2. Hierarachical
> 3. DB Scan

# Distance Measures

**K Means** and **Hierarchical clustering** techniques are driven by the distance between various points and thus are usually referred as *Distance based techniques*. Hence it is necessary to look into various ways of calculating distcance *(Euclidean distance is not the only way calculate distance!)*<br><br>
>There are some important things you should keep in mind:<br>
> 1. With quantitative variables, distance calculations are **highly influenced by variable units and magnitude**. For example, clustering variable height (in feet) with salary (in rupees) having different units and distribution (skewed) will invariably return biased results. Hence, always make sure to standardize (mean = 0, sd = 1) the variables. **Standardization results in unit-less variables.**<br><br>
> 2. Use of a particular **distance measure depends on the variable types**; i.e., formula for calculating distance between numerical variables is different than categorical variables.



1.**Euclidean Distance**: It is used to calculate the distance between quantitative (numeric) variables. As it involves square terms, it is also known as L2 distance (because it squares the difference in coordinates). Its formula is given by:<br>

> \begin{align}
\sqrt{\sum_{i=1}^n (x_i-y_i)^2} 
\end{align}
<br>

In [1]:
euclideanDistance <- function(x,y){
    
    #     Computes Euclidean Distance two 1-D vector x and y

    #     Args:
    #     x : (N,) vector
    #         Input vector
    #     y : (N,) vector
    #         Input vector

    #     Returns:
    #     Euclidean : double
    #         The Euclidean distance between vectors `x` and `y`.
    
    # Uncomment the line below and write your code based on the formula
    
    #return()
    }

In [85]:
euclideanDistance(c(1,2,3), c(4,5,6))

2.**Manhattan Distance**: It is calculated as the absolute value of the sum of differences in the given coordinates. This is known as L1 distance. It is also sometimes called the Minowski Distance.<br>

   An interesting fact about this distance is that it only calculates the horizontal and vertical distances. It doesn't calculate the diagonal distance. For example, in chess, we use the Manhattan distance to calculate the distance covered by rooks. Its formula is given by:<br>
> \begin{align}
\sum_{i=1}^n |x_i-y_i|
\end{align}
   <br>
Note: The generalisation of Euclidean, Manhattan distance etc. is called **Minkowski's distance**
>\begin{align}
\left(\sum_{i=1}^n |x_i-y_i|^p\right)^{1/p}
\end{align}
when p = 2 -> Euclidean <br>
when p = 1 -> Manhattan

In [2]:
manhattanDistance <- function(x,y){
    
#     Computes Manhattan Distance two 1-D vector x and y
    
#     Parameters
#     ----------
#     x : (N,) vector
#         Input vector
#     y : (N,) vector
#         Input vector
        
#     Returns
#     -------
#     manhattan : double
#         The Cosine distance between vectors `x` and `y`.
    
    # Uncomment the line below and write your code based on the formula
    
    #return()
    }

In [86]:
manhattanDistance(c(1,2,3), c(4,5,6))

In [3]:
minkowskiDistance <- function(x,y,p_value){
    
#     Computes Minkowski Distance two 1-D vector x and y for given p
    
#     Parameters
#     ----------
#     x : (N,) vector
#         Input vector
#     y : (N,) vector
#         Input vector
#     p : float
#         the norm factor
#         p == 1, Manhattan distance
#         p == 2, Euclidean distance
        
#     Returns
#     -------
#     minkowski : double
#         The minkowski distance between vectors `x` and `y`.
    
    # Uncomment the line below and write your code based on the formula
    
    #return()
    }    

In [87]:
minkowskiDistance(c(1,2,3), c(4,5,6), p_value = 3)

3.**Hamming Distance**: It is used to calculate the distance between categorical variables. It uses a contingency table to count the number of mismatches among the observations. If a categorical variable is binary (say, male or female), it encodes the variable as male = 0, female = 1.

   In case a categorical variable has more than two levels, the Hamming distance is calculated based on dummy encoding. Its formula is given by (x,y are given points):
>\begin{align}
\frac{\sum_{i=1}^n (x_i <> y_i)}{n}
\end{align}
Note: Dividing by n normalizes the distance, hamming distance is genuine without diving it by n too.

In [4]:
hammingDistance <- function(x, y){
#     Computes Hamming Distance two 1-D vector x and y
    
#     Parameters
#     ----------
#     x : (N,) vector
#         Input vector
#     y : (N,) vector
#         Input vector
        
#     Returns
#     -------
#     hamming : double
#         The hamming distance between vectors `x` and `y`.
    if(length(x) == length(y)){
        # Uncomment the line below and write your code based on the formula
        #return()
    }
    return(F)
    }

In [88]:
hammingDistance(c(1,2,3), c(4,2,6))

4.**Cosine Similarity**: It is the most commonly used similarity metric in text analysis. The closeness of text data is measured by the smallest angle between two vectors. The angle (Θ) is assumed to be between 0 and 90. A quick refresher: cos (Θ = 0) = 1 and cos (Θ = 90) = 0.

   Therefore, the maximum dissimilarity between two vectors is measured at Cos 90 (perpendicular). And, two vectors are said to be most similar at Cos 0 (parallel). For two vectors (x,y), the cosine similarity is given by their normalized dot product shown below:
>\begin{align}
\frac {\pmb x \cdot \pmb y}{\sqrt{(\pmb x \cdot \pmb x) (\pmb y \cdot \pmb y)}}
\end{align}
Note: **Cosine distance = 1 - Cosine similarity**

In [5]:
cosineSimilarity <- function(x,y){
#     Computes cosine similarity two 1-D vector x and y for given p
    
#     Parameters
#     ----------
#     x : (N,) vector
#         Input vector
#     y : (N,) vector
#         Input vector
        
#     Returns
#     -------
#     cosine : double
#         The cosine similarity between vectors `x` and `y`.
    
    # Uncomment the line below and write your code based on the formula
    
    #numerator = 
    #denominator = 
    #return(round(numerator/denominator,3))
    }

In [91]:
cosineSimilarity(c(1,2,3), c(4,5,6))

5.**Jaccard Coefficient**: The Jaccard coefficient (sometimes called the Jaccard similarity index) compares members for two sets to see which members are shared and which are distinct. It’s a measure of similarity for the two sets of data, with a range from 0 to 1. The higher the percentage, the more similar the two populations.<br>
   Although it’s easy to interpret, it is extremely sensitive to small samples sizes and may give erroneous results, especially with very small samples or data sets with missing observations.
>\begin{align}
\frac {|set(x) \cap set(y)|}{|set(x) \cup set(y)|}
\end{align}
Note: **Jaccard distance = 1 - Jaccard similarity**

In [58]:
jaccardSimilarity <- function(x,y){
#     Computes jaccard similarity two 1-D vector x and y for given p
    
#     Parameters
#     ----------
#     x : (N,) vector
#         Input vector
#     y : (N,) vector
#         Input vector
        
#     Returns
#     -------
#     jaccard : double
#         The jaccard similarity between vectors `x` and `y`.
    
    # Uncomment the line below and write your code based on the formula
    
    #intersection_cardinality = 
    #union_cardinality = 
    #return(intersection_cardinality/union_cardinality)
    }

In [92]:
jaccardSimilarity(c(1,2,3), c(1,5,6))

# In-built library function

For faster calculation on huge datasets, use in-built functions as they are optimized.<br>
We will be using 

1. Distance from 'philentropy' package. It supports around 40 distance/Similarity measures<br>
    **Note: It gives similarity score for similarity measures, subtract the value from 1 to get the distance**<br>
    For more information see:<br>
    https://cran.r-project.org/web/packages/philentropy/vignettes/Distances.html
    

In [6]:
if (!is.element('philentropy', installed.packages()[,1])){
    install.packages('philentropy', dep = TRUE)
}
library('philentropy')

euclideanDist = distance(rbind(c(1,2,3), c(4,5,6)), method = "euclidean")
print(euclideanDist)

manhattanDist = distance(rbind(c(1,2,3), c(4,5,6)), method = "manhattan")
print(manhattanDist)

minkowskiDist = distance(rbind(c(1,2,3), c(4,5,6)), method = "minkowski",p = 3 )
print(minkowskiDist)

cosineSim = distance(rbind(c(1,2,3), c(4,5,6)), method = "cosine")
print(1 - cosineSim)

jaccardSim = distance(rbind(c(1,2,3), c(1,5,6)), method = "jaccard")
print(1 - jaccardSim)

euclidean 
 5.196152 
manhattan 
        9 
minkowski 
 4.326749 
    cosine 
0.02536815 
  jaccard 
0.6170213 
