# Types of Data, Statistics and Proximity Measures

##### Although its tempting to jump straight into models and play around with data, it is a pitfall to which almost everyone as a beginner must have fallen. Creating a robust algorithm is important but to be able to do so, it is equally important to know your data. Real world data is usually messy and comes from a number of different sources. It can be large, it can be arbitrarily complex and it can be really messy. Therefore it is important to know everything about data before doing anything else. In this notebook, we shall see the types of data, basic statistics, measures of dispersion, proximity measures, etc. which will prove to be highly important later when dealing with real world data. This notebook aims at providing basic knowledge needed to be known by every data scientist which ultimately helps in better results. Without knowing data, it is difficult to design efficient solutions. 

#####  Please hit upvote if you find the notebook useful.

### Table of content
*  **[Data Objects and Attribute Types](#obj_attr)**
    * [Nominal Attributes](#nominal)
    * [Binary Attributes](#binary)
    * [Numeric Attribites](#numeric)
    * [Ordinal Attributes](#ordinal)
    * [Continuous and Discrete Attributes](#cont_dis)
* **[Measures of Central Tendency](#central_tendency)**
    * [Mean](#mean)
    * [Median](#median)
    * [Mode](#mode)
* **[Measures of Dispersion of data](#dispersion)**
    * [Range](#range)
    * [Quantiles](#quantile)
    * [Quartiles, Interquartile Range and Percentile](#quartile)
    * [Skewness](#skew)
    * [Kurtosis](#kurtosis)
    * [Variance and Standard Deviation](#var_std)
    * [Basic data visualizations](#data_viz)
        * [Distribution plots](#distplot)
        * [Histograms](#hist)
        * [Scatter plots](#scatter)
        * [Box and whiskers plots](#boxplot)
* **[Proximity Measures](#proximity)**
    * [Data Matrix and Dissimilarity Matrix](#matrices)
    * [Proximity Measures for Nominal Attributes](#prox_nominal)
    * [Proximity Measures for Binary Attributes](#prox_binary)
    * [Proximity Measures for Numeric Data:Minkowski Distance](#prox_numeric)
    * [Proximity Measures for Ordinal Data](#prox_ordinal)
    * [Proximity measures for mixed attribute type data](#prox_mix)
    * [Cosine similarity](#prox_cosine)


Referred TextBook - [DataMining:Concepts and Techniques by Jiawei Han, Micheline Kamber and Jian Pei](https://github.com/mohtashim-nawaz/Books/tree/master/Data%20Science)
<br><br>
For basics of numpy: [Basics of Numpy](https://www.kaggle.com/mohtashimnawaz/numpy-with-jokes-and-funs) <br>
For basics of Pandas: [Basics of Pandas](https://www.kaggle.com/mohtashimnawaz/easy-peasy-pandas-with-jokes)
<br><br>
**Note: Since notebook contains latex mathemetical formulae, it may take some time to load the notebook properly.**
**All the figures have been taken from the reffered book.**

<a id = "obj_attr"></a>
## 1. Data Objects and Attribute Types
Dataset is often organized in form of data objects where each data object represents an entity. In different literatures these data objects have different names like ***tuples***, ***objects***, or even ***rows***.
<br><br>
A data object represents an entity on the basis of certain characterstics which are called as ***attributes***. Likewise, attributes also have different names in different literature like ***feature***, ***dimension*** and ***variable***. Since each data object is a vector of attributes, it is sometimes reffered to as ***feature vector***.
<br><br>
The attributes are classified in different categories on the basis of different properties. Also, division on the basis on different properties may not be disjoint.

In [None]:
# We shall see some examples
import pandas as pd
titanic_data = pd.read_csv('../input/titanic/train.csv')
print("Example of some data objects: Each row represents a data object:")
titanic_data.head(3)


In [None]:
# The attributes or features in the dataset describing each data object or tuple
print(list(titanic_data.columns))

<a id="nominal"></a>
### 1.1. Nominal Attributes
Nominal stands for "name like". The value of nominal attributes is name of a thing or a symbol which denotes some category, code or state. They are thus also reffered as ***categorical attributes***. Nominal attrbutes do not have any order and therefore are not applicable for any mathemetical calculations like mean, median, standard deviation, etc. However, mode, the most commonly occuring value can be calculated.
<br><br>
Although nominal attributes represent a category, they can have numeric values which may represent some code or state, etc. However, the numeric value should still be considered as categorical. For e.g. *average*, *good*, *excellant* can be a grading criteria where values to each category are given as 0,1 and 2 respectively. However, they should not be misunderstood as numeric data. Often such type of data is called false numeric and should be carefully dealt. 

In [None]:
# You may have already guessed nominal attributes in the titanic dataset by seeing the rows above
# Here are the columns which represent nominal attributes
print(['Pclass', 'Name', 'Sex', 'Ticket', 'Cabin','Embarked'])
print("You may have noticed that Pclass contains numeric values but is still nominal.")

<a id="binary"></a>
### 1.2. Binary Attributes
Binary attributes are a special type of nominal attributes which holds only two possible values often pertainig to a truth and false. All other properties of nominal attributes are applicable to binary attributes. For example, in *Titanic Dataset*, the attribute *Survived* is a binary attribute.
<br><br>
Binary attributes are of two types, ***symmetric*** binary attributes and ***asymmetric*** attributes. The attributes whose binary values are not biased towards any class are symmetric attributes while others are asymmetric attributes.
<br><br> 
Although, binary attributes are often associated with positive and negative class, they should not be confused to have any order. Like nominal attributes, these class names simply denote two different classed which have no order.

In [None]:
# The 'Survived' attribute is a binary attribute
print(titanic_data['Survived'].value_counts())
print("Clearly only two classes, 0 and 1, exist for 'Survived' attribute. Again it contains numeric values but is actually binary nominal.")

<a id="numeric"></a>
### 1.3. Numeric Attributes
Numeric attributes are used to measure a quantity and are therefore *quantitative*. It is a measurable quantity specified as real numbers. All the mathematical operations are applicable on numeric attributes. For example mean, median, ditributions, etc. are often calculated on numeric data. However, mode may not be of much use for numeric data.
<br><br>
Numeric data is of two types: ***interval-scaled*** numeric attributes and ***ratio-scaled*** numeric attributes.

In [None]:
# In our Titanic dataset, we have following numeric attributes
print(['Age','Fare'],": Numeric attributes")
print("Some values of 'Age' are:")
print(titanic_data['Age'][0:3])
print("Some values of 'Fare' are:")
print(titanic_data['Fare'][0:3])

<a id="ordinal"></a>
### 1.4. Ordinal Attributes
Ordinal attributes are a type of nominal attributes with an order among the data values. For e.g students are often graded as A, B, C which have inherent order among the data values. 
<br><br>
Ordinal attributes, though are nominal, a number of mathemetical operations are possible on them like median, mode however mean is not defined. Ordinal attributes can also be obtained by discretization of numeric data.

In [None]:
# In the titanic dataset, 'Pclass' is an ordinal attribute.
print("It is obvious that class 1, 2 and 3 have an order among themselves.")
print(titanic_data['Pclass'].value_counts())
print("Here 3, 2, 1 are Pclass values while numbers in front of them are their frequency in the dataset")

<a id="cont_dis"></a>
### 1.5. Continuous and Discrete Attributes
There is yet one more way of classification of attributes. The attributes having a number of ***countable*** different values are called discrete attributes. Other type of attributes are called continuous attributes which have infinite number of ***non-countable*** values.
<br><br>
Discrete attributes can have infinite number of values but the values are countable e.g. an attribute having natural numbers as values. On the other hand, continuous attributes are often represented with real numbers.

In [None]:
# In our Titanic dataset 'Fare' is continuous and 'Pclass', 'SibSp', 'Survived', etc. are discrete.
print("An example of continous attribute:")
print(titanic_data['Fare'][0:3])
print("An example of discrete attribute:")
print(titanic_data['Pclass'][0:3])

<a id="central_tendency"></a>
## 2. Measures of Central Tendency
After knowing the attribute type and their properties, it is often helpful to know measures of central tendency. The measures of central tendency denote the central location of the data distribution and are therefore helpful in providing a  general overview. Also, mean, median, mode are often used in data preprocessing tasks which is the next step after knowing the data.
<br><br>
We shall see the most common measures of central tendency which are mean, median and mode.

<a id="mean"></a>
### 2.1 Mean
Mean or average is the most common and effective measure of central tendency which is calculated as the summation of all data values devided by number of data values. 
<br><br>
Mean can be calculated as,<br><br>
$$\bar{x}=\frac{\sum_{i=1}^N x_i}{N} = \frac{x_1 + x_2 +...+x_N}{N}$$
<br><br>
If certain weights are given to each attribute, weighted mean can be calculated as<br><br>
$$\bar{x}=\frac{\sum_{i=1}^N w_i x_i}{\sum_{i=1}^N w_i} = \frac{w_1x_1 + w_2x_2 +...+w_Nx_N}{w_1+w_2+...+w_N}$$
<br><br>
Although mean is very commonly used, it is **less effective in presence of outliers**. In other words, mean is more sensitive to outliers.

In [None]:
# We shall use predefined mean function to calculate the mean, which employs same mathematical procedure
print("Mean of Fare: ",titanic_data['Fare'].mean())
print("Note:Outliers and distribution are not dealt with for sake of understanding.")

<a id="median"></a>
### 2.2 Median
Median is also one of the most commonly used measure of central tendency. Median can be calculated by sorting the data and then finding the middle value. 
<br><br>
When the number of data values are odd, a single middle value exists, however, when the number of data values is even, two values are in the middle of data. Everything occuring between these two middle values is a median. However, most commonly, mean of these values is taken to be the median. 
<br><br>
Median is expensive to compute but can be approximated easily when the data is divided into intervals. The formula used is given as,<br><br>
$$median = L_1 + \left(\frac{ N/2 - (\sum freq)_l}{freq_median} \right)width$$
<br>
where $L_1$ is lower boundary of median interval, $N$ is number of values in entire dataset, $\sum{(freq)_l}$ is sum of frequencies of all intervals lower than median interval, $freq_{median}$ is frequency of median interval and $width$ is width of median interval.
<br><br>
Median is **not defined for nominal and binary attributes** and can be found for numeric and ordinal attributes.

In [None]:
# Let's compute the median for 'Pclass'(ordinal) and 'Age' (numeric)
print("Median for Pclass:", titanic_data['Pclass'].median())
print("Median for Age:", titanic_data['Age'].median())

<a id="mode"></a>
### 2.3 Mode
Mode simply means the **most commonly occuring** value among all the data values. It is defined for all types of attributes but is of little use for continuous attributes.
<br><br>
Dataset having a single mode is called unimodal, having two modes is called bimodal ans so on. Generally, dataset having more than three modes is called polymodal. 
<br><br>
For a unimodal and moderately skewed data, a relation  among mean, median and mode exists which is given as,<br><br>
$$mean-mode \approx 3 \times (mean-median) $$

In [None]:
# We shall see the mode of 'Pclass' and 'Survived'
print("Mode of Pclass:", (titanic_data['Pclass'].mode())[0])
print("Mode of Survived:", (titanic_data['Survived'].mode())[0])

<a id="dispersion"></a>
## 3. Measures of Dispersion of data
Dispersion or spread of numeric data is very useful when analysing the data. Most commonly used measures to asses the dispersion of data are discussed below.

<a id="range"></a>
### 3.1 Range
Range is simply the **difference of maximum and minimum** data value among all data values. It can be simply calculated by finding maximum and minimum values of a data.

In [None]:
# We will calculate range of 'Age' in Titanic dataset
print("Range of Age:", titanic_data['Age'].max()-titanic_data['Age'].min())

<a id="quantile"></a>
### 3.2 Quantiles
Let's suppose the data is sorted in increasing order, we can choose some points which can split the data distribution into approximately equal sized consecutive sets.These points are called **quantiles**.
<br><br>
When a single data point is choosen dividing the data distribution into two equal halves, it is called 2-Quantile, when two data points are choosen it is called 3-Quantile and so on. There are $k-1$ data split points for $k$-quantiles.
<br><br> 
A $k$th $q$-quantile for a given data distribution is a value $x$ such the at most $\frac{k}{q}$ values are lesser than $x$ and at most $\frac{(k-q)}{q}$ values are greater than $x$. For e.g. in a 4-quantile Q1 splits data into 25% and 75%, Q2 divides data into 50% each and Q3 divides into 75% and 25% each.
<br><br>
**2-Quantile corresponds to median.**

<a id="quartile"></a>
### 3.3 Quartiles, Interquartile Range and Percentile
Quartile is nothing but a specific case of quantiles. **4-quantiles are reffered as quartiles**. In the quatiles, Q1, Q2 and Q3 are respective dividing points where **Q2 corresponds to median**. Quartiles provide an indication of center, shape and spread of data distribution.
<br><br>
**Interquartile range is simply the distance (difference) between Q3 and Q1 of quartiles**.
so the IQR can be defined as,
<br>$$IQR = Q3-Q1$$<br>
**Percentile is simply 100-quantiles** which divides the data distribution into 100 essentially equal parts.
<br><br>
***Five Number Summary is yet another useful concept which is simply a collection of five numbers which are min, Q1, median, Q3, max of a given data.***
<br><br> Figure below summarizes the concepts.
![bell_curve.png](attachment:bell_curve.png)

<a id="skew"></a>
### 3.4 Skewness
Skewness is related to the distribution of data. If a distribution curve is perfectly symmetric, it is called normally distributed and such a distribution is not skewed.
<br><br>
Data distribution can be either **positively skewed if the tail ends towards right** or **negatively skewed if the tail ends towards left**. The details can be found in figure.
<br><br>
For positively skewed data, median is lesser than mean while for negatively skewed data, median is greater than mean.
<br><br>
Skeweness is often visualized and analyzed with the help of distribution curves however, box and whiskers plots can also be used for the same task. Box plots can also be used to detect outliers.
<br>
![skew.png](attachment:skew.png)

In [None]:
# Lets see the distribution of 'Age' in titanic dataset
import matplotlib.pyplot as plt
import seaborn as sns # seaborn is a popular visualization library
sns.distplot(titanic_data['Age'])
plt.show()
print("The distribution is slightly positivly skewed.")

In [None]:
# Let's plot box plot
# Box plot can also be used for outlier detection. It also shows Interquartile range, median, Q1, Q3, etc.
sns.boxplot(x=titanic_data['Survived'], y=titanic_data['Age'])
plt.show()
print("Box plots show outliers, IQR, Q1, Q3, median and min and max in 1.5xIQR on both sides")

More information on distribution plot and box and whiskers plot can be found in referred book and [here(distplots)](https://seaborn.pydata.org/tutorial/distributions.html) and [here(boxplots)](https://seaborn.pydata.org/generated/seaborn.boxplot.html).

<a id="kurtosis"></a>
### 3.5 Kurtosis
Kurtosis is a measure of how much tails of a distribution differ from a normal distribution. In other words, kurtosis identifies if tails of a distribution contain extreme values.<br><br>
Type of kurtosis depends on excess kurtosis which is calculated as,<br><br>
$$excess = kurtosis-3$$
<br>
Where $3$ corresponds to kurtosis of normal/gaussian bell shaped distribution.
<br><br>
When excess kurtosis is zero or close to zero, distribution is said to be **mesokurtic**. Such type of ditribution is very similar to normal distribution.<br>
When excess kurtosis is positive, distribution is said to be **leptokurtic**. This type of distribution contains extreme values at tails.<br>
When excess kurtosis is negative, distribution is said to be **platykurtic**. This type of distribution indicates absence of extreme values.

In [None]:
# Kurtosis can be calculated using inbuilt methods
print("Kurtosis of Age:", titanic_data['Age'].kurtosis())
print("Note:- This inbuilt method considers kurtosis of normal distribuiton as 0.0 (Fisher's method)")

<a id="var_std"></a>
### 3.6 Variance and Standard Deviation
Variance and standard deviation are the measure of spread and indicate how data distribution is spread. A high standard deviation means data values are spread over a wide range while low value denotes that data values are close to mean. 
<br><br>
Variance and standard deviation can be calculated as,<br><br>
$$\sigma^2 = \frac{1}{N}\sum_{i=1}^N \left( x_i - \bar{x}\right)^2 = \left(\frac{1}{N}\sum_{i=1}^N x_i^2 \right) - \bar{x}^2$$
<br>
**Standard deviation is the positive square root of variance.** 
<br>
$$\sigma = +\sqrt{\sigma^2}$$
<br>Standard deviation is less sensitive to outliers as compared to variance.

In [None]:
# We can easily compute variance and standard deviation of data using inbuilt methods
# We shall use a new dataset which contains several different types of attributes.
hp_data = pd.read_csv('../input/house-prices-advanced-regression-techniques/train.csv')

In [None]:
# This dataset has an attribute 'SalePrice' which contains prices of houses
# We shall calculate the variance and standard deviation of this attribute
# This is custom formulae to calculate variance and standard deviation
import numpy as np
print("Custom formulae to calculate variance and standard deviation.")
n=(hp_data['SalePrice'].shape)[0]
mean = (hp_data['SalePrice'].sum())/n
v = (abs(hp_data['SalePrice']-mean))**2 # This is called broadcasting in pyhton. Read more in documentation
var = sum(v)/n
print("Variance:", var)
print("Standard deviation:" , np.sqrt(var))

In [None]:
# Inbuilt functions can also be used for same purpose.
print("Variance and standard deviation using inbuilt numpy functions.")
print("Variance of SalePrice:",np.var(hp_data['SalePrice']))
print("Standard Deviation of SalePrice:",np.std(hp_data['SalePrice']))

<a id="data_viz"></a>
### 3.7 Basic Data Vizualizations
Data analysis is incomplete without data visualization and often visualizations help in recognizing some properties and pattern of data that couldn't be recognized by looking at the raw data.
<br><br>
There are a large number of data visualzation techniques however, **distribution plots, histograms, scatter plots and box and whiskers plots** are very common. Here, a brief introduction of each of the above mentioned visualization technique is given along with the sample code.

<a id="distplot"></a>
#### 3.7.1 Distribution plots
Distribution plots are used to visualized the **distribution of numeric attributes**. Such plots are very helpful to get the sense of overall data distribution and to know the status of **skewness in data**.
<br><br>
Distribution plots are used for univariate analysis (invloving a single variable/attribute/feature).

In [None]:
# Distribution plot of 'SalePrice' attribute in house price data
plt.figure(figsize=(8,6))
plt.xlabel("Sale Price")
sns.distplot(hp_data['SalePrice'])
plt.show()
print("It is clear from the figure that distribution is positively skewed.")

<a id="hist"></a>
#### 3.7.2 Histograms
Histograms are a classic way to graphically represent **distribution of an attribute by counting frequency** of different items in data where taller bars represent more data falls into that particular category.
<br><br> 
For non-numeric data, frequency for different items is counted and displayed. However, for numeric data, data needs to be discritized. 
<br><br>
Histograms are used for univariate analysis.

In [None]:
# Following is an example of histogram of attribute 'MSZoning' is house-price data
plt.figure(figsize=(8,6))
sns.countplot(hp_data['MSZoning'])
plt.show()

<a id="scatter"></a>
#### 3.7.3 Scatter plots
Scatter plots are most commonly used plots for **bivariate analysis**. Two attributes are simply plotted on a 2-D plot against each other where each data point is plotted as a seperate point.
<br><br>
Scatter plots are often helpful in recognizing clusters and outliers and can provide a sense of how and where data lies. A point that is far away from most of the other points is possibly an outlier .Scatter plots are also used for correlation analysis. 
![scatter.png](attachment:scatter.png)
<br>
Plot (a) shows positve correlation while plot (b) shows negative correlation between two plotted attributes. Following image shows cases of no correlation.
![scatter2.png](attachment:scatter2.png)

In [None]:
# Here is a scatter plot of 'close/last' and 'open' of macdonald-stock-price dataset
mcd_stock = pd.read_csv('../input/eda-and-cleaning-mcdonald-s-stock-price-data/final.csv')
sns.scatterplot(x='Close/Last', y='Open', data=mcd_stock)
plt.show()
print("The plot shows high positive correlation and there a no possible outliers")

<a id="boxplot"></a>
#### 3.7.4 Box and whiskers plots
Box and whiskers plots are very popular plots mainly used for **plotting data groups using quartiles**. Box plots are used for getting a sense of **data distribution and spread**, to know possible outliers and to plot the **five number sumary (min, Q1, median, Q3, max)**. **The length of box denotes IQR**.

In [None]:
# Let's create a box plot for some attributes
sns.boxplot(y='Age',x='Sex',data=titanic_data)
plt.show()
print("The points above 'male' can be possible outliers and the line in box is the median")

<a id="proximity"></a>
## 4. Proximity Measures
There are applications like clutering, outlier analysis and nearest neighbor classification in machine learning and data mining where we need to asses the similarity/dissimilarity of data tuples. **Proximity measures are mathemetical techniques and formulae to asses the similarity/dissimilarity of data tuples**.
<br><br>
**Similarity and dissimilarity are related where given dissimilarity, we can calculate similarity easliy by subtracting dissimilarity value from 1 ( if in range $[0,1]$ )**.
<br><br>
Proximity measures are different for different types of attributes. We shall discuss proximty measure for all types of attributes seen previously. We shall then see proximity measure for mixed attribute type data and cosine similarity. Howeve, before that it is important to know data matrix and dissimilarity matrix.

<a id="matrices"></a>
### 4.1 Data Matrix and Dissimilarity Matrix
Data matrix is nothing but all the data tuples stacked as a matrix. **Data matrix is tuple vs attribute matrix.**
<br><br>
$$
\begin{bmatrix}
x_{11} & ... & x_{1f} & ... & x_{1p} \\
... & ... & ... & ... & ... \\
x_{i1} & ... & x_{if} & ... & x_{ip} \\
... & ... & ... & ... & ... \\
x_{n1} & ... & x_{nf} & ... & x_{np}
\end{bmatrix}
$$
<br>
Dissimilarity matrix is a matrix of pairwise dissimilarity among the data tuples. It is often desirable to keep only lower triangle or upper triangle of a dissimilarity matrix. **Dissimilarity matrix is a tuple vs tuple matrix.**
<br><br>
$$
\begin{bmatrix}
0 \\
d(2,1) & 0\\
d(3,1) & d(3,2) & 0\\
. & . & . \\
. & . & . \\
d(n,1) & d(n,2) & ... & ... & 0
\end{bmatrix}
$$

<a id="prox_nominal"></a>
### 4.2 Proximity measures for Nominal Attributes
Nominal attributes can have two or more different states e.g. an attribute 'color' can have values like 'Red', 'Green', 'Yellow', etc. Dissimilarity for nominal attributes is calculated as the ratio of total number of mismatches between two data tuples to the total number of attributes.
<br><br>
Let $M$ be the total number of states of a nominal attribute. Then the states can be numbered from 1 to $M$. However, the numbering does not denote any kind of ordering and can not be used for any mathemetical operations.
<br><br>
Let $m$ be total number of matches between two tuple attributes and $p$ be total number of attributes, then the dissimilarity can be calculated as,
$$d(i,j)=\frac{p-m}{p}$$
<br>
We can calculate similarity as, <br>
$$s(i,j) = 1-d(i,j)$$

<a id="prox_binary"></a>
### 4.3 Proximity measures for Binary Attributes
Since binary attributes are similar to nominal attributes, proximity measures for binary attributes are also similar to that of nominal attributes. For symmetric binary attributes, the process is same i.e. <br> $$d(i,j)=\frac{p-m}{p}$$ <br>
However, for asymmetric binary attributes, we drop the number of matched zeros (where an attribute of both tuples is zero). Let $s$ be the cases where matched attributes are both zero then,
<br> $$d(i,j)=\frac{p-m}{p-s}$$ <br>
We can calculate similarity as, <br>
$$s(i,j) = 1-d(i,j)$$

<a id="prox_numeric"></a>
### 4.4 Proximity Measures for Numeric Data : Minkowski Distance
**Distance or dissimilarity between two numric attributes is commonly measured using minkowski distance, manhattan or euclidean distance.**
<br><br>
It is important to scale the data point to a common range usually $[0,1]$ or $[-1,1]$. This is to avoid attributes having high values from outweighing those with lower values.
<br><br>
**Euclidean distance is most popular distance metric for numeric attributes also known as straight line distance**. It can be calculated as,<br><br>
$$d(i,j) = \sqrt{|x_{i1}-x_{j1}|^2+|x_{i2}-x_{j2}|^2+...+|x_{ip}-x_{jp}|^2}$$
<br>
Another popular distance measure is **manhattan or city block distance** which is calculated as,<br><br>
$$d(i,j)=|x_{i1}-x_{j1}|+|x_{i2}-x_{j2}|+...+|x_{ip}-x_{jp}|$$
<br>
Both above mentioned distance measures follow the following properties: <br>
<br>$$d(i,j) >= 0$$
<br>$$d(i,i) = 0$$
<br>$$d(i,j) = d(j,i)$$
<br>$$d(i,k) <= d(i,j) + d(j,k)$$<br>
A generalized distance measure is **minkowski distance** given as,<br><br>
$$d(i,j) = \sqrt[{h}]{|x_{i1}-x_{j1}|^h+|x_{i2}-x_{j2}|^h+...+|x_{ip}-x_{jp}|^h}$$
<br>
**For $h$=1, minkowski distance corresponds to manhattan distance.**
<br>**For $h$=2, minkowski distance corresponds to euclidean distance.**
<br>**For $h\to \infty$, the minkowski distance corresponds to $L_{\infty}$ norm or uniform norm** which is given as,<br><br>
$$d(i,j)=\lim_{h\to \infty}\left(\sum_{f=1}^p|x_{if}-x_{jf}|^h\right)^{\frac{1}{h}} = \max_f^p |x_{if}-x_{jf}|$$
<br>
All the above mentioned distance measures correspond to unweighted attributes, however, sometimes attributes can be assigned weights. In such case, respective attribute terms are multiplied to weights. The formula thus becomes,<br><br>
$$d(i,j) = \sqrt{w_1|x_{i1}-x_{j1}|^2+w_2|x_{i2}-x_{j2}|^2+...+w_m|x_{ip}-x_{jp}|^2}$$
<br>
The above mentioned distance measures are crucial to the algorithms like K-Nearest Neighbors, Clustering algorithms, etc. and are very widely used in machine learning and data mining.

<a id="prox_ordinal"></a>
### 4.5 Proximity measures for ordinal attributes
Ordinal attributes have a meaningful order among their attributes values therefore, they are **treated similar to numeric attributes**. However, to do so, it is important to convert the states to numbers where each state of an ordinal attribute is assigned a number corresponding to the order of attribute values.
For e.g if a grading system have grades as A, B and C, then the number can be given as C=1, B=2 and A=3.
<br><br>
Since number of states can be different for different ordinal attributes, it is therefore **required to scale the values to common range** e.g $[0,1]$. This can be done using given formula, <br><br>
$$z_{if} = \frac{r_{if}-1}{M_f-1}$$
<br>
where $M$ is maximum number assigned to states and $r$ is rank(numeric value) of a patricular object.
<br><br>
After the scaling is done, we can simply apply same distance metrics as given for numeric attributes. The similarity can be calculated as:
<br><br>$$s(i,j) = 1-d(i,j)$$

<a id="prox_mix"></a>
### 4.6 Proximity Measures for Mixed Attribute Types
Real world data is often described by a mixture of different types of attributes, so it is important to define proximity measure for such data. 
<br><br>
Approach is to combine all the attributes into a single dissimilarity matrix, bringing all meaningful attributes to a common scale of $[0,1]$.<br><br>
$$d(i,j) = \frac{\sum_{f=1}^p \delta_{ij}^{(f)} d_{ij}^{(f)}}{\sum_{f=1}^p \delta_{ij}^{(f)}}$$
<br> where $\delta_{ij}^{(f)} = 0$ if <br>
(1) $x_{if}$ or $x_{jf}$ is missing, <br>
(2) $x_{if} = x_{jf} = 0$ and attribute $f$ is asymmetric binary.
Otherwise, $\delta_{ij}^{(f)} = 1$.
<br><br>
$d_{ij}^{(f)}$ depends on type of attribute and can be calculated as:<br><br>
(1) If $f$ is numeric $d_{ij}^{(f)} = \frac{|{x_{if}-x_{jf}}|}{max_hx_{hf}-min_hx_{hf}}$ 
where $h$ runs over all non-missing objects for attribute $f$.<br>
(2) If $f$ is nominal or binary, $d_{ij}^{(f)} = 0$, if $x_{if}=x_{jf}$, otherwise $d_{ij}^{(f)}=1$.<br>
(3) If $f$ is ordinal, compute the ranks(i.e. assign values), $r_{if}$ and $z_{if} = \frac{r_{if}-1}{M_f-1}$ and treat $z_{if}$ as numeric.
<br><br>
All the steps are very similar to what we have already seen and numeric attributes are normalized to $[0,1]$ for same reasons described above.

<a id="prox_cosine"></a>
### 4.7 Cosine Proximity
Cosine similarity is not as popular proximity measure as above described methods but is important for comparing documents. Let $x$ and $y$ be two term frequecy vectors representing two documents, cosine similarity can be computed as, <br><br>
$$sim(x,y) = \frac{x.y}{||x|||y||}$$
<br>
where $||x||$ is euclidean norm of $x$.<br><br>
Cosine similarity is actually based on dot product of vectors $sim(x,y)=0$ implies that two vectors are orthogonal, or have no match. As value becomes closer to $0$, similarity between $x$ and $y$ increases.

##### While data objects and attribute types familiarized with the type of data which is found in real world, measures of central tendency and measures for dispersion of data provide ways to gain insights about data. It is really important to know everything about data before proceeding to next step (i.e. EDA and Preprocessing) failing which leads to difficulty in further steps. In the end, proximity measures provide mathemetics involved to calculate dissimilarity/similarity which is used in complex algorithms like kNN, clustering algorithms, etc.

**PS- I have started a medium publication where we shall be publishing Data Science Learning Material in a sequential way. You are invited to contribute. **
Please visit [here](https://medium.com/scratch-data-science) for further details.