<center><h2>Mahalanobis distance D²</h2></center>


Mahalanobis distance (**D²**) is used to measure the distance between a point and a distribution, considering the correlations between variables. It’s particularly useful when the variables are not independent and have different scales or units. Here’s why you might need to perform Mahalanobis distance:

### 1. **Accounts for Correlation Between Variables**:
   - In multivariate data, the variables often exhibit correlations. **Mahalanobis distance** accounts for these correlations by using the covariance matrix to adjust the distance.
   - For example, in a dataset with height and weight, these two variables are correlated. Euclidean distance would treat them as independent, but Mahalanobis distance adjusts for this relationship, providing a more accurate measure of "distance" between points.

### 2. **Normalization for Different Scales**:
   - Variables can have different units or ranges (e.g., height in meters and weight in kilograms). **Mahalanobis distance** normalizes the variables by incorporating the covariance matrix, ensuring that differences in scales don’t distort the distance measure.
   - In contrast, Euclidean distance is sensitive to the scale of the data and might be dominated by variables with larger units or ranges.

### 3. **Detection of Multivariate Outliers**:
   - Mahalanobis distance is commonly used to **detect outliers** in multivariate data. 
   - Points with large Mahalanobis distances from the mean of the data are considered outliers, as they are far from the center of the distribution, even if their individual variable values don't appear extreme.

### 4. **Multivariate Analysis**:
   - When working with **multivariate distributions**, Mahalanobis distance is often used to quantify the distance between an observation and the mean of a distribution or to compare the similarity between two distributions.
   - For instance, it can be used to measure how close a new observation is to the centroid of a cluster in **clustering algorithms**.

### 5. **Better Performance in Classification**:
   - In **classification problems**, Mahalanobis distance can be used to measure how similar a point is to different classes. 
   - For example, in discriminant analysis, Mahalanobis distance helps determine whether a point belongs to one class or another by measuring how far it is from each class mean, considering the variance-covariance structure within each class.

### 6. **Applications**:
   - **Face recognition**: Mahalanobis distance is used in facial recognition algorithms to determine how close a face is to a reference face, accounting for the relationships between different facial features.
   - **Anomaly detection**: In finance, Mahalanobis distance is used to detect fraudulent transactions by measuring how far a new transaction is from the expected pattern of transactions.
   - **Quality control**: Mahalanobis distance can be used to determine whether a new batch of products deviates significantly from previous batches by considering the relationships between multiple product characteristics.

### Formula for Mahalanobis Distance:
$$
D^2 = (x - \mu)^T \Sigma^{-1} (x - \mu)
$$

- **\( x \)**: The vector of data points (an observation).
- **\( \mu \)**: The mean vector of the distribution.
- **\( \Sigma \)**: The covariance matrix of the distribution, which accounts for the relationships between the variables.
- **\( \Sigma^{-1} \)**: The inverse of the covariance matrix, which adjusts for correlations between the variables.

### Conclusion:
Mahalanobis distance is essential in multivariate data analysis because it accounts for the correlations between variables and differences in scales, making it more robust than Euclidean distance for detecting outliers, comparing distributions, or performing classification tasks.


<center><h2>CODE</h2></center>

In [4]:
import numpy as np
import pandas as pd
from scipy.spatial.distance import mahalanobis
from numpy.linalg import inv
from scipy.stats import chi2

### Generating Data

In [5]:

# Set random seed for reproducibility
np.random.seed(42)

# Generate synthetic data for 40 cars
data = {
    'Price': np.random.uniform(15000, 50000, 40),         # Prices between $15,000 and $50,000
    'Distance': np.random.uniform(0, 300000, 40),         # Distance between 0 and 300,000 km
    'Emission': np.random.uniform(50, 400, 40),           # Emissions between 50g/km and 400g/km
    'Performance': np.random.uniform(100, 500, 40),       # Performance score between 100 and 500
    'Mileage': np.random.uniform(10, 40, 40)              # Mileage between 10 and 40 miles per gallon
}

# Convert to DataFrame
df = pd.DataFrame(data)

# Print the first few rows of the data
print(df.head())


          Price       Distance    Emission  Performance    Mileage
0  28108.904160   36611.470453  352.086199   422.976062  21.033494
1  48275.000724  148553.073033  268.154344   458.436520  28.969175
2  40619.787963   10316.556335  165.814309   227.201390  29.005891
3  35953.046947  272796.120624   72.245423   144.020770  26.073241
4  20460.652415   77633.994480  158.843813   191.174065  12.708693


In [6]:
# Calculate mean vector and covariance matrix for the 5 columns
mean_vector = df.mean().values
cov_matrix = df.cov().values
inv_cov_matrix = inv(cov_matrix)

# Function to calculate Mahalanobis distance for each observation
def mahalanobis_distance(row, mean_vector, inv_cov_matrix):
    diff = row - mean_vector
    md = np.sqrt(np.dot(np.dot(diff, inv_cov_matrix), diff.T))
    return md

# Apply the function to each row in the DataFrame (using only the relevant columns)
df['Mahalanobis_D2'] = df.apply(lambda row: mahalanobis_distance(row.values, mean_vector, inv_cov_matrix), axis=1)

# Print the results with Mahalanobis distances
df.head(10)


Unnamed: 0,Price,Distance,Emission,Performance,Mileage,Mahalanobis_D2
0,28108.90416,36611.470453,352.086199,422.976062,21.033494,2.189294
1,48275.000724,148553.073033,268.154344,458.43652,28.969175,2.323676
2,40619.787963,10316.556335,165.814309,227.20139,29.005891,2.113913
3,35953.046947,272796.120624,72.245423,144.02077,26.073241,2.528953
4,20460.652415,77633.99448,158.843813,191.174065,12.708693,2.502139
5,20459.808212,198756.685306,163.814163,270.843115,35.059075,1.765385
6,17032.926426,93513.322827,305.362162,427.205906,19.623402,2.318112
7,45316.165102,156020.406353,273.145115,444.292233,15.595555,2.071525
8,36039.025411,164013.083803,360.52446,102.780852,11.223254,2.612013
9,39782.540223,55456.336658,215.275224,304.298921,27.726788,1.395439


In [7]:
df['p-value'] = 1 - chi2.cdf(df['Mahalanobis_D2'], df=5)
df

Unnamed: 0,Price,Distance,Emission,Performance,Mileage,Mahalanobis_D2,p-value
0,28108.90416,36611.470453,352.086199,422.976062,21.033494,2.189294,0.822381
1,48275.000724,148553.073033,268.154344,458.43652,28.969175,2.323676,0.802783
2,40619.787963,10316.556335,165.814309,227.20139,29.005891,2.113913,0.833169
3,35953.046947,272796.120624,72.245423,144.02077,26.073241,2.528953,0.772129
4,20460.652415,77633.99448,158.843813,191.174065,12.708693,2.502139,0.776173
5,20459.808212,198756.685306,163.814163,270.843115,35.059075,1.765385,0.880562
6,17032.926426,93513.322827,305.362162,427.205906,19.623402,2.318112,0.803603
7,45316.165102,156020.406353,273.145115,444.292233,15.595555,2.071525,0.839162
8,36039.025411,164013.083803,360.52446,102.780852,11.223254,2.612013,0.759539
9,39782.540223,55456.336658,215.275224,304.298921,27.726788,1.395439,0.924812


In [8]:
outliers = df[df['p-value']<0.01]
print("Outliers, alpha=1%")
outliers

Outliers, alpha=1%


Unnamed: 0,Price,Distance,Emission,Performance,Mileage,Mahalanobis_D2,p-value
