# 🔍 Detecting Multivariate Outliers with Local Outlier Factor (LOF)

When working with **multivariate datasets** (more than one feature), outliers may not be obvious by looking at each variable individually.  
The **Local Outlier Factor (LOF)** algorithm is a density-based method that helps identify data points that are isolated compared to their neighbors.

## 🔹 What is Local Outlier Factor?

The LOF algorithm measures the **local density deviation** of a data point with respect to its neighbors:

- Points with **similar density to their neighbors** → considered normal.  
- Points with **significantly lower density than their neighbors** → considered outliers.

Each point gets a **LOF score**:
- **LOF ≈ 1** → normal data point.  
- **LOF > 1** → possible outlier (higher values = stronger outlier).  

## 🔹 Outlier Handling Strategies

Once outliers are detected with LOF, there are two main approaches:

### 1. Cleaning (Removal)
- Directly remove rows with high LOF scores (outliers).
- Useful when outliers are errors or irrelevant anomalies.  
- **Pros**: Dataset becomes clean and consistent.  
- **Cons**: Risk of losing rare but important cases.

### 2. Suppression (Capping/Correction)
- Instead of removing, **suppress the effect of outliers**:
  - Replace extreme values with boundary values.  
  - Apply transformations (e.g., log-scaling, robust scaling).  
  - Cap values at a chosen percentile.  
- **Pros**: Keeps dataset size intact.  
- **Cons**: Distribution tails become artificially compressed.

---

#### » Pull the diamonds dataset

In [8]:
import seaborn as sns
dia = sns.load_dataset("diamonds")
df = dia.copy()
df = df.select_dtypes(include=["float64", "int64"])
df.head()

Unnamed: 0,carat,depth,table,price,x,y,z
0,0.23,61.5,55.0,326,3.95,3.98,2.43
1,0.21,59.8,61.0,326,3.89,3.84,2.31
2,0.23,56.9,65.0,327,4.05,4.07,2.31
3,0.29,62.4,58.0,334,4.2,4.23,2.63
4,0.31,63.3,58.0,335,4.34,4.35,2.75


## Local Outlier Factor Implementation

#### » Import LocalOutlierFactor and implement prediction 
- n_neighbors : Number of neighbors to consider
- contamination : the percentage of samples in the dataset that are outliers(0.1 -> %10)

In [18]:
from sklearn.neighbors import LocalOutlierFactor
lof = LocalOutlierFactor(n_neighbors=20,contamination=0.1)
lof.fit_predict(df)

array([-1, -1, -1, ...,  1,  1,  1])

#### » Get negative outlier scores to detect only lower outlier values

In [10]:
df_scores = lof.negative_outlier_factor_
df_scores[0:10]

array([-1.58352526, -1.59732899, -1.62278873, -1.33002541, -1.30712521,
       -1.28408436, -1.28428162, -1.26458706, -1.28422952, -1.27351342])

#### » Sort the outlier scores to find breakdown score(suddenly become lower)

In [12]:
import numpy as np
np.sort(df_scores)[0:20]

array([-8.60430658, -8.20889984, -5.86084355, -4.98415175, -4.81502092,
       -4.81502092, -4.61522833, -4.37081214, -4.29842288, -4.10492387,
       -4.0566648 , -4.01831733, -3.94882806, -3.82378797, -3.80135297,
       -3.75680919, -3.65947378, -3.59249261, -3.55564138, -3.47157375])

#### » Assing the value which is determined above to threshold 

In [26]:
threshold = np.sort(df_scores)[13]
threshold

np.float64(-3.823787967755565)

### Cleaning (Removal) Method

In [27]:
df_cleaned = df_scores > threshold
df_cleaned

array([ True,  True,  True, ...,  True,  True,  True])

In [28]:
df_cleaned = df[df_cleaned]
df_cleaned

Unnamed: 0,carat,depth,table,price,x,y,z
0,0.23,61.5,55.0,326,3.95,3.98,2.43
1,0.21,59.8,61.0,326,3.89,3.84,2.31
2,0.23,56.9,65.0,327,4.05,4.07,2.31
3,0.29,62.4,58.0,334,4.20,4.23,2.63
4,0.31,63.3,58.0,335,4.34,4.35,2.75
...,...,...,...,...,...,...,...
53935,0.72,60.8,57.0,2757,5.75,5.76,3.50
53936,0.72,63.1,55.0,2757,5.69,5.75,3.61
53937,0.70,62.8,60.0,2757,5.66,5.68,3.56
53938,0.86,61.0,58.0,2757,6.15,6.12,3.74


### Suppression (Capping/Correction)

In [29]:
supp = df[df_scores == threshold]
supp

Unnamed: 0,carat,depth,table,price,x,y,z
31230,0.45,68.6,57.0,756,4.73,4.5,3.19


In [30]:
df_outlier = df[df_scores < threshold]
df_outlier

Unnamed: 0,carat,depth,table,price,x,y,z
6341,1.0,44.0,53.0,4032,6.31,6.24,4.12
10377,1.09,43.0,54.0,4778,6.53,6.55,4.12
24067,2.0,58.9,57.0,12210,8.09,58.9,8.06
35633,0.29,62.8,44.0,474,4.2,4.24,2.65
36503,0.3,51.0,67.0,945,4.67,4.62,2.37
38840,0.73,70.8,55.0,1049,5.51,5.34,3.84
41918,1.03,78.2,54.0,1262,5.72,5.59,4.42
45688,0.7,71.6,55.0,1696,5.47,5.28,3.85
48410,0.51,61.8,54.7,1970,5.12,5.15,31.8
49189,0.51,61.8,55.0,2075,5.15,31.8,5.12


In [31]:
res = df_outlier.to_records(index=False)
res

rec.array([(1.  , 44. , 53. ,  4032, 6.31,  6.24,  4.12),
           (1.09, 43. , 54. ,  4778, 6.53,  6.55,  4.12),
           (2.  , 58.9, 57. , 12210, 8.09, 58.9 ,  8.06),
           (0.29, 62.8, 44. ,   474, 4.2 ,  4.24,  2.65),
           (0.3 , 51. , 67. ,   945, 4.67,  4.62,  2.37),
           (0.73, 70.8, 55. ,  1049, 5.51,  5.34,  3.84),
           (1.03, 78.2, 54. ,  1262, 5.72,  5.59,  4.42),
           (0.7 , 71.6, 55. ,  1696, 5.47,  5.28,  3.85),
           (0.51, 61.8, 54.7,  1970, 5.12,  5.15, 31.8 ),
           (0.51, 61.8, 55. ,  2075, 5.15, 31.8 ,  5.12),
           (0.81, 68.8, 79. ,  2301, 5.26,  5.2 ,  3.58),
           (0.5 , 79. , 73. ,  2579, 5.21,  5.18,  4.09),
           (0.5 , 79. , 73. ,  2579, 5.21,  5.18,  4.09)],
          dtype=[('carat', '<f8'), ('depth', '<f8'), ('table', '<f8'), ('price', '<i8'), ('x', '<f8'), ('y', '<f8'), ('z', '<f8')])

In [32]:
res[:] = supp.to_records(index=False)
res

rec.array([(0.45, 68.6, 57., 756, 4.73, 4.5, 3.19),
           (0.45, 68.6, 57., 756, 4.73, 4.5, 3.19),
           (0.45, 68.6, 57., 756, 4.73, 4.5, 3.19),
           (0.45, 68.6, 57., 756, 4.73, 4.5, 3.19),
           (0.45, 68.6, 57., 756, 4.73, 4.5, 3.19),
           (0.45, 68.6, 57., 756, 4.73, 4.5, 3.19),
           (0.45, 68.6, 57., 756, 4.73, 4.5, 3.19),
           (0.45, 68.6, 57., 756, 4.73, 4.5, 3.19),
           (0.45, 68.6, 57., 756, 4.73, 4.5, 3.19),
           (0.45, 68.6, 57., 756, 4.73, 4.5, 3.19),
           (0.45, 68.6, 57., 756, 4.73, 4.5, 3.19),
           (0.45, 68.6, 57., 756, 4.73, 4.5, 3.19),
           (0.45, 68.6, 57., 756, 4.73, 4.5, 3.19)],
          dtype=[('carat', '<f8'), ('depth', '<f8'), ('table', '<f8'), ('price', '<i8'), ('x', '<f8'), ('y', '<f8'), ('z', '<f8')])

In [33]:
import pandas as pd
df[df_scores < threshold] = pd.DataFrame(res,index=df[df_scores < threshold].index)
df[df_scores < threshold]

Unnamed: 0,carat,depth,table,price,x,y,z
6341,0.45,68.6,57.0,756,4.73,4.5,3.19
10377,0.45,68.6,57.0,756,4.73,4.5,3.19
24067,0.45,68.6,57.0,756,4.73,4.5,3.19
35633,0.45,68.6,57.0,756,4.73,4.5,3.19
36503,0.45,68.6,57.0,756,4.73,4.5,3.19
38840,0.45,68.6,57.0,756,4.73,4.5,3.19
41918,0.45,68.6,57.0,756,4.73,4.5,3.19
45688,0.45,68.6,57.0,756,4.73,4.5,3.19
48410,0.45,68.6,57.0,756,4.73,4.5,3.19
49189,0.45,68.6,57.0,756,4.73,4.5,3.19
