# Data Quality: Outlier Detection and Missing Value Imputation

Real-world data is messy. Before analysis, we must handle two common issues:

1. **Outliers**: Extreme values that don't fit the pattern (measurement errors, rare events, or data entry mistakes)
2. **Missing values**: Gaps in the data (sensor failures, survey non-response, or data collection issues)

Ignoring these issues can lead to:
- Biased statistical estimates
- Poor model performance
- Misleading conclusions

`tangent/ds` provides multiple algorithms for both univariate and multivariate outlier detection, as well as simple and advanced imputation methods.

In [None]:
// Setup DOM for plotting in Jupyter with Deno
import { Window } from 'https://esm.sh/happy-dom@12.10.3';
const window = new Window();
globalThis.document = window.document;
globalThis.HTMLElement = window.HTMLElement;

// import packages
import * as ds from '../../src/index.js';
import * as Plot from '@observablehq/plot';

// Generate synthetic dataset with outliers and missing values
const seed = 42;
const rng = ds.utils.createRNG(seed);

function generateData(n = 200) {
  const data = [];
  for (let i = 0; i < n; i++) {
    const x1 = rng.normal(50, 10);
    const x2 = rng.normal(100, 20);
    const x3 = 2 * x1 + 0.5 * x2 + rng.normal(0, 5);
    
    data.push({ x1, x2, x3 });
  }
  
  // Add 5 outliers
  for (let i = 0; i < 5; i++) {
    data[i] = {
      x1: rng.uniform(100, 150),
      x2: rng.uniform(200, 300),
      x3: rng.uniform(300, 500)
    };
  }
  
  // Add missing values (10% of data)
  for (let i = 0; i < data.length; i++) {
    if (rng.random() < 0.05) data[i].x1 = NaN;
    if (rng.random() < 0.05) data[i].x2 = NaN;
    if (rng.random() < 0.05) data[i].x3 = NaN;
  }
  
  return data;
}

const dirtyData = generateData(200);
console.log('Generated', dirtyData.length, 'samples');
console.table(dirtyData.slice(0, 10));

## Part 1: Outlier Detection

We'll explore three methods:

1. **Isolation Forest**: Tree-based anomaly detection (good for high-dimensional data)
2. **Local Outlier Factor (LOF)**: Density-based detection (finds local anomalies)
3. **Mahalanobis Distance**: Statistical distance accounting for correlations

### Isolation Forest

**How it works:**
- Builds random trees by randomly selecting features and split points
- Outliers are easier to isolate (require fewer splits)
- Measures "path length" from root to leaf
- Shorter paths → anomalies

**Use when:**
- You have many features
- Outliers are global (far from all normal points)
- You want fast, scalable detection

In [None]:
// Prepare data matrix (remove rows with NaN for outlier detection)
const completeData = dirtyData.filter(row => 
  !isNaN(row.x1) && !isNaN(row.x2) && !isNaN(row.x3)
);

const X = completeData.map(row => [row.x1, row.x2, row.x3]);

// Isolation Forest
const isoForest = new ds.ml.IsolationForest({
  contamination: 0.05,  // expect 5% outliers
  nEstimators: 100,
  maxSamples: 256
});

isoForest.fit(X);
const isoLabels = isoForest.predict(X);
const isoOutliers = isoLabels.filter(label => label === -1).length;

console.log('Isolation Forest detected', isoOutliers, 'outliers out of', X.length, 'samples');
console.log('Outlier indices:', isoLabels.map((l, i) => l === -1 ? i : null).filter(i => i !== null).slice(0, 10));

### Local Outlier Factor (LOF)

**How it works:**
- Compares local density of a point to its neighbors
- If a point is in a low-density region compared to neighbors → outlier
- Uses k-nearest neighbors (default k=20)

**Use when:**
- Data has clusters of varying density
- You want to find local anomalies (points unusual relative to their neighborhood)
- You have moderate-dimensional data

In [None]:
// Local Outlier Factor
const lof = new ds.ml.LocalOutlierFactor({
  contamination: 0.05,
  n_neighbors: 20
});

lof.fit(X);
const lofLabels = lof.predict(X);
const lofOutliers = lofLabels.filter(label => label === -1).length;

console.log('LOF detected', lofOutliers, 'outliers out of', X.length, 'samples');
console.log('Outlier indices:', lofLabels.map((l, i) => l === -1 ? i : null).filter(i => i !== null).slice(0, 10));

### Mahalanobis Distance

**How it works:**
- Statistical distance that accounts for correlations between features
- Unlike Euclidean distance, it considers the covariance structure
- Uses pseudoinverse for robustness to near-singular covariance matrices
- Can use chi-squared distribution to set threshold

**Use when:**
- Features are correlated
- You want statistically-principled outlier detection
- Data is approximately multivariate normal

In [None]:
// Mahalanobis Distance
const mahal = new ds.ml.MahalanobisDistance({
  contamination: 0.05,
  use_chi2: true  // use chi-squared distribution for threshold
});

mahal.fit(X);
const mahalLabels = mahal.predict(X);
const mahalOutliers = mahalLabels.filter(label => label === -1).length;

console.log('Mahalanobis Distance detected', mahalOutliers, 'outliers out of', X.length, 'samples');
console.log('Outlier indices:', mahalLabels.map((l, i) => l === -1 ? i : null).filter(i => i !== null).slice(0, 10));

// Get anomaly scores (higher = more anomalous)
const mahalScores = mahal.score_samples(X);
console.log('\nTop 5 most anomalous scores:', mahalScores.slice().sort((a, b) => b - a).slice(0, 5).map(s => s.toFixed(2)));

### Comparing Methods

Different methods may identify different outliers:
- **Isolation Forest**: Best for global outliers in high dimensions
- **LOF**: Best for local outliers in clusters
- **Mahalanobis**: Best when you understand the statistical distribution

**Tip:** Use multiple methods and look for consensus!

In [None]:
// Find consensus outliers (detected by at least 2 methods)
const consensusOutliers = [];
for (let i = 0; i < X.length; i++) {
  const votes = [
    isoLabels[i] === -1 ? 1 : 0,
    lofLabels[i] === -1 ? 1 : 0,
    mahalLabels[i] === -1 ? 1 : 0
  ].reduce((a, b) => a + b, 0);
  
  if (votes >= 2) consensusOutliers.push(i);
}

console.log('Consensus outliers (2+ methods agree):', consensusOutliers);
console.log('Total consensus outliers:', consensusOutliers.length);

## Part 2: Missing Value Imputation

Once outliers are handled (remove or robust methods), we tackle missing values. We'll explore:

1. **Simple Imputer**: Fill with mean/median/mode (fast, univariate)
2. **KNN Imputer**: Use k-nearest neighbors (captures local patterns)
3. **Iterative Imputer (MICE)**: Multivariate imputation by chained equations (most sophisticated)

### Simple Imputation

**Strategy options:**
- `mean`: Good for normally distributed data
- `median`: Robust to outliers
- `most_frequent`: For categorical data
- `constant`: Fill with a fixed value

**Pros:** Fast, simple, deterministic

**Cons:** Ignores relationships between features, reduces variance

In [None]:
// Remove consensus outliers first
const cleanData = dirtyData.filter((_, i) => !consensusOutliers.includes(i));
const X_with_nan = cleanData.map(row => [row.x1, row.x2, row.x3]);

// Count missing values
const missingCount = X_with_nan.flat().filter(v => isNaN(v)).length;
const totalValues = X_with_nan.length * 3;
console.log(`Missing values: ${missingCount} / ${totalValues} (${(100 * missingCount / totalValues).toFixed(1)}%)`);

// Simple imputation with mean
const simpleImputer = new ds.ml.SimpleImputer({ strategy: 'mean' });
simpleImputer.fit(X_with_nan);
const X_simple = simpleImputer.transform(X_with_nan);

console.log('\nSimple imputation (mean) complete');
console.log('Imputed means:', simpleImputer.statistics_.map(s => s.toFixed(2)));

### KNN Imputation

**How it works:**
- For each missing value, find k nearest neighbors (using complete features)
- Impute as weighted average of neighbor values
- Captures local patterns in data

**Pros:** Considers relationships between samples

**Cons:** Slower than simple imputation, sensitive to k choice

In [None]:
// KNN imputation
const knnImputer = new ds.ml.KNNImputer({ 
  n_neighbors: 5,
  weights: 'distance'  // closer neighbors have more influence
});

knnImputer.fit(X_with_nan);
const X_knn = knnImputer.transform(X_with_nan);

console.log('KNN imputation complete');
console.log('Sample imputed values (first row):', X_knn[0].map(v => v.toFixed(2)));

### Iterative Imputation (MICE)

**MICE = Multivariate Imputation by Chained Equations**

**How it works:**
1. Initial imputation (e.g., with mean)
2. For each feature with missing values:
   - Use other features to predict it (via regression)
   - Update imputed values
3. Repeat until convergence or max iterations

**Pros:** 
- Most sophisticated method
- Captures complex relationships between features
- Uses pseudoinverse for robust regression

**Cons:** Slower, may not converge if relationships are weak

In [None]:
// Iterative imputation (MICE)
const iterImputer = new ds.ml.IterativeImputer({
  initial_strategy: 'mean',
  max_iter: 10,
  tol: 1e-3,
  verbose: true
});

iterImputer.fit(X_with_nan);
const X_iter = iterImputer.transform(X_with_nan);

console.log('\nIterative imputation (MICE) complete');
console.log('Sample imputed values (first row):', X_iter[0].map(v => v.toFixed(2)));

### Comparing Imputation Methods

Let's compare the imputed values to see how methods differ:

In [None]:
// Find a row with missing values to compare
const missingRowIdx = X_with_nan.findIndex(row => row.some(v => isNaN(v)));

if (missingRowIdx !== -1) {
  console.log('Original row with NaN:', X_with_nan[missingRowIdx]);
  console.log('Simple imputation:   ', X_simple[missingRowIdx].map(v => v.toFixed(2)));
  console.log('KNN imputation:      ', X_knn[missingRowIdx].map(v => v.toFixed(2)));
  console.log('Iterative imputation:', X_iter[missingRowIdx].map(v => v.toFixed(2)));
} else {
  console.log('No missing values in this subset');
}

## Best Practices

### Outlier Detection:
1. **Visualize** your data first (scatter plots, box plots)
2. **Domain knowledge** is crucial (is it a true outlier or a rare event?)
3. **Don't blindly remove** outliers - investigate them!
4. Consider **robust methods** instead of removal (e.g., robust regression)
5. **Document** which points were flagged and why

### Missing Value Imputation:
1. **Understand why** data is missing (random? systematic?)
2. **Simple methods** (mean/median) often work well
3. **KNN** when you have local structure
4. **MICE** when features are correlated
5. **Set bounds** if you know valid ranges (e.g., age > 0)
6. **Sensitivity analysis**: try multiple methods and check if conclusions change

### Combined Workflow:
```javascript
// 1. Detect outliers
const outlierDetector = new ds.ml.IsolationForest({ contamination: 0.05 });
outlierDetector.fit(X);
const outlierLabels = outlierDetector.predict(X);

// 2. Remove or flag outliers
const X_clean = X.filter((_, i) => outlierLabels[i] === 1);

// 3. Impute missing values
const imputer = new ds.ml.IterativeImputer();
imputer.fit(X_clean);
const X_imputed = imputer.transform(X_clean);

// 4. Proceed with analysis
```

## Summary

We covered:

**Outlier Detection:**
- ✅ Isolation Forest (tree-based, global outliers)
- ✅ Local Outlier Factor (density-based, local outliers)
- ✅ Mahalanobis Distance (statistical, accounts for correlations)

**Missing Value Imputation:**
- ✅ Simple Imputer (mean/median/mode)
- ✅ KNN Imputer (local neighborhood)
- ✅ Iterative Imputer (MICE, multivariate)

Clean data is the foundation of good analysis. Invest time in understanding and handling outliers and missing values before diving into modeling!