# Module 1: Introduction to Scikit-Learn

## Part 7: Robust Covariance Estimation

Robust Covariance Estimation is a statistical technique used in data analysis and machine learning to compute covariance matrices that are less sensitive to outliers and deviations from normality. The traditional sample covariance matrix can be strongly influenced by outliers and may not accurately represent the true underlying structure of the data. Robust covariance estimation methods aim to provide more reliable and stable covariance estimates in the presence of such anomalies.

### 7.1 Understanding Covariance

Covariance is a statistical measure that quantifies the degree to which two random variables change together. In other words, it assesses how changes in one variable correspond to changes in another. Covariance can provide insights into the direction of the relationship between two variables:
- Positive Covariance: When two variables tend to increase or decrease together, their covariance is positive. This suggests a positive linear relationship.
- Negative Covariance: When one variable tends to increase as the other decreases, their covariance is negative. This indicates a negative linear relationship.
- Zero Covariance: When changes in one variable do not correspond to changes in the other, their covariance is close to zero. This suggests little to no linear relationship.

However, the magnitude of covariance can be challenging to interpret as it depends on the scales of the variables. To address this, the correlation coefficient is often used, which normalizes covariance to a scale between -1 (perfect negative linear relationship) and 1 (perfect positive linear relationship). A correlation of 0 indicates no linear relationship.

### 7.2 Understanding Robust Covariance Estimation

The key idea behind Robust Covariance Estimation is to estimate the covariance matrix using robust statistical measures, such as the Minimum Covariance Determinant (MCD) estimator or the Orthogonalized Gnanadesikan-Kettenring (OGK) estimator. These methods downweight the impact of outliers, resulting in a more accurate estimate of the underlying covariance structure. Deviations from normality or non-Gaussian distributions can also affect the accuracy of covariance estimates.

Robust covariance estimators provide a balance between resisting the influence of outliers and preserving the genuine structure of the data. The choice of robust covariance estimator depends on the specific characteristics of the data and the problem at hand. Evaluating the performance of robust covariance estimators may involve comparing their results to traditional covariance estimates and assessing their impact on downstream analysis.

### 7.3 Training and Evaluation

To apply Robust Covariance Estimation, we need a dataset. The algorithm estimates the covariance matrix using robust statistical methods. The resulting covariance matrix represents the relationships between variables, taking into account the robust measures of location and scale.

Robust Covariance Estimation has several important parameters that need to be set appropriately. One of most important ones is the contamination parameter that determines the expected proportion of outliers in the data, and it needs to be set based on prior knowledge or estimated from the dataset.

Once the robust covariance matrix is estimated, we can use it for various purposes, such as anomaly detection, outlier detection, or dimensionality reduction.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.covariance import EllipticEnvelope, EmpiricalCovariance
from sklearn.datasets import make_blobs

n_samples = 200
n_features = 2
X, _ = make_blobs(n_samples=n_samples, n_features=n_features, centers=3, random_state=42)
outliers = np.array([[20, 20], [25, 15]])
X = np.vstack((X, outliers))

robust_cov = EllipticEnvelope(random_state=0)
robust_cov.fit(X)
sample_cov = EmpiricalCovariance()
sample_cov.fit(X)
robust_cov_matrix = robust_cov.covariance_
sample_cov_matrix = sample_cov.covariance_

plt.figure(figsize=(6, 4))
plt.scatter(X[:, 0], X[:, 1], c='black', s=20, label='Original Data')
plt.title('Original Dataset')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.grid(True)

def evaluate_covariance_matrix(cov_matrix, label):
    determinant = np.linalg.det(cov_matrix)
    condition_number = np.linalg.cond(cov_matrix)
    eigenvalues = np.linalg.eigvals(cov_matrix)    
    print(f"{label} Covariance Matrix:")
    print(f"Determinant: {determinant:.4f}")
    print(f"Condition Number: {condition_number:.4f}")
    print(f"Eigenvalues: {eigenvalues}")
    print("\n")
evaluate_covariance_matrix(robust_cov_matrix, "Robust")
evaluate_covariance_matrix(sample_cov_matrix, "Sample")
plt.show()

The provided Python code demonstrates a comparison between two covariance estimation methods—Robust Covariance Estimation (using EllipticEnvelope) and Sample Covariance Estimation (using EmpiricalCovariance)—on a dataset containing outliers.

First, it generates synthetic data with three Gaussian clusters and introduces outliers. Then, it computes and visualizes the original dataset.

The code calculates and evaluates the covariance matrices for both methods, including metrics such as determinant, condition number, and eigenvalues. 

The determinant of a matrix serves as a scalar measure of the spread or volume occupied by the data points represented by that matrix. In the context of covariance matrices, a smaller determinant implies that the data is less spread out and potentially more robust to outliers.

The condition number assesses the numerical stability of a matrix. A lower condition number indicates a well-conditioned matrix, less sensitive to perturbations in the data. This is particularly crucial in statistical calculations to ensure the reliability of results.

Eigenvalues are scalar values associated with a matrix and provide information about the data's spread along different directions or principal components. Larger eigenvalues indicate greater variability in those directions, aiding techniques like Principal Component Analysis (PCA).

The Robust Covariance Matrix shows a smaller determinant and condition number, indicating its robustness to outliers. It also exhibits an eigenvalue close to zero, highlighting the presence of influential outliers. In contrast, the Sample Covariance Matrix has a larger determinant and condition number, suggesting sensitivity to outliers. Its eigenvalues reflect the spread of the data along principal axes.

This comparison underscores the importance of selecting an appropriate covariance estimation method based on the dataset's characteristics and the need for robustness to outliers.

### 7.4 Summary

Robust Covariance Estimation is a crucial technique for obtaining more reliable covariance matrices in the presence of outliers and deviations from normality. By downweighting or mitigating the effects of extreme data points, these estimators provide a more accurate representation of the underlying data structure. Robust covariance estimation finds applications in various fields, including finance, anomaly detection, and machine learning, where the quality of covariance estimates plays a critical role in decision-making and modeling.