# <u>Multivariate Outlier Detection in Geochemical Datasets</u>

This notebook is intended as an open-source resource for exploring, analyzing and comparing three different methods of outlier detection in geochemical datasets in the context of mineral exploration.  
<br />
<br />
The three primary outlier detection algorithms we will use are the following: 
- Isolation Forests (IF) (Liu et al., 2008)
- Local Outlier Factor (LOF) (Breunig et al., 2000)
- Angle Based Outlier Detection (ABOD) (Shahrestani & Sanislav, 2025)
<br />
<br />

This work is driven from the findings in Antoine Caté's article on multivariate outlier detection for mineral exploration.

<br />
<br />
<u>References: </u>

*Breunig, M. M., Kriegel, H.-P., Ng, R. T., and Sander, J., 2000, LOF: Identifying Density-Based Local Outliers: ACM SIGMOD Record, v. 29, no. 2, p. 93-104.*

*Liu, F. T., Ting, K. M., and Zhou, Z.-H., 2008, Isolation Forest, 2008 Eighth IEEE International Conference on Data Mining, p. 413-422.*

*Maklin, C., 2022, Isolation Forest - Cory Maklin - Medium: Medium, https://medium.com/@corymaklin/isolation-forest-799fceacdda4.*

*Shahrestani, S., and Sanislav, I., 2025, Mapping geochemical anomalies using angle-based outlier detection approach: Journal of Geochemical Exploration, v. 269.*
<br />

---

In [None]:
import pandas as pd
import time
import matplotlib.pyplot as plt
import seaborn as sns

import helper_functions
import outlier_detection_functions

import importlib
importlib.reload(helper_functions)
importlib.reload(outlier_detection_functions)

*For the testing of these algorithms we are going to use geochemical data from a region in southwestern Saudi Arabia. All units are converted to ppm for consistency.*

In [None]:
df = pd.read_parquet('data_files/KSA_raw_data.parquet')

In [None]:
# List of all geochemical columns to be used in the analysis - omits sample number and location columns
# List of all geochemical columns to be used in the analysis - omits sample number and location columns
feature_columns = [
    col for col in df.columns if col not in ["Sample Field Number", "Longitude", "Latitude"]
]

print(f"Selected feature columns: {feature_columns}")

### <u>Simple EDA on the data</u>

To start off, let's do a brief investigation into broad trends or relationships in the data to get a sense of what we are working with. 

In [None]:
# Generate pairplot of select elements
elements_to_plot = ["TFe2O3", "Co", "V", "Rb"]  # Choose key elements - ideally those related to mineralization or other interesting trends
helper_functions.generate_pairplot(df, elements_to_plot, height=2)

In [None]:
# Generate correlation heatmap
helper_functions.plot_correlation_heatmap(df, feature_columns, figsize=(10,8), annot=False) # Increase figsize to view all element labels

In [None]:
# PCA
pc1_scores, top_features1, top_features2 = helper_functions.generate_pca(df, feature_columns)
print("Top 5 Contributing Features to PC1:", top_features1)
print("Top 5 Contributing Features to PC2:", top_features2)

---

## <u>Outlier Detection Methods</u>

Outliers in the binary plots below are classified using the Modified Z-Score method based on the Median Absolute Deviation (MAD). Data points with a modified Z-score greater than 3.5 are labeled as outliers (-1). This is done in place of the standard Contamination value, as it is hard to manually estimate the proportion of outliers in the datset.

### <u>Isolation Forest</u>

Isolation forest is an unsupervised machine learning method of outlier/anomaly detection. It is an ensemble method that combines the predictions of several decision trees to assign an anomaly score to a given data point. Samples that require fewer splits across all trees are given a lower anomaly score (higher likelihood of being anomalous). This method of outlier detection is not affected by data distribution, but does require some parameter-tuning. 

In [None]:
IF_df = outlier_detection_functions.isolation_forest(
    df, feature_columns
)

In [None]:
# Plotting IF results
helper_functions.plot_outlier_results(
    data=IF_df,
    x_col="Longitude",
    y_col="Latitude",
    score_col="anomaly_score",
    binary_col="outlier",
    point_size=50,
    plot_title="IF Outlier Detection Results",
    cmap='viridis_r'
)


---

### <u>Local Outlier Factor</u>

LOF is another unsupervised outlier detection method that uses a density-based approach, comparing the density of data points in their local neighborhoods. Isolated samples or those on the margins of a neighborhood cluster will have a lower density than their neighbors. Samples with a lower LOF value are considered outliers. Similar to IF, LOF is unaffected by data distribution but does require some parameter-tuning.

In [None]:
LOF_df = outlier_detection_functions.local_outlier_factor(
    df, feature_columns, n_neighbors=50, scale_data=False
)

In [None]:
# Plotting LOF results
helper_functions.plot_outlier_results(
    data=LOF_df,
    x_col="Longitude",
    y_col="Latitude",
    score_col="anomaly_score",
    binary_col="outlier",
    point_size=50,
    plot_title="LOF Outlier Detection Results",
    cmap="viridis_r",
)


---

### <u>Angle Based Outlier Detection</u>

Our final method of outlier detection is angle based outlier detection. This method compares the distribution of angles of distance vectors between a sample point and its neighbors; a point with a large distribution of vector angles is considered an inlier (within a cluster), while a point with a smaller distribution of angles is likely an outlier (outside a cluster). An angle-based score is then calculated, with less variation indicating a higher probability of the sample point being an outlier. One benefit of ABOD is that it is free of any parameters, and thus does not have the potential prediction variability resulting from tuning. Its implementation does not, however, output a binary classification, thus in order to generate one an arbitrary threshold must be defined. 

***Note: running the full ABOD function, with use_knn=False, takes quite a long time to process, especially for larger datasets (n>1000). It is recommended to downsample using knn if running on a large dataset, but know that the model accuracy may be impacted.*** 

In [None]:
ABOD_df = outlier_detection_functions.abod(df, feature_columns, scale_data=True, use_knn=False, k_neighbors=50)

In [None]:
# Plotting ABOD results
helper_functions.plot_outlier_results(
    data=ABOD_df,
    x_col="Longitude",
    y_col="Latitude",
    score_col="anomaly_score",
    binary_col="outlier",
    point_size=50,
    plot_title="ABOD Outlier Detection Results",
    cmap="viridis_r",
)

---

## <u>Validation of predictions</u>

Multivariate anomaly detection picks up on trends across multiple elements, rather than just single-element variation. As opposed to univariate anomalies that may be attributed to noise, sampling error, or highly-localized trends, multivariate analysis may point towards regions of broader geologic alteration related to mineral deposits. To test this relationship, we will compare outlier predictions from each model against known mineral occurrences in the sampling region. 

In [None]:
validation_df = pd.read_parquet('data_files/KSA_validation.parquet')

### *Spatial validation of outlier predictions*

To start, we will conduct a visual analysis on the data, comparing binary outlier classifications vs. known mineral occurrences in the region.

In [None]:
# Plot outlier detection results and validation dataset

outlier_results = [IF_df, LOF_df, ABOD_df]
outlier_result_names = ['IF', 'LOF', 'ABOD']

helper_functions.plot_validation(outlier_datasets=outlier_results, outlier_dataset_names=outlier_result_names, validation_df=validation_df, point_size=10, colormap='viridis')

---

### *ROC-AUC, ANOVA F-statistic, and Mutual Information scoring of outlier predictions*

To quantitatively compare the different outlier prediction methods, we will use three different scoring methods: ROC-AUC, ANOVA F-statistic, and Mutual Information.
<br />
<br />
- *<u>ROC-AUC, or Receiver Operating Characteristic Area Under the Curve</u>* is a machine learning metric used to evaluate a model's ability to distinguish between positive and negative classes; a score of 1 is considered 'perfect,' or that the model gets 100% of predictions correct. 
    - Generally best for evaluating overall predictive performance, regardless of spatial location.
<br />
<br />
- The *<u>ANOVA F-statistic</u>* is a method of comparing the variances of two samples (in this case the prediction and the validation set); the higher the score, the greater the model's predictions are differentiated. That is, there is a meaningful pattern between predicted outliers and known mineral deposits.
    - Measures how well the model distinguishes spatially relevant anomalies.
<br />
<br />
- *<u>Mutual Information</u>* is a method of measuring how much information one variable provides about another, or how dependent they are on each other. Higher MI values indicate stronger relationships between variables. 
    - Quantifies dependency between outlier predictions and proximity to known deposits.

In [None]:
# Calculate scores for each outlier detection method

scoring_radius = 0.005 # roughly 500m

roc_auc_scores = helper_functions.calculate_roc_auc(
    outlier_results, outlier_result_names, validation_df, radius=scoring_radius
) 

f_scores = helper_functions.calculate_f_score(
    outlier_results, outlier_result_names, validation_df, radius=scoring_radius
)

mi_scores = helper_functions.calculate_mi_score(
    outlier_results, outlier_result_names, validation_df, radius=scoring_radius
)

helper_functions.plot_scores(
    [roc_auc_scores, f_scores, mi_scores],
    titles=["ROC-AUC Scores", "F-Scores", "Mutual Information Scores"],
)

---

### *Deeper investigation into ROC-AUC scoring of each model*

Since the ROC-AUC scores for each model above are so similar, let's create a ROC curve to better understand the model performance. 

In [None]:
helper_functions.plot_roc_curves(
    outlier_datasets=outlier_results,  # Your outlier model DataFrames
    outlier_dataset_names=outlier_result_names,  # Names of models
    validation_df=validation_df,  # Known mineral deposits
    radius=scoring_radius,  # Search radius
)

The ROC curve above matches the results we observed - the ABOD method appears to best optimize the TPR/FPR ratio, indicating better model performance. 

---

### <u>Time Cost Analysis</u>

Despite the quantitative accuracies between models, it is also important to consider the time cost of each; IF and LOF process almost instantly, while ABOD takes some time due to the nature of the calculation. This ABOD function is set up to allow for using k-nearest neighbors to calculate variance rather than using all possible pairs in the dataset, which improves processing time while potentially skewing results. For relatively small datasets, the time cost of ABOD is minimal, but for larger datasets the tradeoffs should be considered; it may be more efficient to use an algorithm like Isolation Forest despite the small cost in model accuracy. Further investigation into the effects of kNN on model accuracy is needed. 

Below we will do a brief investigation into the time cost of each method, focusing particularly on ABOD.

In [None]:
# Explore time cost of outlier detection vs. roc-auc score

outlier_models = [
    outlier_detection_functions.isolation_forest, 
    outlier_detection_functions.local_outlier_factor, 
    outlier_detection_functions.abod
]

outlier_model_names = ['IF', 'LOF', 'ABOD']
scoring_radius = 0.005

# Dictionary to store results
results = {
    "Model": [],
    "Iteration": [],
    "Execution Time (s)": [],
    "ROC-AUC Score": []
}

for model, name in zip(outlier_models, outlier_model_names):
    for i in range(7):
        start = time.time()
        output_df = model(df, feature_columns)
        end = time.time()
        exec_time = end - start

        print(f"{model.__name__} iteration {i} took {end - start} seconds")

        roc_auc_score = helper_functions.calculate_roc_auc(
            outlier_datasets=[output_df],
            outlier_dataset_names=[name],
            validation_df=validation_df,
            radius=scoring_radius,
        )[name]

        # Store results in dictionary
        results["Model"].append(name)
        results["Iteration"].append(i)
        results["Execution Time (s)"].append(exec_time)
        results["ROC-AUC Score"].append(roc_auc_score)

In [None]:
# Plot the results of the above analysis

results_df = pd.DataFrame(results)

fig, axes = plt.subplots(1, 2, figsize=(10, 4))

# Box Plot
sns.boxplot(x="Model", y="Execution Time (s)", data=results_df, ax=axes[0])
axes[0].set_title("Execution Times by Model")
axes[0].set_xlabel("Outlier Detection Model")
axes[0].set_ylabel("Execution Time (s)")
axes[0].set_yscale("log")

# Scatter Plot
sns.scatterplot(
    x="Execution Time (s)", y="ROC-AUC Score", hue="Model", data=results_df, ax=axes[1]
)
axes[1].set_title("Execution Time vs. ROC-AUC Score")
axes[1].set_xlabel("Execution Time (s)")
axes[1].set_xscale("log")
axes[1].set_ylabel("ROC-AUC Score")
axes[1].legend(title="Model")

plt.tight_layout()
plt.show()

As we can see in the above plots, IF and LOF have significantly lower execution times. IF seems to strike somewhat of a balance between execution time and ROC-AUC score, indicating it may be more suitable than ABOD for datasets n > ~2000-3000 as the tradeoff between time and score diminishes. 

In [None]:
# Analyze the execution time of each method with varying sample sizes

# Define sample sizes
sample_sizes = [50, 100, 200, 400, 800, 1600, 2700]

# Measure execution time
abod_results_df = helper_functions.measure_model_execution(
    df, sample_sizes, model=outlier_detection_functions.abod
)

if_results_df = helper_functions.measure_model_execution(
    df, sample_sizes, model=outlier_detection_functions.isolation_forest
)

lof_results_df = helper_functions.measure_model_execution(
    df, sample_sizes, model=outlier_detection_functions.local_outlier_factor
)



In [None]:
# Plot the above results
fig, axes = plt.subplots(1, 3, figsize=(14, 4))

# ABOD
axes[0].plot(
    abod_results_df["Number of Samples"],
    abod_results_df["Execution Time (s)"],
    marker="o",
    linestyle="-",
)
axes[0].set_xscale("log") 
axes[0].set_yscale("log")
axes[0].set_xlabel("Number of Samples (Log Scale)")
axes[0].set_ylabel("Execution Time (s) (Log Scale)")
axes[0].set_title("ABOD Method")
axes[0].grid(True, which="both", linestyle="--", linewidth=0.5)

# IF
axes[1].plot(
    if_results_df["Number of Samples"],
    if_results_df["Execution Time (s)"],
    marker="o",
    linestyle="-",
)
axes[1].set_xlabel("Number of Samples (Log Scale)")
axes[1].set_xscale("log")  
axes[1].set_ylabel("Execution Time (s)")
axes[1].set_ylim(0, 0.07)
axes[1].set_title("IF Method")
axes[1].grid(True, which="both", linestyle="--", linewidth=0.5)

# LOF
axes[2].plot(
    lof_results_df["Number of Samples"],
    lof_results_df["Execution Time (s)"],
    marker="o",
    linestyle="-",
)
axes[2].set_xlabel("Number of Samples (Log Scale)")
axes[2].set_xscale("log")  
axes[2].set_ylabel("Execution Time (s)")
axes[2].set_ylim(0, 0.07)
axes[2].set_title("LOF Method")
axes[2].grid(True, which="both", linestyle="--", linewidth=0.5)

fig.suptitle("Number of Samples vs. Execution Time per Outlier Detection Method")
plt.tight_layout()
plt.show()

As we can see above, ABOD has more similar execution times as IF/LOF (<1s) when n<~200, but beyond that the time cost grows exponentially. IF and LOF distributions are irregular due to the extremely short processing times, and variation can be attributed to random noise; they generally follow an O(n) time complexity, while ABOD appears to follow O(n^2), which aligns with the pairwise computations required to calculate ABOD. 

To summarize, ABOD seems to be the most appropriate model for smaller datasets (n<2000-3000), but beyond that the time cost becomes significant, and IF should be considered as a still plenty-capable alternative. 