# Programming a random forest model to identify minerals from SEM-EDS data
(C) 2025 Austin M. Weber

#### Load dependencies
If one or more of these dependencies are not installed, open the Julia REPL and execute the following:

```julia
using Pkg
Pkg.add("CSV")
Pkg.add("DataFrames")
Pkg.add("DecisionTree")
Pkg.add("MLJ")
Pkg.add("CategoricalArrays")
Pkg.add("Statistics")
```

Note that you will only need to use `Pkg.add()` for the packages that are not already installed.

In [1]:
# LOAD DEPENDENCIES
using CSV, DataFrames, DecisionTree, MLJ, CategoricalArrays, Statistics

#### Import training data
The file `model_training_data_balanced.csv` contains 1098 rows of mineral observations. The first column `:Mineral` is the target variable, and the remaining 23 columns are the features (in this case, different elemental net intensity ratios). The dataset has been balanced using the synthetic minority oversampling technique so that each target class has an equal number of observations.

**Note:** In order for this to work, `model_training_data_balanced.csv` must be a file in the current folder.

In [2]:
# IMPORT TRAINING DATA
data = CSV.read("model_training_data_balanced.csv",DataFrame);
first(data,5)

#### Remove rows with missing data

In [3]:
# REMOVE ROWS WITH MISSING DATA
data_no_missing = dropmissing!(data);

#### Extract target labels

In [4]:
# EXTRACT TARGET LABELS
labels = Int.(CategoricalArray(data_no_missing.Mineral).refs);
true_classes = data_no_missing[:,1];

#### Extract features

In [5]:
# EXTRACT FEATURES
features = Matrix{Float64}(select(data_no_missing, Not(:Mineral)));
training_data = data_no_missing[:,2:end];

#### Partition the data into 70/30 training/test datasets using a stratified split

In [6]:
# PARTITION THE DATA INTO 70/30 TRAINING/TEST DATASETS USING A STRATIFIED SPLIT
(Xtrain1, Xtest1), (ytrain1, ytest1) = partition((training_data, true_classes), 0.7, rng=10, multi=true);

#### Convert into data types that `build_forest` can interpret (that is, from `DataFrame` to `Float64` and `Int`)

In [7]:
# CONVERT INTO DATA TYPES THAT build_forest CAN INTERPRET (that is, from DataFrame to Float64 and Int)
Xtrain1_features = Matrix{Float64}(Xtrain1);
Xtest1_features = Matrix{Float64}(Xtest1);

ytrain1_targets = Int.(CategoricalArray(ytrain1).refs);
ytest1_targets = Int.(CategoricalArray(ytest1).refs);

#### Construct machine learning model

In [8]:
# CONSTRUCT MACHINE LEARNING MODEL
model = build_forest(ytrain1_targets, Xtrain1_features)

Ensemble of Decision Trees
Trees:      10
Avg Leaves: 23.5
Avg Depth:  5.8

#### Apply random forest model to the training dataset (`Xtrain1_features`) and the test dataset (`Xtest1_features`)

In [9]:
# APPLY RANDOM FOREST MODEL TO THE TRAINING DATASET (Xtrain1_features) AND THE TEST DATASET (Xtest1_features)
validation_predictions = apply_forest(model, Xtrain1_features);
test_predictions = apply_forest(model, Xtest1_features);

#### Evaluate the accuracy of the model by comparing the predictions to the true labels (i.e., targets)

In [10]:
# EVALUATE THE ACCURACY OF THE MODEL BY COMPARING THE PREDICTIONS TO THE TRUE LABELS (i.e. targets)
validation_accuracy = mean(validation_predictions .== ytrain1_targets)
test_accuracy = mean(test_predictions .== ytest1_targets)
println("Validation accuracy: ", round(validation_accuracy * 100, digits=3), "%\n")
println("Test accuracy: ", round(test_accuracy * 100, digits=3), "%\n")

Validation accuracy: 99.727%

Test accuracy: 98.726%



#### Visualize the test confusion matrix

In [11]:
# VISUALIZE THE TEST CONFUSION MATRIX
DecisionTree.confusion_matrix(ytest1_targets, test_predictions)

18×18 Matrix{Int64}:
 20  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
  0  5   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
  0  0  16   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
  0  0   0  19   0   0   0   0   0   0   0   0   0   0   0   0   0   1
  0  0   1   0  21   0   0   0   0   0   0   0   0   0   0   0   0   0
  0  0   0   0   0  17   1   0   0   0   0   0   0   0   0   0   0   0
  0  0   0   0   0   0  21   0   0   0   0   0   0   0   0   0   0   0
  0  0   0   0   0   0   0  17   0   0   0   0   0   0   0   0   0   0
  0  0   0   0   0   0   0   0  17   0   0   0   0   0   0   0   0   0
  0  0   0   0   0   0   0   0   0  16   0   0   0   0   0   0   0   0
  0  0   0   0   0   0   0   0   0   0  21   0   0   0   0   0   0   0
  0  0   0   0   0   0   0   0   0   0   0  14   0   0   0   0   0   0
  0  0   0   0   0   0   0   0   0   0   0   0  18   0   0   0   0   0
  0  0   0   0   0   0   0   0   0   0   0   0   0  23  

Classes:  [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18]
Matrix:   
Accuracy: 0.9872611464968153
Kappa:    0.9864697454459275

#### Get probability scores for a prediction

In [12]:
# GET PROBABILITY SCORES FOR A PREDICTION
observation = 123; # i.e., the 123rd row in the training features data
 # The line below will print the possible classes in the lefthand column and the probability of that class in the righthand column
[levels(true_classes) apply_forest_proba(model,Xtrain1_features[observation,:],levels(labels)).*100]

18×2 Matrix{Any}:
 "Ab"      0.0
 "Ap"      0.0
 "Aug"     0.0
 "Bt"      0.0
 "Chl"     0.0
 "En"      0.0
 "Hbl"     0.0
 "Kln"     0.0
 "Lab"     0.0
 "Mc"      0.0
 "Mnt"     0.0
 "Ms"      0.0
 "Olig"    0.0
 "Pgt"     0.0
 "Plg"   100.0
 "Spl"     0.0
 "Spn"     0.0
 "Vrm"     0.0