<img src='../../media/common/LogoWekeo_Copernicus_RGB_0.png' align='left' height='96px'></img>

<hr>

# Grassland Classification

*Authors: Adrian Di Paolo, Chung-Xiang Hong, Jonas Viehweger* 

This notebook will demonstrate how to clean up and pre process satellite data for a machine learning task. It will also go into detail on how to use the cleaned data to train, evaluate and select a model for a larger scale application. 

The task is to train a model which can classify grassland areas in the Netherlands. The ultimate goal would be to classify grasslands yearly to derive change maps of grassland loss and gain from them. To do the classification we are using phenological data and the EuroCrops Dataset as ground truth.

This notebook uses data that was already downloaded and prepared for Python in the previous notebook.

1. [Preprocessing Data](#preprocessing-data)

    - [Labeling Data](#labeling-the-data)

    - [Data Cleaning](#data-cleaning)

    - [Normalizing Dates](#normalizing-dates)

    - [Normalizing Numerical Data](#normalize-numerical-data)

    - [Split Dataset into Train and Test](#split-the-dataset-into-train-and-test) 

2. [Model Training](#model-training)

    
3. [Model Evaluation](#model-evaluation)

    - [Metrics](#metrics)

    - [Memory Usage](#memory-usage)

    - [Execution Time](#execution-time) 

    - [Confusion Matrix](#confusion-matrtix)

    - [ROC Curve](#roc-curve)
    
4. [Model Selection](#model-selection)

    - [Fine-tuning](#hyperparameter-optimization)

    - [Feature Importance](#feature-importance)

5. [Neural Network](#bonus-training-neural-network)



In [None]:
library(dplyr)
library(lubridate)
library(caret)
library(naivebayes)
library(lightgbm)
library(doParallel)
library(scales)
library(glmnet)

library(ggplot2)
library(gridExtra)
library(pROC)

library(MLmetrics)
library(keras)

### Preprocessing Data


In [None]:
# Load the downloaded data
x_data <- readRDS('../../data/processing/ml-grassland-classification/dataset/x_data.rds')
y_data <- readRDS('../../data/processing/ml-grassland-classification/dataset/y_data.rds')

#### Labeling the data

We assign a 1 to the grassland label and 0 for others. 

In [None]:
binary_label <- function(y_data) {
    # Assign new labels: 1 if grassland, else 0
    grassland_value_str <- format(3.302e+09, scientific = TRUE)

    # Convert the array to strings in scientific notation
    y_data_str <- sapply(y_data, format, scientific = TRUE)

    binary_assign <- function(x) {
        ifelse(x == grassland_value_str, 1, 0)
    }

    sapply(y_data_str, binary_assign, USE.NAMES = FALSE)
}

#### Data cleaning

In our dataset we have Nan values or values that indicate No Data in the HRVPP documentation, so we delete the rows containing those values from the dataset.

In [None]:
data_to_df <- function(x_data, y_data = NULL) {

    # Transform the data to dataframe and cleans it
    x_df <- data.frame(x_data)
    colnames(x_df) <- c('AMPL', 'EOSD', 'EOSV', 'LENGTH', 'LSLOPE', 'MAXD', 'MAXV', 'MINV', 'QFLAG', 'RSLOPE', 'SOSD', 'SOSV', 'SPROD', 'TPROD')

    if (!is.null(y_data)) {
        y_df <- data.frame(LABEL = y_data)
        df <- cbind(x_df, y_df)
    } else {
        df <- x_df
    }
    
    df <- df[complete.cases(df), ]
    df <- df[!(df$EOSD == 0 | df$SOSD == 0 | df$MAXD == 0 | df$LENGTH == 0), ]
    df <- df[!(df$SOSV == 32768 | df$EOSV == 32768 | df$MAXV == 32768 | df$MINV == 32768 | df$AMPL == 32768 | df$LSLOPE == 32768 | df$RSLOPE == 32768), ]
    df <- df[!(df$SPROD == 65535 | df$TPROD == 65535), ]
    
    return(df)
}

#### Normalizing Dates

As we are dealing with datasets that include columns with dates, it is essential to ensure that those columns are normalized. 

In our case, we have a dataset for different years that come with dates in the format 'YYDOY'. So the first two digits representing the year, and the last three digits representing the day of the year. As we need those dates to be consistent with other years, and also be in the same range, we transform them to a number representing the count of days since a reference date (1st of January of the previous year).

In [None]:
transform_dates <- function(df) {
    # Transform dates columns from YYDOY to YY-MM-DD and then to the days since the 1 January of the previous year
    
    df$EOSD <- as.character(df$EOSD)
    df$SOSD <- as.character(df$SOSD)
    df$MAXD <- as.character(df$MAXD)
    
    df$SOSD <- as.Date(df$SOSD, format = "%y%j")
    df$EOSD <- as.Date(df$EOSD, format = "%y%j")
    df$MAXD <- as.Date(df$MAXD, format = "%y%j")
    
    min_year <- min(format(df$SOSD, "%Y"))
    reference_date <- as.Date(paste0(min_year, "-01-01"))
    
    df$SOSD <- as.numeric(df$SOSD - reference_date)
    df$EOSD <- as.numeric(df$EOSD - reference_date)
    df$MAXD <- as.numeric(df$MAXD - reference_date)
    
    return(df)
}


#### Normalizing Numerical Data

To ensure that our machine learning models perform optimally, it is essential to rescale the numerical data. Rescaling, transforms the data to a common scale without distorting differences in the ranges of values. This process is crucial for algorithms that compute distances between data points, such as gradient boosting and neural networks, as it ensures that features with larger ranges do not dominate the learning process.  

In this notebook, we will apply Min-Max scaling to bring all numerical features into the range [0, 1].

In [None]:
normalize_df <- function(df) {
    
    columns_to_scale <- df %>%
        select(-LABEL, -MAXD, -SOSD, -EOSD)
    
    scaled_data <- as.data.frame(lapply(columns_to_scale, rescale))
    
    df <- df %>%
        select(LABEL, MAXD, SOSD, EOSD) %>%
        bind_cols(scaled_data)
    
    return(df)
}

### Prepare Data for the Machine Learning

In [None]:
y_data <- binary_label(y_data)

df_data <- data_to_df(x_data, y_data) %>%
    transform_dates() %>%
    normalize_df()

x_data <- df_data %>% select(-LABEL)
y_data <- df_data$LABEL

##### Split Dataset into Train and Test

In [None]:
set.seed(42)
train_index <- createDataPartition(y_data, p = .8, list = FALSE, times = 1)

X_train <- x_data[train_index, ]
X_test <- x_data[-train_index, ]
y_train <- y_data[train_index]
y_test <- y_data[-train_index]

#### Undersampling Dataset

Undersampling is a technique used to balance imbalanced datasets, where one class has significantly more samples than another class.

The main advantage of undersampling is that it can improve the performance of classifiers by reducing the bias towards the majority class, which can lead to better predictions on the minority class. Undersampling can also reduce the training time and memory requirements of the model, since there are fewer instances to process.

In [None]:
# Compute class distribution before undersampling
class_counts_before <- table(y_train)
print(class_counts_before)

In [None]:
undersampled_data <- downSample(x = X_train, y = as.factor(y_train), list = TRUE)

X_train <- undersampled_data$x
y_train <- undersampled_data$y

In [None]:
class_counts_after <- table(y_train)

# Convert to data frames for ggplot
class_counts_before_df <- as.data.frame(class_counts_before)
colnames(class_counts_before_df) <- c("Class", "Count")
class_counts_before_df$Class <- factor(class_counts_before_df$Class, levels = c("1", "0"), labels = c("Grassland", "No Grassland"))
class_counts_before_df$Percent <- round(class_counts_before_df$Count / sum(class_counts_before_df$Count) * 100, 1)


class_counts_after_df <- as.data.frame(class_counts_after)
colnames(class_counts_after_df) <- c("Class", "Count")
class_counts_after_df$Class <- factor(class_counts_after_df$Class, levels = c("1", "0"), labels = c("Grassland", "No Grassland"))
class_counts_after_df$Percent <- round(class_counts_after_df$Count / sum(class_counts_after_df$Count) * 100, 1)


# Create pie chart for class distribution before undersampling
plot_before <- ggplot(class_counts_before_df, aes(x = "", y = Count, fill = Class)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar(theta = "y") +
  scale_fill_manual(values = c("Grassland" = "#4CAF50", "No Grassland" = "#B0B0B0")) +
  theme_void() +
  theme(legend.position = "right") +
  geom_text(aes(label = paste0(Percent, "%")), position = position_stack(vjust = 0.5)) +
  ggtitle("Class Distribution Before Undersampling")

# Create pie chart for class distribution after undersampling
plot_after <- ggplot(class_counts_after_df, aes(x = "", y = Count, fill = Class)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar(theta = "y") +
  scale_fill_manual(values = c("Grassland" = "#4CAF50", "No Grassland" = "#B0B0B0")) +
  theme_void() +
  theme(legend.position = "right") +
  geom_text(aes(label = paste0(Percent, "%")), position = position_stack(vjust = 0.5)) +
  ggtitle("Class Distribution After Undersampling")

options(repr.plot.width = 24, repr.plot.height = 8)
grid.arrange(plot_before, plot_after, ncol = 2)

### Model Training

Here we are testing quite a few different algorithms on their performance for the task. 
It has to be noted that this is a quite naive and brute force approach to the task, since the models hyperparameters aren't tweaked and no pre-selection of machine learning algorithms based on expert knowledge is made. 

However it will give a rough idea on the performance of the algorithms and in addition it will provide information on the computational efficiency of the algorithms in terms of memory usage and computation time. These are also important parameters to consider when scaling the model up.

In [None]:
# Shuffling the data is crucial for gradient boosting models like LightGBM. It ensures that the training
# and validation sets are representative of the overall dataset, which helps the model learn generalizable
# patterns. This prevents biases and overfitting to the order of the data, improving the robustness and
# performance of the gradient boosting process.

set.seed(123) # For reproducibility

data <- data.frame(X_train, LABEL = y_train)
shuffled_data <- data[sample(nrow(data)), ]

X_train_shuffled <- shuffled_data %>% select(-LABEL)
y_train_shuffled <- shuffled_data$LABEL

In [None]:
models <- {}

# Metrics initialization
accuracy <- list()
precision <- list()
recall <- list()#
f1 <- list()
time_usage <- list()

# Logistic Regression
start_time <- Sys.time()
models[["Logistic Regression"]] <- train(X_train_shuffled, y_train_shuffled,
  method = "glmnet",
  trControl = trainControl(method = "cv", number = 10),
  tuneGrid = expand.grid(alpha = 0, lambda = seq(0.001, 0.1, by = 0.001))
)
time_usage[["Logistic Regression"]] <- as.numeric(difftime(Sys.time(), start_time, units = "secs"))

# Support Vector Machines
start_time <- Sys.time()
models[["Support Vector Machines"]] <- train(X_train_shuffled, y_train_shuffled,
  #data = train_data,
  method = "svmLinear",
  trControl = trainControl(method = "cv", number = 10, allowParallel = TRUE)
)
time_usage[["Support Vector Machines"]] <- as.numeric(difftime(Sys.time(), start_time, units = "secs"))

# Decision Trees
start_time <- Sys.time()
models[["Decision Trees"]] <- train(X_train_shuffled, y_train_shuffled,
  # data = train_data,
  method = "rpart",
  trControl = trainControl(method = "cv", number = 10, allowParallel = TRUE),
  tuneLength = 10
)
time_usage[["Decision Trees"]] <- as.numeric(difftime(Sys.time(), start_time, units = "secs"))


# Naive Bayes
start_time <- Sys.time()
models[["Naive Bayes"]] <- train(X_train_shuffled, y_train_shuffled,
  # data = train_data,
  method = "naive_bayes",
  trControl = trainControl(method = "cv", number = 10, allowParallel = TRUE),
  tuneLength = 10
)
time_usage[["Naive Bayes"]] <- as.numeric(difftime(Sys.time(), start_time, units = "secs"))

# K-Nearest Neighbors
 start_time <- Sys.time()
models[["K-Nearest Neighbor"]] <- train(X_train_shuffled, y_train_shuffled,
  #  data = train_data,
   method = "knn",
   trControl = trainControl(method = "cv", number = 10, allowParallel = TRUE),
   #tuneGrid = expand.grid(k = c(seq(5, 15, by = 2))),  # Include k = 1
   preProcess = c("center", "scale")  # Ensure data is scaled
 )
 time_usage[["K-Nearest Neighbor"]] <- as.numeric(difftime(Sys.time(), start_time, units = "secs"))


# LightGBM
#
train_lightgbm <- function(x_train, y_train, custom_params = list()) {

  params <- list(
    objective = "binary",
    metric = "binary_error",
    feature_fraction = 1,
    learning_rate = 0.02,
    num_leaves = 25
  )

  params <- modifyList(params, custom_params)

  lgb.train(
    params = params,
    data = lgb.Dataset(data = as.matrix(x_train), label = as.numeric(as.character(y_train))),
    nrounds = 100,
    verbose = 0
  )

}

start_time <- Sys.time()
models[["Lightgbm"]] <- train_lightgbm(X_train_shuffled, y_train_shuffled)
time_usage[["Lightgbm"]] <- as.numeric(difftime(Sys.time(), start_time, units = "secs"))

In [None]:
for (key in names(models)) {

  print(key)
  flush.console()

  cm <- NULL
  if (key == "Lightgbm") {
    predictions <- predict(models[[key]], as.matrix(X_test))
    predictions_binary <- ifelse(predictions > 0.5, 1, 0)
    cm <- caret::confusionMatrix(factor(predictions_binary), factor(y_test))
  } else {
    predictions <- predict(models[[key]], X_test)
    cm <- caret::confusionMatrix(predictions, factor(y_test))
  }

  accuracy[[key]] <- cm$overall["Accuracy"]
  precision[[key]] <- cm$byClass["Precision"]
  recall[[key]] <- cm$byClass["Recall"]
  f1[[key]] <- cm$byClass["F1"]

  print(paste("F1: ", (f1[[key]])))

  # Print time usage
  cat(sprintf("Model: %s\nTime: %.2f s\n\n", key, time_usage[[key]]))
}


### Model Evaluation

##### Metrics

In [None]:
plot_dict <- function(data, title, x_label, y_label) {
  df <- data.frame(metric = unlist(data), model = names(data))

  # Sort the DataFrame by the metric column
  df$model <- factor(df$model, levels = df$model[order(-df$metric)])

  # Create a color palette for the bars
  colors <- scale_fill_gradient(low = "lightgreen", high = "darkgreen")

  # Create the plot
  ggplot(df, aes(x = metric, y = model)) +
    geom_bar(stat = "identity", aes(fill = metric), show.legend = FALSE) +
    colors +
    theme_minimal() +
    labs(title = title, x = x_label, y = y_label) +
    theme(
      plot.title = element_text(hjust = 0.5, size = 20),
      axis.title.x = element_text(size = 16),
      axis.title.y = element_text(size = 16),
      axis.text = element_text(size = 14)
    )
}


In [None]:
plot_dict(f1, 'Models F1-Score', 'F1-Score', 'Model')

##### Execution Time

In [None]:
plot_dict(time_usage, 'Training Execution Time', 'Time (s)', 'Model')

As we can see, the Lightgbm model is by far the more efficient.

#### Loading Validation Dataset

Now we will prove this model with the validation dataset, which was taken from a different bounding box area. This will give us the performance of the model on data which hasn't been seen during the training. If the performance of the model is much worse for this dataset, it means that the model has been overfit on the training data and isn't general enough to get a good performance on new data.

We also have to pre-process the validation data them with the sane steps as explained before. 

In [None]:
x_validation <- readRDS('../../data/processing/ml-grassland-classification/dataset/x_validation.rds')
y_validation <- readRDS('../../data/processing/ml-grassland-classification/dataset/y_validation.rds')

In [None]:
y_validation <- binary_label(y_validation)

df_validation <- data_to_df(x_validation, y_validation) %>%
    transform_dates() %>%
    normalize_df()

x_validation <- df_validation %>% select(-LABEL)
y_validation <- df_validation$LABEL

For evaluating the models performance in the validation dataset, we will use confusion matrices and ROC curves. 

#### Confusion Matrix

A confusion matrix is a table that summarizes the performance of a classification model by comparing the predicted labels with the true labels of a set of data. 

The confusion matrix consists of four values: true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN). The rows of the matrix represent the actual labels, while the columns represent the predicted labels. The diagonal elements of the matrix represent the instances that are classified correctly, while the off-diagonal elements represent the instances that are misclassified.

In [None]:
plot_confusion_matrices <- function(models, X_test, y_test) {
    num_models <- length(models)
    nrows <- ceiling(num_models / 3)
    ncols <- min(num_models, 3)
    plots <- list()

    class_names <- c("Grassland", "No Grassland")
    references <- factor(y_test)

    for (key in names(models)) {
        print(key)
        model <- models[[key]]

        predictions <- predict(model, as.matrix(X_test))

        if (key == "Lightgbm") {
            predictions_binary <- ifelse(predictions > 0.9, 1, 0)
            cm <- confusionMatrix(factor(predictions_binary), references)
        } else {
            cm <- confusionMatrix(predictions, references)
        }

        cm_normalized <- prop.table(cm$table, 1)
        cm_data <- as.data.frame(cm_normalized)
        colnames(cm_data) <- c("True", "Predicted", "Freq")
        cm_data$Freq[is.na(cm_data$Freq)] <- 0

        p <- ggplot(cm_data, aes(y = Predicted, x = True, fill = Freq)) +
            coord_equal() +
            geom_tile() +
            geom_text(aes(label = sprintf("%.2f", Freq))) +
            scale_fill_gradient(low = "white", high = "darkgreen") +
            scale_x_discrete(labels = class_names) +
            scale_y_discrete(labels = class_names) +
            labs(title = key, y = "True labels", x = "Predicted labels", ) +
            theme_minimal() +
            theme(legend.position = "none")

        plots[[key]] <- p
    }

    grid.arrange(grobs = plots, nrow = nrows, ncol = ncols)
}


In [None]:
plot_confusion_matrices(models, x_validation, y_validation)

#### ROC Curve

The ROC (Receiver Operating Characteristic) curve is a graphical representation of the performance of a binary classifier. The ROC curve shows the trade-off between the true positive rate (TPR), also called sensitivity or recall, and the false positive rate (FPR), which is the proportion of negative instances that are incorrectly classified as positive.

To create a ROC curve, the classifier's output is sorted by confidence or probability, and the threshold for classification is varied from high to low. At each threshold value, the TPR and FPR are calculated and plotted on a graph with TPR on the y-axis and FPR on the x-axis. The resulting curve represents the classifier's performance at all possible threshold values.

The closer the curve is to the top-left corner of the graph, the better the classifier's performance, as this indicates a high TPR and a low FPR. The area under the ROC curve (AUC) is a commonly used metric to summarize the classifier's performance. A perfect classifier would have an AUC of 1, while a random classifier would have an AUC of 0.5.

In [None]:
plot_roc_curve <- function(models, x_test, y_test, labels) {
  roc_data <- data.frame()

  for (i in seq_along(models)) {
    model <- models[[i]]
    label <- labels[i]

    y_prob <- predict(model, as.matrix(x_test))
    y_prob <- as.numeric(as.character(y_prob))
    roc_curve <- roc(y_test, y_prob)

    roc_auc <- auc(roc_curve)
    roc_df <- data.frame(
      fpr = 1 - roc_curve$specificities,
      tpr = roc_curve$sensitivities,
      model = paste(label, "(AUC =", sprintf("%.2f", roc_auc), ")")
    )

    roc_data <- rbind(roc_data, roc_df)
  }

  ggplot(roc_data, aes(x = fpr, y = tpr, color = model)) +
    geom_line(linewidth = 1) +
    geom_abline(linetype = "dashed", color = "gray") +
    xlim(0, 1) +
    ylim(0, 1) +
    labs(x = "False Positive Rate", y = "True Positive Rate", title = "ROC Curve") +
    theme_minimal() +
    theme(
      legend.position = "bottom",
      plot.title = element_text(size = 20),
      axis.title = element_text(size = 16),
      axis.text = element_text(size = 14),
      legend.text = element_text(size = 12),
      legend.title = element_text(size = 14)
    ) +
    guides(color = guide_legend(title = "Models"))
}


In [None]:
x_validation[is.na(x_validation)] <- -1

model_labels <- names(models)
plot_roc_curve(models, x_validation, y_validation, model_labels)

### Model Selection

As we can see for the previous metrics, memory and time usage, the best candidate to solve this binary classification problem is the LightGBM model.

LightGBM is a popular open-source gradient boosting framework that was developed by Microsoft. It is designed to be highly efficient in terms of training speed and memory usage, making it a popular choice for large-scale machine learning tasks. LightGBM uses gradient boosting algorithms to build models, which iteratively improves the performance of a weak learner by adding new decision trees to the ensemble.

In [None]:
lgbm <- models[["Lightgbm"]]

##### Hyperparameter Optimization

GridSearchCV is a technique used to fine-tune the hyperparameters in order to improve its performance. In essence, it involves searching over a range of values for each hyperparameter and finding the combination that yields the best results.

In [None]:
f1_metric <- function(preds, dtrain) {
  labels <- dtrain$construct()$get_field("label")

  predictions_binary <- ifelse(preds > 0.9, 1, 0)
  cm <- caret::confusionMatrix(factor(predictions_binary), factor(labels))
  f1 <- cm$byClass["F1"]

  return(list(name = "f1", value = f1, higher_better = TRUE))
}

# Define the parameter grid
grid_params <- expand.grid(
  learning_rate = c(0.1, 0.01),
  num_leaves = c(20, 30),
  max_depth = c(7, 10, 14)
)

# Initialize variables to store the best results
best_params <- NULL
best_f1 <- -Inf

# Grid search loop
for (i in 1:nrow(grid_params)) {
  params <- list(
    objective = "binary",
    metric = "None",  # No built-in metric, using custom metric
    learning_rate = grid_params$learning_rate[i],
    num_leaves = grid_params$num_leaves[i],
    max_depth = grid_params$max_depth[i]
  )
  
  
  # Perform cross-validation with the custom F1 metric
  cv_result <- lgb.cv(
    params = params,
    data = lgb.Dataset(data = as.matrix(X_train), label = as.numeric(as.character(y_train))),
    nfold = 5,
    eval = f1_metric,
    verbose = 0
  )
  
  best_f1_iter <- max(cv_result$best_score)
  
  # Update best parameters if current F1 score is higher
  if (best_f1_iter > best_f1) {
    best_f1 <- best_f1_iter
    best_params <- params
  }
}


In [None]:
# Print the best parameters and the corresponding F1 score
print("Best parameters:")
print(best_params)

lgbm <- train_lightgbm(X_train, y_train, custom_params = best_params)

In [None]:
evaluate_lgbm <- function(model, x_test, y_test) {

  # Make predictions
  predictions <- predict(model, as.matrix(x_test))
  predictions_binary <- ifelse(predictions > 0.9, 1, 0)


  # Calculate the confusion matrix using the caret package
  cm <- caret::confusionMatrix(factor(predictions_binary), factor(y_test))

  # Extract relevant metrics from the confusion matrix
  accuracy <- cm$overall["Accuracy"]
  precision <- cm$byClass["Precision"]
  recall <- cm$byClass["Recall"]
  f1 <- cm$byClass["F1"]
  specificity <- cm$byClass["Specificity"]
  
  # Return a list of metrics
  return(list(
    accuracy = accuracy,
    precision = precision,
    recall = recall,
    f1 = f1,
    specificity = specificity
  ))
}

In [None]:
evaluate_lgbm(lgbm, X_test, y_test)

In [None]:
evaluate_lgbm(lgbm, x_validation, y_validation)

In [None]:
plot_confusion_matrix <- function(model, x_validation, y_validation) {
    class_names <- c("Grassland", "No Grassland")

    predictions <- predict(model, as.matrix(x_validation))
    predictions_binary <- ifelse(predictions > 0.9, 1, 0)

    cm <- confusionMatrix(factor(predictions_binary), as.factor(y_validation))
    cm_normalized <- prop.table(cm$table, 1)
    cm_data <- as.data.frame(cm_normalized)
    colnames(cm_data) <- c("True", "Predicted", "Freq")

    ggplot(cm_data, aes(y = Predicted, x = True, fill = Freq)) +
         coord_equal() +
         geom_tile() +
         geom_text(aes(label = sprintf("%.2f", Freq)), size = 5) +
         scale_fill_gradient(low = "white", high = "darkgreen") +
         scale_x_discrete(labels = class_names) +
         scale_y_discrete(labels = class_names) +
         labs(title = "LGBM", y = "True labels", x = "Predicted labels") +
         theme_minimal() +
         theme(
          legend.position = "none",
          axis.title = element_text(size = 14),
          axis.text = element_text(size = 12),
          plot.title = element_text(size = 16)
         )
}

In [None]:
plot_confusion_matrix(lgbm, x_validation, y_validation)

### Conclusions

After completing this first machine learning workflow for a limited area, we can draw several conclusions:

- **The best-performing models for this case were the Random Forest and the LightGBM, both tree-based ML algorithms.** 

- **Our preferred choice for this case is LightGBM, primarily due to its superior speed and memory efficiency, as well as its ability to effectively handle multi-dimensional datasets.**

- **Training the model with limited areas may lead to overfitting due to the correlation between adjacent pixels. In order to address this issue, our upcoming step involves scaling up the analysis and implementing measures to reduce the impact of adjacent pixel correlation.**




### Bonus: Training Neural Network 

This is just a bonus showing a quick demonstration of using a neural network for this classification task. The focus of this notebook is not on neural nets, but anyway it might be an interesting starting point for further exploration of this area of machine learning.

In [None]:
# Initialize a Sequential model
model <- keras_model_sequential()

# Add layers to the model
model %>%
  layer_dense(units = 64, activation = 'relu', input_shape = ncol(X_train)) %>%
  layer_dense(units = 64, activation = 'relu') %>%
  layer_dense(units = 1, activation = 'sigmoid')

# Compile the model
model %>% compile(
  loss = 'binary_crossentropy',
  optimizer = optimizer_adam(),
  metrics = c('accuracy')
)

# Display the model's architecture
model %>% summary()

In [None]:
hist <- model %>% fit(
  x = X_train,
  y = y_train,
  validation_data = list(X_test, y_test),
  epochs = 20,
  batch_size = 100
)

In [None]:
# Extract the accuracy and validation accuracy from the history object
acc <- hist$metrics$accuracy
val_acc <- hist$metrics$val_accuracy
epochs <- 1:length(acc)

# Create a data frame for plotting
df <- data.frame(
  Epoch = epochs,
  Accuracy = acc,
  Validation_Accuracy = val_acc
)

# Plotting the training and validation accuracy
ggplot(df, aes(x = Epoch)) +
  geom_line(aes(y = Accuracy, color = "Training Accuracy"), linetype = "solid") +
  geom_line(aes(y = Validation_Accuracy, color = "Validation Accuracy"), linetype = "dashed") +
  labs(title = "Training and Validation Accuracy",
       x = "Epoch",
       y = "Accuracy") +
  scale_color_manual(name = "Legend", values = c("Training Accuracy" = "blue", "Validation Accuracy" = "red")) +
  theme_minimal() +
  theme(legend.position = "bottom")