<div align="center">
    <h1><b>3. Klasifikacija </b></h1>
</div>

---

Za potrebe klasifikacije korišćen je <i>Solar Clearness Index (SCI)</i>, definisan kao odnos između izmerene 
<i>globalne horizontalne iradijacije (GHI)</i> i očekivane vrednosti pod vedrim nebom (<i>clearsky GHI</i>). 
Vrednosti <i>SCI</i> su ograničene na opseg [0,1].

Za klasifikaciju, SCI je podeljen na dvije kategorije:  
- <b>clear</b>: SCI ≥ 0.8  
- <b>cloudy</b>: SCI < 0.8  



In [7]:
library(arrow)
library(dplyr)
library(caret)
library(ranger)
library(kernlab)

In [None]:
TRAIN_DATASET_PATH <- "../data/nsrdb_puerto_rico_2017_train.parquet"
features <- c(
  "air_temperature",
  "surface_albedo",
  "surface_pressure",
  "total_precipitable_water",
  "wind_speed"
)

In [None]:
df <- read_parquet(TRANSFORMED_DATASET_PATH) %>%
  as.data.frame() 

In [4]:
df %>%
  mutate(sci_label = case_when(
    sci >= THRESHOLD ~ "clear",
    sci < THRESHOLD ~ "cloudy"
  )) %>%
  write_parquet(TRANSFORMED_DATASET_PATH)

In [7]:
df %>%
  group_by(sci_label) %>%
  summarise(count = n())

sci_label,count
<chr>,<int>
clear,82042023
cloudy,47268278


In [6]:
set.seed(123)

train_index <- createDataPartition(df$sci_label, p = 0.7, list = FALSE)

train <- df[train_index, ]
test  <- df[-train_index, ]

write_parquet(train, TRAIN_DATASET_PATH)
write_parquet(test, TEST_DATASET_PATH)

In [3]:
train <- read_parquet(TRAIN_DATASET_PATH)%>%
  mutate(sci_label = factor(sci_label, levels = c("cloudy", "clear"))) %>%
  as.data.frame()

test <- read_parquet(TEST_DATASET_PATH)%>%
  mutate(sci_label = factor(sci_label, levels = c("cloudy", "clear"))) %>%
  as.data.frame()

In [5]:
train_scaled <- scale(train[, features])
test_scaled <- scale(test[, features],
                    center = attr(train_scaled, "scaled:center"),
                    scale  = attr(train_scaled, "scaled:scale"))
  
train_final <- data.frame(train_scaled, sci_label = train$sci_label)
test_final  <- data.frame(test_scaled, sci_label = test$sci_label)

In [6]:
cv_ctrl <- trainControl(
  method = "cv",
  number = 5,
  classProbs = TRUE,
  summaryFunction = defaultSummary,
  savePredictions = "final"
)


<br><br>

<div align="center">
   <h3><b>Logistička regresija</b></h3>
</div>

---

<h5>Scenario 1 – Klasična logistička regresija</h5>
<ul>
   <li>Model bez regularizacije: <code>glm</code></li>
</ul>
<h5>Scenario 2 – Ridge regresija (L2 regularizacija)</h5>
<ul>
   <li>Model koristi <code>glmnet</code> sa <code>alpha = 0</code>.</li>
   <li>Scenario 2.1 <code>lambda=0.01</code>
   <li>Scenario 2.2 <code>lambda=0.1</code>
   <li>Scenario 2.3 <code>lambda=1</code>
</ul>
<h5>Scenario 3 – Lasso regresija (L1 regularizacija)</h5>
<ul>
   <li>Model koristi <code>glmnet</code> sa <code>alpha = 1</code></li>
   <li>Scenario 3.1 <code>lambda=0.01</code>
   <li>Scenario 3.2 <code>lambda=0.1</code>
   <li>Scenario 3.3 <code>lambda=1</code>

</ul>
<br>

In [9]:
# -------------------------------
# Logistic Regression - scenario 1
# -------------------------------

log_model <- train(
  sci_label ~ .,
  data = train_final,
  method = "glm",
  family = "binomial",
  trControl = cv_ctrl,
  metric = "Accuracy"
)

cat("\n--- Feature Coefficients ---\n")
print(coef(log_model$finalModel))

log_pred <- predict(log_model, newdata = test_final[, features])

cat("\n--- Confusion Matrix ---\n")
cm <- confusionMatrix(log_pred, test_final$sci_label)
print(cm)

cat("\n--- Model Summary ---\n")
print(log_model)



--- Feature Coefficients ---
             (Intercept)          air_temperature           surface_albedo 
              0.62925555               0.07815943              -0.11393684 
        surface_pressure total_precipitable_water               wind_speed 
              0.40867042              -0.79388933              -0.08732995 

--- Confusion Matrix ---
Confusion Matrix and Statistics

          Reference
Prediction  cloudy   clear
    cloudy 1146200  686392
    clear  1690552 4235473
                                         
               Accuracy : 0.6936         
                 95% CI : (0.6933, 0.694)
    No Information Rate : 0.6344         
    P-Value [Acc > NIR] : < 2.2e-16      
                                         
                  Kappa : 0.286          
                                         
 Mcnemar's Test P-Value : < 2.2e-16      
                                         
            Sensitivity : 0.4041         
            Specificity : 0.8605         
  

In [None]:
# ----------------------------------------
# Logistic Regression - Ridge & Lasso
# ----------------------------------------


alpha_values <- c(0, 1)
lambda_values <- c(0.01, 0.1, 1)

glmnet_results <- list()

for (a in alpha_values) {
  for (l in lambda_values) {
    
    cat("\n--- Training Logistic Regression with alpha =", a, "and lambda =", l, "---\n")
    
    glmnet_grid <- expand.grid(alpha = a, lambda = l)
    
    glmnet_model <- train(
      sci_label ~ ., 
      data = train[, c(features, "sci_label")],
      method = "glmnet",
      trControl = cv_ctrl,
      tuneGrid = glmnet_grid,
      metric = "Accuracy"
    )
    
    glmnet_pred <- predict(glmnet_model, newdata = test[, features])
    
    cm <- confusionMatrix(glmnet_pred, test$sci_label)
    
    glmnet_results[[paste0("alpha_", a, "_lambda_", l)]] <- list(
      model = glmnet_model,
      confusion = cm
    )
    
    cat("\nConfusion Matrix:\n")
    print(cm$table)
    
    cat("\nAccuracy:", cm$overall["Accuracy"], "\n")
    cat("Precision:", cm$byClass["Precision"], "\n")
    cat("Recall (Sensitivity):", cm$byClass["Recall"], "\n")
    cat("F1 Score:", cm$byClass["F1"], "\n")
  }
}



--- Training Logistic Regression with alpha = 0 and lambda = 0.01 ---

Confusion Matrix:
          Reference
Prediction cloudy clear
    cloudy   5407  3082
    clear    8746 21557

Accuracy: 0.6950918 
Precision: 0.6369419 
Recall (Sensitivity): 0.3820391 
F1 Score: 0.477608 

--- Training Logistic Regression with alpha = 0 and lambda = 0.1 ---

Confusion Matrix:
          Reference
Prediction cloudy clear
    cloudy   3182  1359
    clear   10971 23280

Accuracy: 0.682151 
Precision: 0.7007267 
Recall (Sensitivity): 0.2248287 
F1 Score: 0.3404301 

--- Training Logistic Regression with alpha = 0 and lambda = 1 ---

Confusion Matrix:
          Reference
Prediction cloudy clear
    cloudy      1     0
    clear   14152 24639

Accuracy: 0.6351825 
Precision: 1 
Recall (Sensitivity): 7.06564e-05 
F1 Score: 0.0001413028 

--- Training Logistic Regression with alpha = 1 and lambda = 0.01 ---

Confusion Matrix:
          Reference
Prediction cloudy clear
    cloudy   5275  3002
    clear  

<br><br>

<div align="center">
   <h3><b>Random Forest</b></h3>
</div>

---

In [None]:
mtry_values <- c(2, 3)

for (mtry_val in mtry_values) {
  cat("\n==============================\n")
  cat("Training Random Forest with mtry =", mtry_val, "\n")
  
  rf_model <- train(
    sci_label ~ ., 
    data = train_final[, c(features, "sci_label")],
    method = "ranger",
    trControl = cv_ctrl,
    metric = "Accuracy",
    tuneGrid = data.frame(mtry = mtry_val, splitrule = "gini", min.node.size = 1),
    num.trees = 100,
    importance = "impurity"
  )
  
  rf_pred <- predict(rf_model, newdata = test_final[, features])
  cm_rf <- confusionMatrix(rf_pred, test_final$sci_label)
  
  cat("\n--- Confusion Matrix ---\n")
  print(cm_rf)
  
  cat("\n--- Feature Importance ---\n")
  print(rf_model$finalModel$variable.importance)

  cat("\n--- Cross-Validation Results ---\n")
  print(rf_model$results)
  
  cat("\n--- Best Parameters ---\n")
  print(rf_model$bestTune)
  
  cat("\n--- Model Summary ---\n")
  print(rf_model)
  cat("==============================\n")
}



Training Random Forest with mtry = 2 
Growing trees.. Progress: 46%. Estimated remaining time: 36 seconds.
Growing trees.. Progress: 94%. Estimated remaining time: 4 seconds.
Growing trees.. Progress: 48%. Estimated remaining time: 33 seconds.
Growing trees.. Progress: 96%. Estimated remaining time: 2 seconds.
Growing trees.. Progress: 48%. Estimated remaining time: 33 seconds.
Growing trees.. Progress: 96%. Estimated remaining time: 2 seconds.
Growing trees.. Progress: 46%. Estimated remaining time: 36 seconds.
Growing trees.. Progress: 94%. Estimated remaining time: 4 seconds.
Growing trees.. Progress: 48%. Estimated remaining time: 33 seconds.
Growing trees.. Progress: 98%. Estimated remaining time: 1 seconds.
Growing trees.. Progress: 38%. Estimated remaining time: 52 seconds.
Growing trees.. Progress: 74%. Estimated remaining time: 22 seconds.

--- Confusion Matrix ---
Confusion Matrix and Statistics

          Reference
Prediction cloudy  clear
    cloudy 227210 104056
    clear

<br><br>

<div align="center">
   <h3><b>Support Vector Machine (SVM)</b></h3>
</div>

---

In [None]:
C_values <- c(0.1, 1, 10)

svm_results <- list()

for (C_val in C_values) {
  
  cat("\n--- Training Linear SVM with C =", C_val, "---\n")
  
  svm_grid <- expand.grid(C = C_val)
  
  svm_model <- train(
    sci_label ~ .,
    data = train_final[, c(features, "sci_label")],
    method = "svmLinear",
    trControl = cv_ctrl,         
    metric = "Accuracy",
    tuneGrid = svm_grid
  )
  
  svm_pred <- predict(svm_model, newdata = test_final[, features])
  
  cm <- confusionMatrix(svm_pred, test_final$sci_label)
  
  cat("\nConfusion Matrix:\n")
  print(cm$table)
  
  cat("\nAccuracy:", cm$overall["Accuracy"], "\n")
  cat("Precision:", cm$byClass["Precision"], "\n")
  cat("Recall (Sensitivity):", cm$byClass["Recall"], "\n")
  cat("F1 Score:", cm$byClass["F1"], "\n")
  
}

cat("\n=== SUMMARY OF ALL Linear SVM RESULTS ===\n")
print(svm_results)


<br><br>

<div align="center">
   <h3><b>Ispitivanje odnosa između vrijednosti parametara i performansi u klasifikaciji</b></h3>

---
<br> <br>

<table>
  <tr>
    <th>Method</th>
    <th>Parameters</th>
    <th>Accuracy</th>
    <th>Precision</th>
    <th>Specificity</th>
    <th>Sensitivity</th>
    <th>F1</th>
  </tr>
  <tr>
    <td rowspan="7">Logistic Regression</td>
    <td>no regularization (glm, binomial)</td>
    <td>0.693</td><td>0.625</td><td>0.861</td><td>0.404 </td><td>0.491</td>
  </tr>
  <tr>
    <td>λ = 0.01, α = 0</td><td>0.695</td><td>0.636</td><td>0.875</td><td>0.382</td><td>0.478</td>
  </tr>
  <tr>
    <td>λ = 0.1, α = 0</td><td>0.682</td><td>0.701</td><td>0.945</td><td>0.225</td><td>0.341</td>
  </tr>
  <tr>
    <td>λ = 1.0, α = 0</td><td>0.635</td><td>1</td><td>1</td><td>0.00007</td><td>0.00014</td>
  </tr>
  <tr>
    <td>λ = 0.01, α = 1</td><td>0.694</td><td>0.637</td><td>0.878</td><td>0.373</td><td>0.470</td>
  </tr>
  <tr>
    <td>λ = 0.1, α = 1</td><td>0.636</td><td>1</td><td>1</td><td>0.00113</td><td>0.00226</td>
  </tr>
  <tr>
    <td>λ = 1.0, α = 1</td><td>0.635</td><td>NA</td><td>1</td><td>0</td><td>0</td>
  </tr>
  <tr>
    <td rowspan="2">Random Forest</td>
    <td>trees = 100, mtry = 2</td><td>0.740</td><td>0.686</td><td>0.859</td><td>0.534</td><td>0.600</td>
  </tr>
  <tr>
    <td>trees = 100, mtry = 3</td><td>0.731</td><td>0.657</td><td>0.832</td><td>0.557</td><td>0.603</td>
  </tr>
  <tr>
    <td rowspan="4">Linear SVM</td>
    <td>C = 0.01</td><td>0.695</td><td>0.628</td><td>0.862</td><td>0.405</td><td>0.493</td>
  </tr>
  <tr>
    <td>C = 1</td><td>0.695</td><td>0.628</td><td>0.862</td><td>0.404</td><td>0.492</td>
  </tr>
  <tr>
    <td>C = 10</td><td>0.695</td><td>0.628</td><td>0.862</td><td>0.404</td><td>0.492</td>
  </tr>
  <tr>
    <td>C = 0.1</td><td>0.695</td><td>0.628</td><td>0.862</td><td>0.405</td><td>0.493</td>
  </tr>
</table>

<i> Tabela 3 </i>

</div>


<br> <br>

1. **Logistic Regression (Ridge & Lasso)**

- Performanse su relativno stabilne za male vrijednosti regularizacije (*λ* = 0.01, *α* = 0 ili 1), sa Accuracy oko 0.694–0.695.
- Kako λ raste (npr. *λ* = 1), model postaje previše regularizovan i počinje da predviđa skoro samo većinsku klasu (clear), što se vidi po Sensitivity blizu 0 i Precision = 1, F1 = 0.
- Najbolji balans između Accuracy, Precision i Sensitivity daje λ = 0.01, α = 0, sa F1 ≈ 0.478.

<br>

2. **Random Forest**
- Najveći Accuracy od 0.740 (trees = 100, mtry = 2) za sve metode klasifikacije.
- Sensitivity je oko 0.534–0.557, što znači da model bolje prepoznaje manjinsku klasu (cloudy) u odnosu na *Logistic Regression*.
- F1 skor je takođe veći (0.600–0.603), što potvrđuje bolju ravnotežu između Precision i Sensitivity.

<br>

3. **Linear SVM**
- Accuracy je stabilna (~0.695) za sve vrijednosti *C*, što pokazuje da model reaguje malo na regularizaciju u ovom opsegu.
- Linear SVM daje sličan rezultat kao *Logistic Regression* sa malim *λ*, ali je nešto lošiji od *Random Forest* u prepoznavanju manjinske klase.

<br>

---

- Najbolji metod za ovaj dataset je *Random Forest* sa 100 stabala i mtry = 2, jer daje najveću Accuracy i F1, i bolju ravnotežu između detekcije manjinske klase i ukupne preciznosti.