<div align="center">
    <h1><b>3. Klasifikacija </b></h1>
</div>

---

Za potrebe klasifikacije korišćen je <i>Solar Clearness Index (SCI)</i>, definisan kao odnos između izmerene 
<i>globalne horizontalne iradijacije (GHI)</i> i očekivane vrednosti pod vedrim nebom (<i>clearsky GHI</i>). 
Vrednosti <i>SCI</i> su ograničene na opseg [0,1].

Za klasifikaciju, SCI je podeljen na dve kategorije:  
- <b>clear</b>: SCI ≥ 0.8  
- <b>cloudy</b>: SCI < 0.8  

Ovaj prag je odabran jer omogućava izbalansiran skup podataka (~63% „clear“, ~37% „cloudy“),  


In [3]:
library(arrow)
library(dplyr)
library(caret)
library(ranger)


In [1]:
TRANSFORMED_DATASET_PATH <- "../data/nsrdb_puerto_rico_2017_transformed.parquet"
TRAIN_DATASET_PATH <- "../data/nsrdb_puerto_rico_2017_train.parquet"
TEST_DATASET_PATH <- "../data/nsrdb_puerto_rico_2017_test.parquet"

THRESHOLD <- 0.8

In [5]:
df <- read_parquet(TRANSFORMED_DATASET_PATH)%>%
  as.data.frame()

In [6]:
df %>%
  mutate(sci_label = case_when(
    sci >= THRESHOLD ~ "clear",
    sci < THRESHOLD ~ "cloudy"
  )) %>%
  write_parquet(TRANSFORMED_DATASET_PATH)

In [7]:
df %>%
  group_by(sci_label) %>%
  summarise(count = n())

sci_label,count
<chr>,<int>
clear,82042023
cloudy,47268278


In [None]:
set.seed(123)
train_index <- createDataPartition(df$sci_label, p = 0.7, list = FALSE)
train <- df[train_index, ]
test  <- df[-train_index, ]

write_parquet(train, TRAIN_DATASET_PATH)
write_parquet(test, TEST_DATASET_PATH)


In [9]:
train <- read_parquet(TRAIN_DATASET_PATH)%>%
  mutate(sci_label = factor(sci_label, levels = c("cloudy", "clear"))) %>%
  as.data.frame()

test <- read_parquet(TEST_DATASET_PATH)%>%
  mutate(sci_label = factor(sci_label, levels = c("cloudy", "clear"))) %>%
  as.data.frame()

In [8]:
features <- c(
  "air_temperature",
  "surface_albedo",
  "surface_pressure",
  "total_precipitable_water",
  "wind_speed"
)

In [10]:
train_scaled <- scale(train[, features])
test_scaled <- scale(test[, features],
                    center = attr(train_scaled, "scaled:center"),
                    scale  = attr(train_scaled, "scaled:scale"))
  
train_final <- data.frame(train_scaled, sci_label = train$sci_label)
test_final  <- data.frame(test_scaled, sci_label = test$sci_label)

In [None]:
cv_ctrl <- trainControl(
  method = "cv",
  number = 5,
  classProbs = TRUE,
  summaryFunction = defaultSummary,
  savePredictions = "final"
)


<br><br>

<div align="center">
   <h3><b>Logistička regresija</b></h3>
</div>

---

<h5>Scenario 1 – Klasična logistička regresija</h5>
<ul>
   <li>Model bez regularizacije: <code>glm</code></li>
</ul>
<h5>Scenario 2 – Ridge regresija (L2 regularizacija)</h5>
<ul>
   <li>Model koristi <code>glmnet</code> sa <code>alpha = 0</code>.</li>
   <li>Scenario 2.1 <code>lambda=0.01</code>
   <li>Scenario 2.2 <code>lambda=0.1</code>
   <li>Scenario 2.3 <code>lambda=1</code>
</ul>
<h5>Scenario 3 – Lasso regresija (L1 regularizacija)</h5>
<ul>
   <li>Model koristi <code>glmnet</code> sa <code>alpha = 1</code></li>
   <li>Scenario 3.1 <code>lambda=0.01</code>
   <li>Scenario 3.2 <code>lambda=0.1</code>
   <li>Scenario 3.3 <code>lambda=1</code>

</ul>
<br>

In [9]:
# -------------------------------
# Logistic Regression - scenario 1
# -------------------------------

log_model <- train(
  sci_label ~ .,
  data = train_final,
  method = "glm",
  family = "binomial",
  trControl = cv_ctrl,
  metric = "Accuracy"
)

cat("\n--- Feature Coefficients ---\n")
print(coef(log_model$finalModel))

log_pred <- predict(log_model, newdata = test_final[, features])

cat("\n--- Confusion Matrix ---\n")
cm <- confusionMatrix(log_pred, test_final$sci_label)
print(cm)

cat("\n--- Model Summary ---\n")
print(log_model)



--- Feature Coefficients ---
             (Intercept)          air_temperature           surface_albedo 
              0.62925555               0.07815943              -0.11393684 
        surface_pressure total_precipitable_water               wind_speed 
              0.40867042              -0.79388933              -0.08732995 

--- Confusion Matrix ---
Confusion Matrix and Statistics

          Reference
Prediction  cloudy   clear
    cloudy 1146200  686392
    clear  1690552 4235473
                                         
               Accuracy : 0.6936         
                 95% CI : (0.6933, 0.694)
    No Information Rate : 0.6344         
    P-Value [Acc > NIR] : < 2.2e-16      
                                         
                  Kappa : 0.286          
                                         
 Mcnemar's Test P-Value : < 2.2e-16      
                                         
            Sensitivity : 0.4041         
            Specificity : 0.8605         
  

In [None]:
# ----------------------------------------
# Logistic Regression - Ridge i Lasso
# ----------------------------------------

glmnet_model <- train(
  sci_label ~ ., 
  data = train[, c(features, "sci_label")],
  method = "glmnet",
  trControl = cv_ctrl,
  tuneGrid = expand.grid(alpha = c(0, 1), lambda = c(0.01, 0.1, 1)),
  metric = "Accuracy"
)

cat("\n--- Best params (alpha, lambda) ---\n")
print(glmnet_model$bestTune)

cat("\n--- Coefficients for best model ---\n")
print(coef(glmnet_model$finalModel, s = glmnet_model$bestTune$lambda))

glmnet_pred <- predict(glmnet_model, newdata = test[, features])

cat("\n--- Confusion Matrix ---\n")
cm <- confusionMatrix(glmnet_pred, test$sci_label)
print(cm)

cat("\n--- Model Summary ---\n")
print(glmnet_model)

glmnet_model$results


--- Best params (alpha, lambda) ---
  alpha lambda
1     0   0.01

--- Coefficients for best model ---
6 x 1 sparse Matrix of class "dgCMatrix"
                                s=0.01
(Intercept)              -1.169190e+01
air_temperature           3.359050e-02
surface_albedo           -4.530493e-03
surface_pressure          1.513484e-03
total_precipitable_water -7.122313e-04
wind_speed               -4.878482e-03

--- Confusion Matrix ---
Confusion Matrix and Statistics

          Reference
Prediction  cloudy   clear
    cloudy 1074570  612701
    clear  1762182 4309164
                                          
               Accuracy : 0.6939          
                 95% CI : (0.6936, 0.6942)
    No Information Rate : 0.6344          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.2782          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                   

Unnamed: 0_level_0,alpha,lambda,Accuracy,Kappa,AccuracySD,KappaSD
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,0,0.01,0.6935412,0.2772558,0.0001891443,0.000360693
2,0,0.1,0.6830016,0.2025193,0.0001035228,0.0002865672
3,0,1.0,0.6343961,8.370556e-05,2.349362e-06,8.443386e-06
4,1,0.01,0.6917243,0.2705152,0.0002335733,0.0004690767
5,1,0.1,0.6347748,0.001409552,1.783779e-05,6.178135e-05
6,1,1.0,0.634374,0.0,9.974149e-08,0.0


<br><br>

<div align="center">
   <h3><b>Random Forest</b></h3>
</div>

---

In [None]:
mtry_values <- c(2, 3)

for (mtry_val in mtry_values) {
  cat("\n==============================\n")
  cat("Training Random Forest with mtry =", mtry_val, "\n")
  
  rf_model <- train(
    sci_label ~ ., 
    data = train_final[, c(features, "sci_label")],
    method = "ranger",
    trControl = cv_ctrl,
    metric = "Accuracy",
    tuneGrid = data.frame(mtry = mtry_val, splitrule = "gini", min.node.size = 1),
    num.trees = 50,
    importance = "impurity"
  )
  
  rf_pred <- predict(rf_model, newdata = test_final[, features])
  cm_rf <- confusionMatrix(rf_pred, test_final$sci_label)
  
  cat("\n--- Confusion Matrix ---\n")
  print(cm_rf)
  
  cat("\n--- Feature Importance ---\n")
  print(rf_model$finalModel$variable.importance)

  cat("\n--- Cross-Validation Results ---\n")
  print(rf_model$results)
  
  cat("\n--- Best Parameters ---\n")
  print(rf_model$bestTune)
  
  cat("\n--- Model Summary ---\n")
  print(rf_model)
  cat("==============================\n")
}



Training Random Forest with mtry = 2 
Growing trees.. Progress: 46%. Estimated remaining time: 36 seconds.
Growing trees.. Progress: 94%. Estimated remaining time: 4 seconds.
Growing trees.. Progress: 48%. Estimated remaining time: 33 seconds.
Growing trees.. Progress: 96%. Estimated remaining time: 2 seconds.
Growing trees.. Progress: 48%. Estimated remaining time: 33 seconds.
Growing trees.. Progress: 96%. Estimated remaining time: 2 seconds.
Growing trees.. Progress: 46%. Estimated remaining time: 36 seconds.
Growing trees.. Progress: 94%. Estimated remaining time: 4 seconds.
Growing trees.. Progress: 48%. Estimated remaining time: 33 seconds.
Growing trees.. Progress: 98%. Estimated remaining time: 1 seconds.
Growing trees.. Progress: 38%. Estimated remaining time: 52 seconds.
Growing trees.. Progress: 74%. Estimated remaining time: 22 seconds.

--- Confusion Matrix ---
Confusion Matrix and Statistics

          Reference
Prediction cloudy  clear
    cloudy 227210 104056
    clear

<br><br>

<div align="center">
   <h3><b>Rezultati Klasifikacije </b></h3>
</div>

---

<br><br>


<i> Tabela 2 </i> prikazuje distribuciju ciljnog obilježja.

<div align="center">


| Label                    | Count               | Percentage |
|:------------------------:|:-------------------:|:----------:|
| clear                    | 82042023            |   63.5 %   |
| cloudy                   | 47268278            |   36.5 %   |


<i> Tabela 2 </i>

<br>

</div>

Kao prediktorska obilježja odabrana su: **air_temperature, surface_albedo, surface_pressure, total_precipitable_water, wind_speed**.

Performanse modela procijenjene su korištenjem k-tostruke unakrsne validacije (<i>engl. k-fold cross-validation</i>), sa k = 5.

U nastavku su prikazane performanse različitih metoda klasifikacije (<i>Tabela 3</i>) s različitim parametrima: Logistic Regression, Random Forest i SVM. Za svaki metod prikazane su: **Accuracy, Precision, Specificity, Sensitivity i F1 skor**.

> Napomena: broj stabala (**num_trees**) kod Random Forest-a ograničen je na 100 zbog kapaciteta memorije na mašini gdje je model treniran, koristeći biblioteku <i>ranger</i>.

<br><br>

<div align="center">

| Method                   |Parameters                           | Accuracy  | Precision   | Specificity   | Sensitivity  | F1 |
|:------------------------:|:-----------------------------------:|:---------:|:-----------:|:-------------:|:------------:|:--:|
| Logistic Regression      | no regularization (glm, binomial)   | 0         | 0           | 0             | 0            | 1  |
| Logistic Regression      | λ = 0.01,    α = 0                  | 0         | 0           | 0             | 0            | 1  |
| Logistic Regression      | λ = 0.1,     α = 0                  | 0         | 0           | 0             | 0            | 1  |
| Logistic Regression      | λ = 1.0,     α = 0                  | 0         | 0           | 0             | 0            | 1  |
| Logistic Regression      | λ = 0.01,    α = 1                  | 0         | 0           | 0             | 0            | 1  |
| Logistic Regression      | λ = 0.1,     α = 1                  | 0         | 0           | 0             | 0            | 1  |
| Logistic Regression      | λ = 1.0,     α = 1                  | 0         | 0           | 0             | 0            | 1  |
| Random Forest            | trees = 100, mtry = 2               | 0         | 0           | 0             | 0            | 1  |
| Random Forest            | trees = 100, mtry = 3               | 0         | 0           | 0             | 0            | 1  |


<i> Tabela 3 </i>

</div>