# Projekat 2 - Formalne metode u softverskom inženjerstvu

## Uvod

Ovaj projektni zadatak ima za cilj poređenje performansi neuronske mreže i LightGBM klasifikatora na RCV1 skupu podataka.
Neuronska mreža će biti istrenirana sa jednim skrivenim slojem sa 512 neurona, dok će LightGBM model biti optimizovan korišćenjem Optuna biblioteke za pretragu hiperparametara. Performanse oba modela će biti poređene na osnovu preciznosti i odziva.

### Pokretanje projekta

Za pokretanje ovog projekta potrebno je imati instaliran Python 3.12.2 kako bi se osiguralo da radi sve kao kod mene.

Preporučene specifikacije računara:

- **Procesor (CPU):** Intel Core i5 ili ekvivalentan AMD procesor
- **Memorija (RAM):** 16 GB RAM
- **Disk:** SSD sa najmanje 100 GB slobodnog prostora

Projekat se može pokrenuti na CPU, ali za brže treniranje preporučuje se korišćenje GPU-a, uz minimalne korekcije u kodu.


### Instalacija potrebnih biblioteka

Za pokretanje ovog projekta, potrebno je instalirati sledeće biblioteke (mi smo ih instalirali u sklopu laboratorijskih vježbi):

- `scikit-learn`
- `scipy`
- `numpy`
- `torch`
- `lightgbm`
- `optuna`

Možete instalirati ove biblioteke koristeći sledeću komandu u CMD-u:

```bash
pip install scikit-learn scipy numpy torch lightgbm optuna

_______________________________________________________________________________________________________________________________________________________

#### 1. Učitavanje i priprema podataka

In [2]:
import numpy as np
from sklearn.datasets import fetch_rcv1
from sklearn.model_selection import train_test_split

# Ucitavanje RCV1 dataset-a
rcv1 = fetch_rcv1()

# Ulazni podaci
X = rcv1.data

# Ciljne vrijednosti
y = rcv1.target

# Dijeljenje podataka na trening i testni skup 50% - 50%
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)



#### 2. Implementacija neuronske mreže

In [3]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset
from sklearn.metrics import precision_score, recall_score
import scipy.sparse as sp

# Morao sam ovako jer sam dobijao memory error kad sam kovertovao klasicno
class SparseDataset(Dataset):
    def __init__(self, X, y):
        self.X = X
        self.y = y
    
    # Vraca broj uzoraka u datasetu
    def __len__(self):
        return self.X.shape[0]

    # Vraca uzorak i njegovu ciljnu vrijednost na osnovu indeksa
    def __getitem__(self, idx):
        # Konverzija sparse matrice u gustu i izbacivanje dimenzije
        x = self.X[idx].toarray().squeeze()
        y = self.y[idx].toarray().squeeze()
        return torch.tensor(x, dtype=torch.float32), torch.tensor(y, dtype=torch.float32)


class SimpleNN(nn.Module):
    def __init__(self, input_dim, output_dim):
        super(SimpleNN, self).__init__()
        self.fc1 = nn.Linear(input_dim, 512)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(512, output_dim)

    def forward(self, x):
        out = self.fc1(x)
        out = self.relu(out)
        out = self.fc2(out)
        return out

# Kreiranje DataLoader-a za trening i test skupove
train_dataset = SparseDataset(X_train, y_train)
test_dataset = SparseDataset(X_test, y_test)
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False)


input_dim = X_train.shape[1]
output_dim = y_train.shape[1]
model = SimpleNN(input_dim, output_dim)

# Funkcija greske za multilabel klasifikaciju
# Vidio na https://stackoverflow.com/questions/59336899/which-loss-function-and-metrics-to-use-for-multi-label-classification-with-very
criterion = nn.BCEWithLogitsLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

#### 3. Treniranje neuronske mreže

In [9]:

num_epochs = 5
best_precision = 0

for epoch in range(num_epochs):
    model.train()
    # Itr kroz svaki batch u trening skupu
    for data, target in train_loader:
        optimizer.zero_grad()
        outputs = model(data)
        loss = criterion(outputs, target)
        loss.backward()
        optimizer.step()

    model.eval()
    with torch.no_grad():
        y_true = []
        y_pred = []
        for data, target in test_loader:
            outputs = model(data)
            predictions = torch.sigmoid(outputs).round()
            y_true.extend(target.numpy())
            y_pred.extend(predictions.numpy())

        precision = precision_score(np.array(y_true), np.array(y_pred), average='micro')
        recall = recall_score(np.array(y_true), np.array(y_pred), average='micro')

        if precision > best_precision:
            best_precision = precision
            torch.save(model.state_dict(), 'best_model.pth')

    print(f'Epoch [{epoch+1}/{num_epochs}], Precision: {precision:.4f}, Recall: {recall:.4f}')


Epoch [1/5], Precision: 0.9073, Recall: 0.8208
Epoch [2/5], Precision: 0.9010, Recall: 0.8344
Epoch [3/5], Precision: 0.8931, Recall: 0.8401
Epoch [4/5], Precision: 0.8891, Recall: 0.8375
Epoch [5/5], Precision: 0.8825, Recall: 0.8377


### 4. Treniranje LightGBM klasifikatora sa Optuna bibliotekom

In [None]:
import lightgbm as lgb
import optuna

# U gustu, da bi bio kompatibilan
y = rcv1.target.toarray()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)

# Konvertovanje u 1D niz, iz matrice zbog multiclass
y_train_array = np.argmax(y_train, axis=1) 
y_test_array = np.argmax(y_test, axis=1) 

# Br klasa u RCV1
num_classes = y.shape[1]

print(f'Broj klasa: {num_classes}')

def objective(trial):
    param = {
        'objective': 'multiclass',
        'num_class': num_classes,
        'metric': 'multi_logloss',
        'verbosity': -1,
        'boosting_type': 'gbdt',

        # Odabrao sam ove vrijednosti za learning_rate jer mi je djelovao kao optimalan
        # Sa lr=0.001 i njemu blizu vrijednostima model moze uciti polako, ali sigurno, ali ce mu trebati vise vremena za ucenje
        # sa lr=0.05 i njemu blizu vrijednostima model moze uciti brze
        'learning_rate': trial.suggest_float('learning_rate', 0.001, 0.05, log=True),

        # Za gornju granicu sam izabrao prvo izabrao 256 kao na casu, ali mi se model jako sporo trenirao
        # A preciznost mu je ostala ista kao i sa tom vrijednoscu kad sam stavio 160, a brze se istrenirao
        # Za donju granicu sam stavio 20, jer nisam imao ni na jednom trialu manje od 20 listova kad sam stavio da je minimalna vrijednost 2, 
        # a ubrzalo je proces treniranja 
        'num_leaves': trial.suggest_int('num_leaves', 20, 160),

        # Obrazlozenje isto kao gore, prvobitno je vrijednost bila 20, a modeli sa vecom vrijednoscu od 13 su imali slabiju preciznost
        # Veca vrijednost moze dovesti do overfittinga, a manja pojednostavljuje stablo, tako da mi je ovaj opseg idealan bio da dobijem preciznost koju hocu
        'max_depth': trial.suggest_int('max_depth', 4, 13),

        # Vidio sa časa, ali sam povećao na 0.7 donju granicu zbog brzine treniranja

        # Niža vrednost (0.7): Korišćenjem samo dela karakteristika u svakom boosting koraku, model je manje sklon overfittingu jer se ne oslanja previše na određene karakteristike.
        # Viša vrednost (do 1.0): Korišćenje svih karakteristika može povećati preciznost modela jer model ima pristup svim informacijama iz podataka u svakom boosting koraku.
        # Međutim, to takođe može povećati rizik od overfittinga, posebno ako podaci imaju mnogo karakteristika koje su međusobno zavisne
        'feature_fraction': trial.suggest_float('feature_fraction', 0.7, 1.0),
        
        # Niža vrednost (0.7): Korišćenjem samo dela podataka u svakom boosting koraku, model je manje sklon overfittingu jer se ne oslanja previše na određene uzorke iz trening seta.
        # Ovo omogućava modelu da bude robusniji i poboljšava sposobnost generalizacije modela, jer se koristi različit set uzoraka u različitim koracima treninga.
        # Viša vrednost (do 1.0): Korišćenje svih podataka može povećati preciznost modela jer model ima pristup svim informacijama iz podataka u svakom boosting koraku.
        # Međutim, to takođe može povećati rizik od overfittinga
        'bagging_fraction': trial.suggest_float('bagging_fraction', 0.7, 1.0)
    }

    lgb_train = lgb.Dataset(X_train, label=y_train_array)
    lgb_eval = lgb.Dataset(X_test, label=y_test_array, reference=lgb_train)

    # Smanjio sam ES sa 10 na 5, isti rezultat sam dobio
    # Smanjio sam broj boost rundi sa default 100 na 50, -||-
    callbacks = [lgb.early_stopping(stopping_rounds=5)]
    gbm = lgb.train(param, lgb_train, num_boost_round=50, valid_sets=lgb_eval, callbacks=callbacks)

    y_pred = gbm.predict(X_test)
    
    # Morao dodati da se reshapuje, jer je bio typeerror
    if y_pred.ndim == 1:
        y_pred = y_pred.reshape(-1, num_classes)
    
    y_pred_labels = np.argmax(y_pred, axis=1)

    precision = precision_score(y_test_array, y_pred_labels, average='micro')
    return precision

# Mora maximize, jer po defaultu minimizuje
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=10)

best_params = study.best_params
# Morao dodati pokazivalo da je error i da nema te parametre
best_params['objective'] = 'multiclass'
best_params['num_class'] = num_classes
best_params['metric'] = 'multi_logloss'
# Morao dodati da se ne ispisuju nepotrebne poruke
best_params['verbosity'] = -1

# Dodatno treniranje modela sa najboljim hiperparametrima
lgb_train = lgb.Dataset(X_train, label=y_train_array)
lgb_eval = lgb.Dataset(X_test, label=y_test_array, reference=lgb_train)
final_gbm = lgb.train(best_params, lgb_train, num_boost_round=50, valid_sets=lgb_eval, callbacks=[lgb.early_stopping(stopping_rounds=5)])

y_pred = final_gbm.predict(X_test)
    
if y_pred.ndim == 1:
    y_pred = y_pred.reshape(-1, num_classes) # -1 da se automatski izracuna
    
y_pred_labels = np.argmax(y_pred, axis=1)
final_precision = precision_score(y_test_array, y_pred_labels, average='micro')

print(f'Najbolja preciznost sa LightGBM: {final_precision:.4f}')

[I 2024-06-29 11:42:29,670] A new study created in memory with name: no-name-6ca5ad6c-1412-4003-9fea-add78fb51742


Broj klasa: 103
Training until validation scores don't improve for 5 rounds
Did not meet early stopping. Best iteration is:
[50]	valid_0's multi_logloss: 2.21506


[I 2024-06-29 12:10:43,147] Trial 0 finished with value: 0.35751491147593156 and parameters: {'learning_rate': 0.0020024383282386355, 'num_leaves': 103, 'max_depth': 7, 'feature_fraction': 0.8441809732147931, 'bagging_fraction': 0.82269670434411}. Best is trial 0 with value: 0.35751491147593156.


Training until validation scores don't improve for 5 rounds
Did not meet early stopping. Best iteration is:
[50]	valid_0's multi_logloss: 2.11053


[I 2024-06-29 12:28:13,293] Trial 1 finished with value: 0.45962899700900284 and parameters: {'learning_rate': 0.0035195745803832212, 'num_leaves': 94, 'max_depth': 4, 'feature_fraction': 0.7754985367165826, 'bagging_fraction': 0.8711515501074998}. Best is trial 1 with value: 0.45962899700900284.


Training until validation scores don't improve for 5 rounds
Did not meet early stopping. Best iteration is:
[47]	valid_0's multi_logloss: 1.00386


[I 2024-06-29 12:50:32,581] Trial 2 finished with value: 0.7693923775568302 and parameters: {'learning_rate': 0.03771153693025783, 'num_leaves': 101, 'max_depth': 5, 'feature_fraction': 0.909732921629326, 'bagging_fraction': 0.8236992893643755}. Best is trial 2 with value: 0.7693923775568302.


Training until validation scores don't improve for 5 rounds
Did not meet early stopping. Best iteration is:
[50]	valid_0's multi_logloss: 1.32793


[I 2024-06-29 13:09:24,714] Trial 3 finished with value: 0.7115415693908858 and parameters: {'learning_rate': 0.01744440871588926, 'num_leaves': 25, 'max_depth': 4, 'feature_fraction': 0.997133648147889, 'bagging_fraction': 0.9833239284239321}. Best is trial 2 with value: 0.7693923775568302.


Training until validation scores don't improve for 5 rounds
Did not meet early stopping. Best iteration is:
[50]	valid_0's multi_logloss: 0.862909


[I 2024-06-29 13:53:13,027] Trial 4 finished with value: 0.8026986104170241 and parameters: {'learning_rate': 0.028353039884590363, 'num_leaves': 113, 'max_depth': 11, 'feature_fraction': 0.9168549463388005, 'bagging_fraction': 0.7179470646848696}. Best is trial 4 with value: 0.8026986104170241.


Training until validation scores don't improve for 5 rounds
Did not meet early stopping. Best iteration is:
[50]	valid_0's multi_logloss: 1.33778


[I 2024-06-29 14:31:33,388] Trial 5 finished with value: 0.7562971305820138 and parameters: {'learning_rate': 0.00985987551511969, 'num_leaves': 74, 'max_depth': 12, 'feature_fraction': 0.9229858758597357, 'bagging_fraction': 0.9678347639812017}. Best is trial 4 with value: 0.8026986104170241.


Training until validation scores don't improve for 5 rounds
Did not meet early stopping. Best iteration is:
[49]	valid_0's multi_logloss: 0.890921


[I 2024-06-29 15:06:00,900] Trial 6 finished with value: 0.7933800257081552 and parameters: {'learning_rate': 0.02952102482015001, 'num_leaves': 125, 'max_depth': 8, 'feature_fraction': 0.9147257157515503, 'bagging_fraction': 0.907971964279543}. Best is trial 4 with value: 0.8026986104170241.


Training until validation scores don't improve for 5 rounds
Did not meet early stopping. Best iteration is:
[50]	valid_0's multi_logloss: 2.15266


[I 2024-06-29 15:31:04,005] Trial 7 finished with value: 0.38323549813901797 and parameters: {'learning_rate': 0.00234272496475997, 'num_leaves': 38, 'max_depth': 8, 'feature_fraction': 0.8123605133836813, 'bagging_fraction': 0.99841746342673}. Best is trial 4 with value: 0.8026986104170241.


Training until validation scores don't improve for 5 rounds
Did not meet early stopping. Best iteration is:
[50]	valid_0's multi_logloss: 0.852802


[I 2024-06-29 16:09:26,699] Trial 8 finished with value: 0.8034594126904803 and parameters: {'learning_rate': 0.02869112969714743, 'num_leaves': 75, 'max_depth': 12, 'feature_fraction': 0.810412097404499, 'bagging_fraction': 0.790699861028308}. Best is trial 8 with value: 0.8034594126904803.


Training until validation scores don't improve for 5 rounds
Did not meet early stopping. Best iteration is:
[50]	valid_0's multi_logloss: 1.71214


[I 2024-06-29 16:48:33,128] Trial 9 finished with value: 0.6759877376574749 and parameters: {'learning_rate': 0.005017481841806167, 'num_leaves': 86, 'max_depth': 13, 'feature_fraction': 0.9240857290334837, 'bagging_fraction': 0.7202895036862704}. Best is trial 8 with value: 0.8034594126904803.


Training until validation scores don't improve for 5 rounds
Did not meet early stopping. Best iteration is:
[50]	valid_0's multi_logloss: 0.852802
Najbolja preciznost sa LightGBM: 0.8035


## Poređenje performansi

### Neuronska mreža VS LightGBM

In [5]:
# Ucitavanje najboljeg modela od prije za neuronsku mrezu
model.load_state_dict(torch.load('best_model.pth'))

# Testiranje neuronske mreze
model.eval()
with torch.no_grad():
    y_true = []
    y_pred = []
    for data, target in test_loader:
        outputs = model(data)
        predictions = torch.sigmoid(outputs).round()
        y_true.extend(target.numpy())
        y_pred.extend(predictions.numpy())

nn_precision = precision_score(np.array(y_true), np.array(y_pred), average='micro')
nn_recall = recall_score(np.array(y_true), np.array(y_pred), average='micro')

print(f'Neuronska mreža - Precision: {nn_precision:.4f}, Recall: {nn_recall:.4f}')

# Poredjenje sa LightGBM-ovom najboljom preciznoscu
if nn_precision > final_precision:
    print("Neuronska mreža je bolji model.")
else:
    print("LightGBM je bolji model.")


Neuronska mreža - Precision: 0.9073, Recall: 0.8208
Neuronska mreža je bolji model.


Mislim da je Neuronska mreža bolji model u odnosu na LightGBM, jer je NN kompleksnija arhitektura i moze bolje da se trenira za ovako velik skup podataka. Takođe, za optimalan LightGBM klasifikator potrebno je sa optunom pretraziti sve moguce hiperparametre, sto je gotov nemoguće, pa zato mislim da su rezultati ovakvi kakvi jesu.
Pokušavao sam na različite načine da optimizujem LightGBM, ali nikad nisam dobio preciznost vecu od 90% na ovom datasetu.

### Slaviša Čovakušić 1127/22