ANALIZA I PRIMENA ALATA WEIGHTS & BIASES (W&B) U PRAĆENJU ML MODELA

### Uvod

U razvoju modela mašinskog učenja, posebno u eksperimentalnim fazama, neophodno je sistematsko praćenje parametara, metrika i verzija modela. 
Ručno praćenje eksperimenata brzo postaje neefikasno i nepouzdano.

Alat Weights & Biases (W&B) predstavlja platformu za:
-praćenje eksperimenata (experiment tracking) <br>
-logovanje metrika u realnom vremenu <br>
-verzionisanje modela i podataka <br>
-vizuelizaciju trening procesa <br>
-poređenje više modela <br>

U ovom radu prikazujemo:
  
1.Proceduru instalacije i podešavanja W&B <br>
2.Povezivanje sa modelom za predikciju hepatitisa C <br>
3.Analizu funkcionalnosti <br> 
4.Poređenje sa alatom MLflow <br>

### Setup i instalacija Weights & Biases

#### Kreiranje naloga

1.Otvara se nalog na platformi https://wandb.ai<br>
2.Generiše se API ključ<br>
3.API ključ se koristi za autentifikaciju projekta

#### Instalacija
--pip install wandb nbformat scikit-learn pandas


#### Autentifikacija

Nakon instalacije:

In [18]:
import wandb
wandb.login()


True

In [3]:
import wandb
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

wandb.login()

# ===== Load + preprocessing =====
df = pd.read_csv("HepatitisCdata.csv")

# Unnamed prikazuje umesto X, tako da zbog toga smo izbacili
if "Unnamed: 0" in df.columns:
    df = df.drop("Unnamed: 0", axis=1)

df["Sex"] = df["Sex"].astype("category").cat.codes

# BINARIZACIJA
df["Category"] = df["Category"].apply(lambda x: 0 if "Blood Donor" in str(x) else 1)

X = df.drop("Category", axis=1)
y = df["Category"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

features = list(X.columns)

def run_and_log(model, run_name, config):
    run = wandb.init(
        project="hepatitis_c_prediction",
        entity="milosandjelkovic20-faculty-of-natural-sciences-kragujevac",
        name=run_name,
        config={**config, "features": features},
        reinit=True
    )

    model.fit(X_train, y_train)
    pred = model.predict(X_test)

    y_true = y_test.to_numpy()
    y_pred = np.asarray(pred)

    acc = accuracy_score(y_true, y_pred)
    prec = precision_score(y_true, y_pred, zero_division=0)
    rec  = recall_score(y_true, y_pred, zero_division=0)
    f1   = f1_score(y_true, y_pred, zero_division=0)

    cm = confusion_matrix(y_true, y_pred, labels=[0, 1])

    wandb.log({
        "accuracy": acc,
        "precision": prec,
        "recall": rec,
        "f1_score": f1,
        "confusion_matrix": wandb.Table(
            data=cm.tolist(),
            columns=["Pred_BloodDonor", "Pred_HepatitisC"]
        )
    })

    wandb.finish()

#RandomForest
rf_model = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("rf", RandomForestClassifier(n_estimators=200, random_state=42))
])

run_and_log(
    rf_model,
    run_name="RandomForest-baseline",
    config={"model": "RandomForest", "n_estimators": 200, "test_size": 0.3, "random_state": 42}
)

#LogisticRegression
lr_model = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler()),
    ("lr", LogisticRegression(max_iter=2000))
])

run_and_log(
    lr_model,
    run_name="LogisticRegression-baseline",
    config={"model": "LogisticRegression", "max_iter": 2000, "test_size": 0.3, "random_state": 42}
)

#DecisionTree
dt_model = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("dt", DecisionTreeClassifier(random_state=42, max_depth=5))
])

run_and_log(
    dt_model,
    run_name="DecisionTree-baseline",
    config={"model": "DecisionTree", "max_depth": 5, "test_size": 0.3, "random_state": 42}
)

print("Gotovo,.")

0,1
accuracy,▁
f1_score,▁
precision,▁
recall,▁

0,1
accuracy,0.96216
f1_score,0.82051
precision,1.0
recall,0.69565


0,1
accuracy,▁
f1_score,▁
precision,▁
recall,▁

0,1
accuracy,0.95676
f1_score,0.78947
precision,1.0
recall,0.65217


0,1
accuracy,▁
f1_score,▁
precision,▁
recall,▁

0,1
accuracy,0.96216
f1_score,0.84444
precision,0.86364
recall,0.82609


Gotovo,.


<p align="center">
  <img src="runs-tab.png" width="600"/>
</p>

<p align="center">
Slika 1: Lista eksperimenata u okviru W&B projekta hepatitis_c_prediction
</p>

---

<p align="center">
  <img src="config_matrike.png" width="600"/>
</p>

<p align="center">
Slika 2: Detaljan prikaz RandomForest-baseline run-a sa logovanim hiperparametrima i metrikama
</p>

---

<p align="center">
  <img src="grafik_test_accuracy.png" width="600"/>
</p>

<p align="center">
Slika 3: Grafički prikaz vrednosti metrika
</p>


### Diskusija

U okviru projekta uspešno je demonstrirana integracija modela za klasifikaciju hepatitisa C sa platformom Weights & Biases. 
Alat omogućava automatsko logovanje hiperparametara i metrika, kao i grafički prikaz performansi modela. 
Na osnovu dobijenih rezultata može se uočiti da RandomForest model postiže visoku tačnost na test skupu, 
što potvrđuje njegovu pogodnost za ovaj tip problema.

Weights & Biases značajno olakšava eksperimentalni rad jer omogućava brzo poređenje različitih modela i konfiguracija.
