
  Evaluate_model.py
  -------------------------------------------------------
  PURPOSE:

    - Load trained model from disk

    - Evaluate on test set

    - Save metrics (accuracy, precision, recall, F1)

    - Save confusion matrix and misclassified examples


  SECURITY & GOVERNANCE NOTES:

    - Keep test dataset hashes to confirm integrity

    - Save metrics in JSON for auditability

    - Misclassified samples saved for error analysis


**Temporary (Colab / Development phase):**
For now, we are loading the trained model directly from Google Drive into Colab.

This allows quick access and faster iteration.

Evaluation metrics (eval_report.txt, eval_report.json, etc.) are generated from the Drive copy.

Governance checks (metadata.json, dataset hashes) are still enforced to ensure integrity.

Final (Docker / Deployment phase):
Once training, evaluation, and testing are finalized, we will embed the model inside a Docker image.

#Recruiters, teammates, or reviewers can run:

docker build -t imdb-sentiment
docker run --rm imdb-sentiment


#The Docker image will contain:

- Model weights (quick_distilbert_model/)

- Metadata (metadata.json for auditability)

- Scripts (train_model.py, evaluate_model.py, test_model.py)

This ensures reproducibility, offline use, and no dependency on Google Drive.

✅ Summary:
Drive is used now for development convenience, but Docker will be the final, self-contained delivery format for reproducibility and professional deployment.

In [None]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


In [None]:
import os

MODEL_DIR = "/content/drive/MyDrive/quick_distilbert_model"   # trained model folder
TEST_CSV  = "/content/drive/MyDrive/quick_distilbert_model/imdb_test_clean.csv"      # test dataset
SAVE_DIR  = "/content/drive/MyDrive/quick_distilbert_model/eval_outputs"             # save results here

os.makedirs(SAVE_DIR, exist_ok=True)


In [None]:
import json, hashlib, datetime
import pandas as pd
import numpy as np
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, classification_report, confusion_matrix
from tqdm import tqdm

# Utility: compute file hash (AI governance)
def file_hash(path):
    h = hashlib.sha256()
    with open(path, "rb") as f:
        for chunk in iter(lambda: f.read(8192), b""):
            h.update(chunk)
    return h.hexdigest()

# Minimal Dataset wrapper
class TextDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_length=256):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length
    def __len__(self): return len(self.texts)
    def __getitem__(self, idx):
        text = str(self.texts[idx])
        enc = self.tokenizer(
            text,
            truncation=True,
            padding="max_length",
            max_length=self.max_length,
            return_tensors="pt"
        )
        item = {k: v.squeeze(0) for k, v in enc.items()}
        item["labels"] = torch.tensor(int(self.labels[idx]), dtype=torch.long)
        item["text"] = text
        return item

# Evaluation function
def run_evaluation(model_dir, test_csv, save_dir, batch_size=32, max_length=256):
    device = "cuda" if torch.cuda.is_available() else "cpu"
    print(f"[INFO] Using device: {device}")

    # Load model + tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_dir, local_files_only=True)
    model = AutoModelForSequenceClassification.from_pretrained(model_dir, local_files_only=True)
    model.to(device).eval()

    # Load test data
    df = pd.read_csv(test_csv)
    texts = df["review"].astype(str).tolist()
    labels = df["sentiment"].astype(int).tolist()

    dataset = TextDataset(texts, labels, tokenizer, max_length=max_length)
    dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=False)

    # Run predictions
    preds, true_labels, probs, texts_record = [], [], [], []
    softmax = torch.nn.Softmax(dim=1)

    with torch.no_grad():
        for batch in tqdm(dataloader, desc="Evaluating"):
            ids = batch["input_ids"].to(device)
            mask = batch["attention_mask"].to(device)
            out = model(input_ids=ids, attention_mask=mask)
            batch_probs = softmax(out.logits).cpu().numpy()
            batch_preds = np.argmax(batch_probs, axis=1).tolist()

            preds.extend(batch_preds)
            probs.extend(batch_probs.tolist())
            true_labels.extend(batch["labels"].numpy().tolist())
            texts_record.extend(batch["text"])

    # Metrics
    acc = accuracy_score(true_labels, preds)
    prec, rec, f1, _ = precision_recall_fscore_support(true_labels, preds, average="binary")
    report = classification_report(true_labels, preds, digits=4)
    cm = confusion_matrix(true_labels, preds)

    # Save human-readable report
    txt_path = os.path.join(save_dir, "eval_report.txt")
    with open(txt_path, "w") as f:
        f.write(f"Accuracy: {acc:.4f}\nPrecision: {prec:.4f}\nRecall: {rec:.4f}\nF1: {f1:.4f}\n\n")
        f.write("Classification Report:\n" + report + "\n")
        f.write("Confusion Matrix:\n" + str(cm))

    # Save structured JSON (for governance/audit)
    json_path = os.path.join(save_dir, "eval_report.json")
    metrics = {
        "timestamp": datetime.datetime.utcnow().isoformat() + "Z",
        "accuracy": float(acc),
        "precision": float(prec),
        "recall": float(rec),
        "f1": float(f1),
        "test_csv_hash": file_hash(test_csv)
    }
    with open(json_path, "w") as f:
        json.dump(metrics, f, indent=2)

    # Save predictions + misclassified
    df_out = pd.DataFrame({"text": texts_record, "true": true_labels, "pred": preds})
    df_out.to_csv(os.path.join(save_dir, "predictions_log.csv"), index=False)
    mis = df_out[df_out["true"] != df_out["pred"]]
    mis.to_csv(os.path.join(save_dir, "misclassified_examples.csv"), index=False)

    print(f"[INFO] ✅ Evaluation complete. Reports saved to: {save_dir}")


In [None]:
run_evaluation(MODEL_DIR, TEST_CSV, SAVE_DIR, batch_size=32, max_length=256)


[INFO] Using device: cuda


Evaluating: 100%|██████████| 313/313 [01:14<00:00,  4.18it/s]
  "timestamp": datetime.datetime.utcnow().isoformat() + "Z",


[INFO] ✅ Evaluation complete. Reports saved to: /content/drive/MyDrive/quick_distilbert_model/eval_outputs
