# **Testing the SIGA Model Without Fine-tuning**
## In this notebook, we will evaluate the performance of the pre-trained SIGA model (Microsoft DeBERTa) on our dataset.
## We will compare its performance with a fine-tuned version to assess the impact of fine-tuning on classification accuracy.

### **1. Install Required Libraries**
### Ensure that scikit-learn is installed before proceeding.

In [32]:
!pip install scikit-learn

import torch
import pandas as pd
import sklearn
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from sklearn.metrics import classification_report, accuracy_score

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


### **2. Load the Dataset**
### Read the dataset containing premises, statements, and labels for Natural Language Inference (NLI).
### (remember to replace the path to your data file)

In [36]:
df = pd.read_csv('/home/wtto/Documents/HHU/fourth_semester/NLI_AP/evaluation_dataset.csv')
display(df.head())  # Display first few rows of the dataset

Unnamed: 0,Scale,premise,alternative,statement,label
0,adequate,relearning lessons from the past: adequate pot...,good,"relearning lessons from the past: adequate, bu...",contradiction
1,adequate,"i'm a huge fan of vegetables, physical movemen...",good,"i'm a huge fan of vegetables, physical movemen...",contradiction
2,adequate,it is the responsibility of municipal & city c...,good,it is the responsibility of municipal & city c...,contradiction
3,adequate,fell asleep with a sheet mask on and honestly ...,good,fell asleep with a sheet mask on and honestly ...,entailment
4,adequate,our billing cycles have set dates and to give ...,good,our billing cycles have set dates and to give ...,contradiction


### **3. Load Pre-trained DeBERTa Model and Tokenizer**
### We use the `"microsoft/deberta-base-mnli"` model for sequence classification.

In [37]:
model_name = "microsoft/deberta-base-mnli"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
model.eval()  # Set model to evaluation mode

Some weights of the model checkpoint at microsoft/deberta-base-mnli were not used when initializing DebertaForSequenceClassification: ['config']
- This IS expected if you are initializing DebertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


DebertaForSequenceClassification(
  (deberta): DebertaModel(
    (embeddings): DebertaEmbeddings(
      (word_embeddings): Embedding(50265, 768, padding_idx=0)
      (LayerNorm): DebertaLayerNorm()
      (dropout): StableDropout()
    )
    (encoder): DebertaEncoder(
      (layer): ModuleList(
        (0-11): 12 x DebertaLayer(
          (attention): DebertaAttention(
            (self): DisentangledSelfAttention(
              (in_proj): Linear(in_features=768, out_features=2304, bias=False)
              (pos_dropout): StableDropout()
              (pos_proj): Linear(in_features=768, out_features=768, bias=False)
              (pos_q_proj): Linear(in_features=768, out_features=768, bias=True)
              (dropout): StableDropout()
            )
            (output): DebertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): DebertaLayerNorm()
              (dropout): StableDropout()
            )
          )
          (

### **4. Move Model to GPU if Available**

In [38]:
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

DebertaForSequenceClassification(
  (deberta): DebertaModel(
    (embeddings): DebertaEmbeddings(
      (word_embeddings): Embedding(50265, 768, padding_idx=0)
      (LayerNorm): DebertaLayerNorm()
      (dropout): StableDropout()
    )
    (encoder): DebertaEncoder(
      (layer): ModuleList(
        (0-11): 12 x DebertaLayer(
          (attention): DebertaAttention(
            (self): DisentangledSelfAttention(
              (in_proj): Linear(in_features=768, out_features=2304, bias=False)
              (pos_dropout): StableDropout()
              (pos_proj): Linear(in_features=768, out_features=768, bias=False)
              (pos_q_proj): Linear(in_features=768, out_features=768, bias=True)
              (dropout): StableDropout()
            )
            (output): DebertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): DebertaLayerNorm()
              (dropout): StableDropout()
            )
          )
          (

### **5. Define Batch Processing Parameters**

In [39]:
batch_size = 8  # Adjust based on available GPU memory
predicted_labels = []  # Store model predictions
true_labels = []  # Store ground truth labels

### **6. Process the Dataset in Batches**

In [40]:
# Process the dataset in batches
for i in range(0, len(df), batch_size):
    batch_texts = df["premise"].iloc[i:i+batch_size].tolist()
    batch_statements = df["statement"].iloc[i:i+batch_size].tolist()
    batch_true_labels = df["label"].iloc[i:i+batch_size].tolist()

    # Tokenize in smaller batches
    inputs = tokenizer(batch_texts, batch_statements, truncation=True, padding=True, return_tensors="pt")
    inputs = {key: val.to(device) for key, val in inputs.items()}

    # Get predictions
    with torch.no_grad():
        outputs = model(**inputs)

    logits = outputs.logits
    predicted_classes = torch.argmax(logits, dim=-1).tolist()

    # Map model predictions to labels
    batch_predicted_labels = predicted_classes  # Keep them as integers


    # Append the batch predictions and true labels to the overall list
    predicted_labels.extend(batch_predicted_labels)
    true_labels.extend([label_map[lbl] for lbl in batch_true_labels])  # Ensure integer labels

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


### **7. Evaluate Model Performance**
### Add predictions to the DataFrame

In [41]:
df["predicted_label"] = predicted_labels

# Print classification report
print("Classification Report:")
print(classification_report(true_labels, predicted_labels, labels=[0,1,2], target_names=["contradiction", "neutral", "entailment"]))


Classification Report:
               precision    recall  f1-score   support

contradiction       0.50      0.00      0.00       441
      neutral       0.26      1.00      0.42       500
   entailment       0.00      0.00      0.00       953

     accuracy                           0.26      1894
    macro avg       0.25      0.33      0.14      1894
 weighted avg       0.19      0.26      0.11      1894

