# Quantifying the Environmental Cost of AI: Carbon Emissions in Language Model Fine-Tuning for Question Answering

> ### **Project Goal** : As language models continue to play a larger role in natural language processing, their environmental impact has become an important issue to consider. While much of the research in this area focuses on improving model accuracy, the energy use and carbon footprint involved in training these systems are often overlooked or poorly documented. This project aims to explore that imbalance by studying how improvements in model performance relate to the environmental costs of fine-tuning.


# Training Strategy 1: Full Fine-Tuning (Model DistilBERT)

In [1]:
!pip install transformers
!pip install datasets
!pip install accelerate
!pip install codecarbon
!pip install evaluate codecarbon



In [2]:
# Importing Necessary Libraries
import os
from datasets import load_dataset
from transformers import (
    AutoTokenizer,
    AutoModelForQuestionAnswering,
    TrainingArguments,
    Trainer,
    default_data_collator,
    pipeline
)
import torch
from datasets import Dataset
import evaluate
from codecarbon import EmissionsTracker
from google.colab import drive
import pandas as pd
from collections import defaultdict
import json
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
import numpy as np

drive.mount('/content/drive')

Mounted at /content/drive


## STEP 1: Loading The Stanford Question Answering Dataset (SQuAD) Dataset

In [3]:
squad = load_dataset("squad_v2")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

squad_v2/train-00000-of-00001.parquet:   0%|          | 0.00/16.4M [00:00<?, ?B/s]

squad_v2/validation-00000-of-00001.parqu(‚Ä¶):   0%|          | 0.00/1.35M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/130319 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/11873 [00:00<?, ? examples/s]

In [4]:
print("SQuAD Format: ",squad)
print(f"\nFull training set size: {len(squad['train'])}")
print(f"\nValidation set size: {len(squad['validation'])}")

SQuAD Format:  DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 130319
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 11873
    })
})

Full training set size: 130319

Validation set size: 11873


## STEP 2: Tokenization For the Model Function

In [5]:
#Autotokenizer automatically picks the correct tokenizer for given model

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [6]:
def preprocess_function(examples):
    questions = [q.strip() for q in examples["question"]]
    contexts = [c.strip() for c in examples["context"]]

    # Tokenize
    tokenized = tokenizer(
        questions,
        contexts,
        max_length=384,
        stride=128,
        padding="max_length",
        truncation="only_second",         #Truncate from context
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
    )

    # Mapping back to original samples
    sample_mapping = tokenized.pop("overflow_to_sample_mapping")
    offset_mapping = tokenized["offset_mapping"]

    start_positions = []
    end_positions = []

    for i, offsets in enumerate(offset_mapping):
        sample_idx = sample_mapping[i]
        answers = examples["answers"][sample_idx]

        # In no answer case
        if len(answers["answer_start"]) == 0:
            start_positions.append(0)
            end_positions.append(0)
            continue

        start_char = answers["answer_start"][0]
        end_char = start_char + len(answers["text"][0])

        seq_ids = tokenized.sequence_ids(i)

        # Find context section
        context_start = seq_ids.index(1) if 1 in seq_ids else 0
        context_end = len(seq_ids) - 1 - seq_ids[::-1].index(1) if 1 in seq_ids else len(seq_ids) - 1

        # If answer not inside context - mark no answer
        if not (offsets[context_start][0] <= start_char and offsets[context_end][1] >= end_char):
            start_positions.append(0)
            end_positions.append(0)
            continue

        # Find start token
        token_start = context_start
        while token_start <= context_end and offsets[token_start][0] <= start_char:
            token_start += 1
        start_positions.append(token_start - 1)

        # Find end token - move forward until we pass answer end
        token_end = context_start
        while token_end <= context_end and offsets[token_end][1] < end_char:
            token_end += 1
        end_positions.append(token_end)

    tokenized["start_positions"] = start_positions
    tokenized["end_positions"] = end_positions

    return tokenized

In [7]:
print("========== Data Format Within SQuAD Training Set ==========")
print("\nQuestion at Index[0]: ", squad["train"][0]['question'])
print("\nContext at Index[0]: ", squad["train"][0]['context'])
print("\nAnswers at Index[0]: ", squad["train"][0]['answers'])

#Testing preprocess_function function
sample = {
    "question": [squad["train"][0]['question']],
    "context": [squad["train"][0]['context']],
    "answers": [squad["train"][0]['answers']]
}

output = preprocess_function(sample)
print("\n========== Data Format After Preprocessing ==========")

for k, v in output.items():

    print('\n',k, ":", v[:5] if isinstance(v, list) else v)

# Now test start and end position mapping
predicted = tokenizer.decode(output['input_ids'][0][output['start_positions'][0]:output['end_positions'][0]+1])
print(f"\nPredicted Answer Mapping: '{predicted}'")


Question at Index[0]:  When did Beyonce start becoming popular?

Context at Index[0]:  Beyonc√© Giselle Knowles-Carter (/biÀêÀàj…ínse…™/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time. Their hiatus saw the release of Beyonc√©'s debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".

Answers at Index[0]:  {'text': ['in the late 1990s'], 'answer_start': [269]}


 input_ids : [[101, 2043, 2106, 20773, 2707, 3352, 2759, 1029, 102, 20773, 21025, 19358, 22815, 1011, 5708, 1006, 1013, 12170, 2343

In [8]:
# Preprocess validation set (full)
print("\nüîÑ Preprocessing validation set...")
tokenized_validation = squad["validation"].map(
    preprocess_function,
    batched=True,
    remove_columns=squad["validation"].column_names
)


üîÑ Preprocessing validation set...


Map:   0%|          | 0/11873 [00:00<?, ? examples/s]

In [9]:
tokenized_validation.features

{'input_ids': List(Value('int32')),
 'attention_mask': List(Value('int8')),
 'offset_mapping': List(List(Value('int64'))),
 'start_positions': Value('int64'),
 'end_positions': Value('int64')}

In [10]:
#Prepareing function for tokenization based of training size of the data.

def prepare_dataset(train_data, size_fraction, preprocess_fn):

    #Create and preprocess a subset of training data.
    num_samples = int(len(train_data) * size_fraction)
    train_subset = train_data.select(range(num_samples))

    print(f"üîÑ Preprocessing {num_samples} training samples...")
    tokenized_train = train_subset.map(
        preprocess_fn,
        batched=True,
        remove_columns=train_subset.column_names
    )

    return tokenized_train, num_samples

## STEP 3: Training The DistilBert Model Functions

In [11]:
#Model Architecture:
model = AutoModelForQuestionAnswering.from_pretrained("distilbert-base-uncased")
print(f"\n{'='*80}")
print("\nüõ† DistilBERT Model Architecture:")
print(f"\n{'='*80}")
print("\nTransformer layers:",model.config.n_layers)
print("\nHidden size:",model.config.dim)
print('\nIntermediate feed-forward size:',model.config.hidden_dim)
print("\nAttention heads:",model.config.n_heads)
print("\nMax positional embeddings:", model.config.max_position_embeddings)
print("\nVocabulary size:", model.config.vocab_size)


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.




üõ† DistilBERT Model Architecture:


Transformer layers: 6

Hidden size: 768

Intermediate feed-forward size: 3072

Attention heads: 12

Max positional embeddings: 512

Vocabulary size: 30522


In [12]:
# Custom compute metrics function for F1 and Exact Match
def compute_metrics(pred):
    predictions, labels = pred
    start_preds = np.argmax(predictions[0], axis=1)
    end_preds = np.argmax(predictions[1], axis=1)

    start_true = labels[0]
    end_true = labels[1]

    # Calculate exact match
    exact_matches = ((start_preds == start_true) & (end_preds == end_true)).sum()
    exact_match = exact_matches / len(start_true)

    # Calculate F1 score (token overlap)
    f1_scores = []
    for start_p, end_p, start_t, end_t in zip(start_preds, end_preds, start_true, end_true):
        pred_tokens = set(range(start_p, end_p + 1))
        true_tokens = set(range(start_t, end_t + 1))

        if len(pred_tokens) == 0 and len(true_tokens) == 0:
            f1_scores.append(1.0)
        elif len(pred_tokens) == 0 or len(true_tokens) == 0:
            f1_scores.append(0.0)
        else:
            overlap = len(pred_tokens & true_tokens)
            precision = overlap / len(pred_tokens)
            recall = overlap / len(true_tokens)
            f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0
            f1_scores.append(f1)

    avg_f1 = np.mean(f1_scores)

    return {
        "exact_match": exact_match,
        "f1": avg_f1
    }

In [13]:
def train_model(tokenized_train, tokenized_eval, tokenizer, compute_metrics_fn,
                size_fraction, model_name):

    # Load fresh model
    model = AutoModelForQuestionAnswering.from_pretrained(model_name)

    # Setup output directory
    output_dir = f"results_distilbert_{int(size_fraction*100)}pct"

    # Training arguments
    training_args = TrainingArguments(
        output_dir=output_dir,
        eval_strategy="epoch",
        learning_rate=3e-5,
        per_device_train_batch_size=16,
        per_device_eval_batch_size=16,
        num_train_epochs=2,
        weight_decay=0.01,
        fp16=torch.cuda.is_available(),
        save_strategy="epoch",
        load_best_model_at_end=True,
        metric_for_best_model="f1",
        push_to_hub=False,
        logging_steps=100,
        greater_is_better=True
    )

    # Initialize trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_train,
        eval_dataset=tokenized_eval,
        tokenizer=tokenizer,
        data_collator=default_data_collator,
        compute_metrics=compute_metrics_fn
    )

    # Start carbon tracking
    tracker = EmissionsTracker(
        project_name=f"DistilBERT_{int(size_fraction*100)}pct",
        output_dir=output_dir
    )
    tracker.start()

    # Train
    print("üèãÔ∏è Training model...")
    train_results = trainer.train()

    # Stop carbon tracking
    tracker.stop()

    emissions_data = tracker.final_emissions_data

    return trainer, train_results, emissions_data, output_dir


## STEP 4: Evaluating And Saving The Results Functions

> We will be training our model on various data sizes from our SQuAD dataset.
>
> Training Data Variation: [25%, 50%, 80%]

In [14]:
def evaluate_and_save(trainer, train_results, emissions_data, output_dir,
                      size_fraction, num_samples):
    """Evaluate model, print results, and save artifacts."""

    # Evaluate
    print("üìä Evaluating model...")
    eval_results = trainer.evaluate()

    #Calculate trainable parameters for Full Fine-tuning
    model = trainer.model
    trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
    total_params = sum(p.numel() for p in model.parameters())
    trainable_percentage = 100 * trainable_params / total_params

    # Compile results
    result_entry = {
        "training_method": "Full Fine-Tuning",
        "model_name": "DistilBERT",
        'dataset_size%': int(size_fraction*100),
        "train_samples": num_samples,
        "valid_samples": len(tokenized_validation),
        "trainable_params": trainable_params,
        "total_params": total_params,
        "trainable_percentage": trainable_percentage,

        # Performance metrics
        "f1_score": eval_results["eval_f1"],
        "exact_match": eval_results["eval_exact_match"],
        "eval_loss": eval_results["eval_loss"],
        "training_time_hours": train_results.metrics["train_runtime"] / 3600,

        # Emissions data
        "emissions_rate_kg_per_s": emissions_data.emissions_rate,
        "emissions_kg": emissions_data.emissions,
        "timestamp": emissions_data.timestamp,
        "duration_seconds": emissions_data.duration,
        "duration_hours": emissions_data.duration / 3600,

        # Energy consumption
        "energy_consumed_kwh": emissions_data.energy_consumed,
        "cpu_energy_kwh": emissions_data.cpu_energy,
        "gpu_energy_kwh": emissions_data.gpu_energy,
        "ram_energy_kwh": emissions_data.ram_energy,

        # Power draw
        "cpu_power_w": emissions_data.cpu_power,
        "gpu_power_w": emissions_data.gpu_power,
        "ram_power_w": emissions_data.ram_power,

        # Location and system info
        "country_name": emissions_data.country_name,
        "country_iso_code": emissions_data.country_iso_code,
        "region": emissions_data.region,
        "cloud_provider": emissions_data.cloud_provider,
        "cloud_region": emissions_data.cloud_region,
        "on_cloud": emissions_data.on_cloud,

        # System specifications
        "os": emissions_data.os,
        "python_version": emissions_data.python_version,
        "cpu_count": emissions_data.cpu_count,
        "cpu_model": emissions_data.cpu_model,
        "gpu_count": emissions_data.gpu_count,
        "gpu_model": emissions_data.gpu_model,
        "ram_total_size_gb": emissions_data.ram_total_size,

        # Additional metrics
        "pue": emissions_data.pue,
        "codecarbon_version": emissions_data.codecarbon_version,
    }

    # Print summary
    print(f"\n{'='*80}")
    print(f"\nüìà FINE-TUNING RESULTS SUMMARY FOR {size_fraction*100}% DATASET:")
    print(f"{'='*80}")
    print(f"  Training Method: Full Fine-Tuning")
    print(f"  Model: DistilBERT")

    print(f"\nüîß Model Parameters:")
    print(f"  Total Parameters: {total_params:,}")
    print(f"  Trainable Parameters: {trainable_params:,}")
    print(f"  Trainable Percentage: {trainable_percentage:.2f}%")


    print(f"\nüéØ Performance Metrics:")
    print(f"  F1 Score: {eval_results['eval_f1']:.4f}")
    print(f"  Exact Match: {eval_results['eval_exact_match']:.4f}")
    print(f"  Eval Loss: {eval_results['eval_loss']:.4f}")

    print(f"\n‚ö° Energy Consumption:")
    print(f"  Total Energy: {emissions_data.energy_consumed:.6f} kWh")
    print(f"  CPU Energy: {emissions_data.cpu_energy:.6f} kWh ({emissions_data.cpu_energy/emissions_data.energy_consumed*100:.1f}%)")
    print(f"  GPU Energy: {emissions_data.gpu_energy:.6f} kWh ({emissions_data.gpu_energy/emissions_data.energy_consumed*100:.1f}%)")
    print(f"  RAM Energy: {emissions_data.ram_energy:.6f} kWh ({emissions_data.ram_energy/emissions_data.energy_consumed*100:.1f}%)")

    print(f"\nüîå Average Power Draw:")
    print(f"  CPU Power: {emissions_data.cpu_power:.2f} W")
    print(f"  GPU Power: {emissions_data.gpu_power:.2f} W")
    print(f"  RAM Power: {emissions_data.ram_power:.2f} W")
    print(f"  Total Power: {emissions_data.cpu_power + emissions_data.gpu_power + emissions_data.ram_power:.2f} W")

    print(f"\nüå± Carbon Footprint:")
    print(f"  Total CO2 Emissions: {emissions_data.emissions:.6f} kg")
    print(f"  Emissions Rate: {emissions_data.emissions_rate:.9f} kg/s")
    print(f"  Duration: {emissions_data.duration/3600:.2f} hours")
    print(f"  Training Time (Trainer): {train_results.metrics['train_runtime']/3600:.2f} hours")

    print(f"\nüìç Location & Infrastructure:")
    print(f"  Country: {emissions_data.country_name} ({emissions_data.country_iso_code})")
    print(f"  Region: {emissions_data.region}")
    print(f"  On Cloud: {emissions_data.on_cloud}")
    print(f"  PUE (Power Usage Effectiveness): {emissions_data.pue}")

    print(f"\nüíª System Specifications:")
    print(f"  OS: {emissions_data.os}")
    print(f"  CPU: {emissions_data.cpu_model} ({emissions_data.cpu_count} cores)")
    if emissions_data.gpu_count and emissions_data.gpu_model:
        print(f"  GPU: {emissions_data.gpu_model} (Count: {emissions_data.gpu_count})")
    else:
        print(f"  GPU: None detected")
    print(f"  RAM: {emissions_data.ram_total_size:.2f} GB")
    print(f"  Python: {emissions_data.python_version}")

    print(f"\n{'='*80}")

    # Save model
    trainer.save_model(f"{output_dir}/final_model")

    # Clear GPU memory
    del trainer.model
    del trainer
    torch.cuda.empty_cache()

    return result_entry

In [15]:
def run_experiment(size_fraction, train_data, eval_data, tokenizer, preprocess_fn, compute_metrics_fn, model_name):

    print(f"\n{'='*60}")
    print(f"üöÄ Training with {size_fraction*100}% of training data")
    print(f"{'='*60}")

    # Step 1: Prepare dataset
    tokenized_train, num_samples = prepare_dataset(train_data, size_fraction, preprocess_fn)

    # Step 2: Train model
    trainer, train_results, emissions_data, output_dir = train_model(
        tokenized_train, eval_data, tokenizer, compute_metrics_fn,
        size_fraction, model_name
    )

    # Step 3: Evaluate and save
    result_entry = evaluate_and_save(trainer, train_results, emissions_data, output_dir, size_fraction, num_samples)

    return result_entry

In [16]:
# Store results
results_summary = []

In [17]:
#Considering 25% of data for training the model
%%time
print("\n" + "="*80)
print("üî¨ EXPERIMENT 1: FULL FINE-TUNING WITH 25.0% TRAINING DATASET")
print("="*80)
result1 = run_experiment(
        size_fraction=0.25,
        train_data=squad["train"],
        eval_data=tokenized_validation,
        tokenizer=tokenizer,
        preprocess_fn=preprocess_function,
        compute_metrics_fn=compute_metrics,
        model_name="distilbert-base-uncased"
    )

results_summary.append(result1)


üî¨ EXPERIMENT 1: FULL FINE-TUNING WITH 25.0% TRAINING DATASET

üöÄ Training with 25.0% of training data
üîÑ Preprocessing 32579 training samples...


Map:   0%|          | 0/32579 [00:00<?, ? examples/s]

Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(
[codecarbon INFO @ 03:20:12] [setup] RAM Tracking...
[codecarbon INFO @ 03:20:12] [setup] CPU Tracking...
 Linux OS detected: Please ensure RAPL files exist, and are readable, at /sys/class/powercap/intel-rapl/subsystem to measure CPU

[codecarbon INFO @ 03:20:13] CPU Model on constant consumption mode: Intel(R) Xeon(R) CPU @ 2.20GHz
[codecarbon INFO @ 03:20:13] [setup] GPU Tracking...
[codecarbon INFO @ 03:20:13] Tracking Nvidia GPU via pynvml
[codecarbon INFO @ 03:20:13] The below tracking methods have been set up:
                RAM Tracking Method: RAM power estimation model
                CPU Tracking Method: global constant
                GPU Tracking Method: py

üèãÔ∏è Training model...


  | |_| | '_ \/ _` / _` |  _/ -_)
[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
[34m[1mwandb[0m: Paste an API key from your profile and hit enter:

 ¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mdikshaph07[0m ([33mdikshaph07-rutgers-university[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss,Exact Match,F1
1,1.2787,1.740836,0.388248,0.453493
2,0.921,1.985141,0.388083,0.467055


[codecarbon INFO @ 03:20:30] Energy consumed for RAM : 0.000158 kWh. RAM Power : 38.0 W
[codecarbon INFO @ 03:20:30] Delta energy consumed for CPU with constant : 0.000177 kWh, power : 42.5 W
[codecarbon INFO @ 03:20:30] Energy consumed for All CPU : 0.000177 kWh
[codecarbon INFO @ 03:20:30] Energy consumed for all GPUs : 0.000514 kWh. Total GPU Power : 123.21456454482632 W
[codecarbon INFO @ 03:20:30] 0.000849 kWh of electricity and 0.000000 L of water were used since the beginning.
[codecarbon INFO @ 03:20:37] Energy consumed for RAM : 0.000158 kWh. RAM Power : 38.0 W
[codecarbon INFO @ 03:20:37] Delta energy consumed for CPU with constant : 0.000177 kWh, power : 42.5 W
[codecarbon INFO @ 03:20:37] Energy consumed for All CPU : 0.000177 kWh
[codecarbon INFO @ 03:20:37] Energy consumed for all GPUs : 0.000865 kWh. Total GPU Power : 207.5472413139655 W
[codecarbon INFO @ 03:20:37] 0.001201 kWh of electricity and 0.000000 L of water were used since the beginning.
[codecarbon INFO @ 03:2

üìä Evaluating model...




üìà FINE-TUNING RESULTS SUMMARY FOR 25.0% DATASET:
  Training Method: Full Fine-Tuning
  Model: DistilBERT

üîß Model Parameters:
  Total Parameters: 66,364,418
  Trainable Parameters: 66,364,418
  Trainable Percentage: 100.00%

üéØ Performance Metrics:
  F1 Score: 0.4671
  Exact Match: 0.3881
  Eval Loss: 1.9851

‚ö° Energy Consumption:
  Total Energy: 0.014444 kWh
  CPU Energy: 0.002100 kWh (14.5%)
  GPU Energy: 0.010467 kWh (72.5%)
  RAM Energy: 0.001877 kWh (13.0%)

üîå Average Power Draw:
  CPU Power: 42.50 W
  GPU Power: 211.33 W
  RAM Power: 38.00 W
  Total Power: 291.83 W

üå± Carbon Footprint:
  Total CO2 Emissions: 0.003866 kg
  Emissions Rate: 0.000021720 kg/s
  Duration: 0.05 hours
  Training Time (Trainer): 0.05 hours

üìç Location & Infrastructure:
  Country: The Netherlands (NLD)
  Region: groningen
  On Cloud: N
  PUE (Power Usage Effectiveness): 1.0

üíª System Specifications:
  OS: Linux-6.6.105+-x86_64-with-glibc2.35
  CPU: Intel(R) Xeon(R) CPU @ 2.20GHz (12

In [18]:
#Considering 50% of data for training the model
%%time
print("\n" + "="*80)
print("üî¨ EXPERIMENT 2: FULL FINE-TUNING WITH 50.0% TRAINING DATASET")
print("="*80)
result2 = run_experiment(
        size_fraction=0.5,
        train_data=squad["train"],
        eval_data=tokenized_validation,
        tokenizer=tokenizer,
        preprocess_fn=preprocess_function,
        compute_metrics_fn=compute_metrics,
        model_name="distilbert-base-uncased"
    )
results_summary.append(result2)


üî¨ EXPERIMENT 2: FULL FINE-TUNING WITH 50.0% TRAINING DATASET

üöÄ Training with 50.0% of training data
üîÑ Preprocessing 65159 training samples...


Map:   0%|          | 0/65159 [00:00<?, ? examples/s]

Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(
[codecarbon INFO @ 03:24:03] [setup] RAM Tracking...
[codecarbon INFO @ 03:24:03] [setup] CPU Tracking...
 Linux OS detected: Please ensure RAPL files exist, and are readable, at /sys/class/powercap/intel-rapl/subsystem to measure CPU

[codecarbon INFO @ 03:24:04] CPU Model on constant consumption mode: Intel(R) Xeon(R) CPU @ 2.20GHz
[codecarbon INFO @ 03:24:04] [setup] GPU Tracking...
[codecarbon INFO @ 03:24:04] Tracking Nvidia GPU via pynvml
[codecarbon INFO @ 03:24:04] The below tracking methods have been set up:
                RAM Tracking Method: RAM power estimation model
                CPU Tracking Method: global constant
                GPU Tracking Method: py

üèãÔ∏è Training model...


Epoch,Training Loss,Validation Loss,Exact Match,F1
1,1.2324,1.466417,0.452283,0.531723
2,0.9195,1.443946,0.491017,0.569492


[codecarbon INFO @ 03:24:20] Energy consumed for RAM : 0.000158 kWh. RAM Power : 38.0 W
[codecarbon INFO @ 03:24:20] Delta energy consumed for CPU with constant : 0.000177 kWh, power : 42.5 W
[codecarbon INFO @ 03:24:20] Energy consumed for All CPU : 0.000177 kWh
[codecarbon INFO @ 03:24:20] Energy consumed for all GPUs : 0.000916 kWh. Total GPU Power : 219.73905841974874 W
[codecarbon INFO @ 03:24:20] 0.001252 kWh of electricity and 0.000000 L of water were used since the beginning.
[codecarbon INFO @ 03:24:21] Energy consumed for RAM : 0.000158 kWh. RAM Power : 38.0 W
[codecarbon INFO @ 03:24:21] Delta energy consumed for CPU with constant : 0.000177 kWh, power : 42.5 W
[codecarbon INFO @ 03:24:21] Energy consumed for All CPU : 0.000177 kWh
[codecarbon INFO @ 03:24:21] Energy consumed for all GPUs : 0.000936 kWh. Total GPU Power : 224.4718334008609 W
[codecarbon INFO @ 03:24:21] 0.001271 kWh of electricity and 0.000000 L of water were used since the beginning.
[codecarbon INFO @ 03:2

üìä Evaluating model...




üìà FINE-TUNING RESULTS SUMMARY FOR 50.0% DATASET:
  Training Method: Full Fine-Tuning
  Model: DistilBERT

üîß Model Parameters:
  Total Parameters: 66,364,418
  Trainable Parameters: 66,364,418
  Trainable Percentage: 100.00%

üéØ Performance Metrics:
  F1 Score: 0.5695
  Exact Match: 0.4910
  Eval Loss: 1.4439

‚ö° Energy Consumption:
  Total Energy: 0.026666 kWh
  CPU Energy: 0.003716 kWh (13.9%)
  GPU Energy: 0.019627 kWh (73.6%)
  RAM Energy: 0.003323 kWh (12.5%)

üîå Average Power Draw:
  CPU Power: 42.50 W
  GPU Power: 224.40 W
  RAM Power: 38.00 W
  Total Power: 304.90 W

üå± Carbon Footprint:
  Total CO2 Emissions: 0.007136 kg
  Emissions Rate: 0.000022660 kg/s
  Duration: 0.09 hours
  Training Time (Trainer): 0.09 hours

üìç Location & Infrastructure:
  Country: The Netherlands (NLD)
  Region: groningen
  On Cloud: N
  PUE (Power Usage Effectiveness): 1.0

üíª System Specifications:
  OS: Linux-6.6.105+-x86_64-with-glibc2.35
  CPU: Intel(R) Xeon(R) CPU @ 2.20GHz (12

In [19]:
#Considering 80% of data for training the model
%%time
print("\n" + "="*80)
print("üî¨ EXPERIMENT 3: FULL FINE-TUNING WITH 80.0% TRAINING DATASET")
print("="*80)
result3 = run_experiment(
        size_fraction=0.8,
        train_data=squad["train"],
        eval_data=tokenized_validation,
        tokenizer=tokenizer,
        preprocess_fn=preprocess_function,
        compute_metrics_fn=compute_metrics,
        model_name="distilbert-base-uncased"
    )
results_summary.append(result3)


üî¨ EXPERIMENT 3: FULL FINE-TUNING WITH 80.0% TRAINING DATASET

üöÄ Training with 80.0% of training data
üîÑ Preprocessing 104255 training samples...


Map:   0%|          | 0/104255 [00:00<?, ? examples/s]

Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(
[codecarbon INFO @ 03:30:34] [setup] RAM Tracking...
[codecarbon INFO @ 03:30:34] [setup] CPU Tracking...
 Linux OS detected: Please ensure RAPL files exist, and are readable, at /sys/class/powercap/intel-rapl/subsystem to measure CPU

[codecarbon INFO @ 03:30:35] CPU Model on constant consumption mode: Intel(R) Xeon(R) CPU @ 2.20GHz
[codecarbon INFO @ 03:30:35] [setup] GPU Tracking...
[codecarbon INFO @ 03:30:35] Tracking Nvidia GPU via pynvml
[codecarbon INFO @ 03:30:35] The below tracking methods have been set up:
                RAM Tracking Method: RAM power estimation model
                CPU Tracking Method: global constant
                GPU Tracking Method: py

üèãÔ∏è Training model...


Epoch,Training Loss,Validation Loss,Exact Match,F1
1,1.2582,1.220999,0.522416,0.594213
2,0.8958,1.300708,0.530658,0.609658


[codecarbon INFO @ 03:30:51] Energy consumed for RAM : 0.000158 kWh. RAM Power : 38.0 W
[codecarbon INFO @ 03:30:51] Delta energy consumed for CPU with constant : 0.000177 kWh, power : 42.5 W
[codecarbon INFO @ 03:30:51] Energy consumed for All CPU : 0.000177 kWh
[codecarbon INFO @ 03:30:51] Energy consumed for all GPUs : 0.000911 kWh. Total GPU Power : 218.4722305666519 W
[codecarbon INFO @ 03:30:51] 0.001246 kWh of electricity and 0.000000 L of water were used since the beginning.
[codecarbon INFO @ 03:30:52] Energy consumed for RAM : 0.000158 kWh. RAM Power : 38.0 W
[codecarbon INFO @ 03:30:52] Delta energy consumed for CPU with constant : 0.000177 kWh, power : 42.5 W
[codecarbon INFO @ 03:30:52] Energy consumed for All CPU : 0.000177 kWh
[codecarbon INFO @ 03:30:52] Energy consumed for all GPUs : 0.000937 kWh. Total GPU Power : 224.6698272343695 W
[codecarbon INFO @ 03:30:52] 0.001272 kWh of electricity and 0.000000 L of water were used since the beginning.
[codecarbon INFO @ 03:31

üìä Evaluating model...




üìà FINE-TUNING RESULTS SUMMARY FOR 80.0% DATASET:
  Training Method: Full Fine-Tuning
  Model: DistilBERT

üîß Model Parameters:
  Total Parameters: 66,364,418
  Trainable Parameters: 66,364,418
  Trainable Percentage: 100.00%

üéØ Performance Metrics:
  F1 Score: 0.6097
  Exact Match: 0.5307
  Eval Loss: 1.3007

‚ö° Energy Consumption:
  Total Energy: 0.041722 kWh
  CPU Energy: 0.005793 kWh (13.9%)
  GPU Energy: 0.030749 kWh (73.7%)
  RAM Energy: 0.005179 kWh (12.4%)

üîå Average Power Draw:
  CPU Power: 42.50 W
  GPU Power: 225.07 W
  RAM Power: 38.00 W
  Total Power: 305.57 W

üå± Carbon Footprint:
  Total CO2 Emissions: 0.011166 kg
  Emissions Rate: 0.000022743 kg/s
  Duration: 0.14 hours
  Training Time (Trainer): 0.14 hours

üìç Location & Infrastructure:
  Country: The Netherlands (NLD)
  Region: groningen
  On Cloud: N
  PUE (Power Usage Effectiveness): 1.0

üíª System Specifications:
  OS: Linux-6.6.105+-x86_64-with-glibc2.35
  CPU: Intel(R) Xeon(R) CPU @ 2.20GHz (12

###STEP 4.1: Results and Analysis

In [20]:
# Create summary DataFrame
results_df = pd.DataFrame(results_summary)

print("\n" + "="*60)
print("üìä FINAL RESULTS SUMMARY")
print("="*60)
print(results_df.to_string(index=False))


üìä FINAL RESULTS SUMMARY
 training_method model_name  dataset_size%  train_samples  valid_samples  trainable_params  total_params  trainable_percentage  f1_score  exact_match  eval_loss  training_time_hours  emissions_rate_kg_per_s  emissions_kg           timestamp  duration_seconds  duration_hours  energy_consumed_kwh  cpu_energy_kwh  gpu_energy_kwh  ram_energy_kwh  cpu_power_w  gpu_power_w  ram_power_w    country_name country_iso_code    region cloud_provider cloud_region on_cloud                                   os python_version  cpu_count                      cpu_model  gpu_count                 gpu_model  ram_total_size_gb  pue codecarbon_version
Full Fine-Tuning DistilBERT             25          32579          12134          66364418      66364418                 100.0  0.467055     0.388083   1.985141             0.049323                 0.000022      0.003866 2025-12-01T03:23:13        177.968649        0.049436             0.014444        0.002100        0.010467        

In [21]:
results_df.to_csv("/content/drive/MyDrive/distilbert_dataset_size_results.csv", index=False)

In [22]:
# Load the dataset
full_ft_results = pd.read_csv("/content/drive/MyDrive/distilbert_dataset_size_results.csv")

print("üìä Data loaded successfully!")
print(f"Total experiments: {len(full_ft_results)}")
print("\nExperiments:")
print(full_ft_results[['train_samples', 'dataset_size%', 'f1_score', 'emissions_kg']])


üìä Data loaded successfully!
Total experiments: 3

Experiments:
   train_samples  dataset_size%  f1_score  emissions_kg
0          32579             25  0.467055      0.003866
1          65159             50  0.569492      0.007136
2         104255             80  0.609658      0.011166


In [23]:
# PLOT 1: Energy Consumption vs Dataset Size (Stacked Area)
df_sorted = full_ft_results.sort_values('train_samples')

fig = go.Figure()

fig.add_trace(go.Bar(
    name='CPU Energy',
    x=df_sorted['dataset_size%'],
    y=df_sorted['cpu_energy_kwh'],
    marker_color='#FF6B6B',
    hovertemplate='<b>CPU Energy</b><br>%{y:.6f} kWh<br>Dataset: %{x:.0f}%<extra></extra>'
))

fig.add_trace(go.Bar(
    name='GPU Energy',
    x=df_sorted['dataset_size%'],
    y=df_sorted['gpu_energy_kwh'],
    marker_color='#4ECDC4',
    hovertemplate='<b>GPU Energy</b><br>%{y:.6f} kWh<br>Dataset: %{x:.0f}%<extra></extra>'
))

fig.add_trace(go.Bar(
    name='RAM Energy',
    x=df_sorted['dataset_size%'],
    y=df_sorted['ram_energy_kwh'],
    marker_color='#95E1D3',
    hovertemplate='<b>RAM Energy</b><br>%{y:.6f} kWh<br>Dataset: %{x:.0f}%<extra></extra>'
))

fig.update_layout(
    title=dict(text="Energy Consumption Scaling with Dataset Size", font=dict(size=18)),
    xaxis_title='Dataset Size (%)',
    yaxis_title='Energy Consumption (kWh)',
    barmode='stack',
    template='plotly_white',
    height=500,
    font=dict(size=13),
    legend=dict(
        orientation="h",
        yanchor="bottom",
        y=1.02,
        xanchor="right",
        x=1
    ),
    hovermode='x unified'
)

fig.show()
fig.write_html("/content/drive/MyDrive/full_ft_energy_scaling.html")

In [24]:
# PLOT 2: Performance & Emissions Growth (Dual Y-axis)
df_sorted = full_ft_results.sort_values('train_samples')

fig = make_subplots(specs=[[{"secondary_y": True}]])

# F1 Score line
fig.add_trace(
    go.Scatter(
        x=df_sorted['dataset_size%'],
        y=df_sorted['f1_score'],
        name='F1 Score',
        mode='lines+markers',
        line=dict(color='#4ECDC4', width=3),
        marker=dict(size=12, line=dict(width=2, color='white')),
        hovertemplate='<b>F1 Score</b>: %{y:.4f}<br>Dataset: %{x:.0f}%<extra></extra>'
    ),
    secondary_y=False
)

# Exact Match line
fig.add_trace(
    go.Scatter(
        x=df_sorted['dataset_size%'],
        y=df_sorted['exact_match'],
        name='Exact Match',
        mode='lines+markers',
        line=dict(color='#95E1D3', width=3, dash='dash'),
        marker=dict(size=10),
        hovertemplate='<b>Exact Match</b>: %{y:.4f}<br>Dataset: %{x:.0f}%<extra></extra>'
    ),
    secondary_y=False
)

# CO2 Emissions bar
fig.add_trace(
    go.Bar(
        x=df_sorted['dataset_size%'],
        y=df_sorted['emissions_kg'],
        name='CO‚ÇÇ Emissions',
        marker_color='#FF6B6B',
        opacity=0.6,
        hovertemplate='<b>CO‚ÇÇ</b>: %{y:.6f} kg<br>Dataset: %{x:.0f}%<extra></extra>'
    ),
    secondary_y=True
)

fig.update_xaxes(title_text="Dataset Size (%)")
fig.update_yaxes(title_text="Performance Score", secondary_y=False)
fig.update_yaxes(title_text="CO‚ÇÇ Emissions (kg)", secondary_y=True)

fig.update_layout(
    title=dict(text="Performance vs Carbon Emissions by Dataset Size", font=dict(size=18)),
    template='plotly_white',
    height=500,
    font=dict(size=13),
    hovermode='x unified',
    legend=dict(
        orientation="h",
        yanchor="bottom",
        y=1.02,
        xanchor="right",
        x=1
    )
)

fig.show()
fig.write_html("/content/drive/MyDrive/full_ft_performance_emissions.html")

# Training Strategy 2: LoRA (Low-Rank Adaptation) fine-tuning (Model DistilBERT)

In [25]:
#Import PEFT for LoRA
from peft import LoraConfig, get_peft_model, TaskType, PeftModel

## STEP 5: Creating And Training LoRA Model

In [26]:
def create_lora_model(model_name="distilbert-base-uncased", r=8, lora_alpha=16, lora_dropout=0.1):
    # Load base model
    base_model = AutoModelForQuestionAnswering.from_pretrained(model_name)

    # Configure LoRA
    lora_config = LoraConfig(
        task_type=TaskType.QUESTION_ANS,  # Task type for QA
        r=r,                               # Rank of update matrices
        lora_alpha=lora_alpha,             # Scaling factor
        lora_dropout=lora_dropout,         # Dropout probability
        target_modules=["q_lin", "v_lin"], # Which layers to apply LoRA to
        bias="none",                       # Don't train biases
        inference_mode=False,              # Training mode
    )

    # Apply LoRA to model
    lora_model = get_peft_model(base_model, lora_config)

    # Print trainable parameters
    lora_model.print_trainable_parameters()

    return lora_model

In [27]:
def train_lora_model(tokenized_train, tokenized_eval, tokenizer, compute_metrics_fn,
                     size_fraction, lora_rank=8):

    # Create LoRA model
    print(f"\nüîß Creating LoRA model (rank={lora_rank})...")
    lora_model = create_lora_model(
        model_name="distilbert-base-uncased",
        r=lora_rank,
        lora_alpha=lora_rank * 2,  # Common practice: alpha = 2*r
        lora_dropout=0.1
    )

    # Setup output directory
    output_dir = f"results_distilbert_lora_r{lora_rank}_{int(size_fraction*100)}pct"

    # Training arguments (can use higher learning rate for LoRA)
    training_args = TrainingArguments(
        output_dir=output_dir,
        eval_strategy="epoch",
        learning_rate=3e-4,
        per_device_train_batch_size=16,
        per_device_eval_batch_size=16,
        num_train_epochs=2,
        weight_decay=0.01,
        fp16=torch.cuda.is_available(),
        save_strategy="epoch",
        load_best_model_at_end=True,
        metric_for_best_model="f1",
        push_to_hub=False,
        logging_steps=100,
        greater_is_better=True
    )

    # Initialize trainer
    trainer = Trainer(
        model=lora_model,
        args=training_args,
        train_dataset=tokenized_train,
        eval_dataset=tokenized_eval,
        tokenizer=tokenizer,
        data_collator=default_data_collator,
        compute_metrics=compute_metrics_fn
    )

    # Start carbon tracking
    tracker = EmissionsTracker(
        project_name=f"DistilBERT_LoRA_r{lora_rank}_{int(size_fraction*100)}pct",
        output_dir=output_dir,
        save_to_file=True,
        log_level="info"
    )
    tracker.start()

    # Train
    print("üèãÔ∏è Training LoRA model...")
    train_results = trainer.train()

    # Stop tracking and get detailed emissions data
    emissions_kg = tracker.stop()

    # Get full emissions data object
    emissions_data = tracker.final_emissions_data

    return trainer, train_results, emissions_data, output_dir, lora_model

##STEP 6: Evaluating The LoRA Model On Different Rank Sizes

> We will be training our model on various ranks from our SQuAD dataset.
>
> Training Data Rank Variation: [4, 8, 16]

In [28]:
def evaluate_and_save_lora(trainer, train_results, emissions_data, output_dir,
                           size_fraction, num_samples, lora_model):
    """Evaluate LoRA model and save results with detailed emissions."""
    print("üìä Evaluating LoRA model...")
    eval_results = trainer.evaluate()

    # Count trainable parameters
    trainable_params = sum(p.numel() for p in lora_model.parameters() if p.requires_grad)
    total_params = sum(p.numel() for p in lora_model.parameters())
    trainable_percentage = 100 * trainable_params / total_params

    # Extract emissions data from EmissionsData object
    result_entry = {
        "training_method": "LoRA",
        "model_name": "DistilBERT",
        "dataset_size%": int(size_fraction*100),
        "lora_rank": lora_model.peft_config['default'].r,
        "train_samples": num_samples,
        "valid_samples": len(tokenized_validation),
        "trainable_params": trainable_params,
        "total_params": total_params,
        "trainable_percentage": trainable_percentage,

        # Performance metrics
        "f1_score": eval_results["eval_f1"],
        "exact_match": eval_results["eval_exact_match"],
        "eval_loss": eval_results["eval_loss"],
        "training_time_hours": train_results.metrics["train_runtime"] / 3600,

        # Emissions data (direct access to EmissionsData attributes)
        "emissions_rate_kg_per_s": emissions_data.emissions_rate,
        "emissions_kg": emissions_data.emissions,
        "timestamp": emissions_data.timestamp,
        "duration_seconds": emissions_data.duration,
        "duration_hours": emissions_data.duration / 3600,

        # Energy consumption
        "energy_consumed_kwh": emissions_data.energy_consumed,
        "cpu_energy_kwh": emissions_data.cpu_energy,
        "gpu_energy_kwh": emissions_data.gpu_energy,
        "ram_energy_kwh": emissions_data.ram_energy,

        # Power draw
        "cpu_power_w": emissions_data.cpu_power,
        "gpu_power_w": emissions_data.gpu_power,
        "ram_power_w": emissions_data.ram_power,

        # Location and system info
        "country_name": emissions_data.country_name,
        "country_iso_code": emissions_data.country_iso_code,
        "region": emissions_data.region,
        "cloud_provider": emissions_data.cloud_provider,
        "cloud_region": emissions_data.cloud_region,
        "on_cloud": emissions_data.on_cloud,

        # System specifications
        "os": emissions_data.os,
        "python_version": emissions_data.python_version,
        "cpu_count": emissions_data.cpu_count,
        "cpu_model": emissions_data.cpu_model,
        "gpu_count": emissions_data.gpu_count,
        "gpu_model": emissions_data.gpu_model,
        "ram_total_size_gb": emissions_data.ram_total_size,

        # Additional metrics
        "pue": emissions_data.pue,  # Power Usage Effectiveness
        "codecarbon_version": emissions_data.codecarbon_version,
    }

    # Print detailed summary
    print(f"\n{'='*80}")
    print(f"üìà LoRA RESULTS SUMMARY (Rank {result_entry['lora_rank']})")
    print(f"{'='*80}")

    print(f"\nüîß Model Configuration:")
    print(f"  Training Method: LoRA")
    print(f"  LoRA Rank: {result_entry['lora_rank']}")
    print(f"  Trainable Parameters: {trainable_params:,} ({trainable_percentage:.2f}%)")
    print(f"  Total Parameters: {total_params:,}")
    print(f"  Dataset Size: {size_fraction*100}%")

    print(f"\nüéØ Performance Metrics:")
    print(f"  F1 Score: {eval_results['eval_f1']:.4f}")
    print(f"  Exact Match: {eval_results['eval_exact_match']:.4f}")
    print(f"  Eval Loss: {eval_results['eval_loss']:.4f}")

    print(f"\n‚ö° Energy Consumption:")
    print(f"  Total Energy: {emissions_data.energy_consumed:.6f} kWh")
    print(f"  CPU Energy: {emissions_data.cpu_energy:.6f} kWh ({emissions_data.cpu_energy/emissions_data.energy_consumed*100:.1f}%)")
    print(f"  GPU Energy: {emissions_data.gpu_energy:.6f} kWh ({emissions_data.gpu_energy/emissions_data.energy_consumed*100:.1f}%)")
    print(f"  RAM Energy: {emissions_data.ram_energy:.6f} kWh ({emissions_data.ram_energy/emissions_data.energy_consumed*100:.1f}%)")

    print(f"\nüîå Average Power Draw:")
    print(f"  CPU Power: {emissions_data.cpu_power:.2f} W")
    print(f"  GPU Power: {emissions_data.gpu_power:.2f} W")
    print(f"  RAM Power: {emissions_data.ram_power:.2f} W")
    print(f"  Total Power: {emissions_data.cpu_power + emissions_data.gpu_power + emissions_data.ram_power:.2f} W")

    print(f"\nüå± Carbon Footprint:")
    print(f"  Total CO2 Emissions: {emissions_data.emissions:.6f} kg")
    print(f"  Emissions Rate: {emissions_data.emissions_rate:.9f} kg/s")
    print(f"  Duration: {emissions_data.duration/3600:.2f} hours")
    print(f"  Training Time (Trainer): {train_results.metrics['train_runtime']/3600:.2f} hours")

    print(f"\nüìç Location & Infrastructure:")
    print(f"  Country: {emissions_data.country_name} ({emissions_data.country_iso_code})")
    print(f"  Region: {emissions_data.region}")
    print(f"  On Cloud: {emissions_data.on_cloud}")
    print(f"  PUE (Power Usage Effectiveness): {emissions_data.pue}")

    print(f"\nüíª System Specifications:")
    print(f"  OS: {emissions_data.os}")
    print(f"  CPU: {emissions_data.cpu_model} ({emissions_data.cpu_count} cores)")
    if emissions_data.gpu_count and emissions_data.gpu_model:
        print(f"  GPU: {emissions_data.gpu_model} (Count: {emissions_data.gpu_count})")
    else:
        print(f"  GPU: None detected")
    print(f"  RAM: {emissions_data.ram_total_size:.2f} GB")
    print(f"  Python: {emissions_data.python_version}")

    print(f"\n{'='*80}")

    # Save LoRA adapters
    lora_model.save_pretrained(f"{output_dir}/lora_adapters")
    tokenizer.save_pretrained(f"{output_dir}/lora_adapters")
    print(f"‚úÖ LoRA adapters saved to {output_dir}/lora_adapters")

    # Clean up
    del trainer.model
    del trainer
    torch.cuda.empty_cache()

    return result_entry


In [29]:
def run_lora_experiment(size_fraction, train_data, eval_data, tokenizer, preprocess_fn, compute_metrics_fn, lora_rank):

    print(f"\n{'='*60}")
    print(f"üöÄ LoRA Training with {size_fraction*100}% of training data")
    print(f"{'='*60}")

    # Step 1: Prepare dataset
    tokenized_train, num_samples = prepare_dataset(train_data, size_fraction, preprocess_fn)

    # Step 2: Train LoRA model
    trainer, train_results, emissions_data, output_dir, lora_model = train_lora_model(
        tokenized_train, eval_data, tokenizer, compute_metrics_fn,
        size_fraction, lora_rank
    )

    # Step 3: Evaluate and save
    result_entry = evaluate_and_save_lora(trainer, train_results, emissions_data, output_dir, size_fraction, num_samples, lora_model)

    return result_entry

In [30]:
result_lora = []

In [31]:
%%time
print("\n" + "="*80)
print("üî¨ EXPERIMENT 1: LoRA with Rank 4")
print("="*80)

result_r4 = run_lora_experiment(
    size_fraction=0.8,  # 80% of training data
    train_data=squad["train"],
    eval_data=tokenized_validation,
    tokenizer=tokenizer,
    preprocess_fn=preprocess_function,
    compute_metrics_fn=compute_metrics,
    lora_rank=4
)
result_lora.append(result_r4)

Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



üî¨ EXPERIMENT 1: LoRA with Rank 4

üöÄ LoRA Training with 80.0% of training data
üîÑ Preprocessing 104255 training samples...

üîß Creating LoRA model (rank=4)...
trainable params: 75,266 || all params: 66,439,684 || trainable%: 0.1133



`tokenizer` is deprecated and will be removed in version 5.0.0 for `Trainer.__init__`. Use `processing_class` instead.

[codecarbon INFO @ 03:39:04] [setup] RAM Tracking...
[codecarbon INFO @ 03:39:04] [setup] CPU Tracking...
 Linux OS detected: Please ensure RAPL files exist, and are readable, at /sys/class/powercap/intel-rapl/subsystem to measure CPU

[codecarbon INFO @ 03:39:05] CPU Model on constant consumption mode: Intel(R) Xeon(R) CPU @ 2.20GHz
[codecarbon INFO @ 03:39:05] [setup] GPU Tracking...
[codecarbon INFO @ 03:39:05] Tracking Nvidia GPU via pynvml
[codecarbon INFO @ 03:39:05] The below tracking methods have been set up:
                RAM Tracking Method: RAM power estimation model
                CPU Tracking Method: global constant
                GPU Tracking Method: pynvml
            
[codecarbon INFO @ 03:39:05] >>> Tracker's metadata:
[codecarbon INFO @ 03:39:05]   Platform system: Linux-6.6.105+-x86_64-with-glibc2.35
[codecarbon INFO @ 03:39:05]   Python versio

üèãÔ∏è Training LoRA model...


Epoch,Training Loss,Validation Loss,Exact Match,F1
1,1.7302,1.487096,0.408274,0.462559
2,1.5664,1.417748,0.430526,0.493034


[codecarbon INFO @ 03:39:21] Energy consumed for RAM : 0.000158 kWh. RAM Power : 38.0 W
[codecarbon INFO @ 03:39:21] Delta energy consumed for CPU with constant : 0.000177 kWh, power : 42.5 W
[codecarbon INFO @ 03:39:21] Energy consumed for All CPU : 0.000177 kWh
[codecarbon INFO @ 03:39:21] Energy consumed for all GPUs : 0.000770 kWh. Total GPU Power : 184.6006738769024 W
[codecarbon INFO @ 03:39:21] 0.001105 kWh of electricity and 0.000000 L of water were used since the beginning.
[codecarbon INFO @ 03:39:22] Energy consumed for RAM : 0.000158 kWh. RAM Power : 38.0 W
[codecarbon INFO @ 03:39:22] Delta energy consumed for CPU with constant : 0.000177 kWh, power : 42.5 W
[codecarbon INFO @ 03:39:22] Energy consumed for All CPU : 0.000177 kWh
[codecarbon INFO @ 03:39:22] Energy consumed for all GPUs : 0.000789 kWh. Total GPU Power : 189.35941162223386 W
[codecarbon INFO @ 03:39:22] 0.001125 kWh of electricity and 0.000000 L of water were used since the beginning.
[codecarbon INFO @ 03:3

üìä Evaluating LoRA model...



üìà LoRA RESULTS SUMMARY (Rank 4)

üîß Model Configuration:
  Training Method: LoRA
  LoRA Rank: 4
  Trainable Parameters: 75,266 (0.11%)
  Total Parameters: 66,439,684
  Dataset Size: 80.0%

üéØ Performance Metrics:
  F1 Score: 0.4930
  Exact Match: 0.4305
  Eval Loss: 1.4177

‚ö° Energy Consumption:
  Total Energy: 0.036692 kWh
  CPU Energy: 0.005768 kWh (15.7%)
  GPU Energy: 0.025768 kWh (70.2%)
  RAM Energy: 0.005156 kWh (14.1%)

üîå Average Power Draw:
  CPU Power: 42.50 W
  GPU Power: 189.39 W
  RAM Power: 38.00 W
  Total Power: 269.89 W

üå± Carbon Footprint:
  Total CO2 Emissions: 0.009820 kg
  Emissions Rate: 0.000020088 kg/s
  Duration: 0.14 hours
  Training Time (Trainer): 0.14 hours

üìç Location & Infrastructure:
  Country: The Netherlands (NLD)
  Region: groningen
  On Cloud: N
  PUE (Power Usage Effectiveness): 1.0

üíª System Specifications:
  OS: Linux-6.6.105+-x86_64-with-glibc2.35
  CPU: Intel(R) Xeon(R) CPU @ 2.20GHz (12 cores)
  GPU: 1 x NVIDIA A100-SXM4-40

In [32]:
%%time
print("\n" + "="*80)
print("üî¨ EXPERIMENT 2: LoRA with Rank 8")
print("="*80)

result_r8 = run_lora_experiment(
    size_fraction=0.8,
    train_data=squad["train"],
    eval_data=tokenized_validation,
    tokenizer=tokenizer,
    preprocess_fn=preprocess_function,
    compute_metrics_fn=compute_metrics,
    lora_rank=8
)
result_lora.append(result_r8)

Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



üî¨ EXPERIMENT 2: LoRA with Rank 8

üöÄ LoRA Training with 80.0% of training data
üîÑ Preprocessing 104255 training samples...

üîß Creating LoRA model (rank=8)...
trainable params: 148,994 || all params: 66,513,412 || trainable%: 0.2240



`tokenizer` is deprecated and will be removed in version 5.0.0 for `Trainer.__init__`. Use `processing_class` instead.

[codecarbon INFO @ 03:47:29] [setup] RAM Tracking...
[codecarbon INFO @ 03:47:29] [setup] CPU Tracking...
 Linux OS detected: Please ensure RAPL files exist, and are readable, at /sys/class/powercap/intel-rapl/subsystem to measure CPU

[codecarbon INFO @ 03:47:31] CPU Model on constant consumption mode: Intel(R) Xeon(R) CPU @ 2.20GHz
[codecarbon INFO @ 03:47:31] [setup] GPU Tracking...
[codecarbon INFO @ 03:47:31] Tracking Nvidia GPU via pynvml
[codecarbon INFO @ 03:47:31] The below tracking methods have been set up:
                RAM Tracking Method: RAM power estimation model
                CPU Tracking Method: global constant
                GPU Tracking Method: pynvml
            
[codecarbon INFO @ 03:47:31] >>> Tracker's metadata:
[codecarbon INFO @ 03:47:31]   Platform system: Linux-6.6.105+-x86_64-with-glibc2.35
[codecarbon INFO @ 03:47:31]   Python versio

üèãÔ∏è Training LoRA model...


Epoch,Training Loss,Validation Loss,Exact Match,F1
1,1.6433,1.424173,0.423768,0.484082
2,1.4595,1.371498,0.443053,0.512776


[codecarbon INFO @ 03:47:47] Energy consumed for RAM : 0.000158 kWh. RAM Power : 38.0 W
[codecarbon INFO @ 03:47:47] Delta energy consumed for CPU with constant : 0.000177 kWh, power : 42.5 W
[codecarbon INFO @ 03:47:47] Energy consumed for All CPU : 0.000177 kWh
[codecarbon INFO @ 03:47:47] Energy consumed for all GPUs : 0.000779 kWh. Total GPU Power : 187.00624227949908 W
[codecarbon INFO @ 03:47:47] 0.001115 kWh of electricity and 0.000000 L of water were used since the beginning.
[codecarbon INFO @ 03:47:48] Energy consumed for RAM : 0.000158 kWh. RAM Power : 38.0 W
[codecarbon INFO @ 03:47:48] Delta energy consumed for CPU with constant : 0.000177 kWh, power : 42.5 W
[codecarbon INFO @ 03:47:48] Energy consumed for All CPU : 0.000177 kWh
[codecarbon INFO @ 03:47:48] Energy consumed for all GPUs : 0.000794 kWh. Total GPU Power : 190.49705863434735 W
[codecarbon INFO @ 03:47:48] 0.001130 kWh of electricity and 0.000000 L of water were used since the beginning.
[codecarbon INFO @ 03:

üìä Evaluating LoRA model...



üìà LoRA RESULTS SUMMARY (Rank 8)

üîß Model Configuration:
  Training Method: LoRA
  LoRA Rank: 8
  Trainable Parameters: 148,994 (0.22%)
  Total Parameters: 66,513,412
  Dataset Size: 80.0%

üéØ Performance Metrics:
  F1 Score: 0.5128
  Exact Match: 0.4431
  Eval Loss: 1.3715

‚ö° Energy Consumption:
  Total Energy: 0.036667 kWh
  CPU Energy: 0.005752 kWh (15.7%)
  GPU Energy: 0.025774 kWh (70.3%)
  RAM Energy: 0.005142 kWh (14.0%)

üîå Average Power Draw:
  CPU Power: 42.50 W
  GPU Power: 189.81 W
  RAM Power: 38.00 W
  Total Power: 270.31 W

üå± Carbon Footprint:
  Total CO2 Emissions: 0.009813 kg
  Emissions Rate: 0.000020132 kg/s
  Duration: 0.14 hours
  Training Time (Trainer): 0.14 hours

üìç Location & Infrastructure:
  Country: The Netherlands (NLD)
  Region: groningen
  On Cloud: N
  PUE (Power Usage Effectiveness): 1.0

üíª System Specifications:
  OS: Linux-6.6.105+-x86_64-with-glibc2.35
  CPU: Intel(R) Xeon(R) CPU @ 2.20GHz (12 cores)
  GPU: 1 x NVIDIA A100-SXM4-4

In [33]:
%%time
print("\n" + "="*80)
print("üî¨ EXPERIMENT 3: LoRA with Rank 16")
print("="*80)

result_r16 = run_lora_experiment(
    size_fraction=0.8,
    train_data=squad["train"],
    eval_data=tokenized_validation,
    tokenizer=tokenizer,
    preprocess_fn=preprocess_function,
    compute_metrics_fn=compute_metrics,
    lora_rank=16
)
result_lora.append(result_r16)

Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



üî¨ EXPERIMENT 3: LoRA with Rank 16

üöÄ LoRA Training with 80.0% of training data
üîÑ Preprocessing 104255 training samples...

üîß Creating LoRA model (rank=16)...
trainable params: 296,450 || all params: 66,660,868 || trainable%: 0.4447



`tokenizer` is deprecated and will be removed in version 5.0.0 for `Trainer.__init__`. Use `processing_class` instead.

[codecarbon INFO @ 03:55:54] [setup] RAM Tracking...
[codecarbon INFO @ 03:55:54] [setup] CPU Tracking...
 Linux OS detected: Please ensure RAPL files exist, and are readable, at /sys/class/powercap/intel-rapl/subsystem to measure CPU

[codecarbon INFO @ 03:55:55] CPU Model on constant consumption mode: Intel(R) Xeon(R) CPU @ 2.20GHz
[codecarbon INFO @ 03:55:55] [setup] GPU Tracking...
[codecarbon INFO @ 03:55:55] Tracking Nvidia GPU via pynvml
[codecarbon INFO @ 03:55:55] The below tracking methods have been set up:
                RAM Tracking Method: RAM power estimation model
                CPU Tracking Method: global constant
                GPU Tracking Method: pynvml
            
[codecarbon INFO @ 03:55:55] >>> Tracker's metadata:
[codecarbon INFO @ 03:55:55]   Platform system: Linux-6.6.105+-x86_64-with-glibc2.35
[codecarbon INFO @ 03:55:55]   Python versio

üèãÔ∏è Training LoRA model...


Epoch,Training Loss,Validation Loss,Exact Match,F1
1,1.562,1.345946,0.459206,0.522682
2,1.3689,1.309249,0.476183,0.545845


[codecarbon INFO @ 03:56:11] Energy consumed for RAM : 0.000158 kWh. RAM Power : 38.0 W
[codecarbon INFO @ 03:56:11] Delta energy consumed for CPU with constant : 0.000177 kWh, power : 42.5 W
[codecarbon INFO @ 03:56:11] Energy consumed for All CPU : 0.000177 kWh
[codecarbon INFO @ 03:56:11] Energy consumed for all GPUs : 0.000772 kWh. Total GPU Power : 185.20461018538904 W
[codecarbon INFO @ 03:56:11] 0.001107 kWh of electricity and 0.000000 L of water were used since the beginning.
[codecarbon INFO @ 03:56:12] Energy consumed for RAM : 0.000158 kWh. RAM Power : 38.0 W
[codecarbon INFO @ 03:56:12] Delta energy consumed for CPU with constant : 0.000177 kWh, power : 42.5 W
[codecarbon INFO @ 03:56:12] Energy consumed for All CPU : 0.000177 kWh
[codecarbon INFO @ 03:56:12] Energy consumed for all GPUs : 0.000792 kWh. Total GPU Power : 189.94245020067257 W
[codecarbon INFO @ 03:56:12] 0.001127 kWh of electricity and 0.000000 L of water were used since the beginning.
[codecarbon INFO @ 03:

üìä Evaluating LoRA model...



üìà LoRA RESULTS SUMMARY (Rank 16)

üîß Model Configuration:
  Training Method: LoRA
  LoRA Rank: 16
  Trainable Parameters: 296,450 (0.44%)
  Total Parameters: 66,660,868
  Dataset Size: 80.0%

üéØ Performance Metrics:
  F1 Score: 0.5458
  Exact Match: 0.4762
  Eval Loss: 1.3092

‚ö° Energy Consumption:
  Total Energy: 0.036633 kWh
  CPU Energy: 0.005778 kWh (15.8%)
  GPU Energy: 0.025689 kWh (70.1%)
  RAM Energy: 0.005166 kWh (14.1%)

üîå Average Power Draw:
  CPU Power: 42.50 W
  GPU Power: 188.46 W
  RAM Power: 38.00 W
  Total Power: 268.96 W

üå± Carbon Footprint:
  Total CO2 Emissions: 0.009804 kg
  Emissions Rate: 0.000020021 kg/s
  Duration: 0.14 hours
  Training Time (Trainer): 0.14 hours

üìç Location & Infrastructure:
  Country: The Netherlands (NLD)
  Region: groningen
  On Cloud: N
  PUE (Power Usage Effectiveness): 1.0

üíª System Specifications:
  OS: Linux-6.6.105+-x86_64-with-glibc2.35
  CPU: Intel(R) Xeon(R) CPU @ 2.20GHz (12 cores)
  GPU: 1 x NVIDIA A100-SXM4

###STEP 6.1: Results and Analysis

In [34]:
results_df_lora = pd.DataFrame(result_lora)
print("\n" + "="*60)
print("üìä LoRA RESULTS SUMMARY")
print("="*60)
print(results_df_lora.to_string(index=False))

# Save to CSV
results_df_lora.to_csv("/content/drive/MyDrive/distilbert_lora_results.csv", index=False)


üìä LoRA RESULTS SUMMARY
training_method model_name  dataset_size%  lora_rank  train_samples  valid_samples  trainable_params  total_params  trainable_percentage  f1_score  exact_match  eval_loss  training_time_hours  emissions_rate_kg_per_s  emissions_kg           timestamp  duration_seconds  duration_hours  energy_consumed_kwh  cpu_energy_kwh  gpu_energy_kwh  ram_energy_kwh  cpu_power_w  gpu_power_w  ram_power_w    country_name country_iso_code    region cloud_provider cloud_region on_cloud                                   os python_version  cpu_count                      cpu_model  gpu_count                 gpu_model  ram_total_size_gb  pue codecarbon_version
           LoRA DistilBERT             80          4         104255          12134             75266      66439684              0.113285  0.493034     0.430526   1.417748             0.135649                  0.00002      0.009820 2025-12-01T03:47:15        488.820864        0.135784             0.036692        0.005768     

In [35]:
print("\n" + "="*80)
print("üìä LoRA RANK COMPARISON")
print("="*80)
print(results_df_lora[['lora_rank', 'trainable_params', 'trainable_percentage', 'f1_score', 'exact_match', 'emissions_kg', 'training_time_hours']].to_string(index=False))


üìä LoRA RANK COMPARISON
 lora_rank  trainable_params  trainable_percentage  f1_score  exact_match  emissions_kg  training_time_hours
         4             75266              0.113285  0.493034     0.430526      0.009820             0.135649
         8            148994              0.224006  0.512776     0.443053      0.009813             0.135268
        16            296450              0.444714  0.545845     0.476183      0.009804             0.135895


In [36]:
# Compare efficiency vs performance
print("\n" + "="*80)
print("üìà EFFICIENCY ANALYSIS")
print("="*80)

baseline = results_df_lora[results_df_lora['lora_rank'] == 8].iloc[0]  # Use rank 8 as baseline

for _, row in results_df_lora.iterrows():
    rank = row['lora_rank']
    params_ratio = row['trainable_params'] / baseline['trainable_params']
    f1_diff = row['f1_score'] - baseline['f1_score']
    emissions_diff = row['emissions_kg'] - baseline['emissions_kg']

    print(f"\nLoRA Rank {rank}:")
    print(f"  Trainable Params: {row['trainable_params']:,} ({row['trainable_percentage']:.2f}%)")
    print(f"  vs Rank 8: {params_ratio:.2f}x parameters")
    print(f"  F1 Score: {row['f1_score']:.4f} ({f1_diff:+.4f} vs Rank 8)")
    print(f"  Emissions: {row['emissions_kg']:.6f} kg ({emissions_diff:+.6f} vs Rank 8)")
    print(f"  Training Time: {row['training_time_hours']:.2f} hours")

    # Efficiency metric: F1 per kg CO2
    efficiency = row['f1_score'] / row['emissions_kg']
    print(f"  Efficiency (F1/kg CO2): {efficiency:.2f}")


üìà EFFICIENCY ANALYSIS

LoRA Rank 4:
  Trainable Params: 75,266 (0.11%)
  vs Rank 8: 0.51x parameters
  F1 Score: 0.4930 (-0.0197 vs Rank 8)
  Emissions: 0.009820 kg (+0.000007 vs Rank 8)
  Training Time: 0.14 hours
  Efficiency (F1/kg CO2): 50.21

LoRA Rank 8:
  Trainable Params: 148,994 (0.22%)
  vs Rank 8: 1.00x parameters
  F1 Score: 0.5128 (+0.0000 vs Rank 8)
  Emissions: 0.009813 kg (+0.000000 vs Rank 8)
  Training Time: 0.14 hours
  Efficiency (F1/kg CO2): 52.25

LoRA Rank 16:
  Trainable Params: 296,450 (0.44%)
  vs Rank 8: 1.99x parameters
  F1 Score: 0.5458 (+0.0331 vs Rank 8)
  Emissions: 0.009804 kg (-0.000009 vs Rank 8)
  Training Time: 0.14 hours
  Efficiency (F1/kg CO2): 55.68


In [37]:
# PLOT 1: LoRA Energy Consumption by Rank
df_sorted_lora = results_df_lora.sort_values('lora_rank')

fig_lora_energy = go.Figure()

fig_lora_energy.add_trace(go.Bar(
    name='CPU Energy',
    x=df_sorted_lora['lora_rank'],
    y=df_sorted_lora['cpu_energy_kwh'],
    marker_color='#FF6B6B',
    hovertemplate='<b>CPU Energy</b><br>%{y:.6f} kWh<br>Rank: %{x}<extra></extra>'
))

fig_lora_energy.add_trace(go.Bar(
    name='GPU Energy',
    x=df_sorted_lora['lora_rank'],
    y=df_sorted_lora['gpu_energy_kwh'],
    marker_color='#4ECDC4',
    hovertemplate='<b>GPU Energy</b><br>%{y:.6f} kWh<br>Rank: %{x}<extra></extra>'
))

fig_lora_energy.add_trace(go.Bar(
    name='RAM Energy',
    x=df_sorted_lora['lora_rank'],
    y=df_sorted_lora['ram_energy_kwh'],
    marker_color='#95E1D3',
    hovertemplate='<b>RAM Energy</b><br>%{y:.6f} kWh<br>Rank: %{x}<extra></extra>'
))

fig_lora_energy.update_layout(
    title=dict(text="LoRA: Energy Consumption by Rank", font=dict(size=18)),
    xaxis_title='LoRA Rank',
    yaxis_title='Energy Consumption (kWh)',
    barmode='stack',
    template='plotly_white',
    height=500,
    font=dict(size=13),
    legend=dict(
        orientation="h",
        yanchor="bottom",
        y=1.02,
        xanchor="right",
        x=1
    ),
    hovermode='x unified'
)

fig_lora_energy.show()
fig_lora_energy.write_html("/content/drive/MyDrive/lora_energy_by_rank.html")

In [38]:
# PLOT 2: LoRA Performance & Emissions by Rank (Dual Y-axis)
df_sorted_lora = results_df_lora.sort_values('lora_rank')

fig_lora_perf = make_subplots(specs=[[{"secondary_y": True}]])

# F1 Score line
fig_lora_perf.add_trace(
    go.Scatter(
        x=df_sorted_lora['lora_rank'],
        y=df_sorted_lora['f1_score'],
        name='F1 Score',
        mode='lines+markers',
        line=dict(color='#4ECDC4', width=3),
        marker=dict(size=12, line=dict(width=2, color='white')),
        hovertemplate='<b>F1 Score</b>: %{y:.4f}<br>Rank: %{x}<extra></extra>'
    ),
    secondary_y=False
)

# Exact Match line
fig_lora_perf.add_trace(
    go.Scatter(
        x=df_sorted_lora['lora_rank'],
        y=df_sorted_lora['exact_match'],
        name='Exact Match',
        mode='lines+markers',
        line=dict(color='#95E1D3', width=3, dash='dash'),
        marker=dict(size=10),
        hovertemplate='<b>Exact Match</b>: %{y:.4f}<br>Rank: %{x}<extra></extra>'
    ),
    secondary_y=False
)

# CO2 Emissions bar
fig_lora_perf.add_trace(
    go.Bar(
        x=df_sorted_lora['lora_rank'],
        y=df_sorted_lora['emissions_kg'],
        name='CO‚ÇÇ Emissions',
        marker_color='#FF6B6B',
        opacity=0.6,
        hovertemplate='<b>CO‚ÇÇ</b>: %{y:.6f} kg<br>Rank: %{x}<extra></extra>'
    ),
    secondary_y=True
)

fig_lora_perf.update_xaxes(title_text="LoRA Rank")
fig_lora_perf.update_yaxes(title_text="Performance Score", secondary_y=False)
fig_lora_perf.update_yaxes(title_text="CO‚ÇÇ Emissions (kg)", secondary_y=True)

fig_lora_perf.update_layout(
    title=dict(text="LoRA: Performance vs Carbon Emissions by Rank", font=dict(size=18)),
    template='plotly_white',
    height=500,
    font=dict(size=13),
    hovermode='x unified',
    legend=dict(
        orientation="h",
        yanchor="bottom",
        y=1.02,
        xanchor="right",
        x=1
    )
)

fig_lora_perf.show()
fig_lora_perf.write_html("/content/drive/MyDrive/lora_performance_emissions.html")

# Training Strategy 3: Few-shot Learning With Frozen Backbone

## STEP 7: Creating And Training Few-shot Model

In [39]:
def create_frozen_model(model_name="distilbert-base-uncased"):
    #Create model with frozen backbone (only QA head is trainable).
    # Load base model
    model = AutoModelForQuestionAnswering.from_pretrained(model_name)

    # Freeze ALL parameters first
    for param in model.parameters():
        param.requires_grad = False

    # Unfreeze ONLY the QA head (classifier layer)
    # For DistilBERT: qa_outputs layer
    for param in model.qa_outputs.parameters():
        param.requires_grad = True

    # Count parameters
    trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
    total_params = sum(p.numel() for p in model.parameters())

    print(f"\nüîí Model Configuration:")
    print(f"  Total Parameters: {total_params:,}")
    print(f"  Trainable Parameters: {trainable_params:,}")
    print(f"  Frozen Parameters: {total_params - trainable_params:,}")
    print(f"  Trainable Percentage: {100 * trainable_params / total_params:.4f}%")

    return model, trainable_params, total_params


In [40]:
def prepare_fewshot_dataset(train_data, num_shots, preprocess_fn):
    # Select only num_shots examples
    train_subset = train_data.select(range(num_shots))

    print(f"üîÑ Creating few-shot dataset with {num_shots} examples...")
    tokenized_train = train_subset.map(
        preprocess_fn,
        batched=True,
        remove_columns=train_subset.column_names
    )

    # After tokenization with sliding window, we get more samples
    actual_samples = len(tokenized_train)
    print(f"  Original examples: {num_shots}")
    print(f"  After tokenization (with sliding window): {actual_samples} samples")

    return tokenized_train, num_shots  # Return original num_shots for tracking


In [41]:
def train_fewshot_model(tokenized_train, tokenized_eval, tokenizer, compute_metrics_fn, num_shots, model_name="distilbert-base-uncased"):
    # Create frozen model
    model, trainable_params, total_params = create_frozen_model(model_name)

    # Setup output directory
    output_dir = f"results_distilbert_fewshot_{num_shots}shots"

    # Training arguments - DIFFERENT from full fine-tuning
    training_args = TrainingArguments(
        output_dir=output_dir,
        eval_strategy="epoch",
        learning_rate=5e-4,  # Higher LR since we're only training the head
        per_device_train_batch_size=16,
        per_device_eval_batch_size=16,
        num_train_epochs=10,  # More epochs for few-shot
        fp16=torch.cuda.is_available(),
        save_strategy="epoch",
        load_best_model_at_end=True,
        metric_for_best_model="f1",
        push_to_hub=False,
        logging_steps=50,
        greater_is_better=True
    )

    # Initialize trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_train,
        eval_dataset=tokenized_eval,
        tokenizer=tokenizer,
        data_collator=default_data_collator,
        compute_metrics=compute_metrics_fn
    )

    # Start carbon tracking
    tracker = EmissionsTracker(
        project_name=f"DistilBERT_FewShot_{num_shots}shots",
        output_dir=output_dir,
        save_to_file=True,
        log_level="info"
    )
    tracker.start()

    # Train
    print(f"\nüèãÔ∏è Training few-shot model ({num_shots} examples)...")
    train_results = trainer.train()

    # Stop tracking and get detailed emissions data
    emissions_kg = tracker.stop()
    emissions_data = tracker.final_emissions_data

    return trainer, train_results, emissions_data, output_dir, model, trainable_params, total_params


## STEP 8: Evaluating The Few-shot Model On Different Shot Sizes

>We will be training our model on various shots from our SQuAD dataset.
>
>Training Few-shot Variation: [100, 500, 1000]

In [42]:
def evaluate_and_save_fewshot(trainer, train_results, emissions_data, output_dir, num_shots, trainable_params, total_params):
    print("üìä Evaluating few-shot model...")
    eval_results = trainer.evaluate()

    trainable_percentage = 100 * trainable_params / total_params

    # Compile results
    result_entry = {
        "training_method": "Few-Shot (Frozen Backbone)",
        "model_name": "DistilBERT",
        "num_shots": num_shots,
        "train_samples": num_shots,
        "valid_samples": len(tokenized_validation),
        "trainable_params": trainable_params,
        "total_params": total_params,
        "trainable_percentage": trainable_percentage,

        # Performance
        "f1_score": eval_results["eval_f1"],
        "exact_match": eval_results["eval_exact_match"],
        "eval_loss": eval_results["eval_loss"],
        "training_time_hours": train_results.metrics["train_runtime"] / 3600,

        # Emissions data
        "emissions_rate_kg_per_s": emissions_data.emissions_rate,
        "emissions_kg": emissions_data.emissions,
        "timestamp": emissions_data.timestamp,
        "duration_seconds": emissions_data.duration,
        "duration_hours": emissions_data.duration / 3600,

        # Energy
        "energy_consumed_kwh": emissions_data.energy_consumed,
        "cpu_energy_kwh": emissions_data.cpu_energy,
        "gpu_energy_kwh": emissions_data.gpu_energy,
        "ram_energy_kwh": emissions_data.ram_energy,

        # Power
        "cpu_power_w": emissions_data.cpu_power,
        "gpu_power_w": emissions_data.gpu_power,
        "ram_power_w": emissions_data.ram_power,

        # Location and system info
        "country_name": emissions_data.country_name,
        "country_iso_code": emissions_data.country_iso_code,
        "region": emissions_data.region,
        "cloud_provider": emissions_data.cloud_provider,
        "cloud_region": emissions_data.cloud_region,
        "on_cloud": emissions_data.on_cloud,

        # System specifications
        "os": emissions_data.os,
        "python_version": emissions_data.python_version,
        "cpu_count": emissions_data.cpu_count,
        "cpu_model": emissions_data.cpu_model,
        "gpu_count": emissions_data.gpu_count,
        "gpu_model": emissions_data.gpu_model,
        "ram_total_size_gb": emissions_data.ram_total_size,

        # Additional metrics
        "pue": emissions_data.pue,
        "codecarbon_version": emissions_data.codecarbon_version,
    }

    # Print summary
    print(f"\n{'='*80}")
    print(f"üìà FEW-SHOT LEARNING RESULTS ({num_shots} examples)")
    print(f"{'='*80}")
    print(f"\nüîß Model Configuration:")
    print(f"  Training Method: Few-Shot (Frozen Backbone)")
    print(f"  Training Examples: {num_shots}")
    print(f"  Trainable Parameters: {trainable_params:,} ({trainable_percentage:.4f}%)")
    print(f"  Frozen Parameters: {total_params - trainable_params:,}")

    print(f"\nüéØ Performance:")
    print(f"  F1 Score: {eval_results['eval_f1']:.4f}")
    print(f"  Exact Match: {eval_results['eval_exact_match']:.4f}")
    print(f"  Eval Loss: {eval_results['eval_loss']:.4f}")

    print(f"\n‚ö° Energy:")
    print(f"  Total: {emissions_data.energy_consumed:.6f} kWh")
    if emissions_data.energy_consumed > 0:
        print(f"  GPU: {emissions_data.gpu_energy:.6f} kWh ({emissions_data.gpu_energy/emissions_data.energy_consumed*100:.1f}%)")
        print(f"  CPU: {emissions_data.cpu_energy:.6f} kWh ({emissions_data.cpu_energy/emissions_data.energy_consumed*100:.1f}%)")

    print(f"\nüå± Carbon:")
    print(f"  CO‚ÇÇ Emissions: {emissions_data.emissions:.6f} kg")
    print(f"  Training Time: {train_results.metrics['train_runtime']/3600:.2f} hours")
    print(f"{'='*80}")

    # Save model
    trainer.save_model(f"{output_dir}/final_model")
    print(f"‚úÖ Model saved to {output_dir}/final_model")

    # Clean up
    del trainer.model
    del trainer
    torch.cuda.empty_cache()

    return result_entry


In [43]:
def run_fewshot_experiment(num_shots, train_data, eval_data, tokenizer, preprocess_fn, compute_metrics_fn, model_name="distilbert-base-uncased"):

    print(f"\n{'='*60}")
    print(f"üöÄ Few-Shot Learning with {num_shots} examples")
    print(f"{'='*60}")

    # Step 1: Prepare few-shot dataset
    tokenized_train, num_shots = prepare_fewshot_dataset(train_data, num_shots, preprocess_fn)

    # Step 2: Train with frozen backbone
    trainer, train_results, emissions_data, output_dir, model, trainable_params, total_params = train_fewshot_model(
        tokenized_train, eval_data, tokenizer, compute_metrics_fn,
        num_shots, model_name
    )

    # Step 3: Evaluate and save
    result_entry = evaluate_and_save_fewshot(
        trainer, train_results, emissions_data, output_dir,
        num_shots, trainable_params, total_params
    )

    return result_entry

In [44]:
result_fewshot = []

In [45]:
%%time
print("\n" + "="*80)
print("üî¨ EXPERIMENT 1: 100-shot Learning")
print("="*80)

result_100 = run_fewshot_experiment(
    num_shots=100,
    train_data=squad["train"],
    eval_data=tokenized_validation,
    tokenizer=tokenizer,
    preprocess_fn=preprocess_function,
    compute_metrics_fn=compute_metrics,
    model_name="distilbert-base-uncased"
)
result_fewshot.append(result_100)


üî¨ EXPERIMENT 1: 100-shot Learning

üöÄ Few-Shot Learning with 100 examples
üîÑ Creating few-shot dataset with 100 examples...


Map:   0%|          | 0/100 [00:00<?, ? examples/s]

  Original examples: 100
  After tokenization (with sliding window): 100 samples


Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



üîí Model Configuration:
  Total Parameters: 66,364,418
  Trainable Parameters: 1,538
  Frozen Parameters: 66,362,880
  Trainable Percentage: 0.0023%



`tokenizer` is deprecated and will be removed in version 5.0.0 for `Trainer.__init__`. Use `processing_class` instead.

[codecarbon INFO @ 04:04:22] [setup] RAM Tracking...
[codecarbon INFO @ 04:04:22] [setup] CPU Tracking...
 Linux OS detected: Please ensure RAPL files exist, and are readable, at /sys/class/powercap/intel-rapl/subsystem to measure CPU

[codecarbon INFO @ 04:04:24] CPU Model on constant consumption mode: Intel(R) Xeon(R) CPU @ 2.20GHz
[codecarbon INFO @ 04:04:24] [setup] GPU Tracking...
[codecarbon INFO @ 04:04:24] Tracking Nvidia GPU via pynvml
[codecarbon INFO @ 04:04:24] The below tracking methods have been set up:
                RAM Tracking Method: RAM power estimation model
                CPU Tracking Method: global constant
                GPU Tracking Method: pynvml
            
[codecarbon INFO @ 04:04:24] >>> Tracker's metadata:
[codecarbon INFO @ 04:04:24]   Platform system: Linux-6.6.105+-x86_64-with-glibc2.35
[codecarbon INFO @ 04:04:24]   Python versio


üèãÔ∏è Training few-shot model (100 examples)...


Epoch,Training Loss,Validation Loss,Exact Match,F1
1,No log,5.960748,0.000165,0.006065
2,No log,5.925188,0.000165,0.006769
3,No log,5.897899,0.000165,0.006562
4,No log,5.877831,0.000247,0.00762
5,No log,5.861616,0.000165,0.007852
6,No log,5.847296,0.00033,0.008439
7,No log,5.837134,0.00033,0.008846
8,5.605100,5.830018,0.000412,0.009092
9,5.605100,5.82541,0.000412,0.009039
10,5.605100,5.823724,0.000412,0.009082


[codecarbon INFO @ 04:04:40] Energy consumed for RAM : 0.000158 kWh. RAM Power : 38.0 W
[codecarbon INFO @ 04:04:40] Delta energy consumed for CPU with constant : 0.000177 kWh, power : 42.5 W
[codecarbon INFO @ 04:04:40] Energy consumed for All CPU : 0.000177 kWh
[codecarbon INFO @ 04:04:40] Energy consumed for all GPUs : 0.000692 kWh. Total GPU Power : 165.8547214679277 W
[codecarbon INFO @ 04:04:40] 0.001027 kWh of electricity and 0.000000 L of water were used since the beginning.
[codecarbon INFO @ 04:04:40] Energy consumed for RAM : 0.000158 kWh. RAM Power : 38.0 W
[codecarbon INFO @ 04:04:40] Delta energy consumed for CPU with constant : 0.000177 kWh, power : 42.5 W
[codecarbon INFO @ 04:04:40] Energy consumed for All CPU : 0.000177 kWh
[codecarbon INFO @ 04:04:40] Energy consumed for all GPUs : 0.000711 kWh. Total GPU Power : 170.44782118470903 W
[codecarbon INFO @ 04:04:40] 0.001046 kWh of electricity and 0.000000 L of water were used since the beginning.
[codecarbon INFO @ 04:0

üìä Evaluating few-shot model...



üìà FEW-SHOT LEARNING RESULTS (100 examples)

üîß Model Configuration:
  Training Method: Few-Shot (Frozen Backbone)
  Training Examples: 100
  Trainable Parameters: 1,538 (0.0023%)
  Frozen Parameters: 66,362,880

üéØ Performance:
  F1 Score: 0.0091
  Exact Match: 0.0004
  Eval Loss: 5.8300

‚ö° Energy:
  Total: 0.008555 kWh
  GPU: 0.005841 kWh (68.3%)
  CPU: 0.001433 kWh (16.8%)

üå± Carbon:
  CO‚ÇÇ Emissions: 0.002290 kg
  Training Time: 0.03 hours
‚úÖ Model saved to results_distilbert_fewshot_100shots/final_model
CPU times: user 2min 11s, sys: 2.78 s, total: 2min 14s
Wall time: 2min 16s


In [46]:
%%time
print("\n" + "="*80)
print("üî¨ EXPERIMENT 2: 500-shot Learning")
print("="*80)

result_500 = run_fewshot_experiment(
    num_shots=500,
    train_data=squad["train"],
    eval_data=tokenized_validation,
    tokenizer=tokenizer,
    preprocess_fn=preprocess_function,
    compute_metrics_fn=compute_metrics,
    model_name="distilbert-base-uncased"
)
result_fewshot.append(result_500)


üî¨ EXPERIMENT 2: 500-shot Learning

üöÄ Few-Shot Learning with 500 examples
üîÑ Creating few-shot dataset with 500 examples...


Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  Original examples: 500
  After tokenization (with sliding window): 527 samples

üîí Model Configuration:
  Total Parameters: 66,364,418
  Trainable Parameters: 1,538
  Frozen Parameters: 66,362,880
  Trainable Percentage: 0.0023%



`tokenizer` is deprecated and will be removed in version 5.0.0 for `Trainer.__init__`. Use `processing_class` instead.

[codecarbon INFO @ 04:06:39] [setup] RAM Tracking...
[codecarbon INFO @ 04:06:39] [setup] CPU Tracking...
 Linux OS detected: Please ensure RAPL files exist, and are readable, at /sys/class/powercap/intel-rapl/subsystem to measure CPU

[codecarbon INFO @ 04:06:40] CPU Model on constant consumption mode: Intel(R) Xeon(R) CPU @ 2.20GHz
[codecarbon INFO @ 04:06:40] [setup] GPU Tracking...
[codecarbon INFO @ 04:06:40] Tracking Nvidia GPU via pynvml
[codecarbon INFO @ 04:06:40] The below tracking methods have been set up:
                RAM Tracking Method: RAM power estimation model
                CPU Tracking Method: global constant
                GPU Tracking Method: pynvml
            
[codecarbon INFO @ 04:06:40] >>> Tracker's metadata:
[codecarbon INFO @ 04:06:40]   Platform system: Linux-6.6.105+-x86_64-with-glibc2.35
[codecarbon INFO @ 04:06:40]   Python versio


üèãÔ∏è Training few-shot model (500 examples)...


Epoch,Training Loss,Validation Loss,Exact Match,F1
1,No log,5.600938,0.000247,0.012893
2,5.614300,5.36577,0.000824,0.013981
3,5.614300,5.196386,0.000989,0.014842
4,5.119000,5.050659,0.001731,0.015044
5,4.839400,4.921439,0.002225,0.015549
6,4.839400,4.84842,0.00272,0.015826
7,4.660800,4.767348,0.005357,0.017459
8,4.575200,4.723065,0.00684,0.018732
9,4.575200,4.696511,0.007747,0.019596
10,4.522800,4.687153,0.008241,0.019856


[codecarbon INFO @ 04:06:57] Energy consumed for RAM : 0.000158 kWh. RAM Power : 38.0 W
[codecarbon INFO @ 04:06:57] Delta energy consumed for CPU with constant : 0.000177 kWh, power : 42.5 W
[codecarbon INFO @ 04:06:57] Energy consumed for All CPU : 0.000177 kWh
[codecarbon INFO @ 04:06:57] Energy consumed for all GPUs : 0.000686 kWh. Total GPU Power : 164.6809040429256 W
[codecarbon INFO @ 04:06:57] 0.001022 kWh of electricity and 0.000000 L of water were used since the beginning.
[codecarbon INFO @ 04:06:57] Energy consumed for RAM : 0.000158 kWh. RAM Power : 38.0 W
[codecarbon INFO @ 04:06:57] Delta energy consumed for CPU with constant : 0.000177 kWh, power : 42.5 W
[codecarbon INFO @ 04:06:57] Energy consumed for All CPU : 0.000177 kWh
[codecarbon INFO @ 04:06:57] Energy consumed for all GPUs : 0.000705 kWh. Total GPU Power : 169.03454929120085 W
[codecarbon INFO @ 04:06:57] 0.001040 kWh of electricity and 0.000000 L of water were used since the beginning.
[codecarbon INFO @ 04:0

üìä Evaluating few-shot model...



üìà FEW-SHOT LEARNING RESULTS (500 examples)

üîß Model Configuration:
  Training Method: Few-Shot (Frozen Backbone)
  Training Examples: 500
  Trainable Parameters: 1,538 (0.0023%)
  Frozen Parameters: 66,362,880

üéØ Performance:
  F1 Score: 0.0199
  Exact Match: 0.0082
  Eval Loss: 4.6872

‚ö° Energy:
  Total: 0.008894 kWh
  GPU: 0.006069 kWh (68.2%)
  CPU: 0.001492 kWh (16.8%)

üå± Carbon:
  CO‚ÇÇ Emissions: 0.002380 kg
  Training Time: 0.03 hours
‚úÖ Model saved to results_distilbert_fewshot_500shots/final_model
CPU times: user 2min 16s, sys: 2.8 s, total: 2min 19s
Wall time: 2min 21s


In [47]:
%%time
print("\n" + "="*80)
print("üî¨ EXPERIMENT 3: 1000-shot Learning")
print("="*80)

result_1000 = run_fewshot_experiment(
    num_shots=1000,
    train_data=squad["train"],
    eval_data=tokenized_validation,
    tokenizer=tokenizer,
    preprocess_fn=preprocess_function,
    compute_metrics_fn=compute_metrics,
    model_name="distilbert-base-uncased"
)
result_fewshot.append(result_1000)


üî¨ EXPERIMENT 3: 1000-shot Learning

üöÄ Few-Shot Learning with 1000 examples
üîÑ Creating few-shot dataset with 1000 examples...


Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  Original examples: 1000
  After tokenization (with sliding window): 1027 samples

üîí Model Configuration:
  Total Parameters: 66,364,418
  Trainable Parameters: 1,538
  Frozen Parameters: 66,362,880
  Trainable Percentage: 0.0023%



`tokenizer` is deprecated and will be removed in version 5.0.0 for `Trainer.__init__`. Use `processing_class` instead.

[codecarbon INFO @ 04:09:01] [setup] RAM Tracking...
[codecarbon INFO @ 04:09:01] [setup] CPU Tracking...
 Linux OS detected: Please ensure RAPL files exist, and are readable, at /sys/class/powercap/intel-rapl/subsystem to measure CPU

[codecarbon INFO @ 04:09:02] CPU Model on constant consumption mode: Intel(R) Xeon(R) CPU @ 2.20GHz
[codecarbon INFO @ 04:09:02] [setup] GPU Tracking...
[codecarbon INFO @ 04:09:02] Tracking Nvidia GPU via pynvml
[codecarbon INFO @ 04:09:02] The below tracking methods have been set up:
                RAM Tracking Method: RAM power estimation model
                CPU Tracking Method: global constant
                GPU Tracking Method: pynvml
            
[codecarbon INFO @ 04:09:02] >>> Tracker's metadata:
[codecarbon INFO @ 04:09:02]   Platform system: Linux-6.6.105+-x86_64-with-glibc2.35
[codecarbon INFO @ 04:09:02]   Python versio


üèãÔ∏è Training few-shot model (1000 examples)...


Epoch,Training Loss,Validation Loss,Exact Match,F1
1,5.6405,5.487395,0.001071,0.014791
2,5.1223,5.161344,0.002225,0.016665
3,4.808,4.953194,0.003626,0.018421
4,4.4793,4.815657,0.004533,0.019417
5,4.4013,4.72191,0.005522,0.020926
6,4.2976,4.657375,0.005934,0.021714
7,4.226,4.601258,0.006099,0.021734
8,4.1701,4.566637,0.006346,0.022027
9,4.1827,4.545423,0.006428,0.022245
10,4.1202,4.540926,0.006428,0.022166


[codecarbon INFO @ 04:09:19] Energy consumed for RAM : 0.000158 kWh. RAM Power : 38.0 W
[codecarbon INFO @ 04:09:19] Delta energy consumed for CPU with constant : 0.000177 kWh, power : 42.5 W
[codecarbon INFO @ 04:09:19] Energy consumed for All CPU : 0.000177 kWh
[codecarbon INFO @ 04:09:19] Energy consumed for all GPUs : 0.000676 kWh. Total GPU Power : 162.1411535663205 W
[codecarbon INFO @ 04:09:19] 0.001011 kWh of electricity and 0.000000 L of water were used since the beginning.
[codecarbon INFO @ 04:09:19] Energy consumed for RAM : 0.000158 kWh. RAM Power : 38.0 W
[codecarbon INFO @ 04:09:19] Delta energy consumed for CPU with constant : 0.000177 kWh, power : 42.5 W
[codecarbon INFO @ 04:09:19] Energy consumed for All CPU : 0.000177 kWh
[codecarbon INFO @ 04:09:19] Energy consumed for all GPUs : 0.000691 kWh. Total GPU Power : 165.88317416048844 W
[codecarbon INFO @ 04:09:19] 0.001027 kWh of electricity and 0.000000 L of water were used since the beginning.
[codecarbon INFO @ 04:0

üìä Evaluating few-shot model...



üìà FEW-SHOT LEARNING RESULTS (1000 examples)

üîß Model Configuration:
  Training Method: Few-Shot (Frozen Backbone)
  Training Examples: 1000
  Trainable Parameters: 1,538 (0.0023%)
  Frozen Parameters: 66,362,880

üéØ Performance:
  F1 Score: 0.0222
  Exact Match: 0.0064
  Eval Loss: 4.5454

‚ö° Energy:
  Total: 0.009262 kWh
  GPU: 0.006294 kWh (68.0%)
  CPU: 0.001567 kWh (16.9%)

üå± Carbon:
  CO‚ÇÇ Emissions: 0.002479 kg
  Training Time: 0.04 hours
‚úÖ Model saved to results_distilbert_fewshot_1000shots/final_model
CPU times: user 2min 23s, sys: 3.27 s, total: 2min 26s
Wall time: 2min 28s


### STEP 8.1: Results and Analysis

In [48]:
results_df_fewshot = pd.DataFrame(result_fewshot)
print("\n" + "="*60)
print("üìä FEW-SHOT LEARNING RESULTS SUMMARY")
print("="*60)
print(results_df_fewshot[['num_shots', 'trainable_percentage', 'f1_score', 'exact_match', 'emissions_kg', 'training_time_hours']].to_string(index=False))

# Save to CSV
results_df_fewshot.to_csv("/content/drive/MyDrive/distilbert_fewshot_results.csv", index=False)



üìä FEW-SHOT LEARNING RESULTS SUMMARY
 num_shots  trainable_percentage  f1_score  exact_match  emissions_kg  training_time_hours
       100              0.002318  0.009092     0.000412      0.002290             0.033616
       500              0.002318  0.019856     0.008241      0.002380             0.034983
      1000              0.002318  0.022245     0.006428      0.002479             0.036757


In [49]:
# FEW-SHOT EFFICIENCY ANALYSIS

print("\n" + "="*80)
print("üìà FEW-SHOT EFFICIENCY ANALYSIS")
print("="*80)

# Use 500-shot as baseline (middle ground)
baseline = results_df_fewshot[results_df_fewshot['num_shots'] == 500].iloc[0]

for _, row in results_df_fewshot.iterrows():
    shots = row['num_shots']
    samples_ratio = row['num_shots'] / baseline['num_shots']
    f1_diff = row['f1_score'] - baseline['f1_score']
    emissions_diff = row['emissions_kg'] - baseline['emissions_kg']
    time_diff = row['training_time_hours'] - baseline['training_time_hours']

    print(f"\n{shots}-Shot Learning:")
    print(f"  Training Examples: {row['num_shots']:,}")
    print(f"  Trainable Params: {row['trainable_params']:,} ({row['trainable_percentage']:.4f}%)")
    print(f"  vs 500-shot: {samples_ratio:.2f}x training data")
    print(f"  F1 Score: {row['f1_score']:.4f} ({f1_diff:+.4f} vs 500-shot)")
    print(f"  Emissions: {row['emissions_kg']:.6f} kg ({emissions_diff:+.6f} vs 500-shot)")
    print(f"  Training Time: {row['training_time_hours']:.2f} hours ({time_diff:+.2f} vs 500-shot)")

    # Efficiency metrics
    efficiency_co2 = row['f1_score'] / row['emissions_kg'] if row['emissions_kg'] > 0 else 0
    efficiency_time = row['f1_score'] / row['training_time_hours'] if row['training_time_hours'] > 0 else 0
    efficiency_samples = row['f1_score'] / row['num_shots'] if row['num_shots'] > 0 else 0

    print(f"  Efficiency (F1/kg CO‚ÇÇ): {efficiency_co2:.2f}")
    print(f"  Efficiency (F1/hour): {efficiency_time:.4f}")
    print(f"  Efficiency (F1/sample): {efficiency_samples:.6f}")




üìà FEW-SHOT EFFICIENCY ANALYSIS

100-Shot Learning:
  Training Examples: 100
  Trainable Params: 1,538 (0.0023%)
  vs 500-shot: 0.20x training data
  F1 Score: 0.0091 (-0.0108 vs 500-shot)
  Emissions: 0.002290 kg (-0.000091 vs 500-shot)
  Training Time: 0.03 hours (-0.00 vs 500-shot)
  Efficiency (F1/kg CO‚ÇÇ): 3.97
  Efficiency (F1/hour): 0.2705
  Efficiency (F1/sample): 0.000091

500-Shot Learning:
  Training Examples: 500
  Trainable Params: 1,538 (0.0023%)
  vs 500-shot: 1.00x training data
  F1 Score: 0.0199 (+0.0000 vs 500-shot)
  Emissions: 0.002380 kg (+0.000000 vs 500-shot)
  Training Time: 0.03 hours (+0.00 vs 500-shot)
  Efficiency (F1/kg CO‚ÇÇ): 8.34
  Efficiency (F1/hour): 0.5676
  Efficiency (F1/sample): 0.000040

1000-Shot Learning:
  Training Examples: 1,000
  Trainable Params: 1,538 (0.0023%)
  vs 500-shot: 2.00x training data
  F1 Score: 0.0222 (+0.0024 vs 500-shot)
  Emissions: 0.002479 kg (+0.000099 vs 500-shot)
  Training Time: 0.04 hours (+0.00 vs 500-shot)
  

In [50]:
# PLOT 1: Few-Shot Energy Consumption by Shots
df_sorted_fewshot = results_df_fewshot.sort_values('num_shots')

fig_fewshot_energy = go.Figure()

fig_fewshot_energy.add_trace(go.Bar(
    name='CPU Energy',
    x=df_sorted_fewshot['num_shots'],
    y=df_sorted_fewshot['cpu_energy_kwh'],
    marker_color='#FF6B6B',
    hovertemplate='<b>CPU Energy</b><br>%{y:.6f} kWh<br>Shots: %{x}<extra></extra>'
))

fig_fewshot_energy.add_trace(go.Bar(
    name='GPU Energy',
    x=df_sorted_fewshot['num_shots'],
    y=df_sorted_fewshot['gpu_energy_kwh'],
    marker_color='#4ECDC4',
    hovertemplate='<b>GPU Energy</b><br>%{y:.6f} kWh<br>Shots: %{x}<extra></extra>'
))

fig_fewshot_energy.add_trace(go.Bar(
    name='RAM Energy',
    x=df_sorted_fewshot['num_shots'],
    y=df_sorted_fewshot['ram_energy_kwh'],
    marker_color='#95E1D3',
    hovertemplate='<b>RAM Energy</b><br>%{y:.6f} kWh<br>Shots: %{x}<extra></extra>'
))

fig_fewshot_energy.update_layout(
    title=dict(text="Few-Shot: Energy Consumption by Number of Examples", font=dict(size=18)),
    xaxis_title='Number of Training Examples',
    yaxis_title='Energy Consumption (kWh)',
    barmode='stack',
    template='plotly_white',
    height=500,
    font=dict(size=13),
    legend=dict(
        orientation="h",
        yanchor="bottom",
        y=1.02,
        xanchor="right",
        x=1
    ),
    hovermode='x unified'
)

fig_fewshot_energy.show()
fig_fewshot_energy.write_html("/content/drive/MyDrive/fewshot_energy_by_shots.html")

In [51]:
# PLOT 2: Few-Shot Performance & Emissions by Shots (Dual Y-axis)
df_sorted_fewshot = results_df_fewshot.sort_values('num_shots')

fig_fewshot_perf = make_subplots(specs=[[{"secondary_y": True}]])

# F1 Score line
fig_fewshot_perf.add_trace(
    go.Scatter(
        x=df_sorted_fewshot['num_shots'],
        y=df_sorted_fewshot['f1_score'],
        name='F1 Score',
        mode='lines+markers',
        line=dict(color='#4ECDC4', width=3),
        marker=dict(size=12, line=dict(width=2, color='white')),
        hovertemplate='<b>F1 Score</b>: %{y:.4f}<br>Shots: %{x}<extra></extra>'
    ),
    secondary_y=False
)

# Exact Match line
fig_fewshot_perf.add_trace(
    go.Scatter(
        x=df_sorted_fewshot['num_shots'],
        y=df_sorted_fewshot['exact_match'],
        name='Exact Match',
        mode='lines+markers',
        line=dict(color='#95E1D3', width=3, dash='dash'),
        marker=dict(size=10),
        hovertemplate='<b>Exact Match</b>: %{y:.4f}<br>Shots: %{x}<extra></extra>'
    ),
    secondary_y=False
)

# CO2 Emissions bar
fig_fewshot_perf.add_trace(
    go.Bar(
        x=df_sorted_fewshot['num_shots'],
        y=df_sorted_fewshot['emissions_kg'],
        name='CO‚ÇÇ Emissions',
        marker_color='#FF6B6B',
        opacity=0.6,
        hovertemplate='<b>CO‚ÇÇ</b>: %{y:.6f} kg<br>Shots: %{x}<extra></extra>'
    ),
    secondary_y=True
)

fig_fewshot_perf.update_xaxes(title_text="Number of Training Examples")
fig_fewshot_perf.update_yaxes(title_text="Performance Score", secondary_y=False)
fig_fewshot_perf.update_yaxes(title_text="CO‚ÇÇ Emissions (kg)", secondary_y=True)

fig_fewshot_perf.update_layout(
    title=dict(text="Few-Shot: Performance vs Carbon Emissions", font=dict(size=18)),
    template='plotly_white',
    height=500,
    font=dict(size=13),
    hovermode='x unified',
    legend=dict(
        orientation="h",
        yanchor="bottom",
        y=1.02,
        xanchor="right",
        x=1
    )
)

fig_fewshot_perf.show()
fig_fewshot_perf.write_html("/content/drive/MyDrive/fewshot_performance_emissions.html")

#Comparing And Testing All The Models

In [52]:
def test_model(model_path, examples, tokenizer_name="distilbert-base-uncased"):
    print("\n" + "="*80)
    print("üß™ MODEL TESTING")
    print("="*80)

    # Load tokenizer
    tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)

    # Auto-detect model type
    is_lora = "lora_adapters" in model_path or "lora" in model_path.lower()

    if is_lora:
        method = "LoRA"
        # Load base model + LoRA adapters
        base_model = AutoModelForQuestionAnswering.from_pretrained(tokenizer_name)
        model = PeftModel.from_pretrained(base_model, model_path)
        print(f"‚úÖ Loaded LoRA model (base + adapters)")
    else:
        # Detect if few-shot or full fine-tuning
        method = "Few-Shot" if "fewshot" in model_path.lower() else "Full Fine-tuning"
        model = AutoModelForQuestionAnswering.from_pretrained(model_path)
        print(f"‚úÖ Loaded {method} model")

    print(f"üìã Method: {method}")
    print(f"üìÇ Path: {model_path}")
    print("="*80)

    # Create pipeline
    qa_pipeline = pipeline(
        "question-answering",
        model=model,
        tokenizer=tokenizer,
        device=0 if torch.cuda.is_available() else -1
    )

    # Test all examples
    results = []
    for i, ex in enumerate(examples, 1):
        # Get prediction
        prediction = qa_pipeline(question=ex['question'], context=ex['context'])

        # Store result
        result = {
            'example_num': i,
            'method': method,
            'question': ex['question'],
            'context': ex['context'][:100] + "..." if len(ex['context']) > 100 else ex['context'],
            'predicted_answer': prediction['answer'],
            'expected_answer': ex.get('expected_answer', None),
            'confidence': prediction['score'],
            'start_position': prediction['start'],
            'end_position': prediction['end']
        }
        results.append(result)

        # Print formatted output
        print(f"\nüìù Example {i}")
        print(f"Question: {ex['question']}")
        print(f"Context: {ex['context'][:150]}{'...' if len(ex['context']) > 150 else ''}")
        print(f"\n‚úÖ Predicted: '{prediction['answer']}'")
        print(f"   Confidence: {prediction['score']:.2%}")

        # Check match with expected answer
        if ex.get('expected_answer'):
            expected = ex['expected_answer'].lower().strip()
            predicted = prediction['answer'].lower().strip()
            # Flexible matching: either one contains the other
            match = (predicted in expected) or (expected in predicted)
            print(f"   Expected: '{ex['expected_answer']}'")
            print(f"   Match: {'‚úì YES' if match else '‚úó NO'}")

        print("-" * 80)

    return results

In [53]:
test_examples = [
     {
        'question': "What is the capital of France?",
        'context': "Paris is the capital and most populous city of France. It has been one of Europe's major centers of finance, diplomacy, commerce, fashion, and arts.",
        'expected_answer': "Paris"
    },
    {
        'question': "What does Google Colab provide access to?",
        'context': "Google Colab provides free access to GPUs and TPUs, which makes it popular for deep learning.",
        'expected_answer': "GPUs and TPUs"
    },
    {
        'question': "When was Python created?",
        'context': "Python was created by Guido van Rossum and first released in 1991. Its design philosophy emphasizes code readability.",
        'expected_answer': "1991"
    },
    {
        'question': "Who invented the telephone?",
        'context': "The telephone was invented by Alexander Graham Bell in 1876. He made the first successful telephone call on March 10, 1876.",
        'expected_answer': "Alexander Graham Bell"
    },
]


In [54]:
print(os.listdir('/content/'))

['.config', 'results_distilbert_80pct', 'results_distilbert_lora_r8_80pct', 'wandb', 'results_distilbert_lora_r4_80pct', 'results_distilbert_lora_r16_80pct', 'drive', 'results_distilbert_50pct', 'results_distilbert_fewshot_1000shots', 'results_distilbert_25pct', 'results_distilbert_fewshot_500shots', 'results_distilbert_fewshot_100shots', 'sample_data']


In [55]:
# Test Full Fine-tuning
print("üî∑" * 40)
print("TESTING FULL FINE-TUNING MODEL")
print("üî∑" * 40)
results_full_ft = test_model(
    model_path="results_distilbert_80pct/final_model",
    examples=test_examples
)

üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑
TESTING FULL FINE-TUNING MODEL
üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑

üß™ MODEL TESTING


Device set to use cuda:0


‚úÖ Loaded Full Fine-tuning model
üìã Method: Full Fine-tuning
üìÇ Path: results_distilbert_80pct/final_model

üìù Example 1
Question: What is the capital of France?
Context: Paris is the capital and most populous city of France. It has been one of Europe's major centers of finance, diplomacy, commerce, fashion, and arts.

‚úÖ Predicted: 'Paris'
   Confidence: 99.16%
   Expected: 'Paris'
   Match: ‚úì YES
--------------------------------------------------------------------------------

üìù Example 2
Question: What does Google Colab provide access to?
Context: Google Colab provides free access to GPUs and TPUs, which makes it popular for deep learning.

‚úÖ Predicted: 'GPUs and TPUs'
   Confidence: 72.64%
   Expected: 'GPUs and TPUs'
   Match: ‚úì YES
--------------------------------------------------------------------------------

üìù Example 3
Question: When was Python created?
Context: Python was created by Guido van Rossum and first released in 1991. Its design philosophy empha

In [56]:
# Test LoRA
print("\n" + "üî∑" * 40)
print("TESTING LORA MODEL")
print("üî∑" * 40)
results_lora = test_model(
    model_path="results_distilbert_lora_r16_80pct/lora_adapters",
    examples=test_examples
)



üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑
TESTING LORA MODEL
üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑

üß™ MODEL TESTING


Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cuda:0


‚úÖ Loaded LoRA model (base + adapters)
üìã Method: LoRA
üìÇ Path: results_distilbert_lora_r16_80pct/lora_adapters

üìù Example 1
Question: What is the capital of France?
Context: Paris is the capital and most populous city of France. It has been one of Europe's major centers of finance, diplomacy, commerce, fashion, and arts.

‚úÖ Predicted: 'Paris'
   Confidence: 74.85%
   Expected: 'Paris'
   Match: ‚úì YES
--------------------------------------------------------------------------------

üìù Example 2
Question: What does Google Colab provide access to?
Context: Google Colab provides free access to GPUs and TPUs, which makes it popular for deep learning.

‚úÖ Predicted: 'GPUs and TPUs'
   Confidence: 38.71%
   Expected: 'GPUs and TPUs'
   Match: ‚úì YES
--------------------------------------------------------------------------------

üìù Example 3
Question: When was Python created?
Context: Python was created by Guido van Rossum and first released in 1991. Its design philosophy 

In [57]:
# Test Few-Shot
print("\n" + "üî∑" * 40)
print("TESTING FEW-SHOT MODEL")
print("üî∑" * 40)
test_results_fewshot = test_model(
    model_path="results_distilbert_fewshot_1000shots/final_model",
    examples=test_examples
)



üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑
TESTING FEW-SHOT MODEL
üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑üî∑

üß™ MODEL TESTING


Device set to use cuda:0


‚úÖ Loaded Few-Shot model
üìã Method: Few-Shot
üìÇ Path: results_distilbert_fewshot_1000shots/final_model

üìù Example 1
Question: What is the capital of France?
Context: Paris is the capital and most populous city of France. It has been one of Europe's major centers of finance, diplomacy, commerce, fashion, and arts.

‚úÖ Predicted: 'Paris'
   Confidence: 4.59%
   Expected: 'Paris'
   Match: ‚úì YES
--------------------------------------------------------------------------------

üìù Example 2
Question: What does Google Colab provide access to?
Context: Google Colab provides free access to GPUs and TPUs, which makes it popular for deep learning.

‚úÖ Predicted: 'Google Colab provides free access to GPUs'
   Confidence: 3.72%
   Expected: 'GPUs and TPUs'
   Match: ‚úó NO
--------------------------------------------------------------------------------

üìù Example 3
Question: When was Python created?
Context: Python was created by Guido van Rossum and first released in 1991. Its de

In [58]:
all_results = {
    "Full FT (80%)": results_full_ft,
    "LoRA (r=16, 80%)": results_lora,
    "Few-Shot (1000)": test_results_fewshot
}


##Comparison Function

In [59]:
def compare_models(results_dict):
    print("\n" + "="*80)
    print("üìä MODEL COMPARISON")
    print("="*80)

    comparison_data = []

    for model_name, results in results_dict.items():
        for result in results:
            comparison_data.append({
                'Model': model_name,
                'Question': result['question'][:50] + "...",
                'Predicted': result['predicted_answer'],
                'Expected': result.get('expected_answer', 'N/A'),
                'Confidence': result['confidence']
            })

    comparison_df = pd.DataFrame(comparison_data)

    # Group by question to see how different models answer
    for question in comparison_df['Question'].unique():
        print(f"\nQUESTION: {question}")
        question_results = comparison_df[comparison_df['Question'] == question]
        for _, row in question_results.iterrows():
            exp = str(row['Expected']).lower()
            pred = str(row['Predicted']).lower()
            match_indicator = "‚úì" if (pred in exp or exp in pred) and exp != 'n/a' else "‚úó"
            print(f"  {match_indicator} {row['Model']:20s}: {row['Predicted']:40s} ({row['Confidence']:.1%})")
        expected_val = question_results.iloc[0]['Expected']
        if expected_val != 'N/A':
            print(f"Expected ANSWER: {expected_val}")

    return comparison_df

In [60]:
comparison_df = compare_models(all_results)


üìä MODEL COMPARISON

QUESTION: What is the capital of France?...
  ‚úì Full FT (80%)       : Paris                                    (99.2%)
  ‚úì LoRA (r=16, 80%)    : Paris                                    (74.9%)
  ‚úì Few-Shot (1000)     : Paris                                    (4.6%)
Expected ANSWER: Paris

QUESTION: What does Google Colab provide access to?...
  ‚úì Full FT (80%)       : GPUs and TPUs                            (72.6%)
  ‚úì LoRA (r=16, 80%)    : GPUs and TPUs                            (38.7%)
  ‚úó Few-Shot (1000)     : Google Colab provides free access to GPUs (3.7%)
Expected ANSWER: GPUs and TPUs

QUESTION: When was Python created?...
  ‚úì Full FT (80%)       : 1991                                     (87.9%)
  ‚úì LoRA (r=16, 80%)    : 1991                                     (80.9%)
  ‚úó Few-Shot (1000)     : van Rossum                               (3.3%)
Expected ANSWER: 1991

QUESTION: Who invented the telephone?...
  ‚úì Full FT (80%)       : 

In [61]:
# Save comparison results
comparison_df.to_csv("/content/drive/MyDrive/model_comparison.csv", index=False)

In [62]:
print("\n" + "="*80)
print("üìä COMPARISON: FULL FT vs LoRA vs FEW-SHOT")
print("="*80)

# Load previous results
full_ft_results = pd.read_csv("/content/drive/MyDrive/distilbert_dataset_size_results.csv")
lora_results = pd.read_csv("/content/drive/MyDrive/distilbert_lora_results.csv")
results_fewshot = pd.read_csv("/content/drive/MyDrive/distilbert_fewshot_results.csv")

# Add method identifiers if not present
if 'training_method' not in full_ft_results.columns:
    full_ft_results['training_method'] = 'Full Fine-tuning'
if 'training_method' not in lora_results.columns:
    lora_results['training_method'] = 'LoRA'

# Combine all results
all_methods = pd.concat([full_ft_results, lora_results, results_fewshot], ignore_index=True)

print("\nüîç Training Efficiency Comparison:")
print(all_methods[['training_method', 'train_samples','f1_score', 'emissions_kg', 'training_time_hours']].to_string(index=False))

# Save combined results
all_methods.to_csv("/content/drive/MyDrive/all_training_methods_comparison_distilbert.csv", index=False)



üìä COMPARISON: FULL FT vs LoRA vs FEW-SHOT

üîç Training Efficiency Comparison:
           training_method  train_samples  f1_score  emissions_kg  training_time_hours
          Full Fine-Tuning          32579  0.467055      0.003866             0.049323
          Full Fine-Tuning          65159  0.569492      0.007136             0.087358
          Full Fine-Tuning         104255  0.609658      0.011166             0.136250
                      LoRA         104255  0.493034      0.009820             0.135649
                      LoRA         104255  0.512776      0.009813             0.135268
                      LoRA         104255  0.545845      0.009804             0.135895
Few-Shot (Frozen Backbone)            100  0.009092      0.002290             0.033616
Few-Shot (Frozen Backbone)            500  0.019856      0.002380             0.034983
Few-Shot (Frozen Backbone)           1000  0.022245      0.002479             0.036757
