# Quantifying the Environmental Cost of AI: Carbon Emissions in Language Model Fine-Tuning for Question Answering

> ### **Project Goal** : As language models continue to play a larger role in natural language processing, their environmental impact has become an important issue to consider. While much of the research in this area focuses on improving model accuracy, the energy use and carbon footprint involved in training these systems are often overlooked or poorly documented. This project aims to explore that imbalance by studying how improvements in model performance relate to the environmental costs of fine-tuning.

# Training Strategy 1: Full Fine-Tuning (Model BERT)

In [1]:
!pip install transformers
!pip install datasets
!pip install accelerate
!pip install codecarbon
!pip install evaluate codecarbon

Collecting codecarbon
  Downloading codecarbon-3.2.0-py3-none-any.whl.metadata (12 kB)
Collecting fief-client[cli] (from codecarbon)
  Downloading fief_client-0.20.0-py3-none-any.whl.metadata (2.1 kB)
Collecting psutil>=6.0.0 (from codecarbon)
  Downloading psutil-7.1.3-cp36-abi3-manylinux2010_x86_64.manylinux_2_12_x86_64.manylinux_2_28_x86_64.whl.metadata (23 kB)
Collecting rapidfuzz (from codecarbon)
  Downloading rapidfuzz-3.14.3-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (12 kB)
Collecting questionary (from codecarbon)
  Downloading questionary-2.1.1-py3-none-any.whl.metadata (5.4 kB)
Collecting httpx<0.28.0,>=0.21.3 (from fief-client[cli]->codecarbon)
  Downloading httpx-0.27.2-py3-none-any.whl.metadata (7.1 kB)
Collecting jwcrypto<2.0.0,>=1.4 (from fief-client[cli]->codecarbon)
  Downloading jwcrypto-1.5.6-py3-none-any.whl.metadata (3.1 kB)
Collecting yaspin (from fief-client[cli]->codecarbon)
  Downloading yaspin-3.4.0-py3-none-any.whl.metadata (15 kB)


Collecting evaluate
  Downloading evaluate-0.4.6-py3-none-any.whl.metadata (9.5 kB)
Downloading evaluate-0.4.6-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.6


KeyboardInterrupt: 

In [1]:
# Importing Necessary Libraries
import os
from datasets import load_dataset
from transformers import (
    AutoTokenizer,
    AutoModelForQuestionAnswering,
    TrainingArguments,
    Trainer,
    default_data_collator,
    pipeline
)
import torch
from datasets import Dataset
from codecarbon import EmissionsTracker
from google.colab import drive
import pandas as pd
from collections import defaultdict
import json
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
import numpy as np

drive.mount('/content/drive')

Mounted at /content/drive


## STEP 1: Loading The Stanford Question Answering Dataset (SQuAD) Dataset

In [2]:
squad = load_dataset("squad_v2")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

squad_v2/train-00000-of-00001.parquet:   0%|          | 0.00/16.4M [00:00<?, ?B/s]

squad_v2/validation-00000-of-00001.parqu(…):   0%|          | 0.00/1.35M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/130319 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/11873 [00:00<?, ? examples/s]

In [3]:
print("SQuAD Format: ",squad)
print(f"\nFull training set size: {len(squad['train'])}")
print(f"\nValidation set size: {len(squad['validation'])}")

SQuAD Format:  DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 130319
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 11873
    })
})

Full training set size: 130319

Validation set size: 11873


In [4]:
train_data = pd.DataFrame(squad['train'])
train_data.head()

Unnamed: 0,id,title,context,question,answers
0,56be85543aeaaa14008c9063,Beyoncé,Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ b...,When did Beyonce start becoming popular?,"{'text': ['in the late 1990s'], 'answer_start'..."
1,56be85543aeaaa14008c9065,Beyoncé,Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ b...,What areas did Beyonce compete in when she was...,"{'text': ['singing and dancing'], 'answer_star..."
2,56be85543aeaaa14008c9066,Beyoncé,Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ b...,When did Beyonce leave Destiny's Child and bec...,"{'text': ['2003'], 'answer_start': [526]}"
3,56bf6b0f3aeaaa14008c9601,Beyoncé,Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ b...,In what city and state did Beyonce grow up?,"{'text': ['Houston, Texas'], 'answer_start': [..."
4,56bf6b0f3aeaaa14008c9602,Beyoncé,Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ b...,In which decade did Beyonce become famous?,"{'text': ['late 1990s'], 'answer_start': [276]}"


## STEP 2: Tokenization For the Model Function

In [5]:
#Autotokenizer automatically picks the correct tokenizer for given model

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [6]:
def preprocess_function(examples):
    #Convert raw SQuAD examples into model-ready training data.
    questions = [q.strip() for q in examples["question"]]
    contexts = [c.strip() for c in examples["context"]]

    # Tokenize
    tokenized = tokenizer(
        questions,
        contexts,
        max_length=384,
        stride=128,
        padding="max_length",
        truncation="only_second",         #Truncate from context
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
    )

    # Mapping back to original samples
    sample_mapping = tokenized.pop("overflow_to_sample_mapping")
    offset_mapping = tokenized["offset_mapping"]

    start_positions = []
    end_positions = []

    for i, offsets in enumerate(offset_mapping):
        sample_idx = sample_mapping[i]
        answers = examples["answers"][sample_idx]

        # In no answer case
        if len(answers["answer_start"]) == 0:
            start_positions.append(0)
            end_positions.append(0)
            continue

        start_char = answers["answer_start"][0]
        end_char = start_char + len(answers["text"][0])

        seq_ids = tokenized.sequence_ids(i)

        # Find context section
        context_start = seq_ids.index(1) if 1 in seq_ids else 0
        context_end = len(seq_ids) - 1 - seq_ids[::-1].index(1) if 1 in seq_ids else len(seq_ids) - 1

        # If answer not inside context - mark no answer
        if not (offsets[context_start][0] <= start_char and offsets[context_end][1] >= end_char):
            start_positions.append(0)
            end_positions.append(0)
            continue

        # Find start token
        token_start = context_start
        while token_start <= context_end and offsets[token_start][0] <= start_char:
            token_start += 1
        start_positions.append(token_start - 1)

        # Find end token - move forward until we pass answer end
        token_end = context_start
        while token_end <= context_end and offsets[token_end][1] < end_char:
            token_end += 1
        end_positions.append(token_end)

    tokenized["start_positions"] = start_positions
    tokenized["end_positions"] = end_positions

    return tokenized

In [7]:
print("========== Data Format Within SQuAD Training Set ==========")
print("\nQuestion at Index[0]: ", squad["train"][0]['question'])
print("\nContext at Index[0]: ", squad["train"][0]['context'])
print("\nAnswers at Index[0]: ", squad["train"][0]['answers'])

#Testing preprocess_function function
sample = {
    "question": [squad["train"][0]['question']],
    "context": [squad["train"][0]['context']],
    "answers": [squad["train"][0]['answers']]
}

output = preprocess_function(sample)
print("\n========== Data Format After Preprocessing ==========")

for k, v in output.items():

    print('\n',k, ":", v[:5] if isinstance(v, list) else v)

# Now test start and end position mapping
predicted = tokenizer.decode(output['input_ids'][0][output['start_positions'][0]:output['end_positions'][0]+1])
print(f"\nPredicted Answer Mapping: '{predicted}'")


Question at Index[0]:  When did Beyonce start becoming popular?

Context at Index[0]:  Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time. Their hiatus saw the release of Beyoncé's debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".

Answers at Index[0]:  {'text': ['in the late 1990s'], 'answer_start': [269]}


 input_ids : [[101, 2043, 2106, 20773, 2707, 3352, 2759, 1029, 102, 20773, 21025, 19358, 22815, 1011, 5708, 1006, 1013, 12170, 23432, 297

In [8]:
# Preprocess validation set (full)
print("\nPreprocessing validation set...")
tokenized_validation = squad["validation"].map(
    preprocess_function,
    batched=True,
    remove_columns=squad["validation"].column_names
)


Preprocessing validation set...


Map:   0%|          | 0/11873 [00:00<?, ? examples/s]

In [9]:
tokenized_validation.features

{'input_ids': List(Value('int32')),
 'token_type_ids': List(Value('int8')),
 'attention_mask': List(Value('int8')),
 'offset_mapping': List(List(Value('int64'))),
 'start_positions': Value('int64'),
 'end_positions': Value('int64')}

In [10]:
#Prepareing function for tokenization based of training size of the data.

def prepare_dataset(train_data, size_fraction, preprocess_fn):

    #Create and preprocess a subset of training data.
    num_samples = int(len(train_data) * size_fraction)
    train_subset = train_data.select(range(num_samples))

    print(f"Preprocessing {num_samples} training samples...")
    tokenized_train = train_subset.map(
        preprocess_fn,
        batched=True,
        remove_columns=train_subset.column_names
    )

    return tokenized_train, num_samples

## STEP 3: Training The Bert Model Functions

In [11]:
#Model Architecture:
model = AutoModelForQuestionAnswering.from_pretrained("bert-base-uncased")
print(f"\n{'='*80}")
print("\nBERT Model Architecture:")
print(f"\n{'='*80}")
print("\nTransformer layers:",model.config.num_hidden_layers)
print("\nHidden size:",model.config.hidden_size)
print('\nIntermediate feed-forward size:',model.config.intermediate_size)
print("\nAttention heads:",model.config.num_attention_heads)
print("\nMax positional embeddings:", model.config.max_position_embeddings)
print("\nVocabulary size:", model.config.vocab_size)

# Parameter Count
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

print("\nTotal parameters:", f"{total_params:,}")
print("\nTrainable parameters:", f"{trainable_params:,}")


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.




BERT Model Architecture:


Transformer layers: 12

Hidden size: 768

Intermediate feed-forward size: 3072

Attention heads: 12

Max positional embeddings: 512

Vocabulary size: 30522

Total parameters: 108,893,186

Trainable parameters: 108,893,186


In [12]:
# Custom compute metrics function for F1 and Exact Match
def compute_metrics(pred):
    predictions, labels = pred
    start_preds = np.argmax(predictions[0], axis=1)
    end_preds = np.argmax(predictions[1], axis=1)

    start_true = labels[0]
    end_true = labels[1]

    # Calculate exact match
    exact_matches = ((start_preds == start_true) & (end_preds == end_true)).sum()
    exact_match = exact_matches / len(start_true)

    # Calculate F1 score (token overlap)
    f1_scores = []
    for start_p, end_p, start_t, end_t in zip(start_preds, end_preds, start_true, end_true):
        pred_tokens = set(range(start_p, end_p + 1))
        true_tokens = set(range(start_t, end_t + 1))

        if len(pred_tokens) == 0 and len(true_tokens) == 0:
            f1_scores.append(1.0)
        elif len(pred_tokens) == 0 or len(true_tokens) == 0:
            f1_scores.append(0.0)
        else:
            overlap = len(pred_tokens & true_tokens)
            precision = overlap / len(pred_tokens)
            recall = overlap / len(true_tokens)
            f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0
            f1_scores.append(f1)

    avg_f1 = np.mean(f1_scores)

    return {
        "exact_match": exact_match,
        "f1": avg_f1
    }

In [13]:
def train_model(tokenized_train, tokenized_eval, tokenizer, compute_metrics_fn, size_fraction, model_name):

    # Load fresh model
    model = AutoModelForQuestionAnswering.from_pretrained(model_name)

    # Setup output directory
    output_dir = f"results_bert_{int(size_fraction*100)}pct"

    # Training arguments
    training_args = TrainingArguments(
        output_dir=output_dir,
        eval_strategy="epoch",
        learning_rate=3e-5,
        per_device_train_batch_size=16,
        per_device_eval_batch_size=16,
        num_train_epochs=2,
        weight_decay=0.01,
        fp16=torch.cuda.is_available(),
        save_strategy="epoch",
        load_best_model_at_end=True,
        metric_for_best_model="f1",
        push_to_hub=False,
        logging_steps=100,
        greater_is_better=True
    )

    # Initialize trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_train,
        eval_dataset=tokenized_eval,
        tokenizer=tokenizer,
        data_collator=default_data_collator,
        compute_metrics=compute_metrics_fn
    )

    # Start carbon tracking
    tracker = EmissionsTracker(
        project_name=f"BERT_{int(size_fraction*100)}pct",
        output_dir=output_dir
    )
    tracker.start()

    # Train
    print("Training model...")
    train_results = trainer.train()

    # Stop carbon tracking
    tracker.stop()

    emissions_data = tracker.final_emissions_data

    return trainer, train_results, emissions_data, output_dir


## STEP 4: Evaluating And Saving The Results Functions

> We will be training our model on various data sizes from our SQuAD dataset.
>
> Training Data Variation: [25%, 50%, 80%]

In [14]:
def evaluate_and_save(trainer, train_results, emissions_data, output_dir, size_fraction, num_samples):
    """Evaluate model, print results, and save artifacts."""

    # Evaluate
    print("Evaluating model...")
    eval_results = trainer.evaluate()

    #Calculate trainable parameters for Full Fine-tuning
    model = trainer.model
    trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
    total_params = sum(p.numel() for p in model.parameters())
    trainable_percentage = 100 * trainable_params / total_params

    # Compile results
    result_entry = {
        "training_method": "Full Fine-Tuning",
        "model_name": "BERT",
        'dataset_size%': int(size_fraction*100),
        "train_samples": num_samples,
        "valid_samples": len(tokenized_validation),
        "trainable_params": trainable_params,
        "total_params": total_params,
        "trainable_percentage": trainable_percentage,

        # Performance metrics
        "f1_score": eval_results["eval_f1"],
        "exact_match": eval_results["eval_exact_match"],
        "eval_loss": eval_results["eval_loss"],
        "training_time_hours": train_results.metrics["train_runtime"] / 3600,

        # Emissions data
        "emissions_rate_kg_per_s": emissions_data.emissions_rate,
        "emissions_kg": emissions_data.emissions,
        "timestamp": emissions_data.timestamp,
        "duration_seconds": emissions_data.duration,
        "duration_hours": emissions_data.duration / 3600,

        # Energy consumption
        "energy_consumed_kwh": emissions_data.energy_consumed,
        "cpu_energy_kwh": emissions_data.cpu_energy,
        "gpu_energy_kwh": emissions_data.gpu_energy,
        "ram_energy_kwh": emissions_data.ram_energy,

        # Power draw
        "cpu_power_w": emissions_data.cpu_power,
        "gpu_power_w": emissions_data.gpu_power,
        "ram_power_w": emissions_data.ram_power,

        # Location and system info
        "country_name": emissions_data.country_name,
        "country_iso_code": emissions_data.country_iso_code,
        "region": emissions_data.region,
        "cloud_provider": emissions_data.cloud_provider,
        "cloud_region": emissions_data.cloud_region,
        "on_cloud": emissions_data.on_cloud,

        # System specifications
        "os": emissions_data.os,
        "python_version": emissions_data.python_version,
        "cpu_count": emissions_data.cpu_count,
        "cpu_model": emissions_data.cpu_model,
        "gpu_count": emissions_data.gpu_count,
        "gpu_model": emissions_data.gpu_model,
        "ram_total_size_gb": emissions_data.ram_total_size,

        # Additional metrics
        "pue": emissions_data.pue,
        "codecarbon_version": emissions_data.codecarbon_version,
    }

    # Print summary
    print(f"\n{'='*80}")
    print(f"\nFINE-TUNING RESULTS SUMMARY FOR {size_fraction*100}% DATASET:")
    print(f"{'='*80}")
    print(f"Training Method: Full Fine-Tuning")
    print(f"Model: BERT")

    print(f"\nModel Parameters:")
    print(f"Total Parameters: {total_params:,}")
    print(f"Trainable Parameters: {trainable_params:,}")
    print(f"Trainable Percentage: {trainable_percentage:.2f}%")


    print(f"\nPerformance Metrics:")
    print(f"F1 Score: {eval_results['eval_f1']:.4f}")
    print(f"Exact Match: {eval_results['eval_exact_match']:.4f}")
    print(f"Eval Loss: {eval_results['eval_loss']:.4f}")

    print(f"\nEnergy Consumption:")
    print(f"Total Energy: {emissions_data.energy_consumed:.6f} kWh")
    print(f"CPU Energy: {emissions_data.cpu_energy:.6f} kWh ({emissions_data.cpu_energy/emissions_data.energy_consumed*100:.1f}%)")
    print(f"GPU Energy: {emissions_data.gpu_energy:.6f} kWh ({emissions_data.gpu_energy/emissions_data.energy_consumed*100:.1f}%)")
    print(f"RAM Energy: {emissions_data.ram_energy:.6f} kWh ({emissions_data.ram_energy/emissions_data.energy_consumed*100:.1f}%)")

    print(f"\nAverage Power Draw:")
    print(f"CPU Power: {emissions_data.cpu_power:.2f} W")
    print(f"GPU Power: {emissions_data.gpu_power:.2f} W")
    print(f"RAM Power: {emissions_data.ram_power:.2f} W")
    print(f"Total Power: {emissions_data.cpu_power + emissions_data.gpu_power + emissions_data.ram_power:.2f} W")

    print(f"\nCarbon Footprint:")
    print(f"Total CO2 Emissions: {emissions_data.emissions:.6f} kg")
    print(f"Emissions Rate: {emissions_data.emissions_rate:.9f} kg/s")
    print(f"Duration: {emissions_data.duration/3600:.2f} hours")
    print(f"Training Time (Trainer): {train_results.metrics['train_runtime']/3600:.2f} hours")

    print(f"\nLocation & Infrastructure:")
    print(f"Country: {emissions_data.country_name} ({emissions_data.country_iso_code})")
    print(f"Region: {emissions_data.region}")
    print(f"On Cloud: {emissions_data.on_cloud}")
    print(f"PUE (Power Usage Effectiveness): {emissions_data.pue}")

    print(f"\nSystem Specifications:")
    print(f"OS: {emissions_data.os}")
    print(f"CPU: {emissions_data.cpu_model} ({emissions_data.cpu_count} cores)")
    if emissions_data.gpu_count and emissions_data.gpu_model:
        print(f"GPU: {emissions_data.gpu_model} (Count: {emissions_data.gpu_count})")
    else:
        print(f"GPU: None detected")
    print(f"RAM: {emissions_data.ram_total_size:.2f} GB")
    print(f"Python: {emissions_data.python_version}")

    print(f"\n{'='*80}")

    # Save model
    trainer.save_model(f"{output_dir}/final_model")

    # Clear GPU memory
    del trainer.model
    del trainer
    torch.cuda.empty_cache()

    return result_entry

In [15]:
def run_experiment(size_fraction, train_data, eval_data, tokenizer, preprocess_fn, compute_metrics_fn, model_name):

    print(f"\n{'='*60}")
    print(f"Training with {size_fraction*100}% of training data")
    print(f"{'='*60}")

    # Step 1: Prepare dataset
    tokenized_train, num_samples = prepare_dataset(train_data, size_fraction, preprocess_fn)

    # Step 2: Train model
    trainer, train_results, emissions_data, output_dir = train_model(
        tokenized_train, eval_data, tokenizer, compute_metrics_fn,
        size_fraction, model_name
    )

    # Step 3: Evaluate and save
    result_entry = evaluate_and_save(trainer, train_results, emissions_data, output_dir, size_fraction, num_samples)

    return result_entry

In [16]:
# Store results
results_summary = []

In [17]:
#Considering 25% of data for training the model
%%time
print("\n" + "="*80)
print("EXPERIMENT 1: FULL FINE-TUNING WITH 25.0% TRAINING DATASET")
print("="*80)
result1 = run_experiment(
        size_fraction=0.25,
        train_data=squad["train"],
        eval_data=tokenized_validation,
        tokenizer=tokenizer,
        preprocess_fn=preprocess_function,
        compute_metrics_fn=compute_metrics,
        model_name="bert-base-uncased"
    )

results_summary.append(result1)


EXPERIMENT 1: FULL FINE-TUNING WITH 25.0% TRAINING DATASET

Training with 25.0% of training data
Preprocessing 32579 training samples...


Map:   0%|          | 0/32579 [00:00<?, ? examples/s]

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(
[codecarbon INFO @ 05:00:49] [setup] RAM Tracking...
[codecarbon INFO @ 05:00:49] [setup] CPU Tracking...
 Linux OS detected: Please ensure RAPL files exist, and are readable, at /sys/class/powercap/intel-rapl/subsystem to measure CPU

[codecarbon INFO @ 05:00:51] CPU Model on constant consumption mode: Intel(R) Xeon(R) CPU @ 2.20GHz
[codecarbon INFO @ 05:00:51] [setup] GPU Tracking...
[codecarbon INFO @ 05:00:51] Tracking Nvidia GPU via pynvml
[codecarbon INFO @ 05:00:51] The below tracking methods have been set up:
                RAM Tracking Method: RAM power estimation model
                CPU Tracking Method: global constant
                GPU Tracking Method: pynvml
       

Training model...


  | |_| | '_ \/ _` / _` |  _/ -_)
[34m[1mwandb[0m: (1) Create a W&B account
[34m[1mwandb[0m: (2) Use an existing W&B account
[34m[1mwandb[0m: (3) Don't visualize my results
[34m[1mwandb[0m: Enter your choice:[codecarbon INFO @ 05:01:07] Energy consumed for RAM : 0.000158 kWh. RAM Power : 38.0 W
[codecarbon INFO @ 05:01:07] Delta energy consumed for CPU with constant : 0.000177 kWh, power : 42.5 W
[codecarbon INFO @ 05:01:07] Energy consumed for All CPU : 0.000177 kWh
[codecarbon INFO @ 05:01:07] Energy consumed for all GPUs : 0.000203 kWh. Total GPU Power : 48.72035303165239 W
[codecarbon INFO @ 05:01:07] 0.000539 kWh of electricity and 0.000000 L of water were used since the beginning.
[codecarbon INFO @ 05:01:22] Energy consumed for RAM : 0.000317 kWh. RAM Power : 38.0 W
[codecarbon INFO @ 05:01:22] Delta energy consumed for CPU with constant : 0.000177 kWh, power : 42.5 W
[codecarbon INFO @ 05:01:22] Energy consumed for All CPU : 0.000354 kWh
[codecarbon INFO @ 05:01:22]

 2


[34m[1mwandb[0m: You chose 'Use an existing W&B account'
[34m[1mwandb[0m: Logging into https://api.wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: Find your API key here: https://wandb.ai/authorize?ref=models
[34m[1mwandb[0m: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33msanjanasawant524[0m ([33msanjanasawant524-rutgers-university[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss,Exact Match,F1
1,1.1457,1.582309,0.434729,0.517007
2,0.7569,1.732935,0.441569,0.531585


[codecarbon INFO @ 05:01:37] Energy consumed for RAM : 0.000475 kWh. RAM Power : 38.0 W
[codecarbon INFO @ 05:01:37] Delta energy consumed for CPU with constant : 0.000177 kWh, power : 42.5 W
[codecarbon INFO @ 05:01:37] Energy consumed for All CPU : 0.000531 kWh
[codecarbon INFO @ 05:01:37] Energy consumed for all GPUs : 0.000697 kWh. Total GPU Power : 69.9816305796751 W
[codecarbon INFO @ 05:01:37] 0.001703 kWh of electricity and 0.000000 L of water were used since the beginning.
[codecarbon INFO @ 05:01:49] Energy consumed for RAM : 0.000158 kWh. RAM Power : 38.0 W
[codecarbon INFO @ 05:01:49] Delta energy consumed for CPU with constant : 0.000177 kWh, power : 42.5 W
[codecarbon INFO @ 05:01:49] Energy consumed for All CPU : 0.000177 kWh
[codecarbon INFO @ 05:01:49] Energy consumed for all GPUs : 0.000876 kWh. Total GPU Power : 210.1097739095056 W
[codecarbon INFO @ 05:01:49] 0.001212 kWh of electricity and 0.000000 L of water were used since the beginning.
[codecarbon INFO @ 05:01:

Evaluating model...




FINE-TUNING RESULTS SUMMARY FOR 25.0% DATASET:
Training Method: Full Fine-Tuning
Model: BERT

Model Parameters:
Total Parameters: 108,893,186
Trainable Parameters: 108,893,186
Trainable Percentage: 100.00%

Performance Metrics:
F1 Score: 0.5316
Exact Match: 0.4416
Eval Loss: 1.7329

Energy Consumption:
Total Energy: 0.027816 kWh
CPU Energy: 0.004143 kWh (14.9%)
GPU Energy: 0.019970 kWh (71.8%)
RAM Energy: 0.003704 kWh (13.3%)

Average Power Draw:
CPU Power: 42.50 W
GPU Power: 203.66 W
RAM Power: 38.00 W
Total Power: 284.16 W

Carbon Footprint:
Total CO2 Emissions: 0.012590 kg
Emissions Rate: 0.000035859 kg/s
Duration: 0.10 hours
Training Time (Trainer): 0.10 hours

Location & Infrastructure:
Country: United States (USA)
Region: iowa
On Cloud: N
PUE (Power Usage Effectiveness): 1.0

System Specifications:
OS: Linux-6.6.105+-x86_64-with-glibc2.35
CPU: Intel(R) Xeon(R) CPU @ 2.20GHz (12 cores)
GPU: 1 x NVIDIA A100-SXM4-40GB (Count: 1)
RAM: 83.47 GB
Python: 3.12.12

CPU times: user 6min 

In [18]:
#Considering 50% of data for training the model
%%time
print("\n" + "="*80)
print("EXPERIMENT 2: FULL FINE-TUNING WITH 50.0% TRAINING DATASET")
print("="*80)
result2 = run_experiment(
        size_fraction=0.5,
        train_data=squad["train"],
        eval_data=tokenized_validation,
        tokenizer=tokenizer,
        preprocess_fn=preprocess_function,
        compute_metrics_fn=compute_metrics,
        model_name="bert-base-uncased"
    )
results_summary.append(result2)


EXPERIMENT 2: FULL FINE-TUNING WITH 50.0% TRAINING DATASET

Training with 50.0% of training data
Preprocessing 65159 training samples...


Map:   0%|          | 0/65159 [00:00<?, ? examples/s]

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(
[codecarbon INFO @ 05:07:45] [setup] RAM Tracking...
[codecarbon INFO @ 05:07:45] [setup] CPU Tracking...
 Linux OS detected: Please ensure RAPL files exist, and are readable, at /sys/class/powercap/intel-rapl/subsystem to measure CPU

[codecarbon INFO @ 05:07:46] CPU Model on constant consumption mode: Intel(R) Xeon(R) CPU @ 2.20GHz
[codecarbon INFO @ 05:07:46] [setup] GPU Tracking...
[codecarbon INFO @ 05:07:46] Tracking Nvidia GPU via pynvml
[codecarbon INFO @ 05:07:46] The below tracking methods have been set up:
                RAM Tracking Method: RAM power estimation model
                CPU Tracking Method: global constant
                GPU Tracking Method: pynvml
       

Training model...


Epoch,Training Loss,Validation Loss,Exact Match,F1
1,1.0868,1.309288,0.526207,0.613705
2,0.7407,1.281163,0.568568,0.656064


[codecarbon INFO @ 05:08:03] Energy consumed for RAM : 0.000158 kWh. RAM Power : 38.0 W
[codecarbon INFO @ 05:08:03] Delta energy consumed for CPU with constant : 0.000177 kWh, power : 42.5 W
[codecarbon INFO @ 05:08:03] Energy consumed for All CPU : 0.000177 kWh
[codecarbon INFO @ 05:08:03] Energy consumed for all GPUs : 0.000938 kWh. Total GPU Power : 225.05789303185085 W
[codecarbon INFO @ 05:08:03] 0.001274 kWh of electricity and 0.000000 L of water were used since the beginning.
[codecarbon INFO @ 05:08:03] Energy consumed for RAM : 0.000158 kWh. RAM Power : 38.0 W
[codecarbon INFO @ 05:08:03] Delta energy consumed for CPU with constant : 0.000177 kWh, power : 42.5 W
[codecarbon INFO @ 05:08:03] Energy consumed for All CPU : 0.000177 kWh
[codecarbon INFO @ 05:08:03] Energy consumed for all GPUs : 0.000971 kWh. Total GPU Power : 232.868103069133 W
[codecarbon INFO @ 05:08:03] 0.001306 kWh of electricity and 0.000000 L of water were used since the beginning.
[codecarbon INFO @ 05:08

Evaluating model...




FINE-TUNING RESULTS SUMMARY FOR 50.0% DATASET:
Training Method: Full Fine-Tuning
Model: BERT

Model Parameters:
Total Parameters: 108,893,186
Trainable Parameters: 108,893,186
Trainable Percentage: 100.00%

Performance Metrics:
F1 Score: 0.6561
Exact Match: 0.5686
Eval Loss: 1.2812

Energy Consumption:
Total Energy: 0.050025 kWh
CPU Energy: 0.006844 kWh (13.7%)
GPU Energy: 0.037062 kWh (74.1%)
RAM Energy: 0.006119 kWh (12.2%)

Average Power Draw:
CPU Power: 42.50 W
GPU Power: 229.61 W
RAM Power: 38.00 W
Total Power: 310.11 W

Carbon Footprint:
Total CO2 Emissions: 0.022642 kg
Emissions Rate: 0.000039037 kg/s
Duration: 0.16 hours
Training Time (Trainer): 0.16 hours

Location & Infrastructure:
Country: United States (USA)
Region: iowa
On Cloud: N
PUE (Power Usage Effectiveness): 1.0

System Specifications:
OS: Linux-6.6.105+-x86_64-with-glibc2.35
CPU: Intel(R) Xeon(R) CPU @ 2.20GHz (12 cores)
GPU: 1 x NVIDIA A100-SXM4-40GB (Count: 1)
RAM: 83.47 GB
Python: 3.12.12

CPU times: user 11min

In [19]:
#Considering 80% of data for training the model
%%time
print("\n" + "="*80)
print("EXPERIMENT 3: FULL FINE-TUNING WITH 80.0% TRAINING DATASET")
print("="*80)
result3 = run_experiment(
        size_fraction=0.8,
        train_data=squad["train"],
        eval_data=tokenized_validation,
        tokenizer=tokenizer,
        preprocess_fn=preprocess_function,
        compute_metrics_fn=compute_metrics,
        model_name="bert-base-uncased"
    )
results_summary.append(result3)


EXPERIMENT 3: FULL FINE-TUNING WITH 80.0% TRAINING DATASET

Training with 80.0% of training data
Preprocessing 104255 training samples...


Map:   0%|          | 0/104255 [00:00<?, ? examples/s]

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(
[codecarbon INFO @ 05:18:58] [setup] RAM Tracking...
[codecarbon INFO @ 05:18:58] [setup] CPU Tracking...
 Linux OS detected: Please ensure RAPL files exist, and are readable, at /sys/class/powercap/intel-rapl/subsystem to measure CPU

[codecarbon INFO @ 05:18:59] CPU Model on constant consumption mode: Intel(R) Xeon(R) CPU @ 2.20GHz
[codecarbon INFO @ 05:18:59] [setup] GPU Tracking...
[codecarbon INFO @ 05:18:59] Tracking Nvidia GPU via pynvml
[codecarbon INFO @ 05:18:59] The below tracking methods have been set up:
                RAM Tracking Method: RAM power estimation model
                CPU Tracking Method: global constant
                GPU Tracking Method: pynvml
       

Training model...


Epoch,Training Loss,Validation Loss,Exact Match,F1
1,1.0573,1.034003,0.607302,0.688115
2,0.72,1.166671,0.608456,0.695708


[codecarbon INFO @ 05:19:15] Energy consumed for RAM : 0.000158 kWh. RAM Power : 38.0 W
[codecarbon INFO @ 05:19:15] Delta energy consumed for CPU with constant : 0.000177 kWh, power : 42.5 W
[codecarbon INFO @ 05:19:15] Energy consumed for All CPU : 0.000177 kWh
[codecarbon INFO @ 05:19:15] Energy consumed for all GPUs : 0.000938 kWh. Total GPU Power : 225.09408950382976 W
[codecarbon INFO @ 05:19:15] 0.001274 kWh of electricity and 0.000000 L of water were used since the beginning.
[codecarbon INFO @ 05:19:16] Energy consumed for RAM : 0.000158 kWh. RAM Power : 38.0 W
[codecarbon INFO @ 05:19:16] Delta energy consumed for CPU with constant : 0.000177 kWh, power : 42.5 W
[codecarbon INFO @ 05:19:16] Energy consumed for All CPU : 0.000177 kWh
[codecarbon INFO @ 05:19:16] Energy consumed for all GPUs : 0.000965 kWh. Total GPU Power : 231.49150048907336 W
[codecarbon INFO @ 05:19:16] 0.001301 kWh of electricity and 0.000000 L of water were used since the beginning.
[codecarbon INFO @ 05:

Evaluating model...




FINE-TUNING RESULTS SUMMARY FOR 80.0% DATASET:
Training Method: Full Fine-Tuning
Model: BERT

Model Parameters:
Total Parameters: 108,893,186
Trainable Parameters: 108,893,186
Trainable Percentage: 100.00%

Performance Metrics:
F1 Score: 0.6957
Exact Match: 0.6085
Eval Loss: 1.1667

Energy Consumption:
Total Energy: 0.079030 kWh
CPU Energy: 0.010649 kWh (13.5%)
GPU Energy: 0.058862 kWh (74.5%)
RAM Energy: 0.009520 kWh (12.0%)

Average Power Draw:
CPU Power: 42.50 W
GPU Power: 232.89 W
RAM Power: 38.00 W
Total Power: 313.39 W

Carbon Footprint:
Total CO2 Emissions: 0.035771 kg
Emissions Rate: 0.000039639 kg/s
Duration: 0.25 hours
Training Time (Trainer): 0.25 hours

Location & Infrastructure:
Country: United States (USA)
Region: iowa
On Cloud: N
PUE (Power Usage Effectiveness): 1.0

System Specifications:
OS: Linux-6.6.105+-x86_64-with-glibc2.35
CPU: Intel(R) Xeon(R) CPU @ 2.20GHz (12 cores)
GPU: 1 x NVIDIA A100-SXM4-40GB (Count: 1)
RAM: 83.47 GB
Python: 3.12.12

CPU times: user 17min

###STEP 4.1: Results and Analysis

In [20]:
# Create summary DataFrame
results_df = pd.DataFrame(results_summary)

print("\n" + "="*60)
print("FINAL RESULTS SUMMARY")
print("="*60)
print(results_df.to_string(index=False))


FINAL RESULTS SUMMARY
 training_method model_name  dataset_size%  train_samples  valid_samples  trainable_params  total_params  trainable_percentage  f1_score  exact_match  eval_loss  training_time_hours  emissions_rate_kg_per_s  emissions_kg           timestamp  duration_seconds  duration_hours  energy_consumed_kwh  cpu_energy_kwh  gpu_energy_kwh  ram_energy_kwh  cpu_power_w  gpu_power_w  ram_power_w  country_name country_iso_code region cloud_provider cloud_region on_cloud                                   os python_version  cpu_count                      cpu_model  gpu_count                 gpu_model  ram_total_size_gb  pue codecarbon_version
Full Fine-Tuning       BERT             25          32579          12134         108893186     108893186                 100.0  0.531585     0.441569   1.732935             0.097368                 0.000036      0.012590 2025-12-14T05:06:43        351.109398        0.097530             0.027816        0.004143        0.019970        0.003704  

In [21]:
results_df.to_csv("/content/drive/MyDrive/bert_dataset_size_results.csv", index=False)

In [22]:
# Load the dataset
full_ft_results = pd.read_csv("/content/drive/MyDrive/bert_dataset_size_results.csv")

print("Data loaded successfully!")
print(f"Total experiments: {len(full_ft_results)}")
print("\nExperiments:")
print(full_ft_results[['train_samples', 'dataset_size%', 'f1_score', 'emissions_kg']])


Data loaded successfully!
Total experiments: 3

Experiments:
   train_samples  dataset_size%  f1_score  emissions_kg
0          32579             25  0.531585      0.012590
1          65159             50  0.656064      0.022642
2         104255             80  0.695708      0.035771


In [23]:
# PLOT 1: Energy Consumption vs Dataset Size (Stacked Area)
df_sorted = full_ft_results.sort_values('train_samples')

fig = go.Figure()

fig.add_trace(go.Bar(
    name='CPU Energy',
    x=df_sorted['dataset_size%'],
    y=df_sorted['cpu_energy_kwh'],
    marker_color='#FF6B6B',
    hovertemplate='<b>CPU Energy</b><br>%{y:.6f} kWh<br>Dataset: %{x:.0f}%<extra></extra>'
))

fig.add_trace(go.Bar(
    name='GPU Energy',
    x=df_sorted['dataset_size%'],
    y=df_sorted['gpu_energy_kwh'],
    marker_color='#4ECDC4',
    hovertemplate='<b>GPU Energy</b><br>%{y:.6f} kWh<br>Dataset: %{x:.0f}%<extra></extra>'
))

fig.add_trace(go.Bar(
    name='RAM Energy',
    x=df_sorted['dataset_size%'],
    y=df_sorted['ram_energy_kwh'],
    marker_color='#95E1D3',
    hovertemplate='<b>RAM Energy</b><br>%{y:.6f} kWh<br>Dataset: %{x:.0f}%<extra></extra>'
))

fig.update_layout(
    title=dict(text="Energy Consumption Scaling with Dataset Size", font=dict(size=18)),
    xaxis_title='Dataset Size (%)',
    yaxis_title='Energy Consumption (kWh)',
    barmode='stack',
    template='plotly_white',
    height=500,
    font=dict(size=13),
    legend=dict(
        orientation="h",
        yanchor="bottom",
        y=1.02,
        xanchor="right",
        x=1
    ),
    hovermode='x unified'
)

fig.show()
fig.write_html("/content/drive/MyDrive/full_ft_energy_scaling.html")

In [24]:
# PLOT 2: Performance & Emissions Growth (Dual Y-axis)
df_sorted = full_ft_results.sort_values('train_samples')

fig = make_subplots(specs=[[{"secondary_y": True}]])

# F1 Score line
fig.add_trace(
    go.Scatter(
        x=df_sorted['dataset_size%'],
        y=df_sorted['f1_score'],
        name='F1 Score',
        mode='lines+markers',
        line=dict(color='#4ECDC4', width=3),
        marker=dict(size=12, line=dict(width=2, color='white')),
        hovertemplate='<b>F1 Score</b>: %{y:.4f}<br>Dataset: %{x:.0f}%<extra></extra>'
    ),
    secondary_y=False
)

# Exact Match line
fig.add_trace(
    go.Scatter(
        x=df_sorted['dataset_size%'],
        y=df_sorted['exact_match'],
        name='Exact Match',
        mode='lines+markers',
        line=dict(color='#95E1D3', width=3, dash='dash'),
        marker=dict(size=10),
        hovertemplate='<b>Exact Match</b>: %{y:.4f}<br>Dataset: %{x:.0f}%<extra></extra>'
    ),
    secondary_y=False
)

# CO2 Emissions bar
fig.add_trace(
    go.Bar(
        x=df_sorted['dataset_size%'],
        y=df_sorted['emissions_kg'],
        name='CO₂ Emissions',
        marker_color='#FF6B6B',
        opacity=0.6,
        hovertemplate='<b>CO₂</b>: %{y:.6f} kg<br>Dataset: %{x:.0f}%<extra></extra>'
    ),
    secondary_y=True
)

fig.update_xaxes(title_text="Dataset Size (%)")
fig.update_yaxes(title_text="Performance Score", secondary_y=False)
fig.update_yaxes(title_text="CO₂ Emissions (kg)", secondary_y=True)

fig.update_layout(
    title=dict(text="Performance vs Carbon Emissions by Dataset Size", font=dict(size=18)),
    template='plotly_white',
    height=500,
    font=dict(size=13),
    hovermode='x unified',
    legend=dict(
        orientation="h",
        yanchor="bottom",
        y=1.02,
        xanchor="right",
        x=1
    )
)

fig.show()
fig.write_html("/content/drive/MyDrive/full_ft_performance_emissions.html")

# Training Strategy 2: LoRA (Low-Rank Adaptation) fine-tuning (Model BERT)

In [25]:
#Import PEFT for LoRA
from peft import LoraConfig, get_peft_model, TaskType, PeftModel

## STEP 5: Creating And Training LoRA Model

In [26]:
def create_lora_model(model_name, r, lora_alpha, lora_dropout=0.1):
    # Load base model
    base_model = AutoModelForQuestionAnswering.from_pretrained(model_name)

    # Configure LoRA
    lora_config = LoraConfig(
        task_type=TaskType.QUESTION_ANS,  # Task type for QA
        r=r,                               # Rank of update matrices
        lora_alpha=lora_alpha,             # Scaling factor
        lora_dropout=lora_dropout,         # Dropout probability
        target_modules=["query", "value"], # Which layers to apply LoRA to
        bias="none",                       # Don't train biases
        inference_mode=False,              # Training mode
    )

    # Apply LoRA to model
    lora_model = get_peft_model(base_model, lora_config)

    # Print trainable parameters
    lora_model.print_trainable_parameters()

    return lora_model

In [27]:
def train_lora_model(tokenized_train, tokenized_eval, tokenizer, compute_metrics_fn, size_fraction, lora_rank):

    # Create LoRA model
    print(f"\nCreating LoRA model (rank={lora_rank})...")
    lora_model = create_lora_model(
        model_name="bert-base-uncased",
        r=lora_rank,
        lora_alpha=lora_rank * 2,  # Common practice: alpha = 2*r
        lora_dropout=0.1
    )

    # Setup output directory
    output_dir = f"results_bert_lora_r{lora_rank}_{int(size_fraction*100)}pct"

    # Training arguments (can use higher learning rate for LoRA)
    training_args = TrainingArguments(
        output_dir=output_dir,
        eval_strategy="epoch",
        learning_rate=3e-4,
        per_device_train_batch_size=16,
        per_device_eval_batch_size=16,
        num_train_epochs=2,
        weight_decay=0.01,
        fp16=torch.cuda.is_available(),
        save_strategy="epoch",
        load_best_model_at_end=True,
        metric_for_best_model="f1",
        push_to_hub=False,
        logging_steps=100,
        greater_is_better=True
    )

    # Initialize trainer
    trainer = Trainer(
        model=lora_model,
        args=training_args,
        train_dataset=tokenized_train,
        eval_dataset=tokenized_eval,
        tokenizer=tokenizer,
        data_collator=default_data_collator,
        compute_metrics=compute_metrics_fn
    )

    # Start carbon tracking
    tracker = EmissionsTracker(
        project_name=f"BERT_LoRA_r{lora_rank}_{int(size_fraction*100)}pct",
        output_dir=output_dir,
        save_to_file=True,
        log_level="info"
    )
    tracker.start()

    # Train
    print("Training LoRA model...")
    train_results = trainer.train()

    # Stop tracking and get detailed emissions data
    emissions_kg = tracker.stop()

    # Get full emissions data object
    emissions_data = tracker.final_emissions_data

    return trainer, train_results, emissions_data, output_dir, lora_model

##STEP 6: Evaluating The LoRA Model On Different Rank Sizes

> We will be training our model on various ranks from our SQuAD dataset.
>
> Training Data Rank Variation: [4, 8, 16]

In [28]:
def evaluate_and_save_lora(trainer, train_results, emissions_data, output_dir, size_fraction, num_samples, lora_model):
    """Evaluate LoRA model and save results with detailed emissions."""
    print("Evaluating LoRA model...")
    eval_results = trainer.evaluate()

    # Count trainable parameters
    trainable_params = sum(p.numel() for p in lora_model.parameters() if p.requires_grad)
    total_params = sum(p.numel() for p in lora_model.parameters())
    trainable_percentage = 100 * trainable_params / total_params

    # Extract emissions data from EmissionsData object
    result_entry = {
        "training_method": "LoRA",
        "model_name": "BERT",
        "dataset_size%": int(size_fraction*100),
        "lora_rank": lora_model.peft_config['default'].r,
        "train_samples": num_samples,
        "valid_samples": len(tokenized_validation),
        "trainable_params": trainable_params,
        "total_params": total_params,
        "trainable_percentage": trainable_percentage,

        # Performance metrics
        "f1_score": eval_results["eval_f1"],
        "exact_match": eval_results["eval_exact_match"],
        "eval_loss": eval_results["eval_loss"],
        "training_time_hours": train_results.metrics["train_runtime"] / 3600,

        # Emissions data (direct access to EmissionsData attributes)
        "emissions_rate_kg_per_s": emissions_data.emissions_rate,
        "emissions_kg": emissions_data.emissions,
        "timestamp": emissions_data.timestamp,
        "duration_seconds": emissions_data.duration,
        "duration_hours": emissions_data.duration / 3600,

        # Energy consumption
        "energy_consumed_kwh": emissions_data.energy_consumed,
        "cpu_energy_kwh": emissions_data.cpu_energy,
        "gpu_energy_kwh": emissions_data.gpu_energy,
        "ram_energy_kwh": emissions_data.ram_energy,

        # Power draw
        "cpu_power_w": emissions_data.cpu_power,
        "gpu_power_w": emissions_data.gpu_power,
        "ram_power_w": emissions_data.ram_power,

        # Location and system info
        "country_name": emissions_data.country_name,
        "country_iso_code": emissions_data.country_iso_code,
        "region": emissions_data.region,
        "cloud_provider": emissions_data.cloud_provider,
        "cloud_region": emissions_data.cloud_region,
        "on_cloud": emissions_data.on_cloud,

        # System specifications
        "os": emissions_data.os,
        "python_version": emissions_data.python_version,
        "cpu_count": emissions_data.cpu_count,
        "cpu_model": emissions_data.cpu_model,
        "gpu_count": emissions_data.gpu_count,
        "gpu_model": emissions_data.gpu_model,
        "ram_total_size_gb": emissions_data.ram_total_size,

        # Additional metrics
        "pue": emissions_data.pue,  # Power Usage Effectiveness
        "codecarbon_version": emissions_data.codecarbon_version,
    }

    # Print detailed summary
    print(f"\n{'='*80}")
    print(f"LoRA RESULTS SUMMARY (Rank {result_entry['lora_rank']})")
    print(f"{'='*80}")

    print(f"\nModel Configuration:")
    print(f"Training Method: LoRA")
    print(f"LoRA Rank: {result_entry['lora_rank']}")
    print(f"Trainable Parameters: {trainable_params:,} ({trainable_percentage:.2f}%)")
    print(f"Total Parameters: {total_params:,}")
    print(f"Dataset Size: {size_fraction*100}%")

    print(f"\nPerformance Metrics:")
    print(f"F1 Score: {eval_results['eval_f1']:.4f}")
    print(f"Exact Match: {eval_results['eval_exact_match']:.4f}")
    print(f"Eval Loss: {eval_results['eval_loss']:.4f}")

    print(f"\nEnergy Consumption:")
    print(f"Total Energy: {emissions_data.energy_consumed:.6f} kWh")
    print(f"CPU Energy: {emissions_data.cpu_energy:.6f} kWh ({emissions_data.cpu_energy/emissions_data.energy_consumed*100:.1f}%)")
    print(f"GPU Energy: {emissions_data.gpu_energy:.6f} kWh ({emissions_data.gpu_energy/emissions_data.energy_consumed*100:.1f}%)")
    print(f"RAM Energy: {emissions_data.ram_energy:.6f} kWh ({emissions_data.ram_energy/emissions_data.energy_consumed*100:.1f}%)")

    print(f"\nAverage Power Draw:")
    print(f"CPU Power: {emissions_data.cpu_power:.2f} W")
    print(f"GPU Power: {emissions_data.gpu_power:.2f} W")
    print(f"RAM Power: {emissions_data.ram_power:.2f} W")
    print(f"Total Power: {emissions_data.cpu_power + emissions_data.gpu_power + emissions_data.ram_power:.2f} W")

    print(f"\nCarbon Footprint:")
    print(f"Total CO2 Emissions: {emissions_data.emissions:.6f} kg")
    print(f"Emissions Rate: {emissions_data.emissions_rate:.9f} kg/s")
    print(f"Duration: {emissions_data.duration/3600:.2f} hours")
    print(f"Training Time (Trainer): {train_results.metrics['train_runtime']/3600:.2f} hours")

    print(f"\nLocation & Infrastructure:")
    print(f"Country: {emissions_data.country_name} ({emissions_data.country_iso_code})")
    print(f"Region: {emissions_data.region}")
    print(f"On Cloud: {emissions_data.on_cloud}")
    print(f"PUE (Power Usage Effectiveness): {emissions_data.pue}")

    print(f"\nSystem Specifications:")
    print(f"OS: {emissions_data.os}")
    print(f"CPU: {emissions_data.cpu_model} ({emissions_data.cpu_count} cores)")
    if emissions_data.gpu_count and emissions_data.gpu_model:
        print(f"GPU: {emissions_data.gpu_model} (Count: {emissions_data.gpu_count})")
    else:
        print(f"GPU: None detected")
    print(f"RAM: {emissions_data.ram_total_size:.2f} GB")
    print(f"Python: {emissions_data.python_version}")

    print(f"\n{'='*80}")

    # Save LoRA adapters
    lora_model.save_pretrained(f"{output_dir}/lora_adapters")
    tokenizer.save_pretrained(f"{output_dir}/lora_adapters")
    print(f"LoRA adapters saved to {output_dir}/lora_adapters")

    # Clean up
    del trainer.model
    del trainer
    torch.cuda.empty_cache()

    return result_entry


In [29]:
def run_lora_experiment(size_fraction, train_data, eval_data, tokenizer, preprocess_fn, compute_metrics_fn, lora_rank):

    print(f"\n{'='*60}")
    print(f"LoRA Training with {size_fraction*100}% of training data")
    print(f"{'='*60}")

    # Step 1: Prepare dataset
    tokenized_train, num_samples = prepare_dataset(train_data, size_fraction, preprocess_fn)

    # Step 2: Train LoRA model
    trainer, train_results, emissions_data, output_dir, lora_model = train_lora_model(
        tokenized_train, eval_data, tokenizer, compute_metrics_fn,
        size_fraction, lora_rank
    )

    # Step 3: Evaluate and save
    result_entry = evaluate_and_save_lora(trainer, train_results, emissions_data, output_dir, size_fraction, num_samples, lora_model)

    return result_entry

In [30]:
result_lora = []

In [31]:
%%time
print("\n" + "="*80)
print("EXPERIMENT 1: LoRA with Rank 4")
print("="*80)

result_r4 = run_lora_experiment(
    size_fraction=0.8,  # 80% of training data
    train_data=squad["train"],
    eval_data=tokenized_validation,
    tokenizer=tokenizer,
    preprocess_fn=preprocess_function,
    compute_metrics_fn=compute_metrics,
    lora_rank=4
)
result_lora.append(result_r4)

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



EXPERIMENT 1: LoRA with Rank 4

LoRA Training with 80.0% of training data
Preprocessing 104255 training samples...

Creating LoRA model (rank=4)...



`tokenizer` is deprecated and will be removed in version 5.0.0 for `Trainer.__init__`. Use `processing_class` instead.

[codecarbon INFO @ 05:34:26] [setup] RAM Tracking...
[codecarbon INFO @ 05:34:26] [setup] CPU Tracking...


trainable params: 148,994 || all params: 109,042,180 || trainable%: 0.1366


 Linux OS detected: Please ensure RAPL files exist, and are readable, at /sys/class/powercap/intel-rapl/subsystem to measure CPU

[codecarbon INFO @ 05:34:27] CPU Model on constant consumption mode: Intel(R) Xeon(R) CPU @ 2.20GHz
[codecarbon INFO @ 05:34:27] [setup] GPU Tracking...
[codecarbon INFO @ 05:34:27] Tracking Nvidia GPU via pynvml
[codecarbon INFO @ 05:34:27] The below tracking methods have been set up:
                RAM Tracking Method: RAM power estimation model
                CPU Tracking Method: global constant
                GPU Tracking Method: pynvml
            
[codecarbon INFO @ 05:34:27] >>> Tracker's metadata:
[codecarbon INFO @ 05:34:27]   Platform system: Linux-6.6.105+-x86_64-with-glibc2.35
[codecarbon INFO @ 05:34:27]   Python version: 3.12.12
[codecarbon INFO @ 05:34:27]   CodeCarbon version: 3.2.0
[codecarbon INFO @ 05:34:27]   Available RAM : 83.474 GB
[codecarbon INFO @ 05:34:27]   CPU count: 12 thread(s) in 1 physical CPU(s)
[codecarbon INFO @ 05:34:2

Training LoRA model...


Epoch,Training Loss,Validation Loss,Exact Match,F1
1,1.5526,1.462346,0.408274,0.481865
2,1.4355,1.368017,0.454096,0.529031


[codecarbon INFO @ 05:34:44] Energy consumed for RAM : 0.000158 kWh. RAM Power : 38.0 W
[codecarbon INFO @ 05:34:44] Delta energy consumed for CPU with constant : 0.000177 kWh, power : 42.5 W
[codecarbon INFO @ 05:34:44] Energy consumed for All CPU : 0.000177 kWh
[codecarbon INFO @ 05:34:44] Energy consumed for all GPUs : 0.000802 kWh. Total GPU Power : 192.2639999279996 W
[codecarbon INFO @ 05:34:44] 0.001138 kWh of electricity and 0.000000 L of water were used since the beginning.
[codecarbon INFO @ 05:34:44] Energy consumed for RAM : 0.000158 kWh. RAM Power : 38.0 W
[codecarbon INFO @ 05:34:44] Delta energy consumed for CPU with constant : 0.000177 kWh, power : 42.5 W
[codecarbon INFO @ 05:34:44] Energy consumed for All CPU : 0.000177 kWh
[codecarbon INFO @ 05:34:44] Energy consumed for all GPUs : 0.000820 kWh. Total GPU Power : 196.65651341030562 W
[codecarbon INFO @ 05:34:44] 0.001155 kWh of electricity and 0.000000 L of water were used since the beginning.
[codecarbon INFO @ 05:3

Evaluating LoRA model...



LoRA RESULTS SUMMARY (Rank 4)

Model Configuration:
Training Method: LoRA
LoRA Rank: 4
Trainable Parameters: 148,994 (0.14%)
Total Parameters: 109,042,180
Dataset Size: 80.0%

Performance Metrics:
F1 Score: 0.5290
Exact Match: 0.4541
Eval Loss: 1.3680

Energy Consumption:
Total Energy: 0.071645 kWh
CPU Energy: 0.010876 kWh (15.2%)
GPU Energy: 0.051045 kWh (71.2%)
RAM Energy: 0.009723 kWh (13.6%)

Average Power Draw:
CPU Power: 42.50 W
GPU Power: 199.22 W
RAM Power: 38.00 W
Total Power: 279.72 W

Carbon Footprint:
Total CO2 Emissions: 0.032428 kg
Emissions Rate: 0.000035181 kg/s
Duration: 0.26 hours
Training Time (Trainer): 0.26 hours

Location & Infrastructure:
Country: United States (USA)
Region: iowa
On Cloud: N
PUE (Power Usage Effectiveness): 1.0

System Specifications:
OS: Linux-6.6.105+-x86_64-with-glibc2.35
CPU: Intel(R) Xeon(R) CPU @ 2.20GHz (12 cores)
GPU: 1 x NVIDIA A100-SXM4-40GB (Count: 1)
RAM: 83.47 GB
Python: 3.12.12

LoRA adapters saved to results_bert_lora_r4_80pct/lor

In [32]:
%%time
print("\n" + "="*80)
print("EXPERIMENT 2: LoRA with Rank 8")
print("="*80)

result_r8 = run_lora_experiment(
    size_fraction=0.8,
    train_data=squad["train"],
    eval_data=tokenized_validation,
    tokenizer=tokenizer,
    preprocess_fn=preprocess_function,
    compute_metrics_fn=compute_metrics,
    lora_rank=8
)
result_lora.append(result_r8)


EXPERIMENT 2: LoRA with Rank 8

LoRA Training with 80.0% of training data
Preprocessing 104255 training samples...

Creating LoRA model (rank=8)...


Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


trainable params: 296,450 || all params: 109,189,636 || trainable%: 0.2715



`tokenizer` is deprecated and will be removed in version 5.0.0 for `Trainer.__init__`. Use `processing_class` instead.

[codecarbon INFO @ 05:50:15] [setup] RAM Tracking...
[codecarbon INFO @ 05:50:15] [setup] CPU Tracking...
 Linux OS detected: Please ensure RAPL files exist, and are readable, at /sys/class/powercap/intel-rapl/subsystem to measure CPU

[codecarbon INFO @ 05:50:16] CPU Model on constant consumption mode: Intel(R) Xeon(R) CPU @ 2.20GHz
[codecarbon INFO @ 05:50:16] [setup] GPU Tracking...
[codecarbon INFO @ 05:50:16] Tracking Nvidia GPU via pynvml
[codecarbon INFO @ 05:50:16] The below tracking methods have been set up:
                RAM Tracking Method: RAM power estimation model
                CPU Tracking Method: global constant
                GPU Tracking Method: pynvml
            
[codecarbon INFO @ 05:50:16] >>> Tracker's metadata:
[codecarbon INFO @ 05:50:16]   Platform system: Linux-6.6.105+-x86_64-with-glibc2.35
[codecarbon INFO @ 05:50:16]   Python versio

Training LoRA model...


Epoch,Training Loss,Validation Loss,Exact Match,F1
1,1.4558,1.378176,0.448739,0.524788
2,1.3147,1.31834,0.482281,0.562268


[codecarbon INFO @ 05:50:33] Energy consumed for RAM : 0.000158 kWh. RAM Power : 38.0 W
[codecarbon INFO @ 05:50:33] Delta energy consumed for CPU with constant : 0.000177 kWh, power : 42.5 W
[codecarbon INFO @ 05:50:33] Energy consumed for All CPU : 0.000177 kWh
[codecarbon INFO @ 05:50:33] Energy consumed for all GPUs : 0.000800 kWh. Total GPU Power : 191.82441569306033 W
[codecarbon INFO @ 05:50:33] 0.001135 kWh of electricity and 0.000000 L of water were used since the beginning.
[codecarbon INFO @ 05:50:33] Energy consumed for RAM : 0.000158 kWh. RAM Power : 38.0 W
[codecarbon INFO @ 05:50:33] Delta energy consumed for CPU with constant : 0.000177 kWh, power : 42.5 W
[codecarbon INFO @ 05:50:33] Energy consumed for All CPU : 0.000177 kWh
[codecarbon INFO @ 05:50:33] Energy consumed for all GPUs : 0.000824 kWh. Total GPU Power : 197.67500880428582 W
[codecarbon INFO @ 05:50:33] 0.001160 kWh of electricity and 0.000000 L of water were used since the beginning.
[codecarbon INFO @ 05:

Evaluating LoRA model...



LoRA RESULTS SUMMARY (Rank 8)

Model Configuration:
Training Method: LoRA
LoRA Rank: 8
Trainable Parameters: 296,450 (0.27%)
Total Parameters: 109,189,636
Dataset Size: 80.0%

Performance Metrics:
F1 Score: 0.5623
Exact Match: 0.4823
Eval Loss: 1.3183

Energy Consumption:
Total Energy: 0.071025 kWh
CPU Energy: 0.010826 kWh (15.2%)
GPU Energy: 0.050520 kWh (71.1%)
RAM Energy: 0.009679 kWh (13.6%)

Average Power Draw:
CPU Power: 42.50 W
GPU Power: 197.93 W
RAM Power: 38.00 W
Total Power: 278.43 W

Carbon Footprint:
Total CO2 Emissions: 0.032148 kg
Emissions Rate: 0.000035038 kg/s
Duration: 0.25 hours
Training Time (Trainer): 0.25 hours

Location & Infrastructure:
Country: United States (USA)
Region: iowa
On Cloud: N
PUE (Power Usage Effectiveness): 1.0

System Specifications:
OS: Linux-6.6.105+-x86_64-with-glibc2.35
CPU: Intel(R) Xeon(R) CPU @ 2.20GHz (12 cores)
GPU: 1 x NVIDIA A100-SXM4-40GB (Count: 1)
RAM: 83.47 GB
Python: 3.12.12

LoRA adapters saved to results_bert_lora_r8_80pct/lor

In [33]:
%%time
print("\n" + "="*80)
print("EXPERIMENT 3: LoRA with Rank 16")
print("="*80)

result_r16 = run_lora_experiment(
    size_fraction=0.8,
    train_data=squad["train"],
    eval_data=tokenized_validation,
    tokenizer=tokenizer,
    preprocess_fn=preprocess_function,
    compute_metrics_fn=compute_metrics,
    lora_rank=16
)
result_lora.append(result_r16)


EXPERIMENT 3: LoRA with Rank 16

LoRA Training with 80.0% of training data
Preprocessing 104255 training samples...

Creating LoRA model (rank=16)...


Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


trainable params: 591,362 || all params: 109,484,548 || trainable%: 0.5401



`tokenizer` is deprecated and will be removed in version 5.0.0 for `Trainer.__init__`. Use `processing_class` instead.

[codecarbon INFO @ 06:06:00] [setup] RAM Tracking...
[codecarbon INFO @ 06:06:00] [setup] CPU Tracking...
 Linux OS detected: Please ensure RAPL files exist, and are readable, at /sys/class/powercap/intel-rapl/subsystem to measure CPU

[codecarbon INFO @ 06:06:01] CPU Model on constant consumption mode: Intel(R) Xeon(R) CPU @ 2.20GHz
[codecarbon INFO @ 06:06:01] [setup] GPU Tracking...
[codecarbon INFO @ 06:06:01] Tracking Nvidia GPU via pynvml
[codecarbon INFO @ 06:06:01] The below tracking methods have been set up:
                RAM Tracking Method: RAM power estimation model
                CPU Tracking Method: global constant
                GPU Tracking Method: pynvml
            
[codecarbon INFO @ 06:06:01] >>> Tracker's metadata:
[codecarbon INFO @ 06:06:01]   Platform system: Linux-6.6.105+-x86_64-with-glibc2.35
[codecarbon INFO @ 06:06:01]   Python versio

Training LoRA model...


Epoch,Training Loss,Validation Loss,Exact Match,F1
1,1.3759,1.28299,0.501896,0.573835
2,1.2397,1.25573,0.520768,0.600994


[codecarbon INFO @ 06:06:17] Energy consumed for RAM : 0.000158 kWh. RAM Power : 38.0 W
[codecarbon INFO @ 06:06:17] Delta energy consumed for CPU with constant : 0.000177 kWh, power : 42.5 W
[codecarbon INFO @ 06:06:17] Energy consumed for All CPU : 0.000177 kWh
[codecarbon INFO @ 06:06:17] Energy consumed for all GPUs : 0.000799 kWh. Total GPU Power : 191.66001638999498 W
[codecarbon INFO @ 06:06:17] 0.001134 kWh of electricity and 0.000000 L of water were used since the beginning.
[codecarbon INFO @ 06:06:18] Energy consumed for RAM : 0.000158 kWh. RAM Power : 38.0 W
[codecarbon INFO @ 06:06:18] Delta energy consumed for CPU with constant : 0.000177 kWh, power : 42.5 W
[codecarbon INFO @ 06:06:18] Energy consumed for All CPU : 0.000177 kWh
[codecarbon INFO @ 06:06:18] Energy consumed for all GPUs : 0.000823 kWh. Total GPU Power : 197.28524706126086 W
[codecarbon INFO @ 06:06:18] 0.001159 kWh of electricity and 0.000000 L of water were used since the beginning.
[codecarbon INFO @ 06:

Evaluating LoRA model...



LoRA RESULTS SUMMARY (Rank 16)

Model Configuration:
Training Method: LoRA
LoRA Rank: 16
Trainable Parameters: 591,362 (0.54%)
Total Parameters: 109,484,548
Dataset Size: 80.0%

Performance Metrics:
F1 Score: 0.6010
Exact Match: 0.5208
Eval Loss: 1.2557

Energy Consumption:
Total Energy: 0.071284 kWh
CPU Energy: 0.010869 kWh (15.2%)
GPU Energy: 0.050698 kWh (71.1%)
RAM Energy: 0.009717 kWh (13.6%)

Average Power Draw:
CPU Power: 42.50 W
GPU Power: 197.99 W
RAM Power: 38.00 W
Total Power: 278.49 W

Carbon Footprint:
Total CO2 Emissions: 0.032265 kg
Emissions Rate: 0.000035028 kg/s
Duration: 0.26 hours
Training Time (Trainer): 0.26 hours

Location & Infrastructure:
Country: United States (USA)
Region: iowa
On Cloud: N
PUE (Power Usage Effectiveness): 1.0

System Specifications:
OS: Linux-6.6.105+-x86_64-with-glibc2.35
CPU: Intel(R) Xeon(R) CPU @ 2.20GHz (12 cores)
GPU: 1 x NVIDIA A100-SXM4-40GB (Count: 1)
RAM: 83.47 GB
Python: 3.12.12

LoRA adapters saved to results_bert_lora_r16_80pct/

###STEP 6.1: Results and Analysis

In [34]:
results_df_lora = pd.DataFrame(result_lora)
print("\n" + "="*60)
print("LoRA RESULTS SUMMARY")
print("="*60)
print(results_df_lora.to_string(index=False))

# Save to CSV
results_df_lora.to_csv("/content/drive/MyDrive/bert_lora_results.csv", index=False)


LoRA RESULTS SUMMARY
training_method model_name  dataset_size%  lora_rank  train_samples  valid_samples  trainable_params  total_params  trainable_percentage  f1_score  exact_match  eval_loss  training_time_hours  emissions_rate_kg_per_s  emissions_kg           timestamp  duration_seconds  duration_hours  energy_consumed_kwh  cpu_energy_kwh  gpu_energy_kwh  ram_energy_kwh  cpu_power_w  gpu_power_w  ram_power_w  country_name country_iso_code region cloud_provider cloud_region on_cloud                                   os python_version  cpu_count                      cpu_model  gpu_count                 gpu_model  ram_total_size_gb  pue codecarbon_version
           LoRA       BERT             80          4         104255          12134            148994     109042180              0.136639  0.529031     0.454096   1.368017             0.255871                 0.000035      0.032428 2025-12-14T05:49:50        921.733660        0.256037             0.071645        0.010876        0.05104

In [35]:
print("\n" + "="*80)
print("LoRA RANK COMPARISON")
print("="*80)
print(results_df_lora[['lora_rank', 'trainable_params', 'trainable_percentage', 'f1_score', 'exact_match', 'emissions_kg', 'training_time_hours']].to_string(index=False))


LoRA RANK COMPARISON
 lora_rank  trainable_params  trainable_percentage  f1_score  exact_match  emissions_kg  training_time_hours
         4            148994              0.136639  0.529031     0.454096      0.032428             0.255871
         8            296450              0.271500  0.562268     0.482281      0.032148             0.254697
        16            591362              0.540133  0.600994     0.520768      0.032265             0.255698


In [36]:
# Compare efficiency vs performance
print("\n" + "="*80)
print("EFFICIENCY ANALYSIS")
print("="*80)

baseline = results_df_lora[results_df_lora['lora_rank'] == 8].iloc[0]  # Use rank 8 as baseline

for _, row in results_df_lora.iterrows():
    rank = row['lora_rank']
    params_ratio = row['trainable_params'] / baseline['trainable_params']
    f1_diff = row['f1_score'] - baseline['f1_score']
    emissions_diff = row['emissions_kg'] - baseline['emissions_kg']

    print(f"\nLoRA Rank {rank}:")
    print(f"Trainable Params: {row['trainable_params']:,} ({row['trainable_percentage']:.2f}%)")
    print(f"vs Rank 8: {params_ratio:.2f}x parameters")
    print(f"F1 Score: {row['f1_score']:.4f} ({f1_diff:+.4f} vs Rank 8)")
    print(f"Emissions: {row['emissions_kg']:.6f} kg ({emissions_diff:+.6f} vs Rank 8)")
    print(f"Training Time: {row['training_time_hours']:.2f} hours")

    # Efficiency metric: F1 per kg CO2
    efficiency = row['f1_score'] / row['emissions_kg']
    print(f"Efficiency (F1/kg CO2): {efficiency:.2f}")


EFFICIENCY ANALYSIS

LoRA Rank 4:
Trainable Params: 148,994 (0.14%)
vs Rank 8: 0.50x parameters
F1 Score: 0.5290 (-0.0332 vs Rank 8)
Emissions: 0.032428 kg (+0.000280 vs Rank 8)
Training Time: 0.26 hours
Efficiency (F1/kg CO2): 16.31

LoRA Rank 8:
Trainable Params: 296,450 (0.27%)
vs Rank 8: 1.00x parameters
F1 Score: 0.5623 (+0.0000 vs Rank 8)
Emissions: 0.032148 kg (+0.000000 vs Rank 8)
Training Time: 0.25 hours
Efficiency (F1/kg CO2): 17.49

LoRA Rank 16:
Trainable Params: 591,362 (0.54%)
vs Rank 8: 1.99x parameters
F1 Score: 0.6010 (+0.0387 vs Rank 8)
Emissions: 0.032265 kg (+0.000117 vs Rank 8)
Training Time: 0.26 hours
Efficiency (F1/kg CO2): 18.63


In [37]:
# PLOT 1: LoRA Energy Consumption by Rank
df_sorted_lora = results_df_lora.sort_values('lora_rank')

fig_lora_energy = go.Figure()

fig_lora_energy.add_trace(go.Bar(
    name='CPU Energy',
    x=df_sorted_lora['lora_rank'],
    y=df_sorted_lora['cpu_energy_kwh'],
    marker_color='#FF6B6B',
    hovertemplate='<b>CPU Energy</b><br>%{y:.6f} kWh<br>Rank: %{x}<extra></extra>'
))

fig_lora_energy.add_trace(go.Bar(
    name='GPU Energy',
    x=df_sorted_lora['lora_rank'],
    y=df_sorted_lora['gpu_energy_kwh'],
    marker_color='#4ECDC4',
    hovertemplate='<b>GPU Energy</b><br>%{y:.6f} kWh<br>Rank: %{x}<extra></extra>'
))

fig_lora_energy.add_trace(go.Bar(
    name='RAM Energy',
    x=df_sorted_lora['lora_rank'],
    y=df_sorted_lora['ram_energy_kwh'],
    marker_color='#95E1D3',
    hovertemplate='<b>RAM Energy</b><br>%{y:.6f} kWh<br>Rank: %{x}<extra></extra>'
))

fig_lora_energy.update_layout(
    title=dict(text="LoRA: Energy Consumption by Rank", font=dict(size=18)),
    xaxis_title='LoRA Rank',
    yaxis_title='Energy Consumption (kWh)',
    barmode='stack',
    template='plotly_white',
    height=500,
    font=dict(size=13),
    legend=dict(
        orientation="h",
        yanchor="bottom",
        y=1.02,
        xanchor="right",
        x=1
    ),
    hovermode='x unified'
)

fig_lora_energy.show()
fig_lora_energy.write_html("/content/drive/MyDrive/lora_energy_by_rank.html")

In [38]:
# PLOT 2: LoRA Performance & Emissions by Rank (Dual Y-axis)
df_sorted_lora = results_df_lora.sort_values('lora_rank')

fig_lora_perf = make_subplots(specs=[[{"secondary_y": True}]])

# F1 Score line
fig_lora_perf.add_trace(
    go.Scatter(
        x=df_sorted_lora['lora_rank'],
        y=df_sorted_lora['f1_score'],
        name='F1 Score',
        mode='lines+markers',
        line=dict(color='#4ECDC4', width=3),
        marker=dict(size=12, line=dict(width=2, color='white')),
        hovertemplate='<b>F1 Score</b>: %{y:.4f}<br>Rank: %{x}<extra></extra>'
    ),
    secondary_y=False
)

# Exact Match line
fig_lora_perf.add_trace(
    go.Scatter(
        x=df_sorted_lora['lora_rank'],
        y=df_sorted_lora['exact_match'],
        name='Exact Match',
        mode='lines+markers',
        line=dict(color='#95E1D3', width=3, dash='dash'),
        marker=dict(size=10),
        hovertemplate='<b>Exact Match</b>: %{y:.4f}<br>Rank: %{x}<extra></extra>'
    ),
    secondary_y=False
)

# CO2 Emissions bar
fig_lora_perf.add_trace(
    go.Bar(
        x=df_sorted_lora['lora_rank'],
        y=df_sorted_lora['emissions_kg'],
        name='CO₂ Emissions',
        marker_color='#FF6B6B',
        opacity=0.6,
        hovertemplate='<b>CO₂</b>: %{y:.6f} kg<br>Rank: %{x}<extra></extra>'
    ),
    secondary_y=True
)

fig_lora_perf.update_xaxes(title_text="LoRA Rank")
fig_lora_perf.update_yaxes(title_text="Performance Score", secondary_y=False)
fig_lora_perf.update_yaxes(title_text="CO₂ Emissions (kg)", secondary_y=True)

fig_lora_perf.update_layout(
    title=dict(text="LoRA: Performance vs Carbon Emissions by Rank", font=dict(size=18)),
    template='plotly_white',
    height=500,
    font=dict(size=13),
    hovermode='x unified',
    legend=dict(
        orientation="h",
        yanchor="bottom",
        y=1.02,
        xanchor="right",
        x=1
    )
)

fig_lora_perf.show()
fig_lora_perf.write_html("/content/drive/MyDrive/lora_performance_emissions.html")

# Training Strategy 3: Few-shot Learning With Frozen Backbone

## STEP 7: Creating And Training Few-shot Model

In [39]:
def create_frozen_model(model_name="bert-base-uncased"):
    #Create model with frozen backbone (only QA head is trainable).
    # Load base model
    model = AutoModelForQuestionAnswering.from_pretrained(model_name)

    # Freeze ALL parameters first
    for param in model.parameters():
        param.requires_grad = False

    # Unfreeze ONLY the QA head (classifier layer)
    # For BERT: qa_outputs layer
    for param in model.qa_outputs.parameters():
        param.requires_grad = True

    # Count parameters
    trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
    total_params = sum(p.numel() for p in model.parameters())

    print(f"\nModel Configuration:")
    print(f"Total Parameters: {total_params:,}")
    print(f"Trainable Parameters: {trainable_params:,}")
    print(f"Frozen Parameters: {total_params - trainable_params:,}")
    print(f"Trainable Percentage: {100 * trainable_params / total_params:.4f}%")

    return model, trainable_params, total_params


In [40]:
def prepare_fewshot_dataset(train_data, num_shots, preprocess_fn):
    # Select only num_shots examples
    train_subset = train_data.select(range(num_shots))

    print(f"Creating few-shot dataset with {num_shots} examples...")
    tokenized_train = train_subset.map(
        preprocess_fn,
        batched=True,
        remove_columns=train_subset.column_names
    )

    # After tokenization with sliding window, we get more samples
    actual_samples = len(tokenized_train)
    print(f"Original examples: {num_shots}")
    print(f"After tokenization (with sliding window): {actual_samples} samples")

    return tokenized_train, num_shots  # Return original num_shots for tracking


In [41]:
def train_fewshot_model(tokenized_train, tokenized_eval, tokenizer, compute_metrics_fn, num_shots, model_name="bert-base-uncased"):
    # Create frozen model
    model, trainable_params, total_params = create_frozen_model(model_name)

    # Setup output directory
    output_dir = f"results_bert_fewshot_{num_shots}shots"

    # Training arguments - DIFFERENT from full fine-tuning
    training_args = TrainingArguments(
        output_dir=output_dir,
        eval_strategy="epoch",
        learning_rate=5e-4,  # Higher LR since we're only training the head
        per_device_train_batch_size=16,
        per_device_eval_batch_size=16,
        num_train_epochs=10,  # More epochs for few-shot
        fp16=torch.cuda.is_available(),
        save_strategy="epoch",
        load_best_model_at_end=True,
        metric_for_best_model="f1",
        push_to_hub=False,
        logging_steps=50,
        greater_is_better=True
    )

    # Initialize trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_train,
        eval_dataset=tokenized_eval,
        tokenizer=tokenizer,
        data_collator=default_data_collator,
        compute_metrics=compute_metrics_fn
    )

    # Start carbon tracking
    tracker = EmissionsTracker(
        project_name=f"BERT_FewShot_{num_shots}shots",
        output_dir=output_dir,
        save_to_file=True,
        log_level="info"
    )
    tracker.start()

    # Train
    print(f"\nTraining few-shot model ({num_shots} examples)...")
    train_results = trainer.train()

    # Stop tracking and get detailed emissions data
    emissions_kg = tracker.stop()
    emissions_data = tracker.final_emissions_data

    return trainer, train_results, emissions_data, output_dir, model, trainable_params, total_params


## STEP 8: Evaluating The Few-shot Model On Different Sample Sizes

>We will be training our model on various sample from our SQuAD dataset.
>
>Training Few-shot Variation: [100, 500, 1000]

In [42]:
def evaluate_and_save_fewshot(trainer, train_results, emissions_data, output_dir, num_shots, trainable_params, total_params):
    print("Evaluating few-shot model...")
    eval_results = trainer.evaluate()

    trainable_percentage = 100 * trainable_params / total_params

    # Compile results
    result_entry = {
        "training_method": "Few-Shot (Frozen Backbone)",
        "model_name": "BERT",
        "num_shots": num_shots,
        "train_samples": num_shots,
        "valid_samples": len(tokenized_validation),
        "trainable_params": trainable_params,
        "total_params": total_params,
        "trainable_percentage": trainable_percentage,

        # Performance
        "f1_score": eval_results["eval_f1"],
        "exact_match": eval_results["eval_exact_match"],
        "eval_loss": eval_results["eval_loss"],
        "training_time_hours": train_results.metrics["train_runtime"] / 3600,

        # Emissions data
        "emissions_rate_kg_per_s": emissions_data.emissions_rate,
        "emissions_kg": emissions_data.emissions,
        "timestamp": emissions_data.timestamp,
        "duration_seconds": emissions_data.duration,
        "duration_hours": emissions_data.duration / 3600,

        # Energy
        "energy_consumed_kwh": emissions_data.energy_consumed,
        "cpu_energy_kwh": emissions_data.cpu_energy,
        "gpu_energy_kwh": emissions_data.gpu_energy,
        "ram_energy_kwh": emissions_data.ram_energy,

        # Power
        "cpu_power_w": emissions_data.cpu_power,
        "gpu_power_w": emissions_data.gpu_power,
        "ram_power_w": emissions_data.ram_power,

        # Location and system info
        "country_name": emissions_data.country_name,
        "country_iso_code": emissions_data.country_iso_code,
        "region": emissions_data.region,
        "cloud_provider": emissions_data.cloud_provider,
        "cloud_region": emissions_data.cloud_region,
        "on_cloud": emissions_data.on_cloud,

        # System specifications
        "os": emissions_data.os,
        "python_version": emissions_data.python_version,
        "cpu_count": emissions_data.cpu_count,
        "cpu_model": emissions_data.cpu_model,
        "gpu_count": emissions_data.gpu_count,
        "gpu_model": emissions_data.gpu_model,
        "ram_total_size_gb": emissions_data.ram_total_size,

        # Additional metrics
        "pue": emissions_data.pue,
        "codecarbon_version": emissions_data.codecarbon_version,
    }

    # Print summary
    print(f"\n{'='*80}")
    print(f"FEW-SHOT LEARNING RESULTS ({num_shots} examples)")
    print(f"{'='*80}")
    print(f"\nModel Configuration:")
    print(f"Training Method: Few-Shot (Frozen Backbone)")
    print(f"Training Examples: {num_shots}")
    print(f"Trainable Parameters: {trainable_params:,} ({trainable_percentage:.4f}%)")
    print(f"Frozen Parameters: {total_params - trainable_params:,}")

    print(f"\nPerformance:")
    print(f"F1 Score: {eval_results['eval_f1']:.4f}")
    print(f"Exact Match: {eval_results['eval_exact_match']:.4f}")
    print(f"Eval Loss: {eval_results['eval_loss']:.4f}")

    print(f"\nEnergy:")
    print(f"Total: {emissions_data.energy_consumed:.6f} kWh")
    if emissions_data.energy_consumed > 0:
        print(f"GPU: {emissions_data.gpu_energy:.6f} kWh ({emissions_data.gpu_energy/emissions_data.energy_consumed*100:.1f}%)")
        print(f"CPU: {emissions_data.cpu_energy:.6f} kWh ({emissions_data.cpu_energy/emissions_data.energy_consumed*100:.1f}%)")

    print(f"\nCarbon:")
    print(f"CO₂ Emissions: {emissions_data.emissions:.6f} kg")
    print(f"Training Time: {train_results.metrics['train_runtime']/3600:.2f} hours")
    print(f"{'='*80}")

    # Save model
    trainer.save_model(f"{output_dir}/final_model")
    print(f"Model saved to {output_dir}/final_model")

    # Clean up
    del trainer.model
    del trainer
    torch.cuda.empty_cache()

    return result_entry


In [43]:
def run_fewshot_experiment(num_shots, train_data, eval_data, tokenizer, preprocess_fn, compute_metrics_fn, model_name="bert-base-uncased"):

    print(f"\n{'='*60}")
    print(f"Few-Shot Learning with {num_shots} examples")
    print(f"{'='*60}")

    # Step 1: Prepare few-shot dataset
    tokenized_train, num_shots = prepare_fewshot_dataset(train_data, num_shots, preprocess_fn)

    # Step 2: Train with frozen backbone
    trainer, train_results, emissions_data, output_dir, model, trainable_params, total_params = train_fewshot_model(
        tokenized_train, eval_data, tokenizer, compute_metrics_fn,
        num_shots, model_name
    )

    # Step 3: Evaluate and save
    result_entry = evaluate_and_save_fewshot(
        trainer, train_results, emissions_data, output_dir,
        num_shots, trainable_params, total_params
    )

    return result_entry

In [44]:
result_fewshot = []

In [45]:
%%time
print("\n" + "="*80)
print("EXPERIMENT 1: 100-shot Learning")
print("="*80)

result_100 = run_fewshot_experiment(
    num_shots=100,
    train_data=squad["train"],
    eval_data=tokenized_validation,
    tokenizer=tokenizer,
    preprocess_fn=preprocess_function,
    compute_metrics_fn=compute_metrics,
    model_name="bert-base-uncased"
)
result_fewshot.append(result_100)


EXPERIMENT 1: 100-shot Learning

Few-Shot Learning with 100 examples
Creating few-shot dataset with 100 examples...


Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Original examples: 100
After tokenization (with sliding window): 100 samples


Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Model Configuration:
Total Parameters: 108,893,186
Trainable Parameters: 1,538
Frozen Parameters: 108,891,648
Trainable Percentage: 0.0014%



`tokenizer` is deprecated and will be removed in version 5.0.0 for `Trainer.__init__`. Use `processing_class` instead.

[codecarbon INFO @ 06:21:49] [setup] RAM Tracking...
[codecarbon INFO @ 06:21:49] [setup] CPU Tracking...
 Linux OS detected: Please ensure RAPL files exist, and are readable, at /sys/class/powercap/intel-rapl/subsystem to measure CPU

[codecarbon INFO @ 06:21:50] CPU Model on constant consumption mode: Intel(R) Xeon(R) CPU @ 2.20GHz
[codecarbon INFO @ 06:21:50] [setup] GPU Tracking...
[codecarbon INFO @ 06:21:50] Tracking Nvidia GPU via pynvml
[codecarbon INFO @ 06:21:50] The below tracking methods have been set up:
                RAM Tracking Method: RAM power estimation model
                CPU Tracking Method: global constant
                GPU Tracking Method: pynvml
            
[codecarbon INFO @ 06:21:50] >>> Tracker's metadata:
[codecarbon INFO @ 06:21:50]   Platform system: Linux-6.6.105+-x86_64-with-glibc2.35
[codecarbon INFO @ 06:21:50]   Python versio


Training few-shot model (100 examples)...


Epoch,Training Loss,Validation Loss,Exact Match,F1
1,No log,5.907028,0.000165,0.011898
2,No log,5.855969,0.000247,0.012747
3,No log,5.820057,0.000247,0.013365
4,No log,5.793118,0.000247,0.013983
5,No log,5.771776,0.000494,0.014424
6,No log,5.753668,0.000577,0.014235
7,No log,5.740057,0.000742,0.014386
8,5.516800,5.730417,0.000742,0.014118
9,5.516800,5.725586,0.000742,0.014158
10,5.516800,5.723785,0.000742,0.014147


[codecarbon INFO @ 06:22:06] Energy consumed for RAM : 0.000158 kWh. RAM Power : 38.0 W
[codecarbon INFO @ 06:22:06] Delta energy consumed for CPU with constant : 0.000177 kWh, power : 42.5 W
[codecarbon INFO @ 06:22:07] Energy consumed for All CPU : 0.000177 kWh
[codecarbon INFO @ 06:22:07] Energy consumed for all GPUs : 0.000788 kWh. Total GPU Power : 188.90480976643576 W
[codecarbon INFO @ 06:22:07] 0.001123 kWh of electricity and 0.000000 L of water were used since the beginning.
[codecarbon INFO @ 06:22:07] Energy consumed for RAM : 0.000158 kWh. RAM Power : 38.0 W
[codecarbon INFO @ 06:22:07] Delta energy consumed for CPU with constant : 0.000177 kWh, power : 42.5 W
[codecarbon INFO @ 06:22:07] Energy consumed for All CPU : 0.000177 kWh
[codecarbon INFO @ 06:22:07] Energy consumed for all GPUs : 0.000813 kWh. Total GPU Power : 195.00083038104844 W
[codecarbon INFO @ 06:22:07] 0.001148 kWh of electricity and 0.000000 L of water were used since the beginning.
[codecarbon INFO @ 06:

Evaluating few-shot model...



FEW-SHOT LEARNING RESULTS (100 examples)

Model Configuration:
Training Method: Few-Shot (Frozen Backbone)
Training Examples: 100
Trainable Parameters: 1,538 (0.0014%)
Frozen Parameters: 108,891,648

Performance:
F1 Score: 0.0144
Exact Match: 0.0005
Eval Loss: 5.7718

Energy:
Total: 0.015446 kWh
GPU: 0.010898 kWh (70.6%)
CPU: 0.002401 kWh (15.5%)

Carbon:
CO₂ Emissions: 0.006991 kg
Training Time: 0.06 hours
Model saved to results_bert_fewshot_100shots/final_model
CPU times: user 3min 39s, sys: 5.66 s, total: 3min 44s
Wall time: 3min 46s


In [46]:
%%time
print("\n" + "="*80)
print("EXPERIMENT 2: 500-shot Learning")
print("="*80)

result_500 = run_fewshot_experiment(
    num_shots=500,
    train_data=squad["train"],
    eval_data=tokenized_validation,
    tokenizer=tokenizer,
    preprocess_fn=preprocess_function,
    compute_metrics_fn=compute_metrics,
    model_name="bert-base-uncased"
)
result_fewshot.append(result_500)


EXPERIMENT 2: 500-shot Learning

Few-Shot Learning with 500 examples
Creating few-shot dataset with 500 examples...


Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Original examples: 500
After tokenization (with sliding window): 527 samples

Model Configuration:
Total Parameters: 108,893,186
Trainable Parameters: 1,538
Frozen Parameters: 108,891,648
Trainable Percentage: 0.0014%



`tokenizer` is deprecated and will be removed in version 5.0.0 for `Trainer.__init__`. Use `processing_class` instead.

[codecarbon INFO @ 06:25:36] [setup] RAM Tracking...
[codecarbon INFO @ 06:25:36] [setup] CPU Tracking...
 Linux OS detected: Please ensure RAPL files exist, and are readable, at /sys/class/powercap/intel-rapl/subsystem to measure CPU

[codecarbon INFO @ 06:25:37] CPU Model on constant consumption mode: Intel(R) Xeon(R) CPU @ 2.20GHz
[codecarbon INFO @ 06:25:37] [setup] GPU Tracking...
[codecarbon INFO @ 06:25:37] Tracking Nvidia GPU via pynvml
[codecarbon INFO @ 06:25:37] The below tracking methods have been set up:
                RAM Tracking Method: RAM power estimation model
                CPU Tracking Method: global constant
                GPU Tracking Method: pynvml
            
[codecarbon INFO @ 06:25:37] >>> Tracker's metadata:
[codecarbon INFO @ 06:25:37]   Platform system: Linux-6.6.105+-x86_64-with-glibc2.35
[codecarbon INFO @ 06:25:37]   Python versio


Training few-shot model (500 examples)...


Epoch,Training Loss,Validation Loss,Exact Match,F1
1,No log,5.565303,0.000494,0.01381
2,5.535200,5.248468,0.002143,0.015332
3,5.535200,4.975043,0.003379,0.016909
4,4.866200,4.745785,0.006181,0.018929
5,4.559100,4.546533,0.019285,0.030151
6,4.559100,4.450377,0.027444,0.038582
7,4.342100,4.344647,0.04541,0.056162
8,4.240400,4.294476,0.051096,0.061694
9,4.240400,4.264266,0.057772,0.068332
10,4.181800,4.253274,0.059997,0.070391


[codecarbon INFO @ 06:25:53] Energy consumed for RAM : 0.000158 kWh. RAM Power : 38.0 W
[codecarbon INFO @ 06:25:53] Delta energy consumed for CPU with constant : 0.000177 kWh, power : 42.5 W
[codecarbon INFO @ 06:25:53] Energy consumed for All CPU : 0.000177 kWh
[codecarbon INFO @ 06:25:53] Energy consumed for all GPUs : 0.000790 kWh. Total GPU Power : 189.40330599022263 W
[codecarbon INFO @ 06:25:53] 0.001125 kWh of electricity and 0.000000 L of water were used since the beginning.
[codecarbon INFO @ 06:25:54] Energy consumed for RAM : 0.000158 kWh. RAM Power : 38.0 W
[codecarbon INFO @ 06:25:54] Delta energy consumed for CPU with constant : 0.000177 kWh, power : 42.5 W
[codecarbon INFO @ 06:25:54] Energy consumed for All CPU : 0.000177 kWh
[codecarbon INFO @ 06:25:54] Energy consumed for all GPUs : 0.000815 kWh. Total GPU Power : 195.47905663643255 W
[codecarbon INFO @ 06:25:54] 0.001150 kWh of electricity and 0.000000 L of water were used since the beginning.
[codecarbon INFO @ 06:

Evaluating few-shot model...



FEW-SHOT LEARNING RESULTS (500 examples)

Model Configuration:
Training Method: Few-Shot (Frozen Backbone)
Training Examples: 500
Trainable Parameters: 1,538 (0.0014%)
Frozen Parameters: 108,891,648

Performance:
F1 Score: 0.0704
Exact Match: 0.0600
Eval Loss: 4.2533

Energy:
Total: 0.016035 kWh
GPU: 0.011313 kWh (70.6%)
CPU: 0.002493 kWh (15.5%)

Carbon:
CO₂ Emissions: 0.007258 kg
Training Time: 0.06 hours
Model saved to results_bert_fewshot_500shots/final_model
CPU times: user 3min 47s, sys: 5.7 s, total: 3min 53s
Wall time: 3min 54s


In [47]:
%%time
print("\n" + "="*80)
print("EXPERIMENT 3: 1000-shot Learning")
print("="*80)

result_1000 = run_fewshot_experiment(
    num_shots=1000,
    train_data=squad["train"],
    eval_data=tokenized_validation,
    tokenizer=tokenizer,
    preprocess_fn=preprocess_function,
    compute_metrics_fn=compute_metrics,
    model_name="bert-base-uncased"
)
result_fewshot.append(result_1000)


EXPERIMENT 3: 1000-shot Learning

Few-Shot Learning with 1000 examples
Creating few-shot dataset with 1000 examples...


Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Original examples: 1000
After tokenization (with sliding window): 1027 samples

Model Configuration:
Total Parameters: 108,893,186
Trainable Parameters: 1,538
Frozen Parameters: 108,891,648
Trainable Percentage: 0.0014%



`tokenizer` is deprecated and will be removed in version 5.0.0 for `Trainer.__init__`. Use `processing_class` instead.

[codecarbon INFO @ 06:29:31] [setup] RAM Tracking...
[codecarbon INFO @ 06:29:31] [setup] CPU Tracking...
 Linux OS detected: Please ensure RAPL files exist, and are readable, at /sys/class/powercap/intel-rapl/subsystem to measure CPU

[codecarbon INFO @ 06:29:32] CPU Model on constant consumption mode: Intel(R) Xeon(R) CPU @ 2.20GHz
[codecarbon INFO @ 06:29:32] [setup] GPU Tracking...
[codecarbon INFO @ 06:29:32] Tracking Nvidia GPU via pynvml
[codecarbon INFO @ 06:29:32] The below tracking methods have been set up:
                RAM Tracking Method: RAM power estimation model
                CPU Tracking Method: global constant
                GPU Tracking Method: pynvml
            
[codecarbon INFO @ 06:29:32] >>> Tracker's metadata:
[codecarbon INFO @ 06:29:32]   Platform system: Linux-6.6.105+-x86_64-with-glibc2.35
[codecarbon INFO @ 06:29:32]   Python versio


Training few-shot model (1000 examples)...


Epoch,Training Loss,Validation Loss,Exact Match,F1
1,5.5375,5.414817,0.001896,0.015567
2,4.8438,4.978211,0.00478,0.018874
3,4.4956,4.641418,0.006675,0.022342
4,4.1348,4.438277,0.007829,0.024113
5,4.0612,4.313487,0.009065,0.026081
6,3.9823,4.235535,0.010714,0.028072
7,3.8572,4.191252,0.011867,0.029529
8,3.8498,4.162303,0.012774,0.030687
9,3.8404,4.144755,0.013104,0.031119
10,3.7846,4.143575,0.013186,0.031147


[codecarbon INFO @ 06:29:48] Energy consumed for RAM : 0.000158 kWh. RAM Power : 38.0 W
[codecarbon INFO @ 06:29:48] Delta energy consumed for CPU with constant : 0.000177 kWh, power : 42.5 W
[codecarbon INFO @ 06:29:48] Energy consumed for All CPU : 0.000177 kWh
[codecarbon INFO @ 06:29:48] Energy consumed for all GPUs : 0.000777 kWh. Total GPU Power : 186.38696312738475 W
[codecarbon INFO @ 06:29:48] 0.001113 kWh of electricity and 0.000000 L of water were used since the beginning.
[codecarbon INFO @ 06:29:49] Energy consumed for RAM : 0.000158 kWh. RAM Power : 38.0 W
[codecarbon INFO @ 06:29:49] Delta energy consumed for CPU with constant : 0.000177 kWh, power : 42.5 W
[codecarbon INFO @ 06:29:49] Energy consumed for All CPU : 0.000177 kWh
[codecarbon INFO @ 06:29:49] Energy consumed for all GPUs : 0.000796 kWh. Total GPU Power : 191.04568120201728 W
[codecarbon INFO @ 06:29:49] 0.001132 kWh of electricity and 0.000000 L of water were used since the beginning.
[codecarbon INFO @ 06:

Evaluating few-shot model...



FEW-SHOT LEARNING RESULTS (1000 examples)

Model Configuration:
Training Method: Few-Shot (Frozen Backbone)
Training Examples: 1000
Trainable Parameters: 1,538 (0.0014%)
Frozen Parameters: 108,891,648

Performance:
F1 Score: 0.0311
Exact Match: 0.0132
Eval Loss: 4.1436

Energy:
Total: 0.016912 kWh
GPU: 0.011861 kWh (70.1%)
CPU: 0.002667 kWh (15.8%)

Carbon:
CO₂ Emissions: 0.007655 kg
Training Time: 0.06 hours
Model saved to results_bert_fewshot_1000shots/final_model
CPU times: user 4min 1s, sys: 5.46 s, total: 4min 6s
Wall time: 4min 9s


### STEP 8.1: Results and Analysis

In [48]:
results_df_fewshot = pd.DataFrame(result_fewshot)
print("\n" + "="*60)
print("FEW-SHOT LEARNING RESULTS SUMMARY")
print("="*60)
print(results_df_fewshot[['num_shots', 'trainable_percentage', 'f1_score', 'exact_match', 'emissions_kg', 'training_time_hours']].to_string(index=False))

# Save to CSV
results_df_fewshot.to_csv("/content/drive/MyDrive/bert_fewshot_results.csv", index=False)



FEW-SHOT LEARNING RESULTS SUMMARY
 num_shots  trainable_percentage  f1_score  exact_match  emissions_kg  training_time_hours
       100              0.001412  0.014424     0.000494      0.006991             0.056369
       500              0.001412  0.070391     0.059997      0.007258             0.058514
      1000              0.001412  0.031147     0.013186      0.007655             0.062639


In [49]:
# FEW-SHOT EFFICIENCY ANALYSIS

print("\n" + "="*80)
print("FEW-SHOT EFFICIENCY ANALYSIS")
print("="*80)

# Use 500-shot as baseline (middle ground)
baseline = results_df_fewshot[results_df_fewshot['num_shots'] == 500].iloc[0]

for _, row in results_df_fewshot.iterrows():
    shots = row['num_shots']
    samples_ratio = row['num_shots'] / baseline['num_shots']
    f1_diff = row['f1_score'] - baseline['f1_score']
    emissions_diff = row['emissions_kg'] - baseline['emissions_kg']
    time_diff = row['training_time_hours'] - baseline['training_time_hours']

    print(f"\n{shots}-Shot Learning:")
    print(f"Training Examples: {row['num_shots']:,}")
    print(f"Trainable Params: {row['trainable_params']:,} ({row['trainable_percentage']:.4f}%)")
    print(f"vs 500-shot: {samples_ratio:.2f}x training data")
    print(f"F1 Score: {row['f1_score']:.4f} ({f1_diff:+.4f} vs 500-shot)")
    print(f"Emissions: {row['emissions_kg']:.6f} kg ({emissions_diff:+.6f} vs 500-shot)")
    print(f"Training Time: {row['training_time_hours']:.2f} hours ({time_diff:+.2f} vs 500-shot)")

    # Efficiency metrics
    efficiency_co2 = row['f1_score'] / row['emissions_kg'] if row['emissions_kg'] > 0 else 0
    efficiency_time = row['f1_score'] / row['training_time_hours'] if row['training_time_hours'] > 0 else 0
    efficiency_samples = row['f1_score'] / row['num_shots'] if row['num_shots'] > 0 else 0

    print(f"Efficiency (F1/kg CO₂): {efficiency_co2:.2f}")
    print(f"Efficiency (F1/hour): {efficiency_time:.4f}")
    print(f"Efficiency (F1/sample): {efficiency_samples:.6f}")




FEW-SHOT EFFICIENCY ANALYSIS

100-Shot Learning:
Training Examples: 100
Trainable Params: 1,538 (0.0014%)
vs 500-shot: 0.20x training data
F1 Score: 0.0144 (-0.0560 vs 500-shot)
Emissions: 0.006991 kg (-0.000266 vs 500-shot)
Training Time: 0.06 hours (-0.00 vs 500-shot)
Efficiency (F1/kg CO₂): 2.06
Efficiency (F1/hour): 0.2559
Efficiency (F1/sample): 0.000144

500-Shot Learning:
Training Examples: 500
Trainable Params: 1,538 (0.0014%)
vs 500-shot: 1.00x training data
F1 Score: 0.0704 (+0.0000 vs 500-shot)
Emissions: 0.007258 kg (+0.000000 vs 500-shot)
Training Time: 0.06 hours (+0.00 vs 500-shot)
Efficiency (F1/kg CO₂): 9.70
Efficiency (F1/hour): 1.2030
Efficiency (F1/sample): 0.000141

1000-Shot Learning:
Training Examples: 1,000
Trainable Params: 1,538 (0.0014%)
vs 500-shot: 2.00x training data
F1 Score: 0.0311 (-0.0392 vs 500-shot)
Emissions: 0.007655 kg (+0.000397 vs 500-shot)
Training Time: 0.06 hours (+0.00 vs 500-shot)
Efficiency (F1/kg CO₂): 4.07
Efficiency (F1/hour): 0.4972
E

In [50]:
# PLOT 1: Few-Shot Energy Consumption by Shots
df_sorted_fewshot = results_df_fewshot.sort_values('num_shots')

fig_fewshot_energy = go.Figure()

fig_fewshot_energy.add_trace(go.Bar(
    name='CPU Energy',
    x=df_sorted_fewshot['num_shots'],
    y=df_sorted_fewshot['cpu_energy_kwh'],
    marker_color='#FF6B6B',
    hovertemplate='<b>CPU Energy</b><br>%{y:.6f} kWh<br>Shots: %{x}<extra></extra>'
))

fig_fewshot_energy.add_trace(go.Bar(
    name='GPU Energy',
    x=df_sorted_fewshot['num_shots'],
    y=df_sorted_fewshot['gpu_energy_kwh'],
    marker_color='#4ECDC4',
    hovertemplate='<b>GPU Energy</b><br>%{y:.6f} kWh<br>Shots: %{x}<extra></extra>'
))

fig_fewshot_energy.add_trace(go.Bar(
    name='RAM Energy',
    x=df_sorted_fewshot['num_shots'],
    y=df_sorted_fewshot['ram_energy_kwh'],
    marker_color='#95E1D3',
    hovertemplate='<b>RAM Energy</b><br>%{y:.6f} kWh<br>Shots: %{x}<extra></extra>'
))

fig_fewshot_energy.update_layout(
    title=dict(text="Few-Shot: Energy Consumption by Number of Examples", font=dict(size=18)),
    xaxis_title='Number of Training Examples',
    yaxis_title='Energy Consumption (kWh)',
    barmode='stack',
    template='plotly_white',
    height=500,
    font=dict(size=13),
    legend=dict(
        orientation="h",
        yanchor="bottom",
        y=1.02,
        xanchor="right",
        x=1
    ),
    hovermode='x unified'
)

fig_fewshot_energy.show()
fig_fewshot_energy.write_html("/content/drive/MyDrive/fewshot_energy_by_shots.html")

In [51]:
# PLOT 2: Few-Shot Performance & Emissions by Shots (Dual Y-axis)
df_sorted_fewshot = results_df_fewshot.sort_values('num_shots')

fig_fewshot_perf = make_subplots(specs=[[{"secondary_y": True}]])

# F1 Score line
fig_fewshot_perf.add_trace(
    go.Scatter(
        x=df_sorted_fewshot['num_shots'],
        y=df_sorted_fewshot['f1_score'],
        name='F1 Score',
        mode='lines+markers',
        line=dict(color='#4ECDC4', width=3),
        marker=dict(size=12, line=dict(width=2, color='white')),
        hovertemplate='<b>F1 Score</b>: %{y:.4f}<br>Shots: %{x}<extra></extra>'
    ),
    secondary_y=False
)

# Exact Match line
fig_fewshot_perf.add_trace(
    go.Scatter(
        x=df_sorted_fewshot['num_shots'],
        y=df_sorted_fewshot['exact_match'],
        name='Exact Match',
        mode='lines+markers',
        line=dict(color='#95E1D3', width=3, dash='dash'),
        marker=dict(size=10),
        hovertemplate='<b>Exact Match</b>: %{y:.4f}<br>Shots: %{x}<extra></extra>'
    ),
    secondary_y=False
)

# CO2 Emissions bar
fig_fewshot_perf.add_trace(
    go.Bar(
        x=df_sorted_fewshot['num_shots'],
        y=df_sorted_fewshot['emissions_kg'],
        name='CO₂ Emissions',
        marker_color='#FF6B6B',
        opacity=0.6,
        hovertemplate='<b>CO₂</b>: %{y:.6f} kg<br>Shots: %{x}<extra></extra>'
    ),
    secondary_y=True
)

fig_fewshot_perf.update_xaxes(title_text="Number of Training Examples")
fig_fewshot_perf.update_yaxes(title_text="Performance Score", secondary_y=False)
fig_fewshot_perf.update_yaxes(title_text="CO₂ Emissions (kg)", secondary_y=True)

fig_fewshot_perf.update_layout(
    title=dict(text="Few-Shot: Performance vs Carbon Emissions", font=dict(size=18)),
    template='plotly_white',
    height=500,
    font=dict(size=13),
    hovermode='x unified',
    legend=dict(
        orientation="h",
        yanchor="bottom",
        y=1.02,
        xanchor="right",
        x=1
    )
)

fig_fewshot_perf.show()
fig_fewshot_perf.write_html("/content/drive/MyDrive/fewshot_performance_emissions.html")

#Comparing And Testing All The Models

In [52]:
def test_model(model_path, examples, tokenizer_name="bert-base-uncased"):

    # Load tokenizer
    tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)

    # Auto-detect model type
    is_lora = "lora_adapters" in model_path or "lora" in model_path.lower()

    if is_lora:
        method = "LoRA"
        # Load base model + LoRA adapters
        base_model = AutoModelForQuestionAnswering.from_pretrained(tokenizer_name)
        model = PeftModel.from_pretrained(base_model, model_path)

    else:
        # Detect if few-shot or full fine-tuning
        method = "Few-Shot" if "fewshot" in model_path.lower() else "Full Fine-tuning"
        model = AutoModelForQuestionAnswering.from_pretrained(model_path)


    # Create pipeline
    qa_pipeline = pipeline(
        "question-answering",
        model=model,
        tokenizer=tokenizer,
        device=0 if torch.cuda.is_available() else -1
    )

    # Test all examples
    results = []
    for i, ex in enumerate(examples, 1):
        # Get prediction
        prediction = qa_pipeline(question=ex['question'], context=ex['context'])

        # Store result
        result = {
            'example_num': i,
            'method': method,
            'question': ex['question'],
            'context': ex['context'][:100] + "..." if len(ex['context']) > 100 else ex['context'],
            'predicted_answer': prediction['answer'],
            'expected_answer': ex.get('expected_answer', None),
            'confidence': prediction['score'],
            'start_position': prediction['start'],
            'end_position': prediction['end']
        }
        results.append(result)

        # Print formatted output
        print(f"\nExample {i}")
        print(f"Question: {ex['question']}")
        print(f"Context: {ex['context'][:150]}{'...' if len(ex['context']) > 150 else ''}")
        print(f"\nPredicted: '{prediction['answer']}'")
        print(f"   Confidence: {prediction['score']:.2%}")

        # Check match with expected answer
        if ex.get('expected_answer'):
            expected = ex['expected_answer'].lower().strip()
            predicted = prediction['answer'].lower().strip()
            # Flexible matching: either one contains the other
            match = (predicted in expected) or (expected in predicted)
            print(f"   Expected: '{ex['expected_answer']}'")
            print(f"   Match: {'YES' if match else 'NO'}")

        print("-" * 80)

    return results

In [53]:
test_examples = [
     {
        'question': "What is the capital of France?",
        'context': "Paris is the capital and most populous city of France. It has been one of Europe's major centers of finance, diplomacy, commerce, fashion, and arts.",
        'expected_answer': "Paris"
    },
    {
        'question': "What does Google Colab provide access to?",
        'context': "Google Colab provides free access to GPUs and TPUs, which makes it popular for deep learning.",
        'expected_answer': "GPUs and TPUs"
    },
    {
        'question': "When was Python created?",
        'context': "Python was created by Guido van Rossum and first released in 1991. Its design philosophy emphasizes code readability.",
        'expected_answer': "1991"
    },
    {
        'question': "Who invented the telephone?",
        'context': "The telephone was invented by Alexander Graham Bell in 1876. He made the first successful telephone call on March 10, 1876.",
        'expected_answer': "Alexander Graham Bell"
    },
]


In [54]:
print(os.listdir('/content/'))

['.config', 'results_bert_50pct', 'results_bert_lora_r16_80pct', 'results_bert_fewshot_100shots', 'drive', 'results_bert_lora_r8_80pct', 'results_bert_25pct', 'results_bert_80pct', 'results_bert_fewshot_500shots', 'results_bert_lora_r4_80pct', 'results_bert_fewshot_1000shots', 'wandb', 'sample_data']


In [55]:
# Test Full Fine-tuning
print("=" * 40)
print("TESTING FULL FINE-TUNING MODEL")
print("=" * 40)
results_full_ft = test_model(
    model_path="results_bert_80pct/final_model",
    examples=test_examples
)

TESTING FULL FINE-TUNING MODEL


Device set to use cuda:0



Example 1
Question: What is the capital of France?
Context: Paris is the capital and most populous city of France. It has been one of Europe's major centers of finance, diplomacy, commerce, fashion, and arts.

Predicted: 'Paris'
   Confidence: 98.78%
   Expected: 'Paris'
   Match: YES
--------------------------------------------------------------------------------

Example 2
Question: What does Google Colab provide access to?
Context: Google Colab provides free access to GPUs and TPUs, which makes it popular for deep learning.

Predicted: 'GPUs and TPUs'
   Confidence: 92.15%
   Expected: 'GPUs and TPUs'
   Match: YES
--------------------------------------------------------------------------------

Example 3
Question: When was Python created?
Context: Python was created by Guido van Rossum and first released in 1991. Its design philosophy emphasizes code readability.

Predicted: '1991'
   Confidence: 97.90%
   Expected: '1991'
   Match: YES
--------------------------------------------

In [56]:
# Test LoRA
print("\n" + "=" * 40)
print("TESTING LORA MODEL")
print("=" * 40)
results_lora = test_model(
    model_path="results_bert_lora_r16_80pct/lora_adapters",
    examples=test_examples
)



TESTING LORA MODEL


Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cuda:0



Example 1
Question: What is the capital of France?
Context: Paris is the capital and most populous city of France. It has been one of Europe's major centers of finance, diplomacy, commerce, fashion, and arts.

Predicted: 'Paris'
   Confidence: 82.68%
   Expected: 'Paris'
   Match: YES
--------------------------------------------------------------------------------

Example 2
Question: What does Google Colab provide access to?
Context: Google Colab provides free access to GPUs and TPUs, which makes it popular for deep learning.

Predicted: 'GPUs and TPUs'
   Confidence: 55.75%
   Expected: 'GPUs and TPUs'
   Match: YES
--------------------------------------------------------------------------------

Example 3
Question: When was Python created?
Context: Python was created by Guido van Rossum and first released in 1991. Its design philosophy emphasizes code readability.

Predicted: '1991'
   Confidence: 76.12%
   Expected: '1991'
   Match: YES
--------------------------------------------

In [57]:
# Test Few-Shot
print("\n" + "=" * 40)
print("TESTING FEW-SHOT MODEL")
print("=" * 40)
test_results_fewshot = test_model(
    model_path="results_bert_fewshot_1000shots/final_model",
    examples=test_examples
)



TESTING FEW-SHOT MODEL


Device set to use cuda:0



Example 1
Question: What is the capital of France?
Context: Paris is the capital and most populous city of France. It has been one of Europe's major centers of finance, diplomacy, commerce, fashion, and arts.

Predicted: 'Paris'
   Confidence: 4.92%
   Expected: 'Paris'
   Match: YES
--------------------------------------------------------------------------------

Example 2
Question: What does Google Colab provide access to?
Context: Google Colab provides free access to GPUs and TPUs, which makes it popular for deep learning.

Predicted: 'free access to GPUs and TPUs'
   Confidence: 3.06%
   Expected: 'GPUs and TPUs'
   Match: YES
--------------------------------------------------------------------------------

Example 3
Question: When was Python created?
Context: Python was created by Guido van Rossum and first released in 1991. Its design philosophy emphasizes code readability.

Predicted: 'Guido van Rossum and first released in 1991'
   Confidence: 3.00%
   Expected: '1991'
   Matc

In [58]:
all_results = {
    "Full FT (80%)": results_full_ft,
    "LoRA (r=16, 80%)": results_lora,
    "Few-Shot (1000)": test_results_fewshot
}


##Comparison Function

In [63]:
def compare_models(results_dict):
    print("\n" + "="*80)
    print("MODEL COMPARISON")
    print("="*80)

    comparison_data = []

    for model_name, results in results_dict.items():
        for result in results:
            comparison_data.append({
                'Model': model_name,
                'Question': result['question'][:50] + "...",
                'Predicted': result['predicted_answer'],
                'Expected': result.get('expected_answer', 'N/A'),
                'Confidence': result['confidence']
            })

    comparison_df = pd.DataFrame(comparison_data)

    # Group by question to see how different models answer
    for question in comparison_df['Question'].unique():
        print(f"\nQUESTION: {question}")
        question_results = comparison_df[comparison_df['Question'] == question]
        for _, row in question_results.iterrows():
            exp = str(row['Expected']).lower()
            pred = str(row['Predicted']).lower()
            match_indicator = "M" if (pred in exp or exp in pred) and exp != 'n/a' else "N.M"
            print(f"  {match_indicator} {row['Model']:20s}: {row['Predicted']:40s} ({row['Confidence']:.1%})")
        expected_val = question_results.iloc[0]['Expected']
        if expected_val != 'N/A':
            print(f"Expected ANSWER: {expected_val}")

    return comparison_df

In [64]:
comparison_df = compare_models(all_results)


MODEL COMPARISON

QUESTION: What is the capital of France?...
  M Full FT (80%)       : Paris                                    (98.8%)
  M LoRA (r=16, 80%)    : Paris                                    (82.7%)
  M Few-Shot (1000)     : Paris                                    (4.9%)
Expected ANSWER: Paris

QUESTION: What does Google Colab provide access to?...
  M Full FT (80%)       : GPUs and TPUs                            (92.2%)
  M LoRA (r=16, 80%)    : GPUs and TPUs                            (55.8%)
  M Few-Shot (1000)     : free access to GPUs and TPUs             (3.1%)
Expected ANSWER: GPUs and TPUs

QUESTION: When was Python created?...
  M Full FT (80%)       : 1991                                     (97.9%)
  M LoRA (r=16, 80%)    : 1991                                     (76.1%)
  M Few-Shot (1000)     : Guido van Rossum and first released in 1991 (3.0%)
Expected ANSWER: 1991

QUESTION: Who invented the telephone?...
  M Full FT (80%)       : Alexander Graham Bell  

In [61]:
# Save comparison results
comparison_df.to_csv("/content/drive/MyDrive/model_comparison.csv", index=False)

In [62]:
print("\n" + "="*80)
print("COMPARISON: FULL FT vs LoRA vs FEW-SHOT")
print("="*80)

# Load previous results
full_ft_results = pd.read_csv("/content/drive/MyDrive/bert_dataset_size_results.csv")
lora_results = pd.read_csv("/content/drive/MyDrive/bert_lora_results.csv")
results_fewshot = pd.read_csv("/content/drive/MyDrive/bert_fewshot_results.csv")

# Add method identifiers if not present
if 'training_method' not in full_ft_results.columns:
    full_ft_results['training_method'] = 'Full Fine-tuning'
if 'training_method' not in lora_results.columns:
    lora_results['training_method'] = 'LoRA'

# Combine all results
all_methods = pd.concat([full_ft_results, lora_results, results_fewshot], ignore_index=True)

print("\nTraining Efficiency Comparison:")
print(all_methods[['training_method', 'train_samples','f1_score', 'emissions_kg', 'training_time_hours']].to_string(index=False))

# Save combined results
all_methods.to_csv("/content/drive/MyDrive/all_training_methods_comparison.csv", index=False)



COMPARISON: FULL FT vs LoRA vs FEW-SHOT

Training Efficiency Comparison:
           training_method  train_samples  f1_score  emissions_kg  training_time_hours
          Full Fine-Tuning          32579  0.531585      0.012590             0.097368
          Full Fine-Tuning          65159  0.656064      0.022642             0.160960
          Full Fine-Tuning         104255  0.695708      0.035771             0.250517
                      LoRA         104255  0.529031      0.032428             0.255871
                      LoRA         104255  0.562268      0.032148             0.254697
                      LoRA         104255  0.600994      0.032265             0.255698
Few-Shot (Frozen Backbone)            100  0.014424      0.006991             0.056369
Few-Shot (Frozen Backbone)            500  0.070391      0.007258             0.058514
Few-Shot (Frozen Backbone)           1000  0.031147      0.007655             0.062639
