# Assignment 1: Comparative Analysis of Transformer Architectures

**Name:** Shakirth Anisha

**SRN:** PES2UG23CS927

**Section:** I

---

## Objective

This assignment investigates how different transformer model architectures perform when applied to tasks outside their intended design. We'll examine three distinct models:

- **BERT (bert-base-uncased)** - Bidirectional Encoder
- **RoBERTa (roberta-base)** - Robustly Optimized BERT
- **BART (facebook/bart-base)** - Bidirectional and Auto-Regressive Transformer

By challenging these models with incompatible tasks, we'll understand the fundamental importance of architectural design in NLP systems.

## Part 1: Environment Setup

Let's begin by installing dependencies and configuring our workspace.

In [1]:
!pip install -q transformers torch sentencepiece

In [2]:
from transformers import pipeline, AutoTokenizer, AutoModelForMaskedLM
import warnings
import sys

warnings.filterwarnings('ignore')

print(f"Python Version: {sys.version}")



Python Version: 3.12.12 (main, Oct 10 2025, 08:52:57) [GCC 11.4.0]


## Part 2: Understanding Model Architectures

### Architecture Overview

| Model | Type | Core Mechanism | Training Approach | Ideal Use Case |
|-------|------|----------------|-------------------|----------------|
| **BERT** | Encoder-Only | Bidirectional Context | Masked Token Prediction | Text Classification, Entity Recognition |
| **RoBERTa** | Encoder-Only | Optimized Bidirectional | Dynamic Masking | Improved Understanding Tasks |
| **BART** | Encoder-Decoder | Sequence-to-Sequence | Text Reconstruction | Summarization, Translation |

## Experiment #1: Auto-Regressive Text Generation

### Hypothesis
Encoder-only models (BERT, RoBERTa) should struggle or fail at generating sequential text because they lack auto-regressive capabilities. BART, with its decoder component, should handle this more naturally.

### Test Prompt
```
"The future of Artificial Intelligence is"
```

In [3]:
test_prompt = "The future of Artificial Intelligence is"

print("═" * 70)
print("EXPERIMENT 1: AUTO-REGRESSIVE GENERATION")
print("═" * 70)
print(f"\nTest Input: \"{test_prompt}\"\n")
print("-" * 70)

══════════════════════════════════════════════════════════════════════
EXPERIMENT 1: AUTO-REGRESSIVE GENERATION
══════════════════════════════════════════════════════════════════════

Test Input: "The future of Artificial Intelligence is"

----------------------------------------------------------------------


### 1.1 Testing BERT - Expected: Failure or degraded output

In [4]:
bert_gen_pipeline = pipeline(
    task='text-generation',
    model='bert-base-uncased',
    device=0 if __import__('torch').cuda.is_available() else -1
)

bert_output = bert_gen_pipeline(
    test_prompt,
    max_new_tokens=25,
    num_return_sequences=1
)

print(f"Generated Text: {bert_output[0]['generated_text']}")

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

If you want to use `BertLMHeadModel` as a standalone, add `is_decoder=True.`


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Device set to use cpu


Generated Text: The future of Artificial Intelligence is.........................


BERT uses a bidirectional encoder that processes entire sequences at once and lacks a unidirectional decoder for next-token prediction; therefore, as an encoder-only architecture, it cannot naturally generate sequential text.

### 1.2 Testing RoBERTa - Expected: Similar failure to BERT

In [5]:
roberta_gen_pipeline = pipeline(
    task='text-generation',
    model='roberta-base',
    device=0 if __import__('torch').cuda.is_available() else -1
)

roberta_output = roberta_gen_pipeline(
    test_prompt,
    max_new_tokens=25,
    num_return_sequences=1
)

print(f"Generated Text: {roberta_output[0]['generated_text']}")

config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

If you want to use `RobertaLMHeadModel` as a standalone, add `is_decoder=True.`


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu


Generated Text: The future of Artificial Intelligence is


RoBERTa improves upon BERT through enhanced training strategies but has the same encoder-only architecture. Hence, optimization alone does not enable text generation.

### 1.3 Testing BART - Expected: Variable results

In [6]:
bart_gen_pipeline = pipeline(
    task='text-generation',
    model='facebook/bart-base',
    device=0 if __import__('torch').cuda.is_available() else -1
)

bart_output = bart_gen_pipeline(
    test_prompt,
    max_new_tokens=25,
    num_return_sequences=1
)

print(f"Generated Text: {bart_output[0]['generated_text']}")

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/558M [00:00<?, ?B/s]

Some weights of BartForCausalLM were not initialized from the model checkpoint at facebook/bart-base and are newly initialized: ['lm_head.weight', 'model.decoder.embed_tokens.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cpu


Generated Text: The future of Artificial Intelligence is accomplishments accomplishments classmate classmate classmate Farms classmate Drunk Drunk Drunk Pagan thri appearing widget compilation MethodSOURCE Ips Ips genuine highway highway Treatment possession


BART uses an encoder–decoder setup, so it can generate text, but the base model isn’t really trained for next-token prediction. The architecture fits the task, even if the weights aren’t perfectly tuned for it.

## Experiment #2: Token Masking and Prediction

### Hypothesis
This is the native task for BERT and RoBERTa. They should excel here. BART may work but uses different pre-training objectives.

### Test Sentences
Note: Different models use different mask tokens!

In [13]:
print("\n" + "═" * 70)
print("EXPERIMENT 2: MASKED TOKEN PREDICTION")
print("═" * 70)

# Different mask tokens for different models
bert_masked_input = "The goal of Generative AI is to [MASK] new content."
roberta_masked_input = "The goal of Generative AI is to <mask> new content."
bart_masked_input = "The goal of Generative AI is to <mask> new content."

print("\nTask: Predict the masked word\n")
print("-" * 70)


══════════════════════════════════════════════════════════════════════
EXPERIMENT 2: MASKED TOKEN PREDICTION
══════════════════════════════════════════════════════════════════════

Task: Predict the masked word

----------------------------------------------------------------------


### 2.1 Testing BERT

In [14]:
bert_mask_pipeline = pipeline(
    task='fill-mask',
    model='bert-base-uncased',
    device=0 if __import__('torch').cuda.is_available() else -1
)

bert_predictions = bert_mask_pipeline(bert_masked_input)

print("Top 5 Predictions:\n")
for idx, prediction in enumerate(bert_predictions[:5], start=1):
    word = prediction['token_str'].strip()
    probability = prediction['score']
    print(f"  {idx}. '{word}' → Confidence: {probability:.2%}")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


Top 5 Predictions:

  1. 'create' → Confidence: 53.97%
  2. 'generate' → Confidence: 15.58%
  3. 'produce' → Confidence: 5.41%
  4. 'develop' → Confidence: 4.45%
  5. 'add' → Confidence: 1.76%


BERT is trained specifically on masked language modeling, where some tokens are hidden during training, so this task is a perfect fit and high-confidence predictions are expected.

### 2.2 Testing RoBERTa

In [15]:
print(roberta_masked_input)

roberta_mask_pipeline = pipeline(
    task='fill-mask',
    model='roberta-base',
    device=0 if __import__('torch').cuda.is_available() else -1
)

roberta_predictions = roberta_mask_pipeline(roberta_masked_input)

print("Top 5 RoBERTa Predictions:\n")
for idx, prediction in enumerate(roberta_predictions[:5], start=1):
    word = prediction['token_str'].strip()
    probability = prediction['score']
    print(f"  {idx}. '{word}' → Confidence: {probability:.2%}")

The goal of Generative AI is to <mask> new content.


Device set to use cpu


Top 5 RoBERTa Predictions:

  1. 'generate' → Confidence: 37.11%
  2. 'create' → Confidence: 36.77%
  3. 'discover' → Confidence: 8.35%
  4. 'find' → Confidence: 2.13%
  5. 'provide' → Confidence: 1.65%


RoBERTa uses dynamic masking and is trained on significantly more data with larger batches, so it typically matches or outperforms BERT on masked language modeling tasks.

### 2.3 Testing BART

In [16]:
print(bart_masked_input)
bart_mask_pipeline = pipeline(
    task='fill-mask',
    model='facebook/bart-base',
    device=0 if __import__('torch').cuda.is_available() else -1
)

bart_predictions = bart_mask_pipeline(bart_masked_input)

print("Top 5 Predictions:\n")
for idx, prediction in enumerate(bart_predictions[:5], start=1):
    word = prediction['token_str'].strip()
    probability = prediction['score']
    print(f"  {idx}. '{word}' → Confidence: {probability:.2%}")

The goal of Generative AI is to <mask> new content.


Device set to use cpu


Top 5 Predictions:

  1. 'create' → Confidence: 7.46%
  2. 'help' → Confidence: 6.57%
  3. 'provide' → Confidence: 6.09%
  4. 'enable' → Confidence: 3.59%
  5. 'improve' → Confidence: 3.32%


BART is trained using denoising autoencoding tasks like text infilling and sentence permutation, so it can handle masking, but this is not its primary training objective.

## Experiment #3: Extractive Question Answering

### Hypothesis
Base models (not fine-tuned on QA datasets like SQuAD) will show poor performance. Encoder architectures are suitable but need fine-tuning.

### Test Configuration

In [17]:
print("\n" + "═" * 70)
print("EXPERIMENT 3: EXTRACTIVE QUESTION ANSWERING")
print("═" * 70)

qa_context = "Generative AI poses significant risks such as hallucinations, bias, and deepfakes."
qa_question = "What are the risks?"

print(f"\nContext: {qa_context}")
print(f"Question: {qa_question}\n")
print("-" * 70)


══════════════════════════════════════════════════════════════════════
EXPERIMENT 3: EXTRACTIVE QUESTION ANSWERING
══════════════════════════════════════════════════════════════════════

Context: Generative AI poses significant risks such as hallucinations, bias, and deepfakes.
Question: What are the risks?

----------------------------------------------------------------------


### 3.1 Testing BERT

In [18]:
bert_qa_pipeline = pipeline(
    task='question-answering',
    model='bert-base-uncased',
    device=0 if __import__('torch').cuda.is_available() else -1
)

bert_qa_result = bert_qa_pipeline(
    question=qa_question,
    context=qa_context
)

print("BERT Result:")
print(f"  Extracted Answer: '{bert_qa_result['answer']}'")
print(f"  Confidence Score: {bert_qa_result['score']:.4f}")
print(f"  Position: Characters {bert_qa_result['start']}-{bert_qa_result['end']}")

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cpu


BERT Result:
  Extracted Answer: 'hallucinations'
  Confidence Score: 0.0063
  Position: Characters 46-60


BERT’s architecture can support span extraction, but the base model isn’t fine-tuned for question answering, so its randomly initialized output layers lead to poor performance.

### 3.2 Testing RoBERTa

In [19]:
roberta_qa_pipeline = pipeline(
    task='question-answering',
    model='roberta-base',
    device=0 if __import__('torch').cuda.is_available() else -1
)

roberta_qa_result = roberta_qa_pipeline(
    question=qa_question,
    context=qa_context
)

print("RoBERTa Result:")
print(f"  Extracted Answer: '{roberta_qa_result['answer']}'")
print(f"  Confidence Score: {roberta_qa_result['score']:.4f}")
print(f"  Position: Characters {roberta_qa_result['start']}-{roberta_qa_result['end']}")

Some weights of RobertaForQuestionAnswering were not initialized from the model checkpoint at roberta-base and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cpu


RoBERTa Result:
  Extracted Answer: 'Generative AI poses significant risks such as hallucinations, bias,'
  Confidence Score: 0.0078
  Position: Characters 0-67


RoBERTa has the same QA-capable architecture as BERT, but stronger pre-training can’t replace task-specific fine-tuning, which is needed to correctly map questions and context to answer spans.

### 3.3 Testing BART

In [20]:
bart_qa_pipeline = pipeline(
    task='question-answering',
    model='facebook/bart-base',
    device=0 if __import__('torch').cuda.is_available() else -1
)

bart_qa_result = bart_qa_pipeline(
    question=qa_question,
    context=qa_context
)

print("BART Result:")
print(f"  Extracted Answer: '{bart_qa_result['answer']}'")
print(f"  Confidence Score: {bart_qa_result['score']:.4f}")
print(f"  Position: Characters {bart_qa_result['start']}-{bart_qa_result['end']}")

if bart_qa_result['score'] < 0.1:
    print("\nVery low confidence - likely random extraction")

Some weights of BartForQuestionAnswering were not initialized from the model checkpoint at facebook/bart-base and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cpu


BART Result:
  Extracted Answer: 'and deepfakes'
  Confidence Score: 0.0723
  Position: Characters 68-81

Very low confidence - likely random extraction


BART is built for generative tasks like abstractive QA, but extractive QA requires span selection rather than generation, so the architectural mismatch and lack of fine-tuning lead to poor performance.

## Conclusion

### Complete Observation Matrix

Fill in this table based on your actual experimental results:

| Task           | Model   | Classification (Success/Failure) | Observation (What actually happened?)                      | Why did this happen? (Architectural Reason)                                     |
| -------------- | ------- | -------------------------------- | ---------------------------------------------------------- | ------------------------------------------------------------------------------- |
| **Generation** | BERT    | Failure                          | Generated incoherent or repetitive output; warnings shown. | Encoder-only, bidirectional attention; not trained for next-token generation.   |
|                | RoBERTa | Failure                          | Returned prompt unchanged or minimal text.                 | Same encoder-only architecture as BERT; no decoder or causal masking.           |
|                | BART    | Partial Success                  | Generated text but quality was generic or inconsistent.    | Encoder-decoder model, but trained for denoising, not causal language modeling. |
| **Fill-Mask**  | BERT    | Success                          | Correctly predicted words like “create”, “generate”.       | Trained directly on Masked Language Modeling (MLM).                             |
|                | RoBERTa | Success                          | Accurate predictions with balanced confidence.             | Optimized MLM training with dynamic masking and more data.                      |
|                | BART    | Moderate Success                 | Predicted valid words but with lower confidence.           | Trained for text reconstruction, not single-token masking.                      |
| **QA**         | BERT    | Failure                          | Extracted random or partial answers with low confidence.   | QA architecture exists, but base model lacks QA fine-tuning.                    |
|                | RoBERTa | Failure                          | Similar poor and inconsistent answers.                     | Better pre-training cannot replace task-specific QA fine-tuning.                |
|                | BART    | Failure                          | QA pipeline failed or returned irrelevant output.          | Encoder-decoder model suited for generation, not extractive QA.                 |

---

## Inferences

* **Architecture Determines Capability:** Model architecture strictly defines which tasks a model can perform.
* **Pre-training Creates Specialization:** Models perform best on tasks aligned with their pre-training objectives.
* **Fine-tuning Enables Performance:** Task-specific fine-tuning is required for reliable, real-world results.
* **Practical Implication:** Effective model selection requires aligning task, architecture, and training.

---
