In [1]:
!pip install -q transformers sentencepiece

In [2]:
from transformers import pipeline, AutoTokenizer, AutoModelForMaskedLM, AutoModelForQuestionAnswering, AutoModelForSeq2SeqLM
import torch



In [5]:
device = 0 if torch.cuda.is_available() else -1
print("Using device:", "GPU" if device == 0 else "CPU")

Using device: CPU


# Models used

In [3]:
bert_model = "bert-base-uncased"
roberta_model = "roberta-base"
bart_model = "facebook/bart-base"

In [13]:
gen_bert = pipeline("text-generation", model="bert-base-uncased", framework="pt")
gen_roberta = pipeline("text-generation", model="roberta-base", framework="pt")
gen_bart = pipeline("text-generation", model="facebook/bart-base", framework="pt")

If you want to use `BertLMHeadModel` as a standalone, add `is_decoder=True.`
Device set to use cpu
If you want to use `RobertaLMHeadModel` as a standalone, add `is_decoder=True.`
Device set to use cpu
Some weights of BartForCausalLM were not initialized from the model checkpoint at facebook/bart-base and are newly initialized: ['lm_head.weight', 'model.decoder.embed_tokens.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cpu


In [8]:
mask_bert = pipeline("fill-mask", model=bert_model, device=device)
mask_roberta = pipeline("fill-mask", model=roberta_model, device=device)
mask_bart = pipeline("fill-mask", model=bart_model, device=device)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu
Device set to use cpu
Device set to use cpu


In [9]:
qa_bert = pipeline("question-answering", model=bert_model, device=device)
qa_roberta = pipeline("question-answering", model=roberta_model, device=device)
qa_bart = pipeline("question-answering", model=bart_model, device=device)

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cpu
Some weights of RobertaForQuestionAnswering were not initialized from the model checkpoint at roberta-base and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cpu
Some weights of BartForQuestionAnswering were not initialized from the model checkpoint at facebook/bart-base and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cpu


# Text generation

In [31]:
prompt = "Once upon a time"

In [32]:
def try_generation(generator, name):
    print(f"\n{name} Generation:")
    try:
        output = generator(
            prompt,
            max_new_tokens=40,
            do_sample=True,
            temperature=0.8,
            top_p=0.9,
            repetition_penalty=1.2,
            no_repeat_ngram_size=2,
            pad_token_id=generator.tokenizer.eos_token_id
        )
        print(output[0]["generated_text"])
    except Exception as e:
        print("Failed with error:", str(e))

try_generation(gen_bert, "BERT")
try_generation(gen_roberta, "RoBERTa")
try_generation(gen_bart, "BART")


BERT Generation:
Once upon a time..''. and and.?. ;.!. i..... ".,. a a. - - t. (. / ) ) and " "'" and

RoBERTa Generation:
Once upon a time...."." " " ( "( "( " ) ) )) )))))))))))))))))))))))))))))))).))(( (( ((((((('(''''' "' "'"'

BART Generation:
Once upon a timeManufact Philosophyaineriors Doct1998uuphotos Dan linking.— suscept wasteland salesman:{wu escalated pageantcing EVE schedcritacoconscious FawDenver Coalition Fall Windowsaviourfascistasma July Bl flavours shaved resurrected dietary repud


# Fill Mask

In [25]:
bert_text = "The goal of Generative AI is to [MASK] new content."
roberta_text = "The goal of Generative AI is to <mask> new content."
bart_text = "The goal of Generative AI is to <mask> new content."


In [26]:
def try_fill_mask(model, text, name):
    print(f"\n{name} Fill-Mask Results:")
    try:
        outputs = model(text)
        for o in outputs[:5]:
            print(o["sequence"], " | score:", round(o["score"], 4))
    except Exception as e:
        print("Failed with error:", str(e))

try_fill_mask(mask_bert, bert_text, "BERT")
try_fill_mask(mask_roberta, roberta_text, "RoBERTa")
try_fill_mask(mask_bart, bart_text, "BART")



BERT Fill-Mask Results:
the goal of generative ai is to create new content.  | score: 0.5397
the goal of generative ai is to generate new content.  | score: 0.1558
the goal of generative ai is to produce new content.  | score: 0.0541
the goal of generative ai is to develop new content.  | score: 0.0445
the goal of generative ai is to add new content.  | score: 0.0176

RoBERTa Fill-Mask Results:
The goal of Generative AI is to generate new content.  | score: 0.3711
The goal of Generative AI is to create new content.  | score: 0.3677
The goal of Generative AI is to discover new content.  | score: 0.0835
The goal of Generative AI is to find new content.  | score: 0.0213
The goal of Generative AI is to provide new content.  | score: 0.0165

BART Fill-Mask Results:
The goal of Generative AI is to create new content.  | score: 0.0746
The goal of Generative AI is to help new content.  | score: 0.0657
The goal of Generative AI is to provide new content.  | score: 0.0609
The goal of Generative

# Question Answering

In [27]:
context = "Generative AI poses significant risks such as hallucinations, bias, and deepfakes."
question = "What are the risks?"

In [28]:
def try_qa(model, name):
    print(f"\n{name} QA Result:")
    try:
        answer = model(question=question, context=context)
        print("Answer:", answer["answer"])
        print("Score:", round(answer["score"], 4))
    except Exception as e:
        print("Failed with error:", str(e))

try_qa(qa_bert, "BERT")
try_qa(qa_roberta, "RoBERTa")
try_qa(qa_bart, "BART")


BERT QA Result:
Answer: hallucinations
Score: 0.0145

RoBERTa QA Result:
Answer: Generative AI poses significant risks such as hallucinations, bias, and deepfakes
Score: 0.0114

BART QA Result:
Answer: Generative AI poses significant risks such as
Score: 0.0188


| Task | Model | Classification (Success/Failure) | Observation (What actually happened?) | Why did this happen? (Architectural Reason) |
| :--- | :--- | :--- | :--- | :--- |
| **Generation** | **BERT** | Failure | Output was mostly punctuation, repeated words like “and”, and broken fragments instead of a meaningful sentence. | BERT is an **encoder-only** model trained with Masked Language Modeling. It does not learn left-to-right token prediction, so it cannot perform autoregressive text generation. |
| | **RoBERTa** | Failure | Produced repetitive symbols, quotes, and parentheses with no meaningful continuation. | RoBERTa is also **encoder-only**. Even though it’s optimized over BERT, it still lacks a decoder and was never trained to generate sequences token by token. |
| | **BART** | Partial Failure | Generated real words but in a chaotic, nonsensical sequence (random topics, names, broken structure). | BART has an **encoder–decoder**, so it *can* generate text. However, `bart-base` is trained mainly for **denoising and seq2seq tasks**, not open-ended story continuation, leading to unstable free-form generation. |
| **Fill-Mask** | **BERT** | Success | Predicted highly relevant words like *create, generate, produce* with strong confidence scores. | BERT was **explicitly trained** on the Masked Language Modeling objective, so predicting missing words is its primary strength. |
| | **RoBERTa** | Success | Very accurate predictions similar to BERT, with *generate* and *create* as top outputs. | RoBERTa improves BERT’s MLM training with more data and longer training, making it strong at contextual word prediction. |
| | **BART** | Partial Success | Predicted reasonable but less precise words like *create, help, provide*. Confidence scores were lower. | BART was not trained primarily for MLM. Its objective was **denoising corrupted text**, so token-level mask prediction is not its main specialization. |
| **QA** | **BERT** | Partial Success | Extracted the word *“hallucinations”* as the answer, but missed the full list of risks. Very low confidence score. | Base BERT understands context but is **not fine-tuned for extractive QA**, so it struggles to select full answer spans. |
| | **RoBERTa** | Partial Success | Returned almost the entire sentence from the context instead of a concise span. | Without QA fine-tuning, RoBERTa cannot precisely predict answer boundaries, even though it understands the passage. |
| | **BART** | Partial Success | Output a partial phrase: *“Generative AI poses significant risks such as”* without completing the list. | BART is generative and not optimized for **extractive span selection**, so it struggles with pinpointing exact answer spans in QA tasks. |
