In [4]:
!pip install transformers torch --quiet
from transformers import pipeline


### Hypothesis
Encoder-only models (BERT, RoBERTa) should fail or behave oddly because they are not
trained for auto-regressive text generation.
BART should succeed because it has a decoder trained for generation.


In [5]:
models = {
    "BERT": "bert-base-uncased",
    "RoBERTa": "roberta-base",
    "BART": "facebook/bart-base"
}

prompt = "The future of Artificial Intelligence is"

for name, model in models.items():
    print(f"\n{name} OUTPUT:")
    try:
        generator = pipeline("text-generation", model=model)
        output = generator(prompt, max_length=30, num_return_sequences=1)
        print(output[0]["generated_text"])
    except Exception as e:
        print("ERROR:", e)


If you want to use `BertLMHeadModel` as a standalone, add `is_decoder=True.`



BERT OUTPUT:


Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Both `max_new_tokens` (=256) and `max_length`(=30) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
If you want to use `RobertaLMHeadModel` as a standalone, add `is_decoder=True.`


The future of Artificial Intelligence is................................................................................................................................................................................................................................................................

RoBERTa OUTPUT:


Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Both `max_new_tokens` (=256) and `max_length`(=30) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


The future of Artificial Intelligence is

BART OUTPUT:


Some weights of BartForCausalLM were not initialized from the model checkpoint at facebook/bart-base and are newly initialized: ['lm_head.weight', 'model.decoder.embed_tokens.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Both `max_new_tokens` (=256) and `max_length`(=30) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


The future of Artificial Intelligence is outage CounselSalt When advised adore musSalt coloring Search spoof biggesttext spoof rhythm crowatin crow crow morp blacks DEL DEL DELffe advertisements adore spoof crow crow 720 crow lit adore adore adorerequires cementantis DEL crow crowEducation adore adoreと DEL DEL adore DEL denounce denounce adore cement organizational adore spoofCOR adore organizational mating adoresponsored crow crow adore crow adore adoretextou Americantis adore adore biggest biggesttext Sinn adore adoreparable drawer compos adoretext adore adore organizational DELvered adore adore DEL DEL biggest biggest Demons adoreLuckily adore adoreicrobial adore adore cement adore DEL biggest drawerparable biggest biggest drawer DELantisparable adore adoreRegarding adore drawer illust adore illusttext adoreと adore adore drawer drawer crow drawertexttext drawer drawerparableparableparableicrobialparable biggest adore adore biomass drawer Tirtext adore drawer DELtextantis drawer ador

In [6]:
fill_mask_models = {
    "BERT": ("bert-base-uncased", "[MASK]"),
    "RoBERTa": ("roberta-base", "<mask>"),
    "BART": ("facebook/bart-base", "<mask>")
}

for name, (model, mask_token) in fill_mask_models.items():
    print(f"\n{name} OUTPUT:")
    pipe = pipeline("fill-mask", model=model)
    sentence = f"The goal of Generative AI is to {mask_token} new content."
    results = pipe(sentence)
    for r in results[:3]:
        print(r["token_str"], " | score:", round(r["score"], 4))


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).



BERT OUTPUT:


Device set to use cpu


create  | score: 0.5397
generate  | score: 0.1558
produce  | score: 0.0541

RoBERTa OUTPUT:


Device set to use cpu


 generate  | score: 0.3711
 create  | score: 0.3677
 discover  | score: 0.0835

BART OUTPUT:


Device set to use cpu


 create  | score: 0.0746
 help  | score: 0.0657
 provide  | score: 0.0609


In [7]:
qa_models = {
    "BERT": "bert-base-uncased",
    "RoBERTa": "roberta-base",
    "BART": "facebook/bart-base"
}

context = "Generative AI poses significant risks such as hallucinations, bias, and deepfakes."
question = "What are the risks?"

for name, model in qa_models.items():
    print(f"\n{name} OUTPUT:")
    qa = pipeline("question-answering", model=model)
    result = qa(question=question, context=context)
    print("Answer:", result["answer"])
    print("Score:", round(result["score"], 4))



BERT OUTPUT:


Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cpu
Some weights of RobertaForQuestionAnswering were not initialized from the model checkpoint at roberta-base and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Answer: hallucinations, bias, and deepfakes
Score: 0.0085

RoBERTa OUTPUT:


Device set to use cpu
Some weights of BartForQuestionAnswering were not initialized from the model checkpoint at facebook/bart-base and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Answer: risks such as hallucinations, bias, and deepfakes
Score: 0.0081

BART OUTPUT:


Device set to use cpu


Answer: Generative
Score: 0.0078


| Task | Model | Classification (Success/Failure) | Observation (What actually happened?) | Why did this happen? (Architectural Reason) |
| :--- | :--- | :--- | :--- | :--- |
| **Generation** | BERT | *Failure* | The model repeated punctuation and failed to generate a meaningful continuation beyond the prompt. | BERT is an encoder-only model trained for understanding and masked token prediction, not auto-regressive text generation. |
|  | RoBERTa | *Failure* | The model echoed the prompt without generating any additional content. | RoBERTa is also encoder-only and lacks a decoder to generate new tokens sequentially. |
|  | BART | *Partial Success* | The model generated a long sequence of text, but the output was incoherent and largely gibberish. | Although BART has a decoder, it is trained for sequence-to-sequence tasks, not causal language modeling used in free text generation. |
| **Fill-Mask** | BERT | *Success* | The model confidently predicted appropriate words such as "create", "generate", and "produce". | BERT was explicitly trained using Masked Language Modeling (MLM), making it well-suited for predicting missing words. |
|  | RoBERTa | *Success* | The model correctly predicted words like "generate" and "create" with high probability scores. | RoBERTa uses an optimized MLM training strategy, improving its ability to predict masked tokens. |
|  | BART | *Partial Success* | The model predicted reasonable words such as "create", but with much lower confidence scores. | BART is trained using denoising objectives rather than pure MLM, so masked token prediction is not its primary strength. |
| **QA** | BERT | *Partial Success* | The model extracted the correct answer span but with a very low confidence score. | While BERT’s encoder architecture supports extractive QA, it was not fine-tuned on QA datasets like SQuAD. |
|  | RoBERTa | *Partial Success* | The model returned a more complete answer span but still showed very low confidence. | RoBERTa provides strong contextual representations but lacks a trained question-answering head. |
|  | BART | *Failure* | The model returned an incorrect and unrelated answer ("Generative"). | BART is not designed for extractive question answering and requires task-specific fine-tuning to perform QA effectively. |
