In [1]:
!pip install -qU transformers peft trl accelerate bitsandbytes datasets
!pip install -qU flash-attn --no-build-isolation

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m


---
The task is to use a Foundation Model to detect fake news (binary classification of `real` or `fake`). The datasets in question are the `mohammadjavadpirhadi/fake-news-detection-dataset-english` available from HuggingFace datasets [here](https://huggingface.co/datasets/mohammadjavadpirhadi/fake-news-detection-dataset-english).

In [2]:
from datasets import load_dataset

# Load the train and test splits
train_dataset = load_dataset("mohammadjavadpirhadi/fake-news-detection-dataset-english", split='train')
test_dataset = load_dataset("mohammadjavadpirhadi/fake-news-detection-dataset-english", split='test')

# Downsample the datasets due to memory and processing time constraints
train_dataset = train_dataset.shuffle(seed=99).select(range(5000))
test_dataset = test_dataset.shuffle(seed=99).select(range(500))

In [3]:
id2label = {0: "real", 1: "fake"}
label2id = {"fake": 0, "real": 1}

In [4]:
# Let's look at the first example in train set which is a real news
print(f"""
title: {train_dataset[0]['title']}
label: {id2label[train_dataset[0]['label']]}
text:
{train_dataset[0]['text']}
""")


title: Lebanon's finances can cope with PM resignation: finance minister
label: real
text:
BEIRUT (Reuters) - Lebanon and its financial institutions can cope with the impact of Prime Minister Saad Hariri s surprise resignation, Finance Minister Ali Hassan Khalil said on Monday.  We are confident in the stability of the financial and monetary situation in the country. There are no very big challenges ahead of us,  Khalil said in a televised statement after a meeting on the economy chaired by President Michel Aoun.  The state is able to finance itself,  he said.   



In [5]:
# Let's look at the second example in train set which is a fake news
print(f"""
title: {train_dataset[1]['title']}
label: {id2label[train_dataset[1]['label']]}
text:
{train_dataset[1]['text']}
""")


title:  Sean Hannity Is Totally Butthurt Over This Onion Picture
label: fake
text:
Sean Hannity is giddy and offended that finally, just like his Great Orange Leader, there s a bloody Sean Hannity joke picture out there, and he finally gets to share in the victim outrage.All of Hannity s righteous indignation is over an Onion (yes, the satire site) article with the headline,  Hundreds Of Miniature Sean Hannitys Burst From Roger Ailes  Corpse.  The picture is what really got under Hannity s thin yet completely abrasive skin. It should several of him, like in the movie  Alien  bursting from what looks like a white shirt.Hannity, the man who bled advertisers over a false murder accusation toward Hillary Clinton, was just appalled that his 15 year old daughter would see such a horrible picture.What is wrong with the left that they think these sorts of things are funny? https://t.co/sAxON5xxmh  Sean Hannity (@seanhannity) June 1, 2017Personally, I don t think it s funny, but whatever. And 

---
Import the required libraries, here I'm using the Supervised Fine-Tuning Trainer (`SFTTrainer`) from the [trl](https://huggingface.co/docs/trl/en/index) package.

In [6]:
from peft import LoraConfig, AutoPeftModelForCausalLM
import torch
import transformers
from trl import SFTTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, BitsAndBytesConfig, pipeline
from transformers.pipelines.pt_utils import KeyDataset

The LLM foundation model that I'll be using is the latest Phi-3 model family from Microsoft. Specifically the `Phi-3-mini-4k-instruct`, a 3.8B parameters and state-of-the-art model with comparable performance to many larger 7/8B models, including `Mistral-7b-v0.1` and `Llama-3-8B-Instruct`.

Source: [https://huggingface.co/microsoft/Phi-3-mini-4k-instruct](Llama-3-8B-Instruct)

I'll also be fine-tuning the model with QLoRA (4-bit quantization and LoRA).

In [7]:
checkpoint_path = "microsoft/Phi-3-mini-4k-instruct"
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)
# loading the model with flash-attention support
model_kwargs = dict(
    use_cache=False,
    trust_remote_code=True,
    attn_implementation="flash_attention_2",
    torch_dtype=torch.bfloat16,
    quantization_config=quantization_config,
    device_map="auto"
)
model = AutoModelForCausalLM.from_pretrained(checkpoint_path, **model_kwargs)
tokenizer = AutoTokenizer.from_pretrained(checkpoint_path)
tokenizer.model_max_length = 3072
tokenizer.pad_token = tokenizer.unk_token  # use unk rather than eos token to prevent endless generation
tokenizer.pad_token_id = tokenizer.convert_tokens_to_ids(tokenizer.pad_token)
tokenizer.padding_side = 'left'

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [8]:
def apply_chat_template(example, tokenizer, return_response=False):
    """Function to apply the chat template required by phi-3 models.
    
    Parameters:
    example (record): A record of dataset
    tokenizer (tokenizer): Tokenizer
    return_response (boolean): Whether to append the label (answer) for fine-tuning

    Returns:
    record: A modified record of dataset
    """
    news = example["title"] + "\n" + example["text"]
    messages =[
        {"role": "system", "content": "You are a helpful AI assistant."},
        {"role": "user", 
         "content": f"Please classify whether the following news is fake or real.\n{news}\nOnly answer in one word: `fake` or `real`."},
    ]
    if return_response:
        messages.append({"role": "assistant", "content": id2label[example["label"]]})
    example["messages"] = messages
    example["text"] = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
    return example

In [9]:
processed_train_dataset = train_dataset.map(
    apply_chat_template,
    fn_kwargs={"tokenizer": tokenizer},
    num_proc=10,
    desc="Applying chat template to train",
)

processed_test_dataset = test_dataset.map(
    apply_chat_template,
    fn_kwargs={"tokenizer": tokenizer},
    num_proc=10,
    desc="Applying chat template to test",
)

In [10]:
pipe = pipeline( 
    "text-generation", 
    model=model, 
    tokenizer=tokenizer,
)

generation_args = {
    "batch_size": 4,
    "max_new_tokens": 500, 
    "return_full_text": False, 
    "temperature": 0.0, 
    "do_sample": False, 
} 

In [11]:
# Run the inference (generation) pipeline on the first example in train set which is a real news
# The model is able to classify it correctly
output = pipe(processed_train_dataset[0]['messages'], **generation_args)

output[0]['generated_text']

The `seen_tokens` attribute is deprecated and will be removed in v4.41. Use the `cache_position` model input instead.


' real'

In [12]:
# Run the inference (generation) pipeline on the second example in train set which is a fake news
# The model is able to classify it correctly
output = pipe(processed_train_dataset[1]['messages'], **generation_args)

output[0]['generated_text']

' fake'

Let's run an accuracy evaluation on the whole test set.

In [13]:
def predict_test_set(pipeline):
    preds = []
    for output in pipeline(KeyDataset(processed_test_dataset, 'messages'), **generation_args):
        preds.append(output[0]['generated_text'].lower().strip())
    return preds

def evaluate_test_set(preds):
    num_correct = 0
    for pred, test_record in zip(preds, processed_test_dataset):
        if pred.find(id2label[test_record['label']]) != -1:
            num_correct += 1
    return num_correct / len(processed_test_dataset)

In [14]:
# Predict and evaluate on whole test set
preds = predict_test_set(pipe)
evaluate_test_set(preds)

Token indices sequence length is longer than the specified maximum sequence length for this model (3544 > 3072). Running this sequence through the model will result in indexing errors


0.858

The model without fine-tuning is able to classify the test set with an accuracy of 85.8%.

In [14]:
# Delete the pipeline and free up GPU memory
del pipe
torch.cuda.empty_cache()

In [15]:
# Process the train dataset again with the answers / labels for fine-tuning
processed_train_dataset = train_dataset.map(
    apply_chat_template,
    fn_kwargs={"tokenizer": tokenizer, "return_response": True},
    num_proc=10,
    desc="Applying chat template to train",
)

In [None]:
# Fine-tune the model with 2 epochs
training_config = {
    "bf16": True,
    "do_eval": False,
    "learning_rate": 2e-4,
    "num_train_epochs": 2,
    "max_steps": -1,
    "output_dir": "./checkpoint_dir",
    "overwrite_output_dir": True,
    "per_device_train_batch_size": 8,
    "gradient_accumulation_steps": 4,
    "remove_unused_columns": True,
    "save_steps": 100,
    "save_total_limit": 1,
    "seed": 0,
    "warmup_ratio": 0.2,
    }

# Use a LoRA rank of 16
peft_config = {
    "r": 16,
    "lora_alpha": 32,
    "lora_dropout": 0.05,
    "bias": "none",
    "task_type": "CAUSAL_LM",
    "target_modules": "all-linear",
    "modules_to_save": None,
}

train_conf = TrainingArguments(**training_config)
peft_conf = LoraConfig(**peft_config)

# Use supervised (instruction) fine-tuning
trainer = SFTTrainer(
    model=model,
    args=train_conf,
    peft_config=peft_conf,
    train_dataset=processed_train_dataset,
    eval_dataset=processed_test_dataset,
    max_seq_length=3072,
    dataset_text_field="text",
    tokenizer=tokenizer,
    packing=True
)

train_result = trainer.train()

In [17]:
metrics = train_result.metrics
trainer.log_metrics("train", metrics)
trainer.save_metrics("train", metrics)
trainer.save_state()
trainer.save_model()

In [18]:
# Delete the model and trainer and free up GPU memory
del model
del trainer
torch.cuda.empty_cache()

In [19]:
# Reload the LoRA and merge with original model
model = AutoPeftModelForCausalLM.from_pretrained(
    training_config['output_dir'],
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    device_map="auto"
)

merged_model = model.merge_and_unload()

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [25]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [26]:
merged_model.push_to_hub("zanelim/Phi-3-mini-4k-instruct-fakenews")

README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.67G [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/zanelim/Phi-3-mini-4k-instruct-fakenews/commit/f623352e396f1fc6d51d482d923f04c019c10af9', commit_message='Upload Phi3ForCausalLM', commit_description='', oid='f623352e396f1fc6d51d482d923f04c019c10af9', pr_url=None, pr_revision=None, pr_num=None)

In [23]:
# Recreate the pipeline using the fine-tuned model
pipe = pipeline( 
    "text-generation", 
    model=merged_model,
    tokenizer=tokenizer,
)

In [24]:
preds = predict_test_set(pipe)
evaluate_test_set(preds)

0.99

The model with fine-tuning is able to classify the test set with an accuracy of 99%! A significant improvement over the non-fine-tuned model.

In [36]:
correct_preds = []
for pred, test_record in zip(preds, test_dataset):
        if pred.find(id2label[test_record['label']]) != -1:
            correct_preds.append((pred, id2label[test_record['label']], test_record['title'] + '\n' + test_record['text']))

In [37]:
len(correct_preds)

495

In [38]:
# Let's look at the first 2 correctly predicted examples from test set
for pred, label, text in correct_preds[:2]:
    print(f"Predicted label: {pred}")
    print(f"Actual label: {label}")
    print(f"News: {text}")
    print()

Predicted label: real
Actual label: real
News: Denmark set to become next European country to ban burqas
COPENHAGEN (Reuters) - Denmark looks set to become the next European country to restrict the burqa and the niqab, worn by some Muslim women, after most parties in the Danish parliament backed some sort of ban on facial coverings. Full and partial face veils such as burqas and niqabs divide opinion across Europe, setting advocates of religious freedom against secularists and those who argue that such garments are culturally alien or a symbol of the oppression of women. The niqab covers everything but the eyes, while the burqa also covers the eyes with a transparent veil. France, Belgium, the Netherlands, Bulgaria and the German state of Bavaria have all imposed some restrictions on the wearing of full-face veils in public places.  This is not a ban on religious clothing, this is a ban on masking,  Jacob Ellemann-Jensen, spokesman for the Liberal Party, told reporters on Friday after 

In [31]:
wrong_preds = []
for pred, test_record in zip(preds, test_dataset):
        if pred.find(id2label[test_record['label']]) == -1:
            wrong_preds.append((pred, id2label[test_record['label']], test_record['title'] + '\n' + test_record['text']))

In [33]:
len(wrong_preds)

5

In [35]:
# Let's look at the first 2 wrongly predicted examples from test set
for pred, label, text in wrong_preds[:2]:
    print(f"Predicted label: {pred}")
    print(f"Actual label: {label}")
    print(f"News: {text}")
    print()

Predicted label: fake
Actual label: real
News: Trump says pope will wish he was president if Vatican is attacked by Islamic State
(Reuters) - U.S. Republican presidential candidate Donald Trump, responding to Pope Francis’ calling him “not Christian” because of his positions on immigration, said on Thursday the pope would have wished Trump was president if Islamic State attacked Vatican.  “If and when the Vatican is attacked by the ISIS, which as everyone knows is ISIS’s ultimate trophy, I can promise you that the pope would have only wished and prayed that Donald Trump would have been president,” Trump said in a speech in South Carolina, using an acronym for the militant group.  (Reporting by Mohammad Zargham in Washington) This article was funded in part by SAP. It was independently created by the Reuters editorial staff. SAP had no editorial involvement in its creation or production.

Predicted label: real
Actual label: fake
News: RONALD REAGAN: Remembering Our Heroes [Video]
This s