# Finetune GPT-2 for Fake  News Classification
In this notebook, we are going to fine tune the GPT-2 model for fake news classification using Huggingface Library

## 1. Installations

In [2]:
# Install transformers library.
!pip install accelerate -U
!pip install transformers[torch]



In [3]:
import torch

# Check if a GPU is available and if not, use a CPU
device = torch.device(
    "cuda" if torch.cuda.is_available() else "cpu")
print(f'Using device: {device}')

Using device: cuda


## 2. Import

In [4]:
import io
import os
import torch
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go


from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score
from transformers import (set_seed,
                          GPT2ForSequenceClassification,
                          GPT2Tokenizer,
                          GPT2Config,
                          Trainer,
                          TrainingArguments,
                          get_linear_schedule_with_warmup)
from sklearn.preprocessing import LabelEncoder
from plotly.subplots import make_subplots
from tqdm.notebook import tqdm
from torch.utils.data import Dataset, DataLoader

## 2. Dataset
For this task the dataset that we gonna use is available by the following [link](https://drive.google.com/drive/folders/1hry-KMpj4seazY4J-IIVrmOV-qgjhVFV?usp=sharing).

In [5]:
df1 = pd.read_csv("/kaggle/input/news-article/modified_news_article.csv")
df1.head(10)

Unnamed: 0.1,Unnamed: 0,title,text,label
0,0,muslims busted they stole millions in govt ben...,print they should pay all the back all the mon...,Real
1,1,re why did attorney general loretta lynch plea...,why did attorney general loretta lynch plead t...,Real
2,2,breaking weiner cooperating with fbi on hillar...,red state \nfox news sunday reported this mor...,Real
3,3,pin drop speech by father of daughter kidnappe...,email kayla mueller was a prisoner and torture...,Real
4,4,fantastic trumps point plan to reform healthc...,email healthcare reform to make america great ...,Real
5,5,hillary goes absolutely berserk on protester a...,print hillary goes absolutely berserk she expl...,Real
6,6,breaking nypd ready to make arrests in weiner ...,breaking nypd ready to make arrests in weiner ...,Real
7,7,wow whistleblower tells chilling story of mass...,breaking nypd ready to make arrests in weiner ...,Real
8,8,breaking clinton clearedwas this a coordinated...,limbaugh said that the revelations in the wiki...,Real
9,9,evil hillary supporters yell fck trumpburn tru...,email \nthese people are sick and evil they wi...,Real


In [6]:
df1['label'] = df1['label'].str.lower()

In [7]:
df1.head()

Unnamed: 0.1,Unnamed: 0,title,text,label
0,0,muslims busted they stole millions in govt ben...,print they should pay all the back all the mon...,real
1,1,re why did attorney general loretta lynch plea...,why did attorney general loretta lynch plead t...,real
2,2,breaking weiner cooperating with fbi on hillar...,red state \nfox news sunday reported this mor...,real
3,3,pin drop speech by father of daughter kidnappe...,email kayla mueller was a prisoner and torture...,real
4,4,fantastic trumps point plan to reform healthc...,email healthcare reform to make america great ...,real


we don't need title column

In [8]:
df1 = df1[["text", "label"]]
df1.head()

Unnamed: 0,text,label
0,print they should pay all the back all the mon...,real
1,why did attorney general loretta lynch plead t...,real
2,red state \nfox news sunday reported this mor...,real
3,email kayla mueller was a prisoner and torture...,real
4,email healthcare reform to make america great ...,real


Lets check another csv

In [9]:
df2 = pd.read_csv("/kaggle/input/news-article/modified_news_article2.csv")
df2.head(10)

Unnamed: 0.1,Unnamed: 0,title,text,label
0,0,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,1,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,2,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,3,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,4,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL
5,5,"Tehran, USA","\nI’m not an immigrant, but my grandparents ...",FAKE
6,6,Girl Horrified At What She Watches Boyfriend D...,"Share This Baylee Luciani (left), Screenshot o...",FAKE
7,7,‘Britain’s Schindler’ Dies at 106,A Czech stockbroker who saved more than 650 Je...,REAL
8,8,Fact check: Trump and Clinton at the 'commande...,Hillary Clinton and Donald Trump made some ina...,REAL
9,9,Iran reportedly makes new push for uranium con...,Iranian negotiators reportedly have made a las...,REAL


In [10]:
df2['label'] = df2['label'].str.lower()

In [11]:
df2 = df2[["text", "label"]]
df2.head()

Unnamed: 0,text,label
0,"Daniel Greenfield, a Shillman Journalism Fello...",fake
1,Google Pinterest Digg Linkedin Reddit Stumbleu...,fake
2,U.S. Secretary of State John F. Kerry said Mon...,real
3,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",fake
4,It's primary day in New York and front-runners...,real


Now, merge these two dataframes.

In [12]:
df = pd.concat([df1,df2])

In [13]:
df

Unnamed: 0,text,label
0,print they should pay all the back all the mon...,real
1,why did attorney general loretta lynch plead t...,real
2,red state \nfox news sunday reported this mor...,real
3,email kayla mueller was a prisoner and torture...,real
4,email healthcare reform to make america great ...,real
...,...,...
6330,The State Department told the Republican Natio...,real
6331,The ‘P’ in PBS Should Stand for ‘Plutocratic’ ...,fake
6332,Anti-Trump Protesters Are Tools of the Oligar...,fake
6333,"ADDIS ABABA, Ethiopia —President Obama convene...",real


In [14]:
df.describe()

Unnamed: 0,text,label
count,8380,8380
unique,7996,2
top,"Killing Obama administration rules, dismantlin...",fake
freq,58,4455


In [15]:
# Drop duplicate rows
df = df.drop_duplicates(subset=['text'])

In [16]:
df.describe()

Unnamed: 0,text,label
count,7996,7996
unique,7996,2
top,print they should pay all the back all the mon...,fake
freq,1,4278


In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7996 entries, 0 to 6334
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    7996 non-null   object
 1   label   7996 non-null   object
dtypes: object(2)
memory usage: 187.4+ KB


Checking Target blance

In [18]:
colors = ['gold', 'mediumturquoise']
labels = ['real','fake']
values = df['label'].value_counts()/df['label'].shape[0]

fig = go.Figure(data=[go.Pie(labels=labels, values=values, hole=.3)])
fig.update_traces(hoverinfo='label+percent', textinfo='percent', textfont_size=20,
                  marker=dict(colors=colors, line=dict(color='#000000', width=2)))
fig.update_layout(
    title_text="Target Balance",
    title_font_color="white",
    legend_title_font_color="yellow",
    paper_bgcolor="black",
    plot_bgcolor='black',
    font_color="white",
)
fig.show()

as we can see, the classes are almost balanced.

Let's tranformed our text label in to number.
```real = 1 and fake = 0```

In [19]:
# Initialize the LabelEncoder
label_encoder = LabelEncoder()

# Fit and transform the 'label' column
df['label'] = label_encoder.fit_transform(df['label'])
df.head()



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0,text,label
0,print they should pay all the back all the mon...,1
1,why did attorney general loretta lynch plead t...,1
2,red state \nfox news sunday reported this mor...,1
3,email kayla mueller was a prisoner and torture...,1
4,email healthcare reform to make america great ...,1


In [20]:
df

Unnamed: 0,text,label
0,print they should pay all the back all the mon...,1
1,why did attorney general loretta lynch plead t...,1
2,red state \nfox news sunday reported this mor...,1
3,email kayla mueller was a prisoner and torture...,1
4,email healthcare reform to make america great ...,1
...,...,...
6330,The State Department told the Republican Natio...,1
6331,The ‘P’ in PBS Should Stand for ‘Plutocratic’ ...,0
6332,Anti-Trump Protesters Are Tools of the Oligar...,0
6333,"ADDIS ABABA, Ethiopia —President Obama convene...",1


Split the dataset


In [21]:
X = list(df["text"])
y = list(df["label"])

X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.1,stratify=y)

In [22]:
print(len(X_train))
print(len(y_train))
print(len(X_valid))
print(len(y_valid))

7196
7196
800
800


Transformer doesn't able to take the text input. The text needs to be converted into numbers.

1. First tokenize the sentence.
2. Assign an index value to each tokens based on their position in the vocabulary

In [23]:
# instantiate the configuration for your model, this can be imported from transformers
configuration = GPT2Config()
# set up your tokenizer, just like you described, and set the pad token
tokenizer = GPT2Tokenizer.from_pretrained("gpt2") #gpt small
tokenizer.pad_token = tokenizer.eos_token

sample_data = ["I like you","Alex!, play the music."]
tokenizer(sample_data, padding=True, truncation=True)

Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

{'input_ids': [[40, 588, 345, 50256, 50256, 50256], [15309, 28265, 711, 262, 2647, 13]], 'attention_mask': [[1, 1, 1, 0, 0, 0], [1, 1, 1, 1, 1, 1]]}

In [24]:
X_train_tokenized = tokenizer(X_train, truncation=True, padding="max_length", max_length=512) #input length is no more than 512. We did this for sake of training, if we didnt do this, ram or gpu would crashed
X_valid_tokenized = tokenizer(X_valid, truncation=True, padding="max_length", max_length=512)

In [25]:
X_train_tokenized.keys()

dict_keys(['input_ids', 'attention_mask'])

Now, we need to convert our tokenized input into dataset format.

In [26]:
class Dataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels=None):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        if self.labels:
            item["labels"] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.encodings["input_ids"])

In [27]:
train_dataset = Dataset(X_train_tokenized, y_train)
valid_dataset = Dataset(X_valid_tokenized, y_valid)

## 3. Model Instantiation and Training
We gonna fine-tune the GPT-2 model on our own dataset.

In [28]:
# instantiate the model
model = GPT2ForSequenceClassification(configuration).from_pretrained('gpt2',num_labels=2).to(device)

# set the pad token of the model's configuration
model.config.pad_token_id = model.config.eos_token_id


Downloading:   0%|          | 0.00/523M [00:00<?, ?B/s]

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Create a performance_metrics function that will measure the accuracy, precision, recall, and f1 score of model during training and testing.

In [29]:
def compute_metrics(p):
    pred, labels = p
    pred = np.argmax(pred, axis=1)

    accuracy = accuracy_score(y_true=labels, y_pred=pred)
    recall = recall_score(y_true=labels, y_pred=pred)
    precision = precision_score(y_true=labels, y_pred=pred)
    f1 = f1_score(y_true=labels, y_pred=pred)

    return {"accuracy": accuracy, "precision": precision, "recall": recall, "f1-score": f1}

Define trainer

In [30]:
args = TrainingArguments(
    output_dir="/kaggle/working/gpt-2_fake_news",
    num_train_epochs=6,
    evaluation_strategy="epoch",
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,

)
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=valid_dataset,
    compute_metrics=compute_metrics
)

Train

In [31]:
trainer.train()

***** Running training *****
  Num examples = 7196
  Num Epochs = 6
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 5400
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize


[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit:  ········································


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: wandb version 0.16.3 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1-score
1,0.2935,0.23131,0.925,0.9875,0.849462,0.913295
2,0.1551,0.335129,0.9425,0.988024,0.887097,0.934844
3,0.0771,0.217151,0.95,0.951087,0.94086,0.945946
4,0.0376,0.284902,0.9475,0.941176,0.946237,0.9437
5,0.0157,0.320562,0.95125,0.966387,0.927419,0.946502
6,0.0046,0.357897,0.9525,0.963889,0.932796,0.948087


Saving model checkpoint to /kaggle/working/gpt-2_fake_news/checkpoint-500
Configuration saved in /kaggle/working/gpt-2_fake_news/checkpoint-500/config.json
Model weights saved in /kaggle/working/gpt-2_fake_news/checkpoint-500/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 800
  Batch size = 8
Saving model checkpoint to /kaggle/working/gpt-2_fake_news/checkpoint-1000
Configuration saved in /kaggle/working/gpt-2_fake_news/checkpoint-1000/config.json
Model weights saved in /kaggle/working/gpt-2_fake_news/checkpoint-1000/pytorch_model.bin
Saving model checkpoint to /kaggle/working/gpt-2_fake_news/checkpoint-1500
Configuration saved in /kaggle/working/gpt-2_fake_news/checkpoint-1500/config.json
Model weights saved in /kaggle/working/gpt-2_fake_news/checkpoint-1500/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 800
  Batch size = 8
Saving model checkpoint to /kaggle/working/gpt-2_fake_news/checkpoint-2000
Configuration saved in /kaggle/working/gpt-2_fake

TrainOutput(global_step=5400, training_loss=0.0919587853219774, metrics={'train_runtime': 2941.7944, 'train_samples_per_second': 14.677, 'train_steps_per_second': 1.836, 'total_flos': 1.1281748503560192e+16, 'train_loss': 0.0919587853219774, 'epoch': 6.0})

## 4. Model Evaluation

In [32]:
trainer.evaluate()

***** Running Evaluation *****
  Num examples = 800
  Batch size = 8


{'eval_loss': 0.3578970432281494,
 'eval_accuracy': 0.9525,
 'eval_precision': 0.9638888888888889,
 'eval_recall': 0.9327956989247311,
 'eval_f1-score': 0.948087431693989,
 'eval_runtime': 17.7227,
 'eval_samples_per_second': 45.14,
 'eval_steps_per_second': 5.642,
 'epoch': 6.0}

## 5. Inference

In [33]:
np.set_printoptions(suppress=True)

In [34]:
text = "Secret Underground City Discovered Beneath New York's Central Park! In a shocking revelation, a team of archaeologists claims to have unearthed a hidden underground city beneath the bustling streets of New York City's iconic Central Park. The mysterious metropolis, believed to date back centuries, is said to contain a labyrinth of tunnels, chambers, and forgotten relics. According to sources close to the investigation, the discovery was made during routine maintenance work in the park, when workers stumbled upon a concealed entrance hidden beneath a centuries-old oak tree. Upon further exploration, researchers were astounded to find a sprawling network of interconnected passages, complete with ancient hieroglyphics, elaborate murals, and even evidence of advanced engineering."
inputs = tokenizer(text, padding = True, truncation = True, return_tensors='pt').to(device)
outputs = model(**inputs)
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
predictions = predictions.cpu().detach().numpy()
print(predictions)

[[0.99999547 0.00000458]]


In [35]:
# 0=Fake, 1=Real
pred = np.argmax(predictions, axis=1)
if pred == 0:
    print("Fake News")
else:
    print("Real News")

Fake News


Yes above news was a fake news generated by ChatGPT.

In [38]:
text = "students expressed their fear over a trump presidency in messages to each other that were being shared on twitter today literally scared for their lives is the new literally hitler notmypresident pictwittercomckfqdfce paul joseph watson prisonplanet november and finally this ridiculous and totally biased email was sent from university of michigan president to the students offering them assistance to help them through the results of our presidential election last night the president wants to ensure the students that the university remains committed to their most important responsibility at their school which is apparently to remain committed to education discovery and intellectual honesty  and to diversity equity and inclusion to all members of the university community  as im sure many of you did i watched the election coverage late into the night and had the opportunity to visit with students and staff at a resultswatching event sponsored by the ginsberg center at the michigan union it will take quite some time to completely absorb the results from yesterdays election understand the full implications and discern the longterm impact on our university and our nation  more immediately in the aftermath of a close and highly contentious election we continue to embrace our most important responsibility as a university community our responsibility is to remain committed to education discovery and intellectual honesty  and to diversity equity and inclusion we are at our best when we come together to engage respectfully across our ideological differences to support all who feel marginalized threatened or unwelcome and to pursue knowledge and understanding as we always have as the students faculty and staff of the university of michigan there are reports of members of our community offering support to one another students are planning a vigil tonight on the diag at  pm our center for research on teaching and learning also has numerous resources available for faculty seeking help in cultivating classroom environments that are responsive to national issues i also want to make everyone aware of some of the plans and events we have had in place for today and beyond  our gerald r ford school of public policy is holding a postelection analysis from  to  pm today in the weill halls annenberg auditorium speakers include former us congressman john dingell former ambassador ron weiser and faculty members mara ostfeld betsey stevenson and marina whitman  our history department has organized a community discussion led by faculty and students to include historical perspectives at  pm tonight in  tisch hall  the office of student life will provide resources and referrals for support on campus to students faculty and staff at a location in the michigan unions willis ward lounge it will be open today from  am to  pm  our office of multiethnic student affairs is offering an open space of support to help members of our community connect during open hours today mesas office is in the michigan union room   tomorrow our ginsberg center and counseling and psychological services office is facilitating a postelection dialogue impact perspectivetaking and moving forward this event is part of the student life professional development conference at    pm in the michigan leagues henderson room i know that other schools colleges and offices across our campus are planning events as well i thank everyone who is helping us come together and ask anyone scheduling a post election event post it on the university of michigan events calendar i hope all of us will continue to proudly embrace the opportunities before us as the students faculty and staff of a great public research university governed by the people elections are often times of great change but the values we stand for at um have been shaped over the course of nearly  years our mission remains as essential for society as ever to serve the people of michigan and the world through preeminence in creating communicating preserving and applying knowledge art and academic values and in developing leaders and citizens who will challenge the present and enrich the future i look forward to working together with all of you to advance the work we do in service of the public  and to ensure that the university of michigan will always be a welcoming place for all members of society sincerely  mark schlissel president"
inputs = tokenizer(text, padding = True, truncation = True, return_tensors='pt').to(device)
outputs = model(**inputs)
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
predictions = predictions.cpu().detach().numpy()
print(predictions)

[[0.00204054 0.9979595 ]]


In [39]:
# 0=Fake, 1=Real
pred = np.argmax(predictions, axis=1)
if pred == 0:
    print("Fake News")
else:
    print("Real News")

Real News
