https://www.kaggle.com/datasets/emineyetm/fake-news-detection-datasets/data

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

In [2]:
file_path = '../combined.csv'
df = pd.read_csv(file_path)

In [3]:
df

Unnamed: 0,title,text,subject,date,label,id
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017",0,0
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017",0,1
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017",0,2
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017",0,3
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017",0,4
...,...,...,...,...,...,...
44893,'Fully committed' NATO backs new U.S. approach...,BRUSSELS (Reuters) - NATO allies on Tuesday we...,worldnews,"August 22, 2017",1,44893
44894,LexisNexis withdrew two products from Chinese ...,"LONDON (Reuters) - LexisNexis, a provider of l...",worldnews,"August 22, 2017",1,44894
44895,Minsk cultural hub becomes haven from authorities,MINSK (Reuters) - In the shadow of disused Sov...,worldnews,"August 22, 2017",1,44895
44896,Vatican upbeat on possibility of Pope Francis ...,MOSCOW (Reuters) - Vatican Secretary of State ...,worldnews,"August 22, 2017",1,44896


In [4]:
# Remove the (Reuters) tag from every text in the 'text' column - it is a common tag in the dataset in all the real news
# And it is not present in the fake news
# Every true news article has a (Reuters) tag on its text column. So any model ( even a simple MLP) can learn to choose articles with the Reuters tag on it. If you want to build a more comprehensive model you should remove that tag.

df['text'] = df['text'].str.replace(r'\s*\(Reuters\)', '', regex=True)

In [5]:
df

Unnamed: 0,title,text,subject,date,label,id
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017",0,0
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017",0,1
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017",0,2
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017",0,3
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017",0,4
...,...,...,...,...,...,...
44893,'Fully committed' NATO backs new U.S. approach...,BRUSSELS - NATO allies on Tuesday welcomed Pre...,worldnews,"August 22, 2017",1,44893
44894,LexisNexis withdrew two products from Chinese ...,"LONDON - LexisNexis, a provider of legal, regu...",worldnews,"August 22, 2017",1,44894
44895,Minsk cultural hub becomes haven from authorities,MINSK - In the shadow of disused Soviet-era fa...,worldnews,"August 22, 2017",1,44895
44896,Vatican upbeat on possibility of Pope Francis ...,MOSCOW - Vatican Secretary of State Cardinal P...,worldnews,"August 22, 2017",1,44896


In [6]:
# Verify the changes
print(df['text'].head())

0    Donald Trump just couldn t wish all Americans ...
1    House Intelligence Committee Chairman Devin Nu...
2    On Friday, it was revealed that former Milwauk...
3    On Christmas day, Donald Trump announced that ...
4    Pope Francis used his annual Christmas Day mes...
Name: text, dtype: object


In [7]:
# Remove rows with any empty field (NaN or blanks)
df = df.dropna()  # Drops rows with NaN values

In [8]:
df

Unnamed: 0,title,text,subject,date,label,id
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017",0,0
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017",0,1
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017",0,2
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017",0,3
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017",0,4
...,...,...,...,...,...,...
44893,'Fully committed' NATO backs new U.S. approach...,BRUSSELS - NATO allies on Tuesday welcomed Pre...,worldnews,"August 22, 2017",1,44893
44894,LexisNexis withdrew two products from Chinese ...,"LONDON - LexisNexis, a provider of legal, regu...",worldnews,"August 22, 2017",1,44894
44895,Minsk cultural hub becomes haven from authorities,MINSK - In the shadow of disused Soviet-era fa...,worldnews,"August 22, 2017",1,44895
44896,Vatican upbeat on possibility of Pope Francis ...,MOSCOW - Vatican Secretary of State Cardinal P...,worldnews,"August 22, 2017",1,44896


In [9]:
# Additionally, remove rows where fields are empty strings or contain only whitespaces
df = df[~df[['text', 'title']].apply(lambda x: x.str.strip().eq('')).any(axis=1)]

In [10]:
df

Unnamed: 0,title,text,subject,date,label,id
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017",0,0
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017",0,1
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017",0,2
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017",0,3
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017",0,4
...,...,...,...,...,...,...
44893,'Fully committed' NATO backs new U.S. approach...,BRUSSELS - NATO allies on Tuesday welcomed Pre...,worldnews,"August 22, 2017",1,44893
44894,LexisNexis withdrew two products from Chinese ...,"LONDON - LexisNexis, a provider of legal, regu...",worldnews,"August 22, 2017",1,44894
44895,Minsk cultural hub becomes haven from authorities,MINSK - In the shadow of disused Soviet-era fa...,worldnews,"August 22, 2017",1,44895
44896,Vatican upbeat on possibility of Pope Francis ...,MOSCOW - Vatican Secretary of State Cardinal P...,worldnews,"August 22, 2017",1,44896


In [11]:
# Split the dataset into train (80%) and remaining (20%)
train_data, remaining_data = train_test_split(df, test_size=0.2, random_state=42)

In [12]:
# Split the remaining data into validation (10% of total) and test (10% of total)
val_data, test_data = train_test_split(remaining_data, test_size=0.5, random_state=42)

In [13]:
# Verify the 0-labeled and 1-labeled data in each set
print('Train data:')
print(train_data['label'].value_counts())
print('Validation data:')
print(val_data['label'].value_counts())
print('Test data:')
print(test_data['label'].value_counts())

Train data:
label
0    18266
1    17147
Name: count, dtype: int64
Validation data:
label
0    2320
1    2107
Name: count, dtype: int64
Test data:
label
0    2265
1    2162
Name: count, dtype: int64


In [14]:
# Save the splits to CSV files
train_data.to_csv('train_data.csv', index=False)
val_data.to_csv('val_data.csv', index=False)
test_data.to_csv('test_data.csv', index=False)

https://www.kaggle.com/datasets/hassanamin/textdb3/data

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

In [2]:
file_path = 'fake_or_real_news.csv'
df = pd.read_csv(file_path)

In [3]:
df

Unnamed: 0,id,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL
...,...,...,...,...
6330,4490,State Department says it can't find emails fro...,The State Department told the Republican Natio...,REAL
6331,8062,The ‘P’ in PBS Should Stand for ‘Plutocratic’ ...,The ‘P’ in PBS Should Stand for ‘Plutocratic’ ...,FAKE
6332,8622,Anti-Trump Protesters Are Tools of the Oligarc...,Anti-Trump Protesters Are Tools of the Oligar...,FAKE
6333,4021,"In Ethiopia, Obama seeks progress on peace, se...","ADDIS ABABA, Ethiopia —President Obama convene...",REAL


In [4]:
# Print lines which contain NaN or empty strings on whatever field
print(df[df.isnull().any(axis=1)])
print(df[df['text'].str.strip().eq('')])
print(df[df['title'].str.strip().eq('')])
print(df[df['label'].str.strip().eq('')])
df = df.dropna()
df = df[~df[['title', 'text', 'label']].apply(lambda x: x.str.strip().eq('')).any(axis=1)]

Empty DataFrame
Columns: [id, title, text, label]
Index: []
         id                                              title text label
106    5530  The Arcturian Group by Marilyn Raffaele Octobe...       FAKE
710    8332  MARKETWATCH LEFTIST: MSM’s “Blatant” Anti Trum...       FAKE
806    9314  Southern Poverty Law Center Targets Anti-Jihad...       FAKE
919   10304  Refugee Resettlement Watch: Swept Away In Nort...       FAKE
940    9474  Michael Bloomberg Names Technological Unemploy...       FAKE
1664   5802  Alert News : Putins Army Is Coming For World W...       FAKE
1736   9564  An LDS Reader Takes A Look At Trump Accuser Je...       FAKE
1851   5752  America’s Senator Jeff Sessions Warns of Worse...       FAKE
1883   8816  Paris Migrant Campers Increase after Calais Is...       FAKE
1941   7525  Putins Army is coming for World war 3 against ...       FAKE
2244   6714  Is your promising internet career over now Vin...       FAKE
2426   5776  Radio Derb Transcript For October 21 Up

In [5]:
df

Unnamed: 0,id,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL
...,...,...,...,...
6330,4490,State Department says it can't find emails fro...,The State Department told the Republican Natio...,REAL
6331,8062,The ‘P’ in PBS Should Stand for ‘Plutocratic’ ...,The ‘P’ in PBS Should Stand for ‘Plutocratic’ ...,FAKE
6332,8622,Anti-Trump Protesters Are Tools of the Oligarc...,Anti-Trump Protesters Are Tools of the Oligar...,FAKE
6333,4021,"In Ethiopia, Obama seeks progress on peace, se...","ADDIS ABABA, Ethiopia —President Obama convene...",REAL


In [6]:
# Remove duplicates
df = df.drop_duplicates(subset=['title'], keep='first')
df = df.drop_duplicates(subset=['text'], keep='first')

In [7]:
df

Unnamed: 0,id,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL
...,...,...,...,...
6330,4490,State Department says it can't find emails fro...,The State Department told the Republican Natio...,REAL
6331,8062,The ‘P’ in PBS Should Stand for ‘Plutocratic’ ...,The ‘P’ in PBS Should Stand for ‘Plutocratic’ ...,FAKE
6332,8622,Anti-Trump Protesters Are Tools of the Oligarc...,Anti-Trump Protesters Are Tools of the Oligar...,FAKE
6333,4021,"In Ethiopia, Obama seeks progress on peace, se...","ADDIS ABABA, Ethiopia —President Obama convene...",REAL


In [8]:
# Change the 'label' column to 0 for 'FAKE' and 1 for 'REAL'
df['label'] = df['label'].map({'FAKE': 0, 'REAL': 1})
df

Unnamed: 0,id,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",0
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,0
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,1
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",0
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,1
...,...,...,...,...
6330,4490,State Department says it can't find emails fro...,The State Department told the Republican Natio...,1
6331,8062,The ‘P’ in PBS Should Stand for ‘Plutocratic’ ...,The ‘P’ in PBS Should Stand for ‘Plutocratic’ ...,0
6332,8622,Anti-Trump Protesters Are Tools of the Oligarc...,Anti-Trump Protesters Are Tools of the Oligar...,0
6333,4021,"In Ethiopia, Obama seeks progress on peace, se...","ADDIS ABABA, Ethiopia —President Obama convene...",1


In [10]:
# Split the dataset into train (80%) and remaining (20%)
train_data, remaining_data = train_test_split(df, test_size=0.2, random_state=42)

# Split the remaining data into validation (10% of total) and test (10% of total)
val_data, test_data = train_test_split(remaining_data, test_size=0.5, random_state=42)

In [11]:
# Verify the 0-labeled and 1-labeled data in each set
print('Train data:')
print(train_data['label'].value_counts())
print('Validation data:')
print(val_data['label'].value_counts())
print('Test data:')
print(test_data['label'].value_counts())

Train data:
label
0    2412
1    2396
Name: count, dtype: int64
Validation data:
label
1    308
0    293
Name: count, dtype: int64
Test data:
label
0    320
1    281
Name: count, dtype: int64


In [12]:
# Save the splits to CSV files
train_data.to_csv('train_data_cleaned.csv', index=False) 
val_data.to_csv('val_data_cleaned.csv', index=False)
test_data.to_csv('test_data_cleaned.csv', index=False)

In [15]:
# Print the most common title, and the most common text, to see if there are any duplicates
print(f"{df['title'].value_counts().idxmax()} is found {df['title'].value_counts().max()} times")

You Can Smell Hillary’s Fear is found 1 times


In [16]:
print(f"{df['text'].value_counts().idxmax()} is found {df['text'].value_counts().max()} times")

Daniel Greenfield, a Shillman Journalism Fellow at the Freedom Center, is a New York writer focusing on radical Islam. 
In the final stretch of the election, Hillary Rodham Clinton has gone to war with the FBI. 
The word “unprecedented” has been thrown around so often this election that it ought to be retired. But it’s still unprecedented for the nominee of a major political party to go war with the FBI. 
But that’s exactly what Hillary and her people have done. Coma patients just waking up now and watching an hour of CNN from their hospital beds would assume that FBI Director James Comey is Hillary’s opponent in this election. 
The FBI is under attack by everyone from Obama to CNN. Hillary’s people have circulated a letter attacking Comey. There are currently more media hit pieces lambasting him than targeting Trump. It wouldn’t be too surprising if the Clintons or their allies were to start running attack ads against the FBI. 
The FBI’s leadership is being warned that the entire left