## Fake News Detection with Tensorflow -- A Deep Learning Approach
This notebook demonstrates how to build a fake news detection model using TensorFlow. The model will be trained on a dataset of news articles labeled as real or fake.

#### 1. Import Libraries

In [1]:
import numpy as np
import pandas as pd
import tensorflow as tf
import re
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout, Bidirectional

#### 2. Load Dataset

In [2]:
data = pd.read_csv("news.csv")
data.head()

Unnamed: 0.1,Unnamed: 0,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL


#### 3. Data Preprocessing

- Keep Necessary Columns: We'll use the title, text, and label columns. 

- Combine Title and Text: We will merge the title and text into a single input feature for the model. We'll also handle any missing values in the process.

- Encode Labels: The model needs numerical labels, so we'll convert "FAKE" to 0 and "REAL" to 1.

- Clean Text: We'll create a function to convert text to lowercase and remove punctuation, numbers, and extra spaces. This standardizes the text for the tokenizer.

In [3]:
# Keep only the 'title', 'text', and 'label' columns
data = data[['title', 'text', 'label']]

# Handle missing values in title and text
data['title'] = data['title'].fillna('')
data['text'] = data['text'].fillna('')

# Combine title and text into a single column
data['text'] = data['title'] + ' ' + data['text']
data.drop(columns=['title'], inplace=True) # Drop the original title column

# Drop rows where the combined text is empty
data.dropna(subset=['text'], inplace=True)

# Encode the labels
label_encoder = LabelEncoder()
data['label'] = label_encoder.fit_transform(data['label'])
# FAKE -> 0, REAL -> 1
print("Encoded labels mapping:")
print(dict(zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_))))
print("\n" + "="*50 + "\n")


# Text cleaning function
def clean_text(text):
    text = text.lower()  # Lowercase text
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
    text = re.sub(r'\d+', '', text)  # Remove numbers
    text = re.sub(r'\s+', ' ', text).strip()  # Remove extra whitespace
    return text

# Apply the cleaning function to the text column
data['text'] = data['text'].apply(clean_text)

print("Sample of cleaned text:")
print(data['text'].head().iloc[0])


Encoded labels mapping:
{'FAKE': 0, 'REAL': 1}


Sample of cleaned text:


In [4]:
print("Distribution of labels:")
print(data['label'].value_counts())

Distribution of labels:
label
1    3171
0    3164
Name: count, dtype: int64


#### 4. Tokenization and Padding

Computers don't understand words, they understand numbers. We need to convert our cleaned text into sequences of numbers.

- Tokenizer: We'll use `tf.keras.preprocessing.text.Tokenizer` to build a vocabulary of the most common words and convert each article into a sequence of integers.

- Padding: Neural networks require inputs of a fixed length. We'll use `pad_sequences` to make sure every sequence has the same length by adding zeros to the shorter ones.

In [5]:
# Define parameters for tokenization and padding
VOCAB_SIZE = 10000  # Number of words to keep in the vocabulary
MAX_LEN = 256      # Max length of sequences
OOV_TOKEN = "<OOV>" # Token for words not in the vocabulary

# Initialize the tokenizer
tokenizer = Tokenizer(num_words=VOCAB_SIZE, oov_token=OOV_TOKEN)
tokenizer.fit_on_texts(data['text'])

# Convert text to sequences of integers
sequences = tokenizer.texts_to_sequences(data['text'])

# Pad the sequences
padded_sequences = pad_sequences(sequences, maxlen=MAX_LEN, padding='post', truncating='post')


#### 5. Split Data into Training and Testing Sets

We need to split our data into a training set (for teaching the model) and a testing set (for evaluating its performance on unseen data). We'll use an 80/20 split.

In [6]:
# Get features (padded sequences) and labels
X = padded_sequences
y = data['label'].values

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print(f"Training set shape: {X_train.shape}")
print(f"Testing set shape: {X_test.shape}")

Training set shape: (5068, 256)
Testing set shape: (1267, 256)


#### 6. Build the LSTM Model

- Now we define the architecture of our neural network.

- Embedding Layer: This layer learns a dense vector representation for each word in our vocabulary.

- Bidirectional LSTM: An LSTM (Long Short-Term Memory) layer is excellent for sequence data. We make it Bidirectional so it can learn from the text in both forward and backward directions, which improves context understanding.

- Dropout: A regularization technique to prevent the model from overfitting.

- Dense Layers: Standard fully connected layers for classification. The final layer uses a sigmoid activation function to output a probability between 0 and 1.   

In [7]:
EMBEDDING_DIM = 64 # Dimension for the word vectors

model = Sequential([
    Embedding(input_dim=VOCAB_SIZE, output_dim=EMBEDDING_DIM),
    Bidirectional(LSTM(64, return_sequences=True)),
    Dropout(0.5),
    Bidirectional(LSTM(32)),
    Dropout(0.5),
    Dense(64, activation='relu'),
    Dense(1, activation='sigmoid') # Sigmoid for binary classification
])

# Compile the model
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

# Print the model summary
model.summary()


2025-06-26 22:29:30.146780: I metal_plugin/src/device/metal_device.cc:1154] Metal device set to: Apple M1
2025-06-26 22:29:30.147289: I metal_plugin/src/device/metal_device.cc:296] systemMemory: 16.00 GB
2025-06-26 22:29:30.147325: I metal_plugin/src/device/metal_device.cc:313] maxCacheSize: 5.33 GB
2025-06-26 22:29:30.147969: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2025-06-26 22:29:30.148000: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)


#### 7. Train the Model

We now feed the training data to our model. We'll train for 5 epochs and use the test set as validation data to monitor performance after each epoch.

In [8]:
# Train the model
EPOCHS = 5
BATCH_SIZE = 32

history = model.fit(X_train, y_train,
                    epochs=EPOCHS,
                    batch_size=BATCH_SIZE,
                    validation_data=(X_test, y_test),
                    verbose=1)

Epoch 1/5


: 

#### 8. Evaluate the Model

After training, let's see how well our model performs on the unseen test data. We expect the accuracy to be high.



In [None]:
# Evaluate the model on the test set
loss, accuracy = model.evaluate(X_test, y_test)
print(f"\nTest Accuracy: {accuracy*100:.2f}%")

#### 9. Make Predictions on New Data
Let's use our trained model to classify new, unseen news headlines.

# Example news texts for prediction
new_texts = [
    "The president announced a new groundbreaking healthcare reform today that will cover millions.", # Likely REAL
    "Scientists discover aliens have been living in the ocean for centuries, controlling our weather.", # Likely FAKE
    "Stock market hits an all-time high after positive economic reports are released by the government.", # Likely REAL
    "You won't believe what this celebrity was caught doing, new secret photos leaked online by anonymous source.", # Likely FAKE
]

# Preprocess the new texts
cleaned_texts = [clean_text(text) for text in new_texts]
sequences = tokenizer.texts_to_sequences(cleaned_texts)
padded = pad_sequences(sequences, maxlen=MAX_LEN, padding='post', truncating='post')

# Make predictions
predictions = model.predict(padded)

# Print the results
for text, pred in zip(new_texts, predictions):
    label = "REAL" if pred > 0.5 else "FAKE"
    print(f"\nText: {text}")
    print(f"Prediction: {label} (Confidence: {pred[0]:.4f})")