# Step 1: Importing Libraries

Before starting with the project, we import the required Python libraries:

- **pandas (pd)** → used for handling and manipulating tabular data (our dataset is in CSV format).
- **numpy (np)** → useful for numerical operations, arrays, and data manipulation.
- **re (Regular Expressions)** → helps in text preprocessing (cleaning unwanted characters, symbols).
- **train_test_split** → splits our dataset into Training and Testing sets.
- **TfidfVectorizer** → converts text into numerical features using TF-IDF (Term Frequency – Inverse Document Frequency).
- **LogisticRegression** → a Machine Learning model used for binary classification (Fake vs Real).
- **accuracy_score, confusion_matrix, classification_report** → evaluation metrics to check how well our model performs.

👉 In short: these libraries will help us load the dataset, clean it, convert text into numbers, train a model, and evaluate its performance.



In [27]:
import pandas as pd
import numpy as np
import re
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report


# Step 2: Loading the Dataset

We are using two datasets:

- **fake.csv** → contains news articles that are labeled as *Fake*.
- **true.csv** → contains news articles that are labeled as *True*.

📊 Each dataset has rows (individual news articles) and columns (like title, text, subject, date).  
- `df_fake.shape` → tells us how many fake news samples are present.  
- `df_true.shape` → tells us how many true news samples are present.  

👉 At this stage, we are just loading and checking the size of our data before processing it.


In [28]:
# Load the datasets
df_fake = pd.read_csv("/content/fake.csv")
df_true = pd.read_csv("/content/true.csv")

print("Fake news shape:", df_fake.shape)
print("True news shape:", df_true.shape)


Fake news shape: (23481, 4)
True news shape: (21417, 4)


# Step 3: Assigning Labels

Machine Learning models need numerical labels to identify categories.

- We assign **1** to Fake News.  
- We assign **0** to True News.  

This way, our model will learn:
- Label `1` → Fake news  
- Label `0` → True news  

👉 Converting categories (fake/true) into numbers is called **encoding**, and it’s an essential step before training ML models.


In [29]:
df_fake["label"] = 1   # Fake news = 1
df_true["label"] = 0   # True news = 0


In [30]:
df = pd.concat([df_fake, df_true], axis=0)
df.reset_index(drop=True, inplace=True)

print("Total dataset shape:", df.shape)
df.head()


Total dataset shape: (44898, 5)


Unnamed: 0,title,text,subject,date,label
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017",1
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017",1
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017",1
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017",1
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017",1


# Step 4: Combining the Datasets

Now that both datasets have labels, we combine them into a **single DataFrame**.

- `pd.concat([df_fake, df_true], axis=0)` → merges the two DataFrames vertically (row-wise).  
- `reset_index(drop=True)` → resets the row numbering so it looks clean.  

📊 The combined dataset now contains both **Fake** and **True** news articles, along with their labels:
- `1` → Fake  
- `0` → True  

👉 This unified dataset will be used for training and testing our ML model.


In [31]:
# Merge title + text into one column
df["content"] = df["title"].fillna('') + " " + df["text"].fillna('')

# Keep only the useful columns
df = df[["content", "label"]]


# Step 5: Data Cleaning & Preprocessing

Text data is often messy and contains:
- Numbers
- Punctuation (!, ?, , etc.)
- Symbols (@, #, $ etc.)

To make the dataset consistent, we clean it by:
1. **Removing non-alphabet characters** → using Regular Expressions (`re.sub`).
2. **Converting to lowercase** → so that "News" and "news" are treated the same.
3. **Splitting into words and joining back** → ensures proper spacing and clean text.

👉 This step ensures that the Machine Learning model focuses only on meaningful words.


In [32]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
import re

# Download resources (only once)
nltk.download('stopwords')
nltk.download('wordnet')

# Initialize tools
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

def clean_text(text):
    # 1. Keep only letters
    text = re.sub(r'[^a-zA-Z]', ' ', text)

    # 2. Lowercase everything
    text = text.lower()

    # 3. Split into words
    words = text.split()

    # 4. Remove stopwords
    words = [w for w in words if w not in stop_words]

    # 5. Apply stemming
    words = [stemmer.stem(w) for w in words]

    # 6. Apply lemmatization
    words = [lemmatizer.lemmatize(w) for w in words]

    # 7. Join words back
    return " ".join(words)

# Apply the improved cleaning function
df["content"] = df["content"].apply(clean_text)

# Preview result
df.head()



[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Unnamed: 0,content,label
0,donald trump send embarrass new year eve messa...,1
1,drunk brag trump staffer start russian collus ...,1
2,sheriff david clark becom internet joke threat...,1
3,trump obsess even obama name code websit imag ...,1
4,pope franci call donald trump christma speech ...,1


In [38]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Download stopwords and WordNet
nltk.download('stopwords')
nltk.download('wordnet')

# Initialize
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Example sentence
sample_text = "The players were playing happily in the playground and enjoyed their games."

# Tokenize
words = sample_text.lower().split()

# Remove stopwords
filtered_words = [w for w in words if w not in stop_words]


print("Original words:", words)
print("After Stopword Removal:", filtered_words)
print("After Stemming:", stemmed_words)
print("After Lemmatization:", lemmatized_words)


Original words: ['the', 'players', 'were', 'playing', 'happily', 'in', 'the', 'playground', 'and', 'enjoyed', 'their', 'games.']
After Stopword Removal: ['players', 'playing', 'happily', 'playground', 'enjoyed', 'games.']
After Stemming: ['player', 'play', 'happili', 'playground', 'enjoy', 'games.']
After Lemmatization: ['player', 'playing', 'happily', 'playground', 'enjoyed', 'games.']


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


# Step: Text Vectorization with TF-IDF

Now that the text is cleaned, we convert it into **numerical features** using TF-IDF:

- **TF (Term Frequency)** → how often a word appears in a document.
- **IDF (Inverse Document Frequency)** → reduces weight of common words and gives more importance to rare but meaningful words.
- Example:
  - Word "government" might appear in many articles → lower weight.
  - Word "scandal" might appear in only a few articles → higher weight.

We:
1. Split the dataset into **Training set (80%)** and **Testing set (20%)**.
2. Use `TfidfVectorizer` to convert text into vectors.
3. Get the final numerical matrices `X_train` and `X_test` which can be fed into ML models.

👉 After this step, our text data is now ready for **Machine Learning classification**.



In [39]:
# Split before vectorization
X = df["content"]
y = df["label"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize vectorizers
bow_vectorizer = CountVectorizer(max_features=5000)
tfidf_vectorizer = TfidfVectorizer(max_features=5000)

# Vectorize only here
X_train_bow = bow_vectorizer.fit_transform(X_train)
X_test_bow = bow_vectorizer.transform(X_test)

X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)



# Bag of Words (BoW)

The **Bag of Words** model converts text into numbers by **counting word frequency**.

Example:
- Sentence 1: "The cat sat on the mat"
- Sentence 2: "The dog sat on the mat"

Vocabulary = ["cat", "dog", "mat", "sat", "the"]

Representation:
- Sentence 1 → [1, 0, 1, 1, 2]
- Sentence 2 → [0, 1, 1, 1, 2]

👉 Advantages:
- Simple and easy to understand.

👉 Limitations:
- Only counts frequency, ignores importance of words.
- Common words like *the, is, in* dominate, but don’t carry much meaning.


In [40]:
from sklearn.feature_extraction.text import CountVectorizer

# Initialize Bag of Words vectorizer
bow_vectorizer = CountVectorizer(max_df=0.7, stop_words='english')

# Transform the dataset
X_train_bow = bow_vectorizer.fit_transform(X_train)
X_test_bow = bow_vectorizer.transform(X_test)

print("Bag of Words - Training shape:", X_train_bow.shape)
print("Bag of Words - Testing shape:", X_test_bow.shape)


Bag of Words - Training shape: (35918, 80982)
Bag of Words - Testing shape: (8980, 80982)


# TF-IDF (Term Frequency – Inverse Document Frequency)

TF-IDF improves upon Bag of Words by **assigning importance (weights) to words**.

Formula:
- **TF (Term Frequency)** → how often a word appears in a document.
- **IDF (Inverse Document Frequency)** → reduces weight of common words, increases weight of rare but important words.

Example:
- In a collection of 100 news articles:
  - Word "government" appears in 80 → IDF weight is low.
  - Word "scandal" appears in 5 → IDF weight is high.

👉 Advantages:
- Handles importance of words better than simple counts.
- Reduces noise from very common words.

👉 Limitation:
- Still treats words independently (doesn’t understand meaning/context like modern deep learning models).


In [41]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer(max_df=0.7, stop_words='english')

# Transform the dataset
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

print("TF-IDF - Training shape:", X_train_tfidf.shape)
print("TF-IDF - Testing shape:", X_test_tfidf.shape)


TF-IDF - Training shape: (35918, 80982)
TF-IDF - Testing shape: (8980, 80982)


# Logistic Regression with Bag of Words

We trained a **Logistic Regression classifier** using the BoW representation.

- Logistic Regression is widely used for text classification.
- It predicts probabilities for each class (Fake = 1, True = 0).

We then evaluate the model using:
- **Accuracy** → Overall correctness of predictions.
- **Confusion Matrix** → Shows correct and incorrect classifications.
- **Classification Report** → Precision, Recall, F1-Score.

This helps us understand how well the model is performing.


In [42]:
# Logistic Regression with Bag of Words
log_reg_bow = LogisticRegression(max_iter=1000)
log_reg_bow.fit(X_train_bow, y_train)

# Predictions
y_pred_bow = log_reg_bow.predict(X_test_bow)

# Evaluation
print("🔹 Logistic Regression (Bag of Words)")
print("Accuracy:", accuracy_score(y_test, y_pred_bow))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred_bow))
print("\nClassification Report:\n", classification_report(y_test, y_pred_bow))


🔹 Logistic Regression (Bag of Words)
Accuracy: 0.994543429844098

Confusion Matrix:
 [[4221   26]
 [  23 4710]]

Classification Report:
               precision    recall  f1-score   support

           0       0.99      0.99      0.99      4247
           1       0.99      1.00      0.99      4733

    accuracy                           0.99      8980
   macro avg       0.99      0.99      0.99      8980
weighted avg       0.99      0.99      0.99      8980



# Logistic Regression with TF-IDF

Now, we train Logistic Regression on the **TF-IDF representation**.

✅ TF-IDF usually performs better because:
- It reduces the weight of very common words (like "the", "is").
- It emphasizes rare but important words.

By comparing results with Bag of Words, we can **see how feature representation impacts performance**.


In [43]:
# Logistic Regression with TF-IDF
log_reg_tfidf = LogisticRegression(max_iter=1000)
log_reg_tfidf.fit(X_train_tfidf, y_train)

# Predictions
y_pred_tfidf = log_reg_tfidf.predict(X_test_tfidf)

# Evaluation
print("🔹 Logistic Regression (TF-IDF)")
print("Accuracy:", accuracy_score(y_test, y_pred_tfidf))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred_tfidf))
print("\nClassification Report:\n", classification_report(y_test, y_pred_tfidf))


🔹 Logistic Regression (TF-IDF)
Accuracy: 0.9834075723830735

Confusion Matrix:
 [[4179   68]
 [  81 4652]]

Classification Report:
               precision    recall  f1-score   support

           0       0.98      0.98      0.98      4247
           1       0.99      0.98      0.98      4733

    accuracy                           0.98      8980
   macro avg       0.98      0.98      0.98      8980
weighted avg       0.98      0.98      0.98      8980



In [44]:
news_samples = [
    """The World Health Organization announced that the COVID-19 vaccines
    developed by Pfizer and Moderna are safe and effective after reviewing
    multiple studies.""",

    """In a shocking report, it was revealed that aliens have taken over
    the White House and are negotiating peace treaties.""",

    """The Reserve Bank of India announced a new policy to regulate
    cryptocurrency transactions in the country.""",

    """A celebrity was found alive on Mars after a secret mission, according
    to unreliable social media sources."""
]

# Test on BoW model
print("=== Predictions using Bag of Words (BoW) ===\n")
for news in news_samples:
    vectorized = bow_vectorizer.transform([news])
    prediction = log_reg_bow.predict(vectorized)[0]
    result = "FAKE ❌" if prediction == 1 else "REAL ✅"
    print(f"{news} --> {result}")

print("\n=== Predictions using TF-IDF ===\n")
for news in news_samples:
    vectorized = tfidf_vectorizer.transform([news])
    prediction = log_reg_tfidf.predict(vectorized)[0]
    result = "FAKE ❌" if prediction == 1 else "REAL ✅"
    print(f"{news} --> {result}")



=== Predictions using Bag of Words (BoW) ===

The World Health Organization announced that the COVID-19 vaccines 
    developed by Pfizer and Moderna are safe and effective after reviewing 
    multiple studies. --> FAKE ❌
In a shocking report, it was revealed that aliens have taken over 
    the White House and are negotiating peace treaties. --> FAKE ❌
The Reserve Bank of India announced a new policy to regulate 
    cryptocurrency transactions in the country. --> FAKE ❌
A celebrity was found alive on Mars after a secret mission, according 
    to unreliable social media sources. --> FAKE ❌

=== Predictions using TF-IDF ===

The World Health Organization announced that the COVID-19 vaccines 
    developed by Pfizer and Moderna are safe and effective after reviewing 
    multiple studies. --> FAKE ❌
In a shocking report, it was revealed that aliens have taken over 
    the White House and are negotiating peace treaties. --> FAKE ❌
The Reserve Bank of India announced a new policy to re