## Problem Statement
Fake news refers to false or misleading information presented as news with the intention to deceive readers. With the rapid spread of information through digital platforms, fake news can influence public opinion and cause social harm. This project aims to build a machine learning model that can automatically classify news articles as real or fake based on their textual content.


In [102]:
import pandas as pd
import numpy as np

In [57]:
fake = pd.read_csv("/content/fake.csv", on_bad_lines='skip', engine='python')
true= pd.read_csv("/content/true.csv", on_bad_lines='skip', engine='python')

## Dataset Description
The dataset used in this project consists of two CSV files: one containing real news articles and the other containing fake news articles. Each record includes the news text along with its corresponding label. The datasets were merged and labeled to create a single supervised learning dataset.


In [58]:
fake.head()

Unnamed: 0,title,text,subject,date
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017"
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017"
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017"
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017"
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017"


In [59]:
true.head()

Unnamed: 0,title,text,subject,date
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017"
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017"
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017"
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017"
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017"


In [60]:
true["label"] =1
fake["label"] =0

In [61]:
true.head()

Unnamed: 0,title,text,subject,date,label
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017",1
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017",1
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017",1
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017",1
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017",1


In [62]:
fake.head()

Unnamed: 0,title,text,subject,date,label
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017",0
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017",0
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017",0
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017",0
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017",0


In [63]:
news = pd.concat([true,fake])

In [64]:
news.head()

Unnamed: 0,title,text,subject,date,label
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017",1
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017",1
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017",1
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017",1
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017",1


In [65]:
news.describe()

Unnamed: 0,label
count,44898.0
mean,0.477015
std,0.499477
min,0.0
25%,0.0
50%,0.0
75%,1.0
max,1.0


In [66]:
news.info()

<class 'pandas.core.frame.DataFrame'>
Index: 44898 entries, 0 to 23480
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   title    44898 non-null  object
 1   text     44898 non-null  object
 2   subject  44898 non-null  object
 3   date     44898 non-null  object
 4   label    44898 non-null  int64 
dtypes: int64(1), object(4)
memory usage: 2.1+ MB


In [67]:
news.isnull().sum()

Unnamed: 0,0
title,0
text,0
subject,0
date,0
label,0


In [68]:
news.shape

(44898, 5)

In [69]:
news = news.sample(frac=1).reset_index(drop=True)


In [70]:
news.head()

Unnamed: 0,title,text,subject,date,label
0,"Trump slaps travel restrictions on N.Korea, Ve...",WASHINGTON (Reuters) - President Donald Trump ...,worldnews,"September 25, 2017",1
1,Assange: ‘Trump in Conflict with CIA Over Syri...,This interview with WikiLeaks head Julian Assa...,Middle-east,"April 1, 2017",0
2,U.S. students' rape allegation against Italian...,ROME (Reuters) - Italy s defense minister has ...,worldnews,"September 9, 2017",1
3,Desperation? The Clinton Grifters Go Back On T...,The World Class Grifters aka The Clintons are ...,politics,"Jan 2, 2016",0
4,Kellyanne Conway Just Humiliated Herself On N...,Donald Trump sure hasn t done himself any favo...,News,"August 4, 2017",0


In [71]:
news.shape

(44898, 5)

In [72]:
news["label"].value_counts()


Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
0,23481
1,21417


In [73]:
news = news.drop(["title","subject","date"],axis = 1)

In [74]:
news.head(10)

Unnamed: 0,text,label
0,WASHINGTON (Reuters) - President Donald Trump ...,1
1,This interview with WikiLeaks head Julian Assa...,0
2,ROME (Reuters) - Italy s defense minister has ...,1
3,The World Class Grifters aka The Clintons are ...,0
4,Donald Trump sure hasn t done himself any favo...,0
5,Tune in to the Alternate Current Radio Network...,0
6,"SULAIMANIYA, Iraq (Reuters) - On the eve of an...",1
7,"Karachi, Pakistan (Reuters) - A Pakistani peac...",1
8,"Normally, the thought of listening to Coldplay...",0
9,MOSCOW/WASHINGTON (Reuters) - A Russian bank ...,1


## Data Preprocessing
Data preprocessing was performed to improve model performance and reduce noise. Irrelevant columns were removed, and the news text was cleaned by converting it to lowercase, removing punctuation and URLs, eliminating stopwords, and applying lemmatization to standardize word forms.


In [75]:
import re
import nltk
nltk.download('stopwords')
nltk.download('wordnet')


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [76]:
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer


In [77]:
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def clean_text(text):
    text = text.lower()
    text = re.sub(r'http\S+|www\S+', '', text)
    text = re.sub(r'[^a-z\s]', '', text)
    words = text.split()
    words = [lemmatizer.lemmatize(word)
             for word in words
             if word not in stop_words]
    return " ".join(words)


In [78]:
news["clean_text"] = news["text"].apply(clean_text)


In [79]:
news[["text", "clean_text"]].head()


Unnamed: 0,text,clean_text
0,WASHINGTON (Reuters) - President Donald Trump ...,washington reuters president donald trump sund...
1,This interview with WikiLeaks head Julian Assa...,interview wikileaks head julian assange might ...
2,ROME (Reuters) - Italy s defense minister has ...,rome reuters italy defense minister said basis...
3,The World Class Grifters aka The Clintons are ...,world class grifter aka clinton going road cam...
4,Donald Trump sure hasn t done himself any favo...,donald trump sure done favor appointing equall...


## Feature Extraction
To convert textual data into numerical form, TF-IDF (Term Frequency–Inverse Document Frequency) vectorization was used. This technique assigns higher importance to words that are frequent in a document but rare across the dataset, making it suitable for text classification tasks.


In [80]:
from sklearn.feature_extraction.text import TfidfVectorizer


In [81]:
X = news["clean_text"]
y = news["label"]


In [82]:
tfidf = TfidfVectorizer(
    max_df=0.7,
    min_df=5,
    ngram_range=(1, 2)
)


In [83]:
X_tfidf = tfidf.fit_transform(X)


In [84]:
X_tfidf.shape


(44898, 342755)

## Train-Test Split
The dataset was divided into training and testing sets using an 80:20 split. Stratified sampling was applied to maintain the class distribution, ensuring fair and unbiased evaluation of the models.


In [85]:
from sklearn.model_selection import train_test_split


In [86]:
X_train, X_test, y_train, y_test = train_test_split(
    X_tfidf,
    y,
    test_size=0.2,
    random_state=42,
    stratify=y
)


In [87]:
print("Training set:", X_train.shape)
print("Testing set:", X_test.shape)


Training set: (35918, 342755)
Testing set: (8980, 342755)


## Model Training
Two machine learning models were trained for fake news detection: Logistic Regression and Multinomial Naive Bayes. These models are well-suited for high-dimensional sparse text data and serve as effective baselines for text classification.


In [88]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(max_iter=1000)
lr.fit(X_train, y_train)


In [89]:
y_pred_lr = lr.predict(X_test)


In [90]:
from sklearn.naive_bayes import MultinomialNB

nb = MultinomialNB()
nb.fit(X_train, y_train)


In [91]:
y_pred_nb = nb.predict(X_test)


## Model Evaluation
The trained models were evaluated using standard classification metrics including precision, recall, F1-score, and accuracy. These metrics provide a detailed understanding of the models' performance, especially in detecting fake news.


In [92]:
from sklearn.metrics import (
    accuracy_score,
    classification_report,
    confusion_matrix
)


In [93]:
print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred_lr))


Logistic Regression Accuracy: 0.9824053452115813


In [94]:
print("Logistic Regression Classification Report:\n")
print(classification_report(y_test, y_pred_lr))


Logistic Regression Classification Report:

              precision    recall  f1-score   support

           0       0.98      0.98      0.98      4696
           1       0.98      0.98      0.98      4284

    accuracy                           0.98      8980
   macro avg       0.98      0.98      0.98      8980
weighted avg       0.98      0.98      0.98      8980



In [95]:
cm_lr = confusion_matrix(y_test, y_pred_lr)
cm_lr


array([[4610,   86],
       [  72, 4212]])

In [96]:
print("Naive Bayes Accuracy:", accuracy_score(y_test, y_pred_nb))


Naive Bayes Accuracy: 0.9544543429844098


In [97]:
print("Naive Bayes Classification Report:\n")
print(classification_report(y_test, y_pred_nb))


Naive Bayes Classification Report:

              precision    recall  f1-score   support

           0       0.95      0.96      0.96      4696
           1       0.95      0.95      0.95      4284

    accuracy                           0.95      8980
   macro avg       0.95      0.95      0.95      8980
weighted avg       0.95      0.95      0.95      8980



In [98]:
cm_nb = confusion_matrix(y_test, y_pred_nb)
cm_nb


array([[4503,  193],
       [ 216, 4068]])

## Model Comparison
Two machine learning models, Logistic Regression and Multinomial Naive Bayes, were evaluated for fake news detection using TF-IDF features. Logistic Regression achieved higher precision (0.99), recall (0.98), and F1-score (0.98) for fake news compared to Naive Bayes, which showed lower recall (0.91).

Since false negatives (fake news classified as real) are more critical in this problem, Logistic Regression was selected as the final model due to its balanced and robust performance on unseen data.


## Custom News Prediction


To demonstrate real-world applicability, the final trained model was tested on manually provided news samples. This allows verification of the model’s ability to classify unseen news articles as real or fake.


In [99]:
def predict_news(text):
    cleaned_text = clean_text(text)
    vectorized_text = tfidf.transform([cleaned_text])
    prediction = lr.predict(vectorized_text)[0]

    if prediction == 1:
        print("🟢 REAL NEWS")
    else:
        print("🔴 FAKE NEWS")


In [100]:
real_news = """
The government on Monday announced a new policy aimed at improving digital infrastructure
and boosting investments in renewable energy across the country.
"""

predict_news(real_news)


🟢 REAL NEWS


In [101]:
fake_news = """
Breaking news! Scientists have confirmed that aliens are secretly living among humans
and will reveal themselves next week according to anonymous sources.
"""

predict_news(fake_news)


🔴 FAKE NEWS


## Conclusion
In this project, a fake news detection system was developed using classical machine learning and natural language processing techniques. News articles were preprocessed by cleaning the text and removing noise, followed by feature extraction using TF-IDF vectorization.

Two machine learning models, Logistic Regression and Multinomial Naive Bayes, were trained and evaluated. Based on performance metrics such as precision, recall, and F1-score, Logistic Regression showed superior and more balanced performance, especially in detecting fake news.

The results demonstrate that classical machine learning models combined with proper text preprocessing can effectively identify fake news and provide reliable predictions on unseen data.


## Future Work
Future improvements to this project may include:
- Combining article titles with full text to capture additional contextual information.
- Exploring deep learning approaches such as LSTM or transformer-based models for better semantic understanding.
- Deploying the trained model as a web application or API for real-time fake news detection.
