# Predicting Fake News

**Author:** [Syed Muhammad Ebad](https://www.kaggle.com/syedmuhammadebad)  
**Date:** 23-Oct-2024  
[Send me an email](mailto:mohammadebad1@hotmail.com)  
[Visit my GitHub profile](https://github.com/smebad)

**Dataset:** [Fake News](https://www.kaggle.com/competitions/fake-news/data?select=train.csv)

## 1. Introduction
In this notebook, we aim to predict whether a news article is fake or real using the dataset provided in the Fake News competition. The dataset contains various features including the author, title, and text of the news articles. We will use text preprocessing techniques along with the TF-IDF vectorization and a Logistic Regression model to classify the news as fake or real.

The dataset contains:
* id: Unique identifier for the news
* title: Title of the news article
* author: Author of the news article
* text: The text of the article
* label: Target label where 0 indicates "Real" and 1 indicates "Fake"

## 2. Importing Required Libraries

In [57]:
import pandas as pd
import numpy as np
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Download necessary NLTK data
nltk.download('stopwords')

# Ignore warnings
import warnings
warnings.filterwarnings('ignore')


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\DC\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## 3. Loading and Reviewing the Dataset

In [58]:
# Load the dataset
df = pd.read_csv('train.csv')
df.head()

Unnamed: 0,id,title,author,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1


## 4. Dataset Information and Overview
Let’s take a look at the dataset structure and check for any missing values.

In [59]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20800 entries, 0 to 20799
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      20800 non-null  int64 
 1   title   20242 non-null  object
 2   author  18843 non-null  object
 3   text    20761 non-null  object
 4   label   20800 non-null  int64 
dtypes: int64(2), object(3)
memory usage: 812.6+ KB


### Observation:
* We have 20800 entries and 5 columns (id, author, title, text, label).
* The author and title columns contain missing values.

## 5. Handling Missing Values
Since the dataset is large, we can afford to drop the missing values for simplicity.

In [60]:
df.isnull().sum()

id           0
title      558
author    1957
text        39
label        0
dtype: int64

In [61]:
df = df.dropna()
df.isnull().sum()

id        0
title     0
author    0
text      0
label     0
dtype: int64

### Observation:
* Missing values have been successfully removed, and we are left with 18285 rows.

## 6. Combining Author and Title
To improve the model's performance, we will create a new feature content by combining the author and title columns.

In [62]:
df['content'] = df['author']+' '+df['title']
print(df['content'])

0        Darrell Lucus House Dem Aide: We Didn’t Even S...
1        Daniel J. Flynn FLYNN: Hillary Clinton, Big Wo...
2        Consortiumnews.com Why the Truth Might Get You...
3        Jessica Purkiss 15 Civilians Killed In Single ...
4        Howard Portnoy Iranian woman jailed for fictio...
                               ...                        
20795    Jerome Hudson Rapper T.I.: Trump a ’Poster Chi...
20796    Benjamin Hoffman N.F.L. Playoffs: Schedule, Ma...
20797    Michael J. de la Merced and Rachel Abrams Macy...
20798    Alex Ansary NATO, Russia To Hold Parallel Exer...
20799              David Swanson What Keeps the F-35 Alive
Name: content, Length: 18285, dtype: object


### Observation:
* The new content column is created, combining the available text from the author and title.

## 7. Text Preprocessing
We will now preprocess the text data by:

1. Removing non-alphabetic characters.
2. Converting text to lowercase.
3. Removing stopwords.
4. Applying stemming using PorterStemmer.

In [63]:
ps = PorterStemmer()

def stem(text):
    stemmed_text = re.sub('[^a-zA-Z]', ' ', text)
    stemmed_text = stemmed_text.lower()
    stemmed_text = stemmed_text.split()
    stemmed_text = [ps.stem(word) for word in stemmed_text if not word in set(stopwords.words('english'))]
    stemmed_text = ' '.join(stemmed_text)
    return stemmed_text

df['content'] = df['content'].apply(stem)
print(df['content'].head())


0    darrel lucu hous dem aid even see comey letter...
1    daniel j flynn flynn hillari clinton big woman...
2               consortiumnew com truth might get fire
3    jessica purkiss civilian kill singl us airstri...
4    howard portnoy iranian woman jail fiction unpu...
Name: content, dtype: object


### Observation:
* The text data has been successfully preprocessed with non-alphabetical characters removed, text converted to lowercase, stopwords filtered, and stemming applied.

## 8. Feature Selection and Target Variable
We will now extract our feature (X) and target (y) variables.

In [64]:
X = df['content'].values
y = df['label'].values

### Note:
* X contains the preprocessed news content, and y contains the target labels indicating real (0) or fake (1) news.

## 9. Text Vectorization Using TF-IDF
To convert text data into numerical form, we will use TF-IDF vectorization.

In [65]:
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(X)

## 10. Splitting the Dataset
We will split the dataset into training and testing sets using an 80-20 split.

In [66]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## 11. Model Training Using Logistic Regression
Now we will train the Logistic Regression model on the training data.

In [67]:
model = LogisticRegression()
model.fit(X_train, y_train)

## 12. Model Evaluation
Let’s evaluate the model performance by predicting on the test set and calculating the accuracy.

In [69]:
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy: ", accuracy)

Model Accuracy:  0.9811320754716981


## 13. Making Predictions on New News
We will now test the model by predicting whether a random news article from the test set is real or fake.

In [70]:
new_news = X_test[3]
prediction = model.predict(new_news)
print(prediction)

if prediction[0] == 0:
    print("The news is real")
else:
    print("The news is fake")

[0]
The news is real


## Summary
In this notebook, we explored the Fake News dataset and built a Logistic Regression model to classify news articles as fake or real. We performed the following steps:

1. Loaded and cleaned the dataset by handling missing values.
Preprocessed the text data by stemming, removing stopwords, and vectorizing the text using TF-IDF.
2. Split the data into training and test sets.
3. Trained a Logistic Regression model on the data.
4. Evaluated the model's performance with an accuracy score.
5. Made predictions on new news articles.

The model provides a basic yet effective way of classifying fake news using Logistic Regression.