# ECE 580 Project Code (Final)

## Fake News Detection using SVM and TF-IDF

- Rebecca Du (rrd17)
- Anish Parmar (avp30)

## Imports

In [3]:
import pandas as pd
import numpy as np
import re
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
import string
import nltk
from nltk.corpus import stopwords
from sklearn.svm import SVC
from sklearn.metrics import precision_score, recall_score, f1_score, classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

In [4]:
#Download NLTK stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\rebec\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Step 1: Data Preprocessing

## Overview:

To further test the model  the [fake_real_news_dataset](https://github.com/GeorgeMcIntire/fake_real_news_dataset/tree/main). It is a dataset formed of 1 CSV files of real and fake news.

- **Real news**: $50\%$ of the dataset are random real news articles from sources such as the New York Times, Wall Street Journal, Bloomberg, NPR, and The Guardian which were published in 2015 or 2016.
- **Fake news**: $50\%$ of the dataset are random articles from the [Getting Real about Fake News](https://www.kaggle.com/datasets/mrisdal/fake-news) dataset, which itself contains text and metadata scraped from $244$ fake news websites tagged by the BS Detector Chrome Extension. The data comes from the 2016 election cycle.

Each data point contains the following features:
- **idd**: A unique ID used to identify the article
- **title**: The title of the article
- **text**: The full body of text of the article
- **label**: The label of the article as either REAL or FAKE news

## Step 1a: Load, Prune, Relabel Dataset

Since the fake and true data has been combined into one dataset already, we load it and turn it into a dataframe. 

Then, we prune any NaN values.

Finally, we relabel it for our models.

In [8]:
#Load datasets as dataframes
data = pd.read_csv('new_data/fake_and_real_news_dataset.csv')

In [9]:
#Check data
data

Unnamed: 0,idd,title,text,label
0,Fq+C96tcx+,‘A target on Roe v. Wade ’: Oklahoma bill maki...,UPDATE: Gov. Fallin vetoed the bill on Friday....,REAL
1,bHUqK!pgmv,Study: women had to drive 4 times farther afte...,Ever since Texas laws closed about half of the...,REAL
2,4Y4Ubf%aTi,"Trump, Clinton clash in dueling DC speeches","Donald Trump and Hillary Clinton, now at the s...",REAL
3,_CoY89SJ@K,Grand jury in Texas indicts activists behind P...,A Houston grand jury investigating criminal al...,REAL
4,+rJHoRQVLe,"As Reproductive Rights Hang In The Balance, De...",WASHINGTON -- Forty-three years after the Supr...,REAL
...,...,...,...,...
4589,ukZm6JTO#x,Russia Calls the War Party's Bluff,License DMCA \nCold War 2.0 has reached unprec...,FAKE
4590,yu0xKEiapJ,Bernie Sanders: The Democratic primary gave me...,Print \nSen. Bernie Sanders laid out the ways ...,FAKE
4591,c4Y370E_9c,"Pipeline Police Strip Search Native Girl, Then...",As the pressure to start construction on the D...,FAKE
4592,bBbeuCUeMH,Currency Crisis: Alasdair MacLeod On The Vexed...,Tweet Home » Gold » Gold News » Currency Crisi...,FAKE


In [10]:
#Count how many NaN values there are and where they are located
data.isna().sum()

idd      0
title    1
text     0
label    0
dtype: int64

In [11]:
#Drop rows with NaN, check NaN sum again
data = data.dropna()
data.isna().sum()

idd      0
title    0
text     0
label    0
dtype: int64

In [12]:
#Count how many real and fake data points there are
data['label'].value_counts()

label
FAKE    2297
REAL    2296
Name: count, dtype: int64

In [13]:
#Label fake_df values as '0', true_df values as '1'
data.loc[:, 'label'] = data['label'].map({'REAL': 1, 'FAKE': 0})

#Check data
data

Unnamed: 0,idd,title,text,label
0,Fq+C96tcx+,‘A target on Roe v. Wade ’: Oklahoma bill maki...,UPDATE: Gov. Fallin vetoed the bill on Friday....,1
1,bHUqK!pgmv,Study: women had to drive 4 times farther afte...,Ever since Texas laws closed about half of the...,1
2,4Y4Ubf%aTi,"Trump, Clinton clash in dueling DC speeches","Donald Trump and Hillary Clinton, now at the s...",1
3,_CoY89SJ@K,Grand jury in Texas indicts activists behind P...,A Houston grand jury investigating criminal al...,1
4,+rJHoRQVLe,"As Reproductive Rights Hang In The Balance, De...",WASHINGTON -- Forty-three years after the Supr...,1
...,...,...,...,...
4589,ukZm6JTO#x,Russia Calls the War Party's Bluff,License DMCA \nCold War 2.0 has reached unprec...,0
4590,yu0xKEiapJ,Bernie Sanders: The Democratic primary gave me...,Print \nSen. Bernie Sanders laid out the ways ...,0
4591,c4Y370E_9c,"Pipeline Police Strip Search Native Girl, Then...",As the pressure to start construction on the D...,0
4592,bBbeuCUeMH,Currency Crisis: Alasdair MacLeod On The Vexed...,Tweet Home » Gold » Gold News » Currency Crisi...,0


## Step 1b: Combine Relevant Columns into 'Content'

Since we are mostly focused on the content of the articles, we will combine the title with the text columns and drop everything except the 'content' and 'label' columns.

In [15]:
#Combine title and text into 'content' column
data.loc[:, 'content'] = data['title'] + ' ' + data['text']

#Drop other columns (except for label)
data = data.drop(columns=['title', 'text', 'idd'])

#Check data
data

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data.loc[:, 'content'] = data['title'] + ' ' + data['text']


Unnamed: 0,label,content
0,1,‘A target on Roe v. Wade ’: Oklahoma bill maki...
1,1,Study: women had to drive 4 times farther afte...
2,1,"Trump, Clinton clash in dueling DC speeches Do..."
3,1,Grand jury in Texas indicts activists behind P...
4,1,"As Reproductive Rights Hang In The Balance, De..."
...,...,...
4589,0,Russia Calls the War Party's Bluff License DMC...
4590,0,Bernie Sanders: The Democratic primary gave me...
4591,0,"Pipeline Police Strip Search Native Girl, Then..."
4592,0,Currency Crisis: Alasdair MacLeod On The Vexed...


## Step 1c: Text Cleaning

Next, we will do some simple cleaning of the 'content' column:
- Convert it to lowercase
- Remove punctuation
- Remove stopwords

In [17]:
#Create a function for the data preprocessing tasks
def preprocessing(text):
    #Turn to lowercase
    text = text.lower()

    #Remove punctuation 
    text = re.sub(r'[^\w\s]', '', text)

    #Remove stopwords
    #Split text into individual words
    split_text = text.split()
    cleaned_text = []
    
    for word in split_text:
        if word not in stop_words:
            #Only keep non-stop word words
            cleaned_text.append(word)

    #Recombine into one string
    rejoined_text = ' '.join(cleaned_text)

    return rejoined_text

In [18]:
#Apply preprocessing function to data
data['content'] = data['content'].apply(preprocessing)

#Check to make sure it looks right
data

Unnamed: 0,label,content
0,1,target roe v wade oklahoma bill making felony ...
1,1,study women drive 4 times farther texas laws c...
2,1,trump clinton clash dueling dc speeches donald...
3,1,grand jury texas indicts activists behind plan...
4,1,reproductive rights hang balance debate modera...
...,...,...
4589,0,russia calls war partys bluff license dmca col...
4590,0,bernie sanders democratic primary gave leverag...
4591,0,pipeline police strip search native girl leave...
4592,0,currency crisis alasdair macleod vexed questio...


## Step 1d: Train/Validation/Test Splitting

Now, we will split our cleaned data into training, validation, and testing sets. This will be useful for tuning hyperparameters and assessing model performance later on. 

The split we will choose is: 
- **Train**: 70%
- **Validation**: 15%
- **Test**: 15%

In [31]:
#Split into features (X) and labels (y)
X = data['content']
y = data['label']

In [33]:
#Split into train and temp 
train, temp_data = train_test_split(data, test_size=0.3, stratify=data['label'], random_state=42)

#Split temp into validation & test
val, test = train_test_split(temp_data, test_size=0.5, stratify=temp_data['label'], random_state=42)

#Get features and labels for each split
X_train, y_train = train['content'], train['label']
X_val, y_val = val['content'], val['label']
X_test, y_test = test['content'], test['label']

In [35]:
#Display the distribution of Fake and Real News in nonprocessed and processed datasets after splitting

print("Processed Data Train, Validation, & Test Distribution")
print("Train Labels:", y_train.value_counts(normalize=True))
print("Validation Labels:", y_val.value_counts(normalize=True))
print("Test Labels:", y_test.value_counts(normalize=True))

Processed Data Train, Validation, & Test Distribution
Train Labels: label
0    0.500156
1    0.499844
Name: proportion, dtype: float64
Validation Labels: label
1    0.500726
0    0.499274
Name: proportion, dtype: float64
Test Labels: label
0    0.500726
1    0.499274
Name: proportion, dtype: float64
