<a href="https://colab.research.google.com/github/savinats/MAPD-B/blob/2024/Fake_News.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **NAIVE BAYS CLASSIFIEER FOR FAKE NEWS RECOGNITON**

Marco Foster & Savina Tsichli

In [None]:
#Project

*Fake news are defined by the New York Times as ”a made-up story with an intention to deceive”, with
the intent to confuse or deceive people. They are everywhere in our daily life and they come especially
from social media platforms and applications in the online world. Being able to distinguish fake
contents form real news is today one of the most serious challenges facing the news industry. Naive
Bayes classifiers are powerful algorithms that are used for text data analysis and are connected to
classification tasks of text in multiple classes. The goal of the project is to implement a Multinomial
Naive Bayes classifier in R and test its performances in the classification of social media posts.*

In [None]:
#Instroduction

## Dataset 1: Kaggle Multiclass Fake News Dataset
The Kaggle dataset contains 6 possible labels:
- True (5)
- Not-Known (4)
- Mostly-True (3)
- Half-True (2)
- False (1)
- Barely-True (0)

## Dataset 2: Binary Dataset
This dataset contains two labels:
- Reliable (0)
- Unreliable (1)

## Preprocessing

To prepare the data for classification, we employ the following steps:

### Tokenization
We split the text into individual words or tokens. Tokenization simplifies analysis by focusing on each word as a separate unit.

### Stopword Removal
Stopwords are common words like "and" or "the" that add little semantic value to the text. Removing them allows the model to focus on more important words.

### Normalization
Normalizaqtion reduces words to their base form, making words like "running" and "run" equivalent. This helps reduce the feature space by treating variations of the same word as one.


In [None]:
#Objective

The purpose of this project is to classify news articles into multiple categories (ranging from "False" to "True") using a **Naive Bayes classifier**. By analyzing the text in news articles, we aim to detect their factuality based on predefined labels. The dataset is split into training, validation, and test sets, and we follow standard text preprocessing techniques, including tokenization, stopword removal, and normalization.


In [None]:
#Naive Bayes Classifier

A **Naive Bayes classifier** is a probabilistic machine learning model used for classification tasks. It is based on Bayes' Theorem, assuming independence between features. Despite this "naive" assumption, it performs well in real-world applications, especially for text classification, such as spam detection or sentiment analysis. The algorithm computes the probability of each class given a feature and selects the class with the highest likelihood. It is efficient, easy to implement, and works well with large datasets.

In [None]:
#Code

In [None]:
# Load Packages
package <- c("tokenizers", "tidytext", "dplyr", "tm", "SnowballC", "e1071", "caret", "readr")
install.packages(package)
install.packages("data.table")

library(tokenizers)
library(tidytext)
library(dplyr)
library(tm)
library(SnowballC)
library(e1071)
library(caret)
library(readr)
library(data.table)

Installing packages into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



**Model Training (Binary Classification)**

After preparing the second dataset, a Naive Bayes model was trained using the training data.
The data was split into training (85%) and validation (15%) sets.
The laplace smoothing parameter was set to `1` to handle the zero probabilities of unseen words in the validation set.

In [None]:
df <- read_csv("train.csv")
test <- read_csv("test.csv")

index <- nrow(df) * 0.85
train <- df[1:index, ]
val <- df[(index + 1):nrow(df), ]

print(nrow(test))
print(nrow(df))
print(nrow(train))
print(nrow(val))

# Extract the Labels and Text columns
y <- train$Labels
Text <- train[["Text"]]

# Tokenize the text and store tokens in a list
tokens_list <- lapply(Text, tokenize_words)
#print(head(tokens_list))

# Extract the Labels and Text columns
train_y <- factor(y, levels = c(0, 1, 2, 3, 4, 5))
val_y <- factor(val$Labels, levels = c(0, 1, 2, 3, 4, 5))

TrainText <- train[["Text"]]
ValText <- val[["Text"]]
TestText <- test[["Text"]]

# Tokenize the text and store tokens in a list
tokens_train <- lapply(TrainText, tokenize_words)
tokens_train <- lapply(tokens_train, function(x) setdiff(x, stopwords("en")))

tokens_val <- lapply(ValText, tokenize_words)
tokens_val <- lapply(tokens_val, function(x) setdiff(x, stopwords("en")))

tokens_test <- lapply(TestText, tokenize_words)
tokens_test <- lapply(tokens_test, function(x) setdiff(x, stopwords("en")))

###
#print(head(tokens_train))
#print(head(tokens_val))
#print(head(tokens_test))

# Create a text corpus for each set
trainCorpus <- Corpus(VectorSource(tokens_train))
valCorpus <- Corpus(VectorSource(tokens_val))
testCorpus <- Corpus(VectorSource(tokens_test))

# Create document-term matrices
train_dtm <- DocumentTermMatrix(trainCorpus)
train_dtm <- removeSparseTerms(train_dtm, 0.95)
val_dtm <- DocumentTermMatrix(valCorpus, control = list(dictionary = Terms(train_dtm)))
test_dtm <- DocumentTermMatrix(testCorpus, control = list(dictionary = Terms(train_dtm)))

# Reduce the number of features in your DTMs
# Try removing sparse terms
#train_dtm <- removeSparseTerms(train_dtm, 0.99) # Keep terms that appear in at least 1% of documents
#val_dtm <- removeSparseTerms(val_dtm, 0.99)
#test_dtm <- removeSparseTerms(test_dtm, 0.99)

**Prepare Data for Modeling**

In [None]:
# Convert DTMs to Matrices for easier manipulation
train_matrix <- as.matrix(train_dtm)
val_matrix <- as.matrix(val_dtm)
test_matrix <- as.matrix(test_dtm)

# Matrix columns to factors for categorization
for (cols in colnames(train_matrix)) {
  train_matrix[, cols] <- factor(train_matrix[, cols])
}

for (cols in colnames(val_matrix)) {
  val_matrix[, cols] <- factor(val_matrix[, cols])
}

for (cols in colnames(test_matrix)) {
  test_matrix[, cols] <- factor(test_matrix[, cols])
}

# Ensure Labels column is a factor with the correct levels
train$Labels <- factor(train$Labels, levels = c(0, 1, 2, 3, 4, 5))
val$Labels <- factor(val$Labels, levels = c(0, 1, 2, 3, 4, 5))

# Combine Labels and DTM matrix in a data frame
# The labels are combined with the training and validation matrices to prepare for model training.
train_matrix <- data.frame(Labels = train$Labels, train_matrix)
val_matrix <- data.frame(Labels = val$Labels, val_matrix)

[1mRows: [22m[34m10240[39m [1mColumns: [22m[34m3[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (2): Text, Text_Tag
[32mdbl[39m (1): Labels

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1mRows: [22m[34m1267[39m [1mColumns: [22m[34m2[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (2): Text, Text_Tag

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


[1] 1267
[1] 10240
[1] 8704
[1] 1536


In [None]:
#Train the Multinomial Naive Bayes classifier
model <- naiveBayes(Labels ~ ., data = train_matrix)

# Predict on validation set
valPred <- predict(model, newdata = val_matrix)

# Convert predictions and true labels to factors with the same levels
all_levels <- c(0, 1, 2, 3, 4, 5)
valPred <- factor(valPred, levels = all_levels)
val_matrix$Labels <- factor(val_matrix$Labels, levels = all_levels)

# Evaluate the model
cm <- confusionMatrix(valPred, val_matrix$Labels)
print(cm)

# Predict on test set
#testPred <- predict(model, newdata = test_final)

# Ensure test predictions have the same factor levels (optional, depending on use case)
#testPred <- factor(testPred, levels = all_levels)

# Output test predictions
#print(testPred)

Confusion Matrix and Statistics

          Reference
Prediction   0   1   2   3   4   5
         0  21  27  32  22  17  15
         1  81 126  89 100  44  72
         2  23  11  22  20   8  20
         3  16  18  41  38   6  27
         4  80  96  72  41  44  53
         5  29  35  55  57  16  62

Overall Statistics
                                          
               Accuracy : 0.2038          
                 95% CI : (0.1839, 0.2248)
    No Information Rate : 0.2038          
    P-Value [Acc > NIR] : 0.5101          
                                          
                  Kappa : 0.0499          
                                          
 Mcnemar's Test P-Value : <2e-16          

Statistics by Class:

                     Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
Sensitivity           0.08400  0.40256  0.07074  0.13669  0.32593  0.24900
Specificity           0.91213  0.68438  0.93306  0.91415  0.75589  0.85082
Pos Pred Value        0.15672  0.24609  0.21154

**Final dataset**


In [None]:
package <- c("tokenizers", "tidytext", "dplyr", "tm", "SnowballC", "e1071", "caret", "readr")
install.packages(package)
install.packages("data.table")

Installing packages into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependencies ‘listenv’, ‘parallelly’, ‘future’, ‘globals’, ‘shape’, ‘future.apply’, ‘numDeriv’, ‘progressr’, ‘SQUAREM’, ‘diagram’, ‘lava’, ‘prodlim’, ‘iterators’, ‘clock’, ‘gower’, ‘hardhat’, ‘ipred’, ‘timeDate’, ‘janeaustenr’, ‘NLP’, ‘slam’, ‘BH’, ‘proxy’, ‘foreach’, ‘ModelMetrics’, ‘plyr’, ‘pROC’, ‘recipes’, ‘reshape2’


Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



In [None]:
library(tokenizers)
library(tidytext)
library(dplyr)
library(tm)
library(SnowballC)
library(e1071)
library(caret)
library(readr)
library(data.table)


Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union


Loading required package: NLP

Loading required package: ggplot2


Attaching package: ‘ggplot2’


The following object is masked from ‘package:NLP’:

    annotate


Loading required package: lattice


Attaching package: ‘data.table’


The following objects are masked from ‘package:dplyr’:

    between, first, last




In [None]:
# Same work on new dataset

df <- read_csv("train2.csv")
test <- read_csv("test2.csv")

index <- nrow(df) * 0.85
train <- df[1:index, ]
val <- df[(index + 1):nrow(df), ]

print(nrow(test))
print(nrow(df))
print(nrow(train))
print(nrow(val))

# Extract the Labels and Text columns
y <- train$label
Text <- train[["text"]]

# Tokenize the text and store tokens in a list
tokens_list <- lapply(Text, tokenize_words)
#print(head(tokens_list))

# Extract the Labels and Text columns
train_y <- factor(y, levels = c(0, 1))
val_y <- factor(val$label, levels = c(0, 1))

TrainText <- train[["text"]]
ValText <- val[["text"]]
TestText <- test[["text"]]

# Tokenize the text and store tokens in a list
tokens_train <- lapply(TrainText, tokenize_words)
tokens_train <- lapply(tokens_train, function(x) setdiff(x, stopwords("en")))

tokens_val <- lapply(ValText, tokenize_words)
tokens_val <- lapply(tokens_val, function(x) setdiff(x, stopwords("en")))

tokens_test <- lapply(TestText, tokenize_words)
tokens_test <- lapply(tokens_test, function(x) setdiff(x, stopwords("en")))

###
#print(head(tokens_train))
#print(head(tokens_val))
#print(head(tokens_test))

# Create a text corpus for each set
trainCorpus <- Corpus(VectorSource(tokens_train))
valCorpus <- Corpus(VectorSource(tokens_val))
testCorpus <- Corpus(VectorSource(tokens_test))

# Create document-term matrices
train_dtm <- DocumentTermMatrix(trainCorpus)
train_dtm <- removeSparseTerms(train_dtm, 0.95)
val_dtm <- DocumentTermMatrix(valCorpus, control = list(dictionary = Terms(train_dtm)))
test_dtm <- DocumentTermMatrix(testCorpus, control = list(dictionary = Terms(train_dtm)))

# Reduce the number of features in your DTMs
#train_dtm <- removeSparseTerms(train_dtm, 0.99) # Keep terms that appear in at least 1% of documents
#val_dtm <- removeSparseTerms(val_dtm, 0.99)
#test_dtm <- removeSparseTerms(test_dtm, 0.99)

train_matrix <- as.matrix(train_dtm)
val_matrix <- as.matrix(val_dtm)
test_matrix <- as.matrix(test_dtm)

for (cols in colnames(train_matrix)) {
  train_matrix[, cols] <- factor(train_matrix[, cols])
}

for (cols in colnames(val_matrix)) {
  val_matrix[, cols] <- factor(val_matrix[, cols])
}

for (cols in colnames(test_matrix)) {
  test_matrix[, cols] <- factor(test_matrix[, cols])
}

train_matrix <- data.frame(Labels = as.factor(train$label), train_matrix)
val_matrix <- data.frame(Labels = as.factor(val$label), val_matrix)

[1mRows: [22m[34m20800[39m [1mColumns: [22m[34m5[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (3): title, author, text
[32mdbl[39m (2): id, label

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1mRows: [22m[34m5200[39m [1mColumns: [22m[34m4[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (3): title, author, text
[32mdbl[39m (1): id

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


[1] 5200
[1] 20800
[1] 17680
[1] 3120


In [None]:
all(colnames(train_matrix) == colnames(val_matrix))  # Should return TRUE
all(colnames(train_matrix) == colnames(test_matrix))  # Should return TRUE

“longer object length is not a multiple of shorter object length”


In [None]:
train_matrix$Labels

In [None]:
# Train the Multinomial Naive Bayes classifier
model2 <- naiveBayes(Labels ~ ., data = train_matrix, laplace = 1)

# Predict on validation set
valPred2 <- predict(model2, newdata = val_matrix[,-1])

# Convert predictions and true labels to factors with the same levels
all_levels <- c(0, 1)  # Set explicitly for binary classification
valPred2 <- factor(valPred2, levels = all_levels)
val_matrix$Labels <- factor(val_matrix$Labels, levels = all_levels)

# Evaluate the model
cm <- confusionMatrix(valPred2, val_matrix$Labels)
print(cm)

# Predict on test set
#testPred2 <- predict(model2, newdata = test_matrix)

# Ensure test predictions have the same factor levels (optional, depending on use case)
#testPred2 <- factor(testPred2, levels = all_levels)

# Output test predictions
#print(head(testPred2))

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 1126  127
         1  412 1455
                                          
               Accuracy : 0.8272          
                 95% CI : (0.8135, 0.8404)
    No Information Rate : 0.5071          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.6535          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.7321          
            Specificity : 0.9197          
         Pos Pred Value : 0.8986          
         Neg Pred Value : 0.7793          
             Prevalence : 0.4929          
         Detection Rate : 0.3609          
   Detection Prevalence : 0.4016          
      Balanced Accuracy : 0.8259          
                                          
       'Positive' Class : 0               
                        

 ## **Transformer**

Alternative Approach: Transformers for Fake News Detection

While the Naive Bayes classifier is effective for many text classification tasks, modern approaches using transformer models have demonstrated superior performance. Transformers, such as BERT, utilize embeddings that capture contextual relationships in text, leading to better classification accuracy.

In this section, we propose replacing the Naive Bayes classifier with a transformer-based model for the fake news detection task.


In [None]:
# For this we will need to also install the packages:

install.packages("torch")
install.packages("dplyr")

library(dplyr)
torch::install_torch()
# Install devtools if you don't have it
install.packages("devtools")

# Use devtools to install the package from GitHub
devtools::install_github("huggingface/transformers")

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



ERROR: [1m[33mError[39m in `check_supported_version()`:[22m
[1m[22m[31m✖[39m Unsupported CUDA version [34m"12.2"[39m
[36mℹ[39m Currently supported versions are: [34m"11.7"[39m and [34m"11.8"[39m.


In [None]:
# After loading the datasets, the labels are converted to numeric,
# and then we extract text from both training and test datasets

In [None]:
# Tokenize the data with a pre-trained tokenizer
# This way we convert texts to token IDs with embeddings

tokenizer <- transformers::AutoTokenizer$from_pretrained("bert-base-uncased")

In [None]:
# Then we create a dataset that holds tokenized input and corresponding labels

In [None]:
# After loading the pre-trained model,
# we test it by running the data through it and get predictions.

In [None]:
# then we evaluate the model by comparing the predictions to the actual labels,
# and then calculate accuracy

## Summary of Findings

1) We implemented the Naive Bayes Classifier first on a binary dataset (0,1), and then on the Fake News multi-class dataset (0,1,2,3,4,5). Even though the results of the first dataset were expected, the model on the multi-class dataset did not work accurately for the categorization of the news. We explain this by mentioning that:
- the Naive Bayes Classifier Model assumes that features (e.g., words or phrases) are independent given the class label. In this particular example with the news articles, this assumption doesn’t hold. For example, certain phrases may often occur together in genuine articles but not in fake ones.
- The model may not effectively capture the nuances that differentiate the two categories, especially if they share a lot of vocabulary.
- It treats every feature independently and doesn’t consider the context or relationships between words.

2) That lead us to our next step which was to use a model that takes into account the position of a token in a given phrase or sentence. The transformer is a good example of a model that uses Attention, adding embeddings to each token so as to capture semantic meanings, contextual relationships, and positional information.

- By implementing a transformer-based model for fake news detection, we expect improved accuracy and reliability compared to the Naive Bayes classifier. The context-aware nature of transformers enables a deeper understanding of text, which is critical for accurately distinguishing between real and fake news.

