$$
\huge{\textbf{Multinomial Naive Bayes model}}
$$
$$
\huge{\textbf{per la classificazione di Fake News}}
$$

| Name   | ID |
| -------- | ------- |
| Calandra Buonaura Lorenzo | 2107761     |
| Turci Andrea  |2106724   |

In an era where information dissemination is increasingly digital and instantaneous, the proliferation of fake news has become a pervasive issue. The New York Times defines fake news as "a made-up story with an intention to deceive," highlighting its purpose to confuse or mislead the audience. This phenomenon is predominantly propagated through social media platforms and various online applications, embedding itself deeply in our daily lives. The ability to distinguish between fake and authentic news has emerged as one of the most pressing challenges for the modern news industry.

This assignment delves into the application of Naive Bayes classifiers for the recognition of fake news. Naive Bayes classifiers are renowned for their efficacy in text data analysis, particularly in classification tasks where the objective is to categorize text into multiple classes. These classifiers are based on Bayes' Theorem, leveraging the probability of features (words or phrases) to determine the likelihood of a particular category (e.g., fake or real news).

The goal of this project is to implement a Multinomial Naive Bayes classifier in R, a statistical programming language widely used for data analysis and machine learning. The focus will be on evaluating the performance of this classifier in categorizing social media posts, a common medium for the dissemination of fake news. The dataset suggested for this project is available on Kaggle, a popular platform for data science competitions and datasets.

In this context, the Multinomial Naive Bayes classifier is particularly suitable due to its effectiveness in handling discrete data, such as word counts in documents. This classifier assumes that the features (words) are conditionally independent given the class label, an assumption that, while simplifying computations, has been proven effective in practical text classification tasks.

The project will involve several key steps:

- Data Preprocessing: This involves cleaning the text data to remove noise, such as punctuation and stop words, and transforming the text into a format suitable for analysis, typically using techniques like tokenization and vectorization.
- Model Training: Using the cleaned dataset, the Multinomial Naive Bayes classifier will be trained to learn the patterns associated with fake and real news. This involves calculating the prior probabilities of each class and the conditional probabilities of each word given the class.
- Model Evaluation: The performance of the classifier will be tested on a separate set of data to evaluate its accuracy, precision, recall, and F1 score. This will help in understanding the effectiveness of the model in distinguishing fake news from real news.
- Hyperparameter Tuning: The model's parameters may be adjusted to improve its performance. This includes optimizing the smoothing parameter to handle zero probabilities.
- Validation and Testing: The final model will be validated using cross-validation techniques to ensure its robustness and tested on unseen data to assess its generalization capability.

The suggested labels for classifying the text include categories such as 'fake', 'real', 'satire', 'biased', among others. These labels will help in creating a more nuanced classifier that can handle various forms of misleading information.

By the end of this project, we aim to have a robust Multinomial Naive Bayes classifier that can accurately classify social media posts, contributing to the broader effort of combating the spread of fake news. This assignment not only provides practical experience in implementing a machine learning algorithm but also underscores the significance of technological solutions in addressing contemporary societal issues.

$\Large{\textbf{Datsets}}$

Si sono utilizzati due dataset al fine di testare l'algortimo. 
Il primo $^1$ contiene 10240 testi classificati in base a sei classi:

* $\textit{Barely-True: 0}$
* $\textit{False: 1}$
* $\textit{Half-True: 2}$
* $\textit{Mostly-True: 3}$
* $\textit{Not-Known: 4}$
* $\textit{True: 5}$

Il secondo $^2$ contiene invece 23481 fake news e 21417 articoli di notizie vere, i quali sono stati uniti in un unico dataset in cui è stata assegnata manualmente la classe 0 alle notizie false e la classe 1 alle notizie vere.

Nelle seguenti sezioni verrà applicato l'algoritmo inizialmente al primo dataset, di cui saranno visualizzati i risultati, e solo successivamente al secondo.

$^1 \small{https://www.kaggle.com/datasets/anmolkumar/fake-news-content-detection?select=train.csv}$

$^2 \small{https://www.uvic.ca/ecs/ece/isot/datasets/fake-news/index.php}$

$\Large{\textbf{Multinomial Naive Bayes Model}}$

La probabilità di un testo o documento $d$ di appartenenre alla categoria $c$ si può ottenere dal teorema di Bayes: 

$ P(c|d) \approx P(c) \ \prod_k P(t_k | c)$ 

Dove $P(t_k | c)$ corrisponde alla probabilità che il termine $t_k$ compaia in un documento di classe $c$. Questa probabilità costituisce una misura di quanto $t_k$ contribuisce affinchè $c$ sia la corretta classe da assegnare al documento. Infine $P(c)$ è la probabilità a priori che un documento appartenga alla classe $c$. 
In questo caso, si è scelto di stimare la probabilità a priori sfruttando le frequenze relative: 

$P(c) = \frac{N_c}{N}$

dove $N_c$ corrisponde al numero totale di documenti della classe $c$ ed $N$ è il numero complessivo di documenti. 
Per quanto riguarda la probabilità condizionata invece, viene stimata come la frequenza relativa del termine $t$ nei documenti appartenenti alla classe $c$: 

$P(t|c) = \frac{T_{ct}}{\sum_{t'} T_{ct'}}$

dove $T_{ct}$ è il numero di volte che il termine $t$ compare nel documento di classe $c$. 
Poiche la probabilità $ P(c|d) $ è data dal prodotto di molte probabilità condizionate, risulta computazionlmente vantaggioso considerare i logaritmi di tali quantità e sommarli tra loro. Infine possiamo dire che nella classificazione Naive Bayes la classe migliore è data da quella la cui probabilità a posteriori è massima:

$c_{map} = argmax_{c \in C} P(c|d) = argmax_{c \in C} [log(P(c))+ \sum_{k} log(P(t_k | c))]$

${\textbf{Laplace Smoothing}}$

Nell'implementazione pratica di un classificatore Naive Bayes, è comune incontrare il problema delle probabilità zero. Questo si verifica quando un termine $t_k$ non appare in nessuno dei documenti di una classe $c$. In tal caso, la probabilità condizionata $P(t_k | c)$ sarà zero, e di conseguenza, il prodotto delle probabilità condizionate diventerà zero, annullando la probabilità totale $P(c|d)$. Per evitare questo problema, si utilizza una tecnica chiamata "Laplace Smoothing" o "add-one smoothing".

Con Laplace Smoothing, la probabilità a priori e la probabilità condizionata vengono modificate come segue:

Probabilità A Priori: La probabilità a priori $P(c)$ rimane la stessa, poiché non è influenzata dalla presenza o assenza di termini.

$P(c) = \frac{N_c}{N}$
 
Probabilità Condizionata: La probabilità condizionata viene calcolata aggiungendo 1 al numeratore (numero di volte che il termine $t$ compare nei documenti di classe $c$) e aggiungendo il numero totale dei termini distinti $|V|$ al denominatore (numero totale di occorrenze di tutti i termini nei documenti di classe $c$).

$P(t|c) = \frac{T_{ct} + 1}{\sum_{t'} T_{ct'} + |V|}$
 
dove:

- $T_{ct}$ è il numero di volte che il termine $t$ compare nei documenti della classe $c$.
- $\sum_{t'} T_{ct'}$ è il numero totale di occorrenze di tutti i termini nei documenti di classe $c$.
- $|V|$ è il numero di termini distinti nel vocabolario complessivo.

La formula finale per determinare la classe migliore $c_{map}$ rimane simile, ma utilizza le probabilità condizionate smussate:

$c_{map} = argmax_{c \in C} P(c|d) = argmax_{c \in C} [log(P(c))+ \sum_{k} log(\frac{T_{ct_k} + 1}{\sum_{t'} T_{ct'} + |V|})]$

L'applicazione del Laplace Smoothing ha diversi vantaggi:

- Evita il problema delle probabilità zero.
- Fornisce una stima più robusta delle probabilità condizionate.
- Permette di gestire meglio le parole nuove o rare che possono apparire nei documenti di test.

$\Large{\textbf{Code}}$

In [None]:
####################
# DATASET CLEANING #
####################

In [None]:
change_labels <- function(labels) {
  label_map <- c("0" = 2, "1" = 1, "2" = 3, "3" = 4, "4" = 0, "5" = 5)
  new_labels <- sapply(labels, function(label) label_map[as.character(label)])
  return(new_labels)
}

This function change_labels remaps a set of categorical labels according to a specified mapping. The input labels is expected to be a vector of labels. The label_map defines how each original label is transformed to a new label. The sapply function is used to apply this mapping to each label in the input vector. The function returns a vector of remapped labels.

In [None]:
lemmatize_text <- function(text) {
  lemmatized <- textstem::lemmatize_words(unlist(strsplit(text, "\\s+")))
  lemmatized <- SnowballC::wordStem(lemmatized, language = "en")

  return(paste(lemmatized, collapse = " "))
}

The lemmatize_text function processes an input string text by first lemmatizing and then stemming each word. Lemmatization (using textstem::lemmatize_words) converts words to their base or dictionary form. Stemming (using SnowballC::wordStem) reduces words to their root form. The input text is split into individual words, processed, and then recombined into a single string. The function returns the processed string with lemmatized and stemmed words.

In [None]:
filter_non_english_words <- function(text) {
  tokens <- unlist(strsplit(text, "\\s+"))
  is_english <- hunspell::hunspell_check(tokens)
  english_tokens <- tokens[is_english]
  cleaned_text <- paste(english_tokens, collapse = " ")
  return(cleaned_text)
}

This function filter_non_english_words removes non-English words from a given input string text. It tokenizes the text into individual words, checks each word for being an English word using hunspell::hunspell_check, and retains only the words identified as English. The cleaned text, composed only of English words, is then reassembled into a single string and returned.

In [None]:
remove_numbers_inside_words <- function(text) {
  words <- unlist(strsplit(text, "\\s+"))

  clean_words <- lapply(words, function(word) {
    if (grepl("\\d", word)) {
      word <- gsub("\\d", "", word)
    }
    return(word)
  })

  cleaned_text <- paste(clean_words, collapse = " ")
  return(cleaned_text)
}

The remove_numbers_inside_words function cleans a given input string text by removing any numerical digits within words. It splits the text into individual words, processes each word to remove digits (using gsub), and then recombines the cleaned words into a single string. The resulting string, with numbers removed from within words, is returned.

In [None]:
to_space <- tm::content_transformer(function(x, pattern) {
  return(gsub(pattern, " ", x))
})

The to_space function is a content transformer that replaces all occurrences of a specified pattern in a text with a space. This function can be used with the tm package's text processing functions to clean and transform text data by replacing specified patterns with spaces.

In [None]:
clean <- function(document, tokenize = TRUE, lemmatize = TRUE) {
  clean_doc <- tm::VCorpus(tm::VectorSource(document))

  if (tokenize) {
    clean_doc <- tm::tm_map(clean_doc, tm::content_transformer(tolower))
    clean_doc <- tm::tm_map(clean_doc, tm::removePunctuation)
    clean_doc <- tm::tm_map(clean_doc, tm::removeWords, tm::stopwords("en"))
    clean_doc <- tm::tm_map(clean_doc, tm::content_transformer(filter_non_english_words))
    clean_doc <- tm::tm_map(clean_doc, tm::content_transformer(remove_numbers_inside_words))
    clean_doc <- tm::tm_map(clean_doc, tm::stripWhitespace)
  }

  if (lemmatize) {
    clean_doc <- tm::tm_map(clean_doc, tm::content_transformer(lemmatize_text))
  }

  return(sapply(clean_doc, NLP::content))
}

The clean function performs comprehensive text cleaning on an input document. It first converts the input document into a text corpus using tm::VCorpus. If tokenize is set to TRUE, the function applies a series of transformations: converting text to lowercase, removing punctuation, removing stop words, filtering non-English words, removing numbers from within words, and stripping whitespace. If lemmatize is set to TRUE, it also lemmatizes the text. The function returns the cleaned text as a character vector.

In [None]:
clean_empty_rows <- function(dataframe) {
  empty_rows <- which(nchar(trimws(dataframe$Text)) == 0)
  if (length(empty_rows) != 0) {
    dataframe <- dataframe[-empty_rows, ]
  }
  return(dataframe)
}

The clean_empty_rows function removes rows from a dataframe where the Text column is empty or contains only whitespace. It identifies such rows using nchar and trimws, then excludes them from the dataframe. The cleaned dataframe, with empty rows removed, is returned. This function ensures that the dataframe only contains rows with meaningful text data.

In [None]:
################
# VOCABULARIES #
################

In [None]:
get_vocabulary_six <- function(document, threshold) {
  words <- unlist(strsplit(document, "\\s+"))
  words <- words[words != ""]
  words_table <- table(words)

  words_freq <- as.data.frame(words_table, stringsAsFactors = FALSE)
  colnames(words_freq) <- c("word", "occurrencies")

  vocabulary <- words_freq[words_freq$occurrencies >= threshold, ]$word
  return(vocabulary)
}


The get_vocabulary_six function creates a vocabulary list from a given text document by including only those words that occur at least a specified number of times. It begins by tokenizing the input document, splitting it into individual words using spaces as delimiters. Empty strings that may result from this split are removed. The function then constructs a frequency table of these words. This table is converted into a data frame with columns named word and occurrencies, representing each unique word and its frequency of occurrence, respectively. The function filters this data frame to include only those words whose frequency meets or exceeds the specified threshold. The resulting vocabulary is returned as a vector of words that meet this criterion.

In [None]:
get_vocabulary_tags <- function(df, threshold) {
  tag_texts <- list()
  all_tags <- unique(unlist(strsplit(df$Tag, ",")))

  for (tag in all_tags) {
    matching_docs <- df[grep(tag, df$Tag), "Text"]
    doc <- paste(matching_docs, collapse = " ")

    voc <- get_vocabulary_six(doc, threshold)
    tag_texts <- append(tag_texts, voc)
  }

  return(unlist(tag_texts))
}

The get_vocabulary_tags function constructs a vocabulary list based on the tags associated with text documents in a data frame. It first identifies all unique tags in the Tag column by splitting the tags on commas and finding unique entries. For each unique tag, the function retrieves the texts of all documents associated with that tag, concatenating them into a single text string. It then uses the get_vocabulary_six function to generate a vocabulary list for this concatenated text, filtering words based on the specified threshold. The vocabularies for all tags are combined into a single list, which is returned as a vector of words. This approach ensures that the vocabulary reflects the terms most commonly associated with each tag, based on their frequency in the relevant documents.

In [24]:
get_vocabulary_two <- function(document, threshold) {
  words <- unlist(strsplit(document, "\\s+"))
  words <- words[words != ""]
  words_table <- table(words)

  words_freq <- as.data.frame(words_table, stringsAsFactors = FALSE)
  colnames(words_freq) <- c("word", "occurrencies")

  total_words <- sum(words_freq$occurrencies)
  words_freq$occurrencies <- words_freq$occurrencies / total_words

  vocabulary <- words_freq[words_freq$occurrencies >= threshold, ]$word
  return(list(voc = vocabulary, df = words_freq))
}


The get_vocabulary_two function also generates a vocabulary list from a text document but does so based on the relative frequency of words. Similar to get_vocabulary_six, it starts by tokenizing the document and removing any empty strings. It creates a frequency table of the words and converts this table into a data frame with word and occurrencies columns. The function then calculates the total number of words and converts the occurrencies column to represent the relative frequency of each word. Words whose relative frequency meets or exceeds the specified threshold are filtered and stored in the vocabulary. The function returns a list containing two elements: voc, a vector of words that meet the relative frequency threshold, and df, the data frame with relative frequencies of all words.

In [None]:
############
# TRAINING #
############

In [None]:
train_multinomial_nb <- function(classes, data, threshold, type) {
  n <- length(data$Text)

  if (type == "Six") {
    vocabulary <- get_vocabulary_six(paste(data$Text, collapse = " "), threshold)
  } else if (type == "Two") {
    vocabulary <- get_vocabulary_two(paste(data$Text, collapse = " "), threshold)$voc
  } else if (type == "Tags") {
    vocabulary <- get_vocabulary_tags(data, threshold)
  } else {
    stop("Invalid type specified")
  }

  prior <- numeric(length(classes))
  names(prior) <- classes
  post <- matrix(0, nrow = length(vocabulary), ncol = length(classes), dimnames = list(vocabulary, classes))

  for (c in seq_along(classes)) {
    class_label <- classes[c]
    docs_in_class <- data[data$Label == class_label, "Text"]
    prior[c] <- length(docs_in_class) / n

    textc <- paste(docs_in_class, collapse = " ")
    tokens <- table(strsplit(tolower(textc), "\\W+")[[1]])
    vocab_counts <- sapply(vocabulary, function(t) if (t %in% names(tokens)) tokens[t] else 0)

    post[, c] <- (vocab_counts + 1) / (sum(vocab_counts) + length(vocabulary))
  }

  return(list(vocab = vocabulary, prior = prior, condprob = post))
}

The train_multinomial_nb function trains a Multinomial Naive Bayes classifier based on text data and specified classes. This function performs several critical tasks, including constructing the vocabulary, calculating prior probabilities, and computing conditional probabilities for each class.

The function starts by determining the length of the data, which is the number of text documents. It then decides how to build the vocabulary based on the specified type parameter. If type is "Six", it calls get_vocabulary_six to generate the vocabulary from the combined text of all documents. If type is "Two", it calls get_vocabulary_two and retrieves the vocabulary part of the returned list. If type is "Tags", it calls get_vocabulary_tags to generate a vocabulary based on tags associated with the documents. If an invalid type is provided, the function stops and raises an error.

Next, the function initializes the prior probability array and the conditional probability matrix. The prior array has a length equal to the number of classes and is named according to the class labels. The conditional probability matrix has rows corresponding to the vocabulary and columns corresponding to the classes, initialized to zeros.

The function then iterates over each class to compute the prior and conditional probabilities. For each class, it filters the documents that belong to the current class and calculates the prior probability as the ratio of the number of documents in the class to the total number of documents. It concatenates the text of all documents in the class into a single string and tokenizes this string into words. It counts the frequency of each word and constructs a frequency table. The function computes the conditional probabilities using Laplace smoothing: for each word in the vocabulary, it adds one to the word count (to avoid zero probabilities) and normalizes by the total word count plus the size of the vocabulary. This ensures that every word has a non-zero probability.

Finally, the function returns a list containing three elements: the vocabulary, the prior probabilities, and the conditional probability matrix. This trained model can then be used for classifying new text documents.

In [None]:
##################
# LOG-LIKELIHOOD #
##################

In [None]:
apply_multinomial_nb <- function(classes, vocab, prior, condprob, doc) {
  tokens <- intersect(unlist(strsplit(doc, "\\s+")), vocab)

  score_matrix <- matrix(0, nrow = length(tokens), ncol = length(classes))
  rownames(score_matrix) <- tokens
  colnames(score_matrix) <- classes

  for (c in seq_along(classes)) {
    for (t in seq_along(tokens)) {
      term <- tokens[t]
      score_matrix[t, c] <- log(condprob[term, c])
    }
  }

  scores <- colSums(score_matrix) + log(prior)

  return(names(which.max(scores)))
}

The apply_multinomial_nb function applies a trained Multinomial Naive Bayes classifier to a new document in order to classify it. This function uses the vocabulary, prior probabilities, and conditional probabilities computed during training to determine the most likely class for the given document.

The function begins by tokenizing the input document into individual words. It then intersects these tokens with the provided vocabulary to ensure that only relevant words (those present in the vocabulary) are considered.

A score matrix is initialized, with rows representing the intersected tokens and columns representing the classes. The matrix is initially filled with zeros. The function then iterates over each class and each token, populating the score matrix with the log of the conditional probability of each token given the class. This involves two nested loops: the outer loop iterates over the classes, and the inner loop iterates over the tokens.

Once the score matrix is populated, the function calculates the total score for each class by summing the log-probabilities in the score matrix and adding the log of the prior probability for each class. The class with the highest total score is selected as the predicted class.

The function returns the name of the class with the maximum score, indicating the predicted classification for the input document. This approach ensures that the classification takes into account both the prior probability of each class and the likelihood of the document given each class, making use of the Naive Bayes assumption that the presence of each word is conditionally independent given the class.

In [None]:
###########################
# K-FOLD CROSS VALIDATION #
###########################

In [None]:
kfold_cross_validation <- function(dataset, k = 5, thresholds, type) {
  n <- nrow(dataset)
  fold_size <- floor(n / k)

  accuracies <- matrix(0, nrow = k, ncol = length(thresholds))
  classes <- as.integer(sort(unique(dataset$Label)))

  for (fold in 1:k) {
    validation_indices <- ((fold - 1) * fold_size + 1):(fold * fold_size)
    train_indices <- setdiff(1:n, validation_indices)
    training_set <- dataset[train_indices, ]
    validation_set <- dataset[validation_indices, ]

    for (i in seq_along(thresholds)) {
      model <- train_multinomial_nb(classes, training_set, thresholds[i], type)

      pred_labels <- sapply(validation_set$Text, function(doc) {
        apply_multinomial_nb(classes, model$vocab, model$prior, model$condprob, doc)
      })

      correct_predictions <- sum(validation_set$Label == pred_labels)
      total_predictions <- length(validation_set$Label)
      accuracies[fold, i] <- correct_predictions / total_predictions
    }
  }

  mean_accuracies <- colMeans(accuracies)
  return(data.frame(threshold = thresholds, mean_accuracy = mean_accuracies))
}

The kfold_cross_validation function performs k-fold cross-validation on a given dataset to evaluate the performance of a Multinomial Naive Bayes classifier with different threshold values. This function helps to assess the classifier's accuracy by splitting the data into k subsets (folds) and iteratively training and validating the model on these folds.

The function starts by determining the number of rows (documents) in the dataset and calculating the size of each fold. It initializes a matrix accuracies to store the accuracy results for each fold and each threshold value. The unique class labels in the dataset are sorted and stored as integers in the classes vector.

The function then enters a loop that iterates over each fold. For each fold, it determines the indices of the validation set and the training set. The training set consists of all documents not in the validation set.

Within each fold, the function iterates over the specified threshold values. For each threshold, it trains a Multinomial Naive Bayes classifier using the train_multinomial_nb function, which builds a vocabulary, computes prior probabilities, and calculates conditional probabilities based on the training set.

Next, the function applies the trained model to each document in the validation set using the apply_multinomial_nb function. This function classifies the document based on the trained model and returns the predicted class label.

The function compares the predicted labels to the actual labels of the validation set to count the number of correct predictions. The accuracy for each threshold and fold is computed as the ratio of correct predictions to the total number of predictions and stored in the accuracies matrix.

After completing the cross-validation process for all folds and thresholds, the function calculates the mean accuracy for each threshold by taking the column-wise mean of the accuracies matrix. The function returns a data frame containing the threshold values and their corresponding mean accuracies.

This cross-validation approach ensures a robust evaluation of the classifier's performance by training and validating the model on different subsets of the data, thereby reducing the risk of overfitting and providing a more reliable estimate of the classifier's accuracy.

## Analysis 

In [1]:
source("./aux.R")

Loading required package: NLP

Loading required package: koRpus.lang.en

Loading required package: koRpus

Loading required package: sylly

For information on available language packages for 'koRpus', run

  available.koRpus.lang()

and see ?install.koRpus.lang()



Attaching package: ‘koRpus’


The following object is masked from ‘package:tm’:

    readTagged



Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union




In [2]:
dataset <- read.csv("two_label_dataset.csv", col.names = c("ID", "Title", "Author", "Text", "Label"))
classes <- as.integer(sort(unique(dataset$Label)))

In [3]:
dataset$Text <- clean(dataset$Text)
dataset <- clean_empty_rows(dataset)

In [4]:
eighty_percent <- as.integer(length(dataset$Text) * 0.8)

training_set <- dataset[1:eighty_percent, ]
test_set <- dataset[(eighty_percent + 1):length(dataset$Text), ]

In [5]:
crossval_results <- kfold_cross_validation_two_labels(training_set, k = 5, occ_thresholds = c(0.0000000001 ,0.000000001, 0.000000005 ,0.00000001, 0.00000005,0.0000001,0.0000005,0.000001,0.000005,0.00001, 0.000016, 0.00002, 0.00005))
crossval_results

occ_threshold,mean_accuracy
<dbl>,<dbl>
1e-10,0.8670313
1e-09,0.8670313
5e-09,0.8670313
1e-08,0.8670313
5e-08,0.8670313
1e-07,0.8670313
5e-07,0.8669705
1e-06,0.8659982
5e-06,0.8635673
1e-05,0.8615618


In [9]:
model <- train_multinomial_nb_new_two_label(classes, training_set, threshold = 1e-07)

In [10]:
pred_labels <- sapply(test_set$Text, function(doc) {
  apply_multinomial_nb(classes, model$vocab, model$prior, model$condprob, doc)
})

In [11]:
correct_predictions <- sum(test_set$Label == pred_labels)
total_predictions <- length(test_set$Label)
accuracy <- correct_predictions / total_predictions
confusion_matrix <- table(True = test_set$Label, Predicted = pred_labels)


cat("Accuracy:", accuracy, "\n")
cat("Confusion Matrix:\n")
print(confusion_matrix)

Accuracy: 0.8746051 
Confusion Matrix:
    Predicted
True    0    1
   0 1803  239
   1  277 1796


__________________________________________________________________________________________________