# Fake news recognition using a Multinomial Naive Bayes Classifier

### Course: Advanced Statistics for Physics Analysis
### Students: Toso Simone, Feltrin Antonio

In [1]:
library(stringr)
library(tm) #Text-mining package
library(NLP) 
library(textstem) # For lemmatization

Loading required package: NLP

Loading required package: koRpus.lang.en

Loading required package: koRpus

Loading required package: sylly

For information on available language packages for 'koRpus', run

  available.koRpus.lang()

and see ?install.koRpus.lang()



Attaching package: ‘koRpus’


The following object is masked from ‘package:tm’:

    readTagged




# A bit of theory

Our aim is to *classify* some phrases, using a predetermined set of categories $\mathcal{C}$ (e.g. `{Fake, Not Fake}`).

To do so, we must learn a *classification function* $\gamma: \mathcal{X} \to \mathcal{C}$, where $\mathcal{X}$ is the set of all possible input phrases. 

There are many possible algorithms for text classification in natural language processing. In this project we will focus on the *Multinomial Naive Bayes* classifier.

### The MNB classifier
Given a class $c$ and a *document* (i.e. phrase) $d$, we can imagine that the document was composed by randomly extracting words from the total set of tokens $\mathcal{T} = \{t_1, t_2, \dots, t_m\}$. This way, the probability of composing the observed document would be

$P(d|c) = \prod_{1 \leq k \leq n_d} p(t_k|c)$.

Now, we want to infer $P(c|d)$. This can be done though Bayes's Theorem:

$P(c|d) \propto p(c) \prod_{1 \leq k \leq n_d} p(t_k|c)$.

We can then easily find the *maximum a posteriori* class $c_{map}$:

$c_{map} = \mathrm{argmax}_{c} P(d|c)$.

The maximum a posteriori class will be our guess for the classification.

### The learning algorithm
Our model depends on the following parameters:
* **Prior**: $p(c)$. 
* **Token probability**: $p(t_k|c)$, the probability for token $t_k$ to appear in a document of class $c$. 

The prior can be estimated as $\hat{p}(c) = \frac{N_c}{N}$, i.e. the fraction of documents of class $c$ in the training set.

The token probability can instead be estimated as $\hat{p}(t|c) = \frac{T_{ct} + 1}{\sum_{t'} (T_{ct' + 1})}$. The term $T_{ct}$ is the number of times token $t$ appears in a document of class $c$. Notice that, both at the numerator and denominator, we are adding $1$ to $T_{ct}$ and $T_{ct'}$. This is done in order to avoid having $p(t|c) = 0$ for tokens that never appear in documents of class $c$.

<hr style="border:1px solid gray">


# Kumar dataset 

We first try our hand on [this](https://www.kaggle.com/datasets/anmolkumar/fake-news-content-detection?select=train.csv) dataset. It consists of 11507 records (10240 for training, 1267 for testing)

Each entry is classified as one of these 6 categories:
* *Barely true* - 0
* *False* - 1
* *Half-true* - 2
* *Mostly true* - 3
* *Not known* - 4
* *True* - 5

## Read data

In [2]:
dir_input <- 'data/kumar/train_pruned.csv'
dir_test <- 'data/kumar/test.csv'
input.df <- read.csv(dir_input,header=TRUE,sep=',')
test.df <- read.csv(dir_test,header=TRUE,sep=',')

In [3]:
head(input.df,8)

Unnamed: 0_level_0,Labels,Text,Text_Tag
Unnamed: 0_level_1,<int>,<chr>,<chr>
1,1,Says the Annies List political group supports third-trimester abortions on demand.,abortion
2,2,When did the decline of coal start? It started when natural gas took off that started to begin in (President George W.) Bushs administration.,"energy,history,job-accomplishments"
3,3,"Hillary Clinton agrees with John McCain ""by voting to give George Bush the benefit of the doubt on Iran.""",foreign-policy
4,1,Health care reform legislation is likely to mandate free sex change surgeries.,health-care
5,2,The economic turnaround started at the end of my term.,"economy,jobs"
6,5,The Chicago Bears have had more starting quarterbacks in the last 10 years than the total number of tenured (UW) faculty fired during the last two decades.,education
7,0,Jim Dunnam has not lived in the district he represents for years now.,candidates-biography
8,2,"I'm the only person on this stage who has worked actively just last year passing, along with Russ Feingold, some of the toughest ethics reform since Watergate.",ethics


## Vocabulary construction and tokenization

In [4]:
plain <- function(word,punct='[:punct:]'){
    word <- str_to_lower(str_replace_all(word,punct,' '))
    return(word) 
}

get.unique.words <- function (tags.bag,sep,sortit=FALSE) {
    all.tags <- c()
    for (record in tags.bag){
        temp.tags <- str_split_1(plain(record),sep)
        for (word in temp.tags) {
            word <- plain(word)
            if (word %in% all.tags == FALSE & word != '') {
                    all.tags <- c(all.tags,str_to_lower(word))
                } 
            }
        }
    if (sortit) {all.tags <- sort(all.tags)}
    return(all.tags)
}

get.quotes <- function (quotes.bag,sep) {
    all.quotes <- list()
    for (record in quotes.bag){
        temp.q <- str_split_1(plain(record),sep)
        temp.q <- str_flatten(temp.q[!(temp.q %in% stopwords('en')) & str_length(temp.q)>0], collapse = ' ') #>1 to remove lone letters? no because of $
        all.quotes <- c(all.quotes,temp.q)
        }

    return(all.quotes)
}

get.tags <- function (tags.bag,sep) {
    all.quotes <- list()
    for (record in tags.bag){
        #cat('\nrecord, type',record,typeof(record))
        temp.q <- str_split(plain(record,punct = '[.;()]'),sep)
        temp.q <- str_flatten(temp.q[[1]][!(temp.q %in% stopwords('en'))],  collapse = ' ') #>1 to remove lone letters? no because of $
        all.quotes <- c(all.quotes,temp.q)
        }

    return(all.quotes)
}

In [5]:
input.tags <- get.unique.words(input.df$Text_Tag,sep=',',sortit=TRUE)
test.tags <- get.unique.words(test.df$Text_Tag,sep=',',sortit=TRUE)    

In [6]:
vocabulary <- get.unique.words(input.df$Text,sep=' ')

In [7]:
quotes <- get.quotes(input.df$Text,sep=' ')

In [8]:
tags <- get.tags(input.df$Text_Tag,sep=',')

In [9]:
train.df <- data.frame(Labels = input.df$Labels, Text = unlist(quotes), Text_Tag = unlist(tags))

In [10]:
head(train.df,5)

Unnamed: 0_level_0,Labels,Text,Text_Tag
Unnamed: 0_level_1,<int>,<chr>,<chr>
1,1,says annies list political group supports third trimester abortions demand,abortion
2,2,decline coal start started natural gas took started begin president george w bushs administration,energy history job-accomplishments
3,3,hillary clinton agrees john mccain voting give george bush benefit doubt iran,foreign-policy
4,1,health care reform legislation likely mandate free sex change surgeries,health-care
5,2,economic turnaround started end term,economy jobs


In [11]:
stopwords('en')

In [12]:
train.df$Text <- lemmatize_strings(train.df$Text)

In [13]:
head(train.df,5)

Unnamed: 0_level_0,Labels,Text,Text_Tag
Unnamed: 0_level_1,<int>,<chr>,<chr>
1,1,say annies list political group support 3 trimester abortion demand,abortion
2,2,decline coal start start natural gas take start begin president george w bushs administration,energy history job-accomplishments
3,3,hillary clinton agree john mccain vote give george bush benefit doubt iran,foreign-policy
4,1,health care reform legislation likely mandate free sex change surgery,health-care
5,2,economic turnaround start end term,economy jobs


In [14]:
train.df$Text[5:15]

#### TODO: 

#### FILTER1 regexp '$NUM' -> somemoney

In [16]:
train.df$Text <- str_replace_all(train.df$Text, regex("\\$[0-9]*"), "<MONEY>")

#### FILTER2 regexp '19xx' or '20xx' -> someyear

In [20]:
train.df$Text[12]

In [24]:
train.df$Text <- str_replace_all(train.df$Text, regex("(18|19|20)\\d{2}"), "<YEAR>")

#### FILTER 3 numero --> "< number >"

In [27]:
train.df$Text <- str_replace_all(train.df$Text, regex("\\d+"), "<NUMBER>")

## Feature Selection

Let's see how our dataset is now

In [29]:
head(train.df, 10)

Unnamed: 0_level_0,Labels,Text,Text_Tag
Unnamed: 0_level_1,<int>,<chr>,<chr>
1,1,say annies list political group support <NUMBER> trimester abortion demand,abortion
2,2,decline coal start start natural gas take start begin president george w bushs administration,energy history job-accomplishments
3,3,hillary clinton agree john mccain vote give george bush benefit doubt iran,foreign-policy
4,1,health care reform legislation likely mandate free sex change surgery,health-care
5,2,economic turnaround start end term,economy jobs
6,5,chicago bear start quarterback last <NUMBER> year total numb tenure uw faculty fire last two decade,education
7,0,jim dunnam live district represent year now,candidates-biography
8,2,be person stage work actively just last year pass along russ feingold tough ethic reform since watergate,ethics
9,2,however take <MONEY> <NUMBER> million oregon lottery fund port newport eventually land new noaa marine operation center pacific,jobs
10,3,say gop primary opponent glenn grothman joe leibham cast compromise vote cost <MONEY> million high electricity cost,energy message-machine-2014 voting-record
