# Abstract

Technology today has improve exponentially, and the world wide web is at the heart of it.  It connects billions of people across the globe, providing them with countless forms of entertainment.  Since the internet is not very well regulated, people have the freedom to post whatever they want, regardless if what they post is factual or fake.  The aim of this project is to create a deep learning model that can determine if an article is a real or fake.

We have decided to implement two models.  The first is a multiple layered Recurrent Neural Network, with its first implemented layer being a third order tensor where each matrix in the third order is a matrix of word vectors.  The next layer implemented is the RNN, followed by two fully connected layers, and a sigmoid activation function for binary classifications.

# Section 1 - The Dataset

We have used the [WELFake](https://www.kaggle.com/datasets/saurabhshahane/fake-news-classification) dataset that contains a well balanced mix of real and fake news articles.  This dataset is a combination of datasets from Kaggle, Mclintire, Reuters, and BuzzFeed.  There are a total of 78098 examples provided in this dataset with four features

1. Serial Number
2. Title
3. Text
4. Label

Since we only care about the article itself and its label, we will focus our data preprocessing in those two features.

## Data Preprocessing
Starting with the raw dataset, we want to remove anything in the articles that looked like emails and web urls.  We then removed all non-alphanumeric characters and lowercased them.  Finally, all words were lemmatized and stop words were removed.  These data preprocessing steps were done with regular expressions. Following the preprocessing of the raw data, the text was required to be tokenized and converted to integers.  Therefore a dictionary was created where each token was tied to an integer.  Each text was then converted to a sequenece of integers.  Since some text were larger than others, smaller text was padded with zeros.  After all text was converted to integers, the sequences were converted into word vectors with a length of the largest article to be used in the models.


# Section 2 - Model Descriptions
We have tested a regular [Recurrent Neural Network](https://pytorch.org/docs/stable/generated/torch.nn.RNN.html), then a Neural Network with [LSTM](https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html). The reason we decided to choose these two models is because both of these models can capture trends in sequential data.

## Recurrent Neural Network
We have applied a RNN using a `tanh` activation for the output of the hidden layer.  The RNN had

- 4 stacked recurrent layers
- A design matrix of 100 features
- 128 features for each hidden state
- A drop out layer with drop out probability of $p= 0.5$

The data being supplied as an input to the RNN is the output of a word embeddings layer with an output of 100 features.  Following the RNN layers, we have a fully connected layer with an output of 64 features, another drop out layer with drop out probability of $p= 0.5$, and another fully connected layer with one output.  The output is supplied to a sigmoid activation function for binary classification. We have also applied an xavier normally distribution for weights for the first fully connected layer following the RNN layers and a gradient clipping with a clipping threshold of $\theta = 3$.

As Both models ran into memory issues, we settled with the RNN.

In [None]:
class RNN(nn.Module):
    def __init__(self, input_size, hidden_size, embedding_dimensions, nbr_layers_rnn, batch_size):
        super(RNN, self).__init__()
        self.hidden_size = hidden_size
        self.word_sequence_to_embedding = nn.Embedding(num_embeddings=input_size, embedding_dim=embedding_dimensions) # this converts your sequence of words to a vector to pass through the NN
        self.rnn = nn.RNN(input_size=embedding_dimensions, hidden_size=hidden_size, num_layers=nbr_layers_rnn, batch_first=True,  dropout=0.5, bidirectional=False)
        self.fully_connected = nn.Linear(in_features=hidden_size, out_features=64) # Fully connected layer
        torch.nn.init.xavier_normal_(self.fully_connected.weight) # Apply xavier normal weights
        self.drop = nn.Dropout(0.5)  # Drop out
        self.fully_connected_two = nn.Linear(in_features=64, out_features=1)  # fully_connected_two = nn.Linear
        self.activation = nn.Sigmoid() # Sigmoid activation
        
    def forward(self, input_, hidden):
        sequence_embeddings = self.word_sequence_to_embedding(input_)# Create sequence embeddings
        output, hidden = self.rnn(sequence_embeddings, hidden) # Compute output and hidden layers of RNN
        fully_connected = self.fully_connected(output) # Compute fully connected layer
        drop = self.drop(fully_connected) # Add drop out
        fully_connected_two = self.fully_connected_two(drop) # Second fully connected layer
        fully_connected_reduced = fully_connected_two[:, -1, :] # Since embeddings are of the shape (Example, term, term_features), we must reduce to a matrix by taking the average 
        output = self.activation(fully_connected_reduced)     # Sigmoid Activation
        return output, hidden.detach()
    def initHidden(self):
        return torch.zeros(nbr_layers_rnn, batch_size, self.hidden_size).to(device)     # Return a matrix of 1 row and k columns where k=hidden_size

# Section 3 - Loss Function
We decided to use a [binary cross entropy loss function](https://pytorch.org/docs/stable/generated/torch.nn.BCELoss.html) since this was a binary classification task.  We also made sure to use an average reduction to reduce our output matrix.

# Section 4 - Optimization
We have decided to use the Stochastic Gradient Descent Optimization function with a learning rate of $r = 0.005$.  We have experimented with a variety of learning rates, ranging from 0.001, to 0.05.  We also applied a [step learning rate](https://pytorch.org/docs/stable/generated/torch.optim.lr_scheduler.StepLR.html#torch.optim.lr_scheduler.StepLR) where the learning reduced by a factor of 0.1 every 4 epochs.  Overall the the constant learning rate of $r=0.005$ appeared to provide a more constistent training and testing accuracy.  We also tried applying the [ReduceLROnPlateau](https://pytorch.org/docs/stable/generated/torch.optim.lr_scheduler.ReduceLROnPlateau.html#torch.optim.lr_scheduler.ReduceLROnPlateau) function that allowed us to reduce the learning rate by a factor of 0.1 every time we had two consecutive epochs containing about the same total loss.

# Section 5 - Metrics and Experimental Results
## RNN

<br>

![](rnn_results.png)