# Title: RMDL and Roberta BERT for text classification

#### Individual's Name : Garima Malik

####  Emails : garima.malik@ryerson.ca

### INTRODUCTION:
*********************************************************************************************************************
#### AIM : 
To empirically analyse the feasibility of the proposed Model in the paper "RMDL: Random Multimodel Deep Learning for Classification" released in 2018.
*********************************************************************************************************************
#### Github Repo: 
https://github.com/kk7nc/RMDL
*********************************************************************************************************************
#### DESCRIPTION OF PAPER:
RMDL model can be seen as ensemble approach for deep learning models.RMDL solves the problem of finding the best deep learning structure and architecture while simultaneously improving robustness and accuracy through ensembles of deep learning architectures. 
*********************************************************************************************************************
#### PROBLEM STATEMENT :
* Try to replicate the results given in paper on text classification datasets with RMDL models
* choose 2 standard datsets : imdb and reuters
* To assess the effectivness of RMDL, scrapped and preprocessed stack overflow classification dataset 
(link : "https://www.kaggle.com/stackoverflow/stacksample?select=Questions.csv")
* To compare the performance of RMDL, trained ROBERTa BERT model on the above-mentioned datasets.
*********************************************************************************************************************
#### CONTEXT OF THE PROBLEM:
* The continually increasing number of complex datasets each year necessitates ever improving machine learning methods for robust and accurate categorization of these data.
* Generally, deep learning models involves a lot of randomization
* Users need to manually do hyper parameter tuning by changing each and every parameter which results into longer execution times
* So, They proposed an ensemble based approach for deep learning models.
*********************************************************************************************************************
#### SOLUTION:
* The proposed approach uses a basic concept of randomization
* It asks for max and min no of nodes user wants to train their neural network on 
* It builds the RMDL architecture using DNN + RNN + CNN vertically stacked and gives the prediction using a voting classifier which is created using the predictions from all these models.


# Background
*********************************************************************************************************************
#### In this paper they divided their 'Related Work' into 3 parts :

#### Feature Extraction 

|Reference|Explanation|Dataset/Input|Weakness|
|------|------|------|------|
|L. Krueger et. al.[1]|feature extraction methods based on word counting for text categorization in statistical learning  |pattern recognition dataset |weighing the words prove to be inconsistent with text classificcation datasets |
| G. Salton et.al. [2]|Created the concept of TF-IDF by weighing the frequncy counts  |NPL (National Physical Laboratory) collection of 11429 documents |normalized TF-IDF did not perform better  |


*********************************************************************************************************************
#### Classification methods and techniques

|Reference|Explanation|Dataset/Input|Weakness|
|------|------|------|------|
|K. Murphy [3],I. Rish [4]|introduced Naive Bayes Classifier (NBC) and its empirical analysis| introductory paper where derivations are provided for document classification using naive bayes|numerical underflow with probabilistic models and data characterstics requirements|
|C. Yu et.al [5],S. Tong et. al.[6] |SVM with active learning and latent variables|Reuters, NewsGroup data|simple method is computationally fast than hybrid ones|

*********************************************************************************************************************


#### Deep learning for classification

| Reference |Explanation |  Dataset/Input |Weakness|
| --- | --- | --- | --- |
|D. Cires [7] |multi column deep neural networks for classification tasks |GTSRB (German test sign classification)|seccond best model in the competition and planning to embed in more general system|
|K. Kowsari et. al.[8]|Hierarchical Deep learning for text classification|WOS datasets|present results with only one dataset and more hierarchy can be added to models|
|(Implemented Paper)RMDL[9]|a new ensemble, deep learning approach for classification |Reuters, IMDB, and Reuters|Computationally expensive, excessive randomization, Uses same model for image,text and face recognition|




# Methodology
*********************************************************************************************************************
#### Basic Details of RMDL :
* The novelty of this work is in using multi random deep learning models including DNN, RNN, and CNN for text.
* DNN, RNN, and CNN are trained in parallel.
* Can be used in any kind of dataset for classification i.e. not just text , it can be extended to image as well.
* DNN uses TF-IDF and (RNN,CNN) uses Glove embedding
* r + c + d = n (n layer RMDL where r is RNN layers, d is DNN layers and c is CNN layers)
* after all RDL models are trained , the final prediction is calculated using majority votes of these models.



In [1]:
from IPython.display import Image
Image(url='images/rmdl_voting.png', width=700)

In [2]:
from IPython.display import Image
Image(url='images/rmdl_archi.png', width=700)

# Implementation
********************************************************************************************************************
#### PART - 1: RMDL  Paper  Replication
#### PART - 2: EDA of Stack Overflow Dataset
#### PART - 3: Training of Roberta BERT model for comparison with RMDL

#### NECESSARY LIBRARIES
NOTE : If running RMDL as a library so please installit using pip command
! pip install RMDL

In [5]:
import warnings
warnings.filterwarnings('ignore')
import pandas as pd

In [8]:
import sys
import os
import nltk
nltk.download("reuters")
from RMDL import text_feature_extraction as txt
from keras.datasets import imdb
import numpy as np
from nltk.corpus import reuters
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.model_selection import train_test_split
from RMDL import RMDL_Text as rmdl

[nltk_data] Downloading package reuters to
[nltk_data]     /Users/macbookpro/nltk_data...
[nltk_data]   Package reuters is already up-to-date!
Using TensorFlow backend.


sys.version_info(major=3, minor=7, micro=4, releaselevel='final', serial=0)
sys.version_info(major=3, minor=7, micro=4, releaselevel='final', serial=0)


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/macbookpro/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Data Sets Loading ....
* IMDB : 50,000 documents with 2 classes
* Reuters : 21,578 documents with 90 categories.

#### Explanation 
Using Keras library to load the imdb dataset with maximum unique words as 1000 and cleaning the text using RMDL library 'text_cleaner'function and preparing the training and testing dataset

In [3]:
print("Loading IMDB dataset....")
MAX_NB_WORDS = 1000
(X_train_i, y_train_i), (X_test_i, y_test_i) = imdb.load_data(num_words=MAX_NB_WORDS)
#print(len(X_train))
#print(y_test)
word_index = imdb.get_word_index()
index_word = {v: k for k, v in word_index.items()}
X_train_i = [txt.text_cleaner(' '.join(index_word.get(w) for w in x)) for x in X_train_i]
X_test_i = [txt.text_cleaner(' '.join(index_word.get(w) for w in x)) for x in X_test_i]
X_train_i = np.array(X_train_i)
X_train_i = np.array(X_train_i).ravel()
X_test_i = np.array(X_test_i[:50])
X_test_i = np.array(X_test_i).ravel()

Loading IMDB dataset....


#### Explanation

Reuters dataset is downloaded with the help of NLTK library, it contains 21578 documents with 90 categories. Used multi label binarizer to transform the labels to one hot encoded so that it can be fed into RMDL models.

In [3]:
print("Loading Reuters Dataset ......")
documents = reuters.fileids()
train_docs_id = list(filter(lambda doc: doc.startswith("train"),documents))
test_docs_id = list(filter(lambda doc: doc.startswith("test"),documents))
X_train_r = [(reuters.raw(doc_id)) for doc_id in train_docs_id]
X_test_r = [(reuters.raw(doc_id)) for doc_id in test_docs_id]
mlb = MultiLabelBinarizer()
y_train_r = mlb.fit_transform([reuters.categories(doc_id)
                           for doc_id in train_docs_id])
y_test_r = mlb.transform([reuters.categories(doc_id)
                      for doc_id in test_docs_id])
y_train_r = np.argmax(y_train_r, axis=1)
y_test_r = np.argmax(y_test_r, axis=1)

Loading Reuters Dataset ......


### Calling the RMDL model on IMDB data 
*********************************************************************************************************************
### Parameters Explanation :
1. Input : x_train, y_train, x_test, y_test
2. MAX_SEQUENCE_LENGTH : Maximum length of sequence or document in datasets, it will default to 500.
3. MAX_NB_WORDS : Maximum number of unique words in datasets, it will default to 75000.
4. GloVe_dir : Address of GloVe or any pre-trained directory, it will default to null which glove.6B.zip will be download
5. GloVe_file: Which version of GloVe or pre-trained word emending will be used, it will default to glove.6B.50d.txt.
6. random_deep : Number of ensembled model used in RMDL random_deep[0] is number of DNN, random_deep[1] is number of RNN, random_deep[2] is number of CNN
7. epochs : Number of epochs in each ensembled model used in RMDL
8. [min_hidden_layer_dnn, max_hidden_layer_dnn] : Ranges of layers user wants to experiment with
9. [min_nodes_dnn,max_nodes_dnn] : Ranges of node value corresponding to DNN model
10. random_optimizor : Boolean, if you wanna use random optimizers as well in RMDL

In [6]:
model_i = rmdl.Text_Classification(X_train_i, y_train_i, X_test_i,  y_test_i, batch_size=128,
                        EMBEDDING_DIM=100,MAX_SEQUENCE_LENGTH = 50, MAX_NB_WORDS = 1000,
                        GloVe_dir="/Users/macbookpro/Documents/GitHub/Named-Entity-Recognition-with-Bidirectional-LSTM-CNNs/embeddings/",
                 GloVe_file = "glove.6B.100d.txt",
                        sparse_categorical=True, random_deep=[1, 1, 0], epochs=[1, 1, 1],  plot=False,
                        min_hidden_layer_dnn=1, max_hidden_layer_dnn=8, min_nodes_dnn=128, max_nodes_dnn=256,
                        min_hidden_layer_rnn=1, max_hidden_layer_rnn=5, min_nodes_rnn=32,  max_nodes_rnn=128,
                        min_hidden_layer_cnn=3, max_hidden_layer_cnn=10, min_nodes_cnn=128, max_nodes_cnn=512,
                        random_state=42, random_optimizor=False, dropout=0.5,no_of_classes=2)


Done1
tf-idf with 967 features
/Users/macbookpro/Documents/GitHub/Named-Entity-Recognition-with-Bidirectional-LSTM-CNNs/embeddings/glove.6B.100d.txt
Found 998 unique tokens.
(25050, 50)
Total 400000 word vectors.
2
DNN 0

Epoch 00001: val_accuracy improved from -inf to 0.90000, saving model to weights\weights_DNN_0.hdf5
RNN 0
4
90
Train on 25000 samples, validate on 50 samples
Epoch 1/1
 - 79s - loss: 0.7027 - accuracy: 0.5040 - val_loss: 0.6918 - val_accuracy: 0.5400
(50, 2)
Accuracy of 2 models: [0.9, 0.54]
Accuracy: 0.9
F1_Micro: (0.9, 0.9, 0.9, None)
F1_Macro: (0.900974025974026, 0.8977455716586151, 0.898989898989899, None)
F1_weighted: (0.9003246753246753, 0.9, 0.8997979797979798, None)


### Similar calling for Reuters Data

In [None]:
batch_size = 100
sparse_categorical = 0
n_epochs = [120, 120, 120]  ## DNN--RNN-CNN
Random_Deep = [3, 0, 0]  ## DNN--RNN-CNN
model_r = rmdl.Text_Classification(X_train_r, y_train_r, X_test_r, y_test_r,
                             batch_size=batch_size,
                             sparse_categorical=True,
                             random_deep=Random_Deep,
                             epochs=n_epochs)

*********************************************************************************************************************
### My Additions in the Project :
* Scraped and Preprocessed the stack overflow data set from kaggle
* Applied RMDL on this new data set to assess the effectiveness
* Additionally, trained Roberta BERT model to compare the performance of RMDL model on all of these data sets

### Stack Overflow Data Set EDA
* The dataset is taken from kaggle (Question - answer data and labels : programming language)
* Initial Distribution looks like this :

<div>
   <img=src="images/initial_dist.png" width="700">
</div>

### After Preprocessing 

* preprocessed and converted into text classification dataset
* For simplicity selected only 10,000 documents (1000 from each class)

### Reading the Data 

In [6]:
df_stack = pd.read_csv("data/stack_overflow_10000.csv")

In [10]:
df_stack = df_stack.drop(['Unnamed: 0','Unnamed: 0.1','Id'],axis=1)

#### Explanation :

Data set is showing the 'Title' as questions asked on stack overflow 'body' as answers and 'Tags' as labels and 'Single_label' 
as category coded labels

In [12]:
df_stack.head(5)

Unnamed: 0,Title,Body,Tags,single_label
0,Face recognize using Opencv4Android SDK tutorial?,I am a student. Recently I've been building a ...,['android'],0
1,Chat messages ordering strategy in PubNub,We are building a chat application in Android ...,['android'],0
2,Disappearance of R.java file,I was working on an android project and i pres...,['android'],0
3,Android GoogleMap V2 on zoom map doesn't get s...,I'm currently working on Android and have star...,['android'],0
4,"Where is the App permission for ""identity"" in ...",I am trying to use an emulator with which come...,['android'],0


### Testing the model with stack overflow data
*******************************************************************************************************************************
To assess the effectiveness of RMDL methodology, I have applied the same model configuration on the above-mentioned data set. Complete implementation is present in src folder.

* RMDL model testing : refer [https://github.com/garima2751/NLP_Project/blob/main/Src/stackoverflowRMDL.ipynb](link)

### Training the Roberta BERT model over all the data sets
*******************************************************************************************************************************
Roberta Model is trained using Simple transformers library and I have used fine tuned version of roberta BERT model and all the implementation is present in src folder in github repo. For more explanation i have mentioned urls for each data set training with roberta BERT.

* For Stackoverflow Data : refer [https://github.com/garima2751/NLP_Project/blob/main/Src/Roberta_stack.ipynb](link)
* For IMDB : refer [https://github.com/garima2751/NLP_Project/blob/main/Src/Roberta_imdb.ipynb](link)
* For Reuters : refer [https://github.com/garima2751/NLP_Project/blob/main/Src/Roberta_Reuters.ipynb](link)

### Results :
*******************************************************************************************************************************
For Results each dataset is trained with RMDL and RobertaBERT model and comparison is present in the form of accuracy, f1-macro and f1-weighted. RMDL paper only shows accuarcy in their results however I feel f1-score should be a better metric to assess the performance of text classification tasks.

#### Observations :
*******************************************************************************************************************************
* For IMDB Data: both the models perform equally well
* For Reuters Data : there were 90 categories present in the document and RMDL shows better f1 score compare to Roberta BERT.
* For Stack overflow data : Roberta BERT performs well however the execution time was 13+ hours.


In [4]:
from IPython.display import Image
Image(url='images/nlp_result_table.png', width=500)

#### Graphical Representation of Results

In [3]:
Image(url='images/nlp_plot_result.png', width=900)

### Conclusion and Future Direction :
*******************************************************************************************************************************
#### Learnings : 
During this project, I learnt how to replicate a paper using github repo and make changes in the source code of a package. RMDL was already present as a python package however I cloned the repo made some changes in the source code for the smooth working of code. I also learnt how we can vertically stack together various deep learning model layers and randomize the input layer and number of nodes.

* RMDL source code was changed to reduce the randomness in code as the hardware requirements was limited to implement the RMDL models.
* Learnt the mechanism of building a voting classifier with random multi model deep learning
* Managed to train the Roberta BERT Model with simple transformers library, which was really easy to code with the help of google colab and simple transformers library functions.
* Major learning would be that RMDL often tries every combination of layers and input nodes so that training and execution times become uncontrollable.
* RMDL also gives different model with same set of input which is a big disadvantage for experimentation purpose.
* Same with Roberta BERT model, as it takes 13h+ times to train on a data set with more than 10,000 rows
*******************************************************************************************************************************
#### Results Discussion :
Results depicted in the above graphs summarizes the performance of both the models with each data set. My goal was to see whether newly released fine tuned BERT model is able to perform better or not compare to RMDL models.

* In case of stack overflow dataset : BERT model outperformes RMDL models with an f1-score of 80 %
* In case of IMDB dataset : Both models perform equally well
* In case of Reuters dataset : RMDL perform slightly better than BERT 

*******************************************************************************************************************************
#### Limitations :
In terms of Limitations, RMDL methodology has following problems:

* Longer execution times because of every parameter is randomized in the methodology. for example if we are providing the range of DNN nodes as [3,8] then it will create DNN models with all the possible combinations of the range provided. Imagine doing for RNN and CNN as well so you end up losing all your RAM in the training of RMDL model.

* Although RMDL claims to improve the accuracy and robustness of models however they have only worked with pretty standard datasets and did not provide the f1-scores for the text classification datasets and accuracies shown in the paper is achieveable using standalone deep learning architectures or BERT models

* Randomization of deep learning models results into different models everytime when you try to run RMDL for the same input.

In terms of BERT training the limitations are as follows :

* Cannot pad the sentences or input after a certain limit and for longer sentence it is purely loss in information
* Hardware limitations also leads to inability in training efficient BERT models (RAM availability)
* Training or execution time is endless for more than 10,000 rows.


*******************************************************************************************************************************
#### Future Extension :
For future considerations, RMDL can be implemented with new set of embeddings and feature sets such as ELMo, BERT and fastText.
RMDL can also be extended to do the extensive hyperparameter tuning by trying every possible parameters and provides the optimal parameters with the best model.

# References:

[1]:  Krueger, L. E., & Shapiro, R. G. (1979). Letter detection with rapid serial visual presentation: Evidence against word superiority at feature extraction. Journal of Experimental Psychology: Human Perception and Performance, 5(4), 657.

[2]:  Salton, G., & Buckley, C. (1987). Term weighting approaches in automatic text retrieval. Cornell University.

[3]: Murphy, K. P. (2006). Naive bayes classifiers. University of British Columbia, 18(60)

[4]: Rish, I. (2001, August). An empirical study of the naive Bayes classifier. In IJCAI 2001 workshop on empirical methods in artificial intelligence (Vol. 3, No. 22, pp. 41-46)

[5]: Yu, C. N. J., & Joachims, T. (2009, June). Learning structural svms with latent variables. In Proceedings of the 26th annual international conference on machine learning (pp. 1169-1176).

[6]: ong, S., & Koller, D. (2001). Support vector machine active learning with applications to text classification. Journal of machine learning research, 2(Nov), 45-66.

[7]: CireAan, D., Meier, U., Masci, J., & Schmidhuber, J. (2012). Multi-column deep neural network for traffic sign classification. Neural networks, 32, 333-338.

[8]: Kowsari, K., Brown, D. E., Heidarysafa, M., Meimandi, K. J., Gerber, M. S., & Barnes, L. E. (2017, December). Hdltex: Hierarchical deep learning for text classification. In 2017 16th IEEE international conference on machine learning and applications (ICMLA) (pp. 364-371). IEEE

[9]: Kowsari, K., Heidarysafa, M., Brown, D. E., Meimandi, K. J., & Barnes, L. E. (2018, April). Rmdl: Random multimodel deep learning for classification. In Proceedings of the 2nd International Conference on Information System and Data Mining (pp. 19-28).