<center><h1>Identifying Insincere Questions</h1><h20>Wilbert Garcia, Nyjay Nelson</h20><img src="image.png" width="500" height="500" alt="Insincere Questions">
</center>

### **Problem Explanation**

* According to the [Kaggle](https://www.kaggle.com/c/quora-insincere-questions-classification) prompt for the insincere question is one that "founded upon false premises, or that intend to make a statement" rather than inquire. People use tone and context and intonation and many nonverbal queues to gauge whether or not a question is insincere. Natural Language Processing (NLP) allows computers to make sense of text data and make data driven assumptions. In this case, we are performing binary classification on whether questions are insincere or not.


* For our project, we are experimenting with the range of techniques available for Natural Language Processing from building and training an LSTM recurrent neural network from scratch to using transfer learning models specifically transformer models like BERT in order to perform binary classification on the dataset. The goal is to train a deep neural network that classifies text questions into categories of sincere or insincere. 

### **Data Background**

* The dataset is from a Kaggle competition. The labeled.csv file in this dataset contains 1.3 million questions labeled either sincere or insincere. An important consideration is that the data is highly unbalanced. 94 percent of the data is labelled as sincere. 


* The data was imported using pandas.
 

In [23]:
import pandas as pd
import re
import nltk
import tensorflow as tf

In [2]:
data = pd.read_csv('labeled.csv', usecols=[1,2])
data

Unnamed: 0,question_text,target
0,How did Quebec nationalists see their province...,0
1,"Do you have an adopted dog, how would you enco...",0
2,Why does velocity affect time? Does velocity a...,0
3,How did Otto von Guericke used the Magdeburg h...,0
4,Can I convert montra helicon D to a mountain b...,0
...,...,...
1306117,What other technical skills do you need as a c...,0
1306118,Does MS in ECE have good job prospects in USA ...,0
1306119,Is foam insulation toxic?,0
1306120,How can one start a research project based on ...,0


### **Data Cleaning for NLP Models**

* We performed data preprocessing through text preprocessing and cleaning. We removed the non alphanumeric characters and numbers from the pandas dataframe. We then proceeded to make the text lowercase. We removed stopwords from the text, we removed words that have a length less than two. We remove raw components of the text data that are not relevant and useful and make the process of training a model more difficult and confusing. We make all the words lowercase because this makes the data more uniform as we do not have separate words because of capitalization. We remove stopwords because they are the most common words in the language and they do not carry meaning essential to the classification of questions as sincere or insincere. Words that are less than two characters in length are similar to stopwords in that they are commonly conjunctions or prepositions which do not carry significant meaning in determining whether questions are insincere or not.


* We use the Scikit Learn library to split our initial data into training and testing sets.

In [3]:
from nltk.corpus import stopwords
stop = set(stopwords.words('english'))

def clean_text(text):
    text = re.sub('[^a-zA-Z]', ' ', text)
    text= str(text).lower()
    text = " ".join([word.lower() for word in text.split() if word.lower() not in stop])
    text = " ".join([i for i in text.split() if len(i) > 2])
    return text
data['question_text'] = data['question_text'].apply(clean_text)
data

Unnamed: 0,question_text,target
0,quebec nationalists see province nation,0
1,adopted dog would encourage people adopt shop,0
2,velocity affect time velocity affect space geo...,0
3,otto von guericke used magdeburg hemispheres,0
4,convert montra helicon mountain bike changing ...,0
...,...,...
1306117,technical skills need computer science undergrad,0
1306118,ece good job prospects usa like india jobs pre...,0
1306119,foam insulation toxic,0
1306120,one start research project based biochemistry ...,0


### **Data Pre-processing for LSTM Recurrent Neural Network**

* After cleaning the data, there are a number of steps involved in preparing the data for the Natural Language Processing models. This involves processes such as extracting tokens from the questions and then encoding said tokens. 


* We get the number of unique words in the dataset as an initial step in tokenizing the data. 

In [4]:
#get the total number of unique words in dataset
from collections import Counter

def count_unique(text):
    count = Counter()
    for i in text.values:
        for word in i.split():
            count[word] += 1
    return count

text = data.question_text
labels = data.target
counter = count_unique(text)
num_words = len(counter) +1

In [5]:
from sklearn.model_selection import train_test_split

train_sentences, test_sentences, train_labels, test_labels = train_test_split(text, labels, test_size=0.2)

#one percent of data
train_s1 = text[:10449]
test_s1 = text[10449:13062]
train_l1 = labels[:10449]
test_l1= labels[10449:13062]

### **Tokenize  and Encode Data**

In [6]:
from tensorflow import keras
from keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer(num_words = num_words, oov_token = '<UNK>')
tokenizer.fit_on_texts(train_sentences)
train_sequences = tokenizer.texts_to_sequences(train_sentences)

* Padding makes sure that the sequences are the same length

In [7]:
from keras.preprocessing.sequence import pad_sequences
train_padded = pad_sequences(train_sequences, maxlen= 20, padding = "post", truncating= "post")

#might need to do fits on texts
test_sequences = tokenizer.texts_to_sequences(test_sentences)
test_padded = pad_sequences(test_sequences, maxlen= 20, padding = "post", truncating= "post")

### **Create Embedding Dictionary**

* We used the GloVe 6B which stands for [Global Vectors for Word Representation](https://nlp.stanford.edu/projects/glove/).

In [8]:
import numpy as np
embedding_dict = {}
file = open('glove.6B.100d.txt', encoding = 'utf-8')
    
for line in file:
    values = line.split()
    word = values[0]
    vectors = np.asarray(values[1:], "float32")
    embedding_dict[word] = vectors
    
file.close()

In [9]:
word_index = tokenizer.word_index
index_len = len(word_index) + 1
embedding_matrix = np.zeros((index_len,100))

for word, i in word_index.items():
    if i < index_len:
        emb_vec = embedding_dict.get(word)
        if emb_vec is not None:
            embedding_matrix[i] = emb_vec

### **Data Pre-processing for Transformer Model: DistilBERT**

* The BERT model differs from the LSTM model in terms of encoding because it requires a CLS token and SEP token to designate the beginning and ends of sentences. 


In [10]:
def bert_encode(texts, tokenizer, max_len=512):
    all_tokens = []
    
    for text in texts:
        text = tokenizer.tokenize(str(text))
            
        text = text[:max_len-2]
        input_sequence = ["[CLS]"] + text + ["[SEP]"]
        pad_len = max_len - len(input_sequence)
        
        tokens = tokenizer.convert_tokens_to_ids(input_sequence)
        tokens += [0] * pad_len
        pad_masks = [1] * len(input_sequence) + [0] * pad_len
        segment_ids = [0] * max_len
        
        all_tokens.append(tokens)
    
    return np.array(all_tokens)

In [11]:
import transformers
from transformers import DistilBertTokenizer, DistilBertModel
transformer_layer = transformers.TFDistilBertModel.from_pretrained('distilbert-base-uncased')
tokenizer = transformers.DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertModel: ['vocab_transform', 'activation_13', 'vocab_layer_norm', 'vocab_projector']
- This IS expected if you are initializing TFDistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFDistilBertModel were initialized from the model checkpoint at distilbert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertModel for predictions without further training.


In [12]:
train_input = bert_encode(train_sentences, tokenizer, max_len=50)
test_input = bert_encode(test_sentences, tokenizer, max_len=50)

### **Related Work**

There is a significant amount of research and implementation done pertaining to Natural Language Processing and the 

We trained a LSTM recurrent neural network (RNN). We were inspired by the model for the simple model for a RNN.

The goal of this assignment is to train a deep network but deep networks are very difficult to train as they require much more data, computing power and time. We can avoid the issues of training a network from scratch by taking advantage of large neural networks that others have already trained using training learning models.

We employ a transformer model. The DistilBERT model is pre-trained using .  is a fairly comprehensive dataset. It features millions of . A model that is trained on ImageNet and performs with significant accuracy generalizes well to new data and is not subject to overfitting. This is an attractive model for the classifcation of questions as sincere or insincere that we are attempting to solve. 

### **Experiments**

* We have two models that we are comparing. We have an LSTM model and are also implementing a transformer model in order to compare the two. This section provides an overview of the models we created and the rationale behind choosing said models.
    
    * `keras_model`: * We trained a LSTM recurrent neural network from scratch. We were inspired by the simple example that [Tensorflow](https://www.tensorflow.org/guide/keras/rnn) provides of a Recurrent Neural Network using an LSTM layer through keras. 
    
    * `DistilBERT`: DistilBERT is a language represenation model. [DistilBERT](https://arxiv.org/abs/1910.01108) is described as a "smaller, faster, cheaper and lighter" version of BERT. [BERT](https://arxiv.org/abs/1810.04805) is a language representation model where BERT is an acronym for Bidirectional Encoder Representations from Transformers. BERT is effective and innovative as it can performs bidirectional training on a Transformer. BERT is an attractive model because of its performance. It has shown impressive results in Natural Language Processing tasks such as "pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement)"[4]. But, the BERT model is large and requires alot of time and memory that are outside the capacity of the tools that we have at our disposal. Because of this, we have chosen a variation of the BERT model that is more suited for our capibilities and just as impressive as the BERT model. The DistilBERT model uses knowledge distillation during pretraining and is able to "reduce the size of a BERT model by 40%, while retaining 97% of its language understanding capabilities and being 60% faster" [3]. We chose the Distilbert model because it is able to perform similarly to the BERT model while being faster and more cost effecient.

### **LSTM Model**

* This is the model with the best performance given our LSTM RNN architecture. 

In [25]:
from keras.models import Sequential
from keras.layers import Embedding, LSTM, Dense, Dropout, Input
from tensorflow.keras import layers
from tensorflow.keras import optimizers
from tensorflow.keras import losses
from tensorflow.keras import metrics
from keras.initializers import Constant
from tensorflow.keras.models import Model

* We will refer to the first model below as keras_model_1. This model has an input_length of 50 and 64 nodes in the LSTM layer. The model after will be referred to as keras_model_2. This model has an input length of 20 and 128 nodes in the LSTM layer.

In [14]:
keras_model_1 = Sequential()
keras_model_1.add(Embedding(index_len, 100, input_length = 20, embeddings_initializer = Constant(embedding_matrix), trainable = False))
keras_model_1.add(LSTM(256, activation= 'relu'))
keras_model_1.add(Dropout(0.2))
keras_model_1.add(Dense(1, activation ="sigmoid"))
keras_model_1.summary()
keras_model_1.compile(optimizer=optimizers.Adam(learning_rate=0.0001), loss=losses.BinaryCrossentropy(),metrics=[metrics.BinaryAccuracy()])

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 20, 100)           15865200  
_________________________________________________________________
lstm (LSTM)                  (None, 256)               365568    
_________________________________________________________________
dropout_19 (Dropout)         (None, 256)               0         
_________________________________________________________________
dense (Dense)                (None, 1)                 257       
Total params: 16,231,025
Trainable params: 365,825
Non-trainable params: 15,865,200
_________________________________________________________________


### **Transformer Model: DistilBERT**


In [31]:
def build_model(transformer, max_len=50):
    input_word_ids = Input(shape=(max_len,), dtype=tf.int32, name="input_word_ids")
    sequence_output = transformer(input_word_ids)[0]
    cls_token = sequence_output[:, 0, :]
    out = Dense(1, activation='sigmoid')(cls_token)
    
    model = Model(inputs=input_word_ids, outputs=out)
    model.compile(optimizer=optimizers.Adam(learning_rate=0.0001), loss=losses.BinaryCrossentropy(),metrics=[metrics.BinaryAccuracy()])
    
    return model

model = build_model(transformer_layer, max_len=50)
model.summary()

Model: "model_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_word_ids (InputLayer)  [(None, 50)]              0         
_________________________________________________________________
tf_distil_bert_model (TFDist TFBaseModelOutput(last_hi 66362880  
_________________________________________________________________
tf_op_layer_strided_slice_3  [(None, 768)]             0         
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 769       
Total params: 66,363,649
Trainable params: 66,363,649
Non-trainable params: 0
_________________________________________________________________


#### **Metrics**
* Since we are classifying between two classes, we use binary accuracy instead of categorical accuracy.
* Because implications of false negatives and false positives are more severe and because the classes in the dataset we are using are unbalanced, we look at several metrics other than just binary accuracy to determine the best model:
    * Precision: $\frac{TP}{TP+FP}$
    * Recall: $\frac{TP}{TP+FN}$
    * F1 score: $\frac{2\cdot precision * recall}{precision + recall}$
* Since we are doing binary classification, the overall precision score is the weighted precision from each class as calculated by sklearn.
* We have written the `print_results()` function in to print the prediction, binary accuracy, precision, recall, f1 score and confusion matrix.

In [16]:
from sklearn.metrics import f1_score, recall_score, precision_score, confusion_matrix, accuracy_score
def print_results(y_test, predictions):
    y_pred = np.round(np.squeeze(predictions))
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    conf_mat = confusion_matrix(y_test, y_pred)
    print("Accuracy: ", accuracy)
    print("Precision: ", precision)
    print("Recall: ", recall)
    print("F1 score: ", f1)
    print("Confusion Matrix: ", conf_mat)
    return accuracy, precision, recall, f1, conf_mat

#### **Experimental Specification**
* In our experiments, we vary the following hyperparameters
| num__nodes_LSTM | input_length | learning_rate | embeddings_initializer |
|------|----|----|------|
| 32, 64, 128, 256 | 20, 50 | .001, .0001 | Yes, No |


* There are many more combinations of hyperparameter settings that we could have tried. But, given the time constraints of the project, we used to prior knowledge and experience to chose combinations that we believed would allow us to choose an optimal model. 

* One of our hyperparameter settings is whether or not we include the embeddings_initializer when building the LSTM model.

* For each experiment and each model, we trained 5 epochs at a time with a batch size of 32. 


* Our results can be found in `worknyjay#.ipynb`. Each file represents a different experimental test on either the LSTM models or the DistilBERT models. We began by testing on a small subset of data and using a softmax activation. We were only getting 0.06 accuracy. We continued to vary our models and switched to a larger subset of data and a sigmoid activation in the output layers. We proceeded to test different hyperparameter configurations.  


* We now show how the optimized models  for the LSTM and DistilBERT and the best hyperparametersettings for each model.

In [19]:
keras_model_1.fit(train_padded, train_labels, batch_size = 32, 
                  epochs = 10, steps_per_epoch =10000,
                  verbose=1)
km1pred = keras_model_1.predict(test_padded, verbose=1)
accuracy, precision, recall, f1, conf_mat = print_results(test_labels, km1pred)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Accuracy:  0.9542654799502345
Precision:  0.632794315815264
Recall:  0.6344653933550328
F1 score:  0.6336287527983072
Confusion Matrix:  [[238947   5995]
 [  5952  10331]]


In [32]:
train_history = model.fit(
    train_input, train_labels,
    epochs=10,
    batch_size=32,
    steps_per_epoch =1000
)
predictions = model.predict(test_padded, verbose=1)
accuracy, precision, recall, f1, conf_mat = print_results(test_labels, predictions)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Accuracy:  0.8998487893578333
Precision:  0.1588507493611437
Recall:  0.1412516121107904
F1 score:  0.1495351407580781
Confusion Matrix:  [[232763  12179]
 [ 13983   2300]]


### **Results and Conclusion**

* Based on our experiments, the best model for classifying insincere questions is the LSTM Recurrent Neural Network keras_model_1 with __ LSTM layer nodes, a learning rate of __ , input length of and including embeddings initializer in the embedding layer of the model. In training and testing this model several times with these settings, we always had a test binary accuracy ranging from .94 to .96 and F1 score ranging from .59 to .625. This model had the highest average F1 score among different hyperparameter configurations for the keras_model. We thought that it was interesting that some models had a higher average recall with a max of 0.72 than the keras_model_1 that had a max of 0.64. F1 score is an average of both precision and recall meaning that it serves as a better metric than precision or recall individually in determining the effectiveness of the model.   


* We note that there were significant limitations when gathering experimental results. OOM errors made data collection the most difficult part of this research process. That being said, we acknowledge that there are likely other hyperparameter settings for the keras_model that could have further optimized the model. We also note that regardless of hyperparameter settings our results for DistilBERT are inconclusive. The binary accuracy ranges from .92 to .94. This is likely due to the imbalance of classes in the dataset for both the LSTM and keras_model. For the DistilBERT model, the precision, recall and F1 score are 0 leading to unclear results in terms of how to vary hyperparameters to optimize the DistilBERT model. The DistilBERT model offers an effectively rigorous and more efficient alternative to the BERT model but we were still unable to make any conclusions given the results of our experiment. 


* Overall, we believe that our keras_model is not very effective in classifying questions as sincere or insincere. Our average F1 score on our most successful model would not place us in the top 1000 submissions for the Kaggle's Insincere Question competition. Most models had accuracy over 90 percent but this was not indicative due to the imbalance of classes. Many of our models had precision, recall and F1 scores averaging over .5 for each. Our best models seemed to represent the F1 scores, precision and recall range of our peers. 


* If we had had more time and had not had OOM errors at various points, we would have liked to run a more comprehensive set of experiments  more drastically varying the hyperparameters tested and further testing the effect of batch size and steps per epoch. We believe this might result in a much better model overall. We also were curious to possibly implement the winning submission for the competition. That being said, we put a lot of time into this project and think that given the time and memory constraints, we did the best we could.

### **References**
[1] https://www.kaggle.com/c/quora-insincere-questions-classification <br>
[2] https://www.tensorflow.org/guide/keras/rnn <br>
[3] https://arxiv.org/abs/1910.01108 <br>
[4] https://arxiv.org/abs/1810.04805 <br>
[5] https://nlp.stanford.edu/projects/glove/<br>
[6] https://www.kaggle.com/c/quora-insincere-questions-classification/leaderboard<br>
[7] https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270<br>