**Name:** Trương Thị Kiều Anh

**Student ID:** 19021209

**Class:** 2122I_INT3405E_20

# Problem Description

Quora is a platform that empowers people to learn from each other. On Quora, people can ask questions and connect with others who contribute unique insights and quality answers. A key challenge is to weed out insincere questions -- those founded upon false premises, or that intend to make a statement rather than look for helpful answers.
* Input: 
  a question asked on Quora
* Output:
  0/1 (Yes/ No) - predicting whether a question asked on Quora is sincere or not

# Data analysis

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
import gc
import re
import spacy

from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras import backend as K
from keras.layers import *
from keras.models import *
from keras.initializers import Constant
from keras.utils.vis_utils import plot_model

import torch
import torch.nn as nn
from tensorflow.keras.optimizers import Adam

from torch import LongTensor, FloatTensor, DoubleTensor
from torch.utils.data import Dataset, DataLoader, sampler
from torch.utils.data.distributed import DistributedSampler

from tqdm.notebook import tqdm
from IPython.core.display import display, HTML
tqdm().pandas()

pd_ctx = pd.option_context('display.max_colwidth', 100)

**Data size:**
1.306.122 question

**Data fields**
   * `qid` - unique question identifier
   * `question_text` - Quora question text
   * `target` - a question labeled "insincere" has a value of 1, otherwise 0
   

*No data is null or missing*   

In [None]:
df = pd.read_csv('/kaggle/input/quora-insincere-questions-classification/train.csv')
df.info()

test_df = pd.read_csv('/kaggle/input/quora-insincere-questions-classification/test.csv')
test_df.info()

df['word_count']= df.question_text.progress_apply(lambda x: len(x.split()))
sincere_data = df[df['target']==0]
insincere_data = df[df['target']==1]
print("Sincere question")
display(sincere_data.head())
print("Insincere question")
display(insincere_data.head())

In [None]:
# get number of word in a sentences
statistic = pd.merge(
    sincere_data[['word_count']].describe(percentiles=[.8, .9999]), 
    insincere_data[['word_count']].describe(percentiles=[.8, .9999]), 
    left_index=True, right_index=True, suffixes=('_sincere', '_insincere')
)
colLabels = statistic.columns
cellText = statistic.round(2).values
rowLabels = statistic.index

fig, axes = plt.subplots(nrows=1, ncols=2)
axes[0] = fig.add_axes([0,0,1,1])
axes[0].bar(['sincere question', 'insincere question'], df.target.value_counts())
for p in axes[0].patches:
    width = p.get_width()
    height = p.get_height()
    percent = height / len(df)
    x, y = p.get_xy() 
    axes[0].annotate(f'{percent:.2%}', (x + width/2, y + height + 0.01*len(df)), ha='center')
# axes[1].axis('off')
mpl_table = axes[1].table(cellText = cellText, colLabels=colLabels, rowLabels = rowLabels, bbox=[2, 0, 2, 1.5], )
mpl_table.auto_set_font_size(False)
mpl_table.set_fontsize(14)

* Sincere question account for 93,81%
* Insincere question account for 6,19%

$\rightarrow$ The dataset for training is unbalanced (positive results are 15 times more negative) create a misinterpretation of model quality. Then if we are using accuracy as a performance metric, it can be achieved very high without the model. For example, a random prediction given that all are in the majority group, the accuracy achieved is 93%. Therefore, for an imbalanced dataset, other performance metrics should be used.

**Length of sentences in the dataset**
* Sincere question: The average length of a sentence is 12,5; the longest sentence have 134 words; 80% of these sentence have less than or equal to 16 words. Most of the sentences have length under 53 words.
* Insincere question: The average length of a sentence is 17,28; the longest sentence have 64 words; 80% of these sentence have less than or equal to 25 words. Most of the sentences have length under 54 words. Insincere question tend to bave longer length than sincere question.

## Word cloud

In [None]:
def cloud(docs, title):
    wordcloud = WordCloud(width=800, height=400, collocations=False, background_color="white").generate(" ".join(docs))
    fig = plt.figure(figsize=(10,7), facecolor='w')
    plt.imshow(wordcloud)
    plt.axis('off')
    plt.title(title, fontsize=25,color='k')
    plt.tight_layout(pad=0)
    plt.show()
cloud(sincere_data.question_text, "Sincere question")
cloud(insincere_data.question_text, "Insincere question")

Follow some characteristics that can signify that a question is insincere:
* Has a non-neutral tone
    * Has an exaggerated tone to underscore a point about a group of people
    * Is rhetorical and meant to imply a statement about a group of people
* Is disparaging or inflammatory
    * Suggests a discriminatory idea against a protected class of people, or seeks confirmation of a stereotype
    * Makes disparaging attacks/insults against a specific person or group of people
    * Based on an outlandish premise about a group of people
    * Disparages against a characteristic that is not fixable and not measurable
* Isn't grounded in reality
    * Based on false information, or contains absurd assumptions
* Uses sexual content (incest, bestiality, pedophilia) for shock value, and not to seek genuine answers

We can see, in sincere question, common nouns are prominent in the question while proper nouns primarily arise in insincere questions (American, Muslim, Trump,...); also the sexual, discriminatory word (woman, man, white, black, Muslism...)  


## Word n-grams Count Plot 

To analyze closer to the dataset, let's use n-gram to see the most frequent words. It has to be ensure that all of the stopwords need to be eliminated from the counter before take frequent grams. Words are all lowercased before zipping a number of words (based on parameter n_gram value). Finally, a frequency dictionary is created and count all words appear.

In [None]:
import plotly.graph_objs as go
from collections import defaultdict
from plotly import tools
import plotly.offline as py

stopwords = set(STOPWORDS)
    
# N-gram generation
def generate_ngrams(text, n_gram=1):
    token = [token for token in text.lower().split(" ") if token != "" if token not in STOPWORDS]
    ngrams = zip(*[token[i:] for i in range(n_gram)])
    return [" ".join(ngram) for ngram in ngrams]

# Horizontal bar chart
def horizontal_bar_chart(df, color):
    trace = go.Bar(
        y=df["word"].values[::-1],
        x=df["wordcount"].values[::-1],
        showlegend=False,
        orientation = 'h',
        marker=dict(
            color=color,
        ),
    )
    return trace
def ngram_chart(n_gram = 1):
    # Get the bar chart from sincere questions #
    freq_dict = defaultdict(int)
    for sent in sincere_data["question_text"]:
        for word in generate_ngrams(sent, n_gram):
            freq_dict[word] += 1
    fd_sorted = pd.DataFrame(sorted(freq_dict.items(), key=lambda x: x[1])[::-1])
    fd_sorted.columns = ["word", "wordcount"]
    trace0 = horizontal_bar_chart(fd_sorted.head(50), 'orange')

    # Get the bar chart from insincere questions #
    freq_dict = defaultdict(int)
    for sent in insincere_data["question_text"]:
        for word in generate_ngrams(sent, n_gram):
            freq_dict[word] += 1
    fd_sorted = pd.DataFrame(sorted(freq_dict.items(), key=lambda x: x[1])[::-1])
    fd_sorted.columns = ["word", "wordcount"]
    trace1 = horizontal_bar_chart(fd_sorted.head(50), 'green')

    # Creating two subplots
    fig = tools.make_subplots(rows=1, cols=2, vertical_spacing=0.04,
                              subplot_titles=["Frequent words of sincere questions", 
                                              "Frequent words of insincere questions"])
    fig.append_trace(trace0, 1, 1)
    fig.append_trace(trace1, 1, 2)
    fig['layout'].update(height=1200, width=900, paper_bgcolor='rgb(233,233,233)', title="Word Count Plots")
    py.iplot(fig, filename='word-plots')
    return

### Unigram count plot
* Some of the top words are common across both the classes like 'people', 'will', 'think' etc.
* Many of top words in sincere questions (after excluding the both common ones) used for describe and comparison purpose: 'best', 'good', 'possible', 'need', etc.
* The other top words in insincere questions (after excluding the both common ones) often involve matters that may be sensitive, controversial, statement: 'chinese', 'women', 'white', 'mulisiam', 'trump', etc.

In [None]:
ngram_chart(1)

### Bigram count plot

* Some of the top bigrams are common across both the classes, especially in sincere questions, people tend to find useful answer for method to do something: 'best way'. Therefore, they may need to provide informative question. So we can see a lot of words from different catergories: 'computer science', 'world war', 'tv shows', etc.
* The top bigrams in insincere questions still involve matters that may be sensitive or controversial (and significantly related to people, religion, politics) such as 'donald trump', 'white people', 'black people', etc. This observation is quite reasonable and similar to characteristics (provided in the competition description) that can signify that a question is insincere.

In [None]:
ngram_chart(2)

**Trigram count plot**

* We can easily inspect many trigrams are combinations of popular uni- and bigrams, and we can get the characteristic of the 2 types now. So further plot count (n>3) is now unnecessary.
* More detailed and informational phrases appear: 'black lives matter', 'kim jong un', etc. People's age were mentioned frequently in the questions, especially in insincere questions. Perhaps people really care a lot about characteristics such as physical, psychological, cognitive level of others - traits that are influenced by certain age.

In [None]:
ngram_chart(3)

# Data Preparation

 ## Clean data 
 
* Replace math equations, links by "MATHEQUATION", "URL"
* Make abbreviations complete
* Correcting mispell words
* Remove punctuation

*I consider that removing stopword and lowercase is not a good idea in classification task. In the context of sentiment analysis, removing stop words can be problematic if context is affected. For example stop word corpus includes ‘not’, which is a negation that can alter the valence of the passage. In addition, proper nouns can have a big effect in context classification as we analysis above (insincere question have many controversial proper nouns such as "Trump", "Indian", "Kim Jung Un", etc. Therefore, the data will contain stopword and uppercase words which is necessary.*

In [None]:
contractions= {"i'm": 'i am',"i'm'a": 'i am about to',"i'm'o": 'i am going to',"i've": 'i have',"i'll": 'i will',"i'll've": 'i will have',"i'd": 'i would',"i'd've": 'i would have',"Whatcha": 'What are you',"amn't": 'am not',"ain't": 'are not',"aren't": 'are not',"'cause": 'because',"can't": 'can not',"can't've": 'can not have',"could've": 'could have',"couldn't": 'could not',"couldn't've": 'could not have',"daren't": 'dare not',"daresn't": 'dare not',"dasn't": 'dare not',"didn't": 'did not','didn’t': 'did not',"don't": 'do not','don’t': 'do not',"doesn't": 'does not',"e'er": 'ever',"everyone's": 'everyone is',"finna": 'fixing to',"gimme": 'give me',"gon't": 'go not',"gonna": 'going to',"gotta": 'got to',"hadn't": 'had not',"hadn't've": 'had not have',"hasn't": 'has not',"haven't": 'have not',"he've": 'he have',"he's": 'he is',"he'll": 'he will',"he'll've": 'he will have',"he'd": 'he would',"he'd've": 'he would have',"here's": 'here is',"how're": 'how are',"how'd": 'how did',"how'd'y": 'how do you',"how's": 'how is',"how'll": 'how will',"isn't": 'is not',"it's": 'it is',"'tis": 'it is',"'twas": 'it was',"it'll": 'it will',"it'll've": 'it will have',"it'd": 'it would',"it'd've": 'it would have',"kinda": 'kind of',"let's": 'let us',"luv": 'love',"ma'am": 'madam',"may've": 'may have',"mayn't": 'may not',"might've": 'might have',"mightn't": 'might not',"mightn't've": 'might not have',"must've": 'must have',"mustn't": 'must not',"mustn't've": 'must not have',"needn't": 'need not',"needn't've": 'need not have',"ne'er": 'never',"o'": 'of',"o'clock": 'of the clock',"ol'": 'old',"oughtn't": 'ought not',"oughtn't've": 'ought not have',"o'er": 'over',"shan't": 'shall not',"sha'n't": 'shall not',"shalln't": 'shall not',"shan't've": 'shall not have',"she's": 'she is',"she'll": 'she will',"she'd": 'she would',"she'd've": 'she would have',"should've": 'should have',"shouldn't": 'should not',"shouldn't've": 'should not have',"so've": 'so have',"so's": 'so is',"somebody's": 'somebody is',"someone's": 'someone is',"something's": 'something is',"sux": 'sucks',"that're": 'that are',"that's": 'that is',"that'll": 'that will',"that'd": 'that would',"that'd've": 'that would have',"em": 'them',"there're": 'there are',"there's": 'there is',"there'll": 'there will',"there'd": 'there would',"there'd've": 'there would have',"these're": 'these are',"they're": 'they are',"they've": 'they have',"they'll": 'they will',"they'll've": 'they will have',"they'd": 'they would',"they'd've": 'they would have',"this's": 'this is',"those're": 'those are',"to've": 'to have',"wanna": 'want to',"wasn't": 'was not',"we're": 'we are',"we've": 'we have',"we'll": 'we will',"we'll've": 'we will have',"we'd": 'we would',"we'd've": 'we would have',"weren't": 'were not',"what're": 'what are',"what'd": 'what did',"what've": 'what have',"what's": 'what is',"what'll": 'what will',"what'll've": 'what will have',"when've": 'when have',"when's": 'when is',"where're": 'where are',"where'd": 'where did',"where've": 'where have',"where's": 'where is',"which's": 'which is',"who're": 'who are',"who've": 'who have',"who's": 'who is',"who'll": 'who will',"who'll've": 'who will have',"who'd": 'who would',"who'd've": 'who would have',"why're": 'why are',"why'd": 'why did',"why've": 'why have',"why's": 'why is',"will've": 'will have',"won't": 'will not',"won't've": 'will not have',"would've": 'would have',"wouldn't": 'would not',"wouldn't've": 'would not have',"y'all": 'you all',"y'all're": 'you all are',"y'all've": 'you all have',"y'all'd": 'you all would',"y'all'd've": 'you all would have',"you're": 'you are',"you've": 'you have',"you'll've": 'you shall have',"you'll": 'you will',"you'd": 'you would',"you'd've": 'you would have','jan.': 'january','feb.': 'february','mar.': 'march','apr.': 'april','jun.': 'june','jul.': 'july','aug.': 'august','sep.': 'september','oct.': 'october','nov.': 'november','dec.': 'december','I’m': 'I am','I’m’a': 'I am about to','I’m’o': 'I am going to','I’ve': 'I have','I’ll': 'I will','I’ll’ve': 'I will have','I’d': 'I would','I’d’ve': 'I would have','amn’t': 'am not','ain’t': 'are not','aren’t': 'are not','’cause': 'because','can’t': 'can not','can’t’ve': 'can not have','could’ve': 'could have','couldn’t': 'could not','couldn’t’ve': 'could not have','daren’t': 'dare not','daresn’t': 'dare not','dasn’t': 'dare not','doesn’t': 'does not','e’er': 'ever','everyone’s': 'everyone is','gon’t': 'go not','hadn’t': 'had not','hadn’t’ve': 'had not have','hasn’t': 'has not','haven’t': 'have not','he’ve': 'he have','he’s': 'he is','he’ll': 'he will','he’ll’ve': 'he will have','he’d': 'he would','he’d’ve': 'he would have','here’s': 'here is','how’re': 'how are','how’d': 'how did','how’d’y': 'how do you','how’s': 'how is','how’ll': 'how will','isn’t': 'is not','it’s': 'it is','’tis': 'it is','’twas': 'it was','it’ll': 'it will','it’ll’ve': 'it will have','it’d': 'it would','it’d’ve': 'it would have','let’s': 'let us','ma’am': 'madam','may’ve': 'may have','mayn’t': 'may not','might’ve': 'might have','mightn’t': 'might not','mightn’t’ve': 'might not have','must’ve': 'must have','mustn’t': 'must not','mustn’t’ve': 'must not have','needn’t': 'need not','needn’t’ve': 'need not have','ne’er': 'never','o’': 'of','o’clock': 'of the clock','ol’': 'old','oughtn’t': 'ought not','oughtn’t’ve': 'ought not have','o’er': 'over','shan’t': 'shall not','sha’n’t': 'shall not','shalln’t': 'shall not','shan’t’ve': 'shall not have','she’s': 'she is','she’ll': 'she will','she’d': 'she would','she’d’ve': 'she would have','should’ve': 'should have','shouldn’t': 'should not','shouldn’t’ve': 'should not have','so’ve': 'so have','so’s': 'so is','somebody’s': 'somebody is','someone’s': 'someone is','something’s': 'something is','that’re': 'that are','that’s': 'that is','that’ll': 'that will','that’d': 'that would','that’d’ve': 'that would have','there’re': 'there are','there’s': 'there is','there’ll': 'there will','there’d': 'there would','there’d’ve': 'there would have','these’re': 'these are','they’re': 'they are','they’ve': 'they have','they’ll': 'they will','they’ll’ve': 'they will have','they’d': 'they would','they’d’ve': 'they would have','this’s': 'this is','those’re': 'those are','to’ve': 'to have','wasn’t': 'was not','we’re': 'we are','we’ve': 'we have','we’ll': 'we will','we’ll’ve': 'we will have','we’d': 'we would','we’d’ve': 'we would have','weren’t': 'were not','what’re': 'what are','what’d': 'what did','what’ve': 'what have','what’s': 'what is','what’ll': 'what will','what’ll’ve': 'what will have','when’ve': 'when have','when’s': 'when is','where’re': 'where are','where’d': 'where did','where’ve': 'where have','where’s': 'where is','which’s': 'which is','who’re': 'who are','who’ve': 'who have','who’s': 'who is','who’ll': 'who will','who’ll’ve': 'who will have','who’d': 'who would','who’d’ve': 'who would have','why’re': 'why are','why’d': 'why did','why’ve': 'why have','why’s': 'why is','will’ve': 'will have','won’t': 'will not','won’t’ve': 'will not have','would’ve': 'would have','wouldn’t': 'would not','wouldn’t’ve': 'would not have','y’all': 'you all','y’all’re': 'you all are','y’all’ve': 'you all have','y’all’d': 'you all would','y’all’d’ve': 'you all would have','you’re': 'you are','you’ve': 'you have','you’ll’ve': 'you shall have','you’ll': 'you will','you’d': 'you would','you’d’ve': 'you would have'}
missing_spell = {'colour': 'color', 'centre': 'center', 'favourite': 'favorite', 'travelling': 'traveling', 'counselling': 'counseling', 'theatre': 'theater', 'cancelled': 'canceled', 'labour': 'labor', 'organisation': 'organization', 'wwii': 'world war 2', 'citicise': 'criticize', 'youtu ': 'youtube ', 'Qoura': 'Quora', 'sallary': 'salary', 'Whta': 'What', 'narcisist': 'narcissist', 'howdo': 'how do', 'whatare': 'what are', 'howcan': 'how can', 'howmuch': 'how much', 'howmany': 'how many', 'whydo': 'why do', 'doI': 'do I', 'theBest': 'the best', 'howdoes': 'how does', 'mastrubation': 'masturbation', 'mastrubate': 'masturbate', "mastrubating": 'masturbating', 'pennis': 'penis', 'Etherium': 'bitcoin', 'narcissit': 'narcissist', 'bigdata': 'big data', '2k17': '2017', '2k18': '2018', 'qouta': 'quota', 'exboyfriend': 'ex boyfriend', 'airhostess': 'air hostess', "whst": 'what', 'watsapp': 'whatsapp', 'demonitisation': 'demonetization', 'demonitization': 'demonetization', 'demonetisation': 'demonetization','electroneum':'bitcoin','nanodegree':'degree','hotstar':'star','dream11':'dream','ftre':'fire','tensorflow':'framework','unocoin':'bitcoin','lnmiit':'limit','unacademy':'academy','altcoin':'bitcoin','altcoins':'bitcoin','litecoin':'bitcoin','coinbase':'bitcoin','cryptocurency':'cryptocurrency','simpliv':'simple','quoras':'quora','schizoids':'psychopath','remainers':'remainder','twinflame':'soulmate','quorans':'quora','brexit':'demonetized','iiest':'institute','dceu':'comics','pessat':'exam','uceed':'college','bhakts':'devotee','boruto':'anime','cryptocoin':'bitcoin','blockchains':'blockchain','fiancee':'fiance','redmi':'smartphone','oneplus':'smartphone','qoura':'quora','deepmind':'framework','ryzen':'cpu','whattsapp':'whatsapp','undertale':'adventure','zenfone':'smartphone','cryptocurencies':'cryptocurrencies','koinex':'bitcoin','zebpay':'bitcoin','binance':'bitcoin','whtsapp':'whatsapp','reactjs':'framework','bittrex':'bitcoin','bitconnect':'bitcoin','bitfinex':'bitcoin','yourquote':'your quote','whyis':'why is','jiophone':'smartphone','dogecoin':'bitcoin','onecoin':'bitcoin','poloniex':'bitcoin','7700k':'cpu','angular2':'framework','segwit2x':'bitcoin','hashflare':'bitcoin','940mx':'gpu','openai':'framework','hashflare':'bitcoin','1050ti':'gpu','nearbuy':'near buy','freebitco':'bitcoin','antminer':'bitcoin','filecoin':'bitcoin','whatapp':'whatsapp','empowr':'empower','1080ti':'gpu','crytocurrency':'cryptocurrency','8700k':'cpu','whatsaap':'whatsapp','g4560':'cpu','payymoney':'pay money','fuckboys':'fuck boys','intenship':'internship','zcash':'bitcoin','demonatisation':'demonetization','narcicist':'narcissist','mastuburation':'masturbation','trignometric':'trigonometric','cryptocurreny':'cryptocurrency','howdid':'how did','crytocurrencies':'cryptocurrencies','phycopath':'psychopath','bytecoin':'bitcoin','possesiveness':'possessiveness','scollege':'college','humanties':'humanities','altacoin':'bitcoin','demonitised':'demonetized','brasília':'brazilia','accolite':'accolyte','econimics':'economics','varrier':'warrier','quroa':'quora','statergy':'strategy','langague':'language','splatoon':'game','7600k':'cpu','gate2018':'gate 2018','in2018':'in 2018','narcassist':'narcissist','jiocoin':'bitcoin','hnlu':'hulu','7300hq':'cpu','weatern':'western','interledger':'blockchain','deplation':'deflation', 'cryptocurrencies':'cryptocurrency', 'bitcoin':'blockchain cryptocurrency'}
#Replace math equations, links by "MATHEQUATION", "URL"
def clean_tag(x):
    if '[math]' in x:
        x = re.sub('\[math\].*?math\]', 'math equation', x) #replacing with [MATH EQUATION]
    if 'http' in x or 'www' in x:
        x = re.sub('(?:(?:https?|ftp):\/\/)?[\w/\-?=%.]+\.[\w/\-?=%.]+', 'url', x) #replacing with [url]
    return x
#Make abbreviations complete
def contraction_fix(word):
    try:
        a=contractions[word]
    except KeyError:
        a=word
    return a
#Correcting mispell words
def misspell_fix(word):
    try:
        a=missing_spell[word]
    except KeyError:
        a=word
    return a


def clean_text(text):
    text = clean_tag(text)
    text = " ".join([contraction_fix(w) for w in text.split()]) 
    text = " ".join([misspell_fix(w) for w in text.split()]) 
    #Remove punctuation
    text = re.sub(r'[^a-zA-Z0-9]', ' ', text) 
    return text

def apply_clean_text(question_text):
    tmp = pd.DataFrame()
    tmp['question_text'] = question_text;
    tmp['clean'] = tmp.question_text.progress_map(clean_text)
    with pd_ctx:
        display(tmp)
    return tmp['clean']


trainX_ques = apply_clean_text(df.question_text)
testX_ques = apply_clean_text(test_df.question_text)

## Embedding text to vectors

Tokenization is a way of separating a piece of text into smaller units called tokens. Here, each token is a word in sentences. I using `torchtext` to tokenize by go through the words in the entire document to build up a dictionary. Then, the words will be sorted by their frequency of occurrence. The words appear frequently have the lower the index. Then we will use this dictionary to transform each sentence in text form into a sequence of numbers.

1. Create a dictionary of the dataset
    The list of special tokens (e.g., padding or eos) that will be prepended to the vocabulary in addition to an \<unk> token.

In [None]:
import torchtext

from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

tokenizer = get_tokenizer("basic_english")
def yield_tokens(data_iter):
    for text in data_iter:
        yield tokenizer(text)

vocab = build_vocab_from_iterator(yield_tokens(testX_ques), specials=["<unk>"])
vocab.set_default_index(vocab["<unk>"])

# len(vocab)
# vocab(['here', 'is', 'an', 'example'])
# vocab.lookup_token(1)

2. Turn each question into a sequence of numbers.

In [None]:
# text_pipeline = lambda x: vocab(tokenizer(x))
# text_pipeline('here is the an example')

def tokenize_ques(data_iter):
    text_list = []
    text_pipeline = lambda x: vocab(tokenizer(x))
    for text in data_iter:
        processed_text = text_pipeline(text)
        text_list.append(processed_text)
    return text_list

word_sequences = tokenize_ques(testX_ques)

print("Length of 20 first word_sequences:")
print(list(map(lambda x: len(x) ,word_sequences[:20])))

print("\n20 first word_sequences:")
for sequence in word_sequences[:20]:
    print(sequence)

**Padding and Truncating**
We can see in tokenize steps, these question don't have same length which can lead to difficult in trainning model. Hence, let's regularize sequences with padding and truncating:
Each sequences will have the fixed length 60 as we analyze in section Data Analysis - 99.9% sentences have length less than or equal to 54 words.  
* Padding: if the sequences shorter than the fixed length, adding 0 after the sequences.
* Truncatting: if the sequences shorter than the fixed length, shorten by remove the balance of the sequences.
* 'post': padding or truncatting at the end of the word

In [None]:
MAX_SENTENCE_LENGTH = 60 
PADDING_TYPE = 'post' 
TRUNCATE_TYPE = 'post'
def create_sequence(word_sequences):
    padded_word_sequences = pad_sequences(word_sequences, maxlen=MAX_SENTENCE_LENGTH, padding=PADDING_TYPE, truncating=TRUNCATE_TYPE)
    return padded_word_sequences
padded_sequences = create_sequence(word_sequences)

print("Array size:",padded_sequences.shape)

print("Length of 20 first word_sequences:")
print(list(map(lambda x: len(x) ,padded_sequences[:20])))

print("\n10 first word_sequences:")
for sequence in padded_sequences[:10]:
    print(sequence)

## Split dataset to valid/train/test set
In this section, I defined `QuoraDataset` using data preparing functions above to process the dataset.I also split the dataset into train set and valid set used to give an estimate of model skill while tuning model’s hyperparameters.

In [None]:
class QuoraDataset(Dataset):
    def __init__(self, dataset):
        #contain all question in data
        self.text = dataset.question_text
        #target 0/1 for training data and - len for test and validation
        self.target = dataset.target if "target" in dataset.columns else [-1]*len(dataset)
        
    def __len__(self):
        return len(self.text)

    def __getitem__(self, i):
        target = [self.target[i]]
        question = str(self.text[i])
        question_id = create_sequence([vocab(tokenizer(question))])
        return FloatTensor(target), question, question_id

In [None]:
from torch.utils.data.dataset import random_split

BATCH_SIZE = 1024
# df = df.sample(n=500, random_state=123).reset_index(drop=True)
split = np.int32(0.8*len(df))
valid_data, training_data = df[split:], df[:split]
valid_data = valid_data.reset_index(drop=True)
val_dataset = QuoraDataset(valid_data)
val_loader = DataLoader(dataset=val_dataset, batch_size=BATCH_SIZE,
                        num_workers=0, shuffle=True)

training_data = training_data.reset_index(drop=True)
train_dataset = QuoraDataset(training_data)
train_loader = DataLoader(dataset=train_dataset, batch_size=BATCH_SIZE,
                           num_workers=0, shuffle=True)

# Model architecture
This section will describe model architectures are experimented to classify insincere questions and improve the evaluation metrics F1 score.

In [None]:
EMBEDDING_DIM = 300
VOCAB_SIZE = len(vocab)

### LSTM without pretrained - embedding 
The first architecture used is LSTMs (Long short-term memory). LSTMs is a very special kind of recurrent neural network (RNN) which have powerful performance for many tasks in NLP . LSTMs have capable of learning long-term dependencies which solve the drawback short memory in RNN (conventional RNN have trouble with relating events that too far separated in time) by using gates to control memorizing process. LSTMs have a chain of repeating modules of neural network but has a different structure:
<div>
<img src="https://2.bp.blogspot.com/-xWdsykP1hUg/WNn2HNEC25I/AAAAAAAADHE/vkWmQl68AT4e70AgwCFPBL4GdKObqUylACLcB/s1600/fig04_2d_LSTM.png" width="600"/>
</div>

*In the above diagram, each line carries an entire vector, from the output of one node to the inputs of others. The pink circles represent pointwise operations, like vector addition, while the yellow boxes are learned neural network layers. Lines merging denote concatenation, while a line forking denote its content being copied and the copies going to different locations.*
> RNN (Recurrent Neural Network) is a popular sequence model that has efficient performance for sequential data. RNN's can remember important things about the input they received by processing inputs in a sequential manner, where the context from the previous input is considered when computing the output of the current step. This allows the neural network to carry information over different time steps rather than keeping all the inputs independent of each other. This is why they're the preferred algorithm for sequential data like time series, speech, text, financial data, audio, video, weather and much more. Recurrent neural networks can form a much deeper understanding of a sequence and its context compared to other algorithms.However, a significant shortcoming that plagues the typical RNN is the problem of vanishing.Due to these issues, RNNs are unable to work with longer sequences and hold on to long-term dependencies, making them suffer from “short-term memory”.

> <div>
<img src="https://1.bp.blogspot.com/-6hyAXQfTrXY/WNn2G3CUtbI/AAAAAAAADHA/EaaANM6G1fg460fQccTNmwa8gp9k_IS7wCLcB/s1600/fig04_2c_LSTM.png" width="600"/>
</div>


Now I define the model architecture that will ingest questions dataset.
I'll keep it simple:
* An embedding layer: basically a look-up table that converts each token in the dictionary to a vector with dense representation that will be learn and adjusted throughout training
* A LSTM layers outputs three things: The consolidated output — of all hidden states in the sequence, Hidden state of the last LSTM unit — the final output, Cell state
* A Linear layer and Dropout layers with rate 0.2

Expected dimensions: I first pass the input (batch_size x max_len) through embedding layer which return (batch_size x max_len x embedding_size) matrix. Then through LSTM with hidden state will have dimension (batch_size x hidden_dim). Finally, the matrix is passing through linear layer to return the prediction.

In [None]:
class LSTM(torch.nn.Module) :
    def __init__(self, vocab_size, embedding_dim, hidden_dim) :
        super().__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, batch_first=True)
        self.linear = nn.Linear(hidden_dim, 1)
        self.dropout = nn.Dropout(0.2)
        
    def forward(self, x):
        x = torch.tensor(x).to(device)
        x = self.embeddings(x)
        x = self.dropout(x)
        lstm_out, (ht, ct) = self.lstm(x)
        return self.linear(ht[-1])

pure_lstm_model=  LSTM(VOCAB_SIZE, EMBEDDING_DIM, 786)

In [None]:
pure_lstm_model

### BiLSTM with pre-trained Word embedding 
I tend to improve the simple LSTMs model with pretrained Glove word-vectors and use Bidirectional LSTM. Instead of training word embeddings, we can use pre-trained Glove word vectors that have been trained on massive corpus and probably have better context captured and in addition, using Bidirectional LSTM making any neural network context to have the sequence information in both directions backwards (future to past) or forward(past to future).

1. Loading pretrained Glove word embedding

In [None]:
GLOVE_FILE = 'glove.840B.300d/glove.840B.300d.txt'
# PARAGRAM_FILE =  'paragram_300_sl999/paragram_300_sl999.txt'
# WIKI_FILE = 'wiki-news-300d-1M/wiki-news-300d-1M.vec'
!unzip -n /kaggle/input/quora-insincere-questions-classification/embeddings.zip {GLOVE_FILE} -d .
# !unzip -n /kaggle/input/quora-insincere-questions-classification/embeddings.zip {PARAGRAM_FILE} -d .
# !unzip -n /kaggle/input/quora-insincere-questions-classification/embeddings.zip {WIKI_FILE} -d .

def load_glove_vectors(glove_file=GLOVE_FILE):
    """Load the glove word vectors"""
    word_vectors = {}
    with open(glove_file) as f:
        for line in f:
            split = line.split(" ")
            word_vectors[split[0]] = np.array([float(x) for x in split[1:]])
    return word_vectors

# word_vecs = load_glove_vectors(GLOVE_FILE)

In [None]:
def get_emb_matrix(pretrained, vocab, emb_size = 300):
    """ Creates embedding matrix from word vectors"""
    vocab_size = len(vocab) + 2
    vocab_to_idx = {}
    dic = ["", "UNK"]
    W = np.zeros((vocab_size, emb_size), dtype="float32")
    W[0] = np.zeros(emb_size, dtype='float32') # adding a vector for padding
    W[1] = np.random.uniform(-0.25, 0.25, emb_size) # adding a vector for unknown words 
#     dic["UNK"] = 1
    i = 2
    for i in range(vocab_size-2):
        word = vocab.lookup_token(i)
        if word in word_vecs:
            W[i+2] = word_vecs[word]
        else:
            W[i+2] = np.random.uniform(-0.25,0.25, emb_size)
        vocab_to_idx[word] = i+2
        dic.append(word)
        i = i+ 1   
    return W, np.array(dic), vocab_to_idx

In [None]:
# pretrained_weights, dictionary, vocab2index = get_emb_matrix(word_vecs, vocab)

2. BiLSTM with pre-trained Glove word-vectors:
   *In BiLSTM output we have hidden state from backwards and forwards.Therefore, we have to cat two tensor to the next layer*
<div>
<img src="https://www.researchgate.net/publication/343981315/figure/fig3/AS:938328841015298@1600726437710/Structure-of-bidirectional-long-short-term-memory-LSTM.png" width="600"/>
</div>


In [None]:
class LSTM_glove_vecs(torch.nn.Module) :
    def __init__(self, vocab_size, embedding_dim, hidden_dim, glove_weights) :
        super().__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
        self.embeddings.weight.data.copy_(torch.from_numpy(glove_weights))
        self.embeddings.weight.requires_grad = False ## freeze embeddings
        self.dropout = nn.Dropout(0.2)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers = 2, batch_first=True, bidirectional = True)
        self.linear = nn.Linear(hidden_dim*2, 1)   
    def forward(self, x):
        x = torch.tensor(x).to(device)
        x = self.embeddings(x) #size: 1024, 60, 300 - batch_size x leng_sentence x embedding_size
        lstm_out, (ht, ct) = self.lstm(x) #ht_size: 2, 1024, 128 - batch_size x hidden_size
        x = torch.cat([ht[0],ht[-1]],dim=1) #shape: 1024, 256
        return self.linear(x)

# lstm_glove_model = LSTM_glove_vecs(VOCAB_SIZE+2, 300, 128, pretrained_weights)

### RoBERTa - Pretrained Model
I try the SOTA in NLP model - using powerful performance of pretrained model RoBERTa to train the dataset.

RoBERTa stands for Robustly Optimized BERT Pre-training Approach. This model optimize the training of BERT architecture in order to take lesser time during pre-training.RoBERTa has almost similar architecture as compare to BERT, but in order to improve the results on BERT architecture, the Next Sentence Prediction (NSP) objective is removed, the model is trained with bigger batch sizes & longer sequences, dynamically changing the masking pattern.

> BERT (Bidirectional Encoder Representations from Transformers) is a transformer-based machine learning technique for natural language processing (NLP) pre-training developed by Google. BERT was pretrained on two tasks: language modelling (15% of tokens were masked and BERT was trained to predict them from context) and next sentence prediction (BERT was trained to predict if a chosen next sentence was probable or not given the first sentence). As a result of the training process, BERT learns contextual embeddings for words. After pretraining, which is computationally expensive, BERT can be finetuned with less resources on smaller datasets to optimize its performance on specific tasks.
<div>
<img src="https://s3.us-west-2.amazonaws.com/secure.notion-static.com/2015b7e3-bb9f-4e07-a888-77b56a405a37/Untitled.png?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=AKIAT73L2G45EIPT3X45%2F20220106%2Fus-west-2%2Fs3%2Faws4_request&X-Amz-Date=20220106T125505Z&X-Amz-Expires=86400&X-Amz-Signature=dd71699317b9310bc0d8cdc3ae80ecb35dc990f33c3ae8ac27a18597f2b83e9b&X-Amz-SignedHeaders=host&response-content-disposition=filename%20%3D%22Untitled.png%22&x-id=GetObject" width="600"/>
</div>


Pretraining model need to import from `transformers`, then I have to define a new dataset for RoBerta `RoBertaDataset` to add new parameter `attention_mask` to emphasize which token to focus on because the sequence include both real token (will have `attention_mask = 1`) and padding token (`attention_mask = 0`)

In [None]:
import transformers
from transformers import AutoModel, AutoTokenizer
transformers.__version__
model_name = 'roberta-base'
# instantiate model & tokenizer
# model     = AutoModel.from_pretrained(model_name)
# tokenizer = AutoTokenizer.from_pretrained(model_name)

class RoBertaDataset(Dataset):
    def __init__(self, data, tokenizer):
        #contain all question in data
        self.text = data.question_text
        #initialize tokenizer for dataset
        self.data, self.tokenizer = data, tokenizer
        #target 0/1 for training data and - len for test and validation
        self.target = data.target if "target" in data.columns else [-1]*len(data)

    def __len__(self):
        return len(self.data)

    def __getitem__(self, i):
        pg, tg = 'post', 'post'
        target = [self.target[i]]
        question = str(self.text[i])
        quest_ids = self.tokenizer.encode(question.strip())
        #padding sentence
        attention_mask_idx = len(quest_ids) - 1
        # start of the sentence is 0
        if 0 not in quest_ids: quest_ids = 0 + quest_ids
        quest_ids = pad([quest_ids], maxlen=MAX_SENTENCE_LENGTH, value=1, padding=pg, truncating=tg)
        #emphasize which token to focus on 
        attention_mask = np.zeros(MAX_SENTENCE_LENGTH)
        attention_mask[1:attention_mask_idx] = 1
        attention_mask = attention_mask.reshape((1, -1))
        if 2 not in quest_ids: quest_ids[-1], attention_mask[-1] = 2, 0
        return FloatTensor(target), LongTensor(quest_ids), LongTensor(attention_mask)

In [None]:
# split = np.int32(0.8*len(dfta))
# robert_valid_data, robert_training_data = dfta[split:], dfta[:split]
# robert_valid_data = robert_valid_data.reset_index(drop=True)
# robert_val_dataset = RoBertaDataset(robert_valid_data, tokenizer)
# robert_val_loader = DataLoader(dataset=robert_val_dataset, batch_size=1024,
#                         num_workers=0, shuffle=True)

# robert_training_data = robert_training_data.reset_index(drop=True)
# robert_train_dataset = RoBertaDataset(robert_training_data, tokenizer)
# robert_train_loader = DataLoader(dataset=robert_train_dataset, batch_size=128,
#                            num_workers=0, shuffle=True)

Roberta model will have: roBERTa layer with its pretrained weights will be fine-tuned when feed through the model and add a custom (Dropout + Linear) head at the top to turn it into a binary text classifier. These custom layers will be trained from scratch.

In [None]:
class Roberta(nn.Module):
    def __init__(self):
        super(Roberta, self).__init__()
        self.dropout = nn.Dropout(0.2)
        self.linear = nn.Linear(768, 1)
        self.roberta = AutoModel.from_pretrained(model_path_or_name)

    def forward(self, input_token, att_mask):
        input_token = input_token.view(-1, MAX_SENTENCE_LENGTH)
        _, self.feat = self.roberta(input_token, att_mask, return_dict=False)
        self.feat = self.dropout(self.feat)
        return self.linear(self.feat)

# roberta_model = Roberta()

# Model Training 

### Using F1 score
Data analysis show that the dataset for training is imbalanced. Therefore using accuracy as a performance metric can be achieved wrong evaluation of the model. Therefore,F1 score is a suitable measure of models tested with this imbalance classification datasets.
Formula of F1: 

$F1 \textrm{Score} = 2*\frac{\textrm{Precision*Recall}}{\textrm{Precision+Recall}}$

Where $\textrm{Recall} = \frac{\textrm{#True Positives}}{\textrm{Relevant items}}$

and $\textrm{Precision} = \frac{\textrm{#True Positives}}{\textrm{Total Positives}}$

These 2 elements can be represent in this picture:

<div>
<img src="https://upload.wikimedia.org/wikipedia/commons/2/26/Precisionrecall.svg" width="300"/>
</div>

In [None]:
def f1_score(y_pred, y_true):
    y_true = y_true.squeeze()
    y_pred = torch.round(nn.Sigmoid()(y_pred)).squeeze()
    tp = (y_true * y_pred).sum().to(torch.float32)
    fp = ((1 - y_true) * y_pred).sum().to(torch.float32)
    fn = (y_true * (1 - y_pred)).sum().to(torch.float32)
    tn = ((1 - y_true) * (1 - y_pred)).sum().to(torch.float32)
    epsilon = 1e-7
    recall = tp / (tp + fn + epsilon)
    precision = tp / (tp + fp + epsilon)
    return 2*(precision*recall) / (precision + recall + epsilon)

### GPU 

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

### Model Training 
I selected the learning rate 0.01 and batch size of 1024 for LSTMs model and 128 for RoBerta (because the model is quite big, training with big size make the gpu run out of memory). The valid set will train on bigger batch_size since it don't need gradient calculation.BCE is the loss function which is commonly used in binary classification tasks.


In [None]:
LEARNING_RATE = 0.001
NUM_EPOCHS = 10
MODEL_SAVE_PATH = 'insincerity_model.pt'

global val_f1s; global train_f1s
global val_losses; global train_losses
global metric_lists

def train_quoraModel(model, train_loader, valid_loader):
    criterion = nn.BCEWithLogitsLoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE) 
    val_losses, val_f1s = [], []
    train_losses, train_f1s = [], []
    model.to(device)
    for epoch in range(NUM_EPOCHS):
        print("EPOCH :" + str(epoch+1))
        batch = 1
        model.train()  
        for train_batch in tqdm(train_loader):
            train_targ, train_ques, train_id = train_batch
            train_targ = train_targ.to(device)
            train_id = train_id.to(device)
            train_preds = model.forward(train_id.squeeze(dim=1))
            #for RoBerta model
            #train_preds = model.forward(train_ques, train_id)
            train_preds = train_preds.to(device)
            train_loss = criterion(train_preds, train_targ)
            train_f1 = f1_score(train_preds, train_targ)
            f1 = np.round(train_f1.item(), 3)
            optimizer.zero_grad()
            train_loss.backward()
            optimizer.step()
            batch = batch + 1
            if (batch + 1) % 100 == 0:
                print(
                    f"Step [{batch + 1}], "
                    f"F1Score [{f1}], "
                    f"Loss: {train_loss.item():.4f}"
                )
        val_loss, val_f1, val_points = 0, 0, 0

        model.eval()
        with torch.no_grad():
            for val_batch in val_loader:
                val_targ, val_ques, val_id = val_batch
                val_targ = val_targ.to(device)
                val_id = val_id.to(device)
                val_preds = model.forward(val_id.squeeze(dim=1))
                #for Roberta model
                #val_preds = model.forward(val_ques, val_id)
                val_points = val_points + len(val_targ)
                val_loss = val_loss + criterion(val_preds, val_targ).item()
                val_f1 = val_f1 + f1_score(val_preds, val_targ.squeeze(dim=1)).item()*len(val_preds)
        val_f1 = val_f1/ val_points
        val_loss = val_loss/ val_points
        val_f1s.append(val_f1); train_f1s.append(train_f1.item())
        val_losses.append(val_loss); train_losses.append(train_loss.item())
    print("END TRAINING")
    
    torch.save(model.state_dict(), MODEL_SAVE_PATH); del model; gc.collect()

    metric_lists = [val_losses, train_losses, val_f1s, train_f1s]
    metric_names = ['val_loss_', 'train_loss_', 'val_f1_', 'train_f1_']
    for i, metric_list in enumerate(metric_lists):
        for j, metric_value in enumerate(metric_list):
            torch.save(metric_value, metric_names[i] + str(j) + '.pt')

In [None]:
quoraTrainning = train_quoraModel(pure_lstm_model, train_loader, val_loader)
# quoraTrainning = train_quoraModel(lstm_glove_model, train_loader, val_loader)
# from keras.preprocessing.sequence import pad_sequences as pad
# quoraTrainning = train_quoraModel(roberta_model, robert_train_loader, robert_val_loader)

# Experimental results report

In [None]:
# val_f1s = [0] + [metric_value for metric_value in metric_lists[2]]
# train_f1s = [0] + [metric_value for metric_value in metric_lists[3]]
# val_losses = [0.25] + [metric_value for metric_value in metric_lists[0]]
# train_losses = [0.25] + [metric_value for metric_value in metric_lists[1]]
val_f1s = [0] + [torch.load('val_f1_{}.pt'.format(i)) for i in range(NUM_EPOCHS)]

train_f1s = [0] + [torch.load('train_f1_{}.pt'.format(i)) for i in range(NUM_EPOCHS)]
val_losses = [0.25] + [torch.load('val_loss_{}.pt'.format(i)) for i in range(NUM_EPOCHS)]
train_losses = [0.25] + [torch.load('train_loss_{}.pt'.format(i)) for i in range(NUM_EPOCHS)]

In [None]:
import plotly.graph_objects as go

fig = go.Figure()

fig.add_trace(go.Scatter(x=np.arange(1, len(val_losses)+1),
                         y=val_losses, mode="lines+markers", name="val",
                         marker=dict(color="indianred", line=dict(width=.5,
                                                                  color='rgb(0, 0, 0)'))))

fig.add_trace(go.Scatter(x=np.arange(1, len(train_losses)+1),
                         y=train_losses, mode="lines+markers", name="train",
                         marker=dict(color="darkorange", line=dict(width=.5,
                                                                   color='rgb(0, 0, 0)'))))

fig.update_layout(xaxis_title="Epochs", yaxis_title="Binary Cross Entropy",
                  title_text="Binary Cross Entropy vs. Epochs", template="plotly_white", paper_bgcolor="#f0f0f0")

fig.show()

In [None]:
fig = go.Figure()

fig.add_trace(go.Scatter(x=np.arange(1, len(val_f1s)+1),
                         y=val_f1s, mode="lines+markers", name="val",
                         marker=dict(color="indianred", line=dict(width=.5,
                                                                  color='rgb(0, 0, 0)'))))

fig.add_trace(go.Scatter(x=np.arange(1, len(train_f1s)+1),
                         y=train_f1s, mode="lines+markers", name="train",
                         marker=dict(color="darkorange", line=dict(width=.5,
                                                                   color='rgb(0, 0, 0)'))))

fig.update_layout(xaxis_title="Epochs", yaxis_title="F1 Score",
                  title_text="F1 Score vs. Epochs", template="plotly_white", paper_bgcolor="#f0f0f0")

fig.show()


In [None]:
def predict_insincerity(question, network):
    pg, tg = 'post', 'post'
    ins = {0: 'sincere', 1: 'insincere'}
    print(question.strip())
    quest_id = create_sequence([vocab(tokenizer(question))])
    quest_id = torch.tensor(quest_id).to(device)
#     print(quest_id)
    network.to(device)
    output = network.forward(quest_id)
    return ins[int(np.round(nn.Sigmoid()(output.detach().cpu()).item()))]

print(predict_insincerity("How can I train roBERTa base on TPUs?", pure_lstm_model))
print(predict_insincerity("Why is that stupid man the biggest dictator in the world?", pure_lstm_model))

The simple LSTMs result is showed above. 
The simple LSTMs get the best score 0.60133 seem to get better result than more complex BiLSTMs with the best score only 0.59415
![](https://s3.us-west-2.amazonaws.com/secure.notion-static.com/4e861996-1068-4256-82bc-2f0953a0f9e0/Untitled.png?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=AKIAT73L2G45EIPT3X45%2F20220107%2Fus-west-2%2Fs3%2Faws4_request&X-Amz-Date=20220107T043835Z&X-Amz-Expires=86400&X-Amz-Signature=76db11d7319af280e2cbd0997e5a163c6257478a19ca593a43ecde26bc0c413a&X-Amz-SignedHeaders=host&response-content-disposition=filename%20%3D%22Untitled.png%22&x-id=GetObject)
![](https://s3.us-west-2.amazonaws.com/secure.notion-static.com/adf13b8d-2fef-4c78-8c0a-d7b9de78ecd2/newplot_%284%29.png?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=AKIAT73L2G45EIPT3X45%2F20220107%2Fus-west-2%2Fs3%2Faws4_request&X-Amz-Date=20220107T041942Z&X-Amz-Expires=86400&X-Amz-Signature=012beb12cb1f4e8d787a85ebe5ce8ef5327d8a0ebc4de988a166d354d887d9d4&X-Amz-SignedHeaders=host&response-content-disposition=filename%20%3D%22newplot%2520%284%29.png%22&x-id=GetObject)

RoBERTa run on GPU take 5 hours to complete. But the model converges to around 0.7 F1 Score nearly reach to the leaderboard score with only 3 layers. But in this pretrained model isn't included in provided kaggle competition. Hence this model will not allowed to submit in the competition.
![](https://s3.us-west-2.amazonaws.com/secure.notion-static.com/b8d38bb7-6e79-4647-9de0-90dee1afe140/newplot_%283%29.png?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=AKIAT73L2G45EIPT3X45%2F20220106%2Fus-west-2%2Fs3%2Faws4_request&X-Amz-Date=20220106T134914Z&X-Amz-Expires=86400&X-Amz-Signature=79ad5a024648918f07c330c4aee45b4378acdba155d0213a52990182490bf05c&X-Amz-SignedHeaders=host&response-content-disposition=filename%20%3D%22newplot%2520%283%29.png%22&x-id=GetObject)

# Run on the test data

In [None]:
pure_lstm_model.eval()
test_preds = []

test_dataset = QuoraDataset(test_df)
test_loader = tqdm(DataLoader(test_dataset, batch_size=256))
def sigmoid(x):
    return 1/(1 + np.exp(-x))
with torch.no_grad():
    for batch in test_loader:
        test_targ, test_ques, test_id = batch
#         print(test_id)
        test_pred = pure_lstm_model.forward(test_id.squeeze(dim=1))
        test_preds.extend(test_pred.squeeze().detach().cpu().numpy())

test_preds = np.int32(np.round(sigmoid(np.array(test_preds))))

In [None]:
path = '../input/quora-insincere-questions-classification/'
sample_submission = pd.read_csv(path + 'sample_submission.csv')

In [None]:
sample_submission.prediction = test_preds

In [None]:
sample_submission.head()

In [None]:
sample_submission.to_csv("submission.csv", index = False)