# **Introduction**
Classifying Quora questions whether they are insincere or sincere ones

Input: A csv file with texts <br>
Output: (0, 1) = (sincere, insincere)

# **Import necessary libraries**
# **Import spacy**

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python

%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

#import necessary libraries
import os
import csv
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import string

#import spacy
import re
import nltk
import matplotlib.pyplot as plt
import seaborn as sns
from PIL import Image
from wordcloud import WordCloud #tag cloud: novelty visual representation of text data

# **Starting to understand the input files**

# **Load data into dataframe then print out to observe**

Read input files <br>
Using pandas: CSV file input/output

In [None]:
train_data = pd.read_csv('../input/quora-insincere-questions-classification/train.csv')
test_data = pd.read_csv('../input/quora-insincere-questions-classification/test.csv')

# **Print out some of the first data in train.csv file**
+ The raw data contains 1306122 rows and 3 columns <br>
+ The feature includes "questions id", "questions text", "target" <br> <br>
+ Questions id: Id of the question, qid may not take part in classfying questions -> can ignore <br>
+ Questions text: Since this field is the only one that directly affects the subclass of the question, preprocessing is required <br>
+ Target: Sincere question target = 0; Insincere question target = 1<br> <br>
+ Can not see the questions classification yet

In [None]:
train_data.head()

# **Print out some of the first data in test.csv file**
+ The raw data contains 375806 rows and 2 columns <br>
+ The feature includes questions id, questions text <br> <br>
--> These questions are the ones we have to set target (0, 1), which is the purpose of this challenge

In [None]:
test_data.head()

In [None]:
print("Dimensions of Training Dataset : ", train_data.shape)
print("Dimensions of Test Dataset : ", test_data.shape)

In [None]:
sns.countplot(x="target", data=train_data, palette="Set1")
plt.title('Target Count')

--> The number of sincere questions are much greater than the number of insincere questions

# **Print out the number of sincere questions and insincere questions**
+ Sincere questions have the target tag = 0 <br>
+ Insincere questions have the target tag = 1

In [None]:
num_questions = len(train_data['qid'])
num_sincere_questions = len(train_data.qid[train_data['target'] == 0])
num_insincere_questions = len(train_data.qid[train_data['target'] == 1])
print("Number of Sincere questions in the training set : ", num_sincere_questions)
print("Number of Insincere questions in the training set : ", num_insincere_questions)

# **Print out the graph to see the classification**

In [None]:
values = [train_data[train_data['target']==0].shape[0], train_data[train_data['target']==1].shape[0]]
labels = ['Sincere Questions', 'Insincere Questions']

plt.pie(values, labels=labels, autopct='%1.1f%%', shadow=True)
plt.title('Target Distribution')
plt.show()

In [None]:
print(num_sincere_questions/num_questions * 100, 'percent of training data questions are sincere')
print(num_insincere_questions/num_questions * 100, 'percent of training data questions are insincere')

**Based on the above graph, we get the data divided into 2 classes: 0 = sincere; 1 = insincere** <br>
**+ class 0 : 1225312 data accounts for 93.81%** <br>
**+ class 1: 80810 data accounts for 6.19%** <br> <br>
**--> The dataset for training is unbalanced (sincere questions are 15 times more insincere ones)**

**Data ratio 15:1 will often lead to misinterpretation of model quality. Then the model evaluation measure is the accuracy which can be achieved very high without the model (which is 93%)** <br>
**--> Accuracy should not be selected as the model evaluation index to avoid false optimism about quality** <br>
**--> We will use the F1_score to evaluate the model**

**F1_score is the harmonic mean between precision and recall** <br>

**Precision: in the found set, how many classified questions are correct** <br>

**Recall: of the existence, how many can be found (category)**

In [None]:
train_data.isnull().sum()

In [None]:
train_data.duplicated(subset = ["question_text", "qid", "target"]).any()

# **Print out the info of the train data and test data**

In [None]:
train_data.info()

In [None]:
test_data.info()

# **Use Natural Language Tootkit to clean the data**
+ Wordnet: It groups English words into sets of synonyms called synonym series, provides brief definitions and usage examples, and records the number of relationships between these synonym series or members <br>
+ Punkt: Punkt Sentence Tokenizer. This tokenizer divides a text into a list of sentences, by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences <br>
+ Stopwords: For the purpose of analyzing text data and building NLP models, these stopwords might not add much value to the meaning of the document --> We have to remove these words

In [None]:
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('stopwords')

# **Process the raw data text to see the categories in the sentence**
# **Print out some of the first data**

In [None]:
train_data['freq_qid'] = train_data.groupby('qid')['qid'].transform('count') 
train_data['qlen'] = train_data['question_text'].str.len() 
train_data['n_words'] = train_data['question_text'].apply(lambda row: len(row.split(" ")))
train_data['numeric_words'] = train_data['question_text'].apply(lambda row: sum(c.isdigit() for c in row))
train_data['sp_char_words'] = train_data['question_text'].str.findall(r'[^a-zA-Z0–9 ]').str.len()
train_data['char_words'] = train_data['question_text'].apply(lambda row: len(str(row)))
train_data['unique_words'] = train_data['question_text'].apply(lambda row: len(set(str(row).split())))

train_data.head()

In [None]:
test_data['freq_qid'] = test_data.groupby('qid')['qid'].transform('count') 
test_data['qlen'] = test_data['question_text'].str.len() 
test_data['n_words'] = test_data['question_text'].apply(lambda row: len(row.split(" ")))
test_data['numeric_words'] = test_data['question_text'].apply(lambda row: sum(c.isdigit() for c in row))
test_data['sp_char_words'] = test_data['question_text'].str.findall(r'[^a-zA-Z0–9 ]').str.len()
test_data['char_words'] = test_data['question_text'].apply(lambda row: len(str(row)))
test_data['unique_words'] = test_data['question_text'].apply(lambda row: len(set(str(row).split())))

test_data.head()

**Note: Toxic questions have average of more words and number than non-toxic questions**
# **Insincere questions usually have bad meaning words rathar than grammar**
**Can use this feature into model (Tested but no good with linear models)**

---------------------------------------------------------------------------------------------

# **Special letters, numbers, paths, uppercase or lowercase usually do not affect the classification of the question, so they can be omitted**

****

**Remove special characters**

In [None]:
puncts=[',', '.', '"', ':', ')', '(', '-', '!', '?', '|', ';', "'", '$', '&', '/', '[', ']', '>', '%', '=', '#', '*', '+', '\\', 
        '•', '~', '@', '£', '·', '_', '{', '}', '©', '^', '®', '`', '<', '→', '°', '€', '™', '›', '♥', '←', '×', '§', '″', '′', 
        '█', '…', '“', '★', '”', '–', '●', '►', '−', '¢', '¬', '░', '¡', '¶', '↑', '±', '¿', '▾', '═', '¦', '║', '―', '¥', '▓', 
        '—', '‹', '─', '▒', '：', '⊕', '▼', '▪', '†', '■', '’', '▀', '¨', '▄', '♫', '☆', '¯', '♦', '¤', '▲', '¸', '⋅', '‘', '∞', 
        '∙', '）', '↓', '、', '│', '（', '»', '，', '♪', '╩', '╚', '・', '╦', '╣', '╔', '╗', '▬', '❤', '≤', '‡', '√', '◄', '━', 
        '⇒', '▶', '≥', '╝', '♡', '◊', '。', '✈', '≡', '☺', '✔', '↵', '≈', '✓', '♣', '☎', '℃', '◦', '└', '‟', '～', '！', '○', 
        '◆', '№', '♠', '▌', '✿', '▸', '⁄', '□', '❖', '✦', '．', '÷', '｜', '┃', '／', '￥', '╠', '↩', '✭', '▐', '☼', '☻', '┐', 
        '├', '«', '∼', '┌', '℉', '☮', '฿', '≦', '♬', '✧', '〉', '－', '⌂', '✖', '･', '◕', '※', '‖', '◀', '‰', '\x97', '↺', 
        '∆', '┘', '┬', '╬', '،', '⌘', '⊂', '＞', '〈', '⎙', '？', '☠', '⇐', '▫', '∗', '∈', '≠', '♀', '♔', '˚', '℗', '┗', '＊', 
        '┼', '❀', '＆', '∩', '♂', '‿', '∑', '‣', '➜', '┛', '⇓', '☯', '⊖', '☀', '┳', '；', '∇', '⇑', '✰', '◇', '♯', '☞', '´', 
        '↔', '┏', '｡', '◘', '∂', '✌', '♭', '┣', '┴', '┓', '✨', '\xa0', '˜', '❥', '┫', '℠', '✒', '［', '∫', '\x93', '≧', '］', 
        '\x94', '∀', '♛', '\x96', '∨', '◎', '↻', '⇩', '＜', '≫', '✩', '✪', '♕', '؟', '₤', '☛', '╮', '␊', '＋', '┈', '％', 
        '╋', '▽', '⇨', '┻', '⊗', '￡', '।', '▂', '✯', '▇', '＿', '➤', '✞', '＝', '▷', '△', '◙', '▅', '✝', '∧', '␉', '☭', 
        '┊', '╯', '☾', '➔', '∴', '\x92', '▃', '↳', '＾', '׳', '➢', '╭', '➡', '＠', '⊙', '☢', '˝', '∏', '„', '∥', '❝', '☐', 
        '▆', '╱', '⋙', '๏', '☁', '⇔', '▔', '\x91', '➚', '◡', '╰', '\x85', '♢', '˙', '۞', '✘', '✮', '☑', '⋆', 'ⓘ', '❒', 
        '☣', '✉', '⌊', '➠', '∣', '❑', '◢', 'ⓒ', '\x80', '〒', '∕', '▮', '⦿', '✫', '✚', '⋯', '♩', '☂', '❞', '‗', '܂', '☜', 
        '‾', '✜', '╲', '∘', '⟩', '＼', '⟨', '·', '✗', '♚', '∅', 'ⓔ', '◣', '͡', '‛', '❦', '◠', '✄', '❄', '∃', '␣', '≪', '｢', 
        '≅', '◯', '☽', '∎', '｣', '❧', '̅', 'ⓐ', '↘', '⚓', '▣', '˘', '∪', '⇢', '✍', '⊥', '＃', '⎯', '↠', '۩', '☰', '◥', 
        '⊆', '✽', '⚡', '↪', '❁', '☹', '◼', '☃', '◤', '❏', 'ⓢ', '⊱', '➝', '̣', '✡', '∠', '｀', '▴', '┤', '∝', '♏', 'ⓐ', 
        '✎', ';', '␤', '＇', '❣', '✂', '✤', 'ⓞ', '☪', '✴', '⌒', '˛', '♒', '＄', '✶', '▻', 'ⓔ', '◌', '◈', '❚', '❂', '￦', 
        '◉', '╜', '̃', '✱', '╖', '❉', 'ⓡ', '↗', 'ⓣ', '♻', '➽', '׀', '✲', '✬', '☉', '▉', '≒', '☥', '⌐', '♨', '✕', 'ⓝ', 
        '⊰', '❘', '＂', '⇧', '̵', '➪', '▁', '▏', '⊃', 'ⓛ', '‚', '♰', '́', '✏', '⏑', '̶', 'ⓢ', '⩾', '￠', '❍', '≃', '⋰', '♋', 
        '､', '̂', '❋', '✳', 'ⓤ', '╤', '▕', '⌣', '✸', '℮', '⁺', '▨', '╨', 'ⓥ', '♈', '❃', '☝', '✻', '⊇', '≻', '♘', '♞', 
        '◂', '✟', '⌠', '✠', '☚', '✥', '❊', 'ⓒ', '⌈', '❅', 'ⓡ', '♧', 'ⓞ', '▭', '❱', 'ⓣ', '∟', '☕', '♺', '∵', '⍝', 'ⓑ', 
        '✵', '✣', '٭', '♆', 'ⓘ', '∶', '⚜', '◞', '்', '✹', '➥', '↕', '̳', '∷', '✋', '➧', '∋', '̿', 'ͧ', '┅', '⥤', '⬆', '⋱', 
        '☄', '↖', '⋮', '۔', '♌', 'ⓛ', '╕', '♓', '❯', '♍', '▋', '✺', '⭐', '✾', '♊', '➣', '▿', 'ⓑ', '♉', '⏠', '◾', '▹', 
        '⩽', '↦', '╥', '⍵', '⌋', '։', '➨', '∮', '⇥', 'ⓗ', 'ⓓ', '⁻', '⎝', '⌥', '⌉', '◔', '◑', '✼', '♎', '♐', '╪', '⊚', 
        '☒', '⇤', 'ⓜ', '⎠', '◐', '⚠', '╞', '◗', '⎕', 'ⓨ', '☟', 'ⓟ', '♟', '❈', '↬', 'ⓓ', '◻', '♮', '❙', '♤', '∉', '؛', 
        '⁂', 'ⓝ', '־', '♑', '╫', '╓', '╳', '⬅', '☔', '☸', '┄', '╧', '׃', '⎢', '❆', '⋄', '⚫', '̏', '☏', '➞', '͂', '␙', 
        'ⓤ', '◟', '̊', '⚐', '✙', '↙', '̾', '℘', '✷', '⍺', '❌', '⊢', '▵', '✅', 'ⓖ', '☨', '▰', '╡', 'ⓜ', '☤', '∽', '╘', 
        '˹', '↨', '♙', '⬇', '♱', '⌡', '⠀', '╛', '❕', '┉', 'ⓟ', '̀', '♖', 'ⓚ', '┆', '⎜', '◜', '⚾', '⤴', '✇', '╟', '⎛', 
        '☩', '➲', '➟', 'ⓥ', 'ⓗ', '⏝', '◃', '╢', '↯', '✆', '˃', '⍴', '❇', '⚽', '╒', '̸', '♜', '☓', '➳', '⇄', '☬', '⚑', 
        '✐', '⌃', '◅', '▢', '❐', '∊', '☈', '॥', '⎮', '▩', 'ு', '⊹', '‵', '␔', '☊', '➸', '̌', '☿', '⇉', '⊳', '╙', 'ⓦ', 
        '⇣', '｛', '̄', '↝', '⎟', '▍', '❗', '״', '΄', '▞', '◁', '⛄', '⇝', '⎪', '♁', '⇠', '☇', '✊', 'ி', '｝', '⭕', '➘', 
        '⁀', '☙', '❛', '❓', '⟲', '⇀', '≲', 'ⓕ', '⎥', '\u06dd', 'ͤ', '₋', '̱', '̎', '♝', '≳', '▙', '➭', '܀', 'ⓖ', '⇛', '▊', 
        '⇗', '̷', '⇱', '℅', 'ⓧ', '⚛', '̐', '̕', '⇌', '␀', '≌', 'ⓦ', '⊤', '̓', '☦', 'ⓕ', '▜', '➙', 'ⓨ', '⌨', '◮', '☷', 
        '◍', 'ⓚ', '≔', '⏩', '⍳', '℞', '┋', '˻', '▚', '≺', 'ْ', '▟', '➻', '̪', '⏪', '̉', '⎞', '┇', '⍟', '⇪', '▎', '⇦', '␝', 
        '⤷', '≖', '⟶', '♗', '̴', '♄', 'ͨ', '̈', '❜', '̡', '▛', '✁', '➩', 'ா', '˂', '↥', '⏎', '⎷', '̲', '➖', '↲', '⩵', '̗', '❢', 
        '≎', '⚔', '⇇', '̑', '⊿', '̖', '☍', '➹', '⥊', '⁁', '✢']

def clean_punct(x):
    for punct in puncts:
        if punct in x:
            x = x.replace(punct, '{}' .format(punct))
    return x

**Remove number**

In [None]:
def clean_numbers(x):
    if bool(re.search(r'\d', x)):
        x = re.sub('[0-9]{5,}', '#####', x)
        x = re.sub('[0-9]{4}', '####', x)
        x = re.sub('[0-9]{3}', '###', x)
        x = re.sub('[0-9]{2}', '##', x)
    return x

 **Create a vector of mispell words** <br>
 **Convert the shortened form to the original**

In [None]:
mispell_dict = {'colour': 'color', 'centre': 'center', 'favourite': 'favorite', 'travelling': 'traveling', 'counselling': 'counseling', 'theatre': 'theater', 'cancelled': 'canceled', 'labour': 'labor', 'organisation': 'organization', 'wwii': 'world war 2', 'citicise': 'criticize', 'youtu ': 'youtube ', 'Qoura': 'Quora', 'sallary': 'salary', 'Whta': 'What', 'narcisist': 'narcissist', 'howdo': 'how do', 'whatare': 'what are', 'howcan': 'how can', 'howmuch': 'how much', 'howmany': 'how many', 'whydo': 'why do', 'doI': 'do I', 'theBest': 'the best', 'howdoes': 'how does', 'mastrubation': 'masturbation', 'mastrubate': 'masturbate', "mastrubating": 'masturbating', 'pennis': 'penis', 'Etherium': 'bitcoin', 'narcissit': 'narcissist', 'bigdata': 'big data', '2k17': '2017', '2k18': '2018', 'qouta': 'quota', 'exboyfriend': 'ex boyfriend', 'airhostess': 'air hostess', "whst": 'what', 'watsapp': 'whatsapp', 'demonitisation': 'demonetization', 'demonitization': 'demonetization', 'demonetisation': 'demonetization', 
                'electroneum':'bitcoin','nanodegree':'degree','hotstar':'star','dream11':'dream','ftre':'fire','tensorflow':'framework','unocoin':'bitcoin',
                'lnmiit':'limit','unacademy':'academy','altcoin':'bitcoin','altcoins':'bitcoin','litecoin':'bitcoin','coinbase':'bitcoin','cryptocurency':'cryptocurrency',
                'simpliv':'simple','quoras':'quora','schizoids':'psychopath','remainers':'remainder','twinflame':'soulmate','quorans':'quora','brexit':'demonetized',
                'iiest':'institute','dceu':'comics','pessat':'exam','uceed':'college','bhakts':'devotee','boruto':'anime',
                'cryptocoin':'bitcoin','blockchains':'blockchain','fiancee':'fiance','redmi':'smartphone','oneplus':'smartphone','qoura':'quora','deepmind':'framework','ryzen':'cpu','whattsapp':'whatsapp',
                'undertale':'adventure','zenfone':'smartphone','cryptocurencies':'cryptocurrencies','koinex':'bitcoin','zebpay':'bitcoin','binance':'bitcoin','whtsapp':'whatsapp',
                'reactjs':'framework','bittrex':'bitcoin','bitconnect':'bitcoin','bitfinex':'bitcoin','yourquote':'your quote','whyis':'why is','jiophone':'smartphone',
                'dogecoin':'bitcoin','onecoin':'bitcoin','poloniex':'bitcoin','7700k':'cpu','angular2':'framework','segwit2x':'bitcoin','hashflare':'bitcoin','940mx':'gpu',
                'openai':'framework','hashflare':'bitcoin','1050ti':'gpu','nearbuy':'near buy','freebitco':'bitcoin','antminer':'bitcoin','filecoin':'bitcoin','whatapp':'whatsapp',
                'empowr':'empower','1080ti':'gpu','crytocurrency':'cryptocurrency','8700k':'cpu','whatsaap':'whatsapp','g4560':'cpu','payymoney':'pay money',
                'fuckboys':'fuck boys','intenship':'internship','zcash':'bitcoin','demonatisation':'demonetization','narcicist':'narcissist','mastuburation':'masturbation',
                'trignometric':'trigonometric','cryptocurreny':'cryptocurrency','howdid':'how did','crytocurrencies':'cryptocurrencies','phycopath':'psychopath',
                'bytecoin':'bitcoin','possesiveness':'possessiveness','scollege':'college','humanties':'humanities','altacoin':'bitcoin','demonitised':'demonetized',
                'brasília':'brazilia','accolite':'accolyte','econimics':'economics','varrier':'warrier','quroa':'quora','statergy':'strategy','langague':'language',
                'splatoon':'game','7600k':'cpu','gate2018':'gate 2018','in2018':'in 2018','narcassist':'narcissist','jiocoin':'bitcoin','hnlu':'hulu','7300hq':'cpu',
                'weatern':'western','interledger':'blockchain','deplation':'deflation', 'cryptocurrencies':'cryptocurrency', 'bitcoin':'blockchain cryptocurrency',}

def _get_mispell(mispell_dict):
    mispell_re = re.compile('(%s)' % '|'.join(mispell_dict.keys()))
    return mispell_dict, mispell_re

mispellings, mispellings_re = _get_mispell(mispell_dict)
def replace_typical_misspell(text):
    def replace(match):
        return mispellings[match.group(0)]
    return mispellings_re.sub(replace, text)

**Convert abbreviated words**

In [None]:
contraction_dict = {"ain't": "is not", "aren't": "are not","can't": "cannot", "'cause": "because", "could've": "could have", "couldn't": "could not", "didn't": "did not",  "doesn't": "does not", "don't": "do not", "hadn't": "had not", "hasn't": "has not", "haven't": "have not", "he'd": "he would","he'll": "he will", "he's": "he is", "how'd": "how did", "how'd'y": "how do you", "how'll": "how will", "how's": "how is",  "I'd": "I would", "I'd've": "I would have", "I'll": "I will", "I'll've": "I will have","I'm": "I am", "I've": "I have", "i'd": "i would", "i'd've": "i would have", "i'll": "i will",  "i'll've": "i will have","i'm": "i am", "i've": "i have", "isn't": "is not", "it'd": "it would", "it'd've": "it would have", "it'll": "it will", "it'll've": "it will have","it's": "it is", "let's": "let us", "ma'am": "madam", "mayn't": "may not", "might've": "might have","mightn't": "might not","mightn't've": "might not have", "must've": "must have", "mustn't": "must not", "mustn't've": "must not have", "needn't": "need not", "needn't've": "need not have","o'clock": "of the clock", "oughtn't": "ought not", "oughtn't've": "ought not have", "shan't": "shall not", "sha'n't": "shall not", "shan't've": "shall not have", "she'd": "she would", "she'd've": "she would have", "she'll": "she will", "she'll've": "she will have", "she's": "she is", "should've": "should have", "shouldn't": "should not", "shouldn't've": "should not have", "so've": "so have","so's": "so as", "this's": "this is","that'd": "that would", "that'd've": "that would have", "that's": "that is", "there'd": "there would", "there'd've": "there would have", "there's": "there is", "here's": "here is","they'd": "they would", "they'd've": "they would have", "they'll": "they will", "they'll've": "they will have", "they're": "they are", "they've": "they have", "to've": "to have", "wasn't": "was not", "we'd": "we would", "we'd've": "we would have", "we'll": "we will", "we'll've": "we will have", "we're": "we are", "we've": "we have", "weren't": "were not", "what'll": "what will", "what'll've": "what will have", "what're": "what are",  "what's": "what is", "what've": "what have", "when's": "when is", "when've": "when have", "where'd": "where did", "where's": "where is", "where've": "where have", "who'll": "who will", "who'll've": "who will have", "who's": "who is", "who've": "who have", "why's": "why is", "why've": "why have", "will've": "will have", "won't": "will not", "won't've": "will not have", "would've": "would have", "wouldn't": "would not", "wouldn't've": "would not have", "y'all": "you all", "y'all'd": "you all would","y'all'd've": "you all would have","y'all're": "you all are","y'all've": "you all have","you'd": "you would", "you'd've": "you would have", "you'll": "you will", "you'll've": "you will have", "you're": "you are", "you've": "you have"}

def _get_contractions(contraction_dict):
    contraction_re = re.compile('(%s)' % '|'.join(contraction_dict.keys()))
    return contraction_dict, contraction_re

contractions, contractions_re = _get_contractions(contraction_dict)

def replace_contractions(text):
    def replace(match):
        return contractions[match.group(0)]
    return contractions_re.sub(replace, text)

# **In order to process we must clean text**

 **Remove stopwords**

In [None]:
stopword_list = nltk.corpus.stopwords.words('english')
def remove_stopwords(text, is_lower_case=True):
    tokenizer = ToktokTokenizer()
    tokens = tokenizer.tokenize(text)
    tokens = [token.strip() for token in tokens]
    if is_lower_case:
        filtered_tokens = [token for token in tokens if token not in stopword_list]
    else:
        filtered_tokens = [token for token in tokens if token.lower() not in stopword_list]
    filtered_text = ' '.join(filtered_tokens)
    return filtered_text

 **Convert words with the same variation of a word into a single word**

In [None]:
from nltk.stem import  SnowballStemmer
from nltk.tokenize.toktok import ToktokTokenizer
def stem_text(text):
    tokenizer = ToktokTokenizer()
    stemmer = SnowballStemmer('english')
    tokens = tokenizer.tokenize(text)
    tokens = [token.strip() for token in tokens]
    tokens = [stemmer.stem(token) for token in tokens]
    return ' '.join(tokens)

In [None]:
from nltk.stem import WordNetLemmatizer
from nltk.tokenize.toktok import ToktokTokenizer
wordnet_lemmatizer = WordNetLemmatizer()
def lemma_text(text):
    tokenizer = ToktokTokenizer()
    tokens = tokenizer.tokenize(text)
    tokens = [token.strip() for token in tokens]
    tokens = [wordnet_lemmatizer.lemmatize(token) for token in tokens]
    return ' '.join(tokens)

 **Clean sentences by using all above features**

In [None]:
def clean_sentence(x):
    x = x.lower()
    x = clean_punct(x)
    x = clean_numbers(x)
    x = replace_typical_misspell(x)
    x = remove_stopwords(x)
    x = replace_contractions(x)
    x = stem_text(x)
    x = lemma_text(x)
    x = x.replace("'","")
    return x

# **Print out some sentences after cleaning**

In [None]:
train_data['preprocessed_question_text'] = train_data['question_text'].apply(lambda x: clean_sentence(x))

# **Print out some sentences of train data after cleaning**

In [None]:
train_data.preprocessed_question_text.head()

In [None]:
test_data['preprocessed_question_text'] = test_data['question_text'].apply(lambda x: clean_sentence(x))

# **Print out some sentences of test data after cleaning**

In [None]:
test_data.preprocessed_question_text.head()

# **A tag cloud: A novelty visual representation of text data to visualize free form text**

In [None]:
def cloud(text, title, size = (10,7)):
    # Processing Text
    words_list = text.unique().tolist()
    words = ' '.join(words_list)
    
    wordcloud = WordCloud(width=800, height=400,
                          collocations=False
                         ).generate(words)
    
    # Output Visualization
    fig = plt.figure(figsize=size, dpi=80, facecolor='k',edgecolor='k')
    plt.imshow(wordcloud,interpolation='bilinear')
    plt.axis('off')
    plt.title(title, fontsize=25,color='w')
    plt.tight_layout(pad=0)
    plt.show()

# **Print out the visualization of words which appear in sincere questions (train.csv)**

In [None]:
cloud(train_data[train_data['target']==0]['question_text'], 'Sincere Questions On question_text')

# **Print out the visualization of words which appear in sincere questions (train.csv) AFTER cleaning**

In [None]:
cloud(train_data[train_data['target']==0]['preprocessed_question_text'], 'Sincere Questions On preprocessed_question_text')

# **Print out the visualization of words which appear in insincere questions (train.csv)**

In [None]:
cloud(train_data[train_data['target']==1]['question_text'], 'Insincere Questions On question_text')

# **Print out the visualization of words which appear in insincere questions (train.csv) AFTER cleaning**

In [None]:
cloud(train_data[train_data['target']==1]['preprocessed_question_text'], 'Insincere Questions On preprocessed_question_text')

# **NOTE**
**+ Insincere questions often have a lot of bad meaning words** <br>
**+ However, some words that do not carry a bad meaning have a high frequency like "people", "women", "Trump",... These words belong to stopwords, that is, words that are necessary in grammar but do not give much meaning when viewed individually**

-----------------------------------------------------------------------------------------------------

# **Apply GRU Kerax to start training**

**Import necessary libraries**

In [None]:
import os
import time
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from tqdm import tqdm
import math
from sklearn.model_selection import train_test_split
from sklearn import metrics

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from tensorflow.compat.v1.keras.layers import CuDNNGRU
from keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation, Conv1D
from keras.layers import Bidirectional, GlobalMaxPool1D
from keras.models import Model
from keras import initializers, regularizers, constraints, optimizers, layers

# **First, we process a bit of data used for training**
**+ Split the train file into 2 parts: train and validate. The train file will be used for training and the validate file will be used to check if the model is good or not** <br>
**+ Fill in "na" in the missing data to avoid loss**

In [None]:
## split to train and val
train_data, val_data = train_test_split(train_data, test_size=0.1, random_state=2018) #use 10% of train data to validate

## some config values 
embed_size = 300 # how big is each word vector
max_features = 50000 # how many unique words to use (i.e num rows in embedding vector)
maxlen = 100 # max number of words in a question to use

## fill in "na" the missing values
train_X = train_data["question_text"].fillna("_na_").values
val_X = val_data["question_text"].fillna("_na_").values
test_X = test_data["question_text"].fillna("_na_").values

# **If we leave the data as strings, the machine will not understand it. We would think of encoding each word as a unique positive integer**
**E.g: If a sentence consist 10 words, we encode it into a vector of 10x1**

# **We use Tokenizer to do this. It will encode words into unique positive integers. The lower the number, the more common the word is in the dictionary**
**To synchronize the data, we also use pad_sequences to ensure that the sentences are all 100 words long:** <br>
**+ Cut sentences longer than 100 words** <br> 
**+ Fill in 0 for enough sentences with less than 100 words. Then each sentence will be represented by a vector number 100x1**

In [None]:
## Tokenize the sentences
tokenizer = Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(list(train_X))
train_X = tokenizer.texts_to_sequences(train_X)
val_X = tokenizer.texts_to_sequences(val_X)
test_X = tokenizer.texts_to_sequences(test_X)

## Pad the sentences into 100 words
train_X = pad_sequences(train_X, maxlen=maxlen)
val_X = pad_sequences(val_X, maxlen=maxlen)
test_X = pad_sequences(test_X, maxlen=maxlen)

# **We have the set train_X, val_X, test_X which are the number vectors corresponding to each question in the files**

# **Observe questions after encoding to 0**

In [None]:
print(train_X[0])

# **Get the target column of the train file and the validate file for training**

In [None]:
## Get the target values
train_y = train_data['target'].values
val_y = val_data['target'].values

# **Build GRU model**

**+ Input will be the string vectors corresponding to the questions** <br>
**+ A string will have 100 words corresponding to a 100-dimensional vector** <br>
**+ Embedding will help the machine learn what the words mean. Embedding will convert each word into a 1x300 vector representing the meaning of that word, which is a sentence will be a vector of numbers 100x300** <br>
**+ The Bidirection layer will help the machine learn the meaning of each sentence based on the order of words on the neural network** <br>
**+ For each of the 128 features, the global layer chooses the word with the best feature** <br>
**--> The remaining in the model are used for classification**

In [None]:
inp = Input(shape=(maxlen,))
x = Embedding(max_features, embed_size)(inp)
x = Bidirectional(LSTM(64, return_sequences=True))(x)
x = GlobalMaxPool1D()(x)
x = Dense(16, activation="relu")(x)
x = Dropout(0.1)(x)
x = Dense(1, activation="sigmoid")(x)
model = Model(inputs=inp, outputs=x)
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

print(model.summary())

# **Start training with the file train_X train_Y** 
# **Feed the data into the neural network twice**
# **Each time subdivided into batch_size is 512 sentences. The data used for testing is val_X and val_y**

In [None]:
model.fit(train_X, train_y, batch_size=512, epochs=2, validation_data=(val_X, val_y))

# **The model works quite well. For a more detailed look, calculate the model's F1-score with thresholds from 0.1 to 0.5**

In [None]:
pred_noemb_val_y = model.predict([val_X], batch_size=1024, verbose=1)
for thresh in np.arange(0.1, 0.501, 0.01):
    thresh = np.round(thresh, 2)
    print("F1 score at threshold {0} is {1}".format(thresh, metrics.f1_score(val_y, (pred_noemb_val_y>thresh).astype(int))))

# **We see that the F1-score in the threshold range of 0.26-0.4 is quite good**
# **We need to be more concerned with "missing is better than mistaken", so we should consider threshold < 0.5 (closer to 0 than 1)**
# **--> We would rather miss identifying insincere questions than misidentifying a insincere question as sincere**

In [None]:
pred_noemb_test_y = model.predict([test_X], batch_size=1024, verbose=1)

# **After training, save and submit**

In [None]:
pred_noemb_test_y = (pred_noemb_test_y > 0.33).astype(int)
out_df = pd.DataFrame({"qid":test_data["qid"].values})
out_df['prediction'] = pred_noemb_test_y
out_df.to_csv("submission.csv", index=False)
print('Successfully saved submission')
pred_noemb_test_y