Through this notebook, I am trying to give very basic introduction to natural language processing pipeline. This notebook is related to this [blog](http://) on medium. 

# Introduction to competition

Quora is a platform that empowers people to learn from each others. In this platform people can ask question and any member can answer to the questions. But there are some questions that intend to make statement rather than look for answers. These questions are labeled as 'insincere'.

In this kernel we use the dataset provided in the above mentioned [competiotion](https://www.kaggle.com/c/quora-insincere-questions-classification), where we are supposed to label each question if it is 'insincere' or not. The dataset contained 1.31 million questions which are labeled 0 or 1 (1 is for 'insincere' and 0 is for 'sincere'). Out of 1.31 million there are about 80k questions that are labeled as 'insincere'.



# Planning

In order to solve this problem, we need to build a model that can classify a question if its sincere or not.

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os

print(os.listdir("../input"))
import operator 


['train.csv', 'sample_submission.csv', 'embeddings', 'test.csv']


In [2]:
from gensim.models import KeyedVectors

In [3]:
import re, string
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
import keras

from keras.layers import Input, Embedding, SpatialDropout1D, Bidirectional, Dense
from keras.layers import concatenate, CuDNNGRU, GlobalAveragePooling1D, GlobalMaxPooling1D
from keras.callbacks import ModelCheckpoint
from keras.optimizers import Adam
from keras.models import load_model
from keras.models import Model


from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

import tqdm
import nltk
from nltk.corpus import stopwords

Using TensorFlow backend.


In [4]:
test_df = pd.read_csv('../input/test.csv')
test_df.head()


Unnamed: 0,qid,question_text
0,0000163e3ea7c7a74cd7,Why do so many women become so rude and arroga...
1,00002bd4fb5d505b9161,When should I apply for RV college of engineer...
2,00007756b4a147d2b0b3,What is it really like to be a nurse practitio...
3,000086e4b7e1c7146103,Who are entrepreneurs?
4,0000c4c3fbe8785a3090,Is education really making good people nowadays?


In [5]:
train_df = pd.read_csv('../input/train.csv')
train_df.head()

Unnamed: 0,qid,question_text,target
0,00002165364db923c7e6,How did Quebec nationalists see their province...,0
1,000032939017120e6e44,"Do you have an adopted dog, how would you enco...",0
2,0000412ca6e4628ce2cf,Why does velocity affect time? Does velocity a...,0
3,000042bf85aa498cd78e,How did Otto von Guericke used the Magdeburg h...,0
4,0000455dfa3e01eae3af,Can I convert montra helicon D to a mountain b...,0


In [6]:
lens = train_df.question_text.str.len()
lens.mean(), lens.std(), lens.max()

(70.67883551459971, 38.78427671665139, 1017)

In [7]:
all_df = pd.concat([train_df ,test_df])

print("Total number of questions: ", all_df.shape[0])

Total number of questions:  1681928


of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  """Entry point for launching an IPython kernel.


In [8]:
max_features = 100000
ques_len= 72

## Preprocessing

In [9]:
UNKNOWN_WORD = "_UNK_"
END_WORD = "_END_"
NAN_WORD = "_NAN_"

In [10]:
train_df["question_text"] = train_df["question_text"].fillna(NAN_WORD)
test_df["question_text"] = test_df["question_text"].fillna(NAN_WORD)
sub = test_df[['qid']]

In [11]:
re_tok = re.compile(f'([{string.punctuation}“”¨«»®´·º½¾¿¡§£₤‘’])')

def clean_text(s):
    return re_tok.sub(r' \1 ', s).lower()


def clean_numbers(x):
    x = re.sub('[0-9]{5,}', '#####', x)
    x = re.sub('[0-9]{4}', '####', x)
    x = re.sub('[0-9]{3}', '###', x)
    x = re.sub('[0-9]{2}', '##', x)
    return x

In [12]:
%%time
print("    Cleaning train questions")
train_df["question_text"] = train_df["question_text"].apply(clean_text)
print("    Cleaning test questions")
test_df["question_text"] = test_df["question_text"].apply(clean_text)

print("    Removing numbers from train questions")
train_df["question_text"] = train_df["question_text"].apply(clean_numbers)
print("    Removing numbers from test questions")
test_df["question_text"] = test_df["question_text"].apply(clean_numbers)

    Cleaning train questions
    Cleaning test questions
    Removing numbers from train questions
    Removing numbers from test questions
CPU times: user 32.1 s, sys: 188 ms, total: 32.3 s
Wall time: 32.4 s


# Tokenize text

In [13]:
%%time
tokenizer = Tokenizer(num_words=max_features, oov_token=UNKNOWN_WORD)
tokenizer.fit_on_texts(list(train_df["question_text"]))

CPU times: user 32.6 s, sys: 80 ms, total: 32.7 s
Wall time: 32.7 s


In [14]:
%%time
train_X = tokenizer.texts_to_sequences(train_df["question_text"])
test_X = tokenizer.texts_to_sequences(test_df["question_text"])

CPU times: user 38.3 s, sys: 232 ms, total: 38.5 s
Wall time: 38.6 s


In [15]:
train_X = pad_sequences(train_X, maxlen=ques_len)
test_X = pad_sequences(test_X, maxlen=ques_len)

In [16]:
train_y = train_df['target'].values
# test_y = test_df['target'].values

## Loading Embedding file



In [17]:
embd_file =  '../input/embeddings/paragram_300_sl999/paragram_300_sl999.txt'

In [18]:
def load_embed(file):
    def get_coefs(word,*arr): 
        return word, np.asarray(arr, dtype='float32')
    
    embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(file, encoding='latin'))
        
    return embeddings_index

In [19]:
%%time
print("Extracting Paragram embedding")
embeddings_index = load_embed(embd_file)

Extracting Paragram embedding
CPU times: user 2min 18s, sys: 5.24 s, total: 2min 23s
Wall time: 2min 24s


# Creating Embedding matrics

In [20]:
%%time
all_embs = np.stack(embeddings_index.values())
emb_mean,emb_std = all_embs.mean(), all_embs.std()
embed_size = all_embs.shape[1]

  """Entry point for launching an IPython kernel.


CPU times: user 4.65 s, sys: 1.98 s, total: 6.63 s
Wall time: 6.66 s


In [21]:
## rebuilding embedding matrics
nb_words = min(max_features, len(tokenizer.word_index))
embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size))

for word, i in tokenizer.word_index.items():
    if i >= max_features:
        continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None: 
        embedding_matrix[i] = embedding_vector
        

# Building classification model

In [22]:
input_layer = Input(shape=(ques_len,))
embedding_layer = Embedding(embedding_matrix.shape[0], embedding_matrix.shape[1],
                            weights=[embedding_matrix], trainable=False)(input_layer)
x = SpatialDropout1D(0.2)(embedding_layer)
x = Bidirectional(CuDNNGRU(90, return_sequences=True))(x)
x = Bidirectional(CuDNNGRU(90, return_sequences=True))(x)
avg_pool = GlobalAveragePooling1D()(x)
max_pool = GlobalMaxPooling1D()(x)
x = concatenate([avg_pool, max_pool])
x = Dense(256, activation="relu")(x)
output_layer = Dense(1, activation="sigmoid")(x)
model = Model(inputs=input_layer, outputs=output_layer)
model.compile(
    loss='binary_crossentropy',
    optimizer=Adam(lr=1e-3, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0),
    metrics=['accuracy']
)

model.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None, 72)           0                                            
__________________________________________________________________________________________________
embedding_1 (Embedding)         (None, 72, 300)      30000000    input_1[0][0]                    
__________________________________________________________________________________________________
spatial_dropout1d_1 (SpatialDro (None, 72, 300)      0           embedding_1[0][0]                
__________________________________________________________________________________________________
bidirectional_1 (Bidirectional) (None, 72, 180)      211680      spatial_dropout1d_1[0][0]        
__________________________________________________________________________________________________
bidirectio

In [23]:
from keras.callbacks import ModelCheckpoint

checkpoint = ModelCheckpoint('saved-dmodel-{acc:03f}.h5', verbose=1, monitor='val_acc',save_best_only=True, mode='auto')  

In [24]:
model.fit(train_X, train_y, batch_size=128, validation_split=0.1, callbacks=[checkpoint], epochs=8)

Train on 1175509 samples, validate on 130613 samples
Epoch 1/8

Epoch 00001: val_acc improved from -inf to 0.95806, saving model to saved-dmodel-0.955570.h5
Epoch 2/8
 202496/1175509 [====>.........................] - ETA: 4:11 - loss: 0.1020 - acc: 0.9598

# Generating the prediction

In [25]:
preds = model.predict([test_X], batch_size=1024, verbose=1)



In [26]:
preds = preds.reshape((-1, 1))

In [27]:
pred_test_y = (preds>0.5).astype(int)
sub['prediction'] = pred_test_y

In [28]:
sub.to_csv("submission.csv", index=False)