<h1>Quora Insincere Questions Classification</h1>

<b>Problem Statement: </b>
An existential problem for any major website today is how to handle toxic and divisive content. Quora wants to tackle this problem head-on to keep their platform a place where users can feel safe sharing their knowledge with the world.

Quora is a platform that empowers people to learn from each other. On Quora, people can ask questions and connect with others who contribute unique insights and quality answers. A key challenge is to weed out insincere questions -- those founded upon false premises, or that intend to make a statement rather than look for helpful answers.

The aim of the problem is to detect toxic and misleading content in given a question.

An insincere question is defined as a question intended to make a statement rather than look for helpful answers. Some characteristics that can signify that a question is insincere:

<ul>
    <li>Has a non-neutral tone.
        <ul>
            <li>Has an exaggerated tone to underscore a point about a group of people.</li>
            <li>Is rhetorical and meant to imply a statement about a group of people.</li>
        </ul>
    </li>
    <li>Is disparaging or inflammatory.
        <ul>
            <li>Suggests a discriminatory idea against a protected class of people, or seeks confirmation of a stereotype.</li>
            <li>Makes disparaging attacks/insults against a specific person or group of people.</li>
            <li>Based on an outlandish premise about a group of people.</li>
            <li>Disparages against a characteristic that is not fixable and not measurable.</li>
        </ul>
    </li>
    <li>Isn't grounded in reality.
        <ul>
            <li>Based on false information, or contains absurd assumptions</li>
        </ul>
    </li>
    <li>Uses sexual content (incest, bestiality, pedophilia) for shock value, and not to seek genuine answers.</li>
    
</ul>

Source :: https://www.kaggle.com/c/quora-insincere-questions-classification/overview/description
<br/><br/>

<b>Evaluation Metric: F1 Score</b>

F1-Score = 2 x (precision x recall) / (precision + recall)

<b>Precision:</b> The precision is the ratio tp / (tp + fp) where tp is the number of true positives and fp the number of false positives. The precision is intuitively the ability of the classifier not to label as positive a sample that is negative.

<b>Recall:</b> The recall is the ratio tp / (tp + fn) where tp is the number of true positives and fn the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples.

<b>Dataset Description</b>

The dataset is divided into two parts<br/>
<ol>
    <li>Train</li>
    <li>Test</li>
</ol>

<b>1. Train </b>

Number of rows: 1.31 Million records.<br/>
Number of columns: 3 <br/>

Columns:
<ul>
    <li><b>qid: </b> Question Id</li>
    <li><b>question_text: </b> Question text.</li>
    <li><b>target: </b> target whether the question is sincere or not. if question is insincere then target is 1 else 0.</li>    
</ul>

<b>1. Test </b>

Number of rows: 376000 records.<br/>
Number of columns: 2 <br/>

Columns:
<ul>
    <li><b>qid: </b> Question Id</li>
    <li><b>question_text: </b> Question text.</li>
</ul>

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from bs4 import BeautifulSoup
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import re
import operator 
from tqdm import tqdm_notebook as tqdm
from sklearn.metrics import f1_score, confusion_matrix
import warnings
warnings.filterwarnings('ignore')

### Load the dataset

In [2]:
train_df = pd.read_csv('train.csv')

In [3]:
train_df.head()

Unnamed: 0,qid,question_text,target
0,00002165364db923c7e6,How did Quebec nationalists see their province...,0
1,000032939017120e6e44,"Do you have an adopted dog, how would you enco...",0
2,0000412ca6e4628ce2cf,Why does velocity affect time? Does velocity a...,0
3,000042bf85aa498cd78e,How did Otto von Guericke used the Magdeburg h...,0
4,0000455dfa3e01eae3af,Can I convert montra helicon D to a mountain b...,0


In [4]:
train_df.shape

(1306122, 3)

### Train, Test Split

In [5]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

In [6]:
def get_coefs(word,*arr): 
    return word, np.asarray(arr, dtype='float32')

def load_embed(file):    
    embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(file, encoding='latin'))
    return embeddings_index

In [7]:
%%time
print("Extracting GloVe embedding")
embed_glove = load_embed('glove.840B.300d.txt')

Extracting GloVe embedding
CPU times: user 3min 38s, sys: 6.83 s, total: 3min 45s
Wall time: 10min 34s


In [8]:
def build_vocab(texts):
    sentences = texts.apply(lambda x: x.split()).values
    vocab = {}
    for sentence in sentences:
        for word in sentence:
            try:
                vocab[word] += 1
            except KeyError:
                vocab[word] = 1
    return vocab

In [9]:
def check_coverage(vocab, embeddings_index):
    known_words = {}
    unknown_words = {}
    nb_known_words = 0
    nb_unknown_words = 0
    for word in vocab.keys():
        try:
            known_words[word.strip()] = embeddings_index[word.strip()]
            nb_known_words += vocab[word]
        except:
            unknown_words[word] = vocab[word]
            nb_unknown_words += vocab[word]
            pass

    print('Found embeddings for {:.2%} of vocab'.format(len(known_words) / len(vocab)))
    print('Found embeddings for  {:.2%} of all text'.format(nb_known_words / (nb_known_words + nb_unknown_words)))
    unknown_words = sorted(unknown_words.items(), key=operator.itemgetter(1))[::-1]

    return unknown_words

In [10]:
%%time
vocab = build_vocab(train_df['question_text'])

CPU times: user 9.58 s, sys: 400 ms, total: 9.98 s
Wall time: 9.98 s


In [11]:
print("Glove : ")
oov_glove = check_coverage(vocab, embed_glove)

Glove : 
Found embeddings for 33.02% of vocab
Found embeddings for  88.15% of all text


In [12]:
oov_glove[:10]

[('India?', 16384),
 ('it?', 12900),
 ("What's", 12425),
 ('do?', 8753),
 ('life?', 7753),
 ('you?', 6295),
 ('me?', 6202),
 ('them?', 6140),
 ('time?', 5716),
 ('world?', 5386)]

<b>Conclusion</b><br/>
We can only get 33% of vocab and 88.15% of all text word vectors.
As we can see from the above cell lot of text processing needs to be done.

#### Lets check by lower casing the words

In [13]:
train_df['lowered_question'] = train_df['question_text'].apply(lambda x: x.lower())

In [14]:
vocab_low = build_vocab(train_df['lowered_question'])

In [15]:
print("Glove : ")
oov_glove = check_coverage(vocab_low, embed_glove)

Glove : 
Found embeddings for 27.38% of vocab
Found embeddings for  87.87% of all text


<b>Conclusion</b><br/>
We can see that code coverage has now decreased on vocab and slightly increase for all text.
From the above we can note that lower casing decreased total vocab coverage so we need to lowercase only words which does not have embedding for normal word.

In [16]:
def add_lower(embedding, vocab):
    count = 0
    for word in vocab:
        if word in embedding and word.lower() not in embedding:  
            embedding[word.lower()] = embedding[word]
            count += 1
    print(f"Added {count} words to embedding")

In [17]:
print("Glove : ")
add_lower(embed_glove, vocab)

Glove : 
Added 14725 words to embedding


In [18]:
print("Glove : ")
oov_glove = check_coverage(vocab, embed_glove)

Glove : 
Found embeddings for 33.28% of vocab
Found embeddings for  88.16% of all text


In [19]:
oov_glove[500:510], oov_glove[100:110]

([('fight,', 451),
  ('line?', 451),
  ('sites?', 451),
  ('France?', 451),
  ("women's", 450),
  ('loss?', 450),
  ('teacher?', 449),
  ('working?', 449),
  ('examples?', 449),
  ('mother?', 448)],
 [('from?', 1369),
  ('others?', 1356),
  ('Mumbai?', 1347),
  ('society?', 1341),
  ('use?', 1334),
  ('number?', 1332),
  ('with?', 1321),
  ('back?', 1319),
  ('me,', 1310),
  ('friends?', 1306)])

<b>Conclusion</b><br/>
Now the code coverage is slightly increased from 33.02% to 33.28%.

#### Expanding contractions

https://gist.github.com/nealrs/96342d8231b75cf4bb82

In [20]:
cList = {
  "ain't": "am not",
  "aren't": "are not",
  "can't": "cannot",
  "can't've": "cannot have",
  "'cause": "because",
  "could've": "could have",
  "couldn't": "could not",
  "couldn't've": "could not have",
  "didn't": "did not",
  "doesn't": "does not",
  "don't": "do not",
  "hadn't": "had not",
  "hadn't've": "had not have",
  "hasn't": "has not",
  "haven't": "have not",
  "he'd": "he would",
  "he'd've": "he would have",
  "he'll": "he will",
  "he'll've": "he will have",
  "he's": "he is",
  "how'd": "how did",
  "how'd'y": "how do you",
  "how'll": "how will",
  "how's": "how is",
  "I'd": "I would",
  "I'd've": "I would have",
  "I'll": "I will",
  "I'll've": "I will have",
  "I'm": "I am",
  "I've": "I have",
  "isn't": "is not",
  "it'd": "it had",
  "it'd've": "it would have",
  "it'll": "it will",
  "it'll've": "it will have",
  "it's": "it is",
  "let's": "let us",
  "ma'am": "madam",
  "mayn't": "may not",
  "might've": "might have",
  "mightn't": "might not",
  "mightn't've": "might not have",
  "must've": "must have",
  "mustn't": "must not",
  "mustn't've": "must not have",
  "needn't": "need not",
  "needn't've": "need not have",
  "o'clock": "of the clock",
  "oughtn't": "ought not",
  "oughtn't've": "ought not have",
  "shan't": "shall not",
  "sha'n't": "shall not",
  "shan't've": "shall not have",
  "she'd": "she would",
  "she'd've": "she would have",
  "she'll": "she will",
  "she'll've": "she will have",
  "she's": "she is",
  "should've": "should have",
  "shouldn't": "should not",
  "shouldn't've": "should not have",
  "so've": "so have",
  "so's": "so is",
  "that'd": "that would",
  "that'd've": "that would have",
  "that's": "that is",
  "there'd": "there had",
  "there'd've": "there would have",
  "there's": "there is",
  "they'd": "they would",
  "they'd've": "they would have",
  "they'll": "they will",
  "they'll've": "they will have",
  "they're": "they are",
  "they've": "they have",
  "to've": "to have",
  "wasn't": "was not",
  "we'd": "we had",
  "we'd've": "we would have",
  "we'll": "we will",
  "we'll've": "we will have",
  "we're": "we are",
  "we've": "we have",
  "weren't": "were not",
  "what'll": "what will",
  "what'll've": "what will have",
  "what're": "what are",
  "what's": "what is",
  "what've": "what have",
  "when's": "when is",
  "when've": "when have",
  "where'd": "where did",
  "where's": "where is",
  "where've": "where have",
  "who'll": "who will",
  "who'll've": "who will have",
  "who's": "who is",
  "who've": "who have",
  "why's": "why is",
  "why've": "why have",
  "will've": "will have",
  "won't": "will not",
  "won't've": "will not have",
  "would've": "would have",
  "wouldn't": "would not",
  "wouldn't've": "would not have",
  "y'all": "you all",
  "y'alls": "you alls",
  "y'all'd": "you all would",
  "y'all'd've": "you all would have",
  "y'all're": "you all are",
  "y'all've": "you all have",
  "you'd": "you had",
  "you'd've": "you would have",
  "you'll": "you you will",
  "you'll've": "you you will have",
  "you're": "you are",
  "you've": "you have"
}


In [21]:
known_contractions = []
for k in cList.keys():
    if k in embed_glove:
        known_contractions.append(k)
print('Known contractions available in embedding')
known_contractions

Known contractions available in embedding


["can't",
 "'cause",
 "didn't",
 "doesn't",
 "don't",
 "I'd",
 "I'll",
 "I'm",
 "I've",
 "it's",
 "ma'am",
 "o'clock",
 "that's",
 "you'll",
 "you're"]

##### Remove known contractions from cList

In [22]:
for k in known_contractions:
    del cList[k]

In [23]:
c_re = re.compile('(%s)' % '|'.join(cList.keys()))

def expandContractions(text, c_re=c_re):
    def replace(match):
        return cList[match.group(0)]
    return c_re.sub(replace, text)

In [24]:
expandContractions("you've, ain't")

'you have, am not'

In [25]:
train_df['cleaned_text'] = train_df['lowered_question'].apply(lambda x: expandContractions(x))

In [26]:
train_df['cleaned_text'].head()

0    how did quebec nationalists see their province...
1    do you have an adopted dog, how would you enco...
2    why does velocity affect time? does velocity a...
3    how did otto von guericke used the magdeburg h...
4    can i convert montra helicon d to a mountain b...
Name: cleaned_text, dtype: object

In [27]:
%%time
vocab = build_vocab(train_df['cleaned_text'])

CPU times: user 11.5 s, sys: 39.8 ms, total: 11.6 s
Wall time: 11.7 s


In [28]:
print("Glove : ")
oov_glove = check_coverage(vocab, embed_glove)

Glove : 
Found embeddings for 30.66% of vocab
Found embeddings for  88.43% of all text


In [29]:
oov_glove[:10]

[('india?', 16394),
 ('it?', 13158),
 ('do?', 8766),
 ('life?', 7791),
 ('why?', 7369),
 ('you?', 6314),
 ('me?', 6241),
 ('them?', 6141),
 ('time?', 5742),
 ('world?', 5525)]

### Remove punctuations

In [30]:
puncts = [',', '.', '"', ':', ')', '(', '-', '!', '?', '|', ';', "'", '$', '&', '/', '[', ']', '>', '%', '=', '#', 
          '*', '+', '\\', '•',  '~', '@', '£', '·', '_', '{', '}', '©', '^', '®', '`',  '<', '→', '°', '€', '™', 
          '›',  '♥', '←', '×', '§', '″', '′', 'Â', '█', '½', 'à', '…', '“', '★', '”', '–', '●', 'â', '►', '−', '¢',
          '²', '¬', '░', '¶', '↑', '±', '¿', '▾', '═', '¦', '║', '―', '¥', '▓', '—', '‹', '─', '▒', '：', '¼', '⊕',
          '▼', '▪', '†', '■', '’', '▀', '¨', '▄', '♫', '☆', 'é', '¯', '♦', '¤', '▲', 'è', '¸', '¾', 'Ã', '⋅', '‘', 
          '∞','∙', '）', '↓', '、', '│', '（', '»', '，', '♪', '╩', '╚', '³', '・', '╦', '╣', '╔', '╗', '▬', '❤', 
          'ï', 'Ø', '¹', '≤', '‡', '√', '∞', 'θ', '÷', 'α', '•', 'à', '−', 'β', '∅', '³', 'π', '‘','₹', 
          '´', "'", '°', '£', '€', '×', '™','√','²','—–','&','…', "’", "“", "”", "#", "{", "|", "}", "~"]

In [31]:
unknown_punctuations = []
known_puncts = []
for p in puncts:
    if p not in embed_glove:
        unknown_punctuations.append(p)
    else:
        known_puncts.append(p)
print('Known punctuations')
print(' '.join(known_puncts))
print('*'*50)
print('Unknown punctuations')
print(' '.join(unknown_punctuations))

Known punctuations
, . " : ) ( - ! ? | ; ' $ & / [ ] > % = # * + \ ~ @ _ { } ^ ` < ' & # { | } ~
**************************************************
Unknown punctuations
• £ · © ® → ° € ™ › ♥ ← × § ″ ′ Â █ ½ à … “ ★ ” – ● â ► − ¢ ² ¬ ░ ¶ ↑ ± ¿ ▾ ═ ¦ ║ ― ¥ ▓ — ‹ ─ ▒ ： ¼ ⊕ ▼ ▪ † ■ ’ ▀ ¨ ▄ ♫ ☆ é ¯ ♦ ¤ ▲ è ¸ ¾ Ã ⋅ ‘ ∞ ∙ ） ↓ 、 │ （ » ， ♪ ╩ ╚ ³ ・ ╦ ╣ ╔ ╗ ▬ ❤ ï Ø ¹ ≤ ‡ √ ∞ θ ÷ α • à − β ∅ ³ π ‘ ₹ ´ ° £ € × ™ √ ² —– … ’ “ ”


In [32]:
def remove_punctuations(text):
    for p in unknown_punctuations:
        text = text.replace(p, ' ')
    for p in known_puncts:
        text = text.replace(p, ' ' + p + ' ')
    return text

In [33]:
remove_punctuations('Hi,® xyz')

'Hi ,   xyz'

In [34]:
train_df['final_cleaned_text'] = train_df['cleaned_text'].apply(lambda x: remove_punctuations(x))

In [35]:
%%time
vocab = build_vocab(train_df['final_cleaned_text'])

CPU times: user 10.7 s, sys: 51.7 ms, total: 10.8 s
Wall time: 10.8 s


In [36]:
print("Glove : ")
oov_glove = check_coverage(vocab, embed_glove)

Glove : 
Found embeddings for 69.57% of vocab
Found embeddings for  99.58% of all text


<b>Conclusion</b><br/>
Now the code coverage is slightly increased from 33.28% to 69.57% and covers 99.58% of all text. So removing punctuations increased vocab coverage.

### Lets look into words that are not in our vocab

In [37]:
# first 10 most occured words which are not present in vocab
oov_glove[:10]

[('quorans', 858),
 ('brexit', 524),
 ('cryptocurrencies', 499),
 ('redmi', 383),
 ('coinbase', 149),
 ('oneplus', 139),
 ('uceed', 123),
 ('demonetisation', 115),
 ('bhakts', 115),
 ('upwork', 111)]

In [45]:
oov_glove[-10:]

[('5dcv', 1),
 ('pakkstani', 1),
 ('venuas', 1),
 ('ohmagawd', 1),
 ('savegely', 1),
 ('1500mph', 1),
 ('anizara', 1),
 ('4afsb', 1),
 ('tepelene', 1),
 ('calead', 1)]

In [44]:
# 1. Quorans doesn't exists in our vocab so lets check if quoran exists
'quoran' in embed_glove

True

In [47]:
import json

In [57]:
# lets correct these spelling mistakes and maintain a dict of mapping
# load the spell_corrections.json
with open('spell_corrections.json', 'r') as f:
    spell_corrections = json.load(f)

In [58]:
def correct_spellings(text):
    for k in spell_corrections.keys():
        text = text.replace(k, spell_corrections[k])
    return text

In [59]:
def remove_numerics(text):
    return re.sub('[^A-Za-z]+', ' ', text)

In [60]:
train_df['final_cleaned_text'] = train_df['final_cleaned_text'].apply(lambda x: correct_spellings(x))

In [61]:
train_df['final_cleaned_text'] = train_df['final_cleaned_text'].apply(lambda x: remove_numerics(x))

In [62]:
%%time
vocab = build_vocab(train_df['final_cleaned_text'])

CPU times: user 10.4 s, sys: 11.9 ms, total: 10.5 s
Wall time: 10.5 s


In [63]:
print("Glove : ")
oov_glove = check_coverage(vocab, embed_glove)

Glove : 
Found embeddings for 71.59% of vocab
Found embeddings for  99.61% of all text


In [64]:
oov_glove[:20]

[('redmi', 387),
 ('coinbase', 149),
 ('oneplus', 146),
 ('uceed', 124),
 ('bhakts', 115),
 ('upwork', 111),
 ('machedo', 108),
 ('adityanath', 106),
 ('boruto', 102),
 ('alshamsi', 92),
 ('dceu', 90),
 ('litecoin', 87),
 ('iiest', 86),
 ('unacademy', 86),
 ('zerodha', 80),
 ('tensorflow', 74),
 ('doklam', 70),
 ('kavalireddi', 69),
 ('muoet', 66),
 ('nicmar', 62)]

In [65]:
oov_glove[-10:]

[('bandhup', 1),
 ('seago', 1),
 ('bhashani', 1),
 ('ingredio', 1),
 ('dcv', 1),
 ('ohmagawd', 1),
 ('savegely', 1),
 ('anizara', 1),
 ('tepelene', 1),
 ('calead', 1)]

In [66]:
def get_top_sent(t):
    quorans_sentence = []
    for s in train_df['final_cleaned_text']:
        if t in s:
            quorans_sentence.append(s)
        if len(quorans_sentence) > 5:
            break
    return quorans_sentence

In [67]:
get_top_sent('redmi')

['which is best changer for redmi note ',
 'will the redmi note pro indian model work in the us ',
 'when redmi note get miui update ',
 'what are some android apps for split screen multi tasking available for redmi note marshmallow ',
 'what are the fascinating things about redmi a ',
 'what is the best in redmi note ']

In [68]:
train_df.head()

Unnamed: 0,qid,question_text,target,lowered_question,cleaned_text,final_cleaned_text
0,00002165364db923c7e6,How did Quebec nationalists see their province...,0,how did quebec nationalists see their province...,how did quebec nationalists see their province...,how did quebec nationalists see their province...
1,000032939017120e6e44,"Do you have an adopted dog, how would you enco...",0,"do you have an adopted dog, how would you enco...","do you have an adopted dog, how would you enco...",do you have an adopted dog how would you encou...
2,0000412ca6e4628ce2cf,Why does velocity affect time? Does velocity a...,0,why does velocity affect time? does velocity a...,why does velocity affect time? does velocity a...,why does velocity affect time does velocity af...
3,000042bf85aa498cd78e,How did Otto von Guericke used the Magdeburg h...,0,how did otto von guericke used the magdeburg h...,how did otto von guericke used the magdeburg h...,how did otto von guericke used the magdeburg h...
4,0000455dfa3e01eae3af,Can I convert montra helicon D to a mountain b...,0,can i convert montra helicon d to a mountain b...,can i convert montra helicon d to a mountain b...,can i convert montra helicon d to a mountain b...


In [70]:
train_df[['qid', 'target', 'final_cleaned_text']].head()

Unnamed: 0,qid,target,final_cleaned_text
0,00002165364db923c7e6,0,how did quebec nationalists see their province...
1,000032939017120e6e44,0,do you have an adopted dog how would you encou...
2,0000412ca6e4628ce2cf,0,why does velocity affect time does velocity af...
3,000042bf85aa498cd78e,0,how did otto von guericke used the magdeburg h...
4,0000455dfa3e01eae3af,0,can i convert montra helicon d to a mountain b...


In [71]:
train_df[['qid', 'target', 'final_cleaned_text']].to_csv('final_cleaned_df.csv', index=False)