Importing the necessary libraries and data.

Download the data set from here : https://www.kaggle.com/c/quora-insincere-questions-classification/data

In [0]:

import numpy as np 
import pandas as pd
import os

train = pd.read_csv('./drive/My Drive/train.csv')
test = pd.read_csv('./drive/My Drive/test.csv')

In [12]:
train.head()

Unnamed: 0,qid,question_text,target
188974,24f478324bc328608830,Which hairstyle suits a thin and parrot nosy 1...,0
969067,bdddffacb23411200857,"As a Brit that admires American conservatism, ...",1
355688,45b8639338af0d29358d,What are the best ways to use Slack in board w...,0
1121566,dbc628b2821848b7edd4,"Is there a way to make the name you go by, rat...",0
819998,a0ac336114d699926985,Is eating pork harmful?,0


In [3]:
print ('Shape of train ',train.shape)
print ('Shape of test ',test.shape)

Shape of train  (1306122, 3)
Shape of test  (375806, 2)


Now we'll see what are sincere questions look like and how are insincere questions look like.



In [14]:

print ('Taking a look at Sincere Questions')
train.loc[train['target'] == 0].sample(5)['question_text']

print ('Taking a look at Insincere Questions')
train.loc[train['target'] == 1].sample(5)['question_text']

Taking a look at Insincere Questions


544746    What happened to Kim Jong-un that he decides t...
336329    Quora has an answer for anything I can imagine...
681144    Explaining why colonization contributed to ret...
250442    How powerful is the Jewish community in media ...
852460     Why does Trump always sit like he's on a toilet?
Name: question_text, dtype: object

Insincere questions are questions spreading hatred against a group of people or is not real. An insincere question has a value 1 while a sincere question has target value 0.

In [5]:
 
samp = train.sample(1)
sentence = samp.iloc[0]['question_text']
print (sentence)

Is the attachment to guns by Americans akin to an addiction like it would be with drugs?


Text Preprocessing in Python


Removing Numbers and Punctuations. We use regex expressions to remove numbers.




In [6]:

import re
sentence = re.sub(r'\d+','',sentence)
print ('Sentence After removing numbers\n',sentence)

#Removing Punctuations in a string.

import string
sentence = sentence.translate(sentence.maketrans("","",string.punctuation))
print ('Sentence After Removing Punctuations\n',sentence)


Sentence After removing numbers
 Is the attachment to guns by Americans akin to an addiction like it would be with drugs?
Sentence After Removing Punctuations
 Is the attachment to guns by Americans akin to an addiction like it would be with drugs


Removing Stop Words
Stop words are the most common words in a language like “the”, “a”, “on”, “is”, “all”. These words do not carry important meaning and are usually removed from texts.

Stop words can be removed by using Natural Language Toolkit (NLTK). NLTK is a set of libraries for symbolic ans statistical natural language processing. It was developed by University of Pennysylvania.

In [7]:

import nltk
nltk.download('punkt')
nltk.download('stopwords')
stop_words = set(nltk.corpus.stopwords.words('english'))
words_in_sentence = list(set(sentence.split(' ')) - stop_words)
print (words_in_sentence)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
['drugs', 'Americans', 'would', 'Is', 'attachment', 'guns', 'addiction', 'akin', 'like']


Stemming of Words
Stemming is the process for reducing derived words to their stem, base or root form—generally a written word form. Ex: owed -> owe muliply -> multipli

In [8]:

from nltk.stem import PorterStemmer
nltk.download('wordnet')
stemmer= PorterStemmer()
for i,word in enumerate(words_in_sentence):
    words_in_sentence[i] = stemmer.stem(word)
print (words_in_sentence)    

#Lemmatization of Words
#Lemmatisation is the process of grouping together the different inflected forms of a word so they can be analysed as a single item. Ex: dogs -> dog. I am not clear with difference between lemmatization and stemming. In most of the tutorials, I found them both and I could not understand the clear difference between the two.

from nltk.stem import WordNetLemmatizer
lemmatizer=WordNetLemmatizer()
words = []
for i,word in enumerate(words_in_sentence):
    words_in_sentence[i] = lemmatizer.lemmatize(word)
print (words_in_sentence)


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
['drug', 'american', 'would', 'Is', 'attach', 'gun', 'addict', 'akin', 'like']
['drug', 'american', 'would', 'Is', 'attach', 'gun', 'addict', 'akin', 'like']


**Constructing a Naive Bayes Classifier from Scratch.**


In [0]:

from sklearn.model_selection import train_test_split
train, test = train_test_split(train, test_size=0.2)


The next step are as follows:

1. Combine all the preprocessing techniques and create a dictionary of words and each word's count in training data.

2. Calculate probability for each word in a text and filter the words which has probability less than threshold probability. Words with probability less than threshold probability are insignificant.

3. Then for each word in the dictionary, I am creating a probability of that word being in insincere questions and its probability in sincere questions. I am finding the conditional probability to use in naive bayes classifier.

4. Prediction using condtional probabilities.

In [0]:
word_count = {}
word_count_sincere = {}
word_count_insincere = {}
sincere  = 0
insincere = 0 

import re
import string
import nltk
stop_words = set(nltk.corpus.stopwords.words('english'))
from nltk.stem import PorterStemmer
stemmer= PorterStemmer()
from nltk.stem import WordNetLemmatizer
lemmatizer=WordNetLemmatizer()


Preprocessing training data. I create three dictionaries which hold word count of words occuring in sincere, insincere and overall after preprocessing each word.

In [15]:
row_count = train.shape[0]
for row in range(0,row_count):
    insincere += train.iloc[row]['target']
    sincere += (1 - train.iloc[row]['target'])
    sentence = train.iloc[row]['question_text']
    sentence = re.sub(r'\d+','',sentence)
    sentence = sentence.translate(sentence.maketrans("","",string.punctuation))
    words_in_sentence = list(set(sentence.split(' ')) - stop_words)
    for index,word in enumerate(words_in_sentence):
        word = stemmer.stem(word)
        words_in_sentence[index] = lemmatizer.lemmatize(word)
    for word in words_in_sentence:
        if train.iloc[row]['target'] == 0:   #Sincere Words
            if word in word_count_sincere.keys():
                word_count_sincere[word]+=1
            else:
                word_count_sincere[word] = 1
        elif train.iloc[row]['target'] == 1: #Insincere Words
            if word in word_count_insincere.keys():
                word_count_insincere[word]+=1
            else:
                word_count_insincere[word] = 1
        if word in word_count.keys():        #For all words. I use this to compute probability.
            word_count[word]+=1
        else:
            word_count[word]=1

print('Done')

Done


Finding probability for each word in the dictionary.
After that eliminating words which are insignificant. Insignificant words are words which have a probability of occurence less than 0.0001.

In [16]:

word_probability = {}
total_words = 0
for i in word_count:
    total_words += word_count[i]
for i in word_count:
    word_probability[i] = word_count[i] / total_words

#Eliminating words which are insignificant. Insignificant words are words which have a probability of occurence less than 0.0001.
print ('Total words ',len(word_probability))
print ('Minimum probability ',min (word_probability.values()))
threshold_p = 0.0001
for i in list(word_probability):
    if word_probability[i] < threshold_p:
        del word_probability[i]
        if i in list(word_count_sincere):   #list(dict) return it;s key elements
            del word_count_sincere[i]
        if i in list(word_count_insincere):  
            del word_count_insincere[i]
print ('Total words ',len(word_probability))


Total words  165251
Minimum probability  1.142270994655314e-07
Total words  1580


To apply naive bayes algorithm, we have to find conditional probability. Finding the conditional probability.

In [0]:

total_sincere_words = sum(word_count_sincere.values())
cp_sincere = {}  #Conditional Probability
for i in list(word_count_sincere):
    cp_sincere[i] = word_count_sincere[i] / total_sincere_words

total_insincere_words = sum(word_count_insincere.values())
cp_insincere = {}  #Conditional Probability
for i in list(word_count_insincere):
    cp_insincere[i] = word_count_insincere[i] / total_insincere_words



Prediction

In [18]:
row_count = test.shape[0]

p_insincere = insincere / (sincere + insincere)
p_sincere = sincere / (sincere + insincere)
accuracy = 0

for row in range(0,row_count):
    sentence = test.iloc[row]['question_text']
    target = test.iloc[row]['target']
    sentence = re.sub(r'\d+','',sentence)
    sentence = sentence.translate(sentence.maketrans("","",string.punctuation))
    words_in_sentence = list(set(sentence.split(' ')) - stop_words)
    for index,word in enumerate(words_in_sentence):
        word = stemmer.stem(word)
        words_in_sentence[index] = lemmatizer.lemmatize(word)
    insincere_term = p_insincere
    sincere_term = p_sincere
    
    sincere_M = len(cp_sincere.keys())
    insincere_M = len(cp_insincere.keys())
    for word in words_in_sentence:
        if word not in cp_insincere.keys():
            insincere_M +=1
        if word not in cp_sincere.keys():
            sincere_M += 1
         
    for word in words_in_sentence:
        if word in cp_insincere.keys():
            insincere_term *= (cp_insincere[word] + (1/insincere_M))
        else:
            insincere_term *= (1/insincere_M)
        if word in cp_sincere.keys():
            sincere_term *= (cp_sincere[word] + (1/sincere_M))
        else:
            sincere_term *= (1/sincere_M)
        
    if insincere_term/(insincere_term + sincere_term) > 0.5:
        response = 1
    else:
        response = 0
    if target == response:
        accuracy += 1
    
print ('Accuracy is ',accuracy/row_count*100)


Accuracy is  94.13991769547326
