<br>**Naive Bayes Classification**<font color='violet'></br>
<br>Naive bayes classification is an effective algorithm to predict the category/class of data points based on a training data set, and it works on Bayes theorem of probability to predict the class of unknown data sets. Utilizing this algorithm, we assume independency among predictors which is a strongly simplifying yet affective assumption.</br>
<br>In this project we have been given a training data set from which we have to evaluate each word's probability of appearance in each class based on its experienced frequency. </br>
<br>Steps:</br></font><font color='blue'>
1. Dealing with text data:
    1.1. Normalizing: 
    1.2. Tokenizing: using white-space and characters, we split each comment to a list
    1.3. Stemming and lemmatization:
    With stemming we cut the word to its root by usually ommitting the last alphabet letter which was further added to it to maybe form a plural word. Stemming sometimes tends to reduce the accuracy of the model and precision performance but increses recall performance.
    Lemmatization is a method to accurately identify a word's root using its part of speech tagger and vocabulary words. This may take up some disk space and require a lot of processing time but it will improve acuracy and precision.  
2. Finding word counts in training data set
3. Calculating probabilities and making predictions
</font>
    

In [1]:
from __future__ import unicode_literals
from hazm import *
from collections import deque
from parsivar import FindStems

import numpy as np
import pandas as pd
from collections import deque
import datetime
import secrets
import csv 
import nltk                                         #Natural language processing tool-kit

from nltk.stem import PorterStemmer                 # Stemmer
from nltk.corpus import stopwords
nltk.download('stopwords') #didnt actually use it
nltk.download('punkt')
from nltk.tokenize import word_tokenize

from sklearn.feature_extraction.text import CountVectorizer          #For Bag of words
from sklearn.feature_extraction.text import TfidfVectorizer          #For TF-IDF
from gensim.models import Word2Vec                                   #For Word2Vec


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Sara\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Sara\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
def import_data(filename):
    table=[]
    counter=0
    with open(filename,encoding="utf8") as csvfile:
        sreader=pd.read_csv(csvfile)
    csvfile.close()
    return sreader

data=import_data("comment_train.csv")
test=import_data("comment_test.csv")

persian_stopwords=open("persian.txt",encoding="utf8")
stopwords_list=deque([])
for line in persian_stopwords:
    stopwords_list.append(line.strip())
persian_stopwords.close()




In [3]:
test

Unnamed: 0,title,comment,recommend
0,وری گود,تازه خریدم یه مدت کار بکنه مشخص میشه کیفیت قطعاتش,recommended
1,زیاد مناسب نیست رنگ پس میده یه وقتایی موقع نوشتن,با این قیمت گزینه های بهتری هم میشه گرفت.\nروا...,not_recommended
2,پنکه گوشی,خیلی عالیه، فقط کاش از اون سمتش میشد به پاوربا...,recommended
3,دستگاه خیلی ضعیف,من این فیس براس چند روز یپش به دستم رسید و الا...,not_recommended
4,عالی و بیست,بنده یه هارد اکسترنال دارم که کابل فابریکش سال...,recommended
...,...,...,...
795,بسیار کوچیک,طراحیش قشنگه ولی داخل عکس خیلی بزرگتر ب چشم م...,not_recommended
796,لامپ چینی,این لامپ چینی هستتش کیفیت پایین . نور کم و فاق...,not_recommended
797,خوب بود,در کل از این خریدم راضی هستم و به تناسب قیمتش ...,recommended
798,کیفیت خوبی داره,تازع نصبش کردم-سرعت انتقال و نصب بازی روش عالی...,recommended


In [4]:
data

Unnamed: 0,title,comment,recommend
0,زیبا اما کم دوام,با وجود سابقه خوبی که از برند ایرانی نهرین سرا...,not_recommended
1,بسیار عالی,بسیار عالی,recommended
2,سلام,من الان ۳ هفته هست استفاده میکنم\nبرای کسایی ک...,not_recommended
3,به درد نمیخورهههه,عمرش کمه تا یه هفته بیشتر نمیشه استفاده کرد یا...,not_recommended
4,کلمن آب,فکر کنین کلمن بخرین با ذوق. کلی پولشو بدین. به...,not_recommended
...,...,...,...
5995,جنسش عالیه,خیلی جنس پارچش نرم ولطیفه خیلیم جنسش خوبه اما ...,recommended
5996,خرید محصول,سلام.واقعا فکر نمی کردم به این راحتی اصلاح کنم...,recommended
5997,تعریف,من از دیجی کالا خریدم خیلی زود دستم رسید،زیبا،...,recommended
5998,اصلا چای ماچا نیسش,یا شرکت نمیدونسته چای ماچا امپریال چیه یا واقع...,not_recommended


In [5]:
def preprocessed_data(data):
    normalizer=Normalizer()
    lemmatizer=Lemmatizer()
    stemmer=Stemmer()
    stemm=FindStems()
    for i in range(data.shape[0]):
        for j in range(3):
            normalizer.normalize(data.iloc[i,j])

    comments=deque([])
    titles=deque([])
    bag_of_words_recom=deque([])
    bag_of_words_nrecom=deque([])

    for i in range(data.shape[0]):
        title_list=word_tokenize(data.iloc[i,0])
        comment_list=word_tokenize(data.iloc[i,1])

        list_t=[stemm.convert_to_stem(x) for x in title_list if not x in stopwords_list]
        list_c=[stemm.convert_to_stem(x) for x in comment_list if not x in stopwords_list]
        title_list=deque(list_t)
        comment_list=deque(list_c)
        comments.append(comment_list)
        titles.append(title_list)
        comment_list+=title_list
        if data.loc[i,'recommend']=='recommended': 
            bag_of_words_recom+=comment_list
        else:
            bag_of_words_nrecom+=comment_list
            
    return bag_of_words_recom,bag_of_words_nrecom

def unpreprocessed_data(data):    
    comments=deque([])
    titles=deque([])
    bag_of_words_recom=deque([])
    bag_of_words_nrecom=deque([])

    for i in range(data.shape[0]):
        title_list=word_tokenize(data.iloc[i,0])
        comment_list=word_tokenize(data.iloc[i,1])
        comments.append(comment_list)
        titles.append(title_list)
        comment_list+=title_list
        if data.loc[i,'recommend']=='recommended': 
            bag_of_words_recom+=comment_list
        else:
            bag_of_words_nrecom+=comment_list

    return bag_of_words_recom,bag_of_words_nrecom

bag_of_words_recom,bag_of_words_nrecom=preprocessed_data(data)
bag_of_words_recom1,bag_of_words_nrecom1=unpreprocessed_data(data)


In [9]:
train_recommended_no=data[data.recommend=='recommended'].shape[0]
train_nrecommended_no=data[data.recommend=='not_recommended'].shape[0]
probab_recom=train_recommended_no/data.shape[0]
probab_nrecom=train_nrecommended_no/data.shape[0]

#preprocessed
total_words_recom=len(bag_of_words_recom)
total_words_nrecom=len(bag_of_words_nrecom)

occurences_recom=list(set(list(bag_of_words_recom)))
occurences_nrecom=list(set(list(bag_of_words_nrecom)))

repeats_recom=[]
repeats_nrecom=[]

for i in range(len(occurences_recom)):
    repeats_recom.append(bag_of_words_recom.count(occurences_recom[i]))

for i in range(len(occurences_nrecom)):
    repeats_nrecom.append(bag_of_words_nrecom.count(occurences_nrecom[i]))
    
probab_words_recom=[x/total_words_recom for x in repeats_recom]
probab_words_nrecom=[x/total_words_nrecom for x in repeats_nrecom]

#unpreprocessed
total_words_recom1=len(bag_of_words_recom1)
total_words_nrecom1=len(bag_of_words_nrecom1)

occurences_recom1=list(set(list(bag_of_words_recom1)))
occurences_nrecom1=list(set(list(bag_of_words_nrecom1)))

repeats_recom1=[]
repeats_nrecom1=[]

for i in range(len(occurences_recom1)):
    repeats_recom1.append(bag_of_words_recom1.count(occurences_recom1[i]))

for i in range(len(occurences_nrecom1)):
    repeats_nrecom1.append(bag_of_words_nrecom1.count(occurences_nrecom1[i]))
    
probab_words_recom1=[x/total_words_recom1 for x in repeats_recom1]
probab_words_nrecom1=[x/total_words_nrecom1 for x in repeats_nrecom1]

<br><font color='green'>Additive Smoothing</br></font><font color='orange'>
<br>When a word's frequency belonging to a class in the training data set is zero, our model will estimate the probability of the word's appearance in the other class to be zero and it will wipe out other information on the comment's classification derived from the probabilites of other existing words. So we will apply a small correction and assign a low value to the probability of the word so that no word's probability will be zero. Each time we do this, we add the mentioned value to the total frequency of the words to regularize the whole calculation.</br></font>

In [63]:
#preprocessed & smoothed
def predict_recom(word_bag_recom,word_bag_nrecom,repeats_recom,repeats_nrecom,bag):
    probab_recom=calc_probab(word_bag_recom,repeats_recom,bag)
    probab_nrecom=calc_probab(word_bag_nrecom,repeats_nrecom,bag)
    
    if probab_recom>=probab_nrecom:
        return 1
    else:
        return 0
    
def calc_probab(trained_bag,repeats,bag):
    n=0
    probab=1
    tot=len(trained_bag)
    words_probab=[]
    for word in bag:
        exists=trained_bag.count(word)
        if exists:
            idx=trained_bag.index(word)
            words_probab.append(repeats[idx])
        else:
            words_probab.append(0.5)
            tot=tot+0.5
            
    words_probab=[x/tot for x in words_probab]
    return np.prod(words_probab)
def naive_bayes_preprocessed_smoothed(test):
    normalizer=Normalizer()
    lemmatizer=Lemmatizer()
    stemmer=Stemmer()
    stemm=FindStems()
    for i in range(test.shape[0]):
        for j in range(3):
            normalizer.normalize(test.iloc[i,j])
    naive_bayes_predicts=[]
    comments=deque([])
    titles=deque([])
    bag=deque([])
    for i in range(test.shape[0]):
        title_list=word_tokenize(test.iloc[i,0])
        comment_list=word_tokenize(test.iloc[i,1])
        list_t=[stemm.convert_to_stem(x) for x in title_list if not x in stopwords_list]
        list_c=[stemm.convert_to_stem(x) for x in comment_list if not x in stopwords_list]
        title_list=deque(list_t)
        comment_list=deque(list_c)
        comments.append(comment_list)
        titles.append(title_list)
        comment_list+=title_list
        predict=predict_recom(occurences_recom,occurences_nrecom,repeats_recom,repeats_nrecom,list(comment_list))
        if predict==1:
            naive_bayes_predicts.append('recommended')
        else:
            naive_bayes_predicts.append('not_recommended')
    return naive_bayes_predicts



In [66]:
test_preprocessed=naive_bayes_preprocessed_smoothed(test)


In [67]:
#unpreprocessed and smoothed
def predict_recom(word_bag_recom,word_bag_nrecom,repeats_recom,repeats_nrecom,bag):
    probab_recom=calc_probab(word_bag_recom,repeats_recom,bag)
    probab_nrecom=calc_probab(word_bag_nrecom,repeats_nrecom,bag)
    
    if probab_recom>=probab_nrecom:
        return 1
    else:
        return 0
    
def calc_probab(trained_bag,repeats,bag):
    n=0
    probab=1
    tot=len(trained_bag)
    words_probab=[]
    for word in bag:
        exists=trained_bag.count(word)
        if exists:
            idx=trained_bag.index(word)
            words_probab.append(repeats[idx])
        else:
            words_probab.append(0.5)
            tot=tot+0.5
            
    words_probab=[x/tot for x in words_probab]
    return np.prod(words_probab)

def naive_bayes_upreprocessed_smoothed(test):
    naive_bayes_predicts=[]
    comments=deque([])
    titles=deque([])
    bag=deque([])
    for i in range(test.shape[0]):
        title_list=word_tokenize(test.iloc[i,0])
        comment_list=word_tokenize(test.iloc[i,1])
        comments.append(comment_list)
        titles.append(title_list)
        comment_list+=title_list
        predict=predict_recom(occurences_recom,occurences_nrecom,repeats_recom,repeats_nrecom,list(comment_list))
        if predict==1:
            naive_bayes_predicts.append('recommended')
        else:
            naive_bayes_predicts.append('not_recommended')
    
    return naive_bayes_predicts

In [70]:
test_unpreprocessed=naive_bayes_upreprocessed_smoothed(test)

In [71]:
#preprocessed & unsmoothed
def predict_recom(word_bag_recom,word_bag_nrecom,repeats_recom,repeats_nrecom,bag):
    probab_recom=calc_probab(word_bag_recom,repeats_recom,bag)
    probab_nrecom=calc_probab(word_bag_nrecom,repeats_nrecom,bag)
    
    if probab_recom>=probab_nrecom:
        return 1
    else:
        return 0
    
def calc_probab(trained_bag,repeats,bag):
    n=0
    probab=1
    tot=len(trained_bag)
    words_probab=[]
    for word in bag:
        words_probab.append(trained_bag.count(word))
    words_probab=[x/tot for x in words_probab]
    return np.prod(words_probab)
def naive_bayes_preprocessed_unsmoothed(test):
    normalizer=Normalizer()
    lemmatizer=Lemmatizer()
    stemmer=Stemmer()
    stemm=FindStems()
    for i in range(test.shape[0]):
        for j in range(3):
            normalizer.normalize(test.iloc[i,j])
    naive_bayes_predicts=[]
    comments=deque([])
    titles=deque([])
    bag=deque([])
    for i in range(test.shape[0]):
        title_list=word_tokenize(test.iloc[i,0])
        comment_list=word_tokenize(test.iloc[i,1])
        list_t=[stemm.convert_to_stem(x) for x in title_list if not x in stopwords_list]
        list_c=[stemm.convert_to_stem(x) for x in comment_list if not x in stopwords_list]
        title_list=deque(list_t)
        comment_list=deque(list_c)
        comments.append(comment_list)
        titles.append(title_list)
        comment_list+=title_list
        predict=predict_recom(occurences_recom,occurences_nrecom,repeats_recom,repeats_nrecom,list(comment_list))
        if predict==1:
            naive_bayes_predicts.append('recommended')
        else:
            naive_bayes_predicts.append('not_recommended')
    return naive_bayes_predicts



In [72]:
test_unsmoothed=naive_bayes_preprocessed_unsmoothed(test)

In [73]:
#unpreprocessed & unsmoothed
def predict_recom(word_bag_recom,word_bag_nrecom,repeats_recom,repeats_nrecom,bag):
    probab_recom=calc_probab(word_bag_recom,repeats_recom,bag)
    probab_nrecom=calc_probab(word_bag_nrecom,repeats_nrecom,bag)
    
    if probab_recom>=probab_nrecom:
        return 1
    else:
        return 0
    
def calc_probab(trained_bag,repeats,bag):
    n=0
    probab=1
    tot=len(trained_bag)
    words_probab=[]
    for word in bag:
        words_probab.append(trained_bag.count(word))
    words_probab=[x/tot for x in words_probab]
    return np.prod(words_probab)
def naive_bayes_unpreprocessed_unsmoothed(test):
    naive_bayes_predicts=[]
    comments=deque([])
    titles=deque([])
    bag=deque([])
    for i in range(test.shape[0]):
        title_list=word_tokenize(test.iloc[i,0])
        comment_list=word_tokenize(test.iloc[i,1])
        comments.append(comment_list)
        titles.append(title_list)
        comment_list+=title_list
        predict=predict_recom(occurences_recom,occurences_nrecom,repeats_recom,repeats_nrecom,list(comment_list))
        if predict==1:
            naive_bayes_predicts.append('recommended')
        else:
            naive_bayes_predicts.append('not_recommended')
    
    return naive_bayes_predicts



In [74]:
test_unpreprocessed_unsmoothed=naive_bayes_unpreprocessed_unsmoothed(test)

In [75]:
def calc_characteristics(output):
    recommended_true=0
    #accuracy
    comparing=0
    for i in range(test.shape[0]):
        if test['recommend'][i]==output[i]:
            comparing=comparing+1
            if test['recommend'][i]=='recommended':
                recommended_true=recommended_true+1
    accuracy=comparing/test.shape[0]
    #print("ACCURACY : {}".format(comparing/test.shape[0]))

    #precision
    recommended_bayes=len([x for x in output if x=='recommended'])
    recommended_real=test[test['recommend']=='recommended'].shape[0]
    precision=recommended_true/recommended_bayes
    #print("\nPRECISION : {}".format(precision))

    #recall
    recall=recommended_true/recommended_real
    #print("\nRECALL : {}".format(recall))

    #F1
    f1=2*(precision*recall)/(precision+recall)
    #print("\nF1 : {}".format(f1)
    return [accuracy,precision,recall,f1]
out1=calc_characteristics(test_preprocessed)
out2=calc_characteristics(test_unpreprocessed)
out3=calc_characteristics(test_unsmoothed)
out4=calc_characteristics(test_unpreprocessed_unsmoothed)

In [83]:
data = {'': ['Accuracy','Precision','Recall','F1'],
        'Preprocessed and Smoothed':  out1,
        'Not Preproprecessed and Smoothed': out2,
        'Preprocessed and Not Smoothed': out3,
        'Not Preprocessed and Not Smoothed': out4
        }
df = pd.DataFrame (data, columns = ['','Preprocessed and Smoothed','Not Preproprecessed and Smoothed','Preprocessed and Not Smoothed','Not Preprocessed and Not Smoothed'])
df

Unnamed: 0,Unnamed: 1,Preprocessed and Smoothed,Not Preproprecessed and Smoothed,Preprocessed and Not Smoothed,Not Preprocessed and Not Smoothed
0,Accuracy,0.93375,0.8575,0.755,0.5
1,Precision,0.934837,0.85049,0.769841,0.5
2,Recall,0.9325,0.8675,0.7275,0.9875
3,F1,0.933667,0.858911,0.748072,0.663866


<br><font color='green'>Analysis </br> </font>
<br><font color='orange'>We see that preprocessing alone has considerably raised the percentage of accuracy of our classification. Smoothing the values has also had a great impact on the accuracy which is inferered to affect the accuracy more than preprocessing does. With omitting both of the actions we get a very low accuracy and precision which is not desired at all.</br></font>


In [84]:
!jupyter nbconvert --to html AI_CA3_810098018.ipynb

[NbConvertApp] Converting notebook AI_CA3_810098018.ipynb to html
[NbConvertApp] Writing 346437 bytes to AI_CA3_810098018.html
