<a href="https://colab.research.google.com/github/w-sugata/Navie-Bayes/blob/main/Naive-Bayes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Final Version

In [None]:
!pip install datasets

# Datasets

In [None]:
from datasets import load_dataset
imdb_dataset = load_dataset('imdb')
sms_dataset = load_dataset('sms_spam')

Reusing dataset imdb (/root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1)


  0%|          | 0/3 [00:00<?, ?it/s]

Reusing dataset sms_spam (/root/.cache/huggingface/datasets/sms_spam/plain_text/1.0.0/53f051d3b5f62d99d61792c91acefe4f1577ad3e4c216fb0ad39e30b9f20019c)


  0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
from sklearn.model_selection import train_test_split
import pandas as pd

#split sms_spam into train and test dataset
sms_df = pd.DataFrame(data = sms_dataset['train'])
sms_train, sms_test = train_test_split(sms_df, test_size = 0.2)

sms_train.rename(columns = {'sms':'text'}, inplace = True) # replace the title of the column 'sms' to match to imdb's dataset's column title 'text'
sms_test.rename(columns = {'sms':'text'}, inplace = True) # replace the title of the column 'sms' to match to imdb's dataset's column title 'text'

In [None]:
imdb_train = pd.DataFrame(imdb_dataset['train'])
imdb_test = pd.DataFrame(imdb_dataset['test'])

# shuffle the rows
imdb_train = imdb_train.sample(frac = 1)
imdb_test = imdb_test.sample(frac = 1)

# Clean Function

In [None]:
def clean(data):
  data = data['text'].str.replace('\W', ' ') # Removes punctuation
  data = data.str.lower() # all lower cases
  data = data.str.split() # split sentences into words

  vocabulary = []
  for words in data:
    for each_word in words:
      vocabulary.append(each_word)

  vocabulary = list(set(vocabulary))
  return vocabulary

# Feature Extraction

### 0. bag of words

In [None]:
# The function takes cleaned data 
def bag_of_words(data):
  # create a set containing all the tokens that are only letters
  # the set will automatically filter out duplicates
  words = {w.lower() for w in data if w.isalpha()}
  # convert the set to a list before returning.
  return list(words)

### 1. Stop Words
removes stopwords

In [None]:
# takes cleaned data 
# filter out stopwords and duplicates
import nltk
# nltk.download('stopwords')
# from nltk.corpus import stopwords
from nltk.corpus import stopwords
stoplist = set(stopwords.words("english"))
def no_stopwords(data):
  # stop_words = set(stopwords.words('english'))
  no_stop = {w for w in data if w.lower() not in stoplist}
  return list(no_stop)

### 2. N-grams

In [None]:
# takes cleaned data and number of ngrams
# returns list of strings (each is a ngram)

def ngrams(text, n):

    return [
        " ".join(text[i : i + n]) for i in range(len(text) - (n - 1))
    ]  # list of str 

### 3. Part of Speech

In [None]:
# takes cleaned data
# returns words that are either singular noun, proper noun, or verb (list of strings)
import nltk
nltk.download('averaged_perceptron_tagger')
from nltk.tokenize import wordpunct_tokenize
def part_of_speech(data):
  tagged = nltk.pos_tag(data) 
  pos = {x[0] for x in tagged if x[1] == 'NN' or x[1] == 'NNP' or x[1] == 'VB'} 
  return list(pos)

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


# Baseline Function

In [None]:
def baseline(data):

  good = data[data['label'] == 0]
  bad = data[data['label'] == 1]
  prior_good = len(good) / len(data)
  prior_bad = len(bad) / len(data)

  if prior_good > prior_bad:
    return "prediction: ham messages or negative reviews"
  else:
    return "prediction: spam messages or positive reviews"

# Learn Function

In [None]:
# takes train_data and cleaned data / featurized data
def learn(train_data, clean): 

  train_data['text_val_col'] = train_data['text'].str.replace('\W', ' ') # Removes punctuation
  train_data['text_val_col'] = train_data['text_val_col'].str.lower()
  train_data['text_val_col'] = train_data['text_val_col'].str.split() 


  word_counts_per_sentences = {unique_word: [0] * len(train_data['text_val_col']) for unique_word in clean} # columns - unique words in vocabulary, rows - each sentences
  for index, value in enumerate(train_data['text_val_col']): 
   for word in value:
     if word not in clean:
       continue
     else:
      word_counts_per_sentences[word][index] += 1 # count how many times each word appeares in each sentence in dict
  df = pd.DataFrame(word_counts_per_sentences) # convert the dict to a dataframe

  df['text_val_col'] = list(train_data.iloc[:, 0])
  df['label_val_col'] = list(train_data['label'])

  neg = df[df['label_val_col'] == 0] # extract a dataframe where class value is ham/positive
  pos = df[df['label_val_col'] == 1]
  
  ## Calculate prior probabilities 
  p_neg = len(neg) / len(df) # p of ham/positive 
  p_pos = len(pos) / len(df) # p of spam/negative

  ## Calculate likelihood P(Wi|Cj) 
  n_words_per_neg = neg['text_val_col'].apply(len) # number of words per ham/positive 
  n_neg = n_words_per_neg.sum() # total number of words in ham/positive texts

  n_words_per_pos = pos['text_val_col'].apply(len) # number of words per spam/negative
  n_pos = n_words_per_pos.sum() # total number of words in spam/negative texts

  n_vocabulary = len(clean) # number of vocabulary

  alpha = 1 # laplace smoothing

  likelihood_neg = {unique_word:0 for unique_word in clean}
  likelihood_pos = {unique_word:0 for unique_word in clean}

  # likelihood = {}

  for word in clean:
    if word == 'text_val_col' or word == 'label_val_col':
      continue
    n_word_given_neg = neg[word].sum() # number of occurences of a word W in good texts
    p_word_given_neg = (n_word_given_neg + alpha) / (n_neg + alpha * n_vocabulary) 
    likelihood_neg[word] = p_word_given_neg

    n_word_given_pos = pos[word].sum() # number of occurences of a word W in bad texts
    p_word_given_pos = (n_word_given_pos + alpha) / (n_pos + alpha * n_vocabulary)
    likelihood_pos[word] = p_word_given_pos

  return p_pos, p_neg, likelihood_pos, likelihood_neg

# Souce code: https://www.kdnuggets.com/2020/07/spam-filter-python-naive-bayes-scratch.html

# Classify Function

In [None]:
def classify(doc):

  doc = doc.replace('\W', ' ') # Removes punctuation
  doc = doc.lower()
  doc = doc.split()

  p_neg_given_message = p_neg
  p_pos_given_message = p_pos

  prediction = []
  for word in doc:
    if word in likelihood_neg:
      p_neg_given_message *= likelihood_neg[word]

    if word in likelihood_pos:
      p_pos_given_message *= likelihood_pos[word]

  if p_neg_given_message > p_pos_given_message:
    return 0
  else:
    return 1

# Evaluation

In [None]:
import datasets
accuracy = datasets.load_metric("accuracy")
precision = datasets.load_metric("precision")
recall = datasets.load_metric("recall")

# Run the codes using sms_spam dataset

In [None]:
baseline(sms_test)

'prediction: ham messages or positive reviews'

In [None]:
sms_clean = clean(sms_train)

In [None]:
sms_ref = sms_test['label'].to_list()

### Without feature extractions - sms_spam

In [None]:
p_pos, p_neg, likelihood_pos, likelihood_neg = learn(sms_train, sms_clean)
sms_test['prediction'] = sms_test['text'].apply(classify)
sms_test.head(50)
sms_pred = sms_test['prediction'].to_list()

In [None]:
acc = accuracy.compute(predictions=sms_pred, references=sms_ref)
pre = precision.compute(predictions=sms_pred, references=sms_ref)
rec = recall.compute(predictions=sms_pred, references=sms_ref)
print(acc)
print(pre)
print(rec)

{'accuracy': 0.9802690582959641}
{'precision': 0.9308176100628931}
{'recall': 0.9308176100628931}


### Bag of words - sms_spam

In [None]:
sms_bag = bag_of_words(sms_clean)
p_pos, p_neg, likelihood_pos, likelihood_neg = learn(sms_train, sms_bag)
sms_test['prediction'] = sms_test['text'].apply(classify)
sms_bow_pred = sms_test['prediction'].to_list()

In [None]:
acc = accuracy.compute(predictions=sms_bow_pred, references=sms_ref)
pre = precision.compute(predictions=sms_bow_pred, references=sms_ref)
rec = recall.compute(predictions=sms_bow_pred, references=sms_ref)
print(acc)
print(pre)
print(rec)

{'accuracy': 0.9802690582959641}
{'precision': 0.9363057324840764}
{'recall': 0.9245283018867925}


### No Stop Words - sms_spam

In [None]:
sms_stop = no_stopwords(sms_clean)
p_pos, p_neg, likelihood_pos, likelihood_neg = learn(sms_train, sms_stop)
sms_test['prediction'] = sms_test['text'].apply(classify)
sms_stop_pred = sms_test['prediction'].to_list()

In [None]:
acc = accuracy.compute(predictions=sms_stop_pred, references=sms_ref)
pre = precision.compute(predictions=sms_stop_pred, references=sms_ref)
rec = recall.compute(predictions=sms_stop_pred, references=sms_ref)
print(acc)
print(pre)
print(rec)

{'accuracy': 0.9524663677130045}
{'precision': 0.7704081632653061}
{'recall': 0.949685534591195}


### Part of Speech (NN, NNP, VB) - sms_spam

In [None]:
sms_pos = part_of_speech(sms_clean)
p_pos, p_neg, likelihood_pos, likelihood_neg = learn(sms_train, sms_pos)
sms_test['prediction'] = sms_test['text'].apply(classify)
sms_pos_pred = sms_test['prediction'].to_list()

In [None]:
acc = accuracy.compute(predictions=sms_pos_pred, references=sms_ref)
pre = precision.compute(predictions=sms_pos_pred, references=sms_ref)
rec = recall.compute(predictions=sms_pos_pred, references=sms_ref)
print(acc)
print(pre)
print(rec)

{'accuracy': 0.9506726457399103}
{'precision': 0.8023255813953488}
{'recall': 0.8679245283018868}


### N-grams - sms_spam

In [None]:
sms_bigram = ngrams(sms_clean, 2)
p_pos, p_neg, likelihood_pos, likelihood_neg = learn(sms_train, sms_bigram)
sms_test['prediction'] = sms_test['text'].apply(classify)
sms_bi_pred = sms_test['prediction'].to_list()

In [None]:
acc = accuracy.compute(predictions=sms_bi_pred, references=sms_ref)
pre = precision.compute(predictions=sms_bi_pred, references=sms_ref)
rec = recall.compute(predictions=sms_bi_pred, references=sms_ref)
print(acc)
print(pre)
print(rec)

{'accuracy': 0.8573991031390135}
{'precision': 0.0}
{'recall': 0.0}


  _warn_prf(average, modifier, msg_start, len(result))


In [None]:
sms_trigram = ngrams(sms_clean, 2)
p_pos, p_neg, likelihood_pos, likelihood_neg = learn(sms_train, sms_trigram)
sms_test['prediction'] = sms_test['text'].apply(classify)
sms_tri_pred = sms_test['prediction'].to_list()

In [None]:
acc = accuracy.compute(predictions=sms_tri_pred, references=sms_ref)
pre = precision.compute(predictions=sms_tri_pred, references=sms_ref)
rec = recall.compute(predictions=sms_tri_pred, references=sms_ref)
print(acc)
print(pre)
print(rec)

{'accuracy': 0.8573991031390135}
{'precision': 0.0}
{'recall': 0.0}


  _warn_prf(average, modifier, msg_start, len(result))


### Comparison 

**Accuracy**:

Both no feature extraction and bag of words had the highest number of accuracy: 98%. No-stop words and part-of-speech methods also had the high percentage of accuracy. Both bigram and trigram had lower percentage of accuracy as compared to other methods, so N-grams might not be the right feature extraction for naive bayes classification in spam detection.


**Precision**: 

Precision is a good measure to use when the cost of false positive is high. In email spam detection, a false positive means that an email that is ham (not spam) has been classified as spam. Thus, if there are a lot of false positives (meaning the percentage of precision is not high), the user of the email might lose important (ham) emails in spam. So it is important to have high precision in a spam detection model. In terms of precision, bag-of-words had the highest precision, slightly higher than the one without feature extraction. Part-of-speech and no-stop-words also had precisions (80% and 77%). On the other hands, bigram and trigram did not do well, had 0% precisions. 


**Recall**: 

If the percentage of recall is low, there are more false negative. In spam detection, a false negative means some spam emails were not classified as not spam. No-stop-words had the highest recall, which means it was best at classifying spam emails as spam. No-feature-extraction, Part-of-speech and bag-of-words also had high recall. In contrast, N-grams did not do well (again) at classifying spam emails and had 0% recalls. 


**Conclusion**: 

In spam detection, the cost of false positive (classifying ham emails as spam) is higher than the cost of false negative (classifying spam emails as ham). Since bag-of-words had the highest precision percentage, bag-of-words model is the best method out of all the methods I tried for the spam detection. 
N-grams did worst overall and it is not an ideal method for spam detection. 

# Run the codes using imdb dataset

In [None]:
baseline(imdb_test[:1000])

'prediction: ham messages or positive reviews'

In [None]:
imdb_clean = clean(imdb_train[:1000])

In [None]:
imdb_ref = imdb_test['label'][:1000].to_list()

### Without feature extractions - imdb

In [None]:
p_pos, p_neg, likelihood_pos, likelihood_neg = learn(imdb_train[:1000], imdb_clean)
imdb_test['prediction'] = imdb_test['text'][:1000].apply(classify)
imdb_pred = imdb_test['prediction'][:1000].to_list()

In [None]:
imdb_pred = [int(i) for i in imdb_pred]
acc = accuracy.compute(predictions=imdb_pred, references=imdb_ref)
pre = precision.compute(predictions=imdb_pred, references=imdb_ref)
rec = recall.compute(predictions=imdb_pred, references=imdb_ref)
print(acc)
print(pre)
print(rec)

{'accuracy': 0.565}
{'precision': 0.5424107142857143}
{'recall': 0.9510763209393346}


### Bag of words - imdb


In [None]:
imdb_bag = bag_of_words(imdb_clean)
p_pos, p_neg, likelihood_pos, likelihood_neg = learn(imdb_train[:1000], imdb_bag)
imdb_test['prediction'] = imdb_test['text'][:1000].apply(classify)
imdb_bow_pred = imdb_test['prediction'][:1000].to_list()

In [None]:
imdb_bow_pred = [int(i) for i in imdb_bow_pred]
acc = accuracy.compute(predictions=imdb_bow_pred, references=imdb_ref)
pre = precision.compute(predictions=imdb_bow_pred, references=imdb_ref)
rec = recall.compute(predictions=imdb_bow_pred, references=imdb_ref)
print(acc)
print(pre)
print(rec)

{'accuracy': 0.564}
{'precision': 0.5418994413407822}
{'recall': 0.949119373776908}


### No Stop Words - imdb

In [None]:
imdb_stop = no_stopwords(imdb_clean)
p_pos, p_neg, likelihood_pos, likelihood_neg = learn(imdb_train[:1000], imdb_stop)
imdb_test['prediction'] = imdb_test['text'][:1000].apply(classify)
imdb_stop_pred = imdb_test['prediction'][:1000].to_list()

In [None]:
imdb_stop_pred = [int(i) for i in imdb_stop_pred]
acc = accuracy.compute(predictions=imdb_stop_pred, references=imdb_ref)
pre = precision.compute(predictions=imdb_stop_pred, references=imdb_ref)
rec = recall.compute(predictions=imdb_stop_pred, references=imdb_ref)
print(acc)
print(pre)
print(rec)

{'accuracy': 0.675}
{'precision': 0.6375739644970414}
{'recall': 0.8434442270058709}


### Part of Speech (NN, NNP, VB) - imdb

In [None]:
imdb_pos = part_of_speech(imdb_clean)
p_pos, p_neg, likelihood_pos, likelihood_neg = learn(imdb_train[:1000], imdb_pos)
imdb_test['prediction'] = imdb_test['text'][:1000].apply(classify)
imdb_pos_pred = imdb_test['prediction'][:1000].to_list()

In [None]:
imdb_pos_pred = [int(i) for i in imdb_pos_pred]
acc = accuracy.compute(predictions=imdb_pos_pred, references=imdb_ref)
pre = precision.compute(predictions=imdb_pos_pred, references=imdb_ref)
rec = recall.compute(predictions=imdb_pos_pred, references=imdb_ref)
print(acc)
print(pre)
print(rec)

{'accuracy': 0.667}
{'precision': 0.6854166666666667}
{'recall': 0.6438356164383562}


### N-grams - imdb

In [None]:
imdb_bigram = ngrams(imdb_clean, 2)
p_pos, p_neg, likelihood_pos, likelihood_neg = learn(imdb_train[:1000], imdb_bigram)
imdb_test['prediction'] = imdb_test['text'][:1000].apply(classify)
imdb_bi_pred = imdb_test['prediction'][:1000].to_list()

In [None]:
imdb_bi_pred = [int(i) for i in imdb_bi_pred]
acc = accuracy.compute(predictions=imdb_bi_pred, references=imdb_ref)
pre = precision.compute(predictions=imdb_bi_pred, references=imdb_ref)
rec = recall.compute(predictions=imdb_bi_pred, references=imdb_ref)
print(acc)
print(pre)
print(rec)

{'accuracy': 0.489}
{'precision': 0.0}
{'recall': 0.0}


  _warn_prf(average, modifier, msg_start, len(result))


In [None]:
imdb_trigram = ngrams(imdb_clean, 3)
p_pos, p_neg, likelihood_pos, likelihood_neg = learn(imdb_train[:1000], imdb_trigram)
imdb_test['prediction'] = imdb_test['text'][:1000].apply(classify)
imdb_tri_pred = imdb_test['prediction'][:1000].to_list()

In [None]:
imdb_tri_pred = [int(i) for i in imdb_tri_pred]
acc = accuracy.compute(predictions=imdb_tri_pred, references=imdb_ref)
pre = precision.compute(predictions=imdb_tri_pred, references=imdb_ref)
rec = recall.compute(predictions=imdb_tri_pred, references=imdb_ref)
print(acc)
print(pre)
print(rec)

{'accuracy': 0.489}
{'precision': 0.0}
{'recall': 0.0}


  _warn_prf(average, modifier, msg_start, len(result))


### Comparison

**Accuracy**:

No-stop-words and part-of-speech had high accuracy rate. Bag-of-words and no-feature-extraction had lower accuracy than the previous two that I mentioned. Both bigram and trigram had lower percentage of accuracy as compared to other methods, so N-grams might not be the right feature extraction for naive bayes classification in sentiment analysis.


**Precision**: 

Precision is a good measure to use when the cost of false positive is high. In movie review sentiment analysis, a false positive means that negative reviews has been classified as positive review. Part-of-speech and No-stop-words have the highest precisions out of all and bag-of-words and no-feature-extractions have slightly lower precisions than them. On the other hands, bigram and trigram did not do well, and had 0% precisions. 


**Recall**: 

If the recall rate is low, there are more false negative. In movie review sentiment analysis, a false negative means some positive reviews were not classified as negative. No-feature-extraction and bag-of-words did the best in terms of recall. No-stop-words and part-of-speech have lower recalls than the other two but still high recalls. In contrast, N-grams did not do well (again) at classifying movie reviews and had 0% recalls. 


**Conclusion**: 

In movie review sentiment analysis, it is desired to have high balance between precision and recall. So, I will use f1-score to determine which model did best at classifying movie reviews. 

In [None]:
# F1-score
# Without feature extraction
no_feature_f1 = (2*0.5424107142857143*0.9510763209393346) / (0.5424107142857143 + 0.9510763209393346)
bow_f1 = (2*0.5418994413407822*0.949119373776908) / (0.5418994413407822 + 0.949119373776908)
stop_f1 = (2*0.6375739644970414*0.8434442270058709) / (0.6375739644970414 + 0.8434442270058709)
pos_f1 = (2*0.6854166666666667*0.6438356164383562) / (0.6854166666666667 + 0.6438356164383562)
print(f'f1-score of no-feature-extraction is {no_feature_f1*100:.2f}%')
print(f'f1-score of Bag-of-Words is {bow_f1*100:.2f}%')
print(f'f1-score of No-Stop-Words is {stop_f1*100:.2f}%')
print(f'f1-score of Part-of-Speech is {pos_f1*100:.2f}%')

f1-score of no-feature-extraction is 69.08%
f1-score of Bag-of-Words is 68.99%
f1-score of No-Stop-Words is 72.62%
f1-score of Part-of-Speech is 66.40%


Thus, I concluded that No-Stop-Words did the best at classifying movie reviews in sentiment analysis. 

N-grams again did not do well overall, and it is not a good method to use in sentiment analysis.