<a href="https://colab.research.google.com/github/vincentjunitio00/Multinomial-Naive-Bayes/blob/main/naivebayes_Vincent_Junitio_Ungu.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Naive Bayes (NLP Task) #
Dataset for is taken from https://www.sciencedirect.com/science/article/pii/S2352340920311252?via%3Dihub

Dataset used for training set is from annotated_okezone.csv and for the test set is from annotated_fimela.csv. All the dataset have been uploaded to my github repo.

This notebook explains how to create a Naive Bayes Machine Learning Algorithm from scratch.

First, I am going to import two libraries - pandas and re.
Pandas is used to read the csv file and re (regex) is used to apply regular expression on the texts.

In [1]:
import pandas as pd
import re

Import training set and test set source. I import stopwords list from https://github.com/masdevid/ID-Stopwords/blob/master/id.stopwords.02.01.2016.txt. I do stopwords removal as I believe these stopwords do not have any additional information in our texts (so I can omit it).


In [2]:
# Dataset source
training_source = "https://raw.githubusercontent.com/vincentjunitio00/Multinomial-Naive-Bayes/main/annotated_okezone%20(training%20set).csv"
test_source = "https://raw.githubusercontent.com/vincentjunitio00/Multinomial-Naive-Bayes/main/annotated_fimela%20(test%20set).csv"

# Stopwords source
stopwords = "https://raw.githubusercontent.com/masdevid/ID-Stopwords/master/id.stopwords.02.01.2016.txt"

In [3]:
def load_csv(source):
  '''
    This function loads dataset from its source and return a pandas dataframe
    returns dataset
  '''
  dataset = pd.read_csv(source)
  return dataset

# Training Phase #

In [4]:
training_set = load_csv(training_source) # Load training set
training_set['label'].value_counts() # Check the dataset label's distribution

clickbait        759
non-clickbait    741
Name: label, dtype: int64

Next, I preprocess my training set. The texts are located at 'title' column and I save it as a list. After that, I do stopwords removal and return the clean training texts with its corpus.

Below I comment out the regex and I also provide an option not to use stopwords removal - I personally try: regex + stopwords removal, regex + no stopwords removal, no regex + no stopwords removal, and the best one is no regex + no stopwords removal. Feel free to uncomment the code below to see the differences in result. 

In [5]:
def preprocessing(training_set, stopwords):
  '''
    This function preprocess the texts and return the clean text with its corpus.
    returns clean text and its corpus
  '''
  training_list = training_set['title'].to_list() # Get list of texts in the training set title column.
  clean_text = [] # A list to save the clean text

  # Possible Regex
  for text in training_list:
    # text = text.lower() # Lower case the text
    # text = re.sub(r"[,.\"!@#$%^&*(){}?/;`~:<>+=-_\']", "", text) # Punctuation removal
    # text = re.sub(r"rp[0-9]+", "", text) # Remove currency and its value
    # text = re.sub(r"[0-9]+", "", text) # Remove any numbers
    # text = re.sub(r"u-|-", "", text) # Remove unusual sign
    clean_text.append(text) # Append all the clean text into clean_text list
  
  corpus = {} # A dictionary that contains all the words in the clean_text and its frequency

  # Remove stopwords
  stopword_list = list(pd.read_csv(stopwords, header=None)[0])

  # # Looping to lowercase all the sentences in training_set without stopwords removal
  for i in clean_text:
    sentence = i.split()
    for j in sentence:
      if j not in corpus:
        corpus[j] = 1
      else:
        corpus[j] += 1

  # Looping to lowercase all the sentences in training_set with stopwords removal
  # for text in clean_text:
  #   sentence = text.split()
  #   for word in sentence:
  #     if word not in stopword_list:
  #       if word not in corpus:
  #         corpus[word] = 1
  #       else:
  #         corpus[word] += 1
  return clean_text, corpus

**Create corpus / vocabulary**

In [6]:
clean_training_set, corpus = preprocessing(training_set, stopwords)
print(corpus) # Print corpus to see how it looks like

{'Ini': 140, 'Penyebab': 9, 'Jamaah': 6, 'Tertipu': 2, 'Penggunaan': 3, 'Visa': 1, 'Non-Haji': 1, 'Balita': 2, 'di': 425, 'Bogor': 3, 'Tewas': 9, 'dengan': 65, 'Luka': 3, 'Lebam,': 1, 'Ibu': 37, 'Tiri': 1, 'Ditetapkan': 2, 'Tersangka': 15, 'Demi': 7, 'Keadilan,': 1, 'Pria': 21, 'Habiskan': 3, 'Rp526': 1, 'Juta': 13, 'Lawan': 7, 'Denda': 1, 'Tilang': 1, 'Rp1': 1, 'Claudia': 2, 'Emanuela': 1, 'Santoso': 1, 'Harumkan': 1, 'Indonesia': 78, 'The': 8, 'Voice': 3, 'of': 2, 'Germany': 1, 'Kalah': 10, 'Saing': 2, 'Monza,': 1, 'Bottas': 2, 'Akui': 14, 'Ketangguhan': 2, 'Leclerc': 2, 'AFC': 1, 'Solidarity': 1, 'Cup': 1, 'Jadi': 70, 'Jalur': 4, 'Alternatif': 3, 'Timnas': 25, 'ke': 81, 'Piala': 5, 'Asia': 5, '2023': 1, 'Bertitel': 1, 'Edisi': 2, 'Khusus,': 1, 'Vespa': 1, 'Sprint': 1, 'Carbon': 1, 'Dibanderol': 1, 'Rp49,8': 1, 'juta': 2, 'Finis': 8, 'Ke-11': 1, 'MotoGP': 29, 'San': 16, 'Marino,': 4, 'Zarco': 5, 'Cukup': 5, 'Puas': 9, 'Rodgers': 1, 'Kesulitan': 4, 'Jaga': 4, 'Maddison': 1, 'dari': 70

In [7]:
print("Training set length before cleaning:", len(training_set)) # Check the length before cleaning
print("Training set length after cleaning:", len(clean_training_set)) # Check the length after cleaning

Training set length before cleaning: 1500
Training set length after cleaning: 1500


**Set corpus / vocabulary length to 1000**

In [8]:
# Set corpus/ vocabulary to the 1000 most frequent unique words
print("Length of corpus:", len(corpus)) # Check the length of corpus / vocabulary

vocab_len = 1000 # Set the vocabulary length to be 1000

# To get the 1000 most frequent unique words in the corpus, we have to sort it by its frequency
corpus_sort = sorted(corpus.items(), key=lambda x: x[1], reverse=True)
corpus_1000 = dict(corpus_sort[:vocab_len])
print("Corpus length for training now is {}".format(len(corpus_1000)))

Length of corpus: 5191
Corpus length for training now is 1000


Insert a new column 'title_clean' that contains the clean training text. I drop the 'title' column to not confuse which text should I work with later on.

In [9]:
training_set['title_clean'] = clean_training_set
training_set = training_set.drop(['title'], axis=1)

I separate the text based on its label / class (1 for clickbait and 0 for non clickbait).

In [10]:
clickbait_title = training_set[training_set['label_score']==1]['title_clean'].to_list() # List of clickbait texts
nonclickbait_title = training_set[training_set['label_score']==0]['title_clean'].to_list() # List of non clickbait texts

print(clickbait_title)
print(nonclickbait_title)

['Ini Penyebab Jamaah Tertipu Penggunaan Visa Non-Haji', 'Demi Keadilan, Pria Ini Habiskan Rp526 Juta Lawan Denda Tilang Rp1 Juta', 'Titi DJ Borong Peserta Blind Audition The Voice Indonesia di Episode 6', 'Sukses di ,, Riri Muha Hadirkan Lagu ', 'Memopulerkan Fesyen Ramah Lingkungan untuk Pria, Bakal Jadi Tren , Ya?   ', 'Koleksi Mobil Menpora Imam Nahrawi, Termurah Rp100 juta', 'Lolos ke Babak Kedua, Rinov/Pitha Ingin Lebih Jaga Fokus', 'Asyik Joget, Nikita Mirzani Tepergok Tak Kenakan ', 'Kemenhub Dapat Tambahan Anggaran Rp441,5 Miliar di 2020, Ini Rinciannya', 'Jokowi Tinjau Penanganan Karhutla Riau, Wali Kota Pekanbaru Malah ke Kanada', 'Aktivis HAM HS Dillon Meninggal, KPK Sampaikan Belasungkawa', 'Punya Gigi Gingsul, Bahaya atau Tidak?', 'Heboh Ajakan Pancing Hujan dengan Baskom Air Garam, Ini Penjelasan BMKG', ' Pendarahan, Lucinta Luna Kaget Dengar Hasil USG', 'Miris, Anak di Bawah Umur Ikut Demo Dukung Revisi UU KPK', 'Warna Rambut , Lagi Hits di Instagram, Coba Yuk!', 'Ungga

**Parameter Estimation**

In [11]:
# Calculate prior probability P(cj)
prior_probability = {} # I will save all the parameter information at parameter dictionary
prior_probability['clickbait'] = len(clickbait_title) / len(clean_training_set) # Calculate clickbait parameter
prior_probability['non_clickbait'] = len(nonclickbait_title) / len(clean_training_set) # Calculate non clickbait parameter

assert prior_probability['clickbait'] + prior_probability['non_clickbait'] == 1 # Clickbait and non clickbait probability should sum up to 1

In [12]:
def count_parameter(title, alpha, vocab_len, add_1 = False):
  '''
    This function calculates P(wi|cj) which needs 4 parameters,
    - title: list of texts
    - alpha: constant for add 1 smoothing
    - vocab_len: vocabulary length for add 1 smoothing
    - add_1: to calculate the parameter with or without add 1 smoothing, default: False (means no add 1 smoothing)
    returns parameter and the words frequency in the texts.
  '''
  parameter = {}
  list_ = []
  count_dictionary = {}

  # Split every texts into single word and append to list_
  for text in title:
    text = text.split()
    for word in text:
      list_.append(word)

  total = len(list_)  # Calculate the total length of list_
  distinct_word = list(set(list_)) # Get the list of distinct words in list_

  # Calculate the P(wi|cj)
  for distinct in distinct_word:
    count = 0
    for word in list_:
      if word == distinct:
        count += 1
    parameter[distinct] = (count + alpha) / (total + alpha * vocab_len)
    count_dictionary[distinct] = count # The word's frequency in the text (this is needed for the prediction task)

  # If add_1 parameter set to True, create a new word '<OOV>' for out of vocabulary words.
  if add_1 == True:
    parameter['<OOV>'] = (alpha) / (total + alpha * vocab_len)
    count_dictionary['<OOV>'] = 1

  return parameter, count_dictionary

In [13]:
# To calculate the conditional probability without using add 1 smoothing, set alpha and vocab_len to 0
parameter_clickbait,_ = count_parameter(clickbait_title, 0, 0)
parameter_nonclickbait,_ = count_parameter(nonclickbait_title, 0, 0)

print("Clickbait probability is such as follow\n", parameter_clickbait)
print("=========================================")
print("Nonclickbait probability is such as follow\n", parameter_nonclickbait)

Clickbait probability is such as follow
 {'Bayaran?': 0.00013954786491766677, 'Tata': 0.00013954786491766677, 'Bukan': 0.00027909572983533354, 'Kuntilanak': 0.00013954786491766677, 'Keluarga': 0.00027909572983533354, 'Merasa': 0.0005581914596706671, 'Takluk': 0.00013954786491766677, 'Atacama,': 0.00013954786491766677, 'Anaknya': 0.00027909572983533354, 'Perkawinan': 0.00013954786491766677, 'Justru': 0.00013954786491766677, '50': 0.00027909572983533354, 'Bertarung': 0.00013954786491766677, 'Cerolline,': 0.00013954786491766677, 'Brandnya': 0.00013954786491766677, 'Solskjaer:': 0.00027909572983533354, 'Sekolah': 0.00013954786491766677, 'Yang': 0.00013954786491766677, 'Top': 0.00027909572983533354, 'Baru': 0.0022327658386826683, 'Wajah': 0.0004186435947530003, 'Welterode': 0.00013954786491766677, 'Remaja': 0.00013954786491766677, 'Selain': 0.0005581914596706671, "'Skak'": 0.00013954786491766677, 'Arifin': 0.00013954786491766677, 'Akan': 0.0013954786491766676, 'Marino': 0.000558191459670667

In [14]:
# To calculate the conditional probability with add 1 smoothing, set alpha to 1 and vocab_len to vocab_len, and add_1 to True
add1_clickbait, count_clickbait = count_parameter(clickbait_title, 1, vocab_len, True)
add1_nonclickbait, count_nonclickbait = count_parameter(nonclickbait_title, 1, vocab_len, True)

print("Clickbait probability after add-1 is such as follow\n", add1_clickbait)
print("=========================================")
print("Nonclickbait probability after add-1 is such as follow\n", add1_nonclickbait)

Clickbait probability after add-1 is such as follow
 {'Bayaran?': 0.0002449179524859172, 'Tata': 0.0002449179524859172, 'Bukan': 0.0003673769287288758, 'Kuntilanak': 0.0002449179524859172, 'Keluarga': 0.0003673769287288758, 'Merasa': 0.0006122948812147931, 'Takluk': 0.0002449179524859172, 'Atacama,': 0.0002449179524859172, 'Anaknya': 0.0003673769287288758, 'Perkawinan': 0.0002449179524859172, 'Justru': 0.0002449179524859172, '50': 0.0003673769287288758, 'Bertarung': 0.0002449179524859172, 'Cerolline,': 0.0002449179524859172, 'Brandnya': 0.0002449179524859172, 'Solskjaer:': 0.0003673769287288758, 'Sekolah': 0.0002449179524859172, 'Yang': 0.0002449179524859172, 'Top': 0.0003673769287288758, 'Baru': 0.0020818025961302964, 'Wajah': 0.0004898359049718344, 'Welterode': 0.0002449179524859172, 'Remaja': 0.0002449179524859172, 'Selain': 0.0006122948812147931, "'Skak'": 0.0002449179524859172, 'Arifin': 0.0002449179524859172, 'Akan': 0.0013470487386725448, 'Marino': 0.0006122948812147931, 'Hayya'

The main difference between without add 1 smoothing and with add 1 smoothing is that words that are not found in vocab still have a probability > 0, so it won't cancel the other probability in the same sentence.

# Test Phase #

Load the test set with load_csv function.

In [15]:
# Load the test set with load_csv function
test_set = load_csv(test_source)
print(test_set)

                                                 title  ... label_score
0    Lewat Seni Anak-Anak akan Tampil Percaya Diri ...  ...           0
1         5 Manfaat Pilates untuk Ibu Hamil, Apa Saja?  ...           1
2    Pentingnya Sarapan dengan Makanan Padat untuk ...  ...           1
3    Selalu Ingin Tahu, 5 Zodiak Ini Tidak Bisa Men...  ...           1
4           3 Jenis Diet untuk Organ Intim Lebih Sehat  ...           1
..                                                 ...  ...         ...
695         Essential Oil Terbaik untuk Deodoran Alami  ...           0
696  Pentingnya Menggunakan Essence dan Manfaatnya ...  ...           1
697  Ketiak Cerah Alami dengan Deodoran Dry Serum P...  ...           0
698  FIMELA FEST 2019: 3 Faktor Seseorang Melakukan...  ...           1
699  Rekomendasi Produk Scrub Wajah dengan Harga ya...  ...           1

[700 rows x 3 columns]


In [16]:
# Preprocess the test set, remember that preprocessing returns clean text and its corpus. 
clean_test_set, _ = preprocessing(test_set, stopwords) # I will not be using the corpus of test set so I save it with _

**Prediction**

$Prediction(class|text) = prior\_probability * \prod P(word|class)^{number\_of\_word\_in\_text}$

Let's say my text is "hello I am hello". 

$Prediction(class|text) = prior\_probability * P('hello'|class)^2 * P('I'|class) * P('am'|class)$

Since I do not count the word frequency in the text, I will calculate the probability of each word start from the first word to the last word of the sentence. 
So my formula will be:

$Prediction(class|text) = prior\_probability * P('hello'|class) * P('I'|class) * P('am'|class) * P('hello'|class)$

In [17]:
def predict_class(test_set, prior_probability, add1_clickbait, add1_nonclickbait):
  '''
    This function predicts the class of texts. This function has 4 parameters:
    - test_set: list of texts
    - prior_probability: prior probability dictionary
    - add1_clickbait: conditional probability of clickbait class after add 1 smoothing
    - add1_nonclickbait: conditional probability of nonclickbait class after add 1 smoothing
    returns list of predicted class.
  '''
  words = []
  predicted = []

  # Loop through every line in the test_set
  for text in test_set:
    temp_clickbait = prior_probability['clickbait'] # Initialize temp_clickbait (our result) with prior_probability corresponding to its class
    temp_nonclickbait = prior_probability['non_clickbait'] # Initialize temp_cnonlickbait (our result) with prior_probability corresponding to its class
    
    words = text.split() # Split the text into list of words

    # Loop through every word in words
    for word in words:
      if word in add1_clickbait: # for word in clickbait class
        temp_clickbait *= ((add1_clickbait[word])) # Multiply temp_clickbait with the conditional probability of the word in clickbait class
      else: # for word not in clickbait class (out of vocabulary)
        temp_clickbait *= (add1_clickbait['<OOV>']) # Multiply temp_clickbait with the conditional probability of <OOV> in clickbait class
      
      if word in add1_nonclickbait: # for word in non clickbait class
        temp_nonclickbait *= ((add1_nonclickbait[word])) # Multiply temp_nonclickbait with the conditional probability of the word in clickbait class
      else: # for word not in non clickbait class (out of vocabulary)
        temp_nonclickbait *= (add1_nonclickbait['<OOV>']) # Multiply temp_nonclickbait with the conditional probability of <OOV> in clickbait class

    if (temp_clickbait > temp_nonclickbait): # if temp_clickbait > temp_nonclickbait
      predicted.append(1) # the text is classified as class clickbait (1)
    else:
      predicted.append(0) # the text is classified as class non clickbait (0)
      
  return predicted

Call the predict_class function to do prediction.

In [18]:
predict = predict_class(clean_test_set, prior_probability, add1_clickbait, add1_nonclickbait)

test_set['predict'] = predict # Create a new column for the predicted classes.
print(test_set.info()) # Recheck whether the prediction length is the same as the test set's.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 700 entries, 0 to 699
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   title        700 non-null    object
 1   label        700 non-null    object
 2   label_score  700 non-null    int64 
 3   predict      700 non-null    int64 
dtypes: int64(2), object(2)
memory usage: 22.0+ KB
None


# Evaluation Phase #

In [19]:
# Count true positive: true positive if true for real label and true for predicted label
def calculate_metrics(dataset, string):
  '''
    This function calculates true positive (TP), true negative (TN), false positive (FP), false negative (FN).
    String parameter check whether to calculate the metrics of which classes,
      - non clickbait class has 0 for positive and 1 for negative
      - clickbait class has 1 for positive and 0 for negative
    returns TP, TN, FP, and FN.
  '''
  
  if string == "non clickbait":
    TP = len(dataset[(dataset['label_score']==0) & (dataset['predict']==0)])
    TN = len(dataset[(dataset['label_score']==1) & (dataset['predict']==1)])
    FP = len(dataset[(dataset['label_score']==1) & (dataset['predict']==0)])
    FN = len(dataset[(dataset['label_score']==0) & (dataset['predict']==1)])
  else:
    TP = len(dataset[(dataset['label_score']==1) & (dataset['predict']==1)])
    TN = len(dataset[(dataset['label_score']==0) & (dataset['predict']==0)])
    FP = len(dataset[(dataset['label_score']==0) & (dataset['predict']==1)])
    FN = len(dataset[(dataset['label_score']==1) & (dataset['predict']==0)])
  return TP, TN, FP, FN 

In [20]:
# Calculate the true positive, true negative, false positive, false negative for each classes.
true_positive_click, true_negative_click, false_positive_click, false_negative_click = calculate_metrics(test_set, "clickbait")
true_positive_non, true_negative_non, false_positive_non, false_negative_non = calculate_metrics(test_set, "non clickbait")

# Calculate the sum of true positive, true negative, false positive, false negative from each classes.
true_positive_total = true_positive_click + true_positive_non
true_negative_total = true_negative_click + true_negative_non
false_positive_total = false_positive_click + false_positive_non
false_negative_total = false_negative_click + false_negative_non

# Save the sum into total_metrics; this is needed when calculating microaveraging
total_metrics = {'true_positive': true_positive_total, 'true_negative': true_negative_total, 'false_positive': false_positive_total, 'false_negative': false_negative_total}

Since the number of classes is 2, the true positive, true negative, false positive, false negative from each classes seem like to be swaped.

In [21]:
print("====== Clickbait ======")
print("True positive:", true_positive_click)
print("True negative:", true_negative_click)
print("False positive:", false_positive_click)
print("False negative:", false_negative_click)

print("====== Non clickbait ======")
print("True positive:", true_positive_non)
print("True negative:", true_negative_non)
print("False positive:", false_positive_non)
print("False negative:", false_negative_non)

True positive: 361
True negative: 90
False positive: 216
False negative: 33
True positive: 90
True negative: 361
False positive: 33
False negative: 216


In [22]:
def calculate_precision(TP, TN, FP, FN):
  '''
    This function calculates the precision metrics.
    returns the calculated precision.
  '''
  return TP / (FP + TP)

def calculate_recall(TP, TN, FP, FN):
  '''
    This function calculates the recall metrics.
    returns the calculated recall.
  '''
  return TP / (TP + FN)

def calculate_accuracy(TP, TN, FP, FN):
  '''
    This function calculates the accuracy metrics.
    returns the calculated accuracy.
  '''
  return (TP + TN) / (TP + TN + FP + FN)

def calculate_F1(P, R):
  '''
    This function calculates the F1 score metrics.
    returns the calculated F1 score.
  '''
  return 2 * (P * R) / (P + R)

Next, calculate the four evaluation metrics of each classes.

In [23]:
# Calculate the four evaluation metrics of clickbait class
precision_click = calculate_precision(true_positive_click, true_negative_click, false_positive_click, false_negative_click)
recall_click = calculate_recall(true_positive_click, true_negative_click, false_positive_click, false_negative_click)
accuracy_click = calculate_accuracy(true_positive_click, true_negative_click, false_positive_click, false_negative_click)
f1_click = calculate_F1(precision_click, recall_click)

# Save the value of the evaluation metrics into click_parameter dictionary
click_parameter = {'precision': precision_click, 'recall': recall_click, 'accuracy': accuracy_click, 'f1': f1_click}

# Calculate the four evaluation metrics of non clickbait class
precision_non = calculate_precision(true_positive_non, true_negative_non, false_positive_non, false_negative_non)
recall_non = calculate_recall(true_positive_non, true_negative_non, false_positive_non, false_negative_non)
accuracy_non = calculate_accuracy(true_positive_non, true_negative_non, false_positive_non, false_negative_non)
f1_non = calculate_F1(precision_non, recall_non)

# Save the value of the evaluation metrics into non_parameter dictionary
non_parameter = {'precision': precision_non, 'recall': recall_non, 'accuracy': accuracy_non, 'f1': f1_non}

print("====== Clickbait ======")
print("Precision Clickbait:", precision_click)
print("Recall Clickbait:", recall_click)
print("Accuracy Clickbait:", accuracy_click)
print("F1 score Clickbait:", f1_click)
print()
print("====== Non clickbait ======")
print("Precision Non clickbait:", precision_non)
print("Recall Non clickbait:", recall_non)
print("Accuracy Non clickbait:", accuracy_non)
print("F1 score Non clickbait:", f1_non)

Precision Clickbait: 0.6256499133448874
Recall Clickbait: 0.916243654822335
Accuracy Clickbait: 0.6442857142857142
F1 score Clickbait: 0.7435633367662203

Precision Non clickbait: 0.7317073170731707
Recall Non clickbait: 0.29411764705882354
Accuracy Non clickbait: 0.6442857142857142
F1 score Non clickbait: 0.4195804195804196


In [24]:
def calculate_average(val1, val2, average='macro', param=None):
  '''
    This function calculates the average of the metrics. 
    This function takes two parameters and two optional parameters;
      - val1 and val2: the evaluation metrics of the two classes
      - average: 'macro' to calculate macroaveraging and 'micro' to calculate microaveraging, default set to macro
      - param: the total evaluation metrics for microaveraging, default set to None
    returns precision, recall, accuracy, and f1 score
  '''

  if average == 'macro': # Calculate macroaveraging
    precision = (val1['precision'] + val2['precision']) / 2
    recall = (val1['recall'] + val2['recall']) / 2
    accuracy = (val1['accuracy'] + val2['accuracy']) / 2
    f1 = (val1['f1'] + val2['f1']) / 2

  elif average == 'micro': # Calculate microaveraging
    precision = calculate_precision(param['true_positive'], param['true_negative'], param['false_positive'], param['false_negative'])
    recall = calculate_recall(param['true_positive'], param['true_negative'], param['false_positive'], param['false_negative'])
    accuracy = calculate_accuracy(param['true_positive'], param['true_negative'], param['false_positive'], param['false_negative'])
    f1 = calculate_F1(precision, recall)

  return precision, recall, accuracy, f1

Calculate microaveraging and macroaveraging evaluation metrics.

In [25]:
precision_macro, recall_macro, accuracy_macro, f1_macro = calculate_average(click_parameter, non_parameter, 'macro')
precision_micro, recall_micro, accuracy_micro, f1_micro = calculate_average(click_parameter, non_parameter, 'micro', total_metrics)

In [26]:
print("======== Macroaveraging ========")
print("Precision:", precision_macro)
print("Recall:", recall_macro)
print("Accuracy:", accuracy_macro)
print("F1 Score:", f1_macro)

print()
print("======== Microaveraging ========")
print("Precision:", precision_micro)
print("Recall:", recall_micro)
print("Accuracy:", accuracy_micro)
print("F1 Score:", f1_micro)

Precision: 0.6786786152090291
Recall: 0.6051806509405793
Accuracy: 0.6442857142857142
F1 Score: 0.5815718781733199

Precision: 0.6442857142857142
Recall: 0.6442857142857142
Accuracy: 0.6442857142857142
F1 Score: 0.6442857142857142


# Analysis #
## Training Phase and Test Phase ##
I tried to remove any punctuations, numbers, signs and also remove stopwords. All of these could not lead to a better evaluation metrics. The best evaluation metrics were calculated by not using any regex and no stopwords removal. Even though stopwords in texts should not have much impact, but for this task, we should not use stopwords removal in order to achieve higher evaluation metrics. I assumed that the factors I mentioned above were essential to the text and it was better not to remove it.

Add 1 smoothing was used to prevent cancellation of out of vocabulary words by replacing it with a probability.

Number of vocabulary also played a role in the training and evaluation. Tweaking this variable could lead to a different (perhaps better) evaluation metrics value.

## Prediction ##
For the prediction, the accuracy I obtained was around 0.644. I tried to use regex and stopwords removal but the accuracy tend to be lower. I did change the vocabulary length and it also affected the accuracy. I obtained the accuracy to be this number with following assumptions:
1. The texts in the dataset were not cleaned enough (there were numbers, punctuations and signs).
2. Vocabulary length number was small if compared to the length of corpus.

## Evaluation ##
Macroaverage will compute the metric independently for each class and then take the average (hence treating all classes equally). Microaverage will aggregate the contributions of all classes to compute the average metric. The data is quite balance (759 and 741), macroaverage could suit this task better.
