# CSC5035Z Natural Language Processing
# Assignment 1: Naive Bayes for Test Classification

**Author: Tapera Chikumbu**

**Introduction**
Implementing Naive Bayes classifier for sentiment analysis of afrisent data

# Installations, Imports and Downloads

The following code downloads all python modules needed for my Naive Bayes classifier

In [1]:
%pip install jax numpy pandas matplotlib scikit-learn

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


Once downloaded, these modules must be imported to be used.

In [2]:
import os
import warnings
import re
import pandas as pd
import numpy as np
import jax.numpy as jnp
import matplotlib.pyplot as plt

warnings.filterwarnings("ignore", category=UserWarning)

from jax import grad
from collections import defaultdict
from sklearn.metrics import classification_report

FS = (9, 3)  # figure size

The AfriSenti dataset is used for this tutorial. The following block checks for the dataset in the current directory. If absent, the dataset is downloaded from the Github repo.

In [3]:
# Download dataset
PROJECT_DIR = os.getcwd() + '/afrisent-semeval-2023'
print('Current directiory: ', PROJECT_DIR)
PROJECT_GITHUB_URL = 'https://github.com/afrisenti-semeval/afrisent-semeval-2023.git'

if not os.path.isdir(PROJECT_DIR):
  !git clone {PROJECT_GITHUB_URL}
else:
  %cd {PROJECT_DIR}
  !git pull {PROJECT_GITHUB_URL}

Current directiory:  /home/peter/Documents/Hons/NLP/Ass1/afrisent-semeval-2023
/home/peter/Documents/Hons/NLP/Ass1/afrisent-semeval-2023


From https://github.com/afrisenti-semeval/afrisent-semeval-2023
 * branch            HEAD       -> FETCH_HEAD
Already up to date.


# 1. Data Loading & Preprocessing

The AfriSenti dataset consists of tweets which have been human-labelled according to their emotional tone as either positive, negative, or neutral. It covers 14 languages spoken across the African continent. Each with unique annotations for training (*train.tsv*), validation (*dev.tsv*), and testing (*test.tsv*).

### AfriSenti languages

| No. | Language                     | Code | Country        |
|-----|------------------------------|--------------|----------------|
| 1   | Algerian Arabic              | arq          | Algeria        |
| 2   | Amharic                      | amh          | Ethiopia       |
| 3   | Hausa                        | hau          | Nigeria        |
| 4   | Igbo                         | ibo          | Nigeria        |
| 5   | Kinyarwanda                  | kin          | Rwanda         |
| 6   | Moroccan Arabic/Darija       | ary          | Morocco        |
| 7   | Mozambique Portuguese        | por        | Mozambique     |
| 8   | Nigerian Pidgin              | pcm          | Nigeria        |
| 9   | Oromo                        | orm          | Ethiopia       |
| 10  | Swahili                      | swa          | Kenya/Tanzania |
| 11  | Tigrinya                     | tir          | Ethiopia       |
| 12  | Twi                          | twi          | Ghana          |
| 13  | Xitsonga                     | tso          | Mozambique     |
| 14  | Yoruba                       | yor          | Nigeria        |

For my experiments, I decided to use the Hausa dataset. This is because it has both the largest and most balanced tweet samples.

In [4]:
# Choose language
language =  'hau'  # Can be ['arq', 'amh', 'hau', 'ibo', 'kin', 'ary', 'por', 'pcm', orm', 'swa', 'tir', 'twi', 'tso', 'yor']

With a language specified, the different tweet samples can now be loaded to dataframes using pandas

In [5]:
# Load data
DATA_DIR = f'{PROJECT_DIR}/data/{language}'
print('Data directory: ', DATA_DIR)

train_df = pd.read_csv(f'{DATA_DIR}/train.tsv', sep='\t', names=['text', 'label'], header=0)
dev_df = pd.read_csv(f'{DATA_DIR}/dev.tsv', sep='\t', names=['text', 'label'], header=0)
test_df = pd.read_csv(f'{DATA_DIR}/test.tsv', sep='\t', names=['text', 'label'], header=0)

print('Train shape: ', train_df.shape)
print('Dev shape: ', dev_df.shape)
print('Test shape: ', test_df.shape)

# Display data
train_df.sample(n=5)

Data directory:  /home/peter/Documents/Hons/NLP/Ass1/afrisent-semeval-2023/data/hau
Train shape:  (14172, 2)
Dev shape:  (2677, 2)
Test shape:  (5303, 2)


Unnamed: 0,text,label
10865,@user Kaduna Allah yashiga lamarinku😥,positive
9901,@user Ga baki ga hanci😂😂 wai!!! Kyawu iya kyaw...,positive
3999,"@user Aradu gaskiya ne, tsohon ba abinda yake ...",negative
13372,"@user Ga tsaro, ga taimako. Allah ya biya yara...",positive
2655,@user Bbbc hausa da wulakanci kuke meye kuma w...,negative


<a name="section1_2"></a>
## 1.1. Data cleaning

The aim is **binary classification** of the tweet's content as **positive** or **negative**. All tweets that are labelled as **neutral** must be filtered out before proceeding.

In [6]:
# Discard neutral examples
train_df = train_df[train_df['label'] != 'neutral']
dev_df = dev_df[dev_df['label'] != 'neutral']
test_df = test_df[test_df['label'] != 'neutral']

print('Train shape: ', train_df.shape)
print('Dev shape: ', dev_df.shape)
print('Test shape: ', test_df.shape)

Train shape:  (9260, 2)
Dev shape:  (1781, 2)
Test shape:  (3514, 2)


The following tasks are performed to further clean this dataset.

* Discard neutral examples to perform binary classification.
* Replace all urls with a special '[URL]' token.
* Replace all numbers with a special '[NUM]' token.
* Remove white extra whitespaces either side of the text.

In [7]:
def clean(text):
    # Replace URLS with [URL]
    text = re.sub(r'http\S+', '[URL]', text)

    # Replace numbers with [NUM]
    text = re.sub(r'\d+', '[NUM]', text)

    # Remove trailing spaces
    text = text.strip()

    return text

train_df['text'] = train_df['text'].apply(clean)
dev_df['text'] = dev_df['text'].apply(clean)
test_df['text'] = test_df['text'].apply(clean)

<a name="section1_3"></a>
## 1.2. Vocabulary construction

A vocabulary is a set of unique words or tokens present in the text corpus. We refer to vocabulary items as **types** and to particular occurrences of these types in the dataset as **tokens**. I decided to use two tokenization methods to explore their potential impact on classifier performance

For NLP purposes, we want to map each type in our vocabulary to an **index**, a unique number identifying that type. Later we can use this index to, for example, look up vector representations for our words using a lookup table. To achieve this, our vocabulary will be represented with three variables:
* index2type: list of unique types in the vocabulary e.g. ['word1', 'word2', 'split3', ...]
* type2index: dictionary mapping types to their index in the index2type vocabulary e.g. {'split1': 0, 'split2': 1, 'word3': 2, ...}
* type2count: dictionary mapping types to the number of corresponding token occurences of that type in the training data e.g. {'word1': 1012, 'word2': 510, 'word3': 45, ...}


Next I perform whitespace tokenization. This is the strategy of splitting text on white spaces.

In [8]:
def whitespace_tokenize(sentences):
    return [sentence.split() for sentence in sentences]

# Store training data text as list of tweets
train_corpus = train_df['text'].tolist()
# Word tokenized corpus
tokenized_train_corpus = whitespace_tokenize(train_corpus)

* The function `` count_tokens `` takes a list of sentences as input and count the corpus size.
* The function `` create_vocabulary `` takes a list of sentences as input and iteratively build a vocabulary.

In [9]:
# Count number of tokens in corpus
def count_tokens(sentences):
    """
    Count number of tokens in corpus

    param: sentences: list of list of tokens e.g. [['This', 'is', 'a', 'sentence'], ['This', 'is', 'another', 'sentence'], ...]
    return:
        count: number of tokens in corpus
    """
    count = sum(len(sent) for sent in sentences)
    return count

In [10]:
num_tokens = count_tokens(tokenized_train_corpus)
print('Number of tokens in corpus: ', num_tokens)

Number of tokens in corpus:  141969


I used the function 'create_type_counts' count the unique words (types) in the corpus.

In [11]:
# Collect type counts in corpus
def create_type_counts(sentences):
    """
    Count number of types in corpus

    param: sentences: list of list of tokens e.g. [['This', 'is', 'a', 'sentence'], ['This', 'is', 'another', 'sentence'], ...]
    return:
        type2count: dictionary of type counts in corpus e.g. {'This': 2, 'sentence': 2, ...}
    """
    type2count = defaultdict(int)
    for sentence in sentences:
        for token in sentence:
            type2count[token] += 1
    return type2count

In [12]:
# Count occurances of a type in the corpus
type2count = create_type_counts(tokenized_train_corpus)
print('Number of types in corpus: ', len(type2count))

# Sort types by counts
type2count = dict(sorted(type2count.items(), key=lambda x: x[1], reverse=True))

# Print first few types and counts
for i, (type_, count) in enumerate(type2count.items()):
    print(f'{type_}: {count}')
    if i == 5:
        break

Number of types in corpus:  29569
@user: 11313
Allah: 3985
da: 3914
ya: 2921
a: 1700
ba: 1247


In [13]:
# Create vocabulary
def create_vocabulary(type2count, min_count):
    """
    This function creates an indexed vocabulary from vocabulary counts and returns it as a list and a dictionary.

    param:
        type2count: dictionary of type counts in corpus e.g. {'This': 2, 'sentence': 2, ...}
        min_count: minimum count of a word to be included in the vocabulary
    return:
        index2type: list of words in the vocabulary e.g. ['word1', 'word2', 'word3', ...]
        type2index: dictionary mapping words to their index in the index2type vocabulary e.g. {'word1': 0, 'word2': 1, 'word3': 2, ...}
    """
    index2type = []
    type2index = {}
    idx = 0
    for type,count in type2count.items():
        if count >= min_count:
            type2index[type] = idx
            index2type.append(type)
            idx += 1

    # Add unknown token and padding
    type2index['<UNK>'] = len(index2type)
    index2type.append('<UNK>')
    type2index['<PAD>'] = len(index2type)
    index2type.append('<PAD>')

    return index2type, type2index



In [14]:
index2type, type2index = create_vocabulary(type2count, min_count=1)

print('Vocabulary size: ', len(index2type))
print('First 10 words in the vocabulary: ', index2type[0:10])

Vocabulary size:  29571
First 10 words in the vocabulary:  ['@user', 'Allah', 'da', 'ya', 'a', 'ba', 'wannan', 'mu', 'ta', 'na']


<a name="section1_3"></a>
## 1.3. BPE tokenisation



Byte Pair Encoding (BPE) is a popular subword tokenisation algorithm in NLP. BPE and related algorithms have two parts:
* A type learner that takes a raw training corpus and induces a vocabulary (a set of types) of prespecified size (e.g. 1000 subwords).
* A token segmenter that takes a raw test sentence and tokenises it according to that subword vocabulary.

The 'BPETokenizer' class combines these parts into one for ease of use.

In [15]:
# Implement BPE algorithm

class BPETokenizer():

    def __init__(self, sentences : list[str], vocab_size : int):
        """
        Initialize the BPE tokenizer.

        Args:
            sentences (list[str]): list of list of tokens e.g. [['This', 'is', 'a', 'sentence'], ['This', 'is', 'another', 'sentence'], ...]
            vocab_size (int): The desired vocabulary size after training.
        """
        self.sentences = sentences
        self.vocab_size = vocab_size
        self.word_freqs = defaultdict(int)
        self.pair_freqs = defaultdict(int)
        self.splits = {}
        self.merges = {}


    def train(self):
        """
        Train the BPE tokenizer by iteratively merging the most frequent pairs of symbols.

        Returns:
            dict: A dictionary of merges in the format {(a, b): 'ab'}, where 'a' and 'b' are symbols merged into 'ab'.
        """
        # Compute word frequencies and word splits
        self.word_freqs = create_type_counts(self.sentences)

        # Map each word to a list of its characters
        for word in self.word_freqs:
            self.splits[word] = list(word)

        while (len(self.merges) < self.vocab_size):
            self.pair_freqs = self.compute_pair_freqs()
            # Stop if no pair has been found
            if not self.pair_freqs:
                print("No pairs to merge.")
                break
            # Find most frequent pair
            max_key = max(self.pair_freqs, key=self.pair_freqs.get)
            self.merges[max_key] = "".join(max_key)
            # Merge pair in dictionary of splits
            print(f"Merging: {max_key}, Current merges: {len(self.merges)}")
            self.splits = self.merge_pair(max_key[0], max_key[1])
        return self.merges


    def compute_pair_freqs(self):
        """
        Compute the frequency of each pair of symbols in the corpus.

        Returns:
            dict: A dictionary of pairs and their frequencies in the format {(a, b): frequency}.
        """

        pair_freq = defaultdict(int)
        # Loop through each unique word in the corpus
        for word, symbols in self.splits.items():
            for i in range(len(symbols)-1):
                # Increment the pair frequency by the word frequency
                pair_freq[symbols[i],symbols[i+1]] += self.word_freqs[word]

        return pair_freq


    def merge_pair(self, a, b):
        """
        Merge the given pair of symbols in all words where they appear adjacent.

        Args:
            a (str): The first symbol in the pair.
            b (str): The second symbol in the pair.

        Returns:
            dict: The updated splits dictionary after merging.
        """
        updated_splits = defaultdict(int)
        
        for word, split in self.splits.items():
            encoded = []
            i = 0
            # Merge pair of symbols and replace word in sentence
            while (i < len(split)):
                if (i+1 != len(split)) and (split[i] == a) and (split[i+1] == b):
                    # Add merged pair and skip next symbol
                    encoded.append(a+b)
                    i+=1
                else:
                    # Add current symbol
                    encoded.append(split[i])
                i += 1
            updated_splits[word] = encoded            
        return updated_splits


    def tokenize(self, text):
        """
        Tokenize a given text using the trained BPE tokenizer.

        Args:
            text (str): The text to be tokenized.

        Returns:
            list[str]: A list of tokens obtained after applying BPE tokenization.
        """

        pre_tokenized_text = text.split()
        splits_text = [[l for l in word] for word in pre_tokenized_text]

        for pair, merge in self.merges.items():
            for idx, split in enumerate(splits_text):
                i = 0
                while i < len(split) - 1:
                    if split[i] == pair[0] and split[i + 1] == pair[1]:
                        split = split[:i] + [merge] + split[i + 2 :]
                    else:
                        i += 1
                splits_text[idx] = split
        result = sum(splits_text, [])
        return result

Running the following block trains a BPE tokeniser on our cleaned training dataframe. The merges learned are then applied to all dataframes. The final line shows how our vocabulary changes after subword tokenisation.

In [16]:
# Train BPE
dummy = [['This', 'is', 'a', 'sentence'], ['This', 'is', 'another', 'sentence']]
bpe = BPETokenizer(tokenized_train_corpus, vocab_size=1000) # BPETokenizer(dummy, 20) #       
merges = bpe.train()
print('Merges: ', merges)

# Tokenize text
text = 'This is a test sentence.'
tokenized_text = text.split()
tokens = bpe.tokenize(text)
print('BPE tokens: ', tokens)

# Apply to our dataset
train_df['bpe_text'] = train_df['text'].apply(lambda x: ' '.join(bpe.tokenize(x)))
dev_df['bpe_text'] = dev_df['text'].apply(lambda x: ' '.join(bpe.tokenize(x)))
test_df['bpe_text'] = test_df['text'].apply(lambda x: ' '.join(bpe.tokenize(x)))

train_df.head()

Merging: ('a', 'n'), Current merges: 1
Merging: ('e', 'r'), Current merges: 2
Merging: ('u', 's'), Current merges: 3
Merging: ('y', 'a'), Current merges: 4
Merging: ('@', 'us'), Current merges: 5
Merging: ('@us', 'er'), Current merges: 6
Merging: ('d', 'a'), Current merges: 7
Merging: ('k', 'a'), Current merges: 8
Merging: ('i', 'n'), Current merges: 9
Merging: ('m', 'a'), Current merges: 10
Merging: ('l', 'a'), Current merges: 11
Merging: ('b', 'a'), Current merges: 12
Merging: ('w', 'a'), Current merges: 13
Merging: ('s', 'a'), Current merges: 14
Merging: ('t', 'a'), Current merges: 15
Merging: ('l', 'la'), Current merges: 16
Merging: ('lla', 'h'), Current merges: 17
Merging: ('m', 'u'), Current merges: 18
Merging: ('s', 'h'), Current merges: 19
Merging: ('A', 'llah'), Current merges: 20
Merging: ('n', 'a'), Current merges: 21
Merging: ('r', 'a'), Current merges: 22
Merging: ('s', 'u'), Current merges: 23
Merging: ('k', 'u'), Current merges: 24
Merging: ('h', 'a'), Current merges: 25

Unnamed: 0,text,label,bpe_text
0,@user Da kudin da Arewa babu wani abin azo aga...,negative,@user Da kudin da Arewa babu wani abin a zo a ...
1,@user Kaga wani Adu ar Banda💔😭 wai a haka Shi ...,negative,@user Ka ga wani A du ar Ban da 💔 😭 wai a haka...
2,@user Sai haquri fa yan madrid daman kunce cha...,negative,@user Sai ha q u ri fa yan ma d ri d da man ku...
3,@user Hmmm yanzu kai kasan girman allah daxaka...,negative,@user Hmm m yanzu kai kasan gir man allah da x...
4,@user @user Wai gwamno nin Nigeria suna afa kw...,negative,@user @user Wai gwa m no nin Nigeria suna a fa...


A vocabulary can then be created based on our BPE tokenised corpus. Specifying the ``vocab_size`` parameter of our BPE training algorithm allows us to control the vocabulary size, which enables smaller vocabularies than word-based tokenisation.

In [17]:
bpe_corpus = train_df['bpe_text'].tolist()
tokenized_bpe_corpus = whitespace_tokenize(bpe_corpus)

# Count number of BPE tokens in corpus
num_tokens = count_tokens(tokenized_bpe_corpus)
print('Number of BPE tokens in corpus: ', num_tokens)

# Collect type counts in BPE corpus
bpe_type2count = create_type_counts(tokenized_bpe_corpus)
print('Number of BPE types in corpus: ', len(bpe_type2count))

# Sort types by counts
bpe_type2count = dict(sorted(bpe_type2count.items(), key=lambda x: x[1], reverse=True))

# Print first few types and counts
for i, (type_, count) in enumerate(bpe_type2count.items()):
    print(f'{type_}: {count}')
    if i == 5:
        break

# Create a vocabulary for BPE tokens
bpe_index2type, bpe_type2index = create_vocabulary(bpe_type2count, min_count=2)

# # Add unknown token to bpe
# bpe_type2index['<UNK>'] = len(bpe_index2type)
# bpe_index2type.append('<UNK>')

print('Vocabulary size: ', len(bpe_index2type))
print('First 10 BPE tokens in the vocabulary: ', bpe_index2type[0:10])


Number of BPE tokens in corpus:  245125
Number of BPE types in corpus:  1643
@user: 11392
da: 5864
ya: 4894
Allah: 4226
a: 3680
ka: 2976
Vocabulary size:  1477
First 10 BPE tokens in the vocabulary:  ['@user', 'da', 'ya', 'Allah', 'a', 'ka', 'ba', 'ta', 'ma', ',']


To improve readability, I set up separate dataframes for each tokenization method. These will be used when training the text classifier.

In [18]:
# BPE tokenization dataframe
train_bpe = train_df["bpe_text"].tolist()
dev_bpe = dev_df["bpe_text"].tolist()
test_bpe = test_df["bpe_text"].tolist()

bpe_train_sentences = whitespace_tokenize(train_bpe)
bpe_dev_sentences = whitespace_tokenize(dev_bpe)
bpe_test_sentences = whitespace_tokenize(test_bpe)

In [19]:
# Word tokenization dataframe
train_sentences = train_df["text"].tolist()
dev_sentences = dev_df["text"].tolist()
test_sentences = test_df["text"].tolist()

tokenized_train_sentences = whitespace_tokenize(train_sentences)
tokenized_dev_sentences = whitespace_tokenize(dev_sentences)
tokenized_test_sentences = whitespace_tokenize(test_sentences)

<a name="section2"></a>
# 2. Text Classification

Now a text classification model can be trained for sentiment analysis on our data. The model task will be predicting the content as either positive or negative

<a name="section2_1"></a>
## 2.1. Text vectorization

The textual data from the tweets must be transformed into a numerical format for the model to better understand and process. Vectorization is the process of converting text into numerical vectors. I opted for **One-hot encoding** of the data. Each sentence is represented as a binary vector with a 1 in the position corresponding to the word's index in the vocabulary and 0s elsewhere. Found below is a function to perform this vectorization.

In [20]:
def one_hot_vectorize(sentences, type2index):
    """
    One-hot encode a list of sentences.

    param:
        list of list of tokens e.g. [['This', 'is', 'a', 'sentence'], ['This', 'is', 'another', 'sentence'], ...]
        type2index: dictionary mapping words to their index in the vocabulary e.g. {'word1': 0, 'word2': 1, 'word3': 2, ...}
    return:
        one_hot_sentences: 2d numpy array of one-hot encoded sentences e.g. [[1, 0, 0, 1, ...], [0, 1, 1, 0, ...], ...]
    """

    one_hot_vectors = [] 
    for sentence in sentences:
        vector = np.zeros(len(type2index))
        for word in sentence:
            if word in type2index:
                vector[type2index[word]] = 1
            else:
                vector[type2index['<UNK>']] = 1
            
        one_hot_vectors.append(vector)

    return one_hot_vectors
    

Running the next block of code applies this vectorization to both tokenized versions of the dataset.

In [21]:
# Vectorize dataframes
X_train_word = one_hot_vectorize(tokenized_train_sentences, type2index)
X_dev_word = one_hot_vectorize(tokenized_dev_sentences, type2index)
X_test_word = one_hot_vectorize(tokenized_test_sentences, type2index)

X_train_bpe = one_hot_vectorize(bpe_train_sentences, bpe_type2index)
X_dev_bpe = one_hot_vectorize(bpe_dev_sentences, bpe_type2index)
X_test_bpe = one_hot_vectorize(bpe_test_sentences, bpe_type2index)

print("BPE length:")
print('Train length: ', len(X_train_bpe))
print('Dev length: ', len(X_dev_bpe))
print('Test length: ', len(X_test_bpe))

print("\nWord tokenization length:")
print('Train length: ', len(X_train_word))
print('Dev length: ', len(X_dev_word))
print('Test length: ', len(X_test_word))

BPE length:
Train length:  9260
Dev length:  1781
Test length:  3514

Word tokenization length:
Train length:  9260
Dev length:  1781
Test length:  3514


We also convert the output labels to numerical representations (positive = 1 and negative = 0).

In [22]:
# Store as numpy array
X_train = jnp.array(X_train_bpe, dtype=jnp.float16)
X_dev = jnp.array(X_dev_bpe, dtype=jnp.float16)
X_test = jnp.array(X_test_bpe, dtype=jnp.float16)
n_feat = X_train.shape[1]

# print('Train example:\n', X_train[0:5])

y_train = train_df["label"]
y_train = jnp.array(y_train.map({"positive": 1, "negative": 0}), dtype=jnp.float16)

y_dev = dev_df["label"]
y_dev = jnp.array(y_dev.map({"positive": 1, "negative": 0}), dtype=jnp.float16)

y_test = test_df["label"]
y_test = jnp.array(y_test.map({"positive": 1, "negative": 0}), dtype=jnp.float16)

## 2.2 Naive Bayes Classifier

A Naive bayes classifier uses conditional probability to predict the values associated with a set of features. It uses the probabilities of a distinct features being in a class to estimate the class of a set of features. As my first step, I calculated the frequency of each token in the two classes. These are stored in a frequency dict. 

In [23]:
# Frequency dict
def frequency_dict(features, values):
    freq_dict = defaultdict(int)
    for sample_idx in range(len(features)):
        for key_idx in range(len(features[sample_idx])):
            present = features[sample_idx][key_idx].item()
            if present:
                freq_dict[(key_idx, values[sample_idx].item())]  += 1 
    
    return freq_dict

# bpe_freq = frequency_dict(X_train_bpe, y_train)
# word_freq = frequency_dict(X_train_word, y_train)

Frequencies are used to calculate the probability of eack token for the different sentiments.

In [24]:
# Model training
def train_nb(train_x, train_y, alpha=1.0):
    freqs = frequency_dict(train_x, train_y)
    
    loglikelihood = {}
    logprior = {}

    V = len(train_x[0])

    # Total probabilities
    N_pos = 0
    N_neg = 0
    for pair in freqs.keys():
        if pair[1] > 0:
            N_pos += freqs[(pair)]
        else:
            N_neg += freqs[(pair)]

    # Total tweets
    D = len(train_y)

    # Positive documents
    D_pos = sum(train_y)
    
    # Negative documents
    D_neg = D - sum(train_y)

    # Log prior
    logprior = jnp.log(D_pos) - jnp.log(D_neg)

    # Log likelihood per word
    for idx in range(V):
        # Frequency per label
        freq_pos = freqs.get((idx, 1), 0)
        freq_neg = freqs.get((idx, 0), 0)

        # Probability with smoothing to avoid log of zero
        p_pos = (freq_pos + alpha) / (N_pos +V)
        p_neg = (freq_neg + alpha) / (N_neg +V)

        # Log-likelihood
        loglikelihood[idx] = jnp.log(p_pos / p_neg)

    return logprior, loglikelihood

In [25]:
# Predict sentiment as positive (1) or negative (0)
def predict_nb(tweet, prior, likelihood):
    p = prior
    
    for type_idx in range(len(tweet)):
        if tweet[type_idx]:
            p += likelihood[type_idx]
    
    # Positive prediction matches positive likelihood
    return 1 if p > 0 else 0

# 3 Evaluation

The development dataset was used to evaluate model performance. This is done by predicting the output and using Scikit-learn's ```classification_report``` function. The first step uses Scikit's naive bayes classifier to make predictions. This was used during development as a performance target for my own implementation.

In [26]:
from sklearn import naive_bayes

def benchmark_sklearn_nb(X_train, y_train, X_dev, y_dev):
    """
    Trains a Naive Bayes classifier using scikit-learn and evaluates its performance.

    Args:
        X_train (list): Training data features.
        y_train (jaxlib.xla_extension.ArrayImpl): Training data labels.
        X_dev (list): Development data features.
        y_dev (jaxlib.xla_extension.ArrayImpl): Development data labels.

    Returns:
        None
    """
    clf = naive_bayes.MultinomialNB()
    clf.fit(X_train, y_train)

    # Evaluation
    y_pred = clf.predict(X_dev)
    y_pred = jnp.array(y_pred)
    return classification_report(y_dev, y_pred)

# Benchmark both tokenisation methods
print(f"Word tokenisation benchmark:\n{benchmark_sklearn_nb(X_train_word, y_train, X_dev_word, y_dev)}")
print(f"BPE tokenisation benchmark:\n{benchmark_sklearn_nb(X_train_bpe, y_train, X_dev_bpe, y_dev)}")

Word tokenisation benchmark:
              precision    recall  f1-score   support

         0.0       0.85      0.90      0.87       894
         1.0       0.89      0.84      0.86       887

    accuracy                           0.87      1781
   macro avg       0.87      0.87      0.87      1781
weighted avg       0.87      0.87      0.87      1781

BPE tokenisation benchmark:
              precision    recall  f1-score   support

         0.0       0.84      0.92      0.88       894
         1.0       0.91      0.82      0.86       887

    accuracy                           0.87      1781
   macro avg       0.87      0.87      0.87      1781
weighted avg       0.87      0.87      0.87      1781



Negative likelihood simplified to negative sentiment and positive likelihood to positive sentiment.

The next block performs predictions over a set of input features and stores them in lists.

In [27]:
alpha_values = [0.1, 0.5, 1.0]

for alpha in alpha_values:
    # Train classifier on both dataframes
    bpe_prior, bpe_likelihood = train_nb(X_train_bpe, y_train, alpha=alpha)
    word_prior, word_likelihood = train_nb(X_train_word, y_train, alpha=alpha)

    # Store predictions in dict
    pred_bpe = []
    pred_word = []
    for i in range(y_dev.shape[0]):
        sentiment_bpe = predict_nb(X_dev_bpe[i], bpe_prior, bpe_likelihood)
        sentiment_word = predict_nb(X_dev_word[i], word_prior, word_likelihood)
        
        pred_bpe.append(sentiment_bpe)
        pred_word.append(sentiment_word)

    # Classification report
    print(f'Alpha: {alpha}')
    print('BPE tokenisation report:\n', classification_report(y_dev, pred_bpe))
    print('Word tokenisation report:\n', classification_report(y_dev, pred_word))
    print('-------------------------------------------')
    

Alpha: 0.1
BPE tokenisation report:
               precision    recall  f1-score   support

         0.0       0.79      0.96      0.87       894
         1.0       0.94      0.75      0.84       887

    accuracy                           0.85      1781
   macro avg       0.87      0.85      0.85      1781
weighted avg       0.87      0.85      0.85      1781

Word tokenisation report:
               precision    recall  f1-score   support

         0.0       0.79      0.94      0.86       894
         1.0       0.92      0.75      0.83       887

    accuracy                           0.84      1781
   macro avg       0.86      0.84      0.84      1781
weighted avg       0.86      0.84      0.84      1781

-------------------------------------------
Alpha: 0.5
BPE tokenisation report:
               precision    recall  f1-score   support

         0.0       0.79      0.96      0.87       894
         1.0       0.94      0.75      0.83       887

    accuracy                         