# CSC5035Z Natural Language Processing
# Tutorial 1: Text Processing and Classification

**Authors: Francois Meyer, Jan Buys**

**Introduction**

Welcome to the first practical tutorial for the NLP course! We will implement some of the ideas covered in this week's lectures and set you up for the practical assignment. This notebook provides skeleton code that you are encouraged to complete as part of your learning process.

**Topics:**

Content: text processing, tokenisation, text classification, word embeddings


**Aims/Learning Objectives:**

* Acquire skills in loading annotated NLP datasets.
* Implement basic principles of tokenisation.
* Train a text classification model (logistic regression)
* Explore the fundamentals of word embeddings.

**Prerequisites:**

* Familiarity with Python libraries for text processing (pandas, nltk, sklearn).
* Introductory knowledge of BPE tokenisation.
* Understanding of logistic regression.
* Introductory knowledge of Word2Vec.

**Outline:**

* [1. Text processing](#section1)
    * [1.1. Data loading](#section1_1)
    * [1.2. Data cleaning](#section1_2)
    * [1.3. Vocabulary construction](#section1_3)
    * [1.4. BPE tokenisation](#section1_4)
* [2. Text classification](#section2)
    * [2.1. Text vectorization](#section2_1)
    * [2.2. Logistic regression](#section2_2)
    * [2.3. Evaluation](#section2_3)
* [3. Word embeddings](#section3)
    * [3.1. Context windows](#section3_1)
    * [3.1. Word2Vec CBOW](#section3_2)
    * [3.1. Pretrained embeddings](#section3_3)
* [4. Conclusion](#section4)

# Installations, Imports and Downloads

In [16]:
import os
import warnings
import re
warnings.filterwarnings("ignore", category=UserWarning)

from collections import defaultdict

import pandas as pd
import numpy as np

import jax.numpy as jnp
from jax import grad
import matplotlib.pyplot as plt
from sklearn.metrics import classification_report

FS = (8, 4)  # figure size
RS = 124  # random seed

In [18]:
# Download dataset
PROJECT_DIR = os.getcwd() + '/afrisent-semeval-2023'
print('Current directiory: ', PROJECT_DIR)
PROJECT_GITHUB_URL = 'https://github.com/afrisenti-semeval/afrisent-semeval-2023.git'

if not os.path.isdir(PROJECT_DIR):
  !git clone {PROJECT_GITHUB_URL}
else:
  %cd {PROJECT_DIR}
  !git pull {PROJECT_GITHUB_URL}

Current directiory:  /Users/rogerbukuru/Documents/UCT Masters/MSc Statistics and Data Science/NLP-CSC5035Z/NLPTutsAssignments/Assignment-I/afrisent-semeval-2023
/Users/rogerbukuru/Documents/UCT Masters/MSc Statistics and Data Science/NLP-CSC5035Z/NLPTutsAssignments/Assignment-I/afrisent-semeval-2023
From https://github.com/afrisenti-semeval/afrisent-semeval-2023
 * branch            HEAD       -> FETCH_HEAD
Already up to date.


<a name="section1"></a>
#1. Text processing

Throughout this notebook we will use the AfriSenti dataset. It is a sentiment analysis dataset for 14 African languages. Sentiment analysis is the task of classifying the emotional tone of a piece of text. Sentiment analysis datasets resemble the following table - it consists of pieces of text which have been annotated as _positive_ or _negative_ (some datasets also allow _neutral_ as a label).


| Text                     | Label |
|-----------------------------------|--------------|
| I haven't heard anything. I'm really worried actually             |  negative     |
| About to go to bed. I am so glad the Tigers won tonight!                     | positive          |

The AfriSenti dataset consists of tweets which have been human-labelled according to their emotional tone as either positive, negative, or neutral.


* [Paper introducing the dataset: _AfriSenti: A Twitter Sentiment Analysis Benchmark for African Languages_, Muhammad et al., 2023.](https://aclanthology.org/2023.emnlp-main.862.pdf)
* [Github repository containing the full dataset](https://github.com/afrisenti-semeval/afrisent-semeval-2023)

### AfriSenti languages

| No. | Language                     | Code | Country        |
|-----|------------------------------|--------------|----------------|
| 1   | Algerian Arabic              | arq          | Algeria        |
| 2   | Amharic                      | amh          | Ethiopia       |
| 3   | Hausa                        | hau          | Nigeria        |
| 4   | Igbo                         | ibo          | Nigeria        |
| 5   | Kinyarwanda                  | kin          | Rwanda         |
| 6   | Moroccan Arabic/Darija       | ary          | Morocco        |
| 7   | Mozambique Portuguese        | por        | Mozambique     |
| 8   | Nigerian Pidgin              | pcm          | Nigeria        |
| 9   | Oromo                        | orm          | Ethiopia       |
| 10  | Swahili                      | swa          | Kenya/Tanzania |
| 11  | Tigrinya                     | tir          | Ethiopia       |
| 12  | Twi                          | twi          | Ghana          |
| 13  | Xitsonga                     | tso          | Mozambique     |
| 14  | Yoruba                       | yor          | Nigeria        |


The dataset covers 14 languages spoken across the African continent. For each language, the dataset stores annotated tweets for training (*train.tsv*), validation (*dev.tsv*), and testing (*test.tsv*). In NLP datasets and implementations, languages are often referred to via abbreviations known as language codes. Your first task is to edit the next code cell to enter the language code of the language you want to use going forward in this notebook. The table above lists the language codes of each of the 14 languages.


In [19]:
# Choose language
language =  'swa'  # Can be ['arq', 'amh', 'hau', 'ibo', 'kin', 'ary', 'por', 'pcm', orm', 'swa', 'tir', 'twi', 'tso', 'yor']

<a name="section1_1"></a>
## 1.1. Data loading

Now we can load the train/dev/test datasets for our chosen language. Each dataset is stored as a .tsv file (tab-separated values) with two data columns (the tweet text and the sentiment label) separated by a tab space. Below we read these datasets into dataframes - a Python data structure for storing tabular data in the pandas library. We display a few rows of the training dataframe to show the data format.

In [20]:
# Load data
DATA_DIR = f'{PROJECT_DIR}/data/{language}'
print('Data directory: ', DATA_DIR)

train_df = pd.read_csv(f'{DATA_DIR}/train.tsv', sep='\t', names=['text', 'label'], header=0)
dev_df = pd.read_csv(f'{DATA_DIR}/dev.tsv', sep='\t', names=['text', 'label'], header=0)
test_df = pd.read_csv(f'{DATA_DIR}/test.tsv', sep='\t', names=['text', 'label'], header=0)

print('Train shape: ', train_df.shape)
print('Dev shape: ', dev_df.shape)
print('Test shape: ', test_df.shape)

# Display data
train_df.sample(n=10)

Data directory:  /Users/rogerbukuru/Documents/UCT Masters/MSc Statistics and Data Science/NLP-CSC5035Z/NLPTutsAssignments/Assignment-I/afrisent-semeval-2023/data/swa
Train shape:  (1810, 2)
Dev shape:  (453, 2)
Test shape:  (748, 2)


Unnamed: 0,text,label
1786,Vunja ukimya rushwa ya ngono inadhalilisha na ...,positive
1127,vipi mzee wangu uko na mshua boy Shujaa huko k...,neutral
652,Habari pole sana kwa changamoto kwa sasa laini...,neutral
1575,Namshukuru sana baba yangu babu bibi mama zan...,positive
947,Lol Msukuma aliuliza kwahiyo umeme ukikatika t...,neutral
1492,Hatutakiwa kuhukumu Mungu pekee ndo anajukumu ...,positive
540,Taasisi zinahitaji kujiuliza zinataka page za ...,neutral
1464,Episode 4 hii ni Tamthilia yenye kuburudisha n...,positive
637,Viongozi mbalimbali na wanachama wa wakiwa kat...,neutral
344,wanasemaje kuhusu Majukwaa ya kidijitali Mtoto...,neutral


<a name="section1_2"></a>
## 1.2. Data cleaning

Before we proceed, it's important to ensure that our data is clean and ready for processing. In this section, we will perform some basic data cleaning steps to remove unwanted elements in the text and prepare our dataset for NLP modelling purposes

In this notebook we are interested in **binary classification** - predicting a tweet's emotional content as either **positive** or **negative**. We filter out tweets that are labelled as **neutral** before we continue.

In [21]:
# Discard neutral examples
train_df = train_df[train_df['label'] != 'neutral']
dev_df = dev_df[dev_df['label'] != 'neutral']
test_df = test_df[test_df['label'] != 'neutral']

The extent of data cleaning and preprocessing will depend on quality of the raw dataset, the NLP task we are preparing the data for, and our personal preferences as NLP practitioners. The ``nltk`` library (Natural Language Toolkit) is a popular Python library for text processing and pre-processing. It supports tokenization, stemming, tagging, and parsing for several languages. We do not need it for this this notebook, since we stick to rather basic preprocessing stategies.

* Replace all urls with a special '[URL]' token.
* Replace all numbers with a special '[NUM]' token.
* Remove white extra whitespaces either side of the text.

In [22]:
def clean(text):
    # Replace URLS with [URL]
    text = re.sub(r'http\S+', '[URL]', text)

    # Replace numbers with [NUM]
    text = re.sub(r'\d+', '[NUM]', text)

    # Remove trailing spaces
    text = text.strip()

    return text

train_df['text'] = train_df['text'].apply(clean)
dev_df['text'] = dev_df['text'].apply(clean)
test_df['text'] = test_df['text'].apply(clean)

<a name="section1_3"></a>
## 1.3. Vocabulary construction

One of the fundamental steps in text processing for NLP is constructing a vocabulary from our dataset. A vocabulary is a set of unique words or tokens present in the text corpus. In this section, we will create a vocabulary from our training dataset and explore its characteristics.

We refer to vocabulary items as **types** and to particular occurrences of these types in the dataset as **tokens**. For now we simply tokenise our text data based on the existing tokens in the raw text - we split text on white spaces.

For NLP purposes, we want to map each type in our vocabulary to an **index**, a unique number identifying that type. Later we can use this index to, for example, look up vector representations for our words using a lookup table. To achieve this, our vocabulary will be represented with three variables:
* index2type: list of unique types in the vocabulary e.g. ['word1', 'word2', 'word3', ...]
* type2index: dictionary mapping types to their index in the index2type vocabulary e.g. {'word1': 0, 'word2': 1, 'word3': 2, ...}
* type2count: dictionary mapping types to the number of corresponding token occurences of that type in the training data e.g. {'word1': 1012, 'word2': 510, 'word3': 45, ...}


In [23]:
# Store training data text as list of tweets
train_corpus = train_df['text'].tolist()
train_corpus[0:5]

['Kwani tanesco wanakataga umeme makusudinadhani kuna changamoto behind zinatakiwa zitatuliwe na sio kutoa matamko',
 'cjawahi kuona content yoyote zaidi ya kuwa analalamika cjawahi kuona akitafuta solution ya tattizo zaid ya kulalamikabasi tutakuwa na Rais wa ajabu huwa mwaka [NUM]',
 'Bomu lililokuwa limetegwa ndani ya gari likiwalenga wajenzi kutoka Uturuki limelipuka katika eneo la Afgoye kaskazini Magharibi mwa mji mkuu wa Mogadishu Somalia limeua watu wanne polisi wamesema',
 'Kuna video inasambaa mitandaoni jamaa amemfumania mkewe akiwa na jamaa mwingine huku akimlalamikia jamaa akimwambia kw',
 'Viwavijeshi wanapita katika hatua kuu [NUM] za ukuaji katika hatua yake ya larvae mayai [NUM] hutagwa na larvae mmoja kwa mwezi Hatua pekee inayowezesha kusambaa kwa viwa vijeshi ni kipepeo ambao huruka kwa makundi na wanaweza kuruka hadi kilometa [NUM]']

Next we have to decide how we are going to tokenize the tweets. Libraries like ``nltk`` provide regex-based tokenizers that are handcrafted for specific languages.  For now we will use the simple strategy of splitting text on white spaces - so our tokens will be the units of text divided by white spaces.

In [24]:
def whitespace_tokenize(sentences):
    return [sentence.split() for sentence in sentences]

tokenized_train_corpus = whitespace_tokenize(train_corpus)
tokenized_train_corpus[0:5]

[['Kwani',
  'tanesco',
  'wanakataga',
  'umeme',
  'makusudinadhani',
  'kuna',
  'changamoto',
  'behind',
  'zinatakiwa',
  'zitatuliwe',
  'na',
  'sio',
  'kutoa',
  'matamko'],
 ['cjawahi',
  'kuona',
  'content',
  'yoyote',
  'zaidi',
  'ya',
  'kuwa',
  'analalamika',
  'cjawahi',
  'kuona',
  'akitafuta',
  'solution',
  'ya',
  'tattizo',
  'zaid',
  'ya',
  'kulalamikabasi',
  'tutakuwa',
  'na',
  'Rais',
  'wa',
  'ajabu',
  'huwa',
  'mwaka',
  '[NUM]'],
 ['Bomu',
  'lililokuwa',
  'limetegwa',
  'ndani',
  'ya',
  'gari',
  'likiwalenga',
  'wajenzi',
  'kutoka',
  'Uturuki',
  'limelipuka',
  'katika',
  'eneo',
  'la',
  'Afgoye',
  'kaskazini',
  'Magharibi',
  'mwa',
  'mji',
  'mkuu',
  'wa',
  'Mogadishu',
  'Somalia',
  'limeua',
  'watu',
  'wanne',
  'polisi',
  'wamesema'],
 ['Kuna',
  'video',
  'inasambaa',
  'mitandaoni',
  'jamaa',
  'amemfumania',
  'mkewe',
  'akiwa',
  'na',
  'jamaa',
  'mwingine',
  'huku',
  'akimlalamikia',
  'jamaa',
  'akimwambia

Complete the following to cells of code.
* The function `` count_tokens `` should take a list of sentences as input and count the corpus size.
* The function `` create_vocabulary `` should take a list of sentences as input and iteratively build a vocabulary.

In each case, tokenise sentences on white spaces. This can be done with the function function ``split()``, which splits a string into a list of its whitespace delimited strings.

e.g. ``sentence.split()``

In [25]:
# Count number of tokens in corpus
def count_tokens(sentences):
    """
    Count number of tokens in corpus

    param: sentences: list of list of tokens e.g. [['This', 'is', 'a', 'sentence'], ['This', 'is', 'another', 'sentence'], ...]
    return:
        count: number of tokens in corpus
    """
    total_tokens = 0
    for sentence in sentences:
        total_tokens += len(sentence)
    return total_tokens

In [None]:
num_tokens = count_tokens(tokenized_train_corpus)
print('Number of tokens in corpus: ', num_tokens)

In [26]:
# Collect type counts in corpus
def create_type_counts(sentences):
    """
    Count number of types in corpus

    param: sentences: list of list of tokens e.g. [['This', 'is', 'a', 'sentence'], ['This', 'is', 'another', 'sentence'], ...]
    return:
        type2count: dictionary of type counts in corpus e.g. {'This': 2, 'sentence': 2, ...}
    """
    type2count = {}
    for sentence in sentences:
        for type in sentence:
            if type not in type2count:
                type2count[type] = 1
            else:
                current_count = type2count[type]
                type2count[type] = current_count +1
    return type2count
            

In [27]:
type2count = create_type_counts(tokenized_train_corpus)
print('Number of types in corpus: ', len(type2count))

# Sort types by counts
type2count = dict(sorted(type2count.items(), key=lambda x: x[1], reverse=True))

# Print first few types and counts
for i, (type_, count) in enumerate(type2count.items()):
    print(f'{type_}: {count}')
    if i == 5:
        break

Number of types in corpus:  5383
ya: 529
na: 505
wa: 394
kwa: 321
[NUM]: 216
ni: 136


In [28]:
# Create vocabulary
def create_vocabulary(type2count, min_count):
    """
    This function creates an indexed vocabulary from vocabulary counts and returns it as a list and a dictionary.

    param:
        type2count: dictionary of type counts in corpus e.g. {'This': 2, 'sentence': 2, ...}
        min_count: minimum count of a word to be included in the vocabulary
    return:
        index2type: list of words in the vocabulary e.g. ['word1', 'word2', 'word3', ...]
        type2index: dictionary mapping words to their index in the index2type vocabulary e.g. {'word1': 0, 'word2': 1, 'word3': 2, ...}
    """
    index2type = []
    type2index = {}
    for type_, count in type2count.items():
        if(count >= min_count):
            index2type.append(type_)
            type2index[type_] = len(index2type) - 1
    return index2type, type2index
            

In [None]:
index2type, type2index = create_vocabulary(type2count, min_count=1)

# It's good practice to add a special token for unknown words and padding (to make all sentences in training batches the same length)
type2index['<UNK>'] = len(index2type)
index2type.append('<UNK>')
type2index['<PAD>'] = len(index2type)
index2type.append('<PAD>')

print('Vocabulary size: ', len(index2type))
print('First 10 words in the vocabulary: ', index2type[0:10])

<a name="section1_4"></a>
## 1.4. BPE tokenisation



So far we have tokenised sentences based on the white spaces separating tokens in raw text. In most modern NLP systems, sentences are tokenised into subword tokens instead of words. This approach helps in handling out-of-vocabulary words and improves the model's ability to capture morphological (subword) information.

Byte Pair Encoding (BPE) is a popular subword tokenisation algorithm in NLP. In this section, we will implement the BPE algorithm and apply it to our dataset.

BPE and related algorithms have two parts:
* A type learner that takes a raw training corpus and induces a vocabulary (a set of types) of prespecified size (e.g. 1000 subwords).
* A token segmenter that takes a raw test sentence and tokenises it according to that subword vocabulary.

## BPE type learner (train on training set)

1. Start with a vocabulary consisting of all individual characters e.g. {A, B, C, D,…, a, b, c, d...}.
2. Repeat until the prespecified vocabulary size has been reached:
    * Choose the two symbols that are most frequently adjacent in the training corpus (say 'A', 'B').
    * Merge these symbols and add the newly merged symbol 'AB' to the vocabulary.
    * Replace every adjacent 'A' 'B' in the corpus with 'AB'.


## BPE token segmenter (apply to train/dev/test set)

Segmenter algorithm: Run each merge learned from the training data greedily, in the order they were learned (test frequencies don't play a role).

So merge every "A" "B" to "AB", then merge "AB" "C" to "ABC", etc.

## Other details

* Usually basic tokenization is performed first (space-based tokenization and separating punctuation). BPE is then applied to the initial tokens.
* To enable the algorithm to learn to represent the boundary between tokens, commonly a special end-of-word symbol '_' is added before spaces in the training corpus (or alternatively between the space and the next word).

## Pseudocode

```
function BYTE-PAIR ENCODING(strings C, number of merges k) returns vocab V
    V <- all unique characters in C     # initial set of tokens is characters
    for i = 1 to k do                   # merge tokens k times / until vocab size reached
        t_L, t_R <- Most frequent pair of adjacent tokens in C
        t_new <- t_L + t_R              # make new token by concatenating
        V <- V + t_new                  # update the vocabulary
        Replace each occurrence of t_L, t_R in C with t_new   # and update the corpus
    return V
```




In [92]:
# Implement BPE algorithm
# The following class provides a skeleton for implementing the BPE algorithm.
# You can use the existing code and method headings as a guideline, or structure
# the code as you prefer.

class BPETokenizer():

    def __init__(self, sentences, vocab_size):
        """
        Initialize the BPE tokenizer.

        Args:
            sentences (list[str]): list of list of tokens e.g. [['This', 'is', 'a', 'sentence'], ['This', 'is', 'another', 'sentence'], ...]
            vocab_size (int): The desired vocabulary size after training.
        """
        self.sentences = sentences
        self.vocab_size = vocab_size
        self.word_freqs = defaultdict(int)
        self.splits = {}
        self.merges = {}


    def train(self):
        """
        Train the BPE tokenizer by iteratively merging the most frequent pairs of symbols.

        Returns:
            dict: A dictionary of merges in the format {(a, b): 'ab'}, where 'a' and 'b' are symbols merged into 'ab'.
        """
        # Split corpus
        for sentence in self.sentences:
            for word in sentence:
                self.splits[word] = [char for char in word]
                    
        for i in range(self.vocab_size):
            self.compute_pair_freqs() # compute adjacent pair frequencies
            pair, _ = list(self.word_freqs.items())[0] # most frequent pair
            self.merge_pair(pair[0], pair[1])
            self.merges[pair] = pair[0] + pair[1]
        return self.merges


    def compute_pair_freqs(self):
        """
        Compute the frequency of each pair of symbols in the corpus.

        Returns:
            dict: A dictionary of pairs and their frequencies in the format {(a, b): frequency}.
        """
        pair_freqs = defaultdict(int)
        for _, split in self.splits.items():
            for i in range(len(split)-1):
                pair = (split[i], split[i+1])
                if pair not in pair_freqs:
                    pair_freqs[pair] = 1
                else:
                    pair_freqs[pair] += 1
        self.word_freqs = pair_freqs
        self.word_freqs = dict(sorted(self.word_freqs.items(), key=lambda x: x[1], reverse=True))
        return self.word_freqs
        
    def merge_pair(self, a, b):
        """
        Merge the given pair of symbols in all words where they appear adjacent.

        Args:
            a (str): The first symbol in the pair.
            b (str): The second symbol in the pair.

        Returns:
            dict: The updated splits dictionary after merging.
        """
        pair = (a,b)
        # Check if valid pair
        if pair in self.word_freqs:
            new_token = a+b
            for word, split in self.splits.items():
                print("split", split)
                for i in range(len(split)-1):
                    if split[i] == a and split[i+1] == b:
                       split[i] = new_token
                       new_split = list(filter(lambda x: x not in [b], split))
                       self.splits[word] = new_split
        return self.splits

    def tokenize(self, text):
        """
        Tokenize a given text using the trained BPE tokenizer.

        Args:
            text (str): The text to be tokenized.

        Returns:
            list[str]: A list of tokens obtained after applying BPE tokenization.
        """

        pre_tokenized_text = text.split()
        splits_text = [[l for l in word] for word in pre_tokenized_text]

        for pair, merge in self.merges.items():
            for idx, split in enumerate(splits_text):
                i = 0
                while i < len(split) - 1:
                    if split[i] == pair[0] and split[i + 1] == pair[1]:
                        split = split[:i] + [merge] + split[i + 2 :]
                    else:
                        i += 1
                splits_text[idx] = split
        result = sum(splits_text, [])
        return result

Now let's train a BPE tokeniser on our AfriSenti training corpus, apply it to our corpus, and see how our vocabulary changes after subword tokenisation.

In [95]:
# Train BPE
bpe = BPETokenizer(tokenized_train_corpus, vocab_size=1000)
#testSentences = [ ["low", "lower", "newest", "widest"]]
#bpe2 = BPETokenizer(testSentences, 10)
t = bpe.compute_pair_freqs()
merges = bpe.train()
print('Merges: ', merges)

# Tokenize text
text = 'This is a test sentence.'
tokenized_text = text.split()
tokens = bpe.tokenize(text)
print('BPE tokens: ', tokens)

# Apply to our dataset
#train_df['bpe_text'] = train_df['text'].apply(lambda x: ' '.join(bpe.tokenize(x)))
#dev_df['bpe_text'] = dev_df['text'].apply(lambda x: ' '.join(bpe.tokenize(x)))
#test_df['bpe_text'] = test_df['text'].apply(lambda x: ' '.join(bpe.tokenize(x)))

#train_df.head()

IOPub data rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_data_rate_limit`.

Current values:
ServerApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
ServerApp.rate_limit_window=3.0 (secs)



We now create a vocabulary of BPE tokens, based on our tokenised corpus. Specifying the ``vocab_size`` parameter of our BPE training algorithm allows us to control the vocabulary size, which enables smaller vocabularies than word-based tokenisation.

In [None]:
bpe_corpus = train_df['bpe_text'].tolist()
tokenized_bpe_corpus = whitespace_tokenize(bpe_corpus)

# Count number of BPE tokens in corpus
num_tokens = count_tokens(tokenized_bpe_corpus)
print('Number of BPE tokens in corpus: ', num_tokens)

# Collect type counts in BPE corpus
bpe_type2count = create_type_counts(tokenized_bpe_corpus)
print('Number of BPE types in corpus: ', len(bpe_type2count))

# Sort types by counts
bpe_type2count = dict(sorted(bpe_type2count.items(), key=lambda x: x[1], reverse=True))

# Print first few types and counts
for i, (type_, count) in enumerate(bpe_type2count.items()):
    print(f'{type_}: {count}')
    if i == 5:
        break

# Create a vocabulary for BPE tokens
bpe_index2type, bpe_type2index = create_vocabulary(bpe_type2count, min_count=2)
print('Vocabulary size: ', len(bpe_index2type))
print('First 10 BPE tokens in the vocabulary: ', bpe_index2type[0:10])

<a name="section2"></a>
# 2. Text Classification

Now we will use AfriSenti to train a text classification model for sentiment analysis. We frame the task as a binary classification problem - the model is trained to predict whether a given piece of text is positive or negative in its emotional tone.

<a name="section2_1"></a>
## 2.1. Text vectorization

In order to train a machine learning model for our binary classification task, we need to transform the textual data into a numerical format that the model can understand and process. This transformation is known as vectorization.Vectorization is the process of converting text into numerical vectors. Machine learning models operate on numerical data, so vectorization is a critical step in preparing text for modeling. There are several methods for vectorizing text, including:

* One-hot encoding: Represents each word as a binary vector with a 1 in the position corresponding to the word's index in the vocabulary and 0s elsewhere.
* Term Frequency-Inverse Document Frequency (TF-IDF): Weighs the words based on their frequency in the document and their rarity across all documents, providing a more informative representation.
* Word Embeddings: Maps words to dense vectors in a continuous vector space, capturing semantic similarities between words.

In this tutorial, we will explore different vectorization techniques and apply them to our text data to prepare it for classification.

In [None]:
def one_hot_vectorize(sentences, type2index):
    """
    One-hot encode a list of sentences.

    param:
        list of list of tokens e.g. [['This', 'is', 'a', 'sentence'], ['This', 'is', 'another', 'sentence'], ...]
        type2index: dictionary mapping words to their index in the vocabulary e.g. {'word1': 0, 'word2': 1, 'word3': 2, ...}
    return:
        one_hot_sentences: 2d numpy array of one-hot encoded sentences e.g. [[1, 0, 0, 1, ...], [0, 1, 1, 0, ...], ...]
    """
    # TODO: COMPLETE THIS CODE

In [None]:
train_sentences = train_df["text"].tolist()
dev_sentences = dev_df["text"].tolist()
test_sentences = test_df["text"].tolist()

tokenized_train_sentences = whitespace_tokenize(train_sentences)
tokenized_dev_sentences = whitespace_tokenize(dev_sentences)
tokenized_test_sentences = whitespace_tokenize(test_sentences)

In [None]:
X_train = one_hot_vectorize(tokenized_train_sentences, type2index)
X_dev = one_hot_vectorize(tokenized_dev_sentences, type2index)
X_test = one_hot_vectorize(tokenized_test_sentences, type2index)
print('Train length: ', len(X_train))
print('Dev length: ', len(X_dev))
print('Test length: ', len(X_test))

# Print examples
print('Train example: ', X_train[0:5])

In [None]:
## TODO: uncomment to use TF-IDF vectorizer instead of one-hot encoding
# from sklearn.feature_extraction.text import TfidfVectorizer
# vectorizer = TfidfVectorizer()

# X_train = vectorizer.fit_transform(train_sentences)
# X_dev = vectorizer.transform(dev_sentences)
# X_test = vectorizer.transform(test_sentences)

# X_train = X_train.toarray()
# X_dev = X_dev.toarray()
# X_test = X_test.toarray()

Now that we have converted our text to vectors, we can store them as Jax arrays to later feed as input to our classification model. We also convert the output labels to numerical representations (positive = 1 and negative = 0).

In [None]:
X_train = jnp.array(X_train, dtype=jnp.float16)
X_dev = jnp.array(X_dev, dtype=jnp.float16)
X_test = jnp.array(X_test, dtype=jnp.float16)
n_feat = X_train.shape[1]

y_train = train_df["label"]
y_train = jnp.array(y_train.map({"positive": 1, "negative": 0}), dtype=jnp.float16)

y_dev = dev_df["label"]
y_dev = jnp.array(y_dev.map({"positive": 1, "negative": 0}), dtype=jnp.float16)

y_test = test_df["label"]
y_test = jnp.array(y_test.map({"positive": 1, "negative": 0}), dtype=jnp.float16)

<a name="section2_2"></a>
# 2.2. Logistic regression

In this tutorial, we are going to implement logistic regression for sentiment analysis. Logistic regression is a discriminative classifier that models the decision boundary between data from different classes, i.e., $p(y | \mathbf{x})$. In our case, this means modeling the decision boundary between positive and negative tweets.

We represent our data as pairs $(\mathbf{x}, y)$ of (input, output) / (tweet, label). Our logistic regression implementation requires the following components:

**1. A feature representation of the input.**
   - For each tweet, extract a vector of features $\mathbf{x} = [x_1, x_2, ... , x_n]^T$.
   - We will use a vectorizer like one-hot encoding or TF-IDF.

**2. A classification function that computes the estimated class via $p(y|\mathbf{x})$.**
   - In logistic regression, the classification function is the sigmoid function, defined as $\sigma(z) = \frac{1}{1 + e^{-z}}$, where $z = \mathbf{w}^T\mathbf{x} + b$, and $\mathbf{w}$ and $b$ are the parameters to be learned.
   - The output of the sigmoid function, $\sigma(z)$, represents the probability that the input $(\mathbf{x})$ belongs to the positive class i.e., $p(y=1|\mathbf{x})$.

**3. An objective function for learning, like cross-entropy loss.**
   - The cross-entropy loss function for binary classification is given by $-\frac{1}{N}\sum_{i=1}^{N}[y_i\log(\hat{y}_i) + (1 - y_i)\log(1 - \hat{y}_i)]$, where $N$ is the number of samples, $y_i$ is the true label, and $\hat{y}_i$ is the predicted probability that the $i$-th sample belongs to the positive class.
   - The goal is to minimize this loss function with respect to the parameters $\mathbf{w}$ and $b$.

**4. An algorithm for optimizing the objective function, like stochastic gradient descent (SGD).**
   - SGD is an iterative optimization algorithm that updates the parameters $\mathbf{w}$ and $b$ in the direction of the negative gradient of the loss function with respect to these parameters.
   - The updates are made for each training sample or a batch of samples, leading to the update rules: $\mathbf{w} \leftarrow \mathbf{w} - \alpha \nabla_{\mathbf{w}}\mathcal{L}$ and $b \leftarrow b - \alpha \nabla_{b}\mathcal{L}$, where $\alpha$ is the learning rate, $\nabla_{\mathbf{w}}\mathcal{L}$ and $\nabla_{b}\mathcal{L}$ are the gradients of the loss function with respect to $\mathbf{w}$ and $b$, respectively.

In the following sections, we will implement these components step by step to build our logistic regression model for sentiment analysis.

In [None]:
# Let's start by defining our classification function p(y|x)
def sigmoid(r):
    return 1 / (1 + jnp.exp(-r))

In [None]:
# We can plot the sigmoid function to see its shape
z = 10
r = jnp.linspace(-z, z, 200)
_, ax = plt.subplots(figsize=FS)
plt.plot(r, sigmoid(r))
ax.grid()
_ = ax.set(xlabel="r", ylabel="$logistic(r)$", title="The $logistic$ curve")
_ = ax.set_xlim(-z, z)

In [None]:
# Now let's use the sigmoid function to define our classification function p(y|x)
# It takes as input the bias b, the weights w, and the input features X
def predict(b, w, X):
    return sigmoid(jnp.dot(X, w) + b)

In [None]:
# Now we can define our cost/objective function, which is the cross-entropy loss
# Remember to clip values to avoid log(0)
eps=1e-14
def cross_entropy_loss(b, w, X, y, lmbd=0.1):
    n = y.size
    p = predict(b, w, X)
    p = jnp.clip(p, eps, 1 - eps)  # clip predictions to avoid log(0)

    # TODO: COMPLETE THIS CODE

In [None]:
# Let's test these methods with randomly initialized parameters for the logistic regression model
b_0 = 1.0
w_0 = 1.0e-5 * jnp.ones(n_feat)
print(cross_entropy_loss(b_0, w_0, X_train, y_train))

y_pred_proba = predict(b_0, w_0, X_test)
print(y_pred_proba[:5])

y_pred = jnp.array(y_pred_proba)
y_pred = jnp.where(y_pred < 0.5, y_pred, 1.0)
y_pred = jnp.where(y_pred >= 0.5, y_pred, 0.0)
print(y_pred[:5])

In [None]:
# We can use the grad function from JAX to compute the gradients of the cross-entropy loss function
print(grad(cross_entropy_loss, argnums=0)(b_0, w_0, X_train, y_train))
print(grad(cross_entropy_loss, argnums=1)(b_0, w_0, X_train, y_train))

In the following code cell, we will train our logistic regression model using the training dataset. The training process involves iteratively updating the model's parameters to minimize the cross-entropy loss function. We will use stochastic gradient descent (SGD) as our optimization algorithm. During training, the model learns to predict the sentiment of tweets (positive or negative) based on the features extracted from the text. It's important to monitor the loss during training to ensure that the model is learning effectively.

In [None]:
%%time
n_iter = 1000
eta = 5e-2
tol = 1e-6
w = w_0
b = b_0

new_loss = float(cross_entropy_loss(b, w, X_train, y_train))
loss_hist = [new_loss]
for i in range(n_iter):
    # TODO: COMPLETE THIS CODE

In [None]:
_, ax = plt.subplots(figsize=FS)
plt.semilogy(loss_hist)
ax.grid()
_ = ax.set(xlabel="Iteration", ylabel="Cost value", title="Convergence history")

<a name="section2_3"></a>
## 2.3. Evaluation

After training the model, it's crucial to evaluate its performance on a separate validation or test dataset. In this code cell, we will use the `classification_report` method from `sklearn` to assess the model's performance along several metrics, including precision, recall, and F1 score.

* Precision measures the proportion of correctly predicted positive instances out of all predicted positive instances.
* Recall measures the proportion of correctly predicted positive instances out of all actual positive instances.
* The F1 score is the harmonic mean of precision and recall, providing a single metric that balances both.

**The macro-averaged F1 is often used as a single reference to compare performance, since it incorporates precision and recall across different classes.**

These metrics are particularly important in the context of class imbalance, where the number of instances in different classes may vary significantly. Additionally, we will compare the model's performance against random results to ensure that our model is making meaningful predictions and not just guessing based on class distribution.

In [None]:
y_pred_proba = predict(b, w, X_test)
y_pred = jnp.array(y_pred_proba)
y_pred = jnp.where(y_pred < 0.5, y_pred, 1.0)
y_pred = jnp.where(y_pred >= 0.5, y_pred, 0.0)
print(classification_report(y_test, y_pred))

In [None]:
# Let's compare against a randomly initialized model
b_0_random = 1.0
w_0_random = 1.0e-5 * jnp.ones(n_feat)
y_pred_proba_random = predict(b_0_random, w_0_random, X_test)
y_pred_random = jnp.array(y_pred_proba_random)
y_pred_random = jnp.where(y_pred_random < 0.5, y_pred_random, 1.0)
y_pred_random = jnp.where(y_pred_random >= 0.5, y_pred_random, 0.0)
print(y_pred_random[:5])
print(classification_report(y_test, y_pred_random))

<a name="section2_4"></a>
## 2.4. Addressing class imbalance

You may have noticed that your model is not performing any better than a randomly initialised model. It might be worth experimenting with different training hyperparameters (e.g. learning rate, epochs, weight initialisations) which could be affecting perfomance. But another possibility is the issue of class imbalance. This occurs when the number of instances in one class significantly outnumbers those in another, which is a common scenario in real-world datasets. For instance, in sentiment analysis, positive sentiments may be more prevalent than negative ones, or vice versa.

To confirm if class imbalance is affecting your model, you should:

* Investigate the distribution of classes in your test set. A simple count of instances in each class will reveal any imbalance.
* Analyze the predictions made by your model. If it's predominantly predicting the majority class, class imbalance is likely skewing its performance.

If class imbalance is indeed present, there are several strategies you can apply to your logistic regression model to mitigate its effects:

* Oversampling the minority class or undersampling the majority class can balance the class distribution.

* Modify the logistic regression's loss function to incorporate weights that reflect the class distribution, as discussed earlier.

* Use different regularization terms for different classes to penalize misclassification of the minority class more than the majority.

If you suspect class imbalance is hurting your model's performance, experiment with one of these methods to see if it solves the issue.

The following cell block contains code for a weighted loss function that introduces weights into the loss calculation to account for class imbalance. By incorporating these weights, the loss function penalizes misclassifications of the minority class more heavily, encouraging the model to pay more attention to these instances during training and helping to reduce bias towards the majority class.

In [None]:
# # UNCOMMENT TO USE WEIGHTED LOSS FUNCTION
# positive_class_samples = jnp.sum(y_train == 1)
# negative_class_samples = jnp.sum(y_train == 0)
# total_samples = positive_class_samples + negative_class_samples

# weight_for_positive_class = total_samples / (2.0 * positive_class_samples)
# weight_for_negative_class = total_samples / (2.0 * negative_class_samples)

# def cross_entropy_loss(b, w, X, y, lmbd=0.1):
#     n = y.size
#     p = predict(b, w, X)
#     p = jnp.clip(p, eps, 1 - eps)  # clip predictions to avoid log(0)

#     # Calculate class weights
#     class_weights = jnp.where(y == 1, weight_for_positive_class, weight_for_negative_class)

#     # Weighted cross-entropy loss
#     loss = -jnp.sum(class_weights * (y * jnp.log(p) + (1 - y) * jnp.log(1 - p))) / n

#     # Regularization term
#     reg_term = 0.5 * lmbd * (jnp.dot(w, w) + b * b)

#     return loss + reg_term

<a name="section3"></a>
# 3. Word embeddings

Word embeddings are vector representations of words/tokens that encode semantic meanings of words in relation to other words in a given corpus. In this section we will train word embeddings from our corpus of tweets using the Continuous Bag of Words (CBOW) algorithm from Word2Vec. Our goal in this tutorial is to implement parts of the CBOW algorithm and train basic word embeddings using our small corpus of tweets.

The CBOW algorithm predicts a target word based on its surrounding context words. For example, given the context words "the cat ___ on the mat," CBOW aims to predict the target word "sat". This is achieved by training a neural network to maximize the probability of the target word given its context.


The CBOW algorithm involves the following components:

**1. Context Window**

A fixed-size window that slides over the text, capturing the surrounding context words for each target word.


**2. A 1-layer neural network**

* Input layer: The input to the neural network is the sum of the vector representations of the context words.
* Hidden layer: A fully connected layer that transforms the input into a hidden representation. The dimensionality of this layer determines the size of the resulting word embeddings.
* Output layer: A softmax layer that outputs the probability distribution over the entire vocabulary, predicting the target word.


**3. Objective Function**

The training objective is to maximize the log likelihood of the correct target word given the context words. This is typically achieved using the cross-entropy loss function.

**4. Optimization Algorithm**

Stochastic Gradient Descent (SGD) or other optimization algorithms are used to update the model parameters, minimizing the loss function.

By training the CBOW model on a large corpus of text, the resulting word embeddings capture rich semantic and syntactic information about words. Words with similar meanings tend to have similar vector representations, enabling various NLP tasks such as word similarity, analogy solving, and text classification.

In the following sections, we will implement the CBOW algorithm step by step and use it to generate word embeddings for our text data.

<a name="section3_1"></a>
## 3.1. Context windows

We start by defining a function that transforms a text corpus to the format required by Word2Vec CBOW - as a sequence of context-target pairs.

In [None]:
def generate_train_data(sentences, type2index, window_size):
    contexts = []
    targets = []

    for sentence in sentences:
        for i in range(len(sentence)):
            target = i
            context = []
            for j in range(-window_size, window_size + 1):
                if j == 0:
                    continue
                try:
                    if i + j < 0:
                        context.append('<PAD>')
                        continue
                    context.append(sentence[i + j])
                except Exception:
                    context.append('<PAD>')
            if sentence[target] not in type2index:
                continue
            contexts.append([type2index[k] if k in type2index else type2index["<UNK>"] for k in context])
            targets.append(type2index[sentence[target]])

    return jnp.array(contexts), jnp.array(targets)

In [None]:
contexts, targets = generate_train_data(tokenized_train_sentences, type2index, window_size=5)

# Print example context-target indices
print(contexts[:5])
print(targets[:5])

# Print example context-target tokens
print()
print("Sentence:", train_sentences[0])
print()

for i in range(5):
    print('Context: ', [index2type[j] for j in contexts[i]])
    print('Target: ', index2type[targets[i]])
    print()

<a name="section3_2"></a>
# 3.2. Word2Vec CBOW

Now that we have prepared our training data in the form of context-target pairs, we are ready to define and train our Continuous Bag of Words (CBOW) model. In this section, we will focus on implementing the neural network architecture for CBOW and training it on our tweet corpus.

The CBOW model we will implement consists of the following components:

* Input Layer: This layer takes the average of the one-hot encoded vectors of the context words.
* Hidden Layer: A fully connected layer that projects the input to a lower-dimensional space, where each dimension represents a feature of the word embedding.
* Output Layer: A softmax layer that outputs a probability distribution over the entire vocabulary, aiming to predict the target word.

Our training process will involve feeding batches of context-target pairs into the model and using stochastic gradient descent (SGD) to update the weights of the network to minimize the cross-entropy loss between the predicted and actual target words.

Let's proceed to define our CBOW neural network class and train it on our dataset.

In [None]:
from jax.nn import softmax, one_hot
from jax import random, vmap

class Word2VecCBOW():
    def __init__(self, window_size, embed_dim, vocab_size, random_state):
        # Defines the key to be used for the random creation of the weights
        self.key = random.PRNGKey(random_state)
        self.vocab_size = vocab_size
        self.linear = self._create_random_matrix(vocab_size, embed_dim)
        self.soft = self._create_random_matrix(embed_dim, vocab_size)

        # Vectorizes the predict method
        self.predict = vmap(self._predict, in_axes=(None, 0))

    def train(self, X, y, num_epochs, batch_size):
        # TODO: COMPLETE THIS CODE

    def _predict(self, params, X):
        activations = []
        for x in X:
            activations.append(jnp.dot(x, params[0]))
        # Averages the activations
        activation = jnp.mean(jnp.array(activations), axis=0)
        logits = jnp.dot(activation, params[1])
        result = softmax(logits)

        return result

    def _create_random_matrix(self, window_size, embed_dim):
        w_key = random.split(self.key, num=1)
        return 0.2 * random.normal(self.key, (window_size, embed_dim))

    def loss(self, params, X, y):
        preds = self.predict(params, X)
        l = -jnp.mean(preds * y)
        self.l = l
        return l
    def update(self, params, X, y, step_size=0.02):
        grads = grad(self.loss)(params, X, y)
        return [params[0] - step_size * grads[0],
                params[1] - step_size * grads[1]]

    def get_embedding(self):
        return self.linear


    def generate_batches(self, X, y, batch_size):
            for index, offset in enumerate(range(0, len(y), batch_size)):
                yield X[offset: offset + batch_size], y[offset: offset + batch_size]

In [None]:
tokenized_corpus = tokenized_train_sentences
contexts, targets = generate_train_data(tokenized_corpus, type2index, window_size=5)
w2v = Word2VecCBOW(2, 32, len(type2index), 42)
w2v.train(contexts, targets, 3, 32)
w2v.get_embedding()

After training, we will extract the word embeddings from the weights of the input layer, which will contain the learned representations of words in our tweet corpus. These embeddings can then be used for various NLP tasks, such as sentiment analysis or text classification. In this notebook we will not use embeddings for such tasks. However, we can still get an idea of the representation space learned by CBOW by comparing the embeddings of different words. A common way to do this is to investigate the closest embeddings for different words.

In [None]:
# Compare embeddings of words
def cosine_similarity(a, b):
    return jnp.dot(a, b) / (jnp.linalg.norm(a) * jnp.linalg.norm(b))

def most_similar(word, embeddings, index2type, topn=5):
    word_index = type2index[word]
    word_embedding = embeddings[word_index]
    similarities = []
    for i, embedding in enumerate(embeddings):
        similarities.append((index2type[i], cosine_similarity(word_embedding, embedding)))
    similarities.sort(key=lambda x: x[1], reverse=True)
    return similarities[:topn]

embeddings = w2v.get_embedding()
most_similar("good", embeddings, index2type)

<a name="section3_3"></a>
# 3.3. Pretrained embeddings

While training word embeddings from scratch can be insightful, it often requires a large corpus and substantial computational resources to achieve high-quality embeddings. An alternative approach is to use pretrained embeddings, which have been trained on extensive text corpora such as Wikipedia or the Google News dataset. Pretrained embeddings can provide a strong starting point for various NLP tasks and can be especially useful when working with smaller datasets.

In this section, we will load publicly available pretrained word embeddings.  We use the gensim library, which can be used to download several sets of pretrained embeddings. In this example we load embeddings trained with GloVe, an alternative to Word2Vec. You can check the nearest neighbours of different words, which reveals the type of semantic information encoded in word embeddings.

In [None]:
import gensim.downloader as api
wv = api.load('glove-wiki-gigaword-50')
# wv = api.load('word2vec-google-news-300') # larger embeddings that will take longer to download

In [None]:
wv.most_similar('king')

In [None]:
wv.most_similar('queen')

In [None]:
wv.most_similar('eat')

In [None]:
wv.get_vector('eat')

<a name="section4"></a>

# 4. Conclusion


In this notebook we have tackled key components of the NLP pipeline like text processing, tokenisation, classification, and word embeddings. These tools and techniques provide a solid starting point for further exploration and more complex applications in the field of NLP.