# 2024 COMP90042 Project

# Readme

#### **Custom Tokenizer Training: WordPiece Subwords + Most Frequent Whole Words**: 

In this notebook, we train our custom WordPiece Tokenizer on a corpus which contains the collected text across the Knowledge Source and all Claim sentence. 

*** **PLEASE NOTE**: We import helper functions that we implemented for pre-processing/cleaning our data from the python script called `utils.py`. Our custom WordPiece Tokenizer implementation is contained in the python script called `wordpiece_tokenizer.py`.



In [1]:
%load_ext autoreload
%autoreload 2

# install required packages
!pip install unidecode
!python -m nltk.downloader stopwords


from utils import *
from wordpiece_tokenizer import *

from collections import defaultdict
import pprint as pp
from tqdm import tqdm
import pickle


# 1.DataSet Processing

#### Load the Claims Dataset with Knowledge Source and Clean the Text

In [2]:
# load dataset from file
knowledge_source, train_data, val_data = load_dataset()      
print(f"Number of evidence passages: {len(knowledge_source)}")
print(f"Number of training instances: {len(train_data)}")  
print(f"Number of validation instances: {len(val_data)}")

# clean all senteneces in the dataset (this involves converting from unicode to asc-ii, removing URLS, removing repeating non-alphanumeric characters, etc. Just a bunch of thing that most likely will not be useful for claim classification task)
cleaner = SentenceCleaner()
knowledge_source, train_data, val_data = cleaner.clean_dataset(knowledge_source, train_data, val_data)
print(f"\nNumber of evidence passages after cleaning: {len(knowledge_source)}")
print(f"Number of training instances after cleaning: {len(train_data)}")  
print(f"Number of validation instances after cleaning: {len(val_data)}")

# combine all evidence passages and training claim sentences into a corpus which is a list of sentences
corpus = []
for ev in knowledge_source.values():
    corpus.append(ev)
for claim in train_data.values():
    corpus.append(claim['claim_text'])  

Number of evidence passages: 1208827
Number of training instances: 1228
Number of validation instances: 154

Number of evidence passages after cleaning: 1206800
Number of training instances after cleaning: 1228
Number of validation instances after cleaning: 154


# 2. Model Implementation

#### Instantiate WordPiece Tokenizer object

In [3]:
# train modified wordpiece tokenizer with augmented most frequent words 
tokenizer = WordPieceTokenizer(cleaning=True)

#### Train The tokenizer on the corpus.

In [4]:
tokenizer.generate_vocab(corpus, 10000, num_augmented_words=10000)

Pretokenizing corpus into words and computing unigram counts...
Generating WordPiece vocabulary with max_vocab_size=10000...
Generating splits...
Computing initial pair scores...


Building vocab. Current vocab_size --> : 100%|██████████| 10000/10000 [19:10<00:00,  8.69it/s]


In [5]:
print(tokenizer.vocab)



#### Save the trained toknizer object to file.

In [10]:
# save trained tokenizer object to pickle file
with open("tokenizer_worpiece_20000_aug.pkl", "wb") as f:
    pickle.dump(tokenizer, f)

# 3.Testing and Evaluation

#### Run some subword tokenization tests to make sure our trained custom WordPiece tokenizer is working as expected.

In [14]:
# tokenize some example sentences from the corpus
s = corpus[8290]
print(f"Original Sentence --> {s}")

tokens_idx, tokens_subwords = tokenizer.encode([s], return_subwords=True)
print(f"Subword tokens --> {tokens_subwords}")
print(f"Integer tokens --> {tokens_idx}")

decoded = tokenizer.decode(tokens_idx)
print(f"Decoded --> {decoded}")

Original Sentence --> As the Earth's climate warms, we are seeing many changes: stronger, more destructive hurricanes; heavier rainfall; more disastrous flooding; more areas of the world experiencing severe drought; and more heat waves."


Subword tokens --> (['as', 'the', 'earth', "'s", 'climate', 'warm', '##s', ',', 'we', 'are', 'seeing', 'many', 'changes', ':', 'stronger', ',', 'more', 'des', '##truct', '##iv', '##e', 'hurricanes', ';', 'he', '##av', '##i', '##er', 'rainfall', ';', 'more', 'di', '##s', '##a', '##st', '##r', '##o', '##u', '##s', 'flooding', ';', 'more', 'areas', 'of', 'the', 'world', 'exper', '##i', '##e', '##nc', '##ing', 'severe', 'drought', ';', 'and', 'more', 'heat', 'waves', '.', "''"],)
Integer tokens --> ([10002, 9604, 10496, 10006, 10234, 13670, 4837, 5901, 10671, 10009, 16223, 10072, 10771, 6702, 16361, 5901, 8551, 7429, 5320, 3511, 2786, 17506, 6711, 7919, 2214, 3306, 2905, 14095, 6711, 8551, 7459, 4837, 1673, 5089, 4734, 4277, 5379, 4837, 14873, 6711, 8551, 10337, 8696, 9604, 10031, 7674, 3306, 2786, 4132, 3381, 12833, 15624, 6711, 6807, 8551, 11003, 13883, 5921, 10051],)
Decoded --> ["as the earth 's climate warms , we are seeing many changes : stronger , more destructive hurricanes ; heavi

#### Test the multi-process parallelized encoding and decoding speed to make sure it is fast.

In [26]:
# test encoding speed on 10000 sentences
encoded = tokenizer.encode(corpus[:10000], num_procs=8)

In [27]:
# test decoding speed on 10000 encoded sentences
decoded = tokenizer.decode(encoded, num_procs=8)