<a href="https://colab.research.google.com/github/rdkdaniel/The-Swahili-Project/blob/main/The_Tokenizer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Kiswahili Under a Natural Language Processing Lens**

**Introduction**


*   This notebook shows the process used to design the tokenizer for the Kiswahili Project (title above).
*   Kiswahili is a low resource language but above that, it has a different morphological strcture than English or other languages whose tokenizers are readily available. 
*   It is therefore important to design a tokenizer specific to Kiswahili i.e. based on its strcture.
*   List item





**Sample Kiswahili Words and Sentences**


*   List item
*   List item
*   List item
*   List item



## **1.0 Libraries**

In [1]:
pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.6.1-py3-none-any.whl (441 kB)
[K     |████████████████████████████████| 441 kB 32.2 MB/s 
Collecting dill<0.3.6
  Downloading dill-0.3.5.1-py2.py3-none-any.whl (95 kB)
[K     |████████████████████████████████| 95 kB 6.5 MB/s 
[?25hCollecting xxhash
  Downloading xxhash-3.1.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 212 kB 103.9 MB/s 
Collecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting huggingface-hub<1.0.0,>=0.2.0
  Downloading huggingface_hub-0.10.1-py3-none-any.whl (163 kB)
[K     |████████████████████████████████| 163 kB 81.0 MB/s 
Collecting multiprocess
  Downloading multiprocess-0.70.14-py37-none-any.whl (115 kB)
[K     |████████████████████████████████| 115 kB 62.6 MB/s 
Collecting urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1
  D

In [2]:
#Libraries
import pandas as pd
import numpy as np
import datasets

In [3]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.24.0-py3-none-any.whl (5.5 MB)
[K     |████████████████████████████████| 5.5 MB 29.9 MB/s 
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 74.1 MB/s 
Installing collected packages: tokenizers, transformers
Successfully installed tokenizers-0.13.2 transformers-4.24.0


In [4]:
import transformers

## **2.0 Loading the Datasets**

In [6]:
df = pd.read_fwf('/content/Kiswdata1.txt')

In [7]:
print(df)

   Mhadhiri Denis Skopin (kushoto)  akiwa ameshikilia karatasi zake za  \
0  Dikteta wa Soviet Joseph Stalin  amepitia aina  fulani ya ukarabati   

   kufukuzwa kutoka Chuo  Kikuu cha  \
0  katika Urusi ya Putin  - unaweza   

   Jimbo la St PetersburgKatika nyumba yake ya St  \
0  hata kununua bidhaa za Stalin.Mhadhiri wa chuo   

   Petersburg, mhadhiri wa chuo kikuu  \
0  kikuu aliyefutwa kazi Denis Skopin   

  Denis Skopin ananionyesha hati ambayo imebadilisha maisha  \
0  amesoma miaka ya Stalin. Anaona uwiano kati ya...          

  yake.Maelekezo: "Maelekezo No.87/2D. Kuhusu: Kufutwa kazi."Hadi  hivi  \
0  na sasa."Nimetoka kuchapisha kitabu kwa Kiinge...               watu   

   majuzi Denis  ... kusahau madoa ya.23 umwagaji damu ya.24 historia ya.25  \
0  wa  Urusi ya  ...     NaN   NaN   NaN      NaN  NaN   NaN      NaN   NaN   

  nchi.1 yetu."  
0    NaN    NaN  

[1 rows x 402 columns]




*   Wapi makofi ya good data scrapped na mimi!!
*   👏 👏



## **3.0 Building The Tokenizer**

**Brief Overview of the Process**

Tokenization involves several steps:

1.   Normalization - which involves text cleanup such as lowercasing, removing accents or weird characters with Unicode normalization, etc
2.   Pre-tokenization - splitting the words into parts.
3.   Model - the actual tokenization where characters or subwords are merged into logical components.
4.   Post-processing - at thsis step, special tokens are added and these tokens are translated into IDs.
5.   Decoder - the final step that takes the tokenized data and converts it into human-readable text. Often this step is not seen as part of the tokenization process but is necessary to understand any text-based model output.





### **3.1 Libraries**

In [8]:
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace
from tokenizers import models

In [9]:
tokenizer = Tokenizer(models.WordPiece(unk_token='[UNK]'))

### **3.2 Normalization**

In [10]:
from tokenizers import normalizers
tokenizer.normalizer = normalizers.Sequence(
    [normalizers.Lowercase(), normalizers.NFKD()]
)

### **3.3 Pre-Tokenization**

In [11]:
from tokenizers import pre_tokenizers
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()

### **3.4 Training the Tokenizer**

In [12]:
from tokenizers import trainers

trainer = trainers.WordPieceTrainer(
    vocab_size=30_000,
    special_tokens=['[UNK]', '[PAD]', '[CLS]', '[SEP]', '[MASK]'],
    min_frequency=2,
    continuing_subword_prefix='##'
)

In [13]:
tokenizer.train_from_iterator(df, trainer=trainer)

### **3.5 Post Processing**

In [14]:
from tokenizers import processors

# first we get the token ID values (defined in the vocab) for CLS and SEP
cls_id = tokenizer.token_to_id('[CLS]')
sep_id = tokenizer.token_to_id('[SEP]')

# then setup the post processing step with TemplateProcessing
tokenizer.post_processor = processors.TemplateProcessing(
    single=f'[CLS]:0 $A:0 [SEP]:0',
    pair=f'[CLS]:0 $A:0 [SEP]:0 $B:1 [SEP]:1',
    special_tokens=[
        ('[CLS]', cls_id),
        ('[SEP]', sep_id)
    ]
)

### **3.6 Decoder**

In [15]:
from tokenizers import decoders

tokenizer.decoder = decoders.WordPiece(prefix='##')

## **4.0 Saving the Tokenizer**

In [16]:
from transformers import PreTrainedTokenizerFast

# load the tokenizer in a transformers tokenizer instance
tokenizer = PreTrainedTokenizerFast(
    tokenizer_object=tokenizer,
    unk_token='[UNK]',
    pad_token='[PAD]',
    cls_token='[CLS]',
    sep_token='[SEP]',
    mask_token='[MASK]'
)

# save the tokenizer
tokenizer.save_pretrained('RDK-Kisw-Tokenizer')

('RDK-Kisw-Tokenizer/tokenizer_config.json',
 'RDK-Kisw-Tokenizer/special_tokens_map.json',
 'RDK-Kisw-Tokenizer/tokenizer.json')

## **5.0 Using the Tokenizer**

In [17]:
tokenizer = PreTrainedTokenizerFast.from_pretrained('RDK-Kisw-Tokenizer')

In [18]:
tokenizer("Ilikuwa wakati wa jioni jua limepunguza udhia wake na upepo mwanana ulikuwa ukipita na kuzipapasa ngozi zetu mfano wa pamba")

{'input_ids': [2, 219, 211, 344, 87, 282, 263, 36, 294, 38, 58, 152, 73, 391, 134, 161, 46, 233, 53, 309, 104, 46, 414, 255, 285, 102, 102, 46, 89, 211, 246, 58, 223, 111, 104, 90, 109, 308, 308, 137, 40, 249, 109, 50, 69, 114, 39, 221, 56, 87, 384, 128, 3], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}