<a href="https://colab.research.google.com/github/rdkdaniel/The-Swahili-Project/blob/main/The_Tokenizer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Kiswahili Under a Natural Language Processing Lens**

**Introduction**


*   This notebook shows the process used to design the tokenizer for the Kiswahili Project (title above).
*   Kiswahili is a low resource language but above that, it has a different morphological strcture than English or other languages whose tokenizers are readily available. 
*   It is therefore important to design a tokenizer specific to Kiswahili i.e. based on its strcture.
*   List item





**Sample Kiswahili Words and Sentences**


*   Mwanaume - Man
*   Vitabu - Books
*   Mwanaume mkubwa alienda - The big man went
*   Vitabu vikubwa zilichukuliwa - The big books were taken.



## **1.0 Libraries**

In [1]:
pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.7.1-py3-none-any.whl (451 kB)
[K     |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 451 kB 4.0 MB/s 
Collecting xxhash
  Downloading xxhash-3.1.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 212 kB 51.7 MB/s 
[?25hCollecting huggingface-hub<1.0.0,>=0.2.0
  Downloading huggingface_hub-0.11.1-py3-none-any.whl (182 kB)
[K     |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 182 kB 52.9 MB/s 
Collecting multiprocess
  Downloading multiprocess-0.70.14-py37-none-any.whl (115 kB)
[K     |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 115 kB 48.7 MB/s 
[?25hCollecting re

In [2]:
#Libraries
import pandas as pd
import numpy as np
import datasets

In [3]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.24.0-py3-none-any.whl (5.5 MB)
[K     |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 5.5 MB 3.6 MB/s 
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[K     |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 7.6 MB 29.4 MB/s 
Installing collected packages: tokenizers, transformers
Successfully installed tokenizers-0.13.2 transformers-4.24.0


In [4]:
import transformers

## **2.0 Loading the Datasets**

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [5]:
df = pd.read_fwf('/content/drive/MyDrive/Kiswahili_Dataset/Kiswahili_data1.txt')
df2 = pd.read_fwf('/content/drive/MyDrive/Kiswahili_Dataset/Kiswahili_data2.txt')
df3 = pd.read_fwf('/content/drive/MyDrive/Kiswahili_Dataset/Kiswahili_data3.txt')
df4 = pd.read_fwf('/content/drive/MyDrive/Kiswahili_Dataset/Kiswahili_data4.txt')
df5 = pd.read_fwf('/content/drive/MyDrive/Kiswahili_Dataset/Kiswahili_data5.txt')

In [6]:
print(df, df2, df3, df4, df5)

   Mhadhiri Denis Skopin (kushoto)  akiwa ameshikilia karatasi zake za  \
0  Dikteta wa Soviet Joseph Stalin  amepitia aina  fulani ya ukarabati   

   kufukuzwa kutoka Chuo  Kikuu cha  \
0  katika Urusi ya Putin  - unaweza   

   Jimbo la St PetersburgKatika nyumba yake ya St  \
0  hata kununua bidhaa za Stalin.Mhadhiri wa chuo   

   Petersburg, mhadhiri wa chuo kikuu  \
0  kikuu aliyefutwa kazi Denis Skopin   

  Denis Skopin ananionyesha hati ambayo imebadilisha maisha  \
0  amesoma miaka ya Stalin. Anaona uwiano kati ya...          

  yake.Maelekezo: "Maelekezo No.87/2D. Kuhusu: Kufutwa kazi."Hadi  hivi  \
0  na sasa."Nimetoka kuchapisha kitabu kwa Kiinge...               watu   

   majuzi Denis  ... kusahau madoa ya.23 umwagaji damu ya.24 historia ya.25  \
0  wa ¬†Urusi ya  ...     NaN   NaN   NaN      NaN  NaN   NaN      NaN   NaN   

  nchi.1 yetu."  
0    NaN    NaN  

[1 rows x 402 columns]    KNEC STUDY MATERIALS, REVISION KITS  AND PAST PAPERSSTUDY FOR  \
0  Œîdocument.ge



*   Wapi makofi ya good data scrapped na mimi!!
*   üëè üëè



## **2.1 Merge the DF**

In [7]:
df_merged = pd.concat([df, df2, df3, df4, df5])
#df_merged = pd.merge(df, df2, df3, df4, df5)

In [8]:
print(df_merged)

   Mhadhiri Denis Skopin (kushoto)  akiwa ameshikilia karatasi zake za  \
0  Dikteta wa Soviet Joseph Stalin  amepitia aina  fulani ya ukarabati   
0                              NaN                                 NaN   
0                              NaN                                 NaN   
0                              NaN                                 NaN   
0                              NaN                                 NaN   

   kufukuzwa kutoka Chuo  Kikuu cha  \
0  katika Urusi ya Putin  - unaweza   
0                    NaN        NaN   
0                    NaN        NaN   
0                    NaN        NaN   
0                    NaN        NaN   

   Jimbo la St PetersburgKatika nyumba yake ya St  \
0  hata kununua bidhaa za Stalin.Mhadhiri wa chuo   
0                                             NaN   
0                                             NaN   
0                                             NaN   
0                                             NaN   

 

## **2.2 Checking on Null Values**

In [9]:
df_merged.isnull()

Unnamed: 0,Mhadhiri Denis Skopin (kushoto),akiwa ameshikilia karatasi zake za,kufukuzwa kutoka Chuo,Kikuu cha,Jimbo la St PetersburgKatika nyumba yake ya St,"Petersburg, mhadhiri wa chuo kikuu",Denis Skopin ananionyesha hati ambayo imebadilisha maisha,"yake.Maelekezo: ""Maelekezo No.87/2D. Kuhusu: Kufutwa kazi.""Hadi",hivi,majuzi Denis,...,kuchagua,mbinu.4,hii..3,6)(b).1,sita.2,kutumia.1,kuUdumisha,utanzu,semi.,6)Your
0,False,False,False,False,False,False,False,False,False,False,...,True,True,True,True,True,True,True,True,True,True
0,True,True,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
0,True,True,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
0,True,True,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
0,True,True,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True


In [10]:
print(df_merged.isnull().sum())

Mhadhiri Denis Skopin (kushoto)                   4
akiwa ameshikilia karatasi zake za                4
kufukuzwa kutoka Chuo                             4
Kikuu cha                                         4
Jimbo la St PetersburgKatika nyumba yake ya St    4
                                                 ..
kutumia.1                                         5
kuUdumisha                                        5
utanzu                                            5
semi.                                             5
6)Your                                            5
Length: 4347, dtype: int64


## **2.3 Removing Null Values**

In [11]:
df_merged = df_merged.dropna()

In [12]:
#Checking again for missing values
print(df_merged.isnull().sum())
df_merged.isnull()

Mhadhiri Denis Skopin (kushoto)                   0.0
akiwa ameshikilia karatasi zake za                0.0
kufukuzwa kutoka Chuo                             0.0
Kikuu cha                                         0.0
Jimbo la St PetersburgKatika nyumba yake ya St    0.0
                                                 ... 
kutumia.1                                         0.0
kuUdumisha                                        0.0
utanzu                                            0.0
semi.                                             0.0
6)Your                                            0.0
Length: 4347, dtype: float64


Unnamed: 0,Mhadhiri Denis Skopin (kushoto),akiwa ameshikilia karatasi zake za,kufukuzwa kutoka Chuo,Kikuu cha,Jimbo la St PetersburgKatika nyumba yake ya St,"Petersburg, mhadhiri wa chuo kikuu",Denis Skopin ananionyesha hati ambayo imebadilisha maisha,"yake.Maelekezo: ""Maelekezo No.87/2D. Kuhusu: Kufutwa kazi.""Hadi",hivi,majuzi Denis,...,kuchagua,mbinu.4,hii..3,6)(b).1,sita.2,kutumia.1,kuUdumisha,utanzu,semi.,6)Your


In [13]:
print(df_merged)

Empty DataFrame
Columns: [Mhadhiri Denis Skopin (kushoto), akiwa ameshikilia karatasi zake za, kufukuzwa kutoka Chuo, Kikuu cha, Jimbo la St PetersburgKatika nyumba yake ya St, Petersburg, mhadhiri wa chuo kikuu, Denis Skopin ananionyesha hati ambayo imebadilisha maisha, yake.Maelekezo: "Maelekezo No.87/2D. Kuhusu: Kufutwa kazi."Hadi, hivi, majuzi Denis, alikuwa profesa msaidizi katika Kitivo cha Sanaa ya Kiliberali na Sayansi ya Chuo Kikuu cha Jimbo, la, St Petersburg. Lakini tarehe 20 Oktoba chuo kikuu kilimfukuza, kazi kwa "kitendo, cha ukosefu wa maadili ¬†ambacho, hakiendani na kazi za elimu".Hiki kinachoitwa kitendo kisicho cha madili kilikuwa ni nini? Kushiriki katika mkutano "usioidhinishwa".Tarehe 21 Septemba Denis, alijiunga na maandamano ya mitaani, kupinga uamuzi wa Kremlin kuwaandikisha Warusi kupigana nchini Ukraine. Mapema siku hiyo, Rais Vladimir Putin alikuwa ametangaza "uhamasishaji wa sehemu" kote nchini., Wakati wa, maandamano Denis alikamatwa na kukaa jela siku 10.

## **2.4 We can Transpose Dataset**

(to have rows and one column as compared to how it is now (more columns and 1 row))

In [14]:
df_merged.T

Mhadhiri Denis Skopin (kushoto)
akiwa ameshikilia karatasi zake za
kufukuzwa kutoka Chuo
Kikuu cha
Jimbo la St PetersburgKatika nyumba yake ya St
...
kutumia.1
kuUdumisha
utanzu
semi.
6)Your


In [15]:
print (df_merged.T)
#BTW does this change the performance of the tokenizer?

Empty DataFrame
Columns: []
Index: [Mhadhiri Denis Skopin (kushoto), akiwa ameshikilia karatasi zake za, kufukuzwa kutoka Chuo, Kikuu cha, Jimbo la St PetersburgKatika nyumba yake ya St, Petersburg, mhadhiri wa chuo kikuu, Denis Skopin ananionyesha hati ambayo imebadilisha maisha, yake.Maelekezo: "Maelekezo No.87/2D. Kuhusu: Kufutwa kazi."Hadi, hivi, majuzi Denis, alikuwa profesa msaidizi katika Kitivo cha Sanaa ya Kiliberali na Sayansi ya Chuo Kikuu cha Jimbo, la, St Petersburg. Lakini tarehe 20 Oktoba chuo kikuu kilimfukuza, kazi kwa "kitendo, cha ukosefu wa maadili ¬†ambacho, hakiendani na kazi za elimu".Hiki kinachoitwa kitendo kisicho cha madili kilikuwa ni nini? Kushiriki katika mkutano "usioidhinishwa".Tarehe 21 Septemba Denis, alijiunga na maandamano ya mitaani, kupinga uamuzi wa Kremlin kuwaandikisha Warusi kupigana nchini Ukraine. Mapema siku hiyo, Rais Vladimir Putin alikuwa ametangaza "uhamasishaji wa sehemu" kote nchini., Wakati wa, maandamano Denis alikamatwa na kukaa jel

## **3.0 Building The Tokenizer**

**Brief Overview of the Process**

Tokenization involves several steps:

1.   Normalization - which involves text cleanup such as lowercasing, removing accents or weird characters with Unicode normalization, etc
2.   Pre-tokenization - splitting the words into parts.
3.   Model - the actual tokenization where characters or subwords are merged into logical components.
4.   Post-processing - at thsis step, special tokens are added and these tokens are translated into IDs.
5.   Decoder - the final step that takes the tokenized data and converts it into human-readable text. Often this step is not seen as part of the tokenization process but is necessary to understand any text-based model output.





### **3.1 Libraries**

In [16]:
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace
from tokenizers import models

In [17]:
tokenizer = Tokenizer(models.WordPiece(unk_token='[UNK]'))

### **3.2 Normalization**

In [18]:
from tokenizers import normalizers
tokenizer.normalizer = normalizers.Sequence(
    [normalizers.Lowercase(), normalizers.NFKD()]
)

### **3.3 Pre-Tokenization**

In [19]:
from tokenizers import pre_tokenizers
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()

### **3.4 Training the Tokenizer**

In [20]:
from tokenizers import trainers

trainer = trainers.WordPieceTrainer(
    vocab_size=30_000,
    special_tokens=['[UNK]', '[PAD]', '[CLS]', '[SEP]', '[MASK]'],
    min_frequency=2,
    continuing_subword_prefix='##'
)

In [22]:
tokenizer.train_from_iterator(df_merged, trainer=trainer)

### **3.5 Post Processing**

In [23]:
from tokenizers import processors

# first we get the token ID values (defined in the vocab) for CLS and SEP
cls_id = tokenizer.token_to_id('[CLS]')
sep_id = tokenizer.token_to_id('[SEP]')

# then setup the post processing step with TemplateProcessing
tokenizer.post_processor = processors.TemplateProcessing(
    single=f'[CLS]:0 $A:0 [SEP]:0',
    pair=f'[CLS]:0 $A:0 [SEP]:0 $B:1 [SEP]:1',
    special_tokens=[
        ('[CLS]', cls_id),
        ('[SEP]', sep_id)
    ]
)

### **3.6 Decoder**

In [24]:
from tokenizers import decoders

tokenizer.decoder = decoders.WordPiece(prefix='##')

## **4.0 Saving the Tokenizer**

In [25]:
from transformers import PreTrainedTokenizerFast

# load the tokenizer in a transformers tokenizer instance
tokenizer = PreTrainedTokenizerFast(
    tokenizer_object=tokenizer,
    unk_token='[UNK]',
    pad_token='[PAD]',
    cls_token='[CLS]',
    sep_token='[SEP]',
    mask_token='[MASK]'
)

# save the tokenizer
tokenizer.save_pretrained('/content/drive/MyDrive/Kiswahili_Dataset/RDK-Kisw-Tokenizer')

('/content/drive/MyDrive/Kiswahili_Dataset/RDK-Kisw-Tokenizer/tokenizer_config.json',
 '/content/drive/MyDrive/Kiswahili_Dataset/RDK-Kisw-Tokenizer/special_tokens_map.json',
 '/content/drive/MyDrive/Kiswahili_Dataset/RDK-Kisw-Tokenizer/tokenizer.json')

Tokenizer succesfully saved.

## **5.0 Using the Tokenizer**

In [26]:
tokenizer = PreTrainedTokenizerFast.from_pretrained('/content/drive/MyDrive/Kiswahili_Dataset/RDK-Kisw-Tokenizer')

In [34]:
tokens = tokenizer("Ilikuwa wakati wa jioni jua limepunguza udhia wake na upepo mwanana ulikuwa ukipita na kuzipapasa ngozi zetu mfano wa pamba")

In [35]:
tokenizer("Ilikuwa wakati wa jioni jua limepunguza udhia wake na upepo mwanana ulikuwa ukipita na kuzipapasa ngozi zetu mfano wa pamba")

{'input_ids': [2, 1345, 634, 135, 1952, 1133, 1055, 778, 76, 161, 164, 52, 267, 75, 433, 153, 1442, 1938, 1363, 2012, 153, 2067, 2111, 663, 590, 135, 537, 141, 3], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [30]:
print(tokenizer)

PreTrainedTokenizerFast(name_or_path='/content/drive/MyDrive/Kiswahili_Dataset/RDK-Kisw-Tokenizer', vocab_size=2211, model_max_len=1000000000000000019884624838656, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'})


In [33]:
tokens.input_ids

[2,
 1345,
 634,
 135,
 1952,
 1133,
 1055,
 778,
 76,
 161,
 164,
 52,
 267,
 75,
 433,
 153,
 1442,
 1938,
 1363,
 2012,
 153,
 2067,
 2111,
 663,
 590,
 135,
 537,
 141,
 3]