<a href="https://colab.research.google.com/github/rdkdaniel/The-Swahili-Project/blob/main/Swahili_Language_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **The General Model**

Here I develop a Kiswahili language model. This model will then be fine-tuned for downstream tasks. The objectives for this model are:


1.   Getting the data - we shall use the established Swahili dataset (Source:).
2.   Design the tokenizer (we already have, so we will call it).
3.   Create an input pipeline.
4.   Train the model


# **A Full Training Guide**

## **Libraries**

In [1]:
pip install datasets


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.7.0-py3-none-any.whl (451 kB)
[K     |████████████████████████████████| 451 kB 4.6 MB/s 
[?25hCollecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting multiprocess
  Downloading multiprocess-0.70.14-py37-none-any.whl (115 kB)
[K     |████████████████████████████████| 115 kB 32.1 MB/s 
Collecting huggingface-hub<1.0.0,>=0.2.0
  Downloading huggingface_hub-0.11.0-py3-none-any.whl (182 kB)
[K     |████████████████████████████████| 182 kB 35.5 MB/s 
Collecting xxhash
  Downloading xxhash-3.1.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 212 kB 41.1 MB/s 
Collecting urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1
  Downloading urllib3-1.25.11-py2.py3-none-any.whl (127 kB)
[K     |████████████████████████████████| 127 kB 52.2 MB/s 
Installing collected 

In [2]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.24.0-py3-none-any.whl (5.5 MB)
[K     |████████████████████████████████| 5.5 MB 4.8 MB/s 
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 1.8 MB/s 
Installing collected packages: tokenizers, transformers
Successfully installed tokenizers-0.13.2 transformers-4.24.0


In [3]:
import transformers

In [4]:
import pandas as pd

## **1.0 Getting the Data Dataset**

In [None]:
df = pd.read_csv('/content/drive/MyDrive/train.csv')
print(df)

            id                                            content category
0       SW4670   Bodi ya Utalii Tanzania (TTB) imesema, itafan...   uchumi
1      SW30826   PENDO FUNDISHA-MBEYA RAIS Dk. John Magufuri, ...  kitaifa
2      SW29725  Mwandishi Wetu -Singida BENKI ya NMB imetoa ms...   uchumi
3      SW20901   TIMU ya taifa ya Tanzania, Serengeti Boys jan...  michezo
4      SW12560   Na AGATHA CHARLES – DAR ES SALAAM ALIYEKUWA K...  kitaifa
...        ...                                                ...      ...
23263  SW24920   Alitoa pongezi hizo alipozindua rasmi hatua y...   uchumi
23264   SW4038   Na NORA DAMIAN-DAR ES SALAAM  TEKLA (si jina ...  kitaifa
23265  SW16649   Mkuu wa Mkoa wa Njombe, Dk Rehema Nchimbi wak...   uchumi
23266  SW23291   MABINGWA wa Ligi Kuu Soka Tanzania Bara, Simb...  michezo
23267  SW11778   WIKI iliyopita, nilianza makala haya yanayole...  kitaifa

[23268 rows x 3 columns]


## **2.0 The Tokenizer**

In [15]:
from transformers import PreTrainedTokenizerFast

In [None]:
# initialize the tokenizer using the tokenizer we initialized and saved to file
tokenizer = PreTrainedTokenizerFast.from_pretrained('/content/drive/MyDrive/Kiswahili_Dataset/RDK-Kisw-Tokenizer')

*We can attempt to encode some text with it*

In [None]:
# test our tokenizer on a simple sentence
tokens = tokenizer('Jumbo, habari yako?')

In [None]:
print(tokens)

{'input_ids': [2, 742, 387, 12, 1265, 808, 29, 3], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}


In [None]:
tokens.input_ids

[2, 742, 387, 12, 1265, 808, 29, 3]

## **3.0 Input Pipeline**

### **3.1 Small Data Prep**

## **4.0 Training the Model**

# **A Small Test**

In [6]:
from transformers import PreTrainedTokenizerFast

In [10]:
from datasets import load_dataset

raw_datasets = load_dataset("swahili")
raw_datasets



  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 42069
    })
    test: Dataset({
        features: ['text'],
        num_rows: 3371
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 3372
    })
})

In [11]:
raw_train_dataset = raw_datasets["train"]
raw_train_dataset[0]

{'text': 'taarifa hiyo ilisema kuwa ongezeko la joto la maji juu ya wastani katikati ya bahari ya UNK inaashiria kuwepo kwa mvua za el nino UNK hadi mwishoni mwa april ishirini moja sifuri imeelezwa kuwa ongezeko la joto magharibi mwa bahari ya hindi linatarajiwa kuhamia katikati ya bahari hiyo hali ambayo itasababisha pepo kutoka kaskazini mashariki kuvuma kuelekea bahari ya hindi'}

In [12]:
raw_train_dataset.features

{'text': Value(dtype='string', id=None)}

In [16]:
from transformers import PreTrainedTokenizerFast

tokenizer = PreTrainedTokenizerFast.from_pretrained('/content/drive/MyDrive/Kiswahili_Dataset/RDK-Kisw-Tokenizer')

tokenized_sentences = tokenizer(raw_datasets["train"]["text"])


In [17]:
tokenized_dataset = tokenizer(
    raw_datasets["train"]["text"],
    padding=True,
    truncation=True,
)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


In [19]:
def tokenize_function(example):
    return tokenizer(example["text"], truncation=True)

In [20]:
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
tokenized_datasets

  0%|          | 0/43 [00:00<?, ?ba/s]

  0%|          | 0/4 [00:00<?, ?ba/s]

  0%|          | 0/4 [00:00<?, ?ba/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 42069
    })
    test: Dataset({
        features: ['text', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 3371
    })
    validation: Dataset({
        features: ['text', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 3372
    })
})

In [21]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [22]:
samples = tokenized_datasets["train"][:8]
samples = {k: v for k, v in samples.items() if k not in ["idx", "text"]}
[len(x) for x in samples["input_ids"]]

[123, 69, 49, 68, 25, 77, 47, 51]

In [23]:
batch = data_collator(samples)
{k: v.shape for k, v in batch.items()}

You're using a PreTrainedTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


{'input_ids': torch.Size([8, 123]),
 'token_type_ids': torch.Size([8, 123]),
 'attention_mask': torch.Size([8, 123])}

Training

In [24]:
from transformers import TrainingArguments

training_args = TrainingArguments("test-trainer")

In [25]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

Downloading:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

In [26]:
from transformers import Trainer

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

In [27]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 42069
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 15777
  Number of trainable parameters = 109483778


ValueError: ignored

In [28]:
predictions = trainer.predict(tokenized_datasets["validation"])
print(predictions.predictions.shape, predictions.label_ids.shape)

The following columns in the test set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 3372
  Batch size = 8


AttributeError: ignored

In [29]:
import numpy as np

preds = np.argmax(predictions.predictions, axis=-1)

NameError: ignored