<a href="https://colab.research.google.com/github/rdkdaniel/Swahili-Dataset-Augmentation/blob/main/Text_Data_Augmentation_Swahili_Dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
#Test with a few swahili sentences

In [4]:
#Install and Import library
!pip install nlpaug

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting nlpaug
  Downloading nlpaug-1.1.11-py3-none-any.whl (410 kB)
[K     |████████████████████████████████| 410 kB 5.4 MB/s 
Installing collected packages: nlpaug
Successfully installed nlpaug-1.1.11


In [8]:
import nlpaug.augmenter.char as nac
import nlpaug.augmenter.word as naw
import nlpaug.augmenter.sentence as nas
import nlpaug.flow as nafc

In [10]:
# This will be the base text which we will be using throughout this notebook
text="Jina langu ni Kiguru. Jina lako ni nani?."

**Character Level Augmentation**


**optical character recognition (OCR) Augmenter**: To read textual data from on image, we need an OCR(optical character recognition) model. 
Once the text is extracted from the image, there may be errors like; '0' instead of an 'o', '2' instead of 'z' and other such similar errors.

**Keyboard Augmenter:** While typing/texting typos are fairly common this augmenter simulates the errors by substituting characters in words with 
ones at a similar distance on a keyboard.

In [12]:
# First, the OCR augmenter
# import nlpaug.augmenter.char as nac (already done above)

aug = nac.OcrAug()  
augmented_texts = aug.augment(text, n=3) # specifying n=3 gives us only 3 augmented versions of the sentence.

print("Original:")
print(text)

print("Augmented Texts:")
print(augmented_texts)

Original:
Jina langu ni Kiguru. Jina lako ni nani?.
Augmented Texts:
['Jina langu ni Kiguru. Jina lako ni nani?.', 'Jina langu ni Kiguru. Jina 1aku ni nani?.', 'Jina langu ni Kiguru. Jina lar0 ni nani?.']


In [13]:
# Second, the keyboard Augmenter
# import nlpaug.augmenter.word as naw (already done above)


aug = nac.KeyboardAug()
augmented_text = aug.augment(text, n=3) # specifying n=3 gives us only 3 augmented versions of the sentence.

print("Original:")
print(text)

print("Augmented Text:")
print(augmented_text)

Original:
Jina langu ni Kiguru. Jina lako ni nani?.
Augmented Text:
['J8ba lqnfu ni >ig6ru. K(na lako ni nani?.', 'Jina pang& ni Kinutu. Jina Ixko ni HSni?.', 'JkJa lQmgu ni Kiguru. Jina ,xko ni gQni?.']


**NOTE: There are other types of character augmenters too. Go read about them!**

**Word Level Augmenter**

Word-level is important as well. It makes use of word2vec, GloVe, fast text, BERT, and wordnet to insert and substitute similar words.

NLPAUG provides 7 functions to perform word augmenter: SpellingAug, WordEmbsAug, TfIdfAug, ContextualWordEmbsAug, FasttextAug, BertAug and, WordNetAug

In [15]:
#Spelling augmentor
aug = naw.SpellingAug()
augmented_texts = aug.augment(text, n=3)

print("Original:")
print(text)
print("Augmented Texts:")
print(augmented_texts)

Original:
Jina langu ni Kiguru. Jina lako ni nani?.
Augmented Texts:
['Jina langu Ni Kiguru. Jina lako in nine?.', 'Jina langu Ni Kiguru. Jina lako Ni nine?.', 'Jina langu in Kiguru. Jina lako Ni Nani?.']


In [19]:
#WordEmbsAug: Leverage word2vec, GloVe, or fasttext embeddings to apply augmentation
aug = naw.WordEmbsAug(model_type='word2vec', model_path=model_dir+'GoogleNews-vectors-negative300.bin', action="insert")
augmented_text = aug.augment(text)

print("Original:")
print(text)
print("Augmented Text:")
print(augmented_text)

NameError: ignored