# Transcribing Audio in Kinyarwanda to Text 

**Masakhane \#studyNLP Exercise, April 2021**

**Author: Tunde Ajayi**


## 1. Load pre-trained Kinyarwanda model

In [1]:
%%capture
!pip install datasets==1.4.1
!pip install transformers==4.4.0
!pip install torchaudio
!pip install librosa
!pip install jiwer

In [2]:
# Mount drive
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [3]:
# code copied from huggingface.io link above, with edits to fit this exercise
# import processor and model
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

processor = Wav2Vec2Processor.from_pretrained("lucio/wav2vec2-large-xlsr-kinyarwanda") 
model = Wav2Vec2ForCTC.from_pretrained("lucio/wav2vec2-large-xlsr-kinyarwanda")


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=158.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=259.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=138.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=85.0, style=ProgressStyle(description_w…




Special tokens have been added in the vocabulary, make sure the associated word embedding are fine-tuned or trained.


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1608.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1262065047.0, style=ProgressStyle(descr…




## 2. Load Kinyarwanda audio

In [None]:
# run this cell to upload audio from your local disk
#from google.colab import files
#files.upload()

In [4]:
# Kirundi audio
from IPython.display import Audio
file1 = "/content/gdrive/MyDrive/test_200627-233446_kin_60a_elicit_399.wav"
Audio(file1)

In [8]:
# save audio as a tensor
import torchaudio
speech, rate = torchaudio.load(file1)
print("rate:", rate)
print("audio as a tensor:", speech)

rate: 16000
audio as a tensor: tensor([[ 0.0000,  0.0000,  0.0000,  ..., -0.0007, -0.0002, -0.0002]])


The model was trained on 16 kHz audio, which is the same rate as this file. No need to do any resampling. 

## 3. Apply model to transcribe Kinyarwanda audio

In [10]:
# code copied from huggingface.io link above
import torch 
import numpy

text1 = "abanyaporitiki n’abayobozi b’amadini, badaha agaciro abantu cyangwa ngo babiteho.Yehova we yifuza ko twumva ko dufite"
inputs = processor(speech, sampling_rate=16_000, return_tensors="pt", padding=True)
# comment or uncomment next line if AttributeError: 'numpy.ndarray' object has no attribute 'numpy'
#speech = speech.squeeze().numpy()

with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
predicted_ids = torch.argmax(logits, dim=-1)
print("Prediction:", processor.batch_decode(predicted_ids))
print(f'Reference text: {text1}')

Prediction: ['abanyapolitiki n abayobozi b amadini badahahe agaciro abantu cyangwa ngo babiteho yehova we yifuza ko twumva ko dufite']
Reference text: abanyaporitiki n’abayobozi b’amadini, badaha agaciro abantu cyangwa ngo babiteho.Yehova we yifuza ko twumva ko dufite


## Another example using a different audio file

In [11]:
# Kirundi audio
from IPython.display import Audio
file2 = "/content/gdrive/MyDrive/test_200702-193159_kin_60a_elicit_156.wav"
Audio(file2)

In [12]:
# save audio as a tensor
import torchaudio
speech2, rate = torchaudio.load(file2)
print("rate:", rate)
print("audio as a tensor:", speech2)

rate: 16000
audio as a tensor: tensor([[ 0.0000,  0.0000,  0.0000,  ..., -0.0017, -0.0014, -0.0007]])


In [16]:
# code copied from huggingface.io link above
import torch 
import numpy

text2 = "Inshuti nyanshuti zemera kwigomwa kugira ngo zishyigikire Abakristo bagenzi babo. Urugero, umuvandimwe witwa Peter yagiye"

inputs = processor(speech2, sampling_rate=16_000, return_tensors="pt", padding=True)
# comment or uncomment next line if AttributeError: 'numpy.ndarray' object has no attribute 'numpy'
#speech2 = speech2.squeeze().numpy()

with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
predicted_ids2 = torch.argmax(logits, dim=-1)
print(f'Prediction: {processor.batch_decode(predicted_ids2)}')
print(f'Reference text: {text2}')

Prediction: ['inshuti nyanshuti zenera kwigumwa kugira ngo shyikire abakristu bagenzi babo urugero umuvandimwe witwa peter yagiye']
Reference text: Inshuti nyanshuti zemera kwigomwa kugira ngo zishyigikire Abakristo bagenzi babo. Urugero, umuvandimwe witwa Peter yagiye


## Disclaimer:

I am an illiterate in Kinyarwanda, hence, my inability to evaluate the transcription. But from what I heard from the audio file, I can say the transcription made sense.

## Acknowledgements

Many thanks to the following for their contribution:

- Colin Leong: for initiating this task.

- Bradley Mensah: for the insight from his notebook. 

- Jean Paul: for the audio files.