<a href="https://colab.research.google.com/github/rfclara/fa_xhosa/blob/main/xhosa_forced_alignement.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Aligning transcriptions and annotations - Xhosa corpus



# Introduction

The pourpose of this notebook is to align the interlinear glosses with the audio, from a transcribed corpus in Xhosa, one of the official languages of South Africa and Zimbabwe.

The transcription of this corpus are not aligned with the speech. We will use [CCT forced alignement](https://pytorch.org/audio/main/tutorials/ctc_forced_alignment_api_tutorial.html) in order to cut the recording into small chunks and get the timestamps corresponding to their transcriptions.


Here, we will follow the necessary steps to prepare the data for training or fine-tuning a speech-to-text model,



<p align="center">
  <img src="https://github.com/cawoylel/nlp4all/blob/main/asr/illustrations/forced_aligner.png?raw=true:, width=200" alt="transformer" width=500 class="center">
<br>
    <em>
    Illustration of the task of Forced Alignement, from nlp4all
    </em>
</p>

[MMS](https://github.com/facebookresearch/fairseq/blob/main/examples/mms/README.md) is a Forced Aligner using a multilingual speech model trained on more than one thousand languages. You can check here if your language is included: https://dl.fbaipublicfiles.com/mms/misc/language_coverage_mms.html




# Installing the dependencies in the virtual environnement


In [1]:
!apt install libicu-dev pkg-config

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
libicu-dev is already the newest version (70.1-2).
pkg-config is already the newest version (0.29.2-1ubuntu3).
0 upgraded, 0 newly installed, 0 to remove and 45 not upgraded.


In [2]:
!apt-get update
!apt-get install libsox-fmt-all sox # needed for processing audio
!apt-get install -y ffmpeg
!apt install libicu-dev pkg-config # needed for processing text and unicode symbols

0% [Working]            Get:1 http://security.ubuntu.com/ubuntu jammy-security InRelease [110 kB]
0% [Waiting for headers] [1 InRelease 0 B/110 kB 0%] [Connecting to cloud.r-project.org] [Connecting                                                                                                    Hit:2 http://archive.ubuntu.com/ubuntu jammy InRelease
0% [1 InRelease 110 kB/110 kB 100%] [Connecting to cloud.r-project.org] [Connecting to ppa.launchpad                                                                                                    Get:3 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [119 kB]
0% [Waiting for headers] [Waiting for headers] [Connected to ppa.launchpadcontent.net (185.125.190.8                                                                                                    Get:4 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,626 B]
0% [Waiting for headers] [Connected to ppa.launchpadcontent.net (185.125.190.

In [3]:
!pip uninstall torch torchaudio -y # we need to install the nightly version of torch
#!pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu118
!pip install --pre torch torchaudio --index-url https://download.pytorch.org/whl/nightly/cu118

#!pip install -q dataclasses
!pip install -q sox # for audio processing
!pip install -q ICU-Tokenizer # for tokenizing the text
!pip install -q datasets # we will use huggingface datasets for loading the training dataset
!pip install pandas
!pip install tensorboardX

Found existing installation: torch 2.3.0+cu121
Uninstalling torch-2.3.0+cu121:
  Successfully uninstalled torch-2.3.0+cu121
Found existing installation: torchaudio 2.3.0+cu121
Uninstalling torchaudio-2.3.0+cu121:
  Successfully uninstalled torchaudio-2.3.0+cu121
Looking in indexes: https://download.pytorch.org/whl/nightly/cu118
Collecting torch
  Downloading https://download.pytorch.org/whl/nightly/cu118/torch-2.4.0.dev20240523%2Bcu118-cp310-cp310-linux_x86_64.whl (855.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m855.8/855.8 MB[0m [31m813.8 kB/s[0m eta [36m0:00:00[0m
[?25hCollecting torchaudio
  Downloading https://download.pytorch.org/whl/nightly/cu118/torchaudio-2.2.0.dev20240523%2Bcu118-cp310-cp310-linux_x86_64.whl (3.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.3/3.3 MB[0m [31m19.3 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu11==11.8.89 (from torch)
  Downloading https://download.pytorch.org/whl/nightly/cu

In this step, we clone repositories containing code and resources essential for our ASR project. Specifically, we clone the `rfclara/fa_xhosa` repository, and the `isi-nlp/uroman` repository, which provides functionalities for Romanization of text data.

In [4]:
!git clone https://github.com/isi-nlp/uroman.git

Cloning into 'uroman'...
remote: Enumerating objects: 299, done.[K
remote: Counting objects: 100% (131/131), done.[K
remote: Compressing objects: 100% (82/82), done.[K
remote: Total 299 (delta 76), reused 88 (delta 49), pack-reused 168[K
Receiving objects: 100% (299/299), 5.26 MiB | 26.95 MiB/s, done.
Resolving deltas: 100% (144/144), done.


In [5]:
!git clone https://github.com/facebookresearch/fairseq.git
!cd fairseq
!pip install --editable ./

Cloning into 'fairseq'...
remote: Enumerating objects: 35184, done.[K
remote: Counting objects:   0% (1/105)[Kremote: Counting objects:   1% (2/105)[Kremote: Counting objects:   2% (3/105)[Kremote: Counting objects:   3% (4/105)[Kremote: Counting objects:   4% (5/105)[Kremote: Counting objects:   5% (6/105)[Kremote: Counting objects:   6% (7/105)[Kremote: Counting objects:   7% (8/105)[Kremote: Counting objects:   8% (9/105)[Kremote: Counting objects:   9% (10/105)[Kremote: Counting objects:  10% (11/105)[Kremote: Counting objects:  11% (12/105)[Kremote: Counting objects:  12% (13/105)[Kremote: Counting objects:  13% (14/105)[Kremote: Counting objects:  14% (15/105)[Kremote: Counting objects:  15% (16/105)[Kremote: Counting objects:  16% (17/105)[Kremote: Counting objects:  17% (18/105)[Kremote: Counting objects:  18% (19/105)[Kremote: Counting objects:  19% (20/105)[Kremote: Counting objects:  20% (21/105)[Kremote: Counting objects:  21% (23/

In [6]:
import os

In [7]:
!git clone https://github.com/rfclara/fa_xhosa
!mkdir /content/fa_xhosa
os.chdir("/content/fa_xhosa")
from google.colab import files
uploaded = files.upload()

Cloning into 'fa_xhosa'...
fatal: could not read Username for 'https://github.com': No such device or address


Saving extract_step5_transcriptions.py to extract_step5_transcriptions.py
Saving extract_transcriptions.py to extract_transcriptions.py
Saving json_to_textgrid.py to json_to_textgrid.py
Saving remove_comments.py to remove_comments.py


## Prepearing the data
Getting the audio files and the transcriptions.

`original.zip` should decompress into one folder called `original` containing the audio files and the transcirptions. Each filename must match and differe only by its extension (.wav, .xlsx)

example :
story1.wav
story1.xlsx

**EITHER** give this notebook acces to your drive:
run next cell. It will ask for the permission to acces your drive and it will copy the archive from your Drive to the virtual environnement.

Make sure `xhosa.zip` is placed directly in the main directory of your Drive. If not, you may change the path on the cell.

example :
`!cp /content/drive/MyDrive/your/actual/path/original.zip /content`

In [9]:
# Step 1: Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Step 2: Copy the zip file from Google Drive to the Colab environment
!cp /content/drive/MyDrive/original.zip /content

Mounted at /content/drive


__OR__ upload *original.zip* directory containing the trancriptions and the recordings directly here.

In [None]:
from google.colab import files
uploaded = files.upload()

Decompress original.zip into /content/xhosa

In [None]:
!mkdir /content/xhosa
!unzip /content/original.zip -d /content/xhosa

CLONING rfclara/fa_xhosa repository from github

### Converting the .xlsx transcriptions to .txt

In [None]:
!python /content/fa_xhosa/extract_transcriptions.py /content/xhosa/original /content/xhosa/ready

**OPTIONAL**

REMOVING COMMENTS BETWEEN BRACKETS : ex : \<code-swiching>, \<laugh>


In [None]:
%%shell
for f in /content/xhosa/ready/*.txt; do
  python /content/fa_xhosa/remove_comments.py $f
done

The following cell will display the first 10 lines of the first trancription. Make sure this cell is displaying the expected result before continuing. All transcriptions can be found in /content/xhosa/ready

In [13]:
!head -n 10 $(ls /content/xhosa/ready/*.txt | head -n 1)

ungabuza kaloku
so  phezolo u+ ee bekusenzeka ni o+ pha kwaMashezi
benikhona ?
Yho yho into {e}be ipha
ibengumyadala
Eeh
ndiyakutshela
ngoba kaloku   ee besithenjiswe kuthwa kuzofika
akutshiwongo ukuba kuzofika umlungu
kuthwe  kuza abelungu   abesuka ePitoli


### Resampling the audios  <a name="resampling"></a>

After downloading the audios, we need to resample them. Many modern speech models only deal with *16 000 sampling*. We will use `ffmpeg` to resample the audios into 16 000. We will also save the resampled audios into `.wav` files.

We will save the resampled wav files into a new `ready` directory containing all the extracted transcriptions .txt.

In [None]:
%%shell
for f in /content/xhosa/original/*.wav; do
  filename="$(basename "$f")"
  directory="$(dirname "$f")"
  stem=${filename%.*}
  ffmpeg -i "$f" -ac 1 -ar 16000 "/content/xhosa/ready/${stem}.wav" ;
done

## Neural Forced Alignment  <a name="aligner"></a>

Using [torchaudio.functional.forced_align()](https://pytorch.org/audio/stable/generated/torchaudio.functional.forced_align.html#torchaudio-functional-forced-align) the following cell will automatically align each line of the transcription with its corresponding time in the audio file. This step may take some minutes/hours dpending on the length of the corpus.

In [15]:
%%shell
input_folder=/content/xhosa/ready
output_folder=/content/xhosa/aligned
cd /content/fairseq/
for audio in $input_folder/*.wav; do
  filename="$(basename "$audio")"
  stem=${filename%.*}
  output_path=$output_folder/$stem
  rm -rf $output_path
  python -m examples.mms.data_prep.align_and_segment \
  --audio_filepath $input_folder/$stem.wav \
  --text_filepath $input_folder/$stem.txt \
  --lang xho \
  --outdir $output_path \
  --uroman /content/uroman/bin
done

Using torch version: 2.4.0.dev20240523+cu118
Using torchaudio version: 2.2.0.dev20240523+cu118
Using device:  cpu
Read 24 lines from /content/xhosa/ready/BLN150925D_b.txt
Downloading model and dictionary...
100% 1.18G/1.18G [00:08<00:00, 151MB/s]
100% 79.0/79.0 [00:00<00:00, 52.2kB/s]
Using torch version: 2.4.0.dev20240523+cu118
Using torchaudio version: 2.2.0.dev20240523+cu118
Using device:  cpu
Read 29 lines from /content/xhosa/ready/BLN150925M_b.txt
Downloading model and dictionary...
Model path already exists. Skipping downloading....
Dictionary path already exists. Skipping downloading....
Using torch version: 2.4.0.dev20240523+cu118
Using torchaudio version: 2.2.0.dev20240523+cu118
Using device:  cpu
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/content/fairseq/examples/mms/dat



The two following cells are optional. If you are not interested in the chunks or .json files, you can skip them and you will be able to download the TextGrids later.

In [16]:
!cd /content/xhosa/
# Zip the 'aligned' folder into 'aligned.zip'
!zip -r /content/xhosa/aligned.zip /content/xhosa/aligned

  adding: content/xhosa/aligned/ (stored 0%)
  adding: content/xhosa/aligned/BU160331M_a/ (stored 0%)
  adding: content/xhosa/aligned/BU160331M_a/segment2.flac (deflated 0%)
  adding: content/xhosa/aligned/BU160331M_a/segment6.flac (deflated 0%)
  adding: content/xhosa/aligned/BU160331M_a/manifest.json (deflated 69%)
  adding: content/xhosa/aligned/BU160331M_a/segment1.flac (deflated 0%)
  adding: content/xhosa/aligned/BU160331M_a/segment4.flac (deflated 0%)
  adding: content/xhosa/aligned/BU160331M_a/segment3.flac (deflated 0%)
  adding: content/xhosa/aligned/BU160331M_a/segment5.flac (deflated 0%)
  adding: content/xhosa/aligned/BU160331M_a/segment0.flac (deflated 0%)
  adding: content/xhosa/aligned/BLN150925D_b/ (stored 0%)
  adding: content/xhosa/aligned/BLN150925D_b/segment8.flac (stored 0%)
  adding: content/xhosa/aligned/BLN150925D_b/segment2.flac (stored 0%)
  adding: content/xhosa/aligned/BLN150925D_b/segment6.flac (stored 0%)
  adding: content/xhosa/aligned/BLN150925D_b/segme

**SAVE** `aligned:` folder containing all the chunks and manifest.json (file containing the time stamps) **into your Drive.**

In [17]:
# Ensure you are in the correct directory
!cd /content/xhosa/
# copy aligned.zip into your Drive
!cp /content/xhosa/aligned.zip /content/drive/MyDrive/aligned.zip

**OR**

**SAVE** `aligned:` folder containing all the chunks and manifest.json (file containing the time stamps) **into your local device.**

In [18]:
from google.colab import files
# Download the zip file
files.download('/content/xhosa/aligned.zip')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

note : aligned.zip decompress into content -> xhosa -> aligned and not into aligned directly.

## Converting the aligned transcriptions to .TextGrid
So you can open it in Praat or convert it to .eaf or any other compatible format.

In [None]:
!pip install textgrid

In [None]:
%%shell
input_directory="/content/xhosa/aligned/"
output_directory="/content/xhosa/textgrids/"  # All output files will be saved here

# Ensure the output directory exists
mkdir -p "$output_directory"

# Find all 'manifest.json' files under the input_directory
find "$input_directory" -type f -name "manifest.json" | while IFS= read -r manifest; do
  # Extract the directory of the manifest and the name of the subdirectory containing the manifest
  manifest_directory=$(dirname "$manifest")
  subdirectory_name=$(basename "$manifest_directory")

  # Construct the output file path using the subdirectory name for uniqueness
  output_file_path="${output_directory}${subdirectory_name}.TextGrid"

  # Call the python script with the manifest and output file path
  python /content/fa_xhosa/json_to_textgrid.py "$manifest" "$output_file_path"

  echo "Processed $manifest into $output_file_path"
done


TODO : add speaker and annotations from the original excel files to the textgrids (different tiers)

In [None]:
!zip -r /content/xhosa/textgrids.zip /content/xhosa/textgrids

**SAVE** the TextGrid files into your drive.

In [None]:
!cp /content/xhosa/textgrids.zip /content/drive/MyDrive/textgrids.zip

**Download** the TextGrid files into your computer.

In [None]:
from google.colab import files
files.download('/content/xhosa/textgrids.zip')

This notebook is inspired by [this](https://colab.research.google.com/github/cawoylel/nlp4all/blob/main/asr/src/asr_tutorial.ipynb) tutorial from [_NLP4ALL_](https://github.com/cawoylel/nlp4all) which is focused on simplifying the process of building NLP models for underrepresented languages and making it more accessible: " We aim to provide a replicable framework that communities can adapt for their languages, aligning with our vision of making NLP technology widely accessible."