<a href="https://colab.research.google.com/github/rfclara/fa_xhosa/blob/main/xhosa_forced_alignement.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Aligning transcriptions and annotations - Xhosa corpus



# Introduction

The pourpose of this notebook is to align the interlinear glosses with the audio, from a transcribed corpus in Xhosa, one of the official languages of South Africa and Zimbabwe.

The transcription of this corpus are not aligned with the speech. We will use [CCT forced alignement](https://pytorch.org/audio/main/tutorials/ctc_forced_alignment_api_tutorial.html) in order to cut the recording into small chunks and get the timestamps corresponding to their transcriptions.


Here, we will follow the necessary steps to prepare the data for training or fine-tuning a speech-to-text model,



<p align="center">
  <img src="https://github.com/cawoylel/nlp4all/blob/main/asr/illustrations/forced_aligner.png?raw=true:, width=200" alt="transformer" width=500 class="center">
<br>
    <em>
    Illustration of the task of Forced Alignement, from nlp4all
    </em>
</p>

[MMS](https://github.com/facebookresearch/fairseq/blob/main/examples/mms/README.md) is a Forced Aligner using a multilingual speech model trained on more than one thousand languages. You can check here if your language is included: https://dl.fbaipublicfiles.com/mms/misc/language_coverage_mms.html




# Installing the dependencies in the virtual environnement


In [None]:
!apt install libicu-dev pkg-config

In [None]:
!apt-get update
!apt-get install libsox-fmt-all sox ffmpeg # needed for processing audio
!apt install libicu-dev pkg-config # needed for processing text and unicode symbols

In [None]:
!apt install libicu-dev pkg-config
!pip install -q ICU-Tokenizer

In [None]:
!pip uninstall torch torchaudio -y # we need to install the nightly version of torch
!pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu118
!pip install -q dataclasses
!pip install -q sox # for audio processing
!pip install -q ICU-Tokenizer # for tokenizing the text
!pip install -q datasets # we will use huggingface datasets for loading the training dataset
!pip install pandas
!pip install tensorboardX

In this step, we clone repositories containing code and resources essential for our ASR project. Specifically, we clone the `rfclara/fa_xhosa` repository, and the `isi-nlp/uroman` repository, which provides functionalities for Romanization of text data.

In [None]:
!git clone https://github.com/isi-nlp/uroman.git

In [None]:
%%shell
git clone https://github.com/facebookresearch/fairseq.git
cd fairseq
pip install --editable ./

In [None]:
!git clone https://github.com/rfclara/fa_xhosa

## Prepearing the data
Getting the audio files and the transcriptions.

`original.zip` should decompress into one folder called `original` containing the audio files and the transcirptions. Each filename must match and differe only by its extension (.wav, .xlsx)

example :
story1.wav
story1.xlsx

**EITHER** give this notebook acces to your drive:
run next cell. It will ask for the permission to acces your drive and it will copy the archive from your Drive to the virtual environnement.

Make sure `xhosa.zip` is placed directly in the main directory of your Drive. If not, you may change the path on the cell.

example :
`!cp /content/drive/MyDrive/your/actual/path/original.zip /content`

In [None]:
#THIS CELL IS OPTIONAL
from google.colab import drive
drive.mount('/content/drive')
!cp /content/drive/MyDrive/original.zip /content

__OR__ upload xhosa.zip directly here (Files > upload).

In [None]:
!mkdir /content/xhosa
!unzip /content/original.zip -d /content/xhosa

### Converting the .xlsx transcriptions to .txt

In [None]:
!python /content/fa_xhosa/extract_transcriptions.py /content/xhosa/original /content/xhosa/ready

The following cell will display the first 10 lines of the first trancription.

In [None]:
!head -n 10 $(ls /content/xhosa/ready/*.txt | head -n 1)

### Resampling the audios  <a name="resampling"></a>

After downloading the audios, we need to resample them. Many modern speech models only deal with *16 000 sampling*. We will use `ffmpeg` to resample the audios into 16 000. We will also save the resampled audios into `.wav` files.

We will save the resampled wav files into a new `ready` directory containing all the extracted transcriptions .txt.

In [None]:
%%shell
for f in /content/xhosa/original/*.wav; do
  filename="$(basename "$f")"
  directory="$(dirname "$f")"
  stem=${filename%.*}
  ffmpeg -i "$f" -ac 1 -ar 16000 "/content/xhosa/ready/${stem}.wav" ;
done

## Neural Forced Alignment  <a name="aligner"></a>

Using [torchaudio.functional.forced_align()](https://pytorch.org/audio/stable/generated/torchaudio.functional.forced_align.html#torchaudio-functional-forced-align) the following cell will automatically align each line of the transcription with its corresponding time in the audio file. This step may take some minutes/hours dpending on the length of the corpus.

In [None]:
%%shell
input_folder=/content/xhosa/ready
output_folder=/content/xhosa/aligned
cd fairseq/
for audio in $input_folder/*.wav; do
  filename="$(basename "$audio")"
  stem=${filename%.*}
  output_path=$output_folder/$stem
  rm -rf $output_path
  python -m examples.mms.data_prep.align_and_segment \
  --audio_filepath $input_folder/$stem.wav \
  --text_filepath $input_folder/$stem.txt \
  --lang xho \
  --outdir $output_path \
  --uroman /content/uroman/bin
done

The two following cells are optional. If you are not interested in the chunks or .json files, you can skip them and you will be able to download the TextGrids later.

OPTIONAL : Run the following cell to save `aligned` folder containing the chunks and one manifest.json (time stamps) for each original file you have provided. into your Drive.

In [None]:
!zip -r /content/drive/MyDrive/aligned.zip /content/xhosa/aligned

OPTIONAL : Run the following cell to DOWNLOAD `aligned` folder containing the chunks and one manifest.json (time stamps) for each original file you have provided.

In [None]:
!zip -r aligned.zip /content/xhosa/aligned
from google.colab import files
files.download('/content/aligned.zip')

## Converting the aligned transcriptions to .TextGrid
So you can open it in Praat or convert it to .eaf or any other compatible format.

TODO :replace py paths by my git repository

In [None]:
!pip install textgrid

In [None]:
%%shell
input_directory="/content/xhosa/aligned/"
output_directory="/content/xhosa/textgrids/"  # All output files will be saved here

# Ensure the output directory exists
mkdir -p "$output_directory"

# Find all 'manifest.json' files under the input_directory
find "$input_directory" -type f -name "manifest.json" | while IFS= read -r manifest; do
  # Extract the directory of the manifest and the name of the subdirectory containing the manifest
  manifest_directory=$(dirname "$manifest")
  subdirectory_name=$(basename "$manifest_directory")

  # Construct the output file path using the subdirectory name for uniqueness
  output_file_path="${output_directory}${subdirectory_name}.TextGrid"

  # Call the python script with the manifest and output file path
  python /content/fa_xhosa/json_to_textgrid.py "$manifest" "$output_file_path"

  echo "Processed $manifest into $output_file_path"
done


Save the TextGrid files into your drive.

In [None]:
!zip -r /content/drive/MyDrive/textgrids.zip /content/xhosa/textgrids

Download the TextGrid files into your computer.

In [None]:
!zip textgrids.zip /content/xhosa/textgrids
from google.colab import files
files.download('/content/textgrids.zip')

# Challenges and future directions <a name="challenges"></a>

## Robustness <a name="robust"></a>

This notebook is inspired by [this](https://colab.research.google.com/github/cawoylel/nlp4all/blob/main/asr/src/asr_tutorial.ipynb) tutorial from [_NLP4ALL_](https://github.com/cawoylel/nlp4all) which is focused on simplifying the process of building NLP models for underrepresented languages and making it more accessible: " We aim to provide a replicable framework that communities can adapt for their languages, aligning with our vision of making NLP technology widely accessible."