<a href="https://colab.research.google.com/github/rfclara/fa_xhosa/blob/main/xhosa_forced_alignement.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Aligning transcriptions and annotations - Xhosa corpus



# Introduction

The pourpose of this notebook is to align the interlinear glosses with the audio, from a transcribed corpus in Xhosa, one of the official languages of South Africa and Zimbabwe.

The transcription of this corpus are not aligned with the speech. We will use [CCT forced alignement](https://pytorch.org/audio/main/tutorials/ctc_forced_alignment_api_tutorial.html) in order to cut the recording into small chunks and get the timestamps corresponding to their transcriptions.


Here, we will follow the necessary steps to prepare the data and automatically assign time stamps to each sentence.



<p align="center">
  <img src="https://github.com/cawoylel/nlp4all/blob/main/asr/illustrations/forced_aligner.png?raw=true:, width=200" alt="transformer" width=500 class="center">
<br>
    <em>
    Illustration of the task of Forced Alignement, from nlp4all
    </em>
</p>

[MMS](https://github.com/facebookresearch/fairseq/blob/main/examples/mms/README.md) is a Forced Aligner using a multilingual speech model trained on more than one thousand languages. You can check here if your language is included: https://dl.fbaipublicfiles.com/mms/misc/language_coverage_mms.html




# Installing the dependencies in the virtual environnement


In [None]:
!apt install libicu-dev pkg-config

In [None]:
!apt-get install libsox-fmt-all sox # needed for processing audio
!apt-get install -y ffmpeg
!apt install libicu-dev pkg-config # needed for processing text and unicode symbols

In [None]:
!pip uninstall torch torchaudio -y # we need to install the nightly version of torch
!pip install --pre torch torchaudio --index-url https://download.pytorch.org/whl/nightly/cu118

!pip install -q sox # for audio processing
!pip install -q ICU-Tokenizer # for tokenizing the text
!pip install pandas
!pip install tensorboardX

In this step, we clone repositories containing code and resources essential for our ASR project. Specifically, we clone the `rfclara/fa_xhosa` repository, and the `isi-nlp/uroman` repository, which provides functionalities for Romanization of text data.

In [None]:
!git clone https://github.com/isi-nlp/uroman.git

In [None]:
!git clone https://github.com/facebookresearch/fairseq.git
!cd fairseq
#!pip install --editable ./

In [None]:
import os

In [None]:
!git clone https://github.com/rfclara/fa_xhosa
!mkdir /content/fa_xhosa
os.chdir("/content/fa_xhosa")
#manually uploading the pytohn files until I set the repository public
#from google.colab import files
#uploaded = files.upload()

## Prepearing the data
Getting the audio files and the transcriptions.
Before continuing, put every audio and transcription into a folder named `original` and compress it into `original.zip`. I recommend to save the archive into your Drive.

`original.zip` should decompress into one folder called `original` containing the audio files and the transcirptions. Each filename must match and differe only by its extension (.wav, .xlsx)

example :
story_1.wav
story_1.xlsx

(FASTER) **EITHER** give this notebook acces to your drive:
run next cell. It will ask for the permission to acces your drive and it will copy the archive from your Drive to the virtual environnement.

Make sure `original.zip` is placed directly in the main directory of your Drive.

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Copy the zip file from Google Drive to the Colab environment
!cp /content/drive/MyDrive/original.zip /content

__OR__ upload *original.zip* directory containing the trancriptions and the recordings directly here. (next cell will ask you to browse the file)

In [None]:
from google.colab import files
uploaded = files.upload()

Decompress `original.zip` into /content/xhosa

In [None]:
!mkdir /content/xhosa
!unzip /content/original.zip -d /content/xhosa

### Extracting the transcriptions from the Excel files
---



In [None]:
!python /content/fa_xhosa/extract_transcriptions.py /content/xhosa/original /content/xhosa/ready

Last cell's last line should display the number of files that were correctly porcessed.

**OPTIONAL**
(but probably leads to a better alignement)

REMOVING COMMENTS BETWEEN BRACKETS : ex : \<code-swiching>, \<laugh>


In [None]:
%%shell
for f in /content/xhosa/ready/*.txt; do
  python /content/fa_xhosa/remove_comments.py $f
done

The following cell will display the last 10 lines of the first trancription. Make sure this cell is displaying the expected result before continuing. All transcriptions can be found in /content/xhosa/ready and can be opened by double clicking on them.

In [None]:
!tail -n 10 $(ls /content/xhosa/ready/*.txt | head -n 1)

### Resampling the audios  <a name="resampling"></a>

After downloading the audios, we need to resample them. Many modern speech models only deal with *16 000 sampling*. We will use `ffmpeg` to resample the audios into 16 000. We will also save the resampled audios into `.wav` files.

We will save the resampled wav files into a new `ready` directory containing all the extracted transcriptions .txt.

In [None]:
import os
for file in os.listdir("/content/xhosa/original"):
    if file.endswith(".wav"):
        input_path = f"/content/xhosa/original/{file}"
        output_path = f"/content/xhosa/ready/{os.path.splitext(file)[0]}.wav"
        !ffmpeg -i "{input_path}" -ac 1 -ar 16000 "{output_path}"


## Neural Forced Alignment  <a name="aligner"></a>

Using [torchaudio.functional.forced_align()](https://pytorch.org/audio/stable/generated/torchaudio.functional.forced_align.html#torchaudio-functional-forced-align) the following cell will automatically align each line of the transcription with its corresponding time in the audio file. This step may take some minutes/hours depending on the length of the corpus.

In [None]:
%%shell
input_folder=/content/xhosa/ready
output_folder=/content/xhosa/aligned
cd /content/fairseq/
for audio in $input_folder/*.wav; do
  filename="$(basename "$audio")"
  stem=${filename%.*}
  output_path=$output_folder/$stem
  rm -rf $output_path
  python -m examples.mms.data_prep.align_and_segment \
  --audio_filepath $input_folder/$stem.wav \
  --text_filepath $input_folder/$stem.txt \
  --lang xho \
  --outdir $output_path \
  --uroman /content/uroman/uroman
done

The two following cells allow you to download 'aligned' folder containing, for each original audio file, a manifest.json (timestamps) and every audio chunk in .flac format.

## Converting the aligned transcriptions to .TextGrid
So you can open it in Praat or convert it to .eaf or any other compatible format.

In [None]:
!pip install textgrid

### Add timestamps to Excel files.

Renaming manifest.json to match original filenames and moving them into `aligned` dir.

In [None]:
%%shell
input_directory="/content/xhosa/aligned/"

find "$input_directory" -type f -name "manifest.json" | while IFS= read -r manifest; do
  # Extract the directory of the manifest and the name of the subdirectory containing the manifest
  manifest_directory=$(dirname "$manifest")
  subdirectory_name=$(basename "$manifest_directory")

  # Move and rename the manifest file
  mv "$manifest" "/content/xhosa/aligned/$subdirectory_name.json"
done

In [None]:
%%shell
excel_directory="/content/xhosa/original"
aligned_directory="/content/xhosa/aligned"
for excel_file in $excel_directory/*.xlsx; do
  # Extract the filename without extension
  base_name=$(basename "$excel_file" | sed 's/\.xlsx//')
  # Construct the corresponding JSON file path
  json_file="$aligned_directory/$base_name.json"
  #
  python3 /content/fa_xhosa/add_times_to_excel.py "$excel_file" "$json_file" "$aligned_directory/$base_name.xlsx" # Use correct variable substitution for the output file
done

In [None]:
!pip install praatio

In [None]:
%%shell
input_directory="/content/xhosa/aligned/"
output_directory="/content/xhosa/aligned/"

# Find all aligned excel files under the input_directory
find "$input_directory" -type f -name "*.xlsx" | while IFS= read -r excel_file; do
  # Extract the filename without extension
  base_name=$(basename "$excel_file" | sed 's/\.xlsx//')
  # Construct the corresponding output file path
  output_file_path="$output_directory/$base_name.TextGrid"
  # Call the python script with the manifest and output file path
  python /content/fa_xhosa/excel_to_textgrid.py "$excel_file" -o "$output_file_path"

  echo "Processed $excel_file into $output_file_path"
done

In [None]:
!zip -r /content/xhosa/aligned.zip /content/xhosa/aligned

**SAVE** the aligned files into your drive.

note : aligned.zip decompress into content -> xhosa -> aligned and not into aligned directly.

In [None]:
!cp /content/xhosa/aligned.zip /content/drive/MyDrive/aligned.zip

**Download** aligned files into your computer.

In [None]:
from google.colab import files
files.download('/content/xhosa/aligned.zip')