<a href="https://colab.research.google.com/github/rfclara/fa_xhosa/blob/main/xhosa_forced_alignement.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Aligning transcriptions and annotations - Xhosa corpus



# Introduction

The pourpose of this notebook is to align the interlinear glosses with the audio, from a transcribed corpus in Xhosa, one of the official languages of South Africa and Zimbabwe.

The transcription of this corpus are not aligned with the speech. We will use [CCT forced alignement](https://pytorch.org/audio/main/tutorials/ctc_forced_alignment_api_tutorial.html) in order to cut the recording into small chunks and get the timestamps corresponding to their transcriptions.


Here, we will follow the necessary steps to prepare the data and automatically assign time stamps to each sentence.



<p align="center">
  <img src="https://github.com/cawoylel/nlp4all/blob/main/asr/illustrations/forced_aligner.png?raw=true:, width=200" alt="transformer" width=500 class="center">
<br>
    <em>
    Illustration of the task of Forced Alignement, from nlp4all
    </em>
</p>

[MMS](https://github.com/facebookresearch/fairseq/blob/main/examples/mms/README.md) is a Forced Aligner using a multilingual speech model trained on more than one thousand languages. You can check here if your language is included: https://dl.fbaipublicfiles.com/mms/misc/language_coverage_mms.html




# Installing the dependencies in the virtual environnement


In [1]:
!apt install libicu-dev pkg-config

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
libicu-dev is already the newest version (70.1-2).
The following packages were automatically installed and are no longer required:
  libbz2-dev libpkgconf3 libreadline-dev
Use 'apt autoremove' to remove them.
The following packages will be REMOVED:
  pkgconf r-base-dev
The following NEW packages will be installed:
  pkg-config
0 upgraded, 1 newly installed, 2 to remove and 45 not upgraded.
Need to get 48.2 kB of archives.
After this operation, 11.3 kB disk space will be freed.
Get:1 http://archive.ubuntu.com/ubuntu jammy/main amd64 pkg-config amd64 0.29.2-1ubuntu3 [48.2 kB]
Fetched 48.2 kB in 0s (330 kB/s)
(Reading database ... 123598 files and directories currently installed.)
Removing r-base-dev (4.4.1-1.2204.0) ...
[1mdpkg:[0m pkgconf: dependency problems, but removing anyway as you requested:
 libsndfile1-dev:amd64 depends on pkg-config; however:
  Package pkg-config is not installed.

In [2]:
!apt-get install libsox-fmt-all sox # needed for processing audio
!apt-get install -y ffmpeg
!apt install libicu-dev pkg-config # needed for processing text and unicode symbols

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following packages were automatically installed and are no longer required:
  libbz2-dev libpkgconf3 libreadline-dev
Use 'apt autoremove' to remove them.
The following additional packages will be installed:
  libao-common libao4 libid3tag0 libmad0 libopencore-amrnb0 libopencore-amrwb0 libsox-fmt-alsa
  libsox-fmt-ao libsox-fmt-base libsox-fmt-mp3 libsox-fmt-oss libsox-fmt-pulse libsox3 libwavpack1
Suggested packages:
  libaudio2 libsndio6.1
The following NEW packages will be installed:
  libao-common libao4 libid3tag0 libmad0 libopencore-amrnb0 libopencore-amrwb0 libsox-fmt-all
  libsox-fmt-alsa libsox-fmt-ao libsox-fmt-base libsox-fmt-mp3 libsox-fmt-oss libsox-fmt-pulse
  libsox3 libwavpack1 sox
0 upgraded, 16 newly installed, 0 to remove and 45 not upgraded.
Need to get 800 kB of archives.
After this operation, 2,533 kB of additional disk space will be used.
Get:1 http://archive.ubunt

In [3]:
!pip uninstall torch torchaudio -y # we need to install the nightly version of torch
!pip install --pre torch torchaudio --index-url https://download.pytorch.org/whl/nightly/cu118

!pip install -q sox # for audio processing
!pip install -q ICU-Tokenizer # for tokenizing the text
!pip install pandas
!pip install tensorboardX

Found existing installation: torch 2.3.1+cu121
Uninstalling torch-2.3.1+cu121:
  Successfully uninstalled torch-2.3.1+cu121
Found existing installation: torchaudio 2.3.1+cu121
Uninstalling torchaudio-2.3.1+cu121:
  Successfully uninstalled torchaudio-2.3.1+cu121
Looking in indexes: https://download.pytorch.org/whl/nightly/cu118
Collecting torch
  Downloading https://download.pytorch.org/whl/nightly/cu118/torch-2.5.0.dev20240804%2Bcu118-cp310-cp310-linux_x86_64.whl (835.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m835.9/835.9 MB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting torchaudio
  Downloading https://download.pytorch.org/whl/nightly/cu118/torchaudio-2.4.0.dev20240804%2Bcu118-cp310-cp310-linux_x86_64.whl (3.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.3/3.3 MB[0m [31m48.4 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu11==11.8.89 (from torch)
  Downloading https://download.pytorch.org/whl/nightly/cu11

In this step, we clone repositories containing code and resources essential for our ASR project. Specifically, we clone the `rfclara/fa_xhosa` repository, and the `isi-nlp/uroman` repository, which provides functionalities for Romanization of text data.

In [4]:
!git clone https://github.com/isi-nlp/uroman.git

Cloning into 'uroman'...
remote: Enumerating objects: 579, done.[K
remote: Counting objects: 100% (295/295), done.[K
remote: Compressing objects: 100% (125/125), done.[K
remote: Total 579 (delta 189), reused 262 (delta 164), pack-reused 284[K
Receiving objects: 100% (579/579), 5.07 MiB | 22.46 MiB/s, done.
Resolving deltas: 100% (321/321), done.


In [5]:
!git clone https://github.com/facebookresearch/fairseq.git
!cd fairseq
#!pip install --editable ./

Cloning into 'fairseq'...
remote: Enumerating objects: 35209, done.[K
remote: Counting objects: 100% (126/126), done.[K
remote: Compressing objects: 100% (73/73), done.[K
remote: Total 35209 (delta 68), reused 88 (delta 52), pack-reused 35083[K
Receiving objects: 100% (35209/35209), 25.23 MiB | 11.78 MiB/s, done.
Resolving deltas: 100% (25558/25558), done.


In [6]:
import os

In [7]:
!git clone https://github.com/rfclara/fa_xhosa
!mkdir /content/fa_xhosa
os.chdir("/content/fa_xhosa")
#manually uploading the pytohn files until I set the repository public
#from google.colab import files
#uploaded = files.upload()

Cloning into 'fa_xhosa'...
remote: Enumerating objects: 51, done.[K
remote: Counting objects:   1% (1/51)[Kremote: Counting objects:   3% (2/51)[Kremote: Counting objects:   5% (3/51)[Kremote: Counting objects:   7% (4/51)[Kremote: Counting objects:   9% (5/51)[Kremote: Counting objects:  11% (6/51)[Kremote: Counting objects:  13% (7/51)[Kremote: Counting objects:  15% (8/51)[Kremote: Counting objects:  17% (9/51)[Kremote: Counting objects:  19% (10/51)[Kremote: Counting objects:  21% (11/51)[Kremote: Counting objects:  23% (12/51)[Kremote: Counting objects:  25% (13/51)[Kremote: Counting objects:  27% (14/51)[Kremote: Counting objects:  29% (15/51)[Kremote: Counting objects:  31% (16/51)[Kremote: Counting objects:  33% (17/51)[Kremote: Counting objects:  35% (18/51)[Kremote: Counting objects:  37% (19/51)[Kremote: Counting objects:  39% (20/51)[Kremote: Counting objects:  41% (21/51)[Kremote: Counting objects:  43% (22/51)[Kremote: Counting

## Prepearing the data
Getting the audio files and the transcriptions.
Before continuing, put every audio and transcription into a folder named `original` and compress it into `original.zip`. I recommend to save the archive into your Drive.

`original.zip` should decompress into one folder called `original` containing the audio files and the transcirptions. Each filename must match and differe only by its extension (.wav, .xlsx)

example :
story_1.wav
story_1.xlsx

(FASTER) **EITHER** give this notebook acces to your drive:
run next cell. It will ask for the permission to acces your drive and it will copy the archive from your Drive to the virtual environnement.

Make sure `original.zip` is placed directly in the main directory of your Drive.

In [8]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Copy the zip file from Google Drive to the Colab environment
!cp /content/drive/MyDrive/original.zip /content

Mounted at /content/drive


__OR__ upload *original.zip* directory containing the trancriptions and the recordings directly here. (next cell will ask you to browse the file)

In [9]:
from google.colab import files
uploaded = files.upload()

Decompress `original.zip` into /content/xhosa

In [10]:
!mkdir /content/xhosa
!unzip /content/original.zip -d /content/xhosa

Archive:  /content/original.zip
   creating: /content/xhosa/original/
  inflating: /content/xhosa/original/.~lock.MN180626O_a.xlsx#  
  inflating: /content/xhosa/original/BLN150925D_b.xlsx  
  inflating: /content/xhosa/original/BLN150925M_b.xlsx  
  inflating: /content/xhosa/original/BU151210S_b.xlsx  
  inflating: /content/xhosa/original/BU160331M_a.xlsx  
  inflating: /content/xhosa/original/BU160331M_b.xlsx  
  inflating: /content/xhosa/original/BU160331M_c.xlsx  
  inflating: /content/xhosa/original/BU160331M_e.xlsx  
  inflating: /content/xhosa/original/BU160331M_g.xlsx  
  inflating: /content/xhosa/original/BU160401O.xlsx  
  inflating: /content/xhosa/original/BU191231O.xlsx  
  inflating: /content/xhosa/original/GX150515M_c.xlsx  
  inflating: /content/xhosa/original/LM180625S_a.xlsx  
  inflating: /content/xhosa/original/MN180626O_a.xlsx  
  inflating: /content/xhosa/original/MN180626O_b.xlsx  
  inflating: /content/xhosa/original/MTF170609D_k.xlsx  
  inflating: /content/xhosa

### Extracting the transcriptions from the Excel files
---



In [11]:
!python /content/fa_xhosa/extract_transcriptions.py /content/xhosa/original /content/xhosa/ready

Processed /content/xhosa/original/BU191231O.xlsx -> /content/xhosa/ready/BU191231O.txt
Processed /content/xhosa/original/GX150515M_c.xlsx -> /content/xhosa/ready/GX150515M_c.txt
Processed /content/xhosa/original/BU160401O.xlsx -> /content/xhosa/ready/BU160401O.txt
Processed /content/xhosa/original/BU160331M_e.xlsx -> /content/xhosa/ready/BU160331M_e.txt
Processed /content/xhosa/original/LM180625S_a.xlsx -> /content/xhosa/ready/LM180625S_a.txt
Processed /content/xhosa/original/BU160331M_c.xlsx -> /content/xhosa/ready/BU160331M_c.txt
Processed /content/xhosa/original/MN180626O_b.xlsx -> /content/xhosa/ready/MN180626O_b.txt
Processed /content/xhosa/original/MTF170609D_k.xlsx -> /content/xhosa/ready/MTF170609D_k.txt
Processed /content/xhosa/original/PSJ150516D_c.xlsx -> /content/xhosa/ready/PSJ150516D_c.txt
Processed /content/xhosa/original/BU160331M_a.xlsx -> /content/xhosa/ready/BU160331M_a.txt
Processed /content/xhosa/original/BU160331M_b.xlsx -> /content/xhosa/ready/BU160331M_b.txt
Pro

Last cell's last line should display the number of files that were correctly porcessed.

**OPTIONAL**
(but probably leads to a better alignement)

REMOVING COMMENTS BETWEEN BRACKETS : ex : \<code-swiching>, \<laugh>


In [12]:
%%shell
for f in /content/xhosa/ready/*.txt; do
  python /content/fa_xhosa/remove_comments.py $f
done

Processed file '/content/xhosa/ready/BLN150925D_b.txt' successfully.
Processed file '/content/xhosa/ready/BLN150925M_b.txt' successfully.
Processed file '/content/xhosa/ready/BU151210S_b.txt' successfully.
Processed file '/content/xhosa/ready/BU160331M_a.txt' successfully.
Processed file '/content/xhosa/ready/BU160331M_b.txt' successfully.
Processed file '/content/xhosa/ready/BU160331M_c.txt' successfully.
Processed file '/content/xhosa/ready/BU160331M_e.txt' successfully.
Processed file '/content/xhosa/ready/BU160331M_g.txt' successfully.
Processed file '/content/xhosa/ready/BU160401O.txt' successfully.
Processed file '/content/xhosa/ready/BU191231O.txt' successfully.
Processed file '/content/xhosa/ready/GX150515M_c.txt' successfully.
Processed file '/content/xhosa/ready/LM180625S_a.txt' successfully.
Processed file '/content/xhosa/ready/MN180626O_a.txt' successfully.
Processed file '/content/xhosa/ready/MN180626O_b.txt' successfully.
Processed file '/content/xhosa/ready/MTF170609D_k.



The following cell will display the last 10 lines of the first trancription. Make sure this cell is displaying the expected result before continuing. All transcriptions can be found in /content/xhosa/ready and can be opened by double clicking on them.

In [14]:
!tail -n 10 $(ls /content/xhosa/ready/*.txt | head -n 1)

benxibile isintu sabo ? Hayi ke
Uhm ya bebenxibile qha into eye yasiphazamisa kukuba siye safikela apho koli joyinti khona
{Ooh}
ngoku {h}ayi bayadla
 
ngoku awukwazi ukumbona {ingumntu} nxe {i}nguye enormal   ngoba ngoku u+ usoloko ebonakala edlile
Ee si+ siye sahlangana ke  phaya
bagiya ke shame
hayi bayayenza bon{a} into qha ingathi bangayenza benormal
uhm konakona hayi bezibuya


### Resampling the audios  <a name="resampling"></a>

After downloading the audios, we need to resample them. Many modern speech models only deal with *16 000 sampling*. We will use `ffmpeg` to resample the audios into 16 000. We will also save the resampled audios into `.wav` files.

We will save the resampled wav files into a new `ready` directory containing all the extracted transcriptions .txt.

In [15]:
import os
for file in os.listdir("/content/xhosa/original"):
    if file.endswith(".wav"):
        input_path = f"/content/xhosa/original/{file}"
        output_path = f"/content/xhosa/ready/{os.path.splitext(file)[0]}.wav"
        !ffmpeg -i "{input_path}" -ac 1 -ar 16000 "{output_path}"


ffmpeg version 4.4.2-0ubuntu0.22.04.1 Copyright (c) 2000-2021 the FFmpeg developers
  built with gcc 11 (Ubuntu 11.2.0-19ubuntu1)
  configuration: --prefix=/usr --extra-version=0ubuntu0.22.04.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librabbitmq --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libsrt --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enab

## Neural Forced Alignment  <a name="aligner"></a>

Using [torchaudio.functional.forced_align()](https://pytorch.org/audio/stable/generated/torchaudio.functional.forced_align.html#torchaudio-functional-forced-align) the following cell will automatically align each line of the transcription with its corresponding time in the audio file. This step may take some minutes/hours depending on the length of the corpus.

In [16]:
%%shell
input_folder=/content/xhosa/ready
output_folder=/content/xhosa/aligned
cd /content/fairseq/
for audio in $input_folder/*.wav; do
  filename="$(basename "$audio")"
  stem=${filename%.*}
  output_path=$output_folder/$stem
  rm -rf $output_path
  python -m examples.mms.data_prep.align_and_segment \
  --audio_filepath $input_folder/$stem.wav \
  --text_filepath $input_folder/$stem.txt \
  --lang xho \
  --outdir $output_path \
  --uroman /content/uroman/uroman
done

Using torch version: 2.5.0.dev20240804+cu118
Using torchaudio version: 2.4.0.dev20240804+cu118
Using device:  cpu
Read 24 lines from /content/xhosa/ready/BLN150925D_b.txt
Can't open /content/uroman/uroman/../data/Scripts.txt
Can't open /content/uroman/uroman/../data/UnicodeData.txt
Can't open /content/uroman/uroman/../data/UnicodeDataOverwrite.txt
Can't open /content/uroman/uroman/../data/romanization-table.txt
Downloading model and dictionary...
100% 1.18G/1.18G [00:14<00:00, 88.8MB/s]
  state_dict = torch.load(model_path_name, map_location="cpu")
100% 79.0/79.0 [00:00<00:00, 272kB/s]
Using torch version: 2.5.0.dev20240804+cu118
Using torchaudio version: 2.4.0.dev20240804+cu118
Using device:  cpu
Read 29 lines from /content/xhosa/ready/BLN150925M_b.txt
Can't open /content/uroman/uroman/../data/Scripts.txt
Can't open /content/uroman/uroman/../data/UnicodeData.txt
Can't open /content/uroman/uroman/../data/UnicodeDataOverwrite.txt
Can't open /content/uroman/uroman/../data/romanization-ta



The two following cells allow you to download 'aligned' folder containing, for each original audio file, a manifest.json (timestamps) and every audio chunk in .flac format.

**SAVE** `aligned:` folder containing all the chunks and manifest.json (file containing the time stamps) **into your local device.**

In [17]:
# Change to the parent directory of 'aligned'
!cd /content/xhosa

# Zip the 'aligned' folder into 'aligned.zip' without including the full path
!zip -r /content/xhosa/aligned.zip /content/xhosa/aligned

# Download the zip file
from google.colab import files
files.download('/content/xhosa/aligned.zip')


  adding: content/xhosa/aligned/ (stored 0%)
  adding: content/xhosa/aligned/BU191231O/ (stored 0%)
  adding: content/xhosa/aligned/BU191231O/segment15.flac (deflated 0%)
  adding: content/xhosa/aligned/BU191231O/segment61.flac (deflated 0%)
  adding: content/xhosa/aligned/BU191231O/segment35.flac (deflated 0%)
  adding: content/xhosa/aligned/BU191231O/segment57.flac (deflated 0%)
  adding: content/xhosa/aligned/BU191231O/segment8.flac (deflated 0%)
  adding: content/xhosa/aligned/BU191231O/segment22.flac (deflated 0%)
  adding: content/xhosa/aligned/BU191231O/segment23.flac (deflated 0%)
  adding: content/xhosa/aligned/BU191231O/segment65.flac (stored 0%)
  adding: content/xhosa/aligned/BU191231O/segment6.flac (deflated 0%)
  adding: content/xhosa/aligned/BU191231O/segment11.flac (deflated 0%)
  adding: content/xhosa/aligned/BU191231O/segment62.flac (deflated 0%)
  adding: content/xhosa/aligned/BU191231O/segment33.flac (deflated 0%)
  adding: content/xhosa/aligned/BU191231O/segment37.

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

**SAVE** `aligned:` folder containing all the chunks and manifest.json (file containing the time stamps) **into your Drive.**

In [18]:
# Ensure you are in the correct directory
!cd /content/xhosa/
# copy aligned.zip into your Drive
!cp /content/xhosa/aligned.zip /content/drive/MyDrive/aligned.zip

note : aligned.zip decompress into content -> xhosa -> aligned and not into aligned directly.

## Converting the aligned transcriptions to .TextGrid
So you can open it in Praat or convert it to .eaf or any other compatible format.

In [19]:
!pip install textgrid

Collecting textgrid
  Downloading TextGrid-1.6.1.tar.gz (9.4 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: textgrid
  Building wheel for textgrid (setup.py) ... [?25l[?25hdone
  Created wheel for textgrid: filename=TextGrid-1.6.1-py3-none-any.whl size=10148 sha256=1a788391c510cb22de68b485e78b91b2e38b6432fedf80e253f60b19cfb14770
  Stored in directory: /root/.cache/pip/wheels/23/41/f2/e2ef1817bd163de3c21dd078966bdd71bd5c4455841f4ec016
Successfully built textgrid
Installing collected packages: textgrid
Successfully installed textgrid-1.6.1


### Add timestamps to Excel files.

Renaming manifest.json to match original filenames and moving them into `aligned` dir.

In [20]:
%%shell
input_directory="/content/xhosa/aligned/"

find "$input_directory" -type f -name "manifest.json" | while IFS= read -r manifest; do
  # Extract the directory of the manifest and the name of the subdirectory containing the manifest
  manifest_directory=$(dirname "$manifest")
  subdirectory_name=$(basename "$manifest_directory")

  # Move and rename the manifest file
  mv "$manifest" "/content/xhosa/aligned/$subdirectory_name.json"
done



In [23]:
%%shell
excel_directory="/content/xhosa/original"
aligned_directory="/content/xhosa/aligned"
for excel_file in $excel_directory/*.xlsx; do
  # Extract the filename without extension
  base_name=$(basename "$excel_file" | sed 's/\.xlsx//')
  # Construct the corresponding JSON file path
  json_file="$aligned_directory/$base_name.json"
  #
  python3 /content/fa_xhosa/add_times_to_excel.py "$excel_file" "$json_file" "$aligned_directory/$base_name.xlsx" # Use correct variable substitution for the output file
done

2024-08-04 21:46:37,560 - INFO - Excel rows: 130
2024-08-04 21:46:37,560 - INFO - JSON entries: 24
2024-08-04 21:46:37,603 - INFO - Updated Excel file created: /content/xhosa/aligned/BLN150925D_b.xlsx
2024-08-04 21:46:38,446 - INFO - Excel rows: 203
2024-08-04 21:46:38,446 - INFO - JSON entries: 29
2024-08-04 21:46:38,505 - INFO - Updated Excel file created: /content/xhosa/aligned/BLN150925M_b.xlsx
2024-08-04 21:46:39,327 - INFO - Excel rows: 48
2024-08-04 21:46:39,328 - INFO - JSON entries: 6
2024-08-04 21:46:39,353 - INFO - Updated Excel file created: /content/xhosa/aligned/BU151210S_b.xlsx
2024-08-04 21:46:40,162 - INFO - Excel rows: 27
2024-08-04 21:46:40,162 - INFO - JSON entries: 7
2024-08-04 21:46:40,191 - INFO - Updated Excel file created: /content/xhosa/aligned/BU160331M_a.xlsx
2024-08-04 21:46:40,994 - INFO - Excel rows: 31
2024-08-04 21:46:40,994 - INFO - JSON entries: 7
2024-08-04 21:46:41,015 - INFO - Updated Excel file created: /content/xhosa/aligned/BU160331M_b.xlsx
2024



In [25]:
!pip install praatio

Collecting praatio
  Downloading praatio-6.2.0-py3-none-any.whl.metadata (8.7 kB)
Downloading praatio-6.2.0-py3-none-any.whl (80 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/80.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━[0m [32m71.7/80.0 kB[0m [31m2.8 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m80.0/80.0 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: praatio
Successfully installed praatio-6.2.0


In [26]:
%%shell
input_directory="/content/xhosa/aligned/"
output_directory="/content/xhosa/aligned/"

# Find all aligned excel files under the input_directory
find "$input_directory" -type f -name "*.xlsx" | while IFS= read -r excel_file; do
  # Extract the filename without extension
  base_name=$(basename "$excel_file" | sed 's/\.xlsx//')
  # Construct the corresponding output file path
  output_file_path="$output_directory/$base_name.TextGrid"
  # Call the python script with the manifest and output file path
  python /content/fa_xhosa/excel_to_textgrid.py "$excel_file" -o "$output_file_path"

  echo "Processed $excel_file into $output_file_path"
done

TextGrid file created: /content/xhosa/aligned//BU191231O.TextGrid
Processed /content/xhosa/aligned/BU191231O.xlsx into /content/xhosa/aligned//BU191231O.TextGrid
TextGrid file created: /content/xhosa/aligned//GX150515M_c.TextGrid
Processed /content/xhosa/aligned/GX150515M_c.xlsx into /content/xhosa/aligned//GX150515M_c.TextGrid
TextGrid file created: /content/xhosa/aligned//BU160331M_e.TextGrid
Processed /content/xhosa/aligned/BU160331M_e.xlsx into /content/xhosa/aligned//BU160331M_e.TextGrid
TextGrid file created: /content/xhosa/aligned//LM180625S_a.TextGrid
Processed /content/xhosa/aligned/LM180625S_a.xlsx into /content/xhosa/aligned//LM180625S_a.TextGrid
TextGrid file created: /content/xhosa/aligned//BU160331M_c.TextGrid
Processed /content/xhosa/aligned/BU160331M_c.xlsx into /content/xhosa/aligned//BU160331M_c.TextGrid
Traceback (most recent call last):
  File "/content/fa_xhosa/excel_to_textgrid.py", line 77, in <module>
    main()
  File "/content/fa_xhosa/excel_to_textgrid.py", l



In [27]:
!zip -r /content/xhosa/aligned.zip /content/xhosa/aligned

updating: content/xhosa/aligned/ (stored 0%)
updating: content/xhosa/aligned/BU191231O/ (stored 0%)
updating: content/xhosa/aligned/BU191231O/segment15.flac (deflated 0%)
updating: content/xhosa/aligned/BU191231O/segment61.flac (deflated 0%)
updating: content/xhosa/aligned/BU191231O/segment35.flac (deflated 0%)
updating: content/xhosa/aligned/BU191231O/segment57.flac (deflated 0%)
updating: content/xhosa/aligned/BU191231O/segment8.flac (deflated 0%)
updating: content/xhosa/aligned/BU191231O/segment22.flac (deflated 0%)
updating: content/xhosa/aligned/BU191231O/segment23.flac (deflated 0%)
updating: content/xhosa/aligned/BU191231O/segment65.flac (stored 0%)
updating: content/xhosa/aligned/BU191231O/segment6.flac (deflated 0%)
updating: content/xhosa/aligned/BU191231O/segment11.flac (deflated 0%)
updating: content/xhosa/aligned/BU191231O/segment62.flac (deflated 0%)
updating: content/xhosa/aligned/BU191231O/segment33.flac (deflated 0%)
updating: content/xhosa/aligned/BU191231O/segment37.

**SAVE** the aligned files into your drive.

In [29]:
!cp /content/xhosa/aligned.zip /content/drive/MyDrive/aligned.zip

**Download** aligned files into your computer.

In [30]:
from google.colab import files
files.download('/content/xhosa/aligned.zip')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>