# PROCESSING AND PHONEMIZATION OF THE ABKHAZ DATA 🌄


## 1. PPROCESSING 🔨

First we filter the data so that we only work with the sentences of the most prolific speaker (1039 sentences). We then collect all of the grapheme annotations from the data frame and create a list. Similarly, we create a list of the corresponding file names of the audio files, also provided in the data frame. The data also need to be converted from .mp3 to .wav, for which we used the ffmpeg library and PowerShell commands.

In [None]:
import pandas as pd

In [None]:
# download the latest version of the Abkhaz data from Common Voice and upload it to the content\ folder
# load the uploaded data from the content folder and choose only the relevant folders
tsv_file = 'content\cv-corpus-14.0-2023-06-23\ab\train.tsv'
client_id = "client_id"
content_column = 'sentence'
file_name_column = 'path'

# load only the relevant columns
df = pd.read_csv(tsv_file, sep='\t', usecols=[client_id, file_name_column, content_column], encoding='utf-8')

# filter only the speaker with most sentences that we have established earlier
filtered_df = df[df['client_id'] == '68f6a8d9cdcc1f9b48b1690327761a26e7735653c0d48111e997603ed377410ae8df246246ba64a500e916a961be357fb51f8537ecf7c5e2e62b89223f7edb94']

# make the lists
file_names = []
sentences = []
for index, row in filtered_df.iterrows():
    name = str(row[file_name_column])
    cleaned_name = name.replace('.mp3', '.txt')
    content = str(row[content_column])
    file_names.append(cleaned_name)
    sentences.append(content)

## 2. PHONEMIZATION ✍

We iterate over the list of file contents from the restructuring above and we input every annotation through the input box of the online phonemization tool provided by [Baltoslav](https://baltoslav.eu/ipa/index.php?mova=en&j=ap&t). Then we collect the generated IPA transcription from the output box and write it in a file titled the same as the corresponding audio.
The Selenium loading code is mostly taken from this [blogpost](https://blog.devgenius.io/use-selenium-webdriver-in-google-colab-d5f2dba1d9f5).

In [None]:
!pip install selenium
!apt-get update
!apt install -yq chromium-chromedriver

Collecting selenium
  Downloading selenium-4.11.2-py3-none-any.whl (7.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m16.4 MB/s[0m eta [36m0:00:00[0m
Collecting trio~=0.17 (from selenium)
  Downloading trio-0.22.2-py3-none-any.whl (400 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m400.2/400.2 kB[0m [31m23.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting trio-websocket~=0.9 (from selenium)
  Downloading trio_websocket-0.10.3-py3-none-any.whl (17 kB)
Collecting outcome (from trio~=0.17->selenium)
  Downloading outcome-1.2.0-py2.py3-none-any.whl (9.7 kB)
Collecting wsproto>=0.14 (from trio-websocket~=0.9->selenium)
  Downloading wsproto-1.2.0-py3-none-any.whl (24 kB)
Collecting h11<1,>=0.9.0 (from wsproto>=0.14->trio-websocket~=0.9->selenium)
  Downloading h11-0.14.0-py3-none-any.whl (58 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.3/58.3 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstall

In [None]:
!wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb && apt install ./google-chrome-stable_current_amd64.deb


--2023-08-08 23:17:24--  https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
Resolving dl.google.com (dl.google.com)... 64.233.188.136, 64.233.188.93, 64.233.188.190, ...
Connecting to dl.google.com (dl.google.com)|64.233.188.136|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 95154560 (91M) [application/x-debian-package]
Saving to: ‘google-chrome-stable_current_amd64.deb’


2023-08-08 23:17:25 (128 MB/s) - ‘google-chrome-stable_current_amd64.deb’ saved [95154560/95154560]

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Note, selecting 'google-chrome-stable' instead of './google-chrome-stable_current_amd64.deb'
The following additional packages will be installed:
  libu2f-udev libvulkan1 mesa-vulkan-drivers
The following NEW packages will be installed:
  google-chrome-stable libu2f-udev libvulkan1 mesa-vulkan-drivers
0 upgraded, 4 newly installed, 0 to remove and 16 not upgraded.
Need to g

In [None]:
! apt-get install -y chromium-browser

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
chromium-browser is already the newest version (1:85.0.4183.83-0ubuntu2.22.04.1).
chromium-browser set to manually installed.
0 upgraded, 0 newly installed, 0 to remove and 16 not upgraded.


In [None]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import StaleElementReferenceException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions
from bs4 import BeautifulSoup

In [None]:
# set all sorts of options for the browser
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
options.add_argument("lang=en")
options.add_argument("start-maximized")
options.add_argument("disable-infobars")
options.add_argument("--disable-extensions")
options.add_argument("--incognito")
options.add_argument("--disable-blink-features=AutomationControlled")

driver = webdriver.Chrome(options=options)

In [None]:
url = "https://baltoslav.eu/ipa/index.php?mova=en&j=ap&t"
driver.get(url)

In [None]:
def get_ipa(sentence):
  # find the input box
  input_box = driver.find_element(By.ID, 'wpis')
  #find the output box
  output_box = driver.find_element(By.ID, "izid")
  # wait for the phonemization
  some_timeout = 0.5
  # empty out input box
  driver.find_element(By.ID, 'wpis').clear()
  input_box.send_keys(sentence)
  # press enter
  driver.find_element(By.CLASS_NAME, "guzik").send_keys(Keys.ENTER)
  ignored_exceptions=(NoSuchElementException,StaleElementReferenceException,)
  output_box = WebDriverWait(driver, some_timeout,ignored_exceptions=ignored_exceptions)\
                        .until(expected_conditions.presence_of_element_located((By.ID, "izid")))
  # get the ready transcription from the outfit box
  ipa_transcription = output_box.get_attribute("value")
  return ipa_transcription

In [None]:
# prepera a dictionary in order to easily name the files
name_sentence_dict = {key: value for key, value in zip(sentences, file_names)}

In [None]:
# write the transcription in a text file with the name of the audio
for sentence in sentences:
  ipa = get_ipa(sentence)
  name = name_sentence_dict[sentence]
  path_name = f"/content/{name}"
  with open(path_name, 'w', encoding='utf-8') as txt_file:
        txt_file.write(ipa)

In [None]:
# zip it up for download
!zip -r /content/file.zip /content

  adding: content/ (stored 0%)
  adding: content/.config/ (stored 0%)
  adding: content/.config/active_config (stored 0%)
  adding: content/.config/.last_survey_prompt.yaml (stored 0%)
  adding: content/.config/.last_update_check.json (deflated 25%)
  adding: content/.config/.last_opt_in_prompt.yaml (stored 0%)
  adding: content/.config/config_sentinel (stored 0%)
  adding: content/.config/logs/ (stored 0%)
  adding: content/.config/logs/2023.08.07/ (stored 0%)
  adding: content/.config/logs/2023.08.07/13.30.52.136078.log (deflated 91%)
  adding: content/.config/logs/2023.08.07/13.32.22.654937.log (deflated 56%)
  adding: content/.config/logs/2023.08.07/13.31.19.080579.log (deflated 58%)
  adding: content/.config/logs/2023.08.07/13.31.54.146752.log (deflated 57%)
  adding: content/.config/logs/2023.08.07/13.32.21.790778.log (deflated 57%)
  adding: content/.config/logs/2023.08.07/13.31.45.305961.log (deflated 86%)
  adding: content/.config/default_configs.db (deflated 98%)
  adding: co

In [None]:
# download
from google.colab import files
files.download("/content/file.zip")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>