# CoVoST 2 dataset

<a target="_blank" href="https://colab.research.google.com/github/shreyjasuja/re_s2st/blob/main/download_covost2.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

CoVoST 2 is a large-scale multilingual speech translation corpus covering translations from 21 languages into English and from English into 15 languages. The dataset is created using Mozilla's open-source Common Voice database of
crowdsourced voice recordings. There are 2,900 hours of speech represented in the corpus.

## Download the dataset

This dataset is available as part of Hugging Face **datasets** library, but would require manual download of audio files for each source language from [Common Voices database](https://commonvoice.mozilla.org/en/datasets). CoVoST 2 consists of the following translation data as stated on its [github](https://github.com/facebookresearch/covost?tab=readme-ov-file#covost-2):

**X into English:** French, German, Spanish, Catalan, Italian, Russian, Chinese, Portuguese, Persian, Estonian, Mongolian, Dutch, Turkish, Arabic, Swedish, Latvian, Slovenian, Tamil, Japanese, Indonesian, Welsh

**English into X:** German, Catalan, Chinese, Persian, Estonian, Mongolian, Turkish, Arabic, Swedish, Latvian, Slovenian, Tamil, Japanese, Indonesian, Welsh

So, we would need to download individual voice data from [Common Voices database](https://commonvoice.mozilla.org/en/datasets) for each source language defined above. Please follow the following steps:

1. Download the shell-script to help downloading audio files
Please also provide execute permission to this shell script using `chmod +x download_data.sh`




In [1]:
!wget https://raw.githubusercontent.com/shreyjasuja/re_s2st/main/scripts/download_data.sh
!chmod +x download_data.sh

--2024-04-03 01:57:20--  https://raw.githubusercontent.com/shreyjasuja/re_s2st/main/scripts/download_data.sh
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 911 [text/plain]
Saving to: ‘download_data.sh’


2024-04-03 01:57:20 (27.0 MB/s) - ‘download_data.sh’ saved [911/911]



2. Select the source language

3. Make sure to choose **Common Voice Corpus 4** as the version

4. Fill in your email. Read and accept necessary T&Cs.

5. Don't click, instead copy the link from **Download Dataset Bundle** button

6. use `download_data.sh` to download this data and execute the file while passing url as the argument \
`!./download_data.sh  "<link_to_download>"`

**Note**:  `download_data.sh` would help in downloading the data in a specific directory structure based on language codes, kindly don't modify this shell-script so that the inference code runs. \

In [2]:
#example, this link was running at the time of creation of this notebook
!./download_data.sh "https://storage.googleapis.com/common-voice-prod-prod-datasets/cv-corpus-4-2019-12-10/nl.tar.gz?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=gke-prod%40moz-fx-common-voice-prod.iam.gserviceaccount.com%2F20240403%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20240403T000757Z&X-Goog-Expires=43200&X-Goog-SignedHeaders=host&X-Goog-Signature=22e9073d7737d426b14fc059d289e1e08b017fdd0b71b3079b6a9a639700b3d4a4db9df36a32c11d6591369e8dd0ff8442b6a4de00f5e04d776e1c7e0a199329f571cb9a6e46ed6735e76b3a3a4a21ee91060a36890b9806a6016e3a2b6245a9b7bb44ba5e1945b6b555bb1fd72e118fc12ecdfe8df7f2dd936b453a123879af1581691825677ceff1ab2e0e0396b225b793c4f1993463b432295e17f56747c53f8bd154b07cf67980a0af9660023f0abd0e4679fa1a8dcea4e8a2acbed0bdf111f774697c340ed1c910be552b92cb2c8fe87200fefabb787844266ca3595f060aebcd9e75438591301965d97fddd0d7434118ca7e7771bf9a3a02685774477a"

--2024-04-03 01:57:26--  https://storage.googleapis.com/common-voice-prod-prod-datasets/cv-corpus-4-2019-12-10/nl.tar.gz?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=gke-prod%40moz-fx-common-voice-prod.iam.gserviceaccount.com%2F20240403%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20240403T000757Z&X-Goog-Expires=43200&X-Goog-SignedHeaders=host&X-Goog-Signature=22e9073d7737d426b14fc059d289e1e08b017fdd0b71b3079b6a9a639700b3d4a4db9df36a32c11d6591369e8dd0ff8442b6a4de00f5e04d776e1c7e0a199329f571cb9a6e46ed6735e76b3a3a4a21ee91060a36890b9806a6016e3a2b6245a9b7bb44ba5e1945b6b555bb1fd72e118fc12ecdfe8df7f2dd936b453a123879af1581691825677ceff1ab2e0e0396b225b793c4f1993463b432295e17f56747c53f8bd154b07cf67980a0af9660023f0abd0e4679fa1a8dcea4e8a2acbed0bdf111f774697c340ed1c910be552b92cb2c8fe87200fefabb787844266ca3595f060aebcd9e75438591301965d97fddd0d7434118ca7e7771bf9a3a02685774477a
Resolving storage.googleapis.com (storage.googleapis.com)... 142.251.16.207, 172.253.62.207, 142.251.163.207, ...
Con

Repeat the above steps again and again to download source audios for all languages. Once that is done you would notice that a folder `data` is created with corresponding 'language code' as sub-directories and each sub-directory contains `tar.gz` (compressed) file

## Persist the data on Chameleon Object Store containers

As we saw that the download links for source audio bear authentication which is set to expire, we would want to save ourselves from redundant effort of downloading the data by persisting it in Chameleon Object Store containers. These containers can be accessed on any baremetal instance on Chameleon. Chameleon offers this functionality using containers within [Object Store](https://chameleoncloud.readthedocs.io/en/latest/technical/swift.html#id2)

This would involve two steps:

1. Intialize the container and upload the data.
2. Download the data for subsequent experiments.

We would be performing just step 1 here to upload data and would perform step 2 on the server where access to this data is required

### Intialize the container and upload the data.

Once the data is downloaded using the urls in the above cells. We will directly upload the compressed files to our Object Store. We also upload a pre-processing script which can extract the data whenever we download the data from the Object Store. The auth for `openstack` is already defined under `openrc` file which is prefetched into the instance when it is build.

In [4]:
from getpass import getpass
import os

import subprocess

command = ['bash', '-c', 'source openrc && openstack container list']

proc = subprocess.Popen(command, stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)
stdout, stderr = proc.communicate()

This should display the list of containers for your account

In [5]:
print(stdout)

+-----------------------+
| Name                  |
+-----------------------+
| CoVoST2_data          |
| CoVoST2_data_segments |
+-----------------------+



Since the data folder has compressed audio file, add the shellscript to extract these files

In [6]:
!wget https://raw.githubusercontent.com/shreyjasuja/re_s2st/main/scripts/extract_and_cleanup.sh -P data

--2024-04-03 02:19:18--  https://raw.githubusercontent.com/shreyjasuja/re_s2st/main/scripts/extract_and_cleanup.sh
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 934 [text/plain]
Saving to: ‘data/extract_and_cleanup.sh.1’


2024-04-03 02:19:18 (26.6 MB/s) - ‘data/extract_and_cleanup.sh.1’ saved [934/934]



#### Intialize a container

Use command `openstack container create <container_name>` in below subprocess to create a container

In [8]:
container_name="covost2"

In [None]:
command = ['bash', '-c', 'source openrc && openstack container create '+container_name]

proc = subprocess.Popen(command, stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)
stdout, stderr = proc.communicate()
print(stdout)

#### Upload the dataset

https://chameleoncloud.readthedocs.io/en/latest/technical/swift.html#large-object-support

In [None]:
command = ['bash', '-c', 'source openrc && /home/cc/.local/bin/swift --os-auth-type v3applicationcredential upload --changed --segment-size 4831838208 '+container_name+' data/']
# password = getpass("Please enter your password: ")  # Use getpass.getpass() to input this securely as shown above

proc = subprocess.Popen(command, stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)
stdout, stderr = proc.communicate()

print(stdout)