Library for sound processing with the Wav2Vec2 models: https://github.com/jonatasgrosman/huggingsound <br>
API of the Russian National Corpus: https://github.com/kunansy/RNC <br>
Not fine-tuned large Wav2Vec2 model pretrained on common_voice dataset for 53 languages: "facebook/wav2vec2-large-xlsr-53"<br>
Fine-tuned Wav2Vec2 model which is most probably useless for us because it uses a token set containing characters of the cyrillic alphabet and we want to also use tokens which mark the lexical stress: "jonatasgrosman/wav2vec2-large-xlsr-53-russian"<br>
Training arguments (might be useful for performing ablation studies): https://huggingface.co/transformers/v4.4.2/_modules/transformers/training_args.html

# Mount Google drive
Mounting google drive is necessary for working with files saved there. I shared the folder with RNC data with you. To work with the data folder, go to the "Shared with me" folder on your google drive, right-click on the RussianNationalCorpus and press "Add shortcut to Drive".

In [1]:
from google.colab import drive
drive.mount("/content/drive")
import os
DATA_PATH = "/content/drive/MyDrive/RussianNationalCorpus"

Mounted at /content/drive


In [2]:
! git clone https://github.com/vitreusx/stress.git

Cloning into 'stress'...
remote: Enumerating objects: 17, done.[K
remote: Counting objects: 100% (17/17), done.[K
remote: Compressing objects: 100% (16/16), done.[K
remote: Total 17 (delta 5), reused 3 (delta 0), pack-reused 0[K
Unpacking objects: 100% (17/17), done.


In [15]:
DOWNLOADING_PATH = os.path.join('/content/stress/download_examples.py')
CLEANING_PATH = os.path.join('/content/stress/clean_csv.py')
TRAINING_PATH = os.path.join('/content/stress/run_training.py')
EVALUATION_PATH = os.path.join('/content/stress/evaluate_model.py')

# pip install libraries rnc and hugging sound
Attention: after installing the hugginsound library, you will most probably need to change a library file called trainer.py. In colab you will find it located in usr/local/lib/python3.7/dist-packages/hugginsound/trainer.py (to find the usr folder, open the file browser in the left panel and press two dots above the sample_data folder to reach the root directory). The change you need to make, is replacing self.use_amp with self.use_cuda_amp in lines 434 and 451.

After making the changes, you need to reinstall the library. In colab, you need to restart runtime first and then run the pip install command again.

Making the changes is not required if you only want to run the evaluation.

In [3]:
!pip install rnc
!pip install huggingsound

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting rnc
  Downloading rnc-0.9.0-py3-none-any.whl (28 kB)
Collecting types-ujson<4.3.0,>=4.2.1
  Downloading types_ujson-4.2.1-py3-none-any.whl (2.0 kB)
Collecting types-aiofiles<0.9.0,>=0.8.4
  Downloading types_aiofiles-0.8.8-py3-none-any.whl (5.8 kB)
Collecting aiofiles<0.9.0,>=0.8.0
  Downloading aiofiles-0.8.0-py3-none-any.whl (13 kB)
Collecting ujson<5.2.0,>=5.1.0
  Downloading ujson-5.1.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (43 kB)
[K     |████████████████████████████████| 43 kB 2.0 MB/s 
Collecting aiohttp<3.9.0,>=3.8.1
  Downloading aiohttp-3.8.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 11.4 MB/s 
[?25hCollecting beautifulsoup4<4.11.0,>=4.10.0
  Downloading beautifulsoup4-4.10.0-py3-none-any.whl (97 kB)
[K     |█████████████████████████

# Download samples if they are not downloaded yet
Perhaps, we need to import the nest_asyncio library. Without it, colab has problems with managing nested asynchronous processes.

In [9]:
import nest_asyncio
nest_asyncio.apply()
words_list = ['волк', 'вертолёт']
WORDS = ' '.join(words_list)
!python $DOWNLOADING_PATH --help

usage: download_examples.py [-h] [--data_path DATA_PATH]
                            [--examples_basename BASENAME]
                            [--helper_basename HELPER_NAME]
                            [--media_folder MEDIA_FOLDER]
                            [--pages_per_word PAGES_PER_WORD]
                            words [words ...]

positional arguments:
  words                 List of the words to query the Russian National Corpus
                        for.

optional arguments:
  -h, --help            show this help message and exit
  --data_path DATA_PATH
                        Path to the directory in which the csv file and the
                        media folder should be created. Defaults to empty
                        string '' which might work if data sits in the current
                        working directory. If it doesn't work, then you need
                        to specify the path.
  --examples_basename BASENAME
                        Name of the csv file

In [12]:
!python $CLEANING_PATH --help

  File "/content/stress/clean_csv.py", line 32
    def drop_duplicates_and_overwrite(dataframe)
                                               ^
SyntaxError: invalid syntax


# Run training
Run first the --help command to learn which arguments you should provide

In [13]:
!python $TRAINING_PATH --help

usage: run_training.py [-h] [--data_path DATA_PATH] [--train_data TRAIN_DATA]
                       [--eval_data EVAL_DATA] [--use_model_from_the_web]
                       [--model_folder MODEL_FOLDER]
                       [--model_web_location MODEL_WEB_LOCATION]
                       [--output_dir OUTPUT_DIR] [--not_overwrite_output_dir]
                       [--num_train_epochs NUM_TRAIN_EPOCHS]

optional arguments:
  -h, --help            show this help message and exit
  --data_path DATA_PATH
                        Path to the folder containing the csv file and the
                        model folder. Defaults to empty string '' which might
                        work if data sits in the current working directory. If
                        it doesn't work, then you need to specify the path.
  --train_data TRAIN_DATA
                        Name of the csv file with the training data, needs to
                        be located in the data path. Default: "all_data20" -
 

# Evaluate the model
Run first the --help command to learn which arguments you should provide

In [16]:
!python $EVALUATION_PATH --help

usage: evaluate_model.py [-h] [--transcribe_sents TRANSCRIBE]
                         [--evaluate_sents EVALUATE] [--data_path DATA_PATH]
                         [--no_csv] [--examples_basename EXAMPLES_BASENAME]
                         [--examples_folder EXAMPLES_FOLDER]
                         [--model_folder MODEL_FOLDER]

optional arguments:
  -h, --help            show this help message and exit
  --transcribe_sents TRANSCRIBE
                        How many sentences to transcribe. 0 means: don't
                        produce transcrtiptions. -1 means: run on all examples
                        from the csv file.
  --evaluate_sents EVALUATE
                        On how many sentences to run the evaluation. 0 means:
                        don't run the evaluation. -1 means: run on all
                        examples from the csv file.
  --data_path DATA_PATH
                        Path to the folder containing the csv file and the
                        model folder.

In [17]:
!python $EVALUATION_PATH --transcribe_sents 5 --evaluate_sents 0 --data_path $DATA_PATH --examples_basename all_data20.csv

06/28/2022 16:40:30 - INFO - huggingsound.speech_recognition.model - Loading model...
06/28/2022 16:40:51 - INFO - rnc - Requested to 'https://processing.ruscorpora.ru/search.xml' [0;1) with params {'env': 'alpha', 'api': '1.0', 'lang': 'ru', 'dpp': 5, 'spd': 10, 'text': 'lexgramm', 'out': 'normal', 'sort': 'i_grtagging', 'nodia': 0, 'lex1': 'и', 'mode': 'murco'}
06/28/2022 16:40:51 - DEBUG - rnc - Worker-1: Requested to 'https://processing.ruscorpora.ru/search.xml' with '{'env': 'alpha', 'api': '1.0', 'lang': 'ru', 'dpp': 5, 'spd': 10, 'text': 'lexgramm', 'out': 'normal', 'sort': 'i_grtagging', 'nodia': 0, 'lex1': 'и', 'mode': 'murco', 'p': 0}'
06/28/2022 16:40:52 - DEBUG - rnc - Worker-1: Received from 'https://processing.ruscorpora.ru/search.xml' with '{'env': 'alpha', 'api': '1.0', 'lang': 'ru', 'dpp': 5, 'spd': 10, 'text': 'lexgramm', 'out': 'normal', 'sort': 'i_grtagging', 'nodia': 0, 'lex1': 'и', 'mode': 'murco', 'p': 0}'
06/28/2022 16:40:52 - INFO - rnc - Request was successful