Library for sound processing with the Wav2Vec2 models: https://github.com/jonatasgrosman/huggingsound <br>
API of the Russian National Corpus: https://github.com/kunansy/RNC <br>
Not fine-tuned large Wav2Vec2 model pretrained on common_voice dataset for 53 languages: "facebook/wav2vec2-large-xlsr-53"<br>
Fine-tuned Wav2Vec2 model which is most probably useless for us because it uses a token set containing characters of the cyrillic alphabet and we want to also use tokens which mark the lexical stress: "jonatasgrosman/wav2vec2-large-xlsr-53-russian"<br>
Training arguments (might be useful for performing ablation studies): https://huggingface.co/transformers/v4.4.2/_modules/transformers/training_args.html

# Mount Google drive
Mounting google drive is necessary for working with files saved there. I shared the folder with RNC data with you. To work with the data folder, go to the "Shared with me" folder on your google drive, right-click on the RussianNationalCorpus and press "Add shortcut to Drive".

In [1]:
from google.colab import drive
drive.mount("/content/drive")
import os
DATA_PATH = "/content/drive/MyDrive/RussianNationalCorpus"
DOWNLOADING_PATH = os.path.join(DATA_PATH, 'download_examples.py')
TRAINING_PATH = os.path.join(DATA_PATH, 'run_training.py')
EVALUATION_PATH = os.path.join(DATA_PATH, 'evaluate_model.py')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# pip install libraries rnc and hugging sound
Attention: after installing the hugginsound library, you will most probably need to change a library file called trainer.py. In colab you will find it located in usr/local/lib/python3.7/dist-packages/hugginsound/trainer.py (to find the usr folder, open the file browser in the left panel and press two dots above the sample_data folder to reach the root directory). The change you need to make, is replacing self.use_amp with self.use_cuda_amp in lines 434 and 451.

After making the changes, you need to reinstall the library. In colab, you need to restart runtime first and then run the pip install command again.

Making the changes is not required if you only want to run the evaluation.

In [2]:
!pip install rnc
!pip install huggingsound

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


# Download samples if they are not downloaded yet
Perhaps, we need to import the nest_asyncio library. Without it, colab has problems with managing nested asynchronous processes.

In [None]:
import nest_asyncio
nest_asyncio.apply()
!python $DOWNLOADING_PATH

# Run training
Run first the --help command to learn which arguments you should provide

In [5]:
!python $TRAINING_PATH --help

usage: run_training.py [-h] [--data_path DATA_PATH] [--train_data TRAIN_DATA]
                       [--eval_data EVAL_DATA] [--use_model_from_the_web]
                       [--model_folder MODEL_FOLDER]
                       [--model_web_location MODEL_WEB_LOCATION]
                       [--output_dir OUTPUT_DIR] [--not_overwrite_output_dir]
                       [--num_train_epochs NUM_TRAIN_EPOCHS]

optional arguments:
  -h, --help            show this help message and exit
  --data_path DATA_PATH
                        Path to the folder containing the csv file and the
                        model folder. Defaults to empty string '' which might
                        work if data sits in the current working directory. If
                        it doesn't work, then you need to specify the path.
  --train_data TRAIN_DATA
                        Name of the csv file with the training data, needs to
                        be located in the data path.
  --eval_data EVAL_DATA
 

In [None]:
!python $TRAINING_PATH --data_path $DATA_PATH

06/20/2022 00:50:10 - INFO - huggingsound.speech_recognition.model - Loading model...
06/20/2022 00:50:15 - INFO - rnc - Requested to 'https://processing.ruscorpora.ru/search.xml' [0;1) with params {'env': 'alpha', 'api': '1.0', 'lang': 'ru', 'dpp': 5, 'spd': 10, 'text': 'lexgramm', 'out': 'normal', 'sort': 'i_grtagging', 'nodia': 0, 'lex1': 'и', 'mode': 'murco'}
06/20/2022 00:50:15 - DEBUG - rnc - Worker-1: Requested to 'https://processing.ruscorpora.ru/search.xml' with '{'env': 'alpha', 'api': '1.0', 'lang': 'ru', 'dpp': 5, 'spd': 10, 'text': 'lexgramm', 'out': 'normal', 'sort': 'i_grtagging', 'nodia': 0, 'lex1': 'и', 'mode': 'murco', 'p': 0}'
06/20/2022 00:50:17 - DEBUG - rnc - Worker-1: Received from 'https://processing.ruscorpora.ru/search.xml' with '{'env': 'alpha', 'api': '1.0', 'lang': 'ru', 'dpp': 5, 'spd': 10, 'text': 'lexgramm', 'out': 'normal', 'sort': 'i_grtagging', 'nodia': 0, 'lex1': 'и', 'mode': 'murco', 'p': 0}'
06/20/2022 00:50:17 - INFO - rnc - Request was successful

# Evaluate the model
Run first the --help command to learn which arguments you should provide

In [4]:
!python $EVALUATION_PATH --help

usage: evaluate_model.py [-h] [--transcribe_sents TRANSCRIBE]
                         [--evaluate_sents EVALUATE] [--data_path DATA_PATH]
                         [--examples_basename BASENAME]
                         [--model_folder MODEL_FOLDER]

optional arguments:
  -h, --help            show this help message and exit
  --transcribe_sents TRANSCRIBE
                        How many sentences to transcribe
  --evaluate_sents EVALUATE
                        On how many sentences to run the evaluation
  --data_path DATA_PATH
                        Path to the folder containing the csv file and the
                        model folder. Defaults to empty string '' which might
                        work if data sits in the current working directory. If
                        it doesn't work, then you need to specify the path.
  --examples_basename BASENAME
                        Name of the csv file, needs to be located in the data
                        path.
  --model_folder 

In [3]:
!python $EVALUATION_PATH --transcribe_sents 5 --data_path $DATA_PATH

06/21/2022 14:43:50 - INFO - huggingsound.speech_recognition.model - Loading model...
06/21/2022 14:43:57 - INFO - rnc - Requested to 'https://processing.ruscorpora.ru/search.xml' [0;1) with params {'env': 'alpha', 'api': '1.0', 'lang': 'ru', 'dpp': 5, 'spd': 10, 'text': 'lexgramm', 'out': 'normal', 'sort': 'i_grtagging', 'nodia': 0, 'lex1': 'и', 'mode': 'murco'}
06/21/2022 14:43:57 - DEBUG - rnc - Worker-1: Requested to 'https://processing.ruscorpora.ru/search.xml' with '{'env': 'alpha', 'api': '1.0', 'lang': 'ru', 'dpp': 5, 'spd': 10, 'text': 'lexgramm', 'out': 'normal', 'sort': 'i_grtagging', 'nodia': 0, 'lex1': 'и', 'mode': 'murco', 'p': 0}'
06/21/2022 14:43:57 - DEBUG - rnc - Worker-1: Received from 'https://processing.ruscorpora.ru/search.xml' with '{'env': 'alpha', 'api': '1.0', 'lang': 'ru', 'dpp': 5, 'spd': 10, 'text': 'lexgramm', 'out': 'normal', 'sort': 'i_grtagging', 'nodia': 0, 'lex1': 'и', 'mode': 'murco', 'p': 0}'
06/21/2022 14:43:57 - INFO - rnc - Request was successful