## Wav2vec Manifest File Preparation

Clone fairseq

In [None]:
!git clone https://github.com/pytorch/fairseq
%cd fairseq
!pip install --editable ./

Mount drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Get Soundfile

In [None]:
!pip install soundfile

Define paths for audio and for new manifest file. This is assuming you have downloaded the necessary fairseq files and created a folder called 'manifest'. You will need to change these two paths accordingly.

In [None]:
wavs_path  = r"/content/drive/MyDrive/fairseq/1h"
manifest   = r"/content/drive/MyDrive/fairseq/manifest_file" 

Use Wav2vec manifest.py to get tsv files. Valid percent is the percentage of the data to be used for validation during training.

In [None]:
!python examples/wav2vec/wav2vec_manifest.py $wavs_path --dest $manifest --ext flac --valid-percent 0.1

Get new paths for train.tsv and valid.tsv

Note: you will need to change the file name from valid to dev_other if original model was from Librispeech (since that is the naming scheme used by some Librispeech scripts). This also goes for valid.wrd and valid.ltr later on.

In [None]:
train_dir  = f"{manifest}/train.tsv"
valid_dir  = f"{manifest}/valid.tsv"

In [None]:
train_dir

Use libri_labels.py to get labels from Librispeech data

In [None]:
!python examples/wav2vec/libri_labels.py $train_dir --output-dir $manifest --output-name train
!python examples/wav2vec/libri_labels.py $valid_dir --output-dir $manifest --output-name valid

Or use a modified version of libri_labels.py to get labels from personal data

*Note: you will need to modify the relative paths in the following cells where it says* path-to-script *as that should be specific to where you downloaded these scripts.*

In [None]:
!python /content/drive/MyDrive/path-to-script/modLibri_labels.py $train_dir --output-dir $manifest --output-name train
!python /content/drive/MyDrive/path-to-script/modLibri_labels.py $valid_dir --output-dir $manifest --output-name valid

Create a lexicon from train.wrd and valid.wrd

In [None]:
train_wrd = f"{manifest}/train.wrd"
valid_wrd = f"{manifest}/valid.wrd"

In [None]:
!python /content/drive/MyDrive/path-to-script/wav2vec2_lexicon.py --train_dir $train_wrd --valid_dir $valid_wrd --output_dir $manifest

lexicon: ['CAREERING\t C A R E E R I N G |', 'HOMEWARD\t H O M E W A R D |', 'AND\t A N D |', 'THROUGH\t T H R O U G H |', 'THIS\t T H I S |', 'CHANNEL\t C H A N N E L |', 'OF\t O F |', 'POVERTY\t P O V E R T Y |', 'INACTION\t I N A C T I O N |', 'THE\t T H E |', 'CONTINENT\t C O N T I N E N T |', 'SPED\t S P E D |', 'ITS\t I T S |', 'WEALTH\t W E A L T H |', 'INDUSTRY\t I N D U S T R Y |', 'NOW\t N O W |', 'AGAIN\t A G A I N |', 'CLUMPS\t C L U M P S |', 'PEOPLE\t P E O P L E |', 'RAISED\t R A I S E D |', 'CHEER\t C H E E R |', 'GRATEFULLY\t G R A T E F U L L Y |', 'OPPRESSED\t O P P R E S S E D |', 'THEIR\t T H E I R |', 'SYMPATHY\t S Y M P A T H Y |', 'HOWEVER\t H O W E V E R |', 'WAS\t W A S |', 'FOR\t F O R |', 'BLUE\t B L U E |', 'CARS\t C A R S |', 'HE\t H E |', 'HAD\t H A D |', 'ALSO\t A L S O |', 'BEEN\t B E E N |', 'FORTUNATE\t F O R T U N A T E |', 'ENOUGH\t E N O U G H |', 'TO\t T O |', 'SECURE\t S E C U R E |', 'SOME\t S O M E |', 'POLICE\t P O L I C E |', 'CONTRACTS\t C O

Get letter counts for letter dictionary (to create dict.ltr.txt)

In [None]:
!python /content/drive/MyDrive/path-to-script/ltr_counter.py --train_dir $train_wrd --valid_dir $valid_wrd --output_dir $manifest

Voila! You are done. The Manifest file should now contain the following:



*   train.tsv
*   train.wrd
*   train.ltr
*   valid.tsv (or dev_other.tsv)
*   valid.wrd (or dev_other.wrd)
*   valid.ltr (or dev_other.ltr)
*   dict.ltr.txt
*   lexicon.txt

