<a href="https://colab.research.google.com/github/xiaoyufan/speech-data-augmentation/blob/main/train_baseline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Preprocess data

## Setup

### Create logger

In [1]:
from importlib import reload
import logging
import sys

reload(logging)

LOGGING_LEVEL = 'DEBUG'

logger = logging.getLogger('baseline')
logger.setLevel(LOGGING_LEVEL)

formatter = logging.Formatter('[%(asctime)s - logger %(name)s - %(levelname)s] %(message)s')

ch = logging.StreamHandler(sys.stdout)
ch.setFormatter(formatter)
logger.addHandler(ch)

logger.debug('debug test')
logger.info('info test')

[2020-12-16 14:28:43,347 - logger baseline - DEBUG] debug test
[2020-12-16 14:28:43,348 - logger baseline - INFO] info test


### Mount Google Drive

In [2]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


### Configurations

#### Configure mode

In [3]:
MODES = {
  'BASELINE': 'baseline',
}
MODE =  MODES['BASELINE']
logger.info(f'Notebook runs in {MODE} mode.')

[2020-12-16 14:29:09,904 - logger baseline - INFO] Notebook runs in baseline mode.


#### Get notebook's start time

In [4]:
from datetime import datetime, tzinfo
import pytz

NB_RUN_TIME = datetime.now(tz=pytz.timezone('US/Eastern')).strftime('%Y%m%d-%H%M%S')
logger.info(f'Notebook started at {NB_RUN_TIME}.')

[2020-12-16 14:29:09,945 - logger baseline - INFO] Notebook started at 20201216-092909.


#### Other configurations

In [5]:
DEEPSPEECH_LOG_LEVEL = '1'
DEEPSPEECH_PATH = '/content/DeepSpeech'
PROJECT_ROOT_PATH = '/content/drive/MyDrive/nlp-project'
DATASET_PATH = f'{PROJECT_ROOT_PATH}/cmu_arctic'
FORCE_TRANSFORM_DATA = False

#### Set input and output paths

In [6]:
TEST_DIR = 'test'
TEST_FILES_PATH = f'{DATASET_PATH}/{TEST_DIR}/test_files.csv'
WAV_TEST_DIR = f'{DATASET_PATH}/{TEST_DIR}/audio'

if MODE == MODES['BASELINE']:
  TRAIN_DIR = 'train_baseline'
  TRAIN_FILES_PATH = f'{DATASET_PATH}/{TRAIN_DIR}/train_files.csv'
  WAV_TRAIN_DIR = f'{DATASET_PATH}/{TRAIN_DIR}/audio'

  OUTPUT_PATH = f'{PROJECT_ROOT_PATH}/xiaoyu-baseline/{NB_RUN_TIME}'
else:
  raise NotImplementedError

### Install packages

In [7]:
%%bash -s "$DEEPSPEECH_PATH"
DEEPSPEECH_PATH=$1

if [ ! -d "$DEEPSPEECH_PATH" ] ; then
  git clone --branch v0.9.2 https://github.com/mozilla/DeepSpeech $DEEPSPEECH_PATH
fi

cd $DEEPSPEECH_PATH
pip install --upgrade pip==20.2.2 wheel==0.34.2 setuptools==49.6.0
pip install --upgrade -e .

# pip uninstall tensorflow -y
pip install --upgrade tensorflow==1.15.4
pip install tensorflow-gpu==1.15.4

# Install other packages
pip install pandas
# tensorflow 1.15.4 requires numpy<1.19.0,>=1.16.0, but you'll have numpy 1.19.4 which is incompatible.
pip install --upgrade numpy==1.16.0

Collecting pip==20.2.2
  Downloading https://files.pythonhosted.org/packages/5a/4a/39400ff9b36e719bdf8f31c99fe1fa7842a42fa77432e584f707a5080063/pip-20.2.2-py2.py3-none-any.whl (1.5MB)
Collecting wheel==0.34.2
  Downloading https://files.pythonhosted.org/packages/8c/23/848298cccf8e40f5bbb59009b32848a4c38f4e7f3364297ab3c3e2e2cd14/wheel-0.34.2-py2.py3-none-any.whl
Collecting setuptools==49.6.0
  Downloading https://files.pythonhosted.org/packages/c3/a9/5dc32465951cf4812e9e93b4ad2d314893c2fa6d5f66ce5c057af6e76d85/setuptools-49.6.0-py3-none-any.whl (803kB)
Installing collected packages: pip, wheel, setuptools
  Found existing installation: pip 19.3.1
    Uninstalling pip-19.3.1:
      Successfully uninstalled pip-19.3.1
  Found existing installation: wheel 0.36.1
    Uninstalling wheel-0.36.1:
      Successfully uninstalled wheel-0.36.1
  Found existing installation: setuptools 50.3.2
    Uninstalling setuptools-50.3.2:
      Successfully uninstalled setuptools-50.3.2
Successfully installed

Cloning into '/content/DeepSpeech'...
Note: checking out 'b2920c755717499fad8e49f2c27091f438357653'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -b with the checkout command again. Example:

  git checkout -b <new-branch-name>

ERROR: datascience 0.10.6 has requirement folium==0.2.1, but you'll have folium 0.8.3 which is incompatible.
ERROR: After October 2020 you may experience errors when installing or updating packages. This is because pip will change the way that it resolves dependency conflicts.

We recommend you use --use-feature=2020-resolver to test your packages with the new resolver before it becomes the default.

umap-learn 0.4.6 requires numba!=0.47,>=0.46, but you'll have numba 0.47.0 which is incompa

### Check tensorflow version and if it runs on GPU

In [8]:
import tensorflow as tf
logger.info(f'tensorflow version: {tf.__version__}')

if tf.test.gpu_device_name(): 
    logger.info('Using Default GPU Device: {}'.format(tf.test.gpu_device_name()))
else:
   logger.info("Not using GPU")

[2020-12-16 14:31:05,710 - logger baseline - INFO] tensorflow version: 1.15.4
[2020-12-16 14:31:07,128 - logger baseline - INFO] Using Default GPU Device: /device:GPU:0


## Transform CMU arctic dataset into DeepSpeech format

### Preprocess transcripts

In [9]:
%%bash

pip install jiwer

Collecting jiwer
  Downloading jiwer-2.2.0-py3-none-any.whl (13 kB)
Collecting python-Levenshtein
  Downloading python-Levenshtein-0.12.0.tar.gz (48 kB)
Building wheels for collected packages: python-Levenshtein
  Building wheel for python-Levenshtein (setup.py): started
  Building wheel for python-Levenshtein (setup.py): finished with status 'done'
  Created wheel for python-Levenshtein: filename=python_Levenshtein-0.12.0-cp36-cp36m-linux_x86_64.whl size=144796 sha256=49b9cf6ae7b3d0df9d6bb7eff38a8fb576e91999f41c8cde1acdc62eb23c8413
  Stored in directory: /root/.cache/pip/wheels/79/c3/a1/cbdd8b154234b3e571d121b65be7d53354cc77e223e8f271c8
Successfully built python-Levenshtein
Installing collected packages: python-Levenshtein, jiwer
Successfully installed jiwer-2.2.0 python-Levenshtein-0.12.0


In [10]:
import jiwer
import re

transformation = jiwer.Compose([
  jiwer.ToLowerCase(),
  jiwer.RemoveWhiteSpace(replace_by_space=True),
  jiwer.RemoveMultipleSpaces(),
  jiwer.Strip(),
]) 

PUNCTUATIONS_TO_REMOVE = re.compile(r'[!"#$%&()*\+,-./\\:;<=>?@\[\]^_`{|}~]')


def preprocess_transcript(raw):
  processed = re.sub(PUNCTUATIONS_TO_REMOVE, ' ', raw)
  processed = transformation(processed)
  return processed

### Generate data files

In [11]:
# Importer

import os
import pandas
import sys

from pathlib import Path


def load_data_file():
  data_file_path = f'{DATASET_PATH}/cmuarctic.data.txt'

  with open(data_file_path) as file:
    data_file = {}

    for line in file:
      wav_filename = (line.split('"')[0]).split(' ')[1]
      transcript = line.split('"')[1]
      data_file[wav_filename] = transcript

    return data_file


def generate_files_list(wav_dir, output_path):
  COLUMNS = ['wav_filename', 'wav_filesize', 'transcript']
  files_list_data = []

  data_file = load_data_file()

  for path in Path(wav_dir).rglob('*.wav'):
    wav_path_relative_to_wav_dir = str(path.relative_to(wav_dir))
    wav_filename = path.name
    raw_transcript = data_file[wav_filename.replace(path.suffix, '')]
    transcript = preprocess_transcript(raw_transcript) 

    logger.debug(f'Wav: {wav_path_relative_to_wav_dir}; Transcript: {transcript}')

    file_data = (str(path), path.stat().st_size, transcript)
    files_list_data.append(file_data)
          
  df = pandas.DataFrame(data=files_list_data, columns=COLUMNS)
  df.to_csv(os.path.join(output_path), index=False)


def transform_data():
    if (not FORCE_TRANSFORM_DATA and
        os.path.exists(TRAIN_FILES_PATH) and
        os.path.exists(TEST_FILES_PATH)):
      logger.info(f'Skipping transforming data. Data files {TRAIN_FILES_PATH} and {TEST_FILES_PATH} already exist. ')
      return

    # Generate files list
    generate_files_list(wav_dir=WAV_TRAIN_DIR, output_path=TRAIN_FILES_PATH)
    generate_files_list(wav_dir=WAV_TEST_DIR, output_path=TEST_FILES_PATH)

    logger.info(f'Train data files generated at {TRAIN_FILES_PATH}.')
    logger.info(f'Test data files generated at {TEST_FILES_PATH}.')

In [12]:
transform_data()

[2020-12-16 14:31:19,108 - logger baseline - DEBUG] Wav: female_160/arctic_a0001.wav; Transcript: author of the danger trail philip steels etc
[2020-12-16 14:31:19,109 - logger baseline - DEBUG] Wav: female_160/arctic_a0003.wav; Transcript: for the twentieth time that evening the two men shook hands
[2020-12-16 14:31:19,111 - logger baseline - DEBUG] Wav: female_160/arctic_a0002.wav; Transcript: not at this particular case tom apologized whittemore
[2020-12-16 14:31:19,113 - logger baseline - DEBUG] Wav: female_160/arctic_a0006.wav; Transcript: god bless 'em i hope i'll go on seeing them forever
[2020-12-16 14:31:19,114 - logger baseline - DEBUG] Wav: female_160/arctic_a0007.wav; Transcript: and you always want to see it in the superlative degree
[2020-12-16 14:31:19,116 - logger baseline - DEBUG] Wav: female_160/arctic_a0004.wav; Transcript: lord but i'm glad to see you again phil
[2020-12-16 14:31:19,118 - logger baseline - DEBUG] Wav: female_160/arctic_a0005.wav; Transcript: will we

# Train a baseline model

Train a DeepSpeech model with Kaggle Tensorflow challenge's dataset to establish the baseline.

In [13]:
%%bash -s "$DEEPSPEECH_PATH" "$TRAIN_FILES_PATH" "$TEST_FILES_PATH" "$OUTPUT_PATH" "$DEEPSPEECH_LOG_LEVEL" "$NB_RUN_TIME" "$DATASET_PATH"
DEEPSPEECH_PATH=$1
TRAIN_FILES_PATH=$2
TEST_FILES_PATH=$3
OUTPUT_PATH=$4
DEEPSPEECH_LOG_LEVEL=$5
NB_RUN_TIME=$6
DATASET_PATH=$7

cd $DEEPSPEECH_PATH

TRAIN_BATCH_SIZE=8
EPOCHS=80
echo "===== Configurations =====
Dataset used: $DATASET_PATH
TRAIN_BATCH_SIZE: $TRAIN_BATCH_SIZE
EPOCHS: $EPOCHS
" 2>&1 | tee -a "$NB_RUN_TIME.log"

python DeepSpeech.py \
  --alphabet_config_path "$DATASET_PATH/alphabet.txt" \
  --train_files "$TRAIN_FILES_PATH" \
  --test_files "$TEST_FILES_PATH" \
  --checkpoint_dir "$OUTPUT_PATH/checkpoints" \
  --export_dir "$OUTPUT_PATH/models" \
  --train_batch_size $TRAIN_BATCH_SIZE \
  --epochs $EPOCHS \
  --log_level "$DEEPSPEECH_LOG_LEVEL" \
  --test_output_file "$OUTPUT_PATH/test-output.txt" \
  --summary_dir "$OUTPUT_PATH/summary" 2>&1 | tee -a "$NB_RUN_TIME.log"

  cp "$NB_RUN_TIME.log" "$OUTPUT_PATH/log.txt"

===== Configurations =====
Dataset used: /content/drive/MyDrive/nlp-project/cmu_arctic
TRAIN_BATCH_SIZE: 8
EPOCHS: 80

2020-12-16 14:31:25.444317: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
I1216 14:31:28.035244 140656033421184 utils.py:141] NumExpr defaulting to 4 threads.
Instructions for updating:
Use `tf.compat.v1.data.get_output_types(iterator)`.
W1216 14:31:29.066642 140656033421184 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/data/ops/iterator_ops.py:347: Iterator.output_types (from tensorflow.python.data.ops.iterator_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.compat.v1.data.get_output_types(iterator)`.
Instructions for updating:
Use `tf.compat.v1.data.get_output_shapes(iterator)`.
W1216 14:31:29.066936 140656033421184 deprecation.py:323] Fro