# Keyword Spotting Dataset Curation

[![Open In Colab <](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ShawnHymel/ei-keyword-spotting/blob/master/ei-audio-dataset-curation.ipynb)

Use this tool to download the Google Speech Commands Dataset, combine it with your own keywords, mix in some background noise, and upload the curated dataset to Edge Impulse. From there, you can train a neural network to classify spoken words and upload it to a microcontroller to perform real-time keyword spotting.

 1. Upload samples of your own keyword (optional)
 2. Adjust parameters in the Settings cell (you will need an [Edge Impulse](https://www.edgeimpulse.com/) account)
 3. Run the rest of the cells! ('shift' + 'enter' on each cell)



In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Upload your own keyword samples
You are welcome to use my [custom keyword dataset](https://github.com/ShawnHymel/custom-speech-commands-dataset), but note that it's limited and that I can't promise it will work well. If you want to use it, uncomment the `###Download custom dataset` cell below. You may also add your own recorded keywords to the extracted folder (`/content/custom_keywords`) to augment what's already there.

If you'd rather upload your own custom keyword dataset, follow these instructions:

On the left pane, in the file browser, create a directory structure containing space for your keyword audio samples. All samples for each keyword should be in a directory with that keyword's name. 

The audio samples should be `.wav` format, mono, and 1 second long. Bitrate and bitdepth should not matter. Samples shorter than 1 second will be padded with 0s, and samples longer than 1 second will be truncated to 1 second. The exact name of each `.wav` file does not matter, as they will be read, mixed with background noise, and saved to a separate file with an auto-generated name. Directory name does matter (it is used to determine the name of the class during neural network training).

Right-click on each keyword directory and upload all of your samples. Your directory structor should look like the following:

```
/
|- content
|--- custom_keywords
|----- keyword_1
|------- 000.wav
|------- 001.wav
|------- ...
|----- keyword_2
|------- 000.wav
|------- 001.wav
|------- ...
|----- ...
```




In [None]:
### Update Node.js to the latest stable version
!npm cache clean -f
!npm install -g n
!n stable

[37;40mnpm[0m [0m[30;43mWARN[0m [0m[35musing --force[0m I sure hope you know what you are doing.
[K[?25h/tools/node/bin/n -> /tools/node/lib/node_modules/n/bin/n
+ n@7.2.2
added 1 package from 2 contributors in 0.322s

[33m[39m
[33m   ╭────────────────────────────────────────────────────────────────╮[39m
   [33m│[39m                                                                [33m│[39m
   [33m│[39m      New [31mmajor[39m version of npm available! [31m6.14.8[39m → [32m7.12.0[39m       [33m│[39m
   [33m│[39m   [33mChangelog:[39m [36mhttps://github.com/npm/cli/releases/tag/v7.12.0[39m   [33m│[39m
   [33m│[39m               Run [32mnpm install -g npm[39m to update!                [33m│[39m
   [33m│[39m                                                                [33m│[39m
[33m   ╰────────────────────────────────────────────────────────────────╯[39m
[33m[39m
  [36minstalling[0m : [2mnode-v14.16.1[0m
  [36m     mkdir[0m : [2m/usr/

In [None]:
### Install required packages and tools
!python -m pip install soundfile
!npm install -g --unsafe-perm edge-impulse-cli

[K[?25h[37;40mnpm[0m [0m[30;43mWARN[0m [0m[35mdeprecated[0m request-promise@4.2.6: request-promise has been deprecated because it extends the now deprecated request package, see https://github.com/request/request/issues/3142
[0m[37;40mnpm[0m [0m[30;43mWARN[0m [0m[35mdeprecated[0m request@2.88.2: request has been deprecated, see https://github.com/request/request/issues/3142
[K[?25h[37;40mnpm[0m [0m[30;43mWARN[0m [0m[35mdeprecated[0m @zeit/dockerignore@0.0.5: "@zeit/dockerignore" is no longer maintained
[K[?25h[37;40mnpm[0m [0m[30;43mWARN[0m [0m[35mdeprecated[0m har-validator@5.1.5: this library is no longer supported
[K[?25h/usr/local/bin/edge-impulse-blocks -> /usr/local/lib/node_modules/edge-impulse-cli/build/cli/blocks.js
/usr/local/bin/edge-impulse-uploader -> /usr/local/lib/node_modules/edge-impulse-cli/build/cli/uploader.js
/usr/local/bin/edge-impulse-daemon -> /usr/local/lib/node_modules/edge-impulse-cli/build/cli/daemon.js
/usr/local/bin

In [None]:
### Settings (You probably do not need to change these)
BASE_DIR = "/content"
OUT_DIR = "keywords_curated"
GOOGLE_DATASET_FILENAME = "speech_commands_v0.02.tar.gz"
GOOGLE_DATASET_URL = "http://download.tensorflow.org/data/" + GOOGLE_DATASET_FILENAME
GOOGLE_DATASET_DIR = "google_speech_commands"
CUSTOM_KEYWORDS_FILENAME = "main.zip"
CUSTOM_KEYWORDS_URL = "https://github.com/ShawnHymel/custom-speech-commands-dataset/archive/" + CUSTOM_KEYWORDS_FILENAME
CUSTOM_KEYWORDS_DIR = "custom_keywords"
CUSTOM_KEYWORDS_REPO_NAME = "custom-speech-commands-dataset-main"
CURATION_SCRIPT = "dataset-curation.py"
CURATION_SCRIPT_URL = "https://raw.githubusercontent.com/smlee00/STM32-Keyword-Spotting-with-Edge-Impulse/master/" + CURATION_SCRIPT
UTILS_SCRIPT_URL = "https://raw.githubusercontent.com/smlee00/STM32-Keyword-Spotting-with-Edge-Impulse/master/utils.py"
NUM_SAMPLES = 1500    # Target number of samples to mix and send to Edge Impulse
WORD_VOL = 1.0        # Relative volume of word in output sample
BG_VOL = 0.1          # Relative volume of noise in output sample
SAMPLE_TIME = 1.0     # Time (seconds) of output sample
SAMPLE_RATE = 16000   # Sample rate (Hz) of output sample
BIT_DEPTH = "PCM_16"  # Options: [PCM_16, PCM_24, PCM_32, PCM_U8, FLOAT, DOUBLE]
BG_DIR = "_background_noise_"
TEST_RATIO = 0.2      # 20% reserved for test set, rest is for training
EI_INGEST_TEST_URL = "https://ingestion.edgeimpulse.com/api/test/data"
EI_INGEST_TRAIN_URL = "https://ingestion.edgeimpulse.com/api/training/data"

In [None]:
### Download Google Speech Commands Dataset
!cd {BASE_DIR}
!wget {GOOGLE_DATASET_URL}
!mkdir {GOOGLE_DATASET_DIR}
!echo "Extracting..."
!tar xfz {GOOGLE_DATASET_FILENAME} -C {GOOGLE_DATASET_DIR}

--2021-05-08 07:23:37--  http://download.tensorflow.org/data/speech_commands_v0.02.tar.gz
Resolving download.tensorflow.org (download.tensorflow.org)... 142.250.99.128, 2607:f8b0:400e:c0c::80
Connecting to download.tensorflow.org (download.tensorflow.org)|142.250.99.128|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2428923189 (2.3G) [application/gzip]
Saving to: ‘speech_commands_v0.02.tar.gz’


2021-05-08 07:23:56 (122 MB/s) - ‘speech_commands_v0.02.tar.gz’ saved [2428923189/2428923189]

Extracting...


In [None]:
### Pull out background noise directory
!cd {BASE_DIR}
!mv "{GOOGLE_DATASET_DIR}/{BG_DIR}" "{BG_DIR}"

In [None]:
### (Optional) Download custom dataset--uncomment the code in this cell if you want to use my custom datase

## Download, extract, and move dataset to separate directory
# !cd {BASE_DIR}
# !wget {CUSTOM_KEYWORDS_URL}
# !echo "Extracting..."
# !unzip -q {CUSTOM_KEYWORDS_FILENAME}
# !mv "{CUSTOM_KEYWORDS_REPO_NAME}/{CUSTOM_KEYWORDS_DIR}" "{CUSTOM_KEYWORDS_DIR}"

In [None]:
### User Settings (do change these)

# Location of your custom keyword samples (e.g. "/content/custom_keywords")
# Leave blank ("") for no custom keywords. set to the CUSTOM_KEYWORDS_DIR
# variable to use samples from my custom-speech-commands-dataset repo.
CUSTOM_DATASET_PATH = ""

# Edge Impulse > your_project > Dashboard > Keys
EI_API_KEY = "ei_8dc9b218b6553215681a1b9146f438d332fce4e3d8642d5b451b5545f187551d" 

# Comma separated words. Must match directory names (that contain samples).
TARGETS = "marvin, on, off, bed"

In [None]:
### Download curation and utils scripts
!wget {CURATION_SCRIPT_URL}
!wget {UTILS_SCRIPT_URL}

In [None]:
### Perform curation and mixing of samples with background noise
!cd {BASE_DIR}
!python {CURATION_SCRIPT} \
  -t "{TARGETS}" \
  -n {NUM_SAMPLES} \
  -w {WORD_VOL} \
  -g {BG_VOL} \
  -s {SAMPLE_TIME} \
  -r {SAMPLE_RATE} \
  -e {BIT_DEPTH} \
  -b "{BG_DIR}" \
  -o "{OUT_DIR}" \
  "{GOOGLE_DATASET_DIR}" \
  "{CUSTOM_DATASET_PATH}"

-----------------------------------------------------------------------
Keyword Dataset Curation Tool
v0.1
-----------------------------------------------------------------------
No directory named ''. Ignoring.
Gathering random background noise snippets (1500 files)
Progress: |██████████████████████████████████████████████████| 100.0% Complete
Mixing: marvin (1500 files)
Progress: |██████████████████████████████████████████████████| 100.0% Complete
Mixing: on (1500 files)
Progress: |██████████████████████████████████████████████████| 100.0% Complete
Mixing: off (1500 files)
Progress: |██████████████████████████████████████████████████| 100.0% Complete
Mixing: bed (1500 files)
Progress: |██████████████████████████████████████████████████| 100.0% Complete
No directory named ''. Ignoring.
Mixing: _unknown (1500 files)
Progress: |██████████████████████████████████████████████████| 100.0% Complete
Done!


In [None]:
### Use CLI tool to send curated dataset to Edge Impulse

!cd {BASE_DIR}

# Imports
import os
import random

# Seed with system time
random.seed()

# Go through each category in our curated dataset
for dir in os.listdir(OUT_DIR):
  
  # Create list of files for one category
  paths = []
  for filename in os.listdir(os.path.join(OUT_DIR, dir)):
    paths.append(os.path.join(OUT_DIR, dir, filename))

  # Shuffle and divide into test and training sets
  random.shuffle(paths)
  num_test_samples = int(TEST_RATIO * len(paths))
  test_paths = paths[:num_test_samples]
  train_paths = paths[num_test_samples:]

  # Create arugments list (as a string) for CLI call
  test_paths = ['"' + s + '"' for s in test_paths]
  test_paths = ' '.join(test_paths)
  train_paths = ['"' + s + '"' for s in train_paths]
  train_paths = ' '.join(train_paths)
  
  # Send test files to Edge Impulse
  !edge-impulse-uploader \
    --category testing \
    --label {dir} \
    --api-key {EI_API_KEY} \
    --silent \
    {test_paths}

  # # Send training files to Edge Impulse
  !edge-impulse-uploader \
    --category training \
    --label {dir} \
    --api-key {EI_API_KEY} \
    --silent \
    {train_paths}

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
[ 707/1200] Uploading keywords_curated/on/on.0018.wav OK (407 ms)
[ 708/1200] Uploading keywords_curated/on/on.0379.wav OK (439 ms)
[ 709/1200] Uploading keywords_curated/on/on.0295.wav OK (270 ms)
[ 710/1200] Uploading keywords_curated/on/on.0803.wav OK (253 ms)
[ 711/1200] Uploading keywords_curated/on/on.0096.wav OK (320 ms)
[ 712/1200] Uploading keywords_curated/on/on.0039.wav OK (270 ms)
[ 713/1200] Uploading keywords_curated/on/on.0118.wav OK (288 ms)
[ 714/1200] Uploading keywords_curated/on/on.1318.wav OK (274 ms)
[ 715/1200] Uploading keywords_curated/on/on.0800.wav OK (239 ms)
[ 716/1200] Uploading keywords_curated/on/on.0616.wav OK (453 ms)
[ 717/1200] Uploading keywords_curated/on/on.0137.wav OK (326 ms)
[ 718/1200] Uploading keywords_curated/on/on.1175.wav OK (293 ms)
[ 719/1200] Uploading keywords_curated/on/on.1012.wav OK (279 ms)
[ 720/1200] Uploading keywords_curated/on/on.0417.wav OK (246 ms)
[ 721/1200]