This script is meant to be used for preparing the data required to train a diagnostics model. The original Cambridge sound database contains 200G sound data, we use only the English voice data. The following steps summarize the preparation work.

1. Merge the separate metadata .csv files into one
2. Filter the merged metadata file and keep only the wavs that are needed (filter by symptom, language, etc.)
3. Loop through samples and save audio files in a new folder (rather than in hierarchical folders)
4. Save a new metadata file that describes the info of these filtered audio samples

In [1]:
import pandas as pd
import numpy as np
import os
import shutil
import glob
from path import Path

In [2]:
OG_PATH = '/mnt/d/projects/COVID-datasets/Cambridge/covid19_data_0426'
WAV_FOLDER = OG_PATH + '/covid19_data_0426'
METADATA_FOLDER = OG_PATH + '/all_metadata'

In [3]:
# Merge
df_android = pd.read_csv(os.path.join(METADATA_FOLDER, 'results_raw_20210426_lan_yamnet_android_noloc.csv'), sep=';')
df_ios = pd.read_csv(os.path.join(METADATA_FOLDER, 'results_raw_20210426_lan_yamnet_ios_noloc.csv'), sep=';')
df = pd.concat((df_ios,df_android),ignore_index=True)
df = df.iloc[:,1:]
df.shape

(48045, 18)

In [4]:
# Get only EN samples
df_new = df[df['Language']=='en'].copy()

# Generate symptom label (symptomatic vs non-symptomatic)
df_new['Cough-label'] = 'non'
mask1 = df_new.Symptoms.str.contains('cough',case=False)
df_new.loc[mask1,'Cough-label'] = 'cough'

# Get those that have the 'valid' checkmark
df_new = df_new[df_new['Voice check'] == 'v']

# The wav files are already in .wav format but this is not updated in the original metadata file, hence we update it here
df_new['Voice filename'] = df_new['Voice filename'].str.replace(r'.m4a$', '.wav')

# Add full path to the audios as we will need it later
df_new['voice-path'] = WAV_FOLDER + '/' + df_new['Uid']+'/'+df_new['Folder Name']+'/'+df_new['Voice filename']

# Clean up index
df_new = df_new.reset_index(drop=True)

# Summary
df_new['Cough-label'].value_counts()

  df_new['Voice filename'] = df_new['Voice filename'].str.replace(r'.m4a$', '.wav')


non      13476
cough     9189
Name: Cough-label, dtype: int64

In [5]:
WAV_FOLDER_EN = os.path.join(WAV_FOLDER, 'EN') 

In [7]:
# # create a new folder to store the EN wav files
# os.mkdir(WAV_FOLDER_EN)

# # start copying...
# # wav files in the new folder are named as "00001.wav", "00002.wav", ...
# for idx , row in df_new.iterrows():
#     src = row['voice-path']
#     dst = os.path.join(WAV_FOLDER_EN, f'{idx:05}.wav')
#     shutil.copyfile(src, dst)

In [6]:
# create a new column in dataframe to store the new paths, we will use these paths to access wav files during training.
new_path = []
for idx , row in df_new.iterrows():
    new_path.append(os.path.join('./data_og/Cambridge/wav/EN', f'{idx:05}.wav'))

df_new['voice-path-new'] = new_path

In [7]:
df_new['voice-path-new']

0        ./data_og/Cambridge/wav/EN/00000.wav
1        ./data_og/Cambridge/wav/EN/00001.wav
2        ./data_og/Cambridge/wav/EN/00002.wav
3        ./data_og/Cambridge/wav/EN/00003.wav
4        ./data_og/Cambridge/wav/EN/00004.wav
                         ...                 
22660    ./data_og/Cambridge/wav/EN/22660.wav
22661    ./data_og/Cambridge/wav/EN/22661.wav
22662    ./data_og/Cambridge/wav/EN/22662.wav
22663    ./data_og/Cambridge/wav/EN/22663.wav
22664    ./data_og/Cambridge/wav/EN/22664.wav
Name: voice-path-new, Length: 22665, dtype: object

In [36]:
local_data_path = '/home/yizhu/projects/transfer-learning-diagnostics/data_og/Cambridge/metadata/'
df_new.to_csv(os.path.join(local_data_path,'EN-cough-metadata.csv'),index=False,header=True,sep=';')

Now simply do ```! zip -r EN.zip en/*.wav``` to create a zip file of the EN folder, then move it to ```./data_og/Cambridge/wav``` folder. Our pipeline will automatically unzip the file and prepare it for subsequent training.