# Virufy COVID Quickstart

Hello fellow fighters! We welcome you as allies in the battle against COVID-19. This notebook provides a quick tutorial on how to download our data, preprocess it, and quickly get started training models. 

## Part 1: Setup

First, we import some packages. If you are running this in Colab they should all come pre-installed. If you're running this locally, you might need to install these packages first. 

In [None]:
from sklearn.model_selection import train_test_split
import pandas as pd
import os
import librosa
import librosa.display
import cv2
import numpy as np
import json
import matplotlib.pyplot as plt
import warnings
warnings.simplefilter("ignore")

## Part 2: Data Download

Now, we download the CoughVID data from a different, open-source Virufy repo:

In [3]:
# Download coughvid data in CDF format
# Run once 
!git clone "https://github.com/virufy/virufy-cdf-coughvid.git"
%cd virufy-cdf-coughvid

Cloning into 'virufy-cdf-coughvid'...
remote: Enumerating objects: 4, done.[K
remote: Counting objects: 100% (4/4), done.[K
remote: Compressing objects: 100% (4/4), done.[K
remote: Total 16076 (delta 2), reused 0 (delta 0), pack-reused 16072[K
Receiving objects: 100% (16076/16076), 720.17 MiB | 34.79 MiB/s, done.
Resolving deltas: 100% (17/17), done.
Checking out files: 100% (16061/16061), done.
/content/virufy-cdf-coughvid


## Part 3: Data Cleaning
Now we're ready to load our data into memory! 

If you listen to the recordings, you might notice that some of the recordings aren't coughs. To help you, we've already run a model to predict whether the sound file is really a cough. Here, we filter our dataset, keeping only those recordings that are at least 70% likely to be coughs.  

In [None]:
coughvid = pd.read_csv("virufy-cdf-coughvid.csv")
msk = (coughvid.loc[:,'cough_detected'] > 0.7)
coughvid = coughvid.loc[msk,:]

Let's take a quick look at our labels! 


In [4]:
# Filtering cough_detected to > .7 is advisable
# The .7 threshold can be tuned as part of model development, we recommend testing different thresholds after a model has been completed
coughvid.head()

Unnamed: 0.1,Unnamed: 0,source,patient_id,cough_detected,cough_path,age,biological_sex,reported_gender,submission_date,pcr_test_date,pcr_result_date,respiratory_condition,fever_or_muscle_pain,pcr_test_result,pcr_test_result_inferred,covid_symptoms
1,1,coughvid,f9c950cd-6d37-4598-bf31-bd47cdf1a720,0.8882,virufy-cdf-coughvid/f9c950cd-6d37-4598-bf31-bd...,35.0,male,male,2020-04-13T19:44:42.219149+00:00,,,False,False,untested,negative,False
3,3,coughvid,1284866a-8849-47f9-96bf-ffc2e1db9305,0.9599,virufy-cdf-coughvid/1284866a-8849-47f9-96bf-ff...,,,,2020-04-15T18:04:29.010816+00:00,,,,,untested,untested,
4,4,coughvid,d7d518f7-768c-4407-9706-3d59435f52cf,0.9869,virufy-cdf-coughvid/d7d518f7-768c-4407-9706-3d...,19.0,male,male,2020-05-29T12:52:05.866134+00:00,,,False,True,untested,negative,False
6,6,coughvid,8b1646d1-416d-4230-9015-2203dff0dd87,0.9963,virufy-cdf-coughvid/8b1646d1-416d-4230-9015-22...,28.0,male,male,2020-04-19T08:39:15.488719+00:00,,,False,False,untested,negative,False
7,7,coughvid,e61e9d35-7f7e-483d-ae80-63c2eaf5ee10,0.9708,virufy-cdf-coughvid/e61e9d35-7f7e-483d-ae80-63...,30.0,male,male,2020-04-19T02:08:55.341624+00:00,,,False,False,untested,negative,False


In [None]:
# Disclaimer: we have inferred some of these pcr_test_result labels based on other columns
# Target = pcr_test_result_inferred
# Positive, negative, untested

coughvid['pcr_test_result_inferred'].head(30)

0     negative
1     untested
2     negative
3     negative
4     negative
5     negative
6     untested
7     untested
8     negative
9     negative
10    negative
11    negative
12    untested
13    negative
14    untested
15    negative
16    negative
17    negative
18    negative
19    positive
20    negative
21    untested
22    negative
23    untested
24    untested
25    untested
26    untested
27    negative
28    untested
29    untested
Name: pcr_test_result_inferred, dtype: object

There are a lot of recordings labeled as 'untested'. These can't be directly used in supervised learning, so for now we'll filter out those labels as well, keeping only the recordings that are 'positive' or 'negative'

In [7]:
# Filter out untested results
msk = (coughvid.loc[:,'pcr_test_result_inferred']=='untested')
coughvid = coughvid.loc[~msk,:]
coughvid


Unnamed: 0.1,Unnamed: 0,source,patient_id,cough_detected,cough_path,age,biological_sex,reported_gender,submission_date,pcr_test_date,pcr_result_date,respiratory_condition,fever_or_muscle_pain,pcr_test_result,pcr_test_result_inferred,covid_symptoms
1,1,coughvid,f9c950cd-6d37-4598-bf31-bd47cdf1a720,0.8882,virufy-cdf-coughvid/f9c950cd-6d37-4598-bf31-bd...,35.0,male,male,2020-04-13T19:44:42.219149+00:00,,,False,False,untested,negative,False
4,4,coughvid,d7d518f7-768c-4407-9706-3d59435f52cf,0.9869,virufy-cdf-coughvid/d7d518f7-768c-4407-9706-3d...,19.0,male,male,2020-05-29T12:52:05.866134+00:00,,,False,True,untested,negative,False
6,6,coughvid,8b1646d1-416d-4230-9015-2203dff0dd87,0.9963,virufy-cdf-coughvid/8b1646d1-416d-4230-9015-22...,28.0,male,male,2020-04-19T08:39:15.488719+00:00,,,False,False,untested,negative,False
7,7,coughvid,e61e9d35-7f7e-483d-ae80-63c2eaf5ee10,0.9708,virufy-cdf-coughvid/e61e9d35-7f7e-483d-ae80-63...,30.0,male,male,2020-04-19T02:08:55.341624+00:00,,,False,False,untested,negative,False
8,8,coughvid,1084d790-fc11-4649-b0fb-159d8a6d84b4,0.9980,virufy-cdf-coughvid/1084d790-fc11-4649-b0fb-15...,30.0,female,female,2020-04-20T17:57:58.677742+00:00,,,False,False,untested,negative,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16044,16044,coughvid,b24cb0fe-6c03-48e0-9476-88f76ea5b9fc,0.7968,virufy-cdf-coughvid/b24cb0fe-6c03-48e0-9476-88...,23.0,male,male,2020-04-13T20:54:49.265412+00:00,,,False,False,untested,negative,False
16046,16046,coughvid,f3fd464e-a95c-4588-96d9-4fc705821a75,0.9288,virufy-cdf-coughvid/f3fd464e-a95c-4588-96d9-4f...,64.0,female,female,2020-04-19T08:53:15.160449+00:00,,,True,True,untested,negative,False
16047,16047,coughvid,bb17ca61-9aa8-45d2-9b3e-52c53965641f,0.9957,virufy-cdf-coughvid/bb17ca61-9aa8-45d2-9b3e-52...,37.0,female,female,2020-04-10T04:37:49.199245+00:00,,,False,False,positive,positive,
16048,16048,coughvid,5a3c254e-1eb7-47f3-a014-77c98decf526,0.9585,virufy-cdf-coughvid/5a3c254e-1eb7-47f3-a014-77...,25.0,female,female,2020-05-20T02:47:06.889201+00:00,,,False,False,untested,negative,False


Our cleaned data consists of 5386 recordings, each labelled with 'positive' or 'negative' and a selection of clinical features. 

## Part 4: Data Preprocessing

Now that we have a clean dataset, we split it into train/val. We'll train on the train split and use the val split to decide when to stop training. 

In [8]:
# Test/Train split
stratify_labels = coughvid["pcr_test_result_inferred"].map(lambda x: x if x is "positive" else "untested")
cdf_train, cdf_test = train_test_split(coughvid, test_size=0.2, random_state = 0, stratify = stratify_labels)

In [None]:
cdf_train.shape, cdf_test.shape

((4308, 17), (1078, 17))

Here, we define our custom preprocessing pipeline. We extract the following relevant audio features:
- Mel-Frequency Cepstral Coefficients (MFCCs) 
- Mel-Spectrograms

We also cache these features so that the preprocessing only needs to be run once. 
Feel free to modify this section in any way you like!

In [None]:
# Functions to process audio files into images and json features
def trim_silence(x, *args):
    try:pad,db_max,frame_length,hop_length = args[0],args[1],args[2],args[3]
    except: 
        print('Please enter the following arguments: pad,db_max,frame_length,hop_length')
        return

    _, ints = librosa.effects.trim(x, top_db=db_max, frame_length=256, hop_length=64)
    start = int(max(ints[0]-pad, 0))
    end   = int(min(ints[1]+pad, len(x)))
    return x[start:end]

def process_cough_file(path,trim,*args):
    try: sr,removeaudio,chunk,db_max = args[0],args[1],args[2],args[3]
    except: 
        sr,removeaudio,chunk,db_max= 48000,False,3,50
    try:
        x,sr = librosa.load(path, sr=sr)       
    except: 
        return -1
    
    if len(x)/sr < 0.3 or len(x)/sr > 30:
        return None,None  
    hop_length = np.floor(0.010*sr).astype(int) #10ms
    win_length = np.floor(0.020*sr).astype(int) #20ms  

    if removeaudio:
        os.remove(path)
    
    x = trim(x, 0.25*sr, db_max,win_length,hop_length) 
    x = x[:np.floor(chunk*sr).astype(int)]
    
    #pads to chunk size if smaller
    x_pad = np.zeros(int(sr*chunk))
    x_pad[:min(len(x_pad), len(x))] = x[:min(len(x_pad), len(x))]

    return [x_pad,sr,hop_length,win_length]

def get_melspec(sdir,audio,sr,name):
    #Mel Spectogram
    plt.ioff()
    fig      = plt.figure()
    melspec  = librosa.feature.melspectrogram(y=audio,sr=sr)
    s_db     = librosa.power_to_db(melspec, ref=np.max)
    librosa.display.specshow(s_db)
    fig.canvas.draw()
    img = np.fromstring(fig.canvas.tostring_rgb(), dtype=np.uint8, sep='')
    img = img.reshape(fig.canvas.get_width_height()[::-1] + (3,))
    plt.close(fig=fig)
    #img = img[80:250,80:300]

    savepath = os.path.join(sdir,name+'.png') # Currently saving melspectrogram images to the folders specified in extract features
    cv2.imwrite(savepath,img)
    return savepath

def get_rawMFCCs(audio,sr,*args):
    try: hop_length,win_length,n_mfcc,n_mels,n_ftt = args[0],args[1],args[2],args[3],args[4]
    except:
        hop_length = np.floor(0.010*sr).astype(int) #10ms
        win_length = np.floor(0.020*sr).astype(int) #20ms  
        n_mfcc,n_mels,n_ftt=13,13,2048

    rawMFCCs    = librosa.feature.mfcc(y=audio,sr=sr, n_mfcc=n_mfcc,n_mels=n_mels, n_fft=n_ftt, hop_length=hop_length)
    rawMFCCs    = np.mean(rawMFCCs.T,axis=0).tolist()

    return rawMFCCs

def getlabel(key, dataframe, chosen):
      return dataframe.loc[dataframe[chosen['id']]==key][chosen['pcr']].tolist()[0]

def extract(df, chosen, savedir):
    if not os.path.isdir(savedir):
        os.mkdir(savedir)
        
    keys, dirs = df[chosen['id']].tolist(),df[chosen['path']].tolist()  
    audio_objs = [process_cough_file(path,trim_silence) for path in dirs]
    false_indices = [i for i in range(len(audio_objs)) if isinstance(audio_objs[i],int) or isinstance(audio_objs[i],tuple)]

    audio_objs = [audio_objs[i] for i in range(len(audio_objs)) if i not in false_indices]
    audio_objs = np.array(audio_objs)
    audio,sr,hop_length,win_length = audio_objs[:,0],audio_objs[:,1],audio_objs[:,2],audio_objs[:,3]
    
    dirs = [dirs[i] for i in range(len(dirs)) if i not in false_indices]
    keys = [keys[i] for i in range(len(keys)) if i not in false_indices]
    data = {key:{'DIR':get_melspec(savedir,a_i,sr_i,key),
             'rawMFCC':get_rawMFCCs(a_i,sr_i),
             'label':getlabel(key, df, chosen)} for key,a_i,sr_i in list(zip(keys,audio,sr))}
    return data

def filter_DF(df):
    names = list(df.columns)
    chosen= {}
    for name in names:
        if 'inferred' in name.lower():chosen['pcr'] = name # Choosing the target (pcr_test_result_inferred)
        elif 'path' in name.lower():chosen['path'] = name
        elif 'patient' in name.lower() or 'id' == name.lower() :chosen['id'] = name
    return df[[chosen['id'],chosen['pcr'],chosen['path']]].dropna().reset_index(), chosen 

def extract_features(train_df, test_df):
    train_dataframe, train_chosen = filter_DF(train_df)
    test_dataframe, test_chosen = filter_DF(test_df)
    
    train_features = extract(train_dataframe, train_chosen, 'train_melspecs/')
    test_features = extract(test_dataframe, test_chosen, 'test_melspecs/')
    
    return train_features, test_features

In [None]:
# Json format dictionaries
train_features, test_features = extract_features(cdf_train, cdf_test)

In [None]:
# Optional: Save json features
with open('train_features.json', 'w') as f:
    json.dump(train_features, f, indent=4)
with open('test_features.json', 'w') as f:
    json.dump(test_features, f, indent=4)

## Part 5: Model Training

After all that, we're finally ready to begin training a model! Have fun :) 

In [None]:
# Feel free to use all columns or to just focus on the audio file to predict the pcr_test_result_inferred test column
'''
  Your code here
'''



'\n  Your code here\n'

## Part 6: Saving the Trained Model

If you developed your model in Tensorflow or Keras, the following code chunk saves a model in a standardized way. We accept PyTorch trained models as well, but you'll have to write your own saving code for now! :) 

In [None]:
# Saving a model
def save_model(your_model,savedir,name):

  if '.' in name:
    name = ''.join(name.split('.')[0])
  
  if not os.path.isdir(savedir):
    os.mkdir(savedir)

  Model_JSON = your_model.to_json()
  with open(os.path.join(savedir,name+'.json'), "w") as json_file:
      json_file.write(Model_JSON)
  your_model.save_weights(os.path.join(savedir,name+'.h5'))