# PART1: Data Exploration

## Introductory notes
Original article is open at the URL: https://www.nature.com/articles/s41597-021-00937-4


Dataset description is contained therein, it is worth a read. Some interesting notes from the article:
* Metadata explanation: https://www.nature.com/articles/s41597-021-00937-4/tables/2
* Experts labels are not fully consistent and often in disagreement for the same audio recording. Cough type and congestion are the features where the four experts tend to agree more. Everything else should be used - if used at all - with care.
* A certain number of records does not contain any cough audio, they are spurious. Authors have run XGBoost classification to idenitfy records with coughs. Score of classifier is contained in column cough_detected. They suggest to cut at cough_detected > 0.8, that should leave you with a contamination of non-cough audio by about 4.6%
* On top of that, the SNR variable tells you about the Signal-to-Noise Ratio ofthe cough audio. High SNR correspond to clearer cough sounds
* The XGBoost classifier that identifies whether there is any coughing sound at all was based on the audio features listed at https://www.nature.com/articles/s41597-021-00937-4/tables/3



In [None]:
# general purpose libraries
import numpy as np
import datetime as dt
import pandas as pd
import os
import pickle
from timeit import default_timer as timer
from collections import OrderedDict

pd.set_option("display.max_columns", None)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        if filename.endswith(".csv"):
            print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
# plots and visualisation
import matplotlib.pyplot as plt
import plotly.graph_objects as ply_go
import plotly.figure_factory as ply_ff
import plotly.colors as ply_colors #.sequential.Oranges as orange_palette
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)

## Import metadata file and visualise key metadata properties

In [None]:
data_dir = '/kaggle/input/covid19-cough-audio-classification/'
metadata_file = "metadata_compiled.csv"
metadata=pd.read_csv(data_dir+metadata_file,sep=",")
print(metadata.columns)

# convert strings 'True'/'False' to genuine booleans
cols_to_boolean = (['respiratory_condition', 'fever_muscle_pain',
                     'dyspnea_1', 'wheezing_1', 'stridor_1','choking_1', 'congestion_1', 'nothing_1',
                     'dyspnea_2', 'wheezing_2', 'stridor_2','choking_2', 'congestion_2', 'nothing_2',
                     'dyspnea_3', 'wheezing_3', 'stridor_3','choking_3', 'congestion_3', 'nothing_3',
                     'dyspnea_4', 'wheezing_4', 'stridor_4','choking_4', 'congestion_4', 'nothing_4'])
#metadata[cols_to_boolean] = metadata[cols_to_boolean].apply(lambda x: x.astype(bool))
for c in cols_to_boolean:
    metadata.loc[metadata[c].notnull(),c] = metadata.loc[metadata[c].notnull(),c].astype(bool) 

print("NULL or NA records for each column:")
print( metadata.isnull().sum() )
    
cols_to_fillna = ['gender', 'status','diagnosis_1','diagnosis_2','diagnosis_3','diagnosis_4']
metadata[cols_to_fillna]=metadata[cols_to_fillna].fillna('n/a')

#print(metadata.dtypes)
#print(metadata.shape)
metadata.head(5)

## Visualise metadata

In [None]:
my_title_layout = dict({"text":"my distribution", 'xanchor':'center', 'x':0.5, 'y':0.9, 'font':{'size':24}})
my_xaxis_layout = dict(title=dict(text="my x axis", font={'size':16}))
my_layout = dict(title=my_title_layout,
                xaxis= my_xaxis_layout)
bin_size_dict = dict(cough_detected=0.001,SNR=0.5, age=1, gender=1, respiratory_condition=1, fever_muscle_pain=1, status=1 )
xaxis_title_dict = dict(cough_detected="Cough Detection Score",SNR="Signal-to-Noise Ratio" , age="Age", 
                        gender="Gender", respiratory_condition="Resp. Condition", fever_muscle_pain="Fever", status="Status" )

for c in ['cough_detected','SNR', 'age', 'gender','respiratory_condition','fever_muscle_pain', 'status' ]:
    hist_data = ply_go.Histogram(x=metadata[c], name=c, showlegend=False, xbins={'size':bin_size_dict[c]})
    fig = ply_go.Figure(data=[hist_data], layout=my_layout)
    fig.update_layout(title={'text': c+" distribution"}, xaxis={"title":{"text":xaxis_title_dict[c]}})
    fig.show()
###


fig = ply_go.Figure( layout=my_layout)
for tmp_diag in metadata['status'].unique():
    violin_data = ply_go.Violin(x=metadata.loc[metadata['status']==tmp_diag, 'status'],
                                y=metadata.loc[metadata['status']==tmp_diag, 'age'],
                                name=tmp_diag,
                                box_visible=True,
                                meanline_visible=True)
    fig.add_trace(violin_data)
    #end for
fig.update_layout(title={'text': "Distribution of AGE by type of DIAGNOSYS"}, xaxis={"title":{"text":None}}, 
                  yaxis={"title":{"text":"AGE [years]"}})
fig.show()


fig = ply_go.Figure( layout=my_layout)
for tmp_diag in metadata['status'].unique():
    violin_data = ply_go.Violin(x=metadata.loc[metadata['status']==tmp_diag, 'status'],
                                y=metadata.loc[metadata['status']==tmp_diag, 'cough_detected'],
                                name=tmp_diag,
                                box_visible=True,
                                meanline_visible=True)
    fig.add_trace(violin_data)
    #end for loop on unique statuses

    
fig.update_layout(title={'text': "Distribution of cough detection classifier by type of DIAGNOSYS"}, 
                  xaxis={"title":{"text":None}}, 
                  yaxis={"title":{"text":"Cough Detection Score"}})
fig.show()



fig = ply_go.Figure( layout=my_layout)
for tmp_diag in metadata['status'].unique():
    violin_data = ply_go.Violin(x=metadata.loc[(metadata['status']==tmp_diag)&(metadata['SNR']<100), 'status'],
                                y=metadata.loc[(metadata['status']==tmp_diag)&(metadata['SNR']<100), 'SNR'],
                                name=tmp_diag,
                                box_visible=True,
                                meanline_visible=True)
    fig.add_trace(violin_data)
    #end for loop on unique statuses

    
fig.update_layout(title={'text': "Distribution of SNR by type of DIAGNOSYS"}, 
                  xaxis={"title":{"text":None}}, 
                  yaxis={"title":{"text":"Signal-to-Noise Ratio"}})
fig.show()


In [None]:
def summarise_pivot_df(df, xcols, ycols, valcol):
    summary_df = df[xcols+ycols+valcol]
    summary_df.loc[summary_df[xcols[0]].isnull(),xcols] = 'n/a' #replace NA with a default string
    summary_df.loc[summary_df[ycols[0]].isnull(),ycols] = 'n/a' #replace NA with a default string
    summary_df = summary_df.groupby(xcols+ycols).count().reset_index()
    print(summary_df)
    pivot_df = pd.pivot_table(data=summary_df,values=valcol, index=xcols,columns=ycols)
    pivot_df.columns = [ c[1] for c in pivot_df.columns ] # get rid of multiindex
    return pivot_df

def pandas_to_plotly_heatdata(df):
    #print(df.index)
    return {'x': df.columns.tolist(),
            'y': df.index.tolist(),
            'z': df.values.tolist()}

# Heatmap Fever vs status
meta_summary_df = summarise_pivot_df(metadata, ['fever_muscle_pain'], ['status'], ['uuid'])
meta_summary_df = meta_summary_df[['healthy','symptomatic','COVID-19','n/a']]
n = meta_summary_df.sum().sum()
print(meta_summary_df.head(5) )

heat_data = ply_go.Heatmap(pandas_to_plotly_heatdata(meta_summary_df), 
                           colorscale=ply_colors.sequential.Oranges,
                           colorbar={'title':"Entries", 'titleside':"top"} ,
                           text=meta_summary_df.values)
rounded_annotation = [ ["NA" if pd.isnull(c) else "{:.0f}".format(c) for c in r] for r in heat_data['z']]
fig = ply_ff.create_annotated_heatmap(z=heat_data['z'], 
                                      x=heat_data['x'],
                                      y=[i for i,t in enumerate(heat_data['y'])],
                                      annotation_text=rounded_annotation,
                                      colorscale=heat_data['colorscale'],
                                      showscale=True,
                                      colorbar=heat_data['colorbar']  )
fig.update_layout( yaxis={"title":{"text":"Muscle Pain"},
                          "tickmode":'array',"tickvals":[2,1,0],"ticktext":['n/a','Yes','No']})
fig.show()

heat_data = ply_go.Heatmap(pandas_to_plotly_heatdata(100.0*meta_summary_df/n) ,
                           colorscale=ply_colors.sequential.Oranges,
                          colorbar={'title':"Percentage", 'titleside':"top"})
rounded_annotation = [ [ "NA" if pd.isna(c)  else "{:.2f}%".format(c)  for c in r] for r in heat_data['z']]
fig = ply_ff.create_annotated_heatmap(z=heat_data['z'], 
                                      x=heat_data['x'],
                                      y=[i for i,t in enumerate(heat_data['y'])],
                                      annotation_text=rounded_annotation,
                                      colorscale=heat_data['colorscale'],
                                      showscale=True,
                                      colorbar=heat_data['colorbar'])
fig.update_layout( yaxis={"title":{"text":"Muscle Pain"},
                          "tickmode":'array',"tickvals":[2,1,0],"ticktext":['n/a','Yes','No']})
fig.show()

# Heatmap RespCond vs status
meta_summary_df = summarise_pivot_df(metadata, ['respiratory_condition'], ['status'], ['uuid'])
meta_summary_df = meta_summary_df[['healthy','symptomatic','COVID-19','n/a']]
n = meta_summary_df.sum().sum()
#print(meta_summary_df.head(5) )
#print( pandas_to_plotly_heatdata(meta_summary_df) )
heat_data = ply_go.Heatmap(pandas_to_plotly_heatdata(meta_summary_df), 
                           colorscale=ply_colors.sequential.Oranges,
                           colorbar={'title':"Entries", 'titleside':"top"} ,
                           text=meta_summary_df.values)
rounded_annotation = [ ["NA" if pd.isnull(c) else "{:.0f}".format(c) for c in r] for r in heat_data['z']]
fig = ply_ff.create_annotated_heatmap(z=heat_data['z'], 
                                      x=heat_data['x'],
                                      #y=heat_data['y'],#
                                      y=[int(i) for i,t in enumerate(heat_data['y']) ],
                                      annotation_text=rounded_annotation,
                                      colorscale=heat_data['colorscale'],
                                      showscale=True,
                                      colorbar=heat_data['colorbar']  )
fig.update_layout( yaxis={"title":{"text":"REspiratory Condition"},
                          "tickmode":'array',"tickvals":[2,1,0,],"ticktext":['n/a','Yes','No']})
fig.show()


In [None]:
# Heatmap Age vs status
metadata['age_class'] = 0 # NAs will end up here
metadata.loc[ (metadata['age']<40),'age_class'] = 1
metadata.loc[ (metadata['age']>=40) &(metadata['age']<60),'age_class'] = 2
metadata.loc[ (metadata['age']>=60),'age_class'] = 3

meta_summary_df = summarise_pivot_df(metadata, ['age_class'], ['status'], ['uuid'])
meta_summary_df = meta_summary_df[['healthy','symptomatic','COVID-19','n/a']]
n = meta_summary_df.sum().sum()
print(meta_summary_df.head(5) )
#print( pandas_to_plotly_heatdata(meta_summary_df) )
heat_data = ply_go.Heatmap(pandas_to_plotly_heatdata(meta_summary_df), 
                           colorscale=ply_colors.sequential.Oranges,
                           colorbar={'title':"Entries", 'titleside':"top"} ,
                           text=meta_summary_df.values)
rounded_annotation = [ ["NA" if pd.isnull(c) else "{:.0f}".format(c) for c in r] for r in heat_data['z']]
fig = ply_ff.create_annotated_heatmap(z=heat_data['z'], 
                                      x=heat_data['x'],
                                      y=heat_data['y'],#
                                      #y=[int(i) for i,t in enumerate(heat_data['y']) ],
                                      annotation_text=rounded_annotation,
                                      colorscale=heat_data['colorscale'],
                                      showscale=True,
                                      colorbar=heat_data['colorbar']  )
fig.update_layout( yaxis={"title":{"text":"Age"},
                          "tickmode":'array',"tickvals":[3,2,1,0,],"ticktext":['> 60 yo','40 - 60 yo','< 40 yo', 'n/a']})
fig.show()

## Concluding remarks

A couple of things to keep in mind for the next stages:
* there are a lot of audio records without a diagnosis label in the "status" column. We might want to get rid of them in first instance
* the dataset is imbalanced, the ratio healthy:covid19 is about 11:1
* there is a decent amount of entries with a covid19 diagnosis that have a low cough_detection score. The dataset authors recommend to cut entries with cough_detection < 0.80 but this might result in the loss of a significant fraction of the already limited COVID sample. A possibility could be to train different ML for different categories of records (categorised by age, cough_detection, SNR, etc.).
* according to the data, there are 90+ year old seniors who connected to the website, recorded an audio sample of their cough and populated a webform with their generalities, all of this while being diagnosed with COVID. Allow me to be suspicious... might decide to apply a cut on the max age of the entry

