<h1><center><span style="color:blue">DEEP LEARNING MODEL OF COSWARA PROJECT'S COVID-19 DATASET</span></center></h1>

<p>
This work conforms to a assignment whose objective is to build a deep learning model using the Coswara project's Covid-19 dataset. I have focussed mainly on data analysis and data processing rather than building a very well performing (properly tuned) deep learning model solely to confirm the repeated statement (by many data scientists) that a well studied and processed dataset can often override the requirement of hypertuning of machine learning models. 
</p>

## I. Description

Project [Coswara](https://github.com/iiscleap/Coswara-Data) by Indian Institute of Science (IISc) Bangalore is an attempt to build a diagnostic tool for Covid-19 based on respiratory, cough and speech sounds. The project is in the data collection stage now. It requires the participants to provide a recording of breathing sounds, cough sounds, sustained phonation of vowel sounds and a counting exercise. 

**NOTE:** This repository contains the raw audio data received at [https://coswara.iisc.ac.in/](https://coswara.iisc.ac.in/). 

The annotation process of this is ongoing on [GitHub](https://github.com/iiscleap/Coswara-Exp) and would be delayed compared to the uploaded data here. The data repository for Project Coswara can be foudn at this [link](https://coswara.iisc.ac.in/). To view more information about the database such as distributions of gender, age, etc. [click here](https://iiscleap.github.io/coswara-blog/coswara/2020/11/23/visualize_coswara_data_metadata.html)

Each folder contains metadata and recordings corresponding to a person. To download and extract the data, you can run the script extract_data.py. Voice samples collected include breathing sounds (fast and slow), cough sounds (deep and shallow), phonation of sustained vowels (/a/ as in made, /i/,/o/), and counting numbers at slow and fast pace. Metadata information collected includes the participant's age, gender, location (country, state/ province), current health status (healthy/ exposed/ positive/recovered) and the presence of comorbidities (pre-existing medical conditions). 

## II. Important Links

- [Link to dataset used in this study](https://raw.githubusercontent.com/iiscleap/Coswara-Data/master/combined_data.csv)

- [Google Colab notebook link showing visualizations](https://colab.research.google.com/github/iiscleap/coswara-blog/blob/master/_notebooks/2020-11-23-visualize_coswara_data_metadata.ipynb)

- [Binder notebook link showing some visualizations](https://hub.gke2.mybinder.org/user/iiscleap-coswara-blog-ska67jbp/notebooks/_notebooks/2020-11-23-visualize_coswara_data_metadata.ipynb)

<h2><span style="color:blue">1. Initiate Data Analysis</h2>

<p>
    <ul>
        <li>Install two essential libraries, namely, "watermark" and "opendatasets", if not already installed. Use "watermark" to check some essential informations.</li><br />
        <li>Check the environment of this notebook and the version of Python interpreter running it.</li><br />
        <li>Load all essential packages and check their versions. Then, import all necessary modules from the imported packages.</li><br />
    </ul>
<p>

In [None]:
!pip install watermark --quiet
# !pip install opendatasets --quiet

In [None]:
%load_ext watermark

In [None]:
%watermark -a "Tirthankar Dutta"
%watermark -dhmntuz

In [None]:
import os, sys, platform
from platform import python_version

env = sys.executable
py_version = python_version()

print()
print("notebook env: %s" % (env))
print("python --version: %s" % (py_version))
print()

In [None]:
import warnings
warnings.filterwarnings("ignore") 


In [None]:
import numpy as np 
import pandas as pd 
import matplotlib as mpl
import seaborn as sns 
import re, pprint
import sklearn as skl 
import scipy as scp
import statsmodels as sms
import IPython as ipy
import tensorflow as tf
# import opendatasets as od
import imblearn as imb

In [None]:
print()
%watermark --iversion
print()

In [None]:
from IPython.core.display import display
from IPython import InteractiveShell
InteractiveShell.ast_node_interactivity="all"

from matplotlib import pyplot as plt 

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import RobustScaler, StandardScaler, MinMaxScaler, LabelEncoder
from sklearn.metrics import confusion_matrix, classification_report

# from imblearn.over_sampling import SMOTE, RandomOverSampler
# from imblearn.under_sampling import NearMiss
from imblearn.combine import SMOTEENN, SMOTETomek

from tensorflow import keras
from tensorflow.keras.metrics import BinaryAccuracy, AUC

<h2><span style="color:blue">2. Load Dataset</span></h2>

<p>
    <ul>
        <li>Obtain and save dataset to disk using the "opendatasets" library</li><br />
        <li>Load dataset from disk as a Pandas dataframe</li>    
    </ul>
</p>

In [None]:
# data_url = "https://raw.githubusercontent.com/iiscleap/Coswara-Data/master/combined_data.csv"
# dataset = od.download(data_url)

In [None]:
def get_file_path(file_name, search_path):
    abs_file_path = []
    for root, dir, files in os.walk(search_path):
        if file_name in files:
            abs_file_path.append(os.path.join(root, file_name))
    return abs_file_path[0]


file_name = "combined_data.csv"
pwd = os.getcwd()
file_path = get_file_path(file_name, pwd)

print()
print("current working directory: %s" % (pwd))
print()
print("absolute file path: %s" % (file_path))
print()

In [None]:
data = pd.read_csv(file_path, header="infer")

<h2><span style="color:blue">3. Explore and understand the dataset</span></h2>

<p>
    <ul>
        <li>Configure Pandas display options.</li><br />
        <li>Display dataset</li><br />
        <li>Check dataset dimensionality</li><br />  
        <li>Check the number of unique data types and their distributions present in the dataset</li><br />
        <li>Identify and delete columns to delete based on the following criteria:</li><br />
        <ol>
            <li>Redundancy - No useful information</li><br />
            <li>Missing values - Missing value cut-off/threshold is 90%</li><br />
        </ol>
        <li>Get a summary of dataset</li><br />
        <li>Get descriptive statistics of the numerical and catagorical variables separately.</li>
    </ul>
</p>

In [None]:
pd.set_option("display.precision", 4)
pd.set_option("display.max_rows", 50)
pd.set_option("display.max_columns", 40)
pd.set_option("display.max_colwidth", None)

In [None]:
data

In [None]:
print()
print(f"dataset dimensionality: {data.shape}")
print()

In [None]:
print()
print(f"unique data types and number of each unique data type in the dataset: \n{data.dtypes.value_counts()}")
print()

In [None]:
print()
print(f"label distribution of target variable 'covid_status':\n{data['covid_status'].value_counts()}")
print()

In [None]:
mvs_dict_gt_threshold = {}
mvs_dict_le_threshold = {}
vars_with_no_mvs = []
for var in data.columns.to_list(): 
    missing_vals_magn = data[var].isnull().sum()
    if missing_vals_magn > 0: 
        missing_vals_pct = (missing_vals_magn / data.shape[0]) * 100
        if missing_vals_pct <= 90.0: 
            mvs_dict_le_threshold[var] = [missing_vals_magn, round(missing_vals_pct, 2)]
        else:
            mvs_dict_gt_threshold[var] = [missing_vals_magn, round(missing_vals_pct, 2)]
    else:
        vars_with_no_mvs.append(var)

print()
print(f"total number of variables with missing values: \
    {len(mvs_dict_gt_threshold) + len(mvs_dict_le_threshold)} \n")

print("****" * 15)
print(f"number of variables with missing values > 90%: {len(mvs_dict_gt_threshold)}")
print("****" * 15)
for var, mvs_info in mvs_dict_gt_threshold.items():
    print(f"{var}: {mvs_info[0]}, {mvs_info[1]}%")


print()

print("****" * 15)
print(f"number of variables with missing values <= 90%: {len(mvs_dict_le_threshold)}")
print("****" * 15)
for var, mvs_info in mvs_dict_le_threshold.items():
        print(f"{var}: {mvs_info[0]}, {mvs_info[1]}%")

print()

In [None]:
print()
print(f"number of variables without missing values: {len(vars_with_no_mvs)}")
print()
print(f"variables without missing values:\n\n{vars_with_no_mvs}")

In [None]:
print()
print("basic information of dataset:")
data.info()
print()

In [None]:
print()
print("descriptive statistics of the numerical variables:")
data.describe(exclude="object").transpose()
print()

In [None]:
print()
print("descriptive statistics of the numerical variables:")
data.describe(include="object").transpose()
print()

<h2><span style="color:blue">4. Summary</span></h2>

<p>
    <ul>
        <li>The dataset consists of 1895 records and 35 variables. Out of these 35 variables, 33 are of <code>object</code> data type, 1 is of <code>float64</code> data type, and 1 is of <code>int64</code> data type.</li></br />
        <li>Out of the 35 variables, the one labeled "covid_status" is the "target" variable. The remaining 34 variables are "feature" variables. Analysis of the target variable "covid_status" reveals the following:</li><br />
         <ol>
            <li>The target is a highly imbalanced multiclass (7-class) variable.</li><br />
            <li>The label of the majority class is called "healthy".</li><br />
            <li>Keeping in mind our time crunch and the highly imbalanced nature of the target variable, we will reduce this imbalanced multiclass classification task to a binary classification problem.</li><br />
            <li>To do this, we shall first relabel "healthy" $\rightarrow$ "Unaffected" and then, group together the remaining labels/classes into a single category called "Affected". <span style="color:brown">We shall call the obtained binary class distribution as an "effective" 2-class distribution.</span></li><br />
            </ol>
        <li>28 variables have null/missing values.</li><br />
        <ol>
            <li>Out of the 28 variables, 6 have missing values <= 90%. These 6 variables are, "l_l", "rU", "smoker", "um", "cough", "test_status", </li><br />
            <li>Out of the 28 variables, 22 have missing values > 90%. These 22 variables are, "cold", "ht", "diabetes", "fever", "asthma", "ihd", "bd", "st", "ftg", "mp", "loss_of_smell", "cld", "diarrhoea", "pneumonia", "ctScan", "testType", "test_date", "vacc", "ctScore", others_resp", "others_preexist"</li><br />
        </ol>
        <li>There are 7 variables, including the "target" variable, that do not have missing values. Out of these 7 variables, the first variable (column) labeled "id" is a redundant column.</li>
    </ul>
</p>

<h2><span style="color:blue">5. First set of data preprocessing</span></h2>

<p>
    <ul>
        <li>We will first regroup the classes of the target variable, thereby reducing a 7-class variable to a 2-class variable.</li><br />
        <li>We will then delete 23 variables from the dataset out of which 22 variables contain missing values > 90% while the last variable is the "id" variable which (as mentioned) is redundant.</li><br />
        <li>Finally, we shall perform an analysis of the variables without missing values.</li>
    </ul>
<p>

In [None]:
reduced_data = data.copy()

In [None]:
reduced_data['covid_status'] = reduced_data['covid_status'].replace(
    {
        'healthy': "Unaffected",
        'no_resp_illness_exposed': "Affected",
        'resp_illness_not_identified': "Affected",
        'recovered_full': 'Affected', 
        'positive_mild': 'Affected',
        'positive_asymp': 'Affected', 
        'positive_moderate': 'Affected'
    })

In [None]:
tmp_target = np.array(reduced_data.loc[:, "covid_status"])

label_encoder = LabelEncoder()
tmp_target = label_encoder.fit_transform(tmp_target)
print()
print(f"result of applying label_encoder: {label_encoder.classes_}")
print()

reduced_data = reduced_data.drop(columns=["covid_status"], axis=1)
reduced_data['covid_status'] = tmp_target 

In [None]:
print()
print("label distribution of encoded & regrouped 'covid_status' variable:")
print(reduced_data["covid_status"].value_counts().sort_values(ascending=False))
print()

In [None]:
cols_to_delete = ["id"] + list(mvs_dict_gt_threshold.keys())
print()
print(f"columns to delete: {cols_to_delete}")
print()

In [None]:
reduced_data = reduced_data.drop(columns=cols_to_delete, axis=1)

print()
print(f"dimensionality of reduced dataset (after variables deletion): {reduced_data.shape}")
print()
print("reduced dataset:")
reduced_data
print()

<h2><span style="color:blue">6. Analysis and preprocessing of catagorical variables without missing values, including the "target" variable</span></h2>
<p>
    <ul> 
        <li>Description of the variables:</li><br />
        <ol>
            <li>"ep": Proficient in English (y/n)</li>
            <li>"a":  Age</li>
            <li>"covid status": Health status (eg: positive, mild, healthy, etc.)</li>
            <li>"g": Gender (male/female/other)</li>
            <li>"l_c": Country</li>
            <li>"l_s": State</li><br />
        </ol>
        <li>We will work at a much less granularity of the dataset (due to time crunch) and hence, get rid of the 'l_s' variable and replace the 'l_c' variable with a new variable 'region' that will have "India" (primary country) and all the remaining countries grouped together into their respective continents.</li>
    </ul>
</p>

In [None]:
vars_with_no_mvs.remove("id")
print()
print("variables without missing values after deleting 'id' variable:")
print(*vars_with_no_mvs, sep=", ", end="\n\n")

In [None]:
print() 
print("data types of the variables without missing values:")
reduced_data[vars_with_no_mvs].dtypes

In [None]:
reduced_data = reduced_data.drop(columns=['l_s'], axis=1)

In [None]:
reduced_data['l_c'].unique()

In [None]:
reduced_data['region'] = reduced_data['l_c'].replace(
    {
        'United States': "North America",
        'Canada': "North America",
        'France': "Europe", 
        'Finland': "Europe", 
        'Germany': "Europe",
        'China': "Asia", 
        'Oman': "Asia", 
        'Ireland': "Europe", 
        'Switzerland': "Europe", 
        'Iran': "Asia",
        'Ukraine': "Europe", 
        'United Arab Emirates': "Asia", 
        'United Kingdom': "Europe",
        'Netherlands The': "Europe", 
        'Hungary': "Europe", 
        'Israel': "Europe", 
        'Turkey': "Europe", 
        'Singapore': "Asia",
        'Qatar': "Asia", 
        'Saudi Arabia': "Asia", 
        'Mexico': "North America", 
        'Spain': "Europe", 
        'Malaysia': "Asia", 
        'Portugal': "Europe",
        'Japan': "Asia", 
        'Bahrain': "Asia", 
        'Sri Lanka': "Asia", 
        'Philippines': "Asia", 
        'Argentina': "South America",
        'Brazil': "South America", 
        'Indonesia': "Asia", 
        'Ecuador': "South America", 
        'Italy': "Asia", 
        'Korea South': "Asia",
        'Belgium': "Europe", 
        'Sweden': "Europe", 
        'Norway': "Europe", 
        'Romania': "Europe", 
        'Iraq': "Asia", 
        'Syria': "Asia",
        'Russia': "Europe", 
        'Vietnam': "Asia", 
        'Bangladesh': "Asia", 
        'Greece': "Europe", 
        'Egypt': "Asia"
})

In [None]:
reduced_data = reduced_data.drop(columns=['l_c'], axis=1)

In [None]:
reduced_data['region'].value_counts()

In [None]:
reduced_data

In [None]:
plt.style.use("seaborn-ticks")
plt.rcParams["font.size"] = 12.0

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(15,13), dpi=100)

ax1 = axes[0][0]
sns.countplot(reduced_data["covid_status"], \
    palette="YlGnBu", ax=ax1)
ax1.set_title("Class/label distribution of covid status", \
    fontsize=18)
ax1.set_xlabel('Labels', fontsize = 15)
ax1.set_ylabel('Label Distribution (weight)', fontsize = 15)
x_offset = -0.1; y_offset = -207.5
for p in ax1.patches:
    b = p.get_bbox()
    val = "{:0.2f}".format(b.y1 + b.y0)        
    ax1.annotate(val, ((b.x0 + b.x1)/2 + x_offset, b.y1 + y_offset), fontsize=14)

ax2 = axes[0][1]
sns.histplot(reduced_data["a"], kde=True, palette="YlGnBu", ax=ax2)
ax2.set_title("Histogram plot of age", fontsize=18)
ax2.set_xlabel('Age', fontsize = 15)
ax2.set_ylabel('Distribution', fontsize = 15)

ax3 = axes[1][0]
sns.barplot(x=reduced_data["covid_status"], y=reduced_data["a"], \
    palette="YlGnBu", ci=1.0, ax=ax3)
ax3.set_title("Variation in age vs covid status", fontsize=18)
ax3.set_xlabel('Covid Status', fontsize = 15)
ax3.set_ylabel('Age', fontsize = 15)
x_offset = -0.1; y_offset = -20.5
for p in ax3.patches:
    b = p.get_bbox()
    val = "{:0.2f}".format(b.y1 + b.y0)        
    ax3.annotate(val, ((b.x0 + b.x1)/2 + x_offset, b.y1 + y_offset), fontsize=14)

ax4 = axes[1][1]
sns.boxplot(x=reduced_data["a"], palette="YlGnBu", ax=ax4)
ax4.set_title("Detection of outliers in age variable using boxplot", fontsize=18)
ax4.set_xlabel('Values of age variable', fontsize = 15)
ax4.set_ylabel('Age', fontsize = 15)

plt.show();

In [None]:
fig, axes = plt.subplots(3, 1, figsize=(15,25), dpi=100)

ax1 = axes[0]
sns.countplot(reduced_data["ep"], hue=reduced_data["covid_status"], \
    palette="YlGnBu", ax=ax1)
ax1.set_title("Variation in covid status vs English proficiency", fontsize=18)
ax1.set_xlabel('English Proficiency', fontsize = 15)
ax1.set_ylabel('Num. of Records', fontsize = 15)
ax1.legend(fontsize = 12, title="Covid Status", title_fontsize = 12)
# ax1.legend(fontsize = 12, bbox_to_anchor= (1.26, 1.02), title="Covid Status", title_fontsize = 12)
x_offset = -0.1; y_offset = 20.5
for p in ax1.patches:
    b = p.get_bbox()
    val = "{:0.1f}".format(b.y1 + b.y0)        
    ax1.annotate(val, ((b.x0 + b.x1)/2 + x_offset, b.y1 + y_offset), fontsize=14)

ax2 = axes[1]
sns.countplot(reduced_data["g"], hue=reduced_data["covid_status"], \
    palette="YlGnBu", ax=ax2)
ax2.set_title("Variation in Covid status vs gender", fontsize=18)
ax2.set_xlabel('Gender', fontsize = 15)
ax2.set_ylabel('Num. of Records', fontsize = 15)
ax2.legend(fontsize = 12, title="Covid Status", title_fontsize = 12)
# ax2.legend(fontsize = 12, bbox_to_anchor= (1.0, 1.02), title="Covid Status", title_fontsize = 12)
x_offset = -0.1; y_offset = 20.5
for p in ax2.patches:
    b = p.get_bbox()
    val = "{:0.1f}".format(b.y1 + b.y0)        
    ax2.annotate(val, ((b.x0 + b.x1)/2 + x_offset, b.y1 + y_offset), fontsize=14)

ax3 = axes[2]
sns.countplot(reduced_data["region"], hue=reduced_data["covid_status"], \
    palette="YlGnBu", ax=ax3)
ax3.set_title("Variation in Covid status vs region", fontsize=18)
ax3.set_xlabel('Region', fontsize = 15)
ax3.set_ylabel('Num. of Records', fontsize = 15)
ax3.legend(fontsize = 12, title="Covid Status", title_fontsize = 12)
# ax3.legend(fontsize = 12, bbox_to_anchor= (1.0, 1.02), title="Covid Status", title_fontsize = 12)
x_offset = -0.1; y_offset = 20.5
for p in ax3.patches:
    b = p.get_bbox()
    val = "{:0.1f}".format(b.y1 + b.y0)        
    ax3.annotate(val, ((b.x0 + b.x1)/2 + x_offset, b.y1 + y_offset), fontsize=14)

plt.show();

In [None]:
print()
print(f"value of skewness in the age variable: {round(reduced_data['a'].skew(),2)}")

<h2><span style="color:blue">7. Summary</span></h2>
<p>
    <ol> 
       <li>The variable "a" (age) is moderately (0.96) positive-skewed and therefore we will not treat the outliers present in it manually. Rather we shall use <code>RobustScaler()</code> + <code>StandardScalar()</code> combination (from <code>sklearn.preprocessing</code> module) to deal with both the skewness and the outliers.</li></br />
        <li>Effects of features like "region" and "g" (gender) on the target variable appears to be very asymmetric, as these variables suffer from high degree of class imbalance.</li>
    </ol>
</p>

<h2><span style="color:blue">8. Analysis and preprocessing of categorical variables with missing values</span></h2>

<p>
    <ul>
        <li>The level of granularity at we are working, the variable called 'l_l' (locality-specific records) is redundant and therefore we will first remove this.</li><br />
        <li>Next, as the data types of all these 6 variables is <code>object</code>, we will replace the null values in all of them with "Unknown".</li><br />
        <li>We will then preprocess them as precribed below:</li><br />
        <ol>
            <li>"rU": Assuming that ("y", "n") $\Rightarrow$ ("Yes", "No"), i.e., ("True", "False") respectively, and we will change ("y", "n") $\rightarrow$ ("True", "False").</li><br />
            <li>"smoker": Assuming ("n", "y") refer to ("No", "Yes"), i.e., ("False", "True"), we will change ("y", "n") $\rightarrow$ ("True", "False"), thereby allowing to reducing a 4-class variable to a 2-class one.</li><br />
            <li>"um": We shall change ("n", "y") $\rightarrow$ ("False", "True") in line with the above.</li><br />
            <li>"cough": No preprocessing required.</li><br />
            <li>"test_status": We will change ("na", "p", "n") $\rightarrow$ ("NA", "True", "False").</li><br /> 
            </ol>
            <li>Next, we will look at the distributions of the labels/classes of these variables.</li><br />
            <li>Finally, we will TypeCast all the 6 variables from <code>object</code> to <code>category</code> data type and perform One-Hot Encoding using Pandas <code>get_dummies()</code> functionality.</li>
    </ul>
</p>

In [None]:
reduced_data = reduced_data.drop(columns=['l_l'], axis=1)

In [None]:
cols_with_mvs = list(mvs_dict_le_threshold.keys())
print()
print(f"list of variables with missing values before removal of 'l_l' variable:\n{cols_with_mvs}")
print() 

cols_with_mvs.remove('l_l')
print()
print(f"list of variables with missing values after removal of 'l_l' variable:\n{cols_with_mvs}")
print() 

In [None]:
reduced_data[cols_with_mvs] = reduced_data[cols_with_mvs].replace(
    {
        np.NaN: "Unknown",
        np.nan: "Unknown"
    })

In [None]:
reduced_data[ ['rU', 'smoker', 'um'] ] = reduced_data[ ['rU', 'smoker', 'um'] ].replace(
    {
        "y": "True",
        "n": "False"
    })

In [None]:
reduced_data['test_status'] = reduced_data['test_status'].replace(
    {
        "na": "NA",
        "p": "True",
        "n": "False",
    })

In [None]:
reduced_data

In [None]:
print()
print("value/class distributions of the 6 variables:")
print()

for var in cols_with_mvs:
    print(f"distribution for {var}:\n\n{reduced_data[var].value_counts()}")
    print()

print()

In [None]:
fig, axes = plt.subplots(5, 1, figsize=(18,35), dpi=100)

ax1 = axes[0]
sns.countplot(reduced_data['covid_status'], hue=reduced_data['rU'], \
    palette="YlGnBu", ax=ax1)
ax1.set_title("Covid Status vs Returning User", fontsize=18)
ax1.set_xlabel('Covid Status', fontsize = 15)
ax1.set_ylabel('Num. of Records', fontsize = 15)
ax1.legend(fontsize = 12, title="rU", title_fontsize = 12)
# ax1.legend(fontsize = 12, bbox_to_anchor= (1.26, 1.02), title="Covid Status", title_fontsize = 12)
x_offset = -0.05; y_offset = 5.5
for p in ax1.patches:
    b = p.get_bbox()
    val = "{:0.1f}".format(b.y1 + b.y0)        
    ax1.annotate(val, ((b.x0 + b.x1)/2 + x_offset, b.y1 + y_offset), fontsize=14)


ax2 = axes[1]
sns.countplot(reduced_data['covid_status'], hue=reduced_data['smoker'], \
    palette="YlGnBu", ax=ax2)
ax2.set_title("Covid Status vs Smoker", fontsize=18)
ax2.set_xlabel('Covid Status', fontsize = 15)
ax2.set_ylabel('Num. of Records', fontsize = 15)
ax2.legend(fontsize = 12, title="smoker", title_fontsize = 12)
# ax1.legend(fontsize = 12, bbox_to_anchor= (1.26, 1.02), title="Covid Status", title_fontsize = 12)
x_offset = -0.05; y_offset = 7.5
for p in ax2.patches:
    b = p.get_bbox()
    val = "{:0.1f}".format(b.y1 + b.y0)        
    ax2.annotate(val, ((b.x0 + b.x1)/2 + x_offset, b.y1 + y_offset), fontsize=14)

ax3 = axes[2]
sns.countplot(reduced_data['covid_status'], hue=reduced_data['um'], \
    palette="YlGnBu", ax=ax3)
ax3.set_title("Covid Status vs Using Mask", fontsize=18)
ax3.set_xlabel('Covid Status', fontsize = 15)
ax3.set_ylabel('Num. of Records', fontsize = 15)
ax3.legend(fontsize = 12, title="um", title_fontsize = 12)
# ax1.legend(fontsize = 12, bbox_to_anchor= (1.26, 1.02), title="Covid Status", title_fontsize = 12)
x_offset = -0.05; y_offset = 7.5
for p in ax3.patches:
    b = p.get_bbox()
    val = "{:0.1f}".format(b.y1 + b.y0)        
    ax3.annotate(val, ((b.x0 + b.x1)/2 + x_offset, b.y1 + y_offset), fontsize=14)

ax4 = axes[3]
sns.countplot(reduced_data['covid_status'], hue=reduced_data['cough'], \
    palette="YlGnBu", ax=ax4)
ax4.set_title("Covid Status vs Cough", fontsize=18)
ax4.set_xlabel('Covid Status', fontsize = 15)
ax4.set_ylabel('Num. of Records', fontsize = 15)
ax4.legend(fontsize = 12, title="cough", title_fontsize = 12)
# ax1.legend(fontsize = 12, bbox_to_anchor= (1.26, 1.02), title="Covid Status", title_fontsize = 12)
x_offset = -0.05; y_offset = 7.5
for p in ax4.patches:
    b = p.get_bbox()
    val = "{:0.1f}".format(b.y1 + b.y0)        
    ax4.annotate(val, ((b.x0 + b.x1)/2 + x_offset, b.y1 + y_offset), fontsize=14)

ax5 = axes[4]
sns.countplot(reduced_data['covid_status'], hue=reduced_data['test_status'], \
    palette="YlGnBu", ax=ax5)
ax5.set_title("Covid Status vs Test Status", fontsize=18)
ax5.set_xlabel('Covid Status', fontsize = 15)
ax5.set_ylabel('Num. of Records', fontsize = 15)
ax5.legend(fontsize = 12, title="test_status", title_fontsize = 12)
# ax1.legend(fontsize = 12, bbox_to_anchor= (1.26, 1.02), title="Covid Status", title_fontsize = 12)
x_offset = -0.05; y_offset = 7.5
for p in ax5.patches:
    b = p.get_bbox()
    val = "{:0.1f}".format(b.y1 + b.y0)        
    ax5.annotate(val, ((b.x0 + b.x1)/2 + x_offset, b.y1 + y_offset), fontsize=14)

plt.show();

In [None]:
for var in cols_with_mvs:
    reduced_data[var] = reduced_data[var].astype('category')
    reduced_data = pd.get_dummies(reduced_data, columns=[var], drop_first=True)

In [None]:
reduced_data

<h2><span style="color:blue">9. Remaining preprocessing and data partitioning</span></h2>

<p>
    <ul>
        <li>We will now TypeCast all remaining categorical variables from <code>object</code> to <code>category</code> data type and then perform One-Hot Encoding of the categorical variables using Pandas <code>get_dummies()</code> functionality.</li><br />
        <li>Next, scale the "a" (age) variable appropriately.</li><br />
        <li>Partition dataset first into seen and unseen (holdout) datasets. Then, split the seen dataset further into a train dataset and a validation dataset. The validation dataset will be using for testing and tuning of the deep learning model while the unseen dataset will be used solely for testing the model. The partition sizes will be as follows:</li><br />
        <ol>
            <li>Seen:Unseen :: 85:15</li><br />
            <li>Train:Validation :: 70:15</li><br />
        </ol>   
    </ul>
</p>

In [None]:
cat_vars_to_preprocess = reduced_data.select_dtypes(include="object", exclude=["int64", "category"]).columns.to_list()
print()
print(cat_vars_to_preprocess)
print()

In [None]:
reduced_data[cat_vars_to_preprocess] = reduced_data[cat_vars_to_preprocess].astype('category')

In [None]:
for var in cat_vars_to_preprocess:
    reduced_data = pd.get_dummies(reduced_data, columns=[var], drop_first=True)

In [None]:
reduced_data

In [None]:
target = reduced_data.iloc[:, 1]
reduced_data = reduced_data.drop(columns=['covid_status'], axis=1)
features = reduced_data.iloc[:, :]

In [None]:
print()
print(f"features.shape: {features.shape}")
print(f"target.shape: {target.shape}")

In [None]:
sme = SMOTEENN()
features_sme, target_sme = sme.fit_resample(features, target)

print()
print(f"after applying SMOTEEN technique, features.shape: {features.shape}")
print(f"after applying SMOTEEN technique, target.shape: {target.shape}")

In [None]:
features_sme

In [None]:
target_sme.value_counts().sort_values(ascending=False)

In [None]:
# robust_scaler = RobustScaler()
# transformed_age = robust_scaler.fit_transform(reduced_data['a'].values.reshape(-1,1))

# std_scaler = StandardScaler()
# transformed_age = std_scaler.fit_transform(transformed_age.reshape(-1,1))

In [None]:
min_max_scaler = MinMaxScaler()
transformed_age = min_max_scaler.fit_transform(features_sme['a'].values.reshape(-1,1))

In [None]:
indx_age = features_sme.columns.get_loc('a')
reduced_data = features_sme.drop(columns=['a'], axis=1)
features_sme.insert(indx_age, "age", transformed_age)

In [None]:
# target = np.array(reduced_data.loc[:, "covid_status_Unaffected"].values)
# reduced_data = reduced_data.drop(columns=["covid_status_Unaffected"], axis=1)
# reduced_data["covid_status_Unaffected"] = target
# reduced_data

In [None]:
# X = np.array(reduced_data.iloc[:, 0:-1])
# y = np.array(reduced_data.iloc[:, -1:])

In [None]:
X = np.array(features_sme)
y = np.array(target_sme)

In [None]:
print()
print("dimensionality of X:", X.shape)
print("dimensionality of y:", y.shape)

In [None]:
X_seen, X_unseen, y_seen, y_unseen = train_test_split(X, y, \
    test_size=0.15, random_state=101, stratify=y)

In [None]:
print()
print(f"dimensionality of X_seen: {X_seen.shape}")
print(f"dimensionality of X_unseen: {X_unseen.shape}")
print(f"dimensionality of y_seen: {y_seen.shape}")
print(f"dimensionality of y_unseen: {y_unseen.shape}")
print()

In [None]:
X_seen_train, X_seen_valid, y_seen_train, y_seen_valid = \
    train_test_split(X_seen, y_seen, test_size=0.15, random_state=101, stratify=y_seen)

In [None]:
print()
print(f"dimensionality of X_seen_train: {X_seen_train.shape}")
print(f"dimensionality of X_seen_valid: {X_seen_valid.shape}")
print(f"dimensionality of y_seen_train: {y_seen_train.shape}")
print(f"dimensionality of y_seen_valid: {y_seen_valid.shape}")
print()

<h2><span style="color:blue">10. Deep learning analysis of seen (train) dataset using Tensorflow and Keras</span></h2>

<p> 
  <ul>
      <li>We define a fully connected dense network of four layers. The architecture of the neural network is as follows:</li><br />
      <ol>
          <li>The first (input) layer has 19 neurons corresponding to one neuron per feature. The activation function is <code>relu</code>.</li><br />
          <li>The second and third, i.e., hidden layers, each has 8 neurons with activation function <code>relu</code>.</li><br />
          <li>The last (output) layer has 1 neuron corresponding to the target variable with activation function <code>sigmoid</code>.</li><br />
        </ol>
        <li>No automatic (programmatic) hypertuning of the model is performed due to time crunch and nature of the assignment.</li><br />
        <li>The aforementioned NN architecture is found to be optimal for this problem through nomimal manual hyperparameter tuning, namely, changing the number of neurons per hidden layer and studying the output.</li><br />
        <li>Since we have reduced the multiclass classification problem to an effective binary task, hence we have used the <code>model.predict()</code> method.</li><br /> 
        <li>If however, we would have used an effective 3-class classification problem, or have stuck to the original multiclass problem, then we would have used the <code>model.predict_classes()</code> method as this predicts the classes directly.</li>
    </ul>
</p>

In [None]:
# ******************************************
# Keras model parameters:
# ******************************************
my_num_epochs = 75
my_num_batches = 32
my_num_runs = 10
my_act_func_first_layer = "relu"
my_act_func_inner_layers = "relu"
my_act_func_last_layer = "sigmoid"
my_num_feats = X.shape[1]
my_loss_func = "binary_crossentropy"
my_optimizer = "adam"

my_auc = AUC(num_thresholds=1000, curve='PR', \
    summation_method='interpolation', name="auc", dtype=float, \
        thresholds=None, multi_label=False, num_labels=None, \
            label_weights=None, from_logits=False)

my_bin_accuracy = BinaryAccuracy(name='binary_accuracy', \
    dtype=None, threshold=0.5)

my_metrics = [my_bin_accuracy, my_auc]

<strong>Note:</strong> Check the following links to more about the metric classes used for binary classification and their respective <em>**kwargs</em><br /> 

  1. [AUC](https://www.tensorflow.org/api_docs/python/tf/keras/metrics/AUC)

  2. [Binary Accuracy](https://www.tensorflow.org/api_docs/python/tf/keras/metrics/BinaryAccuracy)

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
model = Sequential()
model.add(Dense(19, input_dim=my_num_feats, \
    activation=my_act_func_first_layer))
model.add(Dense(8, activation=my_act_func_inner_layers))
model.add(Dense(8, activation=my_act_func_inner_layers))
model.add(Dense(1, activation=my_act_func_last_layer))

In [None]:
model.compile(loss=my_loss_func, optimizer=my_optimizer, metrics=my_metrics)

In [None]:
def prepare_model(X, y, num_epochs, num_batches, \
    verbose_flag, metrics, num_runs):

    iter = 1
    loss = []
    bin_acurcy = []
    auc = []
    while iter <= num_runs:
        history = model.fit(X, y, epochs=num_epochs, \
            batch_size=num_batches, \
            verbose=verbose_flag, use_multiprocessing=True)           
        res1, res2, res3 = model.evaluate(X, y, verbose=0)
        loss.append(res1 * 100)
        bin_acurcy.append(res2 * 100)
        auc.append(res3 * 100)
        iter += 1
    
    print()
    avg_loss = round(np.mean(loss), 2)
    avg_bin_acurcy = round(np.mean(bin_acurcy), 2) 
    avg_auc = round(np.mean(auc), 2)
    print(f"avg. loss {avg_loss} after {num_runs} runs")
    print(f"avg. accuracy {avg_bin_acurcy} after {num_runs} runs")
    print(f"avg. auc {avg_auc} after {num_runs} runs")
    print()

    fig, axes = plt.subplots(1, 3, figsize=(20,10), dpi=100)

    x = [i for i in range(num_runs)]
    
    ax1 = axes[0]
    sns.lineplot(x=x, y=loss, palette="bright", \
        marker="o", markersize=12, ax=ax1)
    ax1.set_title("Loss vs Iterations", fontsize=18)
    ax1.set_xlabel("Iterations", fontsize=14)
    ax1.set_ylabel("Loss", fontsize=14)

    ax2 = axes[1]
    sns.lineplot(x=x, y=bin_acurcy, palette="bright", \
        marker="o", markersize=12, ax=ax2)
    ax2.set_title("Binary Accuracy vs Iterations", fontsize=18)
    ax2.set_xlabel("Iterations", fontsize=14)
    ax2.set_ylabel("Binary Accuracy", fontsize=14)
    
    ax3 = axes[2]
    sns.lineplot(x=x, y=auc, palette="bright", \
        marker="o", markersize=12, ax=ax3)
    ax3.set_title("AUC vs Iterations", fontsize=18)
    ax3.set_xlabel("Iterations", fontsize=14)
    ax3.set_ylabel("AUC", fontsize=14)
    
    plt.show();

In [None]:
def display_model_parameters():
    print()
    print("****" * 15)
    print("model parameters")
    print("****" * 15)
    print()
    print(f"number of epochs: {my_num_epochs}")
    print(f"number of batches: {my_num_batches}")
    print(f"number of features: {my_num_feats}")
    print(f"activation function for first layer: {my_act_func_first_layer}")
    print(f"activation function for inner layers: {my_act_func_inner_layers}")
    print(f"activation function for outer/last layer: {my_act_func_last_layer}")
    print(f"loss function minimized: {my_loss_func}")
    print(f"optimizer: {my_optimizer}")
    print(f"error metrics: {my_metrics}")
    print()

In [None]:
def check_model_outputs(y_actual, y_predicted):
    check_model = pd.DataFrame(data=y_actual, columns=['actual values'])
    check_model['predicted values'] = y_predicted
    return check_model

In [None]:
def model_performance_report(model):
    
    display_model_parameters()

    y_actual = model.iloc[:, 0]
    y_pred = model.iloc[:, 1]
    
    plt.figure(figsize=(18, 12), dpi=100)
    sns.heatmap(confusion_matrix(y_actual, y_pred), square=True, \
        cmap="YlGnBu", linewidths=0.1, annot=True, annot_kws={"fontsize":18})
    plt.show();

    print(classification_report(y_actual, y_pred))

In [None]:
# **********************************************
# Both the following are equivalent:
# (num_epochss, num_runs) == (150, 5) 
# (num_epochs, num_runs) == (75, 10)
# num_batches = 10 & 32 give equivalent results 
# **********************************************
prepare_model(X_seen_train, y_seen_train, my_num_epochs, \
    my_num_batches, 0, my_metrics, my_num_runs)

In [None]:
seen_train_loss, seen_train_accuracy, seen_train_auc = \
    model.evaluate(X_seen_train, y_seen_train, verbose=0)

print()
print("Accuracy - seen train dataset: %.2f"% (seen_train_accuracy*100))
print()
print("AUC - seen train dataset: %.2f"% (seen_train_auc*100))

In [None]:
pred_on_seen_train_dataset = model.predict(X_seen_train)
pred_on_seen_train_dataset = [round(x[0]) for x in pred_on_seen_train_dataset]

model_on_seen_train_dataset = check_model_outputs(y_seen_train, \
    pred_on_seen_train_dataset)

model_on_seen_train_dataset.head(3)

In [None]:
model_performance_report(model_on_seen_train_dataset)

In [None]:
seen_valid_loss, seen_valid_accuracy, seen_valid_auc = \
    model.evaluate(X_seen_valid, y_seen_valid, verbose=0)

print()
print("Accuracy - seen valid dataset: %.2f"% (seen_valid_accuracy*100))
print()
print("AUC - seen valid dataset: %.2f"% (seen_valid_auc*100))

In [None]:
pred_on_seen_valid_dataset = model.predict(X_seen_valid)
pred_on_seen_valid_dataset = [round(x[0]) for x in pred_on_seen_valid_dataset]

model_on_seen_valid_dataset = check_model_outputs(y_seen_valid, \
    pred_on_seen_valid_dataset)

model_on_seen_valid_dataset.head(3)

In [None]:
model_performance_report(model_on_seen_valid_dataset)

In [None]:
unseen_loss, unseen_accuracy, unseen_auc = \
    model.evaluate(X_unseen, y_unseen, verbose=0)

print()
print("Accuracy - unseen dataset: %.2f"% (unseen_accuracy*100))
print()
print("AUC - unseen dataset: %.2f"% (unseen_auc*100))

In [None]:
pred_on_unseen_dataset = model.predict(X_unseen)
pred_on_unseen_dataset = [round(x[0]) for x in pred_on_unseen_dataset]

model_on_unseen_dataset = check_model_outputs(y_unseen, \
    pred_on_unseen_dataset)

model_on_unseen_dataset.head(3)

In [None]:
model_performance_report(model_on_unseen_dataset)