# Advanced Certification Programme in AI and MLOps
## A programme by IISc and TalentSprint
### Mini-Project Notebook: Structured Data Classification

## Problem Statement

To predict whether a patient has a heart disease.

## Learning Objectives

At the end of the experiment, you will be able to

* understand the Cleveland Clinic Foundation for Heart Disease dataset
* pre-process this dataset
* build a neural network architecture/model using Keras sequential or functional api
* perform model training
* perform inference on an unseen data
* build a Gradio interface for this application

## Introduction

This example demonstrates how to do structured data classification, starting from a raw
CSV file. Our data includes both numerical and categorical features. We will do preprocessing to normalize the numerical features and vectorize the categorical
ones.

### Dataset

[Our dataset](https://archive.ics.uci.edu/ml/datasets/heart+Disease) is provided by the
Cleveland Clinic Foundation for Heart Disease.
It's a CSV file with 303 rows. Each row contains information about a patient (a
**sample**), and each column describes an attribute of the patient (a **feature**). We
use the features to predict whether a patient has a heart disease (**binary
classification**).

Here's the description of each feature:

Column| Description| Feature Type
------------|--------------------|----------------------
Age | Age in years | Numerical
Sex | (1 = male; 0 = female) | Categorical
CP | Chest pain type (0, 1, 2, 3, 4) | Categorical
Trestbpd | Resting blood pressure (in mm Hg on admission) | Numerical
Chol | Serum cholesterol in mg/dl | Numerical
FBS | fasting blood sugar in 120 mg/dl (1 = true; 0 = false) | Categorical
RestECG | Resting electrocardiogram results (0, 1, 2) | Categorical
Thalach | Maximum heart rate achieved | Numerical
Exang | Exercise induced angina (1 = yes; 0 = no) | Categorical
Oldpeak | ST depression induced by exercise relative to rest | Numerical
Slope | Slope of the peak exercise ST segment | Numerical
CA | Number of major vessels (0-3) colored by fluoroscopy | Both numerical & categorical
Thal | 3 = normal; 6 = fixed defect; 7 = reversible defect | Categorical
Target | Diagnosis of heart disease (1 = true; 0 = false) | Target

In [None]:
#@title Download the data
!wget -qq https://cdn.iisc.talentsprint.com/AIandMLOps/Datasets/heart.csv
print("Data Downloaded Successfuly!!")
!ls | grep '.csv'

Data Downloaded Successfuly!!
heart.csv


## Grading = 10 Points

### Import Required Packages

In [None]:
import tensorflow as tf
import numpy as np
import pandas as pd
from tensorflow import keras
from tensorflow.keras import layers

## Load the data and pre-process it [3 Marks]

### Load data into a Pandas dataframe

Hint:: pd.read_csv

In [None]:
file_url = "/content/heart.csv"
## YOUR CODE HERE
heart_df = pd.read_csv(file_url)


Check the shape of the dataset:

In [None]:
## YOUR CODE HERE
heart_df.shape

(303, 14)

Check the preview of a few samples:

Hint:: head()

In [None]:
## YOUR CODE HERE
heart_df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,1,145,233,1,2,150,0,2.3,3,0,fixed,0
1,67,1,4,160,286,0,2,108,1,1.5,2,3,normal,1
2,67,1,4,120,229,0,2,129,1,2.6,2,2,reversible,0
3,37,1,3,130,250,0,0,187,0,3.5,3,0,normal,0
4,41,0,2,130,204,0,2,172,0,1.4,1,0,normal,0


Draw some inference from the data. What does the target column indicate?

The last column, "target", indicates whether the patient has a heart disease (1) or not
(0).

### Missing values

In [None]:
# Check if any missing values is present
## YOUR CODE HERE
heart_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       303 non-null    int64  
 1   sex       303 non-null    int64  
 2   cp        303 non-null    int64  
 3   trestbps  303 non-null    int64  
 4   chol      303 non-null    int64  
 5   fbs       303 non-null    int64  
 6   restecg   303 non-null    int64  
 7   thalach   303 non-null    int64  
 8   exang     303 non-null    int64  
 9   oldpeak   303 non-null    float64
 10  slope     303 non-null    int64  
 11  ca        303 non-null    int64  
 12  thal      303 non-null    object 
 13  target    303 non-null    int64  
dtypes: float64(1), int64(12), object(1)
memory usage: 33.3+ KB


### Show the unique values present in each categorical columns

- Remove the rows which has '1' and '2' as values in `thal` column

In [None]:
# Show all the columns in dataframe
## YOUR CODE HERE

heart_df.columns

Index(['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach',
       'exang', 'oldpeak', 'slope', 'ca', 'thal', 'target'],
      dtype='object')

In [None]:
# Print the unique values present in each categorical columns

categorical_cols = ['sex', 'cp', 'fbs', 'restecg', 'exang', 'ca', 'thal']

## YOUR CODE HERE

for col in categorical_cols:
    print('Unique values in categorical column:', col)
    print(heart_df[col].unique())


Unique values in categorical column: sex
[1 0]
Unique values in categorical column: cp
[1 4 3 2 0]
Unique values in categorical column: fbs
[1 0]
Unique values in categorical column: restecg
[2 0 1]
Unique values in categorical column: exang
[0 1]
Unique values in categorical column: ca
[0 3 2 1]
Unique values in categorical column: thal
['fixed' 'normal' 'reversible' '1' '2']


In [None]:
# Print the unique values present in each categorical columns along with their counts

## YOUR CODE HERE
for col in categorical_cols:
    print('categorical column counts:')
    print(heart_df[col].value_counts())

categorical column counts:
sex
1    205
0     98
Name: count, dtype: int64
categorical column counts:
cp
4    142
3     84
2     49
1     24
0      4
Name: count, dtype: int64
categorical column counts:
fbs
0    258
1     45
Name: count, dtype: int64
categorical column counts:
restecg
0    149
2    146
1      8
Name: count, dtype: int64
categorical column counts:
exang
0    204
1     99
Name: count, dtype: int64
categorical column counts:
ca
0    176
1     67
2     40
3     20
Name: count, dtype: int64
categorical column counts:
thal
normal        168
reversible    115
fixed          18
1               1
2               1
Name: count, dtype: int64


- Remove the rows which has '1' and '2' as values in `thal` column

In [None]:
# Find indices of the rows which has '1', '2' as values in `thal` column

idx = heart_df.index[(heart_df['thal']== '1') | (heart_df['thal'] == '2')].to_list() ## YOUR CODE HERE

idx

[247, 252]

In [None]:
heart_df.iloc[252]

Unnamed: 0,252
age,57.0
sex,0.0
cp,1.0
trestbps,130.0
chol,236.0
fbs,0.0
restecg,0.0
thalach,174.0
exang,0.0
oldpeak,0.0


In [None]:
# Drop the above indexed rows

## YOUR CODE HERE
heart_df = heart_df.drop(heart_df.index[[247,252]])

In [None]:
# Recheck the unique values present in each categorical columns

## YOUR CODE HERE

categorical_cols = ['sex', 'cp', 'fbs', 'restecg', 'exang', 'ca', 'thal']

for col in categorical_cols:
    print('Unique values in categorical column:', col)
    print(heart_df[col].unique())

Unique values in categorical column: sex
[1 0]
Unique values in categorical column: cp
[1 4 3 2 0]
Unique values in categorical column: fbs
[1 0]
Unique values in categorical column: restecg
[2 0 1]
Unique values in categorical column: exang
[0 1]
Unique values in categorical column: ca
[0 3 2 1]
Unique values in categorical column: thal
['fixed' 'normal' 'reversible']


### Convert the categorical values present in `thal` column to numerical labels

Hint: Create a dictionary mapping

In [None]:
## YOUR CODE HERE
cat_to_num_dict = { 'fixed' : 0, 'normal' : 1 , 'reversible' : 2}

heart_df = heart_df.replace({"thal": cat_to_num_dict})

categorical_cols = ['sex', 'cp', 'fbs', 'restecg', 'exang', 'ca', 'thal']

for col in categorical_cols:
    print('Unique values in categorical column:', col)
    print(heart_df[col].unique())

Unique values in categorical column: sex
[1 0]
Unique values in categorical column: cp
[1 4 3 2 0]
Unique values in categorical column: fbs
[1 0]
Unique values in categorical column: restecg
[2 0 1]
Unique values in categorical column: exang
[0 1]
Unique values in categorical column: ca
[0 3 2 1]
Unique values in categorical column: thal
[0 1 2]


  heart_df = heart_df.replace({"thal": cat_to_num_dict})


### Split the dataset into training and testing sets

In [None]:
from sklearn.model_selection import train_test_split

## YOUR CODE HERE (perform stratified sampling/splitting)
cols = [x for x in heart_df.columns if x != 'target']
X = heart_df[cols]
y = heart_df['target']

X_train, X_test , y_train , y_test =train_test_split(X,y, train_size = 0.8, random_state=42, stratify=y)

### Scale the numerical features

In [None]:
numerical_cols = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak', 'slope']

In [None]:
from sklearn.preprocessing import StandardScaler

## YOUR CODE HERE
scaler = StandardScaler()

X_train = scaler.fit_transform(X=X_train[numerical_cols], y=None)

X_test = scaler.transform(X=X_test[numerical_cols])

In [None]:
X_train

array([[ 1.36263632, -0.96359526,  5.86807145,  0.49804813,  0.4293436 ,
         0.6915294 ],
       [-0.28239747, -0.67955834,  1.41437479,  1.01475512, -0.74445175,
        -0.9355986 ],
       [ 0.59495389,  0.45658935,  0.81806395,  0.92863729,  0.0939735 ,
         0.6915294 ],
       ...,
       [ 0.59495389, -0.67955834, -1.32492817, -2.2577225 , -0.9121368 ,
        -0.9355986 ],
       [ 1.03362956, -0.11148449,  1.00441109, -1.13819068,  0.7647137 ,
         0.6915294 ],
       [ 0.59495389,  0.74062628,  0.61308209, -0.27701236,  1.43545389,
         0.6915294 ]])

In [None]:
X_test

array([[ 4.85284967e-01,  2.38804043e+00, -1.86347140e-03,
        -2.33953444e-01, -9.12136802e-01,  6.91529397e-01],
       [ 1.36263632e+00,  1.13827797e+00, -6.91347890e-01,
         6.74589685e-02, -2.41396603e-01,  6.91529397e-01],
       [-6.11404225e-01, -7.93173109e-01, -1.86533487e+00,
        -9.65955017e-01, -2.41396603e-01, -9.35598596e-01],
       [ 1.58197416e+00,  1.59273705e+00, -2.81384182e-01,
        -7.50660436e-01, -8.28294277e-01,  6.91529397e-01],
       [ 2.65947129e-01, -1.11484492e-01, -2.44114754e-01,
         1.10087295e+00, -9.12136802e-01,  6.91529397e-01],
       [-1.05007990e+00, -2.25099262e-01,  1.09758466e+00,
         9.28637289e-01, -9.12136802e-01, -9.35598596e-01],
       [-1.81776234e+00, -6.79558340e-01, -3.37288324e-01,
         1.44534428e+00,  2.27387914e+00,  6.91529397e-01],
       [ 8.14291725e-01, -4.52328801e-01, -7.47252032e-01,
         6.27224877e-01, -9.12136802e-01, -9.35598596e-01],
       [ 9.23960644e-01,  4.56589355e-01, -1.008

## Building the model [3 Marks]

* Use tf.keras.layers.Input() for input layer
* Add dense layers
* Add dropout layers
* Add a classification layer at the end


In [None]:
X_train.shape

(240, 6)

In [None]:
# Create model



heart_classif_model = keras.Sequential(name="heart_disease_classifier_model")
heart_classif_model.add(layers.Input(shape=(X_train.shape[1], )))   #specifying the input here
heart_classif_model.add(layers.Dense(32, activation=tf.nn.relu))
heart_classif_model.add(layers.Dense(2, activation=tf.nn.sigmoid))

## YOUR CODE HERE

heart_classif_model.summary()

In [None]:
# Compile model with 'adam' optimizer, appropriate loss and metric

## YOUR CODE HERE

heart_classif_model.compile(optimizer =keras.optimizers.Adam(),
              loss = keras.losses.SparseCategoricalCrossentropy(),
              metrics = ["accuracy"])

In [None]:
# Perform training
epochs=50
batch_size=32
validation_split=0.2

heart_classif_model.fit(x=X_train, y=y_train, epochs=epochs,batch_size=batch_size, validation_split=0.2)

Epoch 1/50
[1m6/6[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 65ms/step - accuracy: 0.7081 - loss: 0.7690 - val_accuracy: 0.7083 - val_loss: 0.6886
Epoch 2/50
[1m6/6[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 11ms/step - accuracy: 0.7380 - loss: 0.6982 - val_accuracy: 0.7083 - val_loss: 0.6931
Epoch 3/50
[1m6/6[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step - accuracy: 0.7269 - loss: 0.6932 - val_accuracy: 0.7083 - val_loss: 0.6931
Epoch 4/50
[1m6/6[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 11ms/step - accuracy: 0.7363 - loss: 0.6931 - val_accuracy: 0.7083 - val_loss: 0.6931
Epoch 5/50
[1m6/6[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step - accuracy: 0.7320 - loss: 0.6931 - val_accuracy: 0.7083 - val_loss: 0.6931
Epoch 6/50
[1m6/6[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 11ms/step - accuracy: 0.7649 - loss: 0.6931 - val_accuracy: 0.7083 - val_loss: 0.6931
Epoch 7/50
[1m6/6[0m [32m━━━━━━━━━━━━━━━━━━━━

<keras.src.callbacks.history.History at 0x7b12736930d0>

In [None]:
# Performance on test set

heart_classif_model.evaluate(x=X_test,y=y_test)

[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step - accuracy: 0.7413 - loss: 0.6931 


[0.6931471824645996, 0.7213114500045776]

## Inference on new data [1 Mark]

To get a prediction for a new sample, you can simply call `model.predict()`.

In [None]:
# Inference on new data

sample = {
    "age": 60,
    "sex": 1,
    "cp": 1,
    "trestbps": 145,
    "chol": 233,
    "fbs": 1,
    "restecg": 2,
    "thalach": 150,
    "exang": 0,
    "oldpeak": 2.3,
    "slope": 3,
    "ca": 0,
    "thal": "fixed",
}


In [None]:
## YOUR CODE HERE
sample_df = pd.DataFrame([sample])


sample_df = sample_df.replace(cat_to_num_dict)
sample_df.head()

heart_classif_model.predict(x=sample_df)


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 44ms/step


  sample_df = sample_df.replace(cat_to_num_dict)


array([[1., 1.]], dtype=float32)

In [None]:
test = heart_classif_model.predict(x=sample_df)
type(test)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 18ms/step


numpy.ndarray

In [None]:
test[0][0]

1.0

In [None]:
tf.sigmoid(test)

<tf.Tensor: shape=(1, 2), dtype=float32, numpy=array([[0.7310586, 0.7310586]], dtype=float32)>

## Gradio Implementation [3 Marks]

Create a Gradio interface for this `Heart Disease Prediction` application. For the feature values given by the user as input, perform predcition using the trained model, and return the result back to user.

Make use of gradio elements such as Textbox, Radio buttons, etc.

In [None]:
%%capture
!pip -q install gradio

In [None]:
import gradio
import gradio as gr

In [None]:
# UI - Input components
## YOUR CODE HERE ...



age_input = gr.Number(label = 'Enter the age of the Individual')
sex_input = gr.Number(label = 'Enter the sex of the Individual')
cp_input = gr.Number(label = 'Enter the cp of the Individual')
trestbps_input = gr.Number(label = 'Enter the trestbps of the Individual')
chol_input = gr.Number(label = 'Enter the chol of the Individual')
fbs_input = gr.Number(label = 'Enter the fbs of the Individual')
restecg_input = gr.Number(label = 'Enter the restecg of the Individual')
thalach_input = gr.Number(label = 'Enter the thalach of the Individual')
exang_input = gr.Number(label = 'Enter the exang of the Individual')
oldpeak_input = gr.Number(label = 'Enter the oldpeak of the Individual')
slope_input = gr.Number(label = 'Enter the slope of the Individual')
ca_input = gr.Number(label = 'Enter the ca of the Individual')
thal_input = gr.Textbox(label = 'Enter the thal of the Individual')


# UI - Output component
## YOUR CODE HERE ...
# We create the output
output = gr.Textbox()




In [None]:
# Label prediction function

## YOUR CODE HERE

def predict_output(age_input , sex_input,cp_input ,trestbps_input, chol_input , fbs_input, restecg_input, thalach_input ,exang_input ,oldpeak_input,slope_input,ca_input, thal_input):

    #age_input , sex_input,cp_input ,trestbps_input, chol_input , fbs_input, restecg_input, thalach_input ,exang_input ,oldpeak_input,slope_input,ca_input, thal_input = my_list

    input_json = {'age': age_input , 'sex' : sex_input, 'cp' : cp_input , 'trestbps': trestbps_input, 'chol' : chol_input , 'fbs': fbs_input, 'restecg' : restecg_input, 'thalach' : thalach_input , 'exang' : exang_input , 'oldpeak' : oldpeak_input, 'slope': slope_input,'ca' : ca_input , 'thal' : thal_input  }

    sample_df = pd.DataFrame([input_json])
    sample_df = sample_df.replace(cat_to_num_dict)
    #sample_df.head()

    return heart_classif_model.predict(x=sample_df)[0][0]


In [None]:
# Create Gradio interface object and launch it with (share=True)

## YOUR CODE HERE

input_list = [ age_input , sex_input,cp_input ,trestbps_input, chol_input , fbs_input, restecg_input, thalach_input ,exang_input ,oldpeak_input,slope_input,ca_input, thal_input]


app = gr.Interface(fn = predict_output, inputs= input_list, outputs=output)
app.launch(debug=True)


Running Gradio in a Colab notebook requires sharing enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
* Running on public URL: https://d815ed090eb26a77fd.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 38ms/step


  sample_df = sample_df.replace(cat_to_num_dict)


Keyboard interruption in main thread... closing server.
Killing tunnel 127.0.0.1:7863 <> https://d815ed090eb26a77fd.gradio.live


