<a href="https://colab.research.google.com/github/starkjones/Neural-Networks/blob/main/Simple_Neural_Network_Exercise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Simple Neural Network Exercise**
Jonathan Jones

22.06.07

## **Data Dictionary**

1. Age: age of the patient [years]

2. Sex: sex of the patient [M: Male, F: Female]

3. ChestPainType: chest pain type [TA: Typical Angina, ATA: Atypical Angina, NAP: 
Non-Anginal Pain, ASY: Asymptomatic]

4. RestingBP: resting blood pressure [mm Hg]

5. Cholesterol: serum cholesterol [mm/dl]

6. FastingBS: fasting blood sugar [1: if FastingBS > 120 mg/dl, 0: otherwise]

7. RestingECG: resting electrocardiogram results [Normal: Normal, ST: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV), LVH: showing probable or definite left ventricular hypertrophy by Estes' criteria]

8. MaxHR: maximum heart rate achieved [Numeric value between 60 and 202]

9. ExerciseAngina: exercise-induced angina [Y: Yes, N: No]

10. Oldpeak: oldpeak = ST [Numeric value measured in depression]

11. ST_Slope: the slope of the peak exercise ST segment [Up: upsloping, Flat: flat, Down: downsloping]

12. HeartDisease: output class [1: heart disease, 0: Normal]

In [23]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [25]:
# Import Libraries:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split
import tensorflow.keras as keras
from keras.models import Sequential
from keras.layers import Dense

1. Explore and clean the data if needed

In [26]:
# Import Data:
data = '/content/drive/MyDrive/Colab Notebooks/Week 11/heart - heart.csv'

df = pd.read_csv(data)

df.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


In [27]:
# convert column names to lower case:

df.columns = df.columns.str.lower()

In [28]:
# Duplicated rows:

df.duplicated().sum()

0

In [29]:
# Missing values:

df.isna().sum()

age               0
sex               0
chestpaintype     0
restingbp         0
cholesterol       0
fastingbs         0
restingecg        0
maxhr             0
exerciseangina    0
oldpeak           0
st_slope          0
heartdisease      0
dtype: int64

In [30]:
# Datatypes and dictionary conformity: 

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 918 entries, 0 to 917
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             918 non-null    int64  
 1   sex             918 non-null    object 
 2   chestpaintype   918 non-null    object 
 3   restingbp       918 non-null    int64  
 4   cholesterol     918 non-null    int64  
 5   fastingbs       918 non-null    int64  
 6   restingecg      918 non-null    object 
 7   maxhr           918 non-null    int64  
 8   exerciseangina  918 non-null    object 
 9   oldpeak         918 non-null    float64
 10  st_slope        918 non-null    object 
 11  heartdisease    918 non-null    int64  
dtypes: float64(1), int64(6), object(5)
memory usage: 86.2+ KB


In [31]:
# Check numerical data for inconsistencies:

df.describe().round(2)

Unnamed: 0,age,restingbp,cholesterol,fastingbs,maxhr,oldpeak,heartdisease
count,918.0,918.0,918.0,918.0,918.0,918.0,918.0
mean,53.51,132.4,198.8,0.23,136.81,0.89,0.55
std,9.43,18.51,109.38,0.42,25.46,1.07,0.5
min,28.0,0.0,0.0,0.0,60.0,-2.6,0.0
25%,47.0,120.0,173.25,0.0,120.0,0.0,0.0
50%,54.0,130.0,223.0,0.0,138.0,0.6,1.0
75%,60.0,140.0,267.0,0.0,156.0,1.5,1.0
max,77.0,200.0,603.0,1.0,202.0,6.2,1.0


In [46]:
# Removing rows with no values:

rbp = df['restingbp'] == 0
chol = df['cholesterol'] == 0

dfc = df[~(rbp | chol)]

In [49]:
dfc.describe().round(2)

Unnamed: 0,age,restingbp,cholesterol,fastingbs,maxhr,oldpeak,heartdisease
count,746.0,746.0,746.0,746.0,746.0,746.0,746.0
mean,52.88,133.02,244.64,0.17,140.23,0.9,0.48
std,9.51,17.28,59.15,0.37,24.52,1.07,0.5
min,28.0,92.0,85.0,0.0,69.0,-0.1,0.0
25%,46.0,120.0,207.25,0.0,122.0,0.0,0.0
50%,54.0,130.0,237.0,0.0,140.0,0.5,0.0
75%,59.0,140.0,275.0,0.0,160.0,1.5,1.0
max,77.0,200.0,603.0,1.0,202.0,6.2,1.0


2. Perform a train-test split on your data

In [50]:
# Feature selection:
X = dfc.drop(columns = 'heartdisease')

# Target:
y = dfc['heartdisease']

# Train test split: 

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, stratify = y)

3. Use a column transformer to scale the numeric features and one-hot encode the categorical features.

In [51]:
# Scaler:
scaler = StandardScaler()

# One Hot Encoder:
OHE = OneHotEncoder(sparse = False, handle_unknown= 'ignore')

In [52]:
from sklearn.pipeline import make_pipeline
from pandas.core.arrays import categorical
from sklearn.compose import make_column_selector, make_column_transformer

# Column selection / separation by data type:

cat = make_column_selector(dtype_include= 'object')
num = make_column_selector(dtype_include= 'number')

# Preprocessing tuples:

categorical_tuple = (OHE, cat(X_train))
numeric_tuple = (scaler, num(X_train))

# Column transformer:

preprocessor = make_column_transformer(numeric_tuple, categorical_tuple, remainder= 'passthrough')

preprocessor


ColumnTransformer(remainder='passthrough',
                  transformers=[('standardscaler', StandardScaler(),
                                 ['age', 'restingbp', 'cholesterol',
                                  'fastingbs', 'maxhr', 'oldpeak']),
                                ('onehotencoder',
                                 OneHotEncoder(handle_unknown='ignore',
                                               sparse=False),
                                 ['sex', 'chestpaintype', 'restingecg',
                                  'exerciseangina', 'st_slope'])])

In [53]:
preprocessor.fit_transform(X_train, y_train)

array([[-1.38391574,  0.35874106,  0.72401052, ...,  0.        ,
         0.        ,  1.        ],
       [-0.54040052, -0.77104064,  0.13982767, ...,  0.        ,
         0.        ,  1.        ],
       [-0.75127932,  0.24576289, -0.04377266, ...,  0.        ,
         1.        ,  0.        ],
       ...,
       [-0.85671873,  0.35874106, -0.3609005 , ...,  0.        ,
         0.        ,  1.        ],
       [ 0.51399351,  0.92363191, -0.44435519, ...,  0.        ,
         1.        ,  0.        ],
       [-0.75127932,  0.35874106,  0.49033738, ...,  0.        ,
         0.        ,  1.        ]])

In [54]:
# # Piplines:

# cat_pipe = make_pipeline(OHE)
# num_pipe = make_pipeline(scaler)

4. Define your base sequential model

In [55]:
inputshape = X_train.shape[1]
inputshape

# Model instantiation:
sm = Sequential()

5. Include the number of features of each sample in your input layer

In [56]:
# First hidden layer: 
sm.add(Dense(11, activation = 'relu', input_dim = inputshape))

# Second:
sm.add(Dense(6, activation = 'relu'))

6. Use the correct activation function and the correct number of neurons for your output layer

In [57]:
# Output layer:
sm.add(Dense(1, activation = 'sigmoid'))

In [58]:
# Network Summary:

sm.summary()

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_6 (Dense)             (None, 11)                132       
                                                                 
 dense_7 (Dense)             (None, 6)                 72        
                                                                 
 dense_8 (Dense)             (None, 1)                 7         
                                                                 
Total params: 211
Trainable params: 211
Non-trainable params: 0
_________________________________________________________________


7. Compile your model with the correct loss function and an optimizer (‘adam’ is a fine choice)

In [59]:
# Accuracy used as evaluation metric

sm.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['acc'])

In [62]:
# Fit model:

# Construct pipeline:

sm_pipe = make_pipeline(preprocessor, sm)

# Fit:
sm_pipe.fit(X_train, y_train)

ValueError: ignored

8. Plot your model’s training history.

In [None]:
def plot_history(history, metric=None):
  """plot loss and passed metric.  metric is passed as string and must match 'metric'
  argument in the compile step"""
  fig, axes = plt.subplots(2,1, figsize = (5,10))
  axes[0].plot(history.history['loss'], label = "train")
  axes[0].plot(history.history['val_loss'], label='test')
  axes[0].set_title('Loss')
  axes[0].legend()
  if metric:
    axes[1].plot(history.history[metric], label = 'train')
    axes[1].plot(history.history['val_' + metric], label = 'test')
    axes[1].set_title(metric)
    axes[1].legend()

  plt.show()

9. Evaluate your models with appropriate metrics.

In [None]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, ConfusionMatrixDisplay

After you’ve created, fit, and evaluated your first model, try 2 more versions of it with different numbers of layers and neurons to see if you can create a model that scores better on the testing data.