# **Diabetes Prediction using Deep Neural Network (DNN)** (Solution)

Inspired by Dipanjan Sarkar et al. 2018. [Practical Machine Learning with Python](https://link.springer.com/book/10.1007/978-1-4842-3207-1).

## **Overview**
This notebook demonstrates how to build a **deep neural network (DNN)** model using **TensorFlow and Keras** to predict whether a patient has diabetes. We will use the **Pima Indians Diabetes Dataset** and go through the process of **data loading, preprocessing, model creation, training, and evaluation**.  This notebook provides a practical example of applying deep learning techniques to a binary classification problem in healthcare.

---

## **Learning Objectives**
By the end of this notebook, you will be able to:
- Load and explore the **Pima Indians Diabetes Dataset**.
- Preprocess data for neural network training, including **feature scaling**.
- Build a **sequential deep neural network model** using **TensorFlow/Keras**.
- Train the DNN model to predict diabetes using the dataset.
- Evaluate the model's performance using **accuracy, precision, recall, F1-score, and confusion matrix**.
- Visualize model training progress using **accuracy and loss curves**.
- Make **predictions on new sample data** using the trained DNN model.

---

## **Tasks to Complete**
1. **Load and Explore the Dataset:**
    - Read the Pima Indians Diabetes Dataset from a CSV file.
    - Display dataset information and sample data to understand its structure and features.

2. **Preprocess the Data:**
    - Prepare features and labels for model training.
    - Split the dataset into **training and testing sets**.
    - **Standardize numerical features** using `StandardScaler`.

3. **Build and Train a Deep Neural Network (DNN) Model:**
    - Create a **Sequential model** using TensorFlow/Keras with multiple dense layers and ReLU activation, and a sigmoid output layer for binary classification.
    - **Compile the model** with binary cross-entropy loss, Adam optimizer, and accuracy metric.
    - **Train the DNN model** using the training data, including validation during training.

4. **Evaluate Model Performance:**
    - Make predictions on the test set.
    - Evaluate and display model performance metrics: **accuracy, precision, recall, F1-score, classification report, and confusion matrix**.
    - Plot **training accuracy and loss curves** to visualize training progress.

5. **Make Predictions on New Samples:**
    - Create new sample patient data.
    - Preprocess the new sample data using the **same scaler** fitted on the training data.
    - Use the trained DNN model to **predict diabetes risk probabilities** for the new samples.
    - Display the predicted outcomes and probabilities for each sample.

---

## **Prerequisites**
Before running this notebook, ensure you have the following:
- **Python 3.6+**
- **Required Python libraries installed:**
    ```bash
    pip install numpy pandas matplotlib scikit-learn tensorflow
    ```
    *(Specifically, ensure you have TensorFlow and Keras installed)*

---

## **Get Started**
- **Launch an AWS SageMaker Notebook Instance:** Follow the instructions in the "Setup an AWS SageMaker Notebook Instance" section to create and configure your instance, ensuring you select a GPU-enabled instance type. (e.g., g4dn.xlarge for CUDA support, 50GB storage).
- Please select kernel "conda_tensorflow2_p310" from SageMaker notebook instance.
- Execute the code cells in the notebook sequentially to follow the steps of building, training, and evaluating the diabetes prediction DNN model.


### Import necessary dependencies

In [None]:
# Import the NumPy library for numerical operations in Python.
import numpy as np

# Import the Pandas library for data manipulation and analysis using DataFrames.
import pandas as pd

# Import the Pyplot module from Matplotlib library for creating static, interactive, and animated visualizations in Python.
import matplotlib.pyplot as plt

# Import the 'read_csv' function from the Pandas library to read comma-separated values (CSV) files into DataFrame.
from pandas import read_csv

# Import the 'Counter' class from the 'collections' module for counting hashable objects.
from collections import Counter

# Import the 'metrics' module from scikit-learn library for model evaluation metrics.
from sklearn import metrics

# Import the 'train_test_split' function from scikit-learn to split data into training and testing sets.
from sklearn.model_selection import train_test_split

# Import specific modules from scikit-learn for preprocessing data: LabelEncoder, StandardScaler, and label_binarize.
from sklearn.preprocessing import LabelEncoder, StandardScaler, label_binarize

# Import the 'Sequential' model from TensorFlow Keras to build neural network models layer by layer.
from tensorflow.keras.models import Sequential

# Import the 'Dense' layer from TensorFlow Keras to create densely connected neural network layers.
from tensorflow.keras.layers import Dense

# Import the 'logging' module for configuring logging levels (e.g., suppressing TensorFlow warnings)
import logging

# Import TensorFlow, a deep learning framework
import tensorflow as tf

# Import the 'random' module for generating random numbers (used for reproducibility)
import random

# Set TensorFlow's logger to only display ERROR-level messages (suppresses INFO/WARNING logs)
tf.get_logger().setLevel(logging.ERROR)

# Import the 'warnings' module to handle Python warnings
import warnings

# Ignore all warnings (e.g., deprecation warnings, NumPy/TensorFlow alerts)
warnings.filterwarnings('ignore')

# Enable matplotlib to display plots inline within the notebook.
%matplotlib inline

# Pima Indians Diabetes Dataset

## Overview
The **Pima Indians Diabetes Dataset** is a well-known dataset used for binary classification tasks in machine learning, specifically for predicting whether a patient has diabetes based on various medical attributes. The dataset originates from the **National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK)** and focuses on female patients of **Pima Indian heritage**.

## Source
- **Dataset Repository:** [UCI Machine Learning Repository](https://archive.ics.uci.edu/dataset/12/pima+indians+diabetes)
- **Original Source:** National Institute of Diabetes and Digestive and Kidney Diseases
- **Purpose:** Predicting the onset of diabetes based on diagnostic measurements.

## Dataset Description
The dataset contains **768 samples** with **8 numerical features** and **1 binary target variable** (diabetes outcome).

### **Features:**
1. **Pregnancies** – Number of times pregnant  
2. **Glucose** – Plasma glucose concentration over 2 hours in an oral glucose tolerance test  
3. **BloodPressure** – Diastolic blood pressure (mm Hg)  
4. **SkinThickness** – Triceps skinfold thickness (mm)  
5. **Insulin** – 2-Hour serum insulin (mu U/ml)  
6. **BMI** – Body mass index (weight in kg/(height in m²))  
7. **DiabetesPedigreeFunction** – Diabetes pedigree function (genetic influence)  
8. **Age** – Age of the patient (years)  
9. **Outcome** – Binary classification (1 = Diabetic, 0 = Non-Diabetic)  

## Summary Statistics
- **Total samples:** 768  
- **Diabetes positive cases (Outcome = 1):** ~35%  
- **Diabetes negative cases (Outcome = 0):** ~65%  
- **Missing values:** Some attributes contain zero values which may indicate missing data (e.g., Glucose, BloodPressure).

## Example Usage
This dataset is frequently used in **machine learning** and **statistical modeling** for:
- Logistic Regression
- Decision Trees & Random Forests
- Support Vector Machines (SVM)
- Deep Learning
- Feature Engineering and Imputation Techniques

## References
- UCI Machine Learning Repository: [Pima Indians Diabetes Dataset](https://archive.ics.uci.edu/dataset/12/pima+indians+diabetes)
- Smith, J. W., et al. "Using the ADAP learning algorithm to forecast the onset of diabetes mellitus." In Proceedings of the Annual Symposium on Computer Application in Medical Care. American Medical Informatics Association, 1988.

### Load dataset

In [None]:
# Specify the path to the pima-indians-diabetes dataset CSV file.
diabetes_data = "../../Data/pima-indians-diabetes.csv"

# Define a list of column names for the dataset, based on the dataset description.
columns = [
    'Pregnancies', # Column name for number of pregnancies.
    'Glucose', # Column name for plasma glucose concentration.
    'BloodPressure', # Column name for diastolic blood pressure.
    'SkinThickness', # Column name for triceps skin fold thickness.
    'Insulin', # Column name for 2-hour serum insulin.
    'BMI', # Column name for body mass index.
    'DiabetesPedigreeFunction', # Column name for diabetes pedigree function.
    'Age', # Column name for age.
    'Outcome' # Column name for class variable (diabetes outcome).
]

# Load the dataset from the CSV file using pandas read_csv function.
df = read_csv(
    diabetes_data, # Path to the CSV file.
    header=None, # Indicate that the CSV file has no header row.
    names=columns, # Assign the defined column names to the DataFrame.
    na_values="?", # Treat question marks '?' in the data as Not a Number (NaN) values.
    sep=',' # Specify that the data is separated by commas.
)

# Print information about the loaded dataset.
print("Dataset Info:")

# Print a concise summary of the DataFrame, including data types and non-null values.
print(df.info())

# Print a newline for better readability.
print("\nSample Data:")

# Display the first few rows of the DataFrame to get a glimpse of the data.
print(df.head())

## Domain Knowledge

### Key Health Indicators

- **Pregnancies:** Number of times pregnant
- **Glucose:** Plasma glucose concentration (mg/dL)
- **BloodPressure:** Diastolic blood pressure (mm Hg)
- **SkinThickness:** Triceps skin fold thickness (mm)
- **Insulin:** 2-Hour serum insulin (mu U/ml)
- **BMI:** Body mass index (kg/m²)
- **DiabetesPedigreeFunction:** Diabetes risk genetic score
- **Age:** Years
- **Outcome:** Diabetes diagnosis (0 = Negative, 1 = Positive)

### Define helper functions for model evaluation

In [None]:
# The variability in accuracy, F1-score, and ROC-AUC scores when running a Deep Neural Network (DNN) 
# multiple times can be attributed to several factors related to the randomness and initialization 
# in the training process

# Fix seeds for reproducibility 
np.random.seed(42)
tf.random.set_seed(42)
random.seed(42)

# Disable GPU Non-Determinism
tf.config.experimental.enable_op_determinism()

# Define a function called 'get_metrics' that takes true labels and predicted labels as input.
def get_metrics(true_labels, predicted_labels):
    """Calculate and print performance metrics."""
    # Print the Accuracy score, rounded to 4 decimal places, by comparing true labels and predicted labels.
    print("Accuracy:", np.round(metrics.accuracy_score(true_labels, predicted_labels), 4))
    
    # Print the Precision score, rounded to 4 decimal places, using weighted averaging for multi-class if needed.
    print("Precision:", np.round(metrics.precision_score(true_labels, predicted_labels, average="weighted"), 4))
    
    # Print the Recall score, rounded to 4 decimal places, using weighted averaging for multi-class if needed.
    print("Recall:", np.round(metrics.recall_score(true_labels, predicted_labels, average="weighted"), 4))
    
    # Print the F1 Score, rounded to 4 decimal places, using weighted averaging for multi-class if needed.
    print("F1 Score:", np.round(metrics.f1_score(true_labels, predicted_labels, average="weighted"), 4))

# Define a function called 'display_classification_report' that takes true labels, predicted labels, and class names as input.
def display_classification_report(true_labels, predicted_labels, classes):
    """Display classification report.""" # Docstring for the function explaining its purpose.
    # Prints the classification report using scikit-learn's metrics.classification_report function, using provided true labels, predicted labels, and class labels.
    print(metrics.classification_report(true_labels, predicted_labels, labels=classes)) 

# Defines a function called 'display_confusion_matrix' that takes true labels, predicted labels, and class names as input.
def display_confusion_matrix(true_labels, predicted_labels, classes):
    """Display confusion matrix."""
    # Calculates the confusion matrix using scikit-learn's metrics.confusion_matrix function.
    cm = metrics.confusion_matrix(true_labels, predicted_labels, labels=classes)
    
    # Creates a Pandas DataFrame from the confusion matrix 'cm', using class names for index and columns.
    cm_frame = pd.DataFrame(cm, index=classes, columns=classes)
    
    # Prints the title "Confusion Matrix:" to the console.
    print("Confusion Matrix:")
    
    # Prints the Pandas DataFrame 'cm_frame' which represents the confusion matrix.
    print(cm_frame)

# Define function to display comprehensive model performance metrics
def display_model_performance_metrics(true_labels, predicted_labels, classes):
    """Display model performance metrics."""
    # Prints a header indicating the start of model performance metrics display.
    print("Model Performance Metrics:")
    
    # Prints a separator line for visual clarity.
    print("-" * 30)
    
    # Calls the 'get_metrics' function to calculate and print accuracy, precision, recall, and F1 score.
    get_metrics(true_labels, predicted_labels)
    
    # Prints a header indicating the start of the classification report.
    print("\nClassification Report:")
    
    # Prints a separator line for visual clarity.
    print("-" * 30)
    
    # Calls the 'display_classification_report' function to print the classification report.
    display_classification_report(true_labels, predicted_labels, classes)
    
    # Prints a header indicating the start of the confusion matrix display.
    print("\nConfusion Matrix:")
    
    # Prints a separator line for visual clarity.
    print("-" * 30)
    
    # Calls the 'display_confusion_matrix' function to print the confusion matrix.
    display_confusion_matrix(true_labels, predicted_labels, classes)

# Define a function to plot the Receiver Operating Characteristic (ROC) curve for a classification model.
def plot_model_roc_curve(clf, features, true_labels, label_encoder=None, class_names=None):
    """Plot ROC curve for the model."""
    # Determine class labels from classifier, label encoder, or provided class names.
    # Check if the classifier object 'clf' has attribute 'classes_' (scikit-learn classifiers).
    if hasattr(clf, "classes_"): 
        # Get class labels from the classifier object.
        class_labels = clf.classes_ 
        
    # If 'label_encoder' is provided.    
    elif label_encoder: 
        # Get class labels from the label encoder.
        class_labels = label_encoder.classes_ 

    # If 'class_names' are directly provided.
    elif class_names: 
        # Use the provided class names.
        class_labels = class_names 

    # If class labels cannot be determined from any source.
    else: 
        # Raise a ValueError exception indicating inability to get class labels.
        raise ValueError("Unable to derive prediction classes!") 

    # Get the number of classes.
    n_classes = len(class_labels) 
    
    # Binarize true labels for ROC curve calculation, handling multi-class if necessary.
    y_test = label_binarize(true_labels, classes=class_labels) 

    # For binary classification (2 classes).
    if n_classes == 2:
        # Get probability scores for the positive class (index 1) if 'predict_proba' is available, otherwise use 'decision_function'.
        y_score = clf.predict_proba(features)[:, 1] if hasattr(clf, "predict_proba") else clf.decision_function(features)

        # Calculate False Positive Rate, True Positive Rate, and thresholds for ROC curve.
        fpr, tpr, _ = roc_curve(y_test, y_score) 

        # Calculate Area Under the ROC Curve (AUC).
        roc_auc = auc(fpr, tpr)

        # Plot ROC curve for binary class with AUC value in label.
        plt.plot(fpr, tpr, label=f"ROC curve (area = {roc_auc:.2f})", linewidth=2.5) 
    
    # For multi-class classification (more than 2 classes).
    else:
        # Get probability scores for all classes.
        y_score = clf.predict_proba(features) if hasattr(clf, "predict_proba") else clf.decision_function(features) 

        # Iterate through each class.
        for i in range(n_classes): 
            # Calculate FPR, TPR for each class vs. rest.
            fpr, tpr, _ = roc_curve(y_test[:, i], y_score[:, i]) 

            # Calculate AUC for each class.
            roc_auc = auc(fpr, tpr) 

            # Plot ROC curve for each class with AUC value and linestyle.
            plt.plot(fpr, tpr, label=f"ROC curve of class {class_labels[i]} (area = {roc_auc:.2f})", linestyle=":", linewidth=2) 

    # Plot the diagonal line representing random guessing.
    plt.plot([0, 1], [0, 1], "k--") 

    # Set x-axis limits from 0 to 1.
    plt.xlim([0.0, 1.0]) 

    # Set y-axis limits from 0 to 1.05.
    plt.ylim([0.0, 1.05]) 

    # Set x-axis label.
    plt.xlabel("False Positive Rate") 

    # Set y-axis label.
    plt.ylabel("True Positive Rate") 

    # Set plot title.
    plt.title("Receiver Operating Characteristic (ROC) Curve") 

    # Display legend in the lower right corner.
    plt.legend(loc="lower right") 

    # Show the plot.
    plt.show() 

### Prepare features and labels

In [None]:
# Features: Select all columns except 'Outcome' as features (independent variables).
X = df.drop(columns=['Outcome']).values

# Target: Select the 'Outcome' column as the target variable (dependent variable), representing diabetes status (1=Diabetic, 0=Non-Diabetic).
y = df['Outcome'].values

### Split data into train and test sets

In [None]:
# Split dataset into training and testing sets (80% train, 20% test)
train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.2, random_state=42)

# Print a header to indicate class distribution information is being displayed.
print("\nClass Distribution:")

# Print the class distribution (counts of each class) in the training labels (train_y).
print("Train:", Counter(train_y))

# Print the class distribution (counts of each class) in the test labels (test_y).
print("Test:", Counter(test_y))

### Scale features

In [None]:
# Scale features using StandardScaler to standardize the training features.
scaler = StandardScaler().fit(train_X)

# Transform the training features 'train_X' using the fitted scaler and store the result in 'train_SX'.
train_SX = scaler.transform(train_X)

# Transform the test features 'test_X' using the fitted scaler and store the result in 'test_SX'.
test_SX = scaler.transform(test_X)

### Build and train the DNN model

In [None]:
# Build a Sequential Deep Neural Network model using Keras.
model = Sequential([
    # Add a Dense layer with 16 neurons, ReLU activation, and input shape matching the number of features.
    Dense(16, activation="relu", input_shape=(train_SX.shape[1],)),
    
    # Add another Dense hidden layer with 16 neurons and ReLU activation.
    Dense(16, activation="relu"),
    
    # Add one more Dense hidden layer with 16 neurons and ReLU activation.
    Dense(16, activation="relu"),
    
    # Add the output Dense layer with 1 neuron and sigmoid activation for binary classification.
    Dense(1, activation="sigmoid")
])

# Configures the model for training use 'Adam' as optimizer, 'binary_crossentropy' 
# as loss funciton, and 'accuracy' as evaluation metrics
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])

# Train the compiled DNN model using the training data.
history = model.fit(
    train_SX, train_y,
    epochs=10, # Train for 10 epochs.
    batch_size=5, # Use a batch size of 5 during training.
    validation_split=0.1, # Use 10% of the training data for validation.
    verbose=1 # Display training progress for each epoch.
)

### Evaluate model performance

In [None]:
# Evaluate model performance on test set
# Predict class labels (0 or 1) for the test set based on probability threshold of 0.5.
y_pred = (model.predict(test_SX) > 0.5).astype(int)

# Display model performance metrics (accuracy, precision, recall, F1-score, confusion matrix, classification report).
display_model_performance_metrics(test_y, y_pred, classes=[0, 1])

# Plot training history
# Create a new figure with a size of 12x4 inches for the plots.
plt.figure(figsize=(12, 4))

# Create the first subplot in a 1x2 grid (1 row, 2 columns), which is for accuracy.
plt.subplot(1, 2, 1)

# Plot the training accuracy from the training history.
plt.plot(history.history['accuracy'], label='Train Accuracy')

# Plot the validation accuracy from the training history.
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')

# Set the title of the accuracy subplot.
plt.title('Accuracy')

# Display the legend for the accuracy subplot.
plt.legend()

# Create the second subplot in a 1x2 grid (1 row, 2 columns), which is for loss.
plt.subplot(1, 2, 2)

# Plot the training loss from the training history.
plt.plot(history.history['loss'], label='Train Loss')

# Plot the validation loss from the training history.
plt.plot(history.history['val_loss'], label='Validation Loss')

# Set the title of the loss subplot.
plt.title('Loss')

# Display the legend for the loss subplot.
plt.legend()

# Show the complete figure with both subplots.
plt.show()

## Model Training and Validation Performance Analysis

The plot provides a comprehensive view of the deep learning model's performance, both during training and on the unseen test dataset. Let's break down each component to understand the model's strengths and areas for potential improvement.

### 1. Model Performance Metrics (Test Set)

These metrics evaluate the model's ability to generalize to new, unseen data (the test set).

*   **Accuracy: 0.7403**
    -   **Interpretation:** The model correctly classified approximately 74.03% of the samples in the test dataset. This provides an overall measure of correctness.
*   **Precision: 0.7425**
    -   **Interpretation:**  When the model predicts a class (considering both classes in a weighted manner), it is correct about 74.25% of the time.  This indicates the model's ability to avoid false positives, weighted by class support.
*   **Recall: 0.7403**
    -   **Interpretation:** The model correctly identified approximately 74.03% of the actual positive instances (again, weighted across both classes). This reflects the model's ability to avoid false negatives, weighted by class support.
*   **F1 Score: 0.7413**
    -   **Interpretation:** The F1-score, being the harmonic mean of precision and recall, provides a balanced measure of the model's performance. A score of 0.7413 suggests a good balance between precision and recall on the test set.

**Overall Test Set Performance:** The model demonstrates a reasonable performance on the test set, with all metrics around 75%. This suggests a moderate level of generalization to unseen data.

### 2. Classification Report

This report offers a class-wise breakdown of the model's performance, providing insights into how well it performs for each class (0 and 1).

| Metric        | Class 0 | Class 1 |
|---------------|---------|---------|
| **Precision** | 0.80    | 0.63    |
| **Recall**    | 0.79    | 0.65    |
| **F1-Score**  | 0.80    | 0.64    |
| **Support**   | 99      | 55      |

*   **Class 0 Performance:** The model shows strong performance for Class 0, with precision, recall, and F1-score all at 0.80, 0.79 and 0.80. This indicates the model is effective at identifying and classifying instances of Class 0.
*   **Class 1 Performance:** Performance is lower for Class 1, with precision, recall, and F1-score at 0.63, 0.65, and 0.64. This suggests the model struggles more with accurately classifying instances of Class 1 compared to Class 0.
*   **Support:** The 'support' indicates the number of actual instances of each class in the test set. Class 0 (99 instances) is more prevalent than Class 1 (55 instances), suggesting a potential class imbalance.
*   **Accuracy (Report): 0.74** -  The overall accuracy reported in the classification report is 0.74, slightly lower than the 0.7403 reported separately, likely due to rounding or calculation differences.
*   **Macro Avg F1-Score: 0.72** - The unweighted average F1-score across both classes.
*   **Weighted Avg F1-Score: 0.74** - The F1-score weighted by the support (number of instances) for each class, aligning with the F1-score reported in the initial metrics.

**Class-Specific Performance:** The model is better at predicting Class 0 than Class 1. The class imbalance (more instances of Class 0) might contribute to this difference in performance.

### 3. Confusion Matrix

The confusion matrix visually summarizes the performance by showing counts of True Positives, True Negatives, False Positives, and False Negatives.

| Predicted Class | Class 0 | Class 1 |
|-----------------|---------|---------|
| **Actual Class 0** | 78 (TN) | 21 (FP) |
| **Actual Class 1** | 19 (FN) | 36 (TP) |

*   **True Negatives (TN) = 78:**  Correctly predicted Class 0 when the actual class was 0.
*   **False Positives (FP) = 21:** Incorrectly predicted Class 1 when the actual class was 0 (Type I Error).
*   **False Negatives (FN) = 19:** Incorrectly predicted Class 0 when the actual class was 1 (Type II Error).
*   **True Positives (TP) = 36:** Correctly predicted Class 1 when the actual class was 1.

**Confusion Analysis:** The confusion matrix reinforces the observation that the model performs better for Class 0 (higher TN) compared to Class 1 (lower TP). The equal number of False Positives and False Negatives (19 each) suggests a relatively balanced type of error, though further investigation into minimizing False Negatives might be important depending on the application.

### 4. Accuracy and Loss Curves (Training History)

These plots illustrate the model's learning process over training epochs, showing trends in accuracy and loss for both the training and validation datasets.

*   **Accuracy Curve:**
    -   **Train Accuracy (Blue):**  Increases steadily and plateaus around 0.80, indicating good learning on the training data.
    -   **Validation Accuracy (Orange):**  Also increases but plateaus slightly lower than training accuracy, around 0.73. The validation accuracy generally follows the trend of the training accuracy, suggesting the model is learning effectively without severe overfitting.
*   **Loss Curve:**
    -   **Train Loss (Blue):** Decreases consistently, indicating the model is minimizing the error on the training data.
    -   **Validation Loss (Orange):** Decreases initially and then plateaus at a slightly higher value than the training loss. The validation loss is slightly above the training loss, which is expected, and it doesn't show a significant increase, further suggesting no major overfitting.

**Training Dynamics:** The training curves show that the model is learning and improving over epochs. The validation curves follow a similar trend to the training curves, indicating that the model is generalizing reasonably well to unseen data during training. The slight gap between training and validation performance is normal and might be reduced with further optimization.

### **Summary and Potential Improvements**

The model demonstrates a **moderate level of performance** in classifying the dataset, achieving around 75% accuracy and F1-score on the test set. It performs better on Class 0 compared to Class 1, potentially due to class imbalance. The training curves suggest the model is learning without severe overfitting.

**Potential areas for improvement include:**

*   **Addressing Class Imbalance:** Techniques to handle class imbalance (e.g., oversampling, undersampling, class weights) could improve performance, especially for Class 1.
*   **Hyperparameter Tuning:** Experimenting with different model architectures (layers, neurons), optimizers, learning rates, and regularization techniques might lead to better results.
*   **Feature Engineering/Selection:** Exploring feature engineering or selection to enhance the model's ability to discriminate between classes could be beneficial.
*   **Increasing Data:**  More training data, if available, could help the model generalize better and improve overall performance.

By addressing these areas, you could potentially optimize the model to achieve higher accuracy and a more balanced performance across both classes.

### Prediction using new patients

In [None]:
# 1. Create new sample data (replace with your desired sample data)
# Define a dictionary 'new_samples_data' to hold data for new samples.
new_samples_data = {
    # Define data for 'Sample1' as a dictionary of features and their values.
    'Sample1': {'Pregnancies': 2, 'Glucose': 100, 'BloodPressure': 70, 'SkinThickness': 25, 'Insulin': 50, 'BMI': 30.0, 'DiabetesPedigreeFunction': 0.5, 'Age': 35},
    
    # Define data for 'Sample2' as a dictionary of features and their values.
    'Sample2': {'Pregnancies': 8, 'Glucose': 180, 'BloodPressure': 90, 'SkinThickness': 35, 'Insulin': 100, 'BMI': 40.0, 'DiabetesPedigreeFunction': 1.0, 'Age': 50}
}
# Create a Pandas DataFrame 'new_samples_df' from the 'new_samples_data' dictionary, orienting it by index.
new_samples_df = pd.DataFrame.from_dict(new_samples_data, orient='index')

# 2. Preprocess the new samples using the SAME scaler fitted on training data
# Transform the new sample data using the 'scaler' object (fitted on training data) to standardize features.
new_samples_scaled = scaler.transform(new_samples_df)

# 3. Make predictions
# Get probability predictions from the 'model' for the scaled new samples.
new_probabilities = model.predict(new_samples_scaled) # Get probabilities

# 4. Display the predictions
# Print a header to indicate predictions for new samples.
print("\n--- Predictions for New Samples ---")
# Iterate through the new samples using their index and sample name.
for i, sample_name in enumerate(new_samples_data.keys()):
    # Predict the class label (0 or 1) by thresholding the probability at 0.5.
    prediction = (new_probabilities[i] > 0.5).astype(int)
    
    # Calculate the probability of being 'non-diabetic' (class 0) in percentage.
    # Probability for 'non-diabetic' class
    probability_non_diabetic = (1 - new_probabilities[i][0]) * 100 
    
    # Calculate the probability of being 'diabetic' (class 1) in percentage.
    # Probability for 'diabetic' class
    probability_diabetic = new_probabilities[i][0] * 100   
    
    # Print the sample name.
    print(f"Sample: {sample_name}")
    
    # Print the predicted outcome (class label).
    print(f"  Predicted Outcome: {prediction}")
    
    # Print the probability of being 'non-diabetic'.
    print(f"  Probability (non-diabetic): {probability_non_diabetic:.2f}%")
    
    # Print the probability of being 'diabetic'.
    print(f"  Probability (diabetic): {probability_diabetic:.2f}%")
    
    # Print a separator line for better readability.
    print("-" * 30)

## **Conclusion**
This notebook successfully demonstrated the development and evaluation of a Deep Neural Network model for diabetes prediction using the Pima Indians Diabetes Dataset. 
- We covered the essential steps of a machine learning workflow, including data loading, preprocessing, model building with TensorFlow/Keras, training, and performance evaluation. 
- The model achieved a reasonable level of accuracy in predicting diabetes based on the provided health metrics.
- This example provides a foundation for understanding how deep learning can be applied to healthcare classification problems and can be further extended by exploring different model architectures, hyperparameter tuning, and feature engineering techniques to potentially improve prediction performance.

## Clean up

Remember to shut down your Jupyter Notebook environment and delete any unnecessary files or resources once you've completed the tutorial.
