# Neural Network to Determine Test Results of Patients

## What Are We Solving?

We are creating this neural network to predict the test results of patients in the data set. There are three options, normal, abnormal, and inconclusive making it a multi-class classification problem. This would mean use if multinominal logistic regression to create our model, but for the time frame of this project, we are going to merge the normal and inconclusive cases to have a binary classification problem. 

#### Note:
Any words in bold within this file visit the file `glossary.md` in the github repository.

### Import Libraries and Dataset

In [1]:
import numpy as np
import pandas as pd
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from scipy import stats
import matplotlib.pyplot as plt
import urllib.request
import io
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
import opacus

df = pd.read_csv("../data/healthcare_dataset.csv")
df.head()

Unnamed: 0,Name,Age,Gender,Blood Type,Medical Condition,Date of Admission,Doctor,Hospital,Insurance Provider,Billing Amount,Room Number,Admission Type,Discharge Date,Medication,Test Results
0,Bobby JacksOn,30,Male,B-,Cancer,2024-01-31,Matthew Smith,Sons and Miller,Blue Cross,18856.281306,328,Urgent,2024-02-02,Paracetamol,Normal
1,LesLie TErRy,62,Male,A+,Obesity,2019-08-20,Samantha Davies,Kim Inc,Medicare,33643.327287,265,Emergency,2019-08-26,Ibuprofen,Inconclusive
2,DaNnY sMitH,76,Female,A-,Obesity,2022-09-22,Tiffany Mitchell,Cook PLC,Aetna,27955.096079,205,Emergency,2022-10-07,Aspirin,Normal
3,andrEw waTtS,28,Female,O+,Diabetes,2020-11-18,Kevin Wells,"Hernandez Rogers and Vang,",Medicare,37909.78241,450,Elective,2020-12-18,Ibuprofen,Abnormal
4,adrIENNE bEll,43,Female,AB+,Cancer,2022-09-19,Kathleen Hanna,White-White,Aetna,14238.317814,458,Urgent,2022-10-09,Penicillin,Abnormal


## Pre-processing the Data

### Binary Classification

To begin the phase of preprocessing, we must manipulate our target column to become a binary classification problem. Using the `replace` method, we start by splitting the data in the `Test Results` column from three cases to two, `Abnormal` and `Non-Abnormal`.

In [2]:
df['Test Results'] = df['Test Results'].replace({
    'Normal' : 'Non-Abnormal',
    'Inconclusive' : 'Non-Abnormal',
    'Abnormal' : 'Abnormal'})

df['Test Results'] = df['Test Results'].apply(lambda x: 1 if x == 'Abnormal' else 0)
df.head()

Unnamed: 0,Name,Age,Gender,Blood Type,Medical Condition,Date of Admission,Doctor,Hospital,Insurance Provider,Billing Amount,Room Number,Admission Type,Discharge Date,Medication,Test Results
0,Bobby JacksOn,30,Male,B-,Cancer,2024-01-31,Matthew Smith,Sons and Miller,Blue Cross,18856.281306,328,Urgent,2024-02-02,Paracetamol,0
1,LesLie TErRy,62,Male,A+,Obesity,2019-08-20,Samantha Davies,Kim Inc,Medicare,33643.327287,265,Emergency,2019-08-26,Ibuprofen,0
2,DaNnY sMitH,76,Female,A-,Obesity,2022-09-22,Tiffany Mitchell,Cook PLC,Aetna,27955.096079,205,Emergency,2022-10-07,Aspirin,0
3,andrEw waTtS,28,Female,O+,Diabetes,2020-11-18,Kevin Wells,"Hernandez Rogers and Vang,",Medicare,37909.78241,450,Elective,2020-12-18,Ibuprofen,1
4,adrIENNE bEll,43,Female,AB+,Cancer,2022-09-19,Kathleen Hanna,White-White,Aetna,14238.317814,458,Urgent,2022-10-09,Penicillin,1


### Removing Personally Identifiable Information (PII)

When we import our csv file to create our model, we want to get rid of **PII** that will not be relevant to our problem. This allows us to better utilize the data while maintaining a base level of privacy. We will remove the columns `Name`, `Doctor`, `Hospital`, `Room Number`, `Insurance Provider`, and `Billing Amount`. We remove `Billing Amount` because they are large, unique to each person, and do not help us in our classification.

In [3]:
df_clean = df.drop(columns=[
    'Name', 'Doctor', 'Hospital', 'Room Number', 'Insurance Provider', 'Billing Amount'])


### One-hot Encoding

**One-hot encoding** is a preprocessing method that allows machine-learning models to understand our data, and in pandas it is performed using the method `get_dummies`. It converts the categorical variables into numerical values, creating a binary column for each category with a feature and assigning a 1 if it belongs to that category, 0 otherwise. 

In [4]:
categorical_cols = ['Age', 'Gender', 'Blood Type', 'Medical Condition', 
                    'Date of Admission', 'Admission Type', 'Discharge Date', 'Medication']
df_processed = pd.get_dummies(df_clean, columns=categorical_cols)

### Split, Scale, and Standardize

We separate the **features** (X) from the **target** (y) so the model will learn only from the inputs. A **train/test split** is used to evaluate how well the model generalizes to new data. **Standarizing** the features ensures all of the numerical values share a similar scale. This prevents large magnitude features `(e.g. Billing Amount)` from dominating the gradient updates and helps optimization methods like **stochastic gradient descent** (what we are using to train) converge faster with more stability.

In [5]:
# split features (X) and target (y)
X = df_processed.drop('Test Results', axis=1).values
y = df_processed['Test Results'].values

# test cases
print(f"Shape of X: {X.shape}")
print(f"Shape of y: {y.shape}")

# create train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# standardize numerical features using StandardScaler
scaler = StandardScaler()
# X train and test
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Shape of X: (55500, 3784)
Shape of y: (55500,)


## Convert To Tensors

PyTorch models operate on **tensors**, which are a specialized data structure similar to arrays and matrices. They are used to encode the inputs and outputs of a model along with the models parameters. We convert our numpy arrays to tensors to allow the model to compute gradients and update weights during training. Wrapping the tensors in the **TensorDataset** and the **DataLoader** helps efficiency in **mini-batch** processing and **shuffling** during training and testing. 

In [6]:
# convert the numpy arrays to PyTorch tensors
X_train_tensor = torch.tensor(X_train, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train, dtype=torch.float32).view(-1, 1)
X_test_tensor = torch.tensor(X_test, dtype=torch.float32)
y_test_tensor = torch.tensor(y_test, dtype=torch.float32).view(-1, 1)

# create DataLoaders for batch processing
train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
test_dataset = TensorDataset(X_test_tensor, y_test_tensor)

train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False)

## Create the Neural Network

### What is a Neural Network?

This phrase is thrown around a lot in the computer science world. *Neural Network*. Sounds fancy. When I was a freshman in college I heard this coupled with AI more than my own name. In short, a neural network is comprised of layers of nodes. There is an input layer, a certain amount of hidden layers, and an output layer. The concept of the input and output layer is intuitive, as the input is where your processed data enters, and the output layer provides the prediction. The hidden layers can be considered trivial.  

### Explaining the Code

We create a class, `MedicalNN`, that takes in the parameter `nn.Module`. This is the yadaydyaydyadya. Our initialization function starts by creating the first layer which is our input to our hidden. `ReLU` (Rectified Linear Unit) is 

In [9]:
class MedicalNN(nn.Module):
    def __init__(self, input_dim):
        super(MedicalNN, self).__init__()
        # first layer is input to hidden
        self.layer1 = nn.Linear(input_dim, 64)
        self.relu = nn.ReLU()
        # second is hidden to hidden
        self.layer2 = nn.Linear(64, 32)
        # third is hidden to output
        self.layer3 = nn.Linear(32, 1)
        # we then use the sigmoid for binary classification
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        x = self.layer1(x)
        x = self.relu(x)
        x = self.layer2(x)
        x = self.relu(x)
        x = self.layer3(x)
        x = self.sigmoid(x)

# initialize the model
input_dim = X.shape[1]
model = MedicalNN(input_dim)

# define the loss and optimizer
loss = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
