# Assignment 2: Neural Network (One Hidden Layer) with Optimizer

<h2> <b> <u> Dataset background:</u></b> </h2>
<ul>
    <li>Data: Diabetic Encounters (1-14 days/each) from 130 Hospitals for 10 years (1999-2008) </li>
    <li>Goal: Predict if a diabetic patient will be readmitted to a hospital (less than 30 days, after 30 days, or never)</li>
    <li>Target Feature: readmitted </li>
    <li> <a href = "https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+1999-2008">Dataset Source</li>
</ul>



In [1]:
## import all required libraries 
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler

#display all columns of dataframe
pd.pandas.set_option('display.max_columns', None) 

In [2]:
#import dataset 
dataset_url = "https://raw.githubusercontent.com/ronakHegde98/CS-4372-Computational-Methods-for-Data-Scientists/master/data/diabetic_data.csv"
df = pd.read_csv(dataset_url)

print(f"Initial Dataset Shape: {df.shape}")
df.sample(5)

Initial Dataset Shape: (101766, 50)


Unnamed: 0,encounter_id,patient_nbr,race,gender,age,weight,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,payer_code,medical_specialty,num_lab_procedures,num_procedures,num_medications,number_outpatient,number_emergency,number_inpatient,diag_1,diag_2,diag_3,number_diagnoses,max_glu_serum,A1Cresult,metformin,repaglinide,nateglinide,chlorpropamide,glimepiride,acetohexamide,glipizide,glyburide,tolbutamide,pioglitazone,rosiglitazone,acarbose,miglitol,troglitazone,tolazamide,examide,citoglipton,insulin,glyburide-metformin,glipizide-metformin,glimepiride-pioglitazone,metformin-rosiglitazone,metformin-pioglitazone,change,diabetesMed,readmitted
80218,247042512,88533495,Caucasian,Male,[40-50),?,2,6,7,5,BC,Emergency/Trauma,68,6,38,2,0,8,428,518,250.42,9,,,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,Up,No,No,No,No,No,Ch,Yes,>30
101382,438525890,40711671,Caucasian,Male,[70-80),?,1,1,7,1,MC,?,44,0,17,0,0,1,414,786,425,9,,,Up,No,No,No,No,No,No,Steady,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,Ch,Yes,NO
56463,162110298,86376672,Caucasian,Male,[70-80),?,2,1,7,12,MC,InternalMedicine,86,1,31,0,0,0,486,428,427,9,,,Steady,No,No,No,No,No,No,No,No,No,Steady,No,No,No,No,No,No,No,No,No,No,No,No,Ch,Yes,NO
68576,193669110,43385490,Caucasian,Female,[70-80),?,1,6,7,2,MC,?,52,2,9,0,0,0,569,285,250,6,,,No,No,No,No,No,No,Steady,No,No,No,No,No,No,No,No,No,No,Down,No,No,No,No,No,Ch,Yes,>30
12375,50407170,1294047,AfricanAmerican,Male,[40-50),?,3,1,1,1,?,Orthopedics-Reconstructive,51,1,10,0,0,0,722,250,?,2,,,Steady,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,Steady,No,No,No,No,No,Ch,Yes,NO


In [3]:
## check if patients have multiple records
print(f"There are {np.sum(df['patient_nbr'].value_counts() > 1)} patients with multiple records")

There are 16773 patients with multiple records


In [4]:
categorical_cols = [col for col in df.columns if df[col].dtype == np.dtype(np.object)]
print(f"There are {len(categorical_cols)} categorical columns and {len(df.columns)-len(categorical_cols)} numerical columns")

There are 37 categorical columns and 13 numerical columns


<h2> Handling Missing Values </h2>

In [5]:
## sum all missing values for each row of df (axis 0 is row)
missing_count = np.sum(np.sum(np.equal(df, '?'), axis=0))
print(f"There are {missing_count} '?' values in our dataset which is approx {np.round((missing_count/(np.multiply(df.shape[0], df.shape[1])))*100,2)}% of our entire dataset")

There are 192849 '?' values in our dataset which is approx 3.79% of our entire dataset


In [6]:
## convert ?'s into np.nan
df.replace("?", np.nan, inplace=True)

In [7]:
print("Columns with missing data")
missing_cols = df.columns[df.isnull().any()].tolist()
for col in missing_cols:
    print(' ' + col + ': ' + str(df[col].isna().sum()))

Columns with missing data
 race: 2273
 weight: 98569
 payer_code: 40256
 medical_specialty: 49949
 diag_1: 21
 diag_2: 358
 diag_3: 1423


In [8]:
## drop rows where gender is Unknown/Invalid
df.drop(df[df['gender'] == "Unknown/Invalid"].index, axis=0, inplace=True)

## dropping columns that have many missing values
dropped_columns = ['weight', 'payer_code', 'medical_specialty']
dropped_columns.append("encounter_id")
dropped_columns.append('discharge_disposition_id')

## dropping columns that have little to no variability
for col in categorical_cols:
    if(df[col].value_counts(normalize=True).max() > 0.948):
        dropped_columns.append(col)
        
df.drop(columns=dropped_columns, axis=1, inplace=True)
df.dropna(inplace=True)

<h2> Some Patients have multiple records </h2>

In [9]:
## one record per patient (where they had max of time_in_hospital)
df = df.loc[df.groupby("patient_nbr", sort=False)['time_in_hospital'].idxmax()]
df.drop(columns = ['patient_nbr'], inplace=True)

In [10]:
## convert our categorical variable (if readmitted -> 1 else 0)
df['readmitted'] = np.where(df['readmitted']!='NO',1,0)

In [11]:
## convert age ranges to the midpoint of the ranges
new_ages = {
    "[0-10)": 5,
    "[10-20)": 15,
    "[20-30)": 25,
    "[30-40)": 35,
    "[40-50)": 45,
    "[50-60)": 55,
    "[60-70)": 65,
    "[70-80)": 75,
    "[80-90)": 85,
    "[90-100)": 95
}

df['age'] = df['age'].map(new_ages)

In [12]:
max_glu_serums = {
    "None": 0,
    "Norm": 100,
    ">200": 200,
    ">300": 300
}
df['max_glu_serum'] = df['max_glu_serum'].map(max_glu_serums)

In [13]:
A1CResult_map = {
    "None": 0,
    "Norm": 5,
    ">7": 7,
    ">8": 8
}
df['A1Cresult'] = df['A1Cresult'].map(A1CResult_map)

In [14]:
#converting binary variables into -1 or 1
df['change'] = np.where(df['change']=='No',-1,1)
df['diabetesMed'] = np.where(df['diabetesMed']=='No',-1,1)

In [15]:
drug_codes = {
    "No": -20,
    "Down": -10, 
    "Steady": 0,
    "Up": 10    
}
drugs = ['metformin','glipizide','glyburide', 'pioglitazone', 'rosiglitazone','insulin'] 
for drug in drugs:
    df[drug] = df[drug].map(drug_codes)

In [16]:
## mapping diagnosis categories according to paper (else 800 plus features)
diagnosis_cols = ['diag_1', 'diag_2', 'diag_3']

for col in diagnosis_cols:
    df['tmp'] = np.nan
    df.loc[(df[col].str.contains("250")), col] = '250'
    df.loc[(df[col].str.startswith('V')) | (df[col].str.startswith('E')), col] = '-999' 

    df[col] = df[col].astype(float)
    
    #convert the correct ranges based on values given in paper
    df.loc[(((df[col] >=390) & (df[col]<=460)) | (df[col] == 785)), 'tmp'] = 'Circulatory'
    df.loc[(((df[col] >=460) & (df[col]<=519)) | (df[col] == 786)), 'tmp'] = 'Respiratory'
    df.loc[(((df[col] >=520) & (df[col]<=579)) | (df[col] == 787)), 'tmp'] = 'Digestive'
    df.loc[(((df[col] >=580) & (df[col]<=629)) | (df[col] == 788)), 'tmp'] = 'Genitourinary'
    df.loc[((df[col] >=800) & (df[col]<=999)), 'tmp'] = 'Injury'
    df.loc[((df[col] >=710) & (df[col]<=739)), 'tmp'] = 'Musculoskeletal'
    df.loc[((df[col] >=140) & (df[col]<=239)), 'tmp'] = 'Neoplasms'
    df.loc[(df[col] == 250), 'tmp'] = 'Diabetes'
    
    df['tmp'].fillna(value = "Other", inplace=True)
    
    df[col] = df['tmp']
    df.drop(columns=['tmp'], inplace=True)
    

In [17]:
## admission_source_id
df['tmp'] = np.nan
col = 'admission_source_id'
df.loc[((df[col].between(4,6)) | (df[col] == 10) | (df[col] == 18) | (df[col] == 22) | (df[col].between(25,26))), 'tmp'] = "Transfer_Source"
df.loc[df[col].between(1,3), 'tmp'] = "Referral_Source"
df.loc[((df[col].between(11,14))| (df[col].between(23,24))), 'tmp'] = "Birth_Source"
df.loc[df[col] == 7, 'tmp'] = "Emergency_Source"
df.loc[((df[col] == 8) | (df[col]==19)), 'tmp'] = "Other"
        
df['tmp'].fillna(value = "Unknown", inplace=True)
df[col] = df['tmp']
df.drop(columns=['tmp'], inplace=True)


##mapping admission type_id
df['tmp'] = np.nan
col = 'admission_type_id'
df.loc[df[col] == 1, 'tmp'] = 'Emergency_Type'
df.loc[df[col] == 2, 'tmp'] = 'Urgent_Type'
df.loc[df[col] == 3, 'tmp'] = 'Elective_Type'
df.loc[df[col] == 7, 'tmp'] = 'Trauma_Type'
df.loc[df[col] == 4, 'tmp'] = 'Newborn_Type'

df['tmp'].fillna(value = "Unknown", inplace=True)
df[col] = df['tmp']
df.drop(columns=['tmp'], inplace=True)


In [18]:
def one_hot_encoder(df, cols):
    """one-hot encoding function for all our categorical columns"""
    
    for col in cols:
        if("admission" in col):
            dummies = pd.get_dummies(df[col], drop_first=False)
        else:
            dummies = pd.get_dummies(df[col], prefix=col, drop_first=False)
        df = pd.concat([df, dummies], axis=1)   
        df.drop([col],axis=1, inplace=True)
    return df

In [19]:
#one-hot encoding 
categorical_columns = [col for col in df.columns if df[col].dtype == np.dtype(object)]
df = one_hot_encoder(df, categorical_columns)
df.columns = map(str.lower, df.columns)

#train-test-split
target_variable = 'readmitted'
Y_feature = df[target_variable]
X_features = df.drop(columns=[target_variable])
X_train, X_test, y_train, y_test = train_test_split(X_features,Y_feature, test_size=0.2, random_state = 42)

In [20]:
# normalize of numerical columns
mm_scaler = MinMaxScaler()
X_train = pd.DataFrame(mm_scaler.fit_transform(X_train), columns = X_train.columns) 
X_test = pd.DataFrame(mm_scaler.fit_transform(X_test), columns = X_test.columns)

In [24]:
y_train = y_train.values.reshape(y_train.shape[0],1)
y_test = y_test.values.reshape(y_test.shape[0],1)

In [30]:
from copy import deepcopy
class NeuralNet:

    def __init__(self, X_train, y_train, h=4):
        #np.random.seed(1)
        # h represents the number of neurons in the hidden layers
        self.X = X_train
        self.y = y_train

        # Find number of input and output layers from the dataset
        input_layer_size = self.X.shape[0]
        
        
        self.output_layer_size = 1

        # assign random weights to matrices in network
        # number of weights connecting layers = (no. of nodes in previous layer) x (no. of nodes in following layer)
        self.W_hidden = 2 * np.random.random((h, input_layer_size)) - 1
        self.Wb_hidden = 2 * np.random.random((h,1)) - 1

        self.W_output = 2 * np.random.random((self.output_layer_size,h)) - 1
        self.Wb_output = np.ones((self.output_layer_size,1))

        self.deltaOut = np.zeros((self.output_layer_size, 1))
        self.deltaHidden = np.zeros((h, 1))
        self.h = h
            

    def __activation(self, x, activation):
        if activation == "sigmoid":
            self.__sigmoid(self, x)
        elif activation == "tanh":
            self.__tanh(self,x)
        elif activation == "relu":
            self.__relu(self,x)
     

    def __activation_derivative(self, x, activation):
        if activation == "sigmoid":
            self.__sigmoid_derivative(self, x)
        elif activation == "tanh":
            self.__tanh_derivative(self,x)
        elif activation == "relu":
            self.__relu_derivative(self,x)

    def __sigmoid(self, x):
        return 1 / (1 + np.exp(-x))
    
    def __tanh(self, x):
        return np.tanh(x)
    
    def __relu(self, x):
        return np.maximum(0, x)

    def __sigmoid_derivative(self, x):
        return x * (1 - x)
    
    def __tanh_derivative(self, x):
        return 1-(np.tanh(x))**2
    
    def __relu_derivative(self,x):
        return (x>0)*1


    # Below is the training function
    def train(self, activation, max_iterations=100, learning_rate=0.00001, momentum = 0.90):
        
        update_weight_output, update_weight_output_b, update_weight_hidden, update_weight_hidden_b = 0,0,0,0
        
        for iteration in range(max_iterations):
            out = self.forward_pass(activation)
            
            error = 0.5 * np.power((out - self.y), 2)
            
            
            self.past_delta = [deepcopy(update_weight_output),
                                        deepcopy(update_weight_output_b),
                                        deepcopy(update_weight_hidden),
                                        deepcopy(update_weight_hidden_b)]
            
            
            self.backward_pass(out, activation)
            
            update_weight_output = learning_rate * (1-momentum) * np.dot(self.deltaOut,self.X_hidden.T) + momentum*self.past_delta[0]
            
            update_weight_output_b = learning_rate * (1-momentum) * np.dot(self.deltaOut, np.ones((np.size(self.X, 1), 1))) + momentum*self.past_delta[1]
            
            update_weight_hidden = learning_rate * (1-momentum)* np.dot(self.deltaHidden,self.X.T) + momentum*self.past_delta[2]
            
            update_weight_hidden_b = learning_rate * (1-momentum)* np.dot(self.deltaHidden,np.ones((np.size(self.X, 1), 1))) + momentum*self.past_delta[3]

            self.W_output += update_weight_output
            self.Wb_output += update_weight_output_b
            self.W_hidden += update_weight_hidden
            self.Wb_hidden += update_weight_hidden_b
            

        print("After " + str(max_iterations) + " iterations, the total error is " + str(np.average(np.sum(error))))
#         print("The final weight vectors are (starting from input to output layers) \n" + str(self.W_hidden))
#         print("The final weight vectors are (starting from input to output layers) \n" + str(self.W_output))

#         print("The final bias vectors are (starting from input to output layers) \n" + str(self.Wb_hidden))
#         print("The final bias vectors are (starting from input to output layers) \n" + str(self.Wb_output))

    def forward_pass(self, activation):
        # pass our inputs through our neural network
        in_hidden = np.dot(self.W_hidden, self.X) + self.Wb_hidden

        if activation == "sigmoid":
            self.X_hidden = self.__sigmoid(in_hidden)
        elif activation == "tanh":
            self.X_hidden = self.__tanh(in_hidden)
        elif activation == "relu":
            self.X_hidden = self.__relu(in_hidden)

        in_output = np.dot(self.W_output, self.X_hidden) + self.Wb_output
        
        # output 
        if activation == "sigmoid":
            out = self.__sigmoid(in_output)
        elif activation == "tanh":
            out = self.__tanh(in_output)
        elif activation == "relu":
            out = self.__relu(in_output)
        return out

    def backward_pass(self, out, activation):
        # pass our inputs through our neural network
        self.compute_output_delta(out, activation)
        self.compute_hidden_delta(activation)
        


    def compute_output_delta(self, out, activation):
        if activation == "sigmoid":
            delta_output = (self.y - out) * (self.__sigmoid_derivative(out))
        elif activation == "tanh":
            delta_output = (self.y - out) * (self.__tanh_derivative(out))
        elif activation == "relu":
            delta_output = (self.y - out) * (self.__relu_derivative(out))

        self.deltaOut = delta_output

    def compute_hidden_delta(self, activation):
        
        if activation == "sigmoid":
            delta_hidden_layer = (self.W_output.T.dot(self.deltaOut)) * (self.__sigmoid_derivative(self.X_hidden))
        elif activation == "tanh":
            delta_hidden_layer = (self.W_output.T.dot(self.deltaOut)) * (self.__tanh_derivative(self.X_hidden))
        elif activation == "relu":
            delta_hidden_layer = (self.W_output.T.dot(self.deltaOut)) * (self.__relu_derivative(self.X_hidden))
        
        self.deltaHidden = delta_hidden_layer


    def predict(self, X_test, y_test, activation):
        print("inside predict")
        predict_hidden = np.dot(self.W_hidden, X_test) + self.Wb_hidden
        
        self.X_hidden = self.__relu(predict_hidden)
        
        if(activation == "sigmoid"):
            self.X_hidden = self.__sigmoid(predict_hidden)
        elif(activation=="relu"):
            self.X_hidden = self.__relu(predict_hidden)
        elif(activation == "tanh"):
            self.X_hidden = self.__tanh(predict_hidden)
        
        predict_output = np.dot(self.W_output, self.X_hidden) + self.Wb_output
        
        if(activation == "sigmoid"):
            out = self.__sigmoid(predict_output)
        elif(activation=="relu"):
            out = self.__relu(predict_output)
        elif(activation == "tanh"):
            out = self.__tanh(predict_output)
        
        
        error = 0.5 * np.power((out - y_test), 2)
        print(f"Error on Test Dataset is {np.sum(error)}")
        return out
    


In [26]:
nn_model = NeuralNet(X_train.iloc[:,0:9].T,y_train.T, h=20)
nn_model.train(activation="relu")


for 

Error for iteration 0 is 9503.728045576492
Error for iteration 1 is 9503.72770436775
Error for iteration 2 is 9503.727056257449
Error for iteration 3 is 9503.72613227508
Error for iteration 4 is 9503.724960470958
Error for iteration 5 is 9503.72356619035
Error for iteration 6 is 9503.721972324678
Error for iteration 7 is 9503.720199541183
Error for iteration 8 is 9503.718266492631
Error for iteration 9 is 9503.716190008381
Error for iteration 10 is 9503.713985268332
Error for iteration 11 is 9503.71166596102
Error for iteration 12 is 9503.709244427222
Error for iteration 13 is 9503.706731790218
Error for iteration 14 is 9503.704138073886
Error for iteration 15 is 9503.701472309665
Error for iteration 16 is 9503.698742633393
Error for iteration 17 is 9503.695956372907
Error for iteration 18 is 9503.69312012725
Error for iteration 19 is 9503.69023983827
Error for iteration 20 is 9503.687320855288
Error for iteration 21 is 9503.684367993501
Error for iteration 22 is 9503.68138558671
Error

Error for iteration 186 is 9502.459309471782
Error for iteration 187 is 9502.443484823876
Error for iteration 188 is 9502.427708848854
Error for iteration 189 is 9502.411988507336
Error for iteration 190 is 9502.396330134927
Error for iteration 191 is 9502.380737312444
Error for iteration 192 is 9502.365213199468
Error for iteration 193 is 9502.349760576708
Error for iteration 194 is 9502.334381884168
Error for iteration 195 is 9502.31772599215
Error for iteration 196 is 9502.300155963121
Error for iteration 197 is 9502.282463530262
Error for iteration 198 is 9502.264674096477
Error for iteration 199 is 9502.24681046185
Error for iteration 200 is 9502.228893078487
Error for iteration 201 is 9502.210940281042
Error for iteration 202 is 9502.192968495048
Error for iteration 203 is 9502.174992425213
Error for iteration 204 is 9502.155902958364
Error for iteration 205 is 9502.135964258583
Error for iteration 206 is 9502.11587553844
Error for iteration 207 is 9502.095670929813
Error for ite

In [27]:
predictions = nn_model.predict(X_test.iloc[:,0:9].T,y_test.T,activation="tanh")
predictions = np.around(predictions, 0).astype(np.int32)

inside predict
Error on Test Dataset is 4544.888536219123
