##Neural net
Implemented in Keras, with a lot of help from scikit-learn.
We first train a medium-sized net and see that it instantly overfits. We then get similar and more stable results with a tiny net.

In [None]:
import numpy as np # linear algebra
np.set_printoptions(precision=0, suppress=True)
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

from sklearn.preprocessing import Imputer, StandardScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.manifold import TSNE
from sklearn.metrics import confusion_matrix, roc_curve, roc_auc_score

import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
from keras.models import Sequential, Model
from keras.optimizers import Adam
from keras.regularizers import l2
from keras.layers import Dense, BatchNormalization, Dropout
from keras.callbacks import ModelCheckpoint, ReduceLROnPlateau
import keras.backend as K

## Load the data

In [None]:
with open('../input/diabetes.csv') as f:
    print('\n'.join(f.readline().split(',')[:-1]))

In [None]:
raw_data = np.loadtxt('../input/diabetes.csv', skiprows=1, delimiter=',')
raw_feat = raw_data[:,:-1]
raw_labels = raw_data[:,-1] 

## Preprocessing

In [None]:
x_train, x_test, y_train, y_test = train_test_split(raw_feat, raw_labels, 
                                                    test_size=0.3, random_state=700)

In [None]:
imp = Imputer(missing_values = 0) #replace zero values by mean
clean_feat = raw_feat.copy()
clean_feat[:,1:] = imp.fit_transform(raw_feat[:,1:]) #We don't want to do this for pregnancies
print(raw_feat[:8,4], '\n', clean_feat[:8,4])

In [None]:
#We must do this again for the train test, or we would get data leakage from the test set
#in taking the mean.
x_train = x_train.copy()
x_train[:,1:] = imp.fit_transform(x_train[:,1:])
x_test[:,1:] = imp.transform(x_test[:,1:])

In [None]:
scaler = StandardScaler() #scale to zero mean and unit variance
clean_feat = scaler.fit_transform(clean_feat)
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)

This time we just do PCA and t-SNE for some visualization. We could also use the PCA data for training, but with the neural nets here, it doesn't really affect performance.

In [None]:
pca = PCA(n_components=7)
clean_pca = pca.fit_transform(clean_feat)
plt.scatter(clean_pca[:,0], clean_pca[:,1], color=np.where(raw_labels>0.5,'r','g'))

In [None]:
tsne = TSNE(n_iter=3000)
tsne_data = tsne.fit_transform(clean_pca)
plt.scatter(tsne_data[:,0], tsne_data[:,1], color=np.where(raw_labels>0.5,'r','g'))

In [None]:
pca_train = pca.fit_transform(x_train)
pca_test = pca.transform(x_test)

We can now choose whether to train our nets on original or PCA data. 

##Train the model
First we choose a medium-sized neural net. We have trained this on original data, as PCA does not make a huge difference. Instead, the problem is that it soon starts to overfit.

In [None]:
model = Sequential()
model.add(Dense(32, activation='relu', input_dim = x_train.shape[1]))
model.add(Dense(64, activation='relu'))
model.add(BatchNormalization())
model.add(Dropout(0.5))
model.add(Dense(256, activation='relu'))
model.add(Dropout(0.75))
model.add(Dense(1, activation='sigmoid'))

In [None]:
model.compile(optimizer=Adam(1e-4), loss='binary_crossentropy', metrics=['accuracy'])
hist = model.fit(x_train, y_train, validation_data=(x_test, y_test), 
          epochs=50, verbose=0, batch_size=8)
print(model.evaluate(x_train, y_train, verbose=0))
print(model.evaluate(x_test, y_test, verbose=0))

In [None]:
plt.plot(hist.history['loss'], color='b')
plt.plot(hist.history['val_loss'], color='r')
plt.show()
plt.plot(hist.history['acc'], color='b')
plt.plot(hist.history['val_acc'], color='r')
plt.show()

We have achieved 77% test accuracy on average, although this will vary depending on the random initializations of the test set, and the net weights. Training longer will not help, as the net is already starting to overfit. Overfitting is expected, as the net has almost 20.000 parameters and we train on less than 5000 data values. Getting more data would be very helpful.

We can't expect to achieve much better on this dataset - the original article reports 76% accuracy.

## Visualization

In [None]:
y_true = (raw_labels + 0.5).astype("int")
y_pred = (model.predict(clean_feat) + 0.5).astype("int").reshape(-1,)
color = np.where(y_true * y_pred == 1, 'g', 'r')
color[np.where(y_true * (1-y_pred) == 1)[0]] = 'b'
color[np.where(y_pred * (1-y_true) == 1)[0]] = 'y'

In the plots below, blue is false-negatives, yellow false-positives, green is where the net correctly predicts positive, and red where the net correctly predicts negative. With a perfect net, it would be all green and red.

In [None]:
for i in range(4):
    for j in range(i+1,4):
        plt.scatter(clean_pca[:,i], clean_pca[:,j], color=color)
        plt.show()

Overall, we see that the net has been rather conservative, with classification close to linear. 

##Overfitting
If we continue to train, we can get over 95% training set accuracy, but poor generalization.  In fact, let's forget about the test set and train on our full data.

In [None]:
model.compile(optimizer=Adam(3e-4), loss='binary_crossentropy', metrics=['accuracy'])
hist = model.fit(clean_feat, raw_labels, 
          epochs=500, verbose=0, batch_size=8)
print(model.evaluate(x_train, y_train, verbose=0))

We now plot the results of the overfitted net.

In [None]:
y_true = (raw_labels + 0.5).astype("int")
y_pred = (model.predict(clean_feat) + 0.5).astype("int").reshape(-1,)
color = np.where(y_true * y_pred == 1, 'g', 'r')
color[np.where(y_true * (1-y_pred) == 1)[0]] = 'b'
color[np.where(y_pred * (1-y_true) == 1)[0]] = 'y'

In [None]:
for i in range(4):
    for j in range(i+1,4):
        plt.scatter(clean_pca[:,i], clean_pca[:,j], color=color)
        plt.show()

As expected, it is an almost perfect (green) fit, with only a few blue and yellow dots.

##Small is beautiful
Let's finish this by training a tiny net, to see if we can avoid overfitting. In addition to reducing the size, I have also added both L2 and Dropout regularization

In [None]:
model = Sequential()
model.add(Dense(8, activation='relu', kernel_regularizer=l2(.1), input_dim = x_train.shape[1]))
model.add(Dense(8, activation='relu', kernel_regularizer=l2(.05)))
model.add(BatchNormalization())
model.add(Dense(16, activation='relu'))
model.add(Dropout(0.75))
model.add(Dense(1, activation='sigmoid'))

These three additions are not strictly necessary, but could improve performance slightly

In [None]:
#To keep the best weights, in case of overfitting
callback1 = ModelCheckpoint('tiny.h5', monitor='val_loss', 
                           save_best_only=True, save_weights_only=True)

#To reduce learning rate over time
callback2 = ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=100, 
                              mode='min', epsilon=0.05, min_lr=1e-8)

#To adjust for the fact y_train has twice as many 0's as 1's
#Does not seem to improve results in this case
#sample_weight = (1 + y_train) 

In [None]:
# Label smoothing, sometimes improves net training by avoiding 
# areas where the activation function has flat gradient.
y_train = 0.9 * y_train + 0.05  
y_test = 0.9 * y_test + 0.05

In [None]:
def float_accuracy(y_true, y_pred):
    """
    Equivalent to Keras' built-in binary_accuracy, but can be used with label smoothing.
    """
    return K.mean(K.equal(K.round(y_true), K.round(y_pred)), axis=-1)

In [None]:
model.compile(optimizer=Adam(1e-4), loss='binary_crossentropy', metrics=[float_accuracy])

hist = model.fit(x_train, y_train, validation_data=(x_test, y_test),
                 epochs=1000, verbose=0, batch_size=8, 
                 callbacks = [callback1, callback2])
print(model.evaluate(x_train, y_train, verbose=0))
print(model.evaluate(x_test, y_test, verbose=0))

In [None]:
plt.plot(hist.history['loss'], color='b')
plt.plot(hist.history['val_loss'], color='r')
plt.show()
plt.plot(hist.history['float_accuracy'], color='b')
plt.plot(hist.history['val_float_accuracy'], color='r')
plt.show()

Even with the regularizations, the net will overfit if trained long enough. Fortunately we saved the best weights with 'ModelCheckpoint':

In [None]:
model.load_weights('tiny.h5')
print(model.evaluate(x_train, y_train, verbose=0))
print(model.evaluate(x_test, y_test, verbose=0))

##Sensitivity / specificity trade-off
We first create a Confusion Matrix of our result

In [None]:
threshold = 0.5
y_true = np.where(y_test > 0.5, 1, 0).astype("int")
y_pred = np.where(model.predict(x_test) > threshold, 1, 0).astype("int").reshape(-1,)
cm = confusion_matrix(y_true, y_pred)
pos = np.sum(cm[0])
print(cm)

Remember that a confusion matrix is defined as:

[true negatives, false positives]

[false negatives, true positives]

What if we wanted to avoid false positives, even if that led to a higher proportion of false-negatives? The easy way is just to change the cutoff:

Scikit-learn has methods for doing this automatically, returning all thresholds and corresponding parameters. We here draw two graphs that can be directly compared to the end page of the original article from which this dataset is taken: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2245318/pdf/procascamc00018-0276.pdf

In [None]:
threshold = 0.6
y_true = np.where(y_test > 0.5, 1, 0).astype("int")
y_pred = np.where(model.predict(x_test) > threshold, 1, 0).astype("int").reshape(-1,)
cm = confusion_matrix(y_true, y_pred)
print(cm)

In [None]:
y_score = model.predict(x_test)
fpr, tpr, thresholds = roc_curve(y_true, y_score)
plt.plot(thresholds, 1.-fpr)
plt.plot(thresholds, tpr)
plt.show()
crossover_index = np.min(np.where(1.-fpr <= tpr))
crossover_cutoff = thresholds[crossover_index]
crossover_specificity = 1.-fpr[crossover_index]
print("Crossover at {0:.2f} with specificity {1:.2f}".format(crossover_cutoff, crossover_specificity))

In [None]:
plt.plot(fpr, tpr)
plt.show()
print("ROC area under curve is {0:.2f}".format(roc_auc_score(y_true, y_score)))

We have achieved results no worse than the 0.76 of the original article. The exact result varies quite a bit between different runs, but is usually between 0.70 and 0.80.