# Deep Learning and Time Since TB Infection in Macaques

I am going to apply deep learning algorithms to analyzing the monkey data. I need to:
- Transfer over files for middle and late infection, just the microarray data in one file, and the clinical data in another file, only for those monkeys
- Set up a training and test set.
    - I want 3 latent and 3 active in test set
- Before I set up a 10-fold cross-validation scheme, I think it is okay to just see if I can get a model to train on the training set. I definitely want to train a model just on the training set as opposed to the whole dataset together, to start off with at least some good practice

Current Progress/Questions:
 - I learned that batch normalization was causing my training bugs in the MLP model. The IRIS dataset helped me determine this. One task is to learn why batch normalization was causing these problems in keras. I think I will postpone this for now
 - Just start applying MLP models to the monkey data, to see if it can be trained.
 - Set up systematic 10-fold cross-validation experiments on the monkey data. Stratify folds by TB status. Search over hyperparameters, perhaps with python script as opposed to notebook
 - Go through Jeremy Howard's code on his structured data lecture to learn how he used keras, etc.

## Read in the data

In [1]:
import pandas as pd
import numpy as np
import keras
from keras import backend as K
from keras.utils.data_utils import get_file
from keras.utils import np_utils
from keras.utils.np_utils import to_categorical
from keras.models import Sequential, Model
from keras.layers import Input, Embedding, Reshape, merge, LSTM, Bidirectional
from keras.layers import TimeDistributed, Activation, SimpleRNN, GRU
from keras.layers.core import Flatten, Dense, Dropout, Lambda
from keras.regularizers import l2, l1
from keras.layers.normalization import BatchNormalization
from keras.optimizers import SGD, RMSprop, Adam
#from keras.utils.layer_utils import layer_from_config
from keras.metrics import categorical_crossentropy, categorical_accuracy
from keras.layers.convolutional import *
from keras.preprocessing import image, sequence
from keras.preprocessing.text import Tokenizer

from keras.wrappers.scikit_learn import KerasClassifier

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import Pipeline

Using Theano backend.


In [2]:
path  = "/master/rault/TB"
data_path = path + "/data"

In [3]:
%cd $data_path
%ls 

pheno = pd.read_table("Monkey_PhenoData_middle-late.txt")
expres = pd.read_table("Monkey_Processed_ExpressionData_middle-late.txt")
#Monkey_PhenoData_middle-late.txt
#Monkey_Processed_ExpressionData_middle-late.txt

/master/rault/TB/data
Monkey_PhenoData_middle-late.txt
Monkey_Processed_ExpressionData_middle-late.txt


## Make a Train and Test Set

In [4]:
# Set seed to be consistent
import random
random.seed(100)

# select the latent monkeys
latent_monkeys = pheno.loc[pheno["clinical.status"] == "Latent"]["monkeyid"].tolist()

# select the active monkeys
active_monkeys = pheno.loc[pheno["clinical.status"] == "Active"]["monkeyid"].tolist()

# set(latent_monkeys) & set(active_monkeys) #-> They are correctly disjoint

# Randomly select 3 latent monkeys
test_latent_monkeys = random.sample(latent_monkeys, 3)

# randomly select 3 active monkeys
test_active_monkeys = random.sample(active_monkeys, 3)

test_monkeys = test_latent_monkeys + test_active_monkeys

# remove these monkeys from the training set  and put in a test set (both the clinical variables and the expression)
train_pheno = pheno.loc[pheno["monkeyid"].isin(set(pheno["monkeyid"]) - set(test_monkeys))]
test_pheno = pheno.loc[pheno["monkeyid"].isin(test_monkeys)]

#set(train_set["monkeyid"]) & set(test_set["monkeyid"]) #-> They are correctly disjoint

train_exprs = expres[expres.index.isin(list(train_pheno.index))]
test_exprs = expres[expres.index.isin(list(test_pheno.index))]

train_exprs = train_exprs.astype(float)
test_exprs = test_exprs.astype(float)

train_exprs = train_exprs.as_matrix()
test_exprs = test_exprs.as_matrix()
#DataFrame.as_matrix
#X = dataset[:,0:4].astype(float)
# set(test_exprs.index) & set(train_exprs.index) #-> They are correctly disjoint


In [5]:
#training_set.index
train_set[train_set.index.isin(['GSM2227796'])]  # This somehow works! so can subset by the rows in this way.

NameError: name 'train_set' is not defined

In [6]:
test_exprs

array([[ 3.39462752,  4.48762888,  5.96935506, ...,  3.87268578,
         4.32225507,  6.96685288],
       [ 4.12352933,  4.15348654,  5.53672825, ...,  5.24086738,
         5.91572326,  6.71413883],
       [ 3.39462752,  3.42627115,  5.63217959, ...,  3.87268578,
         3.54491484,  7.23525856],
       ..., 
       [ 3.25470823,  3.90686403,  5.60626114, ...,  3.98340472,
         6.18424082,  6.8451059 ],
       [ 3.59531892,  4.6930189 ,  5.71572579, ...,  5.22307736,
         5.31778927,  6.75539068],
       [ 3.25470823,  4.58647098,  6.24496345, ...,  4.99904663,
         6.18319178,  6.72783263]])

## Prepare the data for loading into keras

This website from Jason Brownlee has excellent tutorial on using pandas to load in data and then use keras. I can use his code to help me

https://machinelearningmastery.com/multi-class-classification-tutorial-keras-deep-learning-library/

In [9]:

encoder = LabelEncoder()
encoder.fit(train_pheno["time.period"])
encoded_Y_train = encoder.transform(train_pheno["time.period"])
encoded_Y_test = encoder.transform(test_pheno["time.period"])

monkey_encoder = LabelEncoder()
monkey_encoder.fit(pheno["monkeyid"])
enc_monkey_train = monkey_encoder.transform(train_pheno["monkeyid"])
enc_monkey_test = monkey_encoder.transform(test_pheno["monkeyid"])

# One-hot encoding
train_Y = np_utils.to_categorical(encoded_Y_train)
test_Y = np_utils.to_categorical(encoded_Y_test)


In [10]:
print(encoded_Y_train)
print(train_pheno["time.period"])
print(encoded_Y_test)
print(test_pheno["time.period"])
print(train_Y)
print(train_pheno["monkeyid"])
print(enc_monkey_train)

[1 0 1 0 1 0 0 1 0 1 0 1 0 1 0 1 0 0 1 0 1 1 1 1 1 1 0 1 0 1 0 1 0 0 0 1 0
 0 0 0 0 0 1 1 0 0 1 1 1 0 0 1 0 1 1 1 0 1 0 0 0 0 0 1 1 0 0 1 1 1 0 1 1 1
 0 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 0 1 0 0 0 0 1 0 0 1 0 1 0 1 1 0 1 0 0 1 1
 1 1 1 0 0 0 0 0 0 0 1 0 1 1 1 0 0 1 1 0 0 1 0 0 0 1 0 0 0 0 0 1 1 0 1 1 0
 0 0 1 0 1 1 1 1 0 0 1 0 0 1 1 1 0 1 0 1 1 0 0 0 0 1 0 0 0 1 1 0 0 1 1 0 0
 1 1 0 1 0 1 1 1 1 0 0 0 1 0 0 1 0 1 1 1 1 0 0 1 0 1 0 0 1 1 1 0 0 0 1 1 0
 0 0 1 0 1 1 1 1 1 0 1 0 1 0 0 1 1 1 1 1 1 1 1 0]
GSM2227796    middle
GSM2227797      late
GSM2227799    middle
GSM2227800      late
GSM2227801    middle
GSM2227805      late
GSM2227806      late
GSM2227807    middle
GSM2227808      late
GSM2227809    middle
GSM2227810      late
GSM2227812    middle
GSM2227814      late
GSM2227815    middle
GSM2227816      late
GSM2227818    middle
GSM2227820      late
GSM2227823      late
GSM2227825    middle
GSM2227827      late
GSM2227828    middle
GSM2227832    middle
GSM2227834    middle
GSM2227835    mid

# Set up 10-fold cross-validation, stratifying by monkey

Tutorials to help me in grid-searching
https://machinelearningmastery.com/grid-search-hyperparameters-deep-learning-models-python-keras/
    https://machinelearningmastery.com/use-keras-deep-learning-models-scikit-learn-python/

I need to look up how to do a stratified k-fold according to both class (middle vs. late infection) but also monkey.

In [51]:
estimator = KerasClassifier(build_fn=baseline_model, epochs=200, batch_size=5, verbose=0)
kfold = KFold(n_splits=10, shuffle=True, random_state=seed)
results = cross_val_score(estimator, X, dummy_y, cv=kfold) # , verbose=3)
print("Baseline: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))

(246, 9050)

In [9]:
p = 0.0

model = Sequential([
    #BatchNormalization(input_shape=train_exprs.shape[1:]),
    Dense(5000, activation="relu", input_shape=train_exprs.shape[1:]),
    #BatchNormalization(),
    Dropout(p),
    Dense(500, activation="relu"),
    #BatchNormalization(),
    Dropout(p),
    Dense(50, activation="relu"),
    #BatchNormalization(),
    Dropout(p),
    Dense(10, activation="relu"),
    #BatchNormalization(),
    Dropout(p),
    Dense(1, activation="sigmoid")
])

The 5000 -> 500 -> 50 -> 10 ->1 model with no dropout
 - With no minibatches, adam lr=0.00004 in 156 epochs got to 80% accuracy on the training set. At 200 epochs it hit 84% accuracy. There was some substantial oscillation in between but mostly was in the 80% accuracy.
 - I expect it to do poorly on the test set from overfitting, perhaps 65% accuracy, maybe worse.
 - WOW! I was dead wrong! It gives 75% accuracy on the test set, 85.78% final accuracy on the training set. I have a high variance problem, but this is creeping up to 80% accurate! I think it may be possible with deep learning or optimizing other machine learning approaches (like gradient boosted machines), or with pathway feature engineering to hit 80% acciracy/
     - These test data were monkeys completely randomly set aside, with latent and active TB stratified. Thus this is valid
 - For model refinement I need to start doing cross-validation, more expensive computationally, but will allow me to get a final unbiased estimate.
 - Now I have to start with cross-validation.
 

In [11]:
#lr=0.00004 # Learning rate was 0.00004 in my one working keras code
model.compile(Adam(lr=0.00004), loss='binary_crossentropy', metrics=['accuracy'])
model.fit(train_exprs, [[y] for y in encoded_Y_train], batch_size=train_exprs.shape[0], epochs=200)
# I think encoded_Y_train should not be passed as it is

#model.fit(train_exprs, encoded_Y_train, validation_data = (test_exprs, test_Y), batch_size=train_exprs.shape[0], epochs=30)
# I was getting problems from train_exprs being a pandas object. Probably can learn how to change that closer to 


#da_dis_model = Sequential(get_my_layers(p))
#da_dis_model.compile(optimizer=Adam(lr=0.001),
#             loss="categorical_crossentropy",
#             metrics=['accuracy'])

#da_dis_model.fit(da_conv_feat, da_trn_labels, batch_size=batch_size, nb_epoch=2, 
 #                   validation_data=(conv_val_feat, val_labels))

Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 27/200
Epoch 28/200
Epoch 29/200
Epoch 30/200
Epoch 31/200
Epoch 32/200
Epoch 33/200
Epoch 34/200
Epoch 35/200
Epoch 36/200
Epoch 37/200
Epoch 38/200
Epoch 39/200
Epoch 40/200
Epoch 41/200
Epoch 42/200
Epoch 43/200
Epoch 44/200
Epoch 45/200
Epoch 46/200
Epoch 47/200
Epoch 48/200
Epoch 49/200
Epoch 50/200
Epoch 51/200
Epoch 52/200
Epoch 53/200
Epoch 54/200
Epoch 55/200
Epoch 56/200
Epoch 57/200
Epoch 58/200
Epoch 59/200
Epoch 60/200
Epoch 61/200
Epoch 62/200
Epoch 63/200
Epoch 64/200
Epoch 65/200
Epoch 66/200
Epoch 67/200
Epoch 68/200
Epoch 69/200
Epoch 70/200
Epoch 71/200
Epoch 72/200
Epoch 73/200
Epoch 74/200
Epoch 75/200
Epoch 76/200
Epoch 77/200
Epoch 78

Epoch 187/200
Epoch 188/200
Epoch 189/200
Epoch 190/200
Epoch 191/200
Epoch 192/200
Epoch 193/200
Epoch 194/200
Epoch 195/200
Epoch 196/200
Epoch 197/200
Epoch 198/200
Epoch 199/200
Epoch 200/200


<keras.callbacks.History at 0x7f387750a978>

In [18]:
print(model.predict_classes(train_exprs))
# So the dense 2 activation just predicts all one class for the test data
#print(model.predict(test_exprs))

# It predicts all the same class for both. How can that be?
print("Now what is ground truth for training data")
print(encoded_Y_train)

# I still don't understand what the model is outputing . It doesn't seem that the predictions that I get from model.predict match the labels that I gave, it is not in a strict 0, 1 prediction
# Something is definitely messed up. I don't know what it is. But if randomforest can get 70% accuracy, then I've got to be able to get something.

from sklearn.metrics import confusion_matrix, accuracy_score
print(confusion_matrix(encoded_Y_train, model.predict_classes(train_exprs, verbose=0)))
print(accuracy_score(encoded_Y_train, model.predict_classes(train_exprs, verbose=0)))

print(confusion_matrix(encoded_Y_test, model.predict_classes(test_exprs, verbose=0)))
print(accuracy_score(encoded_Y_test, model.predict_classes(test_exprs, verbose=0)))

# WOW! FIRST TRY! THIS IS 75% accuracy! 


#sklearn.metrics.confusion_matrix(y_true, y_pred, labels=None, sample_weight=None)

 [0]
 [1]
 [0]
 [1]
 [0]
 [0]
 [1]
 [0]
 [1]
 [0]
 [1]
 [0]
 [1]
 [0]
 [1]
 [0]
 [0]
 [1]
 [0]
 [1]
 [0]
 [1]
 [1]
 [1]
 [0]
 [1]
 [0]
 [0]
 [1]
 [0]
 [1]
 [0]
 [1]
 [0]
 [1]
 [0]
 [0]
 [0]
 [0]
 [0]
 [1]
 [1]
 [0]
 [0]
 [0]
 [1]
 [1]
 [1]
 [1]
 [0]
 [0]
 [0]
 [1]
 [1]
 [0]
 [1]
 [1]
 [1]
 [0]
 [0]
 [0]
 [0]
 [1]
 [1]
 [0]
 [0]
 [1]
 [1]
 [1]
 [0]
 [1]
 [1]
 [1]
 [0]
 [1]
 [0]
 [0]
 [0]
 [1]
 [1]
 [1]
 [1]
 [0]
 [1]
 [1]
 [1]
 [0]
 [1]
 [1]
 [0]
 [1]
 [0]
 [0]
 [0]
 [0]
 [1]
 [0]
 [0]
 [1]
 [1]
 [1]
 [0]
 [1]
 [1]
 [0]
 [1]
 [0]
 [0]
 [1]
 [1]
 [1]
 [1]
 [1]
 [0]
 [0]
 [1]
 [0]
 [0]
 [0]
 [0]
 [1]
 [0]
 [1]
 [1]
 [1]
 [0]
 [0]
 [0]
 [1]
 [0]
 [0]
 [0]
 [1]
 [0]
 [0]
 [1]
 [0]
 [0]
 [0]
 [1]
 [1]
 [0]
 [1]
 [0]
 [1]
 [1]
 [0]
 [0]
 [0]
 [1]
 [0]
 [1]
 [0]
 [0]
 [1]
 [0]
 [0]
 [0]
 [0]
 [0]
 [1]
 [1]
 [1]
 [0]
 [1]
 [0]
 [1]
 [1]
 [0]
 [0]
 [0]
 [0]
 [1]
 [0]
 [0]
 [0]
 [1]
 [1]
 [0]
 [0]
 [1]
 [1]
 [0]
 [0]
 [1]
 [1]
 [0]
 [1]
 [0]
 [0]
 [1]
 [1]
 [0]
 [0]
 [0]
 [0]
 [1]
 [0]
 [0]
 [1]


The model is definitely  predicting all 0's for the output, so I definitely still have some trouble with how I am giving the data because it says 100% accuracy when it is not in fact 100% accuracy. I need to figure this out. Basically everything I learned today was incorrect because there is a bug in mapping my outputs to inputs. Maybe If i just put the numbers in a list comprehension it will work correctly

In [83]:
model.fit(train_exprs, train_Y, batch_size=train_exprs.shape[0], epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f4173b1e828>

Guess what! The following model architecture worked wonderfully to fit. I can fit the training data perfectly. Now we will see if I can fit test data. I can do a first step with a validation split, maybe 80-20, just to see

Also, is my batch normalization helping on the training? Before in R I remember training taking forever. Okay, batch normalization in the middle layesr does speed up training a bit, but it still reliably trains. Okay, when I don't do the batch normalization on the INITIAL layer, then my model doesn't go anywhere from the beginning. Then I have to fiddle with the learning rate. Starting off at 1e-6 then going to 0.001 and then to 0.00001 (when 0.001 didn't really budge) went okay. Thus, the initial batchnormalization (i.e. normalization) was HUGELY critical in getting the model to fit easily, and the batch-normalizations in the middle sped up training.

ALSO, REMEMBER! IN CROSS VALIDATION, I IDEALLY NEED TO SEPARATE ACCORDING TO MONKEY, NOT JUST RANDOMLY, SO RANDOM IS NOT GOING TO WORK. But we can try anyway

Okay, with 80-20 validation split (among samples, not monkeys), I get 60% accuracy on validation, even as the training data is totally fit. Therefore, huge overfitting. Let's add dropout to see what happens.
0.8 Dropout totally killed my ability to train. 
0.5 dropout gets to 91% accuracy in 30 epochs with 80% of the training set, but over no epoch is validation accuracy changed.

Now, using my test data as my validation data, just to start out:
0.5 dropout, in 30 epochs I get 91.55 accuracy in full training set, 50% accuracy in test set at every epoch. It is totally training on noise. How about if I lower the complexity of the model

One hidden layer with 5000 hidden units gets 98.78% accuracy on training set in 30 epochs, no budge on test (50% accuracy). I wonder if the data is somehow in wrong or randomized. I get same result with just 10 hidden units. I am going to see if random forus works, as I know it works in R.
Great fits well at first model:
model = Sequential([
    BatchNormalization(input_shape=train_exprs.shape[1:]), # this line needs work
    Dense(5000, activation="relu"),
    BatchNormalization(),
    Dropout(p),
    Dense(500, activation="relu"),
    BatchNormalization(),
    Dropout(p),
    Dense(50, activation="relu"),
    BatchNormalization(),
    Dropout(p),
    Dense(10, activation="relu"),
    BatchNormalization(),
    Dropout(p),
    Dense(2, activation="softmax")
])

# Sanity check: Try RandomForest with R default parameters (expect 70% test accuracy)

This code runs so fast! A lot faster than in R on my computer. The RandomForests classifier trained on the full training set and used to predict on the full test set obtains 72.9% accuracy. Therefore, my data is intact. I don't know why 

In [102]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=1000, n_features=4,
                            n_informative=2, n_redundant=0,
                            random_state=0, shuffle=False)

clf = RandomForestClassifier(n_estimators=500, oob_score=True, bootstrap=True, max_features="sqrt")
#clf = RandomForestClassifier(max_depth=2, random_state=0)
clf.fit(train_exprs, encoded_Y_train)
#RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
#            max_depth=2, max_features='auto', max_leaf_nodes=None,
#            min_impurity_decrease=0.0, min_impurity_split=None,
#            min_samples_leaf=1, min_samples_split=2,
#            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
#            oob_score=False, random_state=0, verbose=0, warm_start=False)
print(clf.feature_importances_)
#print(clf.predict([[0, 0, 0, 0]]))

[  1.10455459e-04   3.21039856e-05   5.36269603e-05 ...,   3.87808318e-04
   2.47868704e-04   3.68127795e-04]


In [107]:
from sklearn.metrics import confusion_matrix, accuracy_score
test_pred = clf.predict(test_exprs)
print(confusion_matrix(encoded_Y_test, test_pred)) 
print(accuracy_score(encoded_Y_test, test_pred)) 

[[17  7]
 [ 6 18]]
0.729166666667


# Debugging the incorrect loss display of keras with the monkey data

## Keras shows increasing accuracy on the training set when it predicts all of one class at the end of training on the training set.

### To debug this I am just going to try to do standard keras with the IRIS dataset, another structured dataset
### I am using code from Jason Brownlee found at:

https://machinelearningmastery.com/multi-class-classification-tutorial-keras-deep-learning-library/

In [2]:

import numpy
import pandas
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier
from keras.utils import np_utils
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import Pipeline

Using Theano backend.


In [2]:



# fix random seed for reproducibility
seed = 7
numpy.random.seed(seed)

In [3]:

# load dataset
dataframe = pandas.read_csv("http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data", header=None)
#dataframe = pandas.read_csv("iris.csv", header=None)
dataset = dataframe.values
X = dataset[:,0:4].astype(float)
Y = dataset[:,4]


In [4]:
# encode class values as integers
encoder = LabelEncoder()
encoder.fit(Y)
encoded_Y = encoder.transform(Y)
# convert integers to dummy variables (i.e. one hot encoded)
dummy_y = np_utils.to_categorical(encoded_Y)

In [15]:
def baseline_model():
	# create model
	model = Sequential()
	model.add(Dense(8, input_dim=4, activation='relu'))
	model.add(Dense(3, activation='softmax'))
	# Compile model
	model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
	return model

### At first I will do his cross-validation code just to reproduce what he did. Then I will do it without cross-validation. Though cross-validation may be the way to go to really show whether my model is working correctly or not.

In [12]:
estimator = KerasClassifier(build_fn=baseline_model, epochs=200, batch_size=5, verbose=0)



In [13]:

kfold = KFold(n_splits=10, shuffle=True, random_state=seed)

In [16]:
dummy_y

array([[ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.

In [19]:
results = cross_val_score(estimator, X, dummy_y, cv=kfold, verbose=3)
print("Baseline: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))

[CV]  ................................................................
[CV] ...................................... , score=1.0, total=   3.2s
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    3.2s remaining:    0.0s


[CV] ....................... , score=0.9333333373069763, total=   3.3s
[CV]  ................................................................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    6.5s remaining:    0.0s


[CV] ...................................... , score=1.0, total=   3.3s
[CV]  ................................................................
[CV] ...................................... , score=1.0, total=   3.1s
[CV]  ................................................................
[CV] ...................................... , score=1.0, total=   3.3s
[CV]  ................................................................
[CV] ...................................... , score=1.0, total=   3.3s
[CV]  ................................................................
[CV] ...................................... , score=1.0, total=   3.1s
[CV]  ................................................................
[CV] ....................... , score=0.9333333373069763, total=   3.3s
[CV]  ................................................................
[CV] ....................... , score=0.9333333373069763, total=   3.3s
[CV]  ................................................................
[CV] .

[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:   32.3s finished


### There appears to be nothing wrong with Keras and Sci-kit learn, as I was able to run this prediction correctly. The next step is to break the IRIS dataset up into a training set and a small test set, like I have done, then use the same training and validation code, then predict on training and predict on test.

### If this works, then I need to copy this code line by line to my code and retry it, if that doesn't work, then I should go ahead and go straight to 10-fold cross-validation on my training set.

In [1]:
from sklearn.model_selection import train_test_split

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, dummy_y, test_size=0.33, random_state=42, stratify=dummy_y)

In [10]:
print(X.shape)
print(X_train.shape)
print(X_test.shape)
print(dummy_y.shape)
print(y_train.shape)
print(y_test.shape)

(150, 4)
(100, 4)
(50, 4)
(150, 3)
(100, 3)
(50, 3)


In [96]:
p = 0.0

model = Sequential([
   BatchNormalization(input_shape=X_train.shape[1:]),
    Dense(5000, activation="relu", input_shape=X_train.shape[1:]),
    #BatchNormalization(),
    #Dropout(p),
    Dense(500, activation="relu"),
    #BatchNormalization(),
    #Dropout(p),
    Dense(50, activation="relu"),
    #BatchNormalization(),
    #Dropout(p),
    Dense(10, activation="relu"),
    #BatchNormalization(),
    #Dropout(p),
    Dense(3, activation="softmax")
])

In [97]:
#lr=0.001
model.compile(Adam(lr=0.001), loss='categorical_crossentropy', metrics=['accuracy'])

In [98]:
model.fit(X_train, y_train, batch_size=X_train.shape[0], epochs=300, validation_data = (X_test, y_test), verbose=1)

# Same problem where training loss goes down with increased accuracy, but validation accuracy doesn't change. Let's see how it predicts things

Train on 100 samples, validate on 50 samples
Epoch 1/300
Epoch 2/300
Epoch 3/300
Epoch 4/300
Epoch 5/300
Epoch 6/300
Epoch 7/300
Epoch 8/300
Epoch 9/300
Epoch 10/300
Epoch 11/300
Epoch 12/300
Epoch 13/300
Epoch 14/300
Epoch 15/300
Epoch 16/300
Epoch 17/300
Epoch 18/300
Epoch 19/300
Epoch 20/300
Epoch 21/300
Epoch 22/300
Epoch 23/300
Epoch 24/300
Epoch 25/300
Epoch 26/300
Epoch 27/300
Epoch 28/300
Epoch 29/300
Epoch 30/300
Epoch 31/300
Epoch 32/300
Epoch 33/300
Epoch 34/300
Epoch 35/300
Epoch 36/300
Epoch 37/300
Epoch 38/300
Epoch 39/300
Epoch 40/300
Epoch 41/300
Epoch 42/300
Epoch 43/300
Epoch 44/300
Epoch 45/300
Epoch 46/300
Epoch 47/300
Epoch 48/300
Epoch 49/300
Epoch 50/300
Epoch 51/300
Epoch 52/300
Epoch 53/300
Epoch 54/300
Epoch 55/300
Epoch 56/300
Epoch 57/300
Epoch 58/300
Epoch 59/300
Epoch 60/300
Epoch 61/300
Epoch 62/300
Epoch 63/300
Epoch 64/300
Epoch 65/300
Epoch 66/300
Epoch 67/300
Epoch 68/300
Epoch 69/300
Epoch 70/300
Epoch 71/300
Epoch 72/300
Epoch 73/300
Epoch 74/300
Ep

Epoch 131/300
Epoch 132/300
Epoch 133/300
Epoch 134/300
Epoch 135/300
Epoch 136/300
Epoch 137/300
Epoch 138/300
Epoch 139/300
Epoch 140/300
Epoch 141/300
Epoch 142/300
Epoch 143/300
Epoch 144/300
Epoch 145/300
Epoch 146/300
Epoch 147/300
Epoch 148/300
Epoch 149/300
Epoch 150/300
Epoch 151/300
Epoch 152/300
Epoch 153/300
Epoch 154/300
Epoch 155/300
Epoch 156/300
Epoch 157/300
Epoch 158/300
Epoch 159/300
Epoch 160/300
Epoch 161/300
Epoch 162/300
Epoch 163/300
Epoch 164/300
Epoch 165/300
Epoch 166/300
Epoch 167/300
Epoch 168/300
Epoch 169/300
Epoch 170/300
Epoch 171/300
Epoch 172/300
Epoch 173/300
Epoch 174/300
Epoch 175/300
Epoch 176/300
Epoch 177/300
Epoch 178/300
Epoch 179/300
Epoch 180/300
Epoch 181/300
Epoch 182/300
Epoch 183/300
Epoch 184/300
Epoch 185/300
Epoch 186/300
Epoch 187/300
Epoch 188/300
Epoch 189/300
Epoch 190/300
Epoch 191/300
Epoch 192/300
Epoch 193/300
Epoch 194/300
Epoch 195/300
Epoch 196/300
Epoch 197/300
Epoch 198/300
Epoch 199/300
Epoch 200/300
Epoch 201/300
Epoch 

Epoch 257/300
Epoch 258/300
Epoch 259/300
Epoch 260/300
Epoch 261/300
Epoch 262/300
Epoch 263/300
Epoch 264/300
Epoch 265/300
Epoch 266/300
Epoch 267/300
Epoch 268/300
Epoch 269/300
Epoch 270/300
Epoch 271/300
Epoch 272/300
Epoch 273/300
Epoch 274/300
Epoch 275/300
Epoch 276/300
Epoch 277/300
Epoch 278/300
Epoch 279/300
Epoch 280/300
Epoch 281/300
Epoch 282/300
Epoch 283/300
Epoch 284/300
Epoch 285/300
Epoch 286/300
Epoch 287/300
Epoch 288/300
Epoch 289/300
Epoch 290/300
Epoch 291/300
Epoch 292/300
Epoch 293/300
Epoch 294/300
Epoch 295/300
Epoch 296/300
Epoch 297/300
Epoch 298/300
Epoch 299/300
Epoch 300/300


<keras.callbacks.History at 0x7f95f929fa20>

In [43]:
def baseline_model():
	# create model
	model = Sequential()
	model.add(Dense(8, input_dim=4, activation='relu'))
	model.add(Dense(3, activation='softmax'))
	# Compile model
	model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
	return model
model = baseline_model()
model.fit(X_train, y_train, batch_size=5, epochs=200, validation_data = (X_test, y_test))

Train on 100 samples, validate on 50 samples
Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 27/200
Epoch 28/200
Epoch 29/200
Epoch 30/200
Epoch 31/200
Epoch 32/200
Epoch 33/200
Epoch 34/200
Epoch 35/200
Epoch 36/200
Epoch 37/200
Epoch 38/200
Epoch 39/200
Epoch 40/200
Epoch 41/200
Epoch 42/200
Epoch 43/200
Epoch 44/200
Epoch 45/200
Epoch 46/200
Epoch 47/200
Epoch 48/200
Epoch 49/200
Epoch 50/200
Epoch 51/200
Epoch 52/200
Epoch 53/200
Epoch 54/200
Epoch 55/200
Epoch 56/200
Epoch 57/200
Epoch 58/200
Epoch 59/200
Epoch 60/200
Epoch 61/200
Epoch 62/200
Epoch 63/200
Epoch 64/200
Epoch 65/200
Epoch 66/200
Epoch 67/200
Epoch 68/200
Epoch 69/200
Epoch 70/200
Epoch 71/200
Epoch 72/200
Epoch 73/200
Epoch 74/200
Ep

Epoch 131/200
Epoch 132/200
Epoch 133/200
Epoch 134/200
Epoch 135/200
Epoch 136/200
Epoch 137/200
Epoch 138/200
Epoch 139/200
Epoch 140/200
Epoch 141/200
Epoch 142/200
Epoch 143/200
Epoch 144/200
Epoch 145/200
Epoch 146/200
Epoch 147/200
Epoch 148/200
Epoch 149/200
Epoch 150/200
Epoch 151/200
Epoch 152/200
Epoch 153/200
Epoch 154/200
Epoch 155/200
Epoch 156/200
Epoch 157/200
Epoch 158/200
Epoch 159/200
Epoch 160/200
Epoch 161/200
Epoch 162/200
Epoch 163/200
Epoch 164/200
Epoch 165/200
Epoch 166/200
Epoch 167/200
Epoch 168/200
Epoch 169/200
Epoch 170/200
Epoch 171/200
Epoch 172/200
Epoch 173/200
Epoch 174/200
Epoch 175/200
Epoch 176/200
Epoch 177/200
Epoch 178/200
Epoch 179/200
Epoch 180/200
Epoch 181/200
Epoch 182/200
Epoch 183/200
Epoch 184/200
Epoch 185/200
Epoch 186/200
Epoch 187/200
Epoch 188/200
Epoch 189/200
Epoch 190/200
Epoch 191/200
Epoch 192/200
Epoch 193/200
Epoch 194/200
Epoch 195/200
Epoch 196/200
Epoch 197/200
Epoch 198/200
Epoch 199/200
Epoch 200/200


<keras.callbacks.History at 0x7f961dc7bc88>

In [None]:
print(model.predict_classes(X_train))
print(y_train)

 model.predict is giving all of one class. THis doesn't make sense given the output of accuracy for training loss.
 What if I simplify the model?

 With Jason Brownlee's model I am still predicting most of one class after 9 epochs
 
 After 500 epochs I get to 83% accuracy on training, 82% accuracy on test
 
 After 1000 epochs, it actually didn't do so well, only 57% accuracy, maybe from something stochastic.
 
 After 5000 epochs, it gets to 98% training accuracy, 100% test accuracy. Let's see what it looks like on predicting. Yes, it predicts totally correctly

 5000 epochs seems like a long time. What about if I decrease the batch size?
 
 Yes, batch size of 5 instead of the whole training set did alow me to train quicker (in just 200 epochs, 94% training accuracy, 100% test accuracy)
 
 -  I guess that's part of the benefit of minibatching

NOw, I still do not know why my previous model can get 100% training accuracy but not actually train well. Might it have to do with the batch normalization?

So I repeat my previous error of high training accuracy, no validation accuracy, but predicts one class on training and test. Now I will remove batch normalization.

Removing the first batch normalization allowed me to predict all samples as belonging to one of two classes. But I still have the training loss bug of high accuracy but actually poor prediction. I have no idea why this is occuring. I am going to put back intial batch normalization and remove all batch normalization in the middle.

Having first batch normalization and removing all middle batch normalizations did not remove the bug of high training accuracy in model fit output but still just predicting 1 class at prediction time. Going 50 epochs with this model the first time did eventually increase validation score.

Removing all dropoout lines only did not remove the bug. How about I remove all batch normalization etc., but still have a deep network

Just dense layers and the bug is gone. I have 66% accuracy on both training and test, and this is reflected in 1 class correct and all others predicted a different class. Can I go up all the way with more epochs (and this is batch size full dataset by the way...)

50 epochs with the full network, I get up to near 100% accuracy on both sets. I want to try 50 epochs on Jason's network

What are the results of these experiments? It looks preliminarily that having several layers speeds up training when there are no minibatches. I do not know yet if it speeds of uptraining relative to the shallow network when batch size is 5. That is something to try.

I hypothesize now that something with batch normalization is causing the bug of apparent high training accuracy from the output but actually poor training accuracy when I predict on the currently trained model on the training set. This may just be some code bug. The validation set accurately depicts the error. Therefore, I could potentially still use such a network as long as I use a confusion matrix to compare it with other networks. To test this hypothesis, I just need to add back in all batch normalization to get the error.

So add back in first normalization, add back in last and add in all.
Add in all with 50 epochs, still have the bug and the model doesn't converge yet.
Remove first batch normalization, still just predicts 3 classes.

Remove all batch normalization, gets to 95% accuracty in 8 epochs, 98-100% accuracy in 50.

Now just first batch normalization, predicting 1 of 2 classes 66% accuracy by epoch 7 and it doesn't change by epoch 50. There is no mismatch bug in accuracy. Can we go more epochs and get it to converge? Yes, after 300 epochs we get convergence with near 100% accuracty on training and validation. It may be now that the bug is still present because the training accuracy went high initially (basically 100%) but the test accuracy took a lot longer, 243 epochs until it was 96%

The end message is that batchnormalization is bad for the IRIS dataset. Maybe it is bad for structured data in general. Perhaps i should look to Jeremy Howard's structured dataset network for inspiration on structured data.

I still don't know what hte origin of the bug is, but it is clearly related to batch normalization, especially in the middle layers. Maybe i am using it incorrectly with the theano backend? I should look this up, but in the meantime I can try with the monkey data with no batch normalization.

In [None]:
from sklearn.preprocessing import LabelBinarizer
lb = LabelBinarizer()
lb.fit(model.predict_classes(X_train))
LabelBinarizer(neg_label=0, pos_label=1, sparse_output=False)
lb.classes_
#array([1, 2, 4, 6])
lb.transform([1, 2, 0])

In [39]:
from sklearn.metrics import confusion_matrix
#sklearn.metrics.confusion_matrix(y_true, y_pred, labels=None, sample_weight=None)

In [99]:
print(model.predict_classes(X_train, verbose=False))

print(lb.inverse_transform(y_train))
print("Training Confusion Matrix")
print(confusion_matrix(lb.inverse_transform(y_train),model.predict_classes(X_train, verbose=False)))
print("Test Confusion Matrix")
print(confusion_matrix(lb.inverse_transform(y_test),model.predict_classes(X_test, verbose=False)))
#array([[1, 0, 0, 0],
#       [0, 0, 0, 1]])

[0 2 0 1 2 2 2 0 2 2 1 2 1 1 0 0 2 2 0 2 0 2 2 0 2 1 0 1 2 1 2 0 1 0 1 2 0
 2 0 2 1 1 2 0 1 1 2 0 1 0 2 1 2 0 1 1 2 1 1 1 0 1 0 0 2 0 1 1 0 2 0 0 0 2
 0 2 2 0 0 0 2 1 0 0 2 2 1 1 1 1 2 0 1 0 2 2 1 2 1 2]
[0 2 0 1 2 2 2 0 2 2 1 2 1 1 0 0 2 2 0 2 0 2 2 0 2 1 0 1 2 1 2 0 1 0 1 2 0
 2 0 2 1 1 2 0 1 1 2 0 1 0 2 1 2 0 1 1 1 1 1 1 0 1 0 0 2 0 1 1 0 2 0 0 0 2
 0 2 2 0 0 0 2 1 0 0 2 1 1 1 1 1 2 0 1 0 2 2 1 2 1 2]
Training Confusion Matrix
[[33  0  0]
 [ 0 31  2]
 [ 0  0 34]]
Test Confusion Matrix
[[16  1  0]
 [ 0 16  1]
 [ 0  0 16]]


In [None]:
sklearn.model_selection.train_test_split(X)
    X_train, X_test, y_train, y_test = train_test_split(
...     X, y, test_size=0.33, random_state=42)
    X, dummy_y

(150, 4)
(100, 4)
(50, 4)
(150, 3)
(100, 3)
(50, 3)


In [None]:
# This is from my state farm distracted driver code
import random
random.seed(100)   # So subjects selected are consistent
b =set(np.random.permutation(a['subject']))
subs_val = random.sample(b - set('p072'), 3)# Decided on 3 drivers with further consultation from Jeremy Howard's notebook
print("Validation subjects: " + ', '.join(subs_val))

a['val.file'] = a[['classname', 'img']].apply(lambda x: '/'.join(x), axis=1)
tab_val = a.loc[a['subject'].isin(subs_val)]
val_files =tab_val['val.file'].tolist()
val_files[0:2]