PA9: Neural Networks with Tensorflow

In this assignment, you will:

1. Implement neural networks as a powerful approach to supervised machine learning,
2. Practice using state-of-the-art software tools and programming paradigms for machine learning,
3. Investigate the impact of parameters to learning on neural network performance as evaluated on an empirical data set.

For this assignment, we will learn use a well known dataset:

[Higgs](https://archive.ics.uci.edu/ml/datasets/HIGGS). Some information regarding this dataset: The data has been produced using Monte Carlo simulations. The first 21 features (columns 2-22) are kinematic properties measured by the particle detectors in the accelerator. The last seven features are functions of the first 21 features; these are high-level features derived by physicists to help discriminate between the two classes.

For local testing you will use the sample dataset provided to you with this notebook.
When submitting on EdX, your code will be evaluated on a much larger sample of this dataset.

The file format for each of the above data set is as follows:

• The first row contains a comma-separated list of the names of the label and attributes

• Each successive row represents a single instance

• The first entry of each instance is the label to be learned, and all other entries (following the commas) are attribute values.

• All attributes are numerical i.e. real numbers.

Exercise 1: 

Your goal is to complete the below function named train_nn that behaves as follows:

1) It should take as input six parameters:
    
    a. The path to a file containing a data set (e.g., higgs_sample.csv)
    
    b. The number of neurons to use in the hidden layer
    
    c. The learning rate to use during backpropagation
    
    d. The number of iterations to use during training
    
    e. The percentage of instances to use for a training set
    
    f. A random seed as an integer
    
For example, if the call to train_nn looks like train_nn(higgs_sample.csv 20 0.001 1000 0.75 12345) which will create a neural network with 20 neurons in the hidden layer, train the network using a learning rate = 0.001 and 1000 iterations through higgs_sample.csv with a random seed of 12345, where 75% of the data will be used for training (and the remaining 25% will be used for testing)

2) You should create a neural network in Tensorflow that will be learned from the training data. The key parameters to the architecture of the neural network are based on your inputted parameters and the size of your data set:
    
    a. The number of attributes in the input layer is the length of each instance’s
    attribute list (which is the same for all instances)
    
    b. The number of neurons in a hidden layer will be inputted to the program as a
    parameter. Each hidden neuron should use tf.sigmoid as its activation function.
    
    c. The number of output neurons will be 1 since it is a binary classification task, and that should use tf.sigmoid as its activation function
    
3) You should use different cost/loss functions that the network tries to minimize depending on the number of labels:
    
    a. For binary classification we will use the sum of squared error:

$$SSE(X) = \sum_{j=1}^{n}({y_j - \hat{y}_j})^2$$


    The function tf.reduce_sum will allow you to sum across all instances.
    

4) For the implementation of Backpropagation, you should use tf.train.AdamOptimizer

For more on optimizers, you may follow this link: TODO

5) You should train your network using your inputted learning rate and for the inputted number of iterations. The iterations are simply a loop that calls Backpropagation a fixed number of times.

TODOs:

- Biases?
- Mean normalize?
- How to evaluate?


In [10]:
import pandas as pd

In [19]:
## Looking a the data
input = pd.read_csv("higgs_small.csv",header=None)

In [20]:
input.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,19,20,21,22,23,24,25,26,27,28
count,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,...,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0
mean,0.52834,0.990371,-0.003816,-0.001638,0.995049,-0.007613,0.987114,-0.003,0.000438,0.998344,...,-0.007575,-0.004029,0.992721,1.032611,1.023153,1.050193,1.010189,0.973081,1.031873,0.959199
std,0.499199,0.561837,1.004841,1.006189,0.59536,1.006997,0.473118,1.008685,1.00843,1.027402,...,1.009173,1.00709,1.396788,0.652455,0.37161,0.164857,0.398267,0.523552,0.363394,0.313257
min,0.0,0.274697,-2.434976,-1.742508,0.001283,-1.743944,0.139976,-2.968735,-1.741237,0.0,...,-2.497265,-1.742691,0.0,0.110875,0.303144,0.133012,0.295983,0.048125,0.30335,0.350939
25%,0.0,0.590936,-0.741244,-0.868047,0.575635,-0.881465,0.676336,-0.688235,-0.867542,0.0,...,-0.725017,-0.877028,0.0,0.791306,0.846631,0.985775,0.767261,0.673792,0.81917,0.769964
50%,1.0,0.854835,-0.002976,0.000971,0.890268,-0.011024,0.892163,-2.5e-05,-0.003822,1.086538,...,-0.010455,-0.009698,0.0,0.8956,0.950719,0.989742,0.917302,0.874004,0.947037,0.871038
75%,1.0,1.236776,0.735292,0.86822,1.290871,0.865868,1.167809,0.683233,0.871223,2.173076,...,0.71077,0.869386,3.101961,1.025925,1.083218,1.020762,1.141633,1.139816,1.139032,1.057478
max,1.0,7.805887,2.433894,1.743236,7.998711,1.743229,7.064657,2.969674,1.741454,2.173076,...,2.498009,1.743372,3.101961,18.428827,10.038273,4.565248,7.442589,11.994177,7.318191,6.015647


In [7]:
import tensorflow as tf


In [13]:
## Bogus as of now
working_locally = True

In [90]:
import tensorflow as tf

filename_queue = tf.train.string_input_producer(["higgs_small.csv"])


line_reader = tf.TextLineReader()
key, csv_row = line_reader.read(filename_queue)

In [91]:
record_defaults = [[0.0]]*29
all_columns = tf.decode_csv(csv_row, record_defaults=record_defaults)

In [92]:
# Turn the features back into a tensor.
features = tf.stack(all_columns[1:])
labels = tf.stack(all_columns[0])

In [130]:
# Parameters
learning_rate = 0.01
training_epochs = 50
batch_size = 1000
display_step = 1
num_examples= 100000

# Network Parameters
n_hidden_1 = 20 # 1st layer number of features
n_hidden_2 = 15 # 2nd layer number of features
n_input = 28 
n_classes = 1 

# tf Graph input
x = tf.placeholder("float", [None, n_input])
y = tf.placeholder("float", [None, 1])



In [131]:
# Store layers weight & bias
weights = {
    'h1': tf.Variable(tf.random_normal([n_input, n_hidden_1])),
    'h2': tf.Variable(tf.random_normal([n_hidden_1, n_hidden_2])),
    'out': tf.Variable(tf.random_normal([n_hidden_2, n_classes]))
}
biases = {
    'b1': tf.Variable(tf.random_normal([n_hidden_1])),
    'b2': tf.Variable(tf.random_normal([n_hidden_2])),
    'out': tf.Variable(tf.random_normal([n_classes]))
}

In [132]:

# Create model
def multilayer_perceptron(x, weights, biases):
    # Hidden layer with SIGMOID activation
    layer_1 = tf.add(tf.matmul(x, weights['h1']), biases['b1'])
    layer_1 = tf.nn.sigmoid(layer_1)
    # Hidden layer with SIGMOID activation
    layer_2 = tf.add(tf.matmul(layer_1, weights['h2']), biases['b2'])
    layer_2 = tf.nn.sigmoid(layer_2)
    # Output layer with SIGMOID activation
    out_layer = tf.matmul(layer_2, weights['out']) + biases['out']
    out_layer_sigmoid = tf.nn.sigmoid(out_layer)
    return out_layer_sigmoid



In [133]:

# Construct model
pred = multilayer_perceptron(x, weights, biases)

# Define loss and optimizer
cost = tf.reduce_sum((y-pred)**2)
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(cost)

# cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=pred, labels=y))
# optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(cost)

# Initializing the variables
init = tf.global_variables_initializer()




In [134]:
import numpy as np

In [135]:
with tf.Session() as sess:
    #tf.initialize_all_variables().run()
    sess.run(init)
    coord = tf.train.Coordinator()
    threads = tf.train.start_queue_runners(coord=coord)

    for epoch in range(training_epochs):
        avg_cost = 0.
        total_batch = int(num_examples/batch_size)
        # Loop over all batches

        x_batch = []
        y_batch = []
        for i in range(num_examples-90000):
            example, label = sess.run([features, labels])
            x_batch.append(example)
            y_batch.append(label)
            
        y_batch = np.asarray(y_batch,)
        # Run optimization op (backprop) and cost op (to get loss value)
        _, c = sess.run([optimizer, cost], feed_dict={x: x_batch,
                                                      y: y_batch})
        # Compute average loss
        avg_cost += c / total_batch
        # Display logs per epoch step
        if epoch % display_step == 0:
            print ("Epoch:", '%04d' % (epoch+1), "cost=", \
                "{:.9f}".format(avg_cost))
    print ("Optimization Finished!")
    coord.request_stop()
    coord.join(threads)

INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.CancelledError'>, Enqueue operation was cancelled
	 [[Node: input_producer_5/input_producer_5_EnqueueMany = QueueEnqueueManyV2[Tcomponents=[DT_STRING], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/device:CPU:0"](input_producer_5, input_producer_5/Const, ^input_producer_5/Assert/Assert)]]


ValueError: Cannot feed value of shape (10000,) for Tensor 'Placeholder_23:0', which has shape '(?, 1)'