Problem 1.)
I chose to use a diamonds dataset from Kaggle, and create my NN in tensorflow. 

[Link: https://www.kaggle.com/shivam2503/diamonds ]

This dataset contains a variety of information for a series of diamonds, as well as their price, which is what I aim to guess. I did need to clean the data a bit, mostly because some of the values were strings instead of numbers (which is helpful since I didn't lose any data). Additionally, since I wanted to do a binary classification task, I decided to compare whether the price of the diamond was over or under $4,000 (approximately the average).

The main tools I used were Gradient Tape in Tensorflow to record and calculate the gradients at every step in my algorithm, and the Adams Optimizer. I experimented with a wide variety of other tools, but these two are ultimately what got my project working.

Gradient Tape essentially works by remembering the state of a few variables (in our case, the weights and biases), and then calculating the derivative every time they were changed. This was an absolute necessity in making back propogation efficient, and streamlined my code. Unfortunately, this did mean I was unable to use my own custom loss algorithm, as the input to my algorithm would need to be casted (which is non-differentiable) to another variable type.

[Tape: https://www.tensorflow.org/api_docs/python/tf/GradientTape ]

I opted to use the Adams Optimizer, mainly because I wanted to challenge myself. It also helped make my NN more efficient and accurate, which definitely sped up the process of hyperparameter tuning.

[Adams: https://www.tensorflow.org/api_docs/python/tf/compat/v1/train/AdamOptimizer?hl=ko ]

I should mention that a portion of my code used concepts from the website below. However, due to the fact that my dataset was vastly different, I needed to personalize a lot of the code, such as the layer sizes, activation functions, and dev set comparisons. This article is still, however, very informative and was a great tool while debugging.

[Link: https://adventuresinmachinelearning.com/python-tensorflow-tutorial/ ]

In [38]:
# Import TensorFlow, Pandas, and Numpy
import tensorflow as tf
import pandas as pd
import numpy as np
import math

#read in data and print first few values
data = pd.read_csv('diamonds.csv')

def priceGEQ (price):
  if price > 4000:
    return 1
  else:
    return 0
  

#https://www.geeksforgeeks.org/replacing-strings-with-numbers-in-python-for-data-analysis/
cutNums = {"Fair": 1, "Good": 2, "Very Good": 3, "Premium": 4, "Ideal": 5}
data['cutNums'] = [cutNums[item] for item in data['cut']]
colorNums = {"J": 1, "I": 2, "H": 3, "G": 4, "F": 5, "E": 6, "D": 7}
data['colorNums'] = [colorNums[item] for item in data['color']]
clarityNums = {"I3": 1, "I2": 2, "I1": 3, "SI2": 4, "SI1": 5, "VS2": 6, "VS1": 7, "VVS2": 8, "VVS1": 9, "IF": 10, "FL": 11}
data['clarityNums'] = [clarityNums[item] for item in data['clarity']]
data['priceBools'] = [priceGEQ(item) for item in data['price']]
data = data.drop('Unnamed: 0',axis=1)
data = data.drop('cut',axis=1)
data = data.drop('color',axis=1)
data = data.drop('clarity',axis=1)
data = data.drop('price',axis=1)

#split into training, dev, and testing
from sklearn.model_selection import train_test_split
X = data.drop('priceBools',axis=1)
Y = data[['priceBools']]
X_new, X_test, Y_new, Y_test = train_test_split(X, Y, test_size = 0.02,random_state=0)
X_train, X_dev, Y_train, Y_dev = train_test_split(X_new, Y_new, test_size = 0.1,random_state=0)

#https://www.geeksforgeeks.org/deep-neural-net-with-forward-and-back-propagation-from-scratch-python/
#https://adventuresinmachinelearning.com/python-tensorflow-tutorial/

#Hyperparameters
numHiddenNodes = 10
batch_size = 300
num_epochs = 30

# now declare the weights connecting the input to the hidden layer
W1 = tf.Variable(tf.random.normal([9, numHiddenNodes], stddev=0.03), name='W1')
b1 = tf.Variable(tf.random.normal([numHiddenNodes]), name='b1')
# and the weights connecting the hidden layer to the output layer
W2 = tf.Variable(tf.random.normal([numHiddenNodes, 1], stddev=0.03), name='W2')
b2 = tf.Variable(tf.random.normal([1]), name='b2')

#https://www.geeksforgeeks.org/ml-mini-batch-gradient-descent-with-python/
# function to create a list containing mini-batches
def create_mini_batches(X, y, batch_size):
    mini_batches = []
    data = np.hstack((X, y))
    np.random.shuffle(data)
    n_minibatches = data.shape[0] // batch_size
    i = 0
  
    for i in range(n_minibatches + 1):
        mini_batch = data[i * batch_size:(i + 1)*batch_size, :]
        X_mini = mini_batch[:, :-1]
        Y_mini = mini_batch[:, -1].reshape((-1, 1))
        mini_batches.append((X_mini, Y_mini))
    if data.shape[0] % batch_size != 0:
        mini_batch = data[i * batch_size:data.shape[0]]
        X_mini = mini_batch[:, :-1]
        Y_mini = mini_batch[:, -1].reshape((-1, 1))
        mini_batches.append((X_mini, Y_mini))
    return mini_batches

def forward_prop (X, W1, b1, W2, b2):
    Z1 = tf.add(tf.matmul(tf.cast(X, tf.float32), W1), b1)
    A1 = tf.nn.relu(Z1)
    Z2 = tf.add(tf.matmul(A1, W2), b2)
    A2 = tf.nn.sigmoid(Z2)
    return Z1, A1, Z2, A2

def cross_entropy (A, Y):
  #add 0.00000000001 so it doesn't crash when A=0
  logs = np.multiply(np.log(A + 0.0000000001), Y) + np.multiply((1 - Y), np.log(1.000000000001 - A))
  return tf.reduce_mean(logs)

def loss_fn(logits, labels):
    cross_entropy = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(labels=labels,
                                                                              logits=logits))
    return cross_entropy

# setup the optimizer
optimizer = tf.compat.v1.train.AdamOptimizer()
optimizer = tf.keras.optimizers.Adam()

total_batch = int(len(Y_train) / batch_size)
for epoch in range(num_epochs):
    avg_loss = 0
    my_batches = create_mini_batches(X_train, Y_train, batch_size)
    for i in range(total_batch):
        batch_x, batch_y = my_batches[i]
        # create tensors
        batch_x = tf.cast(tf.Variable(batch_x), tf.float32)
        batch_y = tf.cast(tf.Variable(batch_y), tf.int32)
        with tf.GradientTape() as tape:
            tape.watch([W1, b1, W2, b2])
            Z1, A1, Z2, A2 = forward_prop(batch_x, W1, b1, W2, b2)
            #loss = cross_entropy(A2, batch_y)
            loss = loss_fn(Z2, tf.cast(batch_y, tf.float32))
        gradients = tape.gradient(loss, [W1, b1, W2, b2])
        #gradients = optimizer.compute_gradients(tf.fill([1], loss), [W1, b1, W2, b2])
        optimizer.apply_gradients(zip(gradients, [W1, b1, W2, b2]))
        avg_loss += loss / total_batch
    trash1, trash2, trash3, test_logits = forward_prop(X_dev, W1, b1, W2, b2)
    max_idxs = [item[0]>0.5 for item in test_logits]
    test_acc = np.sum(np.array(max_idxs) == Y_dev['priceBools']) / len(Y_dev)
    print(f"Epoch: {epoch + 1}, loss={avg_loss:.3f}, test set      accuracy={test_acc*100:.3f}%")
print("\nTraining complete!")

#test set
trash1, trash2, trash3, test_logits = forward_prop(X_test, W1, b1, W2, b2)
max_idxs = [item[0]>0.5 for item in test_logits]
test_acc = np.sum(np.array(max_idxs) == Y_test['priceBools']) / len(Y_test)
print(f"Final test accuracy={test_acc*100:.3f}%")

Epoch: 1, loss=0.637, test set      accuracy=64.782%
Epoch: 2, loss=0.497, test set      accuracy=84.207%
Epoch: 3, loss=0.378, test set      accuracy=89.408%
Epoch: 4, loss=0.309, test set      accuracy=89.938%
Epoch: 5, loss=0.265, test set      accuracy=94.496%
Epoch: 6, loss=0.235, test set      accuracy=94.250%
Epoch: 7, loss=0.213, test set      accuracy=94.912%
Epoch: 8, loss=0.197, test set      accuracy=95.877%
Epoch: 9, loss=0.184, test set      accuracy=95.158%
Epoch: 10, loss=0.173, test set      accuracy=95.725%
Epoch: 11, loss=0.164, test set      accuracy=95.423%
Epoch: 12, loss=0.156, test set      accuracy=95.915%
Epoch: 13, loss=0.150, test set      accuracy=95.952%
Epoch: 14, loss=0.145, test set      accuracy=96.104%
Epoch: 15, loss=0.140, test set      accuracy=95.688%
Epoch: 16, loss=0.136, test set      accuracy=96.066%
Epoch: 17, loss=0.133, test set      accuracy=96.066%
Epoch: 18, loss=0.129, test set      accuracy=96.123%
Epoch: 19, loss=0.127, test set      

Problem 3.) 

As soon as my NN was working, it was already fairly accurate. My choice to use relu, then sigmoid activation helped a lot, as the activation functions were chosen with my dataset in mind. From there, I grouped the three other main hyperparameters together (number of hidden nodes, batch size, number of epochs) and changed them one-by-onetrying to get the best combination of accuracy and runtime. Interestingly, before I increased the number of hidden nodes, the accuracy used to stagnate for the first few epochs, then make a huge jump upwards before proceeding normally. I chose not to use regularizaton, as this project had already given me enough of a headache. I did, however, use the Adams Optimizer, which most definitely paid off. The values I eventually reached seem pretty stable, and appear to generalize well.