### Lesson 8

For this assignment you will start from the perceptron neural network notebook (Simple Perceptron Neural Network.ipynb) and modify the python code to make it into a multi-layer neural network classifier. To test your system, use the RedWhiteWine.csv file with the goal of building a red or white wine classifier. Use all the features in the dataset, allowing the network to decide how to build the internal weighting system. To review the data attributes, download the L08_WineQuality.pdf.  Perform each of the following tasks and answer the related questions:

- Use the provided RedWhiteWine.csv file. Include ALL the features with “Class” being your output vector
- Use the provided Simple Perceptron Neural Network notebook to develop a multi-layer feed-forward/backpropagation neural network
- Be able to adjust the following between experiments:
    - Learning Rate
    - Number of epochs
    - Depth of architecture—number of hidden layers between the input and output layers
    - Number of nodes in a hidden layer—width of the hidden layers
    - (optional) Momentum
- Determine what the best neural network structure and hyperparameter settings results in the best predictive capability

In [1]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler

### Import dataset

In [2]:
data = pd.read_csv('RedWhiteWine.csv')

In [3]:
data.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,Class
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,1
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,1
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,1
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,1
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,1


### Define predictors (X) and target (Y)

In [4]:
X = data.iloc[:,0:11]
Y = data.iloc[:,12]

In [5]:
# turn X and Y dataframe values to array
X = X.to_numpy()

In [6]:
Y = Y.to_numpy()

In [7]:
# reshape for passing into model
Y = Y.reshape((Y.shape[0], 1))

In [8]:
# Normalize the X
scaler = StandardScaler().fit(X)
X = scaler.transform(X)

### Logistic (Sigmoid) Function

In [9]:
# Creating a numerically stable logistic s-shaped definition to call
def sigmoid(x):
    x = np.clip(x, -500, 500)
    if x.any()>=0:
        return 1/(1 + np.exp(-x))
    else:
        return np.exp(x)/(1 + np.exp(x))

### Initialize Parameters

In [10]:
# Add hidden layer - make 3 dimensions
def init_parameters(dim1, dim2, dim3=1,std=1e-1, random = True):
    if(random):
        return(np.random.random([dim1,dim2])*std)
    else:
        return(np.zeros([dim1,dim2]))

### Forward Propagation
Here, I am assuming a single layered network. Note that event with single layered network, the layer itself can have multiple nodes. Also, I am using vectorized operations here i.e not using explicit loops. This helps in processing multiple inputs.

In [11]:
# Multi layer network: Forward Prop
# Passed in the weight vectors, bias vector, the input vector and the Y
def fwd_prop(W1,bias,X,Y):

    Z1 = np.dot(W1,X) + bias # dot product of the weights and X + bias
    A1 = sigmoid(Z1)  # Uses sigmoid to create a predicted vector

    return(A1)

### Backpropagation

In [12]:
# Multi layer network: Backprop

def back_prop(A1,W1,bias,X,Y):

    m = np.shape(X)[1] # used the calculate the cost by the number of inputs -1/m
   
    # Cross entropy loss function
    cost = (-1/m)*np.sum(Y*np.log(A1) + (1-Y)*np.log(1-A1)) # cost of error
    dZ1 = A1 - Y                                            # subtract actual from pred weights
    dW1 = (1/m) * np.dot(dZ1, X.T)                          # calc new weight vector
    dBias = (1/m) * np.sum(dZ1, axis = 1, keepdims = True)  # calc new bias vector
    
    grads ={"dW1": dW1, "dB1":dBias} # Weight and bias vectors after backprop
    
    return(grads,cost)

### Gradient Descent


In [13]:
def run_grad_desc(num_epochs,learning_rate,X,Y,n_1):
    
    n_0, m = np.shape(X)
    
    W1 = init_parameters(n_1, n_0, True)
    B1 = init_parameters(n_1,1, True)
    
    loss_array = np.ones([num_epochs])*np.nan # resets the loss_array to NaNs
    
    for i in np.arange(num_epochs):
        A1 = fwd_prop(W1,B1,X,Y)                # get predicted vector
        grads,cost = back_prop(A1,W1,B1,X,Y)    # get gradient and the cost from BP 
        
        W1 = W1 - learning_rate*grads["dW1"]    # update weight vector LR*gradient*[BP weights]
        B1 = B1 - learning_rate*grads["dB1"]    # update bias LR*gradient[BP bias]
        
        loss_array[i] = cost                    # loss array gets cross ent values
        
        parameter = {"W1":W1,"B1":B1}           # assign 
    
    return(parameter,loss_array)

### Running the Experiment

In [14]:
num_epochs = 100
learning_rate = 0.1
params, loss_array = run_grad_desc(num_epochs,learning_rate,X,Y,n_1= 1 )
print(loss_array[num_epochs-1])

  
  


nan


In [15]:
### Experimenting with lower learning rate

In [16]:
num_epochs = 100
learning_rate = 0.0001
params, loss_array = run_grad_desc(num_epochs,learning_rate,X,Y,n_1= 1 )
print(loss_array[num_epochs-1])

2197.4488126813476


In [17]:
### Experimenting with fewer epochs

In [18]:
num_epochs = 10
learning_rate = 0.0001
params, loss_array = run_grad_desc(num_epochs,learning_rate,X,Y,n_1= 1 )
print(loss_array[num_epochs-1])

6000.019554810447


#### Findings
It's difficult to determine the ideal neural network structure as I'm not confident my first ANN was a good baseline to build from. Lowering the learing rate in my second experiment gave more promising results than increasing the number of epochs in my third experiement.