<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 4 Sprint 2*

# Sprint Challenge - Neural Network Foundations

Table of Problems

1. [Defining Neural Networks](#Q1)
2. [Chocolate Gummy Bears](#Q2)
    - Perceptron
    - Multilayer Perceptron
4. [Keras MMP](#Q3)

<a id="Q1"></a>
## 1. Define the following terms:

- **Neuron:** An individual node in a neural network.
- **Input Layer:** The layer that receives input data (must be same size as the input data's columns/lists)
- **Hidden Layer:** The layer that performs the computations based on weights.
- **Output Layer:** The layer that returns a value/prediction based on the hidden layer and its own weights
- **Activation:** Applying the weights of a neuron to a mathematical function (like a sigmoid function)
- **Backpropagation:** Re-balancing weights after an output has been found, and then re-running the network to find a better output


## 2. Chocolate Gummy Bears <a id="Q2"></a>

Right now, you're probably thinking, "yuck, who the hell would eat that?". Great question. Your candy company wants to know too. And you thought I was kidding about the [Chocolate Gummy Bears](https://nuts.com/chocolatessweets/gummies/gummy-bears/milk-gummy-bears.html?utm_source=google&utm_medium=cpc&adpos=1o1&gclid=Cj0KCQjwrfvsBRD7ARIsAKuDvMOZrysDku3jGuWaDqf9TrV3x5JLXt1eqnVhN0KM6fMcbA1nod3h8AwaAvWwEALw_wcB). 

Let's assume that a candy company has gone out and collected information on the types of Halloween candy kids ate. Our candy company wants to predict the eating behavior of witches, warlocks, and ghosts -- aka costumed kids. They shared a sample dataset with us. Each row represents a piece of candy that a costumed child was presented with during "trick" or "treat". We know if the candy was `chocolate` (or not chocolate) or `gummy` (or not gummy). Your goal is to predict if the costumed kid `ate` the piece of candy. 

If both chocolate and gummy equal one, you've got a chocolate gummy bear on your hands!?!?!
![Chocolate Gummy Bear](https://ed910ae2d60f0d25bcb8-80550f96b5feb12604f4f720bfefb46d.ssl.cf1.rackcdn.com/3fb630c04435b7b5-2leZuM7_-zoom.jpg)

In [1]:
import pandas as pd
candy = pd.read_csv('chocolate_gummy_bears.csv')

In [3]:
print(candy.shape)
candy.head()

(10000, 3)


Unnamed: 0,chocolate,gummy,ate
0,0,1,1
1,1,0,1
2,0,1,1
3,0,0,0
4,1,1,0


### Perceptron

To make predictions on the `candy` dataframe. Build and train a Perceptron using numpy. Your target column is `ate` and your features: `chocolate` and `gummy`. Do not do any feature engineering. :P

Once you've trained your model, report your accuracy. You will not be able to achieve more than ~50% with the simple perceptron. Explain why you could not achieve a higher accuracy with the *simple perceptron* architecture, because it's possible to achieve ~95% accuracy on this dataset. Provide your answer in markdown (and *optional* data anlysis code) after your perceptron implementation. 

In [106]:
import numpy as np
# Start your candy perceptron here
class Perceptron(object):
    def __init__(self, rate=0.01, niter=10):
        self.rate = rate
        self.niter = niter
    
    def fit(self, X, y):
        """Fit training data
        X: Training vectors, X.shape : [#Samples, #features]
        y: Target values, y.shape : [#samples]"""
        
        # weights
        self.weight = np.zeros(1 + X.shape[1])
        # no. of misclassificaiton
        self.errors = [] # Number of misclassifications
        for i in range(self.niter):
            error = 0
            for xi, target in zip(X, y):
                predicted = self.predict(xi)
                delta_w = self.rate * (target - predicted)
                self.weight[1:] += delta_w * xi
                self.weight[0] += delta_w
                if delta_w != 0.0:
                    error += 1
            self.errors.append(error)
        return self
    
    def net_input(self, X):
            """Calculate net input"""
            return np.dot(X, self.weight[1:]) + self.weight[0]
    
    def predict(self, X):
        """Return class label after unit step"""
        return np.where(self.net_input(X) >= 0.0, 1, -1)

X = candy[['chocolate', 'gummy']].values
y = candy['ate'].values

print(X.shape)
print(X[0])

(10000, 2)
[0 1]


In [107]:
pn = Perceptron()
fitted = pn.fit(X, y)

In [108]:
print(f' Accuracy for the 10 iterations are: {[(1 - x/len(X)) for x in fitted.errors]}')

 Accuracy for the 10 iterations are: [0.3722, 0.3732, 0.3729, 0.3732, 0.3729, 0.3732, 0.3729, 0.3732, 0.3729, 0.3732]


**Why can't a Perceptron achieve higher accuracy?**
The reason that these errors are going to remain roughly the same is that there is no backpropagation happening. There would be more variance if the weights were randomized, but currently they all start at 0. The model will run once through the dataframe, and then will finish with X number of errors. It can't then go back and retrain/rebalance weights to get a better answer.

### Multilayer Perceptron <a id="Q3"></a>

Using the sample candy dataset, implement a Neural Network Multilayer Perceptron class that uses backpropagation to update the network's weights. Your Multilayer Perceptron should be implemented in Numpy. 
Your network must have one hidden layer.

Once you've trained your model, report your accuracy. Explain why your MLP's performance is considerably better than your simple perceptron's on the candy dataset. 

In [109]:
class NeuralNetwork:
    def __init__(self):
        # Set up Architecture of Neural Network
        self.inputs = 2
        self.hiddenNodes = 2
        self.outputNodes = 1

        # Initial Weights
        # 2x3 Matrix Array for the First Layer
        self.weights1 = np.random.rand(self.inputs, self.hiddenNodes)
       
        # 3x1 Matrix Array for Hidden to Output
        self.weights2 = np.random.rand(self.hiddenNodes, self.outputNodes)
        
    def sigmoid(self, s):
        return 1 / (1+np.exp(-s))
    
    def sigmoidPrime(self, x):
        s = self.sigmoid(x)
        return s * (1 - s)
    
    def feed_forward(self, X):
        """
        Calculate the NN inference using feed forward.
        aka "predict"
        """
        
        # Weighted sum of inputs => hidden layer
        self.hidden_sum = np.dot(X, self.weights1)
        
        # Activations of weighted sum
        self.activated_hidden = self.sigmoid(self.hidden_sum)
        
        # Weight sum between hidden and output
        self.output_sum = np.dot(self.activated_hidden, self.weights2)
        
        # Final activation of output
        self.activated_output = self.sigmoid(self.output_sum)
        
        return self.activated_output
        
    def backward(self, X,y,o):
        """
        Backward propagate through the network
        """
        
        # Error in Output
        self.o_error = y - o
        
        # Apply Derivative of Sigmoid to error
        # How far off are we in relation to the Sigmoid f(x) of the output
        # ^- aka hidden => output
        self.o_delta = self.o_error * self.sigmoidPrime(o)
        
        # z2 error
        self.z2_error = self.o_delta.dot(self.weights2.T)
        
        # How much of that "far off" can explained by the input => hidden
        self.z2_delta = self.z2_error * self.sigmoidPrime(self.activated_hidden)
        
        # Adjustment to first set of weights (input => hidden)
        self.weights1 += X.T.dot(self.z2_delta)
        # Adjustment to second set of weights (hidden => output)
        self.weights2 += self.activated_hidden.T.dot(self.o_delta)
        

    def train(self, X, y):
        o = self.feed_forward(X)
        self.backward(X,y,o)

In [110]:
# Didn't need to reshape the X, but the Y did need to become 1 row
y = y.reshape(y.shape[0], 1)

X.shape, y.shape

((10000, 2), (10000, 1))

In [111]:
nn = NeuralNetwork()
fitted_nn = nn.train(X, y)

In [113]:
print(f' Accuracy for the 10 iterations are: {[(1 - x/len(X)) for x in nn.o_error]}')

 Accuracy for the 10 iterations are: [array([0.99996016]), array([0.99995945]), array([0.99996016]), array([1.00005779]), array([1.00006154]), array([0.99996016]), array([1.00005779]), array([1.00006154]), array([1.00005779]), array([1.00005779]), array([0.99996016]), array([1.00006154]), array([1.00005779]), array([0.99996016]), array([0.99996016]), array([1.00005945]), array([0.99995945]), array([0.99996016]), array([1.00006154]), array([0.99996016]), array([0.99996016]), array([0.99996016]), array([1.00006154]), array([1.00006154]), array([0.99995945]), array([0.99995945]), array([1.00005779]), array([1.00005779]), array([1.00005779]), array([1.00005779]), array([0.99995945]), array([0.99996016]), array([1.00006154]), array([0.99995945]), array([0.99995945]), array([0.99996016]), array([0.99996016]), array([1.00006154]), array([1.00005779]), array([0.99995945]), array([0.99995945]), array([1.00005779]), array([1.00006154]), array([0.99996016]), array([0.99995945]), array([1.00006154

P.S. Don't try candy gummy bears. They're disgusting. 

In [114]:
class Neural_Network(object):
    def __init__(self):        
        #Define Hyperparameters
        self.inputLayerSize = 2
        self.outputLayerSize = 1
        self.hiddenLayerSize = 3
        
        #Weights (parameters)
        self.W1 = np.random.randn(self.inputLayerSize,self.hiddenLayerSize)
        self.W2 = np.random.randn(self.hiddenLayerSize,self.outputLayerSize)
        
    def forward(self, X):
        #Propogate inputs though network
        self.z2 = np.dot(X, self.W1)
        self.a2 = self.sigmoid(self.z2)
        self.z3 = np.dot(self.a2, self.W2)
        yHat = self.sigmoid(self.z3) 
        return yHat
        
    def sigmoid(self, z):
        #Apply sigmoid activation function to scalar, vector, or matrix
        return 1/(1+np.exp(-z))
    
    def sigmoidPrime(self,z):
        #Gradient of sigmoid
        return np.exp(-z)/((1+np.exp(-z))**2)
    
    def costFunction(self, X, y):
        #Compute cost for given X,y, use weights already stored in class.
        self.yHat = self.forward(X)
        J = 0.5*sum((y-self.yHat)**2)
        return J
        
    def costFunctionPrime(self, X, y):
        #Compute derivative with respect to W and W2 for a given X and y:
        self.yHat = self.forward(X)
        
        delta3 = np.multiply(-(y-self.yHat), self.sigmoidPrime(self.z3))
        dJdW2 = np.dot(self.a2.T, delta3)
        
        delta2 = np.dot(delta3, self.W2.T)*self.sigmoidPrime(self.z2)
        dJdW1 = np.dot(X.T, delta2)  
        
        return dJdW1, dJdW2
    
    #Helper Functions for interacting with other classes:
    def getParams(self):
        #Get W1 and W2 unrolled into vector:
        params = np.concatenate((self.W1.ravel(), self.W2.ravel()))
        return params
    
    def setParams(self, params):
        #Set W1 and W2 using single paramater vector.
        W1_start = 0
        W1_end = self.hiddenLayerSize * self.inputLayerSize
        self.W1 = np.reshape(params[W1_start:W1_end], (self.inputLayerSize , self.hiddenLayerSize))
        W2_end = W1_end + self.hiddenLayerSize*self.outputLayerSize
        self.W2 = np.reshape(params[W1_end:W2_end], (self.hiddenLayerSize, self.outputLayerSize))
        
    def computeGradients(self, X, y):
        dJdW1, dJdW2 = self.costFunctionPrime(X, y)
        return np.concatenate((dJdW1.ravel(), dJdW2.ravel()))

In [119]:
from scipy import optimize
class trainer(object):
    def __init__(self, N):
        #Make Local reference to network:
        self.N = N
        
    def callbackF(self, params):
        self.N.setParams(params)
        self.J.append(self.N.costFunction(self.X, self.y))   
        
    def costFunctionWrapper(self, params, X, y):
        self.N.setParams(params)
        cost = self.N.costFunction(X, y)
        grad = self.N.computeGradients(X,y)
        
        return cost, grad
        
    def train(self, X, y):
        #Make an internal variable for the callback function:
        self.X = X
        self.y = y

        #Make empty list to store costs:
        self.J = []
        
        params0 = self.N.getParams()

        options = {'maxiter': 200, 'disp' : True}
        _res = optimize.minimize(self.costFunctionWrapper, params0, jac=True, method='BFGS', \
                                 args=(X, y), options=options, callback=self.callbackF)

        self.N.setParams(_res.x)
        self.optimizationResults = _res

In [121]:
nn = Neural_Network()
trained_nn = trainer(nn)
trained_nn = trained_nn.train(X, y)

Optimization terminated successfully.
         Current function value: 256.295819
         Iterations: 48
         Function evaluations: 61
         Gradient evaluations: 61


In [122]:
print("Predicted Output: \n" + str(nn.forward(X))) 
print("Loss: \n" + str(np.mean(np.square(y - nn.forward(X))))) # mean sum squared loss

Predicted Output: 
[[0.94741068]
 [0.94777019]
 [0.94741068]
 ...
 [0.94741068]
 [0.94741068]
 [0.94777019]]
Loss: 
0.05125916383849585


**Answer:** 
Using both a handmade multi-layer perceptron and the Stochastic GD-based implementation we were shown from Welch Labs, we were able to get very high matching rates for the predicted output, thanks to backpropagation!

## 3. Keras MMP <a id="Q3"></a>

Implement a Multilayer Perceptron architecture of your choosing using the Keras library. Train your model and report its baseline accuracy. Then hyperparameter tune at least two parameters and report your model's accuracy.
Use the Heart Disease Dataset (binary classification)
Use an appropriate loss function for a binary classification task
Use an appropriate activation function on the final layer of your network.
Train your model using verbose output for ease of grading.
Use GridSearchCV or RandomSearchCV to hyperparameter tune your model. (for at least two hyperparameters)
When hyperparameter tuning, show you work by adding code cells for each new experiment.
Report the accuracy for each combination of hyperparameters as you test them so that we can easily see which resulted in the highest accuracy.
You must hyperparameter tune at least 3 parameters in order to get a 3 on this section.

In [124]:
from sklearn.model_selection import GridSearchCV
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier
from sklearn.preprocessing import StandardScaler


df = pd.read_csv('https://raw.githubusercontent.com/ryanleeallred/datasets/master/heart.csv')
df = df.sample(frac=1)
print(df.shape)
df.head()

(303, 14)


Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
237,60,1,0,140,293,0,0,170,0,1.2,1,2,3,0
145,70,1,1,156,245,0,0,143,0,0.0,2,0,2,1
26,59,1,2,150,212,1,1,157,0,1.6,2,0,2,1
293,67,1,2,152,212,0,0,150,0,0.8,1,0,3,0
57,45,1,0,115,260,0,0,185,0,0.0,2,0,2,1


In [131]:
np.random.seed(42)
X = df.drop(columns=['target']).values
y = df['target'].values

scaler = StandardScaler()

X = scaler.fit_transform(X)

inputs = X.shape[1]

def make_model():
        model = Sequential()
        model.add(Dense(14, input_shape=(inputs,), activation='relu', ))
        model.add(Dense(32, activation='relu'))
        model.add(Dense(1))
        # Compile model
        model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
        return model

model = KerasClassifier(build_fn=make_model, verbose=1)

fitted = model.fit(X, y, 
          validation_split=0.25, 
          epochs=20, 
          batch_size=5
         )


Train on 227 samples, validate on 76 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [135]:
# BASELINE MODEL ACCURACY (AFTER 20 EPOCHS, BATCH SIZE=5):
fitted.history['accuracy'][-1]

0.8678414

In [148]:
def create_model(optimizer):
        model = Sequential()
        model.add(Dense(14, input_shape=(inputs,), activation='relu', ))
        model.add(Dense(32, activation='relu'))
        model.add(Dense(1))
        # Compile model
        model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy'])
        return model
    
model2 = KerasClassifier(build_fn=create_model, verbose=1)


parameters = {'batch_size': [5, 10, 20, 32],
              'epochs': [10, 20, 35, 50],
              'optimizer': ['SGD','Adagrad', 'Adadelta', 'Adam', 'Adamax', 'Nadam']
             }

grid = GridSearchCV(estimator=model2, param_grid=parameters, n_jobs = -1)
grid_result = grid.fit(X, y, epochs=30)

print(f"Best: {grid_result.best_score_} using {grid_result.best_params_}")
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print(f"Means: {mean}, Stdev: {stdev} with: {param}") 



Train on 303 samples
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30
Best: 0.8254098296165466 using {'batch_size': 10, 'epochs': 35, 'optimizer': 'Adam'}
Means: 0.5548633873462677, Stdev: 0.1594836147356541 with: {'batch_size': 5, 'epochs': 10, 'optimizer': 'SGD'}
Means: 0.5747540771961213, Stdev: 0.08074583013879362 with: {'batch_size': 5, 'epochs': 10, 'optimizer': 'Adagrad'}
Means: 0.455409836769104, Stdev: 0.03807233685447295 with: {'batch_size': 5, 'epochs': 10, 'optimizer': 'Adadelta'}
Means: 0.8086885213851929, Stdev: 0.02604237728903427 with: {'batch_size': 5, 'epochs': 10, 'optimizer': 'Adam'}
Means: 0.6761202096939087, Stdev: 0.06118305981408235 with: {'batch_size': 5, 'epochs':

Seems like the best batch size is 10, the best number of epochs was 35, and the best optimizer is Adam, just based on the non-random Gridsearch.