<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 4 Sprint 2*

# Sprint Challenge - Neural Network Foundations

Table of Problems

1. [Defining Neural Networks](#Q1)
2. [Chocolate Gummy Bears](#Q2)
    - Perceptron
    - Multilayer Perceptron
4. [Keras MMP](#Q3)

<a id="Q1"></a>
## 1. Define the following terms:

- **Neuron:** A Neuron is a number in the network in between synapses that informs the values of neurons in the layers before and after it through the weights of the synapses.
- **Input Layer:** The input layer is the first layer of neurons activated in the network, and each neuron represents the observations of one feature of the dataset.
- **Hidden Layer:** The hidden layers are the layers of neurons between the input layer and the output layer. Each neuron of a hidden layer is calculated by activating the sum of the weights x biases of the previous layer.
- **Output Layer:** The output layer is the last layer of neurons activated in the network. Their neurons are calculated the same way as in a hidden layer, but the number of neurons in the ouput layer reflects the number of unique classes we want to predict. If MNIST for example, we use 10 output neurons. For regression, we use 1 output neuron, and the same for single classification.
- **Activation:** The activation function is a function that normalizes the values of the neurons between a set range. Examples include tanh, sigmoid, and relu.
- **Backpropagation:** Backpropogation is the process by which the error measured between the output values of the network and the actual values of the dataset are fed back through the weights and neurons to recalculate the initial weights of the neural network. This process is repeated through a set number of epochs.


## 2. Chocolate Gummy Bears <a id="Q2"></a>

Right now, you're probably thinking, "yuck, who the hell would eat that?". Great question. Your candy company wants to know too. And you thought I was kidding about the [Chocolate Gummy Bears](https://nuts.com/chocolatessweets/gummies/gummy-bears/milk-gummy-bears.html?utm_source=google&utm_medium=cpc&adpos=1o1&gclid=Cj0KCQjwrfvsBRD7ARIsAKuDvMOZrysDku3jGuWaDqf9TrV3x5JLXt1eqnVhN0KM6fMcbA1nod3h8AwaAvWwEALw_wcB). 

Let's assume that a candy company has gone out and collected information on the types of Halloween candy kids ate. Our candy company wants to predict the eating behavior of witches, warlocks, and ghosts -- aka costumed kids. They shared a sample dataset with us. Each row represents a piece of candy that a costumed child was presented with during "trick" or "treat". We know if the candy was `chocolate` (or not chocolate) or `gummy` (or not gummy). Your goal is to predict if the costumed kid `ate` the piece of candy. 

If both chocolate and gummy equal one, you've got a chocolate gummy bear on your hands!?!?!
![Chocolate Gummy Bear](https://ed910ae2d60f0d25bcb8-80550f96b5feb12604f4f720bfefb46d.ssl.cf1.rackcdn.com/3fb630c04435b7b5-2leZuM7_-zoom.jpg)

In [1]:
import pandas as pd
candy = pd.read_csv('chocolate_gummy_bears.csv')

In [2]:
candy.head()

Unnamed: 0,chocolate,gummy,ate
0,0,1,1
1,1,0,1
2,0,1,1
3,0,0,0
4,1,1,0


In [48]:
candy.shape

(10000, 3)

### Perceptron

To make predictions on the `candy` dataframe. Build and train a Perceptron using numpy. Your target column is `ate` and your features: `chocolate` and `gummy`. Do not do any feature engineering. :P

Once you've trained your model, report your accuracy. You will not be able to achieve more than ~50% with the simple perceptron. Explain why you could not achieve a higher accuracy with the *simple perceptron* architecture, because it's possible to achieve ~95% accuracy on this dataset. Provide your answer in markdown (and *optional* data anlysis code) after your perceptron implementation. 

In [112]:
import numpy as np

X = candy[['chocolate', 'gummy']].values
y = candy['ate'].values
print(X.shape)
class Perceptron(object):
    
    def __init__(self, bias = 10, n_iter = 10):
        self.bias = bias
        self.n_iter = n_iter
        
    def sigmoid(self, x):
        return 1/(1+np.exp(-x))
    
    def sigmoid_prime(self, x):
        sx = self.sigmoid(x)
        return sx*(1-sx)
    
    def fit(self, X, y):
        weights = np.random.rand(X.shape[1], 1)
        
        for iteration in range(10000):
    
            # Weighted sum of inputs / weights
            weighted_sum = np.dot(X, weights)

            # Activate!
            self.activated_output = self.sigmoid(weighted_sum)
            # Cac error
            error = y.reshape(-1, 1) - self.activated_output

            adjustments = error * self.sigmoid_prime(self.activated_output)
            # Update the Weights
            weights += np.dot(X.T, adjustments)
    
    def predict(self, X):
        return self.activated_output

(10000, 2)


In [113]:
pct = Perceptron()
pct.fit(X, y)
y_pred = pct.predict(X)

  del sys.path[0]


In [120]:
from sklearn.metrics import accuracy_score
print(y_pred.shape)
print(y.reshape(-1, 1).shape)
accuracy_score(y.reshape(-1, 1), y_pred.round())

(10000, 1)
(10000, 1)


0.7229

This single-layer perceptron doesn't produce useful accuracy because single-layer perceptrons are terrible at the XOR problem, a linerarly non-separable problem. A multi-layer perceptron should fare better because it can implement differentiable gradient descent. I'm surprised that i got 72% accuracy. I think it was because I rounded my NN outputs. Otherwise sklearn accuracy score didn't like that I mixed binary with float NN outputs.

### Multilayer Perceptron <a id="Q3"></a>

Using the sample candy dataset, implement a Neural Network Multilayer Perceptron class that uses backpropagation to update the network's weights. Your Multilayer Perceptron should be implemented in Numpy. 
Your network must have one hidden layer.

Once you've trained your model, report your accuracy. Explain why your MLP's performance is considerably better than your simple perceptron's on the candy dataset. 

In [63]:
class Neural_Network(object):
    def __init__(self):        
        #Define Hyperparameters
        self.inputLayerSize = 2
        self.outputLayerSize = 1
        self.hiddenLayerSize = 3
        
        #Weights (parameters)
        self.W1 = np.random.randn(self.inputLayerSize,self.hiddenLayerSize)
        self.W2 = np.random.randn(self.hiddenLayerSize,self.outputLayerSize)
        
    def forward(self, X):
        #Propogate inputs though network
        self.z2 = np.dot(X, self.W1)
        self.a2 = self.sigmoid(self.z2)
        self.z3 = np.dot(self.a2, self.W2)
        yHat = self.sigmoid(self.z3) 
        return yHat
        
    def sigmoid(self, z):
        #Apply sigmoid activation function to scalar, vector, or matrix
        return 1/(1+np.exp(-z))
    
    def sigmoidPrime(self,z):
        #Gradient of sigmoid
        return np.exp(-z)/((1+np.exp(-z))**2)
    
    def costFunction(self, X, y):
        #Compute cost for given X,y, use weights already stored in class.
        self.yHat = self.forward(X)
        J = 0.5*sum((y-self.yHat)**2)
        return J
        
    def costFunctionPrime(self, X, y):
        #Compute derivative with respect to W and W2 for a given X and y:
        self.yHat = self.forward(X)
        
        delta3 = np.multiply(-(y-self.yHat), self.sigmoidPrime(self.z3))
        dJdW2 = np.dot(self.a2.T, delta3)
        
        delta2 = np.dot(delta3, self.W2.T)*self.sigmoidPrime(self.z2)
        dJdW1 = np.dot(X.T, delta2)  
        
        return dJdW1, dJdW2
    
    #Helper Functions for interacting with other classes:
    def getParams(self):
        #Get W1 and W2 unrolled into vector:
        params = np.concatenate((self.W1.ravel(), self.W2.ravel()))
        return params
    
    def setParams(self, params):
        #Set W1 and W2 using single paramater vector.
        W1_start = 0
        W1_end = self.hiddenLayerSize * self.inputLayerSize
        self.W1 = np.reshape(params[W1_start:W1_end], (self.inputLayerSize , self.hiddenLayerSize))
        W2_end = W1_end + self.hiddenLayerSize*self.outputLayerSize
        self.W2 = np.reshape(params[W1_end:W2_end], (self.hiddenLayerSize, self.outputLayerSize))
        
    def computeGradients(self, X, y):
        dJdW1, dJdW2 = self.costFunctionPrime(X, y)
        return np.concatenate((dJdW1.ravel(), dJdW2.ravel()))
    
from scipy import optimize
class trainer(object):
    def __init__(self, N):
        #Make Local reference to network:
        self.N = N
        
    def callbackF(self, params):
        self.N.setParams(params)
        self.J.append(self.N.costFunction(self.X, self.y))   
        
    def costFunctionWrapper(self, params, X, y):
        self.N.setParams(params)
        cost = self.N.costFunction(X, y)
        grad = self.N.computeGradients(X,y)
        
        return cost, grad
        
    def train(self, X, y):
        #Make an internal variable for the callback function:
        self.X = X
        self.y = y

        #Make empty list to store costs:
        self.J = []
        
        params0 = self.N.getParams()

        options = {'maxiter': 200, 'disp' : True}
        _res = optimize.minimize(self.costFunctionWrapper, params0, jac=True, method='BFGS', \
                                 args=(X, y), options=options, callback=self.callbackF)

        self.N.setParams(_res.x)
        self.optimizationResults = _res

In [121]:
NN = Neural_Network()
T = trainer(NN)
T.train(X,y.reshape(-1, 1))

         Current function value: 256.295819
         Iterations: 91
         Function evaluations: 226
         Gradient evaluations: 214


In [122]:
print("Predicted Output: \n" + str(NN.forward(X))) 
print("Loss: \n" + str(np.mean(np.square(y - NN.forward(X))))) # mean sum squared loss

Predicted Output: 
[[0.94741067]
 [0.9477702 ]
 [0.94741067]
 ...
 [0.94741067]
 [0.94741067]
 [0.9477702 ]]
Loss: 
0.44874084240820267


P.S. Don't try candy gummy bears. They're disgusting. 

## 3. Keras MMP <a id="Q3"></a>

Implement a Multilayer Perceptron architecture of your choosing using the Keras library. Train your model and report its baseline accuracy. Then hyperparameter tune at least two parameters and report your model's accuracy.
Use the Heart Disease Dataset (binary classification)
Use an appropriate loss function for a binary classification task
Use an appropriate activation function on the final layer of your network.
Train your model using verbose output for ease of grading.
Use GridSearchCV or RandomSearchCV to hyperparameter tune your model. (for at least two hyperparameters)
When hyperparameter tuning, show you work by adding code cells for each new experiment.
Report the accuracy for each combination of hyperparameters as you test them so that we can easily see which resulted in the highest accuracy.
You must hyperparameter tune at least 3 parameters in order to get a 3 on this section.

In [123]:
import pandas as pd
from sklearn.preprocessing import StandardScaler

df = pd.read_csv('https://raw.githubusercontent.com/ryanleeallred/datasets/master/heart.csv')
df = df.sample(frac=1)
print(df.shape)
df.head()

(303, 14)


Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
287,57,1,1,154,232,0,0,164,0,0.0,2,1,2,0
77,59,1,1,140,221,0,1,164,1,0.0,2,0,2,1
47,47,1,2,138,257,0,0,156,0,0.0,2,0,2,1
68,44,1,1,120,220,0,1,170,0,0.0,2,0,2,1
33,54,1,2,125,273,0,0,152,0,0.5,0,1,2,1


In [161]:
import pandas as pd
import numpy as np
import sklearn
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.model_selection import RandomizedSearchCV
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier

In [162]:
features = df.columns[1:len(df)-1]
target = 'target'
X = df[features]
y = df[target]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True, random_state=42)

In [163]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [197]:
def keras_clf(hidden_layers=[64, 64, 64], dropout_rate=0, optimizer='adam', activation='relu', input_dim=12545, output_dim=1):
    model = Sequential()
    for index, layers in enumerate(hidden_layers):
        #Input layer
        if index == 0:
            model.add(Dense(layers, input_dim=input_dim, activation=activation))
        #Hidden layers
        else:
            model.add(Dense(layers, activation=activation))
        #Dropout
        if dropout_rate:
            model.add(Dropout(rate=dropout_rate))
    #Output layer
    if output_dim > 2:
        model.add(Dense(output_dim, activation='softmax'))
        loss='categorical_crossentropy'
    else:
        model.add(Dense(output_dim, activation=activation))
        loss='binary_crossentropy'
    model.compile(loss=loss, optimizer=optimizer, metrics=['accuracy'])    
    return model

In [200]:
history = model.fit(X_train, y_train, validation_split=.1, verbose=0, batch_size=40, epochs=100)

In [201]:
scores = model.evaluate(X_test, y_test)



In [202]:
f'{model.metrics_names[1]}: {scores[1]}'

'acc: 1.0'

In [203]:
input_dim=X_train.shape[1]
output_dim=1

In [204]:
model_keras = KerasClassifier(build_fn=keras_clf,
                              input_dim=input_dim,
                              output_dim=output_dim)

In [205]:
range_epochs = range(10, 30, 4)
range_layers = [[64, 64, 64, 64], [32, 32, 32, 32, 32], [100, 100, 100]]
range_batchsize = range(20, 20000, 2000)
range_optimizer = ['sgd', 'rmsprop', 'adagrad', 'adadelta', 'adam', 'adamax', 'nadam']
range_activations = ['relu', 'tanh', 'sigmoid']
range_dropouts = [n/100 for n in range(20, 50)]

keras_params_options = {
    'epochs': range_epochs,
    'dropout_rate': range_dropouts,
    'batch_size': range_batchsize,
    'optimizer': range_optimizer,
    'activation': range_activations,
#     'hidden_layers': range_layers
}

In [206]:
search = RandomizedSearchCV(model_keras,
                             param_distributions=keras_params_options,
                             cv=3,
                             n_iter=2,
                             n_jobs=1,
                             verbose=1)
best = search.fit(X_train, y_train)

Fitting 3 folds for each of 2 candidates, totalling 6 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Epoch 1/14
Epoch 2/14
Epoch 3/14
Epoch 4/14
Epoch 5/14
Epoch 6/14
Epoch 7/14
Epoch 8/14
Epoch 9/14
Epoch 10/14
Epoch 11/14
Epoch 12/14
Epoch 13/14
Epoch 14/14
Epoch 1/14
Epoch 2/14
Epoch 3/14
Epoch 4/14
Epoch 5/14
Epoch 6/14
Epoch 7/14
Epoch 8/14
Epoch 9/14
Epoch 10/14
Epoch 11/14
Epoch 12/14
Epoch 13/14
Epoch 14/14
Epoch 1/14
Epoch 2/14
Epoch 3/14
Epoch 4/14
Epoch 5/14
Epoch 6/14
Epoch 7/14
Epoch 8/14
Epoch 9/14
Epoch 10/14
Epoch 11/14
Epoch 12/14
Epoch 13/14
Epoch 14/14
Epoch 1/14
Epoch 2/14
Epoch 3/14
Epoch 4/14
Epoch 5/14
Epoch 6/14
Epoch 7/14
Epoch 8/14
Epoch 9/14
Epoch 10/14
Epoch 11/14
Epoch 12/14
Epoch 13/14
Epoch 14/14
Epoch 1/14
Epoch 2/14
Epoch 3/14
Epoch 4/14
Epoch 5/14
Epoch 6/14
Epoch 7/14
Epoch 8/14
Epoch 9/14
Epoch 10/14
Epoch 11/14
Epoch 12/14
Epoch 13/14
Epoch 14/14
Epoch 1/14
Epoch 2/14
Epoch 3/14
Epoch 4/14
Epoch 5/14
Epoch 6/14
Epoch 7/14
Epoch 8/14
Epoch 9/14
Epoch 10/14
Epoch 11/14
Epoch 12/14
Epoch 13/14
Epoch 14/14


[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:   23.3s finished


Epoch 1/14
Epoch 2/14
Epoch 3/14
Epoch 4/14
Epoch 5/14
Epoch 6/14
Epoch 7/14
Epoch 8/14
Epoch 9/14
Epoch 10/14
Epoch 11/14
Epoch 12/14
Epoch 13/14
Epoch 14/14


In [208]:
best.best_estimator_.score(X_test, y_test)



0.8032787