# Non-Linear Decision Boundries
Example: XOR Gate
<div>
<img src="https://cdn-images-1.medium.com/max/1600/1*CyGlr8VjwtQGeNsuTUq3HA.jpeg" alt="XOR Decision Boundry" width="75%%" style="display: inline-flex;"/>
</div>

We get around this by representing an XOR using AND, NAND, and OR gates  
$ a \bigoplus b = (a + b) \cdot (\overline{a \cdot b})$


# Feed-Forward and Feedbak ANNs
#### Components
* Model's architecture/topology
  * Describes the types of neuron
  * Structure of the connections between them
* Activation functions
* Learning algorithm:
  * Finds the optimal values of the weights
  
#### Feed-Forward Neural Networks
* Defined by their acyclic graphs
* Information travels in one directions towards output
* Learn a function to map input to output

#### Feedback (recurrent) Neural Networks
* Contains cycles
* Networks's behavior changes over time due to feedback
* Processing sequences of inputs, translating documents and transcription
* Not implemented in Scikit-Learn

# Multi-Layer Perceptrons
<img src="https://www.oreilly.com/library/view/getting-started-with/9781786468574/graphics/B05474_04_05.jpg" alt="drawing" width="75%" style="display: inline-flex;"/>

#### Training MLPs
* Use gradient descent to minimize a real-valued function, C
  * $C = f(v_1, v_2)$
  * A small cange in the variables produces a small change in the output
  * Change in value of $v_1 = \Delta v_1$
  * Change in value of $v_2 = \Delta v_2$
  * Change in value of $C = \Delta C$
  * $\Delta C = \frac{\delta C}{\delta v_1}\Delta v_1 +
  \frac{\delta C}{\delta v_2}\Delta v_2$
    * Represent as a vector: $\Delta v = (\Delta v_1, \Delta v_2)^T$
    * Gradient Vector of C:
    $\Delta C = (\frac{\delta C}{\delta v_1},\frac{\delta C}{\delta v_2})^T$
    * Rewritten: $\Delta C = \nabla C \Delta v$
    * On each iteration, $\Delta C$ should be negative to decrease the value
    of the cost function. To do this we set $\Delta v$ to the following:
      * $\Delta v = -\eta\nabla C$, where $\eta$ (eta) is a hyperparameter called
      the learning rate
      * $\Delta v = -\eta\nabla C \cdot\nabla C$, so $(\nabla C)^2 > 0$
  * Then Multiply $(\nabla C)^2$ by the learning rate
  * Negate the product
  * Calculate gradient of $C$ on each iteration and update the variables
* Calculating Error with Squared Error Cost Function
  * $E = \frac{1}{2}\sum_{i=1}^n (y_i - \hat{y_i})^2$
  
_*Note: The practical application of these equations is quite lengthy.*_
_*It is recommended to look through an example that applies the activation*_
_*function and then applies the given response to adjust weights.*_
_*It's simply too much to write out in markdown here*_

In [2]:
# Training a MLP to approximate XOR
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier

# Define XOR data
y = [0, 1, 1, 0]
X = [[0, 0], [0, 1], [1, 0], [1, 1]]

# Uses logistic sigmoid activation function
# 2 units in one hidden layer
clf = MLPClassifier(solver='lbfgs', activation='logistic',
                    hidden_layer_sizes=(2,), max_iter=100,
                    random_state=20)
clf.fit(X, y)

predictions = clf.predict(X)
print('Accuracy: %s' % clf.score(X, y))
for i, p in enumerate(predictions[:10]):
    print('True: %s, Predicted: %s' % (y[i], p))

Accuracy: 1.0
True: 0, Predicted: 0
True: 1, Predicted: 1
True: 1, Predicted: 1
True: 0, Predicted: 0


In [4]:
print('Weights connecting the input layer and the hidden layer: \n%s' % clf.coefs_[0])
print('Hidden layer bias weights: \n%s' % clf.intercepts_[0])
print('Weights connecting the hiden layer and the output layer: \n%s' % clf.coefs_[1])
print('Output layer bias weight: \n%s' % clf.intercepts_[1])

Weights connecting the input layer and the hidden layer: 
[[6.11803938 6.35656371]
 [5.79147868 6.14551925]]
Hidden layer bias weights: 
[-9.38637917 -2.77751758]
Weights connecting the hiden layer and the output layer: 
[[-14.9548178]
 [ 14.5308097]]
Output layer bias weight: 
[-7.22845316]


In [5]:
# Training a multi-layer perceptron to classify handwritten digits
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network.multilayer_perceptron import MLPClassifier

# main function is used to fork additional processes during cross-validation
if __name__ == '__main__':
    # load mnist dataset
    digits = load_digits()
    # Scaling the features. Particularly useful for ANNs to converge more quickly
    X = digits.data
    y = digits.target
    # Create a pipeline which scales the data before fitting an MLPClassifier
    # MLP conntains: an input layer, two hidden layers with 150 and 100 units respectively,
    # and an output layer. Increased the value of regularisation parameter (alpha) and
    # increase the max iterations to 300
    pipeline = Pipeline([
        ('ss', StandardScaler()),
        ('mlp', MLPClassifier(hidden_layer_sizes=(150, 100), alpha=0.1, max_iter=300, random_state=20))
    ])
    print(cross_val_score(pipeline, X, y, n_jobs=-1))

[0.95348837 0.96160267 0.90604027]


### Adding more hidden units or hidden layers or using gridsearch to tune the hyperparameters will help increase accuracy.