<a href="https://colab.research.google.com/github/rahiakela/natural-language-processing-in-action/blob/5-baby-steps-with-neuralnetworks/perceptrons_and_backpropagation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Baby steps with neural networks (perceptrons and backpropagation)

## Neural networks

As the availability of processing power and memory has exploded over the course of the decade, an old technology has come into its own again. First proposed in the 1950s by Frank Rosenblatt, the perceptron1 offered a novel algorithm for finding patterns in data.

The basic concept lies in a rough mimicry of the operation of a living neuron cell.
As electrical signals flow into the cell through the dendrites (see figure) into the
nucleus, an electric charge begins to build up. When the cell reaches a certain level of
charge, it fires, sending an electrical
signal out through the axon.

<img src='https://github.com/rahiakela/img-repo/blob/master/perceptron-1.JPG?raw=1' width='800'/>

The key concept to notice
here is the way the cell weights
incoming signals when deciding when to fire. The neuron will dynamically change
those weights in the decision making process over the course of its life.

## Perceptron

Rosenblatt’s original project was to teach a machine to recognize images. The original
perceptron was a conglomeration of photo-receptors and potentiometers, not a computer
in the current sense. But implementation specifics aside, Rosenblatt’s concept
was to take the features of an image and assign a weight, a measure of importance, to
each one. The features of the input image were each a small subsection of the image.

A grid of photo-receptors would be exposed to the image. Each receptor would see
one small piece of the image. The brightness of the image that a particular photoreceptor
could see would determine the strength of the signal that it would send to
the associated “dendrite.”

Each dendrite had an associated weight in the form of a potentiometer. Once
enough signal came in, it would pass the signal into the main body of the “nucleus” of
the “cell.” Once enough of those signals from all the potentiometers passed a certain
threshold, the perceptron would fire down its axon, indicating a positive match on the
image it was presented with. If it didn’t fire for a given image, that was a negative classification
match. Think “hot dog, not hot dog” or “iris setosa, not iris setosa.”

## A numerical perceptron

### Motivation

Basically, you’d like to take an example from a dataset, show it to an algorithm, and
have the algorithm say yes or no. That’s all you’re doing so far. The first piece you
need is a way to determine the features of the sample. Choosing appropriate features
turns out to be a surprisingly challenging part of machine learning.

 In “normal”
machine learning problems, like predicting home prices, your features might be
square footage, last sold price, and ZIP code. 

Or perhaps you’d like to predict the species
of a certain flower using the Iris dataset.2 In that case your features would be petal
length, petal width, sepal length, and sepal width.

In Rosenblatt’s experiment, the features were the intensity values of each pixel
(subsections of the image), one pixel per photo receptor. You then need a set of
weights to assign to each of the features. 

Don’t worry yet about where these weights
come from. Just think of them as a percentage of the signal to let through into the
neuron. If you’re familiar with linear regression, then you probably already know
where these weights come from.

Generally, you’ll see the individual features denoted as xi, where i is a
reference integer. And the collection of all features for a given example are
denoted as X representing a vector:

$$X = [x1, x2, …, xi, …, xn]$$

And similarly, you’ll see the associate weights for each feature as wi, where i
corresponds to the index of feature x associated with that weight. And the
weights are generally represented as a vector W:

$$W = [w1, w2, …, wi, …, wn]$$

With the features in hand, you just multiply each feature (xi) by the corresponding
weight (wi) and then sum up:

$$ (x1 * w1) + (x2 * w2) + … + (xi * wi) + … $$

The one piece you’re missing here is the neuron’s threshold to fire or not. And it’s
just that, a threshold. Once the weighted sum is above a certain threshold, the perceptron
outputs 1. Otherwise it outputs 0.

You can represent this threshold with a simple step function (labeled “Activation
Function” in figure).

<img src='https://github.com/rahiakela/img-repo/blob/master/perceptron-2.JPG?raw=1' width='800'/>

### Detour through bias

The bias is an “always on”
input to the neuron. The neuron has a weight dedicated to it just as with every other
element of the input, and that weight is trained along with the others in the exact
same way. This is represented in two ways in the various literature around neural networks.
You may see the input represented as the base input vector, say of n-elements,with a 1 appended to the beginning or the end of the vector, giving you an n+1 dimensional
vector. The position of the 1 is irrelevant to the network, as long as it’s consistent
across all of your samples.

The reason for having the bias weight at all is that you need the neuron to be resilient
to inputs of all zeros. It may be the case that the network needs to learn to output
0 in the face of inputs of 0, but it may not. Without the bias term, the neuron would
output 0 * weight = 0 for any weights you started with or tried to learn. With the bias
term, you won’t have this problem.

Figure is a rather neat visualization of the analogy between some of the signals
within a biological neuron in your brain and the signals of an artificial neuron used for
deep learning.

<img src='https://github.com/rahiakela/img-repo/blob/master/perceptron-3.JPG?raw=1' width='800'/>

And in mathematical terms, the output of your perceptron, denoted f(x), looks like:

<img src='https://github.com/rahiakela/img-repo/blob/master/perceptron-4.JPG?raw=1' width='800'/>

Your perceptron hasn’t learned anything just yet. But you have achieved something
quite important. You’ve passed data into a model and received an output. That output
is likely wrong, given you said nothing about where the weight values come from. But
this is where things will get interesting.

### A PYTHONIC NEURON

Calculating the output of the neuron described earlier is straightforward in Python.
You can also use the numpy dot function to multiply your two vectors together:

In [5]:
import numpy as np

# row data
example_input = [1, .2, .1, .05, .2]
example_weights = [.2, .12, .4, .6, .90]

# convert into numpy array
input_vector = np.array(example_input)
weights = np.array(example_weights)

# bias term
bias_weight = .2

# The multiplication by one (* 1) is just to emphasize that the bias_weight is like all the other weights:
# it’s multiplied by an input value, only the bias_weight input feature value is always 1.
activation_level = np.dot(input_vector, weights) + (bias_weight * 1)
activation_level

0.6740000000000002

With that, if you use a simple threshold activation function and choose a threshold of
.5, your next step is the following:

In [6]:
threshold = .5
 
if activation_level >= threshold:
   perceptron_output = 1
else:
   perceptron_output = 0

# see the result
perceptron_output

1

Given the example_input, and that particular set of weights, this perceptron will output 1 But if you have several example_input vectors and the associated expected outcomes with each (a labeled dataset), you can decide if the perceptron is correct or
not for each guess.

### CLASS IS IN SESSION

The perceptron learns by altering the weights up or down as a function of how
wrong the system’s guess was for a given input. But from where does it start? The
weights of an untrained neuron start out random! Random values, near zero, are usually
chosen from a normal distribution.

And from there you can start to learn. Many different samples are shown to the system,
and each time the weights are readjusted a small amount based on whether the
neuron output was what you wanted or not. With enough examples (and under the
right conditions), the error should tend toward zero, and the system learns.

The trick is, and this is the key to the whole concept, that each weight is adjusted by
how much it contributed to the resulting error. A larger weight (which lets that data
point affect the result more) should be blamed more for the rightness/wrongness of
the perceptron’s output for that given input

In [7]:
expected_output = 0
new_weights = []

#for i, x in enumerate(example_input):
#  # For example, in the first index above: new_weight = .2 + (0 - 1) * 1 = -0.8
#  new_weights.append(weights[i] + (expected_output - perceptron_output) * x)

new_weights = [weights[i] + (expected_output - perceptron_output) * x for i, x in enumerate(example_input)]
weights = np.array(new_weights)

# Original weights
example_weights

[0.2, 0.12, 0.4, 0.6, 0.9]

In [8]:
# New weights
weights

array([-0.8 , -0.08,  0.3 ,  0.55,  0.7 ])

This process of exposing the network over and over to the same training set can,
under the right circumstances, lead to an accurate predictor even on input that the
perceptron has never seen.

### LOGIC IS A FUN THING TO LEARN

Let’s try to get the computer to understand the concept of logical OR. If either
one side or the other of the expression is true (or both sides are), the logical OR statement
is true. Simple enough. For this toy problem, you can easily model every possible
example by hand (this is rarely the case in reality). Each sample consists of two signals,
each of which is either true (1) or false (0).

In [0]:
sample_data = [ 
    [0, 0],    # False, False
    [0, 1],    # False, True
    [1, 0],    # True, False
    [1, 1]     # True, True
]

expected_results = [
    0,    # (False OR False) gives False
    1,    # (False OR True ) gives True
    1,    # (True OR False) gives True
    1,    # (True OR True ) gives True                
]

activation_threshold = 0.5

You need a few tools to get started: numpy just to get used to doing vector (array) multiplication,
and random to initialize the weights

In [10]:
weights = np.random.random(2) / 1000   # Small random float 0 < w < .001
weights

array([0.00032362, 0.00043451])

You need a bias as well

In [11]:
bias_weight = np.random.random() / 1000
bias_weight

0.0007900798942765753

Then you can pass it through your pipeline and get a prediction for each of your four
samples.

In [12]:
for idx, sample in enumerate(sample_data):
  input_vector = np.array(sample)
  activation_level = np.dot(input_vector, weights) + (bias_weight * 1)
  if activation_level > activation_threshold:
    perceptron_output = 1
  else:
    perceptron_output = 0
  print('Predicted {}'.format(perceptron_output))
  print('Expected: {}'.format(expected_results[idx]))
  print()    

Predicted 0
Expected: 0

Predicted 0
Expected: 1

Predicted 0
Expected: 1

Predicted 0
Expected: 1



Your random weight values didn’t help your little neuron out that much—one right
and three wrong. Let’s send it back to school. Instead of just printing 1 or 0, you’ll
update the weights at each iteration.

In [13]:
for iteration_num in range(5):
  correct_answers = 0

  for idx, sample in enumerate(sample_data):
    input_vector = np.array(sample)
    activation_level = np.dot(input_vector, weights) + (bias_weight * 1)

    if activation_level > activation_threshold:
      perceptron_output = 1
    else:
      perceptron_output = 0

    if perceptron_output == expected_results[idx]:
      correct_answers += 1

    # This is where the magic happens.
    new_weights = [weights[i] + (expected_results[idx] - perceptron_output) * x for i, x in enumerate(sample)]
    # The bias weight is updated as well, just like those associated with the inputs.
    bias_weight = bias_weight + ((expected_results[idx] - perceptron_output) * 1)
    weights = np.array(new_weights)

    print('{} correct answers out of 4, for iteration {}'.format(correct_answers, iteration_num))

1 correct answers out of 4, for iteration 0
1 correct answers out of 4, for iteration 0
2 correct answers out of 4, for iteration 0
3 correct answers out of 4, for iteration 0
0 correct answers out of 4, for iteration 1
1 correct answers out of 4, for iteration 1
1 correct answers out of 4, for iteration 1
2 correct answers out of 4, for iteration 1
0 correct answers out of 4, for iteration 2
1 correct answers out of 4, for iteration 2
2 correct answers out of 4, for iteration 2
3 correct answers out of 4, for iteration 2
1 correct answers out of 4, for iteration 3
2 correct answers out of 4, for iteration 3
3 correct answers out of 4, for iteration 3
4 correct answers out of 4, for iteration 3
1 correct answers out of 4, for iteration 4
2 correct answers out of 4, for iteration 4
3 correct answers out of 4, for iteration 4
4 correct answers out of 4, for iteration 4


Haha! What a good student your little perceptron is. By updating the weights in the inner loop, the perceptron is learning from its experience of the dataset.

This is what is known as convergence. A model is said to converge when its error
function settles to a minimum, or at least a consistent value. Sometimes you’re not so lucky. 

Sometimes a neural network bounces around looking for optimal weights to satisfy
the relationships in a batch of data and never converges.

### Perceptron verdict

The basic perceptron has an inherent flaw. If the data isn’t linearly separable, or the
relationship cannot be described by a linear relationship, the model won’t converge
and won’t have any useful predictive power. It won’t be able to predict the target variable
accurately.

<img src='https://github.com/rahiakela/img-repo/blob/master/linearly-separable-data.JPG?raw=1' width='800'/>

Linearly separable data points are no problem for a perceptron.
Crossed up data will cause a single-neuron perceptron to forever spin its
wheels without learning to predict anything better than a random guess, a random
flip of a coin. It’s not possible to draw a single line between your two classes (dots and Xs)

<img src='https://github.com/rahiakela/img-repo/blob/master/nonlinearly-separable-data.JPG?raw=1' width='800'/>

A perceptron finds a linear equation that describes the relationship between the features
of your dataset and the target variable in your dataset. A perceptron is just doing
linear regression. A perceptron cannot describe a nonlinear equation or a nonlinear
relationship.

A lot of relationships between data values aren’t linear, and there’s no good linear
regression or linear equation that describes those relationships. And many datasets
aren’t linearly separable into classes with lines or planes. Because most data in the
world isn’t cleanly separable with lines and planes.

But the perceptron idea didn’t die easily. It resurfaced again when the Rumelhardt-
McClelland collaboration effort (which Geoffrey Hinton was involved in) showed you
could use the idea to solve the XOR problem with multiple perceptrons in concert.

The key breakthrough by Rumelhardt-
McClelland was the discovery of a way to allocate the error appropriately to each of the
perceptrons. The way they did this was to use an old idea called backpropagation. With
this idea for backpropagation across layers of neurons, the first modern neural network was born.

The basic perceptron has the inherent flaw that if the data isn’t linearly separable,
the model won’t converge to a solution with useful predictive power.

Even though they could solve complex (nonlinear) problems, neural networks were,
for a time, too computationally expensive.

They proved impractical for common use, and they found their
way back to the dusty shelves of academia and supercomputer experimentation. This
began the second “**AI Winter**” that lasted from around 1990 to about 2010. But eventually
computing power, backpropagation algorithms, and the proliferation of raw
data, like labeled images of cats and dogs, caught up. 

Computationally expensive algorithms and limited datasets were no longer show-stoppers. Thus the third age of neural networks began.


### Keras: Neural networks in Python

Writing a neural network in raw Python is a fun experiment and can be helpful in putting
all these pieces together, but Python is at a disadvantage regarding speed, and the
shear number of calculations you’re dealing with can make even moderately sized networks intractable. Many Python libraries, though, get you around the speed zone:
PyTorch, Theano, TensorFlow, Lasagne, and many more.

Keras is a high-level wrapper with an accessible API for Python. The exposed API
can be used with three different backends almost interchangeably: Theano, Tensor-
Flow from Google, and CNTK from Microsoft. Each has its own low-level implementation
of the basic neural network elements and has highly tuned linear algebra
libraries to handle the dot products to make the matrix multiplications of neural networks
as efficiently as possible.

Let’s look at the simple XOR problem and see if you can train a network using
Keras.

In [17]:
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.optimizers import SGD

# Our examples for an exclusive OR.
x_train = np.array([
    [0, 0],
    [0, 1],
    [1, 0],
    [1, 1]                
])
y_train = np.array([
    [0],
    [1],
    [1],
    [0]                
])

# The fully connected hidden layer will have 10 neurons.
model = Sequential()
model.add(Dense(10, input_dim=2))
model.add(Activation('tanh'))
model.add(Dense(1))
# The output layer has one neuron to output a single binary classification value (0 or 1).
model.add(Activation('sigmoid'))
model.summary()

Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_4 (Dense)              (None, 10)                30        
_________________________________________________________________
activation_3 (Activation)    (None, 10)                0         
_________________________________________________________________
dense_5 (Dense)              (None, 1)                 11        
_________________________________________________________________
activation_4 (Activation)    (None, 1)                 0         
Total params: 41
Trainable params: 41
Non-trainable params: 0
_________________________________________________________________


In [18]:
sgd = SGD(lr=0.1)
model.compile(loss='binary_crossentropy', optimizer=sgd, metrics=['accuracy'])



Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


SGD is the stochastic gradient descent optimizer you imported. This is just how the
model will try to minimize the error, or loss. lr is the learning rate, the fraction applied
to the derivative of the error with respect to each weight. Higher values will speed
learn, but may force the model away from the global minimum by shooting past the
goal; smaller values will be more precise but increase the training time and leave the
model more vulnerable to local minima.

The loss function itself is also defined as a
parameter; here it’s binary_crossentropy. The metrics parameter is a list of
options for the output stream during training.

In [19]:
model.predict(x_train)









array([[0.5       ],
       [0.5662843 ],
       [0.6204096 ],
       [0.66136605]], dtype=float32)

The predict method gives the raw output of the last layer, which would be generated
by the sigmoid function in this example.

In [20]:
model.predict_classes(x_train)

array([[0],
       [1],
       [1],
       [1]], dtype=int32)

In [21]:
model.predict(x_train)

array([[0.5       ],
       [0.5662843 ],
       [0.6204096 ],
       [0.66136605]], dtype=float32)