# CNNs!

CNNs achieve state of the art results in a variety of problem areas including Voice User Interfaces, Natural Language Processing, and computer vision. Let's explore some examples:

- [WaveNet](https://deepmind.com/blog/article/wavenet-generative-model-raw-audio): AI trained to sing.
- [Text Classification](http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/)
- [Facebook's novel CNN approach](https://engineering.fb.com/2017/05/09/ml-applications/a-novel-approach-to-neural-machine-translation/)
- [Play Atari games](https://deepmind.com/research/publications/human-level-control-through-deep-reinforcement-learning) Here also the [code](https://sites.google.com/a/deepmind.com/dqn/)
- [Play pictionary](https://quickdraw.withgoogle.com/#) with a CNN
- Some of the world's most famous paintings have been turned into 3D for the visually impaired. Although the article does not mention how this was done, we note that it is possible to use a CNN to [predict depth](https://www.cs.nyu.edu/~deigen/depth/) from a single image.
- Check out [this research](https://research.googleblog.com/2017/03/assisting-pathologists-in-detecting.html) that uses CNNs to localize breast cancer.
- CNNs are used to [save endangered species](https://blogs.nvidia.com/blog/2016/11/04/saving-endangered-species/?adbsc=social_20170303_70517416)!
- An app called [FaceApp](http://www.digitaltrends.com/photography/faceapp-neural-net-image-editing/) uses a CNN to make you smile in a picture or change genders

In general, CNNs can look at images as a whole and learn to identify patterns such as prominent colors and shapes, or whether a texture is fuzzy or smooth and so on. The shapes and colors that define any image and any object in an image are often called features. 

Let's cover how a CNN can learn to identify these features and how a CNN can be used for image classification. 


## What is a feature?

A helpful way to think about what a **feature** is, is to think about what we are visually drawn to when we first see an object and when we identify different objects. For example, what do we look at to distinguish a cat and a dog? The shape of the eyes, the size, and how they move are just a couple o examples of visual features.

As another example, say we see a person walking toward us and we want to see if it is someone we know; we may look at their face, and even further their general shape, eyes. The distinct shape of a person and their eye color a great examples of distinguishing features.


## MNIST Database

How can deep learning be used to recognize a single object in an image? The MNIST DB contains thousands of small gray scale images of hand-written digits. Each image depicts one of the numbers zero through nine. This databse is perhaps one of the most famous databases in the field of machine and deep learning. It was one of the first databases used to prove the usefulness of neural networks and has continued to inform the development fo new archtectures overtime. 

Using deep learning, we can take a data-driven approach to training an algorithm that can examine thses images and discover patterns that distinguish one item from another. Our algorithm will need to attain some level of understanding of how images of one item differs from images of another items. The first step in recognizing patterns in images is learning how images are seen by computers. Before we start to design algorithms, first visualize the data and take a closer look at the images.

We can appreciate [this figure](https://www.kaggle.com/benhamner/popular-datasets-over-time) that shows datasets referenced over time in [NIPS](https://nips.cc/) papers.

Any gray scale image is interpreted by a computer as an array, a grid of values for each cell is called a pixel, and each pixel has a numerical value:

<img src="assets/VisualizeData.png">

Each image in the MNIST database is 28 pixels high and wide. And so, it is understood by a computer as a 28 by 28 array. In a typical gray scale image, white pixels are encoded as the value 255, and black pixels are encoded as zero. Gray pixels fall somewhere in between, with light-gray being closer to 255. Color images have similar numerical representations for each pixel color. 

The MNIST images have actually gone through a quick pre-processing step. They have been rescaled so that each image has pixel values in a range from zero to one, as opposed to from 0-255:

<img src="assets/rescaledMNIST.png">

To go from a range of 0-255 to zero to one, we have to divide every pixel value by 255. This step is known as normalization, and it is common practice in many deep learning techniques. Normalization help our algorithm to train better.

The reason we typically want normalized values is because neural networks rely on gradient calculations. These networks are trying to learn how important or how weighty a certain pixel should be in dettermining the clas of an image. Normalizing the pixel values helps these gradient calculations stay consistent, and not get so large that they slow down or prevent a network from training. 

We know a method of classification, using a multi-layer perceptron. How might we input this image data into an MLP? Recall that MLPs only take vectors as input. In order to use an MLP with images, we have to first convert any image array into a vector. This process is called as flattening. To understand this process, image we have a matrix with 16 pixel values (4x4). Instead of representing this as a 4x4 matrix, we can construct a vector with 16 entries, where the first first four entrie of our vector correspond to the firt wheel of our old array. The second four entries correspond to the second wheel and so on. 

After we convert our images into vectors, they can be fed into the input layer of an MLP:

<img src="assets/flatteningImage7.png">

So, in cas of our MNIST images, 28x28 matrices, we will have 784 entries corresponding to our vector of our flattened image.

Data normalization is typically done by subtracting the mean (the average of all pixel values) from each pixel, and then dividing the result by the standard deviation of all the pixel values. Sometimes you'll see an approximation here, where we use a mean and standard deviation of 0.5 to center the pixel values. [Read more about the Normalize transformation in PyTorch](https://pytorch.org/docs/stable/torchvision/transforms.html#transforms-on-torch-tensor).

The distribution of such data should resemble a [Gaussian function](https://mathworld.wolfram.com/GaussianFunction.html) centered at zero. For image inputs we need the pixel numbers to be positive, so we can often choose to scale the data in a normalized range `[0, 1]`.


## MLP Structure & Class Scores

Once we have normalized and flattened our data into vectors, we then create a neural network for discovering the patterns in our training data. After training, our network should be able to look at totally new images that it hasn't trained on, and classify the content in those images. This previously unseen data is often called test data. 

So comming back to our MNIST dataset, we have been converted images into vectors with 784 entries. So, the first input layer in our MLP should have 784 nodes. We want the output layer to distinguish between 10 different digit types, from zero to nine. So, we will want the last layer to have 10 nodes. Our model will take in a flattened image and produce 10 output values, one for each possible class. These output values are often called class scores. 

A high class score indicates that a network is very certain that a given input image falls into a certain class. The class scores are often represented as a vector of values or even as a bar graph indicating the relative strengths of the scores. 

The part of tour MPL architecture that is up to you to define is really in between the input and output layers. How many hidden layers do we want to include and how many nodes should be in each one? That is a recurrent question on MLP architecture definition, a recommendation is to start by looking at any papers or related work I can find that may act as a good guide. You can start with 1 or 2 hidden layers for a MLP for image classification.
You can look at [this file](https://github.com/keras-team/keras/blob/1a3ee8441933fc007be6b2beb47af67998d50737/examples/mnist_mlp.py) that used to be on Keras repo.

We know that the more hidden layers I include in the network, the more complex patterns this network will be able to detect, but we don't want to add unnecessary complexity either. We can intuite rhat for small images, two hidden layers sounds very reasonable. 

That intuition helps for solving a MNIST Digit dataset but we also could continue the research:
- 1.- Keep looking annd see if we can find another structure that appeals to you.
- 2.- When we fand a model or two that look interesting, try them out in code and see how well they perform. 


## Loss & Optimization

Well, we have defined a structure of our MLP, let's talk about how this entire thing will actually learn from the MS data. What happens when it actually sees an input image? Imagine we take a image from the MNIST dataset, a digit two, we flatten the image into a vector and it goes through our two hidden layers, we get these ten class scores for my output layer, again a high score mean that the network is more certain that the input image is of that particular class. 

At the beginning, the network will attempt to classify the image but incorrectly, we have to tell it to learn form mistakes. As a network trains, we measure any mistakes that it makes using a loss function, whose job is to measure the ifference between the predicted and true clas labels. Then using back propagation, we can compute the gradient of the loss with respect to the models' weights. In this way, we quantify how bad a particular weight is and find out which weights in the network are responsible for any errors. Finally, we use that calculation, we can choose an optimization function like gradient descent to give us a way to calculate a better weight value. 

Towards this goal, the first thing we need to do is make our output layer a bit more interpretable. What is commonly done is to apply a softmax activation function to convert these scores into probabilities. To apply a softmax function to  this output layer, we begin by evaluating the exponential function at each of the socres, then we add up all of the values. Then we divide each of these values by the sum. When we plug in all of the math, we get those 10 values. Now each value yields the probability that the image depicts its corresponding image class. 

Continuing our goal of update the weights of the network in response to the mistake, in order to predict that two is the most likely label. In a perfect world, the network would predict that the image is 100 percent likely to be the true class. In order to get the model's prediction closer to the ground truth, we will need to define some measure of exactly how far off the model currently is from perfection. We can use a los function to find any errors between the truth image classes and our predicted classes, the backpropagation will find out which model parameters are responsible for those errors. 

Since we are constructing a multi-class classifier, we will use categorical cross entropy loss. To calculate the loss we begin by looking at the model's predicted probability of the true class, cross entropy loss looks at that probability value and takes the negative log loss of that value. Now, for argument's sake, say instead that the weights of the network were slightly different. The model instead eturned at these predicted probabilities. That prediction is much better than the one first obtained, and when we calculate the cross entropy loss, we get a much smaller value. 

In general, it is possible to show that the categorical cross entropy loss is defined in such a way that the loss is lower when the model's prediction agrees more with the true class label, and it is higher when the prediction and the true class label disagree. As a model trains, its goal will be to find the weights that minimize that loss function and therefore give us the most accurate predictions. So a loss function and backpropagation give us a way to quantify how bad a particular netwoork weight is, based on how close a predicted and the true class label are from one another. Next, we need a wat to calculate a better weight value.

Previusly we reviewed the error function, which function was to find a way to descend to the lowest value. This is the role of an optimizer. The standard method for minimizing the loss and optimizing for the best weight values is called Gradient Descent. We already have been introduced to a number of ways to perform gradient descent and each method has a corresponding optimizer. All the optimizers are racing towards the minimum of the function. I encorage you to experiment with all of the available GD functions in your code!

<img src="assets/GradientDescentFunctions.png">

## ReLU activation function

So, remembering, the process to train an MLP to classify images is:
- Load and visualize data
- Define a neural network
- Train the model
- Evaluate the performance of our trained model on a test dataset

When we are defining our MLP model, we have talked about defining the input, hidden and output layers. Let's point our a couple of things here. First the init function, to define any neural network in PyTorch you have to define and name any layers that have learned weight values in the Init function. Next we have to define the feedforward behavior of our network. This is just how an input X will be passed through various layers and transformed. Make sure to flatten the input imag by using the `view` function. Once we have our vector, we pass it to our first fully-connected layer defined. A ReLU should be applied generally to the output of every hidden layer so that those outputs are consistent positive. Finally we return the transformated X.

The purpose of an activation function is to scale the outputs of a layer so that they are consistent, small value. Much like normalizing input values, this step ensures that our model trains efficiently.

A ReLU activation function stands for "Rectified Linear Unit" and is one of the most commonly used activation functions for hidden layers. It is an activation function, simply defined as the possitive part of the input. So for an input image with any negative pixel values, this would turn all those values to `0`. Sometimes this will be referred to as "clipping" the values to zero; meaning that is the lover bound.