# CNNs!

CNNs achieve state of the art results in a variety of problem areas including Voice User Interfaces, Natural Language Processing, and computer vision. Let's explore some examples:

- [WaveNet](https://deepmind.com/blog/article/wavenet-generative-model-raw-audio): AI trained to sing.
- [Text Classification](http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/)
- [Facebook's novel CNN approach](https://engineering.fb.com/2017/05/09/ml-applications/a-novel-approach-to-neural-machine-translation/)
- [Play Atari games](https://deepmind.com/research/publications/human-level-control-through-deep-reinforcement-learning) Here also the [code](https://sites.google.com/a/deepmind.com/dqn/)
- [Play pictionary](https://quickdraw.withgoogle.com/#) with a CNN
- Some of the world's most famous paintings have been turned into 3D for the visually impaired. Although the article does not mention how this was done, we note that it is possible to use a CNN to [predict depth](https://www.cs.nyu.edu/~deigen/depth/) from a single image.
- Check out [this research](https://research.googleblog.com/2017/03/assisting-pathologists-in-detecting.html) that uses CNNs to localize breast cancer.
- CNNs are used to [save endangered species](https://blogs.nvidia.com/blog/2016/11/04/saving-endangered-species/?adbsc=social_20170303_70517416)!
- An app called [FaceApp](http://www.digitaltrends.com/photography/faceapp-neural-net-image-editing/) uses a CNN to make you smile in a picture or change genders

In general, CNNs can look at images as a whole and learn to identify patterns such as prominent colors and shapes, or whether a texture is fuzzy or smooth and so on. The shapes and colors that define any image and any object in an image are often called features. 

Let's cover how a CNN can learn to identify these features and how a CNN can be used for image classification. 


## What is a feature?

A helpful way to think about what a **feature** is, is to think about what we are visually drawn to when we first see an object and when we identify different objects. For example, what do we look at to distinguish a cat and a dog? The shape of the eyes, the size, and how they move are just a couple o examples of visual features.

As another example, say we see a person walking toward us and we want to see if it is someone we know; we may look at their face, and even further their general shape, eyes. The distinct shape of a person and their eye color a great examples of distinguishing features.


## MNIST Database

How can deep learning be used to recognize a single object in an image? The MNIST DB contains thousands of small gray scale images of hand-written digits. Each image depicts one of the numbers zero through nine. This databse is perhaps one of the most famous databases in the field of machine and deep learning. It was one of the first databases used to prove the usefulness of neural networks and has continued to inform the development fo new archtectures overtime. 

Using deep learning, we can take a data-driven approach to training an algorithm that can examine thses images and discover patterns that distinguish one item from another. Our algorithm will need to attain some level of understanding of how images of one item differs from images of another items. The first step in recognizing patterns in images is learning how images are seen by computers. Before we start to design algorithms, first visualize the data and take a closer look at the images.

We can appreciate [this figure](https://www.kaggle.com/benhamner/popular-datasets-over-time) that shows datasets referenced over time in [NIPS](https://nips.cc/) papers.

Any gray scale image is interpreted by a computer as an array, a grid of values for each cell is called a pixel, and each pixel has a numerical value:

<img src="assets/VisualizeData.png">

Each image in the MNIST database is 28 pixels high and wide. And so, it is understood by a computer as a 28 by 28 array. In a typical gray scale image, white pixels are encoded as the value 255, and black pixels are encoded as zero. Gray pixels fall somewhere in between, with light-gray being closer to 255. Color images have similar numerical representations for each pixel color. 

The MNIST images have actually gone through a quick pre-processing step. They have been rescaled so that each image has pixel values in a range from zero to one, as opposed to from 0-255:

<img src="assets/rescaledMNIST.png">

To go from a range of 0-255 to zero to one, we have to divide every pixel value by 255. This step is known as normalization, and it is common practice in many deep learning techniques. Normalization help our algorithm to train better.

The reason we typically want normalized values is because neural networks rely on gradient calculations. These networks are trying to learn how important or how weighty a certain pixel should be in dettermining the clas of an image. Normalizing the pixel values helps these gradient calculations stay consistent, and not get so large that they slow down or prevent a network from training. 

We know a method of classification, using a multi-layer perceptron. How might we input this image data into an MLP? Recall that MLPs only take vectors as input. In order to use an MLP with images, we have to first convert any image array into a vector. This process is called as flattening. To understand this process, image we have a matrix with 16 pixel values (4x4). Instead of representing this as a 4x4 matrix, we can construct a vector with 16 entries, where the first first four entrie of our vector correspond to the firt wheel of our old array. The second four entries correspond to the second wheel and so on. 

After we convert our images into vectors, they can be fed into the input layer of an MLP:

<img src="assets/flatteningImage7.png">

So, in cas of our MNIST images, 28x28 matrices, we will have 784 entries corresponding to our vector of our flattened image.

Data normalization is typically done by subtracting the mean (the average of all pixel values) from each pixel, and then dividing the result by the standard deviation of all the pixel values. Sometimes you'll see an approximation here, where we use a mean and standard deviation of 0.5 to center the pixel values. [Read more about the Normalize transformation in PyTorch](https://pytorch.org/docs/stable/torchvision/transforms.html#transforms-on-torch-tensor).

The distribution of such data should resemble a [Gaussian function](https://mathworld.wolfram.com/GaussianFunction.html) centered at zero. For image inputs we need the pixel numbers to be positive, so we can often choose to scale the data in a normalized range `[0, 1]`.