<a href="https://colab.research.google.com/github/rahiakela/deep-learning--from-basics-to-practice/blob/23-keras-part-1/1-preparing-data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Pre-processing

There are many fine deep-learning libraries out there, and each has its
advantages. Rather than try to cover many libraries, we’ll focus on just
one, called Keras. This library is powerful, easy to use, popular, free,
and open-source.

One of the nice things about working with Keras is that a typical session
of building and training a machine-learning system requires very
little routine Python programming. The actual deep learning code is
often the easiest part of the program: we build the network with just a
few lines, and train it with just one or two function calls. Most of the
rest of the program is made of supporting tasks, such as getting the
input data, cleaning it, structuring it for use in the network, writing
routines for saving data and visualizing results, and so on.

### Setup

In [1]:
from keras.datasets import mnist
from keras import backend as Keras_backend
from keras.utils.np_utils import to_categorical
import matplotlib.pyplot as plt

import numpy as np

from keras import backend as keras_backend
keras_backend.set_image_data_format('channels_last')

Using TensorFlow backend.


## Loading the Data

How easy it is to load the MNIST set, since it’s provided
with Keras. To get it, we import the mnist module and then use
its custom load_data() function to get the data. 

This returns two lists:
 * the training data and 
 * the test data. 
 
Each list in turn contains two lists,
* holding the features (that is, the images), and 
* the labels. 

We can use Python’s convenient assignment mechanism to assign all four lists to
our own variables with just one statement.

In [0]:
random_seed = 42
np.random.seed(random_seed)

# load MNIST data and save sizes
(samples_train, labels_train), (samples_test, labels_test) = mnist.load_data()

Downloading data from https://s3.amazonaws.com/img-datasets/mnist.npz


when we use a technique like cross-validation
we break down our input data into the training set, the validation
set, and the test set. We teach many variations of the system using
the training set, and then after each training we evaluate the performance
with the validation set.

This is telling us that samples_train is a 3D block of 60,000 layers.
Each layer holds a 28 by 28 image. The labels_train variable is a 1D
list of 60,000 elements (we’ll see that each is a number from 0 to 9).

The extra comma at the end of (60000,) is a Python convention to tell
us that this is a list of 60,000 elements, and not just the number 60,000
surrounded by parentheses

## Looking at the Data

There are at least two potential sources of problems to keep an eye
out for. 

* Content problems are numerical issues with the data itself,
* while structural problems are issues regarding how the data is organized.

### Content problems

<img src='https://github.com/rahiakela/img-repo/blob/master/content-problems-1.JPG?raw=1' width='800'/>

There are four standout issues.
First, some of the images bleed very close to the edge of the 28 by 28
box, rather than sitting inside a relatively thick black border of 4 pixels
all around that the original paper describes.

Some images from the MNIST training set that demonstrate
a bleeding of the image very near, or right up to, the border. The
number above each example shows its index in the training set.

<img src='https://github.com/rahiakela/img-repo/blob/master/content-problems-2.JPG?raw=1' width='800'/>

Second, some of the digits appear to have had pieces cropped away,
substantially changing their shape.Some images from the MNIST training set that have been
cropped, chopping away some of what seems very likely to have been
drawn, and sometimes creating multiple, disconnected pieces.

Third, some of the images are noisy. Sometimes this means that lines
thin out or disappear. More often there are spurious regions of white,
perhaps due to errors during cropping or thresholding. These don’t
usually cause much confusion to human observers, but these artifacts
have the potential to throw off a computerized network.

<img src='https://github.com/rahiakela/img-repo/blob/master/content-problems-3.JPG?raw=1' width='800'/>

Finally, there are some examples that seem challenging to interpret,
either because of how they were drawn, or how they were processed.

<img src='https://github.com/rahiakela/img-repo/blob/master/content-problems-5.JPG?raw=1' width='800'/>




### Structural problems 

Our main interest is in the shapes of the variables that we got from
mnist.load_data().

In [0]:
print(f'samples_train shape : {samples_train.shape}')
print(f'labels_train shape : {labels_train.shape}')
print(f'samples_test shape : {samples_test.shape}')
print(f'labels_test shape : {labels_test.shape}')

samples_train shape : (60000, 28, 28)
labels_train shape : (60000,)
samples_test shape : (10000, 28, 28)
labels_test shape : (10000,)


Our training data, X_train, is in a 3D block. Using our (away, down,
right) convention, it’s 60,000 slices deep, where each vertical slice is
28 by 28 units.

<img src='https://github.com/rahiakela/img-repo/blob/master/content-problems-6.JPG?raw=1' width='800'/>

The test data is set up the same way, except the stack is only 10,000
images deep.

We’re going to reshape our data in the following sections, so let’s stash
the original height and width of each image in a variable. We’ll also
multiply them together and save that as the total number of pixels per
image.

In [0]:
image_height = samples_train.shape[1]
image_width = samples_train.shape[2]

number_of_pixels = image_height * image_width
number_of_pixels

784

The labels are given to us as one-dimensional lists. The training label
list y_train has, as expected, a length of 60,000, since it’s providing
one label for each sample in the training set.

In [0]:
print(f'start of y_train: {labels_train[:15]}')

start of y_train: [5 0 4 1 9 2 1 3 1 4 3 5 3 6 1]


So each entry in y_train is an integer. We expect it to be the label of
the corresponding image in X_train. It always pays to check, so let’s
look at the first 15 images in X_train.

<img src='https://github.com/rahiakela/img-repo/blob/master/content-problems-7.JPG?raw=1' width='800'/>

Now let’s look at the data itself. we print an arbitrary
little rectangle from within the first image of X_train. A handy bit of
Python to keep in mind is that by simply typing the name of a variable
to the interpreter (rather than using a print statement), we sometimes
get more information about the variable.



In [0]:
samples_train[0, 5:12, 5:12]

array([[  0,   0,   0,   0,   0,   0,   0],
       [  0,   0,   0,  30,  36,  94, 154],
       [  0,   0,  49, 238, 253, 253, 253],
       [  0,   0,  18, 219, 253, 253, 253],
       [  0,   0,   0,  80, 156, 107, 253],
       [  0,   0,   0,   0,  14,   1, 154],
       [  0,   0,   0,   0,   0,   0, 139]], dtype=uint8)

As we might expect from grayscale image data, all of the values are between 0 and 255

In [0]:
labels_train[:15]

array([5, 0, 4, 1, 9, 2, 1, 3, 1, 4, 3, 5, 3, 6, 1], dtype=uint8)

In [0]:
samples_train[0, :, :]

array([[  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0],
       [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0],
       [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0],
       [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0],
       [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0],
       [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   3,
         18,  18,  18, 126, 136, 175,  26, 166, 255, 247, 127,   0,   0,
          0,   0],
       [  

In [0]:
labels_train[:1]

array([5], dtype=uint8)

To use this data for training with Keras, we need to turn the training
and test sample data into normalized floating-point numbers, and turn
the labels into one-hot encodings.

## Train-test Splitting

Most data sets require us to manually split them into training and test
sets. The MNIST data has already been split for us, but for completeness,
let’s see how we’d do the job if we had to.

The easiest and most common approach is to use scikit-learn’s
train_test_split() function to do all the work for us. Suppose that
the MNIST data came to us as only two tensors, called samples and
labels, and we want to split it into a training set and a test set. A typical
test set is often around 20% or 30% of the starting data, so let’s go
down the middle with 25%.

In [0]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(samples, labels, test_size=0.25)

<img src='https://github.com/rahiakela/img-repo/blob/master/content-problems-8.JPG?raw=1' width='800'/>

## Fixing the Data Type

In fact, Keras expects the specific type of floats that match its internal
floatx parameter.

Now that we know the format Keras expects for our floating-point
numbers, we can return to our job of converting our samples into that
form. The easy way to do this is to use the function cast_to_floatx()
from the Keras backend, which takes a tensor as an argument and
casts every element of that a tensor into the type specified by the current
value of floatx.

In [0]:
X_train = keras_backend.cast_to_floatx(samples_train)
X_test = keras_backend.cast_to_floatx(labels_train)

## Normalizing the Data

The networks that we’ll be building to categorize the
MNIST data will use convolution layers near the start, and those will
work best with data that has been normalized so that each feature has
been scaled to fit the range 0 to 1.

Note that normalization is just for the features, and not the labels. The
labels need to refer to the 10 different classes from 0 to 9, and we don’t
want to change those values.

Our feature data in X_train and X_test is originally made of integers in the range 0 to 255, This is a common range for a channel of image data. We’ve just converted these values to 32-bit floats, so we could say that they’re now in the range 0.0 to 255.0.

We said above that we need to normalize our data to the range
[0.0, 1.0] and  this helps to keep neuron
outputs in the same range, which helps with regularization and delaying
the onset of overfitting. And if we’re using an activation function
like a sigmoid, it keeps our functions from saturating.

We know that our pixels in the training and test data are in the range
[0, 255]. All we want is to rescale all the pixels in the same way, compressing
them from the range [0, 255] to the range [0,1]. Conceptually,
this is like converting measurements in millimeters into kilometers, or
vice-versa.

We can scale our input data with Numpy’s interp() routine, which
is designed for exactly this job. It takes an array (or tensor), an input
range, and an output range. For each entry it will find its location in
the first range (0 to 255) and find its corresponding position in the
second range (0 to 1).

In [0]:
X_train = np.interp(X_train, [0, 255], [0, 1])
X_test = np.interp(X_test, [0, 255], [0, 1])

This works perfectly, but since we know our data is in the range 0 to
255, we can accomplish the same thing just by dividing all the pixels
by 255.0.

In [0]:
X_train /= 255.0
X_test /= 255.0

Let’s gather everything we’ve seen so far in one place.

In [0]:
from keras.datasets import mnist
from keras import backend as keras_backend

# load MNIST data and save sizes
(X_train, y_train), (X_test, y_test) = mnist.load_data()

image_height = X_train.shape[1]
print(f'image_height = {image_height}')
image_width = X_train.shape[2]
print(f'image_width = {image_width}')
number_of_pixels = image_height * image_width
print(f'number_of_pixels = {number_of_pixels}')
print()

# convert to floating-point
X_train = keras_backend.cast_to_floatx(X_train)
X_test = keras_backend.cast_to_floatx(X_test)
print(f'Before scalling: \n {X_train[:1]}')
print()

# scale data to range [0, 1]
X_train /= 255.0
X_test /= 255.0
print(f'After scalling: \n {X_train[:1]}')

image_height = 28
image_width = 28
number_of_pixels = 784

Before scalling: 
 [[[  0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
     0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.]
  [  0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
     0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.]
  [  0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
     0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.]
  [  0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
     0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.]
  [  0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
     0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.]
  [  0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   3.  18.
    18.  18. 126. 136. 175.  26. 166. 255. 247. 127.   0.   0.   0.   0.]
  [  0.   0.   0.   0.   0.   0.   0.   

Our training and test samples are now in floating-point format and
scaled from 0.0 to 1.0.

Now let’s pre-process the labels so that they’re ready for use.

## Fixing the Labels

We know that the MNIST data contains images of digits from 0 to 9.
So in our network we’ll create an output layer with 10 neurons, one for
each digit. Each neuron will produce a probability that the image it’s
just been fed corresponds to that digit. The neuron with the highest
value will be the network’s final prediction for the input.

We’d like to compute an error value that tells us how close these 10
values are to the values we want. To make this comparison easy, we
represent the label for each image using one-hot encoding.

In this case, it’s a list of 10 elements, where all are 0 except for 1 in slot 3.

<img src='https://github.com/rahiakela/img-repo/blob/master/one-hot-encodings.JPG?raw=1' width='800'/>

In this imaginary example, the network has given the value 3 the
greatest probability, but it’s given each of the other digits some chance
of being right, too. 

A perfect answer from the network would be a
probability of 1 that the input is a 3, so all other choices would have
a probability of 0. In other words, a perfect prediction would be the
same as the label. The more the two are different, the higher the error.

The one-hot form of the label simplifies this comparison of the output
and the label.

Turning each integer in a list into a one-hot encoding is such a common
task that Keras provides a utility for it. The routine to_categorical()
looks through an array of integers and finds the largest value, so it
knows how many 0’s are needed to represent all the values that need
to be encoded. It then makes a one-hot encoding for each integer in
the list. The output of to_categorial() is a list of these encodings,
which are themselves lists of 0’s and 1’s.

Let’s see one-hot encoding in action.

In [0]:
from keras.utils import to_categorical

# print the first 5 entries of the original y_train array
y_train[:5]

array([5, 0, 4, 1, 9], dtype=uint8)

In [0]:
# encode the y_train array as one-hot lists
y_train = to_categorical(y_train)
# print the new first 5 entries of y_train, now one-hot encoded
y_train[:5]

array([[0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 1.]], dtype=float32)

As we can see, the output is a 2D grid with one row for each input.
Every entry is 0 except for a single 1, located at the index corresponding
to the original y_train value for that row.

We might be tempted to simply pass y_train and then y_test to
to_categorical() in succession and move on, but that could introduce
a subtle bug. The problem is that the largest value in one list
might be different than the largest value in the other, giving us lists of
different sizes.

For instance, suppose that the test data was missing any images of
the digit 9. That means that y_test will contain only the digits 0 to 8.
When we use to_categorical() we’ll get back a list that has only 9
items. This will cause trouble later when we want to compare it to the
values in our output layer, which has a score for each of 10 categories.

We don’t have to worry about this problem with the MNIST data,
because it has examples for every image in both sets, but it might come
up in other data sets.

There’s an easy, general solution that will always avoid this problem.
It involves using an optional argument to to_categorial() that overrides
its scanning step. This argument, called num_classes, tells the
routine to always make lists of the given length.

The value of num_classes has to be at least big enough to encode all
the possible values, or we’ll get an error. If num_classes is bigger than
necessary, that’s fine, and the extra values at the end will always be 0.

In [0]:
# combine the input lists to find largest value
# in either list, then add 1 because the values start at 0
number_of_classes = 1 + max(np.append(y_train, y_test)).astype(np.int32)
print(f'number_of_classes: {number_of_classes}')

# encode each list into one-hot arrays of the size we just found
y_train = to_categorical(y_train, num_classes=number_of_classes)
y_test = to_categorical(y_test, num_classes=number_of_classes)

number_of_classes: 10


Sometimes we want the original list of integers somewhere else in the
program, when we do cross-validation. We can “undo”
the one-hot encoding in two ways. If the one-hot encoding is represented
as a regular Python list (that is, not a NumPy array), we can use
Python’s built-in index() method.

In [0]:
one_hot = [0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
print(f'one-hot represents the integer {one_hot.index(1)}')

one-hot represents the integer 3


If the one-hot version is a NumPy array, then we can’t use index(),
because NumPy doesn’t support that method. There are several ways
to use NumPy to find the index of a single 1 in list of 0’s.
This uses NumPy’s argmax() method, which
returns the index of the largest value in a list.

In [0]:
one_hot_np = np.array([0, 0, 0, 1, 0, 0, 0, 0, 0, 0])
print(f'one_hot_np represents the integer {np.argmax(one_hot_np)}')

one_hot_np represents the integer 3


Because one-hot encoding is so common, scikit-learn also offers a
tool to perform it. It’s in the preprocessing module, and is called
OneHotEncoder().

## Pre-Processing All in One Place

To recap, we began by reading in (and possibly downloading) the
MNIST data, and then prepared each image for Keras by changing it
from integers to floats, and then normalized it. Then we created onehot
encodings of our labels.

In [0]:
from keras.datasets import mnist
from keras import backend as keras_backend

# load MNIST data and save sizes
(X_train, y_train), (X_test, y_test) = mnist.load_data()

image_height = X_train.shape[1]
print(f'image_height = {image_height}')
image_width = X_train.shape[2]
print(f'image_width = {image_width}')
number_of_pixels = image_height * image_width
print(f'number_of_pixels = {number_of_pixels}')
print()

# convert to floating-point
X_train = keras_backend.cast_to_floatx(X_train)
X_test = keras_backend.cast_to_floatx(X_test)
print(f'Before scalling: \n {X_train[:1]}')
print()

# scale data to range [0, 1]
X_train /= 255.0
X_test /= 255.0
print(f'After scalling: \n {X_train[:1]}')
print()

# save the original y_train and y_test
original_y_train = y_train
original_y_test = y_test

# replace label data with one-hot encoded versions
number_of_classes = 1 + max(np.append(y_train, y_test)).astype(np.int32)
print(f'number_of_classes: {number_of_classes}')

# encode each list into one-hot arrays of the size we just found
y_train = to_categorical(y_train, num_classes=number_of_classes)
y_test = to_categorical(y_test, num_classes=number_of_classes)

image_height = 28
image_width = 28
number_of_pixels = 784

Before scalling: 
 [[[  0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
     0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.]
  [  0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
     0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.]
  [  0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
     0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.]
  [  0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
     0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.]
  [  0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
     0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.]
  [  0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   3.  18.
    18.  18. 126. 136. 175.  26. 166. 255. 247. 127.   0.   0.   0.   0.]
  [  0.   0.   0.   0.   0.   0.   0.   