# Ch 3 - Getting Started with Neural Networks

## 3.1 Anatomy of a Neural Network

Training a neural network revolves around the following objects:

- **Layers**, which are combined into a **network** (or **model**)

- The **input data** and corresponding **targets**

- The **loss function**, which defines the feedback signal used for learning

- The **optimizer**, which determines how learning proceeds



![NN1](Images/03_01.jpg)


The network maps the input data to predictions. The loss function then compares these predictions to the targets, producing a loss value: a measure of how well the network's predictions match what was expected. The optimizer uses this loss value to update the network's weights. 



### 3.1.1 Layers: the Building Blocks of Deep Learning


The **layer** is the fundamental data structure in neural networks. It is a data-processing module that takes one or more tensors as input and outputs one or more tensors. 

Some layers are stateless, but they more frequently have a state: the layer's **weights**, one or several tensors learned with stochastic gradient descent, which together contain the network's knowledge. 

Different layers are appropriate for different tensor formats and different types of data processing. 

- **Dense Layers**: Simple vector data stored in 2D tensors of shape (samples, features) is often processed by densely connected layers, also called fully connected or dense layers (the Dense class in Keras).

- **Recurrent Layers**: Sequence data stored in 3D tensors of shape (samples, timesteps, features), is typically processed by recurrent layers such as an LSTM layer.

- **2D Convolution Layers**: Image data stored in 4D tensors is usually processed by 2D convolution layers (Conv2D).


Building deep-learning models in Keras is done by clipping together compatible layers to form useful data-transformation pipelines. The notion of **layer compatibility** here refers specifically to the fact that every layer will only accept input tensors of a certain shape and will return output tensors of a certain shape.




#### Example:
(A dense layer with 32 output units)




In [7]:
from keras import layers

layer = layers.Dense(32, input_shape=(784, ))

We are creating a layer that will only accept 2D tensors as input where the first dimension is 784. Since axis 0 is unspecified, any value would be accepted. This layer will return a tensor where the first dimension has been transformed to be 32.

This layer can only be connected to a downstream layer that expects 32-dimensional vectors as its input.

When using Keras, you don't have to worry about compatibility because the layers you add to your models are dynamically built to match the shape of the incoming layer. 

In [9]:
from keras import models

model = models.Sequential()
model.add(layers.Dense(32, input_shape=(784, )))
model.add(layers.Dense(32))

The second layer didn't receive an input shape argument - instead, it automatically inferred its input shape as being the output shape of the layer that came before. 

### 3.1.2 Models: Networks of Layers

The most common dee-learning model instance is a linear stack of layers, mapping a single input to a single output. You will also be exposed to a broader variety of network topologies:

- Two-branch networks

- Multihead networks

- Inception blocks

The topology of a network defines a **hypothesis space**. We defined machine learning as "searching for useful representations of some input data, within a predefined space of possibilities, using guidance from a feedback signal." By choosing a network topology, you constrain your **space of possibilities** (hypothesis space) to a specific series of tensor operations, mapping input data to output data. Then you will look for a good set of values for the weight tensors involved in these tensor operations. 


Picking the right **network architecture** is more art than science, and although there are some best practices and principles you can rely on, only practice can help you become a proper neural-network architect. 

### 3.1.3 Loss Functions and Optimizers: Keys to Configuring the Learning Process


Once you define a network architecture, you still have to choose

- **A loss function (objective function)**: The quantity that will be minimized during training. It represents a measure of success for the task at hand.

- **Optimizer**: Determines how the network will be updated based on the loss function. It implements a specific variant of stochastic gradient descent (SGD).


A neural network that has multiple outputs may have multiple loss functions (one per output). But the gradient-descent process must be based on a single scalar loss value; so, for multiloss networks, all losses are combined (via averaging) into a single scalar quantity.


Choosing the right objective function for the right problem is important: your network will take any shortcut it can, to minimize the loss; so if the objective doesn't fully correlate with success for the task at hand, your network will end up doing things you might not want. 


When it comes to common problems such as classification, regression, and sequence prediction, there are simple guidelines you can follow to choose the correct loss. Only when you're working on truly new research problems will you have to develop your own objective functions.

For example:

- For a two-class classification problem, you'll use binary crossentropy

- For a many-class classification problem, you'll use categorical crossentropy

- For a regression problem, you'll use mean-squared error

- For a sequence-learning problem, you'll use connectionist temporal classification (CTC).



## 3.2 Introduction to Keras

Keras is a deep-learning framework for Python that provides a convenient way to define and train almost any kind of deep-learning model. Keras was initially developed for researchers, with the aim of enabling fast experimentation. Distributed under the MIT license, which means it can be freely used in commercial projects. 

Keras has the following features:

- Allows the same code to run seamlessly on CPU or GPU.

- Has a user-friendly API that makes it easy to quickly prototype deep-learning models.

- Has build-in support for convolutional networks (for computer vision), recurrent networks (for sequence processing), and any combination of both.

- Supports arbitrary network architectures: multi-input or multi-output models, layer sharing, model sharing, and so on. This means Keras is appropriate for building essentially any deep-learning model, from a generative adversarial network to a neural Turing machine.



![Frameworks](Images/03_02.jpg)






### 3.2.1 Keras, Tensorflow, Theano, and CNTK


Keras is a model-level library, providing high-level building blocks for developing deep-learning models. It doesn’t handle low-level operations such as tensor manipulation and differentiation. Instead, it relies on a specialized, well-optimized tensor library to do so, serving as the backend engine of Keras. Rather than choosing a single tensor library and tying the implementation of Keras to that library, Keras handles the problem in a modular way (see figure 3.3); thus several different backend engines can be plugged seamlessly into Keras. 

Currently, the three existing backend implementations are the TensorFlow backend, the Theano backend, and the Microsoft Cognitive Toolkit (CNTK) backend. In the future, it’s likely that Keras will be extended to work with even more deep-learning execution engines.


![stack](Images/03_03.jpg)





TensorFlow, CNTK, and Theano are some of the primary platforms for deep learning today. [Theano](http://deeplearning.net/software/theano) is developed by the MILA lab at Université de Montréal, [TensorFlow](https://www.tensorflow.org/) is developed by Google, and [CNTK](https://github.com/Microsoft/CNTK) is developed by Microsoft. Any piece of code that you write with Keras can be run with any of these backends without having to change anything in the code: you can seamlessly switch between the two during development, which often proves useful—for instance, if one of these backends proves to be faster for a specific task. We recommend using the TensorFlow backend as the default for most of your deep-learning needs, because it’s the most widely adopted, scalable, and production ready.


Via TensorFlow (or Theano, or CNTK), Keras is able to run seamlessly on both CPUs and GPUs. When running on CPU, TensorFlow is itself wrapping a low-level library for tensor operations called [Eigen](http://eigen.tuxfamily.org). On GPU, TensorFlow wraps a library of well-optimized deep-learning operations called the NVIDIA CUDA Deep Neural Network library (cuDNN). 










### 3.2.2 Developing with Keras: a Quick Overview


A typical Keras workflow is like the MNIST example

1. Define training data: input tensors and target tensors.

2. Define a network of layers (or **model**) that maps your inputs to your targets.

3. Configure the learning process by choosing a loss function, an optimizer, and some metrics to monitor.

4. Iterate on your training data by calling the fit( ) method of your model.



There are two ways to define a model:

- Using the Sequential class, only for linear stacks of layers. Most common network architecture.

- Using the **functional API** for directed acyclic graphs of layers, which lets you build completely arbitrary architecture.


#### Using the Sequential Class: 

In [None]:
from keras import models
from keras import layers


model = models.Sequential()

model.add(layers.Dense(32, activation='relu', input_shape=(784, )))
model.add(layers.Dense(10, activation='softmax'))

#### Using the Functional API:

With the functional API, you are manipulating the data tensors that the model processes and applying layers to this tensor as if they were functions.

In [None]:
input_tensor = layers.Input(shape=(784, ))
x = layers.Dense(32, activation='relu')(input_tensor)
output_tensor = layers.Dense(10, activation='softmax')(x)

model = models.Model(inputs=input_tensor, outputs=output_tensor)

Once your model is defined, it doesn't matter if you are using a Sequential model or the functional API. All of the following steps are the same.

The learning process is configured in the compilation step, where you specify the optimizer and loss function(s) that the model should use, as well as the metrics you want to monitor during training. 

#### Loss function:

In [None]:
from keras import optimizers


model.compile(optimizer=optimizers.RMSprop(lr=0.001),\
             loss='mse',\
             metrics=['accuracy'])

The learning process consists of passing Numpy arrays of input data (and the corresponding target data) to the model via the fit( ) method, similar to what you would do in Scikit-Learn and several other machine learning libraries.

In [None]:
model.fit(input_tensor, target_tensor, batch_size=128, epochs=10)

## 3.3 Setting Up a Deep-Learning Workstation



It is recommended that you run deep-learning code on a modern NVIDIA GPU. Some applications like image processing with convolutional networks and sequence processing with recurrent neural networks will be very slow on CPU. 



Even for applications that can realistically be run on CPU, you’ll generally see speed increase by a factor or 5 or 10 by using a modern GPU. If you don’t want to install a GPU on your machine, you can alternatively consider running your experiments on an AWS EC2 GPU instance or on Google Cloud Platform. But note that cloud GPU instances can become expensive over time.


It is better to be using a Unix workstation. Although it’s technically possible to use Keras on Windows (all three Keras backends support Windows), it is not recommeded.







### 3.3.1 Jupyter Notebooks: the Preferred Way to Run Deep-Learning Experiments


 A notebook is a file generated by the [Jupyter Notebook
app](https://jupyter.org), which you can edit in your browser. It mixes the ability to execute Python code with rich text-editing capabilities for annotating what you’re doing. A notebook also allows you to break up long experiments into smaller pieces that can be executed independently, which makes development interactive and means you don’t have to rerun all of your previous code if something goes wrong late in an experiment.


We recommend using Jupyter notebooks to get started with Keras, although that isn’t a requirement: you can also run standalone Python scripts or run code from within an IDE such as PyCharm. All the code examples in this book are available as open source notebooks; you can download them from the book’s [website](http://www.manning.com/books/deep-learning-with-python).





### 3.3.2 Getting Keras Running: Two Options

#### To get started:

- Use the official EC2 Deep Learning [AMI](https://aws.amazon.com/machine-learning/amis/), and run Keras experiments as Jupyter notebooks on EC2. Do this if you don’t already have a GPU on your local machine. Appendix B provides a step-by-step guide.


OR 


- Install everything from scratch on a local Unix workstation. You can then run either local Jupyter notebooks or a regular Python codebase. Do this if you already have a high-end NVIDIA GPU. Appendix A provides an Ubuntu-specific, step-by-step guide.





### 3.3.3 Running Deep-Learning Jobs in the Cloud: Pros and Cons


If you don’t already have a GPU that you can use for deep learning (a recent, high-end NVIDIA GPU), then running deep-learning experiments in the cloud is a simple, lowcost way for you to get started without having to buy any additional hardware. If you’re using Jupyter notebooks, the experience of running in the cloud is no different from running locally. As of mid-2017, the cloud offering that makes it easiest to get started with deep learning is definitely AWS EC2.



But if you’re a heavy user of deep learning, this setup isn’t sustainable in the long term—or even for more than a few weeks. EC2 instances are expensive: the instance type recommended in appendix B (the p2.xlarge instance, which won’t provide you with much power) costs 0.90 per hour as of mid-2017. Meanwhile, a solid consumerclass GPU will cost you somewhere between $1,000 and $1,500—a price that has been fairly stable over time, even as the specs of these GPUs keep improving. If you’re serious about deep learning, you should set up a local workstation with one or more GPUs.


EC2 is a great way to get started. You could follow the code examples in this book entirely on an EC2 GPU instance. But if you’re going to be a power user of deep learning, get your own GPUs. 


### 3.3.4 What is the Best GPU for Deep Learning?


If you’re going to buy a GPU, which one should you choose? The first thing to note is that it must be an NVIDIA GPU. NVIDIA is the only graphics computing company that has invested heavily in deep learning so far, and modern deep-learning frameworks can only run on NVIDIA cards.




## 3.4 Classifying Movie Reviews: a Binary Classification Example


Two-class classification, or binary classification, may be the most widely applied kind of machine-learning problem. In this example, we will learn to classify movie reviews as positive or negative, based on the text content of the reviews.

### 3.4.1 The IMDB Dataset

IMDB dataset: a set of 50,000 highly polarized reviews from the Internet Movie Database. They’re split into 25,000 reviews for training and 25,000reviews for testing, each set consisting of 50% negative and 50% positive reviews.



Just because a model performs well on its training data doesn’t mean it will perform well on data it has never seen; and what you care about is your model’s performance on new data (because you already know the labels of your training data—obviously you don’t need your model to predict those). For instance, it’s possible that your model could end up merely memorizing a mapping between your training samples and their targets, which would be useless for the task of predicting targets for data the model has never seen before. We’ll go over this point in much more detail in the next chapter.




Just like the MNIST dataset, the IMDB dataset comes packaged with Keras. It has already been preprocessed: the reviews (sequences of words) have been turned into sequences of integers, where each integer stands for a specific word in a dictionary.

In [12]:
import requests
import ssl

requests.packages.urllib3.disable_warnings()


try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    # Legacy Python that doesn't verify HTTPS certificates by default
    pass
else:
    # Handle target environment that doesn't support HTTPS verification
    ssl._create_default_https_context = _create_unverified_https_context

In [13]:
from keras.datasets import imdb

(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)

Downloading data from https://s3.amazonaws.com/text-datasets/imdb.npz


The argument num_words=10000 means you’ll only keep the top 10,000 most frequently occurring words in the training data. Rare words will be discarded. This allows you to work with vector data of manageable size.

The variables train_data and test_data are lists of reviews; each review is a list of word indices (encoding a sequence of words). train_labels and test_labels are lists of 0s and 1s, where 0 stands for negative and 1 stands for positive

In [14]:
train_data[0]

[1,
 14,
 22,
 16,
 43,
 530,
 973,
 1622,
 1385,
 65,
 458,
 4468,
 66,
 3941,
 4,
 173,
 36,
 256,
 5,
 25,
 100,
 43,
 838,
 112,
 50,
 670,
 2,
 9,
 35,
 480,
 284,
 5,
 150,
 4,
 172,
 112,
 167,
 2,
 336,
 385,
 39,
 4,
 172,
 4536,
 1111,
 17,
 546,
 38,
 13,
 447,
 4,
 192,
 50,
 16,
 6,
 147,
 2025,
 19,
 14,
 22,
 4,
 1920,
 4613,
 469,
 4,
 22,
 71,
 87,
 12,
 16,
 43,
 530,
 38,
 76,
 15,
 13,
 1247,
 4,
 22,
 17,
 515,
 17,
 12,
 16,
 626,
 18,
 2,
 5,
 62,
 386,
 12,
 8,
 316,
 8,
 106,
 5,
 4,
 2223,
 5244,
 16,
 480,
 66,
 3785,
 33,
 4,
 130,
 12,
 16,
 38,
 619,
 5,
 25,
 124,
 51,
 36,
 135,
 48,
 25,
 1415,
 33,
 6,
 22,
 12,
 215,
 28,
 77,
 52,
 5,
 14,
 407,
 16,
 82,
 2,
 8,
 4,
 107,
 117,
 5952,
 15,
 256,
 4,
 2,
 7,
 3766,
 5,
 723,
 36,
 71,
 43,
 530,
 476,
 26,
 400,
 317,
 46,
 7,
 4,
 2,
 1029,
 13,
 104,
 88,
 4,
 381,
 15,
 297,
 98,
 32,
 2071,
 56,
 26,
 141,
 6,
 194,
 7486,
 18,
 4,
 226,
 22,
 21,
 134,
 476,
 26,
 480,
 5,
 144,
 30,
 5535,
 18,

In [15]:
train_labels[0]

1

Because you're restricting yourself to the top 10,000 most frequent words, no word index will exceed 10,000.

In [16]:
max([max(sequence) for sequence in train_data])

9999



![reviews](Images/03_04.jpg)


### 3.4.2 Preparing the Data

You can’t feed lists of integers into a neural network. You have to turn your lists into tensors. There are two ways to do that:


- Pad your lists so that they all have the same length, turn them into an integer tensor of shape (samples, word_indices), and then use as the first layer in your network a layer capable of handling such integer tensors (the Embedding layer, which we’ll cover in detail later in the book).


- One-hot encode your lists to turn them into vectors of 0s and 1s. This would mean, for instance, turning the sequence [3, 5] into a 10,000-dimensional vector that would be all 0s except for indices 3 and 5, which would be 1s. Then you could use as the first layer in your network a Dense layer, capable of handling floating-point vector data.



![binary_matrix](Images/03_05.jpg)




#### Vectorizing Data:

In [19]:
import numpy as np


def vectorize_sequences(sequences, dimension=10000):
    results = np.zeros((len(sequences), dimension))
    for i, sequence in enumerate(sequences):
        results[i, sequence] = 1.
    return results

x_train = vectorize_sequences(train_data)
x_test = vectorize_sequences(test_data)

#### Samples now:

In [20]:
x_train[0]

array([0., 1., 1., ..., 0., 0., 0.])

#### Vectorizing labels:

In [21]:
y_train = np.asarray(train_labels).astype('float32')
y_test = np.asarray(test_labels).astype('float32')

Now the data is ready to be fed into a neural network.

### 3.4.3 Building Your Network










### 3.4.4 Validating Your Approach

### 3.4.5 Using a Trained Network to Generate Predictions on New Data

### 3.4.6 Further Experiments

### 3.4.7 Wrapping Up

## 3.5 Classifying Newswires: a Multiclass Classification Example

### 3.5.1 The Reuters Dataset

### 3.5.2 Preparing the Data

### 3.5.3 Building Your Network

### 3.5.4 Validating Your Approach

### 3.5.5 Generating Predictions on New Data

### 3.5.6 A Different Way to Handle the Labels and the Loss

### 3.5.7 The Importance of Having Sufficiently Large Intermediate Layers

### 3.5.8 Further Experiments

### 3.5.9 Wrapping Up

## 3.6 Predicting House Prices: a Regression Example

### 3.6.1 The Boston Housing Price Dataset

### 3.6.2 Preparing the Data

### 3.6.3 Building Your Network

### 3.6.4 Validating Your Approach Using K-fold Validation

### 3.6.5 Wrapping Up