# Python Text Crash Course_Part 2 Data Preparation
## Missing, Distinct Value, Scaling

## Full Day Workshop for user learn Data Science with Python
### 2017 Dec Timothy CL Lam
This is meant for internal usage, part of contents copied externally, not for commercial purpose


# The Case for Convolutional Neural Networks
Given a dataset of gray scale images with the standardized size of 32  32 pixels each, a
traditional feedforward neural network would require 1,024 input weights (plus one bias). This
is fair enough, but the 
attening of the image matrix of pixels to a long vector of pixel values
loses all of the spatial structure in the image. Unless all of the images are perfectly resized, the
neural network will have great diculty with the problem.
Convolutional Neural Networks expect and preserve the spatial relationship between pixels
by learning internal feature representations using small squares of input data. Features are
learned and used across the whole image, allowing for the objects in the images to be shifted or
translated in the scene and still detectable by the network. It is this reason why the network is
so useful for object recognition in photographs, picking out digits, faces, objects and so on with
varying orientation. In summary, below are some of the benets of using convolutional neural
networks:


- They use fewer parameters (weights) to learn than a fully connected network.

- They are designed to be invariant to object position and distortion in the scene.

- They automatically learn and generalize features from the input domain.

## Building Blocks of Convolutional Neural Networks
There are three types of layers in a Convolutional Neural Network:
1. Convolutional Layers.
2. Pooling Layers.
3. Fully-Connected Layers. 

## Convolutional Layers
Convolutional layers are comprised of lters and feature maps.

###  Filters
The lters are essentially the neurons of the layer. They have both weighted inputs and generate
an output value like a neuron. The input size is a xed square called a patch or a receptive
eld. If the convolutional layer is an input layer, then the input patch will be pixel values. If
they deeper in the network architecture, then the convolutional layer will take input from a
feature map from the previous layer.

###  Feature Maps
The feature map is the output of one lter applied to the previous layer. A given lter is drawn
across the entire previous layer and moved one pixel at a time. Each position results in an
activation of the neuron and the output is collected in the feature map. You can see that if the
receptive eld is moved one pixel from activation to activation, then the eld will overlap with
the previous activation by (eld width - 1) input values.
The distance that lter is moved across the input from the previous layer each activation is
referred to as the stride. If the size of the previous layer is not cleanly divisible by the size of
the lter's receptive eld and the size of the stride then it is possible for the receptive eld to
attempt to read o the edge of the input feature map. In this case, techniques like zero padding
can be used to invent mock inputs with zero values for the receptive eld to read.

## Pooling Layers
The pooling layers down-sample the previous layers feature map. Pooling layers follow a sequence
of one or more convolutional layers and are intended to consolidate the features learned and
expressed in the previous layer's feature map. As such, pooling may be considered a technique
to compress or generalize feature representations and generally reduce the overtting of the
training data by the model.
They too have a receptive eld, often much smaller than the convolutional layer. Also, the
stride or number of inputs that the receptive eld is moved for each activation is often equal to
the size of the receptive eld to avoid any overlap. Pooling layers are often very simple, taking
the average or the maximum of the input value in order to create its own feature map.

## Fully Connected Layers
Fully connected layers are the normal 
at feedforward neural network layer. These layers may
have a nonlinear activation function or a softmax activation in order to output probabilities
of class predictions. Fully connected layers are used at the end of the network after feature
extraction and consolidation has been performed by the convolutional and pooling layers. They
are used to create nal nonlinear combinations of features and for making predictions by the
network. 

# Convolutional Neural Networks Best Practices

Now that we know about the building blocks for a convolutional neural network and how the
layers hang together, we can review some best practices to consider when applying them.

- Input Receptive Field Dimensions: The default is 2D for images, but could be 1D such as for words in a sentence or 3D for video that adds a time dimension. Receptive Field Size: The patch should be as small as possible, but large enough to see features in the input data. It is common to use 3  3 on small images and 5  5 or 7  7 and more on larger image sizes.
 
- Stride Width: Use the default stride of 1. It is easy to understand and you don't need padding to handle the receptive eld falling o the edge of your images. This could be increased to 2 or larger for larger images. Number of Filters: Filters are the feature detectors. Generally fewer lters are used at the input layer and increasingly more lters used at deeper layers.
 
- Padding: Set to zero and called zero padding when reading non-input data. This is useful when you cannot or do not want to standardize input image sizes or when you want to use receptive eld and stride sizes that do not neatly divide up the input image size.
 
- Pooling: Pooling is a destructive or generalization process to reduce overtting. Receptive eld size is almost always set to 2  2 with a stride of 2 to discard 75% of the activations from the output of the previous layer.

- Data Preparation: Consider standardizing input data, both the dimensions of the images and pixel values.

- Pattern Architecture: It is common to pattern the layers in your network architecture. This might be one, two or some number of convolutional layers followed by a pooling layer. This structure can then be repeated one or more times. Finally, fully connected layers are often only used at the output end and may be stacked one, two or more deep.

- Dropout: CNNs have a habit of overtting, even with pooling layers. Dropout should be used such as between fully connected layers and perhaps after pooling layers.

# Project: Predict Sentiment From Movie Reviews
Sentiment analysis is a natural language processing problem where text is understood and the
underlying intent is predicted. In this lesson you will discover how you can predict the sentiment
of movie reviews as either positive or negative in Python using the Keras deep learning library.
After completing this step-by-step tutorial, you will know:
 About the IMDB sentiment analysis problem for natural language processing and how to
load it in Keras.
- How to use word embedding in Keras for natural language problems.
- How to develop and evaluate a Multilayer Perceptron model for the IMDB problem.
- How to develop a one-dimensional convolutional neural network model for the IMDB problem. 

In [1]:
import numpy
from keras.datasets import imdb
from matplotlib import pyplot
# load the dataset
(X_train, y_train), (X_test, y_test) = imdb.load_data()
X = numpy.concatenate((X_train, X_test), axis=0)
y = numpy.concatenate((y_train, y_test), axis=0)# summarize size


Using TensorFlow backend.


Downloading data from https://s3.amazonaws.com/text-datasets/imdb.npz

Dataset of 25,000 movies reviews from IMDB, labeled by sentiment (positive/negative). Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers). For convenience, words are indexed by overall frequency in the dataset, so that for instance the integer "3" encodes the 3rd most frequent word in the data. This allows for quick filtering operations such as: "only consider the top 10,000 most common words, but eliminate the top 20 most common words".

In [2]:
print("Training data: ")
print(X.shape)
print(y.shape)

Training data: 
(50000,)
(50000,)


In [9]:
df [:]

((array([ list([1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 22665, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 21631, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 19193, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 10311, 8, 4, 107, 117, 5952, 15, 256, 4, 31050, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 12118, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]),
         list([1, 1

# Word Embeddings
A recent breakthrough in the eld of natural language processing is called word embedding. This
is a technique where words are encoded as real-valued vectors in a high dimensional space, where
the similarity between words in terms of meaning translates to closeness in the vector space.
Discrete words are mapped to vectors of continuous numbers. This is useful when working with
natural language problems with neural networks as we require numbers as input values.
Keras provides a convenient way to convert positive integer representations of words into a
word embedding by an Embedding layer4. The layer takes arguments that dene the mapping
including the maximum number of expected words also called the vocabulary size (e.g. the
largest integer value that will be seen as an input). The layer also allows you to specify the
dimensionality for each word vector, called the output dimension.

Keras provides a convenient way to convert positive integer representations of words into a
word embedding by an Embedding layer4. The layer takes arguments that dene the mapping
including the maximum number of expected words also called the vocabulary size (e.g. the
largest integer value that will be seen as an input). The layer also allows you to specify the
dimensionality for each word vector, called the output dimension.
We would like to use a word embedding representation for the IMDB dataset. Let's say
that we are only interested in the rst 5,000 most used words in the dataset. Therefore our vocabulary size will be 5,000. We can choose to use a 32-dimensional vector to represent each
word. Finally, we may choose to cap the maximum review length at 500 words, truncating
reviews longer than that and padding reviews shorter than that with 0 values. We would load
the IMDB dataset as follows: