## Data Loading 

+ There are two components to working with data in neon   
+ The first is a data iterator **(NervanaDataIterator)**, that feeds the model with minibatches of data during training or evaluation  
+ The second is a dataset **(Dataset)** class, which handles the loading and preprocessing of the data    

## Data Iterators

+ Data iterators are python iterables in that they implement the ```__iter__``` method, which returns a new minibatch of data with each call  
+ There are 2 kinds of Iterator classes that are used in Neon, depending on the size of the data.
+ Custom Iterators can also be built for certain use cases by extending the below mentioned classes.
+ If your data is small enough to fit into memory, you use the ```ArrayIterator``` Class in Neon  
+ If your data is too large, you can save your data in the HDF5 format and use the ```HDF5Iterator``` Class, to load chunks of data to send to the model. This approach is flexible for any type of data.  

### ArrayIterators

+ The ArrayIterator class provides for iteration over minibatches of data that has been preloaded into memory as numpy arrays. 
+ This iterator supports classification, regression, and autoencoder tasks.

#### Classification

+ Below is an example of a classification problem statement with images where we load in 10,000 images. 
+ Each image is 32x32 pixels with 3 color channels (R, G, B), for a total of 32×32×3=3,072 features.

In [None]:
from neon.data import ArrayIterator
import numpy as np

"""
X are the features and y are the labels.
The data in X must have shape (# examples, feature size)
"""
X = np.random.rand(10000,3072) # X.shape = (10000, 3072)
"""

For classification, the labels y must have shape (# examples, 1). y must also
consist of integers from 0 to nclass-1, where nclass is the number of categories.
"""
y = np.random.randint(0,10,10000) # y.shape = (10000, )

"""
The features X and labels y are passed to ArrayIterator be loaded into the backend
nclass, the number of classes, is set to 10
lshape, the local shape of the features, is set to (3,32,32) to represent
        the the image dimensions: 32x32 pixels with 3 channels
"""
train = ArrayIterator(X=X, y=y, nclass=10, lshape=(3,32,32))

+ Importantly, the labels y for classification should be integers from **0** to **K−1**, where **K** is the number of classes. 
+ These labels are stored in the backend in a one-hot representation. 
+ This means that if we have **N** labels with **K** classes, the labels will be stored in a **N×K** binary matrix. 
+ Each column will be all zeros except at the **k-th** element, which will be **one**

\begin{split}y = (0, 2, 0, 1, 2, 3) \rightarrow \left( \begin{array}{cccccc}
1 & 0  & 0 & 0\\
0 & 0  & 1 & 0\\
1 & 0  & 0 & 0\\
0 & 1  & 0 & 0\\
0 & 0  & 1 & 0\\
0 & 0  & 0 & 1 \end{array}  \right).\end{split}

#### Regression

+ In regression, the model output for each training example is a vector ***_y^_*** that is compared against a desired vector ***_y_*** with a cost function (such as mean squared error).
+ We first create the iterator. By default, ArrayIterator assumes classification, so for regression we must set ```make_onehot = False``` to turn off the one-hot representation.

In [None]:
from neon.data import ArrayIterator
import numpy as np

# We generate some random data as X
X = np.random.rand(1000, 1)
# Add a bit of Noise to our data
y = 2*X + 1 + 0.01*np.random.randn(1000, 1)

train = ArrayIterator(X=X, y=y, make_onehot=False)

#### Autoencoders

+ Autoencoders are a special case of regression where the desired outputs ***_y_*** are the input features ***X***. 
+ For convenience, you can exclude passing the labels ***_y_*** to the iterator

In [None]:
# Example construction of ArrayIterator for Autoencoder task with MNIST
from neon.data import MNIST
from neon.data import ArrayIterator

mnist = MNIST()

# load the MNIST data
(X_train, y_train), (X_test, y_test), nclass = mnist.load_data()

# Set input and target to X_train
train = ArrayIterator(X_train, lshape=(1, 28, 28))

#### Sequential Data

For sequence data, where data are fed to the model across multiple time steps, the shape of the input data can depend on your usage 

***Word Embeddings***
+ Often, data such as sentences are encoded as a vector sequence of integers, where each integer corresponds to a word in  the vocabulary.  
+ This encoding is often used in conjunction with embedding layers.  
+ In this case, the input data should be formatted to have shape ***_(T,N)_***, where **_T_** is the number of time steps and **N** is the batch size.  
+ The embedding layer takes this input and provides as output to a subsequent recurrent neural network data of shape **(F,T∗N)**, where **F** is the number of features (in this case, the embedding dimension). For an example, see imdb_lstm.py  

***Time Series***  
+ Time series data should be formatted to have shape ***(F,T∗N)***  
  + where   
  **F** is the number of features  
  **T** is the number of timesteps/interval  
  **N** is the Batch Size (user controlled)

### HDF5Iterator

**What is HDF5 ?**  

+ HDF5 is a unique technology suite that makes possible the management of extremely large and complex data collections.
+ A versatile data model that can represent very complex data objects and a wide variety of metadata.
+ A completely portable file format with no limit on the number or size of data objects in the collection.
+ A software library that runs on a range of computational platforms, from laptops to massively parallel systems, and implements a high-level API with C, C++, Fortran 90, and Java interfaces.
+ A rich set of integrated performance features that allow for access time and storage space optimizations.
+ Tools and applications for managing, manipulating, viewing, and analyzing the data in the collection.

```HDF5Iterator``` uses an HDF5 formatted data file to store the input and target data arrays so the data size is not limited by on-host and/or on-device memory capacity.

To use the ```HDF5Iterator```, the data arrays need to be stored in an HDF5 file with the following format:  
+ The input data is in an HDF5 dataset named input and the target output, if needed, in a dataset named output. The data arrays are of the same format as the arrays used to initialize the ```ArrayIterator``` class.  
+ The input data class also requires an attribute named lshape which specifies the shape of the flattened input data array

For alternate target label formats, such as converting the targets to a one-hot vector, or for autoencoder data, the ```HDF5IteratorOneHot``` and ```HDF5IteratorAutoencoder``` subclasses are included. These subclasses demonstrate how to extend the ```HDF5Iterator``` to handle different input and target data formats or transformations.

In [None]:
from neon.data import HDF5IteratorOneHot, MNIST
import h5py
import numpy as np

# load up the mnist data set
dataset = MNIST(path=args.data_dir)
# split into train and tests sets
(X_train, y_train), (X_test, y_test), nclass = dataset.load_data()

datsets = {'train': (X_train, y_train),
           'test': (X_test, y_test)}

for ky in ['train', 'test']:
    df = h5py.File('mnist_%s.h5' % ky, 'w')

    # input images
    in_dat = datsets[ky][0]
    df.create_dataset('input', data=in_dat)
    df['input'].attrs['lshape'] = (1, 28, 28)  # (C, H, W)

    target = datsets[ky][1].reshape((-1, 1))  # make it a 2D array
    df.create_dataset('output', data=target)
    df['output'].attrs['nclass'] = 10
    df.close()

# setup a training set iterator
# use the iterator that generates 1-hot output. other HDF5Iterator (sub) classes are
# available for different data layouts
train_set = HDF5IteratorOneHot('mnist_train.h5')
valid_set = HDF5IteratorOneHot('mnist_test.h5')

## Datasets

+ Dataset class is a base class for commonly-used datasets.  
+ We recommend creating an object class for your dataset that handles the loading and preprocessing of the data. 
+ Datasets should implement ```gen_iterators()```, which returns a dictionary data iterator used for training and evaluation

***Neon provides dataset objects for handling many stock datasets.***
  
  
They are as follows:  
+ MNIST, is a dataset of handwritten digits, consisting of 60,000 training samples and 10,000 test samples. Each image is 28x28 greyscale pixels.  
+ CIFAR10, is a dataset consisting of 50,000 training samples and 10,000 test samples. There are 10 categories and each sample is a 32x32 RGB color image.  
+ Text datasets (e.g. Penn Treebank, Hutter Prize, and Shakespeare), we have object classes for loading, and sometimes pre-processing, the data.  
+ ImageNet

##### MNIST

In [None]:
from neon.data import MNIST

mnist = MNIST(path='path/to/save/downloadeddata/')
train_set = mnist.train_iter
valid_set = mnist.valid_iter

##### CIFAR10

In [None]:
from neon.data import CIFAR10

cifar10 = CIFAR10()
train = cifar10.train_iter
test = cifar10.valid_iter

##### Penn Tree Bank

In [None]:
from neon.data import PTB

# download Penn Treebank and parse at the word level
ptb = PTB(time_steps, tokenizer="newline_tokenizer")
train_set = ptb.train_iter

#### Low level dataset operations

+ Some applications require access to the underlying data to generate more complex data iterators. 
+ This can be done by using the load_data method of the DataSet class and its subclasses. 
+ The method returns the data arrays which are used to generate the data iterators

In [None]:
# For example, the code below shows how to generate a data iterator to train an autoencoder on the MNIST dataset

from neon.data import MNIST
from neon.data import ArrayIterator

mnist = MNIST()
# get the raw data arrays, both train set and validation set
(X_train, y_train), (X_test, y_test), nclass = mnist.load_data()

# generate and ArrayIterator with no target data
# this will return the image itself as the target
train = ArrayIterator(X_train, lshape=(1, 28, 28))

### Useful Resources 

+ Neon Documentation : http://neon.nervanasys.com/docs/latest/index.html  
+ Neon Examples : https://github.com/NervanaSystems/neon/tree/master/examples
+ Github Repo : https://github.com/NervanaSystems/neon