**Author: Christian Urcuqui**

**Date: 3 August 2018**

# notMNIST

This is an example to process a notMNIST dataset, it is integrated by handwritten digits, the MNSIT (Mixed National Institute of Standards and Technology) is one of the most image dataset used in image processing and machine learning. 
We could say that notMNIST is a good example to say "hello world" in tensorflow because a lot of examples are available and some of them took this dataset as the first steps in deep learning, such as, since the application of a simple neural network with a softmax function until the application of a ConvNet architecture.


<img src="../../Utilities/notmnist.jpeg" width="250">

In this Jupyter notebook I will explore different examples in order to know what is the best solution, for that the notebook's content is divided en these sections:

- [Libraries](#Libraries)
- [Dataset](#Dataset)
- [Classical-ML](#Classical-ML)
- [Udemy-DeepLearning](#Udemy-DeepLearning)
- [Book](#Book)
- [References](#References).



To resolve this problem I used two example sources, one from the book [1] and the last one from the Udemy Course [2].


## Libraries

+ **imageio 2.2.0**, it is python library that provides an easy interace to read an write a wide range of image data.
+ **matploit**, is a specially library to make plots 
+ **numpy**, it is a library make math and scientific operations 
+ **tarfile**, it allows use to process tar files
+ **IPython**, it is a library from Jupyter
+ **six**, it is another Python library, the idea of this is for smoothing over the differences beetween the Python versions. 


In [1]:
from __future__ import print_function
import imageio
import matplotlib.pyplot as plt
import numpy as np
import os
import sys
import tarfile
from IPython.display import display, Image
from sklearn.linear_model import LogisticRegression
from six.moves.urllib.request import urlretrieve
from six.moves import cPickle as pickle
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
# Config the matplotlib backend as plotting inline in IPython
%matplotlib inline



## Dataset 
The next code will download the dataset to our local machine. The repository has characters rendered in a variety of fonts on 28x28 image, moreover, it only has the label images since the 'A' to the 'J', specifically, 10 classes. 

In [3]:
url = 'https://commondatastorage.googleapis.com/books1000/'
last_percent_reported = None
data_root = '.' # Change me to store data elsewhere

def download_progress_hook(count, blockSize, totalSize):
  """A hook to report the progress of a download. This is mostly intended for users with
  slow internet connections. Reports every 5% change in download progress.
  """
  global last_percent_reported
  percent = int(count * blockSize * 100 / totalSize)

  if last_percent_reported != percent:
    if percent % 5 == 0:
      sys.stdout.write("%s%%" % percent)
      sys.stdout.flush()
    else:
      sys.stdout.write(".")
      sys.stdout.flush()
      
    last_percent_reported = percent
        
def maybe_download(filename, expected_bytes, force=False):
  """Download a file if not present, and make sure it's the right size."""
  dest_filename = os.path.join(data_root, filename)
  if force or not os.path.exists(dest_filename):
    print('Attempting to download:', filename) 
    filename, _ = urlretrieve(url + filename, dest_filename, reporthook=download_progress_hook)
    print('\nDownload Complete!')
  statinfo = os.stat(dest_filename)
  if statinfo.st_size == expected_bytes:
    print('Found and verified', dest_filename)
  else:
    raise Exception(
      'Failed to verify ' + dest_filename + '. Can you get to it with a browser?')
  return dest_filename

train_filename = maybe_download('notMNIST_large.tar.gz', 247336696)
test_filename = maybe_download('notMNIST_small.tar.gz', 8458043)

Attempting to download: notMNIST_large.tar.gz
0%....5%....10%....15%....20%....25%....30%....35%....40%....45%....50%....55%....60%....65%....70%....75%....80%....85%....90%....95%....100%
Download Complete!
Found and verified .\notMNIST_large.tar.gz
Attempting to download: notMNIST_small.tar.gz
0%....5%....10%....15%....20%....25%....30%....35%....40%....45%....50%....55%....60%....65%....70%....75%....80%....85%....90%....95%....100%
Download Complete!
Found and verified .\notMNIST_small.tar.gz


### Extract the dataset

Once the datasets are downloaded, the next code will extract these files 

In [None]:
num_classes = 10
np.random.seed(133)

def maybe_extract(filename, force=False):
  root = os.path.splitext(os.path.splitext(filename)[0])[0]  # remove .tar.gz
  if os.path.isdir(root) and not force:
    # You may override by setting force=True.
    print('%s already present - Skipping extraction of %s.' % (root, filename))
  else:
    print('Extracting data for %s. This may take a while. Please wait.' % root)
    tar = tarfile.open(filename)
    sys.stdout.flush()
    tar.extractall(data_root)
    tar.close()
  data_folders = [
    os.path.join(root, d) for d in sorted(os.listdir(root))
    if os.path.isdir(os.path.join(root, d))]
  if len(data_folders) != num_classes:
    raise Exception(
      'Expected %d folders, one per class. Found %d instead.' % (
        num_classes, len(data_folders)))
  print(data_folders)
  return data_folders
  
train_folders = maybe_extract(train_filename)
test_folders = maybe_extract(test_filename)

Extracting data for .\notMNIST_large. This may take a while. Please wait.


Another way is to download the data from the tensorflow repository  (in its examples)

In [3]:
DATA_DIR = '/tmp/data'
NUM_STEPS= 1000
MINIBATCH_SIZE = 100

data = input_data.read_data_sets(DATA_DIR, one_hot=True)



Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.
Instructions for updating:
Please write your own downloading logic.
Instructions for updating:
Please use tf.data to implement this functionality.
Extracting /tmp/data\train-images-idx3-ubyte.gz
Instructions for updating:
Please use tf.data to implement this functionality.
Extracting /tmp/data\train-labels-idx1-ubyte.gz
Instructions for updating:
Please use tf.one_hot on tensors.
Extracting /tmp/data\t10k-images-idx3-ubyte.gz
Extracting /tmp/data\t10k-labels-idx1-ubyte.gz
Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.


## Udemy 

"Let's take a peek at some of the data to make sure it looks sensible. Each exemple should be an image of a character A through J rendered in a different font. Display a sample of the images that we just downloaded. Hint: you can use the package IPython.display."

### Problem 1

In [8]:
print(data.train)

print(data.train.images.shape) ## 550000 points for training with of 28 by 28 pixels

<tensorflow.contrib.learn.python.learn.datasets.mnist.DataSet object at 0x000002B30DF7CF60>
(55000, 784)


### Load the data in a more manageable format

The next code will format the entire dataset into a 3D array (image index, x,y) of floating point values, its normalized will have aproximately zero mean and standard deviation ~0.5 and it let the training easier down the road. 

## Book

The first step that the book focuses is to unroll the image pixels as a single long vector denoted x:
<br>

$xw^0 = \sum_{}x_{i}w_{i}^0$
<br>

The idea is apply an algorithm to transform each pixel's wights in a range of 0 and 1 using the softmax function, next, with the sufficient evidence for each of the 10 possible digits. The final assigment will be the digit which accumulates the most evidence:

$digit=argmax(xW)$




In [5]:
# In the placeholder could be supplied when triggering it. Moreover, the variable is an element manipulated by the computation.
# The image is a placeholder because it could be supplied by us when running the computation graph.
# each image is of size 724 (28*28 pixels unrolled in into a single vector)

# The size [None,784] menas that each image of size 784, and None is an indicator that we are not currently specificyng how many of these images
# we will use at once
x = tf.placeholder(tf.float32, [None, 784])
W = tf.Variable(tf.zeros([784, 10]))

# we have 10 digits, same like the last two code lines, we are specifiying a placeholder to the 10 digits
y_true = tf.placeholder(tf.float32, [None, 10])
# we are going to use the function matmul to multiply to matrixes
y_pred = tf.matmul(x, W)

# note that in the book was used a deprecated method of softmax cross entropy. 
# Cross_entropy is the measure of similarity - a natural choice when the model outputs probabilities - this element is
# usually called the loss function - the softmax function - 
cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(logits=y_pred, labels=y_true))

# The next step is how we are going to minimize the loss function, for this case we will use the gradient descent approach
# 0.5 is the learning rate 
gd_step = tf.train.GradientDescentOptimizer(0.5).minimize(cross_entropy)

# This is the evaluation proccedure 
correct_mask = tf.equal(tf.argmax(y_pred,1), tf.argmax(y_true,1))
accuracy = tf.reduce_mean(tf.cast(correct_mask, tf.float32))

with tf.Session() as sess:
    # The next line will intialize all variables 
    sess.run(tf.global_variables_initializer())
    #train process with the number of steps and the minibatch size
    for _ in range(NUM_STEPS):
        # the next line gets a subset of the data 
        batch_xs, batch_ys = data.train.next_batch(MINIBATCH_SIZE)
        # the session run will use all the process in order to train our neural network
        sess.run(gd_step, feed_dict={x: batch_xs, y_true: batch_ys})

    # the results of our testing will be saved in the accuracy variable
    ans = sess.run(accuracy, feed_dict={x: data.test.images, y_true: data.test.labels})

print ("Accuracy: {:.4}%".format(ans*100))

Accuracy: 91.9%


## Tensorflow tutorial for begginers

The page of tensorflow has some tutorials that allow any person to know how the theory is apply while the code is used. So, I would like to take some parts of them in order make some conclusions of it.

### Softmax Regression

In the tensor flow tutorial were used a softmax regression function like the book's example, but, they included the bias variable 

$Evidence = \sum_{j}W_{i,j}x_{j}+b_{i}$

The index $j$ represents each pixel in the image x, and  $i$  is the class. They explain that when the NN has enough evidence of the target, the idea is use the softmax function in order to get its probabilities 

$y=Softmax(Evidence)$

$y=Softmax(Wx+b)$

In another notebook I commented about the softmax function and its use in this area, also, the next image from the tutorial allow us to understand how is the architecture of the NN.

<image src="https://www.tensorflow.org/images/softmax-regression-scalargraph.png" height="250" weight="250">
    
### Cross entropy

Cross entropy is the way to calculate the cost or the lost of our model, in other words it allows us to understand how far is the model's results to the desire outcomes. This is the function:


$H_{y'}(y) = - \sum_{i}y'_{i}log(y_{i})$

    
    

In [None]:
# this is the solution proposed in the tutorial

x = tf.placeholder(tf.float32,[None, 784])

W = tf.Variable(tf.zeros([784, 10]))

b = tf.Variable(tf.zeros([10]))

tf.nn.softmax(tf.matmul(W,x) + b)


### References

[1] Hope, T., Resheff, Y. S., & Lieder, I. (2017). Learning TensorFlow: A Guide to Building Deep Learning Systems. " O'Reilly Media, Inc.".

[2] Udemy 

[3] https://www.tensorflow.org/versions/r1.2/get_started/mnist/beginners