# Train a Neural Net on the notMNIST Dataset 

### Steps
1. Data curation - Downlaod
1. Data exploration
1. Data validation 
    1. Data balance check
    2. Data check after shuffeling
1. Model using Logistic Regression

[Assignment](https://github.com/tensorflow/examples/blob/master/courses/udacity_deep_learning/1_notmnist.ipynb)

[Dataset](http://yaroslavvb.blogspot.com/2011/09/notmnist-dataset.html)

## Data Curation, Exploration and Validation 
- Download the data intelligently 
- Explore the data
- Check that the data is valid

In [6]:
from __future__ import print_function
import imageio
import matplotlib.pyplot as plt
import numpy as np
import sys
import os
from os.path import expanduser
import tarfile
from IPython.display import display, Image
from sklearn.linear_model import LinearRegression
from six.moves.urllib.request import urlretrieve
from six.moves import cPickle as pickle

The data consists of characters rendered in a variety of fonts on a 28x28 image. The labels are limited to 'A' through 'J' (10 classes). The training set has about 500k and the testset 19000 labeled examples

Download the data into the ~/Downloads dir if the file doesn't already exist

In [9]:
url = 'https://commondatastorage.googleapis.com/books1000/'
data_root = expanduser('~')+'/Downloads/'

def maybe_download(filename, expected_bytes, force=False):
    """Download the file if it isn't present and make sure it is the expected size"""
    dest_file = os.path.join(data_root, filename)
    if force or not os.path.exists(dest_file):
        print('Attempting to download file:{} to {}'.format(filename, dest_file))
        urlretrieve(url + filename, dest_file)
        print('Download complete')
    print('verifying file size...')
    statinfo = os.stat(dest_file)
    if statinfo.st_size == expected_bytes:
        print('verfied size of file{}, expected:{}, actuat:{}'.format(filename, expected_bytes, statinfo.st_size))
    else:
        raise Exception('Mismatch in size for file, expected:{}, actuat:{}'.format(expected_bytes, statinfo.st_size))
    
    return dest_file

train_filename = maybe_download('notMNIST_large.tar.gz', 247336696)
test_filename = maybe_download('notMNIST_small.tar.gz', 8458043)

verifying file size...
verfied size of filenotMNIST_large.tar.gz, expected:247336696, actuat:247336696
verifying file size...
verfied size of filenotMNIST_small.tar.gz, expected:8458043, actuat:8458043


Extract the dataset from the compressed .tar.gz file. This should give you a set of directories, labeled A through J.

In [12]:
num_classes = 10
np.random.seed(37)

def maybe_extract(filename, force=False):
    # remove .tar.gz from filename
    root = os.path.splitext(os.path.splitext(filename)[0])[0]
    if os.path.isdir(filename) and not force:
        print('file:{} already exists, skipping..'.format(filename))
    else:
        print('extracting file to path:{}'.format(root))
        tar = tarfile.open(filename)
        sys.stdout.flush()
        tar.extractall(data_root)
        tar.close()
    data_folders = [
        os.path.join(root, d) for d in sorted(os.listdir(root))
        if os.path.isdir(os.path.join(root, d))
    ]
    if len(data_folders) != num_classes: 
        raise Exception(
            'Number of classes do not match extracted. Expected={}, extracted={}'.format(
                num_classes, len(data_folders)))
    print('folders extracted:{}'.format(data_folders))
    return data_folders

train_folders = maybe_extract(train_filename)
test_folder = maybe_extract(test_filename)

extracting file to path:/Users/sunnyshah/Downloads/notMNIST_large
folders extracted:['/Users/sunnyshah/Downloads/notMNIST_large/A', '/Users/sunnyshah/Downloads/notMNIST_large/B', '/Users/sunnyshah/Downloads/notMNIST_large/C', '/Users/sunnyshah/Downloads/notMNIST_large/D', '/Users/sunnyshah/Downloads/notMNIST_large/E', '/Users/sunnyshah/Downloads/notMNIST_large/F', '/Users/sunnyshah/Downloads/notMNIST_large/G', '/Users/sunnyshah/Downloads/notMNIST_large/H', '/Users/sunnyshah/Downloads/notMNIST_large/I', '/Users/sunnyshah/Downloads/notMNIST_large/J']
extracting file to path:/Users/sunnyshah/Downloads/notMNIST_small
folders extracted:['/Users/sunnyshah/Downloads/notMNIST_small/A', '/Users/sunnyshah/Downloads/notMNIST_small/B', '/Users/sunnyshah/Downloads/notMNIST_small/C', '/Users/sunnyshah/Downloads/notMNIST_small/D', '/Users/sunnyshah/Downloads/notMNIST_small/E', '/Users/sunnyshah/Downloads/notMNIST_small/F', '/Users/sunnyshah/Downloads/notMNIST_small/G', '/Users/sunnyshah/Downloads/not