## Context

This dataset was created by Yaroslav Bulatov by taking some publicly available fonts and extracting glyphs from them to make a dataset similar to MNIST. There are 10 classes, with letters A-J.

## Content

A set of training and test images of letters from A to J on various typefaces. The images size is 28x28 pixels.

## Acknowledgements

The dataset can be found on Tensorflow github page as well as on the blog from Yaroslav, here.

## Inspiration

This is a pretty good dataset to train classifiers! According to Yaroslav:

Judging by the examples, one would expect this to be a harder task than MNIST. This seems to be the case -- logistic regression on top of stacked auto-encoder with fine-tuning gets about 89% accuracy whereas same approach gives got 98% on MNIST. Dataset consists of small hand-cleaned part, about 19k instances, and large uncleaned dataset, 500k instances. Two parts have approximately 0.5% and 6.5% label error rate. I got this by looking through glyphs and counting how often my guess of the letter didn't match it's unicode value in the font file.
Enjoy!

In [1]:
import numpy as np
import tensorflow as tf
import os
import pandas as pd


  return f(*args, **kwds)


In [2]:
import matplotlib.pyplot as plt
import cv2
from PIL import Image
from matplotlib import pyplot as plt


In [3]:
from sklearn.model_selection import train_test_split

In [4]:
from sklearn.preprocessing import OneHotEncoder

## Getting the path names to images

In [5]:
parentDir = 'notMNIST_large/'
print(parentDir)
data = []
total = 0
good = 0
for folder in os.listdir(parentDir):
    if folder != '.DS_Store':
        for file in os.listdir(parentDir + folder):
            if total % 10000 == 0:
                print(total, good)
            total += 1
            try:
                img_path = parentDir + folder + '/' + file
                img = Image.open(img_path)
                data.append([img_path, folder])
                good += 1
            except:
                pass
            
dataset = pd.DataFrame(data)
dataset.head()


                                

notMNIST_large/
0 0
10000 10000
20000 20000
30000 30000
40000 40000
50000 50000
60000 60000
70000 70000
80000 80000
90000 90000
100000 100000
110000 109999
120000 119999
130000 129999
140000 139998
150000 149998
160000 159997
170000 169997
180000 179997
190000 189997
200000 199997
210000 209997
220000 219997
230000 229997
240000 239997
250000 249997
260000 259997
270000 269997
280000 279997
290000 289997
300000 299997
310000 309997
320000 319997
330000 329997
340000 339997
350000 349997
360000 359997
370000 369997
380000 379997
390000 389997
400000 399997
410000 409996
420000 419996
430000 429996
440000 439996
450000 449996
460000 459996
470000 469996
480000 479996
490000 489996
500000 499996
510000 509996
520000 519995


Unnamed: 0,0,1
0,notMNIST_large/I/VmFkaW0ncyBXcml0aW5nLnR0Zg==.png,I
1,notMNIST_large/I/Q3JlZXBpbmcgRXZpbC50dGY=.png,I
2,notMNIST_large/I/Y2FyaWNhdHVyZS50dGY=.png,I
3,notMNIST_large/I/Q2l0eSBEIEVFIEJvbGQucGZi.png,I
4,notMNIST_large/I/S2VwbGVyU3RkLUNuU3ViaC5vdGY=.png,I


In [6]:
print(len(dataset))

529114


## Set batch size and epochs

Don't want batch size to be too large or not too small

In [7]:
batch_size = 16
num_epochs = 100
def input_func(features, labels, batch_size):
    
    def parser(image, label): 
        
        img = tf.image.decode_png(tf.read_file(image))
        img = tf.reshape(img, [28, 28, 1])
        img = tf.cast(img, tf.float32, "cast")

        return img, label
    
    dataset = tf.data.Dataset.from_tensor_slices((features, labels))
    dataset = dataset.map(parser)
    dataset = dataset.batch(batch_size)
    
    return dataset
    

## Define model architecture
Uses a two layer, each layer consisting of a convolutional and pooling layer, architecture. (Same architecture as original MNIST CNN)

In [12]:
def my_model(features, labels, mode, params):
    #initialize input by reshaping and casting for network
    #img = tf.image.decode_png(tf.read_file(features['x'][0]))
    # img = np.array( img, dtype='uint8' ).flatten()
    
    # FIRST LAYER
    # ---conv layer with 32 filters, 5x5 kernel, and relu activation
    # ---pool layer with 2x2 pool window and stride of 2x2
    conv1 = tf.layers.conv2d(inputs=features, filters=64, kernel_size=(5, 5), padding="same", activation=tf.nn.relu)
    pool1 = tf.layers.max_pooling2d(inputs=conv1, pool_size=(2, 2), strides=(2, 2))
    
    # SECOND LAYER
    # ---conv layer with 64 filters, 5x5 kernel, and relu activation
    # ---pool layer with 2x2 pool window and stride of 2x2
    conv2 = tf.layers.conv2d(inputs=pool1, filters=128, kernel_size=(5, 5), padding="same", activation=tf.nn.relu)
    pool2 = tf.layers.max_pooling2d(inputs=conv2, pool_size=(2, 2), strides=(2, 2))
    
    print(pool2.shape)
    
    # DENSE LAYER
    # ---flatten output into vector
    # ---dropout to prevent overfitting
    pool2_flat = tf.reshape(pool2, [-1, 7 * 7 * 128])
    dense = tf.layers.dense(inputs=pool2_flat, units=1024, activation=tf.nn.relu)
    dropout = tf.layers.dropout(inputs=dense, rate=0.2, training=mode == tf.estimator.ModeKeys.TRAIN)
    
    logits = tf.layers.dense(inputs=dropout, units=10)
    predictions = {
        "classes": tf.argmax(input=logits, axis=1),
        "probabilities": tf.nn.softmax(logits, name="softmax_tensor")
    }
    
    if mode == tf.estimator.ModeKeys.PREDICT:
        return tf.estimator.EstimatorSpec(mode=mode, predictions=predictions)
    
    onehot_labels = tf.one_hot(indices=tf.cast(labels, tf.int32), depth=10)
    onehot_labels = tf.reshape(onehot_labels, [-1, 10])
    loss = tf.losses.softmax_cross_entropy(onehot_labels=onehot_labels, logits=logits)
#     loss = tf.losses.sparse_softmax_cross_entropy(labels=labels, logits=logits)
    
    if mode == tf.estimator.ModeKeys.TRAIN:
        optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001)
        train_op = optimizer.minimize(loss=loss,global_step=tf.train.get_global_step())
        return tf.estimator.EstimatorSpec(mode=mode, loss=loss, train_op=train_op)
    print(labels.shape)
    print(predictions["classes"].shape)
    eval_metric_ops = {"accuracy": tf.metrics.accuracy(labels=labels, predictions=predictions["classes"])}
    return tf.estimator.EstimatorSpec(mode=mode, loss=loss, eval_metric_ops=eval_metric_ops)



## Training and Evaluating
-> split data into train and test (2:1)
-> instantiate model with my_mode as cnn
-> convert dataset (np array) to dataframe to use pd.factorize to get integer labels, then convert back to np array

In [13]:
# Fetch the data
X_train, X_test, y_train, y_test = train_test_split(dataset[0], pd.factorize(dataset[1])[0], test_size=0.33, random_state=42)

# Build CNN.
classifier = tf.estimator.Estimator(model_fn=my_model)

# Train the Model.
X_train = np.asarray(X_train)
y_train = np.asarray(y_train)

X_test = np.asarray(X_test)
y_test = np.asarray(y_test)

classifier.train(input_fn=lambda:input_func(X_train, y_train, batch_size), steps = 22000)

# Evaluate the model.

eval_result = classifier.evaluate(input_fn=lambda:input_func(X_test, y_test, batch_size))

print('\nTest set accuracy: {accuracy:0.3f}\n'.format(**eval_result))


INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_log_step_count_steps': 100, '_num_worker_replicas': 1, '_train_distribute': None, '_session_config': None, '_keep_checkpoint_max': 5, '_task_type': 'worker', '_save_checkpoints_secs': 600, '_save_summary_steps': 100, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x1a36dfacc0>, '_keep_checkpoint_every_n_hours': 10000, '_save_checkpoints_steps': None, '_model_dir': '/var/folders/th/svpqqvhs62790bm9gczzcth40000gn/T/tmpywd8sp92', '_global_id_in_cluster': 0, '_is_chief': True, '_task_id': 0, '_num_ps_replicas': 0, '_master': '', '_tf_random_seed': None, '_evaluation_master': '', '_service': None}
INFO:tensorflow:Calling model_fn.
(?, 7, 7, 128)
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 1 into 

INFO:tensorflow:global_step/sec: 8.5536
INFO:tensorflow:loss = 0.17652255, step = 6801 (11.691 sec)
INFO:tensorflow:global_step/sec: 8.65197
INFO:tensorflow:loss = 0.022440376, step = 6901 (11.558 sec)
INFO:tensorflow:global_step/sec: 8.79559
INFO:tensorflow:loss = 0.5456935, step = 7001 (11.369 sec)
INFO:tensorflow:global_step/sec: 8.7829
INFO:tensorflow:loss = 0.32647312, step = 7101 (11.386 sec)
INFO:tensorflow:global_step/sec: 8.92626
INFO:tensorflow:loss = 0.15609053, step = 7201 (11.205 sec)
INFO:tensorflow:global_step/sec: 8.43597
INFO:tensorflow:loss = 0.5737689, step = 7301 (11.852 sec)
INFO:tensorflow:global_step/sec: 8.34834
INFO:tensorflow:loss = 0.28764644, step = 7401 (11.978 sec)
INFO:tensorflow:global_step/sec: 8.07093
INFO:tensorflow:loss = 0.29358056, step = 7501 (12.390 sec)
INFO:tensorflow:global_step/sec: 7.87814
INFO:tensorflow:loss = 0.20512776, step = 7601 (12.693 sec)
INFO:tensorflow:global_step/sec: 8.07717
INFO:tensorflow:loss = 0.2808183, step = 7701 (12.380

INFO:tensorflow:global_step/sec: 8.51918
INFO:tensorflow:loss = 0.77469754, step = 14801 (11.738 sec)
INFO:tensorflow:global_step/sec: 8.98101
INFO:tensorflow:loss = 0.41336948, step = 14901 (11.134 sec)
INFO:tensorflow:global_step/sec: 9.24278
INFO:tensorflow:loss = 0.010055601, step = 15001 (10.819 sec)
INFO:tensorflow:global_step/sec: 8.55556
INFO:tensorflow:loss = 0.5097092, step = 15101 (11.688 sec)
INFO:tensorflow:global_step/sec: 8.69452
INFO:tensorflow:loss = 0.13287695, step = 15201 (11.502 sec)
INFO:tensorflow:Saving checkpoints for 15263 into /var/folders/th/svpqqvhs62790bm9gczzcth40000gn/T/tmpywd8sp92/model.ckpt.
INFO:tensorflow:global_step/sec: 7.71985
INFO:tensorflow:loss = 0.68018055, step = 15301 (12.953 sec)
INFO:tensorflow:global_step/sec: 7.85197
INFO:tensorflow:loss = 0.20181614, step = 15401 (12.736 sec)
INFO:tensorflow:global_step/sec: 7.79403
INFO:tensorflow:loss = 0.5163596, step = 15501 (12.830 sec)
INFO:tensorflow:global_step/sec: 8.79008
INFO:tensorflow:loss 

INFO:tensorflow:Saving dict for global step 22000: accuracy = 0.90286815, global_step = 22000, loss = 0.3268237

Test set accuracy: 0.903

