<a href="https://colab.research.google.com/github/squeeko/DeepChem_projects/blob/master/DC_2_Learn_MNIST_Classifiers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tutorial Part 2: Learning MNIST Digit Classifiers

In the previous tutorial, we learned some basics of how to load data into DeepChem and how to use the basic DeepChem objects to load and manipulate this data. In this tutorial, you'll put the parts together and learn how to train a basic image classification model in DeepChem. You might ask, why are we bothering to learn this material in DeepChem? Part of the reason is that image processing is an increasingly important part of AI for the life sciences. So learning how to train image processing models will be very useful for using some of the more advanced DeepChem features.

The MNIST dataset contains handwritten digits along with their human annotated labels. The learning challenge for this dataset is to train a model that maps the digit image to its true label. MNIST has been a standard benchmark for machine learning for decades at this point.

In [1]:
# As always we need to run the setup in working in Google Colab!

!curl -Lo conda_installer.py https://raw.githubusercontent.com/deepchem/deepchem/master/scripts/colab_install.py
import conda_installer
conda_installer.install()
!/root/miniconda/bin/conda info -e

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100  3490  100  3490    0     0  18368      0 --:--:-- --:--:-- --:--:-- 18465


add /root/miniconda/lib/python3.6/site-packages to PYTHONPATH
python version: 3.6.9
fetching installer from https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
done
installing miniconda to /root/miniconda
done
installing rdkit, openmm, pdbfixer
added omnia to channels
added conda-forge to channels
done
conda packages installation finished!


# conda environments:
#
base                  *  /root/miniconda



In [2]:
!pip install --pre deepchem
import deepchem
deepchem.__version__

Collecting deepchem
[?25l  Downloading https://files.pythonhosted.org/packages/5a/23/51a96cba097428794e3864a4c969f2c4f27f450a9c074cd3f69aecd87169/deepchem-2.4.0rc1.dev20200921195626.tar.gz (390kB)
[K     |▉                               | 10kB 29.1MB/s eta 0:00:01[K     |█▊                              | 20kB 2.9MB/s eta 0:00:01[K     |██▌                             | 30kB 3.6MB/s eta 0:00:01[K     |███▍                            | 40kB 4.0MB/s eta 0:00:01[K     |████▏                           | 51kB 3.4MB/s eta 0:00:01[K     |█████                           | 61kB 3.7MB/s eta 0:00:01[K     |█████▉                          | 71kB 4.1MB/s eta 0:00:01[K     |██████▊                         | 81kB 4.4MB/s eta 0:00:01[K     |███████▌                        | 92kB 4.8MB/s eta 0:00:01[K     |████████▍                       | 102kB 4.5MB/s eta 0:00:01[K     |█████████▏                      | 112kB 4.5MB/s eta 0:00:01[K     |██████████                      | 122kB 4

'2.4.0-rc1.dev'

In [3]:
import deepchem as dc
import tensorflow as tf
import numpy as np
from tensorflow.keras.layers import Reshape, Conv2D, Flatten, Dense

In [8]:
mnist = tf.keras.datasets.mnist.load_data(path='mnist.npz')
train_images = mnist[0][0].reshape((-1, 28, 28, 1)) / 255
valid_images = mnist[1][0].reshape((-1, 28, 28, 1)) / 255
train = dc.data.NumpyDataset(train_images, mnist[0][1])
valid = dc.data.NumpyDataset(valid_images, mnist[1][1])

Now create the model. We use two convolutional layers followed by two dense layers. The final layer outputs ten numbers for each sample. These correspond to the ten possible digits.

How does the model know how to interpret the output? That is determined by the loss function. We specify SparseSoftmaxCrossEntropy. This is a very convenient class that implements a common case:



1.   Each label is an integer which is interpreted as a class index (i.e. which of the ten digits this sample is a drawing of).

2.   The outputs are passed through a softmax function, and the result is interpreted as a probability distribution over those same classes.

The model learns to produce a large output for the correct class, and small outputs for all other classes.



In [9]:
keras_model = tf.keras.Sequential([
                                   Conv2D(filters=32, kernel_size=5, activation=tf.nn.relu),
                                   Conv2D(filters=64, kernel_size=5, activation=tf.nn.relu),
                                   Flatten(),
                                   Dense(1024, activation=tf.nn.relu),
                                   Dense(10),

])

model = dc.models.KerasModel(keras_model, dc.models.losses.SparseSoftmaxCrossEntropy())

In [10]:
model.fit(train, nb_epoch=2)

0.02732362985610962

Let's see how well it works. We ask the model to predict the class of every sample in the validation set. Remember there are ten outputs for each sample. We use argmax() to identify the largest one, which corresponds to the predicted class.

In [11]:
prediction = np.argmax(model.predict_on_batch(valid.X), axis=1)
score = dc.metrics.accuracy_score(prediction, valid.y)
print('Validation set accuracy', score)

Validation set accuracy 0.9892


It gets about 99% of samples correct. Not too bad for such a simple model!

