# Grokking **FizzBuzz** using `MXNet`

Last year, Joel Grus wrote a brilliant [article](http://joelgrus.com/2016/05/23/fizz-buzz-in-tensorflow/) on *fizzbuzz*. He had attended an interview and was asked about *fizzbuzz*. He went on to solve it using `tensorflow`. Now, you may wonder why should we use deep learning for this? Atleast, I thought that way and didn't really pay much attention to the code. 

This summer, I had an opportunity to interview with an AI-startup that I really liked. And guess what? I was asked to solve `fizzbuzz` using deep learning. Long story short, neither Joel nor I got the job ! 

But this made me think about why `fizzbuzz` makes sense? But before we get on to that, what is this `fizzbuzz` problem?

**What is fizzbuzz**

Given an integer `x`, the output is determined by the following rules:

- if `x` is divisible by 3, output is "fizz"
- if `x` is divisible by 5, output is "buzz"
- if `x` is divisible by 15, output is "fizzbuzz"
- else, the output is `x`

A typical output sequence will look like this

| Input   |      Output      | 
|----------|:-------------:|
| 1 |  1 |
| 2 |  2 |
| 3 | "fizz" |
| 4 | 4 |
| 5 | "buzz" |
| 6 | "fizz" |
| 7 | 7 |
| 8 | 8 |
| 9 | "fizz" |
| 10 | "buzz" |
| 11 | 11 |
| 12 | "fizz" |
| 13 | 13 |
| 14 | 14 |
| 15 | "fizzbuzz" |
| 16 | 16 |

If we know the rules that generate the data, there's really no need for machine learning. Unfortunately, in real-life, we only have the data. The goal of machine learning is to learn the function that generated the data. In this aspect, `fizzbuzz` provides us with an easy-to-understand dataset and allows us to understand and explore the algorithms better. 

What follows below is a pedantic exercise in understanding how `MXNet` can be used to solve the `fizzbuzz` problem. 

**What is `MXNet`**

`MXNet` is a scalable open-source deep learning framework. It scales to multiple GPUs and multiple machines. At Amazon, `MXNet` is the deep learning framework of choice at AWS. It is supported by Intel, Dato, Baidu, Microsoft, MIT amongst others. 


**What's the structure of the article?**

In the subsequent sections, we will do the following

1. Generate the fizzbuzz data
2. Divide the data into train and test
3. Structure the problem as a multi-class classification problem
4. Build a logistic regression model in `MXNet` from scratch
5. Build a logistic regression model using `MXNet`
6. Introduce `Gluon`
7. Build a multi-layer-perceptron model using `Gluon`
8. Build a Convolutional Neural Network model

**Import Libraries**

In [1]:
import numpy as np
import mxnet as mx
from mxnet import autograd
from mxnet import gluon
from mxnet import nd
import os
mx.random.seed(1)

In [2]:
ctx = mx.cpu()

**Define a function to encode the integer to its binary representation**

In [3]:
def binary_encode(i, num_digits):
    return np.array([i >> d & 1 for d in range(num_digits)])

**Define a function to label the data and map the labels back to categorical strings**

In [4]:
def fizz_buzz_encode(i):
    if   i % 15 == 0: 
        return 0
    elif i % 5  == 0: 
        return 1
    elif i % 3  == 0: 
        return 2
    else:             
        return 3
    
def fizz_buzz(i, prediction):
    if prediction == 0:
        return "fizzbuzz"
    elif prediction == 1:
        return "buzz"
    elif prediction == 2:
        return "fizz"
    else:
        return str(i)

**Create the Numpy NdArray for training, validation and test data**

In [5]:
MAX_NUMBER = 100000
NUM_DIGITS = np.log2(MAX_NUMBER).astype(np.int)+1
trainX = np.array([binary_encode(i, NUM_DIGITS) for i in range(101, np.int(MAX_NUMBER/2))])
trainY = np.array([fizz_buzz_encode(i)          for i in range(101, np.int(MAX_NUMBER/2))])
valX = np.array([binary_encode(i, NUM_DIGITS) for i in range(np.int(MAX_NUMBER/2), MAX_NUMBER)])
valY = np.array([fizz_buzz_encode(i)          for i in range(np.int(MAX_NUMBER/2), MAX_NUMBER)])
testX = np.array([binary_encode(i, NUM_DIGITS) for i in range(1, 101)])
testY = np.array([fizz_buzz_encode(i)          for i in range(1, 101)])

**Create mxnet NDarrayiter for training, validation and test data**

In [6]:
batch_size = 100
num_inputs = NUM_DIGITS
num_outputs = 4
train_data = mx.io.NDArrayIter(trainX, trainY,
                               batch_size, shuffle=True)
val_data = mx.io.NDArrayIter(valX, valY,
                               batch_size, shuffle=True)
test_data = mx.io.NDArrayIter(testX, testY,
                              batch_size, shuffle=False)

**Lets define the function to calculate accuracy of a model**

In [7]:
def evaluate_accuracy(data_iterator, net):
    acc = mx.metric.Accuracy()
    data_iterator.reset()
    for i, batch in enumerate(data_iterator):
        data = batch.data[0].as_in_context(ctx)
        label = batch.label[0].as_in_context(ctx)
        output = net(data)
        predictions = nd.argmax(output, axis=1)
        acc.update(preds=predictions, labels=label)
    return predictions,acc.get()[1]

### Logistic Regression from Scratch

**Define the bias and weight matrix**

In [8]:
weight_scale = .01

W = nd.random_normal(shape=(num_inputs, num_outputs))
b = nd.random_normal(shape=num_outputs)

params = [W, b]

**Allocate space for each parameter's gradients**

In [9]:
for param in params:
    param.attach_grad()

We shall pass our $yhat\_linear$ and compute the softmax and its log all at once inside the $softmax\_cross\_entropy$ loss function simultaneously

In [10]:
def softmax_cross_entropy(yhat_linear, y):
    return - nd.nansum(y * nd.log_softmax(yhat_linear), axis=0, exclude=True)

**Define the model**

In [11]:
def net(X):
    y_linear = nd.dot(X, W) + b
    return y_linear

**Define the Optimizer**

In [12]:
def SGD(params, lr):
    for param in params:
        param[:] = param - lr * param.grad

**Lets execute the training loops**

In [13]:
epochs = 100
learning_rate = .01
smoothing_constant = .01

for e in range(epochs):
    train_data.reset()
    for i, batch in enumerate(train_data):
        data = batch.data[0].as_in_context(ctx)
        label = batch.label[0].as_in_context(ctx)
        label_one_hot = nd.one_hot(label, 4)
        with autograd.record():
            output = net(data)
            loss = softmax_cross_entropy(output, label_one_hot)
        loss.backward()
        SGD(params, learning_rate)
        curr_loss = nd.mean(loss).asscalar()
        moving_loss = (curr_loss if ((i == 0) and (e == 0)) 
                       else (1 - smoothing_constant) * moving_loss + (smoothing_constant) * curr_loss)

    _,val_accuracy = evaluate_accuracy(val_data, net)
    _,train_accuracy = evaluate_accuracy(train_data, net)
    print("Epoch %s. Loss: %s, Train_acc %s, Val_acc %s" %
          (e, moving_loss, train_accuracy, val_accuracy))

Epoch 99. Loss: 1.18811465231, Train_acc 0.532825651303, Val_acc 0.5332


**Lets see what the model predicts**

In [14]:
predictions,test_accuracy = evaluate_accuracy(test_data, net)
output = np.vectorize(fizz_buzz)(np.arange(1, 101), predictions.asnumpy().astype(np.int))
print(output)
print("Test Accuracy : ",test_accuracy)

['1' '2' '3' '4' '5' '6' '7' '8' '9' '10' '11' '12' '13' '14' '15' '16'
 '17' '18' '19' '20' '21' '22' '23' '24' '25' '26' '27' '28' '29' '30' '31'
 '32' '33' '34' '35' '36' '37' '38' '39' '40' '41' '42' '43' '44' '45' '46'
 '47' '48' '49' '50' '51' '52' '53' '54' '55' '56' '57' '58' '59' '60' '61'
 '62' '63' '64' '65' '66' '67' '68' '69' '70' '71' '72' '73' '74' '75' '76'
 '77' '78' '79' '80' '81' '82' '83' '84' '85' '86' '87' '88' '89' '90' '91'
 '92' '93' '94' '95' '96' '97' '98' '99' '100']
Test Accuracy :  0.53


### MultiLayer Perceptron using Gluon

**Lets reset the Training, Validation and the Test data**

In [15]:
train_data.reset()
val_data.reset()
test_data.reset()

**Define the Gluon Sequential Model**

In [16]:
num_hidden = 64
net = gluon.nn.Sequential()
with net.name_scope():
    net.add(gluon.nn.Dense(num_inputs, activation="relu"))
    net.add(gluon.nn.Dense(num_hidden, activation="relu"))
    net.add(gluon.nn.Dense(num_hidden, activation="relu"))
    net.add(gluon.nn.Dense(num_outputs))

**Initialize Parameter**

In [17]:
net.collect_params().initialize(mx.init.Xavier(magnitude=2.24), ctx=ctx)

**Softmax Cross Entropy Loss**

In [18]:
loss = gluon.loss.SoftmaxCrossEntropyLoss()

**Stochastic Gradient Descent Optimizer**

In [19]:
trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': .02,'momentum':0.9})



**Lets Train the MLP model**

In [20]:
epochs = 100
moving_loss = 0.
best_accuracy = 0.
best_epoch = -1

for e in range(epochs):
    train_data.reset()
    for i, batch in enumerate(train_data):
        data = batch.data[0].as_in_context(ctx)
        label = batch.label[0].as_in_context(ctx)
        with autograd.record():
            output = net(data)
            cross_entropy = loss(output, label)
            cross_entropy.backward()
        trainer.step(data.shape[0])
        if i == 0:
            moving_loss = nd.mean(cross_entropy).asscalar()
        else:
            moving_loss = .99 * moving_loss + .01 * nd.mean(cross_entropy).asscalar()

    _,val_accuracy = evaluate_accuracy(val_data, net)
    _,train_accuracy = evaluate_accuracy(train_data, net)
    
    if val_accuracy > best_accuracy:
            best_accuracy = val_accuracy
            if best_epoch!=-1:
                print('deleting previous checkpoint...')
                os.remove('mlp-%d.params'%(best_epoch))
            best_epoch = e
            print('Best validation accuracy found. Checkpointing...')
            net.save_params('mlp-%d.params'%(e))
    print("Epoch %s. Loss: %s, Train_acc %s, Val_acc %s" %
          (e, moving_loss, train_accuracy, val_accuracy))

Best validation accuracy found. Checkpointing...
Epoch 99. Loss: 0.021456909065, Train_acc 0.99498997996, Val_acc 0.44902


**Lets see what the model predicts**

In [21]:
net.load_params('mlp-%d.params'%(best_epoch), ctx)

In [22]:
predictions,test_accuracy = evaluate_accuracy(test_data, net)
output = np.vectorize(fizz_buzz)(np.arange(1, 101), predictions.asnumpy().astype(np.int))
print(output)
print("Test Accuracy : ",test_accuracy)

['1' '2' '3' '4' 'buzz' '6' '7' '8' 'fizz' 'buzz' '11' 'fizz' '13' '14'
 '15' 'buzz' '17' 'fizz' '19' 'buzz' 'fizz' '22' '23' 'fizz' 'buzz' '26'
 'fizz' '28' '29' 'fizzbuzz' '31' 'buzz' 'fizz' '34' 'buzz' 'fizz' '37'
 '38' '39' 'buzz' '41' 'fizz' '43' 'buzz' '45' 'buzz' '47' 'fizz' '49'
 'buzz' 'fizz' '52' '53' 'fizz' 'buzz' '56' 'fizz' '58' '59' 'fizz' 'buzz'
 '62' 'fizz' '64' 'buzz' '66' '67' '68' '69' 'buzz' '71' 'fizz' '73' '74'
 '75' '76' '77' '78' '79' 'buzz' 'fizz' '82' '83' 'fizz' 'buzz' '86' '87'
 '88' '89' 'fizzbuzz' '91' '92' 'fizz' '94' 'buzz' 'fizz' '97' '98' '99'
 'buzz']
Test Accuracy :  0.83


### CNN using mxnet symbol

**Lets reshape the data (x_dim,y_dim) &rarr; (x_dim,#of channels = 1,y_dim)**

In [23]:
trainX= trainX.reshape(trainX.shape[0],1,trainX.shape[1])
valX= valX.reshape(valX.shape[0],1,valX.shape[1])
testX= testX.reshape(testX.shape[0],1,testX.shape[1])

**Prepare the NDArrayIters corresponding to Training, Testing and Validation data**

In [24]:
train_data = mx.io.NDArrayIter(trainX, trainY,
                               batch_size, shuffle=True)
val_data = mx.io.NDArrayIter(valX, valY,
                               batch_size, shuffle=True)
test_data = mx.io.NDArrayIter(testX, testY,
                              batch_size, shuffle=False)

**Define the CNN Model**

In [25]:
data = mx.sym.var('data')
# first conv layer
conv1 = mx.sym.Convolution(data=data, kernel=(2,), num_filter=20)
tanh1 = mx.sym.Activation(data=conv1, act_type="relu")
pool1 = mx.sym.Pooling(data=tanh1, pool_type="max", kernel=(2,), stride=(2,))
# second conv layer
conv2 = mx.sym.Convolution(data=pool1, kernel=(2,), num_filter=50)
tanh2 = mx.sym.Activation(data=conv2, act_type="relu")
pool2 = mx.sym.Pooling(data=tanh2, pool_type="max", kernel=(2,), stride=(2,))
# first fullc layer
flatten = mx.sym.flatten(data=pool2)
fc1 = mx.symbol.FullyConnected(data=flatten, num_hidden=500)
tanh3 = mx.sym.Activation(data=fc1, act_type="relu")
# second fullc
fc2 = mx.sym.FullyConnected(data=tanh3, num_hidden=num_outputs)
# softmax loss
lenet = mx.sym.SoftmaxOutput(data=fc2, name='softmax')
cnn_model = mx.mod.Module(symbol=lenet, context=ctx)



**Train the CNN Model**

In [26]:
cnn_model.fit(train_data,
                eval_data=val_data,
                optimizer='sgd',
                optimizer_params={'learning_rate':0.01,'momentum':0.9},
                eval_metric='acc',
                num_epoch=100)

**Lets see what the model predicts**

In [27]:
acc = mx.metric.Accuracy()
cnn_model.score(test_data, acc)
probabilities = cnn_model.predict(test_data)
predictions = nd.argmax(probabilities, axis=1)
output = np.vectorize(fizz_buzz)(np.arange(1, 101), predictions.asnumpy().astype(np.int))
print(output)
print("Test Accuracy : ",acc.get_name_value()[0][1])

['1' '2' '3' '4' '5' '6' '7' '8' '9' '10' '11' '12' '13' '14' '15' '16'
 '17' '18' '19' '20' '21' '22' '23' '24' '25' '26' '27' '28' '29' '30' '31'
 '32' '33' '34' '35' '36' '37' '38' '39' '40' '41' '42' '43' '44' '45' '46'
 '47' '48' '49' '50' '51' '52' '53' '54' '55' '56' '57' '58' '59' '60' '61'
 '62' '63' '64' '65' '66' '67' '68' '69' '70' '71' '72' '73' '74' '75' '76'
 '77' '78' '79' '80' '81' '82' '83' '84' '85' '86' '87' '88' '89' '90' '91'
 '92' '93' '94' '95' '96' '97' '98' '99' '100']
Test Accuracy :  0.53
