# <span style="color:#0b486b">Tensorflow, Word embedding, and Text Analytics</span>

This notebook is about practical machine learning knowledge and skills in tensorflow, word embedding and text analytics. Some sections have been partially completed to help you get
started. 

* Before you start, read the entire notebook carefully once to understand what you need to do. <br><br>

* For each cell marked with **#YOU ARE REQUIRED TO INSERT YOUR CODES IN THIS CELL**, there will be places where you **must** supply your own codes when instructed. <br>

## <span style="color:#0b486b">Instruction</span>

This tutorial contains **two** parts 

* Part 1: Deep Feedforward Neural Network
* Part 2: Word2Vec, text analytics and application

**Hint**: You are strongly recommended to go through these lectures and practical lab sessions covered from Week 5 to 9 thoroughly.

## <span style="color:#0b486b">Part 1: Deep Feedforward Neural Network </span>

Demonstrate the knowledge in deep learning that you have acquired from the lectures. Most of the content are drawn from the materials in week 5, 6 and 7 for deep neural networks. 

*Run the following cell to create necessary subfolders. You must **not** modify these codes and **must** run it first*.

In [1]:
# Create necessary subfolders to store immediate files.

import os
if not os.path.exists("./models/dnn0"):
    os.makedirs("models/dnn0")

The first part is to apply DNN to recognize letters from A-Z. You have played with MNIST dataset in your pracs and this should have given a good sense of how to apply DNN on images for recognition task. 

You are going to work with the **notMNIST** dataset for *letter recognition task*. The dataset contains 10 classes of letters A-J taken from different fonts. You will see some examples at the visualization task in the next part. A short blog about the data can be found [here](http://yaroslavvb.blogspot.com.au/2011/09/notmnist-dataset.html).

Here we only consider a small subset which can be found at [this link](http://yaroslavvb.com/upload/notMNIST/notMNIST_small.mat). This file has been already downloaded and stored in subfolder `datasets` of this folder. The file is in *Matlab* format, thus our first task is to:

####  <span style="color:red">**Question 1.1**</span>. Load the data into *`numpy array`* format of two variables:
* *`x`*: storing features with dimension `[num_samples, width, height]` (`num_samples`: number of samples, `width`: image width, `height`: image height), and
* *`y`*: storing labels with dimension `num_samples`. 

Enter the missing codes in the following cell to complete this question.

In [28]:
# YOU ARE REQUIRED TO INSERT YOUR CODES IN THIS CELL

import numpy as np
import scipy.io as sio
data = sio.matlab.loadmat("datasets/notMNIST_small.mat")
x, y = data['images'] , data['labels']
x = np.rollaxis(x, axis=2)

(18724, 28, 28)


####  <span style="color:red">**Question 1.2**</span>. Print out the total number of data points, and the *unique* labels in this dataset.


In [35]:
#YOU ARE REQUIRED TO INSERT YOUR CODES IN THIS CELL
print('Total data points are :',len(y))
print('Unique Labels:',len(np.unique(y)))

Total data points are : 18724
Unique Labels: 10


####  <span style="color:red">**Question 1.3**</span>. Display 100  images in the form of `10x10` matrix, each row showing 10 *random* images of a label. 

#### You might decide to use the function `display_images` provided at the beginning of this tutorial, or you can write your own codes.


In [36]:
# this function is a utility to display images from the dataset
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

def display_images(images, shape):
    fig = plt.figure(figsize=shape)
    fig.subplots_adjust(left=0, right=1, bottom=0, top=1, hspace=0.05, wspace=0.05)
    for i in range(np.prod(shape)):
        p = fig.add_subplot(shape[0], shape[1], i+1, xticks=[], yticks=[])
        p.imshow(images[i], cmap=plt.cm.bone)                

In [52]:
# YOU ARE REQUIRED TO INSERT YOUR CODES IN THIS CELL

unique_labels = np.unique(y)
images = []
for l in unique_labels:    
    idx = np.where(y == l)[0]
    idx = idx[np.random.permutation(len(idx))[:10]]    
    for i in idx:
        images += # INSERT YOUR CODE HERE

display_images(images, shape=(10, 10))

ValueError: operands could not be broadcast together with shapes (0,) (28,28) 

####  <span style="color:red">**Question 1.4**</span>. Use the *deep feedforward neural network* as the classifier to perform images classification task in a *single split training and testing*.

In [None]:
# YOU ARE REQUIRED TO INSERT YOUR CODES IN THIS CELL


In training the DNN, scaling data is important. The pixel intensities of images are in the range of [0, 255], which makes the neural network difficult to learn.

In [None]:
# YOU ARE REQUIRED TO INSERT YOUR CODES IN THIS CELL


In [None]:
# YOU ARE REQUIRED TO INSERT YOUR CODES IN THIS CELL


In [None]:
# YOU ARE REQUIRED TO INSERT YOUR CODES IN THIS CELL [5 marks]

import tensorflow as tf

tf.reset_default_graph()

num_inputs = # INSERT YOUR CODE HERE
num_hidden1 = # INSERT YOUR CODE HERE
num_hidden2 = # INSERT YOUR CODE HERE
num_outputs = len(np.unique(y))

inputs = # INSERT YOUR CODE HERE
labels = # INSERT YOUR CODE HERE

In [None]:
# YOU ARE REQUIRED TO INSERT YOUR CODES IN THIS CELL [3 marks]

def neuron_layer(x, num_neurons, name, activation=None):
    with tf.name_scope(name):
        # INSERT YOUR CODE HERE
    if activation == "sigmoid":
        # INSERT YOUR CODE HERE
    elif activation == "relu":
        # INSERT YOUR CODE HERE
    else:
        return z

In [None]:
# YOU ARE REQUIRED TO INSERT YOUR CODES IN THIS CELL [7 marks]

with tf.name_scope("dnn"):
    hidden1 = # INSERT YOUR CODE HERE
    hidden2 = # INSERT YOUR CODE HERE
    logits = # INSERT YOUR CODE HERE
with tf.name_scope("loss"):
    xentropy = # INSERT YOUR CODE HERE
    loss = tf.reduce_mean(xentropy, name="loss")
    
with tf.name_scope("evaluation"):
   # INSERT YOUR CODE HERE
    
with tf.name_scope("train"):
   # INSERT YOUR CODE HERE
    
    for var in tf.trainable_variables():
        tf.summary.histogram(var.op.name + "/values", var)
        
    for grad, var in grads:
        if grad is not None:
            tf.summary.histogram(var.op.name + "/gradients", grad)

# summary
accuracy_summary = # INSERT YOUR CODE HERE


In [None]:
# YOU ARE REQUIRED TO INSERT YOUR CODES IN THIS CELL [5 marks]

# merge all summary
tf.summary.histogram('hidden1/activations', hidden1)
tf.summary.histogram('hidden2/activations', hidden2)
merged = # INSERT YOUR CODE HERE

init = # INSERT YOUR CODE HERE
saver = # INSERT YOUR CODE HERE

train_writer = tf.summary.FileWriter("models/dnn0/train", tf.get_default_graph())
test_writer = tf.summary.FileWriter("models/dnn0/test", tf.get_default_graph())

num_epochs = # INSERT YOUR CODE HERE
batch_size = # INSERT YOUR CODE HERE

<span style="color:red">**(d)**</span> **You are now required write code to train the DNN.** Write codes in the following cell. <span style="color:red">**[5 points]**</span> </div>

In [None]:
# YOU ARE REQUIRED TO INSERT YOUR CODES IN THIS CELL

with tf.Session() as sess:
    init.run()
    print("Epoch\tTrain accuracy\tTest accuracy")
    for epoch in range(num_epochs):
        for idx_start in range(0, x_train.shape[0], batch_size):
            idx_end = # INSERT YOUR CODE HERE
            x_batch, y_batch = # INSERT YOUR CODE HERE
            sess.run(training_op, feed_dict={inputs: x_batch, labels: y_batch})
            
        summary_train, acc_train = # INSERT YOUR CODE HERE
        summary_test, acc_test = # INSERT YOUR CODE HERE
        
        train_writer.add_summary(summary_train, epoch)
        test_writer.add_summary(summary_test, epoch)
        
        print("{}\t{}\t{}".format(epoch, acc_train, acc_test))

    save_path = saver.save(sess, "models/dnn0.ckpt")