# Tensor Flow on CoLabotory

### Step 1: Getting  Input Files from AWS Bucket

Public S3 data can be easily accessed by using `!wget` in CoLabotory; however, the S3 bucket of this Lab and Assignment is private.  In this case, we need to upload input files into CoLabotory directory: 

1) Please download input data files from Canvas, and save them into your local directory.  

2) Go to files tab:  

![How to bring up files tab](https://raw.githubusercontent.com/rbm2/RiceCOMP543/master/Colab_00.png)

3) Click `UPLOAD`, and select input data files on your local drive. 

4) Once files are uploaded, you should be able to see them in the "Files" tab as following:   

![Uploaded Files](https://raw.githubusercontent.com/rbm2/RiceCOMP543/master/Colab_01.png)

5) Please notice that those files will be deleted when runtime is reset. So you may need to repeat this step after you selected `Reset All Runtimes...` (`Restart runtime` will keep your files. )

### Step 2: Training Visualization Using TensorBoard

Tensorboard is a great tool to visualize your trainning process, and you can find some example [here](https://www.tensorflow.org/guide/summaries_and_tensorboard) to help you complete this lab. In this lab, we'll use a free service called [ngrok](https://ngrok.com/) to build the connection between Google server and you local machine. Please follow steps below to setup your Tensorboard connection: 

#### a) Download and unzip ngrok

In [0]:
!wget https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip
!unzip ngrok-stable-linux-amd64.zip

#### b) Run TensorBoard

In [0]:
LOG_DIR = './log'
get_ipython().system_raw(
    'tensorboard --logdir {} --host 0.0.0.0 --port 6006 &'
    .format(LOG_DIR)
)

#### c) Run ngrok

In [0]:
get_ipython().system_raw('./ngrok http 6006 &')

#### d) Get URL

Generate URL for TensorBoard
 - You only need to generate this link ONCE - you can keep using this url address for your TensorBoard until your runtime is terminated. 

In [0]:
! curl -s http://localhost:4040/api/tunnels | python3 -c \
    "import sys, json; print(json.load(sys.stdin)['tunnels'][0]['public_url'])"

### Step 3: Enable GPU accelerator to speed up your training:

Now, running deep learning on different hardwares can make huge difference on runtime. Fortunately, CoLabotory offers GPU accelerator to speed up your training process for FREE. Before running your code, please make sure to enable the GPU accelerator  as following: 

1)  Go to Menu bar, select `Edit`  
2)  Select `Notebook Settings`  
3)  Enable GPU Accelerator, and SAVE
![GPU Accelerator](https://raw.githubusercontent.com/rbm2/RiceCOMP543/master/Colab_02.png)  
4)  Now you are good to run your Tensorflow code.  

### Step 4: Run your Lab 7 Code on CoLab
- Please follow steps below to complete this task:

    1) Make the same modification as the AWS part to display the message at the end of training.
    
    2) Uncomment the Tensorboard code in the code (line 207-226, 264-279) below. Now, you can simply test your code and take a quick look of TensorBoard. **Hint**: To test your code, you can reduce `numTrainingIters` to a smaller number. 
    
    3) In TensorBoard setup code (line 207-226), please refer to Tensorboard example ([link](https://www.tensorflow.org/guide/summaries_and_tensorboard)), and plot either of the following: `historgram of prediction2` or `mean of Weight in hidden layer`. Of course, feel free to try more. **Hint**: You can enable line number display in `Tools` -> `Preference...` -> `Show Line Number`
    
    3) To check off, please run 5000 iterations, show last 20 output (to avoid slowing browser down, we will print output every 100 iterations in CoLab), the message in the end, and plots in Tensorboard to a TA/ Instructor. 

- To avoid Error Message,  please always go to `Runtime` in the menu bar, click `Restart Runtime` before you rerun the training. 

- To visualize your training progress,  please make sure that you run the code in **Step 2**, complete the Tensorflow code,  and click the Tensorboard link generated in step d) after training is done.


In [0]:
# You may need to clean up the log file 
# to avoid overlap with your old graph from previous run. 
!rm log -rf

In [0]:
import numpy as np
import tensorflow as tf
sess = tf.InteractiveSession()

# the number of iterations to train for
numTrainingIters = 5000

# the number of hidden neurons that hold the state of the RNN
hiddenUnits = 1000

# the number of classes that we are learning over
numClasses = 3

# the number of data points in a batch
batchSize = 100

# this function takes a dictionary (called data) which contains 
# of (dataPointID, (classNumber, matrix)) entries.  Each matrix
# is a sequence of vectors; each vector has a one-hot-encoding of
# an ascii character, and the sequence of vectors corresponds to
# one line of text.  classNumber indicates which file the line of
# text came from.  
# 
# The argument maxSeqLen is the maximum length of a line of text
# seen so far.  fileName is the name of a file whose contents
# we want to add to data.  classNum is an indicator of the class
# we are going to associate with text from that file.  linesToUse
# tells us how many lines to sample from the file.
#
# The return val is the new maxSeqLen, as well as the new data
# dictionary with the additional lines of text added
def addToData (maxSeqLen, data, fileName, classNum, linesToUse):
    #
    # open the file and read it in
    with open(fileName) as f:
        content = f.readlines()
    #
    # sample linesToUse numbers; these will tell us what lines
    # from the text file we will use
    # [Note] random_integers genetate a vector with size "linesToUse", rand from 0 to len(content)
    myInts = np.random.random_integers (0, len(content) - 1, linesToUse)
    #
    # i is the key of the next line of text to add to the dictionary
    # [Note] dictionary is called "data" in this case, so i is the length of dictionary
    i = len(data)
    #
    # loop thru and add the lines of text to the dictionary
    for whichLine in myInts.flat: # myInts.flat is a 1-D interator over myInts
        #
        # get the line and ignore it if it has nothing in it
        line = content[whichLine]
        if line.isspace () or len(line) == 0:
            continue;
        #
        # take note if this is the longest line we've seen
        if len (line) > maxSeqLen:
            maxSeqLen = len (line)
        #
        # create the matrix that will hold this line
        temp = np.zeros((len(line), 256))
        #
        # j is the character we are on
        j = 0
        # 
        # loop thru the characters
        for ch in line:
            #
            # non-ascii? ignore
            if ord(ch) >= 256: # ord(c) gives the unicode of c
                continue
            #
            # one hot!
            temp[j][ord(ch)] = 1 # mark the ascii index 
            # 
            # move onto the next character
            j = j + 1
            #
        # remember the line of text
        # add this (class number, matrix_of_line) to end of data (dictionary)
        data[i] = (classNum, temp)
        #
        # move onto the next line
        i = i + 1
    #
    # and return the dictionary with the new data
    return (maxSeqLen, data) # (max length of the line in file, and the dictionary)

# this function takes as input a data set encoded as a dictionary
# (same encoding as the last function) and pre-pends every line of
# text with empty characters so that each line of text is exactly
# maxSeqLen characters in size
def pad (maxSeqLen, data):
   #
   # loop thru every line of text
   for i in data:
        #
        # access the matrix and the label
        temp = data[i][1]
        label = data[i][0]
        # 
        # get the number of chatacters in this line
        len = temp.shape[0]
        #
        # and then pad so the line is the correct length
        padding = np.zeros ((maxSeqLen - len,256)) 
        data[i] = (label, np.transpose (np.concatenate ((padding, temp), axis = 0)))
   #
   # return the new data set
   return data

# this generates a new batch of training data of size batchSize from the
# list of lines of text data. This version of generateData is useful for
# an RNN because the data set x is a NumPy array with dimensions
# [batchSize, 256, maxSeqLen]; it can be unstacked into a series of
# matrices containing one-hot character encodings for each data point
# using tf.unstack(inputX, axis=2)
def generateDataRNN (maxSeqLen, data):
    #
    # randomly sample batchSize lines of text
    myInts = np.random.random_integers (0, len(data) - 1, batchSize)
    #
    # stack all of the text into a matrix of one-hot characters
    x = np.stack (data[i][1] for i in myInts.flat)
    #
    # and stack all of the labels into a vector of labels
    y = np.stack (np.array((data[i][0])) for i in myInts.flat)
    #
    # return the pair
    return (x, y)

# this also generates a new batch of training data, but it represents
# the data as a NumPy array with dimensions [batchSize, 256 * maxSeqLen]
# where for each data point, all characters have been appended.  Useful
# for feed-forward network training
def generateDataFeedForward (maxSeqLen, data):
    #
    # randomly sample batchSize lines of text
    myInts = np.random.random_integers (0, len(data) - 1, batchSize)
    #
    # stack all of the text into a matrix of one-hot characters
    x = np.stack (data[i][1].flatten () for i in myInts.flat) # flatten turns matrix into 1-D form
    #
    # and stack all of the labels into a vector of labels
    y = np.stack (np.array((data[i][0])) for i in myInts.flat)
    #
    # return the pair
    return (x, y)

# create the data dictionary
maxSeqLen = 0
data = {}

# load up the three data sets
(maxSeqLen, data) = addToData (maxSeqLen, data, "Holmes.txt", 0, 10000)
(maxSeqLen, data) = addToData (maxSeqLen, data, "war.txt", 1, 10000)
(maxSeqLen, data) = addToData (maxSeqLen, data, "william.txt", 2, 10000)

# pad each entry in the dictionary with empty characters as needed so
# that the sequences are all of the same length
data = pad (maxSeqLen, data)
        
# now we build the TensorFlow computation... there are two inputs, 
# a batch of text lines and a batch of labels
inputX = tf.placeholder(tf.float32, [batchSize, 256, maxSeqLen])
inputY = tf.placeholder(tf.int32, [batchSize])

# this is the inital state of the RNN, before processing any data
initialState = tf.placeholder(tf.float32, [batchSize, hiddenUnits])

# the weight matrix that maps the inputs and hidden state to a set of values
W = tf.Variable(np.random.normal(0, 0.05, (hiddenUnits + 256, hiddenUnits)), dtype=tf.float32)

# biaes for the hidden values
b = tf.Variable(np.zeros((1, hiddenUnits)), dtype=tf.float32)

# weights and bias for the final classification
W2 = tf.Variable(np.random.normal (0, 0.05, (hiddenUnits, numClasses)),dtype=tf.float32)
b2 = tf.Variable(np.zeros((1,numClasses)), dtype=tf.float32)

# unpack the input sequences so that we have a series of matrices,
# each of which has a one-hot encoding of the current character from
# every input sequence
sequenceOfLetters = tf.unstack(inputX, axis=2)

# now we implement the forward pass
currentState = initialState
for timeTick in sequenceOfLetters:
    #
    # concatenate the state with the input, then compute the next state
    inputPlusState = tf.concat([timeTick, currentState], 1)  
    next_state = tf.tanh(tf.matmul(inputPlusState, W) + b) 
    currentState = next_state

# compute the set of outputs
outputs = tf.matmul(currentState, W2) + b2 # matmul

predictions = tf.nn.softmax(outputs) # softmax

# compute the loss
losses = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=outputs, labels=inputY)
totalLoss = tf.reduce_mean(losses)

# use gradient descent to train
#trainingAlg = tf.train.GradientDescentOptimizer(0.02).minimize(totalLoss)
trainingAlg = tf.train.AdagradOptimizer(0.02).minimize(totalLoss)

# # TensorBoard Steup below ----------------
# # add Loss to summary
# tf.summary.scalar('Loss', totalLoss)

# # Refer to Tensorboard example, please plot either of following: 
# # (Of course, feel free to try both!)
# # 1 - historgram of prediction
# # 2 - mean of Weight in hidden layer
# # put you code here: 


# # directory where the results from the training are saved
# result_dir = './log/' 

# # Build the summary operation based on the TF collection of Summaries.
# summary_op = tf.summary.merge_all()

# # Instantiate a SummaryWriter to output summaries and the Graph.
# summary_writer = tf.summary.FileWriter(result_dir, sess.graph)
# # Tensorboard Stepup above ---------------


# and train!!

with tf.Session() as sess:
    #
    # initialize everything
    sess.run(tf.global_variables_initializer())
    #
    # and run the training iters
    for epoch in range(numTrainingIters):
        # 
        # get some data
        x, y = generateDataRNN (maxSeqLen, data)
        #
        # do the training epoch
        _currentState = np.zeros((batchSize, hiddenUnits))
        _totalLoss, _trainingAlg, _currentState, _predictions, _outputs = sess.run(
                [totalLoss, trainingAlg, currentState, predictions, outputs],
                feed_dict={
                    inputX:x,
                    inputY:y,
                    initialState:_currentState
                })
        #        
        # just FYI, compute the number of correct predictions
        numCorrect = 0
        for i in range (len(y)):
            maxPos = -1
            maxVal = 0.0
            for j in range (numClasses):
                if maxVal < _predictions[i][j]:
                    maxVal = _predictions[i][j]
                    maxPos = j
            if maxPos == y[i]:
                numCorrect = numCorrect + 1
        
#         # Tensorboard below ----------------
#         # output the training summary every 100 iterations
#         if epoch % 100 == 0:
#             # print out to the screen
#             print("Step", epoch, "Loss", _totalLoss, "Correct", numCorrect, "out of", batchSize)
#             # Update the events file which is used to monitor the training.
#             summary_str = sess.run(
#                 summary_op,
#                 feed_dict={
#                     inputX:x,
#                     inputY:y,
#                     initialState:_currentState
#                 })
#             summary_writer.add_summary(summary_str, epoch)
#             summary_writer.flush()         
#         # Tensorboard above -----------------

#### Reminder: 
Now, please refer back to the URL you have generated in step 2-d, and click that link to check your TensorBoard. 

Copyright ©2019 Christopher M Jermaine (cmj4@rice.edu), and Risa B Myers  (rbm2@rice.edu)

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.