**Introduction**

This notebook provides help to rapidly develop a basic TensorFlow script. The MNIST is one of the most used datasets for learning neuronal networks, so we are using it. While the task of classifying images is generally accomplished by convolution neuronal networks (CNNs), it is useful to learn how to build a simple artificial neuronal network (ANN) before moving to more advanced approaches. Here and there, I will add some notes for further readings that will help to improve on this code. These are generally highlighted with *Advanced Notes*. You can skip those at your first reading. If you are interested in a comprehensive book on neuronal networks, I would recommend the book by Geron, "Hands-on machine learning with Scikit-Learn and TensorFlow".

Let's get started.

**The Data**

This is actually at the heart of any neuronal network: data, data, data and even more data. The idea is that a NN works like the human brain. Thus, you learn by watching movies, reading books, taking classes, etc. All of these activities share one common trait: your brain absorbs data. The same is true for NNs. But in this case, the data are generally provided in a spreadsheet-like format. 
For this example, we have two files, one labelled test, the other train. The training test is what we use to "teach" to our NN. You can think of this as a collection of lectures, where you go through general concepts and solve some exercises with the help of the instructor. In all of these activities you are supposed to think about the way the problems are solved and the answer is immediately available. Once your training is complete, you generally take a test or exam. In that case, you apply your newly learned skills to problems that you have not seen before. This is where your competence is tested. A NN undergoes a similar teaching process. First, you use the training set (which contains the answers) to optimize the NN. Then you test how good your NN is, by assessing its ability on the test set. The test set does not contain answers, so your NN must do the best it can to guess correctly. At the end, is up to the user to "grade" the NN, by checking how many answers are correct.

*Advanced Note*: But how do you generate a test and train set? Well, generally you collect your data somehow. This can be a survey, a database search, recordings, etc. Once you have the data, you randomly split the data into two sets. Let's say 90% training and 10% test. I said: randomly. That's because you want to avoid biases. For instance, the MNIST database contains handwritten digits from 0 to 9. If you split your set into a training set that does *not* contain 9s, you may have troubles. How can your NN guess what is a 9 if it never saw one?

Thus, let's start reading the training set. As the file is a csv (very common for this purpose), I will use the package pandas.

In [None]:
import pandas as pd

train_set=pd.read_csv("../input/train.csv")  # read the CSV file
train_set.head() #show the first few rows



The first column ("label") is the digit drawn. The remaining 28x28 pixels describe an image. Our task is to build a NN that "looks" at the picture and guesses its label.

*Advanced Note*: But why would I need a new module such as pandas for reading a file? Doesn't python have its own functions? Well, yes and no. In this case, reading a file is all we did, so basic python functions would be fine. But sometimes things are a bit more complicated. For instance, imagine that your data are from a survey and that people have left some questions without an answer. This is a case of missing data. You have a lot of people that answered all questions, so you may just drop the incomplete surveys. But sometimes, you need all the data available, so what to do with the missing data? After all, they did answer some (but not all) of the questions. There is no unique solution to this problem. But packages like pandas offer functions to identify these records and "fix" them, whether by dropping them or performing some other operations. Concluding, pandas is not "just" for reading files, but also to check the quality of the data, selecting them, fixing them and creating statistics of them. For instance, we may wonder if the MNIST is using the same number of images for each digit. Or are there more 9s than 0s? To answer this question, we write:



In [None]:
import matplotlib.pyplot as plt  # For graphics

train_set["label"].hist()    # This is actually a pandas function
plt.show()                   # Show the plot! 

**Neuronal Networks**

Now that we have collected our data, analyzed them and made sure that there is no surprise, we can pass the data to our neuronal network (NN). A NN is composed of neurons connected to each other. (Often neurons are called "perceptrons", but we will not dig into the difference here. Instead, I will keep using the word "neuron" that is less confusing for beginners.) 
The only operation that a neuron does is to add up its inputs. Thus, if a neuron has two inputs called \\(x_1\\) and \\(x_2\\), the result would be: $$y=w_1 x_1 + w_2 x_2$$
In this expression, the w's are called weights. The weights are what changes when your NN learns. Thus, training means finding the optimal set of w's. Also, remember that sometimes the result y may be "altered" before leaving the neuron. Thus, the actual output of the neuron looks like: $$ z=F(y)$$
where F is called *activation* function and can be any functions. In practice, only a few functions are actually employed/useful. One very common function is the step function that returns:
$$
z=
\begin{cases}
-1 &  \text{if}\ y<0\\
0 &   \text{if}\ y=0 \\
+1 &  \text{if}\ y>0\\
\end{cases}
$$

As you can guess, a single neuron on its own is limited in the type of behavior that can capture. For instance, how can it learn how to compute a logarithm or a square root if it is limited to sums and multiplications? The solution to this problem is to use more than one neuron. In practice, you design layers of neurons. Each layer is connected to the next one, meaning that the outputs of one layer is the inputs of the next layer. We will use 3 layers. the first layer is connected to the input (namely the list of images) with 300 neurons. The second layer has 100 neurons. This layer is called "hidden" layer because it is not directly connected to the input or output. You can have as many hidden layers as you wish. The last layer has 10 neurons and is connected to the output. 


1. How many layers do I need? How many neurons per layer do I need? These are great questions. Long story short, there is not an easy answer. The optimal setup is generally found by trial an error. Said that, a lot of people have been spending their time trying to improve existing NN or proposing new ones. Thus, you should always start from a quick search of the existing work. In my case, I used the setup proposed by Geron in the book cited at the beginning. 
1. Why ten neurons in the output layer? This depends on what your network is supposed to be looking for. For instance, if I ask: What digit is drawn in the picture? I would expect one output. However, I decided to ask: What is the probability that this picture represents one digit? As I have ten digits, I will obtain the probability that the picture represents each one of them. Hopefully, the digit with the higher probability is the correct one. In addition, this probability will tell me what is confusing the NN. For instance, a picture may have an almost 50/50 change of belonging to two digits. Identifying those cases may help improving the algorithm (or maybe figure out that somebody's handwriting is illegible).
1. Why each layer is smaller than the previous one? This is because we want to lose information. This may appear unintuitive: How can we learn if we lose information? Isn't that "forgetting"?  Actually, it is not. Let's make an example. If I tell you that I have a spherical eraser on my desk, half an inch in diameter, you can imagine its shape. "Imagining" means making a "prediction". Assuming I am not lying, you would be able to find it on my desk, even though you never actually saw it before. How is that possible? Over the span of your life you have been seeing several "round" object: a planet, a basketball, a candy, etc. From these experiences, you have learnt how a sphere looks like: no edges, not elongated in any directions, etc. What happened is that your brain had looked at several object, "forgot" the specific details of each one of them and "extracted" the important features for being a sphere. For instance, the "flavor" of a sphere is not important for its definition. The same is done in machine learning. The NN "reads" several handwritten digits, and then "extract" the essential features from it and uses those to classify the picture. Thus, "learning" means extracting the important features. This should not be confused with memorization, where you learn every single detail of each object.

After this long introduction, we are finally ready to write down our code using tensorflow! (see comment below)


In [None]:
import tensorflow as tf

tf.reset_default_graph()                     # to avoid error message in the notebook. Probably you will never use it in a real script

##### Prepare data with Pandas ######
X_train=train_set.drop(labels=["label"],axis=1) # all data except the columns called "label". 
X_train=X_train.values                          # only the numbers, no column labels 
y_train=train_set["label"]                      # Keep only column "label"
y_train=y_train.values                          # only the numbers, no column labels 

###### TensorFlow starts here  ######
# Define the variables
X=tf.constant(X_train,dtype=tf.float32,name="X")  #input ..... Why float?!
y=tf.constant(y_train,dtype=tf.int64,name="y")    # Correct output: Not prediction, but correct answer!

# Build the Neuronal Network
with tf.name_scope("NeuronalNetwork"): 
    input_layer=tf.layers.dense(X,300, name="InputLayer",activation=tf.nn.relu)
    middle_layer=tf.layers.dense(input_layer,100, name="MiddleLayer",activation=tf.nn.relu)
    output_layer=tf.layers.dense(middle_layer,10, name="OutputLayer")
    

Let's go over the code slowly.

In the first section we prepared the data. If you look at our code earlier, we see that the CSV file stores data with a name for each column. We are not interested in our NN to learn the names of the labels (pixel1, pixel2, etc), so we drop it. We also split the data into the picture themselves (X_train) and the digit they represent (y_train). In practice, we want the NN to "look at" X_train and "guess" y_train.

Then we start writing our code using TensorFlow. First, we need to tell tensorflow which variables we will be using. Why not using directly X_train or y_train? Why should you define new variables?  We will go through this later on, but for now suffices to say that in an actual code you may have already defined lots of variables at this point. As a NN takes a lot of resources to compute, you want to make sure that it is only using the variables that it really needs. No need to pass name of files, temporary variables, etc. if it is not using those.

After this section, follows the network itself. The line starting with "with" is optional. We will see how to use it later on. For now, think of it as a "comment" line that describes the lines within it.

The "dense" type means that all of the output of each layer are connected to all of the inputs of the next layer. This is the most basic form of NN. Of course, people have devised other ways of connecting layers, but that's way beyond the scope of this tutorial. For each layer we need to tell where the data come from, how many neurons in that layer, a label for that layer(optional) and an activation function(optional). For instance, the first layer takes the data from X, uses 300 neurons, it is called "InputLayer" and uses an activation function called relu. Remember that the activation function is the function F(y) that alters the output. We do not use an activation function for the last layer as that is our answer ... kind of (see below for more).

So far, we have constructed the NN. But let's not forget our task: Train the network! So how do we teach to it? As we discussed before, we follow the same path as for teaching to people. Thus, we write down a test and check the answers! We know that the NN must guess the values of y_train. So, we need to define a grading system that tells the NN how well is doing. One option is to check the accuracy, namely how many times it guesses correctly. This is the ultimate goal of the network, so it should be checked.

However, as I discussed above, the question I am asking is actually "What is the probability that this picture is digit D?" (where D is a number between 0 and 9). In my output layer, I have 10 neurons. The first neuron tells me how likely is that the picture is a 0, the second how likely that it is a 1 instead and so far so on up to the 10th neuron that tells me how likely it is to be a 9.  Thus, it makes sense to assign a second "grade" according to this definition. This is the grade that I want the NN to "see" and improve. The grade that you share with the NN is called "loss function".

Let's implement our grading system!


In [None]:
##### Grading ######
# For human beings : Accuracy
with tf.name_scope("Accuracy"):
    Guess=tf.nn.in_top_k(output_layer,y,1)    # Pick the digit with highest probability
    accuracy=tf.reduce_mean(tf.cast(Guess, tf.float32))  # Count how many you got right and give %

# For the neuronal network: Loss Function
with tf.name_scope("LossFunction"):
    EntropyProbability=tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y,logits=output_layer) # compute the cross entropy
    loss=tf.reduce_mean(EntropyProbability,name="Loss") # compute average cross entropy


The code for "human beings" should be self-explanatory (even if the function names are not). The cross entropy may not be what you would pick at first. The truth is: In NN it is used all the time. I am not going into why is a good choice or not. Just remember that for whatever problem you are dealing with, there is probably a cross entropy for that specific computation. And that's usually one of the loss functions to consider. It may or may not be the best choice. For this tutorial we assume it is. Once more, notice the "with" statements used as comments. Nope, I am not telling you why I am using those just yet.
Thus, we have created our first NN, and a grading system. Now we need to make the NN learn from its own mistakes. Let's call this a revision of the materials taught:


In [None]:
with tf.name_scope("Revision"):
    optimizer=tf.train.GradientDescentOptimizer(0.01)
    training_op=optimizer.minimize(loss)


The optimizer is the way our NN updates the values of the w's (see our equation above) and training_op says how learning is achieved. In this case minimizing the loss function, namely the average cross entropy. The 0.01 used in the optimizer is called "learning rate". Once more, there is not an unique way of finding the perfect number. Usually, you should try several and find the best one. What is the best one? Well, the learning rate tells you how fast you lean. Thus, you may think that the larger the better. However, if it gets too big, the program becomes unstable. Thus, you should be careful in its choice. Once more, you should do a search for your specific problem and see what people have found out to be a good value.

The gradient descent is the actual mathematical operation used to update the w's. This is a fairly technical matter, so I will cowardly shy away from it. Remember that there are several optimizers and that to find the best one, you should try several and see what works best. Gradient descent is generally fairly accurate and efficient, so it is a good starting point.

This is mostly all you need to run a basic NN program. You will be happy to know, however, that after all this coding you have achieved .... absolutely nothing!! In fact, none of the code you just wrote is executed! What just happened?!

**Running a TensorFlow Network**

This section is the kind of things that you read once and then forget. But still worth a few words. We ended the last section saying that our code does not run. Why? Python is an *imperative* programming language. That means that you write a command, press enter and something happens right away. TensorFlow follows a *declarative* programming philosophy (fine, recent versions have added imperative programming features....). That means that first of all, you describe the network, its connections, the variables used and all that stuff, but **not** how you execute that. Basically, it is like an electronic toolbox where you have all of your pieces, but you still must connect those in a meaningful way to build a radio. That's what you must do next! So far, we have only defined our component, now it is time to "connect" them together and run the NN.


In [None]:
import glob                                # To check if files exist. See below

init=tf.global_variables_initializer()   # Allocate and initialize all the variables
saver=tf.train.Saver()                   # Store the values of the w's (and other stuff)

n_epochs=50                              # how many "training sessions"
save_file="./model.ckpt"                 # checkpoint. Better have one.

with tf.Session() as sess:    
    if glob.glob(save_file + "*"):        # If we have run this before, just continue 
        print("Parameters Found. Continuing")
        saver.restore(sess,save_file)    
    else:
        print("New Optimization Started")
        init.run()                       # If it is the first time, initialize.
        
    for i in range(n_epochs):
        sess.run(training_op)   # this is a training session and how to train it (training_op is defined above)
        if i % 10==0:
            print("Epoch:",i,"Loss:",loss.eval(),"accuracy:",accuracy.eval())  # notice the eval()!!
        
    print("Final Loss:",loss.eval(),"Final Accuracy",accuracy.eval())                 # Best result
    save_path = saver.save(sess, save_file)     # Save stuff for later. It returns a path ... 
    print("Model data saved in %s" % save_path)      # ... and we can well print it!
    
print ("Calculation Completed")         # Just to make sure we reached the end  of the program    


Once more the code should be self-explanatory. It is important to not forget the .eval(). Remember that TensorFlow does **not** execute the code until you say so. Thus, just defining accuracy and loss is not enough. You need to explicitly tell when to evaluate those. 
The saver is used to store the variables and optimized w's so that you do not need to train the NN every time. Training actual NNs may be extremely long and require a lot of (expensive) resources, so it is better to always have a checkpoint system in place. 
Let's check the output.  The loss function is decreasing, which is good, but very slowly. Well, this is to be expected as we did **not** normalize the input.

**Scaling the Data**

What we just wrote is a **too** basic NN network and you should take a bit more care of the data we introduce. Otherwise, we may get bizarre or unexpected results. One problem is the range of the data. Our picture are black & white bitmaps. That means that each pixel of the picture is described by a number between 0 and 255. NN are generally not very pleased with number that vary that much. Thus, they are normalized so that their values are of the order of the unit. In our case, we can simply divide our data by 255. But more generally we can use a scaler. *This, by the way, is why we set X as float since the beginning. Once you start scaling data, these become non-integer (usually).* We will follow the second approach. The scaling should be done at the very beginning, before using tensorflow. You should add:



In [None]:
from sklearn.preprocessing import StandardScaler

##### Prepare data with Pandas ######
## See code above ##

#### Normalize Data with scikit ####
scaler=StandardScaler()
scaler.fit(X_train)
X_train=scaler.transform(X_train)

###### TensorFlow starts here  ######
## See code above ##

Thus the full code now reads:

In [None]:

import pandas as pd
import tensorflow as tf
from sklearn.preprocessing import StandardScaler
import  glob

tf.reset_default_graph()                     # to avoid error message in the notebook. Probably you will never use it in a real script
train_set=pd.read_csv("../input/train.csv")  # read the CSV file

##### Prepare data with Pandas ######
X_train=train_set.drop(labels=["label"],axis=1) # all data except the columns called "label". 
X_train=X_train.values                          # only the numbers, no column labels 
y_train=train_set["label"]                      # Keep only column "label"
y_train=y_train.values                          # only the numbers, no column labels 

#### Normalize Data with scikit ####
scaler=StandardScaler()
scaler.fit(X_train)
X_train=scaler.transform(X_train)

###### TensorFlow starts here  ######
# Define the variables
X=tf.constant(X_train,dtype=tf.float32,name="X")  #input ..... Why float?!
y=tf.constant(y_train,dtype=tf.int64,name="y")    # Correct output: Not prediction, but correct answer!

# Build the Neuronal Network
with tf.name_scope("NeuronalNetwork"): 
    input_layer=tf.layers.dense(X,300, name="InputLayer",activation=tf.nn.relu)
    middle_layer=tf.layers.dense(input_layer,100, name="MiddleLayer",activation=tf.nn.relu)
    output_layer=tf.layers.dense(middle_layer,10, name="OutputLayer")

##### Grading ######
# For human beings : Accuracy
with tf.name_scope("Accuracy"):
    Guess=tf.nn.in_top_k(output_layer,y,1)    # Pick the digit with highest probability
    accuracy=tf.reduce_mean(tf.cast(Guess, tf.float32))  # Count how many you got right and give %

# For the neuronal network: Loss Function
with tf.name_scope("LossFunction"):
    EntropyProbability=tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y,logits=output_layer) # compute the cross entropy
    loss=tf.reduce_mean(EntropyProbability,name="Loss") # compute average cross entropy

with tf.name_scope("Revision"):
    optimizer=tf.train.GradientDescentOptimizer(0.01)
    training_op=optimizer.minimize(loss)

##### Lets run the neuronal network!  #####

init=tf.global_variables_initializer()   # Allocate and initialize all the variables
saver=tf.train.Saver()                   # Store the values of the w's (and other stuff)

n_epochs=50                              # how many "training sessions"
save_file="./modelNormalized.ckpt"                 # checkpoint. Better have one.

with tf.Session() as sess:    
    if glob.glob(save_file + "*"):        # If we have run this before, just continue 
        print("Parameters Found. Continuing")
        saver.restore(sess,save_file)    
    else:
        print("New Optimization Started")
        init.run()                       # If it is the first time, initialize.
        
    for i in range(n_epochs):
        sess.run(training_op)   # this is a training session and how to train it (training_op is defined above)
        if i % 10==0:
            print("Epoch:",i,"Loss:",loss.eval(),"accuracy:",accuracy.eval())  # notice the eval()!!
        
    print("Final Loss:",loss.eval(),"Final Accuracy",accuracy.eval())                 # Best result
    save_path = saver.save(sess, save_file)     # Save stuff for later. It returns a path ... 
    print("Model data saved in %s" % save_path)      # ... and we can well print it!
    
print ("Calculation Completed")         # Just to make sure we reached the end  of the program    


Now things make more sense than before. What you have here is the backbone of an actual training session for a NN. **Do not forget to scale your data!!** 
Of course, you can run this code a bit more than just 50 epochs. And thanks to the restore/save facilities, you do not need to do it in just one run. Now it is just a matter of optimizing the w's so that the loss is as small as possible and the accuracy is as high as possible. There are, however, a few tricks that can speed up and improve the optimization. I will show a few below. If you are curious about all of those "with tf.name_scope" I have been liberally adding to the code, feel free to jump to the end of this notebook. However, right now I want to focus on the prediction phase of your NN.  All those other sections can be considered optional as you already know enough for a basic NN framework. 

**Making Predictions**

After you have trained your NN, it is time to test how effective it is and make some predictions on a new dataset: The test data that we have avoided so far. The code is actually very simple compared to the training phase. We are, of course, using the parameter optimized earlier. 


In [None]:
import pandas as pd
import tensorflow as tf
import numpy as np
from sklearn.preprocessing import StandardScaler
import  glob
import sys

tf.reset_default_graph()                     # to avoid error message in the notebook. Probably you will never use it in a real script
test_set=pd.read_csv("../input/test.csv")  # read the CSV file

##### Prepare data with Pandas ######
#X_test=test_set.drop(labels=["label"],axis=1) # all data except the columns called "label". 
X_test=test_set.values                          # only the numbers, no column labels 

#### Normalize Data with scikit ####
scaler=StandardScaler()
scaler.fit(X_test)
X_test=scaler.transform(X_test)

###### TensorFlow starts here  ######
# Define the variables
X=tf.constant(X_test,dtype=tf.float32,name="X") 

# Build the Neuronal Network
with tf.name_scope("NeuronalNetwork"): 
    input_layer=tf.layers.dense(X,300, name="InputLayer",activation=tf.nn.relu)
    middle_layer=tf.layers.dense(input_layer,100, name="MiddleLayer",activation=tf.nn.relu)
    output_layer=tf.layers.dense(middle_layer,10, name="OutputLayer")

##### Lets run the neuronal network!  #####
init=tf.global_variables_initializer()   # Allocate and initialize all the variables
saver=tf.train.Saver()                   # Store the values of the w's (and other stuff)
n_epochs=50                              # how many "training sessions"
save_file="./modelNormalized.ckpt"                 # checkpoint. Better have one.

with tf.Session() as sess:    
    if glob.glob(save_file + "*"):        # This time you MUST have a trained NN
        saver.restore(sess,save_file)    
    else:
        sys.exit('I need a trained set of parameters!')    
    prediction=np.argmax(output_layer.eval(),axis=1)
    # REMEMBER to commit the notebook to save the output file!
    #Here we create the file for the output. It has to be CSV, so let's use pandas
    d={'ImageId': np.arange(1,1+X_test.shape[0]),'label': prediction }
    df=pd.DataFrame(data=d)
    df.to_csv('submission.csv',index=None)   # print data into csv. Ready for submission!
        
print ("Calculation Completed")         # Just to make sure we reached the end  of the program    


Notice that we do not have a y variable this time. In fact, these are predictions, so the code only outputs the answer and submits it for a score. Also, remember that it is unlikely that your system will reach a 100% correct predictions. Like a human brain, a NN can make mistakes. How many times reading you mispronounced a word or confused two words? The same is true for the best NN. Its accuracy can be very high (using a CNN you can reach above 99%), but never perfect.

That's it. This is more or less what you need for a functional, even though basic, NN. In what follows there are a few suggestions on how to improve your NN or how to visualize your NN. 


**Mini-Batches**

The idea of mini-batch sampling is actually quite easy. So far, we have used all the data at once for training. When using mini-batch sampling, you break up your training data into small groups (called mini-batches) and optimize the parameters only for that batch. One of the main reasons for using this approach is that it performs much better in modern machines, especially GPUs. There are a few things to keep into account for our code to use mini-batches. First off, our variable X and y where defined as "constant" before, as they were read once and then used over and over. However, as now we will be using mini-batches, they are different for each mini-batch. Thus, we will need a different data type, while keeping the rest unchanged:


In [None]:
# Define the variables
X=tf.placeholder(shape=(None,28*28),dtype=tf.float32,name="X")  # training set 
y=tf.placeholder(shape=(None),dtype=tf.int64,name="y")    # Correct output: Not prediction, but correct answer!

# Build the Neuronal Network
with tf.name_scope("NeuronalNetwork"): 
    input_layer=tf.layers.dense(X,300, name="InputLayer",activation=tf.nn.relu)
    middle_layer=tf.layers.dense(input_layer,100, name="MiddleLayer",activation=tf.nn.relu)
    output_layer=tf.layers.dense(middle_layer,10, name="OutputLayer")

##### Grading ######
# For human beings : Accuracy
with tf.name_scope("Accuracy"):
    Guess=tf.nn.in_top_k(output_layer,y,1)    # Pick the digit with highest probability
    accuracy=tf.reduce_mean(tf.cast(Guess, tf.float32))  # Count how many you got right and give %

# For the neuronal network: Loss Function
with tf.name_scope("LossFunction"):
    EntropyProbability=tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y,logits=output_layer) # compute the cross entropy
    loss=tf.reduce_mean(EntropyProbability,name="Loss") # compute average cross entropy

with tf.name_scope("Revision"):
    optimizer=tf.train.GradientDescentOptimizer(0.01)
    training_op=optimizer.minimize(loss)



Besides the keyword "placeholder", we have added the parameters shape. None stands for "any size". In this example, None stands for the number of pictures. For X we specify that each picture contains 28x28 points. Notice that None will be the size of a mini-batch, not of the entire set. Also, we did not tell that X contains data from X_train as done before. That is why it is a "placeholder". It is there, but it does not really contain anything. We can do this in TensorFlow because it is not an imperative language. Meaning that just because we write something, does not mean that something happens! However, we must specify where the data come from when we actually want to perform the calculation. And things are only computed in the "with tf.Session" part. So, let's work on that next!

In [None]:
import numpy as np                      # for math
##### Lets run the neuronal network!  #####

init=tf.global_variables_initializer()   # Allocate and initialize all the variables
saver=tf.train.Saver()                   # Store the values of the w's (and other stuff)

n_epochs=50                              # how many "training sessions"
save_file="./modelMB.ckpt"                 # checkpoint. Better have one.

#### Mini Batches  #####
batch_size=5000                           # how many pictures per mini-batch. This number is huge! Use 50 instead
n_batches=int(np.floor(X_train.shape[0]/batch_size))   #how many minibatches

with tf.Session() as sess:    
    if glob.glob(save_file + "*"):        # If we have run this before, just continue 
        print("Parameters Found. Continuing")
        saver.restore(sess,save_file)    
    else:
        print("New Optimization Started")
        init.run()                       # If it is the first time, initialize.
        
    for i in range(n_epochs):
        for j in range(n_batches):
            X_batch=X_train[(batch_size*j):(batch_size*(j+1))]        #assign elements to X in first mini-batch
            y_batch=y_train[(batch_size*j):(batch_size*(j+1))]        #assign elements to X in first mini-batch
            sess.run(training_op,feed_dict={X: X_batch, y: y_batch})  # this is a training session and how to train it (training_op is defined above)
        if i % 10==0:
            print("Epoch:",i,"Loss:",loss.eval(feed_dict={X: X_batch, y: y_batch}),"accuracy:",accuracy.eval(feed_dict={X: X_batch, y: y_batch}))  # notice the eval()!!
        
    print("Final Loss:",loss.eval(feed_dict={X: X_batch, y: y_batch}),"Final Accuracy",accuracy.eval(feed_dict={X: X_batch, y: y_batch}))                 # Best result
    save_path = saver.save(sess, save_file)     # Save stuff for later. It returns a path ... 
    print("Model data saved in %s" % save_path)      # ... and we can well print it!
    
print ("Calculation Completed")         # Just to make sure we reached the end  of the program    


We imported numpy for doing some math, then we added two lines after the section "Mini-batches". Finally we added 3 lines in the for loop where we define the elements in each batch. Those elements are then passed to any functions that is performing any calculations, including the training_op, accuracy and loss. I have also changed the save/restore file, but that is irrelevant.

*Advanced Note*: What is the best mini-batch size? As usual, there is not a definitive answer to that question. It also depends on the particular hardware you are using. Here, I picked a number that allow the calculation to complete in a reasonable amount of time. However, 5000 is way too big. In real cases you want to use something smaller, such as 500 or even better 50. I would not go much smaller than 50.

**Know Your Data**

So far, our approach has been: Collect the data, feed them to a NN and hope for the best. While this is a legitimate approach, sometimes you can do better. As we discussed above, what NNs do is to figure out pattern. For instance, if each digit is also written with a different color, your NN would figure it out. But the question is: Is that relevant? Probably you could have guesses that just looking at the digits. Did you really need to spend thousands of dollars in computer time to see that? So maybe, just print the digits with the same color and ask the NN to find some other pattern. For instace, to identify who the writer is. Bottom line: do not waste time in finding "obvious" patterns.

How does this apply to our problem? Our digits are handwritten. Somebody received a piece of paper and was asked to write a digit on it. A few things can happen. First, people are not going to write in the very same spot. That means that each digit will be slightly shifted or rotated with respect to the center of the piece of paper. As this is an obvious feature, we can just feed rotated and translated digits so that the NN can recognize such features. To rotate, I will use scikit-image.


In [None]:
from skimage.transform import rotate

train_set=pd.read_csv("../input/train.csv")  # read the CSV file

##### Prepare data with Pandas ######
X_train=train_set.drop(labels=["label"],axis=1) # all data except the columns called "label". 
X_train=X_train.values                          # only the numbers, no column labels 
y_train=train_set["label"]                      # Keep only column "label"
y_train=y_train.values                          # only the numbers, no column labels 

##### Augment data with skimage ######

# First we create two empty matrices with the same shape as X_train that contains the rotated images
X_rot20=np.zeros((X_train.shape[0],X_train.shape[1]))     
X_rotm20=np.zeros((X_train.shape[0],X_train.shape[1]))    
# then we rotate +20 and -20 degrees
for i in range (X_train.shape[0]):
    X_rot20[i]=rotate(X_train[i].reshape(28,28),20,preserve_range=True).reshape(1,784)
    X_rotm20[i]=rotate(X_train[i].reshape(28,28),-20,preserve_range=True).reshape(1,784)

##### New Data #####
X_train=np.concatenate((X_train,X_rot20,X_rotm20))    # Let create a new vector with all the data
y_train=np.concatenate((y_train,y_train,y_train))   # Even if we rotate an image, its answer does not change!


We needed to do some acrobatics with the shape of the matrix. That is because a picture is 28x28, but it is stored as a one line vector. At the end we overwrite the "old" X_train and y_train, so that the rest of the code is unchanged. In addition, we also performed some acrobatics with the color coding. The "preserve_range" option translates into "keep the same colors". More technically, our images have color coded as numbers in the range 0 to 255, which are integer numbers. However, without the "preserve_range" option, the function rotate converts the color scheme into some internal color scheme that uses floats. This is a problem because our NN  is supposed to identify the images and not classifying different color coding!

**Randomness**

Here is another aspect of "knowing your data". When you insert data make sure you are not adding patters. For instance, if your train set starts with a long list of 0s, followed by 1s, etc the NN will learn this patter. Once more think about a student. If he/she knows that the first answer is always 0, they will guess that without even looking at the picture. The same is true for a NN. You do not want a NN that get 100% because it has memorized (instead of learning!) the position of each digit. Thus, it is better that you shuffle those every time. This is an easy fix. After you have concatenated the data, add:

In [None]:
# Shuffle the indexes
shuffled_positions=np.random.permutation(X_train.shape[0])
X_train=X_train[shuffled_positions]
y_train=y_train[shuffled_positions]


Finally, the ultimate code. I have also added a few lines and functions to shift the picture up/down/left/write. As usual: 5000 for the size of the mini-batch is huge. In real applications, decrease it to 500 or 50!

In [None]:
import numpy as np
import pandas as pd
import tensorflow as tf
from sklearn.preprocessing import StandardScaler
import  glob
from skimage.transform import rotate
#### Custom Functions to  move picture 1 pixel in every direction#####
def MoveUp(OneLinePic):
    tmp=OneLinePic.reshape(28,28)
    newPic=np.zeros((28,28))
    for i in range(27):
        newPic[i]=tmp[i+1]
    return newPic.reshape(1,784)

def MoveDown(OneLinePic):
    tmp=OneLinePic.reshape(28,28)
    newPic=np.zeros((28,28))
    for i in range(27):
        newPic[i+1]=tmp[i]
    return newPic.reshape(1,784)

def MoveLeft(OneLinePic):
    tmp=OneLinePic.reshape(28,28)
    newPic=np.zeros((28,28))
    for i in range(27):
        newPic[:,i]=tmp[:,i+1]
    return newPic.reshape(1,784)

def MoveRight(OneLinePic):
    tmp=OneLinePic.reshape(28,28)
    newPic=np.zeros((28,28))
    for i in range(27):
        newPic[:,i+1]=tmp[:,i]
    return newPic.reshape(1,784)

#### Here starts the code ####
tf.reset_default_graph()                     # to avoid error message in the notebook. Probably you will never use it in a real script
train_set=pd.read_csv("../input/train.csv")  # read the CSV file

##### Prepare data with Pandas ######
X_train=train_set.drop(labels=["label"],axis=1) # all data except the columns called "label". 
X_train=X_train.values                          # only the numbers, no column labels 
y_train=train_set["label"]                      # Keep only column "label"
y_train=y_train.values                          # only the numbers, no column labels 

##### Augment data with skimage ######

# First we create two empty matrices with the same shape as X_train that contains the rotated images
X_rot20=np.zeros((X_train.shape[0],X_train.shape[1]))     
X_rotm20=np.zeros((X_train.shape[0],X_train.shape[1]))    
# then we rotate +20 and -20 degrees
for i in range (X_train.shape[0]):
    X_rot20[i]=rotate(X_train[i].reshape(28,28),20,preserve_range=True).reshape(1,784)
    X_rotm20[i]=rotate(X_train[i].reshape(28,28),-20,preserve_range=True).reshape(1,784)
    
##### Augment data with Move Functions ######
X_up=np.zeros((X_train.shape[0],X_train.shape[1]))     
X_down=np.zeros((X_train.shape[0],X_train.shape[1]))    
X_left=np.zeros((X_train.shape[0],X_train.shape[1]))     
X_right=np.zeros((X_train.shape[0],X_train.shape[1]))    
for i in range (X_train.shape[0]):
    X_up[i]=MoveUp(X_train[i])
    X_down[i]=MoveDown(X_train[i])
    X_left[i]=MoveLeft(X_train[i])
    X_right[i]=MoveRight(X_train[i])

##### New Data #####
X_train=np.concatenate((X_train,X_rot20,X_rotm20,X_up,X_down,X_left,X_right))    # Let create a new vector with all the data
y_train=np.concatenate((y_train,y_train,y_train,y_train,y_train,y_train,y_train))   # Even if we rotate an image, its answer does not change!

# Shuffle the indexes
shuffled_positions=np.random.permutation(X_train.shape[0])
X_train=X_train[shuffled_positions]
y_train=y_train[shuffled_positions]

#### Normalize Data with scikit ####
scaler=StandardScaler()
scaler.fit(X_train)
X_train=scaler.transform(X_train)

###### TensorFlow starts here  ######
# Define the variables
X=tf.placeholder(shape=(None,28*28),dtype=tf.float32,name="X")  # training set 
y=tf.placeholder(shape=(None),dtype=tf.int64,name="y")    # Correct output: Not prediction, but correct answer!

# Build the Neuronal Network
with tf.name_scope("NeuronalNetwork"): 
    input_layer=tf.layers.dense(X,300, name="InputLayer",activation=tf.nn.relu)
    middle_layer=tf.layers.dense(input_layer,100, name="MiddleLayer",activation=tf.nn.relu)
    output_layer=tf.layers.dense(middle_layer,10, name="OutputLayer")

##### Grading ######
# For human beings : Accuracy
with tf.name_scope("Accuracy"):
    Guess=tf.nn.in_top_k(output_layer,y,1)    # Pick the digit with highest probability
    accuracy=tf.reduce_mean(tf.cast(Guess, tf.float32))  # Count how many you got right and give %

# For the neuronal network: Loss Function
with tf.name_scope("LossFunction"):
    EntropyProbability=tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y,logits=output_layer) # compute the cross entropy
    loss=tf.reduce_mean(EntropyProbability,name="Loss") # compute average cross entropy

with tf.name_scope("Revision"):
    optimizer=tf.train.GradientDescentOptimizer(0.01)
    training_op=optimizer.minimize(loss)

##### Lets run the neuronal network!  #####

init=tf.global_variables_initializer()   # Allocate and initialize all the variables
saver=tf.train.Saver()                   # Store the values of the w's (and other stuff)

n_epochs=50                              # how many "training sessions"
save_file="./modelAugmented.ckpt"                 # checkpoint. Better have one.

#### Mini Batches  #####
batch_size=5000                           # how many pictures per mini-batch. This number is huge! Use 50 instead
n_batches=int(np.floor(X_train.shape[0]/batch_size))   #how many minibatches

with tf.Session() as sess:    
    if glob.glob(save_file + "*"):        # If we have run this before, just continue 
        print("Parameters Found. Continuing")
        saver.restore(sess,save_file)    
    else:
        print("New Optimization Started")
        init.run()                       # If it is the first time, initialize.
        
    for i in range(n_epochs):
        for j in range(n_batches):
            X_batch=X_train[(batch_size*j):(batch_size*(j+1))]        #assign elements to X in first mini-batch
            y_batch=y_train[(batch_size*j):(batch_size*(j+1))]        #assign elements to y in first mini-batch
            sess.run(training_op,feed_dict={X: X_batch, y: y_batch})  # this is a training session and how to train it (training_op is defined above)
        if i % 10==0:
            print("Epoch:",i,"Loss:",loss.eval(feed_dict={X: X_batch, y: y_batch}),"accuracy:",accuracy.eval(feed_dict={X: X_batch, y: y_batch}))  # notice the eval()!!
        
    print("Final Loss:",loss.eval(feed_dict={X: X_batch, y: y_batch}),"Final Accuracy",accuracy.eval(feed_dict={X: X_batch, y: y_batch}))                 # Best result
    save_path = saver.save(sess, save_file)     # Save stuff for later. It returns a path ... 
    print("Model data saved in %s" % save_path)      # ... and we can well print it!
    
print ("Calculation Completed")         # Just to make sure we reached the end  of the program    


**TensorBoard**

TensorFlow offers a package called TensorBoard that you can use to plot the progress of your training as well as show an interactive structure of your graph. Currently, this is a package that you must run on your computer to see. But altering to add this feature is very useful. 
One problem with NN is that they are often composed by thousands of neurons and hundreds of layers. Showing all of those on a standard PC screen is generally not recommended. Besides, Do you really need to see each single neurons? In order to limit the clutter on the screen, the "with tf.name_scope" lines are added. In practice, everything that belongs to one of these scopes is represented as one block. Each of these blocks is interactive: If you click on it, it expands to shows what lies beneath. You may think this is not something very useful to you, but you can well use it anyway. I mean, it is like writing comments to the code, something that you **NEVER** forget to do.
The second feature of TensorBoard is that it can plot how your variables change during the training. In the example below, I will plot the loss function. I will also use the basic script written earlier as it is easier to handle. Here is the code:


In [None]:
import pandas as pd
import tensorflow as tf
from sklearn.preprocessing import StandardScaler
import glob

tf.reset_default_graph()                     # to avoid error message in the notebook. Probably you will never use it in a real script
train_set=pd.read_csv("../input/train.csv")  # read the CSV file

##### Prepare data with Pandas ######
X_train=train_set.drop(labels=["label"],axis=1) # all data except the columns called "label". 
X_train=X_train.values                          # only the numbers, no column labels 
y_train=train_set["label"]                      # Keep only column "label"
y_train=y_train.values                          # only the numbers, no column labels 

#### Normalize Data with scikit ####
scaler=StandardScaler()
scaler.fit(X_train)
X_train=scaler.transform(X_train)

###### TensorFlow starts here  ######
# Define the variables
X=tf.constant(X_train,dtype=tf.float32,name="X")  #input ..... Why float?!
y=tf.constant(y_train,dtype=tf.int64,name="y")    # Correct output: Not prediction, but correct answer!

# Build the Neuronal Network
with tf.name_scope("NeuronalNetwork"): 
    input_layer=tf.layers.dense(X,300, name="InputLayer",activation=tf.nn.relu)
    middle_layer=tf.layers.dense(input_layer,100, name="MiddleLayer",activation=tf.nn.relu)
    output_layer=tf.layers.dense(middle_layer,10, name="OutputLayer")

##### Grading ######
# For human beings : Accuracy
with tf.name_scope("Accuracy"):
    Guess=tf.nn.in_top_k(output_layer,y,1)    # Pick the digit with highest probability
    accuracy=tf.reduce_mean(tf.cast(Guess, tf.float32))  # Count how many you got right and give %

# For the neuronal network: Loss Function
with tf.name_scope("LossFunction"):
    EntropyProbability=tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y,logits=output_layer) # compute the cross entropy
    loss=tf.reduce_mean(EntropyProbability,name="Loss") # compute average cross entropy

with tf.name_scope("Revision"):
    optimizer=tf.train.GradientDescentOptimizer(0.01)
    training_op=optimizer.minimize(loss)

#### Enable TensorBoard #####
summary_data=tf.summary.scalar('Loss',loss)       #Data we want to track and plot with TensorBoard
file_writer=tf.summary.FileWriter("TensorLogs",tf.get_default_graph())

##### Lets run the neuronal network!  #####

init=tf.global_variables_initializer()   # Allocate and initialize all the variables
saver=tf.train.Saver()                   # Store the values of the w's (and other stuff)

n_epochs=50                              # how many "training sessions"
save_file="./modelNormalized.ckpt"                  # checkpoint. Better have one.
with tf.Session() as sess:    
    if glob.glob(save_file + "*"):        # If we have run this before, just continue 
        print("Parameters Found. Continuing")
        saver.restore(sess,save_file)    
    else:
        print("New Optimization Started")
        init.run()                       # If it is the first time, initialize.
        
    for i in range(n_epochs):
        sess.run(training_op)   # this is a training session and how to train it (training_op is defined above)
        
        if i % 10==0:
            print("Epoch:",i,"Loss:",loss.eval(),"accuracy:",accuracy.eval())  # notice the eval()!!
            file_writer.add_summary(summary_data.eval(),i)     #Plot the data into the logdir
    print("Final Loss:",loss.eval(),"Final Accuracy",accuracy.eval())                 # Best result
    save_path = saver.save(sess, save_file)     # Save stuff for later. It returns a path ... 
    print("Model data saved in %s" % save_path)      # ... and we can well print it!
    
print ("Calculation Completed")         # Just to make sure we reached the end  of the program    
file_writer.close()

Very few changes are needed. First, we added the new section for TensorBoard. We state that we plot the loss function and we create the object file_writer, which will store or the stuff that TensorBoard needs. We tell the object to write the data into the folder TensorLogs (if it does not exist it will be created). **Be careful**: Make sure to empty this folder every time, otherwise tensorboard gets confused.

Then, into the execution phase we use the add_summary to update our data. That means that we plot the value of loss only every 10 epochs. It is a bad idea to store data every epoch because it slows down the code. 

Finally, at the very last line, we destroy the object.

That's it.

To visualize your plot, run in the command shell:

tensorboard --logdir TensorLogs

TensorBoard will answer with a http address that you can copy and paste in your favourite browser. You will then be able to see and interact with your graph.