# Introduction

In this workshop, I will show you how to build Artificial Neural Network(ANN) and how to tune its hyperparameters. I will use Digit Recognizer data set that is very famous among "Kagglers" who is specially interested in Neural Network. The data sets consist of the number of hand written digits that range from 0 to 9. The size og digits' photos er 28x28 which means each photo has 784 features(pixels). Also the data set consist of tran  and test sets. Train sets dimension is (42000x784) that means there are 42000 different photos in the train set while the test set dimension is (28000x784).

**Metric:** I used "accuracy metric" as a metric to evaluate model performance.

**Train set splitting:** I splitted the train set as %66 of train and 33% of dev set.

<font color = 'blue'>
 Content:
   
   1. [Data Loading and Pre-processing](#1)
   
   2. [Building Model and Optimize Hyperparameters](#2)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

#pd.set_option("display.max_rows", 999)
#pd.set_option("display.max_columns", 999)
#pd.reset_option("display.max_rows")
#pd.reset_option("display.max_columns")

import matplotlib.pyplot as plt

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

<a id = '1'></a><br>
## 1. Data Loading and Pre-processing

In [None]:
# Data loading
train = pd.read_csv("/kaggle/input/digit-recognizer/train.csv")
test = pd.read_csv("/kaggle/input/digit-recognizer/test.csv")

In [None]:
# Lets look at 5 rows of train set
train.head()

In [None]:
# Lets look at 5 rows of test set
test.head()

In [None]:
# Target variable
y = train.pop("label")


In [None]:
# Lets look at 5 rows of the target variable
y.head()

In [None]:
# Unique values and their frequiencies in the target variable
y.value_counts()

In [None]:
# train set has 784 feature(pixels) and 42000 photos, test set has 784 feature and 28000 photos.  
train.shape,y.shape,test.shape

In [None]:
# Gets info quickly about data set
test.info()

In [None]:
# Gets info quickly about data set
train.info()

In [None]:
# Type of the values in y variable
y.dtype

**Note: Keras accepts data with type of float32, and normilizing pixels by dividing 255 will sharply increase the speed of the ANN.**

In [None]:
# scale the input values to type float32 and (normalize) the input values within the interval [0, 1]

train = train.astype('float32')/255
test = test.astype('float32')/255
y = y.astype('float32')



In [None]:
# Converting pandas Dataframe to numpy array
"""
Keras models accept three types of inputs:

NumPy arrays, just like Scikit-Learn and many other Python-based libraries. This is a good option if your data fits in memory.

TensorFlow Dataset objects. This is a high-performance option that is more suitable for datasets that do not fit in memory and that are streamed from disk or from a distributed filesystem.

Python generators that yield batches of data (such as custom subclasses of the keras.utils.Sequence class).
"""
train = pd.DataFrame.to_numpy(train)
test = pd.DataFrame.to_numpy(test)

In [None]:
# Splitting training set into train and dev set 
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(train, y, test_size=0.33, random_state=42)

In [None]:
# Y variable has 10 different classes. Therefore we need to represent each values in y as vector. 
# This converst for example  1 to [0., 1., 0., 0., 0., 0., 0., 0., 0., 0.] vector. 
"""
# label encoding to y variable
from keras.utils import to_categorical
y = to_categorical(y, num_classes=10)
"""
y_onehot = tf.one_hot(y, depth=10)
y_onehot_train = tf.one_hot(y_train, depth=10)
y_onehot_test = tf.one_hot(y_test, depth=10)

<a id = '2'></a><br>
##  2. Building Model and Optimize Hyperparameters

In [None]:
from tensorflow import keras
# Importing libraries
from keras.models import Sequential              # creates sequential model
from keras.layers.core import Dense, Activation # creates layers and calls activation functions
from tensorflow.keras.layers import (
    Dense,
    Dropout,
    Flatten
)

from kerastuner import HyperModel  # It helps to tune hyperparameters.

In [None]:
# Hyperparameter Tuning
# https://keras-team.github.io/keras-tuner/documentation/hyperparameters/
# For Beta and Epsilon:  https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adam

"""
Hyperparameter tuning is heart of  ANN model and it directly affects the performance of the model. We can tune hyperparameters such as:

Learning Rate : It determines how quick the model will learn. It should be selected carefully. If it is small, the model speed will be very slow which means the derivative 
of loss function goes to its minimum point very slowly. If it is very high, the derivative of the loss fucntion cannot reach to its global minimum point. 
Therefore I preffered to choose its values as 1e-2, 1e-3, 1e-4.

The number of nodes: Nodes are points of the Layers on ANN. We need to optimize them and it is very general that they can be choosen as 32,64,128,256 and 512. 

The number of layer: It determines the complexity of the ANN model like the nodes. If you choose very high number, it can result in "BİAS". If you choose very small like 2,
it may not be good to solve complex and non-linear problems. We use dense function to create layers.

Activation function: Normally ANN is the linear method (Z=W*X+b), but we use activation function to make ANN non-linear. The most famous activation functions are relu and 
tanh for layers. If you use binary classification, you need to use "sigmoid" funcion. If you classify more than 2 classes, you need to use "softmax" function.

L2 Regularization: Regularization is used to reduce "VARİANCE" problem. One of the regularization techniques is L2 that is added to loss function to punish the weights.
By doing this weights getting closer to zero which reduces the model complexity.

Dropout: It is another regularization techniues. It is based on to close some of nodes randomly in determined layers. It uses "BERNOULLİ PROBABLİTY" to determine which nodes 
is getting closed. It is very effective like L2 regularization. It is very commen to use both L2 and Dropout regularization.

Adam optimazation: There are different optimazation methods like "momentum" and "RMSProp" to speed the model and increase the model performance. "ADAM" optimization technique
uses noth momentum and RMSProp(Root Mean Square Prop)

Batch-size : It is based on to divide data into small datasets and train them. It increase the performance beside speeding model training time. Exponentially weighted avarages 
statistical technique is used to calculate avarage loss on this technique. 




"""
class AnnHyperModel(HyperModel): 
    def __init__(self, input_shape):
        self.input_shape = input_shape
        
        
    def build(self, hp):
        model = Sequential()
        model.add(
            layers.Dense(
                units=hp.Int('units', 8, 64, 4, default=8), 
                activation=hp.Choice(
                    'dense_activation', 
                    values=['relu', 'tanh', 'elu'],
                    default='relu'), 
                activity_regularizer=tf.keras.regularizers.l2(0.001),
                input_shape=input_shape) )
        
        model.add( 
            layers.Dense(
            units=hp.Int('units', 8, 64, 4, default=16), 
            activation=hp.Choice(
                'dense_activation', 
                values=['relu', 'tanh', 'elu'], 
                default='relu')))

        model.add( 
            layers.Dropout(
                hp.Float( 
                    'dropout',
                    min_value=0.0, 
                    max_value=0.1, 
                    default=0.005, 
                    step=0.01)))
        
    
        model.add(layers.Dense(10,activation = "softmax"))
        
        model.compile(
            optimizer=keras.optimizers.Adam(
            hp.Choice('learning_rate',
                      values=[1e-2, 1e-3, 1e-4]),
                      beta_1=0.9,
                      beta_2=0.999,
                      epsilon=1e-07)
            ,loss='mse',
            metrics=['accuracy'] )

        return model

In [None]:
# Ceate the object from the class
input_shape = (X_train.shape[1],)
hypermodel = AnnHyperModel(input_shape)

In [None]:
from kerastuner.tuners import RandomSearch

In [None]:
# RandomSearch is here to do the hyperparameter search. 
# For more tuners: https://keras-team.github.io/keras-tuner/documentation/tuners/
tuner_rs = RandomSearch( hypermodel,
                        objective='val_accuracy', 
                        seed=42, 
                        max_trials=12)

In [None]:
# You can print a summary of the search space
tuner_rs.search_space_summary()

In [None]:
# fit the model to find best model
tuner_rs.search(X_train, y_onehot_train, epochs=10, validation_data=(X_test, y_onehot_test), verbose=2)

In [None]:
# choosing best model among the models
best_model = tuner_rs.get_best_models(num_models=1)[0] 
loss, mse = best_model.evaluate(X_test,y_onehot_test)

In [None]:
# Shows layers of the model
best_model.layers

In [None]:
# Shows weights of the model (w,b)
best_model.weights

In [None]:
# used to see the content of the model. It gives a summary of the model.
# here is the total number of parameters entering the nodes in each layer, which is called params. 
# There is 784 inputs in the first layer,that is, 784 w and 2 b and since there are 2 nodes, the total parameter entered into the nodes = 2 * 784 +2 = 1570
best_model.summary()


In [None]:
# model fitting
# batch_size_step = X_train/batch_size 
best_model.fit(X_train, y_onehot_train,
          batch_size=100, epochs=10)   # epochs = number of iterations

In [None]:
# Model evaluation
test_loss, test_acc = best_model.evaluate(X_test, y_onehot_test)
print("Dev set accuracy: ", test_acc)
print("Dev set loss: ", test_loss)

## Confusion Matrix

"" In the field of machine learning and specifically the problem of statistical classification, a confusion matrix, also known as an error matrix, is a specific table layout that allows visualization of the performance of an algorithm, typically a supervised learning one. ""

In [None]:
# Plot confusion matrix 
# Note: This code snippet for confusion-matrix is taken directly from the SKLEARN website.
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=30)
    plt.yticks(tick_marks, classes)

    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('Actual class')
    plt.xlabel('Predicted class')

In [None]:
from collections import Counter
from sklearn.metrics import confusion_matrix
import itertools

# Predict the values from the validation dataset
Y_pred = best_model.predict(X_test)
# Convert predictions classes to one hot vectors 
Y_pred_classes = np.argmax(Y_pred, axis = 1) 
# Convert validation observations to one hot vectors
Y_true = np.argmax(y_onehot_test, axis = 1) 
# compute the confusion matrix
confusion_mtx = confusion_matrix(Y_true, Y_pred_classes) 
# plot the confusion matrix
plot_confusion_matrix(confusion_mtx, classes = range(10))

# Prediction dev-set

In [None]:
# The Predict () method returns a vector containing the predictions of all dataset items.
predictions = best_model.predict(X_test)

In [None]:
# Returning the index of the position containing the highest value of the vector, we know which class gives the highest probability of belonging with the argmax function of Numpy.
np.argmax(predictions[9])

In [None]:
# We can use sum to see that all values in a vector are zero. Because these are probability values.
np.sum(predictions[11])

# Prediction test-set

In [None]:
# The Predict () method returns a vector containing the predictions of all dataset items.
test_result = best_model.predict(test)

## Write results to csv 

In [None]:
# Saving the results to a csv file

# Convert one-hot vector to number
results = np.argmax(test_result,axis = 1) # this gives us the corresponding y value relative to the highest probability in the prediction vector, such as 2 or 3

results = pd.Series(results,name="Label")


submission = pd.concat([pd.Series(range(1,28001),name = "ImageId"),results],axis = 1)

submission.to_csv("test_submission.csv",index=False)

## I AM LOOKING FORWARD TO YOUR COMMENTS AND UPVOTES 

## If you have any questi̇on, please donot hesi̇tate to ask.

