# Model Architecture

Important steps involved are: -

                           0. __init__ function
                           1. Parameter_Initialization
                           2. Forward_Prop
                           3. Cost_dAL_and_dZL
                           4. Back_Prop
                           5. Parameter_Update
                           6. fit
                           7. predict
                           8. gred_checking
                    
Some helper function used in some of these steps are: -

    1. Activation functions: -
                           1) sigmoid 
                           2) relu
                           3) leaky_relu
                           4) tanh (used directly in place using np.tanh() function)
                           5) softmax
                           6) linear
                           
    2. Activation gradient function
    
More helper function which are not part of the learning model per say but can be a part of learning model:-

    1. params_to_vector       --> to convert W's and b's into column vector each and concatenate them together .
    2. grads_to_vector        --> to convert dW's and db's into column vector each and concatenate them together.
    3. vectors_to_params      --> after updating the concatenated vectors(theta+epsilon), (theta-epsilon), convert them back                                     into W's and b's to perform Forward_Prop to compute cost
    4. grad_checking          --> to check if gradient descent is working properly and no mistakes were made during                                               Forward_Prop or Back_Prop. 
    5. Random_mini_batches    --> convert X and Y into randomly shuffled mini_batches.
    6. decay                  --> updating (decreasing) learning_rate after a fix interval

## Helper Functions

### params_to_vector:-

                      Given W[l] and b[l] for each layer, convert them into column vectors
                      and concatenate them together on top of each other.
                      result --> [[W1],[b1],...[WL],[bL]]
                             --> here W[1] is also a column vector
                             --> shape of result = (something, 1) --> column vector
                             
### grads_to_vector:-

                      Given dW[l] and db[l] for each layer, convert them into column vectors
                      and concatenate them together on top of each other.
                      result --> [[dW1],[db1],...[dWL],[dbL]]
                             --> [[dJ/dtheta1], [dJ/dtheta2], .....]
                             --> here W[1] is also a column vector
                             --> shape of result = (something, 1) --> column vector --> this is my dJ/dtheta = grad
                             
### vectors_to_params:-
                
                      Gievn W[l], b[l] stacked vertically
                      convert them into their original form
                      
### grad_checking:-
                    
                    theta = params_to_vector
                    theta_plus = theta + epsilon
                    params_plus = vectors_to_params(theta_plus)
                    theta_minus = theta - epsilon
                    param_minus = vectors_to_params(theta_minus)
                    J_plus = Forward_prop(params_plus) followed by --> Cost_dAL_and_dWL(params_plus)
                    J_minus = Forward_prop(params_minus) followed by --> Cost_dAL_and_dWL(params_minus)
                    grad_approx = (J_plus - J_minus)/(2*epsilon)
                    grad = [[dJ/dtheta1], [dJ/dtheta2],....]
                    
                    if norm(grad_approx - grad)/(norm(grad) + norm(grad_approx)) > 2*epsilon
                        --> There is a mistake in Back_Prop()
                    
### Random_mini_batches:-

                   Given mini_batch_size, X, and Y
                      --> randomly shuffle X and Y
                      --> get how many batches can be formed with batch size = mini_batch_size
                            --> get those mini_batches
                      --> if there are still some training examples left
                            --> get those too
                            
### decay:-
 
                   Reduces the learning rate of the model at a fixed interval
                      --> new learning_rate = old_learning_rate/(1 + decay_rate*(epoch_num/fixed_interval))

## Parameter_Initialization

### Vanishing/Exploding gradient: -

       During Initialization if Weights and bais are big, they might cause the dW's or db's to be either 
       very big or very small, effecting the learning of the model.
       To avoid it, Initialization is done using "he" or "Xavier" Initialization.
       
               1. "he" Initialization:- 
                                       factor = sqrt(2/l_dims[l-1])  for layer l
                                       here l_dims[l-1] is # hidden unit for hidden layer l
                                       Note:- Used when activation function of the lth layer is "relu" or "leaky_relu"

                2. "Xavier" Initialization:-
                               factor = sqrt(1/l_dims[l-1]) for layer l
                               Note:- Used when activation function of the lth layer is "tanh" or "sigmoid" or "softmax"
                               
                use these factor to scale Weights of the model.
                shape of W[l] --> (l_dims[l], l_dims[l-1])
                shape of b[l] --> (l_dims[l], 1)
    
    
### Batch_norm (optional) :-

            Normalize Zorig (Z[l] = W[L]*A[l-1] + b[l]) so that our model can train faster
            
            Z_norm = (Zorig - mean(Zorig))/(sqrt(variance + epsilon))
            Z = gama_norm*Z_norm + beta_norm
            Note:- 1. epsilon is added to avoid division by zero.
                   2. gama_norm*Z_norm is element-wise product and not a dot product because dot product
                      will sum up values while element-wise product will only scale up/down which is what we need.
            
            for batch_norm we need gama_norm matrix and beta_norm vector.
               shape of gama_norm:-
                             shape of Z = (l_dims[l], mini_batch_size)
                             gama_norm will be a column vector of shape (l_dims[l], 1)
                       
               shape of beta_norm:- 
                             beta_norm is a vector so, shape of beta_norm --> (l_dims[l], 1)
                             
            if batch_norm == True:
                       Initialize the gama_norm and beta_norm as Weights and bais and they will get updated
                       similarly in the back propagation


### learnin_algo (optional) :-
                  
               if learning algorithm = "momentum" or "Adam"
                   we need to Initialize V_dW and V_db to zero matrix and zero vector for each layer
                   they will get updated after each epoch
                   shape of V_dW = shape of dW = shape of W = [l_dims[l], l_dims[l-1])
                   shape of V_db = shape of b = [l_dims[l], 1)
                   
               if learning algorithm = "RMSprop" or "Adam"
                   Initialize S_dW and S_db to zero matrix and zero vector for each layer
                   they will be updated after each epoch
                   shape of S_dW = shape of W
                   shape of S_db = shape of b
                   
                   
### Dropout (keep_prob != 1) :-
          
                Initialize D[l] for each layer which takes values as 0 and 1
                    --> 0 being that "neuron" is shut down
                    --> 1 being that "neuron" is not shut down during learning
                    --> shape of D[l] = shape of Z[l] = shape of A[l] --> (l_dims[l], mini_batch_size)
                    
                    D[l] = np.random.rand(l_dims[l], mini_batch_size)
                           --> create a random matrix 
                    D[l] = (D[l] < keep_prob).astype(int) 
                           --> keep_prob*100 % of the total values of D[l] are kept 1 else are 0

## Forward_Prop

### linear_model:- 
              
              stores the linear calculated part of the model.
              Z[l] = W[l]*A[l-1] + b[l], l = 1,2,...,L
              and A[0] = X
              
### Activation_model:-
               
              stores the values after applying activation function on Z
              A[l] = activation(Z[l])
              
### batch_norm:-

              if set to True, normalize Z[l] as described in "Parameter_Initialization" step.
            
### Dropout (keep_prob != 1):-

              A[l] = A[l].*D[l]
                    --> .* is element-wise product
              A[l] = A[l]/keep_prob
                    --> element-wise devision

## Cost_dAL_and_dZL

### Cost:-
      
       Cost = J(W[1], b[1], ...W[L],b[L]) + Regularization_cost
       for different type of model ("binary", "multi", "reg"), different formula is used to calculate J
       
       model = "binary":
              J = (-1/mini_batch_size)*(Sum(Y_true*log(AL) + (1-Y_true)*log(1-AL))
              here AL is output layer post activation value
              
       model = "multi":
              J = (-1/mini_batch_size)*sum(Y_true*log(AL))
              
       model = "reg":
              cost_reg = "MSE":
                       J = (1/mini_batch_size)*sum((Y_true-AL)^2)
                       
              cost_reg = "MAE":
                       J = (1/mini_batch_size)*sum(|Y_true - AL|)
                       here |Y_true - AL| = abs(Y_true - AL)
                       
      Regularization_cost does not depend upon model and cost_reg
      
      Regularization_cost  = lambd*sum(W[l]^2)/(2*mini_batch_size)
           frobenius_norm = sqrt(sum(W^2)) --> sum element_wise square of W 
                   val1 = np.linalg.norm(W)
                   val2 = np.sqrt(np.sum(W^2))
                   both val1 and val2 are same thing. --> val1 = val2 = frobenius_norm
           sum(W[l]^2) --> sum(frobenius_norm^2) --> sum of element-wise square of W for each layer 
            
### "dAL" and "dZL" :-

          "dAL" = d(cost)/dAL = dJ/dAL + d(Regularization_cost)/dAL --> dJ/dAL
          "dZL" = d(cost)/dZL = dJ/dZL + d(Regularization_cost)/dZL  --> dJ/dZL
               --> "dZL" = (dJ/dAL)*(dAL/dZL) --> "dAL"*grad_activation(Z[L])
               --> A[L] = AL = activation(Z[L]) --> dAL/dZL = derivative of activation function wrt Z[L]
               
           model = "binary":
                  "dAL" = dJ/dAL --> "dAL" = (-Y_true/AL) + (1-Y_true)/(1-AL)
                  "dZL" = "dAL"*grad_activation(Z[L])
                  
           model = "multi":
                  "dAL" = dJ/dAL  --> "dAL" = -Y_true/AL
                  "dZL" = AL - Y
                  
           model = "reg":
                    In "reg" model, activation function used for output layer is always "linear"
                    so, grad_activation(Z[L]) = 1
                    cost_reg = "MSE":
                          "dAL" = dJ/dAL  --> "dAL" = -2*(Y_true - AL)
                          "dZL" = "dAL"*grad_activation(Z[L]) = "dAL"
                          
                    cost_reg = "MAE":
                          "dAL" = dJ/dAL  --> "dAL" = (AL - Y_true)/|AL - Y_true| or (AL - Y_true)/abs(AL - Y_true)
                          "dZL" = dJ/dZL -->  (dJ/dAL)*(dAL/dZL) --> "dAL"*(dAL/dZL)
                                         --> "dAL"*grad_activation(Z[L])
                                                       --> grad_activation(Z[L]) = 1
                          "dZL" = "dAL"
                          

## Back_Prop

"dZL" and "dAL" calculated from "Cost_dAL_and_dZL" function
 m = mini_batch_size
 can use Dropout with batch_norm too.
 
### Dropout (keep_prob != 1):-
 
                      apply:-
                              "dA[l]" = "dA[l]".*D[1]
                              "dA[l]" = "dA[l]/keep_prob[l-1]
                                   --> l goes from 1, 2,.... L
                                   --> keep_prob[l-1] is the variable for the l_th layer
                             Above formula to if we want to use it with or without batch_norm

### without batch_norm:-

                         Formula used in Forward_Prop:-
                                     --> Z[l] = W[l]*A[l-1] + b[l]
                                     --> A[l] = activation(Z[l])
                          
                         1. "dW[l]" = d(cost)/dW[l] = dJ/dW[l] + d(Regularization_cost)/dW[l] 
                                     --> (dJ/dZ[l])*(dZ[l]/dW[l]) + lambd*W[l]/m
                                     --> "dZ[l]"*(dZ[l]/dW[l]) + lambd*W[l]/m
                                              --> dZ[l]/dW[l] = A[l-1]
                                     --> "dZ[l]"*A[l-1] + lambd*W[l]/m
                                              --> shape of dZ[l] = (l_dims[l], m), shape of A[l-1] = (l_dims[l-1], m)
                                              --> shape of dW[l] = shape of W[l] = (l_dims[l], l_dims[l-1])
                            "dW[l]" = (1/m)*"dZ[l]"*A[l-1].T + lambd*W[l]/m  
                                     --> * represent dot product
                                     --> A[l-1].T is transpose of A[l-1]
                                     --> This update is done after the model has seen m examples
                                     --> that's why it is averaged over m
                                      
                         2. "db[l]" = d(cost)/db[l] = dJ/db[l] + d(Regularization_cost)/db[l]
                                     --> (dJ/dZ[l])*(dZ[l]/db[l]) + 0
                                     --> "dZ[l]"*(dZ[l]/db[l])
                                              --> dZ[l]/db[l] = 1
                                     --> "dZ[l]"
                                              --> shape of dZ[l] = (l_dims[l], m)
                                              --> shape of db[l] = shape of b[l] = (l_dims[l], 1)
                            "db[l]" = sum("dZ[l]")/m
                                     --> seen m example so averaged over m.
                                     --> sum is done along each row to match the dimension of "db[l]"
                                      
                         3. "dA[l-1]" = d(cost/dA[l-1]) = (dJ/dZ[l])*(dZ[l]/dA[l-1]) + d(Regularization_cost)/dA[l-1])
                                     --> "dZ[l]"*(dZ[l]/dA[l-1]) + 0
                                              --> dZ[l]/dA[l-1] = W[l]
                                     --> "dZ[l]"*W[l]
                                              --> shape of "dZ[l]" = (l_dims[l], m)
                                              --> shape of "W[l]" = (l_dims[l], l_dims[l-1])
                                              --> shape of "dA[l-1]" = shape of A[l-1] = (l_dims[l-1], m)
                            "dA[l-1]" = (W[l].T)*"dZ[l]"
                                   --> "dA[l-1]" = "dA[l-1]".*D[l-1]
                                   --> "dA[l-1]" = "dA[l-1]"/keep_prob[l-2]
                                     
                         4. "dZ[l-1]" = d(cost)/dZ[l-1] = (dJ/dA[l-1])*(dA[l-1]/dZ[l-1]) + d(Regularization_cost/dZ[l-1])
                                     --> "dA[l-1]"*(dA[l-1]/dZ[l-1]) + 0
                                              --> dA[l-1]/dZ[l-1]  = grad_activation(Z[l-1])
                                     --> "dA[l-1]"*grad_activation(Z[l-1])
                                              --> grad_activation(Z[l-1]) is gradient of activation function of (l-1)th layer
                                                  with respect to Z[l-1]
                                              --> shape of "dA[l-1]" = (l_dims[l-1], m)
                                              --> shape of grad_activation(Z[l-1]) = shape of A[l-1] = (l_dims[l-1], m)
                                              --> shape of "dZ[l-1]" = (l_dims[l-1], m)
                            "dZ[l-1]" = "dA[l-1]".*grad_activation(Z[l-1])
                                              --> .* represent element-wise dot product
                                              
                                              
### With batch_norm:-
 
                         Formula used in Forward_Prop:-
                                     --> Z_orig[l] = W[l]*A[l-1] + b[l]
                                     --> Z_norm[l] = (Z_orig[l]  - mean(Z_orig[l]))/(sqrt(var(Z_orig[l]) + epsilon)
                                     --> Z[l] = gama_norm[l].*Z_norm[l] + beta_norm[l]
                                     --> A[l] = activation(Z[l])
                                     
                         Last thing calculated will be dZ[l] and dA[l]
                         1. "d_gama_norm[l]" = d(cost)/d_gama_norm[l]
                                              -->  (dJ/dZ[l])*(dZ[l]/d_gama_norm[l]) + d(Regularization_cost)/d_gama_norm[l]
                                     --> "dZ[l]"*(dZ[l]/d_gama_norm[l]) + 0
                                              --> dZ[l]/d_gama_norm[l] = Z_norm[l]
                                     --> "dZ[l]".*Z_norm[l]
                                              --> .* represent element-wise product
                                              --> shape of dZ[l] = shape of Z_norm = (l_dims[l], m)
                                              --> shape of d_gama_norm[i] = (l_dims[l], 1)
                            "d_gama_norm[l]" = sum("dZ[l]".*Z_norm[l])/m
                                              --> this sum is done along each row
                                              --> seen m example so averaged over m
                                              
                         2. "d_beta_norm[l]" =  d(cost)/d_beta_norm[l]
                                              -->  (dJ/dZ[l])*(dZ[l]/d_beta_norm[l]) + d(Regularization_cost)/d_beta_norm[l]
                                     --> "dZ[l]"*(dZ[l]/d_beta_norm[l]) + 0
                                              --> dZ[l]/d_beta_norm[l] = 1
                                     --> "dZ[l]"
                                              --> shape of dZ[l] = (l_dims[l], m)
                                              --> shape of d_beta_norm[i] = (l_dims[l], 1)
                            "d_gama_norm[l]" = sum("dZ[l]")/m
                                              --> this sum is done along each row
                                              --> seen m example so averaged over m
                                               
                         3. "dZ_orig[l]" = d(cost)/dZ_orig[l] = dJ/dZ_orig[l] + d(Regularization_cost)/dZ_orig[l]
                                     --> dJ/dZ_orig[l] + 0
                            "dZ_orig[l]" = gama_norm[l].*{m*"dZ[l]" - d_gama_norm[l].*Z_norm[l] -       d_beta_norm[l]*I_m}/(m*sqrt(var(Z_orig[l]) + epsilon)
                                     --> .* is element-wise product
                                     --> * is dot product
                                     --> I_m is a row vector of ones of shape (1, m)
                                     
                         4. "dW[l]" = d(cost)/dW[l] = dJ/dW[l] + d(Regularization_cost)/dW[l] 
                                     --> (dJ/dZ_orig[l])*(dZ_orig[l]/dW[l]) + lambd*W[l]/m
                                     --> "dZ_orig[l]"*(dZ_orig[l]/dW[l]) + lambd*W[l]/m
                                              --> dZ_orig[l]/dW[l] = A[l-1]
                                     --> "dZ_orig[l]"*A[l-1] + lambd*W[l]/m
                                              --> shape of dZ_orig[l] = (l_dims[l], m), shape of A[l-1] = (l_dims[l-1], m)
                                              --> shape of dW[l] = shape of W[l] = (l_dims[l], l_dims[l-1])
                            "dW[l]" = (1/m)*"dZ_orig[l]"*A[l-1].T + lambd*W[l]/m  
                                      
                         5. "db[l]" = d(cost)/db[l] = dJ/db[l] + d(Regularization_cost)/db[l]
                                     --> (dJ/dZ_orig[l])*(dZ_orig[l]/db[l]) + 0
                                     --> "dZ_orig[l]"*(dZ_orig[l]/db[l])
                                              --> dZ_orig[l]/db[l] = 1
                                     --> "dZ_orig[l]"
                                              --> shape of dZ_orig[l] = (l_dims[l], m)
                                              --> shape of db[l] = shape of b[l] = (l_dims[l], 1)
                            "db[l]" = sum("dZ_orig[l]")/m
                                     --> sum is done along each row to match the dimension of "db[l]"
                                      
                         6. "dA[l-1]" = d(cost/dA[l-1])
                                              --> (dJ/dZ_orig[l])*(dZ_orig[l]/dA[l-1]) + d(Regularization_cost)/dA[l-1])
                                     --> "dZ_orig[l]"*(dZ_orig[l]/dA[l-1]) + 0
                                              --> dZ_orig[l]/dA[l-1] = W[l]
                                     --> "dZ_orig[l]"*W[l]
                                              --> shape of "dZ_orig[l]" = (l_dims[l], m)
                                              --> shape of "W[l]" = (l_dims[l], l_dims[l-1])
                                              --> shape of "dA[l-1]" = shape of A[l-1] = (l_dims[l-1], m)
                            "dA[l-1]" = (W[l].T)*"dZ_orig[l]"
                                   --> "dA[l-1]" = "dA[l-1]".*D[l-1]
                                   --> "dA[l-1]" = "dA[l-1]"/keep_prob[l-2]
                                     
                         7. "dZ[l-1]" = d(cost)/dZ[l-1] = (dJ/dA[l-1])*(dA[l-1]/dZ[l-1]) + d(Regularization_cost/dZ[l-1])
                                     --> "dA[l-1]"*(dA[l-1]/dZ[l-1]) + 0
                                              --> dA[l-1]/dZ[l-1]  = grad_activation(Z[l-1])
                                     --> "dA[l-1]"*grad_activation(Z[l-1])
                            "dZ[l-1]" = "dA[l-1]".*grad_activation(Z[l-1])
                                              --> .* represent element-wise dot product
                                              

## Parameter_update

                         W[l] = W[l] - learning_rate*dW_factor
                         b[l] = b[l] - learning_rate*db_factor
                                   --> dW_factor and db_factor are calculated based on the learning algorithm used
                                   
                      
### "gd" learning Algorithm:-

                         Standard gradient descent is applied.
                         dW_factor = "dW[l]"
                         db_factor = "db[l]"
                         
### "momentum" learning Algorithm:- 
        
            for each Neural network layer -->
                         Initialized V_dW = 0 matrix of shape W, V_db = 0 vector of shape b
                         After a pass through a mini_batch
                                V_dW = beta1*V_dW + (1-beta1)*dW
                                V_db = beta1*V_db + (1-beta1)*db
                                dW_factor = V_dW
                                db_factor = V_db
                                
### "RMSprop" learning Algorithm:-

            for each Neural network layer -->
                         Initialized S_dW = 0 matrix of shape W, S_db = 0 vector of shape b
                         After a pass through a mini_batch
                                S_dW = beta2*S_dW + (1-beta2)*(dW^2)
                                S_db = beta2*S_db + (1-beta2)*(db^2)
                                dW_factor = dW/(sqrt(S_dW) + epsilon) 
                                db_factor = db/(sqrt(S_db) + epsilon)
                                
### "Adam" learning Algorithm:-

            for each Neural network layer -->
                         t = 0 --> "Adam" parameter, used in calculation
                         Initialized V_dW = 0 matrix of shape W, V_db = 0 vector of shape b
                         After a pass through a mini_batch
                                t += 1
                                V_dW = beta1*V_dW + (1-beta1)*dW
                                V_db = beta1*V_db + (1-beta1)*db
                                S_dW = beta2*S_dW + (1-beta2)*(dW^2)
                                S_db = beta2*S_db + (1-beta2)*(db^2)
                                V_dW_corrected = V_dW/(1-beta1^t)
                                V_db_corrected = V_db/(1-beta1^t)
                                S_dW_corrected = S_dW/(1-beta2^t)
                                S_db_corrected = S_db/(1-beta2^t)
                                dW_factor = V_dW_corrected/(sqrt(S_dW_corrected) + epsilon)
                                db_factor = V_db_corrected/(sqrt(S_db_corrected) + epsilon)

## fit

### On Mini_batches :-
 
                    --> shuffle X and Y randomly
                    --> given the mini_batch_size, get mini_batch_X and mini_batch_Y from Random_mini_batches
                    --> Pass this batch through the learning model
                    --> update parameter based on learning from this pass
                    --> repeat for all the batches of X and Y
                    --> This all process is called "1 epoch" or "1 pass"
                    --> train the model for "num epochs" or "max epochs"

## predict

## Tips for model Fitting

### Don't Use Batch_norm and Dropout togather
 
                    1. batch_norm is used to speed up learning of the model
                    2. batch_norm makes the learning stable
                          --> can use large learning rate

In [128]:

# Creating a class Ml to create a Deep learning model based on different conditions
import numpy as np
import math
class DNN:
    
    def __init__(self, l_dims, activation, mini_batch_size = 64, max_epochs = 3000, learning_rate = 0.001,
                 decay_rate = 1.0, decay_interval = 500, is_decay = False, epsilon = 0.00000001, lambd = 0, 
                 Dropout = False, keep_prob = None, batch_norm = False, cost_reg = "MSE", learning_algo = "gd", 
                 beta1 = 0.9, beta2 = 0.999,  leaky_para = 0.01, model_type = "binary"):
        
        '''
    Parameters:- 
                  l_dims --> type --> python list, value_type --> int, 
                            Note:- 1. contains the # hidden units for hidden layers and # output units for output layers.
                                   2. length of l_dims = L (total # hidden layers + output layer)
                                   3. later in the "fit" method, # input features are added into it at index 0.
                                   4. So, throughout the model, we will be considering the length of l_dims = L+1

                  activation --> type --> python list, value_type --> str,
                            Note:- 1. length of activation = L
                                   2. activation[l] --> stores the activation of (l-1)th layer
                                   3. values --> "relu", "leaky_relu", "sigmoid", "tanh", "softmax" and "linear".

                  mini_batch_size --> hyperparameter, value_type --> int
                            Note:- 1. # training sample used before updating the model parameters 
                                   2. Default value --> 64
                                   3. other common values --> 32, 128, 256, 512, 1024 (multiple of two)
                                   4. most frequently used values --> 64, 128, 256

                  max_epochs --> hyperparameter, value_type --> int
                            Note:- 1. # times model trained on each mini_batch
                                   2. Default value --> 5000

                  learning_rate --> hyperparameter, value_type --> float
                            Note:- 1. how much of a step is taken in the downhill direction 
                                   2. Default value --> 0.001

                  decay_rate --> hyperparameter, value_type --> float
                            Note:- 1. used in formula to calculate learning_rate decay
                                   2. Default value --> 1.0
                                   
                  decay_interval --> hyperparameter, value_type --> int
                            Note:- 1. used in formula to calculate learning_rate decay
                                   2. Default value --> 100
                                   
                  is_decay --> boolean
                            Note:- 1. if set to true then "decay" function is used and
                                      learning rate decreases at fix interval
                                   2. Default value --> True  

                  epsilon --> hyperparameter, value_type --> float
                            Note:- 1. used at different places to avoid dividing by zero
                                   2. Default value --> 0.00000001
                                   3. used in "batch_norm", "RMSprop" algorithm and "Adam" algorithm (defined below)

                  lambd --> hyperparameter, value_type --> float
                            Note:- 1. Used in Regularization
                                   2. Default value --> 0
                                   3. Train the model without regularization, get training_error and dev_error
                                   4. dev_error - training_error is large (Overfitting) 
                                   5. Regularization is one of the techniques used to reduce overfitting
                  
                  Dropout --> Boolean
                            Note:- 1. Default value --> False
                                   2. If set to True --> dropout will be implemented
                                   3. Dropout is another techniques used to reduce overfitting
                  
                  keep_prob --> hyperparameter, value_type --> list
                            Note:- 1. Default value --> None
                                   2. if Dropout is True --> a list will be provided
                                   3. list contains float value --> [0, 1)
                                   4. defines the fraction of hidden units of each layer to be kept open for training

                  batch_norm --> type --> boolean
                            Note:- 1. Default value --> False
                                   2. if True, normalizes the layer's (hidden layers + output layer) Z values
                                   3. batch_norm is done to increase speed of learning

                  cost_reg --> type --> str
                            Note:- 1. Default value --> "MSE"
                                   2. Type of cost function used in regression model

                  learning_algo --> type --> str
                            Note:- 1. Type of learning algorithm is used to update the model parameters
                                   2. Default value --> "gd" short for standard gradient descent
                                   3. other values --> "momentum", "RMSprop", "Adam"

                  beta1 --> hyperparameter,  type --> float
                            Note:- 1. Default value --> 0.9
                                   2. hyperparameter for "momentum" learning algorithm
                                   3. most of the time it is not used as a hyperparameter and its value is fixed at 0.9

                  beta2 --> hyperparameter, type --> float
                            Note:- 1. Default value --> 0.999
                                   2. hyperparameter for "RMSprop" learning algorithm
                                   3. most of the time it is not used as a hyperparameter and its value is fixed at 0.999

                  leaky_para --> hyperparameter, type --> float
                            Note:- 1. Default value --> 0.01
                                   2. used in "leaky_relu" function defined later
                  
                  grad_check --> Boolean
                           Note:- 1. if set to True, checks if Back_Prop is implemented correctly
                                  2. Default --> False

                  model_type --> type --> str
                            Note:- 1. defines the problem on which this class is used 
                                   2. Default value --> "binary" for binary classification problem
                                   3. other values --> "multi" for multi-class classification problem
                                                       "reg" for regression problem   

                  params --> dictionary to keep track of Weights and bais of the model.
                            Note:- 1. batch_norm = True, also stored gama_norm and beta_norm

                  linear_model --> dictionary to keep track of Z's of forward propogation.
                            Note:- 1. Z = W*A + b (genral form)

                  activation_model --> dictionary to keep track of A's of forward propogation.
                            Note:- 1. A = activation_function(Z)

                  grads --> dictionary to keep track of derivatives computed in backward propogation.
                            Note:- 1. dA's, dZ's, dW's and db's.

                    
        '''
        
        self.l_dims = l_dims
        self.activation = activation
        self.mini_batch_size = mini_batch_size
        self.max_epochs = max_epochs
        self.learning_rate = learning_rate
        self.decay_rate = decay_rate
        self.decay_interval = decay_interval
        self.is_decay = is_decay
        self.epsilon = epsilon
        self.lambd = lambd
        self.Dropout = Dropout
        if keep_prob is None:
            keep_prob = []
        self.keep_prob = keep_prob
        self.batch_norm = batch_norm
        self.cost_reg = cost_reg
        self.learning_algo = learning_algo
        self.beta1 = beta1
        self.beta2 = beta2
        self.leaky_para = leaky_para
        self.model_type = model_type
        
        self.params = {}  
        self.linear_model = {}
        self.activation_model = {}
        self.grads = {}
    
    
    # helper function to calculate activation function values for each layer
    # relu, leaky_relu and tanh activation function are used in hidden layers
    # and they are not used in output layer.
    # model --> "binary" --> output layer = "sigmoid"
    # model --> "multi"  --> output layer = "softmax"
    # model --> "reg"    --> output layer = "linear"
    
    # 1. relu function
    def relu(self, z):
        
        # it return the maximum of 0 and z
        return np.maximum(0, z)
    
    
    # 2. leaky_ReLU function
    def leaky_relu(self, z):
        
        # return z if z >0
        #       else returns leaky_para*z 
        # where leaky_para is user defined and can take a default value of 0.01
        
        return np.maximum(self.leaky_para*z, z)
    
    
    # 3. sigmoid function 
    def sigmoid(self, z):
        
        # sigmoid(z) = 1/(1+e(-z))
        return 1/(1+np.exp(-z))
    
    
    # 4. softmax function 
    def softmax(self, z):
        
        # softmax(z) = e^z(i)/sum(e^z(i)) 
        # sum is taken along column
        
        t = np.exp(z)
        sum_val = np.sum(t, axis = 0)
        
        return t/sum_val
    
    
    # 5. linear function (output layer activation function for regression)
    def linear(self, z):
        
        # linear_activation(z) = z
        return z
    
    
    # gradient of activation functions
    def grad_activation(self, z, activation):
        
        # given the type of activation function computed its derivative
        if activation == 'relu':
            
            # relu = max(0, z) so for z>0, derivative = 1, else derivative = 0.
            z[z>0.0] = 1
            z[z<=0.0] = 0
            
            return z
        
        elif activation == "sigmoid":
            
            # sigmoid(z) = 1/(1+exp(-x))
            # dreivative of sigmoid(z) = sigmoid(z)*(1-sigmoid(z))
            
            return self.sigmoid(z)*(1.0 - self.sigmoid(z))
        
        elif activation == "tanh":
            
            # tanh(z) = (exp(z) - exp(-z))/(exp(z)+exp(-z))
            # derivative of tanh(z) :- (1 - tanh(z)^2)
            
            return 1.0 - np.power(np.tanh(z), 2)
        
        elif activation == "linear":
            
            # g(z) = z
            # derivative of g(z) with respect to z = 1
            
            return 1
        
        else: # for leaky_relu activation function
            
            # leaky_relu = max(leaky_para*z, z), so for z>0 derivative = 1, else, it is = leaky_para
            
            z[z>0.0] = 1
            z[z<=0.0] = self.leaky_para
            
            return z
            
    
    # helper function for gradient checking
    # parameter to vector conversion --> used in grad_checking
    def params_to_vector(self):
        
        '''
        Parameters: self: 
                        Weights and bais are needed to convert them into column vector
                        l_dims --> needed to iterate over each layer
                        
                 Returns --> an array of column vector where each W[l] and b[l] are stacked vertically
         '''
        
        W_vector = self.params["W" + str(1)].reshape((-1, 1))
        b_vector = self.params["b" + str(1)].reshape((-1, 1))
        params_vector = np.concatenate(( W_vector, b_vector), axis = 0)
        
        # iterate over each layer
        for l in range(2, len(self.l_dims)):
            
            W_vector = self.params["W" + str(l)].reshape((-1, 1))
            b_vector = self.params["b" + str(l)].reshape((-1, 1))
            params_vector = np.concatenate((params_vector, W_vector, b_vector), axis = 0)
        
        return params_vector
    
    
    # gradients to vector conversion --> used in grad_checking
    def grads_to_vector(self):
        
        '''
        Parameters: self: 
                        dW's and db's are needed to convert them into column vector
                        l_dims --> needed to iterate over each layer
                        
                 Returns --> an array of column vector where each dW[l] and db[l] are stacked vertically
        '''
        
        dW_vector = self.grads["dW" + str(1)].reshape((-1, 1))
        db_vector = self.grads["db" + str(1)].reshape((-1, 1))
        grads_vector = np.concatenate((dW_vector, db_vector), axis = 0)
        
        # iterate over each layer
        for l in range(2, len(self.l_dims)):
            
            dW_vector = self.grads["dW" + str(l)].reshape((-1, 1))
            db_vector = self.grads["db" + str(l)].reshape((-1, 1))
            grads_vector = np.concatenate((grads_vector, dW_vector, db_vector), axis = 0)
        
        return grads_vector
        
    
    # converting column vector of params to their original shape
    def vector_to_params(self, vector):
        
        '''
        Parameters: self: 
                       --> l_dims, params --> to store the vectors into their respective place in params
                       --> vector:- vector that needed to be converted 
                       
        '''
        
        layer_dims = self.l_dims                # no. of hidden units of each layer including input layer
        i = 0                                   # used in indexing and slicing of the vector 
        for l in range(1, len(layer_dims)):
            
            row = layer_dims[l]                 
            column = layer_dims[l-1]
            self.params["W_check" + str(l)] = vector[i: i + row*column].reshape((row, column))
            i += row*column
            self.params["b_check" + str(l)] = vector[i:i + row].reshape((row, 1))
            i += row
            
#             print(self.params["W" + str(l)].shape, " = ", self.params["W_check" + str(l)], "\n")
#             print(self.params["b" + str(l)].shape, " = ", self.params["b_check" + str(l)], "\n")
            
    
    # grad_checking implemented
    def grad_checking(self, X, Y):
        
        '''
        Parameters: self: 
                        ForWard_Prop(), Cost_dAL_and_dZL() --> to calculate J_plus and J_minus
                        grads --> to get the column vector --> dJ/dtheta --> stored in grads
                        params --> to get theta+ and theta-
                        
                Return :- "confermation messgae on Back_Prop() implementation"
    
        '''
         
        para_vector = self.params_to_vector()
        dJ_dtheta = self.grads_to_vector()
        num_para_values = para_vector.shape[0]
        J_plus = np.zeros((num_para_values, 1))
        J_minus = np.zeros((num_para_values, 1))
        grad_approx = np.zeros((num_para_values, 1))
        epsilon = self.epsilon                             # epsilon --> 1e-8
        
#         print("para_vector = ", para_vector, "\n para_vector shape = ", para_vector, "\n")
#         print("dJ_dtheta value = ", dJ_dtheta, "\n dJ_dtheta shape = ", dJ_dtheta.shape, "\n")
#         print("num_para_values = ", num_para_values, "\n")
#         print("J_plus = ", J_plus, "\n")
#         print("J_minus = ", J_minus, "\n")
#         print("grad_approx = ", grad_approx, "\n")
        
        for i in range(num_para_values):
            
#             print("nudgeing " + str(i) + "th parameter = ", i, "\n")
            
            # slight nudge to parameter theta_plus
            theta_plus = np.copy(para_vector)
#             print("before nudgeing theta_plus = ", theta_plus, "\n")
            theta_plus[i] = theta_plus[i] + epsilon
            self.vector_to_params(theta_plus)
            self.Forward_Prop(X, grad_check = True)
            J_plus[i] = self.Cost_dAL_and_dZL(Y, grad_check = True)
            
#             print("after nudgeing theta_plus = ", theta_plus, "\n")
#             print("J_plus after nudgeing = ", J_plus, "\n")
            
            # slight nudge to parameter theta_minus
            theta_minus = np.copy(para_vector)
#             print("before nudgeing theta_minus = ", theta_minus, "\n")
            theta_minus[i] = theta_minus[i] - epsilon
            self.vector_to_params(theta_minus)
            self.Forward_Prop(X, grad_check = True)
            J_minus[i] = self.Cost_dAL_and_dZL(Y, grad_check = True)
            
#             print("after nudgeing theta_minus = ", theta_minus, "\n")
#             print("J_minus after nudgeing = ", J_minus, "\n")
            
            # grad_approx calculation
            grad_approx[i] = (J_plus[i] - J_minus[i])/(2*epsilon)
            
#             print("grad_approx after nudgeing = ", grad_approx, "\n")
            
        numerator = np.linalg.norm(dJ_dtheta - grad_approx)
        denominator = np.linalg.norm(dJ_dtheta) + np.linalg.norm(grad_approx)
        difference = numerator/denominator
        
#         print("final difference = ", difference, "\n")
        
        if difference > 2*epsilon:
            
            print("There is a mistake in the Back_Prop! difference = " + str(difference))
            
        else:
            
            print("Back_Prop works perfectly fine! difference = " + str(difference))

            
    # random mini batches for training of the model
    def Random_mini_batches(self, X, Y, seed):
        
        '''
        Parameters: self: --> mini_batch_size to define the size of each mini batch
                    X, Y --> to convert them into mini batches of size mini_batch_size
                    seed --> so that each time same mini batches are created
                    
             Returns: mini_batches --> list containing tuples of X_batch and Y_batch
        
        '''
        
        M = X.shape[1]                                         # total no. of training examples
        mini_batches = []                                      # to store X_batch and Y_batch
        np.random.seed(seed)                                   # to always get the same random shuffling of X and Y
        m = self.mini_batch_size                               # size of mini_batch
        
        # Shuffle X and Y randomly
        permutation = list(np.random.permutation(M))           # gives a list of shuffled list
        shuffled_X  = X[:, permutation]
        
        if self.model_type == "multi":
            shuffled_Y = Y[:, permutation].reshape(Y.shape)
        else:
            shuffled_Y = Y[:, permutation].reshape((1, M))
            
        num_full_batches = math.floor(M/m)                     # no. of mini_batches of size mini_batch_size
        
#         print("total no of training examples = ", M, "\n")
#         print("size of mini_batch defined = ", m, "\n")
#         print("model_type = ", self.model_type, "\n")
#         print("shuffled_X = ", shuffled_X, "\n")
#         print("shuffled_Y = ", shuffled_Y, "\n")
#         print("total no of num_full_batches = ", num_full_batches, " with  size = ", m, "\n")
        
        for i in range(num_full_batches):
            
            mini_batch_X = shuffled_X[:, i*m: (i+1)*m]
            mini_batch_Y = shuffled_Y[:, i*m: (i+1)*m]
            mini_batch = (mini_batch_X, mini_batch_Y)
            mini_batches.append(mini_batch)
        
        if M % self.mini_batch_size != 0:
            
            mini_batch_X = shuffled_X[:, num_full_batches*m : M]
            mini_batch_Y = shuffled_Y[:, num_full_batches*m : M]
            mini_batch = (mini_batch_X, mini_batch_Y)
            mini_batches.append(mini_batch)
            
        return mini_batches
    
    
    # learning_rate decay on fixed interval
    def decay(self, epoch):
        
        '''
        Parameters: self: 
                         learning_rate, decay_rate, decay_interval, epoch needed to calculate new learning rate
                         
           Returns :- New learning rate
        '''
        
        old_rate = self.learning_rate
        decay_rate = self.decay_rate
        interval = self.decay_interval
        if epoch % interval == 0:
            
#             print("old learning rate = ", old_rate, "\n")
            self.learning_rate = old_rate/(1 + decay_rate*(epoch/interval))
#             print("new learning rate = ", self.learning_rate, "\n")
     
    # Updating mean_test and var_test using exponentially weighted moving average
    def Update_mean_and_var(self):
        
        '''
        Parameters: self: 
                        beta1, mean_test[l], mean_var[l], l_dims --. to update mean_test and mean_var
                        --> updating using exponentially weighted moving average
        '''
        
        
        layer_dims = self.l_dims
        beta1 = self.beta1                                   # beta1 used in exponentially weighted moving average
            
        for l in range(1, len(layer_dims)):
            
            # parameter used for updating mean and variance using exponentially weighted moving average
            old_mean_test = self.params["mean_test" + str(l)]    # old mean_test calculated till previous mini_batch                      
            mean_batch = self.params["mean" + str(l)]            # mean calculated for perticular mini_batch for layer l
            old_var_test = self.params["var_test" + str(l)]      # old var_test calculated till previous mini_batch                      
            var_batch = self.params["var" + str(l)]              # variance calculated for perticular mini_batch for layer l
            
            # applying the formula and updating the values
            self.params["mean_test" + str(l)] = beta1*old_mean_test + (1-beta1)*mean_batch
            self.params["var_test" + str(l)] = beta1*old_var_test + (1-beta1)*var_batch
            
#             print("beta1 = ", beta1, "\n")
#             print("old mean_test" + str(l), " = ", old_mean_test, "\n")
#             print("mean" + str(l), "_batch = ", mean_batch, "\n")
#             print("new mean_test" + str(l), " = ", self.params["mean_test" + str(l)], "\n")
#             print("old var_test" + str(l), " = ", old_var_test, "\n")
#             print("var" + str(l), "_batch = ", var_batch, "\n")
#             print("new var_test" + str(l), " = ", self.params["var_test" + str(l)], "\n")
            
            
    # parameter initlization
    def Parameter_Initialization(self):
        
        '''
        Parameters: self: 
                        l_dims --> To iterate over each layer and access the # hidden unit 
                                     
                        params --> To initialize and store parameters
                                  1. W, b, gama_norm, beta_norm, V_dW, V_db, S_dW and S_db 
                                    
                        activation --> To get the factor for each layer, activation of that layer is needed
                                          
        '''
#         print("Running Parameter Initialization function ! \n")
#         print("batch_norm = ", self.batch_norm, "\n")
#         print("learning Algorithm = ", self.learning_algo, "\n")
        
        activation_fun = self.activation        # activation function list for the layers
        layer_dims = self.l_dims                # list containing # hidden unit for each layer including input layer 
        
        for l in range(1, len(layer_dims)):
            
            # "he" initilization
            if activation_fun[l-1] == "relu" or activation_fun[l-1] == "leaky_relu":     
                
                mul_factor = np.sqrt(2/layer_dims[l-1])
                
                
            # "xavier" initilization for "tanh" and "sigmoid" activation function
            else:
                
                mul_factor = np.sqrt(1/layer_dims[l-1])
               
            self.params["W"+str(l)] = np.random.randn(layer_dims[l], layer_dims[l-1])*mul_factor     
            self.params["b"+str(l)] = np.zeros((layer_dims[l], 1))                                   
            
                
#             print("mul_factor = ", mul_factor, "\n")
#             print("W" + str(l), " = ", self.params["W" + str(l)], "\n")
#             print("b" + str(l), " = ", self.params["b" + str(l)], "\n")
                
            # batch_norm
            if self.batch_norm == True:
                
                # mean_test[l] is exponentially weighted avg values computed over all mini_batch_X
                # var_test[l] is exponentially weighted avg values computed over all mini_batch_X
                # mean[l] stores the mean of the current mini_batch_X
                # var[l] stores the variance of the current minin_batch_X
                
                shape_norm = (layer_dims[l], 1)
                self.params["gama" + str(l)] = np.random.randn(shape_norm[0], shape_norm[1])*mul_factor
                self.params["beta" + str(l)] = np.zeros((shape_norm))
                self.params["mean_test" + str(l)] = np.zeros((shape_norm)) 
                self.params["var_test" + str(l)] = np.zeros((shape_norm)) 
                self.params["mean" + str(l)] = np.zeros((shape_norm)) 
                self.params["var" + str(l)] = np.zeros((shape_norm)) 

#                 print("shape for norm = ", shape_norm, "\n")
#                 print("gama" + str(l), " = ", self.params["gama" + str(l)], "\n")
                
            
            # "momentum" learning algorithm or "Adam" learning algorithm
            if self.learning_algo == "momentum" or self.learning_algo == "Adam":
                
                self.params["V_dW"+str(l)] = np.zeros((layer_dims[l], layer_dims[l-1]))
                self.params["V_db"+str(l)] = np.zeros((layer_dims[l], 1))
                
#                 print("V_dW" + str(l), " = ", self.params["V_dW" + str(l)], "\n")
#                 print("V_db" + str(l), " = ", self.params["V_db" + str(l)], "\n")
                
                # for batch_norm
                if self.batch_norm == True:
                    
                    self.params["V_dgama" + str(l)] = np.zeros((layer_dims[l], 1))
                    self.params["V_dbeta" + str(l)] = np.zeros((layer_dims[l], 1))
                    
#                     print("V_dgama" + str(l), " = ", self.params["V_dgama" + str(l)], "\n")
#                     print("V_dbeta" + str(l), " = ", self.params["V_dbeta" + str(l)], "\n")
                

            # "RMSprop" learning algorithm or "Adam" learning algorithm
            if self.learning_algo == "RMSprop" or self.learning_algo == "Adam":
                
                self.params["S_dW"+str(l)] = np.zeros((layer_dims[l], layer_dims[l-1]))
                self.params["S_db"+str(l)] = np.zeros((layer_dims[l], 1))
                
#                 print("S_dW" + str(l), " = ", self.params["S_dW" + str(l)], "\n")
#                 print("S_db" + str(l), " = ", self.params["S_db" + str(l)], "\n")
                
                
                # for batch_norm
                if self.batch_norm == True:
                    
                    self.params["S_dgama" + str(l)] = np.zeros((layer_dims[l], 1))
                    self.params["S_dbeta" + str(l)] = np.zeros((layer_dims[l], 1))
                    
#                     print("S_dgama" + str(l), " = ", self.params["S_dgama" + str(l)], "\n")
#                     print("S_dbeta" + str(l), " = ", self.params["S_dbeta" + str(l)], "\n")
                

                    
    # forward propogation
    def Forward_Prop(self, X, grad_check = False, function = "fit"):
        
        '''
        Parameters: self: 
                        l_dims --> used just to iterate over the layers 
                        params --> used to calculate Z 
                        linear_model --> to keep track of Z
                        activation --> to get the type of activation function used at each layer
                        activation_model --> to keep track of A
                    
                    X --> training set
                        Note: stored in activation_model as "A0"
                              shape of X = (l_dims[0], batch_size)
                              l_dims[0] = n_features of X
                              
        '''

        self.activation_model["A0"] = X 
        layer_dims = self.l_dims
        epsilon = self.epsilon                            # epsilon = 1e-8
        activation_fun = self.activation
        
#         print("check type = ", grad_check, "\n")
#         print("batch_norm = ", self.batch_norm, "\n")
#         print("Dropout = ", self.Dropout, "\n")
#         print("Used for = ", function, "\n")
        
        for l in range(1, len(layer_dims)): 
            
            # if gradient_checking is set to True
            if grad_check == True:
                
                W = self.params["W_check" + str(l)]
                b = self.params["b_check" + str(l)]
                
            else:
                
                W = self.params["W" + str(l)]
                b = self.params["b" + str(l)]
            
            A_prev = self.activation_model["A"+str(l-1)]
            
#             print("W" + str(l), " = ", W, "\n")
#             print("b" + str(l), " = ", b, "\n")
#             print("A_prev or A"+str(l-1), " = ",A_prev, "\n")
            
            # batch_normalization on Z[l] if batch_norm = True
            if self.batch_norm:
                
                Z_orig = np.dot(W, A_prev) + b
                self.linear_model["Z_orig" + str(l)] = Z_orig
#                 print("Z_orig = ", Z_orig, "\n")
                
                # During fitting the model
                if function == "fit":
                    
                    mean = np.mean(Z_orig, axis = 1, keepdims= True) # mean is caluclated over each hidden unit (along row)
                    self.params["mean"+str(l)] = mean
                    var = np.var(Z_orig, axis = 1, keepdims= True)   # variance is calculated over each hidden unit (along row)
                    self.params["var" + str(l)] = var
                    
#                     print("mean for Z_orig"+ str(l), " = ", mean, "\n", "shape of mean = ", mean.shape, "\n")
#                     print("var for Z_orig"+ str(l), " = ", var , "\n", "shape of var = ", var.shape, "\n")
                
                
                # During predicting for the model
                else:
                    
                    mean = self.params["mean_test" + str(l)]
                    var = self.params["var_test" + str(l)]
                    
#                     print("mean used during testing = ", mean, "\n", "shape of mean = ", mean.shape, "\n")
#                     print("var used during testing = ", var , "\n", "shape of var = ", var.shape, "\n")
                
                
                Z_norm = (Z_orig - mean)/(np.sqrt(var + epsilon))
                self.linear_model["Z_norm" + str(l)] = Z_norm
                gama = self.params["gama" + str(l)]
                beta = self.params["beta" + str(l)]
                Z = np.multiply(gama, Z_norm) + beta
                self.linear_model["Z"+str(l)] = Z
                
#                 print("Z_norm"+str(l), " = ", Z_norm, "\n")
#                 print("gama"+str(l), " = ", gama, "\n")
#                 print("beta"+str(l), " = ", beta, "\n")
            
            # batch_norm = False
            else:
                
                Z = np.dot(W, A_prev) + b
                self.linear_model["Z"+str(l)] = Z
                
#             print("Z"+str(l), " = ", Z, "\n")
#             print("activation function of" + str(l) + "_th layer = ", activation_fun[l-1], "\n")
            # applying activation function of Z[l]
            
            if activation_fun[l-1] == 'relu':
                
                A = self.relu(Z)                   # post_activation_function value
                
            elif activation_fun[l-1] == 'sigmoid':
                
                A = self.sigmoid(Z)
                
            elif activation_fun[l-1] == 'tanh' :
                
                A = np.tanh(Z)
                
            elif activation_fun[l-1] == "softmax":
                
                A = self.softmax(Z)
                
            elif activation_fun[l-1] == "linear":
                
                A = self.linear(Z)
                
            else:  # activation = 'leaky_relu'
                
                A = self.leaky_relu(Z)
            
            self.activation_model["A" + str(l)] = A
            # Applying Dropout (only applied during fitting and not in predicting)
            if self.Dropout and function == "fit":  
                
                np.random.seed(0)
                shape_D = A.shape
                D = np.random.rand(shape_D[0], shape_D[1])
                self.params["D" + str(l)] = (D < self.keep_prob[l-1]).astype(int)
                A = np.multiply(A, self.params["D" + str(l)])
                self.activation_model["A" + str(l)] = A/self.keep_prob[l-1]    # scale up
                
#                 print("D"+str(l), " = ", self.params["D"+str(l)], "\n")
                
#             print("final A"+str(l), " = ", self.activation_model["A"+str(l)], "\n")    

    # cost of the model and dAL            
    def Cost_dAL_and_dZL(self, Y, grad_check = False):
        
        
        '''
        Parameters: self:
                        activation_model --> to get the output layer A (AL)
                        grads --> to store dAL (derivative of cost function w.r.t AL)
                        l_dims --> to access the AL
                        mini_batch_size --> used for calculating cost
                        cost_reg --> for "reg" model, which type of cost function is used
                        
                    Y --> used in calculating cost function
                               
        Returns:  cost function value for a given cost_type
        '''
#         print("check_type = ", grad_check, "\n")
#         print("model_type = ", self.model_type, "\n")
#         print("cost_reg = ", self.cost_reg, '\n')
        
        layer_dims = self.l_dims                    
        m = Y.shape[1]                                  # size of the current mini_batch / training set
        L = len(layer_dims) - 1                         # final layer index
        AL = self.activation_model["A"+str(L)]          # post activation value of output layer
        lambd = self.lambd                              # lambda --> used in calculating regularization cost
        frobenius_norm_square = 0                       # to get the sum of W[l]^2 over each layer
        
#         print("size of batch = ", m, "\n")
        for l in range(1, len(layer_dims)):
            
            # if gradient_checking is set to True
            if grad_check == True:
                
                W = self.params["W_check" + str(l)]
                b = self.params["b_check" + str(l)]
                
#                 print("W_check"+str(l), " = ", W, "\n")
#                 print("b_check"+str(l), " = ", b, "\n")
                
            else:
                
                W = self.params["W" + str(l)]
                b = self.params["b" + str(l)]
                
#                 print("W"+str(l), " = ", W, "\n")
#                 print("b"+str(l), " = ", b, "\n")
            
            frobenius_norm_square += np.power(np.linalg.norm(W),2)
            
#             print("frobenius_norm_square = ", frobenius_norm_square, "\n")
            
#         print("frobenius_norm_square final value = ", frobenius_norm_square, "\n")
#         print("A"+str(L), "= ", AL, "\n")
#         print("Y = ", Y, "\n")
        
        Regularization_cost = lambd*frobenius_norm_square/(2*m)
#         print("Regularization_cost = ", Regularization_cost, "\n")
        
        # for binary classification
        if self.model_type == "binary": 
            
            # for "binary" model, only one hidden unit is used at output layer
            # shape of AL --> (1, m), shape of Y --> (1, m) (Y is given as user input)
            
            cost = (-1/m)*(np.dot(Y, np.log(AL).T)+np.dot(1-Y, np.log(1-AL).T)) + Regularization_cost
            self.grads["dA"+str(L)] = -1*np.divide(Y, AL) + np.divide(1-Y, 1-AL)
            cost = np.squeeze(cost)                     # if by chance cost is not a value and an array, np.squeeze is used. 
            self.grads["dZ" + str(L)] = AL - Y          # because activation function at output layer will always be sigmoid
            
        # for multi-class classification
        elif self.model_type == "multi":
            
            cost = (-1/m)*np.sum(np.multiply(Y, np.log(AL))) + Regularization_cost
            self.grads["dA"+str(L)] = -Y/AL
            cost = np.squeeze(cost)
            self.grads["dZ" + str(L)] = AL - Y
            
        
        # for regression 
        else:
            
            # for mean squared error
            if self.cost_reg == "MSE":

                cost = (1/m)*np.sum((Y-AL)*(Y-AL)) + Regularization_cost
                self.grads["dA"+str(L)] = -2*(Y-AL)
                cost = np.squeeze(cost)
                
            # for mean absolute error
            else:

                cost = (1/m)*np.sum(np.abs(Y-AL)) + regularization_cost
                self.grads["dA"+str(L)] = (AL-Y)/(np.abs(Y-AL))
                cost = np.squeeze(cost)
                
            self.grads["dZ" + str(L)] = self.grads["dA" + str(L)]        # activation function is linear
            
#         print("cost = ", cost, "\n")
#         print("dA"+str(L), " = ", self.grads["dA"+str(L)], "\n")
#         print("dZ"+str(L), " = ", self.grads["dZ"+str(L)], "\n")
        
        return cost
    
    
    # back propogation
    def Back_Prop(self):
        
        '''
        Parameters: self: 
                        batch_size --> used in calculating grads
                        l_dims --> to  iterate over each layer
                        grads --> to store derivatives 
                        linear_model, activation_model, params --> used to calculate grads
                        var --> stored variance is used in calculation
                                 
        '''
        
        layer_dims = self.l_dims
        L = len(layer_dims) - 1                  # output layer number
        m = self.grads["dA"+str(L)].shape[1]     # size of current mini batch
        I_m = np.ones((1, m))
        lambd = self.lambd
        epsilon = self.epsilon
        activation_fun = self.activation
        
#         print("size of batch = ", m, "\n")
#         print("I_m = ", I_m, "\n")
#         print("batch_norm = ", self.batch_norm, "\n")
#         print("Dropout = ", self.Dropout, "\n")
        
        # calculated "dAL" and "dZL" during calculating cost for the model
        # d_Reg_W is derivative of regularization cost w.r.t W[l]
        
        for l in reversed(range(1, len(layer_dims))):
            
            # will be used in calculation of grads
            W = self.params["W" + str(l)]                     
            d_Reg_W = lambd*W/m
            dZ = self.grads["dZ" + str(l)]
            A_prev = self.activation_model["A" + str(l-1)]
            
            # when batch norm is applied to Z[l]
            if self.batch_norm == True:
                
                # calculating d_gama[l], d_beta[l], dZ_orig[l], dW[l], db[l], dA[l-1], dZ[l-1]
                # parameter from norm used in calculations.
                
                Z_norm = self.linear_model["Z_norm" + str(l)]
                var = self.params["var" + str(l)]
                gama = self.params["gama" + str(l)]
                
                # forumlas applied to calculate d_gama[l], d_beta[l], dZ_orig[l], dW[l], db[l], dA[l-1], dZ[l-1]
                
                d_gama = np.sum(np.multiply(dZ, Z_norm), axis = 1, keepdims = True)/m
                d_beta = np.sum(dZ, axis = 1, keepdims = True)/m
                variable1 = m*dZ
                variable2 = np.dot(d_beta, I_m)
                variable3 = np.multiply(d_gama, Z_norm)
                variable4 = m*np.sqrt(var + epsilon)
                dZ_orig = np.multiply(gama, (variable1 - variable2 - variable3))/variable4
                dW = (1/m)*np.dot(dZ_orig, A_prev.T) + d_Reg_W
                db = (1/m)*np.sum(dZ_orig, axis = 1, keepdims = True)
                dA_prev = np.dot(W.T, dZ_orig)
                
                # storing the values in dictionary for further use
                self.grads["d_gama" + str(l)] = d_gama
                self.grads["d_beta" + str(l)] = d_beta
                self.grads["dZ_orig" + str(l)] = dZ_orig
                self.grads["dW"+str(l)] = dW
                self.grads["db"+str(l)] = db
                self.grads["dA"+str(l-1)] = dA_prev
                
#                 print("dZ"+str(l), " = ", dZ, "\n")
#                 print("Z_norm"+str(l), " = ", Z_norm, "\n")
#                 print("d_gama"+str(l), " = ", d_gama, "\n")
#                 print("d_beta"+str(l), " = ", d_beta, "\n")
#                 print("m*dZ"+str(l), " = ", variable1, "\n")
#                 print("d_beta"+str(l), "*I_m = ", variable2, "\n")
#                 print("d_gama_norm"+str(l),".*Z_norm"+str(l), " = ", variable3, "\n")
#                 print("m*sqrt(var"+str(l), "+ epsilon) = ", variable4, "\n")
#                 print("gama"+str(l), " = ", gama, "\n")
#                 print("dZ_orig"+str(l), " = ", dZ_orig, "\n")
#                 print("d_Reg_W = ", d_Reg_W, "\n")
#                 print("A"+str(l-1), " = ", A_prev, "\n")
#                 print("dW"+str(l), " = ", dW, "\n")
#                 print("db"+str(l), " = ", db, "\n")
#                 print("dA"+str(l-1), " = ", dA_prev, "\n")
                
            # When batch norm is not applied
            else:
                
                # formulas used to calculate dW[l], db[l], dA[l-1]
                dW = (1/m)*np.dot(dZ, A_prev.T) + d_Reg_W
                db = (1/m)*np.sum(dZ, axis = 1, keepdims = True)
                dA_prev = np.dot(W.T, dZ)
                
                # storing the values in grads dictionary for further use
                self.grads["dW"+str(l)] = dW
                self.grads["db"+str(l)] = db
                self.grads["dA"+str(l-1)] = dA_prev
                
                
#                 print("d_Reg_W = ", d_Reg_W, "\n")
#                 print("dZ"+str(l), " = ", self.grads["dZ"+str(l)], "\n")
#                 print("A"+str(l-1), " = ", self.activation_model["A"+str(l-1)], "\n")
#                 print("dW"+str(l), " = ", self.grads["dW"+str(l)], "\n")
#                 print("db"+str(l), " = ", self.grads["db"+str(l)], "\n")
#                 print("dA"+str(l-1), " = ", self.grads["dA"+str(l-1)], "\n")
                
            if l != 1:
                
                Z_prev = self.linear_model["Z" + str(l-1)]
                grad_value = self.grad_activation(Z_prev, activation_fun[l-2])
                dZ_prev = np.multiply(dA_prev, grad_value)
                self.grads["dZ"+str(l-1)] = dZ_prev
                
#                 print("Z_prev = ", Z_prev)
#                 print("grad_value = ", grad_value, "\n")
#                 print("dZ"+str(l-1), " = ", dZ_prev, "\n")

            # Applying Dropout to "dA[l]" if possible
            if self.Dropout:

#                 print("dA"+str(l), " before = ", self.grads["dA"+str(l)], "\n")
                D = self.params["D" + str(l)]
                dA = self.grads["dA" + str(l)]
                dA = np.multiply(dA, D)
                dA = dA/self.keep_prob[l-1]
                self.grads["dA" + str(l)] = dA
                
#                 print("D"+str(l), " = ", D, "\n")
#                 print("dA"+str(l), " after = ", dA, "\n")

    # updating the parameters Wl's and bl's        
    def Parameter_Update(self, t):
        
        '''
        Parameters: self:
                        t --> counter for "Adam"
                        learning_algo --> which algorithm to use for updating Weights and bais of the model
                        l_dims --> to iterate over each layer
                        params, learning_rate, grads --> to update the parameter of the model
                        
        '''
        
        layer_dims = self.l_dims
        beta1 = self.beta1                                   # parameter used for "momentum" or "Adam" Algorithm
        beta2 = self.beta2                                   # parameter used for "RMSprop" or "Adam" Algorithm
        epsilon = self.epsilon
        
#         print("learning Algorithm = ", self.learning_algo, "\n")
#         print("batch_norm = ", self.batch_norm, "\n")
#         print("beta1 = ", beta1, "\n")
#         print("beta2 = ", beta2, "\n")
        
        # updating the parameter using different learning algorithms
        for l in range(1, len(layer_dims)):
            
            # "gd", standard gradient descent algorithm to update W's and b's
            if self.learning_algo == "gd":
                
                factor_dW = self.grads["dW" + str(l)]
                factor_db = self.grads["db" + str(l)]
                
                # for batch_norm
                if self.batch_norm == True:
                    
                    factor_dgama = self.grads["d_gama" + str(l)]
                    factor_dbeta = self.grads["d_beta" + str(l)] 
            
            
            # "momentum" algorithm
            elif self.learning_algo == "momentum":
                
                # parameter used for updating V_dW and V_db
                V_dW = self.params["V_dW" + str(l)]
                V_db = self.params["V_db" + str(l)]
                dW = self.grads["dW" + str(l)]
                db = self.grads["db" + str(l)]
                
                # updating and storing V_dW and V_db
                self.params["V_dW" + str(l)] = beta1*V_dW + (1-beta1)*dW
                self.params["V_db" + str(l)] = beta1*V_db + (1-beta1)*db
                
                factor_dW = self.params["V_dW" + str(l)]
                factor_db = self.params["V_db" + str(l)]
                
#                 print("V_dW" + str(l), " before = ", V_dW, "\n")
#                 print("V_db" + str(l), " before = ", V_db, "\n")
#                 print("dW" + str(l), " = ", dW, "\n")
#                 print("db" + str(l), " = ", db, "\n")
#                 print("V_dW" + str(l), " after = ", self.params["V_dW" + str(l)], "\n")
#                 print("V_db" + str(l), " after = ", self.params["V_db" + str(l)], "\n")
                
                
                # for batch_norm
                if self.batch_norm == True:
                    
                    # parameter used for updating V_dgama and V_dbeta
                    V_dgama = self.params["V_dgama" + str(l)]
                    V_dbeta = self.params["V_dbeta" + str(l)]
                    d_gama = self.grads["d_gama" + str(l)]
                    d_beta = self.grads["d_beta" + str(l)]
                    
                    # updating and storing V_dgama and V_dbeta
                    self.params["V_dgama" + str(l)] = beta1*V_dgama + (1 - beta1)*d_gama
                    self.params["V_dbeta" + str(l)] = beta1*V_dbeta + (1 - beta1)*d_beta
                    
                    factor_dgama = self.params["V_dgama" + str(l)]
                    factor_dbeta = self.params["V_dbeta" + str(l)]
                    
#                     print("V_dgama"+str(l), " before = ", V_dgama, "\n")
#                     print("V_dbeta"+str(l), " before = ", V_dbeta, "\n")
#                     print("d_gama"+str(l), " = ", d_gama, "\n")
#                     print("d_beta"+str(l), " = ", d_beta, "\n")
#                     print("V_dgama"+str(l), " after = ", self.params["V_dgama"+str(l)], "\n")
#                     print("V_dbeta"+str(l), " after = ", self.params["V_dbeta"+str(l)], "\n")
                
                    
            # "RMSprop" algorithm
            elif self.learning_algo == "RMSprop":
                
                # parameter used for updating S_dW and S_db
                S_dW = self.params["S_dW" + str(l)]
                S_db = self.params["S_db" + str(l)]
                dW = self.grads["dW" + str(l)]
                db = self.grads["db" + str(l)]
                
                # updating and storing S_dW and S_db
                self.params["S_dW" + str(l)] = beta2*S_dW + (1-beta2)*np.power(dW, 2)
                self.params["S_db" + str(l)] = beta2*S_db + (1-beta2)*np.power(db, 2)
                
                factor_dW = dW/(np.sqrt(self.params["S_dW" + str(l)]) + epsilon)
                factor_db = db/(np.sqrt(self.params["S_db" + str(l)]) + epsilon)
                
#                 print("S_dW"+str(l), " before = ", S_dW, "\n")
#                 print("S_db"+str(l), " before = ", S_db, "\n")
#                 print("dW"+str(l), " = ", dW, "\n")
#                 print("db"+str(l), " = ", db, "\n")
#                 print("S_dW"+str(l), " after = ", self.params["S_dW"+str(l)], "\n")
#                 print("S_db"+str(l), " after = ", self.params["S_db"+str(l)], "\n")
                
                
                # for batch_norm
                if self.batch_norm == True:
                    
                    # parameter used for updating S_dgama and S_dbeta
                    S_dgama = self.params["S_dgama" + str(l)]
                    S_dbeta = self.params["S_dbeta" + str(l)]
                    d_gama = self.grads["d_gama" + str(l)]
                    d_beta = self.grads["d_beta" + str(l)]
                    
                    # updating and storing V_dgama and V_dbeta
                    self.params["S_dgama" + str(l)] = beta2*S_dgama + (1 - beta2)*np.power(d_gama, 2)
                    self.params["S_dbeta" + str(l)] = beta2*S_dbeta + (1 - beta2)*np.power(d_beta, 2)
                    
                    factor_dgama = d_gama/(np.sqrt(self.params["S_dgama" + str(l)]) + epsilon)
                    factor_dbeta = d_beta/(np.sqrt(self.params["S_dbeta" + str(l)]) + epsilon)
                
#                     print("S_dgama"+str(l), " before = ", S_dgama, "\n")
#                     print("S_dbeta"+str(l), " before = ", S_dbeta, "\n")
#                     print("d_gama"+str(l), " = ", d_gama, "\n")
#                     print("d_beta"+str(l), " = ", d_beta, "\n")
#                     print("S_dgama"+str(l), " after = ", self.params["S_dgama"+str(l)], "\n")
#                     print("S_dbeta"+str(l), " after = ", self.params["S_dbeta"+str(l)], "\n")
                    
                    
            # "Adam" algorithm
            else:
                
                # parameter used to update V_dW , V_db, S_dW and S_db
                V_dW = self.params["V_dW" + str(l)]
                V_db = self.params["V_db" + str(l)]
                S_dW = self.params["S_dW" + str(l)]
                S_db = self.params["S_db" + str(l)]
                dW = self.grads["dW" + str(l)]
                db = self.grads["db" + str(l)]
                corr_factor_V = (1 - np.power(beta1, t))      # correction factor for V_dW and V_db
                corr_factor_S = (1 - np.power(beta2, t))      # correction factor for S_dW and S_db
                
                # updating and storing V_dW, V_db, S_dW and S_db
                self.params["V_dW" + str(l)] = beta1*V_dW + (1-beta1)*dW
                self.params["V_db" + str(l)] = beta1*V_db + (1-beta1)*db
                self.params["S_dW" + str(l)] = beta2*S_dW + (1-beta2)*np.power(dW, 2)
                self.params["S_db" + str(l)] = beta2*S_db + (1-beta2)*np.power(db, 2)
                
                # corrected values of V_dW, V_db, S_dW and S_db
                V_dW_corrected = self.params["V_dW" + str(l)]/corr_factor_V
                V_db_corrected = self.params["V_db" + str(l)]/corr_factor_V
                S_dW_corrected = self.params["S_dW" + str(l)]/corr_factor_S
                S_db_corrected = self.params["S_db" + str(l)]/corr_factor_S
                
                # factors for updating W and b for neural layer
                factor_dW = V_dW_corrected/(np.sqrt(S_dW_corrected) + epsilon)
                factor_db = V_db_corrected/(np.sqrt(S_db_corrected) + epsilon)
                
#                 print("V_dW"+str(l), " before = ", V_dW, "\n")
#                 print("V_db"+str(l), " before = ", V_db, "\n")
#                 print("S_dW"+str(l), " before = ", S_dW, "\n")
#                 print("S_db"+str(l), " before = ", S_db, "\n")
#                 print("dW"+str(l), " = ", dW, "\n")
#                 print("db"+str(l), " = ", db, "\n")
#                 print("V_dW" + str(l), " after = ", self.params["V_dW" + str(l)], "\n")
#                 print("V_db" + str(l), " after = ", self.params["V_db" + str(l)], "\n")
#                 print("S_dW"+str(l), " after = ", self.params["S_dW"+str(l)], "\n")
#                 print("S_db"+str(l), " after = ", self.params["S_db"+str(l)], "\n")
#                 print("corr_factor_V = ", corr_factor_V, "\n")
#                 print("corr_factor_S = ", corr_factor_S, "\n")
#                 print("V_dW_corrected = ", V_dW_corrected, "\n")
#                 print("V_db_corrected = ", V_db_corrected, "\n")
#                 print("S_dW_corrected = ", S_dW_corrected, "\n")
#                 print("S_db_corrected = ", S_db_corrected, "\n")
                
                # for batch_norm
                if self.batch_norm == True:
                    
                    # parameter used for updating V_dgama, V_dbeta, S_dgama and S_dbeta
                    V_dgama = self.params["V_dgama" + str(l)]
                    V_dbeta = self.params["V_dbeta" + str(l)]
                    S_dgama = self.params["S_dgama" + str(l)]
                    S_dbeta = self.params["S_dbeta" + str(l)]
                    d_gama = self.grads["d_gama" + str(l)]
                    d_beta = self.grads["d_beta" + str(l)]
                    
                    # updating and storing V_dgama, V_dbeta, S_dgama and S_dbeta
                    self.params["V_dgama" + str(l)] = beta1*V_dgama + (1 - beta1)*d_gama
                    self.params["V_dbeta" + str(l)] = beta1*V_dbeta + (1 - beta1)*d_beta
                    self.params["S_dgama" + str(l)] = beta2*S_dgama + (1 - beta2)*np.power(d_gama, 2)
                    self.params["S_dbeta" + str(l)] = beta2*S_dbeta + (1 - beta2)*np.power(d_beta, 2)
                    
                    # corrected values of V_dgama, V_dbeta, S_dgama and S_dbeta
                    V_dgama_corrected = self.params["V_dgama" + str(l)]/corr_factor_V
                    V_dbeta_corrected = self.params["V_dbeta" + str(l)]/corr_factor_V
                    S_dgama_corrected = self.params["S_dgama" + str(l)]/corr_factor_S
                    S_dbeta_corrected = self.params["S_dbeta" + str(l)]/corr_factor_S

                    # factors for updating d_gama and d_beta
                    factor_dgama = V_dgama_corrected/(np.sqrt(S_dgama_corrected) + epsilon)
                    factor_dbeta = V_dbeta_corrected/(np.sqrt(S_dbeta_corrected) + epsilon)
                
#                     print("V_dgama"+str(l), " before = ", V_dgama, "\n")
#                     print("V_dbeta"+str(l), " before = ", V_dbeta, "\n")
#                     print("V_dgama"+str(l), " after = ", self.params["V_dgama"+str(l)], "\n")
#                     print("V_dbeta"+str(l), " after = ", self.params["V_dbeta"+str(l)], "\n")
#                     print("d_gama"+str(l), " = ", d_gama, "\n")
#                     print("d_beta"+str(l), " = ", d_beta, "\n")
#                     print("S_dgama"+str(l), " before = ", S_dgama, "\n")
#                     print("S_dbeta"+str(l), " before = ", S_dbeta, "\n")
#                     print("S_dgama"+str(l), " after = ", self.params["S_dgama"+str(l)], "\n")
#                     print("S_dbeta"+str(l), " after = ", self.params["S_dbeta"+str(l)], "\n")
#                     print("d_gama"+str(l), " = ", d_gama, "\n")
#                     print("d_beta"+str(l), " = ", d_beta, "\n")
                    
#                     print("V_dgama_corrected = ", V_dgama_corrected, "\n")
#                     print("V_dbeta_corrected = ", V_dbeta_corrected, "\n")
#                     print("S_dgama_corrected = ", S_dgama_corrected, "\n")
#                     print("S_dbeta_corrected = ", S_dbeta_corrected, "\n")

            
#             print("factor_dW = ", factor_dW, "\n")
#             print("factor_db = ", factor_db, "\n")
#             print("W"+str(l), " before = ", self.params["W"+str(l)], "\n")    
#             print("b"+str(l), " before = ", self.params["b"+str(l)], "\n")    
            
            self.params["W"+str(l)] -= self.learning_rate*factor_dW
            self.params["b"+str(l)] -= self.learning_rate*factor_db
            
#             print("W"+str(l), " after = ", self.params["W"+str(l)], "\n")    
#             print("b"+str(l), " after = ", self.params["b"+str(l)], "\n")    
            
            # updating for batch_norm
            if self.batch_norm == True:
                
#                 print("factor_dgama = ", factor_dgama, "\n")
#                 print("factor_dbeta = ", factor_dbeta, "\n")
#                 print("gama"+str(l), " before = ", self.params["gama"+str(l)], "\n")    
#                 print("beta"+str(l), " before = ", self.params["beta"+str(l)], "\n")    

                self.params["gama" + str(l)] -= self.learning_rate*factor_dgama
                self.params["beta" + str(l)] -= self.learning_rate*factor_dbeta
                
#                 print("gama"+str(l), " after = ", self.params["gama"+str(l)], "\n")    
#                 print("beta"+str(l), " after = ", self.params["beta"+str(l)], "\n")    

    
    # training the model to get optimum values of Wl's and bl's
    def fit(self, X, Y):
        
        '''
        Parameters: self: 
                        l_dims, params_set, forward, cost_and_dAL, backward, update_params --> to train the model
                        
                    X --> whole training set
                    Y --> true label set
                    shape of X --> (no. of training examples, no. of features)
                    shape of Y --> (no. of training examples, 1) -->  for "reg" or "binary" model
                               --> (no. of training examples, no. of class) --> for "multi" model
                    
        '''
        
        X = X.T                                      # to change the shape of X to (no. of features , no. of training example)
        
        if self.model_type == "multi":
            Y = Y.T.reshape(Y.shape[1], Y.shape[0])  # to change the shape of Y to (no. of class, no. of training ex.) 
            
        else:
            Y = Y.T.reshape(1, Y.shape[0])           # to change the shape to (1, no. of training example) 
        
        n_x = X.shape[0]                             # no. of features of X or dimension of zeroth layer or input layer
        self.l_dims.insert(0, n_x)                   # inserting the n_x at i = 0 position in l_dims list.
        M = X.shape[1]                               # total no. of training example
        self.Parameter_Initialization()
        costs = []
        epochs = []
        t = 0
#         print('shape of X = ', X.shape, "\n")
#         print("shape of Y = ", Y.shape, "\n")
        for epoch in range(self.max_epochs):
            
            seed = 1
            mini_batches = self.Random_mini_batches(X, Y, seed)
            cost_total = 0    # to store the cost for each pass/epoch

            for batch in mini_batches:
                
                (mini_batch_X, mini_batch_Y) = batch
#                 print("mini_batch_X = ", mini_batch_X, "\n")
#                 print("mini_batch_Y = ", mini_batch_Y, "\n")
                self.Forward_Prop(mini_batch_X)
                
                # updating mean_test and var_test for each mini_batch
                # using exponentially weighted moving average when applied batch_norm
                if self.batch_norm:
                    
                    self.Update_mean_and_var()
                
                cost_total += self.Cost_dAL_and_dZL(mini_batch_Y)
                self.Back_Prop()
                t = t+1  # parameter for "Adam" learning Algorithm
                self.Parameter_Update(t)
            
            cost_avg = cost_total/M
            
            if self.is_decay:
                
                self.decay(epoch)
            
            # Print the cost every 1000 epoch
            if epoch % 200 == 0:
                print ("Cost after epoch %i: %f" %(epoch, cost_avg))
            if epoch % 100 == 0:
                costs.append(cost_avg)
                epochs.append(epoch)
                
        self.grad_checking(X, Y)
        
        return (costs, epochs)
        
    # predict function
    def predict(self, X):

        '''
        Parameters:- self:
                         params --> learned parameters from fit method
                         X --> sample for which prediction is to be done
                Return: Y_hat value for X
        '''
        X = X.T                   # to convert (# of sample, # of features) --> (# of features, # of samples)
        M = X.shape[1]            # number of samples 
        L = len(self.l_dims) - 1  # index for output layer 
        self.Forward_Prop(X, function = "predict")      # Forward_prop to get the AL(post activation function value of output layer)
        AL = self.activation_model["A"+str(L)]

        # for "binary" classification
        if self.model_type == "binary":

            # last layer activation function --> "sigmoid"
            # if AL > 0.5 (threshold) --> y_hat = 1
            # else --> y_hat = 0

            AL[AL >= 0.5] = 1
            AL[AL < 0.5] = 0
            y_hat = AL.reshape((M, 1))

        # for "multi" class classification
        elif self.model_type == "multi":

            # AL is of shape (# of classes, # of samples)
            # get the index for the maximum value in each column, that will be the class
            # AL[:,0] --> first column with c rows whose sum of values = 1 (probability)
            # get the index of the maximum probabilty in the column, that will be our class

            y_hat = np.argmax(AL, axis = 0).reshape((M, 1))

        # for "reg" problem
        else:

            # AL values are your y_hat
            # just reshape them

            y_hat = AL.reshape((M, 1))

        return y_hat