<a href="https://colab.research.google.com/github/raj-vijay/dl/blob/master/21_Gradient_Descent_in_TensorFlow.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Gradient descent**

Gradient descent is an optimization algorithm used to minimize some function by iteratively moving in the direction of steepest descent as defined by the negative of the gradient. In machine learning, we use gradient descent to update the parameters of our model

**Stochastic gradient descent (SGD) optimizer**
- tf.keras.optimizers.SGD()
- learning_rate (typically between 0.5 and 0.001 which will determine how quickly the model parameters adjust during training)
- Simple and easy to interpret

**Root mean squared (RMS) propagation optimizer**
- Applies different learning rates to each feature
- tf.keras.optimizers.RMSprop()
- learning_rate
- momentum
- decay

Allows for momentum to both build and decay

**Adaptive moment (adam) optimizer**
- tf.keras.optimizers.Adam()
- learning_rate
- beta1

Performs well with default parameter values

**Default of Credit Card Clients Dataset**

This research aimed at the case of customers default payments in Taiwan and compares the predictive accuracy of probability of default among six data mining methods. 

From the perspective of risk management, the result of predictive accuracy of the estimated probability of default will be more valuable than the binary result of classification - credible or not credible clients. 

Because the real probability of default is unknown, this study presented the novel Sorting Smoothing Method to estimate the real probability of default. With the real probability of default as the response variable (Y), and the predictive probability of default as the independent variable (X), the simple linear regression result (Y = A + BX) shows that the forecasting model produced by artificial neural network has the highest coefficient of determination; its regression intercept (A) is close to zero, and regression coefficient (B) to one. 

Therefore, among the six data mining techniques, artificial neural network is the only one that can accurately estimate the real probability of default.



Installing Kaggle Package to access the diabetes dataset from Kaggle.

In [None]:
!pip install kaggle



Make .kaggle directory under root to import the Kaggle Authentication JSON.

In [None]:
!mkdir ~/.kaggle

Change file path to root/.kaggle/kaggle.json

In [None]:
!cp /content/kaggle.json ~/.kaggle/kaggle.json

Chmod 600 (chmod a+rwx,u-x,g-rwx,o-rwx) sets permissions so that, (U)ser / owner can read, can write and can't execute. (G)roup can't read, can't write and can't execute. (O)thers can't read, can't write and can't execute.

In [None]:
!chmod 600 /root/.kaggle/kaggle.json

Download housing dataset from Kaggle!

In [None]:
!kaggle datasets download -d uciml/default-of-credit-card-clients-dataset

Downloading default-of-credit-card-clients-dataset.zip to /content
  0% 0.00/0.98M [00:00<?, ?B/s]
100% 0.98M/0.98M [00:00<00:00, 68.7MB/s]


In [None]:
import pandas as pd
file = '/content/default-of-credit-card-clients-dataset.zip'
data = pd.read_csv(file, compression = 'zip')

In [None]:
data.head()

Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,PAY_6,BILL_AMT1,BILL_AMT2,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default.payment.next.month
0,1,20000.0,2,2,1,24,2,2,-1,-1,-2,-2,3913.0,3102.0,689.0,0.0,0.0,0.0,0.0,689.0,0.0,0.0,0.0,0.0,1
1,2,120000.0,2,2,2,26,-1,2,0,0,0,2,2682.0,1725.0,2682.0,3272.0,3455.0,3261.0,0.0,1000.0,1000.0,1000.0,0.0,2000.0,1
2,3,90000.0,2,2,2,34,0,0,0,0,0,0,29239.0,14027.0,13559.0,14331.0,14948.0,15549.0,1518.0,1500.0,1000.0,1000.0,1000.0,5000.0,0
3,4,50000.0,2,2,1,37,0,0,0,0,0,0,46990.0,48233.0,49291.0,28314.0,28959.0,29547.0,2000.0,2019.0,1200.0,1100.0,1069.0,1000.0,0
4,5,50000.0,1,2,1,57,-1,0,-1,0,0,0,8617.0,5670.0,35835.0,20940.0,19146.0,19131.0,2000.0,36681.0,10000.0,9000.0,689.0,679.0,0


The input layer contains 3 features: 
1. Education
2. Marital status, and
3. Age

which are available as borrower_features. 

The hidden layer contains 2 nodes and the output layer contains a single node.

In [None]:
borrower_features = data[['EDUCATION','MARRIAGE','AGE']]

In [None]:
borrower_features.head()

Unnamed: 0,EDUCATION,MARRIAGE,AGE
0,2,1,24
1,2,2,26
2,2,2,34
3,2,1,37
4,2,1,57


For each layer, we take the previous layer as an input, initialize a set of weights, compute the product of the inputs and weights, and then apply an activation function. 

In [None]:
borrower_features = data[['EDUCATION',	'MARRIAGE', 'AGE',	'BILL_AMT1',	'BILL_AMT2',	'BILL_AMT3',	'BILL_AMT4',	'BILL_AMT5',	'BILL_AMT6']]

In [None]:
borrower_features.head()

Unnamed: 0,EDUCATION,MARRIAGE,AGE,BILL_AMT1,BILL_AMT2,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6
0,2,1,24,3913.0,3102.0,689.0,0.0,0.0,0.0
1,2,2,26,2682.0,1725.0,2682.0,3272.0,3455.0,3261.0
2,2,2,34,29239.0,14027.0,13559.0,14331.0,14948.0,15549.0
3,2,1,37,46990.0,48233.0,49291.0,28314.0,28959.0,29547.0
4,2,1,57,8617.0,5670.0,35835.0,20940.0,19146.0,19131.0


In [None]:
default = data['default.payment.next.month']

In [None]:
import tensorflow as tf

In [None]:
borrower_features = tf.constant(borrower_features, tf.float32)
default = tf.constant(default, tf.float32)

In [None]:
# Define the model function
def model(bias, weights, features = borrower_features):
  product = tf.matmul(features, weights)
  return tf.keras.activations.sigmoid(product+bias)

In [None]:
# Compute the predicted values and loss
def loss_function(bias, weights, targets = default, features = borrower_features):
  predictions = model(bias, weights)
  return tf.keras.losses.binary_crossentropy(targets, predictions)

In [None]:
# Minimize the loss function with RMS propagation
opt = tf.keras.optimizers.RMSprop(learning_rate=0.01, momentum=0.9)
opt.minimize(lambda: loss_function(bias, weights), var_list=[bias, weights])

**Dangers of local minima**

Here we determing the global minimum of loss_function() using keras.optimizers.SGD(). 

It is done twice, each time with a different initial values of the input to loss_function(). 

First, use x_1, which is a variable with an initial value of 6.0. 
Second, use x_2, which is a variable with an initial value of 0.3. 

That loss_function() is defined as below.

![alt text](https://assets.datacamp.com/production/repositories/3953/datasets/42876c85cba5c14941a3fac191eff75b41597112/local_minima_dots_4_10.png)

In [None]:
import math

def divide(x, y, name=None):
  """Computes Python style division of `x` by `y`."""

  if name is not None:
    # Cannot use tensors operator overload, because it has no way to track
    # override names. Use a dummy class to track the runtime division behavior
    return DivideDelegateWithName(x, name) / y
  else:
    return x / y


def loss_function(x):
	return 4.0*math.cos(x-1) + divide(math.cos(2.0*math.pi*x),x)

In [None]:
# Initialize x_1 and x_2
x_1 = tf.Variable(6.0, tf.float32)
x_2 = tf.Variable(0.3, tf.float32)

# Define the optimization operation
opt = tf.keras.optimizers.SGD(learning_rate=0.01)

for j in range(100):
	# Perform minimization using the loss function and x_1
	opt.minimize(lambda: loss_function(x_1), var_list=[x_1])
	# Perform minimization using the loss function and x_2
	opt.minimize(lambda: loss_function(x_2), var_list=[x_2])

# Print x_1 and x_2 as numpy arrays
print(x_1.numpy(), x_2.numpy())

6.027515 0.25


**Avoiding local minima**

We had a simple optimization problem in one variable and gradient descent still failed to deliver the global minimum when we had to travel through local minima first. 

One way to avoid this problem is to use momentum, which allows the optimizer to break through local minima. 

In [None]:
# Initialize x_1 and x_2
x_1 = tf.Variable(0.05, tf.float32)
x_2 = tf.Variable(0.05, tf.float32)

# Define the optimization operation for opt_1 and opt_2
opt_1 = tf.keras.optimizers.RMSprop(learning_rate=0.01, momentum=0.99)
opt_2 = tf.keras.optimizers.RMSprop(learning_rate=0.01, momentum=0.00)

for j in range(100):
	opt_1.minimize(lambda: loss_function(x_1), var_list=[x_1])
    # Define the minimization operation for opt_2
	opt_2.minimize(lambda: loss_function(x_2), var_list=[x_2])

# Print x_1 and x_2 as numpy arrays
print(x_1.numpy(), x_2.numpy())

2.744511 0.24999999
