# Training Models

## Implementing Batch Gradient Descent with early stopping for Softmax Regression

We will be implementing Batch Gradient Descent with early stopping for Softmax Regression, without usig Scikit-Learn. We will import the iris dataset from scikit-learn to test the algorithm.

In [5]:
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt

from sklearn.datasets import load_iris

In [10]:
iris=load_iris()
X=iris['data']
y=iris['target']

Now split the data into training, validation and test sets:

In [11]:
X_with_bias = np.c_[np.ones([len(X), 1]), X]
np.random.seed(1234)

test_ratio = 0.2
validation_ratio = 0.2
total_size = len(X_with_bias)

test_size = int(total_size * test_ratio)
validation_size = int(total_size * validation_ratio)
train_size = total_size - test_size - validation_size

rnd_indices = np.random.permutation(total_size)

X_train = X_with_bias[rnd_indices[:train_size]]
y_train = y[rnd_indices[:train_size]]
X_valid = X_with_bias[rnd_indices[train_size:-test_size]]
y_valid = y[rnd_indices[train_size:-test_size]]
X_test = X_with_bias[rnd_indices[-test_size:]]
y_test = y[rnd_indices[-test_size:]]

Now we have to convert the classes into probability matrix vector (in this case, one-hot vector) in order to use Softmax

In [19]:
def one_hot(Y):
    nclasses=Y.max()+1
    m = len(Y)
    Y_one_hot=np.zeros((m,nclasses))
    Y_one_hot[np.arange(m),Y]=1
    return Y_one_hot

In [20]:
y_valid[:10]

array([1, 0, 1, 2, 1, 1, 1, 0, 0, 0])

In [21]:
one_hot(y_valid[:10])

array([[ 0.,  1.,  0.],
       [ 1.,  0.,  0.],
       [ 0.,  1.,  0.],
       [ 0.,  0.,  1.],
       [ 0.,  1.,  0.],
       [ 0.,  1.,  0.],
       [ 0.,  1.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.]])

Testing it worked. So we can now convert all our target values into the probability matrix vectors.

In [23]:
y_train_prob = one_hot(y_train)
y_valid_prob = one_hot(y_valid)
y_test_prob = one_hot(y_test)

We can not write the Softmax Regression function:

In [24]:
def softmax(sk_X):
    top = np.exp(sk_X)
    bottom = np.sum(top,axis=1,keepdim=True)
    return top/bottom

In [25]:
n_inputs = X_train.shape[1]
n_outputs = len(np.unique(y_train))

In [26]:
print (n_inputs, n_outputs)

5 3


Now we can train. The equations we need to look at are the cost function and the gradients.

* The cost function:

$$J(\mathbf{\Theta}) =
-\dfrac{1}{m}\sum_{i=1}^{m}\sum_{k=1}^{K}{y_k^{(i)}\log\left(\hat{p}_k^{(i)}\right)}$$

* And the equation for the gradients:

$$\nabla_{\mathbf{\theta}^{(k)}} \, J(\mathbf{\Theta}) = \dfrac{1}{m} \sum_{i=1}^{m}{ \left ( \hat{p}^{(i)}_k - y_k^{(i)} \right ) \mathbf{x}^{(i)}}$$

Note that $\log\left(\hat{p}_k^{(i)}\right)$ may not be computable if $\hat{p}_k^{(i)} = 0$. So we will add a tiny value $\epsilon$ to $\log\left(\hat{p}_k^{(i)}\right)$ to avoid getting nan values.