# Softmax Regression
Softmax regression is a generalization of Logistic Regression, where instead of **two** possible classes, our problem allows for more than two, or what is called **multi-class** classification.


# Logistic Regression
In Logistic Regression, our challenge was **binary**: we need to distinguish between two  classes.   

We have as our inputs and outputs:
* A set of "m" samples, each a vector of "n" input features: 
$$X_i = {x_{1i},x_{2i},x_{3i},...,x_{ni}}$$
* A set of corresponding "m" outputs, each either 1 (postive/signal class) or 0 (negative/background class)
$$y_i = 0 ~or ~1 $$

In this case, we defined the output of our classifer as:
$$h_\theta(X)=g(X\theta)= {1\over{1+e^{-X\theta}}}$$

and we said that for a given set of features for one sample,  $h_\theta(X)$ is the probability that y=1 for that specific set of features.   


We defined this cost function $J(\theta)$:
 $$J(\theta) = \sum_{i=i}^m [-y\log(h_\theta(X)) - (1-y)\log(1-h_\theta(X))]$$
 
 And the gradient of the cost function with respect to $\theta$:
 $${\delta J\over \delta \theta_j} = \sum_{i=i}^m(h_\theta(X^{(i)})  -y^{(i)})\cdot X^{(i)}$$
 
 
The goal of logistic regression is to choose the parameters $\theta$ so that our predictions $h_\theta(X)$  are as close to our sample classes *y* as possible, by minimizing the cost function $J(\theta)$.

This is illustrated with the figure below.   In this case, we assume we have 3 input features and 1 output.   We end up with **4** $\theta$ values.

![alt text](https://github.com/big-data-analytics-physics/data/blob/master/images/logistic_classification.jpg?raw=true)



# Softmax Regression
In Softmax Regression, we need to distinguish between more than two  classes.   
This is illustrated with the figure below.   In this case, we assume we have 3 input features and 3 output classes.  Remember that the output in our data sample are 1-hot: only 1 output is true at a time.   We end up with **12** $\theta$ values: 4 for each of the outputs

![alt text](https://github.com/big-data-analytics-physics/data/blob/master/images/softmax_classifcation.jpg?raw=true)




We have as our inputs and outputs:
* A set of "m" samples, each a vector of "n" input features: 
$$X_i = {x_{1i},x_{2i},x_{3i},...,x_{ni}}$$
In the figure above, n=3.
* A set of corresponding "m" output vectors of length k, where only one of the k values is 1.0, and the other are 0.0 (this is the one-hot encoding):
$$y_i = {y_{0i},y_{1i},y_{2i},...,y_{(k-1)i},}$$
In the Figure above, k=3

In this case, our classifier has "k" outputs, each of the form:
$$p_k = {{e^{X\theta_k}}\over{\sum_{i=0}^{i=k-1}e^{X\theta_k}}}$$

where $p_k$ is the probability that the class=k for that specific set of features.   


We will define our cost function $J(\theta)$:
$$J(\theta) = -{1\over{m}}\sum_{i=1}^m \sum_{j=0}^{k-1}1[y^{(i)}=j] ~log{e^{\theta_j X}\over{\sum_{\ell=0}^k}e^{\theta_\ell X}}$$
Note that the term $1[y^{(i)}=j]$ equals 1 when the true output $y^{i}=1$ for the specific output class j.

 And the gradient of the cost function with respect to $\theta$ for a single output $j$ is:
 $${\delta J\over{\partial \theta_j}} =X \left( 1-{e^{\theta_j X}\over{\sum_{\ell=0}^k e^{\theta_\ell X} }} \right) $$
 Note that this is a vector of length (n+1) (for the n features plus the $\theta_0$ term), and note also that there are "k" of these vectors (one for each output).

We end up with **(n+1)*k** $\theta$ values.


## Implementation of the softmax function
We want to implement this:

$$p_k = {{e^{X\theta_k}}\over{\sum_{i=0}^{i=k-1}e^{X\theta_k}}}$$

where pt_k$ is the probability that the class=k for that specific set of features.   


In [None]:
#
# softmax regression
def softmax(Theta,Xp):
  # data has m input rows, n features, k outputs
  
  # assume Theta is (n+1) by k matrix
  # assume Xp is an m by (n+1) matrix
  z = np.dot(Xp,Theta)   # this is now an m by k matrix
  z -= np.max(z)         # get the max and subtract, helps with big numbers 
                         # see: https://stats.stackexchange.com/questions/304758/softmax-overflow
  res = np.exp(z) / np.sum(np.exp(z),axis=1)[:,np.newaxis]
  return res  


## Next we implement the cost for softmax

Our cost function again $J(\theta)$:
$$J(\theta) = -{1\over{m}}\sum_{i=1}^m \sum_{j=0}^{k-1}1 \left\{ y^{(i)}=j \right\} ~log{e^{\theta_j X}\over{\sum_{\ell=0}^k}e^{\theta_\ell X}}$$
Note that the term $1\left\{y^{(i)}=j\right\}$ equals 1 when the true output $y^{i}=1$ for the specific output class j.

The function we will actually implement below has an additional term for **regularization**.   We initially set this to be 0.
$$J(\theta) = -{1\over{m}}\sum_{i=1}^m \sum_{j=0}^{k-1}1 \left\{ y^{(i)}=j \right\} ~log{e^{\theta_j X}\over{\sum_{\ell=0}^{k-1}e^{\theta_\ell X}}} +{\lambda\over{2}} \sum_{\ell=0}^{k-1}\theta_\ell ^2$$


In [None]:
def calc_cost_softmax(Theta,Xp,yp_oneHot,Lambda):
  m = Xp.shape[0] #First we get the number of training examples
  probs = softmax(Theta,Xp)
  cost = (-1 / m) * np.sum(yp_oneHot * np.log(probs)) + (Lambda/2.0)*np.sum(np.square(Theta))
  return cost,grad 


## Next we implement the gradient for softmax
Our gradient of the cost function with respect to $\theta$ for a single output $j$:
 $${\delta J\over{\partial \theta_j}} =X \left( 1-{e^{\theta_j X}\over{\sum_{\ell=0}^k e^{\theta_\ell X} }} \right) $$
 Note that this is a vector of length (n+1) (for the n features plus the $\theta_0j$ term), and note also that there are "k" of these vectorss (one for each output).
 
 Including the term for regularization, we get:
 $${\delta J\over{\partial \theta_j}} =X \left( 1-{e^{\theta_j X}\over{\sum_{\ell=0}^k e^{\theta_\ell X} }} \right) + \lambda\theta_j$$
 


In [None]:
def calc_gradient_softmax(Theta,Xp,yp_oneHot,Lambda):
  m = Xp.shape[0] #First we get the number of training examples
  probs = softmax(Theta,Xp)
  grad = (-1 / m) * np.dot(Xp.T,(yp_oneHot - probs)) + Lambda*Theta #And compute the gradient for that loss
  return cost,grad  



## Combine gradient and cost
Since in both routines we calculate the probabilities for all of our samples, it makes no sense to do this twice, so lets combine both functions into one:

In [None]:
def calc_cost_and_gradient_softmax(Theta,Xp,yp_oneHot,Lambda):
  m = Xp.shape[0] #First we get the number of training examples
  probs = softmax(Theta,Xp)
  cost = (-1 / m) * np.sum(yp_oneHot * np.log(probs)) + (Lambda/2.0)*np.sum(np.square(Theta))
  grad = (-1 / m) * np.dot(Xp.T,(yp_oneHot - probs)) + Lambda*Theta #And compute the gradient for that loss
  return cost,grad

## Iterating until we converge
The basic algorithm then to implement gradient descent looks like this:
1. Initialize each of the $\theta$ parameters to some reasonable value (0 is common, or a random number).   Remember
  *  We have an axis that is of length (n features + 1)
  * A separate axis of length (k) outputs
2. Choose a learning rate $\alpha$, maxmimum allowed iterations, and a precision for the cost decrease to reach.   We will leave $Lambda$ as 0.0.
3. Have an outer loop that checks that we have not exceeded our maximum number of allowed iterations **AND** that the cost is still decreasing.
4. Calculate the gradient and update our parameters like so:
$$\theta_j := \theta_j - \alpha {\partial J\over \partial \theta_j}(\theta)$$
5. Calculate the cost for this iteration and compare it to the cost of the previous iteration.
6. If the change in cost is small enough (below our chosen precision), declare victory and jump out of the loop.

It is helpful to keep track of the cost for each iteration, so you can plot it and inspect its behavior.   And of course you need to keep track of the last value of the $\theta$ parameters so you can return them.

An implementation of this iteration algorithm is shown below.

In [None]:

def fit_data(Xp,yp_oneHot,learningRate,max_iterations,scale=True,delta=0.001,Lambda=0.0,iterations_min=2):
#
# Get the initial values
  m,features = Xp.shape   # this has the true "n" features +1 for the "ones" column
#
# How many outputs do we have
  m,outputs = yp_oneHot.shape
#
# Set the starting theta values
  Theta = np.zeros((features,outputs))
  print("Starting theta",Theta.shape)
  costList = []
#
# Calculate our initial cost
  cost,grad = calc_cost_and_gradient_softmax(Theta,Xp,yp_oneHot,Lambda)
  cost_change = delta+0.1
  cost = 1000000
  iterations = 0
#
# In the while loop, "delta" is the precision
  while (iterations<iterations_max) and (cost_change>delta):
    last_cost = cost
#
# Get the cost and gradient
    cost,grad = calc_cost_and_gradient_softmax(Theta,Xp,yp_oneHot,Lambda)
    #print("cost,grad ",cost,grad)
#
# Update the theta parameters
    Theta = Theta - learningRate*grad
#
# Calculate the cost change
    cost_change = last_cost - cost
#
# Store the cost
    costList.append(cost)
    iterations += 1
    
  return Theta,iterations,costList


## Get the Data
We will use the MNIST data sample to test our softmax regression algorithm.


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Form our test and train data
from sklearn.model_selection import train_test_split

#short = ""
short = "short_"
dfCombined = pd.DataFrame()
#
# Read in digits
for digit in range(10):
  print("digit",digit)
  fname = 'https://raw.githubusercontent.com/big-data-analytics-physics/data/master/ch3/digit_' + short + str(digit) + '.csv'
  df = pd.read_csv(fname,header=None)
  df['digit'] = digit
  dfCombined = pd.concat([dfCombined, df])


## Make Separate Test and Train Samples
We will do a simple 70/30 split to form our Train/Test sample.

We also need to:
* Scale the input data.   Since we know the input pixel data goes from 0-255, we can just divide by 255.
* Add the ones column to the input features.
* Convert our output labels to 1-hot.   We will use a **keras** utility for this.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle
from keras.utils.np_utils import to_categorical   

train_digits,test_digits = train_test_split(dfCombined, test_size=0.3, random_state=42)
yTrain = train_digits['digit'].values
XTrain = train_digits.as_matrix(columns=train_digits.columns[:784])

yTest = test_digits['digit'].values
XTest = test_digits.as_matrix(columns=test_digits.columns[:784])

#
# one hot encode the labels
num_classes = len(np.unique(yTrain))
print("Number distinct classes ",num_classes)
yTrain_oneHot = to_categorical(yTrain, num_classes=num_classes)
yTest_oneHot = to_categorical(yTest, num_classes=num_classes)
for i in range(10):
  print("digit ",yTrain[i],"encoding",yTrain_oneHot[i])
  
#
# We need to normalize our data - just divide by 256!
XTrain = XTrain/255.0
XTest = XTest / 255.0
#
# Add the ones column to the test and train sets
ones = np.ones((len(XTrain),1))
XTrain = np.append(ones,XTrain,axis=1)
ones = np.ones((len(XTest),1))
XTest = np.append(ones,XTest,axis=1)


In [None]:
iterations_max = 100
iterations_min = 50
learningRate = 0.1
delta = 0.0001
Theta,iterations,costList = fit_data(XTrain,yTrain_oneHot,learningRate,iterations_max)
print("Iterations ",iterations)
print("Cost:",costList[-1:])

In [None]:
def getProbsAndPreds(Theta,someX):
    probs = softmax(Theta,someX)
    preds = np.argmax(probs,axis=1)
    return probs,preds

def getAccuracy(Theta,someX,someY):
    prob,prede = getProbsAndPreds(Theta,someX)
    accuracy = sum(prede == someY)/(float(len(someY)))
    return accuracy
  
print('Training Accuracy: ', getAccuracy(Theta,XTrain,yTrain))
print('Test Accuracy: ', getAccuracy(Theta,XTest,yTest))