# Softmax function

Transform vector of scores to proper probabilities.

In [6]:
import numpy as np

def softmax(x):
    return np.exp(x)/np.sum(np.exp(x),axis=0)

# when scores are mutiplied by large numbers, scores get close to 0 and 1
# when scores are mutiplied by small numbers, scores get close uniform distribution
scores = [3.0,1.0,0.2]
print softmax(scores)
print softmax([300.0,100.0,20.0])
print softmax([0.03,0.01,0.002]) 


[ 0.8360188   0.11314284  0.05083836]
[  1.00000000e+000   1.38389653e-087   2.49772757e-122]
[ 0.33868604  0.3319796   0.32933436]


# Cross entropy 

A measure between two probability distributions $\mathbf{s}$ and $\mathbf{l}$ given by $D(\mathbf{s},\mathbf{l})= -\sum(l_i) \log(s_i)$ where, in our context, $\mathbf{s}$ is the vector produced by the softmax and $\mathbf{l}$ is hot-encoding of class labels (one in one class and zero everywhere else). So when the correct class label is $j$ then $D(\mathbf{s},\mathbf{l})= - \log(s_j)$

In [8]:
def cross_entropy(S,L):
    return -np.sum(L*np.log(S))

scores  = [3.0,1.0,0.2]
class_1 = [1.0,0.0,0.0]
class_3 = [0.0,0.0,1.0]
print 'Estimated class probabilities     ',softmax(scores)
print 'Cross entropy for good prediction ',cross_entropy(softmax(scores),class_1) 
print 'Cross entropy for bad  prediction ',cross_entropy(softmax(scores),class_3) 

Estimated class probabilities      [ 0.8360188   0.11314284  0.05083836]
Cross entropy for good prediction  0.179104174785
Cross entropy for bad  prediction  2.97910417479


# Cross entropy loss minimization
Given $N$ training samples,we consider a weight matrix $W$ and a bias vector $b$ we can write down the cross entropy loss as

> $\frac{1}{N} \sum_{i=1}^{N} D(f(W\mathbf{x}_i + b),L_i)$

where $\mathbf{x_i}$ is the i'th example, $L_i$ is its class label vector and $f$ is the softmax function.

We will minimize this loss by gradient descent procedure.

# logistic regression in Python

Logistic regression is a scheme for binary classification problems involving $d$ variables $x_i , i =1,\ldots,d$. The output variables $\mathbf{y}$ can take only the value $0$ or $1$. The classification scheme goes as follows:
* Compute $z = \theta_0 + \mathbf{x}^T \mathbf{\theta}$ where $\theta_0 ,\ldots,\theta_d$ are free parameters.
* Use the logistic function $s(z) = \frac{1}{1+e^{-z}}$ to compute a value between $0$ and $1$.
* We interpret this value as probability and predict the output class to be $1$ if $s(z) > \frac{1}{2}$ and $0$ otherwise.

ther references:
* [ML with python - logistic regression](http://aimotion.blogspot.co.il/2011/11/machine-learning-with-python-logistic.html). The data set used consists of the file [] from 

https://github.com/justmarkham/gadsdc1/blob/master/logistic_assignment/kevin_logistic_sklearn.ipynb

https://github.com/jcgillespie/Coursera-Machine-Learning

http://www.ats.ucla.edu/stat/r/dae/logit.htm

http://blog.yhat.com/posts/logistic-regression-and-python.html

http://blog.smellthedata.com/2009/06/python-logistic-regression-with-l2.html

http://nbviewer.ipython.org/github/tfolkman/learningwithdata/blob/master/Logistic%20Gradient%20Descent.ipynb

[Nandos de Feritas Youtube course](https://www.youtube.com/watch?v=w2OtwL5T1ow&list=PLE6Wd9FR--EdyJ5lbFl8UuGjecvVw66F6)- Look at the [logistic regression](https://www.youtube.com/watch?v=mz3j59aJBZQ) video. The basic idea is to look at the likelihood function. Take the minus of the log-likelihood to get the error function which is then minimized by a gradient descent approach.

In [1]:
import numpy as np
import pandas as pd

%pwd

u'd:\\GitHub\\nn_deep'

# Logistic regression - single variable

The logistic function is $s(z) = \frac{1}{1+e^{-z}}$ and it's derivative is $s'(z) = s(z) \cdot (1-s(z))$. 

In [2]:
# data set consists of two variables representingscores on two exams
# and decision on admission: 0 or 1
data = np.loadtxt(r'data/ex2data1.txt', delimiter=',')

X = data[:, 0:2]
y = data[:, 2]

print type(X),X.shape
print len(y)

<type 'numpy.ndarray'> (100L, 2L)
100


In [4]:
def sigmoid(z):
    return 1.0/(1.0+np.exp(-z))

def der_sigmoid(z):
    s = sigmoid(z)
    return s*(1.0-s)
