Logistic regression is binary classification.

## Logistic loss function

$ L(\hat{y}, y)=-(y\log{\hat{y}}+(1-y)\log{(1-\hat{y})}) $  
If y=1: $L(\hat{y}, y)=-\log \hat{y}$ $\longleftarrow$ want $\log \hat{y}$ large, want $\hat{y}$ large  

If y=0: $L(\hat{y}, y)=-\log (1-\hat{y})$ $\longleftarrow$ want $\log 1-\hat{y}$ large, want $\hat{y}$ small   

### How to get the logistic loss function?
$If　y=1:　p(y|x)=\hat{y}$  
$If　y=0:　p(y|x)=1-\hat{y}$
$\rightarrow p(y|x)=\hat{y}^{y}(1-\hat{y})^(1-y)$  

Because the log function is a strictly monotonically increasing function, your maximizing $log p(y|x)$ should give you a similar result as optimizing $p(y|x)$.   

$$\log{p(y|x)}=\log \hat{y}^{y} (1-\hat{y})^{(1-y)}=y \log(\hat{y})+(1-y)\log(1-\hat{y})$$  

And so this is actually negative of the loss function that we had to find previously. And there's a negative sign there because usually if you're training a learning algorithm, you want to make probabilities large whereas in logistic regression we're expressing this. We want to minimize the loss function.   

**Cost on m examples**  
$$\log p(labels　in　training set)=\log \quad \prod_{i=1}^m p(y^{(i)}|x^{(i)}) \quad $$
$$\log p(labels　in　training set)=\sum_i^m \log(p(y^{(i)}|x^{i}))＝-\sum_i^{m}L(\hat{y}^{(i)},y^{(i)})=-m \times Cost　function $$  

We're really carrying out maximum likelihood estimation(最大似然估计) with the logistic regression model. Under the assumption that our training examples were IID, or identically independently distributed. 
### Why cost function use cross entropy instead of MSE?
Because MSE is not convex function,it has multiple local optimal solution,but cross entropy is convex function,it has only one optimal solution.

### What is the difference between the cost function and the loss function for logistic regression?
The loss function computes the error for a single training example; the cost function is the average of the loss functions of the entire training set.  
Want to find w,b that minimize cost function(J(w, b)).

$ Cost function: J(W, b)=\frac{1}{m}\sum_i^m L(\hat{y}^{(i)}, y^{(i)})=-\frac{1}{m}\sum_i^m [y^{(i)} \log \hat{y}^{(i)}+(1-y^{(i)}\log(1-\hat{y}^{(i)}))] $  

## Gradient Descent

Repeat  
$$
\begin{cases}
W=W - \alpha \times \frac{\partial J(W,b)}{\partial W} \\\\
b=b - \alpha \times \frac{\partial J(W,b)}{\partial b}
\end{cases}
$$

$\alpha : learningrate$

### Note!
When writing code, we use dW instead of $\frac{\partial J(W,b)}{\partial W}$, db instead of $\frac{\partial J(W,b)}{\partial b}$  

## Logistic regression derivatives(求导)

### Backpropagation of one sample 
$$z=w_1x_1+w_2x_2+b\rightarrow a=\sigma{z} \rightarrow L(a,y)$$  
Because logistic regression is binary classification,so loss function is $$ L(a,y)=-(y\log{a}+(1-y)\log{(1-a)}) $$ 
`da` =$\frac{L(a,y)}{da}$  

`dz` =$\frac{dL(a,y)}{dz}=\frac{dL(a,y)}{da} \frac{da}{dz}=(-\frac{y}{a}+\frac{1-y}{1-a}) a(1-a)=a-y$  

`dw1` =$\frac{\partial L}{\partial w_1}=\frac{dL}{dz} \frac{dz}{dw_1}=x_1 \frac{dL(a,y)}{dz}$  

`dw2` =$\frac{\partial L}{\partial w_2}=\frac{dL}{dz} \frac{dz}{dw_2}=x_2 \frac{dL(a,y)}{dz}$  

`db` =$\frac{\partial L}{\partial b}=\frac{dL}{dz} \frac{dz}{db}=1*\frac{dL(a,y)}{dz}$  

so repeat:  
$w_1:=w_1-\alpha dw_1$  
$w_2:=w_2-\alpha dw_2$  
$b:=b-\alpha db$


### Backpropagation of m samples

$J(W, b)=\frac{1}{m}\sum_i^m L(a^{(i)}, y^{(i)})$  
$a^{(i)}=\hat{y}^{(i)}=\sigma{(z^{(i)})}=\sigma(w^Tx^{(i)}+b)$,$x^{(i)}=\left[
\begin{matrix}
x_1 \\
x_2 \\
\vdots \\
x_{n_x} \\
\end{matrix}
\right],a^{(i)}=\left[\begin{matrix}
a_1^{(i)} \\
a_2^{(i)} \\
\vdots \\
a_{n_x}^{(i)} \\
\end{matrix}
\right]
$  
$\frac{J(w,b)}{w_1}=\frac{1}{m}\sum_{i=1}^m \frac{\partial L(a^{(i)},y^{(i)})}{\partial w_1}$, $code:dw_1^{(i)}= \frac{\partial L(a^{(i)},y^{(i)})}{\partial w_1}$  

$\frac{J(w,b)}{w_2}=\frac{1}{m}\sum_{i=1}^m \frac{\partial L(a^{(i)},y^{(i)})}{\partial w_2}$, $code:dw_2^{(i)}= \frac{\partial L(a^{(i)},y^{(i)})}{\partial w_2}$

$\frac{J(w,b)}{b}=\frac{1}{m}\sum_{i=1}^m \frac{\partial L(a^{(i)},y^{(i)})}{\partial b}$, $code:db^{(i)}= \frac{\partial L(a^{(i)},y^{(i)})}{\partial b}$  

Pseudo code  
$J=0;dw_1=0;dw_2=0;db=0$  

$for　i=1　to　m:$  
　　$z^{(i)}=W^Tx^{(i)}+b$
　　$a^{(i)}=\sigma(z^{(i)})$  
　　$J+=-(y^{(i)}\log{a^{(i)}}+(1-y^{(i)})\log{(1-a^{(i)})})$  
　　$dz^{(i)}=a^{(i)}-y^{(i)}$  
　　$dw_1+=x_1^{(i)}dz^{(i)}$  
　　$dw_2+=x_2^{(i)}dz^{(i)}$  
　　$db+=dz^{(i)}$ 
  
$J/=m$

$dw_1/=m$  

$dw_2/=m$  

$db/=m$  

$dw_1=\frac{\partial J}{\partial w_1}$    

repeat:

$w_1:=w_1-\alpha dw_1$  

$w_2:=w_2-\alpha dw_2$  

$b:=b-\alpha db$  

**We can change it above to vectorization.**  

Pseudo code  
$J=0;dw=np.zeros((n_x,1));db=0$  

$for　i=1　to　m:$  
　　$z^{(i)}=W^Tx^{(i)}+b$
　　$a^{(i)}=\sigma(z^{(i)})$  
　　$J+=-(y^{(i)}\log{a^{(i)}}+(1-y^{(i)})\log{(1-a^{(i)})})$  
　　$dz^{(i)}=a^{(i)}-y^{(i)}$  
　　$dw+=x^{(i)}dz^{(i)}$  
　　$db+=dz^{(i)}$ 
  
$J/=m$

$dw/=m$

$db/=m$  

$dw_1=\frac{\partial J}{\partial w_1}$    

repeat:

$w:=w-\alpha dw$    

$b:=b-\alpha db$  



## Vectorization

In [1]:
import numpy as np

a = np.array([1, 2, 3, 4])
print(a)

[1 2 3 4]


In [11]:
a1 = np.array([1, 2])
print(a1)
b1 = np.array([3, 4])
print(b1)

c1 = np.dot(a1,b1)
print(c1)

[1 2]
[3 4]
11


In [6]:
np.random.rand(1000000)

array([0.40737197, 0.98558692, 0.18633978, ..., 0.38695059, 0.07289723,
       0.3516798 ])

In [7]:
np.random.rand(1000000).shape

(1000000,)

In [4]:
import time

a = np.random.rand(1000000)
b = np.random.rand(1000000)

tic = time.time()
c =  np.dot(a, b)
toc = time.time()
print(c)
print('Vectorized version:' + str(1000*(toc-tic))+'ms')

250092.56167622298
Vectorized version:0.5388259887695312ms


In [12]:
c = 0
tic = time.time()
for i in range(1000000):
    c += a[i]*b[i]
toc = time.time()
print(c)
print('For loop:'+str(1000*(toc-tic))+'ms')

250092.56167622848
For loop:445.69945335388184ms


GPU and CPU have parallelization instructions(并行化指令). They're sometimes called **SIMD** instructions. This stands for a single instruction multiple data. But what this basically means is that, if you use built-in functions such as this np.function or other functions that don't require you explicitly implementing a for loop. It enables Phyton to take much better advantage of parallelism to do your computations much faster. And this is true both computations on CPUs and computations on GPUs. It's just that GPUs are remarkably good at these SIMD calculations but CPU is actually also not too bad at that. Maybe just not as good as GPUs. You're seeing how vectorization can significantly speed up your code.

### Vectorizing Logistic Regression 

It trains without for loop!  

$
X = \left[
\begin{matrix}
　\vdots　　\vdots　 \cdots　　\vdots      \\
　　x^{(1)}　x^{(2)}\cdots　　x^{(m)}\\
　\vdots　　\vdots 　\cdots　　\vdots \\
\end{matrix}
\right]
$　　$X \in (n_x, m)$   
$Z=\left[
\begin{matrix}
z^{(1)}　z^{(2)}　\cdots　z^{(m)} \\
\end{matrix}
\right]$ $=W^TX+ $ $\left[
\begin{matrix}
b　b　\cdots b 　　\\
\end{matrix}
\right]$ $=\left[
\begin{matrix}
W^TX^{(1)}+b　W^TX^{(2)}+b　\cdots　W^TX^{(m)}+b 　　\\
\end{matrix}
\right] $  
$A=\left[
\begin{matrix}
a^{(1)}　a^{(2)}　\cdots a^{(m)} 　　\\
\end{matrix}
\right]=\sigma(Z)$

`#code`  
`Z = np.dot(w.T,X)+b`  

上面提到:  
$dz^{(i)}=a^{(i)}-y^{(i)}$  
$dZ=\left[
\begin{matrix}
dz^{(1)}　dz^{(2)}　\cdots dz^{(m)} 　　\\
\end{matrix}
\right]$
$A=\left[
\begin{matrix}
a^{(1)}　a^{(2)}　\cdots a^{(m)} 　　\\
\end{matrix}
\right]$  
$Y=\left[
\begin{matrix}
y^{(1)}　y^{(2)}　\cdots y^{(m)} 　　\\
\end{matrix}
\right]$
$\rightarrow dZ=A-Y$

上面提到:  
$dw+=x^{(i)}dz^{(i)}$  

$db+=dz^{(i)}$

$dw/=m$

$db/=m$  

so $\rightarrow$  
$dW = \frac{1}{m}XdZ^T=\frac{1}{m}\left[
\begin{matrix}
　\vdots　　\vdots　 \cdots　　\vdots      \\
　　x^{(1)}　x^{(2)}\cdots　　x^{(m)}\\
　\vdots　　\vdots 　\cdots　　\vdots \\
\end{matrix}
\right]\left[
\begin{matrix}
dz^{(1)}　dz^{(2)}　\cdots dz^{(m)} 　　\\
\end{matrix}
\right]=\frac{1}{m}\left[
\begin{matrix}
x^{(1)}dz^{(1)}+x^{(2)}dz^{(2)}+\cdots x^{(m)}dz^{(m)} 　　\\
\end{matrix}
\right],dW \in(n \times 1)$  

$db=\frac{1}{m}\sum_{i=1}^{m}dz^{(i)}$
`#code:`  
`1/m np.sum(dZ)`  

#### Summary 
`for iter in range(1000):`  

　　　`Z = np.dot(W.T,X)+b`  
　　　`A = np.exp(Z)`  
　　　`dZ = A-Y`  
　　　`dW = 1/m*X*dZ.T`  
　　　`db = 1/m*np.sum(dZ)`  
　　　`W = W - lr*dW`  
　　　`b = b - lr*db`

### Python-Numpy vectors

In [13]:
import numpy as np

a = np.random.randn(5)

In [14]:
print(a)

[-1.8669049   0.24053453  0.51378543 -1.61745633  0.62625261]


In [15]:
a.shape 
# this is called a rank 1 array in Python and it's neither a row vector nor a column vector. 

(5,)

In [17]:
print(a.T)

[-1.8669049   0.24053453  0.51378543 -1.61745633  0.62625261]


In [18]:
print(np.dot(a, a.T))

6.81552354351696


a is same with a.T so what I would recommend is that when you're coding new networks, that you just not use data structures where the shape is (5,), or (n,), rank 1 array. Instead, if you set a to be this, (5,1), then this commits a to be (5,1) column vector. 

In [20]:
a = np.random.randn(5,1)
print(a)

[[ 0.19903633]
 [ 0.08218999]
 [ 0.13970819]
 [ 0.77363965]
 [-1.00748745]]


In [21]:
print(a.T)

[[ 0.19903633  0.08218999  0.13970819  0.77363965 -1.00748745]]


This data structure above, `np.random.randn(5,1).T`there are two square brackets when we print a transpose. Whereas previously, `np.random.randn(5).T`there was one square bracket. So that's the difference between this is really a 1 by 5 matrix versus one of these rank 1 arrays.  

Use `assert(a.shape == (5, 1))` to make sure this is a vector which you want to.

In [22]:
print(np.dot(a, a.T))

[[ 0.03961546  0.01635879  0.027807    0.15398239 -0.2005266 ]
 [ 0.01635879  0.0067552   0.01148262  0.06358544 -0.08280539]
 [ 0.027807    0.01148262  0.01951838  0.10808379 -0.14075425]
 [ 0.15398239  0.06358544  0.10808379  0.5985183  -0.77943223]
 [-0.2005266  -0.08280539 -0.14075425 -0.77943223  1.01503096]]


In [25]:
assert(a.shape == (5, 1))

In [26]:
a = np.random.randn(4, 3)
b = np.random.randn(3, 2)

In [27]:
c = a*b #Note! n numpy the "*" operator indicates element-wise multiplication. It is different from "np.dot()".

ValueError: operands could not be broadcast together with shapes (4,3) (3,2) 

In [28]:
a = np.random.randn(3, 3)
b = np.random.randn(3, 1)

In [29]:
c = a*b

In [30]:
c

array([[-0.06921615, -0.13919856,  0.04555526],
       [-0.61205946, -0.97378702,  0.67751549],
       [-0.01218801, -0.05809639,  0.09167936]])