# Binary Classification

When there are two labels to assign, e.g., can/not cat for an image. 

An image is represented by three matrixes in $[x,y]$ for red, blye and green colors. 

Single training set $(x,y)$ where $x\in\mathcal{R}^3$ and $y\in\{o,1\}$.  
There exists $m$ training examples $\{(x^{(1)},y^{(1)}), (x^{(2)},y^{(2)}), ... ,(x^{(m)},y^{(m)})\}$.  
Given also $m=m_{\rm train }$ and $m_{\rm test}$. 

Data is represented as matrixes for all training examples as 

$$
X=
\begin{bmatrix}
\vec{x}^{(1)},\vec{x}^{(2)},...,\vec{x}^{(m-1)},\vec{x}^{(m)}\\
\end{bmatrix}
\in \mathcal{R}^{n_x\times m}
$$

where each $\vec{x}^{(i)}$ is a vector with all features for a given training example. 

And labels are a given by a coumn-vector as 

$$
Y = [y^{(1)},y^{(2)},...,y^{(m-1)},y^{(m)}] \in \mathcal{R}^{1\times m}
$$ 

## Logistic Regression

Given features $X$, what is $\hat{y}=P(y=1|X)$, i.e., a probability of $y$ being a e.g., cat.  
Parameters $\omega\in\mathcal{R}^{n_{x}}$, $b\in\mathcal{R}$.  
Output $\hat{y}=\sigma(\omega^T x + b)$, i.e., sigmoid function $\sigma(z)=1/(1+e^{-z})\in(0,1)$.  

Goal is to learn $\omega$ and $b$ parameters. 

#### Cost function 

Given a training set with $m$ training examples, we need to learn the parameters. 
We have to compute the cost function for each training example $x^{(i)}$, where the trye label is $y^{(i)}$ and the model prediction is $\hat{y}^{(i)}=\sigma(\omega^T x^{(i)} + b)$ where $\sigma(z^{(i)})=1/(1+e^{-z^{(i)}})$.  

Consider a **Loss function** (**Error function**). 
Square error cannot work as the function becomes non-convex with many local minima. 

Consider 
$$
\mathcal{L}(\hat{y},y)=-(y\log(\hat{y})+(1-y)\log(1-\hat{y}))
$$.

This is a **convex** function 

- if $y=1$, it will _try_ to make $\hat{y}$ **large**  
- if $y=0$, it will _try_ to make $\hat{y}$ **small**

> Loss funciton is for a single training example

> Cost function is for the entire set

$$
J(\omega,b) = \frac{1}{m}\sum_{i=1}^{m}\mathcal{L}(\hat{y}^{(i)},y^{(i)})
$$


## Gradient descend algorithm

For a convex funciton, it tries to move downhill to the global minimum. 

Repean: 

$$
w:=w - \alpha \frac{dJ(u)}{dw} \text{ the slope of the function } \\
b := b - \alpha \frac{dJ(u)}{db} \\
$$

$\alpha$ is a learning rate. 

## Computation Graph 

Computation on a NN is optimized in terms of a 
- _forward pass_ (compute the output of a NN) and a 
- _backward pass_ (in which we compute drivatives and update the model parameters).  

Computation graph illustrates why it is made this way. Given a $3$ functions computation of which requires intermediate steps. The computatational graph illustrates these steps.   
The backward propagation can be seen by increasing the values of the $X$ _slightly_. This slight increase propagates forward in a form that can be traced bak to compute the drivatives. 
If we change one of the values, we evaluate the net change and we get the derivatives $dv/da$ via _chain rule_. 

> When computing derivitiaves it is most efficient from right to left. 
This is the key point behind back propagation.  


## Logistic regression gradient descent

Consider the computation graph.  
$z=w^Tx 9 b$  
$\hat{y} = a = \sigma(z)$  
$\mathcal{L}(a,y) = -(y\log(a)+(1-y)\log(1-a))$  

Consider the graph 

$[ x_1, w_1, x_2,w_2, b ]\rightarrow z=w_1x_1+w_2x_2+b \rightarrow a=\sigma(z)\rightarrow\mathcal{L}(a,y)$

Compute derivativers with respect to the loss. First we do $d\mathcal{L}(a,y)/da$, then go further backwards and compute the $dz=d\mathcal{L}/dz=d\mathcal{L}(a,y)/dz$ that can be computed via _chain rule_. Finally, compute $d\mathcal{L}/dx_1$ alongside other derivatives.

## Gradient descent on $m$ examples

Recall the definition of the **cost function**

$$
J(w,b) = \frac{1}{m}\sum_{i=0}^m\Big( \mathcal{L}(a^{(i)},y) \Big)
$$

where $a^{(i)}=\hat{y}^{(i)}=\sigma(z^{(i)})=\sigma(w^tx^{(i)}+b)$  

and the derivative _is also an average of the derivatives_ with respect to loss terms as 

$$
\frac{\partial}{\partial w_i}J(w,b)=\frac{1}{m}\sum_{i=1}^m\frac{\partial }{\partial w_i}\mathcal{L}(a^{(i)},y^{(i)})
$$

where the derivatives for each training example can be computed independently as before. 

Start with initialization $0$ for each variable and loop over each element in the training exampl,e compute the $z^{(i)}$, $a^{(i)}$, $J$, $dz^{(i)}$, $dw_{i}$, where $dw_i$, $db$ are **accumulators**. 

The limitation of this approach is that there are multiple $for$ loops. It is inefficinet to implement explicit $for$ loops. Solution: _vectorization_. 


## Vectorization

Implement the training of a NN for ligistic regression without _any_ explicit $for$ loops. 

This is done using the $x\cdot y$ dot product in $\texttt{\text{numpy.dot()}}$ and  $x^T$ transpose function. 
Then $z = w^t\cdot x + b$, which is done in one line withour explicit $for$ loop.  
**Note** that $b$ here is a number. However iti is __broadcasted__ as a vector for this operation. 

## Vectorizing the gradient computation. 

$dz^{(i)} = az^{(i)} - yz^{(i)}$ constuct a vector $dz = [...dz^{(i)}...]$ and $A = [... a^{(i)} ...]$ and $Y=[...y^{(i)}...]$. 
Then vectorized operation is $dz = A-Y$.  
To compute the derivative $dw$ and $db$ we do the following 
$db = (1/m)\sum_{i=1}^m dz^{(i)}$ and $dw = (1/m) X dz^T$, where $Xdz^{(i)}$ is the matrix multiplication. 

## Broadcasting

simpl python...

## Cost function for logistic regression

$\hat{y} = \sigma(w^T x + b)$ where $\sigma(z) = 1 / (1+e^{-z})$

interprete $\hat{y} = P(y=1|x)$, i.e., $y = 1 : P(y|x) = \hat{y}$ and if $y=0$ $p(y|x) = 1-\hat{y}$. These two equations can be summarized in $p(y|x) = \hat{y}^y(1-\hat{y})^{(1-y)}$ and 
$$
\log{(p(y|x))} = y\log\hat{y} + (1-y)\log(1-\hat{y})
$$

>Minimizing the loss leads to maximizing the log of the probability. 

Consider overall cost function for the entire training set:

Consdier that the samples are drawn independently and that the distribution is normal. 
Then:
> Probability of the sample is given by the product of probabilities

$p(\text{labels in training set}) = \Pi_{i=1}^{m} p(y^{(i)}|x^{(i)})$

To perfrom maximum likelihood estimation we need to maximize chance that with parameters chosen chance that observations correspond to training set 

$\log p(...) = \sum\log p(y^{(i)}|x^{(i)} ) = -\mathcal{L}(\hat{y}^{(i)},y^{(i)})$

> Maximum likelihood estimation: choose the parameters that opimise this $-\sum_{i=1}^{m}\mathcal{L}(\hat{x}^{(i)},y^{(i)})$

> Minimization of the cost function $J(w,b)$ is the maximum likelihood estimation with the logistic regression model under the assumption that our training examples are identically independetly distributed.

This justifies the cost function for the logistic regression. 