# Cost and Gradient univariate regression/simple regression

Given the linear model:

$$
h_\theta(x) = \theta_0 + \theta_1 x
$$

And the following concrete training data:

$$
D_{train} = \{(0,1),(1,3),(2,6),(4,8)\}
$$

with each tuple $(x,y)$ denoting $x$ the feature and $y$ the target.

**Task:**

For $\theta_0 = 1$ and $\theta_1 = 2$ calculate:

1. The cost:

$$ J_D(\theta_0, \theta_1)=\frac{1}{2m}\sum_{i=1}^{m}{(h_\theta(x^{(i)})-y^{(i)})^2} $$

2. The gradient $\nabla J$, i.e. the partial derivatives:

$$ \frac{\partial J (\theta_0, \theta_1)}{\partial \theta_0} $$

$$ \frac{\partial J (\theta_0, \theta_1)}{\partial \theta_1} $$


1. Cost:
$$J_D(\theta_0, \theta_1)=\frac{1}{2m}\sum_{i=1}^{m}{(h_\theta(x^{(i)})-y^{(i)})^2}$$

2. Gradient:

$$\frac{\partial J (\theta_0, \theta_1)}{\partial \theta_0} =  \frac{1}{m}\sum_{i=1}^{m}(\theta_0 + \theta_1 x^i - y^i)$$

$$\frac{\partial J (\theta_0, \theta_1)}{\partial \theta_0} = \frac{1}{m}\sum_{i=1}^{m}((\theta_0 + \theta_1 x^i - y^i) \cdot x^i)$$

# Logistic regression and regularization

### Pen & Paper Exercises

#### Task

Why is 

$$
\text{arg}\max_x f(x) = \text{arg}\min_x \left[ - \log f(x) \right] 
$$

#### Solution
 
The logarithm is a strictly monotonic function so we have 
$$
\text{arg}\max_x f(x) = \text{arg}\max_x \left[ \log f(x) \right] 
$$

with 
$$
 \text{arg}\min_x \left[ - \log f(x) \right] = \text{arg}\max_x \left[ \log f(x) \right] 
$$

we have the result

$$
\text{arg}\max_x f(x) = \text{arg}\min_x \left[ - \log f(x) \right] 
$$

#### Logistic model

In logistic regression, the prediction of a learned model $h_\Theta(\vec x)$
can be interpreted as the prediction that $\vec x$ belongs to the positive class $1$:

$$p(y=1\mid \vec x; \Theta) = h_\Theta(\vec x)$$

#### Task
What is the probability of the negative class $p(y=0\mid \vec x; \Theta)$ prediction (expressed with $h_\Theta(\vec x)$)?

#### Solution
$$p(y=0\mid \vec x; \Theta) = 1 - h_\Theta(\vec x)$$

#### Loss


The loss of an example $(\vec x^{(i)}, y^{(i)})$ with target value $y^{(i)}=1$ is
$$loss_{(\vec x^{(i)}, 1)} (\Theta) = - \log p(y=1\mid \vec x; \Theta)$$

The loss of an example $(\vec x^{(i)}, y^{(i)})$ with target value $y^{(i)}=0$ is
$$loss_{(\vec x^{(i)}, 0)} (\Theta) = - \log p(y=0\mid \vec x; \Theta)$$

So, $p(y=k\mid \vec x; \Theta)$ is maximized for the target class $k$ "by searching
in the $\Theta$-space".  

$p(y=k\mid \vec x; \Theta)$ is called *likelihood* of $\Theta$ (of one example $(\vec x, y)$)
if it is considered as a function of $\Theta$. 
Note that the likelihood is a function of $\Theta$.

$\mathcal L^{(i)}(\Theta) = \log p(y=y^{(i)}\mid \vec x^{(i)}; \Theta)$ is the log-likelihood
of $\Theta$ for an example $i$.

Why is $p(y=k\mid \vec x; \Theta)$ not a probability with respect to $\Theta$.
Which property of a probability does not hold?

#### Solution

$p(y=k\mid \vec x; \Theta)$ is a probability with respect to $y$. The property
$\sum_k p(y=k\mid \vec x; \Theta)=1$ holds.

This propery does not hold with respect to $\Theta$. 
The integral ($\Theta$ is continuous) $$\int_\Theta p(y\mid \vec x; \Theta) d\Theta$$
is not $1$ in general. Therefore, it's not a probability with respect to $\Theta$. $\Theta$ is on the right site of the conditioning-bar "$\mid$".

Therefore, it has another name if $p(y=k\mid \vec x; \Theta)$ is considered w.r.t. $\Theta$.
The technical term **likelihood** is used.

#### i.i.d. and log-likelihood for all data

Note that the training data in logistic regression should be 
**i.i.d.** (independent and identically distributed):

An simple example of an i.i.d. data set is the toin coss of a (marked) coin.
Assume that the probability of head (class $y=1$) is $0.4$, i.e. $p(y=1)=0.4$.     
The probability of getting two heads in two throws is $0.4 \cdot 0.4$:
- Each throw has the same distribution (here: $p(y=1)=0.4$. Each throw of the same coin is **identically distributed**
- The throws are **independent**. If we get a head on the first throw the probability of
getting a head on the second throw does not change.

So, the probability factorizes: $p(y^{(1)}=1, y^{(2)}=1)=p(y^{(1)}=1)p(y^{(2)}=1)$

For our classification problem:

$p(\mathcal D_y \mid \mathcal D_x; \Theta) = \prod_i p(y=y^{(i)}\mid \vec x^{(i)}; \Theta)$ 

with 
- $\mathcal D_x= \{x^{(1)}, x^{(2)}, \dots , x^{(m)}\}$
- $\mathcal D_y= \{y^{(1)}, y^{(2)}, \dots , y^{(m)}\}$
- $\mathcal D$ is the combination of $\mathcal D_x$ with $\mathcal D_y$:
$\mathcal D= \{ (\vec x^{(1)},y^{(1)}), (\vec x^{(2)},y^{(2)}), \dots , (\vec x^{(m)},y^{(m)})\}$. 

#### Task 
For the whole data set the log-likelihood $\mathcal L_\mathcal D(\Theta)$ of a parameter set $\Theta$ is 
$\log p(\mathcal D_y \mid \mathcal D_x; \Theta)$).     
Note: The (log-)likelihood $\mathcal L_\mathcal D(\Theta)$ is a function of the parameters $\Theta$.
Never say the (log-)likelihood of the data.

1. What is $\mathcal L_\mathcal D(\Theta) = \log p(\mathcal D_y \mid \mathcal D_x; \Theta)$ expressed by the $p(y=y^{(i)}\mid \vec x^{(i)}; \Theta)$?

2. What is the relation of the log-likelihood $\mathcal L^{(i)}(\Theta)$ (for the individual examples $(\vec x^{(i)}, y^{(i)})$) 
to the log-likelihood $\mathcal L_\mathcal D(\Theta)$ for the whole data set.

 
In logistic regression the cost function is the negative log-likelihood divided by the number of data examples $m$:

$$J (\Theta) = - \frac{\mathcal L_\mathcal D(\Theta)}{m}$$

2. The aveage log-liklihood 

2. What is the relation of the (log-)likelihood with the cost function for logistic-regression? 
3. Derive the cost function of logistic-regression by using your result of 2.

#### Solution

1. 
$$ \mathcal L_\mathcal D(\Theta) = \log p(\mathcal D_y \mid \mathcal D_x; \Theta) = \log \prod_i p(y=y^{(i)}\mid \vec x^{(i)}; \Theta) =\sum_i \log p(y=y^{(i)}\mid \vec x^{(i)}; \Theta)  $$

2. from 1. we have

$$ \mathcal L_{\mathcal D}(\Theta) = \sum_i \mathcal L^{(i)}(\Theta) $$

3.
So $J (\Theta)$ is the negative average of $\mathcal L^{(i)}(\Theta)$.


$$J (\Theta) = - \frac{\mathcal L_\mathcal D(\Theta)}{m}= - \frac{1}{m} \sum_i \mathcal L^{(i)}(\Theta)$$


So, we have with $p(y=1\mid \vec x^{(i)}; \Theta) = h_\Theta(\vec x)$ 
and $p(y=1\mid \vec x^{(i)}; \Theta) = 1- h_\Theta(\vec x)$:

$$
\begin{align}
J (\Theta) 
 & = - \frac{1}{m}  \sum_i \log p(y=y^{(i)}\mid \vec x^{(i)}; \Theta) \\
 &= - \frac{1}{m}  \sum_{i=1}^{m} 
    \left[  y^{(i)} \log h_\theta({\vec x}^{(i)})+
      (1 - y^{(i)}) \log \left( 1- h_\theta({\vec x}^{(i)})\right) \right]
\end{align}
$$

The trick of multiplying each term by $y^{(i)}$ resp. $(1 - y^{(i)})$ selects the correct term and
cancels out the incorrect one.

#### Derivative of the logistic function

The sigmoid activation function is defined as $\sigma (z) = \frac{1}{1+\exp(-z)}$ 

**Task:**

Show that:
$$
\frac{d \sigma(z)}{d z} = \sigma(z)(1-\sigma(z))
$$

**Solution:**

\begin{equation}
\begin{split}
\frac{d \sigma(z)}{d z} & = \frac{d }{d z}  \left(\frac{1}{1+\exp(-z)}\right) \\
 & \\
 & \text{Quotient rule}\\
 & \\
 & = \frac{(1)'(1+\exp(-z)) - (1)(1+\exp(-z))'}{(1+\exp(-z))^2} \\
 & \\
 & = \frac{0(1+\exp(-z)) - (1)(-\exp(-z))}{(1+\exp(-z))^2} \\
 & \\ 
 & = \frac{\exp(-z)}{(1+\exp(-z))^2} \\ 
 & \\
 & \text{adding +1-1 to the nominator}\\ 
 & \\ 
 & = \frac{ 1 + \exp(-z) - 1}{(1+\exp(-z))^2} \\ 
 & \\
 & = \frac{1 + \exp(-z) }{(1+\exp(-z))^2} - \frac{1}{(1+\exp(-z))^2} \\
 & \\
 & = \frac{1}{1+\exp(-z)} - \left( \frac{1}{1+\exp(-z)} \right)^2 \\
 & \\
 & = \sigma(z) - \sigma(z)^2 \\
 & \\
 & = \sigma(z) (1-\sigma(z))
\end{split}
\end{equation}

#### Task:

Now show that:
$$
\frac{\partial \sigma(z)}{\partial \theta_j} = \sigma(z)(1-\sigma(z)) \cdot x_j
$$


with 
- $z=\vec x'^T \vec \theta$

and
- $\vec \theta = (\theta_0, \theta_1, \dots, \theta_n)^T $
- $\vec x' = (x_0, x_1, \dots, x_n)^T $


Hint: Use the *chain rule of calculus*.

Solution:
    
    
Note that $z=\vec x'^T \vec \theta = \sum_{k=0}^{n} \theta_k x_k$
    
$$
\frac{\partial \sigma(\vec x'^T \vec \theta)}{\partial \theta_j} = 
\frac{\partial \sigma(z)}{\partial z} \frac{\partial z}{\partial \theta_j}=
\frac{\partial \sigma(z)}{\partial z} \frac{\partial(\sum_{k=0}^{n} \theta_k x_k)}{\partial \theta_j}= \sigma(z) (1-\sigma(z)) x_j
$$

**Task:**

Show from
$$
    \frac{\partial}{\partial \theta_j}  J(\theta)  =  
    \frac{\partial}{\partial \theta_j}  \left( - \frac{1}{m}  \sum_{i=1}^{m} 
    \left[  y^{(i)} \log h_\theta({\vec x}^{(i)})+
      (1 - y^{(i)}) \log \left( 1- h_\theta({\vec x}^{(i)})\right) \right] \right)
$$  
that
$$
\frac{\partial}{\partial \theta_j}  J(\theta)  =   \frac{1}{m}
     \sum_{i=1}^{m} \left( h_\theta({\vec x}^{(i)})- y^{(i)}\right) x_j^{(i)}
$$

with the hypothesis $h_\theta(\vec x^{(i)}) = \sigma(\vec x'^T \vec \theta)$
So, with our classification cost function (from the max-likelihood principle) the 
partial derivatives (components the gradient) has a simple form.

**Hint:**

1. Make use of your knowledge, that:

$$
\frac{\partial h_\theta(\vec x^{(i)})}{\partial \theta_j} = h_\theta(\vec x^{(i)})(1-h_\theta(\vec x^{(i)})) \cdot x_j
$$
2. and note that the chain rule for the derivative of the log is:

$$
\frac{\partial \log(f(a))}{\partial a} = \frac{\partial \log(f(a))}{\partial f} \frac{\partial f(a)}{\partial a} =
\frac{1}{f(a)} \frac{\partial f(a)}{\partial a}
$$

Solution (very, very explicitly - it's much simples as it looks like):

$$\begin{align}
    \frac{\partial}{\partial \theta_j}  J(\theta) &= 
    \frac{\partial}{\partial \theta_j}  \left( - \frac{1}{m}  \sum_{i=1}^{m} 
    \left[  y^{(i)} \log h_\theta({\vec x}^{(i)})+
      (1 - y^{(i)}) \log \left( 1- h_\theta({\vec x}^{(i)})\right) \right] \right) \\ &=
        - \frac{1}{m}  \sum_{i=1}^{m} 
    \left[  y^{(i)} \frac{\partial}{\partial \theta_j}  \log h_\theta({\vec x}^{(i)})+
      (1 - y^{(i)}) \frac{\partial}{\partial \theta_j}  \log \left( 1- h_\theta({\vec x}^{(i)})\right) \right]  \\&=  
 - \frac{1}{m}  \sum_{i=1}^{m} 
    \left[  y^{(i)} \frac{h_\theta({\vec x}^{(i)}) (1-h_\theta({\vec x}^{(i)})) } {h_\theta({\vec x}^{(i)})} x_j^{(i)} -
      (1 - y^{(i)}) \frac{h_\theta({\vec x}^{(i)}) (1-h_\theta({\vec x}^{(i)})) }{\left( 1- h_\theta({\vec x}^{(i)})\right)} x_j^{(i)} \right]\\
    &=    - \frac{1}{m}  \sum_{i=1}^{m} 
    \left[  y^{(i)} {  (1-h_\theta({\vec x}^{(i)})) }   x_j^{(i)} -
      (1 - y^{(i)}) {h_\theta({\vec x}^{(i)}) } x_j^{(i)} \right]\\
    &=    - \frac{1}{m}  \sum_{i=1}^{m} 
    \left[  y^{(i)} {  (1-h_\theta({\vec x}^{(i)})) }  -
      (1 - y^{(i)}) {h_\theta({\vec x}^{(i)} }) \right]x_j^{(i)}\\ 
    &=    - \frac{1}{m}  \sum_{i=1}^{m} 
    \left[  y^{(i)} - y^{(i)} h_\theta({\vec x}^{(i)})   -
      ( h_\theta({\vec x}^{(i)}) - y^{(i)} h_\theta({\vec x}^{(i)}) ) \right]x_j^{(i)}\\
    &=    - \frac{1}{m}  \sum_{i=1}^{m} 
    \left[  y^{(i)} - y^{(i)} h_\theta({\vec x}^{(i)})   
      - h_\theta({\vec x}^{(i)}) + y^{(i)} h_\theta({\vec x}^{(i)}) \right]x_j^{(i)}\\
    &=    - \frac{1}{m}  \sum_{i=1}^{m} 
    \left[  y^{(i)}    
      - h_\theta({\vec x}^{(i)})  \right]x_j^{(i)}\\
  &=  \frac{1}{m}  \sum_{i=1}^{m} 
    \left[  h_\theta({\vec x}^{(i)}) - y^{(i)} \right]x_j^{(i)}
\end{align}
$$  