# **Module 2: Basics of Neural Networks**

## Section I Logistic Regression as a Neural Network

#### 1. Binary Classification

- **Concept**: A classifier that can produces a label *y* with 0 (absent) or 1 (present) from a list of feature set *x* with *n* features
- ***Notation***: 
    - (x,y), $x \in \textbf{R}^{n_{x}}$, $y \in \{0,1\}$
        - **for *m* training example**: ($x^{(1)}$,$y^{(1)}$), ($x^{(2)}$,$y^{(2)}$),... ($x^{(m)}$,$y^{(m)}$)
    - $X = [x^{(1)},x^{(2)},...x^{(m)}]$, where $x^{(i)} = [x^{(i)}_{1},x^{(i)}_{2},...x^{(i)}_{n}]^T; X \in \textbf{R}^{{n_{x}} \times m}, X.shape = (n_{x},m)$
    - $y = [y^{(1)},y^{(2)},...y^{(m)}], y \in \{0,1\}^{1 \times m}, y.shape = (1,m)$

#### 2. Logistic Regression

- **Concept**: Given *x* ($x \in \textbf{R}^{n_{x}}$), produce $\hat{y}$, where $\hat{y} = P(y=1|x)$
- **Parameter**: $w \in \textbf{R}^{n_{x}}, b \in \textbf{R}$
- **Output**: $\hat{y} = \sigma (w^{T}x+b) = \sigma (z)$
    - $z = w^{T}x + b$
    - $\sigma (z) = \frac{1}{1+e^{-z}}$
        - If $z \rightarrow +\infty \Rightarrow e^{-z} \rightarrow 0 \Rightarrow \sigma (z) \rightarrow 1$
        - If $z \rightarrow -\infty \Rightarrow e^{-z} \rightarrow +\infty \Rightarrow \sigma (z) \rightarrow 0$
    
![Sigmoid](resource%20database%20for%20MD%20notes/Week2/1280px-Logistic-curve.svg.png )  
$\qquad\qquad\qquad\qquad\qquad\qquad$*Sigmoid Function*  

#### 3. Logistic Regression Cost Function

- **Loss(error) function**: $L(\hat{y},y) = -[y\log\hat{y}+(1-y)\log(1-\hat{y})]$
    - ***Meaning***: a function to measure how good our output $\hat{y}$ is when the true label is $y$.
        - If $y = 1$: $L(\hat{y},y) = -\log\hat{y} \Rightarrow \hat{y}_+\rightarrow 1 \leftrightarrow L(\hat{y},y) \rightarrow 0$
        - If $y = 0$: $L(\hat{y},y) = -\log(1-\hat{y}) \Rightarrow \hat{y}_-\rightarrow 0 \leftrightarrow L(\hat{y},y) \rightarrow 0$
    - ***Application***: on a single training sample of the whole set
- **Cost function**: $J(w,b) = \frac{1}{m}\sum\limits_{i=1}^{m} L(\hat{y}^{(i)},y^{(i)}) = -\frac{1}{m} \sum\limits_{i=1}^{m} [y^{(i)}\log\hat{y}^{(i)}+(1-y^{(i)})\log(1-\hat{y}^{(i)})]$
    - ***Application***: on the whole sample set
    - ***Characteristic***: $\underline{convex}$, has a minimal value with a set of *w* and *b* (named **global optimum**)
- **Task**: Given $\{(x^{(1)},y^{(1)}), (x^{(2)},y^{(2)}),... (x^{(m)},y^{(m)})\}$, want $\hat{y}^{(i)} \approx y^{(i)}$
    - ***Specifically***: Find minimized **cost function** *J* for *w* and *b* factors

#### 4. Gradient Descent

- **Use**: a method to calculate **global optimum** of a model
- **Procedures** (only considering *w*):
    - 1-1. $w_{new} = w - \alpha\frac{\partial J(w,b)}{\partial w}$
    - 1-2. $b_{new} = b - \alpha\frac{\partial J(w,b)}{\partial b}$
    - 2. *If not* converged [$J(w_{new})< \mathrm{ all\ other\ }J(w)$]:
        - 3. *Return to* Step 1
    - 4. *Else* converged:
        - 5-1. $w_{final} = w_{new}$
        - 5-2. $b_{final} = b_{new}$
- **Learning rate $\alpha$**: controls how big a step is taken on each iteration

#### 5. Derivatives (Basic)

- **Meaning**: the derivative of a function is the slope of the function at a certain point and can vary at different points on the function
- **General functions to calculate derivatives**:
    - $f(x) = ax^b \Rightarrow f'(x) = abx^{b-1}$
    - $f(x) = \log_{a}x \Rightarrow f'(x) = \frac{1}{x\ln a} $
        - *Special case*: $f(x) = \ln x \Rightarrow f'(x) = \frac{1}{x} $
    - $f(x) = a^x \Rightarrow f'(x) = a^x \ln a$
        - *Special case*: $f(x) = e^x \Rightarrow f'(x) = e^x$
    - $f(x) = \sin x \Rightarrow f'(x) = \cos x$
    - $f(x) = \cos x \Rightarrow f'(x) = -\sin x$
    - $f(x) = \tan x \Rightarrow f'(x) = \frac{1}{\cos^2 x}$
    - $f(x) = \cot x \Rightarrow f'(x) = -\frac{1}{\sin^2 x}$

#### 6. Computational Graph

- **Use**: Calculate a function (e.g.,  **cost function** *J*) step-by-step from left to right in a graph
- **Example**: Given $J = 3(a+bc)=3(a+u)=3v$
![Computational graph](resource%20database%20for%20MD%20notes/Week2/Computational_graph.png)
    - ***Derivative***: 
        - $\frac{\mathrm{d}J}{\mathrm{d}v} = 3$ - one-step backward propagation
        - $\frac{\partial J}{\partial a} = \frac{\mathrm{d}J}{\mathrm{d}v}\frac{\partial v}{\partial a} = 3\times 1 = 3$
        - $\frac{\partial J}{\partial b} = \frac{\mathrm{d}J}{\mathrm{d}v}\frac{\partial v}{\partial u}\frac{\partial u}{\partial b} =3\times 1\times c = 3c$
- **Notation in code**: $\frac{\mathrm{d}J}{\mathrm{d}x}$ in code is directly denoted as $\mathrm{d}x$ to reduce complexity