# Implement a basic neural network from scratch

This is a badly drawn but still valid fully connected neural net with 3 features, 1 hidden layer, and 1-dimensional output.
```
x1
  |\
  | \
  |  \
  |   \
  | |--node1
  | |   / \
  | |  /   \
x2--+-/     y
  | | \    /
  | |  \  /
  |-+- node2
    | /
    |/ 
    /
   /
x3
```

### Formulation

Here's the formula for above network, given the feature values of a single data point,

First layer:
    
$$
\begin{bmatrix}
node1 \\
node2 \\
\end{bmatrix}
=
\sigma(
\begin{bmatrix} 
w_{11} & w_{12} & w_{13} \\
w_{21} & w_{22} & w_{23}
\end{bmatrix}
\begin{bmatrix}
x_1 \\
x_2 \\
x_3
\end{bmatrix}
+
\begin{bmatrix}
b_1 \\
b_2
\end{bmatrix}
)
$$

Second layer

$$
y = 
\sigma(
\begin{bmatrix}
\beta_1 & \beta_2
\end{bmatrix}
\begin{bmatrix}
node1 \\
node2 \\
\end{bmatrix}
+
d
)
$$

$w$ and $\beta$ are the weights and $b, d$ are the bias terms. The simoid function, $\sigma$, is used as activation function for both layers

$$\sigma(x) = \frac{1}{1 + e^{-x}}$$

### Data vs prediction

Suppose there are n samples, to go from those $x$ values to expected $y$ values, include all samples and slightly rearrange above formula to

First layer

$$
\begin{bmatrix} 
node_{11} & node_{12} \\
node_{21} & node_{22} \\
... \\
node_{n1} & node_{n2}
\end{bmatrix}
=
\sigma(
\begin{bmatrix} 
x_{11} & x_{12} & x_{13} \\
x_{21} & x_{22} & x_{23} \\
... \\
x_{n1} & x_{n2} & x_{n3}
\end{bmatrix}
\begin{bmatrix}
w_{11} & w_{21} \\
w_{12} & w_{22} \\
w_{13} & w_{23}
\end{bmatrix}
+
\begin{bmatrix}
b_1 & b_2 \\
b_1 & b_2 \\
...\\
b_1 & b_2
\end{bmatrix}
)
= \sigma(XW^T + B)
$$

Output layer

$$
\hat{Y} =
\begin{bmatrix}
\hat{y_1} \\
\hat{y_2} \\
... \\
\hat{y_n}
\end{bmatrix}
=
\sigma(
\begin{bmatrix} 
node_{11} & node_{12} \\
node_{21} & node_{22} \\
... \\
node_{n1} & node_{n2}
\end{bmatrix}
\begin{bmatrix}
v_1 \\
v_2
\end{bmatrix}
+
\begin{bmatrix}
d \\
d \\
..\\
d
\end{bmatrix}
)
= \sigma(Nodes\times V^T + D)
$$

However, in this imperfect world, there's usually a gap between actual and expected value.
Instead of using sum-of-sqaures error, let's introduce another cost function suitable for
using the sigmoid function for probability estimations.

The goal of model training is to minimize above loss function. We want the weights and biases
associated with minimal loss function.

### How to find best set of parameters?

Unlike linear regression, there's no algebraic solution to find optimal coefficients. We
need to iteratively "guess" a set of weights, calculate the errors, and infer from the errors
how we should modify the weights towards a better direction (lower errors).

N x (update weights -> calculate errors -> find out how to update weights)

The practice of using errors at output layer to update weights from the very first
hidden layer is called **back propagation**. How do we use errors? Would be nice if
someone can tell me something like "if you change $w_1$ by 0.2, then you can reduce the
errors by 0.3". Keep in mind that while this tells us the _direction_ of the next move, it
doesn't tell us how much to move before we reach a minimum.

To get such direction, we want to know the derivative of loss function with respect to
each weight and bias. The prerequisite is that the loss function must be differentiable
w.r.t all weights and biases.

In [None]:
#### An