## Some Logistic Regression Concepts
$z = w^{t}x + b$  
$\hat{y} = a = \sigma(z)$  
$L(a, y) = -(y\log{a}+(1-y)\log{(1-a)})$  

process of foreward and backward propagation:
<img src="image/propagation.png" style="width:80%;">

we get the cost function by all the calculation at the front. We can modify w and b by using backward propagation by calculating the derivatives using chain rule.

## Some Calculations
<img src="image/calculations.jpg" style="width:50%;">

---


In [15]:
import numpy as np
import time

a = np.random.rand(10000000)
b = np.random.rand(10000000)

t1 = time.time()
c = np.dot(a,b)
t2 = time.time()
print(c)
print("numpy: "+ str(1000*(t2-t1)) + "ms")

t1 = time.time()
for i in range(10000000):
    c = a[i] * b[i]
t2 = time.time()
print(c)
print("numpy: "+ str(1000*(t2-t1)) + "ms")

2500022.044102834
numpy: 3.172636032104492ms
0.5401602835576047
numpy: 1403.7532806396484ms


**By the experiment above, I found out that using numpy to do matrix calculations is much more efficient.**

## Vectorizing Logistic Regression
x is the input matrix, having a shape of ($n_x$, m), which means every column is a training example, and each row represents each feature.

$w^T$ is a row vectors. By multipling $w^T$ and x, and then plus b, we can get z

***

## Neural Network
[i] represents data in i-th layer  

neural network is just repeating z and a calculations  

There are three layers in a neural network: input layer, hidden layer, output layer
- input layer: x or $a^{[0]}$
- hidden layer: something like $a^{[1]}$  
$a^{[1]}_1$ means hidden layer 1 node 1

When we count layer, we don't count input layer. When a neural network has 1 input layer, 1 hidden layer and 1 output layer, it is a two-layer neural network.

## Activation Functions
sigmoid function is an example of activation function.  

tanh(z) is also an example of activation function, and it works better than sigmoid function, since it can generates an average of 0 instead of 0.5. However, it is not useful for binary classification.  

Therefore, we can use tanh(z) in hidden layer, sigmoid function for the output layer.

Both tanh(z) and sigmoid function's slope at the end is close to zero, which will slow down gradient descent. Therefore, we can use ReLU function. 

We don't use linear function in hidden layer, because it can not learn any non-linear relationships.

## Hyperparameters
- learning rate $\alpha$
- number of iterations
- number of hidden layer L
- number of hidden units
- choice of activation function

## Process

<img src="image/process.png" style="width:50%;">

---

## Implementing Neural Network

In [None]:
'''
w[l]: (n[l], n[l-1])
b[l]: (n[l], 1)
'''

class MyNeuralNetwork:

    '''
    num_layer: number of hidden layer
    layer_size: units of hidden layer

    weights and bias are hidden layers
    '''
    def __init__(self, layer_size: list):
        self.num_layer = len(layer_size)
        self.layer_size = layer_size

        self.weights = []
        self.bias = []

        # initializing weights and bias
        for i in range(len(layer_size)-1):
            self.weights.append(np.random.rand(layer_size[i+1], layer_size[i]) * 0.01)
            self.bias.append(np.zeros((layer_size[i+1], 1)))

    
    # activation function
    def ReLU():
        
        
        
    

In [None]:
# the first element in [4,4,1] is a[0]
model = MyNeuralNetwork([4,4,1])
print(model.weights)


[array([[0.00079833, 0.00033678, 0.00597505, 0.0065092 ],
       [0.00112352, 0.00494675, 0.0041837 , 0.00972023],
       [0.00460516, 0.00712167, 0.00108812, 0.00651455],
       [0.00930487, 0.00247874, 0.00938701, 0.00063321]]), array([[0.00954513, 0.00402281, 0.00719258, 0.00356439]])]
