# Loss Functions
---

**1. Mean Absolute Error**

$\text{Mean Absolute Error (MAE)} = {\frac{1}{n}}{\sum_{i=1}^{n} abs(y_i - {\hat{y}_i})}$

In [None]:
model.compile(optimizer="adam", 
              loss="mean_absolute_error",
              metrics=["accurary"]
)

**2. Mean Squared Error**

$\text{Mean Squared Error (MSE)} = {\frac{1}{n}}{\sum_{i=1}^{n} (y_i - {\hat{y}_i})^2}$

In [None]:
model.compile(optimizer="adam", 
              loss="mean_square_error",
              metrics=["accurary"]
)

**3. Binary Cross Entropy**

_Also known as log loss_

$\text{Binary Cross Entropy}  = {\frac{1}{n}}{\sum_{i=1}^{n} y_i log({\hat{y}_i}) + (1-y_i) \cdot {\log(1 - {\hat{y_i}})}}$

In [None]:
model.compile(optimizer="adam", 
              loss="binary_crossentropy",
              metrics=["accurary"]
)

# Gradient Descent
---

### Batch Gradient Descent
Use **all** training samples for one forward pass and then adjust weights


<br>
In each epoch, assuming _log loss_

**1. Calculate weight**

$w = w - \alpha * \frac{\partial}{\partial w }$,  
<br>where $w$ = weight, $\alpha$ = learning rate


$\frac{\partial}{\partial w } = {\frac{1}{n}}{\sum_{i=1}^{n} x_i log({y_{\text predicted}} - y_\text{true})}$

<br>



**2. Calculate bias**

$bias = bias - \alpha * \frac{\partial}{\partial b }$,  
<br>where $bias$ = bias, $\alpha$ = learning rate


$\frac{\partial}{\partial b } = {\frac{1}{n}}{\sum_{i=1}^{n} ({y_{\text predicted}} - y_\text{true})}$

<br>


In [None]:
def sigmoid(y):
    ...

def log_loss(y_true, y_predicted):
    ...

In [1]:
# Code Sample

def batch_gradient_descent(X, y_true, epochs, learning_rate=0.01):

    number_of_features = X.shape[1]
    w = np.ones(shape=(number_of_features))
    bias = 0
    total_samples = X.shape[0]
    cost_list = []
    epoch_list = []

    for i in range(epochs):
        y_predicted = np.dot(w, X.T) + bias

        # using mse loss
        w_d = -(1/total_samples)*(X.T.dot(y_true-y_predicted))
        bias_d = -(1/total_samples)*np.sum(y_true-y_predicted)

        w = w - learning_rate * w_d
        bias = bias - learning_rate * bias_d

        cost = np.mean(np.square(y_true - y_predicted))

        if(i%10 == 0):
            cost_list.append(cost)
            epoch_list.append(i)
    return w, bias, cost,cost_list, epoch_list



### Stochastic Gradient Descent
Use one (randomly picked) sample for a forward pass and then adjust weights

In [3]:
# Code sample

def stochastic_gradient_descent(X, y_true, epochs, learning_rate=0.01):
    number_of_features = X.shape[1]
    w = np.ones(shape=(number_of_features))
    bias = 0
    total_samples = X.shape[0]

    cost_list = []
    epoch_list = []

    for i in range(epochs):
        # get random index from X -- stochastic
        random_index = random.randint(0, total_samples-1)
        sample_X = X[random_index]
        sample_y = y_true[random_index]

        y_predicted = np.dot(w, sample_X.T) + bias

        # using mse loss
        w_d = -(2/total_samples)*(sample_X.T.dot(sample_y-y_predicted))
        bias_d = -(2/total_samples)*(sample_y-y_predicted)

        w = w - learning_rate * w_d
        bias = bias - learning_rate * bias_d

        cost = np.mean(np.square(sample_y - y_predicted))

        if(i % 100 == 0):
            cost_list.append(cost)
            epoch_list.append(i)
        
    return w, b, cost, cost_list, epoch_list


# Dropout Regularization
---

Why will dropout help with overfitting?
- It can't rely on one input as it might be dropped out at random
- Neurons will not learn redundant details of inputs

### Example (adding Dropout layer on `keras`)

In [None]:
model = keras.Sequential([
  # input layer
  keras.layers.Dense(60, input_dim=60, activation="relu"),
  keras.layers.Dropout(0.5),
  # hidden layer(s)
  keras.layers.Dense(60, activation="relu"),
  keras.layers.Dropout(0.5),
  # output layer
  keras.layers.Dense(1, activation="sigmoid)
])