(Cost_Function)= 
# Chapter 4 -- Cost Function

The previous 2 examples have the same amount of students, so it is a fair comparison. However, if they have different sizes, then the very accurate model with 1000 data would not have better cross-entropy result than the one with 5 data. So we need to divide the amount of the data to make `per capita' comparison. 

$$
Cost = -\frac{1}{m}\sum_{i=1}^m [y_i*ln(p_i)+(1-y_i)*ln(1-p_i)]
$$ (eq4_1)

where the $p_i$ is the AI's prediction of the probability of the event happening, which is the same as $\hat{y}$ notation we used before.

$$
\hat{y} = \sigma(\vec{w}*\vec{x}+b)
$$ (eq4_2)

So finally, we can obtain the following by replacing $p_i$ by $\hat{y}$

$$
Cost(\vec{w},b) = -\frac{1}{m}\sum_{i=1}^m [y_i*ln( \sigma(\vec{w}*\vec{x}+b))+(1-y_i)*ln(1- \sigma(\vec{w}*\vec{x}+b))]
$$ (eq4_3)

What we want is to find the weights $w$ and bias $b$ to minimise the cost. This is a simple multi-variable calculus problem.

## Mathematical Theory

### Sigmoid Function

The sigmoid function, $\sigma(z)$, is often used in logistic regression and neural networks to map any real-valued number into the range of 0 to 1. It is defined as:

$$
\sigma(z) = \frac{1}{1 + e^{-z}}
$$


### Gradient Descent

To minimize the cost function, we use an optimization algorithm called gradient descent. Gradient descent updates the weights $\vec{w}$ and bias $b$ iteratively to find the minimum of the cost function. The update rules are given by:

$$
\vec{w} := \vec{w} - \alpha \frac{\partial}{\partial \vec{w}} Cost(\vec{w}, b)
$$


$$
b := b - \alpha \frac{\partial}{\partial b} Cost(\vec{w}, b)
$$


where $\alpha$ is the learning rate.

### Derivatives of the Cost Function

The partial derivatives of the cost function with respect to the weights and bias are:

$$
\frac{\partial}{\partial \vec{w}} Cost(\vec{w}, b) = \frac{1}{m} \sum_{i=1}^m (\hat{y}_i - y_i) \vec{x}_i
$$


$$
\frac{\partial}{\partial b} Cost(\vec{w}, b) = \frac{1}{m} \sum_{i=1}^m (\hat{y}_i - y_i)
$$



In [1]:
import numpy as np

# Sigmoid function
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

# Cost function
def compute_cost(X, y, w, b):
    m = X.shape[0]
    cost = -(1/m) * np.sum(y * np.log(sigmoid(X.dot(w) + b)) + (1 - y) * np.log(1 - sigmoid(X.dot(w) + b)))
    return cost

# Gradient descent
def gradient_descent(X, y, w, b, alpha, num_iterations):
    m = X.shape[0]
    cost_history = []
    
    for i in range(num_iterations):
        # Compute the predictions
        predictions = sigmoid(X.dot(w) + b)
        
        # Compute the gradients
        dw = (1/m) * X.T.dot(predictions - y)
        db = (1/m) * np.sum(predictions - y)
        
        # Update the weights and bias
        w -= alpha * dw
        b -= alpha * db
        
        # Compute and store the cost
        cost = compute_cost(X, y, w, b)
        cost_history.append(cost)
        
        if i % 100 == 0:
            print(f"Iteration {i}: Cost {cost:.4f}")
    
    return w, b, cost_history

# Example data
X = np.array([[1, 2], [2, 3], [3, 4], [4, 5]])
y = np.array([0, 0, 1, 1])

# Initialize parameters
w = np.zeros(X.shape[1])
b = 0
alpha = 0.01
num_iterations = 1000

# Perform gradient descent
w, b, cost_history = gradient_descent(X, y, w, b, alpha, num_iterations)

# Final weights and bias
print(f"Final weights: {w}")
print(f"Final bias: {b}")

# Final cost
final_cost = compute_cost(X, y, w, b)
print(f"Final cost: {final_cost:.4f}")


Iteration 0: Cost 0.6883
Iteration 100: Cost 0.6192
Iteration 200: Cost 0.5918
Iteration 300: Cost 0.5665
Iteration 400: Cost 0.5431
Iteration 500: Cost 0.5216
Iteration 600: Cost 0.5017
Iteration 700: Cost 0.4832
Iteration 800: Cost 0.4661
Iteration 900: Cost 0.4503
Final weights: [ 0.91260242 -0.20690417]
Final bias: -1.1195065937669517
Final cost: 0.4357
