# Building A Model For Multi-Class Classification
A simple Logistic Regression cannot be used to for multi-class classification. A simple Logistic Regression is equivalent to single-neuron model and can only be used to classify 2 classes. In order to implement Logistic Regression for multi-class classification multiple Logistic Regression models will have to be built (OvR).

Therefore, if there are 3 classes, then 3 models will have to be trained.

<img src = "../artifacts/neural_networks_11.png" alt = "drawing" width = "500"/>

### Can the above be accomplished using a single model?
In multi-class classification,
- The probability of a given data point belonging to either one of the classes (A, B or C) is calculated.
- The data point would be classified into the class for which the probability value was the highest.

An intuition that the output layer should have 3 outputs (one for each class) can be drawn from the above.

Therefore, the Neural Network looks as follows,

<img src = "../artifacts/neural_networks_12.png" alt = "drawing" width = "500"/>

Observations,
- The number of outputs are the same.
- The number of connections are the same.

The difference,
- Computation happens together.

So, instead of a weight vector, the weight matrix is multiplied with data matrix.

A model built by using multiple neurons is called a Neural Network (NN).

# Notations In Neural Networks
### Inputs
Consider that there are 2 features, $x_1$ and $x_2$ and $m$ data points. This is represeted as a matrix where each row is a data point. It looks as,

$
\begin{bmatrix}
x_{11} & x_{12} \\
x_{21} & x_{i2} \\
...  & ...  \\
x_{m1} & x_{m2}
\end{bmatrix}$

### Neuron
Neuron is represented using $f_i$. 

Where,
- $i$ = neuron number. For example, $f_1$ is representing the first neuron.

### Weights
Weights are defined by the notation $w_{ij}$.

Where,
- $i$ = source neuron
- $j$ = destination neuron

The weight associated with input $x_1$ going to neuron $f_2$ is represented as $w_{12}$. Similarly, other weights are represented as, $w_{11}$, $w_{13}$, $w_{21}$, $w_{22}$, $w_{23}$.

### Bias
Each neuron will have a bias term associated with it. The bias matrix is represented as,

$b = \begin{bmatrix}
b_1 \space b_2 \space b_3
\end{bmatrix}$

### Z value
Z value represents the linear operation, i.e., additive multiplication of inputs with their respective weights.
- $z_1 = w_{11} * x_1 + w_{21} * x_2$.
- $z_2 = w_{12} * x_1 + w_{22} * x_2$.
- $z_3 = w_{13} * x_1 + w_{23} * x_2$.

### Output
Each neuron will apply its activation function on the x values to the outputs: $a_1'$, $a_2'$, $a_3'$.

<img src = "../artifacts/neural_networks_13.png" alt = "drawing" width = "500">

### Problem with this formulation
What if the model predict a probability value greater than 0.5 for multiple classes (i.e., $A$, $B$, $C$ = $\begin{bmatrix}1 \space 1 \space 0\end{bmatrix}$ or $A$, $B$, $C$ = $\begin{bmatrix}1 \space 1 \space 1\end{bmatrix}$)?

<img src = "../artifacts/neural_networks_14.png" alt = "drawing" width = "500">

The actual requirement is that, the sum of all output probabilities should be equal to 1.

### How can multiple outputs be 1?
Since the sigmoid activation function is applied to each output and the range of the sigmoid function is, $\sigma \in (0, 1)$. Therefore, multiple probabilitiy values can be greater than 0.5 and hence multiple class labels can be 1.

# Softmax Classifier
Consider 3 outputs, $z_1$, $z_2$ and $z_3$. A function that should map $z_1$, $z_2$ and $z_3$ such that the sum total of the output probabilities is equal to 1 is required.

The softmax function helps in achieving exactly this. The softmax function is mathematicall represented as,

$p_i = \frac{e^{z_i}}{\sum_{i = 0}^{k} e^{z_i}}$.

Here $p_i$ refers to the probability of the data point belonging to class i. There denominator in the equation is the normalization term to make $p_1 + p_2 + p_3 = 1$.

Softmax can be thought of as sigmoid like function for multi-class setting.

### Why not directly use $\frac{{z_i}}{\sum_{i = 0}^{k} {z_i}}$? Why exponentiate $z_i$?
The intuitive reason is that it ensures that values are non-negative and lie only between 0 and 1 (the value of $z_i$ ranges from $-\infty\$ and $\infty$).

Besides this, softmax function has some other desirable properties,
1. It is nicely differentiable $\frac{de^{x}}{dx} = e_x$.
2. The output probabilities can be interpreted as log likelihoods (log odds).

Now consider the following representation,

<img src = "../artifacts/neural_networks_15.png" alt = "drawing" width = "500">

If the normalized z values were instead of exponential,
- The ration of the probability would be 1: 3: 6.
- However, softmax pushes the probability of largest number closer to 1. Hence the term softmax.

Hence, softmax is used as the activation function in NN.

# Training A Neural Network
