### NN
- Input to NN --> feature vector

- 1D vector - Classic input to a neural network, similar to rows in a spreadsheet. Common in predictive modeling.
- 2D Matrix - Grayscale image input to a CNN.
- 3D Matrix - Color image input to a CNN.
- nD Matrix - Higher-order input to a CNN.

- Regression or two class classification --> networks always have a single output
- Classification --> networks have an output neuron for each category

### Input and Neurons
- Typically NN takes in floating-point vectors
- Vector is 1D array eg: [0.75, 0.55, 0.2]

### Hidden Neurons
- Previously mentioned to have only one hidden layer --> Reason: training will take more time
- Now we are computationaly well to do hence can have multiple layer of hidden neurons
- Training refers to *__process that determines good weight values__*

### Bias Neurons
- Generally marked as 1 which is called the bias activation
- Not connected with previous layers
- Each layer is fixed with a bias neuron to provide a constant value
- Allow the program to shift the output of an activation function

### Neurons called as nodes, units or summations
- __Input Neurons__: Map each input neuron to one element of feature vector
- __Hidden Neurons__: Allows NN to be abstract and process input to the output
- __Output Neurons__: Each output neuron calculate one part of the output
- __Bias Neurons__: Work similar to y-intercept of linear equation (y = mx + b) i.e. the *b* 

### Input and Output Neurons
- Input neurons will have each neuron take an input vector and output the output vector in similar shape of the input vector
- Shape of input vector or array must be similar to the number of input neurons
    - Eg: Three input neurons must have:
        [0.5, 0.75, 0.3]

### Hidden Neurons
- Takes input from input neurons or other hidden neurons
- Helps to understand input and form the output
- Most of time, bunch multiple hidden layers
- Training means finding the best weights values

### Bias Neurons
- Fixed output of value like 1
- Each layer apart from output layer will have bias neurons
- Not connected with previous layers

### Why Bias neurons needed?
- Activation function specifies the output of a neuron
- Derivative of a function is taken to measure its *senstivity*

### Activation functions
- ReLU: Used for output of hidden layers
- Softmax: Used for output of classification
- Linear: Used for output of regression (or 2-classification)

### Linear Activation function

$$
    f(x) = x 
$$ 
- Linear activation function spits out a constant value when tried to apply backpropogation as derivative is a constant
- Last layer will be a linear function of the first layer: hence linear function turns the NN into just one layer

<img src='../misc/Linear.png' width='800'/>

### Rectified Linear Units (ReLU)

$$
    f(x) = max(0,x)
$$
- Good for hidden layers

<img src='../misc/ReLU.png' width='800'/>

### Softmax Activation function

- Generally found in classification based problems with usually present in the output layer of NN
- Each neuron gives its output based on *probability* of each class
- The probability will sum to 100%
- 1 output neuron for each class

$$
    f*i(x) = \frac{exp(x_i)} {\sum*j exp(x_j)}
$$

- The output of each neuron is not the probability but the output vector and if you apply the above formula then it will give the probability of each class

### Step Activation Function

- Step activation is 1 if x >= 0.5 and 0 otherwise
- Mainly used in binary classification

<img src='../misc/Step.png' width='800'/>

### Sigmoid Activation Function

$$
    f(x) = \frac{1}{1 + e^{-x}}
$$
- Also called Logistic activation function
- Use to enusre that values stay within a relatively small range like 0 to 1 (most of the time)

<img src='../misc/sigmoid.png' width='800'/>

### Hyperbolic Tangent Activation Function

- Outputs the value in range -1 to 1

$$
    f(x) = tanh(x)
$$
- HTAN/tanh has many advantages over Sigmoid:
    - tanh converges faster than sigmoid
    
Reference: [Sigmoid vs Tanh](https://www.baeldung.com/cs/sigmoid-vs-tanh-functions)

<img src='../misc/tanh.png' width='800'/>

### Derivating of Sigmoid function

First, apply the reciprocal rule.

$$
s’(x) = \frac{d}{dx} \left( \frac{1}{e^{-x}+1} \right) = \frac{-\frac{d}{dx}(e^{-x}+1)}{(e^{-x}+1)^2}
$$

Second, this is just a linear equation, so we can take the derivative of each part. Constant 1 goes away.

$$
= \frac{-\frac{d}{dx}(e^{-x})}{(e^{-x}+1)^2}
$$

Next, exponential function rule with chain rule.

$$
= \frac{-( e^{-x} \cdot \frac{d}{dx}({-x}))}{(e^{-x}+1)^2}
$$

Next, this is just a linear equation, so we can take the derivative of each part.

$$
= \frac{-(-1)\cdot{e}^{-x}}{(e^{-x}+1)^2} =
$$

Eliminate the double negative.

$$
= \frac{e^{-x}}{(e^{-x}+1)^2}
$$

We could stop with the above derivative. However, most texts do not display the above derivative as the final form of the sigmoid functions derivative. For “computational efficiency” we algebraically transform this to use the original sigmoid function twice.

$$
= \left( \frac{1}{e^{-x}+1} \right) \left( \frac{-e^{-x}}{e^{-x}+1} \right)
$$

This now reduces to the commonly used form of the sigmoid’s derivative.

$$
s’(x)= s(x)(1-s(x))
$$

### Softmax
- Commonly used as output are changed to probability values rather than *logits* (vector of raw scores)
- Calculated by exponentiating each score and then normalizing the results to sum to 1
$$
    P(y_i) = \frac{e^{z_i}}{\sum_{j=1}^K e^{z_j}}
$$
Where:

*$P(y_i)$* is the probability of class 

*$z_i$* is the raw score (logit) for class 

*$K$* is the total number of classes.

### LogSoftmax
- Better as compared to softmax
- No need to compare probabilities 
- Prevent overflow or underflow errors
- Log softmax is also useful when applying the negative log-likelihood loss function, as it simplifies the mathematical calculations by eliminating the need to compute the actual probabilities.
$$
    \log P(y_i) = {z_i} - log {\sum_{j=1}^K e^{z_j}}
$$


### Cross-Entropy Loss
#### For Binary Classification:

$$
    CE(y,\hat{y}) = -(y.\log(y) + (1-y).\log(1-\hat{y}))
$$
where
    $y$ is true label (0 or 1)
    $\hat{y} is predicted probability of belonging to class 1

#### For Multi-class classification
- Often called *categorical cross entropy loss* or *softmax loss*
$$
    CE(y,\hat{y}) = -\sum_{i=1}^N y_i. \log{\hat{y_i}}
$$
where
    $N$ is number of classes
    $y_i$ is true probability distribution over the classes
    $\hat{y_i}$ is predicted probability distribution over the classes produced by softmax function

### PyTorch

- When initiating an uninitialized tensor use *__torch.Tensor__*
- When inititating or creating a tensor from existing data and ensure the data is successfully created then use *__torch.tensor__*

### PyTorch training loop explaination
```
# training loop
for epoch in range(1000):
    optimizer.zero_grad()
    out = model(x)
    loss = criterion(out, y)
    loss.backward()
    optimizer.step()
```

Explaination: