# Common math used in ML

This guide will cover the most common non-linear and non calculus functions in Machine Learning, and how to express them  
in numpy and pytorch.  It will also cover why they are useful

- Activation functions
    - Softmax/Sigmoid
    - ReLU
- Entropy and Cross-Entropy
- Seeding
- Minmax
- Mean and variance
- Sampling
- T-test


In [None]:
## Imports we need

import torch as tch
import torch.nn.functional as F

## Activation Functions

An activation function in deep learning is used to turn a linear equation into a non-linear one.  Without activation  
functions, we would not be able to model things statistically as we do, and problems would boil down to a system of  
linear equations.

The two most common activation functions are Sigmoid (or softmax) and ReLU (Rectified Linear Unit)

$ \huge{\sigma_i = \frac{e^{z_i}}{\sum e^z}} $

In [None]:
def softmax(z: tch.Tensor):
    """Generate non-linear mapping of input to output of probability

    the z values are the input values, which get mapped to a number representing the probability.  This
    is often used for classification.  The sum of the output values will always = 0

    Parameters
    ----------
    z : tch.Tensor
        input values

    Returns
    -------
    tch.Tensor
        mapped probability of ith values
    """
    num = z.exp()  # e^z[i] for each element in z
    denom = tch.sum(num)
    return num / denom

z = tch.rand(3)
print(z)
softmax(z)

## Entropy and Cross Entropy

Entropy is a measure of "surprise" or conversely, how much we don't know the probability of something.  50/50 odds are  
the most "surprising" and the highest entropy, because we don't know what outcome is more likely.  When something has a  
90% chance or 10% chance then the outcome (whether for or against) are better known, and thus have low entropy.  Another  
way to think about entropy is that low entropy provides less information and high entropy provides more.

Entropy is measured as:

$ \large{H = - \sum_{i}^{N} p(x_i) \log_{2}(p(x_i))} $

Where 
- `p(x_i)` is the probability of the event happening

The sum of p(x_i) should equal 1

```python
events = [.25, .75]
assert sum(events) == 1
```

In [None]:
def entropy(x: tch.Tensor):
    return -1 * tch.sum(x * tch.log(x))

x = tch.tensor([.25, .75])
entropy(x)

## Cross Entropy

Cross entropy measures the difference between two random variables or sets of data.

It is defined

$ \large{\sigma^2 = \frac{1}{n - 1} \sum_{i=1}^{N} x_i -  \log_{2}p(x_i)} $

In [None]:
def cross_entropy(t: tch.Tensor):
    ...

## Mean and Variance

The mean tends to tell us what the most common value is in a set of data, but it can have problems depending on the distribution
or variance of data.

The mean or average is defined as:

$ \large{\overline{x} = \frac{1}{n} \sum_{i}^{n} x_{i}} $

```python
nums = [2, 5, -1, 3]
avg = sum(nums)/len(nums)
```

The variance of a data set is a measure of how dispersed the values are.  Imagine a curve that is not too high but broad, vs a curve
that is tall but narrow, where both curves are centered on the same mean.  Variance is a way to measure how spread out or variable 
values are from the mean.

Variance is defined as:

$ \large{\sigma^2 = \frac{1}{n - 1} \sum_{i=1}^{N} (x_{i} - \overline{x})^2} $

The standard deviation is related to variance, and is the square rroot of the variance.

$ \large{\sigma = \sqrt{\frac{1}{n - 1} \sum_{i=1}^{N} ((x_i) -  \log_{2}(p(x_i))}} $
