# Deep Understanding of Deep Learning Practice
## Chapter: Statistics,Numpy,Pytorch
### Topic: Mean & Variance
#### Shahjalal Shanto
##### Department of Chemistry, University of Chittagong
**26 February 2024**

## Mean 

**The mean is the most widely spread measure of central tendency. It is the simple average of the dataset.**

**Note:** 
* Easily affected by outliers

<center> <b>Formula of mean</b> </center>


$$
\bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i
$$

## Variance and standard deviation

**Variance and standard deviation measure the dispersion of a set of data points around its mean value.There are different formulas for population and sample variance & standard deviation. This is due to the fact that the sample formulas are the unbiased estimators of the population
formulas**

 <center> <b>Sample Variance Formula </b> </center>
$$
s^2 = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2
$$


<center> <b> Population Variance Formula </b> </center>
$$
\sigma^2 = \frac{1}{N} \sum_{i=1}^{N} (x_i - \mu)^2
$$

<center> <b> Sample standard deviation Formula </b> </center>
$$
s = \sqrt{\frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2}
$$

<center> <b> Population standard deviation Formula </b> </center>
$$
\sigma = \sqrt{\frac{1}{N} \sum_{i=1}^{N} (x_i - \mu)^2}
$$

* s represents the sample standard deviation.
* σ represents the population standard deviation.
* $x_i$ represents each individual observation.
* $\bar{x}$ represents the sample mean (average) for the sample standard deviation formula.
* μ represents the population mean (average) for the population standard deviation formula.
* n represents the sample size (number of observations) for the sample standard deviation formula.
* N represents the population size (number of observations) for the population standard deviation formula.

**Note**
* calculate the sample variance when the dataset you’re working with represents a a sample taken from a larger population of interest.
* calculate the population variance when the dataset you’re working with represents an entire population, i.e. every value that you’re interested in.

* Futuer Reference: https://www.statology.org/sample-variance-vs-population-variance/

In [11]:
# import libraries
import numpy as np

In [14]:
# create a list of numbers to compute the mean and variance of
x = [1,2,4,6,5,4,0]
n = len(x)

# compute the mean
mean1 = np.mean(x)
mean2 = np.sum(x) / n

# print them
print(mean1)
print(mean2)

3.142857142857143
3.142857142857143


In [15]:
# variance

var1 = np.var(x)
var2 = (1/(n-1)) * np.sum( (x-mean1)**2 )

print(var1)
print(var2)

4.122448979591836
4.809523809523809


In [16]:
# using np.var and explicitly assigning degree of freedom to 1!

var3 = np.var(x,ddof=1)

print(var3)
print(var2)

4.809523809523809
4.809523809523809


In [17]:
# does it matter for large N? 

N = 10000
x = np.random.randint(0,high=20,size=N)

var0 = np.var(x,ddof=0) # default
var1 = np.var(x,ddof=1) # unbiased

print(var0)
print(var1)

32.95552111
32.95881699169917


**For a large dataset the difference between biased and unbiased get smaller**
* These concepts are very usefull in DL especially in normalization of data or weights of the model