# Gaussian Distributions

Gaussian distributinos are continuous distrituions. For example, when modeling probabilities of a spinning wheel, you used a uniform distribution. When modeling <U>uncertainty</U> in a sensor measurement, we need to use a Gaussian distribution.


* population refers to the entire set of all data points. Like if you were measuring people's weights, then the population would be all people in the world.
* sample refers to a part of the population. In the weights example, you might take a random sample of the population since it would be practically impossible to measure the weights of all humans.
* mean is the average value, which in this case would be the average weight of all humans.
* standard deviation measures the spread in the data. Does the data tend to hover around the mean, or is the data more spread out?
* ref: [sample and populatoin](https://stattrek.com/sampling/populations-and-samples.aspx)


## Gaussian Equation

<img src='./figs/gaussian.png'>

* $\mu$: population mean
* $\sigma$: standard deviation of the distribution
* $x$: variable

> From statistics, the mean and standard deviation are constants when dealing with a population. So for a specific population, the only value that varies is x.


* The mean value is the center of the bell curve.
* As the standard deviation increases, uncertainty increases as well.


## Demo: Probability Density Function

The code cell below calculates a Gaussian probability density function two ways. First, we're using the density function from the previous exercises. Then, we compare the results using the SciPy library's implementation.

You'll see that the results are exactly the same.

In [3]:
from scipy.stats import norm
import numpy as np

# our solution to calculate the probability density function
def gaussian_density(x, mu, sigma):
    return (1/np.sqrt(2*np.pi*np.power(sigma, 2.))) * np.exp(-np.power(x - mu, 2.) / (2 * np.power(sigma, 2.)))

print("Probability density function our solution: mu = 50, sigma = 10, x = 50")
print(gaussian_density(50, 50, 10))
print("\nProbability density function SciPy: mu = 50, sigma = 10, x = 50")
print(norm(loc = 50, scale = 10).pdf(50))

Probability density function our solution: mu = 50, sigma = 10, x = 50
0.03989422804014327

Probability density function SciPy: mu = 50, sigma = 10, x = 50
0.03989422804014327


You might wonder why the Gaussian distribution is called "norm" in the SciPy library; it's because the Gaussian distribution is also called the normal distribution. 

Also, note that to initialize the distribution, the loc keyword is the mean and the scale keyword is the standard deviation.

### Calculating Probability

The area under the probability density function represents probability. Your job will be to write a function that calculates the probability between two x-values. For example, using the winter San Francisco temperature example, what is the probability that the temperature is between 30 degrees and 50 degrees? 

The SciPy library has a function that calculates the area under the curve for you. It's called cdf ([cumulative density function](https://en.wikipedia.org/wiki/Cumulative_distribution_function)). You can use the cdf SciPy method in a similar way to the pdf method. Run the code cell below to see an example.

In [2]:
norm(loc = 50, scale = 10).cdf(50)

0.5

Why is the output 0.5? The cdf method gives you the area under the curve from x = -infinity through the input, which in this case was 50. The area under the curve is 0.5 meaning there is a 50% chance that the temperature is between -infinity and 50 degrees.

Run the code cell below to see a visualization of the area under the curve from -infinity to 50.

## Central Limit Theorem

It says that if you take large enough samples from a population and then calculate the sample means, these means will be normally distributed. The theorem should hold as long as the sample size is large enough and the variable in question is independent and random.

#### A Population
A population consists of all of the values of a data set. 

<img src='./figs/clm.png'>

For example, the value 15 shows up in the population about 160 times. The value 50 shows up in the population about 70 times. In total, this population has 10,000 data points.

Consider randomly grabbing 100 data points from this distribution. Call these 100 data points a <U>sample</U>. Then, calculate the mean value of the sample by random sampling over and over, **the mean values would have a Gaussian distribution**.

It's amazing that a population distribution that does not look Gaussian at all becomes Gaussian as you take the mean of many samples.