# The Normal Distribution, Standard Normal Distribution, Standard Scores, and One-Sample z-tests

# Normal distribution 


## What are the parameters that characterize the normal distribution?

## What is the empirical rule? 

## Next, let's create a normal distribution with `numpy` and visualize it

Use `numpy` to create a normal distribution containing 3000 values with mean $\mu = 20$ and standard deviation $\sigma = 0.5$

In [8]:
import numpy as np
normal = np.random.normal(loc=20,scale=.5,size=3000)

Create a normalized histogram for this distribution using `matplotlib`. Set bins = 20. Make sure to get the bin positions and counts for each of the obtained bins.

In [9]:
import matplotlib.pyplot as plt
%matplotlib notebook

In [10]:
plt.hist(normal,bins=20)

<IPython.core.display.Javascript object>

(array([  2.,   7.,  13.,  22.,  47.,  95., 134., 248., 357., 359., 418.,
        377., 346., 247., 157.,  98.,  34.,  19.,  14.,   6.]),
 array([18.19155271, 18.36361774, 18.53568278, 18.70774782, 18.87981285,
        19.05187789, 19.22394292, 19.39600796, 19.568073  , 19.74013803,
        19.91220307, 20.0842681 , 20.25633314, 20.42839818, 20.60046321,
        20.77252825, 20.94459328, 21.11665832, 21.28872336, 21.46078839,
        21.63285343]),
 <a list of 20 Patch objects>)

Calculate the density function with $\mu$, $\sigma$, and the bin information obtained before.

Plot the normalized histogram (set bins = 20) along with the density function

Use seaborn to visualize the distribution and plot the KDE

# Standard normal distributions 

## Compare and contrast the normal distribution and the standard normal distribution. What is the empirical rule for the standard normal distribution? 

## How do you standardize a normal distribution? 

## Standardize the distribution you created above and use seaborn to visualize the distribution and plot the KDE

# Standard score (z-score)

## Why is the standard score a useful statistic? 

## Let's use a real-world dataset. 

Let's look at Combined Cycle Power Plant dataset from the [UCI Machine Learning repository](https://archive.ics.uci.edu/ml/datasets/combined+cycle+power+plant). The dataset contains 9568 observations collected from a combined cycle power plant over a period of six years. Features in the dataset consist of hourly average ambient variables Temperature (T), Ambient Pressure (AP), Relative Humidity (RH) and Exhaust Vacuum (V), and the net hourly electrical energy output (EP) of the plant. 

We'll look at the ambient pressure (AP) variable, which is measured in units of millibars. 

Let's start by loading the data into a `pandas DataFrame` and inspecting the first five rows of the dataframe.

Use seaborn to visualize the distribution of the ambient pressure (`AP`) feature. Plot the KDE. 

**What is the mean ambient pressure? What is the standard deviation of the ambient pressure?**

Standardize the ambient pressure and use seaborn to visualize the standardized distribution of the ambient pressure (set `kde = True`). 

What are the mean and standard deviation of standardized distribution of the ambient pressure?  

**What is the z-score corresponding to an observed ambient pressure of 1025 millibars? Interpret the result.** 

**Suppose an observation of ambient pressure has a z-score of -2. Interpret this z-score. What is the observed ambient pressure?**

# Statistical Testing with z-scores and p-values 

## What makes a sample representative of a population? 

## What is the probability of a z-score being less than 0? 

Hint: Look at the image below.

<img src="images/cumprob.png" width="500">



## Let's go back to the Combined Cycle Power Plant dataset

Assume that the combined cycle power plant dataset represents data that spans all the time period the plant was operational.  

**What is the probability of observing an ambient pressure less than 1001.4 millibars?**

**What is the probability of observing an ambient pressure greater than or equal to 1010 millibars?**

## What is a statistical hypothesis? What is hypothesis testing?

## When are one-sample z-tests used?

## Let's perform one-sample z-tests!

Recall the test statistic for a one-sample z-test is the z-statistic: 

$$ \large \text{z-statistic} = \dfrac{\bar x - \mu_0}{{\sigma}/{\sqrt{n}}} $$

* $\bar x$ is your sample mean
* $n$ is the number of items in your sample 
* $\sigma$ is the population standard deviation
* $\mu_0$ is the population mean

The z-statistic differs from the standard score formula: we divide the standard deviation by the square root of $n$ to reflect that we are dealing with the _sample variance_. 

Imagine we have measured the blood pressure for a population of individuals. The average blood pressure for this population is 72.5 mm Hg, with a standard deviation of 12.5 mm Hg. 

We then measure the blood pressure of 30 other individuals. Here are the observed blood pressures (in units of mm Hg): 

`62.9, 66.2, 65.0, 84.7, 68.2, 73.1, 68.3, 57.6, 65.8, 67.8, 54.0, 66.8, 56.4, 54.3, 48.3, 
73.9, 62.2, 53.0, 52.2, 74.5, 66.1, 66.7, 77.7, 73.6, 76.5, 64.2, 59.5, 66.1, 58.3, 64.9`

We want to know if the average blood pressure of these 30 individuals is __significantly lower__ than the population's average blood pressure, at a significance level of $\alpha$ = 0.05.

State the null and alternative hypotheses for this problem.

Perform a one-sample z-test. Interpret the result of the test. 

Now, we want to know if the average blood pressure of these 30 individuals' blood pressure measurements is significantly **different** than the population's average blood pressure, at a significance threshold of $\alpha$ = 0.05.

State the null and alternative hypothesis: 

Perform a one-sample z-test. Interpret the results of the test. 

## What is a p-value? What is the importance of $\alpha$, the significance threshold, in hypothesis testing? 

## Summary

### Key Takeaways: 

* Normal distributions are characterized by two parameters: the mean, $\mu$, and standard deviation, $\sigma$. Normal distributions are symmetric about the mean. The standard normal distribution is a special case of the normal distribution where $\mu = 0$ and $\sigma = 1$. Any normal distribution can be standardized by subtracting the mean $\mu$ from each value and dividing each value by the standard deviation $\sigma$. 

* The z-score tells us how many standard deviations above or below the mean an observation is. z-scores allow us to compare scores from different normal distributions. 

$$\large \text{z} = \frac{x - \mu}{\sigma}$$ 

* z-scores and probabilities: 
    * To compute the probability of obtaining a z-score less than a given value z, use `scipy.stats.norm.cdf(z)`. 
    * To compute the probability of obtaining a z-score greater than or equal to a given value z, use `1 - scipy.stats.norm.cdf(z)`.

* Samples are representative of populations when they accurately reflect the members of the entire population. 

* A statistical hypothesis is an assumption about a population parameter. There are two types of hypotheses: null and alternative hypotheses. You set a null hypothesis, draw a sample, and test your null hypothesis based on that sample.

* A p-value is the probability of observing a test statistic as extreme as some value, assuming the null hypothesis is true.  
    * **A p-value answers the question: what are the chances of getting your result if the null hypothesis is true?**

* The one-sample z-test is used when you want to know if your sample comes from a particular population. The one-sample z-test is used only for tests related to the sample mean. The test statistic of one-sample z-tests is called the z-statistic. 

* When performing hypothesis tests, we either have enough evidence or do not have enough evidence to reject the null hypothesis in favor of the alternative, depending on the significance level $\alpha$ chosen. 