# Normal Distribution

The normal distribution is unimodal and symmetric. It is sometimes referred to as the bell curve due to the distribution resembling a bell shape. However, it's not just any symmetric unimodel curve, it follows very strict guidelines about how variably the data are distributed around the mean. While many variables are nearly normal, none are exactly normal due to these strict guidelines. The normal distribution has two parameters, mean ($\mu$) and the standard deviation ($\sigma$).

Here we see two normal distributions: both centered at 0 but with different standard deviations. These are good representations of how changing the spread of the distribution actually changes the overall shape of the distribution. 

<img src="https://cdn.rawgit.com/rogergranada/MOOCs/3d1b63cc/Coursera/Duke%20University/Probability-intro/Week%204/images/normal_distribution.svg" width="400" align="center"/>

### Stricted Rules

What are these strict rules that govern the variability of normally distributed data around the mean of the distribution? Well, for nearly normally distributed data 68% falls within one standard deviation of the mean. 95% falls within two standard deviations of the mean, and 99.7% falls within three standard deviations of the mean. It's possible for observations to fall four, five, or even more standard deviations away from the mean, but these occurrences are very rare if the data are nearly normal. 

<img src="https://cdn.rawgit.com/rogergranada/MOOCs/f06fcb47/Coursera/Duke%20University/Probability-intro/Week%204/images/rules.svg" width="400" align="center"/>

**Practice**:

*A doctor collects a large set of heart rate measurements that approximately follow a normal distribution.  He only reports three statistics, the mean, 110 beats per minute, the minimum, 65 beats per minute, and the maximum 155 beats per minutes. Which of the following is most likely to be the standard deviation of the distribution?*  
*(a) 5*   
*(b) 15*    
*(c) 35*  
*(d) 90*   

<img src="https://cdn.rawgit.com/rogergranada/MOOCs/2b1fd7c1/Coursera/Duke%20University/Probability-intro/Week%204/images/bell_curve.svg" width="400" align="center"/>

We're going to make use of the fact that in a normal distribution, almost all of the data lie within 3 standard deviations of the mean. 

(a) 5: $\mu \pm (3 \times \sigma) = 110 \pm (3 \times 5) = (95, 125)$<br>
(b) 15 $\mu \pm (3 \times \sigma) = 110 \pm (3 \times 15) = (65, 155)$ &#10004;<br>
(c) 35 $\mu \pm (3 \times \sigma) = 110 \pm (3 \times 35) = (5, 215)$<br>
(b) 90 $\mu \pm (3 \times \sigma) = 110 \pm (3 \times 90) = (-160, 380)$<br>

*A college admissions officer wants to determine which of the two applicants scored better on their standardized test with respect to the other test takers, Pam, who earned an 1800 on her SAT, or Jim, who scored a 24 on his ACT?*  
SAT scores ~ N(mean=1500, SD=300)
ACT scores ~ N(mean=21, SD=5)

<img src="https://cdn.rawgit.com/rogergranada/MOOCs/a526924d/Coursera/Duke%20University/Probability-intro/Week%204/images/sat_scores.svg" width="800" align="center"/>

We can't just compare these raw scores of 1,800 versus 24 and say, well, Pam did better because her score is higher, since they are measured on different scales. Instead, we want to figure out how many standard deviations above the respective means of their distributions Pam and Jim scored. 

The standard deviation of SAT squares is 300, so Pam scored one standard deviation above the mean. To calculate this, we first calculate how far off Pam is from the mean.

Pam: $\frac{1800-1500}{300} = 1$<br>
Jim: $\frac{24-21}{5} = 0.6$<br>

Plotting these values on the same distribution we can see that Pam indeed do better than Jim. 

<img src="https://cdn.rawgit.com/rogergranada/MOOCs/f06fcb47/Coursera/Duke%20University/Probability-intro/Week%204/images/pam_jim.svg" width="400" align="center"/>

### Standardizing with Z Scores

We define a standardized or Z-score as the number of standard deviations and observations fall below or above the mean. The Z actually comes from the Z in standardize which might sound a little odd, why wouldn't we just use S, the initial letter of the word, but that's because we tend to spare S for standard deviations and we don't want to be confusing our abbreviations. So we're going to be referring to standardized scores as **Z-scores** from this point onwards. We calculate the Z-score of an observation as that observation minus the mean divided by the standard deviation. 

$$Z = \frac{\text{observation} - \mu}{\sigma}$$

By definition, the Z-score of a mean is 0, because we would simply be plugging in the mean as the observation itself, and get a zero for the numerator in our calculation. 

Z-scores are also useful for identifying unusual observations. Usually, observations with $|Z| > 2$, so that's either 2 standard deviations below, or above the mean or something beyond that, are considered to be unusual. While we introduce Z-scores within the context of a normal distribution, note that they're actually defined for distributions of any type. After all, every distribution will have a mean and a standard deviation, therefore for any observation whatever distribution the random variable follows, we could calculate a Z-score.

### Percentiles

Percentile is the percentage of observations that fall below a given data point. Graphically it's the area below the probability distribution curve, to the left of that observation.

<img src="https://cdn.rawgit.com/rogergranada/MOOCs/f06fcb47/Coursera/Duke%20University/Probability-intro/Week%204/images/percentile.svg" width="400" align="center"/>

In R, the function `pnorm` gives the percentile of an observation, given the mean and the standard deviation of the distribution. So `pnorm` of negative 1, for a distribution with mean 0 and standard deviation of 1 is estimated to be about 0.1587. We can also obtain the same probability using a [web applet](https://gallery.shinyapps.io/dist_calc/), so no need for access to R to use this one. 

In [1]:
pnorm(-1, mean=0, sd=1)

**Practice**:

1) *SAT scores are distributed normally with mean 1,500 and standard deviation 300. Pam earned an 1,800 on her SAT. What is Pam's percentile score?*

The first thing to do is to always draw the curve, mark the mean, and shade the area of interest. Here we have a normal distribution with mean 1,500, and to find the percentile score associated with an SAT score of 1,800, we shade the area under the curve below 1,800.

<img src="https://cdn.rawgit.com/rogergranada/MOOCs/f06fcb47/Coursera/Duke%20University/Probability-intro/Week%204/images/practice_pam.svg" width="400" align="center"/>

We can do this using R and the `pnorm` function. So here, the first argument is the observation of interest. The second argument is the mean. And the third argument is the standard deviation, which spits out an associated percentile of 0.8413, meaning that Pam scored better than 84.13% of the SAT takers.

In [3]:
pnorm(1800, mean=1500, sd=300)

If we actually wanted to find out the area above the observation, we'd simply would need to take the complement of this value since the total area under the curve is always 1. So Pam scored worse than 1- 0.8413 which amounts to 15.87% of the test takers.

Scored_worse $= 1 - 0.8413 = 0.1587$<br>

2) *A friend of yours tells you that she scored in the top 10% on the SAT, what is the lowest possible score she could have gotten?*

Remember, SAT scores are normally distributed with mean 1,500 and standard deviation 300. We're looking for the cutoff value for the top 10% of the distribution. This is a different problem than the one we worked on earlier, as this time we don't know the value of the observation of interest. But we do know, or at least we can get its percentile score. Since the total area under the curve is 1, the percentile score associated with the cutoff value for the top 10% is 

$P = 1 - 0.10 = 0.90$<br>

If we know the mean, the standard deviation and the Z-score, we could solve for the unknown observation. Using the table we can find the Z-score associated with the 90th percentile. So what we want to do is to locate the 90th percentile inside the mass of this table, and grabbing the Z-score from the edges of the table. We don't actually see exactly 0.9 here, but the closest we can get is 0.8997. And traveling to the edges of the table, we can obtain that the Z-score is 1.28. 

We know that this number 1.28 is equal to the unknown observation. We're calling it `X` here:

$Z = \frac{\text{observation} - \mu}{\sigma} $<br>
$1.28 = \frac{X - 1500}{300} \ \ \ \rightarrow \ \ \ X = (1.28 \times 300) + 1500 = 1884$

Thus, if you have scored above 1,884, you know that you're in the top 10% of the distribution. We could also do this using R, and we're going to use the `qnorm` function this time. So `pnorm` for probabilities, `qnorm` for quantiles or cutoff values, which takes the percentile as the first input, the mean and the standard deviation as the second and the third, just like the function we saw earlier.

In [5]:
qnorm(0.9, mean=1500, sd=300)

### Questions

1) SAT scores are distributed nearly normally with mean 1500 and standard deviation 300. According to the 68-95-99.7% rule, which of the following is false?  
&#9744; Roughly 68% of students score between 1200 and 1800 on the SAT.  
&#9744; Roughly 95% of students score between 900 and 2100 on the SAT.  
&#9744; Roughly 99.7% of students score between 600 and 2400 on the SAT.  
&#9745; No students can score below 600 on the SAT.  

2) Scores on a standardized test are nearly normally distributed with a mean of 100 and a standard deviation of 20. If these scores are converted to standard normal Z scores, which of the following statements will be correct?  
&#9745; The mean will be 0, and the median should be roughly 0 as well.  
&#9744; The mean will equal 0, but the median cannot be determined.  
&#9744; The mean of the standardized Z scores will equal 100.  
&#9744; The mean of the standardized Z scores will equal 5.  

3) ACT scores are distributed nearly normally with mean 21 and standard deviation 5. Jim, who scored a 24 on his ACT. Which of the following is true?  
&#9744; Jim's Z score is -0.6  
$ Z = \frac{O - \mu}{\sigma} = \frac{24 - 21}{5} = 0.6$<br>
&#9745; Jim scored better than approximately 72.57% of ACT takers. 
&#9744; 72.57% of ACT takers scored better than Jim.  
&#9744; Jim's percentile score is 60%. 

In [4]:
pnorm(24, mean=21, sd=5)

4) ACT scores are distributed nearly normally with mean 21 and standard deviation 5. A friend of yours tells you that she scored in the bottom 10% on the ACT. What is the highest possible score she could have gotten? Choose the closest answer.  
&#9745; 14.6  
&#9744; 27.4  
&#9744; 12.75  
&#9744; 29.25  

*Solving*:  
The Z score cut-off for the bottom 10% is Z = -1.28, therefore:

$Z = \frac{\text{observation} - \mu}{\sigma} $<br>
$-1.28 = \frac{X - 21}{5} \ \ \ \rightarrow \ \ \ X = (-1.28 \times 5) + 21 = 14.59$

In [7]:
qnorm(0.1, mean=21, sd=5)