# Probability Distributions

In [None]:
from IPython.display import Markdown
base_path = (
    "https://raw.githubusercontent.com/rezahabibi96/GitBook/refs/heads/main/"
    "books/applied-statistics-with-python/.resources"
)

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import norm
from matplotlib.pyplot import figure

import requests
from PIL import Image
from io import BytesIO

## Probability Distributions

There are many important probability distributions with numerous applications in Engineering, Biology, Medicine, Finance, etc. In this chapter, only a few distributions that are important in Statistics are considered. 

We have already studied simple finite probability distributions in Chapter 3. The table below, for example, reviews payout distributions for an insurance company.

|              | No Accidents | One Accident  | Two Accidents |
| ------------ | ------------ | ------------- | -----         |
| Payout       | 0            | 5000          | 10000         |
| Probability  | 0.96         | 0.03          | 0.01          |

The above is an example of a **discrete random variable** with a finite set of distinct values, each assigned a specific probability. More generally, a discrete probability distribution can have a **countably infinite** set of outcomes. The only requirements are:

1. $0 \le p(s) \le 1$ for each event $s$ in the sample space $S$
2. $\sum_{s\in S} p(s) = 1$

On the other hand, a **continuous random variable** can take any value within a given range. It is often associated with measurements like height, weight, time, temperature, distance, etc. Unlike discrete random variables, continuous ones have an uncountably infinite number of possible outcomes. Any real number range contains infinitely many values, so the probability of any specific value is 0. Instead, areas under a continuous probability distribution function are used to compute the probabilities. The two requirements for a continuous probability distribution function are:

1. $0 \le f(x) \le 1$ for all $x$
2. $\int_{-\infty}^{\infty} f(x) dx = 1$

For example, an electric outlet voltage varies slightly around the prescribed value of 120 volts. One assumption might be that it has a **uniform distribution** in the range 119–121 V, i.e., the values are spread evenly over the range of possibilities.

For a general **uniform distribution** on the interval $[a,b]$, the probability distribution function is given by:

$$
f(x) = \begin{cases} 
\frac{1}{b-a} & \text{if } a < x < b \\
0 & \text{if } x \le a \text{ or } x \ge b 
\end{cases}
$$

**Example**

$$
f(x) = \begin{cases} 
\frac{1}{121-119} = \frac{1}{2} & \text{if } 119 < x < 121 \\
0 & \text{if } x \le 119 \text{ or } x \ge 121 
\end{cases}
$$

Note that the area under any probability distribution is always 1.

Total Area = base·height = $(b-a) \cdot \frac{1}{b-a} = 1$

The probability of any particular event (interval) is equal to the area under the probability distribution function over that interval. For example, to find the probability that the voltage is between 119 and 120.5 V, we compute:

$$P(119 < V < 120.5) = \text{base} \cdot \text{height} = (120.5 - 119) \cdot \frac{1}{2} = 0.75$$

Also, note that a single particular value has no area, so the probability of it is 0:

$$P(V = 120) = 0$$

We could ask instead for a tight interval around the value of interest:

$$P(119.9 < V < 120.1) = (120.1 - 119.9) \cdot \frac{1}{2} = 0.1$$

## Normal Distribution

### Normal Distribution Model

One of the most important distributions in Statistics is the **normal (bell-shaped)** distribution, which is symmetric and unimodal. The IQ scores, SAT scores, heights, baby weights, and many other practical quantities closely follow such a shape with low probability for more extreme values on both ends and a bell shape in the middle. It is completely described by mean $µ$ (center of the bell curve) and standard deviation $σ$ (spread). The figure below shows a typical distribution of IQ scores with mean 100 and standard deviation 15 (one standard deviation away from the mean is shaded red). Generally, we denote the normal distribution as $N(µ,σ)$.

The probability distribution function is given by:

$$f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}$$

However, one cannot use it to find areas under the curve analytically, only numerically. Changing the mean $µ$ shifts the normal curve horizontally, while $σ$ specifies the spread as illustrated in the figure below. On the left, we show distributions of height for populations of three different countries with means 66 in, 69 in, and 72 in, respectively, but with the same standard deviation of 3 in. On the right, all distributions have the same mean of 69 in, but the standard deviations are 3 in, 6 in, and 9 in, respectively.

The normal distribution with mean $µ = 0$ and standard deviation $σ = 1$ is called the **standard normal distribution**. Its probability distribution function is:

$$f(x) = \frac{1}{\sqrt{2\pi}} e^{-x^2/2}$$

It is used to compute the areas under any other normal distribution $N(µ,σ)$, but first, the given distribution must be standardized to a Z-score with mean 0 and standard deviation 1. The **Z-score** is the number of standard deviations above or below the mean:

$$z = \frac{x-µ}{σ}$$

For example, someone with an IQ score of 130 on the standard IQ scale with mean $µ = 100$ and standard deviation $σ = 15$ has the standardized score:

$$z = \frac{x-µ}{σ} = \frac{130-100}{15}$$

On the other hand, a raw score of 85 implies:

$$z = \frac{x-µ}{σ} = \frac{85-100}{15} = -1$$

Generally, any score above the mean leads to positive $z$, and any score below the mean produces negative $z$, while the mean score produces $z = \frac{µ-µ}{σ} = 0$.

Note that you can always subtract the mean and divide by the standard deviation for any distribution, so z-scores can be defined for any type of data/distribution, not just normally distributed. Z-scores are very convenient to compare variables measured on different scales.

For example, let's say in a given year the SAT scores are normally distributed with mean $µ = 1000$ and standard deviation $σ = 250$, and Regents scores are also normally distributed with mean $µ = 60$ and standard deviation $σ = 20$. Let's say Jane got 1210 on her SAT and John got 70 on his Regents exam. The scores are on completely different scales, so we cannot compare them directly. However, we can find the corresponding standardized z-scores.

$$z = \frac{1210-1000}{250} = 0.84$$
$$z = \frac{70-60}{20} = 0.5$$

Therefore, Jane did better on her SAT than John on his Regents test.

### Normal Probability Calculations

Let's now start discussing probability questions for normal distributions.

For example, what is the fraction (percentage) of students who scored **below John's score? Or equivalently, if we select a random student, what is the probability that their score is below 70? This is given by the left-tail area under the normal probability distribution function shown in the figure below.

I created my own (user) function `plot_normal()` with the `def` command. In parentheses, it has its input parameters. This function does not produce any output; it plots the areas under the normal curve with given parameters and the corresponding standard normal area.

Therefore, 69.15% of students have scores below John's score of 70.

The next common question to ask would be: what is the fraction (percentage) of students who scored **above** John's score (**at least** as good as John's)? Or equivalently, if we select a random student, what is the probability that their score is above John's score of 70? This is given by the right-tail area under the normal probability distribution function shown in the figure below.

The total area under any probability distribution is 1, so we can find the complement area to the left and then subtract it from 1:

Finally, a school board may decide that students who scored **between** 45 and 59 require extra help. What proportion of students falls within these bounds? Or equivalently, what is the probability that a randomly chosen student falls in these bounds?

This probability is best obtained as a **difference** between the probability of scoring below 59 and scoring below 45, as shown in the figure below: