# Distributions in Statistics

![](https://storage.googleapis.com/rg-ai-bootcamp/toolkits/discrete-vs-continues-min.png)

_Discrete vs. continuous data (Source: [g2.com](https://www.g2.com/articles/discrete-vs-continuous-data))_

## Distribution Overview

What is **Distribution** in Statistics? 🤔 Distribution is a tool in statistics that outlines the potential values a variable can assume and their frequency. Also known as a **probability distribution**, it demonstrates how probabilities are dispersed across different outcomes. It helps uncover patterns, randomness, and frequency, essential in theory and practice.

Before we begin, let's understand the key notation. **Notation** is a system of symbols representing specific concepts, crucial for clear communication in fields like mathematics and statistics.

In probability, $Y$ represents an actual event outcome, while $y$ represents any potential event outcome. The probability notation $P (Y=y)$ or $p (y)$ is used to describe the likelihood that a random variable $Y$ will equal a specific value $y$.

For instance, consider drawing red marbles $(Y)$ from a bag ![](https://storage.googleapis.com/rg-ai-bootcamp/toolkits/marbles-min.png). If y is a specific count like 3 or 5, we might ask:
- What is the probability of drawing 5 red marbles? This can be expressed as $P (Y=5)$ or $p (5)$.

Why do we need to know this? ⚠️ Understanding this is crucial as many decisions in fields like data science are based on understanding probabilities. For example, machine learning models often make predictions based on probability.

Now, we'll proceed with some key definitions in statistical distribution.

### Population vs Sample

In data analysis, we differentiate between:

- **Population data**: All the data.
- **Sample data**: A subset of the population data.

Look at this illustration for further clarification:

![](https://storage.googleapis.com/rg-ai-bootcamp/toolkits/population-vs-sample2-min.png)

For example, if an employer surveys a department's commute methods, that data is the **population** for that department, but is a **sample** for the entire company.

Why is it important🤔 This distinction is crucial in statistical distributions as it affects measures of central tendency, spread, and distribution shape. These measures can significantly differ between a population and a sample. The type of data also determines the statistical tests used and how accurately we infer population characteristics.

### Mean vs Variance

Two primary metrics define distributions:

- **Mean** ($\mu$ for population, $\bar{x}$ for sample): The average value of the distribution.
- **Variance** ($\sigma^2$ for population, $s^2$ for sample): Measures data dispersion by quantifying the spread from the mean.

The mean provides a central value, and the variance shows the range of values in the distribution.

One issue with variance is that it's in squared units, which can be hard to interpret. ⚠️ For example, if time is in seconds, variance would be in seconds squared ($\text{seconds}^2$). To address this, we introduce the Standard Deviation ($\sigma$ for population, $s$ for sample). So, what is **Standard Deviation**? 🤔 The positive square root of variance ($\sqrt{\sigma^2}$). It has the same units as the mean, making it directly interpretable.

The standard deviation reflects the range within which values fall around the mean:

| low standard deviation | high standard deviation |
| --- | --- |
| ![](https://storage.googleapis.com/rg-ai-bootcamp/toolkits/stddev-bell-1-min.png) | ![](https://storage.googleapis.com/rg-ai-bootcamp/toolkits/stddev-bell-2-min.png) |
| If data is clustered towards the center (low standard deviation), more data falls within that range, resulting in a tall bell curve. | If less data falls within the range, the data is more spread out, resulting in a wider bell curve |

Why is understanding this relationship important? 🤔 Understanding the relationship between the **mean and variance** is crucial in comprehending the spread of a distribution. By definition, variance is the **expected value** ($E$) of the squared deviation from the mean for all values, represented as:

$$\sigma^2 = E((Y - \mu)^2)$$

Where $Y$ are random variables, $μ$ is the mean, and $E$ is the expected value operator.

This can be simplified to:

$$\sigma^2 = E(Y^2) - \mu^2$$

This simpler equation allows for the calculation of variance using just the mean and expected square value of a random variable, which is useful in data analysis and probability distributions.

**For example**:

Let's say we repeatedly drew marbles from a bag that contained a mix of red and blue marbles in 5 attempts, and the number of red marbles we drew each time was `Y = [1, 2, 3, 4, 5]`.

- **Formula 1 $\rightarrow \sigma^2 = E((Y - \mu)^2)$**

  - **Step 1:** Calculate the mean $(\mu)$:

    $$\mu = \frac{1 + 2 + 3 + 4 + 5}{5} = 3$$

    This is the average number of red marbles drawn from the bag across all 5 experiments.

  - **Step 2:** Find the difference between each data point and the mean, $(Y - \mu)$, which gives us: `[-2, -1, 0, 1, 2]`.

    This measures how each individual experimental outcome varies from the average outcome.

  - **Step 3:** Square these differences to give: `[4, 1, 0, 1, 4]`.

    We square the differences to get positive values, and to give more weight to larger differences.

  - **Step 4:** Find the expected value or average of these squared differences to give the variance $(\sigma^2)$:

    $$\sigma^2 = \frac{4 + 1 + 0 + 1 + 4}{5} = 2$$

  The variance of 2 tells us that, on average, the number of red marbles drawn in each experiment deviates from the mean by the square root of 2.

- **Formula 2 $\rightarrow \sigma^2 = E(Y^2) - \mu^2$**

  - **Step 1:** Square each data point to give: `[1, 4, 9, 16, 25]`.

  - **Step 2:** Find the expected value or average of these squares (E(Y^2)), which is:

    $$E(Y^2) = \frac{1 + 4 + 9 + 16 + 25}{5} = 11$$

  - **Step 3:** Subtract the square of the mean from this value to give the variance $(\sigma^2)$:

    $$\sigma^2 = E(Y^2) - \mu^2 = 11 - 3^2 = 11 - 9 = 2$$

  Again, our variance is 2, indicating the same level of spread in our experimental outcomes.

In both cases, we get the same result for the variance: 2. The second formula is often preferred because it requires fewer computations and reduces the chances of errors, especially while calculating variance for a large or very small dataset.

Remember, **high variance** signifies widely spread values, while **low variance** indicates closer values. Understanding the mean-variance relationship aids in recognizing the stability of a probability distribution.

As we'll see in the next section, knowing a specific distribution, we can derive a more exact formula to outline this relationship.

## Types of Probability Distributions

Probability distributions are primarily classified based on the number of **possible outcomes**:

- **Discrete distributions**: Used when outcomes are **finite**, like rolling a die 🎲.
- **Continuous distributions**: Used when outcomes are **infinite**, like measuring time ⏰ or distance 📏.

Before moving forward, it's vital to familiarize ourselves with the notation used to define these distributions. The notation of a distribution is structured as follows:

![](https://storage.googleapis.com/rg-ai-bootcamp/toolkits/notation-min.png)

It starts with the **variable name** ($X$) for our values, followed by a **tilde sign** ($\sim$). Next, we write a capital letter representing the **distribution type** ($N$) and, within parentheses, we list some dataset characteristics. These typically include the **mean** ($\mu$) and **variance** ($\sigma^2$) but can vary depending on the specific distribution type.

**Discrete Distributions**:

- **Bernoulli Distribution** is used in **binary classification** problems in machine learning, representing a situation with only two outcomes. An example is a model predicting whether an email is spam or not.

- **Binomial Distribution** is used for **multi-classification** problems, where an event has more than two possible outcomes. An example is a model predicting weather conditions like 'sunny', 'rainy', or 'cloudy'.

**Continuous Distributions**:

- **Normal Distribution** is a commonly **observed distribution** in many natural phenomena and is often used in machine learning for regression problems. It's widely used due to its unique statistical properties and its ability to accurately describe many natural occurrences. An example is a model predicting the height of trees, where most predictions cluster around the mean value with rare outliers.

OK, now let's discuss it in more depth.

## Bernoulli Distribution

The **Bernoulli Distribution**, denoted as $Bern(p)$, is used for experiments with binary outcomes such as **coin tosses** or **yes/no questions**. The variable $p$ represents the probability of success.

A Bernoulli Distribution is suitable for single trial events with two possible outcomes: **success** (usually coded '1') or **failure** (coded '0'). Examples of such events include flipping a coin (heads or tails), answering a single true-or-false question (true or false), and voting in a two-party election (Democratic or Republican)

![](https://storage.googleapis.com/rg-ai-bootcamp/toolkits/event-with-min.png)

The graph of a Bernoulli Distribution consists of two bars representing the two possible outcomes.

![](https://storage.googleapis.com/rg-ai-bootcamp/toolkits/graph-bd-min.png)

One bar reaches up to the associated probability $p$, and the other one attains a height of $1−p$, because the total probability must be equal to 1

In the Bernoulli distribution, "1" is typically assigned to the outcome considered to be "success" and "0" to "failure".

For instance, in the context of a coin toss 🪙:

- If Head = 1 and Tails = 0, then the expected value $E(X)$ of this random variable is $p$, the probability of "heads". It's computed as $E(X)=p$.
- Conversely, if Head = 0 and Tails = 1, the expected value $E(X)$ becomes $1−p$, the "tail" probability. It's computed as $E(X)=1−p$.

This shows that the assignment of 0 and 1 can affect the expected value of the distribution. Generally, 1 is assigned to the outcome with higher probability, especially when the probability of success is greater than the probability of failure ($p>1−p$).

### Expected Values

In a Bernoulli Distribution, the expected value is calculated as follows:

$$E(X)=1⋅p+0⋅(1−p)=p$$

This means that the expected value equals the probability $p$ of the preferred outcome.

Let's consider a fair coin toss where the preferred outcome is getting a 'Head'. If we denote 'Head' as 1 and 'Tail' as 0, and because the coin is fair, the probability $p$ of getting a 'Head' is 0.5.

We can substitute $p$ with 0.5 to compute the expected value:

$$E(X)=1⋅0.5+0⋅(1−0.5)=0.5$$

This results in an expected value of 0.5, which signifies the probability of getting a 'Head' when tossing the coin. Essentially, if you toss the coin repeatedly, you could expect 'Heads' to appear about half the time.

In a Bernoulli Distribution, where we only have one trial and a preferred outcome, the expected value primarily indicates the likelihood of the preferred outcome occurring.

### Variance

In a discrete probability distribution, the variance is computed using the formula:

$$σ^2 = ∑(x_i−μ)^2⋅P(x_i)$$

For a Bernoulli Distribution, which only has two outcomes **success** $x^1=1$ and **failure** $x^0=0$, and the expected value or mean $μ$ equals the probability of success $p$, this formula can be simplified to:

$$σ^2=p(1−p)$$

You can prove this yourself by using the value of 1 and 0 for $x_i$, because the numbers are only 1 and 0, the mean $μ$ will be 0.5 and given the probability of a fair coin is 0.5, you will find that both formula will give the same answer.

Therefore, regardless of the expected value, the variance of a Bernoulli distribution is given by $p(1−p)$. This simple form is due to the specific properties of the Bernoulli distribution, with only two outcomes and a mean equal to the probability of success.

The standard deviation is the square root of the variance, so we also have:

- **Variance**: $\sigma^2 = p(1-p)$
- **Standard Deviation**: $\sigma = \sqrt{p(1-p)}$

​
Even though we can calculate variance and standard deviation using these formulas, they generally offer little insight in a Bernoulli Distribution context, as the distribution is determined by a single parameter, $p$.

### Example John Bakery

John, a baker, has a special recipe for a popular type of bread. However, due to variations in ingredients and baking conditions, not every loaf turns out perfectly. John has observed that 70% of the time, the bread turns out just right.

$$P(Perfect🍞)=0.7$$

- Perfect bread 🍞 = 1
- Imperfect bread 🍞 = 0

$$E(X)=p=0.7$$

In this scenario, we assign the outcome of a perfect loaf to be 1, and $p$ to be 0.7. Therefore, the expected value is $p$, or 0.7. The variance can be calculated using the formula $σ^2=p(1−p)$, which gives us a value of $0.7×0.3=0.21$.

In summary, John can expect 70% of his bread loaves to be perfect, with a variance of **0.21**. This means that the outcome of each bread loaf baking process varies from the expected outcome (perfect bread) is about 21% on average.

In [3]:
# @title #### John Bakery Chart

import matplotlib.pyplot as plt
import numpy as np
import ipywidgets as widgets

def plot_bernoulli(p=0.7):
    variance = p * (1 - p)

    outcomes = np.array([0, 1])

    probabilities = np.array([1-p, p])

    plt.bar(outcomes, probabilities, color=['orange', 'green'])

    plt.xticks(outcomes, ['Imperfect Bread (0)', 'Perfect Bread (1)'])

    plt.ylabel('Probability')
    plt.xlabel('Outcomes')
    plt.title('Bernoulli Distribution of Bread Quality')

    plt.text(0.5, 0.5, 'Variance: {:.2f}'.format(variance), ha='center', va='center',
             bbox=dict(facecolor='white', alpha=0.5, boxstyle="round,pad=0.5"))

    plt.show()

widgets.interact(plot_bernoulli, p=(0, 1, 0.01));

In this context, the variance of **0.21** serves several purposes for John:

- **Measuring Consistency**: Variance displays how consistently John bakes bread. Lower variance signifies more consistent results, whereas higher variance indicates more variability.
- **Predicting Future Outcomes**: It can also aid John in forecasting future outcomes. For instance, when estimating how many perfect loaves he could bake in a month, this variance helps predict expected variations.
- **Strategic Decision-Making**: Knowledge of variance assists John in deciding whether to modify his bread-baking recipe or process. A high variance could indicate excessive uncertainty in the baking process, hinting at a potential need for increased consistency.

## Binomial Distribution

A Binomial Distribution is a series of identical Bernoulli events. A Binomial Distribution is denoted as $B(n,p)$, where:

- $n$ represents the number of trials
- $p$ indicates the probability of success in each trial

For example, $X∼B(10,0.6)$ means "the variable X follows a Binomial Distribution with 10 trials and a success probability of 0.6 in each individual trial".

A Bernoulli Distribution can be seen as a Binomial Distribution with only one trial, i.e., $Bern(P)=B(1,p)$.

### Bernoulli vs Binomial Distribution

![](https://storage.googleapis.com/rg-ai-bootcamp/toolkits/pop-quiz-min.png)

Let's consider a scenario where you're in a class, and there's a surprise pop quiz with 10 true-or-false questions.

![](https://storage.googleapis.com/rg-ai-bootcamp/toolkits/quiz-min.png)

Guessing an answer for a single question is a Bernoulli event (two outcomes: true or false). Guessing answers for all the questions on the quiz is a Binomial event.

The difference lies in the expected values:

- The expected value of a Bernoulli event is the most likely outcome for a single trial.
- The expected value of a Binomial event indicates the number of times we expect a specific outcome to occur.

### Graph of Binomial Distribution

The graph of a Binomial Distribution shows the likelihood of achieving a desired outcome a certain number of times. If we execute 'n' trials, our graph would have 'n+1' bars, each bar representing a unique value from 0 to 'n'.

![](https://storage.googleapis.com/rg-ai-bootcamp/toolkits/binomial-graph-min.png)

For example, if we're flipping an unfair coin twice, we would need three bars to represent the three possible outcomes:

- Zero tails (both flips are heads)
- One tail (one flip is tails and one flip is heads, in any order)
- Two tails (both flips are tails)

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import binom
from ipywidgets import interact

def plot_binom(n=10, p=0.5):
    # Define the distribution
    dist = binom(n, p)

    # Values
    x = np.arange(n+1)

    # Probabilities
    probs = dist.pmf(x)

    # Create the plot
    plt.bar(x, probs, color='blue')

    # Set the title
    plt.title('Binomial Distribution')

    # Set the labels
    plt.xlabel('Number of Successes')
    plt.ylabel('Probability')

    # Show the plot
    plt.show()

# Create the interactive plot
interact(plot_binom, n=(1, 20), p=(0, 1, 0.01))

### The Probability Function

The probability function of the Binomial Distribution is used to find the probability of obtaining a certain result $y$ times during $n$ trials. This is because the probability of obtaining a desired outcome is expressed as $p$, and the probability of an alternative outcome is expressed as $1−p$.

To calculate the probability of getting our chosen outcome exactly $y$ times during $n$ trials, we also need to consider the alternative outcomes occurring $n−y$ times.

![](https://storage.googleapis.com/rg-ai-bootcamp/toolkits/probability-function-min.png)

The number of ways to achieve our desired outcome is calculated using combinations, denoted as:

$$\binom{n}{y} = C_{n}^{y} = \frac{n!}{y!(n-y)!} $$
​
Where $!$ denotes factorial (multiplication of all positive integers up to that number).

For instance, when flipping a coin 3 times, there are 3 ways to get tails exactly twice. We can use the combination formula where $n=3$ (the number of times we flip the coin) and $y=2$ (we want to get tails twice):

$$C_{3}^{2} = \frac{3!}{2!(3-2)!}$$

Let's break down each part of the formula:

- **Factoral n ($3!$):** This is the product of all positive integers up to $n$ (in this case, up to 3). So, $3! = 3 \times 2 \times 1 = 6$.
- **Factoral y ($2!$):** Same as above, namely the product of all positive integers up to y (in this case up to 2). So, $2! = 2 \times 1 = 2$.
- **Factoral n-y ($1!$):** This is the product of all positive integers up to $n-y$ (in this case, up to 1). So, $1!$ is 1.

Substituting these values into the formula:

$$C_{3}^{2} = \frac{6}{2 \times 1} = \frac{6}{2} = 3$$

Here is the visualization:

<table>
  <tr>
    <th>Flip 1</th><th>Flip 2</th><th>Flip 3</th>
  </tr>
  <tr>
    <td>Tails 🪙</td><td>Tails 🪙</td><td>Heads 🪙</td>
  </tr>
  <tr>
    <td>Tails 🪙</td><td>Heads 🪙</td><td>Tails 🪙</td>
  </tr>
  <tr>
    <td>Heads 🪙</td><td>Tails 🪙</td><td>Tails 🪙</td>
  </tr>
</table>

So, there are 3 different ways to get **tails** 2 times by tossing the coin 3 times, as shown in the table. In other words, we validate that our combination formula works correctly in this example.

Next, for the **Probability Function** of the Binomial Distribution it is given by:

$$p(y) = \binom{n}{y} \cdot p^{y} \cdot (1-p)^{n-y}$$

This equation represents the product of the number of ways to choose $y$ elements from $n$, times $p$ to the power of $y$, times $1-p$ to the power of $n-y$.

In the context of flipping a coin, if we flip a coin 3 times ($n=3$) and want to know the probability of getting tails twice ($y=2$), with the probability of getting tails on one flip being 0.5 ($p=0.5$), we can substitute these values into our formula:

$$p(2) = \binom{3}{2} \cdot (0.5)^{2} \cdot (1-0.5)^{(3-2)} = 3 \cdot (0.5)^{2} \cdot (0.5)^{1} = 3 \cdot 0.25 \cdot 0.5 = 0.375$$

So, the probability of getting tails twice out of three flips is **0.375**, or **37.5%**.

### Example John Bakery

John knows that there's a 60% chance ($p=0.6$) that he will sell more than 120 loaves of bread daily, and a 40% chance ($p=0.4$) he will not. He wants to calculate the likelihood that his bread sales will exceed 120 loaves exactly 3 times in a 5-day work week.

Given:

- Number of successes ($y$): **3**
- Number of trials ($n$): **5**
- Probability of success ($p$): **0.6**

We can use the binomial distribution formula:

$$p(y) = \binom{n}{y} \cdot p^{y} \cdot (1-p)^{n-y}$$

Substituting $n=5$, $y=3$, $p=0.6$ and $1-p=0.4$ yields:

$$= \binom{5}{3} \cdot 0.6^{3} \cdot 0.4^{2} = 10 \cdot 0.216 \cdot 0.16 = 0.3456$$

So, there's a **34.56%** chance of selling more than 120 loaves precisely three times over a work week.

### Expected Values

The expected value is a weighted average of all possible outcomes of a random variable, with the weights being the probabilities of each outcome. For a binomial distribution $Y∼B(n,p)$, the expected value $E(Y)$ is simplified to $E(Y)=n⋅p$.

For example, if you flip a fair coin 100 times ($n=100$, $p=0.5$), you can expect heads to come up about **50** times.

### Variance and Standard Deviation

Variance measures how spread out a group of numbers is from its average value. For a binomial distribution, variance can be calculated using formula $n⋅p⋅(1−p)$.

Let's relate this to John's Bakery. John works for 5 days in a week $(n=5)$, and the probability of selling more than 120 loaves of bread per day is 0.6 $(p=0.6)$.

Thus, the variance is $5⋅0.6⋅0.4=1.2$. This variance indicates how far John's possible bread sales deviate from the average sales he expects.

The standard deviation, denoted as σ, is the square root of the variance. So, in this case of John's Bakery, the standard deviation is the square root of 1.2, which is approximately 1.1.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import binom
from ipywidgets import interact

def plot_binom(n=5, p=0.6, y=3):
    # Defining the binomial distribution
    dist = binom(n, p)

    # Define the range of successes we want to plot
    x = np.arange(n+1)

    # Calculating the probability for each number of successes
    probs = dist.pmf(x)

    # Create the bar plot
    plt.bar(x, probs, color='blue')

    # Highlight the bar that represents exactly y successes
    plt.bar(y, dist.pmf(y), color='red')

    # Display the probability on the plot
    plt.text(y, dist.pmf(y), 'p = {:.4f}'.format(dist.pmf(y)), ha='center', va='bottom', color='red')

    # Adding labels and title
    plt.title("Binomial Distribution of John Bakery Sales")
    plt.xlabel("Nb of Days")
    plt.ylabel("Probability")

    # Show variance and standard deviation
    plt.text(n, max(probs)/2, 'Variance: {:.2f}\nStandard Deviation: {:.2f}'.format(dist.var(), dist.std()),
             ha='right', va='center', fontsize=10,
             bbox=dict(facecolor='white', alpha=0.5, boxstyle="round,pad=0.5"))

    # Show the plot
    plt.show()

# Create the interactive plot
interact(plot_binom, n=(1, 20), p=(0, 1, 0.01), y=(0, 20, 1))

Hence, even though John expects to sell more than 120 loaves about 3 times in a week (based on a 60% probability), the actual number of days could be about 1.1 days above or below that expected value. Therefore, in a workweek, the occurrence of bread sales exceeding 120 loaves might happen between roughly 2 to 4 days.

## Normal Distribution

The Normal Distribution is one of the most commonly found continuous distributions in nature and life. The notation for a Normal Distribution is denoted as $N(\mu, \sigma^2)$, where $\mu$ signifies the mean, and $\sigma^2$ indicates the variance of the distribution.

> For example, $X \sim N(\mu, \sigma^2)$ is read as "the variable 'X' follows a Normal Distribution with a mean of '$\mu$' and a variance of '$\sigma^2$'".

This distribution is prevalent in various shapes of forms. In most cases, we would know the numerical values of $\mu$ and $\sigma^2$ when dealing with actual data.

### Events following a Normal Distribution

![](https://storage.googleapis.com/rg-ai-bootcamp/toolkits/grown-male-lion-min.png)

A Normal Distribution represents an event with various outcomes, where the majority of the outcomes are centered around the mean. Such events include:

- The size of a full-grown male lion 🦁 (outcomes: weights between 150 and 250 kilograms)
- The height of human beings
- The daily sales of loaves of bread at John's Bakery (outcomes: daily sales ranging between 40 and 160 loaves)

When working with a Normal Distribution, we usually know the mean and variance, or we can estimate them from previous data.

### Graph of Normal Distribution

The graph of a Normal Distribution is **bell-shaped** 🔔, signifying that the majority of the data is centered around the mean.

![](https://storage.googleapis.com/rg-ai-bootcamp/toolkits/bell-curve-min.png)

Values further away from the mean are less likely to occur. The graph is symmetric with regard to the mean, suggesting that values equally far away in opposing directions are equally likely.

### Expected Value and Variance

In Normal Distribution, the expected value equals its mean - "$\mu$". The variance "$\sigma^2$" is usually given when we define the distribution. However, if it isn't, we can deduce it from the expected value using the following formula:

$$\sigma^2 = E(X^2) - [E(X)]^2$$

In this equation, $E(X^2)$ represents the expected value of the squared variable, and $[E(X)]^2$ is the squared expected value of the variable.

### 68, 95, 99.7 Rule

One of the significant features of the Normal Distribution is the "68, 95, 99.7" rule, also known as the empirical rule.

![](https://storage.googleapis.com/rg-ai-bootcamp/toolkits/empirical-rule-normal-distribution-graph-min.png)

_Empirical Rule (Sumber: [inchcalculator.com](https://www.inchcalculator.com/empirical-rule-calculator/))_

This rule suggests that for any normally distributed event:

- 68% of all outcomes fall within 1 standard deviation ($\sigma$) from the mean ($\mu$)
- 95% of outcomes fall within 2 standard deviations
- 99.7% of outcomes fall within 3 standard deviations

This rule underscores the rarity of outliers in Normal Distributions. It also illustrates how much we can infer about a dataset if we know that it is normally distributed.

### Example John Bakery

Assume John owns a bakery where he sells bread, and he tracks his daily sales over several months. He discovers an average daily sale of 100 loaves with a standard deviation of 20 loaves. Assuming the bread sales follow a normal distribution (a sensible assumption given many random factors affecting the sales), we can utilize properties of the normal distribution to make future sales predictions.

Applying the "68, 95, 99.7" rule (also known as the empirical or 3-sigma rule), we can state that:

- 68% of the time, daily sales will range between 80 and 120 loaves (100 ± 20, or mean ± 1 standard deviation).
- 95% of the time, daily sales will fall between 60 and 140 loaves (100 ± 2x20, or mean ± 2 standard deviations).
- 99.7% of the time, daily sales will lie between 40 and 160 loaves (100 ± 3x20, or mean ± 3 standard deviations).

But why do we use a normal distribution and make such predictions? Here's why:

1. **Understanding**: Recognizing the distribution of sales helps John to understand daily sale fluctuations and set his expectations.

2. **Planning**: By understanding the distribution, John can enhance his inventory and operational planning. For instance, he could ensure he always has enough raw materials to bake at least 120 loaves each day, as this meets demand for 68% of the days.

3. **Decision-Making**: If John considers increasing his production, he can examine the distribution and determine how often he might need extra capacity. For example, if contemplating expanding to make 160 loaves daily, he could observe that he'll only need this additional capacity on 0.3% of the days (as 99.7% of the time, sales will be lower).

4. **Risk Estimation**: The normal distribution is also useful for anticipating risks. For instance, John could use it to calculate the probability that sales will drop below a certain point (say, 50 loaves per day) that could make his operation financially unviable.

In summary, the normal distribution is a highly valuable tool in statistics and data analysis and can assist in various aspects of business and decision-making.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
from ipywidgets import interact

def plot_normal(mean=100, std_dev=20):
    # Alias defined
    mu = mean
    sigma = std_dev
    x_value = mu - sigma

    # Define the range of values we want to plot
    x = np.linspace(mu - 3*sigma, mu + 3*sigma, 100)

    # Calculate the normal distribution
    y = norm.pdf(x, mu, sigma)

    # Create the plot
    plt.figure(figsize=[10,5])
    plt.plot(x, y, 'b', label='Normal Distribution')

    # Add labels and title
    plt.title("Normal Distribution of Bakery's Daily Sales")
    plt.xlabel("Loaves of Bread Sold Per Day")
    plt.ylabel("Probability Density")

    # Show variance and standard deviation
    plt.text(mu + 1.5*sigma, max(y), 'Variance: {:.2f}\nStandard Deviation: {:.2f}'.format(sigma**2, sigma),
             ha='right', va='top', fontsize=10,
             bbox=dict(facecolor='white', alpha=0.5, boxstyle="round,pad=0.5"))

    # Add vertical lines for 68-95-99.7 rule within the distribution curve
    for i, color in zip([1, 2, 3], ['green', 'orange', 'purple']):
        plt.vlines([mu - i*sigma, mu + i*sigma], 0, norm.pdf(mu + i*sigma, mu, sigma), colors=color, linestyles='dashed')

        # Calculate required probabilities for each i
        prob = norm.cdf(mu + i*sigma, mu, sigma) - norm.cdf(mu - i*sigma, mu, sigma)

        # Show the probability as percentage in the legend
        plt.plot([], [], color=color, linestyle='dashed', label=f'{i*sigma*2} loaves: {prob:.1%} of days')

    # Add vertical line at the mean
    plt.vlines(mu, 0, norm.pdf(mu, mu, sigma), colors='red', linestyles='dashed')
    plt.plot([], [], color='red', linestyle='dashed', label=f'Mean: {mu} loaves')

    # Show the plot with legend outside
    plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')

    # Show the plot
    plt.tight_layout()
    plt.show()

# Create the interactive plot
interact(plot_normal, mean=(80, 120, 1), std_dev=(10, 30, 1))

Data significantly beyond this range could be **outliers—unusual** values that stand apart from the rest. Understanding a Normal Distribution allows us to calculate probabilities, predict outcomes, and spot these outliers more easily.