![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)
**$$\large\textbf{Probability Distributions of Discrete Random Variables}$$**
$$\large-\text{ Computation and Visualization }-$$

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)
## **Random Variables and Probability Distributions**



>Random variables and probability distributions are two of the most important concepts in statistics. A random variable assigns unique numerical values to the outcomes of a random experiment; this is a process that generates uncertain outcomes. A probability distribution assigns probabilities to each possible value of a random variable.

>**Discrete Random Variables** take on a countable number of values. An example is the number of heads when flipping a coin three times.

>The **probability distribution** is often described using a **probability mass function (PMF)**. It gives the probability that the random variable takes on any specific value.

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)
## **Discrete Random Variables and Definitions of Terms for Discrete Probability Distributions**

> ### **1. Probability Mass , Cummulative Probability, and Percent Point Functions**

>>#### **$\color{red}{\textbf{pmf:}}$ Probability Mass Function** gives the height of a single bar:
* $f(x\color{red}{=}k)$

>>#### **$\color{red}{\textbf{cmf:}}$ Cumulative Mass Function** gives sum of bar heights between two end points.
These four cases describe the different scenarios where we include some endpoints but not others:
* $f(k_1 \color{red}{\le} x \color{red}{\le} k_2)$
* $f(k_1 < x \color{red}{\le} k_2)$
* $f(k_1 \color{red}{\le} x < k_2)$
* $f(k_1 < x < k_2)$

>>#### **$\color{red}{\textbf{ ppf:}}$ Percent Point Function** is the inverse of ```cmf```. Given a percentile rank it gives the corresponding value of the random variable.
* $k = f^{-1}(p) $ $p$ is percentile

<details>
  <summary><font size = 4.5px><b>Visulize pdf, cdf, and ppf</b></font></summary>
<img src = "https://drive.google.com/uc?id=1PIjz9XFU1uLGZAsbLSl2rMXuQ3qC4dx1" />
</details>


![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)
## **Python Syntax Structure for Probability Distributions**


###$\large\hspace{20mm}$ **`scipy.stats.`<font color = "orange">distribution</font>.$\color{green}{\text{method}}( \color{magenta}{\text{parameters}})$**

### **<font color = "orange">distributions</font>:**

|<font size = 3 px>Distribution</font>|<font size = 3 px>Binomial</font>|<font size = 3 px>Uniform</font>|<font size = 3 px>Normal</font>|<font size = 3 px>Student's t</font>|<font size = 3 px>Chi-Square $\chi^2$</font>|<font size = 3 px>f</font>|
|:--:|--|--|--|--|--|--|
|<font size = 3 px>**Python sytax** </font>|<font size = 5 px>`binom`</font>|<font size = 5 px>`uniform`</font>|<font size = 5 px>`norm`</font>|<font size = 5 px>`t`</font>|<font size = 5 px>`chi2`</font>|<font size = 5 px>`f`</font>|

### **<font color = "green">methods:</font>**

|<font size = 3 px>statistics<br>term</font>|<font size = 3 px>random <br> variables</font>|<font size = 3 px>probability <br> mass function</font>|<font size = 3 px>probability<br> density function</font>|<font size = 3 px>cummulative <br> probability <br> function</font>|<font size = 3 px>summary<br>statistic</font>|<font size = 3 px>mean</font>|<font size = 3 px>median</font>|<font size = 3 px>standard <br> deviation</font>|<font size = 3 px>confidence<br> interval</font>|<font size = 3 px>random<br> variables</font>|
|:--:|--|--|--|--|--|--|--|--|--|--|
|<font size = 3 px>**sytax** </font>|<font size = 5 px>`pmf`</font>|<font size = 5 px>`pdf`</font>|<font size = 5 px>`cdf`</font>|<font size = 5 px>`t`</font>|<font size = 5 px>`stats`</font>|<font size = 5 px>`mean`</font>|<font size = 5 px>`median`</font>|<font size = 5 px>`std`</font>|<font size = 5 px>`interval`</font>|<font size = 5 px>`rvs`</font>|

### **<font color = "magenta">parameters</font>:**

|<font size = 3 px>Statistics term</font>|<font size = 3 px>sample size</font>|<font size = 3 px>value of <br>binomial random variable</font>|<font size = 3 px>probability of success <br> of each binomial trial</font>|<font size = 3 px>percentile</font>|<font size = 3 px>mean</font>|<font size = 3 px>standard deviation</font>|
|:--:|:--:|:--:|:--:|:--:|:--:|:--:|
|<font size = 3 px>**Python sytax** </font>|<font size = 5 px>`n`</font>|<font size = 5 px>`k`</font>|<font size = 5 px>`p`</font>|<font size = 5 px>`q`</font>|<font size = 5 px>`loc`</font>|<font size = 5 px>`scale`</font>|<font size = 5 px>`df`</font>|

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)
## **Binomial Random Variable:**


> ###  **1. Mathematical Notation of Probability Mass Function:** $X \sim B(n,p)$
<details>
  <summary><b>Show explanation of parameters of the binomial distribution</b></summary>
$B: \text{Binomial Probability Mass Distribution Function}\\
X: \text{ a discrete random variable} $

<b>The Parameters:</b>\
$n: \text{number of trials} \\
p: \text{probability of a success on each trial}$
</details>


>### **2. Python syntax of```scipy.stats``` modules for Binomial Distributions**

#### $\hspace{20mm}$ **Syntax Structure:** $\hspace{20mm}$ `scipy.stats.binom`.**$\color{green}{\text{method}}( \color{magenta}{\text{parameters}})$**

|Methods and parameters|Output|
|--|--|
|```rvs(n, p, loc=0, size=1, random_state=None)```|Random variates|
|```pmf(k, n, p, loc=0)```|Probability mass function|
|```cdf(k, n, p, loc=0)```|Cumulative probability mass function|
|```ppf(q, n, p, loc=0)```|Percent point function (inverse of cdf — percentiles)|
|```stats(n, p, loc=0, moments=’mv’)```|Mean(‘m’), variance(‘v’), skew(‘s’), and/or kurtosis(‘k’)```|
|```median(n, p, loc=0)```|Median of the distribution.|
|```mean(n, p, loc=0)```|Mean of the distribution.|
|```var(n, p, loc=0)```|Variance of the distribution.|
|```std(n, p, loc=0)```|Standard deviation of the distribution.|
|```interval(cl, n, p, loc=0)```|Endpoints of the range that contains alpha percent of the distribution|



![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)
##🔖 **$\color{blue}{\textbf{Lab Work 1:}}$ Practice using `cdf`, `pmf`, `ppf` with binomial distributions**

<details>
  <summary><b>Show problem </b></summary>
It has been stated that about 41% of adult workers have a high school diploma but do not pursue any further education.  If 20 adult workers are randomly selected, find the probability that at most 12 of them have a high school diploma but do not pursue any further education. How many adult workers do you expect to have a high school diploma but to not pursue any further education?  
</details>

### **1.** Find the probability that $\color{blue}{\textbf{exactly}}$ 12 of the 20 randomly selected adult workers have a high school diploma but do not pursue any further education.

In [4]:
import scipy.stats as stats

# Because it says "exactly", we will use `pmf`
status = "exactly"
n = 20 # number of (binomial) trials
p = 0.41 # probability of "success" (adult worker has a high school diploma but does not pursue any further education)
k = 12 # value of random variable / number of success cases

probability = stats.binom.pmf(
    k, # number of success
    n, # sample size
    p  # probability
)
print(f"The probability that {status} {k} people did not pursue higher ed after high school is {probability:.4f} or {probability:.2%}")

# Given the probability, you can use ppf to find the inverse

The probability that exactly 12 people did not pursue higher ed after high school is 0.0417 or 4.17%


### **2.** Find the probability that $\color{green}{\textbf{at most}}$ 12 of them have a high school diploma but do not pursue any further education.

In [6]:
# Think about what does "at most" mean? From zero up to, and including, 12
status = "at most"
n = 20 # number of (binomial) trials
p = 0.41 # probability of "success" (adult worker has a high school diploma but does not pursue any further education)
k = 12 # value of random variable / number of success cases

# Use cumulative mass function because we need to compute twelve or fewer people (k = 1, 2, 3, ..., 12 if we were to use cmf)
# cdf will compute each from 0 to 12 and then sum them
probability = stats.binom.cdf(
    k, # number of success
    n, # sample size
    p  # probability
)
print(f"The probability that {status} {k} people did not pursue higher ed after high school is {probability:.4f} or {probability:.2%}")

The probability that at most 12 people did not pursue higher ed after high school is 0.9738 or 97.38%


### 📓 $\color{red}{\text{Teacher Note:} }$
The code below is another way to calculate the probability of at most 12.  This uses a for loop to add up the individual probabilities.

In [9]:
probability = 0
# The cdf method does this loop for us
for k in range(0, 13):
    probability += stats.binom.pmf(k, n, p)

print(f"The probability that {status} {k} people did not pursue higher ed after high school is {probability:.4f} or {probability:.2%}")

The probability that at most 12 people did not pursue higher ed after high school is 0.9738 or 97.38%


### **3.** Find the probability that $\color{orange}{\textbf{more than}}$ 12 of them have a high school diploma but do not pursue any further education.

In [10]:
# Use the complement definition of probability
status = "more than"
n = 20 # number of (binomial) trials
p = 0.41 # probability of "success" (adult worker has a high school diploma but does not pursue any further education)
k = 12 # value of random variable / number of success cases

# Use cumulative mass function because we need to compute more than 12 people (k = 13, 14, 15, ..., 20 if we were to use cmf)
# cdf will compute each from 13 to 20 and then sum them
# what we need is 1 - cdf that is computed from 0 to 12
probability = 1 - stats.binom.cdf(
    k, # number of success
    n, # sample size
    p  # probability
)
print(f"The probability that {status} {k} people did not pursue higher ed after high school is {probability:.4f} or {probability:.2%}")

The probability that more than 12 people did not pursue higher ed after high school is 0.0262 or 2.62%


### **4.** Find the probability that $\color{orange}{\textbf{less than}}$ 12 of them have a high school diploma but do not pursue any further education.

In [12]:
# "less than" meaning not including twelve
# Use the complement definition of probability
status = "less than"
n = 20 # number of (binomial) trials
p = 0.41 # probability of "success" (adult worker has a high school diploma but does not pursue any further education)
k = 12 # value of random variable / number of success cases

# Use cumulative mass function because we need to compute less than 12 people (k = 0, 1, 2, 3, ..., 11 if we were to use cmf)
# What we can do is to calculate cdf up to 12 and then subtract pmf(k=12) to accomplish "less than 12"
# One other way to think about this is to calculate the probability of "at most twelve" and subtract "exactly 12"
# At least twelve = 1 - more than 12 + exactly twelve
probability = stats.binom.cdf(
    k, # number of success <=
    n, # sample size
    p  # probability
)
probability_pmf_12 = stats.binom.pmf(
    k, # k = 12
    n,
    p
)
final_probability = probability - probability_pmf_12
print(f"The probability that {status} {k} people did not pursue higher ed after high school is {final_probability:.4f} or {final_probability:.2%}")

The probability that less than 12 people did not pursue higher ed after high school is 0.9321 or 93.21%


### **5.** Find the probability that $\color{green}{\textbf{at least}}$ 12 of them have a high school diploma but do not pursue any further education.

In [17]:
status = "at least"

n = 20 # number of (binomial) trials
p = 0.41 # probability of "success" (adult worker has a high school diploma but does not pursue any further education)
k = 12 # value of random variable / number of success cases

probability_cdf_more_than_12 = 1 - stats.binom.cdf(
    k, # number of success <=
    n, # sample size
    p  # probability
)
probability_pmf_12 = stats.binom.pmf(
    k, # k = 12
    n,
    p
)
final_probability = probability_cdf_more_than_12 + probability_pmf_12
print(f"The probability that {status} {k} people did not pursue higher ed after high school is {final_probability:.4f} or {final_probability:.2%}")

The probability that at least 12 people did not pursue higher ed after high school is 0.0679 or 6.79%


### **6.** Find the 70th percentile (that is, "when" will $\color{green}{\textbf{70%}}$ of the people not pursue any further education?).

In [23]:
# Use ppf, which is the inverse of cdf
per = 0.7 # 70% of people who do not pursue higher ed
n = 20 # number of (binomial) trials
p = 0.41 # probability of "success" (adult worker has a high school diploma but does not pursue any further education)
k = 12 # value of random variable / number of success cases

# ppf ranking probability function is the inverse of cdf which outputs the value of the variable that will produce the percent

percentile = stats.binom.ppf(
    per, # percent probability for the percentile
    n, # sample size
    p  # probability
)

print(f"The {per * 100}th percentile is {percentile:.0f} of people who did not pursue higher ed after high school.")

The 70.0th percentile is 9 of people who did not pursue higher ed after high school.


![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)
##🔖 **$\color{blue}{\textbf{Lab Work 2:}}$ Computer the descriptive statistics for a binomial random variable  $X \sim B(n, p)$**

### **1.** Compute the mean and standard deviation of a binomial distribution $X \sim B(20, 0.41)$ using the formulas
> **Mean** = $np$ and **std** = $\sqrt{np(1-p)}$, where $n$ is the number of trial and $p$ is the probability of success in each trial.

In [34]:
import numpy as np

n = 20
p = 0.41
# Rounds the number down to an integer
mean = int(np.floor(n * p))

std = int(np.sqrt(np.floor(n * p * (1 - p))))

print(f"""The mean of this binomial distribution of people who did not go to college after high school is equal to {mean}
Number of people: {n}
Probability of not going to college after high school: {p}
Standard deviation: {std} people""")

The mean of this binomial distribution of people who did not go to college after high school is equal to 8
Number of people: 20
Probability of not going to college after high school: 0.41
Standard deviation: 2 people


### **2.** Use the Python plotting library to visualize $B(20, 0.41)$ and its parameters mean and standard deviation.
Refer to "[How to use Binomial Distribution](https://colab.research.google.com/drive/1vGWwdwrR_vWtXMnK-IwRZOc0NtfpXc-M?usp=sharing)" for more options.

In [61]:
import plotly.graph_objects as go

# Create input values of `k` = 0, 1, 2, 3, ..., 20
k = np.arange(0, n + 1) # Remember, this is right exclusive
# Calculate the pmf for each k value and save it in p_k
p_k = stats.binom.pmf(k, n, p)

fig = go.Figure()
fig.add_trace(
    go.Bar(
        x = k,
        y = p_k,
        marker_color="indianred",
        name = "Binomial distribution")
)
fig.update_layout(
    width=800,
    height=500,
    title=f"Binomial Distribution B(n, p) = B({n}, {p})",
    title_x = 0.5,
    yaxis_title = "Probability mass function (pmf)",
    xaxis_title = "Number of success<br>(Number of people who do not pursue higher ed)",
    showlegend=True
)
fig.add_trace(
    # Add a vertical line at the mean and two times the standard deviation
    go.Scatter(
        x = (mean + 2 * std, mean + 2 * std),
        y = [0, max(p_k)],
        mode="lines",
        line=dict(
            color="cyan",
            width=4,
            dash="dash"
        ),
        name="mean + 2std"
    )
)
fig.add_trace(
    # Add a vertical line at the mean and two times the standard deviation
    go.Scatter(
        x = (mean, mean),
        y = [0, max(p_k)],
        mode="lines",
        line=dict(
            color="yellow",
            width=4,
            dash="dash"
        ),
        name="mean"
    )
)
fig.add_trace(
    # Add a vertical line at the mean and two times the standard deviation
    go.Scatter(
        x = (mean - 2 * std, mean - 2 * std),
        y = [0, max(p_k)],
        mode="lines",
        line=dict(
            color="cyan",
            width=4,
            dash="dash"
        ),
        name="mean - 2std"
    )
)
fig.show()

### **3.** **Create a Binomial Distribution Plotter App $X \sim B(n,p):$** Copy the mean and std calculation below the input parameters of n and p.



In [74]:
# @title Specify `n` and `p` {"run":"auto"}
user_n = 50 # @param {"type":"slider","min":0,"max":75,"step":1}
user_p = 0.8 # @param {"type":"slider","min":0,"max":1,"step":0.01}


mean = int(np.floor(user_n * user_p))
std = int(np.sqrt(np.floor(user_n * user_p * (1 - user_p))))
k = np.arange(0, user_n + 1) # Remember, this is right exclusive
# Calculate the pmf for each k value and save it in p_k
p_k = stats.binom.pmf(k, user_n, user_p)

fig = go.Figure()
fig.add_trace(
    go.Bar(
        x = k,
        y = p_k,
        marker_color="indianred",
        name = "Binomial distribution")
)
fig.update_layout(
    width=800,
    height=500,
    title=f"Binomial Distribution B(n, p) = B({user_n}, {user_p})",
    title_x = 0.5,
    yaxis_title = "Probability mass function (pmf)",
    xaxis_title = "Number of success<br>(Number of people who do not pursue higher ed)",
    showlegend=True
)
fig.add_trace(
    # Add a vertical line at the mean and two times the standard deviation
    go.Scatter(
        x = (mean + 2 * std, mean + 2 * std),
        y = [0, max(p_k)],
        mode="lines",
        line=dict(
            color="cyan",
            width=4,
            dash="dash"
        ),
        name="mean + 2std"
    )
)
fig.add_trace(
    # Add a vertical line at the mean and two times the standard deviation
    go.Scatter(
        x = (mean, mean),
        y = [0, max(p_k)],
        mode="lines",
        line=dict(
            color="yellow",
            width=4,
            dash="dash"
        ),
        name="mean"
    )
)
fig.add_trace(
    # Add a vertical line at the mean and two times the standard deviation
    go.Scatter(
        x = (mean - 2 * std, mean - 2 * std),
        y = [0, max(p_k)],
        mode="lines",
        line=dict(
            color="cyan",
            width=4,
            dash="dash"
        ),
        name="mean - 2std"
    )
)
fig.show()

### **4.** Change the number of trials or probability of success to see how the symmetry of the distribution changes.  When do you see left-skewed, right-skewed, or symmtric.




AS we increase the probability value to 0.50, the distribution approaches a normal distribution, where the data is evenly distributed. Decreasing the value of p below 0.5 gives us a right skewed data set, while increasing the value of p above 0.5 gives us a left skewed data set

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)
##🔖 **$\color{green}{\textbf{TODO: }}$ Practice using ```cdf, pmf```, and `ppf` with binomial random variables.**

<details>
  <summary><b>Show problem </b></summary>
It has been stated that about 70% of adult workers have a high school diploma but do not pursue any further education.  If 50 adult workers are randomly selected, find the probability that at most 30 of them have a high school diploma but do not pursue any further education. How many adult workers do you expect to have a high school diploma but to not pursue any further education?  
</details>

### **1.** Find the probability that $\color{blue}{\textbf{exactly}}$ 30 of 50 have a high school diploma but do not pursue any further education and present the numerical values and their interpretations in statements either in a text cell or code cell with print command and f-string.

In [78]:
# Because it says "exactly", we will use `pmf`
status = "exactly"
n = 50 # number of (binomial) trials
p = 0.7 # probability of "success" (adult worker has a high school diploma but does not pursue any further education)
k = 30 # value of random variable / number of success cases

probability = stats.binom.pmf(
    k, # number of success
    n, # sample size
    p  # probability
)
print(f"The probability that {status} {k} people did not pursue higher ed after high school is {probability:.4f} or {probability:.2%}")

# Given the probability, you can use ppf to find the inverse

The probability that exactly 30 people did not pursue higher ed after high school is 0.0370 or 3.70%


### **2.** Find the probability that $\color{green}{\textbf{at most}}$ 30 of 50 have a high school diploma but do not pursue any further education and present the numerical values and their interpretations in statements either in a text cell or code cell with print command and f-string.

In [81]:
status = "at most"
n = 50 # number of (binomial) trials
p = 0.7 # probability of "success" (adult worker has a high school diploma but does not pursue any further education)
k = 30

# Use cumulative mass function because we need to compute thirty or fewer people (k = 1, 2, 3, ..., 30 if we were to use cmf)
# cdf will compute each from 0 to 30 and then sum them
probability = stats.binom.cdf(
    k, # number of success
    n, # sample size
    p  # probability
)
print(f"The probability that {status} {k} people did not pursue higher ed after high school is {probability:.4f} or {probability:.2%}")

The probability that at most 30 people did not pursue higher ed after high school is 0.0848 or 8.48%


### **3.** Find the probability that $\color{orange}{\textbf{more than}}$ 30 of 50 have a high school diploma but do not pursue any further education and present the numerical values and their interpretations in statements either in a text cell or code cell with print command and f-string.

In [82]:
status = "more than"
n = 50 # number of (binomial) trials
p = 0.7 # probability of "success" (adult worker has a high school diploma but does not pursue any further education)
k = 30 # value of random variable / number of success cases

# Use cumulative mass function because we need to compute more than 30 people (k = 31, 32, 33, ..., 50 if we were to use cmf)
# cdf will compute each from 31 to 50 and then sum them
# what we need is 1 - cdf that is computed from 0 to 30
probability = 1 - stats.binom.cdf(
    k, # number of success
    n, # sample size
    p  # probability
)
print(f"The probability that {status} {k} people did not pursue higher ed after high school is {probability:.4f} or {probability:.2%}")

The probability that more than 30 people did not pursue higher ed after high school is 0.9152 or 91.52%


### **4.** Find the probability that $\color{orange}{\textbf{less than}}$ 30 of 50 have a high school diploma but do not pursue any further education and present the numerical values and their interpretations in statements either in a text cell or code cell with print command and f-string.

In [83]:
status = "less than"
n = 50 # number of (binomial) trials
p = 0.7 # probability of "success" (adult worker has a high school diploma but does not pursue any further education)
k = 30 # value of random variable / number of success cases

probability = stats.binom.cdf(
    k, # number of success <=
    n, # sample size
    p  # probability
)
probability_pmf_30 = stats.binom.pmf(
    k, # k = 12
    n,
    p
)
final_probability = probability - probability_pmf_30
print(f"The probability that {status} {k} people did not pursue higher ed after high school is {final_probability:.4f} or {final_probability:.2%}")

The probability that less than 30 people did not pursue higher ed after high school is 0.0478 or 4.78%


### **5.** Find the probability that $\color{green}{\textbf{at least}}$ 30 of 50 have a high school diploma but do not pursue any further education and present the numerical values and their interpretations in statements either in a text cell or code cell with print command and f-string.

In [84]:
status = "at least"

n = 50 # number of (binomial) trials
p = 0.7 # probability of "success" (adult worker has a high school diploma but does not pursue any further education)
k = 30 # value of random variable / number of success cases

probability_cdf_more_than_30 = 1 - stats.binom.cdf(
    k, # number of success <=
    n, # sample size
    p  # probability
)
probability_pmf_30 = stats.binom.pmf(
    k, # k = 30
    n,
    p
)
final_probability = probability_cdf_more_than_30 + probability_pmf_30
print(f"The probability that {status} {k} people did not pursue higher ed after high school is {final_probability:.4f} or {final_probability:.2%}")

The probability that at least 30 people did not pursue higher ed after high school is 0.9522 or 95.22%


### **6.** Find the 90th percentile (that is, $\color{green}{\textbf{10%}}$ of the people do not pursue any further education) and present the numerical values and their interpretations in statements either in a text cell or code cell with print command and f-string.

In [85]:
per = 0.9 # 90% of people who do not pursue higher ed
n = 50 # number of (binomial) trials
p = 0.7 # probability of "success" (adult worker has a high school diploma but does not pursue any further education)
k = 30 # value of random variable / number of success cases

# ppf ranking probability function is the inverse of cdf which outputs the value of the variable that will produce the percent

percentile = stats.binom.ppf(
    per, # percent probability for the percentile
    n, # sample size
    p  # probability
)

print(f"The {per * 100}th percentile is {percentile:.0f} of people who did not pursue higher ed after high school.")

The 90.0th percentile is 39 of people who did not pursue higher ed after high school.


### **7.** Compute the mean and standard deviation of the binomial distribution $B(50, 0.7)$ using the formulas and present the numerical values and their interpretations in statements either in a text cell or code cell with print command and f-string.
> **Mean** = $np$ and **std** = $\sqrt{np(1-p)}$, where $n$ is the number of trial and $p$ is the probability of success in each trial.

In [87]:
n = 50
p = 0.7
# Rounds the number down to an integer
mean = int(np.floor(n * p))

std = int(np.sqrt(np.floor(n * p * (1 - p))))

print(f"""The mean of this binomial distribution of people who did not go to college after high school is equal to {mean}
Number of people: {n}
Probability of not going to college after high school: {p}
Standard deviation: {std} people""")

The mean of this binomial distribution of people who did not go to college after high school is equal to 35
Number of people: 50
Probability of not going to college after high school: 0.7
Standard deviation: 3 people


### **8.** Use Python plotting library to visualize $B(50, 0.7)$, its mean, and 2 standard deviations above and below the mean and present the numerical values and their interpretations in statements either in a text cell or code cell with print command and f-string.
Refer to "[How to use Binomial Distribution](https://colab.research.google.com/drive/1vGWwdwrR_vWtXMnK-IwRZOc0NtfpXc-M?usp=sharing)" for more options.

In [88]:
k = np.arange(0, n + 1) # Remember, this is right exclusive
# Calculate the pmf for each k value and save it in p_k
p_k = stats.binom.pmf(k, n, p)

fig = go.Figure()
fig.add_trace(
    go.Bar(
        x = k,
        y = p_k,
        marker_color="indianred",
        name = "Binomial distribution")
)
fig.update_layout(
    width=800,
    height=500,
    title=f"Binomial Distribution B(n, p) = B({n}, {p})",
    title_x = 0.5,
    yaxis_title = "Probability mass function (pmf)",
    xaxis_title = "Number of success<br>(Number of people who do not pursue higher ed)",
    showlegend=True
)
fig.add_trace(
    # Add a vertical line at the mean and two times the standard deviation
    go.Scatter(
        x = (mean + 2 * std, mean + 2 * std),
        y = [0, max(p_k)],
        mode="lines",
        line=dict(
            color="cyan",
            width=4,
            dash="dash"
        ),
        name="mean + 2std"
    )
)
fig.add_trace(
    # Add a vertical line at the mean and two times the standard deviation
    go.Scatter(
        x = (mean, mean),
        y = [0, max(p_k)],
        mode="lines",
        line=dict(
            color="yellow",
            width=4,
            dash="dash"
        ),
        name="mean"
    )
)
fig.add_trace(
    # Add a vertical line at the mean and two times the standard deviation
    go.Scatter(
        x = (mean - 2 * std, mean - 2 * std),
        y = [0, max(p_k)],
        mode="lines",
        line=dict(
            color="cyan",
            width=4,
            dash="dash"
        ),
        name="mean - 2std"
    )
)
fig.show()

### **9.** Describe your observation of the distribution in comparison to the distribution above.

This distribution looks more normally distributed than our distribution that we calculated earlier with `n = 20`. This is likely because we have a larger sample size (`n = 50`), so some of the random variations that would impact the overall distribution are less prominent. Our overall distribution has a wider spread than the previous distribution and the standard deviations are larger than the previous distribution.