In [1]:
import numpy as np
import pandas as pd
import math
import scipy.stats as st

## Introduction to Statistical Methods Formulas and Code

### Statistics

- field that focuses on understanding and analyzing data, providing us with a toolkit of methods and techniques to collect, organize, summarize, and interpret data in a meaningful way
- goal is to uncover patterns, relationships, and insights that can help us make informed decisions and draw reliable conclusions
- descriptive statistics: methods that help us summarize and describe data in a concise and informative way
- inferential statistics: methods that help us draw conclusions about a population based on a sample of data from that population

#### Types of variables
1. Qualitative (categorical) variables
    - variables that can be placed into distinct categories, according to some characteristic or attribute
    - nominal variables: variables that have no natural ordering
    - ordinal variables: variables that have a natural ordering
2. Quantitative (numerical) variables
    - variables that are measured on a numeric scale
    - discrete variables: variables that can only take on a finite number of values
    - continuous variables: variables that can take on an infinite number of values
        1. interval variables: variables that have no natural zero point, ratios of values are not meaningful
        2. ratio variables: variables that have a natural zero point, ratios of values are meaningful

#### Measures of central tendency
They are measures that describe the center of a distribution of data
- mean: the average of a set of values
$${\displaystyle {\mu}={\bar {x}}={\frac {1}{n}}\left(\sum_{i=1}^{n}{x_{i}}\right)={\frac {x_{1}+x_{2}+\cdots +x_{n}}{n}}}$$
- median: the middle value of a set of values
$$\mathrm{median}(x) = \begin{cases} 
    x_{\frac{n+1}{2}} & \text{if } n \text{ is odd} \\
    \frac{x_{\frac{n}{2}} + x_{\frac{n + 1}{2}}}{2} & \text{if } n \text{ is even}
\end{cases}$$
- mode: the most frequently occurring value in a set of values

The mean, the median, and the mode each answer the question “Where is the center of the data set?” The nature of the data set, as indicated by a relative frequency histogram, determines which one gives the best answer.

| Measure | When to Use |
|---------|-------------|
| Mean    | When data is normally distributed and there are no extreme outliers. |
| Median  | When data has outliers or is skewed, providing a robust central value. |
| Mode    | When identifying the most frequent or common value in categorical data. |

Other results:
$$mean − mode ≈ 3(mean − median)$$
$$midrange = (min + max) / 2$$
$$range = max− min$$

##### Skewness

| Distribution | Characteristics | Mean vs Median vs Mode |
|--|--|--|
| Positive | Data skewed towards higher values<br>Majority of observations on the left side<br>Long tail on the right side | Mean > Median > Mode    |
| Negative | Data skewed towards lower values<br>Majority of observations on the right side<br>Long tail on the left side | Mean < Median < Mode     |
| Symmetric | Balanced distribution<br>Observations evenly spread on both sides | Mean ≈ Median ≈ Mode    |

<img src="https://upload.wikimedia.org/wikipedia/commons/c/cc/Relationship_between_mean_and_median_under_different_skewness.png" width="400" style="display: block; margin-left: auto; margin-right: auto; padding-top: 10px; padding-bottom: 10px;">

Bi-modal distributions have two peaks, while multi-modal distributions have more than two peaks.

#### Measures of variability
They are measures that describe the spread of a distribution of data, to describe the distribution of data in a more complete way, and measure how well an individual value represents the entire distribution
- range: the difference between the maximum and minimum values
- variance: the average squared deviation from the mean
$$\sigma^2 = \frac{\sum_{i=1}^{n}(x_i - \bar{x})^2}{n}$$
- standard deviation: the square root of the variance
$$\sigma = \sqrt{\frac{\sum_{i=1}^{n}(x_i - \bar{x})^2}{n}}$$
- coefficient of variation: the ratio of the standard deviation to the mean
$$CV = \frac{\sigma}{\bar{x}}$$

The range, the standard deviation, and the variance each give a quantitative answer to the question “How variable are the data?”

| Measure | When to use |
|--|--|
| Range | you need a quick measure of the spread or dispersion of data and want to know the difference between the highest and lowest values in the dataset.       |
| Variance | you want to quantify the average squared deviation of data points from the mean, providing a measure of how much the data points vary from the mean value. |
| Standard Deviation | you want a measure of the dispersion of data that is easy to interpret and represents the typical distance between each data point and the mean. |
| Coefficient of Variation | you want to compare the relative variability between datasets with different units of measurement, allowing you to assess the variation relative to the mean. |

Above formulas are for population, for sample, we use $n-1$ instead of $n$ in the denominator:
- to provide an unbiased estimate of the population variance or standard deviation
- adjustment accounts for the loss of one degree of freedom when estimating the sample mean and helps to avoid underestimating the true population variance or standard deviation
- using $n−1$, we provide a more conservative estimate of the variability in the population, ensuring that our statistical inferences are more accurate and reliable

#### 5-point summary
- minimum: the smallest value in the dataset
- first quartile: the value such that 25% of the data falls below
- median: the middle value in the dataset
- third quartile: the value such that 75% of the data falls below
- maximum: the largest value in the dataset

#### z-scores
- z-score: the number of standard deviations a data point is from the mean
- is the number $z$ given by the formula: $$z = \frac{x - \mu}{\sigma}$$

#### Boxplots
- boxplots are a graphical representation of the 5-point summary
- outliers are observations that fall outside the upper and lower fences
<img src="https://i.imgur.com/BgJweoR.png" width="400" style="display: block; margin-left: auto; margin-right: auto; padding-top: 10px; padding-bottom: 10px;">

#### The Empirical Rule
**If** a data set has an approximately bell-shaped relative frequency histogram, then:
- Approximately 68% of the data lie within one standard deviation of the mean, that is, in the interval with endpoints $\bar{x} \pm s$ for samples and with endpoints $\mu \pm \sigma$ for populations
- Approximately 95% of the data lie within two standard deviations of the mean
- Approximately 99.7% of the data lies within three standard deviations of the mean

#### Chebyshev’s Theorem
For any data set, the proportion of observations that lie within $k$ standard deviations of the mean is at least $1 - \frac{1}{k^2}$, where $k$ is any positive number larger than 1.
- at least 75% of the data lie within two standard deviations, 89% within three standard deviations, etc.

In [4]:
data = np.array([20, 22, 30, 33, 33, 35, 35, 35, 35, 36, 40, 41, 42, 51, 54])
print(f'Len : {len(data)}, sorted : {sorted(data)}')
print(f'Mean : {np.mean(data)}')
print(f'Median : {np.median(data)}')
print(f'Mode : {st.mode(data, keepdims=False)}')
q1 = np.quantile(data, 0.25, method= 'midpoint')
q2 = np.quantile(data, 0.5, method= 'midpoint')
q3 = np.quantile(data, 0.75, method= 'midpoint')
iqr = q3 - q1
minimum = q1 - 1.5 * iqr
maximum = q3 + 1.5 * iqr
print(f'Q1 : {q1}, Median : {q2}, Q3 : {q3}')
print(f'IQR : {iqr}')
print(f'Minimum : {minimum}, Maximum : {maximum}')

Len : 15, sorted : [20, 22, 30, 33, 33, 35, 35, 35, 35, 36, 40, 41, 42, 51, 54]
Mean : 36.13333333333333
Median : 35.0
Mode : ModeResult(mode=35, count=4)
Q1 : 33.0, Median : 35.0, Q3 : 40.5
IQR : 7.5
Minimum : 21.75, Maximum : 51.75


### Probability

- A population is any specific collection of objects of interest. A sample is any subset or subcollection of the population, including the case that the sample consists of the whole population, in which case it is termed a census.
- A measurement is a number or attribute computed for each member of a population or of a sample. The measurements of sample elements are collectively called the sample data.
- A parameter is a number that summarizes some aspect of the population as a whole. A statistic is a number computed from the sample data.
- Statistics computed from samples vary randomly from sample to sample. Conclusions made about population parameters are statements of probability.

#### Random experiments
- random experiments are actions that occur by chance, and their outcomes are not predictable
- sample space: the set of all possible outcomes of a random event
    - discrete sample space: a sample space with a finite number of outcomes
    - continuous sample space: a sample space with an infinite number of outcomes
- event: a subset of the sample space
- probability: a numerical measure of the likelihood that an event will occur
$$ P(A) = \frac{\text{number of outcomes in A}}{\text{number of outcomes in S}} = \frac{\text{number of favorable outcomes}}{\text{number of possible outcomes}}$$
- empirical probability: the relative frequency of an event occurring in a series of trials
$$ P_{empirical}(A) = \frac{\text{number of times A occurs}}{\text{number of observations}}$$
- theoretical probability: the probability of an event occurring based on mathematical reasoning
- law of large numbers: as the number of trials increases, the empirical probability of an event will converge to the theoretical probability of that event

#### Events
- an event is a subset of the sample space of a random experiment
- an event $A$ occurs on a particular trial of a random experiment if the outcome of that trial is in $A$
- complement of an event: the set of all outcomes in the sample space that are not in the event
  - the complement of an event $A$ is denoted by $A^c$
    - $A^c = S - A, A \cup A^c = S, A \cap A^c = \emptyset$
    - $P(A^c) = 1 - P(A)$
- union of two events: the set of all outcomes that are in either event
  - the union of two events $A$ and $B$ is denoted by $A \cup B$
- intersection of two events: the set of all outcomes that are in both events
  - the intersection of two events $A$ and $B$ is denoted by $A \cap B$
- mutually exclusive / disjoint events: events that have no outcomes in common
  - if $A$ and $B$ are mutually exclusive, then $A \cap B = \emptyset$
- independent events: events that have no effect on each other
- addition rule: the probability of the union of two events is equal to the sum of the probabilities of the individual events minus the probability of their intersection
$$ P(A \cup B) = P(A) + P(B) - P(A \cap B)$$

#### Probability
- the probability of an outcome $e$ in a sample space $S$ is a number $P$ between $0$ and $1$ that measures the likelihood that $e$ will occur on a single trial of a random experiment. The probability of an event $E$ is the sum of the probabilities of the outcomes in $E$.
- a number assigned to each member of the sample space of a random experiment that satisfies the following axioms:
    1. $0 \leq P(A) \leq 1$
    2. $P(S) = 1$
    3. For two events $A$ and $B$, if $A$ and $B$ are mutually exclusive, then $P(A \cup B) = P(A) + P(B)$

### Conditional probability

The conditional probability of $A$ given $B$, denoted $P(A|B)$, is the probability that event $A$ has occurred in a trial of a random experiment for which it is known that event $B$ has definitely occurred.
For any two events $A$ and $B$ with $P(B) > 0$, the conditional probability of $A$ given $B$ is defined as:
$$ P(A|B) = \frac{P(A \cap B)}{P(B)}$$

Conditional probability relation for three events $A$, $B$, and $C$:
$$ P(A \cap B \cap C) = P(A|B \cap C)P(B \cap C) = P(A|B \cap C)P(B|C)P(C)$$

#### Independent events

- We expect $P(A | B)$ to be different from $P(A)$, but it does not always happen. If $P(A | B) = P(A)$, then $A$ and $B$ are independent events and the occurrence of $B$ has no effect on the likelihood of $A$.
  - $P(A|B) = P(A)$ if and only if $P(A \cap B) = P(A)P(B)$, that is, the probability of $A$ and $B$ occurring together is equal to the product of their individual probabilities
  - if A and B are not independent, then they are dependent and $P(A \cap B) \neq P(A)P(B)$



### Random variables
- random variables are variables that take on numerical values based on the outcome of a random experiment
- discrete random variables: random variables that can take on a finite number of values
- continuous random variables: random variables that can take on an infinite number of values

#### Probability distributions of discrete random variables

The probability distribution of a discrete random variable $X$ is a list of each possible value of $X$ together with the probability that $X$ takes that value in one trial of the experiment.
The probabilities in the probability distribution of a discrete random variable $X$ must satisfy the following two conditions:
1. $0 \leq P(X = x) \leq 1$ for each possible value $x$ of $X$
2. $\sum_{\text{all } x} P(X = x) = 1$

Example : probability distribution of $X$, the sum of the two dice, is given by:
$$\begin{array}{c|ccccccccccc} x &2 &3 &4 &5 &6 &7 &8 &9 &10 &11 &12 \\ \hline P(x) &\dfrac{1}{36} &\dfrac{2}{36} &\dfrac{3}{36} &\dfrac{4}{36} &\dfrac{5}{36} &\dfrac{6}{36} &\dfrac{5}{36} &\dfrac{4}{36} &\dfrac{3}{36} &\dfrac{2}{36} &\dfrac{1}{36} \\ \end{array}$$

- $P(X \geq 9) = P(X = 9) + P(X = 10) + P(X = 11) + P(X = 12) = \dfrac{10}{36} = \dfrac{5}{18}$
- $P(\text{X is even}) = P(X = 2) + P(X = 4) + P(X = 6) + P(X = 8) + P(X = 10) + P(X = 12) = \dfrac{18}{36} = \dfrac{1}{2}$

#### Mean of a discrete random variable

The mean (expected value / expectation) of a discrete random variable $X$ is the weighted average of the possible values of $X$, where the weights are the probabilities of the values of $X$.
$$ \mu = E(X) = \sum_{\text{all } x} xP(x)$$

- the mean of a discrete random variable is the long-run average value of the variable

#### Variance and standard deviation of a discrete random variable

The variance of a discrete random variable $X$ is the weighted average of the squared deviations of the possible values of $X$ from the mean of $X$, where the weights are the probabilities of the values of $X$.
$$ \sigma^2 = Var(X) = \sum(x - \mu)^2 P(x) = [\sum x^2 P(x)] - \mu^2$$

The standard deviation of a discrete random variable $X$ is the square root of the variance of $X$.
$$ \sigma = \sqrt{Var(X)}$$

#### Probability distribution of a continuous random variable
With continuous random variables one is concerned not with the event that the variable assumes a single particular value, but with the event that the random variable assumes a value in a particular interval.

The probability distribution of a continuous random variable $X$ is an assignment of probabilities to intervals of decimal numbers using a function $f(x)$, called a density function, in the following way: the probability that $X$ assumes a value in the interval $[a, b]$ is equal to the area of the region that is bounded above by the graph of the equation $y=f(x)$, bounded below by the x-axis, and bounded on the left and right by the vertical lines through $a$ and $b$. The probability density function $f(x)$ must satisfy the following two conditions:
1. $f(x) \geq 0$ for all $x$
2. $\int_{-\infty}^{\infty} f(x) dx = 1$

### Binomial distribution

- The discrete random variable $X$ that counts the number of successes in $n$ identical, independent trials of a procedure that always results in either of two outcomes, `success` or `failure` and in which the probability of success on each trial is the same number $p$, is called the binomial random variable with parameters $n$ and $p$.
    $$~X \sim Bin(n, p)$$
- There is a formula for the probability that the binomial random variable with parameters $n$ and $p$ will take a particular value $x$.
    $$ P(X = x) = \binom{n}{x} p^x (1 - p)^{n - x} = \frac{n!}{x!(n - x)!} p^x (1 - p)^{n - x}$$
- There are special formulas for the mean, variance, and standard deviation of the binomial random variable with parameters $n$ and $p$ that are much simpler than the general formulas that apply to all discrete random variables.
    $$ \mu = E(X) = np$$
    $$ \sigma^2 = Var(X) = np(1 - p)$$
    $$ \sigma = \sqrt{Var(X)} = \sqrt{np(1 - p)}$$
- Cumulative probability distribution tables, when available, facilitate computation of probabilities encountered in typical practical situations.
  - In place of $P(X = x)$, we can use $P(X \leq x) = P(X = 0) + P(X = 1) + \cdots + P(X = x)$
  - Here, $P(X \geq x) = 1 - P(X < x) = 1 - P(X \leq x - 1)$
  - and $P(x) = P(X \leq x) - P(X \leq x - 1)$

### Normal distribution

- The probability distribution corresponding to the density function for the bell curve with parameters $\mu$ and $\sigma$ is called the normal distribution with mean $\mu$ and standard deviation $\sigma$. A continuous random variable whose probabilities are described by the normal distribution with mean $\mu$ and standard deviation $\sigma$ is called a normally distributed random variable, or a normal random variable for short, with mean $\mu$ and standard deviation $\sigma$.
    $$~X \sim N(\mu, \sigma)$$
- The density curve for the normal distribution is symmetric about the mean $\mu$.
- The density curve for the normal distribution with mean $\mu$ and standard deviation $\sigma$ is given by the equation:
    $$ y = f(x) = \frac{1}{\sigma \sqrt{2\pi}} e^{-\frac{1}{2}(\frac{x - \mu}{\sigma})^2}$$
    where $\pi \approx 3.14159$ and $e \approx 2.71828$ 
- Standard normal distribution is the normal distribution with mean $\mu = 0$ and standard deviation $\sigma = 1$, denoted by $Z = N(0, 1)$.
    $$ y = f(x) = \frac{1}{\sqrt{2\pi}} e^{-\frac{1}{2}x^2}$$
    - rules for help computing $P(Z)$:
      - $P(Z \leq z) = P(Z < z) + P(Z = z) = P(Z < z)$
      - $P(Z \geq z) = 1 - P(Z < z)$
      - $P(z_1 \leq Z \leq z_2) = P(Z \leq z_2) - P(Z < z_1)$
- If $X$ is a normally distributed random variable with mean $\mu$ and standard deviation $\sigma$, then
    $$P(X \leq a) = P\left(\frac{X - \mu}{\sigma} \leq \frac{a - \mu}{\sigma}\right) = P\left(Z \leq \frac{a - \mu}{\sigma}\right)$$
    $$P(a < X < b) = P\left( \frac{a - \mu}{\sigma} < Z < \frac{b - \mu}{\sigma} \right)$$
    where $Z$ is a standard normal random variable, and $a$ and $b$ are any two real numbers with $a < b$.
    - The new endpoints $\frac{a - \mu}{\sigma}$ and $\frac{b - \mu}{\sigma}$ are called the standard score or z-score of the original endpoints $a$ and $b$.

### Sampling distributions

- The sampling distribution of a statistic is the probability distribution of the statistic when the statistic is computed from samples of the same size from the same population.
- There are formulas that relate the mean and standard deviation of the sample mean to the mean and standard deviation of the population from which the sample is drawn.
- For example, consider random variable $\bar{X}$, the sampling distribution of the sample mean, when the sample size is $n$. The mean of this r.v. is $\mu_{\bar{X}}$ and the standard deviation is $\sigma_{\bar{X}}$. Then:
    $$ \mu_{\bar{X}} = \mu$$
    $$ \sigma_{\bar{X}} = \frac{\sigma}{\sqrt{n}}$$ 
    where $\mu$ and $\sigma$ are the mean and standard deviation of the population.
- The shape of the sampling distribution of $\bar{X}$ is approximately normal if the sample size is large enough.
- As $n$ increases, the shape of the sampling distribution of $\bar{X}$ becomes more and more like the shape of the normal distribution. The probabilities on the lower and upper ends shrink, and the probabilities in the middle become larger in relation. 
<img src="https://i.imgur.com/9lItK6A.jpg" width="400" style="display: block; margin-left: auto; margin-right: auto; padding-top: 10px; padding-bottom: 10px;">

#### Central limit theorem
- In general, one may start with any distribution and the sampling distribution of the sample mean will increasingly resemble the bell-shaped normal curve as the sample size increases. This is the content of the Central Limit Theorem.
- For sample sizes of 30 or more, the sampling distribution of the sample mean is approximately normal, regardless of the shape of the population distribution, with mean $\mu_{\bar{X}} = \mu$ and standard deviation $\sigma_{\bar{X}} = \frac{\sigma}{\sqrt{n}}$. The larger the sample size, the better the approximation.

<img src="https://i.imgur.com/0vzPLFF.jpg" width="600" style="display: block; margin-left: auto; margin-right: auto; padding-top: 10px; padding-bottom: 10px;">

- The importance of CLT is that it allows us to make probability statements about the sample mean, in relation to its value in comparison to the population mean. 
Realize there are two distributions involved: 
1. $X$, the population distribution, mean $\mu$, standard deviation $\sigma$
2. $\bar{X}$, the sampling distribution of the sample mean, mean $\mu_{\bar{X}} = \mu$, standard deviation $\sigma_{\bar{X}} = \frac{\sigma}{\sqrt{n}}$

- For samples of any size drawn from a normal population, the sampling distribution of the sample mean is normal. For samples of any size drawn from a non-normal population, the sampling distribution of the sample mean is approximately normal if the sample size is 30 or more.

#### Sample proportion
- There are formulas that relate the mean and standard deviation of the sample proportion to the mean and standard deviation of the population from which the sample is drawn.
- sample proportion is the percentage of the sample that has a certain characteristic $\hat{p}$, as opposed to the population proportion $p$.
- viewed as a random variable, $\hat{P}$ has a mean $\mu_{\hat{P}}$ and a standard deviation $\sigma_{\hat{P}}$.
- relation to population proportion $p$:
    $$ \mu_{\hat{P}} = p$$
    $$ \sigma_{\hat{P}} = \sqrt{\frac{p(1 - p)}{n}}$$
- CLT applies to sample proportion as well, but the condition is more complex:
  - for large samples, sample proportion is normally distributed with mean $p$ and standard deviation $\sqrt{\frac{p(1 - p)}{n}}$.
  - to check if sample size is large enough, we need to check if $\left[ p - 3 \sigma_{\hat{P}}, p + 3 \sigma_{\hat{P}} \right]$ lies wholly within the interval $[0, 1]$.
    - since $p$ is unknown, we use $\hat{p}$ instead.
    - since $\sigma_{\hat{P}}$ is unknown, we use $\sqrt{\frac{\hat{p}(1 - \hat{p})}{n}}$ instead.

### Estimation

- The goal of estimation is to use sample data to estimate the value of an unknown population parameter.
- Point estimation : use a single value to estimate a population parameter
  - e.g. use sample mean $\bar{x}$ to estimate population mean $\mu$
  - problem: we don't know how reliable the estimate is
- Interval estimation : use an interval of values to estimate a population parameter
  - we use the data to compute $E$, such that $[\bar{x} - E, \bar{x} + E]$ has a certain probability of containing the population parameter $\mu$.
  - we do this in such a way that, $95\%$ of the all the intervals constructed from sample data will contain the population parameter $\mu$. 
  - $E$ is called the margin of error, and the interval is called the $95\%$ confidence interval for $\mu$.

The empirical rule states that you must go about 2 standard deviations in either direction from the mean to capture $95\%$ of the values of $\bar{X}$.

The key idea is that, in sample after sample $95\%$ of the values of $\bar{X}$ lie in the interval $[\mu - E, \mu + E]$. So if we adjoin to eac hside of the point estimate $x$ a wing of length E, $95\%$ of the time the wing will contain the population mean $\mu$.
- $95\%$ confidence interval is thus $\hat{x} \pm 1.960 \frac{\sigma}{\sqrt{n}}$
  - for a different confidence level, use a different multiplier instead of $1.960$.
  - Here, $1.960$ is the value of $z$ such that $P(-1.960 < Z < 1.960) = 0.95$, and is given by $z_{\alpha/2} = z_{0.025} = 1.960$, where $\alpha = 0.05$

<img src="https://i.imgur.com/u3ME5Qj.png" width="600" style="display: block; margin-left: auto; margin-right: auto; padding-top: 10px; padding-bottom: 10px;">

In selecting the correct formula for construction of a confidence interval for a population mean ask two questions: is the population standard deviation $\sigma$ known or unknown, and is the sample large or small?


#### Large sample $100(1 - \alpha)\%$ confidence interval for $\mu$
- if $\sigma$ is known:
    $$ \bar{x} \pm z_{\alpha/2} \frac{\sigma}{\sqrt{n}}$$
- if $\sigma$ is unknown:
    $$ \bar{x} \pm z_{\alpha/2} \frac{s}{\sqrt{n}}$$

A sample of size $n$ is large if $n \geq 30$ or if the population from which the sample is drawn is normal or approximately normal.
The number $E = z_{\alpha/2} \frac{\sigma}{\sqrt{n}}$ or $E = z_{\alpha/2} \frac{s}{\sqrt{n}}$ is called the margin of error for the estimate $\bar{x}$ of $\mu$.

#### Small sample $100(1 - \alpha)\%$ confidence interval for $\mu$

We use the Student's $t$ distribution instead of the $z$ distribution. The $t$ distribution is similar to the $z$ distribution, but it is more spread out. The spread increases as the degrees of freedom decrease. The $t$ distribution is symmetric and bell-shaped, but it has more area in the tails than the $z$ distribution.

- if $\sigma$ is known:
    $$ \bar{x} \pm t_{\alpha/2, n-1} \frac{\sigma}{\sqrt{n}}$$
- if $\sigma$ is unknown:
    $$ \bar{x} \pm t_{\alpha/2, n-1} \frac{s}{\sqrt{n}}$$ 
    with the degrees of freedom $df = n - 1$

The population must be normal or approximately normal. 

#### Large sample estimation for population proportion $p$

- $100(1 - \alpha)\%$ confidence interval for $p$:
    $$ \hat{p} \pm z_{\alpha/2} \sqrt{\frac{\hat{p}(1 - \hat{p})}{n}}$$

#### Finding the minimum sample size

We have a population with mean $\mu$ and standard deviation $\sigma$. We want to estimate the population mean $\mu$ to within $E$ with $100(1 - \alpha)\%$ confidence. What is the minimum sample size $n$ required?
$$ n = \left( \frac{z_{\alpha/2} \sigma}{E} \right)^2 \text{rounded up}$$

For population proportion $p$, we have:
$$ n = \left( \frac{z_{\alpha/2} \sqrt{p(1 - p)}}{E} \right)^2 \text{rounded up}$$

### Hypothesis testing

- The null hypothesis $H_0$ is a statement about the value of a population parameter that is assumed to be true until there is convincing evidence to the contrary (status quo).
- The alternative hypothesis $H_a$ is a statement that is accepted if the sample data provide sufficient evidence that the null hypothesis is false.
- Hypothesis testing is a statistical procedure that uses sample data to decide between two competing claims (hypotheses) about a population parameter.
- Two conclusions are possible:
  - Reject the null hypothesis $H_0$ in favor of the alternative hypothesis $H_a$.
  - Do not reject the null hypothesis $H_0$.

#### Logic of hypothesis testing

The test procedure is based on the initial assumption that $H_0$ is true.

The criterion for judging between $H_0$ and $H_a$ based on the sample data is: if the value of $\bar{X}$ would be highly unlikely to occur if $H_0$ were true, but favors the truth of $H_a$, then we reject $H_0$ in favor of $H_a$. Otherwise, we do not reject $H_0$.

Supposing for now that $\bar{X}$ follows a normal distribution, when the null hypothesis is true, the density function for the sample mean $\bar{X}$ must be a bell curve centered at $\mu_0$. Thus, if $H_0$ is true, then $\bar{X}$ is likely to take a value near $\mu_0$ and is unlikely to take values far away. Our decision procedure, therefore, reduces simply to:

- If $H_a$ has the form $H_a: \mu < \mu_0$, then reject $H_0$ if $\bar{x}$ is far to the left of $\mu_0$ (rejection region is $[\infty, C]$, left-tailed test)
- If $H_a$ has the form $H_a: \mu > \mu_0$, then reject $H_0$ if $\bar{x}$ is far to the right of $\mu_0$ (rejection region is $[C, \infty]$, right-tailed test)
- If $H_a$ has the form $H_a: \mu \neq \mu_0$, then reject $H_0$ if $\bar{x}$ is far away from $\mu_0$ in either direction (rejection region is $(-\infty, C] \cup [C', \infty)$, two-tailed test)

Rejection region is therefore the set of values of $\bar{x}$ that are far away from $\mu_0$ in the direction indicated by $H_a$. The critical value or critical values of a test of hypotheses are the number or numbers that determine the rejection region.

Procedure for selecting $C$:
- define a rare event : an event is rare if it has a probability of occurring that is less than or equal to $\alpha$. (say $\alpha = 0.01$)
- then critical value $C$ is the value of $\bar{x}$ that cuts off a tail of area $\alpha$ in the appropriate direction.
  - when the rejection region is in two tails, the critical values are the values of $\bar{x}$ that cut off a tail of area $\alpha/2$ in each direction.

<img src="https://i.imgur.com/4xBkgrW.png" width="600" style="display: block; margin-left: auto; margin-right: auto; padding-top: 10px; padding-bottom: 10px;">

For example, $z_{0.005} = 2.58$ is the critical value for a test of $H_0: \mu = 100$ against $H_a: \mu \neq 100$ at the $\alpha = 0.01$ level of significance. The critical value will be $$ C = 100 \pm 2.58 \cdot \sigma_{\bar{x}} = 100 \pm 2.58 \cdot \frac{\sigma}{\sqrt{n}}$$



### Comparing two populations

Consider two populations:
1. Population 1 with mean $\mu_1$ and standard deviation $\sigma_1$, sampling we get a sample of size $n_1$ with sample mean $\bar{x}_1$ and sample standard deviation $s_1$
2. Population 2 with mean $\mu_2$ and standard deviation $\sigma_2$, sampling we get a sample of size $n_2$ with sample mean $\bar{x}_2$ and sample standard deviation $s_2$

Our goal is to compare the two populations, by estimating the difference between the two population means $\mu_1 - \mu_2$, using the samples.

Samples from two distinct populations are independent if each one is drawn without reference to the other, and has no connection with the other.

#### 100(1 - $\alpha$)% confidence interval for $\mu_1 - \mu_2$ for large samples
- A point estimate for the difference in two population means is simply the difference in the corresponding sample means.
- A confidence interval for the difference in two population means is computed using a formula in the same fashion as was done for a single population mean.
$$ (\bar{x}_1 - \bar{x}_2) \pm z_{\alpha/2} \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}$$

#### Hypothesis testing for $\mu_1 - \mu_2$ for large samples
The same five-step procedure used to test hypotheses concerning a single population mean is used to test hypotheses concerning the difference between two population means. The only difference is in the formula for the standardized test statistic.

$$ H_0: \mu_1 - \mu_2 = D_0$$
$$ H_a: \mu_1 - \mu_2 < D_0 \text{ or } \mu_1 - \mu_2 > D_0 \text{ or } \mu_1 - \mu_2 \neq D_0$$

Standardized test statistic:
$$ Z = \frac{(\bar{x}_1 - \bar{x}_2) - D_0}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}$$
The samples must be independent and the population distributions must be normal or the sample sizes must be large.

#### 100(1 - $\alpha$)% confidence interval for $\mu_1 - \mu_2$ for small samples
$$ (\bar{x}_1 - \bar{x}_2) \pm t_{\alpha/2} \sqrt{s_p^2 \left( \frac{1}{n_1} + \frac{1}{n_2} \right)}$$
where $s_p^2$ is the pooled sample variance, defined as
$$ s_p^2 = \frac{(n_1 - 1)s_1^2 + (n_2 - 1)s_2^2}{n_1 + n_2 - 2}$$
and the number of degrees of freedom is $n_1 + n_2 - 2$.

#### Hypothesis testing for $\mu_1 - \mu_2$ for small samples
$$ T = \frac{(\bar{x}_1 - \bar{x}_2) - D_0}{s_p^2 \sqrt{\frac{1}{n_1} + \frac{1}{n_2}}}$$
The test statistic has Student's t distribution with $n_1 + n_2 - 2$ degrees of freedom.


### Regression and correlation
Two variables $x$ and $y$ have a deterministic linear relationship if points plotted from $(x, y)$ pairs lie exactly along a single straight line. In practice it is common for two variables to exhibit a relationship that is close to linear but which contains an element, possibly large, of randomness.

#### Linear correlation coefficient
The linear correlation coefficient is a number computed directly from the data that measures the strength of the linear relationship between the two variables $x$ and $y$. The linear correlation coefficient is denoted by $r$ and is defined by the following formula:
$$ r = \frac{\sum{(x_i - \bar{x})(y_i - \bar{y})}}{\sqrt{\sum{(x_i - \bar{x})^2} \sum{(y_i - \bar{y})^2}}} = \frac{S_{xy}}{\sqrt{S_{xx}S_{yy}}}$$
where $S_{xy}$, $S_{xx}$, and $S_{yy}$ are the sums of squares defined by
where $\bar{x}$ and $\bar{y}$ are the sample means of the $x$ and $y$ values, respectively.

Properties:
1. value of $r$ is always between -1 and 1, inclusive
2. sign of $r$ indicates the direction of the linear relationship between $x$ and $y$
3. size of $|r|$ indicates the strength of the linear relationship between $x$ and $y$
   - $|r|$ close to 1 indicates a strong linear relationship, $|r|$ close to 0 indicates a weak linear relationship

### $\chi^2$ tests and F-tests

#### $\chi^2$ test for independence
All the $\chi^2$ distributions form a family, each specified by a parameter called the number of degrees of freedom. The number of degrees of freedom for a $\chi^2$ distribution is equal to the number of independent standard normal random variables that are squared and summed to obtain the $\chi^2$ random variable.

<img src="https://i.imgur.com/T4Ow1S0.jpg" width="600" style="display: block; margin-left: auto; margin-right: auto; padding-top: 10px; padding-bottom: 10px;">

The value of the $\chi^2$ random variable with $df = k$ that cuts off a right tail with an area of $c$ is denoted by $\chi^2_c$.


