In [1]:
import numpy as np
import pandas as pd
import math
import scipy.stats as st

## Introduction to Statistical Methods Formulas and Code

### Statistics

- field that focuses on understanding and analyzing data, providing us with a toolkit of methods and techniques to collect, organize, summarize, and interpret data in a meaningful way
- goal is to uncover patterns, relationships, and insights that can help us make informed decisions and draw reliable conclusions
- descriptive statistics: methods that help us summarize and describe data in a concise and informative way
- inferential statistics: methods that help us draw conclusions about a population based on a sample of data from that population

#### Types of variables
1. Qualitative (categorical) variables
    - variables that can be placed into distinct categories, according to some characteristic or attribute
    - nominal variables: variables that have no natural ordering
    - ordinal variables: variables that have a natural ordering
2. Quantitative (numerical) variables
    - variables that are measured on a numeric scale
    - discrete variables: variables that can only take on a finite number of values
    - continuous variables: variables that can take on an infinite number of values
        1. interval variables: variables that have no natural zero point, ratios of values are not meaningful
        2. ratio variables: variables that have a natural zero point, ratios of values are meaningful

#### Measures of central tendency
They are measures that describe the center of a distribution of data
- mean: the average of a set of values
$${\displaystyle {\bar {x}}={\frac {1}{n}}\left(\sum_{i=1}^{n}{x_{i}}\right)={\frac {x_{1}+x_{2}+\cdots +x_{n}}{n}}}$$
- median: the middle value of a set of values
$$\mathrm{median}(x) = \begin{cases} 
    x_{\frac{n+1}{2}} & \text{if } n \text{ is odd} \\
    \frac{x_{\frac{n}{2}} + x_{\frac{n + 1}{2}}}{2} & \text{if } n \text{ is even}
\end{cases}$$
- mode: the most frequently occurring value in a set of values

| Measure | When to Use |
|---------|-------------|
| Mean    | When data is normally distributed and there are no extreme outliers. |
| Median  | When data has outliers or is skewed, providing a robust central value. |
| Mode    | When identifying the most frequent or common value in categorical data. |

Other results:
$$mean − mode ≈ 3(mean − median)$$
$$midrange = (min + max) / 2$$
$$range = max− min$$

##### Skewness

| Distribution | Characteristics | Mean vs Median vs Mode |
|--|--|--|
| Positive | Data skewed towards higher values<br>Majority of observations on the left side<br>Long tail on the right side | Mean > Median > Mode    |
| Negative | Data skewed towards lower values<br>Majority of observations on the right side<br>Long tail on the left side | Mean < Median < Mode     |
| Symmetric | Balanced distribution<br>Observations evenly spread on both sides | Mean ≈ Median ≈ Mode    |

<img src="https://upload.wikimedia.org/wikipedia/commons/c/cc/Relationship_between_mean_and_median_under_different_skewness.png" width="400" style="display: block; margin-left: auto; margin-right: auto; padding-top: 10px; padding-bottom: 10px;">

Bi-modal distributions have two peaks, while multi-modal distributions have more than two peaks.

#### Measures of variability
They are measures that describe the spread of a distribution of data, to describe the distribution of data in a more complete way, and measure how well an individual value represents the entire distribution
- range: the difference between the maximum and minimum values
- variance: the average squared deviation from the mean
$$\sigma^2 = \frac{\sum_{i=1}^{n}(x_i - \bar{x})^2}{n}$$
- standard deviation: the square root of the variance
$$\sigma = \sqrt{\frac{\sum_{i=1}^{n}(x_i - \bar{x})^2}{n}}$$
- coefficient of variation: the ratio of the standard deviation to the mean
$$CV = \frac{\sigma}{\bar{x}}$$

| Measure | When to use |
|--|--|
| Range | you need a quick measure of the spread or dispersion of data and want to know the difference between the highest and lowest values in the dataset.       |
| Variance | you want to quantify the average squared deviation of data points from the mean, providing a measure of how much the data points vary from the mean value. |
| Standard Deviation | you want a measure of the dispersion of data that is easy to interpret and represents the typical distance between each data point and the mean. |
| Coefficient of Variation | you want to compare the relative variability between datasets with different units of measurement, allowing you to assess the variation relative to the mean. |

Above formulas are for population, for sample, we use $n-1$ instead of $n$ in the denominator:
- to provide an unbiased estimate of the population variance or standard deviation
- adjustment accounts for the loss of one degree of freedom when estimating the sample mean and helps to avoid underestimating the true population variance or standard deviation
- using $n−1$, we provide a more conservative estimate of the variability in the population, ensuring that our statistical inferences are more accurate and reliable

#### 5-point summary
- minimum: the smallest value in the dataset
- first quartile: the value such that 25% of the data falls below
- median: the middle value in the dataset
- third quartile: the value such that 75% of the data falls below
- maximum: the largest value in the dataset

#### Boxplots
- boxplots are a graphical representation of the 5-point summary
- outliers are observations that fall outside the upper and lower fences
<img src="https://i.imgur.com/BgJweoR.png" width="400" style="display: block; margin-left: auto; margin-right: auto; padding-top: 10px; padding-bottom: 10px;">


In [4]:
data = np.array([20, 22, 30, 33, 33, 35, 35, 35, 35, 36, 40, 41, 42, 51, 54])
print(f'Len : {len(data)}, sorted : {sorted(data)}')
print(f'Mean : {np.mean(data)}')
print(f'Median : {np.median(data)}')
print(f'Mode : {st.mode(data, keepdims=False)}')
q1 = np.quantile(data, 0.25, method= 'midpoint')
q2 = np.quantile(data, 0.5, method= 'midpoint')
q3 = np.quantile(data, 0.75, method= 'midpoint')
iqr = q3 - q1
minimum = q1 - 1.5 * iqr
maximum = q3 + 1.5 * iqr
print(f'Q1 : {q1}, Median : {q2}, Q3 : {q3}')
print(f'IQR : {iqr}')
print(f'Minimum : {minimum}, Maximum : {maximum}')

Len : 15, sorted : [20, 22, 30, 33, 33, 35, 35, 35, 35, 36, 40, 41, 42, 51, 54]
Mean : 36.13333333333333
Median : 35.0
Mode : ModeResult(mode=35, count=4)
Q1 : 33.0, Median : 35.0, Q3 : 40.5
IQR : 7.5
Minimum : 21.75, Maximum : 51.75


### Probability

#### Random experiments
- random experiments are actions that occur by chance, and their outcomes are not predictable
- sample space: the set of all possible outcomes of a random event
    - discrete sample space: a sample space with a finite number of outcomes
    - continuous sample space: a sample space with an infinite number of outcomes
- event: a subset of the sample space
- probability: a numerical measure of the likelihood that an event will occur
$$ P(A) = \frac{\text{number of outcomes in A}}{\text{number of outcomes in S}} = \frac{\text{number of favorable outcomes}}{\text{number of possible outcomes}}$$
- empirical probability: the relative frequency of an event occurring in a series of trials
$$ P_{empirical}(A) = \frac{\text{number of times A occurs}}{\text{number of observations}}$$
- theoretical probability: the probability of an event occurring based on mathematical reasoning
- law of large numbers: as the number of trials increases, the empirical probability of an event will converge to the theoretical probability of that event

#### Events
- an event is a subset of the sample space of a random experiment
- complement of an event: the set of all outcomes in the sample space that are not in the event
- union of two events: the set of all outcomes that are in either event
- intersection of two events: the set of all outcomes that are in both events
- mutually exclusive events: events that have no outcomes in common
- independent events: events that have no effect on each other
- addition rule: the probability of the union of two events is equal to the sum of the probabilities of the individual events minus the probability of their intersection
$$ P(A \cup B) = P(A) + P(B) - P(A \cap B)$$

#### Probability
- a number assigned to each member of the sample space of a random experiment that satisfies the following axioms:
    1. $0 \leq P(A) \leq 1$
    2. $P(S) = 1$
    3. For two events $A$ and $B$, if $A$ and $B$ are mutually exclusive, then $P(A \cup B) = P(A) + P(B)$
- it is a measure of the likelihood of occurrence of an event

#### Random variables
- random variables are variables that take on numerical values based on the outcome of a random experiment
- discrete random variables: random variables that can take on a finite number of values
- continuous random variables: random variables that can take on an infinite number of values

### Conditional probability

For any two events $A$ and $B$ with $P(B) > 0$, the conditional probability of $A$ given $B$ is defined as:
$$ P(A|B) = \frac{P(A \cap B)}{P(B)}$$

Conditional probabilities for three events $A$, $B$, and $C$:
$$ P(A \cap B \cap C) = P(A|B \cap C)P(B \cap C) = P(A|B \cap C)P(B|C)P(C)$$
