# Basic Statistics Concepts

_Summarized by QH_  
_First version: 2022-11-01_  
_Last updated on : 2023-01-20_  

# Terminology

* __Population__: Any large collection of objects or individuals.
* __Parameter__: Any summary number that describes the entire population (e.g. population mean).
* __Sample__: A representative group drawn from population.
* __Statistic__: Any summary number that describes the sample (e.g. sample mean).
* __Confidence Interval__: The interval that we have confidence that contains a population parameter.
* __Hypothesis__: Assumption about a sample of population.
* __Contingency Table__: Summarize the frequency of two categorical variable in a $r \times c$ table, where $r$ and $c$ are the total levels of the categorical variable respectively.
* __Type I Error__: Rejecting the null hypothesis given the null hypothesis is true.
* __Type II Error__: Failing to reject the null hypothess given the alternative hypothess is true.
* __Power__: Given the alternative hypothesis is true, the probability to reject the null hypothesis by accepting $\alpha$ level of Type I error.
* Sample size calculation:
    * __Margin of Error__: The maximum amount that the sample resutls are expected to differ from those of the actual population.
    * __Confidence Level__: The probability that a sample size accurately reflects the greater population.


# Concepts
## __Hypothesis testing__
* General Procedures:
    * Step 1: Making an initial assumption.
    * Step 2: Collecting evidence (data).
    * Step 3: Based on the available evidence (data), deciding whether to reject or not reject the initial assumption.
* Critical Value Approach:
    * __Pinciple__ : The critical value approach involves determining "likely" or "unlikely" by determining whether or not the observed test statistic is more extreme than would be expected if the null hypothesis were true. That is, it entails comparing the observed test statistic to some cutoff value, called the "critical value." If the test statistic is more extreme than the critical value, then the null hypothesis is rejected in favor of the alternative hypothesis. If the test statistic is not as extreme as the critical value, then the null hypothesis is not rejected.
    * Example on testing population mean $\mu$:
        1. Specify the null and alternative hypotheses.
        2. Using the sample data and assuming the null hypothesis is true, calculate the value of the test statistic. To conduct the hypothesis test for the population mean $\mu$, we use the t-statistic $t^* = \frac{\bar{x} - \mu}{s / \sqrt{n}}$  which follows a t-distribution with n - 1 degrees of freedom.
        3. Determine the critical value by finding the value of the known distribution of the test statistic such that the probability of making a Type I error — which is denoted  (greek letter "alpha") and is called the "significance level of the test" — is small (typically 0.01, 0.05, or 0.10).
        4. Compare the test statistic to the critical value. If the test statistic is more extreme in the direction of the alternative than the critical value, reject the null hypothesis in favor of the alternative hypothesis. If the test statistic is less extreme than the critical value, do not reject the null hypothesis.
* P-Value Approach:
    * __Principle__: The P-value approach involves determining "likely" or "unlikely" by determining the probability — assuming the null hypothesis were true — of observing a more extreme test statistic in the direction of the alternative hypothesis than the one observed. If the P-value is small, say less than (or equal to) , then it is "unlikely." And, if the P-value is large, say more than , then it is "likely."
    * Example on testing population mean $\mu$:
        1. Specify the null and alternative hypotheses.
        2. Using the sample data and assuming the null hypothesis is true, calculate the value of the test statistic. To conduct the hypothesis test for the population mean $\mu$, we use the t-statistic $t^* = \frac{\bar{x} - \mu}{s / \sqrt{n}}$  which follows a t-distribution with n - 1 degrees of freedom.
        3. Using the known distribution of the test statistic, calculate the P-value: "If the null hypothesis is true, what is the probability that we'd observe a more extreme test statistic in the direction of the alternative hypothesis than we did?" 
        4. Set the significance level, , the probability of making a Type I error to be small — 0.01, 0.05, or 0.10. Compare the P-value to . If the P-value is less than (or equal to) , reject the null hypothesis in favor of the alternative hypothesis. If the P-value is greater than , do not reject the null hypothesis.

### student t-test
### chi-square test of Independence
* Null Hypothesis ($H_0$): The two categorical variables are independent.  
* Alternative Hypothesis ($H_A$): The two categorical variables are dependent.  
* Chi-Square Test statistic: $\chi^2 = \sum (O - E)^2 / E$, where $O$ is the observed frequency and $E$ is the expected frequency under the null hypothesis computed as: $E = \frac{\text{row total} \times \text{column total}}{\text{sample size}}$.
* We will compare the value of the test statistic to the critical value of $\chi^2_\alpha$ with degree of freedom = $(r - 1) (c - 1)$, and reject the null hypothesis if $\chi^2 > \chi^2_\alpha$.



### Contingency Table and Confusion matrix

For binary classification problem, we use confusion matrix as follows:

|Predicted\Actual| Actual Positive| Actual Negative|Metrics|
|:-|:-|:-|:-|
|__Predicted Positive__|# True Positive (TP)|# False Positive (FP)|_Precision_: Percentage of correct <br> predictions of all the predicted positive cases. <br> $\frac{TP}{TP + FP}$ |
|__Predicted Negative__|# False Negative (FN)| # True Negative (TN)|
|__Metrics__|_Recall_ or _True Positive Rate_ or _Sensitivity_: Percentage of correct predictions <br> of all actual positive cases, $\frac{TP}{TP + FN}$| _False Positive Rate_ or (1 - _Specificity_): Percentage of incorrect predictions <br> of all actual negative cases, $\frac{FP}{FP + TN}$ |_Accuracy_: Percentage of correct predictions of all cases, $\frac{TP + TN}{TP+FP+TN+FN}$ |

When we encounter an imbalanced classification problem as follows, accuracy may not be the best measure:
|Predicted\Actual| Actual Positive| Actual Negative|
|:-|:-|:-|
|__Predicted Positive__| 3 | 2 |
|__Predicted Negative__| 2 | 93|

The accuracy = $93\%$, however, if we predict all cases to be negative, the confusion matrix becomes:
|Predicted\Actual| Actual Positive| Actual Negative|
|:-|:-|:-|
|__Predicted Positive__| 0 | 0 |
|__Predicted Negative__| 5 | 95|

The accuracy = $95\%$ which is even better. However, this classification is no use since it never predict the positive case.

There are more useful metrics as follows:
#### F1-score
It is the weighted harmonic average or mean of precision and recall:
$$F1 = 2 * \frac{\text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}}$$

#### Area under Precision-Recall (PR)
It is the area under Precision and Recall curve.

#### Area under ROC
It is the area under True-Positive and False Positive Curve.

<img src="FPR_TPR_curve.png" alt="TPR-FPC_png" width="450"/>
<img src="AUC.png" alt="AUC_png" width="415"/>

Source: Google machine learning crash course

## Sample Size Calculation


# References
1. Penn State Online Statistics Courses: https://online.stat.psu.edu/statprogram/reviews/statistical-concepts/terminology
