# Essential Concepts for Statistics

## General framework

- Where and when it is used?
- Definition in simple English
- When value changes
- Application (optional)

## Power

- Where and when it is used?
    - Used in a binary hypothesis test
- Definition in simple English
    - Likelihood that a test will detect an effect
    - When the effect is present
    - P(reject Null | Alternative is true) or P(reject Null | Null is false)
        - Power = 1 - Type II error
- When value changes
    - Higher the power, better the test
- Application (optional)
    - Experiment design: calculate the minimum sample size

## Type I Error
- Where and when it is used?
    - Categorize errors in a binary hypothesis test
- Definition in simple English
    - Reject null hypothesis when it is true
    - Conclude findings are significant, but in fact they occurred by chance
    - P(reject Null | Null is true)
- When value changes
    - Higher the Type I error, less reliable a test is
- Application (optional)
    - Used in A/B testing: observe differences but there is no difference

## Type II Error
- Where and when it is used?
    - Categorize errors in a binary hypothesis test
- Definition in simple English
    - Fail to reject null hypothesis when it is false
    - Conclude no significant effect, but in fact there is
    - P(fail to reject Null | Null is false)
- When value changes
    - Higher the Type II error, less reliable a test is
- Application (optional)
    - Used in A/B testing: does not observe differences but there is a difference

## Confidence Interval
- Where and when it is used?
    - Want to know how variable a sample result is
    - CI is used for estimating true value
        - Note: true value is deterministic but unknown
        - True value either in it or not
        - Confidence level: how likely CI (estimated from samples) cover the true value
            - e.g. draw 100 samples, with 95% level, CI is 95% likely to cover the true value
- Definition in simple English
    - An interval of numbers
    - How likely it covers the true value
    - P(CI covers the true value) = confidence level
    - $$\bar x \pm z\frac{s}{\sqrt{n}}$$
        - where $\bar x$ is sample mean, $z$ is confidence level value, $s$ is sample standard deviation and $n$ is sample size
- When value changes
    - Wider CI, more uncertainty about the sample result
        - Less data, wider CI
        - Higher confidence level, wider CI
        - Higher variation/s.d., wider CI
- Application (optional)

## P value
- Where and when it is used?
    - Commonly used in hypothesis testing
    - Connect observation and conclusion
- Definition in simple English
    - Conditional probability
    - P(results as extreme as observed results | Null is true)
- When value changes
    - Lower the p-value, less support of null hypothesis
    - Often use 0.05 as cut-off in practice
    - p-value < 0.05, reject null hypothesis, otherwise fail to reject (NEVER accept!)
- Application (optional)

## Confusion Matrix

![](../images/confusion_matrix.png)

# Credits & References

- [5 Statistics Concepts in Data Science Interviews | Power, Errors, Confidence Interval, P value (YouTube)](https://youtu.be/Allap_hrjyo)
- [40 Statistics Interview Problems](https://towardsdatascience.com/40-statistics-interview-problems-and-answers-for-data-scientists-6971a02b7eee)