# Statistical Hypothesis

The main goal in many research studies is to check whether the data collected support certain statements or predictions. 
* Statistical Hypothesis is a conjecture about a population parameter, this conjecture may or may not be true.
* EX: The mean income for a resident of Denver is equal to the mean income for a resident of Seattle.
  * Population paramter: mean income
  * One population consists of residents of Denver while the other consists of residents of Seattle
 
### Null Hypothesis
The null hypothesis is the default assumption we start within a statistical test. It states there's no effect, no difference, or no relationship between the groups we are comparing. 

#### Alternative Hypothesis
Alternative hypothesis states that existence of a difference or relationship between the groups we're comparing.

##### Why null hypothesis?
* The null hypothesis provides the baseline for statistical inference.
* When we run an experiment, we collect data and calculate a p-value.
* If the p-value is below a significance threshold, we reject the null hypothesis, meaning the observed effect is unlikely to be due to chance.
* If not, we fail to reject the null, meaning we don't have strong enough evidence to claim a real effect.

#### INTERVIEW RELATED
The null hypothesis is the assumption that nothing has changed — that the treatment has no effect compared to the control. In A/B testing, it means both groups have the same metric values, and any differences we see are just due to random variation. We test against this baseline to decide whether the observed effect is statistically significant.

### Design the study
After stating the hypothesis, the researcher designs the study.
* Select the correct statistical test.
  * Statistical test -- uses the data obtained from a sample to make a decision about whether the null hypothesis should be rejected.
* Choose an appropriate level of significance
* Formulate a plan for conducting the study.

### Type I and Type II Error
#### Type I Error
    reject H0 when H0 is true. Conclude that results are statistically significant, however, in reality, they came about purely by chance or because of unrelated factors. 
Denoted as $\alpha$.

#### Type II Error
    do not reject H0 when H0 is false. 
Denoted as $\beta$.

#### P-Value
    The P-value is the probability of observing data at least as extreme as what you actually saw, assuming the null hypothesis is true.

Simpler words: If there really no effect, how surprising would my data be? \
Example: \
If A and B truely had the same conversion rate, what's the probability I'd see a difference this big or bigger just by random chance? (p-value)

Interpreting p-value: \
Low p-value: The data are very unlikely under the null hypothesis --> we reject H0 --> evidence suggests a real effect. \
High p-value: The data are consistent with random chance under H0 --> we fail to reject H0 --> no strong evidence of a real effect.

##### Assuming that Null Hypothesis is true, if low p-value, that is observed result falls into the region that is very unlikely to happen, we could consider reject H0.

* The p-value does not tell you whether something is true.
* When used, p-value need to be complemented by effect sizes and uncertainty (confidence intervals).
* To answer 'How likely is this result to be true?', it is better to use Bayesian approach or a false-discovery rate together with the p-value to answer the question.

#### INTERVIEW RELATED
The p-value measures how likely your observed data (or more extreme) would be if there were truly no difference between your groups. In A/B testing, a low p-value means the observed difference is unlikely to be due to chance under the null hypothesis, so we may conclude the treatment had an effect.

Imagine a curve showing the probability of each possible result if the null hypothesis were true. We pick a significance level (say 5%) and shade the extreme tails — those are the results we’d consider too unlikely under H₀. The p-value is the total probability of seeing a result as extreme or more extreme than our observed one. If the p-value lies within the shaded 5% region, we reject the null hypothesis; otherwise, we don’t.

#### Significance Level
* Significance level is a value that you set at the beginning of your study to assess the p-value, it denoted as $\alpha$.
* It explains the maximum probability of commiting a Type I error.

### Statistical Significance & Practical Significance & Power
Statistical Significance: shows that if an effect exists in a study, denoted by p-value. \
Practical Significance: shows that if an effect is large enough to be meaningful in the real world, denoted by effect size.

#### Effect Size
Effect size is a quantitative measure of the magnitude of a phenomenon. In the context of A/B testing or statistical hypothesis testing, it represents how large the difference or association is between two groups or variables, independent of sample size. While a p-value only tells you whether a difference is statistically significant, the effect size tells you how meaningful or practically important that difference is.

* Statistically significance alone can be misleading because it is influenced by the sample size. Increasing the sample size always makes it more likely to find a statistically effect, no matter how small the effect truly is in the real world.
* In contrast, effect sizes are independent of the sample size, related only to the data used to calculate effect size.
* It would be better to report effect sizes, confidence intervals together with p-value to present the relationship or effect between those two groups.

### Power
Power is the probability of correctly reject H0 when H1 is true. \
Type II Error ($\beta$)is the probability of failing to reject H0 when H1 is true. \
Type I error ($\alpha$)is the probability of rejecting H0 when H0 is true.

Therefore, Power = 1 - $\beta$, risk of type II error ($\beta$) is inversely related to the statistical power of study. 

#### Statistical power is determined by...
* Effect sizes: Larger effects are more easily detected.
* Sample size: Larger samples reduce sampling error and increase power.
* Significance level: Increase the significance level increases the power.
* Measurement error: Systemetic and random errors in recorded data reduce power.

#### Type I Error is even worse...
A type I error means mistakenly going against the main statistical assumption of a null hypothesis. This may lead to new policies, practices or treatments that are inadequate or a waste of resources. \
In practical terms, however, either type of error could be worse depending on your research context.

#### How to reduce Type I Error
* The risk of making type I error is the significance level (or alpha) that you choose. That's a value that you set at the beginning of your study to assess the statistical probability of obtaining your results (p-value).
* The significance level is usually set at 0.05, This means that your results only have a 5% chance of occuring, or less, if the null hypothesis is actually true.
* To reduce the type I error probability, you can set a lower significance level.

#### How to reduce Type II Error
*  The risk of making type II error is inversely related to the statistical power of a test. Power is the extent to which a test can correctly detect a real effect when there's one.
*  To (indirectly) reduce the risk of a type II error, you can increase the sample size or the significance level to increase the statistical power. 

## Effect Size and Power

#### What happens to p-value as sample size increases?
Note that unless $\beta$ is exactly equal to 0 with an infinite number of decimals (in which case the p-value will approach 1), the p-value will approach 0. A similar mathametical relationship exists between the test statistic and the sample size in all statistical tests, including regression models with multiple independent variables. 

#### Statistical Significance (p-value)
* Hypothesis testing traditionally focused on p-values to derive statistical significance when alpha is less than 0.05 has a major weakness.
* With a large enough sample size any experiment can eventually reject the null hypothesis and detect trivially small differences that turn out to be statistically significant.
* This is the reason why drug companies structure clinical trials to obtain FDA approval with very large samples. The large sample size will reduce the standard error to close to zero. This in turn will artificially boost the t stat and commensurately lower the p-value to close to 0.

### Practical Significance (Effect Sizes)
Effect sizes tell you how meaningful the relationship between variables or the difference between groups is. \
A large effect size means that a research finding has practical significance, while a small effect size indicates limited practical applications. 

## Measures of Effect Size 
The measures of the effect sizes can be grouped into 3 categories, based on their approaches to define the effect. \ 
The groups are:
### Metrics based on the correlation (R Family)
  * Most popular -- Pearson's R. Measures the degree of linear association between two real-valued variables.
  * Coefficient of determination ($R^2$) 
    * It states what proportion of the dependent variable (y)'s variance is explained (predictable) by the independent variables (x). Generally, a higher $R^2$ indicates a better fit for the model.

  When using simple linear regression (with one dependent variable) with the intercept included, the coefficient of determination ($R^2$) is simply the square of the Pearson's R. Due to the fact that we square the Pearson's R, the coefficient of determination does not convey any direction information of correlation.

#### Coefficient of Determination $R^2$
  $R^2$ is a comparison of residual sum of squares ($SS_{res}$) with total sum of squares ($SS_{tot}$)
  $$ R^2 = 1 - \frac{RR_{res}}{RR_{tot}}$$
  where: \
  Total sum of squares is calculated by summation of squares of perpendicular distance between data points and average line. It represents the variation of the observed data.
  
  Residual sum of squares is calculated by summation of squares of perpendicular distance between data points and the best-fitted line. It measures how well the regression model represents the data.

  If $R^2$ = 1, all of the data points fall perfectly on the regression line. The predictor x accounts for all of the variation in y ($SS_{res}$ = 0).\
  If $R^2$ = 0, the estimated regression line is perfectly horizontal. The predictor x accounts for none of the variation in y, therefore, there's no linear relationship between x and y.

  $R^2$ expresses the proportion of the variation in Y that is caused by variation in X. R expresses the strength, direction and linearity in the relation between X and Y. The $R^2$ and R quantify the strength of a linear relationship, it is possible that $R^2$ = 0 and R = 0, suggesting there's no linear relationship between y and x, and yet a perfect curved exists. In simple linear regression, $R^2$ equals to Pearson's R square. However, for multiple or nonlinear regression, $R^2$ is defined as the proportion of variance explained. Besides, the $R^2$ and R can both be greatly affected by just one data point or a few data points (outliers). What's more, a statistically significant $R^2$ does not imply that the slope $\beta1$ is meaningfully different from 0. Since the larger the dataset, the easier it is to reject null hypothesis and claim statistical significance, which does not imply a practical significance. 

### Metrics based on differences (D Family)        
* Calculating the difference between the mean values of the samples. Usually the difference is standardized by dividing it by the standard deviation.
* Effect sizes are independent from the sample size, because the unit of statistical distance or differentiation in effect size analysis is the standard deviation instead of the standard error. Standard deviation is completely independent from sample size. On the other hand, standard error is dependent from the sample size.

#### Why standardized?
* You can compare the standardized difference across variables.
* You don't have to be familiar with the scaling of the variables.

#### Cohen's d -- Standardized Mean Difference
Cohen's d is one of the most common ways to measure effect size, the difference is expressed in terms of the number of standard deviations
$$ d = \frac{\mu_1-\mu_2}{s} $$
$$     s = \sqrt{\frac{(n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1+n_2-2}}$$
where, s is the pooled standard deviation, $s_1$ and $s_2$ are standard deviations of the two independent samples.

* There's considerable overlap between the two distribution even when Cohen'd indicates a large effect size.
* Cohen's d: values of d across disciplines: In psychology, the mean effect sizes is 0.4, with 30% of effects below 0.2 and 17% greater than 0.8. However, the average effect sizes is also d=0.4, with 0.2, 0.4, 0.6 considered small, medium, large effect sizes. Medical research is often associated with small effect sizes, often in the 0.05 to 0.2 range. Despite being small, these effects often represent meaningful effects such as saving lives.
* Cohen's d is very frequently used in estimating the required sample size for A/B test. In general, a lower value of Cohen'd indicates the necessity of a larger sample size and vice versa. 
### Metrics for categorical variables
#### Odds
* The odds of an event is a ratio of the frequency (or likelihood) of its occurrence to the frequency (or likelihood) of its non-occurence.
#### Odds Ratio
* The odds ratio is comparison of the odds of an event after exposure to a risk factor with the odds of that event in a control or reference situation. That is, the odds of an event in Treatment group divided by the odds of that event in Control group.
  $$\frac{p_1/(1-p_1)}{p_2/(1-p_2)} = \frac{p_1/q_1}{p_2/q_2} = \frac{p_1q_2}{p_2q_1}$$
* An odds ratio of...
  * 1.0 (or close to 1.0) indicates that X is not associated with y, the odds of the event are the same between groups.
  * greater than 1.0 indicates a positive association, the event odds increase compared to the reference group (or per unit increase). EX. OR=2 means the odds are twice as high.
  * smaller than 1.0 indicates a negative association, the event odds decrease compared to the reference group. EX. OR=0.5 means the odds are half as high. 

## Power Analysis
Power Analysis is built from 
##### significance level, effect size, power, sample size. 
All four of these variables are linked together and changing one of them impacts the others. Following this relationship, if three of these variables are known then we can determine the fourth unknown variable, and this is what power analysis is all about. 

### Power analysis is in...
* Experiment Design
  * Select the alpha, power and effect size that is relevant for the experiment, and consequently calculate the sample size that will be needed for such an experiment.
  * Use tt_ind_solve_power() function in statsmodels. Requires
  * ##### effect_size(standardized effect size), alpha, power, ratio. In addition, alternative: Power the test to detect two-sided effects. Returns an estimated sample size

* Validate the findings of an experiment
  * By using the given sample size, effect_size and significance level, you can
    ##### determine the power of the conducted experiment
    to conclude whether the probability of committing a Type II error is acceptable from the decision-making perspective.

* Sensitivity Test
  * Analyze the impact of changing one variable on the rest of the three. The results can be plotted on a graph to explain the behaviour of the experiment. 

## Summary
* Statistical power is the probability that a test will correctly reject null hypothesis. Statistical power has relevance only when the null hypothesis is false.
* The higher the statistical power for a given experiment, the lower the probability of making type II error.
* ##### High Statistical Power: Small risk of committing type II error
* It is common to design experiments with a statistical power of 80% or higher, this means that only 20% probability of encountering a type II error.
* Power Analysis answers questions like 'how much statistical power does my study have?' and 'how big a sample size do I need?'. Power analysis are normally run before a study is conducted. A prospective or a priori power analysis can be used to estimate any one of the four power parameters but is most often used to estimate
##### required sample sizes.
* As a practitioner, we can start with sensible defaults for some parameters, such as significance level of 0.05, and a power level of 0.80, then we can estimate a desirable minimum effect size, specific to the experiment being performed. Finally, we can estimate a minimum sample size based on all the other parameters in the power analysis. 