# Statistical Considerations in Testing

## [Important source for statistics](https://greenteapress.com/wp/think-stats-2e/)

## Where, when, and how do statistics matter when doing experiments?
During this lesson, we will dive deep into the following topics:

* How can statistics be used to design your experiments?
* Pitfalls to avoid when analyzing outcome data

In this lesson, we will show you the value of statistics both in the planning stage, as well as in the analysis of your experiments

![1](images/1.PNG)

### Lesson outline
What statistics do you need to know in testing? We will talk about each of the following throughout this lesson.

* Statistical significance
* Practical significance
* Dummy tests
* Non-parametric tests
* Missing data
* Early stopping

### Learning objectives
By the end of this lesson, you will be able to

* Describe applications of statistics in the real world
* Apply statistical techniques and considerations when evaluating the data collected during an experiment.
* Establish key metrics



## Practical Significance
Even if an experiment result shows a statistically significant difference in an evaluation metric between control and experimental groups, that does not necessarily mean that the experiment was a success. If there are any costs associated with deploying a change, those costs might outweigh the benefits expected based on the experiment results. **Practical significance** refers to the level of effect that you need to observe in order for the experiment to be called a true success and implemented in truth. Not all experiments imply a practical significance boundary, but it's an important factor in the interpretation of outcomes where it is relevant.

If you consider the confidence interval for an evaluation metric statistic against the null baseline and practical significance bound, there are a few cases that can come about.

### Confidence interval is fully in practical significance region
(Below, m_0m 
0
​
  indicates the null statistic value, d_{min}d 
min
​
  the practical significance bound, and the blue line the confidence interval for the observed statistic. We assume that we're looking for a positive change, ignoring the negative equivalent for d_{min}d 
min
​
 .)

![2](images/2.png)

If the confidence interval for the statistic does not include the null or the practical significance level, then the experimental manipulation can be concluded to have a statistically and practically significant effect. It is clearest in this case that the manipulation should be implemented as a success.

### Confidence interval completely excludes any part of practical significance region

![3](images/3.png)

If the confidence interval does not include any values that would be considered practically significant, this is a clear case for us to not implement the experimental change. This includes the case where the metric is statistically significant, but whose interval does not extend past the practical significance bounds. With such a low chance of practical significance being achieved on the metric, we should be wary of implementing the change.

### Confidence interval includes points both inside and outside practical significance bounds

This leaves the trickiest cases to consider, where the confidence interval straddles the practical significance bound. In each of these cases, there is an uncertain possibility of practical significance being achieved. In an ideal world, you would be able to collect more data to reduce our uncertainty, reducing the scenario to one of the previous cases. Outside of this, you'll need to consider the risks carefully in order to make a recommendation on whether or not to follow through with a tested change. Your analysis might also reveal subsets of the population or aspects of the manipulation that do work, in order to refine further studies or experiments.

## Experiment Size

After computing the number of observations needed for an experiment to reliably detect a specified level of experimental effect (i.e. statistical power), we need to divide by the expected number of observations per day in order to get a minimum experiment length. We want to make sure that an experiment can be completed in a reasonable time frame so that if we do have a successful effect, it can be deployed as soon as possible and resources can be freed up to run new experiments. What a 'reasonable time frame' means will depend on how important a change will be, but if the length of time is beyond a month or two, that's probably a sign that it's too long.

There are a few ways that an experiment's duration can be reduced. We could, of course, change our statistical parameters. Accepting higher Type I or Type II error rates will reduce the number of observations needed. So too will increasing the effect size: it's much easier to detect larger changes.

Another option is to change the unit of diversion. A 'wider' unit of diversion will result in more observations being generated. For example, you could consider moving from a cookie-based diversion in a web-based experiment to an event-based diversion like pageviews. The tradeoff is that event-based diversion could create inconsistent website experiences for users who visit the site multiple times.

> Example of experiment with fewer observations

![5](images/5.PNG)

> Example of experiment with larger samples

![6](images/6.PNG)