# Statistical Considerations in Testing

## [Important source for statistics](https://greenteapress.com/wp/think-stats-2e/)

## Where, when, and how do statistics matter when doing experiments?
During this lesson, we will dive deep into the following topics:

* How can statistics be used to design your experiments?
* Pitfalls to avoid when analyzing outcome data

In this lesson, we will show you the value of statistics both in the planning stage, as well as in the analysis of your experiments

![1](images/1.PNG)

### Lesson outline
What statistics do you need to know in testing? We will talk about each of the following throughout this lesson.

* Statistical significance
* Practical significance
* Dummy tests
* Non-parametric tests
* Missing data
* Early stopping

### Learning objectives
By the end of this lesson, you will be able to

* Describe applications of statistics in the real world
* Apply statistical techniques and considerations when evaluating the data collected during an experiment.
* Establish key metrics



## Practical Significance
Even if an experiment result shows a statistically significant difference in an evaluation metric between control and experimental groups, that does not necessarily mean that the experiment was a success. If there are any costs associated with deploying a change, those costs might outweigh the benefits expected based on the experiment results. **Practical significance** refers to the level of effect that you need to observe in order for the experiment to be called a true success and implemented in truth. Not all experiments imply a practical significance boundary, but it's an important factor in the interpretation of outcomes where it is relevant.

If you consider the confidence interval for an evaluation metric statistic against the null baseline and practical significance bound, there are a few cases that can come about.

### Confidence interval is fully in practical significance region
(Below, m_0m 
0
​
  indicates the null statistic value, d_{min}d 
min
​
  the practical significance bound, and the blue line the confidence interval for the observed statistic. We assume that we're looking for a positive change, ignoring the negative equivalent for d_{min}d 
min
​
 .)

![2](images/2.png)

If the confidence interval for the statistic does not include the null or the practical significance level, then the experimental manipulation can be concluded to have a statistically and practically significant effect. It is clearest in this case that the manipulation should be implemented as a success.

### Confidence interval completely excludes any part of practical significance region

![3](images/3.png)

If the confidence interval does not include any values that would be considered practically significant, this is a clear case for us to not implement the experimental change. This includes the case where the metric is statistically significant, but whose interval does not extend past the practical significance bounds. With such a low chance of practical significance being achieved on the metric, we should be wary of implementing the change.

### Confidence interval includes points both inside and outside practical significance bounds

This leaves the trickiest cases to consider, where the confidence interval straddles the practical significance bound. In each of these cases, there is an uncertain possibility of practical significance being achieved. In an ideal world, you would be able to collect more data to reduce our uncertainty, reducing the scenario to one of the previous cases. Outside of this, you'll need to consider the risks carefully in order to make a recommendation on whether or not to follow through with a tested change. Your analysis might also reveal subsets of the population or aspects of the manipulation that do work, in order to refine further studies or experiments.

## Experiment Size

After computing the number of observations needed for an experiment to reliably detect a specified level of experimental effect (i.e. statistical power), we need to divide by the expected number of observations per day in order to get a minimum experiment length. We want to make sure that an experiment can be completed in a reasonable time frame so that if we do have a successful effect, it can be deployed as soon as possible and resources can be freed up to run new experiments. What a 'reasonable time frame' means will depend on how important a change will be, but if the length of time is beyond a month or two, that's probably a sign that it's too long.

There are a few ways that an experiment's duration can be reduced. We could, of course, change our statistical parameters. Accepting higher Type I or Type II error rates will reduce the number of observations needed. So too will increasing the effect size: it's much easier to detect larger changes.

Another option is to change the unit of diversion. A 'wider' unit of diversion will result in more observations being generated. For example, you could consider moving from a cookie-based diversion in a web-based experiment to an event-based diversion like pageviews. The tradeoff is that event-based diversion could create inconsistent website experiences for users who visit the site multiple times.

> Example of experiment with fewer observations

![5](images/5.PNG)

> Example of experiment with larger samples

![6](images/6.PNG)

## Using Dummy Test (AA Test)
When it comes to designing an experiment, it might be useful to run a dummy test as a predecessor to or as part of that process. In a dummy test, you will implement the same steps that you would in an actual experiment to assign the experimental units into groups. However, the experimental manipulation won't actually be implemented, and the groups will be treated equivalently.

There are multiple reasons to run a dummy test. First, a dummy test can expose if there are any errors in the randomization or assignment procedures. A short dummy test can be worth the investment if an invariant metric is found to have a statistically significant difference, or if some other systematic bias is identified because it can help avoid larger problems down the line. A second reason to run a dummy test is to collect data on metrics' behaviors. If historic data is not enough to predict the outcome of recorded metrics or allow for experiment duration to be computed, then a dummy test can be useful for getting baselines.

Of course, performing a dummy test requires an investment of resources, the most important of which is time. If time is of the essence, then you may need to just go ahead with the experiment, keeping an eye on invariant metrics for any trouble. An alternative approach is to perform a hybrid test. In the A/B testing paradigm, this can take the form of an A/A/B test. That is, we split the data into three groups: two control and one experimental. A comparison between control groups can be used to learn about null-environment properties before making inferences on the effect of the experimental manipulation.

## Missing Data

Three strategies for working with missing values include:

1. We can remove (or “drop”) the rows or columns holding the missing values.
2. We can impute the missing values.
3. We can build models that work around them, and only use the information provided.

Though dropping rows and/or columns holding missing values is quite easy to do using numpy and pandas, it is often not appropriate.

Understanding why the data is missing is important before dropping these rows and columns. In this video you saw a number of situations in which dropping values was not a good idea. These included

1. Dropping data values associated with the effort or time an individual put into a survey.
2. Dropping data values associated with sensitive information.

In either of these cases, the missing values hold information. A quick removal of the rows or columns associated with these missing values would remove missing data that could be used to better inform models.

Instead of removing these values, we might keep track of the missing values using indicator values, or counts associated with how many questions an individual skipped.

### When is it OK to remove data?

In the last video, you saw cases in which dropping rows or columns associated with missing values would not be a good idea. There are other cases in which dropping rows or columns associated with missing values would be okay.

A few instances in which dropping a row might be okay are:

1. Dropping missing data associated with mechanical failures.
2. The missing data is in a column that you are interested in predicting.

Other cases when you should consider dropping data that are not associated with missing data:

1. Dropping columns with no variability in the data.
2. Dropping data associated with information that you know is not correct.

In handling removing data, you should think more about why is this missing or why is this data incorrectly input to see if an alternative solution might be used than dropping the values.

### Other considerations when removing data

One common strategy for working with missing data is to understand the proportion of a column that is missing. If a large proportion of a column is missing data, this is a reason to consider dropping it.

There are easy ways to use pandas to create dummy variables to track the missing values, so you can see if these missing values actually hold information (regardless of the proportion that is missing) before choosing to remove a full column.

>If an entire column or row is missing, we can remove it, as there is no information being provided.

>For the column with mixed heights, we should be able to (for the most part) map those to a consistent measurement (all meters or all feet). We don't want to just drop this.

>If the response is missing, for those rows, we have nothing to predict. You might be interested in predicting those values. Without a target/response to predict, your model cannot learn. These rows are not providing information for training any sort of supervised learning model.

>Though it is common to drop columns just because not many values exist, there may be value to grouping rows that have a column missing as compared to rows that do not have a missing value for that particular column.

>If there is no variability (all the values are the same) in a column, it does not provide value for prediction or finding differences in your data. It should be dropped for this reason. Keeping it doesn't really hurt, but it can lead to confusing results as we will see later in this lesson.

>When you have incorrect data, you do not want to input this information into your conclusions. You should attempt to correct these values, or you may need to drop them.

## Analyzing Multiple Metrics

If you have multiple evaluation metrics to track in your experiment, you should be careful to specify the criteria for calling your experiment a success so that you don't make excessive errors in your judgment. To make things simple, this includes:

* Assume independent measures
* Success if either metric shows statistical significance
* 5% Type I error rate

Given the above assumptions, what is the probability that we falsely called the experiment is a success, assuming that there is no actual effect of our changes?

> p(both not significant) = (0.95)*(0.95) = 0.9025

> p(at least one significant) = 1 - 0.9025 = 0.0975 = **9.75%** 

### Type I Errors

If you're tracking multiple evaluation metrics, make sure that you're aware of how the Type I error rates on individual metrics can affect the overall chance of making some kind of Type I error. The simplest case we can consider is if we have _n_ independent evaluation metrics, and that seeing one with a statistically, the significant result would be enough to call the manipulation a success. In this case, the probability of making at least one Type I error is given by $\alpha_{over} = 1 - (1-\alpha_{ind})^n $ , illustrated in the below image for individual $\alpha_{ind} = .05 $ and $\alpha_{ind} = .01$:

![7](images/7.png)

To protect against this, we need to introduce a correction factor on the individual test error rate so that the overall error rate is at most the desired level. A conservative approach is to divide the overall error rate by the number of metrics tested:
$$
\alpha_{ind} = \frac{\alpha_{over}}{n}
$$

This is known as the Bonferroni correction. If we assume independence between metrics, we can do a little bit better with the Šidák correction:
$$
\alpha_{ind} = (1 - \alpha_{over})^{\frac{1}{n} }
$$

![8](images/8.png)

The Šidák correction is only slightly higher than the line drawn by the Bonferroni correction.

In real life, evaluation scenarios are rarely so straightforward. Metrics will likely be correlated in some way, rather than being independent. If a positive correlation exists, then knowing the outcome of one metric will make it more likely for a correlated metric to also point in the same way. In this case, the corrections above will be more conservative than necessary, resulting in an overall error rate smaller than the desired level. (In cases of negative correlation, the true error rate could go either way, depending on the types of tests performed.)

In addition, we might need multiple metrics to show statistical significance to call an experiment a success, or there may be different degrees of success depending on which metrics appear to be moved by the manipulation. One metric may not be enough to make it worth deploying a change tested in an experiment. Reducing the individual error rate will make it harder for a truly significant effect to show up as statistically significant. That is, reducing the Type I error rate will also increase the Type II error rate – another conservative shift.

## Early Stopping

As the workspace below shows, there are significant risks for peeking ahead and making an early decision if it is not planned for in the design. If you haven't accounted for the effects of peeking on your error rate, then it's best to resist the temptation to look at the results early, and only perform a final analysis at the end of the experiment. This is another reason why it's important to design an experiment ahead of any data collection.

Note that there are ways of putting together a design to allow for making an early decision on an experiment. In the workspace, we showed how to treat the a problem like a multiple comparisons problem, adjusting the individual test-wise error rate to preserve an overall error rate. For continuous tracking, [this page](https://www.evanmiller.org/sequential-ab-testing.html) describes a rule of thumb for rate-based metrics, tracking the number of successes in each group and stopping the experiment once the counts' sum or difference exceeds some threshold. More generally, tests like the [sequential probability ratio test](https://en.wikipedia.org/wiki/Sequential_probability_ratio_test) can be developed to make an early stopping decision while an experiment is running, if it looks statistically unlikely for a metric to move past or fall back against the statistical significance bound.