# Uses of A/B testing

In the video, you saw how Electronic Arts used A/B testing on their website when launching SimCity 5. One version of the page showed an advertisement for a discount, and one version did not. Half the users saw one version of the page, and the other half saw the second version of the page.

What is the main reason to use an A/B test?

- It provides a way to check outcomes of competing scenarios and decide which way to proceed.

# Calculating the sample mean

The `late_shipments` dataset contains supply chain data on the delivery of medical supplies. Each row represents one delivery of a part. The late columns denotes whether or not the part was delivered late. A value of "Yes" means that the part was delivered late, and a value of "No" means the part was delivered on time.

You'll begin your analysis by calculating a point estimate (or sample statistic), namely the proportion of late shipments.

In pandas, a value's proportion in a categorical DataFrame column can be quickly calculated using the syntax:

`prop = (df['col'] == val).mean()`
`late_shipments` is available, and pandas is loaded as pd.

In [None]:
# # Print the late_shipments dataset
# print(late_shipments)

# # Calculate the proportion of late shipments
# late_prop_samp = late_shipments["late"].value_counts(normalize = True)

# # Print the results
# print(late_prop_samp)

# Calculating a z-score

Since variables have arbitrary ranges and units, we need to standardize them. For example, a hypothesis test that gave different answers if the variables were in Euros instead of US dollars would be of little value. Standardization avoids that.

One standardized value of interest in a hypothesis test is called a z-score. To calculate it, you need three numbers: the sample statistic (point estimate), the hypothesized statistic, and the standard error of the statistic (estimated from the bootstrap distribution).

The sample statistic is available as `late_prop_samp`.

`late_shipments_boot_distn` is a bootstrap distribution of the proportion of late shipments, available as a list.

pandas and numpy are loaded with their usual aliases.

In [1]:
# # Hypothesize that the proportion is 6%
# late_prop_hyp = 0.06

# # Calculate the standard error
# std_error = np.std(late_shipments_boot_distn, ddof = 1)

# # Find z-score of late_prop_samp
# z_score = (np.mean(late_prop_samp)- late_prop_hyp)/std_error

# # Print z_score
# print(z_score)

# Criminal trials and hypothesis tests

In the video, you saw how hypothesis testing follows a similar process to criminal trials.

Which of the following correctly matches up a criminal trial with properties of a hypothesis test?

- Just as the defendant is initially assumed not guilty, the null hypothesis is first assumed to be true.


# Left tail, right tail, two tails

Hypothesis tests are used to determine whether the sample statistic lies in the tails of the null distribution. However, the way that the alternative hypothesis is phrased affects which tail(s) we are interested in.

<center><img src="images/01.07.jpg"  style="width: 400px, height: 300px;"/></center>

# Calculating p-values

In order to determine whether to choose the null hypothesis or the alternative hypothesis, you need to calculate a p-value from the z-score.

You'll now return to the late shipments dataset and the proportion of late shipments.

The null hypothesis, , is that the proportion of late shipments is six percent.

The alternative hypothesis, , is that the proportion of late shipments is greater than six percent.

The observed sample statistic, `late_prop_samp`, the hypothesized value, `late_prop_hyp` (6%), and the bootstrap standard error, `std_error` are available. norm from scipy.stats has also been loaded without an alias.

What type of test should be used for this alternative hypothesis?
- Right-tailed

In [2]:
# # Calculate the z-score of late_prop_samp
# z_score = (np.mean(late_prop_samp) - late_prop_hyp) / std_error

# # Calculate the p-value
# p_value = 1 - norm.cdf(z_score)
                 
# # Print the p-value
# print(p_value) 

# Decisions from p-values

The p-value, denoted here as , is a measure of the amount of evidence to reject the null hypothesis or not. By comparing the p-value to the significance level, , you can make a decision about which hypothesis to support.

Which of the following is the correct conclusion from the decision rule for a significance level ?

- If the p-value is less than or equal to the significance level, you reject the null hypothesis.


# Calculating a confidence interval

If you give a single estimate of a sample statistic, you are bound to be wrong by some amount. For example, the hypothesized proportion of late shipments was 6%. Even if evidence suggests the null hypothesis that the proportion of late shipments is equal to this, for any new sample of shipments, the proportion is likely to be a little different due to sampling variability. Consequently, it's a good idea to state a confidence interval. That is, you say, "we are 95% 'confident' that the proportion of late shipments is between A and B" (for some value of A and B).

Sampling in Python demonstrated two methods for calculating confidence intervals. Here, you'll use quantiles of the bootstrap distribution to calculate the confidence interval.

`late_prop_samp` and `late_shipments_boot_distn` are available; `pandas` and `numpy` are loaded with their usual aliases.

In [3]:
# # Calculate 95% confidence interval using quantile method
# lower = np.quantile(late_shipments_boot_distn, 0.025)
# upper = np.quantile(late_shipments_boot_distn, 0.975)

# # Print the confidence interval
# print((lower, upper))

Does the confidence interval match up with the conclusion to stick with the original assumption that 6% is a reasonable value for the unknown population parameter?

- Yes, since 0.06 is included in the 95% confidence interval and we failed to reject null due to a large p-value, the results are similar.

# Type I and type II errors

For hypothesis tests and for criminal trials, there are two states of truth and two possible outcomes. Two combinations are correct test outcomes, and there are two ways it can go wrong.

The errors are known as false positives (or "type I errors"), and false negatives (or "type II errors").

<center><img src="images/01.12.jpg"  style="width: 400px, height: 300px;"/></center>
