# Introduction to Hypothesis Testing

# 1. Hypothesis tests and z-scores 

<b>1.1 Uses of A/B testing</b>

A comapny used A/B testing on their website when launching a product. One version of the page showed an advertisement for a discount, and one version did not. Half the users saw one version of the page, and the other half saw the second version of the page.

What is the main reason to use an A/B test?

- It lets users vote on their preferred web page.

- It allows you to only give discounts to half your users.

- It is a method used to directly determine the sample size needed for your analysis.

- It provides a way to check outcomes of competing scenarios and decide which way to proceed.  (True)

- It reduces the number of errors in production.

A/B testing lets you compare scenarios to see which best achieves some goal.

<b>1.2 Calculating the sample mean</b>

The late_shipments dataset contains supply chain data on the delivery of medical supplies. Each row represents one delivery of a part. The late columns denotes whether or not the part was delivered late. A value of "Yes" means that the part was delivered late, and a value of "No" means the part was delivered on time.

You'll begin your analysis by calculating a point estimate (or sample statistic), namely the proportion of late shipments.

In pandas, a value's proportion in a categorical DataFrame column can be quickly calculated using the syntax:

```python:
prop = (df['col'] == val).mean()
```

In [2]:
import pandas as pd
late_shipments = pd.read_feather('C:\\Users\\yazan\\Desktop\\Data_Analytics\\9-Introduction to Hypothesis Testing\\Datasets\\late_shipments.feather')
print(late_shipments.head())

        id       country managed_by  fulfill_via vendor_inco_term  \
0  36203.0       Nigeria   PMO - US  Direct Drop              EXW   
1  30998.0      Botswana   PMO - US  Direct Drop              EXW   
2  69871.0       Vietnam   PMO - US  Direct Drop              EXW   
3  17648.0  South Africa   PMO - US  Direct Drop              DDP   
4   5647.0        Uganda   PMO - US  Direct Drop              EXW   

  shipment_mode  late_delivery late product_group    sub_classification  ...  \
0           Air            1.0  Yes          HRDT              HIV test  ...   
1           Air            0.0   No          HRDT              HIV test  ...   
2           Air            0.0   No           ARV                 Adult  ...   
3         Ocean            0.0   No           ARV                 Adult  ...   
4           Air            0.0   No          HRDT  HIV test - Ancillary  ...   

  line_item_quantity line_item_value pack_price unit_price  \
0             2996.0       266644.00      

In [3]:
'''Calculate the proportion of late shipments in the sample; that is, the mean cases where 
the late column is "Yes".'''

# Calculate the proportion of late shipments
late_prop_samp = (late_shipments['late_delivery']==1).mean()

# Print the results
print(late_prop_samp)

0.061


The proportion of late shipments in the sample is 0.061, or 6.1%.

1.3 Calculating a z-score

Since variables have arbitrary ranges and units, we need to standardize them. For example, a hypothesis test that gave different answers if the variables were in Euros instead of US dollars would be of little value. Standardization avoids that.

One standardized value of interest in a hypothesis test is called a z-score. To calculate it, you need three numbers: the sample statistic (point estimate), the hypothesized statistic, and the standard error of the statistic (estimated from the bootstrap distribution).

The sample statistic is available as late_prop_samp.

late_shipments_boot_distn is a bootstrap distribution of the proportion of late shipments, available as a list.

pandas and numpy are loaded with their usual aliases.

In [5]:
late_shipments_boot_distn = []

# Calculate the proportion of late shipments
late_prop_samp = (late_shipments['late_delivery']==1).mean()

import numpy as np

# Step 3. Repeat steps 1 & 2 many times, appending to a list

for i in range(5000):
    late_shipments_boot_distn.append(
    # Step 2. Calculate point estimate
          np.mean(
            # Step 1. Resample
                late_shipments.sample(frac=1, replace=True)['late_delivery']
                ))

# Hypothesize that the proportion is 6%
late_prop_hyp = 0.06

# Calculate the standard error
std_error = np.std(late_shipments_boot_distn, ddof=1)

# Find z-score of late_prop_samp
z_score = ((late_shipments['late_delivery']==1).mean() - late_prop_hyp)/std_error

# Print z_score
print(z_score)

0.13221588287567415
