# Proportion Tests

# 1. One-sample proportion tests

<b>1.1 t for proportions?</b>

Some of the hypothesis tests in this course have used a  test statistic <i>z</i> and some have used a <i>t</i> test statistic. To get the correct p-value, you need to use the right type of test statistic.

Do tests of proportion(s) use a  or a  test statistic and why?

- <i>t</i> : There are two estimates used for unknown values in the test statistic for proportion(s).

- <i>z</i> : Since the population standard deviation is always known for proportions, we always compute z-scores.

- <b><font color='green'><i>z</i> : The test statistic for proportion(s) has only one estimate of a parameter instead of two.</font></b>

- <i>t</i> : Proportions are ratios, so you need to estimate the numerator and the denominator.

The t-test is needed for tests of mean(s) since you are estimating two unknown quantities, which leads to more variability.

<b>1.2 Test for single proportions</b>

In Chapter 1, you calculated a p-value for a test hypothesizing that the proportion of late shipments was greater than 6%. In that chapter, you used a bootstrap distribution to estimate the standard error of the statistic. An alternative is to use an equation for the standard error based on the sample proportion, hypothesized proportion, and sample size.

$$
z = \frac{\hat{p}-p_{o}}{\sqrt{\frac{p_{o}*(1-p_{o})}{n}}}
$$

In [76]:
# Import the late_shipment dataframe
import pandas as pd
import numpy as np

late_shipments = pd.read_feather("C:\\Users\\yazan\\Desktop\\Data_Analytics\\9-Introduction to Hypothesis Testing\\Datasets\\late_shipments.feather")
print(late_shipments.head())

        id       country managed_by  fulfill_via vendor_inco_term  \
0  36203.0       Nigeria   PMO - US  Direct Drop              EXW   
1  30998.0      Botswana   PMO - US  Direct Drop              EXW   
2  69871.0       Vietnam   PMO - US  Direct Drop              EXW   
3  17648.0  South Africa   PMO - US  Direct Drop              DDP   
4   5647.0        Uganda   PMO - US  Direct Drop              EXW   

  shipment_mode  late_delivery late product_group    sub_classification  ...  \
0           Air            1.0  Yes          HRDT              HIV test  ...   
1           Air            0.0   No          HRDT              HIV test  ...   
2           Air            0.0   No           ARV                 Adult  ...   
3         Ocean            0.0   No           ARV                 Adult  ...   
4           Air            0.0   No          HRDT  HIV test - Ancillary  ...   

  line_item_quantity line_item_value pack_price unit_price  \
0             2996.0       266644.00      

In [77]:
from scipy.stats import norm

# Hypothesize that the proportion of late shipments is 6%
p_0 = 0.06

# Calculate the sample proportion of late shipments
p_hat = (late_shipments['late'] == "Yes").mean()

# Calculate the sample size
n = len(late_shipments)

# Calculate the numerator and denominator of the test statistic
numerator = p_hat - p_0
denominator = np.sqrt(p_0 * (1 - p_0) / n)

# Calculate the test statistic
z_score = numerator / denominator

# Calculate the p-value from the z-score
p_value = 1 - norm.cdf(z_score)

# Print the p-value
print(p_value)

0.44703503936503364


While bootstrapping can be used to estimate the standard error of any statistic, it is computationally intensive. For proportions, using a simple equation of the hypothesized proportion and sample size is easier to compute.

# 2. Two-sample proportion test

<b>2.1 Test of two proportions</b>

You may wonder if the amount paid for freight affects whether or not the shipment was late. Recall that in the late_shipments dataset, whether or not the shipment was late is stored in the late column. Freight costs are stored in the freight_cost_group column, and the categories are "expensive" and "reasonable".

The hypotheses to test, with "late" corresponding to the proportion of late shipments for that group, are

<i>H<sub>o</suc></i> : <i>late<sub>expensive</suc></i> - <i>late<sub>resonable</suc></i> = 0

<i>H<sub>A</suc></i> : <i>late<sub>expensive</suc></i> - <i>late<sub>resonable</suc></i> > 0

p_hats contains the estimates of population proportions (sample proportions) for each freight_cost_group:

```python:
freight_cost_group  late
expensive           Yes     0.082569
reasonable          Yes     0.035165
Name: late, dtype: float64
```
ns contains the sample sizes for these groups:

```python:
freight_cost_group
expensive     545
reasonable    455
Name: late, dtype: int64
```

In [78]:
p_hats = late_shipments.groupby('freight_cost_groups')['late'].value_counts(normalize = True)
print(p_hats)

ns = late_shipments.groupby('freight_cost_groups')['late'].count()
print(ns)

freight_cost_groups  late
expensive            No      0.920904
                     Yes     0.079096
reasonable           No      0.964835
                     Yes     0.035165
Name: late, dtype: float64
freight_cost_groups
expensive     531
reasonable    455
Name: late, dtype: int64


In [81]:
# Calculate the pooled estimate of the population proportion
p_hat = (p_hats[("reasonable", "Yes")] * ns["reasonable"] + p_hats[("expensive", "Yes")] * ns["expensive"]) / (ns["reasonable"] + ns["expensive"])

# Calculate p_hat one minus p_hat
p_hat_times_not_p_hat = p_hat * (1 - p_hat)

# Divide this by each of the sample sizes and then sum
p_hat_times_not_p_hat_over_ns = p_hat_times_not_p_hat / ns["expensive"] + p_hat_times_not_p_hat / ns["reasonable"]

# Calculate the standard error
std_error = np.sqrt(p_hat_times_not_p_hat_over_ns)

# Calculate the z-score
z_score = (p_hats[('expensive', 'Yes')] - p_hats[('reasonable', 'Yes')]) / std_error
print('z_score = ',z_score)

# Calculate the p-value from the z-score
p_value = 1-norm.cdf(z_score)

# Print p_value
print('p_value = ', p_value)

z_score =  2.922648567784529
p_value =  0.0017353400023595311


You can calculate a p-value for a two sample proportion test using (a rather exhausting amount of) arithmetic. This tiny p-value leads us to suspect there is a larger proportion of late shipments for expensive freight compared to reasonable freight.

<b>2.2 proportions_ztest() for two samples</b>

That took a lot of effort to calculate the p-value, so while it is useful to see how the calculations work, it isn't practical to do in real-world analyses. For daily usage, it's better to use the statsmodels package.

Recall the hypotheses.

<i>H<sub>o</suc></i> : <i>late<sub>expensive</suc></i> - <i>late<sub>resonable</suc></i> = 0

<i>H<sub>A</suc></i> : <i>late<sub>expensive</suc></i> - <i>late<sub>resonable</suc></i> > 0

In [80]:
# Count the late column values for each freight_cost_group
late_by_freight_cost_group = late_shipments.groupby("freight_cost_groups")['late'].value_counts()

# Create an array of the "Yes" counts for each freight_cost_group
success_counts = np.array([45, 16])

# Create an array of the total number of rows in each freight_cost_group
n = np.array([45 + 500, 16 + 439])

# Run a z-test on the two proportions
from statsmodels.stats.proportion import proportions_ztest
stat, p_value = proportions_ztest(count=success_counts, nobs=n, alternative='larger')

# Print the results
print(stat, p_value)

3.1190401865206128 0.0009072060637051224
