# Hypothesis Testing in Python

## Chapter 4: Non-Parametric Tests

In [15]:
import pandas as pd
import pingouin as pg

Sample Size for t-test:
- One Sample: $n \geq 30$
- Two Samples: $n_1 \geq 30$, $n_2 \geq 30$
- Paired Samples: at least 30 pairs of observations across samples. Number of rows in the data $\geq 30$
- ANOVA: at least 30 observations in each sample. $n_i \geq 30$ for all values of $i$

Large sample size: proportion test:
- One Sample:
    - Number of successes in sample is greater than equal to 10
        - $n \times \hat{p} \geq 10$
    - Number of failures in sample is greater than or equal to 10
        - $n \times (1 - \hat{p}) \geq 10$
- Two Sample:
    - Number of successes in each sample is greater than or equal to 10
        - $n_1 \times \hat{p_1} \geq 10$
        - $n_2 \times \hat{p_2} \geq 10$
    -  Number of failures in each sample is greater than or equal to 10
        - $n_1 \times (1 - \hat{p_1}) \geq 10$
        - $n_2 \times (1 - \hat{p_2}) \geq 10$

Large sample size: chi-square tests:
- The number of successes in each group in greater than or equal to 5
    - $n_i \times \hat{p_i} \geq 5$ for all values of $i$
- The number of failures in each group in greater than or equal to 5
    - $n_i \times (1 − \hat{p_i}) ≥ 5$ for all values of $i$

1. If the bootstrap distribution doesn't look normal, assumptions likely aren't valid
2. Check for randomness, independence, and sample size

Steps:
1. z-test, t-test, and ANOVA are all parametric tests
2. Assume a normal distribution
3. Require sufficiently large sample sizes

In [2]:
late_shipments = pd.read_feather("late_shipments.feather")
late_shipments

Unnamed: 0,id,country,managed_by,fulfill_via,vendor_inco_term,shipment_mode,late_delivery,late,product_group,sub_classification,...,line_item_quantity,line_item_value,pack_price,unit_price,manufacturing_site,first_line_designation,weight_kilograms,freight_cost_usd,freight_cost_groups,line_item_insurance_usd
0,36203.0,Nigeria,PMO - US,Direct Drop,EXW,Air,1.0,Yes,HRDT,HIV test,...,2996.0,266644.00,89.00,0.89,"Alere Medical Co., Ltd.",Yes,1426.0,33279.83,expensive,373.83
1,30998.0,Botswana,PMO - US,Direct Drop,EXW,Air,0.0,No,HRDT,HIV test,...,25.0,800.00,32.00,1.60,"Trinity Biotech, Plc",Yes,10.0,559.89,reasonable,1.72
2,69871.0,Vietnam,PMO - US,Direct Drop,EXW,Air,0.0,No,ARV,Adult,...,22925.0,110040.00,4.80,0.08,Hetero Unit III Hyderabad IN,Yes,3723.0,19056.13,expensive,181.57
3,17648.0,South Africa,PMO - US,Direct Drop,DDP,Ocean,0.0,No,ARV,Adult,...,152535.0,361507.95,2.37,0.04,"Aurobindo Unit III, India",Yes,7698.0,11372.23,expensive,779.41
4,5647.0,Uganda,PMO - US,Direct Drop,EXW,Air,0.0,No,HRDT,HIV test - Ancillary,...,850.0,8.50,0.01,0.00,Inverness Japan,Yes,56.0,360.00,reasonable,0.01
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,13608.0,Uganda,PMO - US,Direct Drop,DDP,Air,0.0,No,ARV,Adult,...,121.0,9075.00,75.00,0.62,"Janssen-Cilag, Latina, IT",Yes,43.0,199.00,reasonable,12.72
996,80394.0,"Congo, DRC",PMO - US,Direct Drop,EXW,Air,0.0,No,HRDT,HIV test,...,292.0,9344.00,32.00,1.60,"Trinity Biotech, Plc",Yes,99.0,2162.55,reasonable,13.10
997,61675.0,Zambia,PMO - US,Direct Drop,EXW,Air,1.0,Yes,HRDT,HIV test,...,2127.0,170160.00,80.00,0.80,"Alere Medical Co., Ltd.",Yes,881.0,14019.38,expensive,210.49
998,39182.0,South Africa,PMO - US,Direct Drop,DDP,Ocean,0.0,No,ARV,Adult,...,191011.0,861459.61,4.51,0.15,"Aurobindo Unit III, India",Yes,16234.0,14439.17,expensive,1421.41


In [3]:
late_shipments = late_shipments[late_shipments["vendor_inco_term"] != "DDU"].copy()

In [4]:
# Count the freight_cost_group values
counts = late_shipments["freight_cost_groups"].value_counts()

# Print the result
print(counts)

# Inspect whether the counts are big enough
print((counts >= 30).all())


freight_cost_groups
expensive     531
reasonable    454
Name: count, dtype: int64
True


In [5]:
# Count the late values
counts = late_shipments["late"].value_counts()

# Print the result
print(counts)

# Inspect whether the counts are big enough
print((counts >= 10).all())


late
No     938
Yes     61
Name: count, dtype: int64
True


In [6]:
# Count the values of freight_cost_group grouped by vendor_inco_term
counts =  late_shipments.groupby("vendor_inco_term")["freight_cost_groups"].value_counts()

# Print the result
print(counts)

# Inspect whether the counts are big enough
print((counts >= 5).all())


vendor_inco_term  freight_cost_groups
CIP               reasonable              34
                  expensive               16
DDP               expensive               55
                  reasonable              45
EXW               expensive              423
                  reasonable             302
FCA               reasonable              73
                  expensive               37
Name: count, dtype: int64
True


In [7]:
# Count the shipment_mode values
counts = late_shipments["shipment_mode"].value_counts()

# Print the result
print(counts)

# Inspect whether the counts are big enough
print((counts >= 30).all())


shipment_mode
Air            905
Ocean           88
Air Charter      6
Name: count, dtype: int64
False


In [8]:
sample_dem_data = pd.read_feather("dem_votes_potus_12_16.feather")
sample_dem_data

Unnamed: 0,state,county,dem_percent_12,dem_percent_16
0,Alabama,Bullock,76.305900,74.946921
1,Alabama,Chilton,19.453671,15.847352
2,Alabama,Clay,26.673672,18.674517
3,Alabama,Cullman,14.661752,10.028252
4,Alabama,Escambia,36.915731,31.020546
...,...,...,...,...
495,Wyoming,Uinta,19.065464,14.191263
496,Wyoming,Washakie,20.131846,13.948610
497,Alaska,District 3,33.514582,16.301064
498,Alaska,District 18,61.284271,52.810051


In [9]:
# Conduct a paired t-test on dem_percent_12 and dem_percent_16
paired_test_results = pg.ttest(
    x=sample_dem_data["dem_percent_12"],
    y=sample_dem_data["dem_percent_16"],
    paired=True,
    alternative="two-sided"
) 

# Print paired t-test results
print(paired_test_results)

                T  dof alternative          p-val         CI95%   cohen-d  \
T-test  30.298384  499   two-sided  3.600634e-115  [6.39, 7.27]  0.454202   

              BF10  power  
T-test  2.246e+111    1.0  


In [10]:
# Conduct a Wilcoxon test on dem_percent_12 and dem_percent_16
wilcoxon_test_results =  pg.wilcoxon(
    x=sample_dem_data["dem_percent_12"],
    y=sample_dem_data["dem_percent_16"],
    alternative="two-sided"
) 

# Print Wilcoxon test results
print(wilcoxon_test_results)

           W-val alternative         p-val       RBC      CLES
Wilcoxon  2401.0   two-sided  1.780396e-77  0.961661  0.644816


Given the large sample size (500), you obtained similar results here between the parametric t-test and non-parametric Wilcoxon test with a very small p-value.

In [12]:
# Select the weight_kilograms and late columns
weight_vs_late = late_shipments[["weight_kilograms", "late"]]

# Convert weight_vs_late into wide format
weight_vs_late_wide = weight_vs_late.pivot(columns="late",
                                           values="weight_kilograms")

weight_vs_late_wide

late,No,Yes
0,,1426.0
1,10.0,
2,3723.0,
3,7698.0,
4,56.0,
...,...,...
995,43.0,
996,99.0,
997,,881.0
998,16234.0,


In [14]:
# Run a two-sided Wilcoxon-Mann-Whitney test on weight_kilograms vs. late
wmw_test = pg.mwu(
    x=weight_vs_late_wide["No"],
    y=weight_vs_late_wide["Yes"],
    alternative="two-sided"
)

# Print the test results
print(wmw_test)

       U-val alternative     p-val       RBC      CLES
MWU  19131.0   two-sided  0.000014  0.331294  0.334353


The small p-value here leads us to suspect that a difference does exist in the weight of the shipment and whether or not it was late. The Wilcoxon-Mann-Whitney test is useful when you cannot satisfy the assumptions for a parametric test comparing two means, like the t-test.

In [16]:
# Run a Kruskal-Wallis test on weight_kilograms vs. shipment_mode
kw_test = pg.kruskal(data=late_shipments, dv="weight_kilograms", between="shipment_mode")

# Print the results
print(kw_test)

                Source  ddof1           H         p-unc
Kruskal  shipment_mode      2  124.983244  7.248254e-28


The Kruskal-Wallis test yielded a very small p-value, so there is evidence that at least one of the three groups of shipment mode has a different weight distribution than the others. Th Kruskal-Wallis test is comparable to an ANOVA, which tests for a difference in means across multiple groups.