# Assumptions in hypothesis testing


## Common assumptions of hypothesis tests
Hypothesis tests make assumptions about the dataset that they are testing, and the conclusions you draw from the test results are only valid if those assumptions hold. While some assumptions differ between types of test, others are common to all hypothesis tests.

* Sample observations have no direct relationship with each other.


 All hypothesis tests assume that the data are collected at random from the population, that each row is independent of the others, and that the sample size is "big enough"

In [2]:
import pandas as pd
import numpy as np

late_shipments = pd.read_feather("/kaggle/input/late-shipments-dataset-to-perform-hypothesis-test/late_shipments.feather")
print(late_shipments)

          id       country managed_by  fulfill_via vendor_inco_term  \
0    36203.0       Nigeria   PMO - US  Direct Drop              EXW   
1    30998.0      Botswana   PMO - US  Direct Drop              EXW   
2    69871.0       Vietnam   PMO - US  Direct Drop              EXW   
3    17648.0  South Africa   PMO - US  Direct Drop              DDP   
4     5647.0        Uganda   PMO - US  Direct Drop              EXW   
..       ...           ...        ...          ...              ...   
995  13608.0        Uganda   PMO - US  Direct Drop              DDP   
996  80394.0    Congo, DRC   PMO - US  Direct Drop              EXW   
997  61675.0        Zambia   PMO - US  Direct Drop              EXW   
998  39182.0  South Africa   PMO - US  Direct Drop              DDP   
999   5645.0      Botswana   PMO - US  Direct Drop              EXW   

    shipment_mode  late_delivery late product_group    sub_classification  \
0             Air            1.0  Yes          HRDT              HIV t

## Testing sample size
In order to conduct a hypothesis test and be sure that the result is fair, a sample must meet three requirements: it is a random sample of the population, the observations are independent, and there are enough observations. Of these, only the last condition is easily testable with code.

The minimum sample size depends on the type of hypothesis tests you want to perform. You'll now test some scenarios on the late_shipments dataset.

Note that the .all() method from pandas can be used to check if all elements are true. For example, given a DataFrame df with numeric entries, you check to see if all its elements are less than 5, using (df < 5).all().

* Get the count of each value in the freight_cost_group column of late_shipments.
Insert a suitable number to inspect whether the counts are "big enough" for a two sample t-test.
* Get the count of each value in the late column of late_shipments.
Insert a suitable number to inspect whether the counts are "big enough" for a one sample proportion test.
* Get the count of each value in the freight_cost_group column of late_shipments grouped by vendor_inco_term.
Insert a suitable number to inspect whether the counts are "big enough" for a chi-square independence test.
*Get the count of each value in the shipment_mode column of late_shipments.
Insert a suitable number to inspect whether the counts are "big enough" for an ANOVA test.

In [4]:
# Count the freight_cost_group values
counts = late_shipments['freight_cost_groups'].value_counts()

# Print the result
print(counts)

# Inspect whether the counts are big enough
print((counts >= 30).all())

freight_cost_groups
expensive     531
reasonable    455
Name: count, dtype: int64
True


In [5]:
# Count the late values
counts = late_shipments["late"].value_counts()

# Print the result
print(counts)

# Inspect whether the counts are big enough
print((counts >= 10).all())

late
No     939
Yes     61
Name: count, dtype: int64
True


In [7]:
# Count the values of freight_cost_group grouped by vendor_inco_term
counts = late_shipments.groupby("vendor_inco_term")["freight_cost_groups"].value_counts()

# Print the result
print(counts)

# Inspect whether the counts are big enough
print((counts >= 5).all())

vendor_inco_term  freight_cost_groups
CIP               reasonable              34
                  expensive               16
DDP               expensive               55
                  reasonable              45
DDU               reasonable               1
EXW               expensive              423
                  reasonable             302
FCA               reasonable              73
                  expensive               37
Name: count, dtype: int64
False


In [8]:
# Count the shipment_mode values
counts = late_shipments["shipment_mode"].value_counts()

# Print the result
print(counts)

# Inspect whether the counts are big enough
print((counts >= 30).all())

shipment_mode
Air            906
Ocean           88
Air Charter      6
Name: count, dtype: int64
False


**Setting a great example for an ample sample! While randomness and independence of observations can't easily be tested programmatically, you can test that your sample sizes are big enough to make a hypothesis test appropriate. Based on the last result, we should be a little cautious of the ANOVA test results given the small sample size for Air Charter.**

# Wilcoxon signed-rank test
You'll explore the difference between the proportion of county-level votes for the Democratic candidate in 2012 and 2016 to identify if the difference is significant.

sample_dem_data is available, and has columns dem_percent_12 and dem_percent_16 in addition to state and county names. The following packages have also been loaded: pingouin and pandas as pd.


* Conduct a paired t-test on the percentage columns using an appropriate alternative hypothesis.
* Conduct a Wilcoxon-signed rank test on the same columns.

In [9]:
sample_dem_data = pd.read_csv("/kaggle/input/2012-2016-presidential-elections/US_County_Level_Presidential_Results_12-16.csv")
sample_dem_data.head()

  has_large_values = (abs_vals > 1e6).any()
  has_small_values = ((abs_vals < 10 ** (-self.digits)) & (abs_vals > 0)).any()
  has_small_values = ((abs_vals < 10 ** (-self.digits)) & (abs_vals > 0)).any()
  has_large_values = (abs_vals > 1e6).any()
  has_small_values = ((abs_vals < 10 ** (-self.digits)) & (abs_vals > 0)).any()
  has_small_values = ((abs_vals < 10 ** (-self.digits)) & (abs_vals > 0)).any()


Unnamed: 0.1,Unnamed: 0,combined_fips,votes_dem_2016,votes_gop_2016,total_votes_2016,per_dem_2016,per_gop_2016,diff_2016,per_point_diff_2016,state_abbr,...,FIPS,total_votes_2012,votes_dem_2012,votes_gop_2012,county_fips,state_fips,per_dem_2012,per_gop_2012,diff_2012,per_point_diff_2012
0,0,2013,93003.0,130413.0,246588.0,0.377159,0.52887,37410,-0.151711,AK,...,2013,,,,,,,,,
1,1,2016,93003.0,130413.0,246588.0,0.377159,0.52887,37410,-0.151711,AK,...,2016,,,,,,,,,
2,2,2020,93003.0,130413.0,246588.0,0.377159,0.52887,37410,-0.151711,AK,...,2020,,,,,,,,,
3,3,2050,93003.0,130413.0,246588.0,0.377159,0.52887,37410,-0.151711,AK,...,2050,,,,,,,,,
4,4,2060,93003.0,130413.0,246588.0,0.377159,0.52887,37410,-0.151711,AK,...,2060,,,,,,,,,


In [12]:
!pip install pingouin

import pingouin

Collecting pingouin
  Downloading pingouin-0.5.5-py3-none-any.whl.metadata (19 kB)
Collecting pandas-flavor (from pingouin)
  Downloading pandas_flavor-0.7.0-py3-none-any.whl.metadata (6.7 kB)
Downloading pingouin-0.5.5-py3-none-any.whl (204 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m204.4/204.4 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hDownloading pandas_flavor-0.7.0-py3-none-any.whl (8.4 kB)
Installing collected packages: pandas-flavor, pingouin
Successfully installed pandas-flavor-0.7.0 pingouin-0.5.5


In [14]:
# Conduct a paired t-test on dem_percent_12 and dem_percent_16
paired_test_results = pingouin.ttest(x=sample_dem_data['votes_dem_2016'], 
                                     y=sample_dem_data['votes_dem_2012'],
                                     paired=True,
                                     alternative="two-sided")





# Print paired t-test results
print(paired_test_results)

               T   dof alternative     p-val              CI95%   cohen-d  \
T-test  0.177061  3111   two-sided  0.859472  [-289.13, 346.54]  0.000416   

         BF10     power  
T-test  0.021  0.050062  


In [15]:
# Conduct a Wilcoxon test on dem_percent_12 and dem_percent_16
wilcoxon_test_results = pingouin.wilcoxon(x=sample_dem_data['votes_dem_2016'], 
                                          y=sample_dem_data['votes_dem_2012'],
                                          alternative="two-sided")


# Print Wilcoxon test results
print(wilcoxon_test_results)

             W-val alternative          p-val      RBC      CLES
Wilcoxon  809005.0   two-sided  1.698817e-226 -0.66532  0.460459


  r_plus = np.sum((d > 0) * r, axis=-1)
  r_minus = np.sum((d < 0) * r, axis=-1)


# Wilcoxon-Mann-Whitney
Another class of non-parametric hypothesis tests are called rank sum tests. Ranks are the positions of numeric values from smallest to largest. Think of them as positions in running events: whoever has the fastest (smallest) time is rank 1, second fastest is rank 2, and so on.

By calculating on the ranks of data instead of the actual values, you can avoid making assumptions about the distribution of the test statistic. It's more robust in the same way that a median is more robust than a mean.

One common rank-based test is the Wilcoxon-Mann-Whitney test, which is like a non-parametric t-test.


* Select weight_kilograms and late from late_shipments, assigning the name weight_vs_late.
* Convert weight_vs_late from long-to-wide format, setting columns to 'late'.
* Run a Wilcoxon-Mann-Whitney test for a difference in weight_kilograms when the shipment was late and on-time.

In [16]:
# Select the weight_kilograms and late columns
weight_vs_late = late_shipments[['weight_kilograms', 'late']]

# Convert weight_vs_late into wide format
weight_vs_late_wide = weight_vs_late.pivot(columns='late', 
                                           values='weight_kilograms')


# Run a two-sided Wilcoxon-Mann-Whitney test on weight_kilograms vs. late
wmw_test = pingouin.mwu(x=weight_vs_late_wide['No'], y=weight_vs_late_wide['Yes'], alternative='two-sided')


# Print the test results
print(wmw_test)

       U-val alternative     p-val       RBC      CLES
MWU  19134.0   two-sided  0.000014 -0.331902  0.334049


They tried to make me use parameters, but I said "No, no, no". The small p-value here leads us to suspect that a difference does exist in the weight of the shipment and whether or not it was late. The Wilcoxon-Mann-Whitney test is useful when you cannot satisfy the assumptions for a parametric test comparing two means, like the t-test.

# Kruskal-Wallis
Recall that the Kruskal-Wallis test is a non-parametric version of an ANOVA test, comparing the means across multiple groups.

late_shipments is available, and the following packages have been loaded: pingouin and pandas as pd.

* Run a Kruskal-Wallis test on weight_kilograms between the different shipment modes in late_shipments.

In [17]:
# Run a Kruskal-Wallis test on weight_kilograms vs. shipment_mode
kw_test = pingouin.kruskal(data=late_shipments, 
                           dv='weight_kilograms',
                           between='shipment_mode')


# Print the results
print(kw_test)

                Source  ddof1           H         p-unc
Kruskal  shipment_mode      2  125.096618  6.848799e-28


Great work! The Kruskal-Wallis test yielded a very small p-value, so there is evidence that at least one of the three groups of shipment mode has a different weight distribution than the others. Th Kruskal-Wallis test is comparable to an ANOVA, which tests for a difference in means across multiple groups.