# Non-Parametric Tests

# 1. Assumptions in hypothesis testing

<b>1.1 Common assumptions of hypothesis tests</b>

Hypothesis tests make assumptions about the dataset that they are testing, and the conclusions you draw from the test results are only valid if those assumptions hold. While some assumptions differ between types of test, others are common to all hypothesis tests.

Which of the following statements is a common assumption of hypothesis tests?

Possible Answers

- Sample observations are collected deterministically from the population.

- Sample observations are correlated with each other.

- <b><font color ='green'>Sample observations have no direct relationship with each other.</font></b>

- Sample sizes are greater than thirty observations.

All hypothesis tests assume that the data are collected at random from the population, that each row is independent of the others, and that the sample size is "big enough"..

<b>1.2 Testing sample size</b>

In order to conduct a hypothesis test and be sure that the result is fair, a sample must meet three requirements: it is a random sample of the population, the observations are independent, and there are enough observations. Of these, only the last condition is easily testable with code.

The minimum sample size depends on the type of hypothesis tests you want to perform. You'll now test some scenarios on the late_shipments dataset.

Note that the .all() method from pandas can be used to check if all elements are true. For example, given a DataFrame df with numeric entries, you check to see if all its elements are less than 5, using (df < 5).all().

In [16]:
import pandas as pd
late_shipments = pd.read_feather("C:\\Users\\yazan\\Desktop\\Data_Analytics\\9-Introduction to Hypothesis Testing\\Datasets\\late_shipments.feather")

In [17]:
'''Get the count of each value in the freight_cost_group column of late_shipments.
Insert a suitable number to inspect whether the counts are "big enough" for a two sample t-test.'''

# Count the freight_cost_group values
counts = late_shipments['freight_cost_groups'].value_counts()

# Print the result
print(counts)

# Inspect whether the counts are big enough
print((counts >= 30).all())

expensive     531
reasonable    455
Name: freight_cost_groups, dtype: int64
True


In [18]:
'''Get the count of each value in the late column of late_shipments.
Insert a suitable number to inspect whether the counts are "big enough" for a one sample proportion test.'''

# Count the late values
counts = late_shipments['late'].value_counts()

# Print the result
print(counts)

# Inspect whether the counts are big enough
print((counts >= 10).all())

No     939
Yes     61
Name: late, dtype: int64
True


In [19]:
'''Get the count of each value in the freight_cost_group column of late_shipments grouped by vendor_inco_term.
Insert a suitable number to inspect whether the counts are "big enough" for a chi-square independence test.'''
# Count the values of freight_cost_group grouped by vendor_inco_term
counts = late_shipments.groupby('vendor_inco_term')['freight_cost_groups'].value_counts()

# Print the result
print(counts)

# Inspect whether the counts are big enough
print((counts >= 5).all())

vendor_inco_term  freight_cost_groups
CIP               reasonable              34
                  expensive               16
DDP               expensive               55
                  reasonable              45
DDU               reasonable               1
EXW               expensive              423
                  reasonable             302
FCA               reasonable              73
                  expensive               37
Name: freight_cost_groups, dtype: int64
False


In [20]:
'''Get the count of each value in the shipment_mode column of late_shipments.
Insert a suitable number to inspect whether the counts are "big enough" for an ANOVA test.'''

# Count the shipment_mode values
counts = late_shipments['shipment_mode'].value_counts()

# Print the result
print(counts)

# Inspect whether the counts are big enough
print((counts >= 30).all())

Air            906
Ocean           88
Air Charter      6
Name: shipment_mode, dtype: int64
False


While randomness and independence of observations can't easily be tested programmatically, you can test that your sample sizes are big enough to make a hypothesis test appropriate. Based on the last result, we should be a little cautious of the ANOVA test results given the small sample size for Air Charter.

# 2. Non-parametric tests