<a href="https://colab.research.google.com/github/wisnercelucus/utils/blob/main/sampling_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# An Analysis of the Significance of the Difference between Planned and Achieved Sample Sizes in PARE baseline study

*By Wisner CELUCUS*

In this notebook, we conduct a t-test to determine whether the sample sizes we intended to achieve in the PARE baseline study across all targeted communes differ significantly from the sample sizes achieved by the data collection firm, AGIRED.

Performing a visual interpretation of the difference between the two series provides confidence that the disparity between the two sets is not statistically significant. Therefore, there is no need to apply a weighting strategy to the dataset to ensure an equal distribution of household selections across the targeted communes. As a result, we hypothesize that H0 states: the distribution of households achieved by the firm is not significantly different from the sample distribution of households we planned to attain for the study. Conversely, the H1 hypothesis posits that the distribution of households achieved by the firm differs significantly from the intended sample distribution for the study.

In the following notebook cells, we utilize Python's `statsmodels` to perform a t-test for the two samples. Our aim is to demonstrate that there is insufficient evidence to reject the null hypothesis (H0).

We consider H0 acceptable or confirmed if the `p-value` of the t-test is greater than 0.05, indicating that the `p-value` is not statistically significant.


In [None]:
# Mount the drive to access the sampling dataset.
from google.colab import drive

drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# import pandas to read the csv file.
import pandas as pd

In [None]:
# import ztest from the statmodels package. That test is applied with the sample size is larger than 30.
from statsmodels.stats.weightstats import ztest as ztest

# import ttest_ind from the statmodels package. That test is applied with the sample size is less than 30.
from statsmodels.stats.weightstats import ttest_ind as ttest

In [None]:
# Load the csv file containing the data
sample_df = pd.read_csv('/content/drive/My Drive/PARE/sampling.csv', encoding="latin-1")

In [None]:

df = sample_df.copy()

In [None]:
df.shape

(16, 9)

In [None]:
# Let's take a look to make sure the data is correctly loaded.
df.head()

Unnamed: 0,Department,Selected Communes,# of SDEs expected,Total # household interviews expected,# of SDEs Achieved,Total # household interviews Achieved,# of SDEs diff,Total # household interviews diff,% difference between expected and achieved (# interviews)
0,Nord,Saint-Raphaël,5,85,5,85,0,0,0%
1,Nord,Acul du Nord,6,153,9,152,3,-1,-1%
2,Nord,Limonade,11,85,5,84,-6,-1,-1%
3,Nord,Pignon,3,102,6,98,3,-4,-4%
4,Nord-Est,Terrier-Rouge,8,136,8,136,0,0,0%


In [None]:
# load the expected # of houselds per commune
x = df['Total # household interviews expected']

In [None]:
# load the achieved # of houselds per commune
y = df['Total # household interviews Achieved']

In [None]:
# perform a ttest since we have less than 30 data entries, in this case 16 communes
ttest(x, y)

(0.05342808399947886, 0.9577450345436508, 30.0)

In [None]:
#help(ttest)

In [None]:
# perform a non necessary ztest to confirm results of the t-test.
ztest(x, y, value=0)

(0.05342808399947886, 0.9573908293674199)

We derive from the result of the t-test that the p-value of the test is 0.958, which is greater than 0.05. Therefore, we do have sufficient evidence to retain the null hypothesis.

In conclusion, there is no evident support for the necessity of data weighting.