## Sample size calculation for stage sampling

In the sections below, we illustrate a simple examples of sample size calculations in the context of household surveys using stage sampling designs. Let's assume that we want to calculate sample size for a vaccination survey in Senegal. We want to stratify the sample by administrative region. We will use the 2017 Senegal Demographic and Health Survey (DHS) (see <https://www.dhsprogram.com/publications/publication-FR345-DHS-Final-Reports.cfm>) to get an idea of the coverage rates for some main vaccine-doses. Below, we show vaccination coverage of hepatitis B birth dose (hepB0) vaccine, first and third dose of diphtheria, tetanus and pertussis (DTP), first dose of measles containing vaccine (MCV1) and coverage of basic vaccination. Basic vaccination refers to the 12-23 months old children that received BCG vaccine, three doses of DTP containing vaccine, three doses of polio vaccine, and the first dose of measles containing vaccine.The table below shows the 2017 Senegal DHS vaccination coverage of a few vaccine-doses for children aged 12 to 23 months old.

| Region        | HepB0   | DTP1    | DTP3    | MCV1    | Basic vaccination  |
| :------------ | :-----: | :-----: | :-----: | :-----: | :----------------: |
| Dakar         | 53.6    | 99.1    | 98.5    | 97.0    | 84.9               |
| Ziguinchor    | 47.1    | 98.6    | 94.1    | 93.6    | 80.9               |
| Diourbel      | 62.8    | 94.6    | 88.2    | 86.1    | 68.2               |
| Saint-Louis   | 40.1    | 99.1    | 97.2    | 94.7    | 80.6               |
| Tambacounda   | 45.0    | 83.3    | 72.7    | 65.3    | 47.0               |
| Kaolack       | 63.9    | 99.6    | 92.2    | 89.3    | 79.7               |
| Thies         | 62.3    | 100.0   | 98.8    | 91.6    | 83.4               |
| Louga         | 49.8    | 96.2    | 87.8    | 81.5    | 67.8               |
| Fatick        | 62.7    | 98.5    | 93.8    | 90.3    | 76.6               |
| Kolda         | 32.8    | 94.4    | 87.3    | 85.6    | 63.7               |
| Matam         | 43.1    | 94.3    | 88.1    | 79.4    | 68.7               |
| Kaffrine      | 56.9    | 98.0    | 93.6    | 88.7    | 76.6               |
| Kedougou      | 44.4    | 70.7    | 60.2    | 46.5    | 33.6               |
| Sedhiou       | 46.6    | 96.8    | 90.4    | 89.9    | 74.2               |

Data collection happened from April to December 2018. Therefore, the data shown in the table represent children born from October 2016 to December 2017. For the purpose of this tutorial, we will assume that these vaccinae coverage rates still hold.  


In [28]:
import numpy as np
import pandas as pd

import samplics
from samplics.sampling import SampleSize

The first step is to extentiate teh *SampleSize* class to create an object that reprents the parameter of interest, the sample size calculation method, and the stratification status. In this example, we want to calculate sample size for proportions, using wald method for a stratified design. This is achived with the following snippet of code.

```python
SampleSize(
    parameter="proportion", method="wald", stratification=True
)
```

Because, we are using stratified design, it is best to specify the expected coverage levels by stratum. If the information is not available then aggregate expected proportions can be used across the strata. The 2017 Senegal DHS published the coverage rates by region hence we have the information available by stratum. To pass the informmation to  *Samplics* we use the python dictionaries as follows

```python
expected_coverage = {
    "Dakar": 0.849,
    "Ziguinchor": 0.809,
    "Diourbel": 0.682,
    "Saint-Louis": 0.806,
    "Tambacounda": 0.470,
    "Kaolack": 0.797,
    "Thies": 0.834,
    "Louga": 0.678,
    "Fatick": 0.766,
    "Kolda": 0.637,
    "Matam": 0.687,
    "Kaffrine": 0.766,
    "Kedougou": 0.336,
    "Sedhiou": 0.742,
}
```

Now we want to calculate the sample size with desired precision of 0.07 which means that we want the expected vaccination coverage rates to have 7% half confidence intervals e.g. expected rate of 90% with a confidence interval of [83%, 97%]. 

Given that information, we can calculate the sample size using *SampleSize* class as follows.

In [32]:
# target coverage rates
expected_coverage = {
    "Dakar": 0.849,
    "Ziguinchor": 0.809,
    "Diourbel": 0.682,
    "Saint-Louis": 0.806,
    "Tambacounda": 0.470,
    "Kaolack": 0.797,
    "Thies": 0.834,
    "Louga": 0.678,
    "Fatick": 0.766,
    "Kolda": 0.637,
    "Matam": 0.687,
    "Kaffrine": 0.766,
    "Kedougou": 0.336,
    "Sedhiou": 0.742,
}

# Declare the sample size calculation parameters
sen_vaccine_wald = SampleSize(
    parameter="proportion", method="wald", stratification=True
)

# calculate the sample size
sen_vaccine_wald.calculate(target=expected_coverage, precision=0.07)

# show the calculated sample size
sen_vaccine_wald.samp_size

{'Dakar': 101.0,
 'Ziguinchor': 122.0,
 'Diourbel': 171.0,
 'Saint-Louis': 123.0,
 'Tambacounda': 196.0,
 'Kaolack': 127.0,
 'Thies': 109.0,
 'Louga': 172.0,
 'Fatick': 141.0,
 'Kolda': 182.0,
 'Matam': 169.0,
 'Kaffrine': 141.0,
 'Kedougou': 175.0,
 'Sedhiou': 151.0}

*SampleSize* calculates the sample sizes and store the in teh *samp_size* attributes which is a python dictinary object. If a dataframe is better suited for the use case, the method *to_dataframe* can be used to create a dataframe. 

In [33]:
sen_vaccine_wald_size = sen_vaccine_wald.to_dataframe()

sen_vaccine_wald_size

Unnamed: 0,_stratum,_target,_precision,_samp_size
0,Dakar,0.849,0.07,101.0
1,Ziguinchor,0.809,0.07,122.0
2,Diourbel,0.682,0.07,171.0
3,Saint-Louis,0.806,0.07,123.0
4,Tambacounda,0.47,0.07,196.0
5,Kaolack,0.797,0.07,127.0
6,Thies,0.834,0.07,109.0
7,Louga,0.678,0.07,172.0
8,Fatick,0.766,0.07,141.0
9,Kolda,0.637,0.07,182.0


The sample size calculation above assumes that the design effect (DEFF) was equal to 1. In the context of complex sampling designs, DEFF is different from 1. Stage sampling and unequal weights usually increase the design efffect above 1. The 2017 Senegal DHS indicated a design effect equal to 1.963 (1.401^2) for basic vaccination. Hence, to calculate the sample size, we will use the design effect provided by DHS. 

In [43]:
sen_vaccine_wald.calculate(target=expected_coverage, precision=0.07, deff=1.401 ** 2)

sen_vaccine_wald.to_dataframe()

Unnamed: 0,_stratum,_target,_precision,_samp_size
0,Dakar,0.849,0.07,198.0
1,Ziguinchor,0.809,0.07,238.0
2,Diourbel,0.682,0.07,334.0
3,Saint-Louis,0.806,0.07,241.0
4,Tambacounda,0.47,0.07,384.0
5,Kaolack,0.797,0.07,249.0
6,Thies,0.834,0.07,214.0
7,Louga,0.678,0.07,336.0
8,Fatick,0.766,0.07,276.0
9,Kolda,0.637,0.07,356.0


Since the sample design is stratified, the sample size calculation will be more precised if DEFF is specified at the stratum level. Again, the 2017 Senegal DHS provide the information by region, we then use that to calculate the sample sizes. 

In [46]:
# Target coverage rates
expected_deff = {
    "Dakar": 1.100 ** 2,
    "Ziguinchor": 1.100 ** 2,
    "Diourbel": 1.346 ** 2,
    "Saint-Louis": 1.484 ** 2,
    "Tambacounda": 1.366 ** 2,
    "Kaolack": 1.360 ** 2,
    "Thies": 1.109 ** 2,
    "Louga": 1.902 ** 2,
    "Fatick": 1.100 ** 2,
    "Kolda": 1.217 ** 2,
    "Matam": 1.403 ** 2,
    "Kaffrine": 1.256 ** 2,
    "Kedougou": 2.280 ** 2,
    "Sedhiou": 1.335 ** 2,
}

# Calculate sample sizes using deff at the stratum level
sen_vaccine_wald.calculate(target=expected_coverage, precision=0.07, deff=expected_deff)

# Convert sample sizes to a dataframe
sen_vaccine_wald.to_dataframe()

Unnamed: 0,_stratum,_target,_precision,_samp_size
0,Dakar,0.849,0.07,122.0
1,Ziguinchor,0.809,0.07,147.0
2,Diourbel,0.682,0.07,309.0
3,Saint-Louis,0.806,0.07,270.0
4,Tambacounda,0.47,0.07,365.0
5,Kaolack,0.797,0.07,235.0
6,Thies,0.834,0.07,134.0
7,Louga,0.678,0.07,620.0
8,Fatick,0.766,0.07,171.0
9,Kolda,0.637,0.07,269.0


The sample size calculation above does not account for attrition of sample sizes due to non-response. In the 2017 Semegal DHS, the overal household and women reponse rate was abou 94.2%. 

In [48]:
# Calculate sample sizes with a resp_rate of 94.2%
sen_vaccine_wald.calculate(
    target=expected_coverage, precision=0.07, deff=expected_deff, resp_rate=0.942
)

# Convert sample sizes to a dataframe
sen_vaccine_wald.to_dataframe()

Unnamed: 0,_stratum,_target,_precision,_samp_size
0,Dakar,0.849,0.07,129.511677
1,Ziguinchor,0.809,0.07,156.050955
2,Diourbel,0.682,0.07,328.025478
3,Saint-Louis,0.806,0.07,286.624204
4,Tambacounda,0.47,0.07,387.473461
5,Kaolack,0.797,0.07,249.469214
6,Thies,0.834,0.07,142.250531
7,Louga,0.678,0.07,658.174098
8,Fatick,0.766,0.07,181.528662
9,Kolda,0.637,0.07,285.562633


The World Health OR=rganization (WHO) recommends using the Fleiss method for calculating sample size for vaccination coeverage survey (see https://www.who.int/immunization/documents/who_ivb_18.09/en/). The Fleiss method adjusts for asymetries in the distribution of proportions. To use the Fleiss method, the use of the API shown above is the same with *method="fleiss"*. 

In [50]:
sen_vaccine_fleiss = SampleSize(
    parameter="proportion", method="fleiss", stratification=True
)

sen_vaccine_fleiss.calculate(
    target=expected_coverage, precision=0.07, deff=expected_deff, resp_rate=0.942
)


sen_vaccine_fleiss.to_dataframe()

Unnamed: 0,_stratum,_target,_precision,_samp_size
0,Dakar,0.849,0.07,190.021231
1,Ziguinchor,0.809,0.07,210.191083
2,Diourbel,0.682,0.07,398.089172
3,Saint-Louis,0.806,0.07,384.288747
4,Tambacounda,0.47,0.07,409.766454
5,Kaolack,0.797,0.07,329.087049
6,Thies,0.834,0.07,200.636943
7,Louga,0.678,0.07,794.055202
8,Fatick,0.766,0.07,228.237792
9,Kolda,0.637,0.07,324.840764
