# Sample size calculation for stage sampling

In the cells below, we illustrate a simple example of sample size calculation in the context of household surveys using stage sampling designs. Let's assume that we want to calculate sample size for a vaccination survey in Senegal. We want to stratify the sample by administrative region. We will use the 2017 Senegal Demographic and Health Survey (DHS) (see <https://www.dhsprogram.com/publications/publication-FR345-DHS-Final-Reports.cfm>) to get an idea of the vaccination coverage rates for some main vaccine-doses. Below, we show coverage rates of hepatitis B birth dose (hepB0) vaccine, first and third dose of diphtheria, tetanus and pertussis (DTP), first dose of measles containing vaccine (MCV1) and coverage of basic vaccination. Basic vaccination refers to the 12-23 months old children that received BCG vaccine, three doses of DTP containing vaccine, three doses of polio vaccine, and the first dose of measles containing vaccine.The table below shows the 2017 Senegal DHS vaccination coverage of a few vaccine-doses for children aged 12 to 23 months old.

| Region        | HepB0   | DTP1    | DTP3    | MCV1    | Basic vaccination  |
| :------------ | :-----: | :-----: | :-----: | :-----: | :----------------: |
| Dakar         | 53.6    | 99.1    | 98.5    | 97.0    | 84.9               |
| Ziguinchor    | 47.1    | 98.6    | 94.1    | 93.6    | 80.9               |
| Diourbel      | 62.8    | 94.6    | 88.2    | 86.1    | 68.2               |
| Saint-Louis   | 40.1    | 99.1    | 97.2    | 94.7    | 80.6               |
| Tambacounda   | 45.0    | 83.3    | 72.7    | 65.3    | 47.0               |
| Kaolack       | 63.9    | 99.6    | 92.2    | 89.3    | 79.7               |
| Thies         | 62.3    | 100.0   | 98.8    | 91.6    | 83.4               |
| Louga         | 49.8    | 96.2    | 87.8    | 81.5    | 67.8               |
| Fatick        | 62.7    | 98.5    | 93.8    | 90.3    | 76.6               |
| Kolda         | 32.8    | 94.4    | 87.3    | 85.6    | 63.7               |
| Matam         | 43.1    | 94.3    | 88.1    | 79.4    | 68.7               |
| Kaffrine      | 56.9    | 98.0    | 93.6    | 88.7    | 76.6               |
| Kedougou      | 44.4    | 70.7    | 60.2    | 46.5    | 33.6               |
| Sedhiou       | 46.6    | 96.8    | 90.4    | 89.9    | 74.2               |

The 2017 Senegal DHS data collection happened from April to December 2018. Therefore, the data shown in the table represent children born from October 2016 to December 2017. For the purpose of this tutorial, we will assume that these vaccine coverage rates still hold. Furthermore, we will use the basic vaccination coverage rates to calculate sample size.


In [1]:
from samplics.sampling import SampleSize

The first step is to create and object using the *SampleSize* class with the parameter of interest, the sample size calculation method, and the stratification status. In this example, we want to calculate sample size for proportions, using wald method for a stratified design. This is achived with the following snippet of code.

```python
SampleSize(
    parameter="proportion", method="wald", stratification=True
)
```

Because, we are using a stratified sample design, it is best to specify the expected coverage levels by stratum. If the information is not available then aggregated values can be used across the strata. The 2017 Senegal DHS published the coverage rates by region hence we have the information available by stratum. To provide the informmation to  *Samplics* we use the python dictionaries as follows

```python
expected_coverage = {
    "Dakar": 0.849,
    "Ziguinchor": 0.809,
    "Diourbel": 0.682,
    "Saint-Louis": 0.806,
    "Tambacounda": 0.470,
    "Kaolack": 0.797,
    "Thies": 0.834,
    "Louga": 0.678,
    "Fatick": 0.766,
    "Kolda": 0.637,
    "Matam": 0.687,
    "Kaffrine": 0.766,
    "Kedougou": 0.336,
    "Sedhiou": 0.742,
}
```

Now, we want to calculate the sample size with desired precision of 0.07 which means that we want the expected vaccination coverage rates to have 7% half confidence intervals e.g. expected rate of 90% will have a confidence interval of [83%, 97%]. Note that the desired precision can be specified by stratum in a similar way as the target coverage using a python dictionary.

Given that information, we can calculate the sample size using *SampleSize* class as follows.

In [2]:
# target coverage rates
expected_coverage = {
    "Dakar": 0.849,
    "Ziguinchor": 0.809,
    "Diourbel": 0.682,
    "Saint-Louis": 0.806,
    "Tambacounda": 0.470,
    "Kaolack": 0.797,
    "Thies": 0.834,
    "Louga": 0.678,
    "Fatick": 0.766,
    "Kolda": 0.637,
    "Matam": 0.687,
    "Kaffrine": 0.766,
    "Kedougou": 0.336,
    "Sedhiou": 0.742,
}

# Declare the sample size calculation parameters
sen_vaccine_wald = SampleSize(
    parameter="proportion", method="wald", stratification=True
)

# calculate the sample size
sen_vaccine_wald.calculate(target=expected_coverage, half_ci=0.07)

# show the calculated sample size
print("\nCalculated sample sizes by stratum:")
sen_vaccine_wald.samp_size


Calculated sample sizes by stratum:


{'Dakar': 101,
 'Ziguinchor': 122,
 'Diourbel': 171,
 'Saint-Louis': 123,
 'Tambacounda': 196,
 'Kaolack': 127,
 'Thies': 109,
 'Louga': 172,
 'Fatick': 141,
 'Kolda': 182,
 'Matam': 169,
 'Kaffrine': 141,
 'Kedougou': 175,
 'Sedhiou': 151}

*SampleSize* calculates the sample sizes and store the in teh *samp_size* attributes which is a python dictinary object. If a dataframe is better suited for the use case, the method *to_dataframe()* can be used to create a pandas dataframe. 

In [3]:
sen_vaccine_wald_size = sen_vaccine_wald.to_dataframe()

sen_vaccine_wald_size

Unnamed: 0,_parameter,_stratum,_target,_half_ci,_samp_size
0,proportion,Dakar,0.849,0.07,101
1,proportion,Ziguinchor,0.809,0.07,122
2,proportion,Diourbel,0.682,0.07,171
3,proportion,Saint-Louis,0.806,0.07,123
4,proportion,Tambacounda,0.47,0.07,196
5,proportion,Kaolack,0.797,0.07,127
6,proportion,Thies,0.834,0.07,109
7,proportion,Louga,0.678,0.07,172
8,proportion,Fatick,0.766,0.07,141
9,proportion,Kolda,0.637,0.07,182


The sample size calculation above assumes that the design effect (DEFF) was equal to 1. A design effect of 1 correspond to sampling design with a variance equivalent to a simple random selection of same sample size. In the context of complex sampling designs, DEFF is often different from 1. Stage sampling and unequal weights usually increase the design effect above 1. The 2017 Senegal DHS indicated a design effect equal to 1.963 (1.401^2) for basic vaccination. Hence, to calculate the sample size, we will use the design effect provided by DHS. 

In [4]:
sen_vaccine_wald.calculate(
    target=expected_coverage, half_ci=0.07, deff=1.401 ** 2
)

sen_vaccine_wald.to_dataframe()

Unnamed: 0,_parameter,_stratum,_target,_half_ci,_samp_size
0,proportion,Dakar,0.849,0.07,198
1,proportion,Ziguinchor,0.809,0.07,238
2,proportion,Diourbel,0.682,0.07,334
3,proportion,Saint-Louis,0.806,0.07,241
4,proportion,Tambacounda,0.47,0.07,384
5,proportion,Kaolack,0.797,0.07,249
6,proportion,Thies,0.834,0.07,214
7,proportion,Louga,0.678,0.07,336
8,proportion,Fatick,0.766,0.07,276
9,proportion,Kolda,0.637,0.07,356


Since the sample design is stratified, the sample size calculation will be more precised if DEFF is specified at the stratum level which is available from the 2017 Senegal DHS provided report. Some regions have a design effect below 1. To be conservative with our sample size calculation, we will use 1.21 as the minimum design effect to use in the sample size calculation. 

In [5]:
# Target coverage rates
expected_deff = {
    "Dakar": 1.100 ** 2,
    "Ziguinchor": 1.100 ** 2,
    "Diourbel": 1.346 ** 2,
    "Saint-Louis": 1.484 ** 2,
    "Tambacounda": 1.366 ** 2,
    "Kaolack": 1.360 ** 2,
    "Thies": 1.109 ** 2,
    "Louga": 1.902 ** 2,
    "Fatick": 1.100 ** 2,
    "Kolda": 1.217 ** 2,
    "Matam": 1.403 ** 2,
    "Kaffrine": 1.256 ** 2,
    "Kedougou": 2.280 ** 2,
    "Sedhiou": 1.335 ** 2,
}

# Calculate sample sizes using deff at the stratum level
sen_vaccine_wald.calculate(
    target=expected_coverage, half_ci=0.07, deff=expected_deff
)

# Convert sample sizes to a dataframe
sen_vaccine_wald.to_dataframe()

Unnamed: 0,_parameter,_stratum,_target,_half_ci,_samp_size
0,proportion,Dakar,0.849,0.07,122
1,proportion,Ziguinchor,0.809,0.07,147
2,proportion,Diourbel,0.682,0.07,309
3,proportion,Saint-Louis,0.806,0.07,270
4,proportion,Tambacounda,0.47,0.07,365
5,proportion,Kaolack,0.797,0.07,235
6,proportion,Thies,0.834,0.07,134
7,proportion,Louga,0.678,0.07,620
8,proportion,Fatick,0.766,0.07,171
9,proportion,Kolda,0.637,0.07,269


The sample size calculation above does not account for attrition of sample sizes due to non-response. In the 2017 Semegal DHS, the overal household and women reponse rate was abou 94.2%. 

In [6]:
# Calculate sample sizes with a resp_rate of 94.2%
sen_vaccine_wald.calculate(
    target=expected_coverage, half_ci=0.07, deff=expected_deff, resp_rate=0.942
)

# Convert sample sizes to a dataframe
sen_vaccine_wald.to_dataframe(
    col_names=[
        "Parameter",
        "region",
        "vaccine_coverage",
        "precision",
        "number_12_23_months",
    ]
)

Unnamed: 0,Parameter,region,vaccine_coverage,precision,number_12_23_months
0,proportion,Dakar,0.849,0.07,130
1,proportion,Ziguinchor,0.809,0.07,157
2,proportion,Diourbel,0.682,0.07,329
3,proportion,Saint-Louis,0.806,0.07,287
4,proportion,Tambacounda,0.47,0.07,388
5,proportion,Kaolack,0.797,0.07,250
6,proportion,Thies,0.834,0.07,143
7,proportion,Louga,0.678,0.07,659
8,proportion,Fatick,0.766,0.07,182
9,proportion,Kolda,0.637,0.07,286


### Fleiss method

The World Health OR=rganization (WHO) recommends using the Fleiss method for calculating sample size for vaccination coverage survey (see https://www.who.int/immunization/documents/who_ivb_18.09/en/). To use the Fleiss method, the examples shown above are the same with *method="fleiss"*. 

In [7]:
sen_vaccine_fleiss = SampleSize(
    parameter="proportion", method="fleiss", stratification=True
)

sen_vaccine_fleiss.calculate(
    target=expected_coverage, half_ci=0.07, deff=expected_deff, resp_rate=0.942
)


sen_vaccine_sample = sen_vaccine_fleiss.to_dataframe(
    col_names=[
        "Parameter",
        "region",
        "vaccine_coverage",
        "precision",
        "number_12_23_months",
    ]
)
sen_vaccine_sample.head(15)

Unnamed: 0,Parameter,region,vaccine_coverage,precision,number_12_23_months
0,proportion,Dakar,0.849,0.07,191
1,proportion,Ziguinchor,0.809,0.07,211
2,proportion,Diourbel,0.682,0.07,399
3,proportion,Saint-Louis,0.806,0.07,385
4,proportion,Tambacounda,0.47,0.07,410
5,proportion,Kaolack,0.797,0.07,330
6,proportion,Thies,0.834,0.07,201
7,proportion,Louga,0.678,0.07,795
8,proportion,Fatick,0.766,0.07,229
9,proportion,Kolda,0.637,0.07,325


At this point, we have the number of 12-23 months needed to achieve the desired precision given the expected proportions using wald or fleiss calculation methods.

### Number of households

To obtain the number of households, we need to know the expected average number of children aged 12-23 months per household. This information can be obtained from census data or from surveys' rosters. Since, the design is stratified, it is best to obtain the information per stratum. In this example, we wil assume that 5.2% of the population is between 12 and 23 months of age and apply that to all strata and household. Hence, the minimum number of households to select is:

In [8]:
sen_vaccine_sample["number_households"] = round(
    sen_vaccine_sample["number_12_23_months"] / 0.052, 0
)

sen_vaccine_sample.head(15)

Unnamed: 0,Parameter,region,vaccine_coverage,precision,number_12_23_months,number_households
0,proportion,Dakar,0.849,0.07,191,3673.0
1,proportion,Ziguinchor,0.809,0.07,211,4058.0
2,proportion,Diourbel,0.682,0.07,399,7673.0
3,proportion,Saint-Louis,0.806,0.07,385,7404.0
4,proportion,Tambacounda,0.47,0.07,410,7885.0
5,proportion,Kaolack,0.797,0.07,330,6346.0
6,proportion,Thies,0.834,0.07,201,3865.0
7,proportion,Louga,0.678,0.07,795,15288.0
8,proportion,Fatick,0.766,0.07,229,4404.0
9,proportion,Kolda,0.637,0.07,325,6250.0


Similarly, the number of clusters to select can be obtained by dividing the number of households by the number of households per cluster to be selected.