# 2. Sample Selection 

## Table of Contents

- [Objective](#section0)

- [Sampling Stage 1 - Selection of Primary Sampling Units (PSUs)](#section1)
    - [PSU Probabilty of Selection](#section11)
    - [PSU Selection](#section12)

- [Sampling Stage 2 - Selection of Secondary Sampling Units (SSUs)](#section2)
    - [SSU Probabilty of Selection](#section21)
    - [SSU Selection](#section22)


## Objective <a name="section0"></a>

This tutorial illustrates the use of the class *Sample* for selecting a random sample from a frame of finite sample units. More precisely, the objective of this tutorial are:

- Overview of some common sample selection techniques
- Introduction to the attributes and methods of class *Selection*

**Important.** This tutorial is not designed to teach sample selection but rather aims to provide a minimum review as reference for the user. At the end of this tutorial, the user should have a good understanding of the *Selection* API. 


In [1]:
#%load_ext lab_black 

# In this cell, all the necessary python packages and classes are imported in the workspace.
import numpy as np
import pandas as pd

# first run (from the terminal): pip3 install survmeth
from samplics import Sample

In the sections below, we select samples using several random selection procedures. In each section, a short review of the selection method will be provided to the user. The three main selection methods implemented in class *Selection* are the simple random selection (SRS), the systematic selection (SYS), and the probability proportional to size (PPS).

The class *Selection* has two main methods that is *inclusion_probs* and *select*. *inclusion_probs* computes the probability of selection and *select* is used to random samples. For more details visit: samplics.io/readthedocs/samplics.sampling

Some of the attributes of the class *Sample* are the following:
- *method* which indicates the selection method. The current implemented methods are: srs, sys, pps-brewer, pps-hv (for the Hanarav-Vajayan algorithm), pps-murphy, pps-sampford, and pps-sys.
- *stratification* indicates if selection is stratified.
- *with_replacement* indicates if selection is with replacement.
- *fpc* provides the finite population correction - **(To Do list)**

## Sampling at Stage 1 - Selection of Primary Sampling Units (PSUs) <a name="section1"></a>

The file *sample_frame.csv* - shown below - contains syntethic data of 100 clusters classified by region. Clusters represent a group of households. In the file, each clusters has an associated number of households (number_households) and a status variable indicating whether the cluster is in scope or not. 

This synthetic data represents a simplified version of enumeration areas (EAs) frame found in countries and used by major household survey programs such as the Demographic and Health (DHS), the Population-based HIV Impact Assessment (PHIA) and the Multiple Cluster Indicator Survey (MICS). 

In [2]:
psu_frame = pd.read_csv("psu_frame.csv")

psu_frame.head(25)

Unnamed: 0,cluster,region,number_households_census,cluster_status,comment
0,1,North,105,1,
1,2,North,85,1,
2,3,North,95,1,
3,4,North,75,1,
4,5,North,120,1,
5,6,North,90,1,
6,7,North,130,1,
7,8,North,55,1,
8,9,North,30,1,
9,10,North,600,1,due to a large building


Often, sampling frames are not available for the sampling units of interest. For example, most countries do not have a accurate list of all available housholds or people living in the country. Even if such frame exist, it may be operationally and finacially not feasible to directly select sampling units without any form of clustering. 

Hence, stage sampling is a common strategy used by large households national surveys for selecting samples of households and people. At the first stage, geographic or admistrative clusters of households are selected. At the second stage, a frame of households is created from the selected clusters and a sample of households are selected. At the third stage (if applicable), a sample of people is selected from the household in the sample. This is a high level description of the process; usually implementations are much less straighforward with many adjustments to address complexities. 

### PSU Probability of Selection <a name="section11"></a>

At the first stage, we use the proportional to size (pps) method to select a random sample of clusters. The measure of size is the number of households (number_households) as provided in the psu sampling frame. The sample is stratified by region. The probabilities, for stratified pps, is obtained as follow: \begin{equation} p_{hi} = \frac{n_h M_{hi}}{\sum_{i=1}^{N_h}{M_{hi}}} \end{equation} where $p_{hi}$ is the probability of selection for unit $i$ from stratum $h$, $M_{hi}$ is the measure of size (mos), $n_h$ and $N_h$ are the sample size and the total number of clusters in stratum $h$, respectively.

### PSU Sample size

For a stratified sample design, the sample size is python dictionary. A Python dictionary allows us to pair the strata with the sample sizes. Let say that we want to select 3 clusters from stratum *East*, 2 from *West*, 2 from *North* and 3 from *South*. The code below shows how to create the Python dictionary. Note that is is important to correctly spell out the keys of the dictionary which correspond to the values of the variable representing stratum (in our case it's *region*).

In [3]:
psu_sample_size = {"East":3, "West": 2, "North": 2, "South": 3}
print(psu_sample_size)

{'East': 3, 'West': 2, 'North': 2, 'South': 3}


The function *array_to_dict()* converts an array to a dictionnary by pairing the values of the array to their frequency. We can use this function to calculates the number of clusters per stratum and store it in a Python dictionnary. Then, we modify the values of the dictionnary to create the sample size dictionnary.

If some of the clusters are certainties then an exception will be raised. Hence, the user will have to manually handle the certaininty. Better handling of certainties is planned for future versions of the *SAMPLICS*.

In [4]:
from samplics import array_to_dict

frame_size = array_to_dict(psu_frame["region"])
print(f"The number of clusters per stratum is: {frame_size} \n")

psu_sample_size = frame_size.copy()
psu_sample_size["East"] = 3
psu_sample_size["North"] = 2
psu_sample_size["South"] = 3
psu_sample_size["West"] = 2
print(f"The sample size per stratum is: {psu_sample_size}")

The number of clusters per stratum is: {'East': 25, 'North': 10, 'South': 20, 'West': 45} 

The sample size per stratum is: {'East': 3, 'North': 2, 'South': 3, 'West': 2}


In [5]:
stage1_design = Sample(method="pps-sys", stratification=True, with_replacement=False)

psu_frame["psu_prob"] = stage1_design.inclusion_probs(
    psu_frame["cluster"], 
    psu_sample_size, 
    psu_frame["region"],
    psu_frame["number_households_census"],
    )

psu_frame.head(15)

Unnamed: 0,cluster,region,number_households_census,cluster_status,comment,psu_prob
0,1,North,105,1,,0.151625
1,2,North,85,1,,0.122744
2,3,North,95,1,,0.137184
3,4,North,75,1,,0.108303
4,5,North,120,1,,0.173285
5,6,North,90,1,,0.129964
6,7,North,130,1,,0.187726
7,8,North,55,1,,0.079422
8,9,North,30,1,,0.043321
9,10,North,600,1,due to a large building,0.866426


### PSU Selection <a name="section12"></a>

In this subsection, we will select a sample of psus using pps method. Above, we have calculated the probability of selection. That step is not necessary when using *SAMPLICS*. As shown below, *select()* method returns three arrays. 
* The first array indicating the selected units (i.e. 1=selected and 0=not selected). 
* The second array provide the number of hits usefull when teh sample is selected with replacement. 
* The third array is the probability of selection. 

NB: *np.random.seed()* fixes the random seed to allow us to reproduce the random selection. 

In [6]:
np.random.seed(23)

psu_frame["psu_sample"], psu_frame["psu_hits"], psu_frame["psu_probs"] = stage1_design.select(
    psu_frame["cluster"], 
    psu_sample_size, 
    psu_frame["region"], 
    psu_frame["number_households_census"]
    )

psu_frame.head(15)

Unnamed: 0,cluster,region,number_households_census,cluster_status,comment,psu_prob,psu_sample,psu_hits,psu_probs
0,1,North,105,1,,0.151625,0,0,0.151625
1,2,North,85,1,,0.122744,0,0,0.122744
2,3,North,95,1,,0.137184,0,0,0.137184
3,4,North,75,1,,0.108303,0,0,0.108303
4,5,North,120,1,,0.173285,0,0,0.173285
5,6,North,90,1,,0.129964,0,0,0.129964
6,7,North,130,1,,0.187726,1,1,0.187726
7,8,North,55,1,,0.079422,0,0,0.079422
8,9,North,30,1,,0.043321,0,0,0.043321
9,10,North,600,1,due to a large building,0.866426,1,1,0.866426


The default setting (sample_only=False) returns the entire frame. We can easily reduce the output data to the sample by filtering i.e. psu_sample == 1. However, if we are only inetrested in the sample, we could use *sample_only=True* when calling *select()*. This will reduce the output data to the sampled units and will convert the data to a pandas dataframe (pd.DataFrame). Note that the columns in the dataframe will be reduced to the minimum.

In [8]:
np.random.seed(23)

psu_sample = stage1_design.select(
    psu_frame["cluster"], 
    psu_sample_size, 
    psu_frame["region"], 
    psu_frame["number_households_census"],
    sample_only = True
    )

psu_sample

Unnamed: 0,_samp_unit,_stratum,_mos,_sample,_hits,_probs
6,7,North,130,1,1,0.187726
9,10,North,600,1,1,0.866426
15,16,South,190,1,1,0.209174
23,24,South,75,1,1,0.082569
28,29,South,200,1,1,0.220183
33,34,East,305,1,1,0.210587
44,45,East,450,1,1,0.310702
51,52,East,700,1,1,0.483314
63,64,West,300,1,1,0.091673
85,86,West,280,1,1,0.085561


The systematic selection method can be implemented with or without replacement. Other available algorithms for selecting sample with unequal probablities of selection are Brewer, Hanurav-Vijayan, Murphy (only for sample size of 2), and sampford methods. All these sampling techniques can be specified as shown below then calling *select()* is similar. 

```python 
$ Sample(method="pps-sys", with_replacement=True)
$ Sample(method="pps-sys", with_replacement=False)
$ Sample(method="pps-brewer", with_replacement=False)
$ Sample(method="pps-hv", with_replacement=False)
$ Sample(method="pps-murphy", with_replacement=False)
$ Sample(method="pps-sampford", with_replacement=False)
```
For example, if we wanted to select the sample using the sampford method, we could use the following code. 

In [9]:
np.random.seed(23)

stage1_sampford = Sample(method="pps-sampford", stratification=True, with_replacement=False)

psu_sample = stage1_sampford.select(
    psu_frame["cluster"], 
    psu_sample_size, 
    psu_frame["region"], 
    psu_frame["number_households_census"],
    sample_only=True
    )

psu_sample.head(15)

Unnamed: 0,_samp_unit,_stratum,_mos,_sample,_hits,_probs
2,3,North,95,1,1,0.137184
9,10,North,600,1,1,0.866426
10,11,South,25,1,1,0.027523
19,20,South,110,1,1,0.121101
24,25,South,95,1,1,0.104587
35,36,East,95,1,1,0.065593
37,38,East,80,1,1,0.055236
50,51,East,135,1,1,0.093211
93,94,West,60,1,1,0.018335
95,96,West,95,1,1,0.02903


## Sampling at Stage 2 - Selection of Secondary Sampling Unit (SSU) <a name="section2"></a>

To select the second stage sample, we need the second stage frame that is the list of all the households in the 10 clusters (psus) selected above. DHS, PHIA, MICS and other large  scale surveys visit the selected clusters and construct the list of all households in the selected clusters. In this tutorial, we will simulate the second stage frame. 

Assume that the psu frame was obtained from the previous census conducted several years before. We also assume that, the change in the number of households since the previous census follows a normal distribution with a mean equal to 5% higher than the census value and a variance of 0.15 times the number of households from the census.

In [10]:
# Create a synthetic second stage frame
census_size = psu_frame.loc[psu_frame["psu_sample"]==1, "number_households_census"].values
stratum_names = psu_frame.loc[psu_frame["psu_sample"]==1, "region"].values
cluster = psu_frame.loc[psu_frame["psu_sample"]==1, "cluster"].values

np.random.seed(15)

listing_size = np.zeros(census_size.size)
for k in range(census_size.size):
    listing_size[k] = np.random.normal(1.05*census_size[k], 0.15*census_size[k])
    
listing_size = listing_size.astype(int)
hh_id = rr_id = cl_id = []
for k, s in enumerate(listing_size):
    hh_k1 = np.char.array(np.repeat(stratum_names[k], s)).astype(str)
    hh_k2 = np.char.array(np.arange(1, s+1)).astype(str)
    cl_k = np.repeat(cluster[k], s)
    hh_k = np.char.add(np.char.array(cl_k).astype(str), hh_k2)
    hh_id = np.append(hh_id, hh_k)
    rr_id = np.append(rr_id, hh_k1)
    cl_id = np.append(cl_id, cl_k)

ssu_frame = pd.DataFrame(cl_id.astype(int))
ssu_frame.rename(columns={0: "cluster"}, inplace=True)
ssu_frame["region"] = rr_id
ssu_frame["household"] = hh_id

ssu_frame.head(15)

Unnamed: 0,cluster,region,household
0,7,North,71
1,7,North,72
2,7,North,73
3,7,North,74
4,7,North,75
5,7,North,76
6,7,North,77
7,7,North,78
8,7,North,79
9,7,North,710


In [11]:
ssu_counts = ssu_frame.groupby("cluster").count()
ssu_counts.drop(columns="region", inplace=True)
ssu_counts.reset_index(inplace=True)
ssu_counts.rename(columns={"household":"number_households_listed"}, inplace=True)

pd.merge(
    psu_sample[["cluster", "region", "number_households_census"]], 
    ssu_counts[["cluster", "number_households_listed"]], on=["cluster"]
    )

KeyError: "None of [Index(['cluster', 'region', 'number_households_census'], dtype='object')] are in the [columns]"

According to the simulated second stage frame, we get the same number of households in cluster 7 as the census. However, in strata 10, 16, 29, and 64, we listed more households than during than the census. And finally, we found less households in the remaining clusters than the census. 

Now that we have simulated a second stage frame, let's use *SAMPLICS* to calculate the inclusion probabilities and to select a sample. We assume that the second stage sample size is 150 households and the strategy is to select 15 households per cluster. 

### SSU probability of selection <a name="section21"></a>

The second stage probability of selection are conditional on the first stage realization. For this stage, simple random selection (srs) and systematic selection(sys) are common methods used to select households. For this example, we use srs to select 15 households from each cluster. Considering all clusters, the second stage selection is a stratified srs where the clusters are the strata. More generally, we have that \begin{equation} p_{hij} = \frac{m_{hi}}{M_{hi}^{'}} \end{equation} where $p_{hij}$ is the conditional probability of selection for unit $j$ from stratum $h$ and cluster $j$, $m_{hi}$ and $M_{hi}^{'}$ are the sample size and the number of secondary sampling units listed for stratum $h$ and cluster $j$, respectively.


In this scenario, sample size is the same in each stratum. Hence, the parameter *sample_size* does not need to be a Python dictionnary; we will only provide 15 in the function call. 

In [None]:
stage2_design = Sample(method="srs", stratification=True, with_replacement=False)

ssu_frame["ssu_prob"] = stage2_design.inclusion_probs(
    ssu_frame["household"], 15, ssu_frame["cluster"]
    )

ssu_frame.sample(20)

### SSU Selection <a name="section22"></a>

The second stage sample is selected from the SSU frame (*ssu_frame*) using the variable *cluster* as the stratification variable. The sample is selected without replacement according to the specification of the second stage design *stage2_design*. Hence, both *ssu_sample* and *ssu_hits* sum to 150 and each selected household was hit only ounce (i.e. *ssu_hits* = 1). 

```python
$ ssu_frame["ssu_sample"].sum()
>>> 150
$ ssu_frame["ssu_hits"].sum()
>>> 150
```

In [None]:
np.random.seed(11)
ssu_sample, ssu_hits, _ = stage2_design.select(ssu_frame["household"], 15, ssu_frame["cluster"])

ssu_frame["ssu_sample"] = ssu_sample
ssu_frame["ssu_hits"] = ssu_hits

ssu_frame[ssu_frame["ssu_sample"]==1].sample(15)

To use systematic selection, we just need to replace *method="srs"* by *method="sys"*. 

Another common approach is to use a rate for selecting the sample. Instead of selecting 15 households from 130 in the first cluster, we may want to select with a rate of 15/130, and similarly for the other clusters and get the following rates. 

In [None]:
rates = np.repeat(15, 10) / ssu_counts["number_households_listed"].values

ssu_rates = dict(zip(np.unique(ssu_frame["cluster"]), rates))

ssu_rates

A sample is selected using the rates as follows:

In [None]:
np.random.seed(22)

stage2_design2 = Sample(method="sys", stratification=True, with_replacement=False)

ssu_sample_r, ssu_hits_r, _ = stage2_design2.select(
    ssu_frame["household"], stratum=ssu_frame["cluster"], samp_rate=ssu_rates
    )

ssu_sample2 = pd.DataFrame(
    data={
        "household":ssu_frame["household"], 
        "ssu_sample_r":ssu_sample_r,
        "ssu_hits_r":ssu_hits_r
    })

ssu_sample2.head(25)

Let's store the first and second stages samples. 

In [None]:
psu_sample[["cluster", "region", "psu_prob"]].to_csv("psu_sample.csv")

ssu_sample = ssu_frame.loc[ssu_frame["ssu_sample"]==1]
ssu_sample[["cluster", "household", "ssu_prob"]].to_csv("ssu_sample.csv")