# 2. Sample Selection 

This tutorial will walk you through the use of the class *Selection* to select a random sample from a frame of finite sample units.

## Objective:
- Overview of some common sample selection techniques
- Introduction to the attributes and methods of class *Selection*

**Important.** This tutorial is not designed to teach sample selection but rather aims to provide a minimum review as reference for the user. At the end of this tutorial, the user should have a good understanding of the *Selection* API. 


In [None]:
import numpy as np
import pandas as pd

from survmeth import Selection

The file *sample_frame.csv* - shown below - contains syntethic data of 100 clusters classified by region. Clusters represent a group of households. In the file, each clusters has an associated number of households (number_households) and a status variable indicating whether the cluster is in scope or not. 

This synthetic data represents a simplified version of enumeration areas (EAs) frame from countries used by major household survey programs such as the Demographic and Health (DHS), the Population-based HIV Impact Assessment (PHIA) and the Multiple Cluster Indicator Survey (MICS). 

In [38]:
psu_frame = pd.read_csv("psu_frame.csv")

psu_frame.head(20)

Unnamed: 0,cluster,region,number_households,cluster_status,comment
0,1,North,105,1,
1,2,North,85,1,
2,3,North,95,1,
3,4,North,75,1,
4,5,North,120,1,
5,6,North,90,1,
6,7,North,130,1,
7,8,North,55,1,
8,9,North,30,1,
9,10,North,600,1,due to a large building


In the sections below, the user will select samples using several random selection procedures. In each section, a short review of the selection method will be provided to the user. The three main selection methods implemented with class *Selection* are the simple random selection (SRS), the systematic selection (SYS), and the probability proportional to size (PPS).

The class *Selection* has two main methods that is *inclusion_probs* and *select*. *inclusion_probs* computes the probability of selection and *select* provides the random sample. For more details visit: samplics.io/readthedocs/samplics.sampling

Some of the attributes of the class *Selection are the following:
- *method* which indicates the selection method. The current methods are: srs, sys, pps-brewer, pps-hv (for the Hanarav-Vajayan algorithm), pps-murphy, pps-sampford, and pps-sys.
- *stratification* indicates if selection is stratified.
- *with_replacement* indicates if selection is with replacement.
- *fpc* provides the finite population correction - **(To Do list)**

## Sampling STAGE 1 - Selection of Primary Sampling Units (PSUs)

Often, sampling frames are not available for the sampling units of interest. For example, most countries do not have a accurate list of all available housholds or people living in the country. Even if such frame exist, it may be operationally and finacially not feasible to directly select sampling units without any form of clustering. 

Hence, stage sampling is a common strategy used by large households national surveys for selecting samples of households and people. 

### PSU Probability of Selection 

The sampling units will be selected with probabilities proportional to size (pps). <br>
The measure of size is the number of households (number_households) provided in <br>
the sampling frame. The sample will be stratified by region. 

The probabilities, for stratified pps, is obtained as follow:  $p_k = \frac{n M_k}{\sum_{i=1}^{N}{M_i}}$ <br>
where $p_k$ is the probability of selection for unit $k$, $M_k$ is the measure of size (mos), <br>
$N$ is the total number of clusters in the sampling frame.

### Sample size by stratum 

The table below provides the stratum (region), the sample size (sample_size), <br>
the total size measure (total_size), and the ratio (ratio = $\frac{n}{\sum_{i=1}^{N}{M_i}}$).

In [None]:
psu_sample_size = psu_frame.groupby("region").sum()
psu_sample_size["sample_size"] = 2
psu_sample_size.sample_size["North"] = 1
psu_sample_size.sample_size["West"] = 5
psu_sample_size["region"] = psu_sample_size.index
psu_sample_size.drop(columns="cluster_id", inplace=True)
psu_sample_size.rename(columns={
    "cluster_status": "total_clusters",
    "number_households": "total_households"
}, inplace=True)
psu_sample_size.reset_index(drop=True, inplace=True)

psu_sample_size

In [None]:
# Merge sample information to frame
psu_frame = pd.merge(psu_frame, psu_sample_size, on="region")

# Probabilities of selection
psu_frame["psu_probs_ref"] = psu_frame["sample_size"] * psu_frame["number_households"] / psu_frame["total_households"]

psu_frame.head()

In [None]:
stratified_pps = svm.SizeSelection(stratification=True, with_replacement=False)

psu_sample_size = dict(
    {"North": 1, "South": 2, "East": 2, "West": 5}
)
psu = psu_frame["cluster_id"]
psu_mos = psu_frame["number_households"]
region = psu_frame["region"]

psu_probs = stratified_pps.inclusion_probability(psu, psu_sample_size, psu_mos, region)

psu_frame["psu_probs"] = psu_probs

psu_weight = np.zeros(psu_probs.size)
for i in range(psu_probs.size):
    if psu_probs[i] != 0:
        psu_weight[i] = 1 / psu_probs[i]

psu_frame["psu_weight"] = psu_weight
psu_frame.head(15)

### PSU Selection

Provide the steps for systematic selection

Have a set python script to show each of the step separately

Finish with SurvMeth

In [None]:
np.random.seed(23)
psu_sample, psu_hits = stratified_pps.select(psu, psu_sample_size, psu_mos, region)

psu_frame["psu_sample"] = psu_sample
psu_frame["psu_hits"] = psu_hits

psu_frame.head(15)

#print(np.sum(psu_frame["psu_sample"].values))
#print(np.sum(psu_frame["psu_hits"].values))

The sample is given by filtering the rows with psu_sample == 1

In [None]:
psu_frame.loc[psu_frame["psu_sample"]==1]

## Sampling STAGE 2 - Selection of Secondary Sampling Unit (SSU)

Provide the steps for systematic selection

Have a set python script to show each of the step separately

Finish with SurvMeth

In [None]:
# Create a synthetic second stage frame
census_size = psu_frame.loc[psu_frame["psu_sample"]==1, "number_households"].values
stratum_names = psu_frame.loc[psu_frame["psu_sample"]==1, "region"].values
cluster_id = psu_frame.loc[psu_frame["psu_sample"]==1, "cluster_id"].values

np.random.seed(323241)

listing_size = np.zeros(census_size.size)
for k in range(census_size.size):
    listing_size[k] = np.random.normal(1.05*census_size[k], 0.1*census_size[k])
    
listing_size = listing_size.astype(int)

hh_id = rr_id = cl_id = []
for k, s in enumerate(listing_size):
    hh_k1 = np.char.array(np.repeat(stratum_names[k], s)).astype(str)
    hh_k2 = np.char.array(np.arange(1, s+1)).astype(str)
    cl_k = np.repeat(cluster_id[k], s)
    hh_k = np.char.add(hh_k1, np.char.array(cl_k).astype(str))
    hh_k = np.char.add(hh_k, hh_k2)
    hh_id = np.append(hh_id, hh_k)
    rr_id = np.append(rr_id, hh_k1)
    cl_id = np.append(cl_id, cl_k)
    
#(np.size(hh_id), sum(census_size), sum(listing_size))

ssu_frame = pd.DataFrame(cl_id.astype(int))
ssu_frame.rename(columns={0: "cluster_id"}, inplace=True)
ssu_frame["region"] = rr_id
ssu_frame["ssu_id"] = hh_id

#ssu_frame.head(15)
#ssu_frame.sample(15)
ssu_frame.groupby("cluster_id").count()

In [None]:
stratified_srs = svm.SimpleSelection(stratification=True, with_replacement=False)

ssu_sample_size = 12
ssu = ssu_frame["ssu_id"]
cluster = ssu_frame["cluster_id"]

np.unique(ssu, return_counts=True)

ssu_probs = stratified_srs.inclusion_probability(ssu, ssu_sample_size, cluster)

ssu_frame["ssu_probs"] = ssu_probs

ssu_weight = np.zeros(ssu_probs.size)
for i in range(ssu_probs.size):
    if ssu_probs[i] != 0:
        ssu_weight[i] = 1 / ssu_probs[i]

ssu_frame["ssu_weight"] = ssu_weight
ssu_frame.sample(15)

In [None]:
np.random.seed(11135)
ssu_sample, ssu_hits = stratified_srs.select(ssu, ssu_sample_size, cluster)

ssu_frame["ssu_sample"] = ssu_sample
ssu_frame["ssu_hits"] = ssu_hits

ssu_frame.sample(15)

#print(np.sum(ssu_frame["ssu_sample"].values))
#print(np.sum(ssu_frame["ssu_hits"].values))

## Sampling STAGE 3 - Tertiary Sampling Unit (TSU)

Maybe not necessary

## Base (Design) Weight