# 3. Sample Weighting

## Tables of Contents

- [Objective](#section0)

- [Design (base) Weight](#section1)

- [Non-Response Adjustment](#section2)

- [Post-Stratification](#section3)

- [Normalization](#section4)

## Objective <a name="section0"></a>

In [1]:
import numpy as np
import pandas as pd

import samplics as svm
from samplics.weighting import SampleWeight

## Design (base) weight <a name="section1"></a>

The design weight is the inverse of the overall probability of selection which is the product of the first and second probability of selection. 

In [0]:
psu_sample = pd.read_csv("psu_sample.csv")
ssu_sample = pd.read_csv("ssu_sample.csv")

full_sample = pd.merge(
    psu_sample[["cluster", "region", "psu_prob"]], 
    ssu_sample[["cluster", "household", "ssu_prob"]], 
    on="cluster")

full_sample["inclusion_prob"] = full_sample["psu_prob"] * full_sample["ssu_prob"] 
full_sample["design_weight"] = 1 / full_sample["inclusion_prob"] 

full_sample.head(25)

To illustrate the class *SampleWeight*, we will simulate non-response status. 

In [0]:
np.random.seed(7)
full_sample["response_status"] = np.random.choice(
    ["ineligible","respondent", "non-respondent","unknown"], 
    size=full_sample.shape[0], 
    p=(0.10, 0.70, 0.15, 0.05)
    )

full_sample[["cluster", "region","design_weight", "response_status"]].head(25)

## Non-Response Adjustment <a name="section2"></a>

In [0]:
status_mapping = {
    "in": "ineligible", "rr": "respondent", "nr": "non-respondent", "uk":"unknown"
    }

full_sample["nr_weight"] = SampleWeight().adjust(
    samp_weight=full_sample["design_weight"], 
    adjust_class=full_sample["region"], 
    resp_status=full_sample["response_status"], 
    resp_dict=status_mapping
    )

full_sample[["cluster", "region","design_weight", "response_status", "nr_weight"]].head(25)

**Important.** The default call of *adjust()* expects standard codes for response status that is "in", "rr", "nr", and "uk" where "in" means ineligible, "rr" means respondent, "nr" means non-respondent, and "uk" means unknown eligibility.

If we called *adjust()* without the parameter *response_dict*, the run would fail with an assertion error.  The current error message is the following: *The response status must only contains values in ('in', 'rr', 'nr', 'uk') or the mapping should be provided using response_dict parameter*. For the call to run without using *response_dict* it is necessary that the response status takes only codes "in", "rr", "nr", or "uk". The variable associated with *response_status* can contain any code but a mapping is necessary when the response variable is not constructed using the standard codes.

To further illustrate the mapping of response status, let's assume that we have response_status2 which has the values 100 for ineligible, 200 for non-respondent, 300 for respondent, and 999 for unknown. 

In [0]:
response_status2 = np.repeat(100, full_sample["response_status"].shape[0])
response_status2[full_sample["response_status"]=="non-respondent"] = 200
response_status2[full_sample["response_status"]=="respondent"] = 300
response_status2[full_sample["response_status"]=="unknown"] = 999

pd.crosstab(response_status2, full_sample["response_status"])

To use *response_status2*, we need to map the values 100, 200, 300 and 999 to "in", "rr", "nr", and "uk". This mapping is done below through the python dictionnary *status_mapping2*. Using *status_mapping2* in the function call *adjust()* will to the same adjustment as in the previous run i.e. *nr_weight* and *nr_weight2* contain the same adjsuted weight. 

In [0]:
status_mapping2 = {"in": 100, "nr": 200, "rr": 300, "uk": 999}

full_sample["nr_weight2"] = SampleWeight().adjust(
    samp_weight=full_sample["design_weight"], 
    adjust_class=full_sample["region"], 
    resp_status=response_status2, 
    resp_dict=status_mapping2
    )

full_sample[["cluster", "region","design_weight", "response_status", "nr_weight", "nr_weight2"]].head(25)

If the response status variable only takes values "in", "nr", "rr" and "uk" then it is not necessary to provide the mapping dictionary to the function i.e. resp_dict can be ommited from the function call *adjust()*. 

In [0]:
response_status3 = np.repeat("in", full_sample["response_status"].shape[0])
response_status3[full_sample["response_status"]=="non-respondent"] = "nr"
response_status3[full_sample["response_status"]=="respondent"] = "rr"
response_status3[full_sample["response_status"]=="unknown"] = "uk"

full_sample["nr_weight3"] = SampleWeight().adjust(
    samp_weight=full_sample["design_weight"], 
    adjust_class=full_sample["region"], 
    resp_status=response_status3
    )

full_sample[["cluster", "region","design_weight", "response_status", "nr_weight", "nr_weight2", "nr_weight3"]].head(25)

## Post-Stratification <a name="section3"></a>

Poststratification is useful to compensate for underepresentation of the sample or to correct for nonsampling error. The most common poststratification method consists of adjusting the sample weights to ensure that they sum to some "known" control values by poststratification classes (domains). 

Let's us that we have very reliable external source e.g. a recent census that provides the number of households by region. The external source has the following control data: 3700 households for East, 1500 for North, 2800 for South and 6500 for West. 

In [0]:
# Just dropping a couple of variables not needed for the rest of the tutorial
full_sample.drop(columns=["psu_prob", "ssu_prob", "inclusion_prob", "nr_weight2", "nr_weight3"], inplace=True)

census_households = {"East":3700, "North": 1500, "South": 2800, "West":6500}

full_sample["ps_weight"] = SampleWeight().poststratify(
    samp_weight=full_sample["nr_weight"], control=census_households, domain=full_sample["region"]
    )

full_sample.head()

In [0]:
sum_of_weights = full_sample[["region", "nr_weight", "ps_weight"]].groupby("region").sum()
sum_of_weights.reset_index(inplace=True)
sum_of_weights.head()

In [0]:
full_sample["ps_adjust_fct"] = round(full_sample["ps_weight"] / full_sample["nr_weight"], 12)

pd.crosstab(full_sample["ps_adjust_fct"] , full_sample["region"])

In [0]:
known_ratios = {"East":0.25, "North": 0.10, "South": 0.20, "West":0.45}
full_sample["ps_weight2"] = SampleWeight().poststratify(
    samp_weight=full_sample["nr_weight"], factor=known_ratios, domain=full_sample["region"]
    )

full_sample.head()

In [0]:
sum_of_weights2 = full_sample[["region", "nr_weight", "ps_weight2"]].groupby("region").sum()
sum_of_weights2.reset_index(inplace=True)
sum_of_weights2["ratio"] = sum_of_weights2["ps_weight2"] / sum(sum_of_weights2["ps_weight2"])
sum_of_weights2.head()

Obviously, poststratification classes can be formed using variables beyond the ones involved in the sampling design. For exemple, socio-economique variables such as age group, gender, race and education are often used to form poststratification classes/cells.

## Calibration 

Calibration is a more general concept for adjusting sample weights to sum to known constants. In this tutorial, we consider the generalized regression (GREG) class of calibration. Assume that we have $\hat{\mathbf{Y}} = \sum_{i \in s} w_i y_i$ and know population totals $\mathbf{X} = (\mathbf{X}_1, ..., \mathbf{X}_p)^T$ are available. Working under the model $Y_i | \mathbf{x}_i = \mathbf{x}^T_i \mathbf{\beta} + \epsilon_i$, the GREG estimator of the population total is 

$$\hat{\mathbf{Y}}_{GR} = \hat{\mathbf{Y}} + (\mathbf{X} - \hat{\mathbf{X}})^T\hat{\mathbf{B}}$$

where $\hat{\mathbf{B}}$ is the weighted least squares estimate of $\mathbf{\beta}$ and $\hat{\mathbf{X}}$ is the survey estimate of $\mathbf{X}$.

The essential of the GREG approach is, under the regression model, to find the adjusted weights $w^{*}_i$ that are the closest to $w_i$, to minimize $h(z) = \frac{\sum_{i \in s} c_i(w_i - z_i)}{w_i}$.

Let us simulation three auxiliary variables that is education, poverty and under_five (number of children under five in the household). 

In [0]:
np.random.seed(150)
education = np.random.choice(("Low", "Medium", "High"), size=150, p=(0.40, 0.50, 0.10))
poverty = np.random.choice(("No", "Yes"), size=150, p=(0.70, 0.30))
under_five = np.random.choice((0,1,2,3,4,5), size=150, p=(0.05, 0.35, 0.25, 0.20, 0.10, 0.05))

## Normalization <a name="section4"></a>

DHS and MICS normalize the final sample weights to sum to the sample size. We can use the class method *normalize()* to ensure that the sample weight sum to some constant across the sample or by normalization domain e.g. stratum. 

In [0]:
full_sample["norm_weight"] = SampleWeight().normalize(full_sample["nr_weight"])


full_sample[["cluster", "region", "nr_weight", "norm_weight"]].head(25)

print((full_sample.shape[0], full_sample["norm_weight"].sum()))

When *normalize()* is called with only the parameter *sample_weight* then the sample weights are normalize to sum to the length of the sample weight vector. 

In [0]:
full_sample["norm_weight2"] = SampleWeight().normalize(full_sample["nr_weight"], control=300)

print(full_sample["norm_weight2"].sum())

In [0]:
full_sample["norm_weight3"] = SampleWeight().normalize(full_sample["nr_weight"], domain=full_sample["region"])

weight_sum = full_sample.groupby(["region"]).sum()
weight_sum.reset_index(inplace=True)
weight_sum[["region", "nr_weight", "norm_weight", "norm_weight3"]]

In [0]:
norm_level = {"East": 10, "North": 20, "South": 30, "West": 50}

full_sample["norm_weight4"] = SampleWeight().normalize(full_sample["nr_weight"], norm_level, full_sample["region"])

weight_sum = full_sample.groupby(["region"]).sum()
weight_sum.reset_index(inplace=True)
weight_sum[["region", "nr_weight", "norm_weight", "norm_weight2", "norm_weight3", "norm_weight4",]]