# 3. Sample Weighting

## Tables of Contents

- [Objective](#section0)

- [Design (base) Weight](#section1)

- [Non-Response Adjustment](#section2)

- [Post-Stratification](#section3)

- [Normalization](#section4)

## Objective <a name="section0"></a>

In [1]:
import numpy as np
import pandas as pd

import samplics as svm
from samplics.weighting import SampleWeight

## Design (base) weight <a name="section1"></a>

The design weight is the inverse of the overall probability of selection which is the product of the first and second probability of selection. 

In [2]:
psu_sample = pd.read_csv("psu_sample.csv")
ssu_sample = pd.read_csv("ssu_sample.csv")

full_sample = pd.merge(
    psu_sample[["cluster", "region", "psu_prob"]], 
    ssu_sample[["cluster", "household", "ssu_prob"]], 
    on="cluster")

full_sample["inclusion_prob"] = full_sample["psu_prob"] * full_sample["ssu_prob"] 
full_sample["design_weight"] = 1 / full_sample["inclusion_prob"] 

full_sample.head(25)

Unnamed: 0,cluster,region,psu_prob,household,ssu_prob,inclusion_prob,design_weight
0,7,North,0.187726,72,0.115385,0.021661,46.166667
1,7,North,0.187726,73,0.115385,0.021661,46.166667
2,7,North,0.187726,75,0.115385,0.021661,46.166667
3,7,North,0.187726,715,0.115385,0.021661,46.166667
4,7,North,0.187726,722,0.115385,0.021661,46.166667
5,7,North,0.187726,724,0.115385,0.021661,46.166667
6,7,North,0.187726,755,0.115385,0.021661,46.166667
7,7,North,0.187726,761,0.115385,0.021661,46.166667
8,7,North,0.187726,764,0.115385,0.021661,46.166667
9,7,North,0.187726,782,0.115385,0.021661,46.166667


To illustrate the class *SampleWeight*, we will simulate non-response status. 

In [3]:
np.random.seed(7)
full_sample["response_status"] = np.random.choice(
    ["ineligible","respondent", "non-respondent","unknown"], 
    size=full_sample.shape[0], 
    p=(0.10, 0.70, 0.15, 0.05)
    )

full_sample[["cluster", "region","design_weight", "response_status"]].head(25)

Unnamed: 0,cluster,region,design_weight,response_status
0,7,North,46.166667,ineligible
1,7,North,46.166667,respondent
2,7,North,46.166667,respondent
3,7,North,46.166667,respondent
4,7,North,46.166667,unknown
5,7,North,46.166667,respondent
6,7,North,46.166667,respondent
7,7,North,46.166667,ineligible
8,7,North,46.166667,respondent
9,7,North,46.166667,respondent


## Non-Response Adjustment <a name="section2"></a>

In [4]:
status_mapping = {
    "in": "ineligible", "rr": "respondent", "nr": "non-respondent", "uk":"unknown"
    }

full_sample["nr_weight"] = SampleWeight().adjust(
    samp_weight=full_sample["design_weight"], 
    adjust_class=full_sample["region"], 
    resp_status=full_sample["response_status"], 
    resp_dict=status_mapping
    )

full_sample[["cluster", "region","design_weight", "response_status", "nr_weight"]].head(25)

Unnamed: 0,cluster,region,design_weight,response_status,nr_weight
0,7,North,46.166667,ineligible,46.166667
1,7,North,46.166667,respondent,60.236508
2,7,North,46.166667,respondent,60.236508
3,7,North,46.166667,respondent,60.236508
4,7,North,46.166667,unknown,0.0
5,7,North,46.166667,respondent,60.236508
6,7,North,46.166667,respondent,60.236508
7,7,North,46.166667,ineligible,46.166667
8,7,North,46.166667,respondent,60.236508
9,7,North,46.166667,respondent,60.236508


**Important.** The default call of *adjust()* expects standard codes for response status that is "in", "rr", "nr", and "uk" where "in" means ineligible, "rr" means respondent, "nr" means non-respondent, and "uk" means unknown eligibility.

If we called *adjust()* without the parameter *response_dict*, the run would fail with an assertion error.  The current error message is the following: *The response status must only contains values in ('in', 'rr', 'nr', 'uk') or the mapping should be provided using response_dict parameter*. For the call to run without using *response_dict* it is necessary that the response status takes only codes "in", "rr", "nr", or "uk". The variable associated with *response_status* can contain any code but a mapping is necessary when the response variable is not constructed using the standard codes.

To further illustrate the mapping of response status, let's assume that we have response_status2 which has the values 100 for ineligible, 200 for non-respondent, 300 for respondent, and 999 for unknown. 

In [5]:
response_status2 = np.repeat(100, full_sample["response_status"].shape[0])
response_status2[full_sample["response_status"]=="non-respondent"] = 200
response_status2[full_sample["response_status"]=="respondent"] = 300
response_status2[full_sample["response_status"]=="unknown"] = 999

pd.crosstab(response_status2, full_sample["response_status"])

response_status,ineligible,non-respondent,respondent,unknown
row_0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
100,16,0,0,0
200,0,23,0,0
300,0,0,106,0
999,0,0,0,5


To use *response_status2*, we need to map the values 100, 200, 300 and 999 to "in", "rr", "nr", and "uk". This mapping is done below through the python dictionnary *status_mapping2*. Using *status_mapping2* in the function call *adjust()* will to the same adjustment as in the previous run i.e. *nr_weight* and *nr_weight2* contain the same adjsuted weight. 

In [6]:
status_mapping2 = {"in": 100, "nr": 200, "rr": 300, "uk": 999}

full_sample["nr_weight2"] = SampleWeight().adjust(
    samp_weight=full_sample["design_weight"], 
    adjust_class=full_sample["region"], 
    resp_status=response_status2, 
    resp_dict=status_mapping2
    )

full_sample[["cluster", "region","design_weight", "response_status", "nr_weight", "nr_weight2"]].head(25)

  mask |= (ar1 == a)


Unnamed: 0,cluster,region,design_weight,response_status,nr_weight,nr_weight2
0,7,North,46.166667,ineligible,46.166667,46.166667
1,7,North,46.166667,respondent,60.236508,60.236508
2,7,North,46.166667,respondent,60.236508,60.236508
3,7,North,46.166667,respondent,60.236508,60.236508
4,7,North,46.166667,unknown,0.0,0.0
5,7,North,46.166667,respondent,60.236508,60.236508
6,7,North,46.166667,respondent,60.236508,60.236508
7,7,North,46.166667,ineligible,46.166667,46.166667
8,7,North,46.166667,respondent,60.236508,60.236508
9,7,North,46.166667,respondent,60.236508,60.236508


If the response status variable only takes values "in", "nr", "rr" and "uk" then it is not necessary to provide the mapping dictionary to the function i.e. resp_dict can be ommited from the function call *adjust()*. 

In [7]:
response_status3 = np.repeat("in", full_sample["response_status"].shape[0])
response_status3[full_sample["response_status"]=="non-respondent"] = "nr"
response_status3[full_sample["response_status"]=="respondent"] = "rr"
response_status3[full_sample["response_status"]=="unknown"] = "uk"

full_sample["nr_weight3"] = SampleWeight().adjust(
    samp_weight=full_sample["design_weight"], 
    adjust_class=full_sample["region"], 
    resp_status=response_status3
    )

full_sample[["cluster", "region","design_weight", "response_status", "nr_weight", "nr_weight2", "nr_weight3"]].head(25)

Unnamed: 0,cluster,region,design_weight,response_status,nr_weight,nr_weight2,nr_weight3
0,7,North,46.166667,ineligible,46.166667,46.166667,46.166667
1,7,North,46.166667,respondent,60.236508,60.236508,60.236508
2,7,North,46.166667,respondent,60.236508,60.236508,60.236508
3,7,North,46.166667,respondent,60.236508,60.236508,60.236508
4,7,North,46.166667,unknown,0.0,0.0,0.0
5,7,North,46.166667,respondent,60.236508,60.236508,60.236508
6,7,North,46.166667,respondent,60.236508,60.236508,60.236508
7,7,North,46.166667,ineligible,46.166667,46.166667,46.166667
8,7,North,46.166667,respondent,60.236508,60.236508,60.236508
9,7,North,46.166667,respondent,60.236508,60.236508,60.236508


## Post-Stratification <a name="section3"></a>

Poststratification is useful to compensate for underepresentation of the sample or to correct for nonsampling error. The most common poststratification method consists of adjusting the sample weights to ensure that they sum to some "known" control values by poststratification classes (domains). 

Let's us that we have very reliable external source e.g. a recent census that provides the number of households by region. The external source has the following control data: 3700 households for East, 1500 for North, 2800 for South and 6500 for West. 

In [0]:
# Just dropping a couple of variables not needed for the rest of the tutorial
full_sample.drop(columns=["psu_prob", "ssu_prob", "inclusion_prob", "nr_weight2", "nr_weight3"], inplace=True)

census_households = {"East":3700, "North": 1500, "South": 2800, "West":6500}

full_sample["ps_weight"] = SampleWeight().poststratify(
    samp_weight=full_sample["nr_weight"], control=census_households, domain=full_sample["region"]
    )

full_sample.head()

In [0]:
sum_of_weights = full_sample[["region", "nr_weight", "ps_weight"]].groupby("region").sum()
sum_of_weights.reset_index(inplace=True)
sum_of_weights.head()

In [0]:
full_sample["ps_adjust_fct"] = round(full_sample["ps_weight"] / full_sample["nr_weight"], 12)

pd.crosstab(full_sample["ps_adjust_fct"] , full_sample["region"])

In [0]:
known_ratios = {"East":0.25, "North": 0.10, "South": 0.20, "West":0.45}
full_sample["ps_weight2"] = SampleWeight().poststratify(
    samp_weight=full_sample["nr_weight"], factor=known_ratios, domain=full_sample["region"]
    )

full_sample.head()

In [0]:
sum_of_weights2 = full_sample[["region", "nr_weight", "ps_weight2"]].groupby("region").sum()
sum_of_weights2.reset_index(inplace=True)
sum_of_weights2["ratio"] = sum_of_weights2["ps_weight2"] / sum(sum_of_weights2["ps_weight2"])
sum_of_weights2.head()

Obviously, poststratification classes can be formed using variables beyond the ones involved in the sampling design. For exemple, socio-economique variables such as age group, gender, race and education are often used to form poststratification classes/cells.

## Calibration 

Calibration is a more general concept for adjusting sample weights to sum to known constants. In this tutorial, we consider the generalized regression (GREG) class of calibration. Assume that we have $\hat{\mathbf{Y}} = \sum_{i \in s} w_i y_i$ and know population totals $\mathbf{X} = (\mathbf{X}_1, ..., \mathbf{X}_p)^T$ are available. Working under the model $Y_i | \mathbf{x}_i = \mathbf{x}^T_i \mathbf{\beta} + \epsilon_i$, the GREG estimator of the population total is 

$$\hat{\mathbf{Y}}_{GR} = \hat{\mathbf{Y}} + (\mathbf{X} - \hat{\mathbf{X}})^T\hat{\mathbf{B}}$$

where $\hat{\mathbf{B}}$ is the weighted least squares estimate of $\mathbf{\beta}$ and $\hat{\mathbf{X}}$ is the survey estimate of $\mathbf{X}$.

The essential of the GREG approach is, under the regression model, to find the adjusted weights $w^{*}_i$ that are the closest to $w_i$, to minimize $h(z) = \frac{\sum_{i \in s} c_i(w_i - z_i)}{w_i}$.

Let us simulation three auxiliary variables that is education, poverty and under_five (number of children under five in the household) and assume that we have the following control totals
* Total number of under five children: 6300 in the East, 4000 in the North, 6500 in the South and 14000 in the West. 
* Poverty (Yes: in poverty / No: not in poverty)

    | Region &nbsp;| Poverty &nbsp;| Number of households |
    |:--------|:--------:|:--------------------:|
    | East    |    No    |       2600           |
    |         |    Yes   |       1200           |
    | North   |    No    |       1500           |
    |         |    Yes   |        200           |
    | South   |    No    |       1800           |
    |         |    Yes   |       1100           |
    | West    |    No    |       4500           |
    |         |    Yes   |       2200           |

* Education (Low: less than secondary, Medium: secondary completed, and High: More than secondary)

    | Region &nbsp;| Education &nbsp;| Number of households |
    |:--------|:--------:|:------:|
    | East    | Low      | 2000   |
    |         | Medium   | 1400   |
    |         | High     |  350   |
    | North   | Low      |  550   |
    |         | Medium   |  700   |
    |         | High     |  250   |
    | South   | Low      | 1300   |
    |         | Medium   | 1200   |
    |         | High     |  350   |
    | West    | Low      | 2100   |
    |         | Medium   | 4000   |
    |         | High     |  500   |

In [0]:
np.random.seed(150)
full_sample["education"] = np.random.choice(("Low", "Medium", "High"), size=150, p=(0.40, 0.50, 0.10))
full_sample["poverty"] = np.random.choice((0, 1), size=150, p=(0.70, 0.30))
full_sample["under_five"] = np.random.choice((0,1,2,3,4,5), size=150, p=(0.05, 0.35, 0.25, 0.20, 0.10, 0.05))
full_sample.head()

We now will calibrate the nonreponse weight (*nr_weight*) to ensure that the estimated number of households in poverty is equal to 4,700 and the estimated total number of children under five is 30,8500. The control numbers 4,700 and 30,800 are obtained from the table above. 

The class *SampleWeight()* uses the method *calibrate(samp_weight, aux_vars, control, domain, scale, bounded, additive)* to adjust the weight using the GREG approach. 
* The contol values must be stored in a python dictionnary i.e. totals = {"poverty": 4700, "under_five": 30800}. In this case, we have two numerical variables poverty with values in {0, 1} and under_five with values in {0, 1, 2, 3, 4, 5}. 
* *X* is the matrix of covariates.


In [0]:
totals = {"poverty": 4700, "under_five": 30800}

full_sample["calib_weight"] = SampleWeight().calibrate(
    full_sample["nr_weight"], full_sample[["poverty", "under_five"]], totals
    )

full_sample[["cluster", "region", "household", "nr_weight", "calib_weight"]].head(15)

We can confirm that the estimated totals for the auxiliary variables are equal to their control values.

In [0]:
poverty = full_sample["poverty"]
under_5 = full_sample["under_five"]
nr_weight = full_sample["nr_weight"]
calib_weight = full_sample["calib_weight"]

print(f"Total estimated number of poor households was {sum(poverty*nr_weight):.2f} before and {sum(poverty*calib_weight):.2f} after adjustment \n")
print(f"Total estimated number of children under 5 was {sum(under_5*nr_weight):.2f} before and {sum(under_5*calib_weight):.2f} after adjustment \n")

If we want to control by domain then we can do so using the parameter *domain* of *calibrate()*. Firs we need to update the python dictionnary holding the control values. Now, those values have to be provided for each domain. Note that the dictionnary is now a nested dictionnary where the higher level keys hold the domain values i.e. East, North, South and West. Then the higher level values of the dictionnary are the dictionnaries providing mapping for the auxiliary variables and the corresponding control values. 

In [0]:
totals_by_domain = {
    "East": {"poverty": 1200, "under_five": 6300}, 
    "North": {"poverty": 200, "under_five": 4000}, 
    "South": {"poverty": 1100, "under_five": 6500}, 
    "West": {"poverty": 2200, "under_five": 14000}
    }

full_sample["calib_weight_d"] = SampleWeight().calibrate(
    full_sample["nr_weight"], full_sample[["poverty", "under_five"]], totals_by_domain, full_sample["region"]
    )

full_sample[["cluster", "region", "household", "nr_weight", "calib_weight", "calib_weight_d"]].head(15)

Note that the GREG domain estimates above do not have the additive property. That is the GREG domain estimates do not sum to the overal GREG estimate. To see this, let's assume that we want to estimate the number of households. 

In [0]:
print(f"The number of households using the overall GREG is: {sum(full_sample['calib_weight']):.2f} \n")
print(f"The number of households using the domain GREG is: {sum(full_sample['calib_weight_d']):.2f} \n")

We can force the adittive property by setting the additive flag to true as shown below, by default the flag is set to false. However, with the additive property, a set of adjusted sample weight is created for each domain. Hence, four sets of adjusted sample weights will be created for our example. The output is no longer a vector but a matrix with four columns. To estimate a given domain, the user will have to use the associated column of matrix. 

**Important**
* Note that GREG can produce negative weights. Future version of the library will implement optional modifications to address negative or large weights. 
* Also, note that units outside of the domain of interest have non-zero sampling weights. This is necessary to achieve the additive property. 

In [0]:
calib_weight_d2 = SampleWeight().calibrate(
    full_sample["nr_weight"], full_sample[["poverty", "under_five"]], totals_by_domain, full_sample["region"], additive=True
    )

calib_weight_d2 = pd.DataFrame(calib_weight_d2, columns=["East", "North", "South", "West"])
calib_weight_d2.head(15)

In [0]:
print(f"The GREG domain estimates for the number of households are:\n{calib_weight_d2.sum()} \n")

print(f"The sum of the GREG domain estimates (with the additive property) for the number of households are: {sum(calib_weight_d2.sum()):.2f} which is the same as the overall GREG estimate previously calculed as {sum(full_sample['calib_weight']):.2f}. In summary:\n")

print(f"The number of households using the overall GREG estimate is: {sum(full_sample['calib_weight']):.2f} \n")
print(f"The number of households using the domain GREG estimates is: {sum(full_sample['calib_weight_d']):.2f} - with ADDITIVE=FALSE \n")
print(f"The number of households using the domain GREG estimates is: {sum(calib_weight_d2.sum()):.2f} - with ADDITIVE=TRUE \n")

All the calibration auxiliary variables seen above are numerical but categorical variables may also be used. The approach used by the class method *calibrate()* is to trasform the categorical variables into dummy variables which are used to fit the associated regression model. The user have to provide the control values associated to the adjsutment cells. To facilitate this process, the user can take advantage of the method *calib_covariates(data, x_cat, x_cont, domain)*. This method take the auxiliary variables as input and return a matrix of the auxiliary variables and a dictionnary to be updated with the control values. These two output objects are inputs to *calibrate()*.

Assume that we want to calibrate the weights based on the variables *education* (categorical) and *under_five* (numerical). We can use the snipet of code below to create the matrix of auxiliary variables and the dictionnary object to map adjsutment classes to control values. Note that the categorical variables are presented by value while the continuous variable has only one entry in the dictionnary. 

In [0]:
aux_vars, control_dict = SampleWeight().calib_covariates(full_sample, x_cat=["education"], x_cont=["under_five"])

print(f"The dictionnary for mapping domains to control values: {control_dict}\n")
print(f"A slice of the matrix of auxiliary variables:\n{aux_vars[0:14,]}")

The control values are 30800 for under_five, 5950 for low education, 7300 for medium education, and 1450 for high education. Let's update *aux_dict* then run *calibrate()*.

In [0]:
control_dict["under_five"] = 30800
control_dict["Low"] = 5950
control_dict["Medium"] = 7300
control_dict["High"] = 1450

full_sample["calib_weight2"] = SampleWeight().calibrate(full_sample["nr_weight"], aux_vars, control_dict)

full_sample[["cluster", "region", "household", "nr_weight", "calib_weight2"]].head(15)

In [0]:
aux_vars, control_dict = SampleWeight().calib_covariates(full_sample, x_cat=["education"], x_cont=["under_five"], domain="region")

from pprint import pprint
pprint(control_dict)
print(f"\nA slice of the matrix of auxiliary variables:\n{aux_vars[0:14,]}")

In [0]:
control_dict["East"]["under_five"] = 6300
control_dict["East"]["Low"] = 2000
control_dict["East"]["Medium"] = 1400
control_dict["East"]["High"] = 350

control_dict["North"]["under_five"] = 4000
control_dict["North"]["Low"] = 550
control_dict["North"]["Medium"] = 700
control_dict["North"]["High"] = 250

control_dict["South"]["under_five"] = 6500
control_dict["South"]["Low"] = 1300
control_dict["South"]["Medium"] = 1200
control_dict["South"]["High"] = 350

control_dict["West"]["under_five"] = 14000
control_dict["West"]["Low"] = 2100
control_dict["West"]["Medium"] = 4000
control_dict["West"]["High"] = 500

full_sample["calib_weight3"] = SampleWeight().calibrate(full_sample["nr_weight"], aux_vars, control_dict, full_sample["region"])

full_sample[["cluster", "region", "household", "education", "under_five", "nr_weight", "calib_weight3"]].head(15)

## Normalization <a name="section4"></a>

Sometimes surveys adjust their sample weights to sum to arbitrary constants. This is known as normalizing sample weights. Sample weights normalization is less common modern surveys. However, DHS and MICS still normalize their final sample weights to sum to the sample size. Note that estimates of totals are not meaning using normalized weights but relative estimates such as mean, proportion or ratio remains valid as the normalization constant cancel out. 

We can use the class method *normalize(samp_weight, control, domain)* to adjust sample weights to sum to some constant across the sample or by normalization domain. Users should be careful when normalizing by domain as it will change the distribution of the weights across normalization domains. *normalize()* implementes the domain parameter for compleness and flexibility but it be shaldom used in practice. 

In [0]:
full_sample["norm_weight"] = SampleWeight().normalize(full_sample["nr_weight"])


full_sample[["cluster", "region", "nr_weight", "norm_weight"]].head(25)

print((full_sample.shape[0], full_sample["norm_weight"].sum()))

When *normalize()* is called with only the parameter *samp_weight* then the sample weights are normalized to sum to the length of the sample weight vector. 

In [0]:
full_sample["norm_weight2"] = SampleWeight().normalize(full_sample["nr_weight"], control=300)

print(f"{full_sample['norm_weight2'].sum():.2f}")

In [0]:
full_sample["norm_weight3"] = SampleWeight().normalize(full_sample["nr_weight"], domain=full_sample["region"])

weight_sum = full_sample.groupby(["region"]).sum()
weight_sum.reset_index(inplace=True)
weight_sum[["region", "nr_weight", "norm_weight", "norm_weight3"]]

As for the other methods, the control values by domain are provided using a python dictionnary that maps the domain to the associated normalization level.

In [0]:
norm_level = {"East": 10, "North": 20, "South": 30, "West": 50}

full_sample["norm_weight4"] = SampleWeight().normalize(full_sample["nr_weight"], norm_level, full_sample["region"])

weight_sum = full_sample.groupby(["region"]).sum()
weight_sum.reset_index(inplace=True)
weight_sum[["region", "nr_weight", "norm_weight", "norm_weight2", "norm_weight3", "norm_weight4",]]