# 3. Sample Weighting

## Tables of Contents

- [Objective](#section0)

- [Design (base) Weight](#section1)

- [Non-Response Adjustment](#section2)

- [Post-Stratification](#section3)

- [Normalization](#section4)

## Objective <a name="section0"></a>

In [1]:
import numpy as np
import pandas as pd

import samplics as svm
from samplics.weighting import SampleWeight

## Design (base) weight <a name="section1"></a>

The design weight is the inverse of the overall probability of selection which is the product of the first and second probability of selection. 

In [2]:
psu_sample = pd.read_csv("psu_sample.csv")
ssu_sample = pd.read_csv("ssu_sample.csv")

full_sample = pd.merge(
    psu_sample[["cluster", "region", "psu_prob"]], 
    ssu_sample[["cluster", "household", "ssu_prob"]], 
    on="cluster")

full_sample["inclusion_prob"] = full_sample["psu_prob"] * full_sample["ssu_prob"] 
full_sample["design_weight"] = 1 / full_sample["inclusion_prob"] 

full_sample.head(25)

Unnamed: 0,cluster,region,psu_prob,household,ssu_prob,inclusion_prob,design_weight
0,7,North,0.187726,72,0.115385,0.021661,46.166667
1,7,North,0.187726,73,0.115385,0.021661,46.166667
2,7,North,0.187726,75,0.115385,0.021661,46.166667
3,7,North,0.187726,715,0.115385,0.021661,46.166667
4,7,North,0.187726,722,0.115385,0.021661,46.166667
5,7,North,0.187726,724,0.115385,0.021661,46.166667
6,7,North,0.187726,755,0.115385,0.021661,46.166667
7,7,North,0.187726,761,0.115385,0.021661,46.166667
8,7,North,0.187726,764,0.115385,0.021661,46.166667
9,7,North,0.187726,782,0.115385,0.021661,46.166667


To illustrate the class *SampleWeight*, we will simulate non-response status. 

In [3]:
np.random.seed(7)
full_sample["response_status"] = np.random.choice(
    ["ineligible","respondent", "non-respondent","unknown"], 
    size=full_sample.shape[0], 
    p=(0.10, 0.70, 0.15, 0.05)
    )

full_sample[["cluster", "region","design_weight", "response_status"]].head(25)

Unnamed: 0,cluster,region,design_weight,response_status
0,7,North,46.166667,ineligible
1,7,North,46.166667,respondent
2,7,North,46.166667,respondent
3,7,North,46.166667,respondent
4,7,North,46.166667,unknown
5,7,North,46.166667,respondent
6,7,North,46.166667,respondent
7,7,North,46.166667,ineligible
8,7,North,46.166667,respondent
9,7,North,46.166667,respondent


## Non-Response Adjustment <a name="section2"></a>

In [14]:
status_mapping = {
    "in": "ineligible", "rr": "respondent", "nr": "non-respondent", "uk":"unknown"
    }

full_sample["nr_weight"] = SampleWeight().adjust(
    samp_weight=full_sample["design_weight"], 
    adjust_class=full_sample["region"], 
    resp_status=full_sample["response_status"], 
    resp_dict=status_mapping
    )

full_sample[["cluster", "region","design_weight", "response_status", "nr_weight"]].head(25)

Unnamed: 0,cluster,region,design_weight,response_status,nr_weight
0,7,North,46.166667,ineligible,46.166667
1,7,North,46.166667,respondent,60.236508
2,7,North,46.166667,respondent,60.236508
3,7,North,46.166667,respondent,60.236508
4,7,North,46.166667,unknown,0.0
5,7,North,46.166667,respondent,60.236508
6,7,North,46.166667,respondent,60.236508
7,7,North,46.166667,ineligible,46.166667
8,7,North,46.166667,respondent,60.236508
9,7,North,46.166667,respondent,60.236508


**Important.** The default call of *adjust()* expects standard codes for response status that is "in", "rr", "nr", and "uk" where "in" means ineligible, "rr" means respondent, "nr" means non-respondent, and "uk" means unknown eligibility.

If we called *adjust()* without the parameter *response_dict*, the run would fail with an assertion error.  The current error message is the following: *The response status must only contains values in ('in', 'rr', 'nr', 'uk') or the mapping should be provided using response_dict parameter*. For the call to run without using *response_dict* it is necessary that the response status takes only codes "in", "rr", "nr", or "uk". The variable associated with *response_status* can contain any code but a mapping is necessary when the response variable is not constructed using the standard codes.

To further illustrate the mapping of response status, let's assume that we have response_status2 which has the values 100 for ineligible, 200 for non-respondent, 300 for respondent, and 999 for unknown. 

In [15]:
response_status2 = np.repeat(100, full_sample["response_status"].shape[0])
response_status2[full_sample["response_status"]=="non-respondent"] = 200
response_status2[full_sample["response_status"]=="respondent"] = 300
response_status2[full_sample["response_status"]=="unknown"] = 999

pd.crosstab(response_status2, full_sample["response_status"])

response_status,ineligible,non-respondent,respondent,unknown
row_0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
100,16,0,0,0
200,0,23,0,0
300,0,0,106,0
999,0,0,0,5


To use *response_status2*, we need to map the values 100, 200, 300 and 999 to "in", "rr", "nr", and "uk". This mapping is done below through the python dictionnary *status_mapping2*. Using *status_mapping2* in the function call *adjust()* will to the same adjustment as in the previous run i.e. *nr_weight* and *nr_weight2* contain the same adjsuted weight. 

In [17]:
status_mapping2 = {"in": 100, "nr": 200, "rr": 300, "uk": 999}

full_sample["nr_weight2"] = SampleWeight().adjust(
    samp_weight=full_sample["design_weight"], 
    adjust_class=full_sample["region"], 
    resp_status=response_status2, 
    resp_dict=status_mapping2
    )

full_sample[["cluster", "region","design_weight", "response_status", "nr_weight", "nr_weight2"]].head(25)

Unnamed: 0,cluster,region,design_weight,response_status,nr_weight,nr_weight2
0,7,North,46.166667,ineligible,46.166667,46.166667
1,7,North,46.166667,respondent,60.236508,60.236508
2,7,North,46.166667,respondent,60.236508,60.236508
3,7,North,46.166667,respondent,60.236508,60.236508
4,7,North,46.166667,unknown,0.0,0.0
5,7,North,46.166667,respondent,60.236508,60.236508
6,7,North,46.166667,respondent,60.236508,60.236508
7,7,North,46.166667,ineligible,46.166667,46.166667
8,7,North,46.166667,respondent,60.236508,60.236508
9,7,North,46.166667,respondent,60.236508,60.236508


If the response status variable only takes values "in", "nr", "rr" and "uk" then it is not necessary to provide the mapping dictionary to the function. 

In [18]:
response_status3 = np.repeat("in", full_sample["response_status"].shape[0])
response_status3[full_sample["response_status"]=="non-respondent"] = "nr"
response_status3[full_sample["response_status"]=="respondent"] = "rr"
response_status3[full_sample["response_status"]=="unknown"] = "uk"

full_sample["nr_weight3"] = SampleWeight().adjust(
    samp_weight=full_sample["design_weight"], 
    adjust_class=full_sample["region"], 
    resp_status=response_status3
    )

full_sample[["cluster", "region","design_weight", "response_status", "nr_weight", "nr_weight2", "nr_weight3"]].head(25)

Unnamed: 0,cluster,region,design_weight,response_status,nr_weight,nr_weight2,nr_weight3
0,7,North,46.166667,ineligible,46.166667,46.166667,46.166667
1,7,North,46.166667,respondent,60.236508,60.236508,60.236508
2,7,North,46.166667,respondent,60.236508,60.236508,60.236508
3,7,North,46.166667,respondent,60.236508,60.236508,60.236508
4,7,North,46.166667,unknown,0.0,0.0,0.0
5,7,North,46.166667,respondent,60.236508,60.236508,60.236508
6,7,North,46.166667,respondent,60.236508,60.236508,60.236508
7,7,North,46.166667,ineligible,46.166667,46.166667,46.166667
8,7,North,46.166667,respondent,60.236508,60.236508,60.236508
9,7,North,46.166667,respondent,60.236508,60.236508,60.236508


## Post-Stratification <a name="section3"></a>

## Normalization <a name="section4"></a>

DHS and MICS normalize the final sample weights to sum to the sample size. We can use the class method *normalize()* to ensure that the sample weight sum to some constant across the sample or by normalization domain e.g. stratum. 

In [8]:
full_sample["norm_weight"] = SampleWeight().normalize(full_sample["nr_weight"])


full_sample[["cluster", "region", "nr_weight", "norm_weight"]].head(25)

print((full_sample.shape[0], full_sample["norm_weight"].sum()))

(150, 150.00000000000003)


When *normalize()* is called with only the parameter *sample_weight* then the sample weights are normalize to sum to the length of the sample weight vector. 

In [9]:
full_sample["norm_weight2"] = SampleWeight().normalize(full_sample["nr_weight"], control=300)

print(full_sample["norm_weight2"].sum())

300.00000000000006


In [10]:
full_sample["norm_weight3"] = SampleWeight().normalize(full_sample["nr_weight"], domain=full_sample["region"])

weight_sum = full_sample.groupby(["region"]).sum()
weight_sum.reset_index(inplace=True)
weight_sum[["region", "nr_weight", "norm_weight", "norm_weight3"]]

Unnamed: 0,region,nr_weight,norm_weight,norm_weight3
0,East,3698.703391,38.419768,45.0
1,North,1454.25,15.10582,30.0
2,South,2801.88962,29.104239,45.0
3,West,6485.783333,67.370173,30.0


In [11]:
norm_level = {"East": 10, "North": 20, "South": 30, "West": 50}

full_sample["norm_weight4"] = SampleWeight().normalize(full_sample["nr_weight"], norm_level, full_sample["region"])

weight_sum = full_sample.groupby(["region"]).sum()
weight_sum.reset_index(inplace=True)
weight_sum[["region", "nr_weight", "norm_weight", "norm_weight2", "norm_weight3", "norm_weight4",]]

Unnamed: 0,region,nr_weight,norm_weight,norm_weight2,norm_weight3,norm_weight4
0,East,3698.703391,38.419768,76.839535,45.0,10.0
1,North,1454.25,15.10582,30.21164,30.0,20.0
2,South,2801.88962,29.104239,58.208478,45.0,30.0
3,West,6485.783333,67.370173,134.740347,30.0,50.0
