# Estimation of population parameters


The objective of this tutorial is illustrate the use of the *samplics* estimation APIs. There are two main classes: *TaylorEstimator* and *ReplicateEstimator*. The former class uses linearization methods to estimate variance of population parameters while the latter uses replicate-based methods (bootstrap, brr/fay, and jackknife) to estimate the variance. 

In [3]:
from IPython.core.display import Image, display

import numpy as np
import pandas as pd

import samplics 
from samplics.datasets import load_nhanes2, load_nhanes2brr, load_nhanes2jk, load_nmhis
from samplics.estimation import TaylorEstimator, ReplicateEstimator

## Taylor approximation <a name="section1"></a>

In [4]:
# Load Nhanes sample data
nhanes2_dict = load_nhanes2()
nhanes2 = nhanes2_dict["data"]

nhanes2.head(15)

Unnamed: 0,stratid,psuid,race,highbp,highlead,zinc,diabetes,finalwgt
0,1,1,1,0,,104.0,0.0,8995
1,1,1,1,0,0.0,111.0,0.0,25964
2,1,1,3,0,,102.0,0.0,8752
3,1,1,1,1,,109.0,1.0,4310
4,1,1,1,0,0.0,99.0,0.0,9011
5,1,1,1,1,,101.0,0.0,4310
6,1,1,1,0,0.0,93.0,0.0,3201
7,1,1,1,1,,83.0,0.0,25386
8,1,1,1,0,,98.0,0.0,12102
9,1,1,2,0,0.0,98.0,0.0,4312


We calculate the survey mean of the level of zinc using Stata and we get the following    

Using *samplics*, the same estimate can be obtained using the snippet of code below.

In [3]:
zinc_mean_str = TaylorEstimator("mean")
zinc_mean_str.estimate(
    y=nhanes2["zinc"],
    samp_weight=nhanes2["finalwgt"],
    stratum=nhanes2["stratid"],
    psu=nhanes2["psuid"],
    remove_nan=True,
)

print(zinc_mean_str)

SAMPLICS - Estimation of Mean

Number of strata: 31
Number of psus: 62
Degree of freedom: 31

     MEAN       SE       LCI       UCI       CV
87.182067 0.494483 86.173563 88.190571 0.005672



Let's remove the stratum parameter then we get the following with stata        

with samplics, we get ...

In [4]:
zinc_mean_nostr = TaylorEstimator("mean")
zinc_mean_nostr.estimate(
    y=nhanes2["zinc"], samp_weight=nhanes2["finalwgt"], psu=nhanes2["psuid"], remove_nan=True
)

print(zinc_mean_nostr)

SAMPLICS - Estimation of Mean

Number of strata: 1
Number of psus: 2
Degree of freedom: 1

     MEAN       SE       LCI       UCI       CV
87.182067 0.742622 77.746158 96.617976 0.008518



The other parameters currently implemented in *TaylorEstimator* are TOTAL, PROPORTION and RATIO. TOTAL and PROPORTION have the same function call as the MEAN parameter. For the RATIO parameter, it is necessary to provide the parameter *x*.         

In [5]:
ratio_bp_lead = TaylorEstimator("ratio")
ratio_bp_lead.estimate(
    y=nhanes2["highbp"],
    samp_weight=nhanes2["finalwgt"],
    x=nhanes2["highlead"],
    stratum=nhanes2["stratid"],
    psu=nhanes2["psuid"],
    remove_nan=True,
)

print(ratio_bp_lead)

SAMPLICS - Estimation of Ratio

Number of strata: 31
Number of psus: 62
Degree of freedom: 31

  RATIO       SE     LCI      UCI       CV
5.93255 0.553058 4.80458 7.060519 0.093224


## Replicate-based variance estimation <a name="section2"></a>

#### Bootstrap  <a name="section21"></a>

In [6]:
# Load NMIHS sample data
nmihs_cls = load_mnhis()
nmihs_cls.load_data()
nmihs = nmihs_cls.data

nmihs.head(15)

Unnamed: 0,finalwgt,birth_weight,bsrw1,bsrw2,bsrw3,bsrw4,bsrw5,bsrw6,bsrw7,bsrw8,...,bsrw41,bsrw42,bsrw43,bsrw44,bsrw45,bsrw46,bsrw47,bsrw48,bsrw49,bsrw50
0,24.67243,1270,49.403603,0.0,49.403603,0.0,24.701801,49.403603,24.701801,0.0,...,0.0,0.0,0.0,0.0,49.403603,0.0,74.105408,49.403603,24.701801,49.403603
1,23.56827,879,23.596327,47.192654,0.0,0.0,47.192654,23.596327,47.192654,23.596327,...,47.192654,23.596327,0.0,47.192654,23.596327,47.192654,23.596327,47.192654,23.596327,23.596327
2,24.67243,794,24.701801,0.0,24.701801,0.0,0.0,24.701801,0.0,0.0,...,24.701801,0.0,0.0,24.701801,24.701801,24.701801,0.0,49.403603,24.701801,98.807205
3,20.33146,1446,40.711327,0.0,0.0,20.355663,40.711327,20.355663,0.0,20.355663,...,0.0,20.355663,61.066994,61.066994,40.711327,40.711327,20.355663,20.355663,81.422653,40.711327
4,21.83328,830,21.859272,21.859272,0.0,0.0,0.0,21.859272,0.0,21.859272,...,65.577812,0.0,21.859272,21.859272,21.859272,0.0,21.859272,0.0,0.0,0.0
5,23.56827,1304,70.788986,23.596327,23.596327,23.596327,0.0,23.596327,47.192654,0.0,...,47.192654,47.192654,23.596327,23.596327,23.596327,23.596327,0.0,23.596327,23.596327,0.0
6,18.67915,1106,18.701387,56.10416,0.0,0.0,18.701387,18.701387,18.701387,0.0,...,18.701387,0.0,18.701387,0.0,18.701387,18.701387,18.701387,18.701387,18.701387,37.402775
7,24.6337,1418,24.663025,49.32605,0.0,24.663025,0.0,24.663025,0.0,24.663025,...,24.663025,0.0,0.0,0.0,49.32605,49.32605,24.663025,0.0,24.663025,0.0
8,20.33146,1474,0.0,40.711327,40.711327,0.0,20.355663,0.0,0.0,40.711327,...,40.711327,0.0,20.355663,20.355663,20.355663,20.355663,0.0,20.355663,20.355663,20.355663
9,20.33146,454,0.0,20.355663,20.355663,20.355663,61.066994,0.0,0.0,40.711327,...,0.0,61.066994,0.0,20.355663,0.0,20.355663,20.355663,20.355663,81.422653,0.0



Let's estimate the average birth weight using the bootstrap weights. 

In [7]:
# rep_wgt_boot = nmihsboot.loc[:, "bsrw1":"bsrw50"]

birthwgt = ReplicateEstimator("bootstrap", "mean").estimate(
    y=nmihs["birth_weight"],
    samp_weight=nmihs["finalwgt"],
    rep_weights=nmihs.loc[:, "bsrw1":"bsrw50"],
    remove_nan=True,
)

print(birthwgt)

SAMPLICS - Estimation of Mean

Number of strata: None
Number of psus: None
Degree of freedom: 49

       MEAN        SE         LCI         UCI       CV
2679.127143 31.053792 2616.722212 2741.532074 0.011591


#### Balanced repeated replication (BRR)  <a name="section22"></a>

In [8]:
# Load NMIHS sample data
nhanes2brr_cls = Nhanes2brr()
nhanes2brr_cls.load_data()
nhanes2brr = nhanes2brr_cls.data

nhanes2brr.head(15)

Unnamed: 0,height,weight,finalwgt,brr_1,brr_2,brr_3,brr_4,brr_5,brr_6,brr_7,...,brr_23,brr_24,brr_25,brr_26,brr_27,brr_28,brr_29,brr_30,brr_31,brr_32
0,174.59801,62.48,8995,0,17990,17990,0,17990,0,0,...,17990,0,0,17990,17990,0,17990,0,0,17990
1,152.297,48.759998,25964,0,51928,51928,0,51928,0,0,...,51928,0,0,51928,51928,0,51928,0,0,51928
2,164.09801,67.25,8752,0,17504,17504,0,17504,0,0,...,17504,0,0,17504,17504,0,17504,0,0,17504
3,162.59801,94.459999,4310,0,8620,8620,0,8620,0,0,...,8620,0,0,8620,8620,0,8620,0,0,8620
4,163.09801,74.279999,9011,0,18022,18022,0,18022,0,0,...,18022,0,0,18022,18022,0,18022,0,0,18022
5,147.09801,66.0,4310,0,8620,8620,0,8620,0,0,...,8620,0,0,8620,8620,0,8620,0,0,8620
6,153.89799,54.549999,3201,0,6402,6402,0,6402,0,0,...,6402,0,0,6402,6402,0,6402,0,0,6402
7,160.0,58.970001,25386,0,50772,50772,0,50772,0,0,...,50772,0,0,50772,50772,0,50772,0,0,50772
8,164.0,68.949997,12102,0,24204,24204,0,24204,0,0,...,24204,0,0,24204,24204,0,24204,0,0,24204
9,176.59801,65.43,4312,0,8624,8624,0,8624,0,0,...,8624,0,0,8624,8624,0,8624,0,0,8624


Let's estimate the average birth weight using the BRR weights. 

In [9]:
brr = ReplicateEstimator("brr", "ratio")

ratio_wgt_hgt = brr.estimate(
    y=nhanes2brr["weight"],
    samp_weight=nhanes2brr["finalwgt"],
    x=nhanes2brr["height"],
    rep_weights=nhanes2brr.loc[:, "brr_1":"brr_32"],
    remove_nan=True,
)

print(ratio_wgt_hgt)

SAMPLICS - Estimation of Ratio

Number of strata: None
Number of psus: None
Degree of freedom: 16

   RATIO      SE      LCI     UCI       CV
0.426082 0.00273 0.420295 0.43187 0.006407


#### Jackknife  <a name="section23"></a>

In [10]:
# Load NMIHS sample data
nhanes2jk_cls = Nhanes2jk()
nhanes2jk_cls.load_data()
nhanes2jk = nhanes2jk_cls.data

nhanes2jk.head(15)

Unnamed: 0,height,weight,finalwgt,jkw_1,jkw_2,jkw_3,jkw_4,jkw_5,jkw_6,jkw_7,...,jkw_53,jkw_54,jkw_55,jkw_56,jkw_57,jkw_58,jkw_59,jkw_60,jkw_61,jkw_62
0,174.59801,62.48,8995,0,17990,8995,8995,8995,8995,8995,...,8995,8995,8995,8995,8995,8995,8995,8995,8995,8995
1,152.297,48.759998,25964,0,51928,25964,25964,25964,25964,25964,...,25964,25964,25964,25964,25964,25964,25964,25964,25964,25964
2,164.09801,67.25,8752,0,17504,8752,8752,8752,8752,8752,...,8752,8752,8752,8752,8752,8752,8752,8752,8752,8752
3,162.59801,94.459999,4310,0,8620,4310,4310,4310,4310,4310,...,4310,4310,4310,4310,4310,4310,4310,4310,4310,4310
4,163.09801,74.279999,9011,0,18022,9011,9011,9011,9011,9011,...,9011,9011,9011,9011,9011,9011,9011,9011,9011,9011
5,147.09801,66.0,4310,0,8620,4310,4310,4310,4310,4310,...,4310,4310,4310,4310,4310,4310,4310,4310,4310,4310
6,153.89799,54.549999,3201,0,6402,3201,3201,3201,3201,3201,...,3201,3201,3201,3201,3201,3201,3201,3201,3201,3201
7,160.0,58.970001,25386,0,50772,25386,25386,25386,25386,25386,...,25386,25386,25386,25386,25386,25386,25386,25386,25386,25386
8,164.0,68.949997,12102,0,24204,12102,12102,12102,12102,12102,...,12102,12102,12102,12102,12102,12102,12102,12102,12102,12102
9,176.59801,65.43,4312,0,8624,4312,4312,4312,4312,4312,...,4312,4312,4312,4312,4312,4312,4312,4312,4312,4312


In this case, stratification was used to calculate the jackknife weights. The stratum variable is not indicated in the dataset or survey design description. However, it says that the number of strata is 31 and the number of replicates is 62. Hence, the jackknife replicate coefficient is $(n_h - 1) / n_h = (2-1) / 2 = 0.5$. Now we can call *replicate()* and specify *rep_coefs = 0.5*.

In [11]:
jackknife = ReplicateEstimator("jackknife", "ratio")

ratio_wgt_hgt2 = jackknife.estimate(
    y=nhanes2jk["weight"],
    samp_weight=nhanes2jk["finalwgt"],
    x=nhanes2jk["height"],
    rep_weights=nhanes2jk.loc[:, "jkw_1":"jkw_62"],
    rep_coefs=0.5,
    remove_nan=True,
)

print(ratio_wgt_hgt2)

SAMPLICS - Estimation of Ratio

Number of strata: None
Number of psus: None
Degree of freedom: 61

   RATIO       SE      LCI      UCI      CV
0.423502 0.003464 0.416574 0.430429 0.00818
