## Small Area Estimation (SAE): Area Level Model

Explain the concept of estimating a parameter of the population from the sample

In [1]:
import numpy as np
import pandas as pd

import samplics 
from samplics.sae import EblupAreaModel

### Milk Expenditure data

To illustrate the EblupAreaModel class, we will use the Milk Expenditure dataset used in Rao and Molina (2015). As mentioned in the book, this dataset was originally used by Arora and Lahiri (1997) and later by You and Chapman (2006). For the R users, this dataset is also used by the R package sae (https://cran.r-project.org/web/packages/sae/index.html). 

The Milk Expenditure data contains 43 observations on the average expenditure on fresh milk for the year 1989. The datasets has the following values: major area representing (major_area), small area (small_area), sample size (samp_size), direct survey estimates of average expenditure (direct_est), standard error of the direct estimate (std_error), and coefficient of variation of the direct estimates (coef_variance). 

In [2]:
milk_exp = pd.read_csv("../../../datasets/docs/expenditure_on_milk.csv")

nb_obs = 15
print(f"\nFirst {nb_obs} observations of the Milk Expendure dataset\n")
milk_exp.tail(nb_obs)


First 15 observations of the Milk Expendure dataset



Unnamed: 0,major_area,small_area,samp_size,direct_est,std_error,coef_variance
28,4,29,238,0.796,0.106,0.133
29,4,30,207,0.565,0.089,0.158
30,4,31,165,0.886,0.225,0.254
31,4,32,153,0.952,0.205,0.215
32,4,33,210,0.807,0.119,0.147
33,4,34,383,0.582,0.067,0.115
34,4,35,255,0.684,0.106,0.155
35,4,36,226,0.787,0.126,0.16
36,4,37,224,0.44,0.092,0.209
37,4,38,212,0.759,0.132,0.174


### Empirical Bayes linear unbiased predictor (EBLUP)

As shown in the milk expenditure datasets, some of the coefficients of variation are not small which indicates unstability of the direct survey estimates. Hence, we can try to reduce the variability of the estimates by smoothing them through modeling. For illustration purpose, we will model the average expenditure on milk using the major areas as auxiliary variables.

First, we use the method *fit()* to estimate the model parameters. The pandas's method *get_dummies()* create a matrix with dummy values (0 and 1) from the categorical variable *major_area*. 

In [3]:
area = milk_exp["small_area"]
yhat = milk_exp["direct_est"]
X = pd.get_dummies(milk_exp["major_area"],drop_first=True)
sigma_e = milk_exp["std_error"]

## REML method
fh_model_reml = EblupAreaModel(method="REML")
fh_model_reml.fit(
    yhat=yhat, X=X, area=area, error_std=sigma_e, intercept=True, tol=1e-8,
)

print(f"\nThe estimated fixed effects are: {fh_model_reml.fixed_effects}")
print(f"\nThe estimated standard error of the area random effects is: {fh_model_reml.re_std}")
print(f"\nThe convergence statistics are: {fh_model_reml.convergence}")
print(f"\nThe goodness of fit statistics are: {fh_model_reml.goodness}\n")


The estimated fixed effects are: [ 0.96818899  0.13278031  0.22694622 -0.24130104]

The estimated standard error of the area random effects is: 0.136199615091212

The convergence statistics are: {'achieved': True, 'iterations': 7, 'precision': 4.8490397079603564e-09}

The goodness of fit statistics are: {'loglike': -9.403474415513806, 'AIC': 30.80694883102761, 'BIC': 41.37414952518898}



Now the the model has been fitted, we can obtain the EBLUP average expenditure on milk by running *predict()* which is a method of *EblupAreaModel* class. This run will produce two main attributes that is *area_est* and *area_mse* which are python dictionaries pairing the small areas to the eblup estimates and the MSE estimates, respectively.

In [4]:
fh_model_reml.predict(
    X=X, area=area, intercept=True
)

import pprint
pprint.pprint(fh_model_reml.area_est)

{1: 1.0219705448470267,
 2: 1.0476019518832937,
 3: 1.0679514268850938,
 4: 0.7608165634164006,
 5: 0.8461570426977274,
 6: 0.9743727062092652,
 7: 1.0584526732855357,
 8: 1.0977762564423168,
 9: 1.2215454913423593,
 10: 1.1951460164712615,
 11: 0.7852149170863973,
 12: 1.2139462074222371,
 13: 1.2096597223605203,
 14: 0.9834964402356507,
 15: 1.186424709535009,
 16: 1.1556981135233584,
 17: 1.22634125101869,
 18: 1.2856489898727417,
 19: 1.2363248413266228,
 20: 1.2349601399238859,
 21: 1.090301626523384,
 22: 1.1923057228469687,
 23: 1.1216467660137082,
 24: 1.2230297222963116,
 25: 1.1938054444127775,
 26: 0.7627195900552479,
 27: 0.7649551536523862,
 28: 0.7338443883489107,
 29: 0.7699295545743627,
 30: 0.6134416227081902,
 31: 0.7695560730689732,
 32: 0.795825312822418,
 33: 0.7723188482183636,
 34: 0.6102300678743078,
 35: 0.7001781895145358,
 36: 0.7592788108093533,
 37: 0.52988633522673,
 38: 0.7434466782997076,
 39: 0.7548996333852704,
 40: 0.7701919661644319,
 41: 0.748116424

We can use the utility method *to_dataframe()* to output the estimates as a pandas dataframe. The function provides the area, the estimate and its MSE estimates. We can use *col_names* to customize the name of the columns. For example, using `col_names = ["small_area", "eblup_estimate", "eblup_mse"]`. Otherwise, if col_names is not provided, "_area", "_estimates" and "_mse" are used as defaults.

In [5]:
milk_est_reml = fh_model_reml.to_dataframe(col_names = ["small_area", "eblup_estimate", "eblup_mse"])
print(f"\nThe dataframe version of the area level estimates:\n\n {milk_est_reml}")


The dataframe version of the area level estimates:

     small_area  eblup_estimate  eblup_mse
0            1        1.021971   0.013460
1            2        1.047602   0.005373
2            3        1.067951   0.005702
3            4        0.760817   0.008542
4            5        0.846157   0.009580
5            6        0.974373   0.011671
6            7        1.058453   0.015926
7            8        1.097776   0.010587
8            9        1.221545   0.014184
9           10        1.195146   0.014902
10          11        0.785215   0.007694
11          12        1.213946   0.016337
12          13        1.209660   0.012563
13          14        0.983496   0.012117
14          15        1.186425   0.012031
15          16        1.155698   0.011709
16          17        1.226341   0.010860
17          18        1.285649   0.013691
18          19        1.236325   0.011035
19          20        1.234960   0.013080
20          21        1.090302   0.009949
21          22        

We could also fit the model parameters using the maximum likelihood (ML) method which will impact the MSE estimation as well. To estimate the area means using the ML methdo, we only need to set *method="ML"* then run the prediction as follows. 

In [6]:
## ML method
fh_model_ml = EblupAreaModel(method="ML")
fh_model_ml.fit(
    yhat=yhat, X=X, area=area, error_std=sigma_e, intercept=True, tol=1e-8,
)

milk_est_ml = fh_model_ml.predict(
    X=X, area=area, intercept=True
)

milk_est_ml = fh_model_ml.to_dataframe(col_names = ["small_area", "eblup_estimate", "eblup_mse"])


print(f"\nThe dataframe version of the ML area level estimates:\n\n {milk_est_ml}")


The dataframe version of the ML area level estimates:

     small_area  eblup_estimate  eblup_mse
0            1        1.016173   0.013580
1            2        1.043697   0.005513
2            3        1.062817   0.005851
3            4        0.775349   0.008735
4            5        0.855490   0.009775
5            6        0.973586   0.011841
6            7        1.047478   0.015934
7            8        1.095344   0.010822
8            9        1.205409   0.014346
9           10        1.181256   0.015036
10          11        0.803370   0.007911
11          12        1.196775   0.016405
12          13        1.196159   0.012771
13          14        0.991405   0.012335
14          15        1.186883   0.012192
15          16        1.159036   0.011877
16          17        1.223237   0.011041
17          18        1.275519   0.013805
18          19        1.232285   0.011214
19          20        1.230442   0.013214
20          21        1.098577   0.010138
21          22     

Similar, we can use the Fay-Herriot method as follows

In [7]:
## FH method
fh_model_fh = EblupAreaModel(method="FH")
fh_model_fh.fit(
    yhat=yhat, X=X, area=area, error_std=sigma_e, intercept=True, tol=1e-8,
)

milk_est_fh = fh_model_fh.predict(
    X=X, area=area, intercept=True
)

milk_est_fh = fh_model_fh.to_dataframe(col_names = ["small_area", "eblup_estimate", "eblup_mse"])


print(f"\nThe dataframe version of the ML area level estimates:\n\n {milk_est_fh}")


The dataframe version of the ML area level estimates:

     small_area  eblup_estimate  eblup_mse
0            1        1.017976   0.012757
1            2        1.044964   0.005314
2            3        1.064481   0.005632
3            4        0.770692   0.008323
4            5        0.852512   0.009284
5            6        0.973826   0.011178
6            7        1.050857   0.014868
7            8        1.096165   0.010253
8            9        1.210505   0.013471
9           10        1.185640   0.014095
10          11        0.797569   0.007558
11          12        1.202150   0.015325
12          13        1.200459   0.012039
13          14        0.988971   0.011640
14          15        1.186745   0.011467
15          16        1.157992   0.011182
16          17        1.224223   0.010424
17          18        1.278680   0.012914
18          19        1.233566   0.010581
19          20        1.231860   0.012386
20          21        1.095955   0.009600
21          22     