## Small Area Estimation (SAE): Area Level Model

Explain the concept of estimating a parameter of the population from the sample

In [111]:
import numpy as np
import pandas as pd

import samplics 
from samplics.sae import EblupAreaModel

### Milk Expenditure data

To illustrate the EblupAreaModel class, we will use the Milk Expenditure dataset used in Rao and Molina (2015). As mentioned in the book, this dataset was originally used by Arora and Lahiri (1997) and later by You and Chapman (2006). For the R users, this dataset is also used by the R package sae (https://cran.r-project.org/web/packages/sae/index.html). 

The Milk Expenditure data contains 43 observations on the average expenditure on fresh milk for the year 1989. The datasets has the following values: major area representing (major_area), small area (small_area), sample size (samp_size), direct survey estimates of average expenditure (direct_est), standard error of the direct estimate (std_error), and coefficient of variation of the direct estimates (coef_variance). 

In [112]:
milk_exp = pd.read_csv("../../../datasets/docs/expenditure_on_milk.csv")

nb_obs = 15
print(f"\nFirst {nb_obs} observations of the Milk Expendure dataset\n")
milk_exp.tail(nb_obs)


First 15 observations of the Milk Expendure dataset



Unnamed: 0,major_area,small_area,samp_size,direct_est,std_error,coef_variance
28,4,29,238,0.796,0.106,0.133
29,4,30,207,0.565,0.089,0.158
30,4,31,165,0.886,0.225,0.254
31,4,32,153,0.952,0.205,0.215
32,4,33,210,0.807,0.119,0.147
33,4,34,383,0.582,0.067,0.115
34,4,35,255,0.684,0.106,0.155
35,4,36,226,0.787,0.126,0.16
36,4,37,224,0.44,0.092,0.209
37,4,38,212,0.759,0.132,0.174


### Empirical Bayes linear unbiased predictor (EBLUP)

As shown in the milk expenditure datasets, some of the coefficients of variation are not small which indicates unstability of the direct survey estimates. Hence, we can try to reduce the variability of the estimates by smoothing them through modeling. For illustration purpose, we will model the average expenditure on milk using the major areas as auxiliary variables.

First, we use the method *fit()* to estimate the model parameters. The pandas's method *get_dummies()* create a matrix with dummy values (0 and 1) from the categorical variable *major_area*. 

In [113]:
area = milk_exp["small_area"]
yhat = milk_exp["direct_est"]
X = pd.get_dummies(milk_exp["major_area"],drop_first=True)
sigma_e = milk_exp["std_error"]

## REML method
fh_model_reml = EblupAreaModel(method="REML")
fh_model_reml.fit(
    yhat=yhat, X=X, area=area, error_std=sigma_e, tol=1e-8,
)

print(f"\nThe estimated fixed effects are: {fh_model_reml.fixed_effects}")
print(f"\nThe estimated standard error of the area random effects is: {fh_model_reml.re_std}")
print(f"\nThe convergence statistics are: {fh_model_reml.convergence}")
print(f"\nThe goodness of fit statistics are: {fh_model_reml.goodness}\n")


The estimated fixed effects are: [ 0.96818899  0.13278031  0.22694622 -0.24130104]

The estimated standard error of the area random effects is: 0.136199615091212

The convergence statistics are: {'achieved': True, 'iterations': 7, 'precision': 4.8490397079603564e-09}

The goodness of fit statistics are: {'loglike': -9.403474415513806, 'AIC': 30.80694883102761, 'BIC': 41.37414952518898}



Now the the model has been fitted, we can obtain the EBLUP average expenditure on milk by running *predict()* which is a method of *EblupAreaModel* class. This run will produce two main attributes that is *area_est* and *area_mse* which are python dictionaries pairing the small areas to the eblup estimates and the MSE estimates, respectively.

In [114]:
fh_model_reml.predict(
    X=X, area=area,
)

import pprint
pprint.pprint(fh_model_reml.area_est)

{1: 1.0219705448470267,
 2: 1.0476019518832937,
 3: 1.0679514268850938,
 4: 0.7608165634164006,
 5: 0.8461570426977274,
 6: 0.9743727062092652,
 7: 1.0584526732855357,
 8: 1.0977762564423168,
 9: 1.2215454913423593,
 10: 1.1951460164712615,
 11: 0.7852149170863973,
 12: 1.2139462074222371,
 13: 1.2096597223605203,
 14: 0.9834964402356507,
 15: 1.186424709535009,
 16: 1.1556981135233584,
 17: 1.22634125101869,
 18: 1.2856489898727417,
 19: 1.2363248413266228,
 20: 1.2349601399238859,
 21: 1.090301626523384,
 22: 1.1923057228469687,
 23: 1.1216467660137082,
 24: 1.2230297222963116,
 25: 1.1938054444127775,
 26: 0.7627195900552479,
 27: 0.7649551536523862,
 28: 0.7338443883489107,
 29: 0.7699295545743627,
 30: 0.6134416227081902,
 31: 0.7695560730689732,
 32: 0.795825312822418,
 33: 0.7723188482183636,
 34: 0.6102300678743078,
 35: 0.7001781895145358,
 36: 0.7592788108093533,
 37: 0.52988633522673,
 38: 0.7434466782997076,
 39: 0.7548996333852704,
 40: 0.7701919661644319,
 41: 0.748116424

We can use the 

In [117]:
milk_est = fh_model_reml.to_dataframe()
print(milk_est)


AttributeError: 'EblupAreaModel' object has no attribute 'to_dataframe'

In [100]:
def dict_to_dataframe(col_names, *args):
    
    values = []
    for k, arg in enumerate(args):
        if not isinstance(arg, dict):
            raise AssertionError("All input parameters must be dictionaries with the same keys.")

        values.append(list(arg.values()))
        
    values_df = pd.DataFrame(values,).T
    values_df.insert(0, "0", list(args[0].keys()))
    values_df.columns = col_names
    
    return values_df

In [101]:
area_est = fh_model_reml.area_est
area_mse = fh_model_reml.area_mse

est_data = dict_to_dataframe(["area", "estimate", "mse"], area_est, area_mse)

print(est_data)

    area  estimate       mse
0      1  1.021971  0.013460
1      2  1.047602  0.005373
2      3  1.067951  0.005702
3      4  0.760817  0.008542
4      5  0.846157  0.009580
5      6  0.974373  0.011671
6      7  1.058453  0.015926
7      8  1.097776  0.010587
8      9  1.221545  0.014184
9     10  1.195146  0.014902
10    11  0.785215  0.007694
11    12  1.213946  0.016337
12    13  1.209660  0.012563
13    14  0.983496  0.012117
14    15  1.186425  0.012031
15    16  1.155698  0.011709
16    17  1.226341  0.010860
17    18  1.285649  0.013691
18    19  1.236325  0.011035
19    20  1.234960  0.013080
20    21  1.090302  0.009949
21    22  1.192306  0.017244
22    23  1.121647  0.011292
23    24  1.223030  0.013625
24    25  1.193805  0.008066
25    26  0.762720  0.009205
26    27  0.764955  0.009205
27    28  0.733844  0.016477
28    29  0.769930  0.007801
29    30  0.613442  0.006099
30    31  0.769556  0.015442
31    32  0.795825  0.014658
32    33  0.772319  0.009025
33    34  0.61