Skip to content

Code associated with paper: Plug-in Regularized Estimation of High-Dimensional Parameters in Nonlinear Semiparametric Models, Chernozhukov, Nekipelov, Semenova, Syrgkanis, 2018

Notifications You must be signed in to change notification settings

vsyrgkanis/plugin_regularized_estimation

Repository files navigation

Introduction

Code associated with paper: Regularized Orthogonal Estimation of High-Dimensional Parameters in Nonlinear Semiparametric Models, Nekipelov, Semenova, Syrgkanis, 2018

Code requires Python 3.6. It also requires basic python packages like, numpy, scipy and scikit-learn.

Running the Monte Carlo Simulations

Linear Heterogeneous Treatment Effects

For replicating the experiments with the linear heterogeneous treatment effect estimation, run:

cp sweep_config_linear.py_example sweep_config_linear.py
python sweep_mc_from_config.py --config sweep_config_linear

The DGP and estimation methods for this application are contained in linear_te.py.

The above code produces the following figures:

ell2_error_dgp_dgp1_growing_kappa
Growing support size of nuisance

plot2_ell2_varying_sigma_eta
Varying support size and variances of errors

Heterogeneous Treatment Effects with a Logistic Link

For replicating the experiments with the logistic heterogeneous treatment effect estimation, run:

cp config_logistic.py_example config_logistic.py
python mc_from_config.py --config config_logistic

The DGP and estimation methods for this application are contained in logistic_te.py.

The above code produces the following figures:

l1_decrease_logistic l1_error_logistic
l1 and l2 errors and error decreases (Direct - Ortho)

Conditional Moment Models with Missing Data

For replicating the experiments related to estimation of conditional moment models with missing data, run:

cp config_missing_data.py_example config_missing_data.py
python mc_from_config.py --config config_missing_data

cp sweep_config_missing_data.py_example sweep_config_missing_data.py
python sweep_mc_from_config.py --config sweep_config_missing_data

The DGP and estimation methods for this application are contained in missing_data.py.

The above code produces the following figures:

l2_errors_missing
l2 errors and error decreases for each benchmark method

dmr_kz_1_sx_3_dx_se ipmr_kz_1_se_1_dx_sx
Percentage decrease of error with varying dimension of nuisance and error variances

Games of Incomplete Information

For replicating the experiments related to estimation in games of incomplete information, run:

cp config_games.py_example config_games.py
python mc_from_config.py --config config_games

cp sweep_config_games.py_example sweep_config_games.py
python sweep_mc_from_config.py --config sweep_config_games

The DGP and estimation methods for this application are contained in games.py.

The above code produces the following figures:

l2_decrease_one_player l1_error_one_player
One player DGP

l2_decrease_two_player l1_error_two_player
Two player DGP

l2_games_varying_p_sigma_n_10000
One player DGP, varying dimension and variance of co-variates

MCPY library

The library folder mcpy contains library code related to running generic monte carlo experiments from config files and saving and running the results. Check out the notebook example_mcpy.ipynb for a simple example of how to use the library.

A simple config dictionary allows you to run monte carlo experiments for some configuration of the parameters of the dgp and the estimation methods and allows you to specify arbitrary methods to use to estimate for each sample, arbitrary set of dgps to use to generate samples, arbitrary metrics to evaluate, and arbitrary plots to create from the experimental results. The monte carlo class will run many experiments, each time generating a sample from each dgp, running each estimation method on each sample and calculating each metric on the returned result. Subsequently the plotting functions receive the collection of all experimental results and create figures. The package offers a basic set of plotting functions but the user can define their own plotting functions and add them to their config dictionary. See e.g. config_games_py_example, config_logistic.py_example and config_missing_data.py_example for sample config variable definitions.

Consider the following simple example: suppose we want to generate data from a linear model and test the performance of OLS and Lasso for estimating the coefficients and as the dimension of the features changes. We can then create a dgp function that takes as input a dictionary of options and returns a data-set and the true parameter:

def dgp(dgp_opts):
    true_param = np.zeros(opts['n_dim'])
    true_param[:opts['kappa']] = 1.0
    x = np.random.normal(0, 1, size=(opts['n_samples'], opts['n_dim']))
    y = np.matmul(x, true_param)
    return (x, y), true_param

Then we also define two functions that take as input a data-set and a dictionary of options and return the estimated coefficient based on each method:

def ols(data, method_opts):
    x, y = data
    from sklearn.linear_model import LinearRegression
    return LinearRegression().fit(x, y).coef_

def lasso(data, method_opts):
    x, y = data
    from sklearn.linear_model import Lasso
    return Lasso(alpha=opts['l1_reg']).fit(x, y).coef_

Now we are ready to write our config file that will specify the monte carlo simulation we want to run as well as the metrics of performance to compute and the plots to generate at the end:

from mcpy import metrics
from mcpy import plotting
CONFIG = {
    # Functions to be used for data generation
    'dgps': {'linear_dgp': dgp},
    # Dictionary of parameters to the dgp functions
    'dgp_opts': {'n_dim': 10, 'n_samples': 100, 'kappa': 2},
    # Estimation methods to evaluate
    'methods': {'ols': ols, 'lasso': lasso},
    # Dictionary of parameters to the estimation functions
    'method_opts': {'l1_reg': 0.01},
    # Metrics to evaluate. Each metric takes two inputs: estimated param, true param
    'metrics': {'l1_error': metrics.l1_error, 'l2_error': metrics.l2_error},
    # Options for the monte carlo simulation
    'mc_opts': {'n_experiments': 10, 'seed': 123},
    # Which of the methods is the proposed one vs a benchmark method. Used in plotting
    'proposed_method': 'lasso',
    # Target folder for saving plots and results
    'target_dir': 'test_ols',
    # Whether to reload monte carlo results from the folder if results with
    # the same config spec exist from a previous run
    'reload_results': False,
    # Which plots to generate. Could either be a dictionary of a plot spec or an ad hoc plot function. 
    # A plot spec contains: {'metrics', 'methods', 'dgps', 'metric_transforms'}, 
    # that specify a subset of each to plot (defaults to all if not specified).
    # An ad hoc function should be taking as input (param_estimates, metric_results, config)
    'plots': {'all_metrics': {}, 'param_hist': plotting.plot_param_histograms}
}

We can then run this monte-carlo simulation as easy as:

from mcpy.monte_carlo import MonteCarlo
estimates, metric_results = MonteCarlo(CONFIG).run()

This code will save the plots in the target_dir. In particular it will save the following two figures that depict the distribution of l1 and l2 errors across the 10 experiments:

A sweep config dictionary, allows you to specify for each dgp option a whole list of parameters, rather than a single value. Then the MonteCarloSweep class will execute monte carlo experiments for each combination of parameters. Subsequently the plotting functions provided can for instance plot how each metric varies as a single parameter varies and averaging out the performance over the settings of the rest of the parameters. Such plots are created for each dgp and metric, and each plot contains the results for each method. This is for instance used in the case of the linear treatment effect experiment. See e.g. sweep_config_linear.py_example for a sample sweep-config variable definition.

For instance, back to the linear example, we could try to understand how the l1 and l2 errors change as a function of the dimension of the features or the number of samples. We can perform such a sweeping monte carlo by defining a sweep config:

SWEEP_CONFIG = {
    'dgps': {'linear_dgp': dgp},
    'dgp_opts': {'n_dim': [10, 100, 200], 'n_samples': [100, 200, 300], 'kappa': 2},
    'methods': {'ols': ols, 'lasso': lasso},
    'method_opts': {'l1_reg': 0.01},
    'metrics': {'l1_error': metrics.l1_error, 'l2_error': metrics.l2_error},
    'mc_opts': {'n_experiments': 10, 'seed': 123},
    'proposed_method': 'lasso',
    'target_dir': 'test_sweep_ols',
    'reload_results': False,
    # Let's not plot anything per instance
    'plots': {},
    # Let's make some plots across the sweep of parameters
    'sweep_plots': {
        # Let's plot the errors as a function of the dimensions, holding fixed the samples to 100
        'var_dim_at_100_samples': {'varying_params': ['n_dim'], 'select_vals': {'n_samples': [100]}},
        # Let's plot the errors as a function of n_samples, holding fixed the dimensions to 100
        'var_samples_at_100_dim': {'varying_params': ['n_samples'], 'select_vals': {'n_dim': [100]}},
        # Let's plot a 2d contour of the median metric of each method as two parameters vary simultaneously
        'var_samples_and_dim': {'varying_params': [('n_samples', 'n_dim')]},
        # Let's plot the difference between each method in a designated list with the 'proposed_method' in the config
        'error_diff_var_samples_and_dim': {'varying_params': [('n_samples', 'n_dim')], 'methods': ['ols'], 
                                           'metric_transforms': {'diff': metrics.transform_diff}}
    }
}

We can then run our sweeping monte carlo with the following command:

from mcpy.monte_carlo import MonteCarloSweep
sweep_keys, sweep_estimates, sweep_metric_results = MonteCarloSweep(SWEEP_CONFIG).run()

The sweep plots allows you to define which types of plots to save as some subset of parameters vary while others take a subset of the values. For instance, the above four sweep plots will create 8 plots, one for each metric. The four plots corresponding to the l2 error are as follows:

var_samples_at_100_dim var_samples_and_dim

Showing how lasso out-performs ols when the number of samples is smaller than the dimension.

ORTHOPY library

The library folder orthopy contains modifications to standard estimation methods, such as the logistic regression, that are required for orthogonal estimation, e.g. adding and orthogonal correction term to the loss or adding an offset to the index.

More concretely, the class LogisticWithOffsetAndGradientCorrection is an estimator adhering to the fit and predict specification of sklearn that enables fitting an "orthogonal" logistic regression. Its fit method minimizes a regularized modified weighted logistic loss, with sample weights, a gradient correction and an index offset:

where the modified logistic loss for each sample is defined as:

The specification of the class follows a standard sklearn paradigm:

class LogisticWithOffsetAndGradientCorrection():
    def __init__(self, alpha_l1=0.1, alpha_l2=0.1, tol=1e-6):
    ''' Initialize
    alpha_l1 : ell_1 regularization weight
    alpha_l2 : ell_2 regularization weight
    tol : minimization tolerance
    '''

    def fit(self, X, y, offsets=None, grad_corrections=None, sample_weights=None):
    ''' Fits coefficient theta by minimizing regularized modified logistic loss
    Parameters
    X (n x d) matrix of features
    y (n x 1) matrix of labels
    offsets (n x 1) matrix of index offsets v
    grad_corrections (n x 1) matrix of grad corrections g
    sample_weights (n x 1) matrix of sample weights w
    '''

    @property
    def coef_(self):
    ''' Fitted coefficient '''
    
    def predict_proba(self, X, offsets=None):
    ''' Probabilistic prediction. Returns (n x 2) matrix of probabilities '''
    
    def predict(self, X, offsets=None):
    ''' Binary prediction '''

    def score(self, X, y_true, offsets=None):
    ''' AUC score '''

    def accuracy(self, X, y_true, offsets=None):
    ''' Prediction accuracy '''

Empirical application

About

Code associated with paper: Plug-in Regularized Estimation of High-Dimensional Parameters in Nonlinear Semiparametric Models, Chernozhukov, Nekipelov, Semenova, Syrgkanis, 2018

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published