In [6]:
!pip install --no-cache-dir --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple lightweight-mcnnm
import platform
import jax
import jax.numpy as jnp
from mcnnm import estimate, generate_data, complete_matrix

In [7]:
# Check if the platform is macOS and the machine is Apple Silicon (ARM architecture)
if platform.system() == "Darwin" and platform.machine() == "arm64":
    jax.config.update('jax_platforms', 'cpu')  # Avoid problems with Metal on Apple Silicon Machines

## Example 1: Staggered Adoption with Cross-Validation (Default)
In this example, we generate a dataset with covariates and staggered adoption treatment assignment and use the default cross-validation method for selecting the regularization parameters.

In [8]:
data, true_params = generate_data(nobs=500, nperiods=100, seed=42, assignment_mechanism='staggered', 
                                  X_cov=True, Z_cov=True, V_cov=True)

Y = jnp.array(data.pivot(index='unit', columns='period', values='y').values)
W = jnp.array(data.pivot(index='unit', columns='period', values='treat').values)
X = jnp.array(true_params['X'])
Z = jnp.array(true_params['Z'])
V = jnp.array(true_params['V'])

results = estimate(Y, W, X=X, Z=Z, V=V)

print(f"\nTrue effect: {true_params['treatment_effect']}, Estimated effect: {results.tau:.4f}")
print(f"Chosen lambda_L: {results.lambda_L:.4f}, lambda_H: {results.lambda_H:.4f}")


True effect: 1.0, Estimated effect: 1.0200
Chosen lambda_L: 0.0010, lambda_H: 0.0010


The `generate_data` function is used to create a synthetic dataset with staggered adoption treatment assignment. The assignment_mechanism parameter is set to `staggered`, which means that each unit adopts the treatment at a random time point with a specified probability.
By default, the estimate function uses cross-validation to select the optimal regularization parameters lambda_L and lambda_H. Cross-validation splits the data into K folds (default is 5) and evaluates the model performance on each fold to select the best parameters.

## Example 2: Block Assignment with Holdout Validation
In this example, we generate a dataset without covariates using block treatment assignment and use holdout validation for selecting the regularization parameters.

In [9]:
data, true_params = generate_data(nobs=1000, nperiods=50, seed=123, assignment_mechanism='block', 
                                  treated_fraction=0.4, X_cov=False, Z_cov=False, V_cov=False)

Y = jnp.array(data.pivot(index='unit', columns='period', values='y').values)
W = jnp.array(data.pivot(index='unit', columns='period', values='treat').values)

results = estimate(Y, W, validation_method='holdout')

print(f"\nTrue effect: {true_params['treatment_effect']}, Estimated effect: {results.tau:.4f}")
print(f"Chosen lambda_L: {results.lambda_L:.4f}, lambda_H: {results.lambda_H:.4f}")


True effect: 1.0, Estimated effect: 1.0305
Chosen lambda_L: 0.0010, lambda_H: 0.0010


Here, the `assignment_mechanism` is set to `block`, which means that a specified fraction of units (determined by `treated_fraction`) are treated in the second half of the time periods.
The validation_method parameter in the estimate function is set to `holdout`, indicating that holdout validation should be used for selecting the regularization parameters. Holdout validation splits the data into a training set and a validation set based on time. It uses the earlier time periods for training and the later time periods for validation. Holdout validation is susbtantially faster than cross-validation but may be less accurate, especially if the number of time periods is small.

## Example 3: Single Treated Unit with Covariates
In this example, we generate a dataset with a single treated unit and include covariates in the estimation.

In [10]:
data, true_params = generate_data(nobs=100, nperiods=200, seed=456, assignment_mechanism='single_treated_unit', 
                                  X_cov=True, Z_cov=True, V_cov=True)

Y = jnp.array(data.pivot(index='unit', columns='period', values='y').values)
W = jnp.array(data.pivot(index='unit', columns='period', values='treat').values)
X = jnp.array(true_params['X'])
Z = jnp.array(true_params['Z'])
V = jnp.array(true_params['V'])

results = estimate(Y, W, X=X, Z=Z, V=V)

print(f"\nTrue effect: {true_params['treatment_effect']}, Estimated effect: {results.tau:.4f}")
print(f"Chosen lambda_L: {results.lambda_L:.4f}, lambda_H: {results.lambda_H:.4f}")


True effect: 1.0, Estimated effect: 1.1468
Chosen lambda_L: 0.0010, lambda_H: 0.0010


The `assignment_mechanism` is set to `'single_treated_unit'`, which means that only one randomly selected unit is treated in the second half of the time periods.

In this example, we include unit-specific covariates `X`, time-specific covariates `Z`, and unit-time specific covariates `V` in the estimation. The `estimate` function automatically handles the presence of covariates and estimates their coefficients along with the treatment effect.

With this specific dataset, the estimated treatment effect is not close to the true treatment effect, as the single treated unit leads to the cross-validation method struggling to find a valid loss during the parameter selection process. The warning message "No valid loss found in cross_validate" indicates that the cross-validation procedure could not find a suitable set of regularization parameters that yielded a finite loss value.

This issue arises because with only a single treated unit, there might not be enough information to reliably estimate the treatment effect, especially when using cross-validation. The limited treatment variation can make it challenging for the model to distinguish the treatment effect from the noise in the data.

In such cases, it may be more appropriate to use a different validation method, such as holdout validation, or to rely on domain knowledge to set the regularization parameters manually. Additionally, increasing the number of observations or treated units can help improve the estimation accuracy and stability.

It's important to note that the performance of the estimation method can be sensitive to the specific dataset and the chosen assignment mechanism. While the `estimate` function aims to handle various scenarios, there may be limitations in extreme cases like having only a single treated unit. It's always a good practice to carefully evaluate the results, consider the characteristics of the dataset, and interpret the findings in the context of the specific application.

# Covariates
The `generate_data` function allows you to include three types of covariates in the generated dataset:

1. **Unit-specific covariates (X):** These are characteristics or features that vary across units but remain constant over time. For example, in a study of students' academic performance, unit-specific covariates could include variables like gender, age, or socioeconomic status. These covariates capture the inherent differences between units that may influence the outcome variable.
2. **Time-specific covariates (Z):** These are factors that change over time but are the same for all units at each time point. For instance, in an analysis of sales data, time-specific covariates could include variables like market trends, seasonal effects, or economic indicators. These covariates reflect the temporal variations that affect all units simultaneously.
3. **Unit-time specific covariates (V):** These are covariates that vary both across units and over time. They capture the unique characteristics of each unit at each time point. For example, in a healthcare study, unit-time specific covariates could include individual patients' medical measurements or treatment adherence recorded at different time points. These covariates allow for capturing the dynamic and personalized aspects of each unit's experience.

These three options are available for estimation, mirroring the description of the estimator in [Athey et al. (2021)](https://www.tandfonline.com/doi/full/10.1080/01621459.2021.1891924).

In the `generate_data` function, you can control the inclusion of these covariates using the boolean flags X_cov, Z_cov, and V_cov. Setting these flags to True incorporates the respective type of covariates into the generated dataset, while setting them to False excludes them.